本篇博文主要内容为 2026-04-28 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-04-28)

今日共更新1147篇论文,其中:

  • 自然语言处理171篇(Computation and Language (cs.CL))
  • 人工智能372篇(Artificial Intelligence (cs.AI))
  • 计算机视觉239篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习302篇(Machine Learning (cs.LG))
  • 多智能体系统21篇(Multiagent Systems (cs.MA))
  • 信息检索51篇(Information Retrieval (cs.IR))
  • 人机交互55篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] FastOMOP: A Foundational Architecture for Reliable Agent ic Real-World Evidence Generation on OMOP CDM data

【速读】:该论文旨在解决从大规模电子健康记录(Electronic Health Records, EHR)数据中自动化生成真实世界证据(Real-World Evidence, RWE)时面临的挑战,尤其是多智能体系统(Multi-Agent Systems)在临床任务中引入的不可预测行为、协调失败和安全风险问题。现有方法缺乏可扩展、可审计且安全的基础设施来保障RWE生成流程的可靠性与可控性。解决方案的关键在于提出FastOMOP——一个开源的多智能体架构,通过将基础设施划分为治理(Governance)、可观测性(Observability)和编排(Orchestration)三层,并与可插拔的智能体团队解耦,实现对智能体行为的安全约束。其中,治理层在流程边界实施确定性验证机制,独立于智能体推理过程,确保任何存在幻觉或越权行为的代理都无法绕过安全控制;同时,用于表型定义、研究设计和统计分析的智能体团队通过受控工具暴露继承这些安全保障,从而在不依赖特定模型能力的前提下实现了高可靠性和安全性。

链接: https://arxiv.org/abs/2604.24572
作者: Niko Moeller-Grell,Shihao Shenzhang,Zhangshu Joshua Jiang,Richard JB Dobson,Vishnu V Chandrabalan
机构: King’s College London (伦敦国王学院); LTHTR NHS Trust (LTHTR NHS信托)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The Observational Medical Outcomes Partnership Common Data Model (OMOP CDM), maintained by the Observational Health Data Sciences and Informatics (OHDSI) collaboration, enabled the harmonisation of electronic health records data of nearly one billion patients in 83 countries. Yet generating real-world evidence (RWE) from these repositories remains a manual process requiring clinical, epidemiological and technical expertise. LLMs and multi-agent systems have shown promise for clinical tasks, but RWE automation exposes a fundamental challenge: agentic systems introduce emergent behaviours, coordination failures and safety risks that existing approaches fail to govern. No infrastructure exists to ensure agentic RWE generation is flexible, safe and auditable across the lifecycle. We introduce FastOMOP, an open-source multi-agent architecture that addresses this gap by separating three infrastructure layers, governance, observability and orchestration, from pluggable agent-teams. Governance is enforced at the process boundary through deterministic validation independent of agent reasoning, ensuring no compromised or hallucinating agent can bypass safety controls. Agent teams for phenotyping, study design and statistical analysis inherit these guarantees through controlled tool exposure. We validated FastOMOP using a natural-language-to-SQL agent team across three OMOP CDM datasets: synthetic data from Synthea, MIMIC-IV and a real-world NHS dataset from Lancashire Teaching Hospitals (IDRIL). FastOMOP achieved reliability scores of 0.84-0.94 with perfect adversarial and out-of-scope block rates, demonstrating process-boundary governance delivers safety guarantees independent of model choice. These results indicate that the reliability gap in RWE deployment is architectural rather than model capability, and establish FastOMOP as a governed architecture for progressive RWE automation.

[MA-1] Agent ic Witnessing: Prag matic and Scalable TEE-Enabled Privacy-Preserving Auditing

【速读】:该论文旨在解决隐私保护与数据语义属性验证之间的根本矛盾:验证方需要透明访问数据以确认其性质,而数据所有者则因专有权利要求数据保密。传统零知识证明(Zero-Knowledge Proofs, ZKPs)虽能保障隐私,但通常仅适用于精确的代数约束,难以验证代码库等非结构化、定性逻辑属性。解决方案的关键在于提出“代理见证”(Agentic Witnessing)框架,将验证从可证明执行转向可证明推理——通过引入三个智能体(Verifier、Prover 和 Auditor),并利用可信执行环境(Trusted Execution Environment, TEE)封装基于大语言模型(LLM)的审计员,使验证方可通过有限的布尔型问题查询私有数据,而无需直接接触原始数据;审计员使用模型上下文协议(Model Context Protocol, MCP)动态分析目标数据,并生成带密码学签名的哈希链,将推理过程绑定至原始数据和TEE硬件根信任,从而实现对高阶语义属性的隐私保护式验证。

链接: https://arxiv.org/abs/2604.24203
作者: Antony Rowstron
机构: Advanced Research and Invention Agency (ARIA)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Auditing the semantic properties of proprietary data creates a fundamental tension: verification requires transparent access, while proprietary rights demand confidentiality. While Zero-Knowledge Proofs (ZKPs) ensure privacy, they are typically limited to precise algebraic constraints and are ill-suited for verifying qualitative, unstructured properties, such as the logic within a codebase. We propose \em Agentic Witnessing, a framework that moves verification from attested execution to \em attested reasoning. The system is composed of three agents: a Verifier (who wants to check properties of a dataset), a Prover (who owns the dataset) and an Auditor (that inspects the dataset). The Verifier is allowed to ask a limited number of simple binary true/false questions to the auditor. By isolating an LLM-based Auditor within a Trusted Execution Environment (TEE), the system enables the Verifier to query a Prover’s private data via simple Boolean queries, without exposing the raw dataset. The Auditor uses the Model Context Protocol (MCP) to dynamically inspect the target dataset, producing a yes/no verdict accompanied by a cryptographic transcript: a signed hash chain binding the reasoning trace to both the original dataset and the TEE’s hardware root of trust. We demonstrate this architecture by automating the artifact evaluation process for 21 peer-reviewed computer science papers with released codebases on GitHub (e.g. Does the codebase implement the system described in the paper?). We verified five high-level properties of these codebases described in the corresponding publications, treating the source code as private. Our results show that TEE-enabled agentic auditing provides a mechanism for privacy-preserving oversight, effectively decoupling qualitative verification from the need for data disclosure.

[MA-2] Rewarding the Scientific Process: Process-Level Reward Modeling for Agent ic Data Analysis

【速读】:该论文旨在解决通用领域过程奖励模型(Process Reward Models, PRMs)在动态数据分析任务中表现不佳的问题,尤其是其无法有效监督数据分析师代理(data analysis agents)的局限性:一方面难以识别“静默错误”(silent errors),即不触发解释器异常但导致结果错误的逻辑缺陷;另一方面会错误惩罚探索性行为,将必要的试错过程误判为基础性错误。解决方案的关键在于提出一种环境感知的生成式过程奖励模型 DataPRM,其核心创新包括:(1) 作为主动验证器,能自主与环境交互以探测中间执行状态并发现静默错误;(2) 采用反射感知的三元奖励策略,区分可修正的基础性错误与不可恢复的失误,从而更精准地指导代理学习。该方法通过多样性驱动的轨迹生成和知识增强的步骤级标注构建了超过8K高质量训练样本,实验证明其显著提升下游策略大语言模型(LLM)性能,并在多种测试时缩放策略下展现出强泛化能力。

链接: https://arxiv.org/abs/2604.24198
作者: Zhisong Qiu,Shuofei Qiao,Kewei Xu,Yuqi Zhu,Lun Du,Ningyu Zhang,Huajun Chen
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress

点击查看摘要

Abstract:Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at this https URL.

[MA-3] EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce

【速读】:该论文旨在解决电子商务场景中产品映射(Product Mapping)任务的准确性与部署成本之间的矛盾问题,即如何在不依赖昂贵外部API、复杂推理编排和隐私敏感环境下实现高精度、可扩展的产品匹配。其核心解决方案是提出EPM-RL框架,通过强化学习(Reinforcement Learning, RL)将高成本的多智能体推理过程蒸馏为一个可在本地部署的轻量级模型;关键创新在于利用结构化推理输出进行参数高效微调(Parameter-Efficient Fine-Tuning, PEFT),再结合基于判别模型的奖励机制对输出格式合规性、标签正确性和推理偏好进行联合优化,从而在保证质量的同时显著降低延迟和运营成本,使系统具备生产就绪能力。

链接: https://arxiv.org/abs/2604.23993
作者: Minhyeong Yu,Wonduk Seo
机构: Enhans(恩汉斯); Seoul(首尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: preprint

点击查看摘要

Abstract:Product mapping, the task of deciding whether two e-commerce listings refer to the same product, is a core problem for price monitoring and channel visibility. In real marketplaces, however, sellers frequently inject promotional keywords, platform-specific tags, and bundle descriptions into titles, causing the same product to appear under many different names. Recent LLM-based and multi-agent frameworks improve robustness and interpretability on such hard cases, but they often rely on expensive external APIs, repeated retrieval, and complex inference-time orchestration, making large-scale deployment costly and difficult in privacy-sensitive enterprise settings. To address these issues, we present EPM-RL, a reinforcement-learning-based framework for building an accurate and efficient on-premise e-commerce product mapping model. Our central idea is to distill high-cost agentic reasoning into a trainable in-house model. Starting from a curated set of product pairs with LLM-generated rationales and human verification, we first perform parameter-efficient fine-tuning (PEFT) on a small student model using structured reasoning outputs. We then further optimize the model with Reinforcement Learning (RL) using an agent-based reward that jointly evaluates output-format compliance, label correctness, reasoning–preference scores from specially designed judge models. Preliminary results show that EPM-RL consistently improves over PEFT-only training and offers a stronger quality–cost trade-off than commercial API-based baselines, while enabling private deployment and lower operational cost. These findings suggest that reinforcement learning can turn product mapping from a high-latency agentic pipeline into a scalable, inspectable, and production-ready in-house system.

[MA-4] LLM -Guided Agent ic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People

【速读】:该论文旨在解决盲人及低视力(Blind and Low-Vision, BLV)群体在室内导航中的可访问性挑战,现有方案通常依赖昂贵的每栋建筑专用基础设施,难以大规模部署。其解决方案的关键在于提出一种基于多智能体(multi-agent)的框架,通过将单张楼层平面图转化为结构化、可检索的知识库,实现轻量级基础设施下的安全、无障碍导航指令生成。该框架包含两个阶段:一是利用自校正流水线与迭代重试机制的多智能体模块,将平面图解析为空间知识图谱;二是路径规划器结合安全评估代理对路径进行潜在风险评估,从而生成高可靠性的导航指令。实验证明,该方法在真实场景(UMBC数学与心理学楼)和CVC-FP基准测试中均显著优于单一调用大语言模型(LLM)基线,展现出良好的可扩展性和实用性。

链接: https://arxiv.org/abs/2604.23970
作者: Aydin Ayanzadeh,Tim Oates
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Indoor navigation remains a critical accessibility challenge for the blind and low-vision (BLV) individuals, as existing solutions rely on costly per-building infrastructure. We present an agentic framework that converts a single floor plan image into a structured, retrievable knowledge base to generate safe, accessible navigation instructions with lightweight infrastructure. The system has two phases: a multi-agent module that parses the floor plan into a spatial knowledge graph through a self-correcting pipeline with iterative retry loops and corrective feedback; and a Path Planner that generates accessible navigation instructions, with a Safety Evaluator agent assessing potential hazards along each route. We evaluate the system on the real-world UMBC Math and Psychology building (floors MP-1 and MP-3) and on the CVC-FP benchmark. On MP-1, we achieve success rates of 92.31%, 76.92%, and 61.54% for short, medium, and long routes, outperforming the strongest single-call baseline (Claude 3.7 Sonnet) at 84.62%, 69.23%, and 53.85%. On MP-3, we reach 76.92%, 61.54%, and 38.46%, compared to the best baseline at 61.54%, 46.15%, and 23.08%. These results show consistent gains over single-call LLM baselines and demonstrate that our workflow is a scalable solution for accessible indoor navigation for BLV individuals.

[MA-5] EndoGov: A knowledge-governed multi-agent expert system for endometrial cancer risk stratification

【速读】:该论文旨在解决多模态人工智能模型在子宫内膜癌(Endometrial Cancer, EC)风险分层中缺乏对临床指南强制性规则的执行机制的问题,例如当POLE突变型肿瘤表现为高分级形态时,仍需将其归类为低风险组。解决方案的关键在于提出一种两级多智能体专家系统EndoGov,其决策过程形式化为D(x) = G(P(x), R),其中P代表提取结构化证据的专科代理(病理、分子和临床代理),G为治理代理,依据可执行规则集R进行决策;Tier 1生成符合Schema约束的报告,Tier 2通过证据加权的指南知识图谱,结合确定性硬路径规则处理高优先级覆盖场景,以及受限软路径推理应对模糊情况,从而实现既符合临床指南又具备竞争性判别性能的风险分配。

链接: https://arxiv.org/abs/2604.23802
作者: Weiye Dai,Liyun Shi,Zanxiang He,Yuling Ma,Mengyuan Lin,Dianxiang Sun,Liming Nie
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multimodal artificial intelligence models for endometrial cancer (EC) risk stratification typically optimize aggregate predictive performance but provide limited mechanisms for enforcing mandatory guideline overrides, such as assigning POLE-mutated tumors to the low-risk group despite high-grade morphology. We present EndoGov, a two-tier multi-agent expert system that factorizes the decision process as D(x) = G(P(x), R), where specialist agents P extract structured evidence and a governance agent G applies an executable rule set R. Tier 1 comprises pathology, molecular, and clinical agents that independently generate schema-constrained reports from frozen foundation-model features or structured records. Tier 2 queries an evidence-level-weighted Guideline Knowledge Graph, using deterministic hard-path rules for high-priority overrides and constrained soft-path reasoning for ambiguous cases. In TCGA-UCEC (n=541), EndoGov achieved 0.943 accuracy, 0.973 macro AUC, and a conditional logic-violation rate (C-LVR) of 0.93% among trigger-exposed cases. In CPTAC-UCEC (n=95), where reference labels are guideline-derived, EndoGov reached 0.842 accuracy compared with 0.31 for locked-transfer neural baselines, supporting governance-pathway transfer under distribution shift rather than validation against independent clinical truth. End-to-end safety decomposition localized residual failures primarily to upstream molecular detection rather than downstream governance. Backend-swap experiments further showed that hard-path compliance is invariant to the LLM backend. These findings indicate that explicit clinical-rule governance can provide guideline-compliant, auditable EC risk assignment while preserving competitive discrimination.

[MA-6] Information-Theoretic Measures in AI: A Practical Decision Guide

【速读】:该论文旨在解决信息论(Information-theoretic, IT)度量在人工智能(AI)与机器学习(ML)应用中选择不当的问题,特别是针对七种核心度量——熵、交叉熵、互信息、传递熵以及集成信息(Phi)、有效信息(EI)和自主性(Autonomy)——缺乏系统化决策框架的现状。当前实践中,度量的选择往往脱离其估计器假设、失效模式及可信赖推断的前提,导致误用或误导性结论。论文的关键解决方案是提出一个围绕三个核心问题构建的实用决策框架:(i) 度量所回答的具体问题及其适用的AI场景;(ii) 针对数据类型与维度的合适估计方法;(iii) 最危险的误用情形。该框架通过两个互补工具实现:度量选择流程图与主决策表,并辅以标准化“桥接框”(Bridge Boxes)将信息论量与认知概念关联,从而为从业者提供结构化、可操作的指导,提升度量使用的严谨性和有效性。

链接: https://arxiv.org/abs/2604.23716
作者: Nikolaos Al.Papadopoulos,Konstantinos E. Psannis
机构: University of Macedonia (马其顿大学)
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 25 pages, 2 tables, 1 figure. Submitted to Applied Intelligence (Springer)

点击查看摘要

Abstract:Information-theoretic (IT) measures are ubiquitous in artificial intelligence: entropy drives decision-tree splits and uncertainty quantification, cross-entropy is the default classification loss, mutual information underpins representation learning and feature selection, and transfer entropy reveals directed influence in dynamical systems. A second, less consolidated family of measures, integrated information (Phi), effective information (EI), and autonomy, has emerged for characterizing agent complexity. Despite wide adoption, measure selection is often decoupled from estimator assumptions, failure modes, and safe inferential claims. This paper provides a practical decision framework for all seven measures, organized around three prescriptive questions for each: (i) what question does the measure answer and in which AI context; (ii) which estimator is appropriate for the data type and dimensionality; and (iii) what is the most dangerous misuse. The framework is operationalized in two complementary artifacts: a measure-selection flowchart and a master decision table. We cover both AI/ML and decision-making agent application domains per measure, with standardized Bridge Boxes linking IT quantities to cognitive constructs. Three worked examples illustrate the framework on concrete practitioner scenarios spanning representation learning, temporal influence analysis, and evolved agent complexity. Comments: 25 pages, 2 tables, 1 figure. Submitted to Applied Intelligence (Springer) Subjects: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Multiagent Systems (cs.MA) MSC classes: 94A17, 68T05 ACMclasses: H.1.1; I.2.0; I.2.6 Cite as: arXiv:2604.23716 [cs.AI] (or arXiv:2604.23716v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.23716 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-7] DLM: Unified Decision Language Models for Offline Multi-Agent Sequential Decision Making

【速读】:该论文旨在解决离线多智能体强化学习(Offline Multi-Agent Reinforcement Learning, Offline MARL)中决策策略难以规模化和复用的问题,特别是现有方法受限于固定观测格式与动作空间,导致泛化能力不足。其解决方案的关键在于提出决策语言模型(Decision Language Model, DLM),将多智能体决策建模为一种对话式的序列预测问题,在集中训练、分散执行(Centralized Training with Decentralized Execution, CTDE)框架下进行优化。DLM通过两阶段训练:第一阶段采用监督微调,利用对话风格数据集引入智能体间上下文信息并从离线轨迹中生成可执行动作;第二阶段通过群体相对策略优化(Group Relative Policy Optimization)引入轻量级奖励函数,提升对分布外动作的鲁棒性。实验表明,统一的DLM在多个基准测试中优于强基线方法,并展现出对未见场景的零样本泛化能力。

链接: https://arxiv.org/abs/2604.23557
作者: Zhuohui Zhang,Bin Cheng,Bin He
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 22 pages, 11 figures

点击查看摘要

Abstract:Building scalable and reusable multi-agent decision policies from offline datasets remains a challenge in offline multi-agent reinforcement learning (MARL), as existing methods often rely on fixed observation formats and action spaces that limit generalization. In contrast, large language models (LLMs) offer a flexible modeling interface that can naturally accommodate heterogeneous observations and actions. Motivated by this, we propose the Decision Language Model (DLM), which formulates multi-agent decision making as a dialogue-style sequence prediction problem under the centralized training with decentralized execution paradigm. DLM is trained in two stages: a supervised fine-tuning phase, which leverages dialogue-style datasets for centralized training with inter-agent context and generates executable actions from offline trajectories, followed by a group relative policy optimization phase to enhance robustness to out-of-distribution actions through lightweight reward functions. Experiments on multiple benchmarks show that a unified DLM outperforms strong offline MARL baselines and LLM-based conversational decision-making methods, while demonstrating strong zero-shot generalization to unseen scenarios across tasks.

[MA-8] Breaking the Secret: Economic Interventions for Combating Collusion in Embodied Multi-Agent Systems

【速读】:该论文旨在解决具身多智能体系统(Embodied Multi-Agent Systems, MAS)中自主智能体之间合谋(Collusion)带来的安全威胁问题。此类合谋行为可能导致智能体协同偏离全局目标,并在物理环境中引发实际后果,而现有基于身份控制或事后行为分析的防御手段因物理环境中的延迟反馈和噪声观测难以及时准确检测此类偏差。论文提出了一种“诱变激励干预”(Mutagenic Incentive Intervention)方法,其核心在于重构智能体的收益结构:通过奖励举报合谋行为的智能体、惩罚被识别的合谋参与者,诱导策略性背叛(Strategic Defection),从而使合谋变得不稳定。关键创新包括报告保证金机制、基于智能合约的奖励执行机制以及加密通信,以增强机制鲁棒性并防范滥用与报复。实验表明,该方法能有效抑制合谋同时保持系统效率,性能接近无合谋基线并优于代表性被动防御方案。

链接: https://arxiv.org/abs/2604.23511
作者: Qi Liu,Xiaohui Chen,Zhihui Zhao,Yaowen Zheng,Dan Yu,Zehua Zhang,Limin Sun,Yongle Chen
机构: Taiyuan University of Technology (太原理工大学); Tsinghua University (清华大学); China Mobile Research Institute (中国移动研究院); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所)
类目: Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Collusion among autonomous agents poses a critical security threat in embodied multi-agent systems (MAS), where coordinated behaviors can deviate from global objectives and lead to real-world consequences. Existing defenses, primarily based on identity control or post-hoc behavior analysis, are insufficient to address such threats in embodied settings due to delayed feedback and noisy observations in physical environments, which make behavioral deviations difficult to detect accurately and in a timely manner. To address this challenge, we propose a mutagenic incentive intervention approach that mitigates collusion by reshaping agents’ payoff structures. By rewarding agents who report collusive behavior and penalizing identified participants, the mechanism induces strategic defection and renders collusion unstable. We further design supporting mechanisms, including reporting deposits, smart contract-based reward enforcement, and encrypted communication, to ensure robustness against misuse of the incentive mechanism and retaliation from penalized agents. We implement the proposed approach in both simulated and real-world embodied environments. Experimental results show that our method effectively suppresses collusion by inducing defection, while preserving system efficiency. It achieves performance comparable to the non-collusion baseline and outperforms representative reactive defenses, thereby fulfilling the desired security objectives. These results demonstrate the effectiveness of proactive incentive design as a practical paradigm for securing embodied multi-agent systems.

[MA-9] Architecture Matters for Multi-Agent Security

【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在实际部署中因架构设计决策引入的新型安全风险问题,这些问题在单智能体场景中并不存在。研究发现,即使各智能体本身具备良好的安全性,其协作机制(如角色分配、通信拓扑和记忆可见性)仍可能显著增加攻击面,导致系统整体更易受到攻击。解决方案的关键在于通过系统性的实证研究,量化不同架构配置下任务性能与抗攻击能力之间的权衡关系,并识别出影响安全性的三个核心设计因素:(i)代理角色(agent roles),决定权限与责任分配;(ii)通信拓扑(communication topology),控制交互方式与时序;(iii)记忆机制(memory),定义上下文和状态的可见性范围。研究结果表明,没有一种单一架构在所有场景下都更安全,从而推动了从单智能体安全评估向多智能体协同安全评估的必要演进。

链接: https://arxiv.org/abs/2604.23459
作者: Ben Hagag,William L. Anderson,Christian Schroeder de Witt,Sarah Scheffler
机构: 未知
类目: Multiagent Systems (cs.MA); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS), composed of networks of two or more autonomous AI agents, have become increasingly popular in production deployments, yet introduce security risks that do not arise in single-agent settings. Even if individual agents exhibit robust security, architectural decisions governing their coordination can create attack surfaces that have not been systematically characterized. In this work, we present an empirical study of how MAS design decisions shape the tradeoff between task performance and attack resistance. Across three agentic environments (browser, desktop, and code) and 13 architectural configurations, we use stagewise evaluations that distinguish planning refusal, execution-stage interception, partial harmful execution, and successful attack completion to study three key design choices: (i) agent roles, which determine how authority and responsibility are allocated; (ii) communication topology, which shapes how and when agents interact; and (iii) memory, which determines the context and state visibility accessible to each agent. We find that multi-agent architectures are more vulnerable than standalone agents in the majority of configurations, with attack success rates varying by up to 3.8x at comparable or higher benign accuracy, and that no single design is universally safer. These results motivate the development of further evaluations that move beyond the security properties of a single agent.

[MA-10] GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLM s

【速读】:该论文旨在解决自主多智能体大语言模型(LLM)系统在生成诊断报告时缺乏可信度的问题,核心挑战在于如何确保每个主张(claim)都基于观测证据而非模型内部推理,且能对下游行为进行可控制的响应。现有评估方法将支持性证据视为等价物,仅输出单一信号,无法实现对决策流程的有效干预。其解决方案的关键在于提出GSAR框架,通过四个维度对主张分类(有依据、无依据、矛盾、互补),引入基于证据类型的权重以反映认知强度,并计算一种不对称惩罚矛盾的加权接地分数;进而将该分数映射至三级决策函数(继续、重生成、重规划),驱动一个受显式计算预算约束的有限迭代外循环,从而实现从证据到行动的结构化、可控反馈机制。

链接: https://arxiv.org/abs/2604.23366
作者: Federico A. Kamelhar
机构: Oracle Corporation
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Autonomous multi-agent LLM systems are increasingly deployed to investigate operational incidents and produce structured diagnostic reports. Their trustworthiness hinges on whether each claim is grounded in observed evidence rather than model-internal inference. Existing groundedness evaluators (binary classifiers, LLM-as-judge scalars, self-correction loops) treat supporting evidence as interchangeable and emit a single signal that offers no principled control over downstream action. We present GSAR, a grounding-evaluation and replanning framework that (i) partitions claims into a four-way typology (grounded, ungrounded, contradicted, complementary), giving first-class standing to non-redundant alternative perspectives; (ii) assigns evidence-type-specific weights reflecting epistemic strength; (iii) computes an asymmetric contradiction-penalised weighted groundedness score; and (iv) couples that score to a three-tier decision function (proceed, regenerate, replan) driving a bounded-iteration outer loop under an explicit compute budget. We formalise the algorithm, prove six structural properties, and evaluate five design claims on FEVER with gold Wikipedia evidence under four independently-trained LLM judges (gpt-5.4, claude-sonnet-4-6, claude-opus-4-7, gemini-2.5-pro). Every ablation reproduces in the same direction on every judge: bootstrap 95% CIs on the rho=0 effect exclude 0 on all four; the no-complementary ablation under Opus 4.7 has CI [-96,-68] of 200; at n=1000 three independent judges converge to DeltaS(rho=0)=+0.058. A head-to-head against Vectara HHEM-2.1-Open is included. To our knowledge, GSAR is the first published groundedness framework coupling evidence-typed scoring with tiered recovery under an explicit compute budget. Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) ACMclasses: I.2.7; I.2.11 Cite as: arXiv:2604.23366 [cs.AI] (or arXiv:2604.23366v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.23366 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-11] Proteus: Shapeshifting Desktop Visualizations for Mobile via Multi-level Intelligent Adaptation

【速读】:该论文旨在解决移动优先消费趋势下,桌面端数据可视化(Data Visualization)在移动端因视口尺寸差异和交互范式不同而导致的可读性差、信息丢失及交互失效问题。其解决方案的关键在于构建一个多层次的设计空间(Design Space),涵盖全局拓扑、参考系到视觉元素三个层级的演化规则,并基于此设计出由大语言模型驱动的多智能体系统Proteus,该系统能够自动解析在线可视化内容,预测最优转换策略并生成适配移动端的高可读性等效可视化结果。

链接: https://arxiv.org/abs/2604.23299
作者: Can Liu,Sizhe Cheng,Feng Liang,Zhibang Jiang,Lingru Huang,Kavinda Athapaththu,Yong Wang
机构: Nanyang Technological University(南洋理工大学); The Hong Kong Polytechnic University(香港理工大学); Providence Health Services(普罗维登斯健康服务)
类目: Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: accepted by ACM Designing Interactive Systems Conference

点击查看摘要

Abstract:With the rise of mobile-first consumption, users increasingly engage with data visualizations on mobile devices. However, the vast majority of existing visualizations are originally authored for desktop environments. Due to significant differences in viewport size and interaction paradigms, directly scaling desktop charts often results in illegible text, information loss, and interaction failures. To bridge this gap, we propose an automated framework to adapt desktop-based visualizations for mobile screens. By systematically categorizing the operations involved in the adaptation process, we establish a multi-level design space. This space defines evolution rules spanning from the global topology level, through the reference frame level, down to the visual elements level. Guided by this theoretical framework, we developed Proteus, a large language model-driven multi-agent system that automatically parses online visualizations, predicts optimal transformation strategies within the design space, and generates equivalent, highly readable visualizations for mobile devices. Case studies and an in-depth user study with 12 participants demonstrate the effectiveness and usability of Proteus.

[MA-12] Cooperative Informative Sensing for Monitoring Dynamic Indoor Environments via Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决室内环境中多机器人协同主动观测(cooperative active observation)问题,即如何在部分可观测条件下,通过多机器人自主调整运动轨迹以直接优化对人类活动的监测精度。传统方法通常依赖覆盖或访问频率等目标函数,与以人为中心的监测任务准确性要求关联较弱。解决方案的关键在于将该问题建模为一个去中心化的控制问题,并采用基于多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的学习框架,从局部观测中训练出协同策略;同时设计了一个能处理人类数量变化和时间依赖性的架构,从而在多种室内场景和任务中显著优于经典的覆盖、持续监控及无学习基准方法,且对观测人数变化具有鲁棒性。

链接: https://arxiv.org/abs/2604.23179
作者: Kanghoon Lee,Matthew M. Sato,Jinnyeong Yang,Seungro Lee,Sujin Lee,Jiachen Li,Kuk-Jin Yoon,Jinkyoo Park,Kincho H. Law,Yoonjin Yoon
机构: Korea Advanced Institute of Science and Technology (KAIST); Stanford University (斯坦福大学); University of California, Berkeley (加州大学伯克利分校); University of California, Riverside (加州大学河滨分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 8 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Monitoring human activity in indoor environments is important for applications such as facility management, safety assessment, and space utilization analysis. While mobile robot teams offer the potential to actively improve observation quality, existing multi-robot monitoring and active perception approaches typically rely on coverage or visitation based objectives that are weakly aligned with the accuracy requirements of human-centric monitoring tasks. In this work, we formulate cooperative active observation as a decentralized control problem in which multiple robots adjust their motion to directly optimize monitoring accuracy under partial observability. We propose a learning-based framework for cooperative policies from decentralized observations using multi-agent reinforcement learning (MARL), supported by an architecture that handles variable numbers of humans and temporal dependencies. Simulation results across diverse indoor environments and monitoring tasks show that the proposed approach consistently outperforms classical coverage, persistent monitoring, and learning-free multi-robot baselines, while remaining robust to changes in the number of observed humans.

[MA-13] MindTrellis: Co-Creating Knowledge Structures with AI through Interactive Visual Exploration

【速读】:该论文旨在解决知识工作者在从多文档中提炼信息并构建结构化概念理解时面临的挑战,这一过程具有迭代性,涉及用户探索内容、识别概念间关系并持续重构心智模型,而现有方法要么仅支持信息检索(如基于大语言模型的系统),要么虽能支持结构创建但缺乏智能辅助(如手动思维导图工具)。解决方案的关键在于提出MindTrellis——一个交互式可视化系统,通过用户与AI协同构建动态知识图谱,使用户能够查询图谱获取文档支撑的信息,并主动引入新概念、调整关系和重组层级以反映其认知演化;实证研究表明,该方案在知识组织质量和认知负荷方面显著优于仅依赖检索的基线方法。

链接: https://arxiv.org/abs/2604.23129
作者: Xiang Li,Cara Li,Emily Kuang,Can Liu,Jian Zhao
机构: University of Waterloo (滑铁卢大学); York University (约克大学); Nanyang Technological University (南洋理工大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 21 pages, 7 figures, ACM Designing Interactive Systems. DIS 2026

点击查看摘要

Abstract:Knowledge workers face increasing challenges in synthesizing information from multiple documents into structured conceptual understanding. This process is inherently iterative: users explore content, identify relationships between concepts, and continuously reorganize their mental models. However, current approaches offer limited support. LLM-based systems let users query information but not shape how knowledge is organized; manual tools like mind maps support structure creation but lack intelligent assistance. This leaves an open opportunity: supporting collaborative construction where users and AI jointly develop an evolving knowledge representation. We present MindTrellis, an interactive visual system where users and AI collaboratively build a dynamic knowledge graph. Users can query the graph to retrieve document-grounded information, and contribute by introducing new concepts, modifying relationships, and reorganizing the hierarchy to reflect their developing understanding. In a user study where 12 participants created slide decks, MindTrellis outperformed retrieval-only baselines in knowledge organization and cognitive load, as measured by expert ratings of content coverage and structural quality.

[MA-14] No Test Cases No Problem: Distillation-Driven Code Generation for Scientific Workflows

【速读】:该论文旨在解决科学工作流(scientific workflows)中代码生成缺乏执行反馈和输入/输出(I/O)测试用例的问题,这类场景下传统基于I/O反馈的多智能体大语言模型(multi-agent Large Language Model, LLM)框架无法适用。其核心解决方案是提出MOSAIC框架,采用无训练的师生知识蒸馏机制,通过领域特定示例和结构化问题分解来引导生成过程,并引入 Consolidated Context Window(CCW)以在多智能体协作中维持跨子问题的一致性推理,从而有效缓解链式任务中的幻觉问题。实验表明,MOSAIC在SciCode基准上显著提升了准确率、可执行性和数值精度,且仅依赖轻量级模型。

链接: https://arxiv.org/abs/2604.23106
作者: Siddeshwar Raghavan,Tanwi Mallick
机构: Purdue University (普渡大学); Argonne National Laboratory (阿贡国家实验室)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Existing multi-agent Large Language Model (LLM) frameworks for code generation typically use execution feedback and improve iteratively using Input/Output (I/O) test cases. However, this does not work for scientific workflows, where I/O test cases do not exist, and generating them requires solving the very problem at hand. To address this, we introduce MOSAIC, a training-free multi-agent framework for scientific code generation without I/O supervision. Instead of execution feedback, MOSAIC employs a student-teacher knowledge distillation framework that grounds generation through domain-specific examples and structured problem decomposition. To further mitigate hallucinations across chained subproblems, we introduce a Consolidated Context Window (CCW) for maintaining consistent reasoning across agents. Experiments on the SciCode benchmark show that MOSAIC improves accuracy, executability, and numerical precision over existing approaches while relying on lightweight models.

[MA-15] Usable Agent Discovery for Decentralized AI Systems

【速读】:该论文旨在解决大规模分布式环境中软件代理(agent)的去中心化发现问题,尤其在节点级和代理级双重波动(churn)场景下如何保持路由效率、系统韧性与服务就绪性之间的平衡。其核心挑战在于:当多个代理共享物理主机且通过点对点机制动态发现时,频繁的节点故障或离开(node-level churn)以及代理因需求驱动的状态切换(如冷热状态变化,agent-level churn)会显著影响发现机制的稳定性与性能。解决方案的关键在于对比结构化(如Kademlia)与基于洪泛的(gossip-based,如Cyclon+Vicinity)覆盖网络在不同 churn 场景下的表现——研究表明,在稳定或仅存在节点 churn 的情况下,结构化覆盖网络更具鲁棒性和效率;而在代理冷热切换频繁、服务就绪性成为关键指标时,gossip-based 方案仍具竞争力甚至更快,从而为实际系统设计提供依据。

链接: https://arxiv.org/abs/2604.23080
作者: Patrizio Dazzi,Emanuele Carlini,Matteo Mordacchini,Saul Urso
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Large-scale agentic systems run on distributed infrastructures where many software agents share physical hosts and are discovered via peer-to-peer mechanisms. Discovery must handle node-level churn from failures and host departures and agent-level churn from demand-driven activation, deactivation, and state changes. Their interaction reshapes classic trade-offs between structured and unstructured overlays. We study decentralized agent discovery under this two-level churn, assuming nodes host multiple agents, overlays are structured or gossip-based, and agents switch between warm and cold states. Using Kademlia as a structured and Cyclon+Vicinity as a gossip baseline, we compare stable, node-churn-only, agent-cooling-only, and combined regimes to see when routing efficiency, resilience, and service readiness align or favor different designs. Structured overlays are more robust and efficient in stable and node-churn regimes, while gossip-based overlays remain competitive and can be faster when readiness dominates.

[MA-16] Peer Identity Bias in Multi-Agent LLM Evaluation: An Empirical Study Using the TRUST Democratic Discourse Analysis Pipeline

【速读】:该论文旨在解决生成式 AI (Generative AI) 系统中因模型身份暴露而引发的评分偏倚问题,特别是针对 TRUST 民主话语分析流水线中多通道身份暴露机制下 bias 的系统性测量缺失。其解决方案的关键在于:通过跨四个模型家族、两种匿名化范围和30条政治陈述的实验设计,识别出单通道匿名化无法有效检测偏倚——因为各通道效应相互抵消,导致误判为无偏倚;唯有全流水线匿名化才能揭示真实偏倚模式,即同质模型集合会放大身份驱动的奉承行为,而异质配置则相反。此外,研究还发现特定模型本身存在结构性偏倚,进一步强调了模型选择与全流程匿名化对保障多代理 LLM 系统在质量关键场景中公平性和可信度的重要性。

链接: https://arxiv.org/abs/2604.22971
作者: Juergen Dietrich
机构: democracy-intelligence.de
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 13 pages, 1 figure

点击查看摘要

Abstract:The TRUST democratic discourse analysis pipeline exposes its large language model (LLM) components to peer model identity through multiple structural channels – a design feature whose bias implications have not previously been empirically tested. We provide the first systematic measurement of identity-dependent scoring bias across all active identity exposure channels in TRUST, crossing four model families with two anonymization scopes across 30 political statements. The central finding is that single-channel anonymization produces near-zero bias effects, because individual channels act in opposite directions and cancel each other out – a result that would lead an evaluator to conclude that identity bias is absent when it is not. Only full-pipeline anonymization reveals the true pattern: homogeneous ensembles amplify identity-driven sycophancy when model identity is fully visible, while the heterogeneous production configuration shows the reverse. Model choice matters independently: one tested model exhibits baseline sycophancy two to three times higher than the others and near-zero deliberative conflict on ideological topics, making it structurally unsuitable for pipelines where genuine inter-role disagreement is the intended quality mechanism. Three practical conclusions follow. First, heterogeneous model ensembles are structurally more robust than homogeneous ones, achieving higher consensus rates and lower identity amplification. Second, full-pipeline anonymization is required for valid bias measurement – partial anonymization is insufficient and actively misleading. Third, these findings have direct implications for the validation of multi-agent LLM systems in quality-critical applications: a system validated under partial anonymization or with a homogeneous ensemble may pass validation while retaining structural identity bias invisible to single-channel measurement.

[MA-17] Beyond Single-Agent Alignment: Preventing Context-Frag mented Violations in Multi-Agent Systems

【速读】:该论文旨在解决多智能体系统中因上下文碎片化导致的政策违规问题——即Context-Fragmented Violations (CFVs),这类违规行为表现为单个智能体的行为在局部看似合规,但跨部门协作时因关键政策信息被隔离在不同部门的私有上下文中,从而违反组织整体策略。现有基于提示(prompt-based)对齐机制和集中式拦截器难以应对跨越多个上下文域的违规场景。其解决方案的关键在于提出一种分布式零信任执行架构Distributed Sentinel,引入语义污染令牌(Semantic Taint Token, STT)协议,通过轻量级sidecar代理在不暴露原始跨域数据的前提下传播安全状态,并结合反事实图模拟(Counterfactual Graph Simulation)实现跨域策略验证,从而有效识别并阻止CFVs。

链接: https://arxiv.org/abs/2604.22879
作者: Jie Wu,Ming Gong
机构: Atlassian(Atlassian)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 34 pages, 3 figures, 20 tables

点击查看摘要

Abstract:We identify and formalize a novel security risk: Context-Fragmented Violations (CFVs) - a class of policy breaches where individual agent actions appear locally safe and reasonable, yet collectively violate organizational policies because critical policy facts are siloed in different departments private contexts. Existing prompt-based alignment mechanisms and monolithic interceptors are poorly matched to violations that span contextual islands. We propose Distributed Sentinel, a distributed zero-trust enforcement architecture that introduces the Semantic Taint Token (STT) Protocol. Through lightweight sidecar proxies, our system propagates security state across organizational boundaries without exposing raw cross-domain data, enabling Counterfactual Graph Simulation for cross-domain policy verification. We construct PhantomEcosystem, a comprehensive benchmark comprising 9 categories of realistic cross-agent violation scenarios with adversarially balanced safe controls. On this benchmark, Distributed Sentinel achieves F1 = 0.95 with 106ms end-to-end latency (16ms verification + 90ms entity extraction on A100), compared to 0.85 F1 for prompt-based filtering and 0.65 for rule-based DLP. To empirically validate the need for external enforcement, we evaluate eight frontier LLMs in execution-oriented multi-agent workflows with per-agent domain world models. All models exhibit substantial violation rates (14-98%), with cross-domain data flows showing systematically higher violation rates than same-domain flows. These results indicate that self-avoidance is unreliable and that multi-agent security benefits from a centralized enforcement layer operating above individual agents.

[MA-18] AutoRISE: Agent -Driven Strategy Evolution for Red-Teaming Large Language Models

【速读】:该论文旨在解决当前自动化红队测试(red-teaming)方法在攻击大型语言模型时的局限性——即现有方法仅优化单个攻击提示(prompt),而无法改变攻击策略本身,从而限制了攻击的有效性和多样性。其解决方案的关键在于提出AutoRISE,一种通过搜索可执行的攻击程序(executable attack programs)来自动优化攻击策略的方法。该方法利用一个编码代理(coding agent)在每轮迭代中编辑攻击策略,并借助固定评估工具(evaluation harness)返回标量目标值和逐例诊断信息以指导后续修改,从而支持结构层面的调整(如新增攻击组件或控制流变更),突破了传统提示级方法的表达能力限制。实验表明,AutoRISE在多个基准数据集和11个模型上显著优于现有基线,平均提升攻击成功率达17.0点,尤其在低基线成功率的前沿目标上提升最高达16点,且无需微调、人工标注或GPU计算,在纯黑盒推理场景下即可实现高效攻击策略演化。

链接: https://arxiv.org/abs/2604.22871
作者: Tanmay Gautam,Alireza Bahramali,Sandeep Atluri
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 36 pages, 6 tables, 2 figures

点击查看摘要

Abstract:Automated red-teaming methods for large language models typically optimize attack prompts within a fixed, human-designed strategy, leaving the attack strategy itself unchanged. We instead optimize the strategy. We propose AutoRISE, a method that searches over executable attack programs rather than individual prompts. At each iteration, a coding agent edits a strategy and a fixed evaluation harness scores the resulting attacks, returning both a scalar objective and per-example diagnostics that guide subsequent edits. This allows structural changes, including new attack components and altered control flow, that prompt-level methods do not directly express. We also release two benchmark suites developed on disjoint target sets and evaluate on 11 models from five families against seven established jailbreak datasets. Across held-out models, AutoRISE improves average attack success rate by 17.0 points over the strongest baseline, and improves attack success by up to 16 points on frontier targets with low baseline success rates. Ablations against parametric and strategy-library baselines suggest that these gains arise from unrestricted program search, particularly compositional techniques and control-flow edits. AutoRISE operates in a black-box, inference-only setting, requiring no fine-tuning, human annotation, or GPU compute.

[MA-19] Complete Cyclic Subtask Graphs for Tool-Using LLM Agents : Flexibility Cost and Bottlenecks in Multi-Agent Workflows

【速读】:该论文旨在解决长时序工具使用任务中多智能体协作的灵活性与效率之间的权衡问题,即如何在允许任意子任务重访以增强恢复能力和探索性的同时,避免因协调开销和推理成本增加而导致性能下降。其解决方案的关键在于提出一种“完全循环子任务图”(complete cyclic subtask graphs)架构——该架构通过将所有可执行子任务节点完全连接,并由一个统一的状态分析与路由代理基于自然语言条件选择转移路径,使子任务层面的重访行为显式化且可直接分析。这一设计为系统性评估多智能体重访机制在不同任务场景下的有效性提供了实验基础,从而识别出何时重访有助于提升性能、何时仅增加协调负担、以及何时外部瓶颈(如检索或证据合成)主导了整体表现。

链接: https://arxiv.org/abs/2604.22820
作者: Luay Gharzeddine,Samer Saab Jr
机构: Lebanese American University (美国大学黎巴嫩分校)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon tool-using tasks sometimes benefit from revisiting earlier subtasks for recovery and exploration, but added multi-agent workflow flexibility can also introduce coordination overhead and substantial inference cost. We study complete cyclic subtask graphs, a deliberately maximally flexible multi-agent architecture in which executable subtask nodes are fully connected and a unified state-analysis-and-routing agent selects transitions using natural-language criteria. This makes unrestricted revisitation explicit and directly analyzable at the subtask level. We evaluate task-specific (Spec-Cyc) and benchmark-generic (Gen-Cyc) graphs on TextCraft, ALFWorld, and Finance-Agent, with ablations over planner/executor/router strength, tool exposure (generalist vs specialized), n -shot successful trajectory summaries, and fault-injected random subtask perturbations. The benchmarks expose three distinct regimes. ALFWorld highlights a setting where explicit revisitation supports recovery and exploration; TextCraft, a largely prerequisite-chain domain, often favors the efficiency of simpler forward execution; and Finance-Agent remains bottlenecked by retrieval, grounding, and evidence synthesis more than by workflow flexibility alone. Shared-win token comparisons further show that the added flexibility can be substantially more expensive than a single ReAct agent. Overall, we use complete cyclic subtask graphs as a maximally flexible experimental lens for measuring when multi-agent revisitation helps, when it mainly adds coordination cost, and when external task bottlenecks dominate.

[MA-20] Representation Homogeneity and Systemic Instability in AI-Dominated Financial Markets: A Structural Approach

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)交易主体间信息表征相似性如何引发金融市场系统性不稳定的问题。其核心贡献在于构建了一个基于结构化的多智能体市场模型,通过高频率微观结构特征进行校准,并引入两层决策架构:非线性表征层将原始市场状态映射为高维特征向量,线性读出层生成收益预测并输入至风险控制交易规则中。解决方案的关键在于区分“表征同质性”(representation homogeneity)与“预测重叠度”(forecast overlap)这两个常被混淆的概念,并理论证明二者虽相关但不等价;尤其指出,在压力情境下,表征同质性会压缩有效预测分歧空间,即使正常时期预测看似多样,仍可能导致信念和头寸同步化,进而引发波动率集聚、流动性紧张及尾部风险上升。这一机制揭示了低感知波动率环境下因头寸粘性积累隐性杠杆,随后在冲击下触发同步去杠杆的内在风险传导路径,为宏观审慎政策提供了监测和维护AI系统对市场信息表征多样性的重要依据。

链接: https://arxiv.org/abs/2604.22818
作者: Yimeng Qiu,Qiwei Han
机构: 未知
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper investigates how similarity in the informational representation of market states among Artificial Intelligence (AI) trading agents can generate systemic instability in financial markets. We construct a structural multi-agent market model calibrated using high-frequency microstructural moments. AI agents are modeled through a two-layer decision architecture consisting of a nonlinear representation layer and an adaptive linear readout layer. The representation layer maps raw market states into high-dimensional feature vectors, while the readout layer generates return forecasts that feed into a risk-controlled trading rule. This representation-based microfoundation separates two objects that are often conflated in the literature: representation homogeneity (the degree to which agents encode market states into similar feature spaces) and forecast overlap (the degree to which agents produce similar return predictions). We show theoretically that these two concepts are related but not equivalent, and that representation homogeneity can compress the effective space of forecast disagreement under stress even when predictions appear diverse in normal times. Through controlled factorial experiments that vary representation homogeneity while conditioning on alternative risk-aversion and learning-rate distributions, we hypothesize that increasing representation similarity amplifies synchronization in beliefs and positions, leading to volatility clustering, liquidity stress, and elevated tail risk. Our structural mechanisms suggest that low perceived volatility regimes can endogenously accumulate hidden leverage through position stickiness, which subsequently collapses when shocks trigger synchronized deleveraging. The results provide a structural foundation for macroprudential policies aimed at monitoring and preserving diversity in how AI systems represent and process market information.

自然语言处理

[NLP-0] Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

【速读】: 该论文旨在解决印尼电商平台评论中因混合使用标准词汇、俚语、地区借词、数字缩写及表情符号而导致传统基于词典的情感分析工具失效的问题。其解决方案的关键在于提出一个双轨分类流水线:第一轨采用TF-IDF向量化结合PyCaret AutoML对多种标准分类器进行自动筛选;第二轨则构建一个共享编码器的双向长短期记忆网络(BiLSTM),配备两个任务特定输出头,分别用于二分类情感(正/负)和五类情绪(快乐、悲伤、恐惧、爱、愤怒)预测。此外,研究引入了包含140个条目的俚语词典和14步预处理流程以提升文本质量,从而显著增强模型在复杂语言环境下的鲁棒性与准确性。

链接: https://arxiv.org/abs/2604.24720
作者: Hermawan Manurung,Ibrahim Al-Kahfi,Ahmad Rizqi,Martin Clinton Tosima Manullang
机构: Institut Teknologi Sumatera (印尼科技大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 5 figures, 4 tables. Final project for Natural Language Processing course (PBA 2026) at Institut Teknologi Sumatera

点击查看摘要

Abstract:Indonesian marketplace reviews mix standard vocabulary with slang, regional loanwords, numeric shorthands, and emoji, making lexicon-based sentiment tools unreliable in practice. This paper describes a two-track classification pipeline applied to the PRDECT-ID dataset, which contains 5,400 product reviews from 29 Indonesian e-commerce categories, each labeled for binary sentiment (Positive/Negative) and five-class emotion (Happy, Sad, Fear, Love, Anger). The first track applies TF-IDF vectorization with a PyCaret AutoML sweep across standard classifiers. The second track is a PyTorch Bidirectional Long Short-Term Memory (BiLSTM) network with a shared encoder and two task-specific output heads. A preprocessing module applies 14 sequential cleaning steps, including a 140-entry slang dictionary assembled from marketplace corpora. Four configurations are benchmarked: BiLSTM Baseline, BiLSTM Improved, BiLSTM Large, and TextCNN. Training uses class-weighted cross-entropy loss, ReduceLROnPlateau scheduling, and early stopping. Both tracks are deployed as Gradio applications on Hugging Face Spaces. Source code is publicly available at this https URL.

[NLP-1] Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

【速读】: 该论文旨在解决纯Transformer语言模型在长文本处理中面临的计算资源消耗大、KV-cache内存占用高以及上下文长度受限的问题,同时希望利用已有的预训练Transformer检查点(checkpoint)实现高效迁移,而非从头预训练。其解决方案的关键在于提出一种名为HyLo(HYbrid LOng-context)的“升级”(upcycling)方法:通过架构适配引入高效的线性序列建模模块(如Mamba2或Gated DeltaNet)与多头潜在注意力(Multi-Head Latent Attention, MLA),结合分阶段长上下文训练和教师引导蒸馏策略,在不牺牲短上下文性能的前提下显著提升长上下文能力;该方案可将可用上下文长度扩展至原始模型的32倍,并减少超过90%的KV-cache内存占用,从而在vLLM推理栈中支持高达2M tokens的前向填充和解码,优于同类基线模型(如JetNemotron)。

链接: https://arxiv.org/abs/2604.24715
作者: Parsa Ashrafi Fashi,Utkarsh Saxena,Mehdi Rezagholizadeh,Aref Jafari,Akash Haridas,Mingyu Yang,Vansh Bhatia,Guihong Li,Vikram Appia,Emad Barsoum
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emphHyLo (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to 32\times through efficient post-training and reduces KV-cache memory by more than 90% , enabling up to 2M-token prefill and decoding in our \textttvLLM inference stack, while comparable Llama baselines run out of memory beyond 64K context. Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER. Notably, at similar scale, HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K.

[NLP-2] Case-Specific Rubrics for Clinical AI Evaluation: Methodology Validation and LLM -Clinician Agreement Across 823 Encounters

【速读】: 该论文旨在解决临床人工智能(Clinical AI)文档系统在评估过程中面临的三大挑战:评估方法需具备临床有效性、经济可行性,并能敏感地响应迭代更新。传统依赖专家逐项评审的方法成本高、效率低,难以支持安全、持续的部署。其解决方案的关键在于提出一种基于案例特异性的、由临床医生撰写的评分量表(rubric)方法,用于评估AI输出质量,并验证大型语言模型(LLM)生成的量表能否逼近临床医生间的共识。研究通过20名临床医生为823个真实与合成病例构建1646条量表,证明人工量表能有效区分高质量与低质量输出(中位分数差距82.9%),且评分稳定性高(中位范围0.00%);同时发现LLM生成的量表与临床医生评分的一致性(tau: 0.42–0.46)可媲美甚至超过临床医生之间的一致性(tau: 0.38–0.43),表明LLM量表可在约千分之一的成本下实现大规模评估覆盖,而临床医生的持续参与则确保了评估的专业权威性。

链接: https://arxiv.org/abs/2604.24710
作者: Aaryan Shah,Andrew Hines,Alexia Downs,Denis Bajet,Paulius Mui,Fabiano Araujo,Laura Offutt,Aida Rutledge,Elizabeth Jimenez
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 2 figures, 3 tables, submitted to JAMIA

点击查看摘要

Abstract:Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring expert review per scoring instance are too slow and expensive for safe, iterative deployment. We present a case-specific, clinician-authored rubric methodology for clinical AI evaluation and examine whether LLM-generated rubrics can approximate clinician agreement. Materials and Methods. Twenty clinicians authored 1,646 rubrics for 823 clinical cases (736 real-world, 87 synthetic) across primary care, psychiatry, oncology, and behavioral health. Each rubric was validated by confirming that an LLM-based scoring agent consistently scored clinician-preferred outputs higher than rejected ones. Seven versions of an EHR-embedded AI agent for clinicians were evaluated across all cases. Results. Clinician-authored rubrics discriminated effectively between high- and low-quality outputs (median score gap: 82.9%) with high scoring stability (median range: 0.00%). Median scores improved from 84% to 95%. In later experiments, clinician-LLM ranking agreement (tau: 0.42-0.46) matched or exceeded clinician-clinician agreement (tau: 0.38-0.43), attributable to both ceiling compression and LLM rubric improvement. Discussion. This convergence supports incorporating LLM rubrics alongside clinician-authored ones. At roughly 1,000 times lower cost, LLM rubrics enable substantially greater evaluation coverage, while continued clinical authorship grounds evaluation in expert judgment. Ceiling compression poses a methodological challenge for future inter-rater agreement studies. Conclusion. Case-specific rubrics offer a path for clinical AI evaluation that preserves expert judgment while enabling automation at three orders lower cost. Clinician-authored rubrics establish the baseline against which LLM rubrics are validated. Comments: 14 pages, 2 figures, 3 tables, submitted to JAMIA Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: J.3; I.2.7 Cite as: arXiv:2604.24710 [cs.AI] (or arXiv:2604.24710v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.24710 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aaryan Shah [view email] [v1] Mon, 27 Apr 2026 17:17:56 UTC (5,045 KB) Full-text links: Access Paper: View a PDF of the paper titled Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters, by Aaryan Shah and 8 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-04 Change to browse by: cs cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-3] Green Shielding: A User-Centric Approach Towards Trustworthy AI

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中对用户提问方式的敏感性问题,即即使是非对抗性的常规输入变化(如措辞微调),也可能导致模型输出显著偏离预期,而现有红队测试(red-teaming)方法未能有效覆盖此类现象。解决方案的核心在于提出“绿色防护”(Green Shielding)这一以用户为中心的框架,其关键在于通过CUE标准(Context、Utility、Elicitation)构建可验证的部署指导:即使用真实场景下的上下文(Context)、临床可解释的参考标准与指标(Utility),以及反映日常交互变异的扰动机制(Elicitation)。作者以医疗诊断为例,开发了HealthCareMagic-Diagnosis(HCM-Dx)基准,结合患者自述查询、结构化参考诊断及临床相关评估指标,揭示了提示层面因素如何系统性地改变模型输出的合理性与覆盖范围,并发现存在帕累托式权衡关系——例如中性化处理虽提升诊断的临床合理性与简洁性,但可能降低对高风险疾病的覆盖度。这为高风险领域中的安全部署提供了基于实证的用户导向干预策略。

链接: https://arxiv.org/abs/2604.24700
作者: Aaron J. Li,Nicolas Sanchez,Hao Huang,Ruijiang Dong,Jaskaran Bains,Katrin Jaradeh,Zhen Xiang,Bo Li,Feng Liu,Aaron Kornblith,Bin Yu
机构: University of California, Berkeley(加州大学伯克利分校); University of California, San Francisco(加州大学旧金山分校); University of Melbourne(墨尔本大学); University of Georgia(佐治亚大学); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed, yet their outputs can be highly sensitive to routine, non-adversarial variation in how users phrase queries, a gap not well addressed by existing red-teaming efforts. We propose Green Shielding, a user-centric agenda for building evidence-backed deployment guidance by characterizing how benign input variation shifts model behavior. We operationalize this agenda through the CUE criteria: benchmarks with authentic Context, reference standards and metrics that capture true Utility, and perturbations that reflect realistic variations in the Elicitation of model behavior. Guided by the PCS framework and developed with practicing physicians, we instantiate Green Shielding in medical diagnosis through HealthCareMagic-Diagnosis (HCM-Dx), a benchmark of patient-authored queries, together with structured reference diagnosis sets and clinically grounded metrics for evaluating differential diagnosis lists. We also study perturbation regimes that capture routine input variation and show that prompt-level factors shift model behavior along clinically meaningful dimensions. Across multiple frontier LLMs, these shifts trace out Pareto-like tradeoffs. In particular, neutralization, which removes common user-level factors while preserving clinical content, increases plausibility and yields more concise, clinician-like differentials, but reduces coverage of highly likely and safety-critical conditions. Together, these results show that interaction choices can systematically shift task-relevant properties of model outputs and support user-facing guidance for safer deployment in high-stakes domains. Although instantiated here in medical diagnosis, the agenda extends naturally to other decision-support settings and agentic AI systems.

[NLP-4] he Chameleons Limit: Investigating Persona Collapse and Homogenization in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多智能体模拟等应用中因个体角色(persona)设定差异而导致的“人格坍缩”(Persona Collapse)问题,即多个被赋予不同身份特征的代理最终表现出高度同质化的行为模式,导致模拟群体缺乏真实多样性。解决方案的关键在于提出一个量化框架,从三个维度评估代理群体的行为多样性:覆盖度(Coverage,衡量占据的人格空间比例)、均匀性(Uniformity,衡量代理在人格空间中的分布是否均衡)和复杂性(Complexity,衡量行为模式的丰富程度),并通过该框架揭示模型在不同维度(如人格与道德推理)上呈现的结构性退化现象,从而为LLM在群体层面的行为多样性提供可测量、可诊断的评估工具。

链接: https://arxiv.org/abs/2604.24698
作者: Yunze Xiao,Vivienne J. Zhang,Chenghao Yang,Ningshan Ma,Weihao Xuan,Jen-tse Huang
机构: CMU(卡内基梅隆大学); UChicago(芝加哥大学); MIT(麻省理工学院); 2077.ai; UTokyo(东京大学); RIKEN AIP(理化学研究所人工智能中心); JHU(约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Applications based on large language models (LLMs), such as multi-agent simulations, require population diversity among agents. We identify a pervasive failure mode we term \emphPersona Collapse: agents each assigned a distinct profile nonetheless converge into a narrow behavioral mode, producing a homogeneous simulated population. To quantify persona collapse, we propose a framework that measures how much of the persona space a population occupies (Coverage), how evenly agents spread across it (Uniformity), and how rich the resulting behavioral patterns are (Complexity). Evaluating ten LLMs on personality simulation (BFI-44), moral reasoning, and self-introduction, we observe persona collapse along two axes: (1) Dimensions: a model can appear diverse on one axis yet structurally degenerate on another, and (2) Domains: the same model may collapse the most in personality yet be the most diverse in moral reasoning. Furthermore, item-level diagnostics reveal that behavioral variation tracks coarse demographic stereotypes rather than the fine-grained individual differences specified in each persona. Counter-intuitively, \textbfthe models achieving the highest per-persona fidelity consistently produce the most stereotyped populations. We release our toolkit and data to support population-level evaluation of LLMs.

[NLP-5] Contextual Linear Activation Steering of Language Models

【速读】: 该论文旨在解决现有线性激活引导(Linear Activation Steering)方法在应用时对所有token采用固定引导强度,导致在不同输入提示(prompt)下引导效果不一致的问题。解决方案的关键在于提出上下文感知的线性激活引导(Contextual Linear Activation Steering, CLAS),通过动态调整引导强度以适应输入上下文,从而提升引导的一致性和准确性。CLAS在11个引导基准和4种模型家族上均优于标准线性激活引导,并在有限标注数据场景下达到或超越ReFT和LoRA的性能表现。

链接: https://arxiv.org/abs/2604.24693
作者: Brandon Hsu,Daniel Beaglehole,Adityanarayanan Radhakrishnan,Mikhail Belkin
机构: UC San Diego (加州大学圣地亚哥分校); MIT (麻省理工学院); Broad Institute of MIT and Harvard (MIT和哈佛大学的博德研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Linear activation steering is a powerful approach for eliciting the capabilities of large language models and specializing their behavior using limited labeled data. While effective, existing methods often apply a fixed steering strength to all tokens, resulting in inconsistent steering quality across diverse input prompts. In this work, we introduce Contextual Linear Activation Steering (CLAS), a method that dynamically adapts linear activation steering to context-dependent steering strengths. Across eleven steering benchmarks and four model families, it consistently outperforms standard linear activation steering and matches or exceeds the performance of ReFT and LoRA in settings with limited labeled data. We therefore propose CLAS as a scalable, interpretable, and accurate method for specializing and steering large language models.

[NLP-6] Can LLM s Act as Historians? Evaluating Historical Research Capabilities of LLM s via the Chinese Imperial Examination ACL2026

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在历史研究领域中专业级推理能力不足的问题,现有评估基准多局限于基础知识广度或词汇理解,未能有效衡量如证据推理(evidentiary reasoning)等历史研究的核心高阶能力。其解决方案的关键在于提出ProHist-Bench——一个基于中国科举制度的新型评测基准,该体系覆盖八代王朝、包含400道专家精制难题及10,891条细粒度评分标准,通过跨学科协作构建出能够真实反映历史推理复杂性的评估框架,从而系统性揭示LLMs在处理专业历史问题时的能力瓶颈,并为发展领域专用推理型LLMs提供可靠工具与方向。

链接: https://arxiv.org/abs/2604.24690
作者: Lirong Gao,Zeqing Wang,Yuyan Cai,Jiayi Deng,Yanmei Gu,Yiming Zhang,Jia Zhou,Yanfei Zhang,Junbo Zhao
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026

点击查看摘要

Abstract:While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at this https URL.

[NLP-7] Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLM s under Evidential Trust Manipulation

【速读】: 该论文旨在解决土耳其语中证据形态(evidential morphology)是否受信息来源可信度影响,以及大型语言模型(Large Language Models, LLMs)能否捕捉这种敏感性的问题。其解决方案的关键在于通过控制补全任务(cloze task)中的信息源为外部明确但可信度被操纵的高信任(High-Trust)与低信任(Low-Trust)情境,系统考察母语者在生成 -DI 和 -mIs 这两种过去时标记时的行为差异,并进一步评估10个LLM在不同提示范式下的表现。结果表明,人类受试者表现出显著且稳定的可信度效应,而LLM则呈现高度模型和提示依赖性,整体上缺乏一致性、稳定性及对源可信度的敏感性,揭示了人类与LLM在基于信任的证据推理上的显著差距。

链接: https://arxiv.org/abs/2604.24665
作者: Sercan Karakaş,Yusuf Şimşek
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to The 15th edition of the Workshop on Cognitive Modeling and Computational Linguistics, co-located with the Language Resources and Evaluation Conference

点击查看摘要

Abstract:This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly external, while only its perceived reliability is manipulated (High-Trust vs. Low-Trust). In a human production experiment, native speakers of Turkish show a robust trust effect: High-Trust contexts yield relatively more -DI, whereas Low-Trust contexts yield relatively more -mIs, with the pattern remaining stable across sensitivity analyses. We then evaluate 10 LLMs in three prompting paradigms (open gap-fill, explicit past-tense gap-fill, and forced-choice A/B selection). LLM behavior is highly model- and prompt-dependent: some models show weak or local trust-consistent shifts, but effects are generally unstable, often reversed, and frequently overshadowed by output-compliance problems and strong base-rate suffix preferences. The results provide new evidence for a trust-/commitment-based account of Turkish evidentiality and reveal a clear human-LLM gap in source-sensitive evidential reasoning.

[NLP-8] DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本推理过程中因键值缓存(Key-Value Cache, KV cache)内存占用随序列长度线性增长而导致的内存瓶颈问题。现有方法通常采用统一的剪枝比例对所有层进行KV缓存裁剪,隐含假设各层对模型性能的贡献一致,但研究发现这一假设并不成立,不同层对剪枝的敏感度存在显著差异。论文提出DepthKV框架,其核心创新在于基于各层敏感度动态分配固定的全局KV缓存预算,而非均匀分配,从而实现更高效的缓存资源利用,在多个模型和任务上均优于统一剪枝策略。

链接: https://arxiv.org/abs/2604.24647
作者: Zahra Dehghanighobadi,Asja Fischer
机构: Ruhr University Bochum(鲁尔大学波鸿分校); UAR Research Center for Trustworthy Data Science and Security(可信数据科学与安全联合研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck. To mitigate this overhead, KV cache pruning methods discard cached tokens with low attention scores during inference. Most existing methods apply a uniform pruning ratio across layers, implicitly assuming that all layers contribute equally to overall model performance. We show that this assumption is suboptimal, as layers differ significantly in their sensitivity to pruning. We propose DepthKV, a layer-dependent pruning framework that allocates a fixed global KV budget across layers based on their sensitivity, rather than using a uniform allocation. Across multiple models and tasks, DepthKV consistently outperforms uniform pruning at the same global pruning ratio, demonstrating more effective utilization of the KV cache budget through layer-dependent allocation.

[NLP-9] K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning Locality and Multimodality in Meteorology ACL2026

【速读】: 该论文旨在解决当前针对韩国气象预报员的实用型多模态大语言模型助手开发中缺乏基于权威来源的多维专家级评估框架的问题。其解决方案的关键在于提出K-MetBench,一个基于国家级资格考试构建的诊断性基准测试,能够从四个维度揭示模型缺陷:专家级图表视觉推理能力、通过专家验证的逻辑有效性、韩语特定地理文化理解能力以及细粒度领域分析能力。该基准不仅识别出模型在专业图表解读上的显著模态差距和尽管预测正确但存在逻辑幻觉的推理缺陷,还发现本地模型在本土情境下显著优于更大规模的全球模型,强调参数量扩展无法解决文化依赖问题,从而为开发可靠且具备文化敏感性的专家级AI代理提供了明确路径。

链接: https://arxiv.org/abs/2604.24645
作者: Soyeon Kim,Cheongwoong Kang,Myeongjin Lee,Eun-Chul Chang,Jaedeok Lee,Jaesik Choi
机构: KAIST (한국과학기술원); Kongju National University (공주국립대학교)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 39 pages, 32 figures, 14 tables, including appendices. Accepted to Findings of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:The development of practical (multimodal) large language model assistants for Korean weather forecasters is hindered by the absence of a multidimensional, expert-level evaluation framework grounded in authoritative sources. To address this, we introduce K-MetBench, a diagnostic benchmark grounded in national qualification exams. It exposes critical gaps across four dimensions: expert visual reasoning of charts, logical validity via expert-verified rationales, Korean-specific geo-cultural comprehension, and fine-grained domain analysis. Our evaluation of 55 models reveals a profound modality gap in interpreting specialized diagrams and a reasoning gap where models hallucinate logic despite correct predictions. Crucially, Korean models outperform significantly larger global models in local contexts, demonstrating that parameter scaling alone cannot resolve cultural dependencies. K-MetBench serves as a roadmap for developing reliable, culturally aware expert AI agents. The dataset is available at this https URL .

[NLP-10] Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application

【速读】: 该论文旨在解决在移动设备上部署小语言模型(Small Language Models, SLMs)时面临的工程挑战,以实现真正离线、私密的AI体验(即无需云端依赖且数据不出设备)。研究通过一个为期5天的开发冲刺案例,深入分析了将Gemma 4 E2B(2.6B参数)和Qwen3(0.6B参数)集成到生产级Android游戏Palabrita中的实践问题。关键解决方案在于从最初期望LLM生成完整结构化谜题(包括词、类别、难度及五条提示)的高复杂度设计,转变为一种务实架构:仅由预定义词表提供词汇,LLM负责生成三条简短提示,并设置确定性回退机制。这一转变显著提升了系统稳定性与可用性,其核心思想是“最可靠的本地LLM功能,正是让LLM做最少事情的功能”。论文进一步提炼出八项可操作的设计启发式原则,为移动端SLM集成提供了实证指导。

链接: https://arxiv.org/abs/2604.24636
作者: William Oliveira
机构: Google DeepMind; Alibaba; Microsoft
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages, 8 tables, 17 references

点击查看摘要

Abstract:On-device Small Language Models (SLMs) promise fully offline, private AI experiences for mobile users (no cloud dependency, no data leaving the device). But is this promise achievable in practice? This paper presents a longitudinal practitioner case study documenting the engineering challenges of integrating SLMs (Gemma 4 E2B, 2.6B parameters; Qwen3 0.6B, 600M parameters) into Palabrita, a production Android word-guessing game. Over a 5-day development sprint comprising 204 commits (~90 directly AI-related), the system underwent a radical transformation: from an ambitious design where the LLM generated complete structured puzzles (word, category, difficulty, and five hints as JSON) to a pragmatic architecture where curated word lists provide the words and the LLM generates only three short hints, with a deterministic fallback if it fails. We identify five categories of failures specific to on-device SLM integration: output format violations, constraint violations, context quality degradation, latency incompatibility, and model selection instability. For each failure category, we document the observed symptoms, root causes, and the prompt engineering and architectural strategies that effectively mitigated them, including multi-layer defensive parsing, contextual retry with failure feedback, session rotation, progressive prompt hardening, and systematic responsibility reduction. Our findings demonstrate that on-device SLMs are viable for production mobile applications, but only when the developer accepts a fundamental constraint: the most reliable on-device LLM feature is one where the LLM does the least. We distill our experience into eight actionable design heuristics for practitioners integrating SLMs into mobile apps.

[NLP-11] Looking for the Bottleneck in Fine-grained Temporal Relation Classification

【速读】: 该论文旨在解决细粒度时间关系分类(fine-grained temporal relation classification)问题,即准确识别文本中两个时间实体(如事件或时间表达式)之间的时序关系。由于任务复杂性,以往研究逐步简化数据集并聚焦于事件对之间的部分关系类别,导致对完整时序关系建模能力受限。本文提出的关键解决方案是“Point-to-Interval”范式,其核心思想是首先将时间实体的端点(point)间的关系进行分类(如before、after、overlap等),然后基于这些点关系推导出完整的区间(interval)关系。该方法在TempEval-3基准上取得了70.1%的时间意识得分(temporal awareness score),达到当时最优性能,验证了从点关系到区间关系的解码策略的有效性。

链接: https://arxiv.org/abs/2604.24620
作者: Hugo Sousa,Ricardo Campos,Alípio Jorge
机构: University of Porto (波尔图大学); INESC TECPorto (INESC TEC波尔图); University of Beira Interior (贝拉内陆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Temporal relation classification is the task of determining the temporal relation between pairs of temporal entities in a text. Despite recent advancements in natural language processing, temporal relation classification remains a considerable challenge. Early attempts framed this task using a comprehensive set of temporal relations between events and temporal expressions. However, due to the task complexity, datasets have been progressively simplified, leading recent approaches to focus on the relations between event pairs and to use only a subset of relations. In this work, we revisit the broader goal of classifying interval relations between temporal entities by considering the full set of relations that can hold between two time intervals. The proposed approach, Interval from Point, involves first classifying the point relations between the endpoints of the temporal entities and then decoding these point relations into an interval relation. Evaluation on the TempEval-3 dataset shows that this approach can yield effective results, achieving a temporal awareness score of 70.1 percent, a new state-of-the-art on this benchmark. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.24620 [cs.CL] (or arXiv:2604.24620v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.24620 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805712.3809581 Focus to learn more DOI(s) linking to related resources Submission history From: Hugo Sousa [view email] [v1] Mon, 27 Apr 2026 15:48:45 UTC (638 KB) Full-text links: Access Paper: View a PDF of the paper titled Looking for the Bottleneck in Fine-grained Temporal Relation Classification, by Hugo Sousa and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-12] Evaluation of Pose Estimation Systems for Sign Language Translation LREC2026

【速读】: 该论文旨在解决姿态估计器(pose estimator)选择对基于姿态的手语翻译(Sign Language Translation, SLT)系统性能影响不明确的问题。当前多数SLT系统默认采用如MediaPipe Holistic或OpenPose等通用姿态估计工具,但其对翻译质量的具体影响缺乏系统评估。解决方案的关键在于:在RWTH-PHOENIX-Weather 2014数据集上构建一个受控的SLT流水线,仅改变输入姿态表示来源,对比多种主流及新兴姿态估计模型(包括MediaPipe Holistic、OpenPose、MMPose WholeBody、OpenPifPaf、AlphaPose、SDPose、Sapiens和SMPLest-X),并通过BLEU和BLEURT指标量化下游翻译性能差异;同时结合Signsuisse高分辨率视频分析时间稳定性、手部关键点缺失率与遮挡鲁棒性,揭示姿态估计质量与翻译准确性的内在关联。实验表明,SDPose和Sapiens表现最优(BLEU≈11.5),且Sapiens在遮挡场景下具备最强鲁棒性(15/15正确),而频繁遗漏手部关键点的估计器则显著降低翻译性能。

链接: https://arxiv.org/abs/2604.24609
作者: Catherine O’Brien,Gerard Sant,Mathias Müller,Sarah Ebling
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026 Workshop on the Representation and Processing of Sign Languages. O’Brien and Sant contributed equally to this paper. 16 pages, 6 figures

点击查看摘要

Abstract:Many sign language translation (SLT) systems operate on pose sequences instead of raw video to reduce input dimensionality, improve portability, and partially anonymize signers. The choice of pose estimator is often treated as an implementation detail, with systems defaulting to widely available tools such as MediaPipe Holistic or OpenPose. We present a systematic comparison of pose estimators for pose-based SLT, covering widely used baselines (MediaPipe Holistic, OpenPose) and newer whole-body/high-capacity models (MMPose WholeBody, OpenPifPaf, AlphaPose, SDPose, Sapiens, SMPLest-X). We quantify downstream impact by training a controlled SLT pipeline on RWTH-PHOENIX-Weather 2014 where only the pose representation varies, evaluating with BLEU and BLEURT. To contextualize translation outcomes, we analyze temporal stability, missing hand keypoints, and robustness to occlusion using higher-resolution videos from the Signsuisse dataset. SDPose and Sapiens achieve the best translation performance (BLEU ~11.5), outperforming the common MediaPipe baseline (BLEU ~10). In occlusion cases, Sapiens is correct in all tested instances (15/15), while OpenPifPaf fails in nearly all (1/15) and also yields the weakest translation scores. Estimators that frequently leave out hand keypoints are associated with lower BLEU/BLEURT. We release code that can be used not only to reproduce our experiments, but also considerably lowers the barrier for other researchers to use alternative pose estimators. Comments: Accepted at LREC 2026 Workshop on the Representation and Processing of Sign Languages. O’Brien and Sant contributed equally to this paper. 16 pages, 6 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.24609 [cs.CL] (or arXiv:2604.24609v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.24609 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-13] Skill Retrieval Augmentation for Agent ic AI

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在作为代理(agent)执行复杂任务时,因外部可复用技能(skill)数量增长而导致上下文窗口受限、技能识别准确率下降的问题。现有方法依赖显式枚举技能,难以扩展。其解决方案的核心是提出Skill Retrieval Augmentation (SRA) 新范式,即代理动态从大规模外部技能语料库中按需检索、融合并应用相关技能,从而突破上下文限制。为量化评估该范式,作者构建了首个端到端的基准测试工具SRA-Bench,涵盖技能检索、融合与任务执行三个阶段,实验表明基于检索的技能增强能显著提升代理性能,但同时揭示出当前LLM代理在技能加载决策上的缺陷——即无论是否检索到黄金技能或任务是否需要外部能力,均以相似频率加载技能,说明瓶颈不仅在于检索,更在于基础模型对何时及如何加载技能的判断能力。

链接: https://arxiv.org/abs/2604.24594
作者: Weihang Su,Jianming Long,Qingyao Ai,Yichen Tang,Changyue Wang,Yiteng Tu,Yiqun Liu
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model’s ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.

[NLP-14] owards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations

【速读】: 该论文旨在解决自动驾驶车辆(AV)在实际道路场景中难以准确遵守交通法规的问题。传统方法依赖形式逻辑语言显式定义行为约束,存在劳动密集、难以扩展和维护成本高的缺陷;而直接利用大语言模型(LLM)提取法律要求时,由于缺乏对结构化交通场景的显式锚定与推理,常导致检索无关条款或遗漏适用条文,从而产生不精确的规则推导结果。解决方案的关键在于提出一种新颖的推理管道,通过节点级锚点(node-wise anchors)将LLM的推理过程锚定于交通场景分类体系(taxonomy),以编码层次化语义信息,实现更精准的法律条文与场景匹配。实验表明,该方法在中文交通法规和OnSite数据集上显著提升了法律-场景匹配准确率(+29.1%)以及强制性与禁止性要求的推导准确性(分别提升36.9%和38.2%),并成功构建了可用于车载实时合规监测的法律合规层,为自动驾驶系统的开发、部署及监管提供了坚实基础。

链接: https://arxiv.org/abs/2604.24562
作者: Bowen Jian,Rongjie Yu,Hong Wang,Liqiang Wang,Zihang Zou
机构: Tongji University (同济大学); The Key Laboratory of Road and Traffic Engineering, Ministry of Education (教育部道路与交通工程重点实验室); Tsinghua University (清华大学); University of Central Florida (中佛罗里达大学); Optixway AI (Optixway AI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Driving in compliance with traffic laws and regulations is a basic requirement for human drivers, yet autonomous vehicles (AVs) can violate these requirements in diverse real-world scenarios. To encode law compliance into AV systems, conventional approaches use formal logic languages to explicitly specify behavioral constraints, but this process is labor-intensive, hard to scale, and costly to maintain. With recent advances in artificial intelligence, it is promising to leverage large language models (LLMs) to derive legal requirements from traffic laws and regulations. However, without explicitly grounding and reasoning in structured traffic scenarios, LLMs often retrieve irrelevant provisions or miss applicable ones, yielding imprecise requirements. To address this, we propose a novel pipeline that grounds LLM reasoning in a traffic scenario taxonomy through node-wise anchors that encode hierarchical semantics. On Chinese traffic laws and OnSite dataset (5,897 scenarios), our method improves law-scenario matching by 29.1% and increases the accuracy of derived mandatory and prohibitive requirements by 36.9% and 38.2%, respectively. We further demonstrate real-world applicability by constructing a law-compliance layer for AV navigation and developing an onboard, real-time compliance monitor for in-field testing, providing a solid foundation for future AV development, deployment, and regulatory oversight.

[NLP-15] Aligned Multi-View Scripts for Universal Chart-to-Code Generation ACL2026

【速读】: 该论文旨在解决现有图表到代码(chart-to-code)生成方法主要局限于Python语言,导致实用性受限且未能充分利用跨语言语义等价性监督信号的问题。其关键解决方案是提出Chart2NCode数据集和CharLuMA模型:前者通过元数据驱动的模板生成与渲染验证流程构建了176K张图表及其在Python、R和LaTeX中语义等价的代码对;后者基于LLaVA架构设计了一种参数高效适配模块,利用语言条件的低秩子空间混合机制,在共享核心图表理解能力的同时,通过轻量级路由策略实现目标编程语言的代码生成专业化。实验表明,该方案在多语言执行成功率和视觉保真度上均显著优于开源基线,并保持与商业系统的竞争力。

链接: https://arxiv.org/abs/2604.24559
作者: Zhihan Zhang,Lizi Liao
机构: Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Chart-to-code generation converts a chart image into an executable plotting script, enabling faithful reproduction and editable visualizations. Existing methods are largely Python-centric, limiting practical use and overlooking a critical source of supervision: the same chart can be expressed by semantically equivalent scripts in different plotting languages. To fill this gap, we introduce Chart2NCode, a dataset of 176K charts paired with aligned scripts in Python, R, and LaTeX that render visually equivalent outputs, constructed via a metadata-to-template pipeline with rendering verification and human quality checks. Building on a LLaVA-style architecture, we further propose CharLuMA, a parameter-efficient adaptation module that augments the multimodal projector with a language-conditioned mixture of low-rank subspaces, allowing the model to share core chart understanding while specializing code generation to the target language through lightweight routing. Extensive experiments show consistent gains in executability and visual fidelity across all languages, outperforming strong open-source baselines and remaining competitive with proprietary systems. Further analyses reveal that balanced multi-language supervision benefits all languages and that the adapter allocates a compact shared core plus language-specific capacity. Codes and data are available at this https URL.

[NLP-16] STELLAR-E: a Synthetic Tailored End-to-end LLM Application Rigorous Evaluator

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在特定领域和语言上缺乏高质量、可扩展且多语言支持的评估数据集的问题。现有自动化基准测试方法受限于依赖已有数据、扩展性差、单一领域聚焦以及缺乏多语言能力,难以满足多样化应用场景的需求。解决方案的关键在于提出 STELLAR-E——一个完全自动化的合成数据生成系统,其核心由两阶段构成:第一阶段通过改进 TGRT Self-Instruct 框架构建可控的合成数据引擎,实现仅需少量人工输入即可生成定制规模的高质量合成数据;第二阶段引入融合统计与 LLM 评分的评估流水线,确保生成数据适用于 LLM 应用评估。实验证明,该方法生成的数据在 LLM-as-a-judge 评分上平均优于现有语言特定基准 +5.7%,具备与真实数据相当的评估能力,同时显著提升效率与可扩展性,为 LLM 的公平、高效评估提供了可适配领域的自动化框架。

链接: https://arxiv.org/abs/2604.24544
作者: Alessio Sordo,Lingxiao Du,Meeka-Hanna Lenisa,Evgeny Bogdanov,Maxim Romanovsky
机构: Deutsche Bank(德意志银行)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and LLM-based metrics to assess the applicability of the synthetic dataset for LLM-based application evaluations. The synthetic datasets reach an average difference of +5.7% in terms of LLM-as-a-judge scores against existing language-specific benchmarks, demonstrating comparable quality for comprehensive assessment of big and small LLMs. While real datasets remain slightly more challenging for LLMs especially for smaller models, this work establishes a scalable and domain-adaptable benchmarking framework that supports fair evaluation of LLM applications, offering a faster alternative to manual approaches and enabling high-efficiency automated quality assurance cycles.

[NLP-17] Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在运行时可能因训练数据中的后门(backdoors)、越狱攻击(jailbreaks)和提示注入(prompt injections)而产生不可预测的恶意行为问题,这些问题无法通过训练数据验证阶段发现。现有防御方法通常针对单一威胁类型设计,且依赖于干净参考模型、触发器知识或可编辑权重等理想假设,这些条件在第三方部署场景中难以满足。解决方案的关键在于提出一种无需调优的运行时监控机制——层间收敛指纹(Layerwise Convergence Fingerprinting, LCF),其核心思想是将模型各层之间的隐藏状态轨迹视为健康信号:LCF 在每层间计算对角马氏距离(diagonal Mahalanobis distance),利用Ledoit-Wolf收缩聚合特征,并通过留一法校准阈值,在仅使用200个干净样本的情况下实现对三类威胁的统一检测,无需参考模型、触发器知识或重新训练,从而为云端和边缘设备上的LLM提供通用安全防护层。

链接: https://arxiv.org/abs/2604.24542
作者: Nay Myat Min,Long H. Pham,Jun Sun
机构: Singapore Management University (新加坡管理大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 34 pages, 5 figures. Code: this https URL

点击查看摘要

Abstract:Large language models deployed at runtime can misbehave in ways that clean-data validation cannot anticipate: training-time backdoors lie dormant until triggered, jailbreaks subvert safety alignment, and prompt injections override the deployer’s instructions. Existing runtime defenses address these threats one at a time and often assume a clean reference model, trigger knowledge, or editable weights, assumptions that rarely hold for opaque third-party artifacts. We introduce Layerwise Convergence Fingerprinting (LCF), a tuning-free runtime monitor that treats the inter-layer hidden-state trajectory as a health signal: LCF computes a diagonal Mahalanobis distance on every inter-layer difference, aggregates via Ledoit-Wolf shrinkage, and thresholds via leave-one-out calibration on 200 clean examples, with no reference model, trigger knowledge, or retraining. Evaluated on four architectures (Llama-3-8B, Qwen2.5-7B, Gemma-2-9B, Qwen2.5-14B) across backdoors, jailbreaks, and prompt injection (56 backdoor combinations, 3 jailbreak techniques, and BIPIA email + code-QA), LCF reduces mean backdoor attack success rate (ASR) below 1% on Qwen2.5-7B and Gemma-2 and to 1.3% on Qwen2.5-14B, detects 92-100% of DAN jailbreaks (62-100% for GCG and softer role-play), and flags 100% of text-payload injections across all eight (model, domain) cells, at 12-16% backdoor FPR and 0.1% inference overhead. A single aggregation score covers all three threat families without threat-specific tuning, positioning LCF as a general-purpose runtime safety layer for cloud-served and on-device LLMs.

[NLP-18] Generating Place-Based Compromises Between Two Points of View

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在社会智能任务中表现不足的问题,特别是生成具有 empathic neutrality(情感中立性)的妥协方案能力较弱。其核心解决方案是通过引入外部情感相似性作为迭代反馈机制,优化妥协生成过程,从而提升妥协方案的可接受性。关键创新在于利用人类偏好数据对小型基础模型进行基于边距的对齐训练,不仅提高了效率,还避免了推理阶段对情感估计的依赖。

链接: https://arxiv.org/abs/2604.24536
作者: Sumanta Bhattacharyya,Francine Chen,Scott Carter,Yan-Ying Chen,Tatiana Lau,Nayeli Suseth Bravo,Monica P. Van,Kate Sieck,Charlene C. Wu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Toyota Research Institute (丰田研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel academically but struggle with social intelligence tasks, such as creating good compromises. In this paper, we present methods for generating empathically neutral compromises between two opposing viewpoints. We first compared four different prompt engineering methods using Claude 3 Opus and a dataset of 2,400 contrasting views on shared places. A subset of the gen erated compromises was evaluated for acceptability in a 50-participant study. We found that the best method for generating compromises between two views used external empathic similarity between a compromise and each viewpoint as iterative feedback, outperforming stan dard Chain of Thought (CoT) reasoning. The results indicate that the use of empathic neutrality improves the acceptability of compromises. The dataset of generated compromises was then used to train two smaller foundation models via margin-based alignment of human preferences, improving efficiency and removing the need for empathy estimation during inference.

[NLP-19] SEARCH-R: Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator for Multi-hop Question Answering ACL2026

【速读】: 该论文旨在解决多跳问答(Multi-hop Question Answering, MHQA)中两个核心问题:一是如何生成正确且可控的推理路径以应对复杂用户查询;二是如何在大语言模型(Large Language Models, LLMs)知识受限的情况下准确检索具有实际价值的知识信息。现有方法主要依赖提示工程生成推理路径,并结合传统稀疏或稠密检索策略,但存在推理过程缺乏控制导致偏离正确路径、以及检索机制仅基于相似度匹配而忽视信息实用性的问题。为此,论文提出结构化实体感知检索与链式推理导航框架(SEARCH-R),其关键创新在于:1)通过微调Llama3.1-8B模型训练端到端的推理路径导航器,实现对子问题分解的有效控制;2)设计基于依赖树的检索机制,量化评估文档的信息贡献度,从而提升检索内容的相关性与多样性。

链接: https://arxiv.org/abs/2604.24515
作者: Yuqing Fu,Yimin Deng,Wanyu Wang,Yuhao Wang,Yejing Wang,Hongshi Liu,Yiqi Wang,Xiao Han,Maolin Wang,Guoshuai Zhao,Yi Chang,Xiangyu Zhao
机构: City University of Hong Kong(香港城市大学); Xi’an Jiaotong University(西安交通大学); Zhejiang University of Technology(浙江工业大学); Michigan State University(密歇根州立大学); Jilin University(吉林大学)
类目: Computation and Language (cs.CL)
备注: ACL2026 findings

点击查看摘要

Abstract:Multi-hop Question Answering (MHQA) aims to answer questions that require multi-step reasoning. It presents two key challenges: generating correct reasoning paths in response to the complex user queries, and accurately retrieving essential knowledge in the face of potential limitations in large language models (LLMs). Existing approaches primarily rely on prompt-based methods to generate reasoning paths, which are further combined with traditional sparse or dense retrieval to produce the final answer. However, the generation of reasoning paths commonly lacks effective control over the generative process, thus leading the reasoning astray. Meanwhile, the retrieval methods over-rely on knowledge matching or similarity scores rather than evaluating the practical utility of the information, resulting in retrieving homogeneous or non-useful information. Therefore, we propose a Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator framework named SEARCH-R. Specifically, SEARCH-R trains an end-to-end reasoning path navigator, which is able to provide a powerful sub-question decomposer by fine-tuning the Llama3.1-8B model. Moreover, a novel dependency tree-based retrieval is designed to evaluate the informational contribution of the document quantitatively. Extensive experiments on three challenging multi-hop datasets validate the effectiveness of the proposed framework. The code and dataset are available at: this https URL.

[NLP-20] Agent ic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

【速读】: 该论文旨在解决多发性骨髓瘤(multiple myeloma)患者在长期治疗过程中,如何利用大语言模型(large language models, LLMs)高效整合分散于数十至数百份异构临床文档中的累积疾病信息,并达到接近专家共识的决策水平这一关键问题。其解决方案的核心在于提出并验证了一种基于代理推理(agentic reasoning)的系统架构,相较于单次检索增强生成(single-pass retrieval-augmented generation, RAG)、迭代RAG及全上下文输入等基线方法,该系统在复杂问题和长病历场景下显著提升了准确性(+3.8–13.5个百分点),并在最复杂任务中首次突破了其他方法共同达到的性能上限(75.8%),表明其具备处理高阶临床合成推理的能力,但需进一步前瞻性评估以确保临床安全性。

链接: https://arxiv.org/abs/2604.24473
作者: Johannes Moll,Jannik Lübberstedt,Christoph Nuernbergk,Jacob Stroh,Luisa Mertens,Anna Purcarea,Christopher Zirn,Zeineb Benchaaben,Fabian Drexel,Hartmut Häntze,Anirudh Narayanan,Friedrich Puttkammer,Andrei Zhukov,Jacqueline Lammert,Sebastian Ziegelmayer,Markus Graf,Marion Högner,Marcus Makowski,Florian Bassermann,Lisa C. Adams,Jiazhen Pan,Daniel Rueckert,Krischan Braitsch,Keno K. Bressem
机构: Technical University of Munich (TUM); TUM University Hospital; Klinikum rechts der Isar; School of Medicine and Health; Chair for AI in Healthcare and Medicine; Department of Diagnostic and Interventional Radiology; Department of Cardiovascular Radiology and Nuclear Medicine; German Heart Center; Department of Medicine III; Clinical Department of Gynecology; Institute of AI in Medicine and Healthcare; TUM School of Medicine and Health; Munich Center for Machine Learning (MCML); University of Oxford; Department of Engineering Science; Department of Computing; Imperial College London; TranslaTUM; Center for Translational Cancer Research; Deutsches Konsortium für Translationale Krebsforschung; Bavarian Cancer Research Center; Chair of Medical Informatics; Department of Radiology; Charité – Universitätsmedizin Berlin; Department of Gastroenterology, Infectious Diseases and Rheumatology
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous clinical documents. Whether LLM-based systems can synthesise this evidence at a level approaching expert agreement has not been established. A retrospective evaluation was conducted on longitudinal clinical records of 811 myeloma patients treated at a tertiary centre (2001-2026), covering 44,962 documents and 1,334,677 laboratory values, with external validation on MIMIC-IV. An agentic reasoning system was compared against single-pass retrieval-augmented generation (RAG), iterative RAG, and full-context input on 469 patient-question pairs from 48 templates at three complexity levels. Reference labels came from double annotation by four oncologists with senior haematologist adjudication. Iterative RAG and full-context input converged on a shared ceiling (75.4% vs 75.8%, p = 1.00). The agentic system reached 79.6% concordance (95% CI 76.4-82.8), exceeding both baselines (+3.8 and +4.2 pp; p = 0.006 and 0.007). Gains rose with question complexity, reaching +9.4 pp on criteria-based synthesis (p = 0.032), and with record length, reaching +13.5 pp in the top decile (n = 10). The system error rate (12.2%) was comparable to expert disagreement (13.6%), but severity was inverted: 57.8% of system errors were clinically significant versus 18.8% of expert disagreements. Agentic reasoning was the only approach to exceed the shared ceiling, with gains concentrated on the most complex questions and longest records. The greater clinical consequence of residual system errors indicates that prospective evaluation in routine care is required before these findings translate into patient benefit.

[NLP-21] Zero-shot Large Language Models for Automatic Readability Assessment ACL2026

【速读】: 该论文旨在解决无监督自动可读性评估(Automatic Readability Assessment, ARA)方法在实际应用中的局限性,尤其是在跨语言、文本长度和专业术语含量等复杂场景下性能不稳定的问题。其解决方案的关键在于提出一种新的零样本提示(zero-shot prompting)方法,并引入LAURAE模型——该模型融合大语言模型(Large Language Models, LLMs)与传统可读性公式得分,从而同时捕捉文本的上下文语义特征和浅层句法特征(如句子长度),显著提升了模型在多种数据集上的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2604.24470
作者: Riley Grossman,Yi Chen
机构: New Jersey Institute of Technology (新泽西理工学院)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 (Main Conference)

点击查看摘要

Abstract:Unsupervised automatic readability assessment (ARA) methods have important practical and research applications (e.g., ensuring medical or educational materials are suitable for their target audiences). In this paper, we propose a new zero-shot prompting methodology for ARA and present the first comprehensive evaluation of using large language models (LLMs) as an unsupervised ARA method by testing 10 diverse open-source LLMs (e.g., different sizes and developers) on 14 diverse datasets (e.g., different text lengths and languages). Our findings show that our proposed prompting methodology outperforms prior methods on 13 of the 14 datasets. Furthermore, we propose LAURAE, which combines LLM and readability formula scores to improve robustness by capturing both contextual and shallow (e.g., sentence length) features of readability. Our evaluation demonstrates that LAURAE robustly outperforms prior methods across languages, text lengths, and amounts of technical language.

[NLP-22] A Survey on Split Learning for LLM Fine-Tuning: Models Systems and Privacy Optimizations

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)微调过程中计算资源消耗高、数据隐私风险大的问题,尤其针对资源受限组织难以安全高效地适配LLMs的挑战。解决方案的关键在于采用分片学习(Split Learning)范式,将模型在客户端与服务器之间分割,通过交换中间特征表示实现协作训练,从而在不共享原始数据的前提下完成模型优化,兼顾了计算效率、系统可扩展性与隐私保护能力。

链接: https://arxiv.org/abs/2604.24468
作者: Zihan Liu,Yizhen Wang,Rui Wang,Xiu Tang,Sai Wu
机构: Zhejiang UniversityNingboChina
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-tuning unlocks large language models (LLMs) for specialized applications, but its high computational cost often puts it out of reach for resource-constrained organizations. While cloud platforms could provide the needed resources, data privacy concerns make sharing sensitive information with third parties risky. A promising solution is split learning for LLM fine-tuning, which divides the model between clients and a server, allowing collaborative and secure training through exchanged intermediate data, thus enabling resource-constrained participants to adapt LLMs safely. % In light of this, a growing body of literature has emerged to advance this paradigm, introducing varied model methods, system optimizations, and privacy defense-attack techniques for split learning. To bring clarity and direction to the field, a comprehensive survey is needed to classify, compare, and critique these diverse approaches. This paper fills the gap by presenting the first extensive survey dedicated to split learning for LLM fine-tuning. We propose a unified, fine-grained training pipeline to pinpoint key operational components and conduct a systematic review of state-of-the-art work across three core dimensions: model-level optimization, system-level efficiency, and privacy preservation. Through this structured taxonomy, we establish a foundation for advancing scalable, robust, and secure collaborative LLM adaptation.

[NLP-23] Can You Make It Sound Like You? Post-Editing LLM -Generated Text for Personal Style ACL2026

【速读】: 该论文旨在解决用户在使用大语言模型(Large Language Models, LLMs)进行写作时,如何有效通过后编辑(post-editing)策略重塑生成文本以体现个人写作风格的问题。其核心挑战在于:尽管后编辑是常见的协作写作方式,但尚不明确用户能否真正将LLM生成内容调整至贴近自身风格,以及这种调整是否在客观测量中仍存在局限。解决方案的关键在于采用基于嵌入(embedding-based)的风格相似性度量方法,系统评估后编辑文本与用户未辅助写作样本及纯LLM输出之间的风格差异,发现后编辑虽能提升风格一致性并降低对LLM风格的依赖,但仍无法完全消除可检测的LLM痕迹,且风格多样性低于人类独立写作水平,揭示了感知真实性与实际风格匹配之间的显著差距。

链接: https://arxiv.org/abs/2604.24444
作者: Connor Baumler,Calvin Bao,Huy Nghiem,Xinchen Yang,Marine Carpuat,Hal Daumé III
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026

点击查看摘要

Abstract:Despite the growing use of large language models (LLMs) for writing tasks, users may hesitate to rely on LLMs when personal style is important. Post-editing LLM-generated drafts or translations is a common collaborative writing strategy, but it remains unclear whether users can effectively reshape LLM-generated text to reflect their personal style. We conduct a pre-registered online study ( n=81 ) in which participants post-edit LLM-generated drafts for writing tasks where personal style matters to them. Using embedding-based style similarity metrics, we find that post-editing increases stylistic similarity to participants’ unassisted writing and reduces similarity to fully LLM-generated output. However, post-edited text still remains stylistically closer in style to LLM text than to participants’ unassisted control text, and it exhibits reduced stylistic diversity compared to unassisted human text. We find a gap between perceived stylistic authenticity and model-measured stylistic similarity, with post-edited text often perceived as representative of participants’ personal style despite remaining detectable LLM stylistic traces.

[NLP-24] A Multi-Dimensional Audit of Politically Aligned Large Language Models

【速读】: 该论文旨在解决政治倾向性对齐的大型语言模型(Large Language Models, LLMs)可能引发的滥用风险问题,特别是其在政治话语等敏感场景中可能导致的偏见加剧、事实失真或误导性说服力增强。解决方案的关键在于提出一个基于哈贝马斯交往行为理论(Habermas’ Theory of Communicative Action)的多维评估框架,从有效性(effectiveness)、公平性(fairness)、真实性(truthfulness)和说服力(persuasiveness)四个维度,采用自动化定量指标系统性审计政治对齐LLMs的表现。实证结果揭示了不同对齐方法(微调与角色扮演)在各维度上的权衡关系,强调需发展更平衡、鲁棒的对齐策略以确保生成内容合法、无害且符合伦理规范。

链接: https://arxiv.org/abs/2604.24429
作者: Lisa Korver,Mohamed Mostagir,Sherief Reda
机构: Brown University (布朗大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the application of Large Language Models (LLMs) spreads across various industries, there are increasing concerns about the potential for their misuse, especially in sensitive areas such as political discourse. Deliberately aligning LLMs with specific political ideologies, through prompt engineering or fine-tuning techniques, can be advantageous in use cases such as political campaigns, but requires careful consideration due to heightened risks of performance degradation, misinformation, or increased biased behavior. In this work, we propose a multi-dimensional framework inspired by Habermas’ Theory of Communicative Action to audit politically aligned language models across four dimensions: effectiveness, fairness, truthfulness, and persuasiveness using automated, quantitative metrics. Applying this to nine popular LLMs aligned via fine-tuning or role-playing revealed consistent trade-offs: while larger models tend to be more effective at role-playing political ideologies and truthful in their responses, they were also less fair, exhibiting higher levels of bias in the form of angry and toxic language towards people of different ideologies. Fine-tuned models exhibited lower bias and more effective alignment than the corresponding role-playing models, but also saw a decline in performance reasoning tasks and an increase in hallucinations. Overall, all of the models tested exhibited some deficiency in at least one of the four metrics, highlighting the need for more balanced and robust alignment strategies. Ultimately, this work aims to ensure politically-aligned LLMs generate legitimate, harmless arguments, offering a framework to evaluate the responsible political alignment of these models.

[NLP-25] Scaling Properties of Continuous Diffusion Spoken Language Models

【速读】: 该论文旨在解决语音-only 语言模型(Speech-only Language Models, SLMs)在性能上落后于文本模型及文本-语音联合模型的问题,尤其关注离散自回归(Discrete Autoregressive, AR)SLM 在计算资源和数据需求上的瓶颈。其解决方案的关键在于探索连续扩散(Continuous Diffusion, CD)建模方法作为替代路径,通过引入 phoneme Jensen-Shannon divergence (pJSD) 作为量化指标,发现 CD SLM 能够遵循与 AR 模型类似的缩放规律(scaling laws),且在大规模(如16B参数、数千万小时对话数据)训练下可生成具有情感、语调、多说话人和多语言特征的高质量语音,尽管长篇连贯性仍是挑战。

链接: https://arxiv.org/abs/2604.24416
作者: Jason Ramapuram,Eeshan Gunesh Dhekane,Amitis Shidani,Dan Busbridge,Bogdan Mazoure,Zijin Gu,Russ Webb,Tatiana Likhomanenko,Navdeep Jaitly
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speech-only spoken language models (SLMs) lag behind text and text-speech models in performance, with recent discrete autoregressive (AR) SLMs indicating significant computational and data demands to match text models. Since discretizing continuous speech for AR creates bottlenecks, we explore whether continuous diffusion (CD) SLM is more viable. To quantify the SLMs linguistic quality, we introduce the phoneme Jensen-Shannon divergence (pJSD) metric. Our analysis reveals CD SLMs, mirroring AR behavior, exhibit scaling laws for validation loss and pJSD, and show optimal token-to-parameter ratios decreasing as compute scales. However, for the latter, loss becomes insensitive to choice of data and model sizes, showing potential for fast inference. Scaling CD SLMs to 16B parameters with tens of millions of hours of conversational data enables generation of emotive, prosodic, multi-speaker, multilingual speech, though achieving long-form coherence remains a significant challenge.

[NLP-26] All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

【速读】: 该论文旨在解决当前大型音频语言模型(Large Audio-Language Models, LALMs)在基准测试中表现优异但可能并未真正实现稳健音频理解的问题。其核心挑战在于:高分表现可能源于模型对文本先验(text prior)的依赖,而非对声学信号的实际感知能力。论文提出了一种诊断框架,关键在于从两个维度评估模型性能:一是“文本先验”(text prior),衡量模型是否仅凭文本和通用知识即可作答;二是“音频依赖性”(audio reliance),评估模型对完整声学信号的真实依赖程度。实证结果显示,即使无音频输入,模型仍能保留60–72%的原始音频得分,且多数音频任务仅需局部片段即可完成,这揭示了现有基准测试存在系统性偏差,进而为提升评估可靠性与改进基准设计提供了可操作的指导原则。

链接: https://arxiv.org/abs/2604.24401
作者: Leonardo Haw-Yang Foo,Chih-Kai Yang,Chen-An Li,Ke-Han Lu,Hung-yi Lee
机构: National Taiwan University (国立台湾大学); NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 6 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reliability and benchmark design.

[NLP-27] Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics Recovery and Data Efficiency

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)在资源受限的边缘设备上部署时面临的计算和内存需求过高问题。现有参数压缩方法主要依赖于从小语言模型重新训练LVLM,但此类方法灵活性差且计算成本高。论文提出了一种互补策略:对已有的LVLM进行结构化剪枝(structured pruning),即采用逐层(layerwise)和按宽度(widthwise)两种剪枝方式,随后通过轻量级恢复训练(recovery training)来补偿性能损失。关键创新在于发现宽度剪枝在低资源场景下表现更优,且仅需5%的原始数据即可实现超过95%的原始性能,同时验证了仅微调多模态投影器(multimodal projector)在低压缩率下已足够,并指出监督微调与隐藏状态知识蒸馏(hidden-state distillation)相结合能获得最优恢复效果。

链接: https://arxiv.org/abs/2604.24380
作者: Yiran Huang,Lukas Thede,Massimiliano Mancini,Wenjia Xu,Zeynep Akata
机构: Technische Universität München (慕尼黑工业大学); University of Tübingen (图宾根大学); University of Trento (特伦托大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注: Accepted at International Journal of Computer Vision (IJCV) 2026

点击查看摘要

Abstract:While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from small language models, but these methods offer limited flexibility and remain computationally intensive. We study a complementary route: compressing existing LVLMs by applying structured pruning to the language model backbone, followed by lightweight recovery training. Specifically, we investigate two structural pruning paradigms: layerwise and widthwise pruning, and pair them with supervised finetuning and knowledge distillation on logits and hidden states. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios, where computational resources are limited or there is insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels. Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance. Through empirical study on three representative LVLM families ranging from 3B to 7B parameters, this study offers actionable insights for practitioners to compress LVLMs without extensive computation resources or sufficient data. The code base is available at this https URL.

[NLP-28] Learning Evidence of Depression Symptoms via Prompt Induction SIGIR2026

【速读】: 该论文旨在解决如何在大规模用户生成文本(如社交媒体和在线论坛)中自动识别抑郁症的临床症状证据,以弥补有限的临床资源并实现对大人群的高效筛查。其核心挑战在于任务细粒度高、类别极度不平衡,且主流大语言模型(LLM)方法(零样本、上下文学习和微调)难以对多数症状保持一致的相关性判断标准。解决方案的关键是提出症状诱导(Symptom Induction, SI),即通过压缩标注样例生成简明、可解释的指导规则,明确界定每种症状的证据标准,并利用这些规则作为分类条件,从而显著提升模型对罕见症状的识别能力,且在跨疾病领域(如双相障碍和进食障碍)中表现出良好的泛化性能。

链接: https://arxiv.org/abs/2604.24376
作者: Eliseo Bao,Anxo Perez,David Otero,Javier Parapar
机构: IRLab, CITIC, Universidade da Coruña (拉科鲁尼亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted at SIGIR 2026

点击查看摘要

Abstract:Depression places substantial pressure on mental health services, and many people describe their experiences outside clinical settings in high-volume user-generated text (e.g., online forums and social media). Automatically identifying clinical symptom evidence in such text can therefore complement limited clinical capacity and scale to large populations. We address this need through sentence-level classification of 21 depression symptoms from the BDI-II questionnaire, using BDI-Sen, a dataset annotated for symptom relevance. This task is fine-grained and highly imbalanced, and we find that common LLM approaches (zero-shot, in-context learning, and fine-tuning) struggle to apply consistent relevance criteria for most symptoms. We propose Symptom Induction (SI), a novel approach which compresses labeled examples into short, interpretable guidelines that specify what counts as evidence for each symptom and uses these guidelines to condition classification. Across four LLM families and eight models, SI achieves the best overall weighted F1 on BDI-Sen, with especially large gains for infrequent symptoms. Cross-domain evaluation on an external dataset further shows that induced guidelines generalize across other diseases shared symptomatology (bipolar and eating disorders).

[NLP-29] MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining ACL

【速读】: 该论文旨在解决在自然语言处理(Natural Language Processing, NLP)中构建适用于不同计算预算的高效嵌入表示(Embedding Representation)这一难题,尤其是如何在保持语义紧凑性的同时实现嵌套嵌入结构(Nested Embeddings)的结构一致性。其核心解决方案是提出MIPIC(Matryoshka Representation Learning via Self-Distilled Intra-Relational Alignment and Progressive Information Chaining),该框架通过两个关键机制协同优化:一是自蒸馏内部关系对齐(Self-Distilled Intra-Relational Alignment, SIA),利用top-k核函数相关性(CKA)自蒸馏方式对齐完整与截断嵌入之间的token级几何和注意力驱动关系,以增强跨维度结构一致性;二是渐进信息链式传递(Progressive Information Chaining, PIC),通过分层引导策略将深层模型成熟的任务语义逐步迁移至浅层,从而实现深度方向上的语义凝聚。实验表明,MIPIC可在从TinyBERT到BGEM3、Qwen3等多模型上生成高性能且结构一致的Matryoshka表示,在极低维场景下仍具显著优势。

链接: https://arxiv.org/abs/2604.24374
作者: Phung Gia Huy,Hai An Vu,Minh-Phuc Truong,Thang Duc Tran,Linh Ngo Van,Thanh Hong Nguyen,Trung Le
机构: Hanoi University of Science and Technology (河内科学技术大学); University of Oregon (俄勒冈大学); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注: ACL Findings

点击查看摘要

Abstract:Representation learning is fundamental to NLP, but building embeddings that work well at different computational budgets is challenging. Matryoshka Representation Learning (MRL) offers a flexible inference paradigm through nested embeddings; however, learning such structures requires explicit coordination of how information is arranged across embedding dimensionality and model depth. In this work, we propose MIPIC (Matryoshka Representation Learning via Self-Distilled Intra-Relational Alignment and Progressive Information Chaining), a unified training framework designed to produce structurally coherent and semantically compact Matryoshka representations. MIPIC promotes cross-dimensional structural consistency through Self-Distilled Intra-Relational Alignment (SIA), which aligns token-level geometric and attention-driven relations between full and truncated representations using top-k CKA self-distillation. Complementarily, it enables depth-wise semantic consolidation via Progressive Information Chaining (PIC), a scaffolded alignment strategy that incrementally transfers mature task semantics from deeper layers into earlier layers. Extensive experiments on STS, NLI, and classification benchmarks (spanning models from TinyBERT to BGEM3, Qwen3) demonstrate that MIPIC yields Matryoshka representations that are highly competitive across all capacities, with significant performance advantages observed under extreme low-dimensional.

[NLP-30] SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution

【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)驱动的进化搜索在算法发现过程中存在的三大核心问题:难以区分语法不同但语义相同的策略实现、无法保留低适应度但具有战略潜力的方向,以及缺乏对策略家族饱和状态的识别能力。其解决方案的关键在于引入一个模块化的策略空间层(strategy-space layer),将自然语言描述的策略从临时提示上下文提升为种群层面的显式进化状态;该层通过三个机制实现优化:策略表述(Strategy Articulation)使变异过程变为“诊断-指导-实现”闭环;分层经验检索(Stratified Experience Retrieval)基于行为互补性从策略聚类中选择灵感;战略景观导航(Strategic Landscape Navigation)定期总结有效、饱和与未探索的策略家族以引导后续变异。实验表明,该方法显著提升了LLM-guided进化搜索的鲁棒性和效率,在开放系统优化任务中相对性能提升达21%。

链接: https://arxiv.org/abs/2604.24372
作者: Sichun Luo,Yi Huang,Haochen Luo,Fengyuan Liu,Guanzhi Deng,Lei Li,Qinghua Yao,Zefa Hu,Junlan Feng,Qi Liu
机构: The University of Hong Kong; JIUTIAN Research, China Mobile; City University of Hong Kong
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:LLM-guided evolutionary search has emerged as a promising paradigm for automated algorithm discovery, yet most systems track search progress primarily through executable programs and scalar fitness. Even when natural-language reflection is used, it is often used locally in mutation prompts or stored without an explicit population-level organization of strategic directions. As a result, evolutionary search can struggle to distinguish syntactically different implementations of the same idea, preserve lower-fitness but strategically promising directions, or detect when an entire family of strategies has saturated. We introduce \model, a modular strategy-space layer that elevates natural-language strategy descriptions from transient prompt context to first-class population-level evolutionary state in LLM-driven program search. \model augments each candidate program with an explicit natural language strategy description and uses this representation in three ways: Strategy Articulation turns mutation into a diagnose-direct-implement process; Stratified Experience Retrieval organizes the archive into strategy clusters and selects inspirations by behavioral complementarity; and Strategic Landscape Navigation periodically summarizes effective, saturated, and underexplored strategy families to guide future mutations. Across mathematical algorithm discovery, systems optimization, and agent-scaffold benchmarks, \model improves the underlying evolutionary backbones in most settings, with particularly large gains (21% relative improvement) on open-ended system optimization tasks. These results suggest that persistent strategy representations provide a practical mechanism for improving the robustness and efficiency of LLM-guided evolutionary search, suggesting a path toward compound AI systems that accumulate algorithmic knowledge over time. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2604.24372 [cs.CL] (or arXiv:2604.24372v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.24372 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-31] Culture-Aware Machine Translation in Large Language Models : Benchmarking and Investigation ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文化敏感场景下的机器翻译能力尚不明确的问题。其核心挑战在于现有模型虽在通用翻译任务中表现优异,但在处理文化特定信息(culture-specific items)时存在识别与实际翻译操作之间的脱节。解决方案的关键在于构建CanMT——一个面向文化感知的新型驱动平行语料库(Culture-Aware Novel-Driven Parallel Dataset),并提出一套理论基础扎实、多维度的文化翻译质量评估框架。通过该语料库和评估体系,研究系统性地评估了多种LLM及翻译系统在不同翻译策略约束下的表现,揭示了模型间显著的能力差异以及翻译策略对行为的系统性影响,并验证了参考译文在提升“LLM作为裁判”(LLM-as-a-judge)评估可靠性中的关键作用。

链接: https://arxiv.org/abs/2604.24361
作者: Zekun Yuan,Yangfan Ye,Xiaocheng Feng,Baohang Li,Qichen Hong,Yunfei Lu,Dandan Tu,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室); Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computation and Language (cs.CL)
备注: 26pages,25 figures ACL2026 main conference, long paper

点击查看摘要

Abstract:Large language models (LLMs) have achieved strong performance in general machine translation, yet their ability in culture-aware scenarios remains poorly understood. To bridge this gap, we introduce CanMT, a Culture-Aware Novel-Driven Parallel Dataset for Machine Translation, together with a theoretically grounded, multi-dimensional evaluation framework for assessing cultural translation quality. Leveraging CanMT, we systematically evaluate a wide range of LLMs and translation systems under different translation strategy constraints. Our findings reveal substantial performance disparities across models and demonstrate that translation strategies exert a systematic influence on model behavior. Further analysis shows that translation difficulty varies across types of culture-specific items, and that a persistent gap remains between models’ recognition of culture-specific knowledge and their ability to correctly operationalize it in translation outputs. In addition, incorporating reference translations is shown to substantially improve evaluation reliability in LLM-as-a-judge, underscoring their essential role in assessing culture-aware translation quality. The corpus and code are available at CanMT.

[NLP-32] OS-SPEAR: A Toolkit for the Safety PerformanceEfficiency and Robustness Analysis of OS Agents

【速读】: 该论文旨在解决当前操作系统代理(OS agents)在安全性、效率和多模态鲁棒性方面缺乏系统化评估的问题,这些问题阻碍了其作为可信日常伙伴的落地应用。现有基准测试存在安全场景狭窄、轨迹标注噪声大及鲁棒性指标不足等缺陷。解决方案的关键在于提出OS-SPEAR,一个涵盖安全性(S)、性能(P)、效率(E)和鲁棒性(R)四个维度的综合性评估工具包,包含四个专业化子集:S-subset覆盖环境与人为引发的多种危害;P-subset基于轨迹价值估计与分层采样构建;E-subset从时间延迟与Token消耗双视角量化效率;R-subset则对视觉与文本输入施加跨模态扰动以检验鲁棒性。此外,该工具包还提供自动化诊断报告生成机制,从而为下一代可靠且高效的OS代理开发奠定标准化评估基础。

链接: https://arxiv.org/abs/2604.24348
作者: Zheng Wu,Yi Hua,Zhaoyuan Huang,Chenhao Xue,Yijie Lu,Pengzhou Cheng,Zongru Wu,Lingzhong Dong,Gongshen Liu,Xinghao Jiang,Zhuosheng Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at this https URL.

[NLP-33] Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering

【速读】: 该论文旨在解决标准检索增强生成(Retrieval-Augmented Generation, RAG)中因文本分块(chunking)策略导致的冗余问题,该问题表现为索引语料库存储开销大、检索效率低。解决方案的关键在于引入轻量级的过滤策略,特别是基于命名实体(named-entity-based)的过滤方法,能够在不显著降低检索质量的前提下,将向量索引规模减少约25%至36%,从而提升RAG流水线中检索组件的效率。

链接: https://arxiv.org/abs/2604.24334
作者: Daria Berdyugina,Anaëlle Cohen,Yohann Rioual
机构: AI Lab – VO2 Group
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Standard Retrieval-Augmented Generation (RAG) chunking methods often create excessive redundancy, increasing storage costs and slowing retrieval. This study explores chunk filtering strategies, such as semantic, topic-based, and named-entity-based methods in order to reduce the indexed corpus while preserving retrieval quality. Experiments are conducted on multiple corpora. Retrieval performance is evaluated using a token-based framework based on precision, recall, and intersection-over-union metrics. Results indicate that entity-based filtering can reduce vector index size by approximately 25% to 36% while maintaining high retrieval quality close to the baseline. These findings suggest that redundancy introduced during chunking can be effectively reduced through lightweight filtering, improving the efficiency of retrieval-oriented components in RAG pipelines.

[NLP-34] DPEPO: Diverse Parallel Exploration Policy Optimization for LLM -based Agents ACL2026

【速读】: 该论文旨在解决传统大型语言模型(Large Language Model, LLM)代理在复杂任务中因“先推理后执行”(reason-then-act)范式导致的探索能力受限和环境理解不充分的问题,即每一步仅与单一环境交互,难以实现高效多路径探索。其解决方案的关键在于提出一种新型并行交互范式,使代理能够同时与多个环境进行交互,并共享跨轨迹经验;在此基础上设计了DPEPO(Diverse Parallel Exploration Policy Optimization)算法,包含两个阶段:初始监督微调(Supervised Fine-Tuning, SFT)赋予基础的并行推理与动作生成能力,随后通过分层奖励机制强化多样性探索——包括轨迹级的成功奖励以及步骤级的多样动作奖励和状态转移奖励,主动惩罚行为冗余,从而显著提升探索广度与任务成功率。实验表明,DPEPO在ALFWorld和ScienceWorld基准上达到当前最优(SOTA)成功率,同时保持与强序列基线相当的效率。

链接: https://arxiv.org/abs/2604.24320
作者: Junshuo Zhang,Chengrui Huang,Feng Guo,Zihan Li,Ke Shi,Menghua Jiang,Jiguo Yu,Shuo Shang,Shen Gao
机构: University of Electronic Science and Technology of China (电子科技大学); DiDi (滴滴)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026 main conference

点击查看摘要

Abstract:Large language model (LLM) agents that follow the sequential “reason-then-act” paradigm have achieved superior performance in many complex this http URL, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at this https URL)

[NLP-35] Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer

【速读】: 该论文旨在解决大规模语言模型中机制可解释性(mechanistic interpretability)的高成本、模型依赖性和难以扩展的问题。现有方法通常需要在目标模型上重新进行完整的电路发现,这在计算资源和时间上均不可行。解决方案的关键在于提出可微忠实度对齐(Differentiable Faithfulness Alignment, DFA)框架,通过学习一个可微映射将小规模源模型中的节点重要性得分投影到大规模目标模型中,并利用软忠实度目标函数训练该映射,从而避免在目标模型上执行完整的电路挖掘。这种方法实现了跨模型的电路信息迁移,在多个任务和模型架构上展现出优于简单基线的效果,甚至在某些情况下恢复出与直接节点归因相当或更优的忠实度表现。

链接: https://arxiv.org/abs/2604.24302
作者: Shun Shao,Binxu Wang,Shay B. Cohen,Anna Korhonen,Yonatan Belinkov
机构: University of Cambridge(剑桥大学); Kempner Institute, Harvard University(哈佛大学肯普纳研究所); University of Edinburgh(爱丁堡大学); Technion(以色列理工学院)
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Mechanistic interpretability has made it possible to localize circuits underlying specific behaviors in language models, but existing methods are expensive, model-specific, and difficult to scale to larger architectures. We introduce \textbfDifferentiable Faithfulness Alignment (DFA), a framework that transfers circuit information from a smaller source model to a larger target model through a learned differentiable alignment. DFA projects source-model node importance scores into the target model and trains this mapping with a soft faithfulness objective, avoiding full circuit discovery on the target model. We evaluate DFA on Llama-3 and Qwen-2.5 across six tasks spanning factual retrieval, multiple-choice reasoning, and arithmetic. The strongest results occur on Llama-3 1 B \rightarrow3 B, where aligned circuits are often competitive with direct node attribution and zero-shot transfer remains effective. Recovery weakens for larger source–target gaps and is substantially lower on Qwen-2.5, suggesting that transfer becomes harder as architectural and scaling differences increase. Overall, DFA consistently outperforms simple baselines and, in some settings, recovers target-model circuits with faithfulness comparable to or stronger than direct attribution. These results suggest that smaller models can provide useful mechanistic priors for larger ones, while highlighting both the promise and the limits of node-level cross-model circuit alignment.\footnoteCode is available at this https URL.

[NLP-36] MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在企业环境中因缺乏对私有内部库的预训练数据而性能显著下降的问题,尤其针对检索增强生成(Retrieval-Augmented Generation, RAG)系统中静态API文档无法提供任务级协调模式与API参数约束理解的局限性。解决方案的关键在于提出MEMCoder框架,其核心创新是引入多维演化记忆(Multi-dimensional Evolving Memory),通过自动闭环机制利用执行反馈动态积累和更新使用指南,从而在推理阶段融合静态文档与历史经验知识,实现对API调用逻辑和边界条件的深度理解与持续优化。

链接: https://arxiv.org/abs/2604.24222
作者: Mofei Li,Taozhi Chen,Guowei Yang,Jia Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at general code generation, but their performance drops sharply in enterprise settings that rely on internal private libraries absent from public pre-training corpora. While Retrieval-Augmented Generation (RAG) offers a training-free alternative by providing static API documentation, we find that such documentation typically provides only isolated definitions, leaving a fundamental knowledge gap. Specifically, LLMs struggle with a task-level lack of coordination patterns between APIs and an API-level misunderstanding of parameter constraints and boundary conditions. To address this, we propose MEMCoder, a novel framework that enables LLMs to autonomously accumulate and evolve Usage Guidelines across these two dimensions. MEMCoder introduces a Multi-dimensional Evolving Memory that captures distilled lessons from the model’s own problem-solving trajectories. During inference, MEMCoder employs a dual-source retrieval mechanism to inject both static documentation and relevant historical guidelines into the context. The framework operates in an automated closed loop by using objective execution feedback to reflect on successes and failures, resolve knowledge conflicts, and dynamically update memory. Extensive evaluations on the NdonnxEval and NumbaEval benchmarks demonstrate that MEMCoder substantially enhances existing RAG systems, yielding an average absolute pass@1 gain of 16.31%. Furthermore, MEMCoder exhibits vastly superior domain-specific adaptation compared to existing memory-based continual learning methods.

[NLP-37] Seeing Is No Longer Believing: Frontier Image Generation Models Synthetic Visual Evidence and Real-World Risk

【速读】: 该论文旨在解决生成式 AI(Generative AI)图像模型在现实世界中引发的合成视觉风险问题,特别是当这些图像具备高度真实性、可读文本、身份一致性及快速迭代能力时,可能对金融、医疗、新闻、法律、应急响应等多个领域造成严重危害。其核心解决方案在于提出一个“能力加权风险框架”,将模型功能与实际社会损害相匹配,并主张实施分层控制策略:包括模型端限制、加密溯源、可见标签、平台摩擦、行业级验证和事件响应机制,以系统性降低合成图像带来的信任危机与滥用风险。

链接: https://arxiv.org/abs/2604.24197
作者: Shuai Wu,Xue Li,Yanna Feng,Yufang Li,Zhijun Wang,Ran Wang
机构: OpenAI; Google; xAI; Alibaba; ByteDance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical report, 20 pages, 15 figures, 2 tables, 1 algorithm

点击查看摘要

Abstract:Frontier image generation has moved from artistic synthesis toward synthetic visual evidence. Systems such as GPT Image 2, Nano Banana Pro, Nano Banana 2, Grok Imagine, Qwen Image 2.0 Pro, and Seedream 5.0 Lite combine photorealistic rendering, readable typography, reference consistency, editing control, and in several cases reasoning or search-grounded image construction. These capabilities create large benefits for design, education, accessibility, and communication, yet they also weaken one of society’s most common trust shortcuts: the belief that a plausible picture is a reliable record. This paper provides a source-grounded technical and policy analysis of synthetic visual risk. We first summarize the public capabilities of recent image models, then analyze public incidents involving fake crisis images, celebrity and public-figure imagery, medical scans, forged-looking documents, synthetic screenshots, phishing assets, and market-moving rumors. We introduce a capability-weighted risk framework that links model affordances to real-world harm in finance, medicine, news, law, emergency response, identity verification, and civic discourse. Our findings show that risk is driven less by photorealism alone than by the convergence of realism, legible text, identity persistence, fast iteration, and distribution context. We argue for layered control: model-side restrictions, cryptographic provenance, visible labeling, platform friction, sector-grade verification, and incident response. The paper closes with practical recommendations for model providers, platforms, newsrooms, financial institutions, healthcare systems, legal organizations, regulators, and ordinary users.

[NLP-38] MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning ACL2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗诊断推理中因领域知识有限而导致的准确性不足与临床逻辑对齐性差的问题。现有方法依赖模型内部知识或静态知识库,难以适应多样化的临床场景且缺乏对标准临床推理路径的遵循。其解决方案的关键在于提出一个两阶段的诊断推理框架 MultiDx,通过整合来自网络搜索、SOAP格式病例和临床病例数据库等多源证据,首先生成疑似诊断及推理路径,再通过匹配、投票与鉴别诊断机制融合多视角证据,从而提升诊断准确性和临床合理性。

链接: https://arxiv.org/abs/2604.24186
作者: Yimin Deng,Zhenxi Lin,Yejing Wang,Guoshuai Zhao,Pengyue Jia,Zichuan Fu,Derong Xu,Yefeng Zheng,Xiangyu Zhao,Li Zhu,Xian Wu,Xueming Qian
机构: Xi’an Jiaotong University(西安交通大学); City University of Hong Kong(香港城市大学); Tencent Jarvis Lab(腾讯 Jarvis 实验室); Westlake University(西湖大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 findings

点击查看摘要

Abstract:Diagnostic prediction and clinical reasoning are critical tasks in healthcare applications. While Large Language Models (LLMs) have shown strong capabilities in commonsense reasoning, they still struggle with diagnostic reasoning due to limited domain knowledge. Existing approaches often rely on internal model knowledge or static knowledge bases, resulting in knowledge insufficiency and limited adaptability, which hinder their capacity to perform diagnostic reasoning. Moreover, these methods focus solely on the accuracy of final predictions, overlooking alignment with standard clinical reasoning trajectories. To this end, we propose MultiDx, a two-stage diagnostic reasoning framework that performs differential diagnosis by analyzing evidence collected from multiple knowledge sources. Specifically, it first generates suspected diagnoses and reasoning paths by leveraging knowledge from web search, SOAP-formatted case, and clinical case database. Then it integrates multi-perspective evidence through matching, voting, and differential diagnosis to generate the final prediction.~Extensive experiments on two public benchmarks demonstrate the effectiveness of our approach.

[NLP-39] MemeScouts@LT-EDI 2026: Asking the Right Questions – Prompted Weak Supervision for Meme Hate Speech Detection ACL2026 ACL26

【速读】: 该论文旨在解决多语言环境下基于图文混合(multimodal)特征的仇恨言论(hate speech)检测难题,尤其针对表情包(meme)中隐含的同性恋恐惧(homophobia)和跨性别恐惧(transphobia)内容。其核心挑战在于:表情包具有跨模态特性(图像与文本协同作用),且常包含文化语境依赖的微妙线索(如讽刺、反语),导致端到端视觉-语言模型(VLM)在单一预测任务中易出现脆弱性。解决方案的关键在于提出一种提示弱监督(prompted weak supervision, PWS)方法,将复杂的 meme 理解过程分解为一系列目标明确、问题驱动的标注函数(labeling functions, LF),每条 LF 仅回答特定子任务(如立场判断、隐含性识别),并限制答案选项以增强一致性;同时利用量化后的 Qwen3-VLM 提取结构化特征,通过迭代式误差驱动的 LF 扩展与特征剪枝优化模型冗余与泛化能力,最终显著优于直接 VLM 分类,在英语、中文和印地语三个语种上均取得领先性能。

链接: https://arxiv.org/abs/2604.24179
作者: Ivo Bueno,Lea Hirlimann,Enkelejda Kasneci
机构: Technical University of Munich (慕尼黑工业大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Sixth Workshop on Language Technology for Equality, Diversity and Inclusion at ACL2026 (LT-EDI@ACL26)

点击查看摘要

Abstract:Detecting hate speech in memes is challenging due to their multimodal nature and subtle, culturally grounded cues such as sarcasm and context. While recent vision-language models (VLMs) enable joint reasoning over text and images, end-to-end prompting can be brittle, as a single prediction must resolve target, stance, implicitness, and irony. These challenges are amplified in multilingual settings. We propose a prompted weak supervision (PWS) approach that decomposes meme understanding into targeted, question-based labeling functions with constrained answer options for homophobia and transphobia detection in the LT-EDI 2026 shared task. Using a quantized Qwen3-VLM to extract features by answering targeted questions, our method outperforms direct VLM classification, with substantial gains for Chinese and Hindi, ranking 1st in English, 2nd in Chinese, and 3rd in Hindi. Iterative refinement via error-driven LF expansion and feature pruning reduces redundancy and improves generalization. Our results highlight the effectiveness of prompted weak supervision for multilingual multimodal hate speech detection.

[NLP-40] AdapTime: Enabling Adaptive Temporal Reasoning in Large Language Models ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理时间信息时能力有限的问题,现有方法通常依赖外部工具或人工验证,且针对特定场景设计,导致泛化性差;同时,这些方法采用固定推理流程应对所有时间类问题,未能区分简单与复杂问题所需的差异化推理策略,造成资源浪费或推理不足。解决方案的关键在于提出AdapTime,一种自适应的时间推理方法,通过动态执行三种核心操作——重述(reformulate)、重写(rewrite)和复核(review),由LLM规划器引导推理过程,从而根据输入上下文灵活调整推理步骤,无需外部支持即可显著提升LLMs的时间推理能力。

链接: https://arxiv.org/abs/2604.24175
作者: Yimin Deng,Yejing Wang,Zhenxi Lin,Zichuan Fu,Guoshuai Zhao,Derong Xu,Yefeng Zheng,Xiangyu Zhao,Xian Wu,Li Zhu,Xueming Qian
机构: Xi’an Jiaotong University(西安交通大学); City University of Hong Kong(香港城市大学); Tencent Jarvis Lab(腾讯 Jarvis 实验室); Westlake University(西湖大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 findings

点击查看摘要

Abstract:Large language models have demonstrated strong reasoning capabilities in general knowledge question answering. However, their ability to handle temporal information remains limited. To address this limitation, existing approaches often involve external tools or manual verification and are tailored to specific scenarios, leading to poor generalizability. Moreover, these methods apply a fixed pipeline to all questions, overlooking the fact that different types of temporal questions require distinct reasoning strategies, which leads to unnecessary processing for simple cases and inadequate reasoning for complex ones. To this end, we propose AdapTime, an adaptive temporal reasoning method that dynamically executes reasoning steps based on the input context. Specifically, it involves three temporal reasoning actions: reformulate, rewrite and review, with an LLM planner guiding the reasoning process. AdapTime integrates seamlessly with state-of-the-art LLMs and significantly enhances their temporal reasoning capabilities without relying on external support. Extensive experiments demonstrate the effectiveness of our approach.

[NLP-41] Psychologically-Grounded Graph Modeling for Interpretable Depression Detection

【速读】: 该论文旨在解决自动从对话交互中检测抑郁症的两个核心挑战:数据稀缺性与临床可解释性的缺乏。现有方法多依赖黑箱深度学习模型,难以捕捉抑郁症状的细微时序变化,并忽略个体差异。其解决方案的关键在于提出PsyGAT(Psychological Graph Attention Network),一种基于心理学原理的动态时序图神经网络框架:通过引入心理表达单元(Psychological Expression Units, PEUs)显式编码话语层面的临床证据,构建以心理状态转移而非单纯语义关联为核心的会话图结构;同时采用经临床认可的基于人格特征的数据增强策略缓解类别不平衡问题,并将会话级人格背景嵌入图结构以分离特质行为与急性抑郁症状。该方法在DAIC-WoZ和E-DAIC数据集上分别取得89.99和71.37的Macro F1分数,显著优于现有图基基线及闭源大语言模型(如GPT-5),并进一步通过因果解释模块Causal-PsyGAT提升对症状触发因素的识别能力,实现从监测到临床可解释性的有效衔接。

链接: https://arxiv.org/abs/2604.24126
作者: Rishitej Reddy Vyalla,Kritarth Prasad,Avinash Anand,Erik Cambria,Shaoxiong Ji,Faten S. Alamri,Zhengkui Wang
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic depression detection from conversational interactions holds significant promise for scalable screening but remains hindered by severe data scarcity and a lack of clinical interpretability. Existing approaches typically rely on black-box deep learning architectures that struggle to model the subtle, temporal evolution of depressive symptoms or account for participant-specific heterogeneity. In this work, we propose PsyGAT (Psychological Graph Attention Network), a psychologically grounded framework that models conversational sessions as dynamic temporal graphs. We introduce Psychological Expression Units (PEUs) to explicitly encode utterance-level clinical evidence, structuring the session graph to capture transitions in psychological states rather than mere semantic dependencies. To address the critical class imbalance in depression datasets, we employ clinically approved persona-based data augmentation, enable robust model learning. Additionally, we integrate session-level personality context directly into the graph structure to disentangle trait-based behavior from acute depressive symptoms. PsyGAT achieves state-of-the-art performance, surpassing both strong graph-based baselines and closed-source LLMs like GPT-5, achieving 89.99 and 71.37 Macro F1 scores in DAIC-WoZ and E-DAIC, respectively. We further introduce Causal-PsyGAT, an interpretability module that identifies symptom triggers. Experiments show a 20% improvement in MRR for identifying causal indicators, effectively bridging the gap between depression monitoring and clinical explainability. The full augmented dataset is publicly available at this https URL.

[NLP-42] IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning ACL

【速读】: 该论文旨在解决生成式 AI(Generative AI)在多语言和低资源环境下进行复杂推理时,因课程学习(Curriculum Learning)难以生成一致的逐步推理过程而表现不佳的问题。其核心解决方案是提出 IRIS 框架——一种双轴结构:垂直轴采用渐进式难度提升的监督微调(Supervised Fine-Tuning),水平轴引入反向课程强化学习(Reverse Curriculum Reinforcement Learning)以降低对步骤引导的依赖;同时设计包含正确性、步骤对齐度、连续性和数值激励的复合奖励函数,并通过分组相对策略优化(Group Relative Policy Optimization, GRPO)进行优化,从而在英语、印地语和马拉地语等多语言场景下显著提升数学推理性能,尤其在低资源设置中取得实质性改进。

链接: https://arxiv.org/abs/2604.24114
作者: Navya Gupta,Rishitej Reddy Vyalla,Avinash Anand,Chhavi Kirtani,Erik Cambria,Zhengchen Zhang,Zhengkui Wang,Timothy Liu,Aik Beng Ng,Simon See,Rajiv Ratn Shah
机构: Singapore Institute of Technology, Singapore; IIIT Delhi, New Delhi, India; University of California, San Diego, USA; Nanyang Technological University, Singapore; NVIDIA
类目: Computation and Language (cs.CL)
备注: Accepted in ACL main

点击查看摘要

Abstract:Curriculum learning helps language models tackle complex reasoning by gradually increasing task difficulty. However, it often fails to generate consistent step-by-step reasoning, especially in multilingual and low-resource settings where cross-lingual transfer from English to Indian languages remains limited. We propose IRIS: Interleaved Reinforcement with Incremental Staged Curriculum, a two-axis framework that combines Supervised Fine-Tuning on progressively harder problems (vertical axis) with Reverse Curriculum Reinforcement Learning to reduce reliance on step-by-step guidance (horizontal axis). We design a composite reward combining correctness, step-wise alignment, continuity, and numeric incentives, optimized via Group Relative Policy Optimization (GRPO). We release CL-Math, a dataset of 29k problems with step-level annotations in English, Hindi, and Marathi. Across standard benchmarks and curated multilingual test sets, IRIS consistently improves performance, with strong results on math reasoning tasks and substantial gains in low-resource and bilingual settings, alongside modest improvements in high-resource languages.

[NLP-43] Factual and Edit-Sensitive Graph-to-Sequence Generation via Graph-Aware Adaptive Noising

【速读】: 该论文旨在解决图到序列生成(Graph-to-Sequence, G2S)任务中细调的自回归模型在事实一致性(factual grounding)和编辑敏感性(edit sensitivity)方面的不足。其核心解决方案是提出一种非自回归扩散框架——图扩散语言模型(Diffusion Language Model for Graphs, DLM4G),通过迭代细化方式生成文本,关键创新在于引入了一种基于逐标记去噪误差的自适应加噪策略,该策略能够动态调节实体和关系标记上的噪声强度,从而增强图结构信息的保留能力,并支持在图结构发生局部修改时实现精准的文本更新。

链接: https://arxiv.org/abs/2604.24104
作者: Aditya Hemant Shahane,Anuj Kumar Sirohi,Tanmoy Chakraborty,Prathosh A P,Sandeep Kumar
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuned autoregressive models for graph-to-sequence generation (G2S) often struggle with factual grounding and edit sensitivity. To tackle these issues, we propose a non-autoregressive diffusion framework that generates text by iterative refinement conditioned on an input graph, named as Diffusion Language Model for Graphs (DLM4G). By aligning graph components (entities/relations) with their corresponding sequence tokens, DLM4G employs an adaptive noising strategy. The proposed strategy uses per-token denoising error as a signal to adaptively modulate noise on entity and relation tokens, improving preservation of graph structure and enabling localized updates under graph edits. Evaluated on three datasets, DLM4G consistently outperforms competitive G2S diffusion baselines trained on identical splits across both surface-form and embedding-based metrics. DLM4G further exceeds fine-tuned autoregressive baselines up to 12x larger (e.g., T5-Large) and is competitive with zero-shot LLM transfer baselines up to 127x larger. Relative to the strongest fine-tuned PLM baseline, DLM4G improves factual grounding (FGT@0.5) by +5.16% and edit sensitivity (ESR) by +7.9%; compared to the best diffusion baseline, it yields gains of +3.75% in FGT@0.5 and +23.6% in ESR. We additionally demonstrate applicability beyond textual graphs through experiments on molecule captioning, indicating the method’s generality for scientific G2S generation.

[NLP-44] BiMol-Diff: A Unified Diffusion Framework for Molecular Generation and Captioning

【速读】: 该论文旨在解决分子结构与自然语言之间跨模态建模的难题,特别是在可控分子设计中,如何有效保持分子结构信息在生成与描述过程中的完整性。传统自回归模型难以捕捉长程依赖关系,而标准扩散过程对所有位置施加均匀噪声,易破坏具有结构意义的原子或官能团(token)。其解决方案的关键在于提出一种基于token感知的噪声调度机制(token-aware noise schedule),该机制根据每个token的恢复难度动态调整污染强度,使更难恢复的子结构在前向扩散过程中得以保留,从而显著提升分子结构-语言建模的保真度。实验表明,该方法在ChEBI-20和M3-20M数据集上实现了更高的精确匹配率(相对提升15.4%)和更强的分子图像描述性能(BLEU与BERTScore最优)。

链接: https://arxiv.org/abs/2604.24089
作者: Aditya Hemant Shahane,Anuj Kumar Sirohi,Devansh Arora,Nitin Kumar,Prathosh A P,Sandeep Kumar
机构: Indian Institute of Technology Delhi(印度理工学院德里分校); Indian Institute of Technology Ropar(印度理工学院拉普尔分校); Indian Institute of Science Bengaluru(印度科学研究所班加罗尔分校); Latentforce.ai
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Bridging molecular structures and natural language is essential for controllable design. Autoregressive models struggle with long-range dependencies, while standard diffusion processes apply uniform corruption across positions, which can distort structurally informative tokens. We present BiMol-Diff, a unified diffusion framework for the paired tasks of text-conditioned molecule generation and molecule captioning. Our key component is a token-aware noise schedule that assigns position-dependent corruption based on token recovery difficulty, preserving harder-to-recover substructures during the forward process. On ChEBI-20 and M3-20M, BiMol-Diff improves molecule reconstruction with a 15.4% relative gain in Exact Match and achieves strong captioning results, attaining best BLEU and BERTScore among compared baselines. These results indicate token-aware noising improves fidelity in molecular structure-language modelling.

[NLP-45] he Kerimov-Alekberli Model: An Information-Geometric Framework for Real-Time System Stability

【速读】: 该论文旨在解决当前人工智能(AI)安全领域中伦理对齐缺乏物理基础的问题,即如何将道德规范从依赖人工规则的启发式方法转变为可量化、可验证的稳定性机制。其核心解决方案是提出Kerimov-Alekberli模型,该模型通过建立非平衡热力学与随机控制之间的形式同构关系,将系统异常定义为偏离黎曼流形的偏差,并以KL散度作为主度量,由费舍尔信息度量动态确定阈值。关键创新在于引入兰道尔原理(Landauer Principle),证明对抗扰动会通过增加系统的信息熵而执行可观测的物理功,从而首次在物理层面量化了伦理违规行为,实现了从抽象伦理准则到可测量热力学效应的转化。

链接: https://arxiv.org/abs/2604.24083
作者: Hikmat Karimov,Rahid Zahid Alekberli
机构: Azerbaijan Technical University (阿塞拜疆技术大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This study introduces the Kerimov-Alekberli model, a novel information-geometric framework that redefines AI safety by formally linking non-equilibrium thermodynamics to stochastic control for the ethical alignment of autonomous systems. By establishing a formal isomorphism between non-equilibrium thermodynamics and stochastic control, we define systemic anomalies as deviations from a Riemannian manifold. The model utilizes the Kullback-Leibler divergence as the primary metric, governed by a dynamic threshold derived from the Fisher Information Metric. We further ground this framework in the Landauer Principle, proving that adversarial perturbations perform measurable physical work by increasing the system’s informational entropy. Validation on the NSL-KDD dataset and unmanned aerial vehicle trajectory simulations demonstrated that our model achieves effective real-time detection via the FPT trigger, with strong performance metrics (e.g., high accuracy and low FPR) on benchmark datasets. This study provides a rigorous physical foundation for AI safety, transitioning from heuristic, rule-based ethical frameworks to a thermodynamics-based stability paradigm by grounding ethical violations in quantifiable physical work and entropic information.

[NLP-46] Jailbreaking Frontier Foundation Models Through Intention Deception CVPR2026

【速读】: 该论文旨在解决前沿生成式 AI(Generative AI)模型在多轮对话中因“安全完成”(safe completion)机制而产生的新型越狱攻击问题,尤其是针对意图反转(intent inversion)和一种未被发现的“准越狱”(para-jailbreaking)漏洞。其解决方案的关键在于设计了一种多轮渐进式越狱方法:通过模拟看似无害的意图建立对话信任,并利用模型的一致性特性逐步引导其输出有害内容;同时首次识别并验证了para-jailbreaking现象——即模型虽不直接回应攻击请求,但所释放的信息本身具有危害性,从而揭示了当前安全机制的深层缺陷并提出应对策略。

链接: https://arxiv.org/abs/2604.24082
作者: Xinhe Wang,Katia Sycara,Yaqi Xie
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at CVPR 2026 Findings Track

点击查看摘要

Abstract:Large (vision-)language models exhibit remarkable capability but remain highly susceptible to jailbreaking. Existing safety training approaches aim to have the model learn a refusal boundary between safe and unsafe, based on the user’s intent. It has been found that this binary training regime often leads to brittleness, since the user intent cannot reliably be evaluated, especially if the attacker obfuscates their intent, and also makes the system seem unhelpful. In response, frontier models, such as GPT-5, have shifted from refusal-based safeguards to safe completion, that aims to maximize helpfulness while obeying safety constraints. However, safe completion could be exploited when a user pretends their intention is benign. Specifically, this intent inversion would be effective in multi-turn conversation, where the attacker has multiple opportunities to reinforce their deceptively benign intent. In this work, we introduce a novel multi-turn jailbreaking method that exploits this vulnerability. Our approach gradually builds conversational trust by simulating benign-seeming intentions and by exploiting the consistency property of the model, ultimately guiding the target model toward harmful, detailed outputs. Most crucially, our approach also uncovered an additional class of model vulnerability that we call para-jailbreaking that has been unnoticed up to now. Para-jailbreaking describes the situation where the model may not reveal harmful direct reply to the attack query, however the information that it reveals is nevertheless harmful. Our contributions are threefold. First, it achieves high success rates against frontier models including GPT-5-thinking and Claude-Sonnet-4.5. Second, our approach revealed and addressed para-jailbreaking harmful output. Third, experiments on multimodal VLM models showed that our approach outperformed state-of-the-art models.

[NLP-47] he Prag matic Persona: Discovering LLM Persona through Bridging Inference ICPR2026

【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)persona发现方法依赖表面词汇或风格特征、忽视对话深层语篇结构的问题。其解决方案的关键在于引入“桥接推理”(bridging inference)机制,即通过建模话语中隐含的概念关联——这些关联基于共享的世界知识和语篇连贯性——构建结构化的知识图谱,从而捕捉LLM在多轮对话中组织意义的潜在语义链接。这种方法使persona识别从表层词法模式转向语篇一致性层面,显著提升了语义连贯性和persona稳定性的检测效果。

链接: https://arxiv.org/abs/2604.24079
作者: Jisoo Yang(1),Jongwon Ryu(1),Minuk Ma(2),Trung X. Pham(3),Junyeong Kim(1) ((1) Chung-Ang University, (2) University of British Columbia, (3) Van Lang University)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, accepted to ICPR 2026

点击查看摘要

Abstract:Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference – implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at this https URL

[NLP-48] An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在高风险和实际应用场景中,仅依赖聚合准确率进行评估时难以充分刻画系统可靠性的难题。其解决方案的关键在于提出一种受热力学启发的建模框架,通过引入一个复合稳定性评分(composite stability score),融合任务效用、熵(entropy,作为外部不确定性的度量)以及两个内部结构代理变量——内部整合性(internal integration)和对齐反思能力(aligned reflective capacity),从而构建一个可解释的评价视角,以量化内部结构如何调节不确定性对模型行为的影响。该框架在IST-20基准测试协议下对4个主流LLM的80个模型-场景观测进行了分析,结果表明其相比仅基于效用与熵的简化基线具有显著更高的稳定性得分(平均提升0.0299),尤其在高熵条件下表现更优,体现出非线性抑制不确定性的潜力。

链接: https://arxiv.org/abs/2604.24076
作者: Hikmat Karimov,Rahid Zahid Alekberli
机构: Azerbaijan Technical University (阿塞拜疆技术大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in high-stakes and operational settings, evaluation strategies based solely on aggregate accuracy are often insucient to characterize system reliability. This study proposes a thermodynamic inspired modeling framework for analyzing the stability of LLM outputs under conditions of uncertainty and perturbation. The framework introduces a composite stability score that integrates task utility, entropy as a measure of external uncertainty, and two internal structural proxies: internal integration and aligned reective capacity. Rather than interpreting these quantities as physical variables, the formulation is intended as an interpretable abstraction that captures how internal structure may modulate the impact of disorder on model behavior. Using the IST-20 benchmarking protocol and associated metadata, we analyze 80 modelscenario observations across four contemporary LLMs. The proposed formulation consistently yields higher stability scores than a reduced utilityentropy baseline, with a mean improvement of 0.0299 (95% CI: 0.02470.0351). The observed gain is more pronounced under higher entropy conditions, suggesting that the framework captures a form of nonlinear attenuation of uncertainty. We do not claim a fundamental physical law or a complete theory of machine ethics. Instead, the contribution of this work is a compact and interpretable modeling perspective that connects uncertainty, performance, and internal structure within a unied evaluation lens. The framework is intended to complement existing benchmarking approaches and to support ongoing discussions in AI safety, reliability, and governance.

[NLP-49] How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

【速读】: 该论文旨在解决当前安全基准测试(如HarmBench)中评估结果的不稳定性问题,其核心在于揭示并量化“评判提示(judge prompt)”这一此前被当作固定实现细节的因素对模型安全性评分的影响。解决方案的关键在于通过一个2×2×3因子设计构建12种不同提示变体,并在单一裁判模型(Claude Sonnet 4-6)下进行系统性实验,发现即使保持裁判模型不变,仅调整提示措辞即可导致有害响应率波动高达24.2个百分点,且模型安全排名也呈现中度不稳定(平均Kendall tau = 0.89),从而证明评判提示是安全基准测试中一个显著且此前被忽视的测量变异来源。

链接: https://arxiv.org/abs/2604.24074
作者: Xinran Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by the 22nd International Conference on Intelligent Computing (ICIC 2026). Final version to appear in Springer CCIS

点击查看摘要

Abstract:Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementation detail. We show this assumption is problematic. Using a 2 x 2 x 3 factorial design, we construct 12 judge prompt variants along two axes, evaluation structure and instruction framing, and apply them using a single judge model, Claude Sonnet 4-6, producing 28,812 judgments over six target models and 400 HarmBench behaviors. We find that prompt wording alone, holding the judge model fixed, shifts measured harmful-response rates by up to 24.2 percentage points, with even within-condition surface rewording causing swings of up to 20.1 percentage points. Model safety rankings are moderately unstable, with mean Kendall tau = 0.89, and category-level sensitivity ranges from 39.6 percentage points for copyright to 0 percentage points for harassment. A supplementary multi-judge experiment using three judge models shows that judge-model choice adds further variance. Our results demonstrate that judge prompt wording is a substantial, previously under-examined source of measurement variance in safety benchmarking.

[NLP-50] PeeriScope: A Multi-Faceted Framework for Evaluating Peer Review Quality

【速读】: 该论文旨在解决学术会议中同行评审(peer review)规模扩大与多样性增加所带来的评审质量评估难题,亟需系统化、可解释且可扩展的工具来量化评审质量。其解决方案的关键在于提出一个模块化平台 PeeriScope,该平台融合结构化特征、基于评分量表(rubric)的大语言模型(large language model)评估以及监督学习预测方法,从多个维度对评审质量进行综合评估。PeeriScope 设计注重开放性和集成性,提供公开界面和文档化的 API,支持实际部署与研究扩展,适用于审稿人自我评估、编辑初筛及大规模审计等场景,从而推动科学同行评审质量评价方法的持续发展。

链接: https://arxiv.org/abs/2604.24071
作者: Sajad Ebrahimi,Soroush Sadeghian,Ali Ghorbanpour,Negar Arabzadeh,Sara Salamat,Seyed Mohammad Hosseini,Hai Son Le,Mahdi Bashari,Ebrahim Bagheri
机构: Reviewerly; UC Berkeley (加州大学伯克利分校); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing scale and variability of peer review in scholarly venues has created an urgent need for systematic, interpretable, and extensible tools to assess review quality. We present PeeriScope, a modular platform that integrates structured features, rubric-guided large language model assessments, and supervised prediction to evaluate peer review quality along multiple dimensions. Designed for openness and integration, PeeriScope provides both a public interface and a documented API, supporting practical deployment and research extensibility. The demonstration illustrates its use for reviewer self-assessment, editorial triage, and large-scale auditing, and it enables the continued development of quality evaluation methods within scientific peer review. PeeriScope is available both as a live demo at this https URL and via API services at this https URL.

[NLP-51] Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B

【速读】: 该论文旨在解决小型指令微调语言模型(Small instruct-tuned LLMs)在最小提示条件下产生的退化口头置信度问题,即模型输出的置信度评分与内部信息不一致,表现为天花板效应(>95%置信度)、接近随机的Type-2 AUROC(0.5附近)以及无效的有效性分布。解决方案的关键在于引入置信度条件监督微调(Confidence-conditioned Supervised Fine-Tuning, CSFT),利用自一致性(self-consistency)生成的目标标签对模型进行训练,以弥合内部表征与口头读出之间的差距。实验发现,若仅使用正确模态答案过滤后的样本进行训练(带模态过滤器),会导致标签熵坍缩(label-entropy collapse),使Type-2 AUROC从0.554降至0.509;而移除过滤器、在全部校准样本上训练后,可获得显著提升的二元正确性判别能力(TriviaQA上AUROC₂=0.774),且该结果在MMLU上体现为准确率从54.2%提升至77.4%,表明目标标签的熵和正确性对模型输出格式具有正则化作用。

链接: https://arxiv.org/abs/2604.24070
作者: Jon-Paul Cacioli
机构: Independent Researcher, Melbourne, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, 4 tables. Pre-registered on OSF ( this https URL ). Code and data: this https URL

点击查看摘要

Abstract:Small instruct-tuned LLMs produce degenerate verbal confidence under minimal elicitation: ceiling rates above 95%, near-chance Type-2 AUROC, and Invalid validity profiles. We test whether confidence-conditioned supervised fine-tuning (CSFT) with self-consistency-derived targets can close the gap between internal information and verbal readout. A pre-registered Phase 0 protocol on Gemma 3 4B-it with a modal filter restricting training to items with correct modal answers produced a negative result: AUROC2 dropped from 0.554 to 0.509 due to label-entropy collapse in the training targets. An exploratory rescue removed the filter, training on all 2,000 calibration items. This produced a binary verbal correctness discriminator with AUROC2 = 0.774 on held-out TriviaQA, compressing a 10-sample self-consistency signal (AUROC2 = 0.999) into a single-pass readout exceeding logit entropy (0.701). The shuffled-target control showed no improvement (0.501). On MMLU, accuracy improved from 54.2% to 77.4% with the shuffled model at baseline (56.1%), supporting a target-dependent interpretation. The result is exploratory, binary rather than continuously calibrated, and observed at a single scale. It identifies two design lessons: confidence training requires label entropy, and correct targets regularise output format.

[NLP-52] Agent icCache: Cache-Driven Asynchronous Planning for Embodied AI Agents

【速读】: 该论文旨在解决嵌入式人工智能(Embodied AI)代理在执行任务时因每步调用大语言模型(Large Language Models, LLMs)而导致的高延迟和高成本问题。其解决方案的关键在于利用嵌入式任务中计划局部性(plan locality)的特性,提出AgenticCache框架:该框架通过运行时缓存频繁的计划转移来复用历史计划,从而避免每步都调用LLM;同时,后台的缓存更新器异步调用LLM对缓存条目进行验证与优化,确保计划质量。实验表明,该方法在多个多智能体嵌入式基准上平均提升任务成功率22%,降低模拟延迟65%、token消耗50%。

链接: https://arxiv.org/abs/2604.24039
作者: Hojoon Kim,Yuheng Wu,Thierry Tambe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at MLSys 2026

点击查看摘要

Abstract:Embodied AI agents increasingly rely on large language models (LLMs) for planning, yet per-step LLM calls impose severe latency and cost. In this paper, we show that embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one. Building on this, we introduce AgenticCache, a planning framework that reuses cached plans to avoid per-step LLM calls. In AgenticCache, each agent queries a runtime cache of frequent plan transitions, while a background Cache Updater asynchronously calls the LLM to validate and refine cached entries. Across four multi-agent embodied benchmarks, AgenticCache improves task success rate by 22% on average across 12 configurations (4 benchmarks x 3 models), reduces simulation latency by 65%, and lowers token usage by 50%. Cache-based plan reuse thus offers a practical path to low-latency, low-cost embodied agents. Code is available at this https URL.

[NLP-53] Agent Pulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

【速读】: 该论文旨在解决静态基准测试无法反映AI代理在实际部署中被采纳、维护及用户体验的问题,即传统评估方法仅能衡量模型在某一时刻的能力,而不能捕捉其在真实场景中的演化与影响力。解决方案的关键在于提出AgentPulse——一个连续评估框架,通过整合来自GitHub、包注册表、IDE市场、社交平台和基准排行榜等18个实时信号,从四个互补维度(基准性能、采纳信号、社区情绪和生态系统健康)对50个代理进行评分。该框架的核心创新在于验证了非代码来源的信号(如社区情绪)可有效预测外部采纳指标(如GitHub星标和Stack Overflow提问量),并揭示了封闭源高能力代理存在“采纳-能力负相关”现象,从而证明持续性部署信号对于全面理解AI代理价值的重要性。

链接: https://arxiv.org/abs/2604.24038
作者: Yuxuan Gao,Megan Wang,Yi Ling Yu
机构: University of Pennsylvania (宾夕法尼亚大学); Columbia University (哥伦比亚大学); OpenMesh AI (OpenMesh AI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 19 pages, 5 figures, 9 tables. Preprint under review

点击查看摘要

Abstract:Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuous evaluation framework scoring 50 agents across 10 workload categories along four factors (Benchmark Performance, Adoption Signals, Community Sentiment, and Ecosystem Health) aggregated from 18 real-time signals across GitHub, package registries, IDE marketplaces, social platforms, and benchmark leaderboards. Three analyses ground the framework. The four factors capture largely complementary information (n=50; \rho_\max=0.61 for Adoption-Ecosystem, all others |\rho| \leq 0.37 ). A circularity-controlled test (n=35) shows the Benchmark+Sentiment sub-composite, which contains no GitHub-derived signals, predicts external adoption proxies it does not aggregate: GitHub stars ( \rho_s=0.52 , p0.01 ) and Stack Overflow question volume ( \rho_s=0.49 , p0.01 ), with VS Code installs ( \rho_s=0.44 , p0.05 ) reported as illustrative given that only 11 of 35 agents have non-zero installs. On the n=11 subset with published SWE-bench scores, composite and benchmark-only rankings are nearly uncorrelated ( \rho_s=0.25 ; 9 of 11 agents shift by at least 2 ranks), driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset. This is precisely why we rest the framework’s validity claim on the broader n=35 test rather than the SWE-bench overlap. AgentPulse surfaces deployment signal absent from benchmarks; it is a methodology, not a ground-truth ranking. The framework, all collected signals, scoring outputs, and evaluation harness are released under CC BY 4.0.

[NLP-54] From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理系统中技能(skill)表示方式过于依赖文本密集型描述所带来的可操作性难题,即技能的调用接口、执行结构与具体副作用等关键信息在自然语言描述中高度耦合,导致机器难以高效地检索、理解和复用技能。其解决方案的关键在于提出一种全新的结构化表示框架——Scheduling-Structural-Logical (SSL) 表示法,该方法基于Schank和Abelson的经典语义知识表征理论,将技能知识解耦为三个独立维度:调度信号(scheduling signals)、场景级执行结构(structural execution context)以及逻辑级动作与资源使用证据(logical-level action and resource-use evidence),从而实现技能的显式结构化表达。实验表明,SSL在技能发现(Skill Discovery)和风险评估(Risk Assessment)任务上显著优于纯文本基线,证明了结构化表示对提升技能可搜索性和可审查性的有效性。

链接: https://arxiv.org/abs/2604.24026
作者: Qiliang Liang,Hansi Wang,Zhong Liang,Yang Liu
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 1 figure

点击查看摘要

Abstract:LLM agents increasingly rely on reusable skills, capability packages that combine instructions, control flow, constraints, and tool calls. In most current agent systems, however, skills are still represented by text-heavy artifacts, including this http URL-style documents and structured records whose machine-usable evidence remains embedded largely in natural-language descriptions. This poses a challenge for skill-centered agent systems: managing skill collections and using skills to support agent both require reasoning over invocation interfaces, execution structure, and concrete side effects that are often entangled in a single textual surface. An explicit representation of skill knowledge may therefore help make these artifacts easier for machines to acquire and leverage. Drawing on Memory Organization Packets, Script Theory, and Conceptual Dependency from Schank and Abelson’s classical work on linguistic knowledge representation, we introduce what is, to our knowledge, the first structured representation for agent skill artifacts that disentangles skill-level scheduling signals, scene-level execution structure, and logic-level action and resource-use evidence: the Scheduling-Structural-Logical (SSL) representation. We instantiate SSL with an LLM-based normalizer and evaluate it on a corpus of skills in two tasks, Skill Discovery and Risk Assessment, and superiorly outperform the text-only baselines: in Skill Discovery, SSL improves MRR from 0.573 to 0.707; in Risk Assessment, it improves macro F1 from 0.744 to 0.787. These findings reveal that explicit, source-grounded structure makes agent skills easier to search and review. They also suggest that SSL is best understood as a practical step toward more inspectable, reusable, and operationally actionable skill representations for agent systems, rather than as a finished standard or an end-to-end mechanism for managing and using skills.

[NLP-55] Stabilizing Efficient Reasoning with Step-Level Advantage Selection ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因生成冗长且低效的推理轨迹而导致的计算资源浪费问题,尤其关注短上下文窗口后训练(short-context post-training)对推理压缩效果的影响及其带来的训练不稳定与性能下降问题。其解决方案的关键在于提出步骤级优势选择(Step-level Advantage Selection, SAS),该方法在推理步骤层面动态调整优势值:对正确轨迹中低置信度的步骤赋予零优势,对验证失败轨迹中高置信度的步骤也赋予零优势,从而有效区分由截断或验证器错误引起的失败与真实推理错误,显著提升推理效率和准确性——在多个数学与通用推理基准上,SAS相比最强长度感知基线平均提升Pass@1准确率0.86点,同时减少平均推理长度16.3%。

链接: https://arxiv.org/abs/2604.24003
作者: Han Wang,Xiaodong Yu,Jialian Wu,Jiang Liu,Ximeng Sun,Mohit Bansal,Zicheng Liu
机构: UNC Chapel Hill; Advanced Micro Devices, Inc.
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Findings of ACL 2026, Code: this https URL

点击查看摘要

Abstract:Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead through length-based rewards or pruning, many approaches are post-trained under a much shorter context window than base-model training, a factor whose effect has not been systematically isolated. We first show that short-context post-training alone, using standard GRPO without any length-aware objective, already induces substantial reasoning compression-but at the cost of increasingly unstable training dynamics and accuracy degradation. To address this, we propose Step-level Advantage Selection (SAS), which operates at the reasoning-step level and assigns a zero advantage to low-confidence steps in correct rollouts and to high-confidence steps in verifier-failed rollouts, where failures often arise from truncation or verifier issues rather than incorrect reasoning. Across diverse mathematical and general reasoning benchmarks, SAS improves average Pass@1 accuracy by 0.86 points over the strongest length-aware baseline while reducing average reasoning length by 16.3%, yielding a better accuracy-efficiency trade-off.

[NLP-56] When to Commit? Towards Variable-Size Self-Contained Blocks for Discrete Diffusion Language Models

【速读】: 该论文旨在解决离散扩散语言模型(Discrete Diffusion Language Models, dLLMs)在实际生成过程中存在的训练-推理不匹配问题:训练时利用全序列上下文进行去噪,而推理时采用固定大小或启发式划分的块(blockwise)进行半自回归解码,导致在缺乏未来上下文信息的情况下过早承诺token选择,从而影响生成质量。解决方案的关键在于提出“自洽性”(self-containedness)作为块边界选择的理论准则——即一个块若其预测在无未来上下文(No-Future, NF)与有未来上下文(Future-Aware, FA)条件下保持一致,则认为该块是自洽的。基于此,作者引入可变大小自洽块(Variable-size Self-contained Blocks, VSB),通过量化NF与FA条件下的token级预测分布差异来动态评估并选择块边界,实现了更合理的块划分策略,从而缓解了因未来信息缺失引发的错误决策问题。

链接: https://arxiv.org/abs/2604.23994
作者: Danny Wang,Ruihong Qiu,Zi Huang
机构: The University of Queensland(昆士兰大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Discrete diffusion language models (dLLMs) enable parallel token updates with bidirectional attention, yet practical generation typically adopts blockwise semi-autoregressive decoding. This switch creates a training-inference mismatch: training denoises with full-sequence context, while inference commits tokens within a bounded block without future context. Therefore, decoding with fixed-size or heuristic-based blocks can lead to premature token commitments, as decisions are made without full access to future context that could alter those choices. Motivated by this, we propose self-containedness as a principled criterion for block commitment. A block is self-contained if its predictions remain consistent with Future-Aware (FA) or without No-Future (NF) access to future context, reframing block boundary selection as a test of self-containedness rather than a heuristic choice. Based on this principle, we introduce Variable-size Self-contained Blocks (VSB) for dLLMs. VSB scores and selects block boundaries using the divergence between token-level predictive distributions under NF and FA conditioning, which quantifies how predictions would change if future context were revealed. We provide theoretical justification linking self-containedness to predictive consistency, and extensive experiments validate VSB’s efficacy over fixed-size and heuristic blockwise decoding.

[NLP-57] Representational Curvature Modulates Behavioral Uncertainty in Large Language Models

【速读】: 该论文旨在解决自回归大语言模型(LLMs)中,表示轨迹的几何特性与token级行为不确定性之间的关联缺失问题。其关键解决方案是引入“上下文曲率”(contextual curvature)这一几何度量,用于量化表示轨迹在近期上下文中的弯曲程度,并发现该曲率与下一个token的熵(entropy)存在显著相关性。通过训练过程中的动态演化验证了这种关系的出现,并借助轨迹对齐的扰动实验表明:仅当干预与表示轨迹方向一致时,才能有效调节熵;而几何错位的扰动则无影响。此外,通过正则化使表示更“直”(straighter),可在不损害验证损失的前提下略微降低token级熵,从而确立了轨迹曲率作为任务对齐的表征特征,直接影响模型的行为不确定性。

链接: https://arxiv.org/abs/2604.23985
作者: Jack King,Evelina Fedorenko,Eghbal A. Hosseini
机构: Massachusetts Institute of Technology (麻省理工学院); Google DeepMind (谷歌深度思维)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In autoregressive large language models (LLMs), temporal straightening offers an account of how the next-token prediction objective shapes representations. Models learn to progressively straighten the representational trajectory of input sequences across layers, potentially facilitating next-token prediction via linear extrapolation. However, a direct link between this trajectory and token-level behavior has been missing. We provide such a link by relating contextual curvature-a geometric measure of how sharply the representational trajectory bends over recent context-to next-token entropy. Across two models (GPT-2 XL and Pythia-2.8B), contextual curvature is correlated with entropy, and this relationship emerges during training. Perturbation experiments reveal selective dependence: manipulating curvature through trajectory-aligned interventions reliably modulates entropy, while geometrically misaligned perturbations have no effect. Finally, regularizing representations to be straighter during training modestly reduces token-level entropy without degrading validation loss. These results identify trajectory curvature as a task-aligned representational feature that influences behavioral uncertainty in LLMs.

[NLP-58] Propagation Structure-Semantic Transfer Learning for Robust Fake News Detection ECML-PKDD2024

【速读】: 该论文旨在解决虚假新闻检测中因社交媒体上非正式语言的语义模糊性和用户交互行为的不可靠性所导致的内容与传播结构中存在固有噪声的问题,这些问题会引发语义噪声与结构噪声之间的相互干扰,从而限制现有方法在实际场景中的鲁棒性。其解决方案的关键在于提出一种基于教师-学生架构的传播结构-语义迁移学习框架(Propagation Structure-Semantic Transfer Learning, PSS-TL),通过设计两个独立的教师模型分别学习含噪内容的语义知识和传播结构知识,并引入多通道知识蒸馏(Multi-channel Knowledge Distillation, MKD)损失函数,使学生模型能够有效获取来自教师模型的专业化知识,从而避免两类噪声间的干扰,提升检测性能的鲁棒性。

链接: https://arxiv.org/abs/2604.23974
作者: Mengyang Chen,Lingwei Wei,Han Cao,Wei Zhou,Zhou Yan,Songlin Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ECML-PKDD 2024

点击查看摘要

Abstract:Fake news generally refers to false information that is spread deliberately to deceive people, which has detrimental social effects. Existing fake news detection methods primarily learn the semantic features from news content or integrate structural features from propagation. However, in practical scenarios, due to the semantic ambiguity of informal language and unreliable user interactive behaviors on social media, there are inherent semantic and structural noises in news content and propagation. Although some recent works consider the effect of irrelevant user interactions in a hybrid-modeling way, they still suffer from the mutual interference between structural noise and semantic noise, leading to limited performance for robust detection. To alleviate this issue, this paper proposes a novel Propagation Structure-Semantic Transfer Learning framework (PSS-TL) for robust fake news detection under a teacher-student architecture. Specifically, we design dual teacher models to learn semantics knowledge and structure knowledge from noisy news content and propagation structure independently. Besides, we design a Multi-channel Knowledge Distillation (MKD) loss to enable the student model to acquire specialized knowledge from the teacher models, thereby avoiding mutual interference. Extensive experiments on two real-world datasets validate the effectiveness and robustness of our method.

[NLP-59] Quantum Knowledge Graph: Modeling Context-Dependent Triplet Validity

【速读】: 该论文旨在解决传统知识图谱(Knowledge Graph, KG)在支持大语言模型(Large Language Model, LLM)推理时存在的局限性问题,即标准三元组形式的KG将每个关系视为全局有效,而未考虑其在特定临床情境下的适用性。这种忽略上下文依赖性的建模方式可能导致错误推理,尤其在医疗问答等对准确性要求极高的场景中。解决方案的关键在于提出一种量子知识图谱(Quantum Knowledge Graph, QKG),将三元组的有效性建模为与上下文相关的函数,并在糖尿病中心的PrimeKG子图中实现,通过引入患者群体特异性约束来增强关系的条件适用性。实验表明,基于QKG的上下文匹配机制显著提升了LLM推理的准确性,验证了在医疗领域中,知识图谱的价值不仅在于存储医学事实,更在于精准表达这些事实在具体患者情境下的可应用性。

链接: https://arxiv.org/abs/2604.23972
作者: Yao Wang,Zixu Geng,Jun Yan
机构: City University of Hong Kong (香港城市大学); Tsinghua University (清华大学); Duke University (杜克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: 15 pages main text, 6 pages appendix, 5 figures, preprint

点击查看摘要

Abstract:Knowledge graphs (KGs) are increasingly used to support large lan guage model (LLM) reasoning, but standard triplet-based KGs treat each relation as globally valid. In many settings, whether a relation should count as evidence depends on the context. We therefore formulate triplet validity as a triplet-specific function of context and refer to this formulation as a Quantum Knowledge Graph (QKG). We instantiate QKG in medicine using a diabetes-centered PrimeKG subgraph, whose 68,651 context-sensitive relations are further annotated with patient-group-specific constraints. We evaluate it in a reasoner–validator pipeline for medical question answering on a KG-grounded subset of MedReason containing 2,788 questions. With Haiku-4.5 as both the Reasoner and the Validator, KG-backed validation significantly improves over a no-validator baseline ( +0.61 pp), and QKG with context matching yields the largest gain, outperforming both KG validation without context matching ( +0.79 pp) and the no-validator baseline ( +1.40 pp; paired McNemar, all p0.05 ). Under a stronger validator (Qwen-3.6-Plus), the raw QKG gain over the no-validator baseline grows from +1.40 pp to +5.96 pp; the context-matching gap is non-significant ( p=0.73 ) on the raw set but becomes borderline significant ( p=0.05 ) after adjustment for knowledge leakage and suspicious questions, consistent with a benchmark-gold ceiling rather than a QKG limitation. Taken together, the results support the view that the value of a KG in LLM-based clinical reasoning lies not merely in storing medically related facts, but in representing whether those facts are applicable to the specific patient context. For reproducibility and further research, we release the curated QKG datasets and source code.\footnotethis https URL

[NLP-60] KOMBO: Korean Character Representations Based on the Combination Rules of Subcharacters ACL2024

【速读】: 该论文旨在解决现有预训练语言模型(Pre-trained Language Models, PLMs)在处理韩语时忽视韩文字符构造原理的问题。韩语书写系统(Hangeul)的字符结构严格遵循《训民正音》(Hunminjeongeum)所记载的造字原则,而当前主流韩语PLMs多采用基于子词(subword)的分词策略,未能有效利用这一独特的字符层级信息。解决方案的关键在于提出一种名为KOMBO的新框架,首次将《训民正音》中的造字原理融入韩语字符表示中,通过子字符(subcharacter)建模方式替代传统的子词切分方法,从而更精准地捕捉韩语的语言特征。实验表明,KOMBO在五个韩语自然语言理解任务上平均性能优于当前最优模型2.11%,验证了子字符表示相较于子词方法对韩语建模的优势。

链接: https://arxiv.org/abs/2604.23948
作者: SungHo Kim,Juhyeong Park,Yeachan Kim,SangKeun Lee
机构: Korea University(韩国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presented at ACL 2024 Findings

点击查看摘要

Abstract:The Korean writing system, \textitHangeul, has a unique character representation rigidly following the invention principles recorded in \textitHunminjeongeum.\footnote\textitHunminjeongeum is a book published in 1446 that describes the principles of invention and usage of \textitHangeul, devised by King Sejong \citeHunminjeongeum_Guide. However, existing pre-trained language models (PLMs) for Korean have overlooked these principles. In this paper, we introduce a novel framework for Korean PLMs called KOMBO, which firstly brings the invention principles of \textitHangeul to represent character. Our proposed method, KOMBO, exhibits notable experimental proficiency across diverse NLP tasks. In particular, our method outperforms the state-of-the-art Korean PLM by an average of 2.11% in five Korean natural language understanding tasks. Furthermore, extensive experiments demonstrate that our proposed method is suitable for comprehending the linguistic features of the Korean language. Consequently, we shed light on the superiority of using subcharacters over the typical subword-based approach for Korean PLMs. Our code is available at: [this https URL](this https URL).

[NLP-61] SAssistant: A Human-in-the-Loop Agent ic Framework for Automated Target Safety Assessment

【速读】: 该论文旨在解决靶点安全评估(Target Safety Assessment, TSA)过程中因依赖异质性证据(如遗传、转录组、靶点同源性、药理学及临床数据)而带来的系统性整合难题,该过程高度迭代且依赖专家判断,导致可扩展性和可重复性不足的问题。解决方案的关键在于提出TSAssistant——一个基于多智能体(multi-agent)的框架,通过模块化、分章节的人机协同范式实现报告生成自动化;其核心机制是将报告撰写分解为由专门子智能体协作完成的流水线任务,每个子智能体负责单一TSA章节,并通过标准化接口从生物医学知识库中检索结构化与非结构化证据,生成可独立引用、基于证据的章节内容;同时引入交互式优化循环,允许用户手动编辑、补充信息或重新调用特定智能体进行修订,系统则维持跨轮次的对话记忆,从而在提升效率的同时保障专家决策权,实现AI增强型证据合成与人类主导的最终判断相结合的混合模式。

链接: https://arxiv.org/abs/2604.23938
作者: Xiaochen Zheng,Zhiwen Jiang,Melanie Guerard,Klas Hatje,Tatyana Doktorova
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preliminary version; quantitative evaluation results to be included in a future revision

点击查看摘要

Abstract:Target Safety Assessment (TSA) requires systematic integration of heterogeneous evidence, including genetic, transcriptomic, target homology, pharmacological, and clinical data, to evaluate potential safety liabilities of therapeutic targets. This process is inherently iterative and expert-driven, posing challenges in scalability and reproducibility. We present TSAssistant, a multi-agent framework designed to support TSA report drafting through a modular, section-based, and human-in-the-loop paradigm. The framework decomposes report generation into a coordinated pipeline of specialised subagents, each targeting a single TSA section. Specialised subagents retrieve structured and unstructured data as well as literature evidence from curated biomedical sources through standardised tool interfaces, producing individually citable, evidence-grounded sections. Agent behaviour is governed by a hierarchical instruction architecture comprising system prompts, domain-specific skill modules, and runtime user instructions. A key feature is an interactive refinement loop in which users may manually edit sections, append new information, upload additional sources, or re-invoke agents to revise specific sections, with the system maintaining conversational memory across iterations. TSAssistant is designed to reduce the mechanical burden of evidence synthesis and report drafting, supporting a hybrid model in which agentic AI augments evidence synthesis while toxicologists retain final decision authority.

[NLP-62] Knowledge Vector of Logical Reasoning in Large Language Models ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中不同类型逻辑推理能力(演绎、归纳和溯因推理)的知识表示独立性问题,即这些推理形式在模型内部的表征向量虽可被线性空间捕捉,但彼此间缺乏协同与互补。其解决方案的关键在于提出一种互补子空间约束精炼框架(complementary subspace-constrained refinement framework),通过引入互补损失函数(complementary loss)使各类推理向量能够借助其他推理类型的辅助知识进行优化,同时利用子空间约束损失函数(subspace constraint loss)保留各推理类型特有的表征特征,从而增强不同推理模式之间的互补性与一致性。

链接: https://arxiv.org/abs/2604.23877
作者: Zixuan Wang,Yuanyuan Lei
机构: University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Logical reasoning serve as a central capability in LLMs and includes three main forms: deductive, inductive, and abductive reasoning. In this work, we study the knowledge representations of these reasoning types in LLMs and analyze the correlations among them. Our analysis shows that each form of logical reasoning can be captured as a reasoning-specific knowledge vector in a linear representation space, yet these vectors are largely independent of each other. Motivated by cognitive science theory that these subforms of logical reasoning interact closely in the human brain, as well as our observation that the reasoning process for one type can benefit from the reasoning chain produced by another, we further propose to refine the knowledge representations of each reasoning type in LLMs to encourage complementarity between them. To this end, we design a complementary subspace-constrained refinement framework, which introduces a complementary loss that enables each reasoning vector to leverage auxiliary knowledge from the others, and a subspace constraint loss that prevents erasure of their unique characteristics. Through steering experiments along reasoning vectors, we find that refined vectors incorporating complementary knowledge yield consistent performance gains. We also conduct a mechanism-interpretability analysis of each reasoning vector, revealing insights into the shared and specific features of different reasoning in LLMs.

[NLP-63] Graph Memory Transformer (GMT)

【速读】: 该论文旨在解决如何在保持自回归架构不变的前提下,用显式学习的记忆图结构替代解码器-only Transformer 中的前馈网络(Feed-Forward Network, FFN)子层的问题。其核心解决方案是提出图记忆Transformer(Graph Memory Transformer, GMT),该模型保留因果自注意力机制,但将每个token的FFN变换替换为一个记忆单元:该单元基于学习得到的中心点(centroid)池和有向转移矩阵,在token条件驱动下实现源状态到目标状态的移动而非直接值检索。关键创新在于通过引力源路由、token条件的目标选择和门控位移读出机制,使模型具备可解释性,并能直接观测记忆中心使用、过渡结构及状态迁移路径,从而在不引入密集FFN层的情况下实现语言建模任务。

链接: https://arxiv.org/abs/2604.23862
作者: Nicola Zanarini,Niccolò Ferrari
机构: Bonfiglioli Engineering s.r.l.(邦菲利欧工程有限责任公司); University of Ferrara(费拉拉大学); NAIS Engineering s.r.l.(纳伊斯工程有限责任公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 65 pages, 10 figures, 5 tables. Code available at this https URL

点击查看摘要

Abstract:We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.

[NLP-64] Learning Selective LLM Autonomy from Copilot Feedback in Enterprise Customer Support Workflows

【速读】: 该论文旨在解决企业级业务流程管理(BPM)平台中客户支持工作流自动化效率低、部署周期长且依赖大量人工标注的问题。其解决方案的关键在于构建一个可规模化部署的端到端自动化系统,通过利用已有的大规模结构化UI交互轨迹和低开销的“协作者反馈”(copilot feedback)来训练策略模型:首先在分阶段部署管道中学习下一步UI操作策略,再从协作者反馈中学习批评者(critic)以校准系统在不确定时的放弃阈值(abstention),从而仅在高置信度下自动执行步骤,并将不确定性决策交由人工处理,同时从更新后的UI状态恢复执行。该设计使单个操作员可同时监督多个并发会话,仅在系统不确定时被中断,最终在生产环境中实现了45%的会话自动化率并降低39%的平均处理时间,且不降低服务质量。

链接: https://arxiv.org/abs/2604.23855
作者: Nikita Borovkov,Elisei Rykov,Olga Tsymboi,Sergei Filimonov,Nikita Surnachev,Dmitry Bitman,Anatolii Potapov
机构: 未知
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We present a deployed system that automates end-to-end customer support workflows inside an enterprise Business Process Management (BPM) platform. The approach is scalable in production and reaches selective automation within two weeks for a new process, leveraging supervision already generated at scale: structured per-case UI interaction traces and low-overhead copilot feedback, where operators either accept a suggestion or provide a correction. A staged deployment pipeline trains a next UI action policy, learns a critic from copilot feedback to calibrate abstention, and executes only high-confidence steps in the background while deferring uncertain decisions to operators and resuming from the updated UI state. This setup lets one operator supervise multiple concurrent sessions and be interrupted only when the system is uncertain. The system operates on a schema-driven view of the BPM interface and includes monitoring and safe fallbacks for production. In production, it automated 45% of sessions and reduced average handling time by 39% without degrading support quality level.

[NLP-65] ranslate or Simplify First: An Analysis of Cross-lingual Text Simplification in English and French

【速读】: 该论文旨在解决跨语言文本简化(Cross-Lingual Text Simplification, CLTS)问题,即在翻译的同时降低目标语言文本的复杂度,以提升多语言语境下的内容可及性。其解决方案的关键在于系统性地比较五种不同的提示策略:包括直接提示(同时执行翻译与简化)、两种组合式方法(先译后简或先简后译)以及两种分解式方法(分步执行翻译与简化)。实验结果表明,尽管直接提示在BLEU得分上表现最优(体现语义保真度),但“先译后简”策略在简化程度上最佳,凸显了任务分解对提升文本可读性的关键作用。

链接: https://arxiv.org/abs/2604.23844
作者: Ido Dahan,Omer Toledano,Roey J. Gafter,Sharon Pardo,Oren Tsur,Hila Zahavi,Elior Sulem
机构: Ben-Gurion University of the Negev (本古里安大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-Lingual Text Simplification (CLTS) aims to make content more accessible across languages by simultaneously addressing both linguistic complexity and translation. This study investigates the effectiveness of different prompting strategies for CLTS between English and French using large language models (LLMs). We examine five distinct prompting systems: a direct prompt instructing the LLM to perform both translation and simplification simultaneously, two Composition approaches that either translate-then-simplify or simplify-then-translate within a single prompt, and two decomposition approaches that perform the same operations in separate, consecutive prompts. These systems are evaluated across a diverse set of five corpora of different genres (Wikipedia and medical texts) using seven state-of-the-art LLMs. Output quality is assessed through a multi-faceted evaluation framework comprising automatic metrics, comprehensive linguistic feature analysis, and human evaluation of simplicity and meaning preservation. Our findings reveal that while direct prompting consistently achieves the highest BLEU scores, indicating meaning fidelity, Translate-then-Simplify approaches demonstrate the highest simplicity, as measured by the linguistic features.

[NLP-66] Reheat Nachos for Dinner? Evaluating AI Support for Cross-Cultural Communication of Neologisms ACL2026

【速读】: 该论文旨在解决非母语者(Non-Native Speakers, NNS)在跨文化交际中难以准确理解和使用英语新词(neologisms)及新兴俚语的问题,从而影响其与母语者(Native Speakers, NS)的沟通效果。解决方案的关键在于评估不同人工智能(AI)辅助方式对NNS学习和运用这些词汇的有效性,特别是比较三种AI支持条件:AI定义(AI Definition)、AI简化重写(AI Rewrite into simpler English)和AI意义与用法解释(AI Explanation of meaning and usage),并与传统词典(Non-AI Dictionary)进行对照。研究发现,AI Explanation在提升NS评价的交际能力方面表现最优,但NNS对自身语用恰当性的判断与其实际表现存在显著偏差,揭示了当前AI工具在模拟真实语境理解上的局限性,为未来更精准的AI辅助语言学习系统设计提供了依据。

链接: https://arxiv.org/abs/2604.23842
作者: Dayeon Ki,Yu Hou,Rachel Rudinger,Hal Daumé III,Marine Carpuat,Fumeng Yang
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Neologisms and emerging slang are central to daily conversation, yet challenging for non-native speakers (NNS) to interpret and use appropriately in cross-cultural communication with native speakers (NS). NNS increasingly make use of Artificial Intelligence (AI) tools to learn these words. We study the utility of such tools in mediating an informal communication scenario through a human-subjects study (N=234): NNS participants learn English neologisms with AI support, write messages using the learned word to an NS friend, and judge contextual appropriateness of the neologism in two provided writing samples. Using both NS evaluator-rated communicative competence of NNS-produced writing and NNS’ contextual appropriateness judgments, we compare three AI-based support conditions: AI Definition, AI Rewrite into simpler English, AI Explanation of meaning and usage, and Non-AI Dictionary for comparison. We show that AI Explanation yields the largest gains over no support in NS-rated competence, while contextual appropriateness judgments show indifference across support. NNS participants’ self-reported perceptions tend to overestimate NS ratings, revealing a mismatch between perceived and actual competence. We further observe a significant gap between NNS- and NS-produced writing, highlighting the limitations of current AI tools and informing design for future tools.

[NLP-67] One Size Fits None: Heuristic Collapse in LLM Investment Advice

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域(如投资建议)中是否真正实现基于用户完整背景的个性化决策,还是仅依赖少数主导性输入特征进行简化判断的问题。研究发现,LLMs在投资建议场景中表现出系统性的启发式坍塌(heuristic collapse),即决策主要由自我报告的风险承受能力决定,而其他相关因素贡献甚微。解决方案的关键在于引入可解释的代理模型(interpretable surrogate models)对LLM输出进行分析,从而揭示其输入敏感性,而非仅仅评估输出质量;这表明部署LLM作为顾问时,必须通过审计输入敏感性来识别潜在的决策偏差。

链接: https://arxiv.org/abs/2604.23837
作者: Jillian Ross,Andrew W. Lo
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models are increasingly deployed as advisors in high-stakes domains – answering medical questions, interpreting legal documents, recommending financial products – where good advice requires integrating a user’s full context rather than responding to salient surface features. We investigate whether frontier LLMs actually do this, or whether they instead exhibit heuristic collapse: a systematic reduction of complex, multi-factor decisions to a small number of dominant inputs. We study the phenomenon in investment advice, where legal standards explicitly require individualized reasoning over a client’s full circumstances. Applying interpretable surrogate models to LLM outputs, we find systematic heuristic collapse: investment allocation decisions are largely determined by self-reported risk tolerance, while other relevant factors contribute minimally. We further find that web search partially attenuates heuristic collapse but does not resolve it. These findings suggest that heuristic collapse is not resolved by web search augmentation or model scale alone, and that deploying LLMs as advisors requires auditing input sensitivity, not just output quality.

[NLP-68] Resource-Lean Lexicon Induction for German Dialects LREC2026

【速读】: 该论文旨在解决低资源语言和方言中高质量词典自动构建的难题,尤其针对德语方言中存在的标注数据稀缺、拼写变体复杂以及大语言模型(LLM)性能不佳等问题。其核心解决方案是采用基于字符串相似性特征训练的随机森林(Random Forests)统计模型,该方法在实际应用中展现出优于主流LLM的性能,不仅能够实现跨方言迁移学习,还具备轻量化、数据驱动的优势。实验表明,在双语词典诱导(BLI)任务中,随机森林模型超越了Mistral-123b,同时在方言信息检索(IR)任务中通过查询扩展显著提升指标表现,nDCG@10最高提升28.9%,Recall@100提升50.7%。

链接: https://arxiv.org/abs/2604.23824
作者: Robert Litschko,Barbara Plank,Diego Frassinelli
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026

点击查看摘要

Abstract:Automatic induction of high-quality dictionaries is essential for building lexical resources, yet low-resource languages and dialects pose several challenges: limited access to annotators, high degree of spelling variations, and poor performance of large language models (LLMs). We empirically show that statistical models (random forests) trained on string similarity features are surprisingly effective for inducing German dialect lexicons. They outperform LLMs, enable cross-dialect transfer, and offer a lightweight data-driven alternative. We evaluate our models intrinsically on bilingual lexicon induction (BLI) and extrinsically on dialect information retrieval (IR). On BLI, random forests outperform Mistral-123b while being more resource-lean. On dialect IR with BM25, using our dialect dictionaries for query expansion yields relative improvements of up to 28.9% in nDCG@10 and 50.7% in Recall@100. Motivated by the resource scarcity in dialects, we further investigate the extent to which models transfer across different German dialects, and their performance under varying amounts of training data.

[NLP-69] DRACULA: Hunting for the Actions Users Want Deep Research Agents to Execute

【速读】: 该论文旨在解决科学深度研究(Scientific Deep Research, DR)代理在生成多章节研究报告过程中,因缺乏对中间操作(intermediate actions)用户反馈而难以优化其行动决策的问题。现有方法仅评估最终报告质量,无法识别哪些中间步骤有助于提升报告效果,从而阻碍了DR代理的学习与改进。解决方案的关键在于构建DRACULA数据集——首个包含用户对DR代理中间操作偏好及执行结果判断的标注数据集,通过五周实验收集了19位计算机科学专家的8,103条动作偏好和5,230条执行判断。基于此数据集,作者开展模拟研究以评估大语言模型(LLM)预测用户偏好的能力,并发现:LLM性能提升最显著的是利用用户的完整选择历史而非自我报告或推断的上下文信号;同时揭示了用户对相同查询的选择差异源于未明示的目标,提示需引入可交互的“ affordances”机制帮助用户引导报告生成。最终,该研究提出一种在线干预策略,根据用户过往交互生成新动作,显著提高了用户采纳率,验证了行动选择优先于执行本身的重要性。

链接: https://arxiv.org/abs/2604.23815
作者: Nishant Balepur,Malachi Hamada,Varsha Kishore,Sergey Feldman,Amanpreet Singh,Pao Siangliulue,Joseph Chee Chang,Rachel Rudinger,Eunsol Choi,Jordan Lee Boyd-Graber,Doug Downey,Aakanksha Naik
机构: 未知
类目: Computation and Language (cs.CL)
备注: In-progress Preprint

点击查看摘要

Abstract:Scientific Deep Research (DR) agents answer user queries by synthesizing research papers into multi-section reports. User feedback can improve their utility, but existing protocols only score the final report, making it hard to study and learn which intermediate actions DR agents should take to improve reports. We collect DRACULA, the first dataset with user feedback on intermediate actions for DR. Over five weeks, nineteen expert CS researchers ask queries to a DR system that proposes actions (e.g., “Add a section on datasets”). Our users select actions they prefer, then judge whether an output report applied their selections successfully, yielding 8,103 action preferences and 5,230 execution judgments. After confirming a DR agent can execute DRACULA’s actions, we study the predictability of user-preferred actions via simulation-how well LLMs predict the actions users select-a step toward learning to generate useful actions. We discover: (1) LLM judges initially struggle to predict action selections, but improve most when using a user’s full selection history, rather than self-reported or extrapolated user context signals; (2) Users’ selections for the same query differ based on unstated goals, bottlenecking simulation and motivating affordances that let users steer reports; and (3) Our simulation results inform an online intervention that generates new actions based on the user’s past interactions, which users pick most often in follow-up studies. Overall, while work extensively studies execution, DRACULA reveals a key challenge is deciding which actions to execute in the first place. We open-source DRACULA’s study design, user feedback, and simulation tasks to spur future work on action feedback for long-horizon agents.

[NLP-70] ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLM s in Document Reconstruction ACL2026

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉丰富文档理解(Visually Rich Document Understanding, VRDU)任务中对碎片化文档内容恢复能力不足的问题。其核心挑战在于如何在显著的内容断裂条件下,结合视觉模式识别与语义推理实现准确的文档重建。解决方案的关键在于提出ShredBench基准测试平台,该平台通过自动化生成管道直接从Markdown源文件渲染碎片化文档,确保评估的有效性并避免训练数据污染;同时,该基准涵盖四种场景(英文、中文、代码、表格)及三种碎片粒度(8、12、16片),系统性地揭示了现有MLLMs在碎片化环境下性能急剧下降的现象,从而指出现有模型缺乏细粒度跨模态推理能力以弥合视觉断点的瓶颈。

链接: https://arxiv.org/abs/2604.23813
作者: Zichun Guo,Yuling Shi,Wenhao Zeng,Chao Hu,Haotian Lin,Terry Yue Zhuo,Jiawei Chen,Xiaodong Gu,Wenping Ma
机构: Xidian University (西安电子科技大学); Shanghai Jiao Tong University (上海交通大学); Alibaba Qwen (阿里巴巴通义千问); Old Dominion University (老多明尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ACL 2026 Findings. Code available at this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable performance in Visually Rich Document Understanding (VRDU) tasks, but their capabilities are mainly evaluated on pristine, well-structured document images. We consider content restoration from shredded fragments, a challenging VRDU setting that requires integrating visual pattern recognition with semantic reasoning under significant content discontinuities. To facilitate systematic evaluation of complex VRDU tasks, we introduce ShredBench, a benchmark supported by an automated generation pipeline that renders fragmented documents directly from Markdown. The proposed pipeline ensures evaluation validity by allowing the flexible integration of latest or unseen textual sources to prevent training data contamination. ShredBench assesses four scenarios (English, Chinese, Code, Table) with three fragmentation granularities (8, 12, 16 pieces). Empirical evaluations on state-of-the-art MLLMs reveal a significant performance gap: The method is effective on intact documents; however, once the document is shredded, restoration becomes a significant challenge, with NED dropping sharply as fragmentation increases. Our findings highlight that current MLLMs lack the fine-grained cross-modal reasoning required to bridge visual discontinuities, identifying a critical gap in robust VRDU research.

[NLP-71] LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models ACL2026

【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在高风险法律推理任务中因容量有限而导致的推理能力不足问题,以及高质量、细粒度推理轨迹数据难以获取的训练瓶颈。其解决方案的关键在于提出LegalDrill框架,通过细粒度提示(fine-grained prompting)从能力强的教师模型中提取并迭代优化推理轨迹,并引入自省式验证机制(self-reflective verification)以自适应筛选最优训练数据,从而实现对SLM学生的监督微调与直接偏好优化,显著提升其法律推理能力,且无需依赖稀缺的人工专家标注。

链接: https://arxiv.org/abs/2604.23809
作者: Tianchun Li,Haochen Liu,Vishwa Pardeshi,Xingchen Wang,Tianci Liu,Huijun Zhao,Wei Fan,Jing Gao
机构: Purdue University (普渡大学); Fidelity Investments (富达投资)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Industry Track

点击查看摘要

Abstract:Small language models (SLMs) are promising for real-world deployment due to their efficiency and low operational cost. However, their limited capacity struggles with high-stakes legal reasoning tasks that require coherent statute interpretation and logically consistent deduction. Furthermore, training SLMs for such tasks demands high-quality, concise reasoning trajectories, which are prohibitively expensive to manually collect and difficult to curate via standard rejection sampling, lacking granularity beyond final verdicts. To address these challenges, we propose LegalDrill, a diagnosis-driven synthesis framework that extracts and iteratively refines reasoning trajectories from a capable teacher via fine-grained prompting, then a self-reflective verification is employed to adaptively select the most effective data for the SLM student. The resulting data empower SLM training through supervised fine-tuning and direct preference optimization. Extensive experiments on several legal benchmarks demonstrate that LegalDrill significantly bolsters the legal reasoning capabilities of representative SLMs while bypassing the need for scarce expert annotations, paving a scalable path toward practical legal reasoning systems.

[NLP-72] SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

【速读】: 该论文旨在解决近期大语言模型(Large Language Model, LLM)推理优化中广泛采用的混合策略(mixed-policy optimization)方法存在的基准错误问题,这些问题导致其性能评估失真。研究发现,多个主流框架(如TRL、OpenRLHF和Llama-Factory)在训练过程中受到两个独立缺陷的影响:一是DeepSpeed中的CPU-offloaded优化器bug,在梯度累积时静默丢弃中间微批次(micro-batch),显著抑制监督微调(Supervised Fine-Tuning, SFT)性能;二是OpenRLHF中的损失聚合bug,对每个小批次损失进行错误加权。这两个bug共同导致混合策略方法看似优于标准SFT-then-RL流程,实则源于基线偏差。解决方案的关键在于修正上述两个bug后重新评估,结果表明标准SFT-then-RL流程在数学基准测试上超越所有被评估的混合策略方法,且仅用50步强化学习(Reinforcement Learning, RL)的简化版本也表现更优,同时消耗更少浮点运算量(FLOPs)。

链接: https://arxiv.org/abs/2604.23747
作者: Alexis Limozin,Eduard Durech,Torsten Hoefler,Imanol Schlag,Valentina Pyatkin
机构: ETH AI Center, ETH Zürich (ETH人工智能中心); EPFL (洛桑联邦理工学院); Allen Institute for AI (艾伦人工智能研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math benchmarks with Qwen2.5-Math-7B and by +22.2 points with Llama-3.1-8B. Even a truncated variant with just 50 RL steps outperforms mixed-policy methods on math benchmarks while using fewer FLOPs.

[NLP-73] Multimodal QUD: Inquisitive Questions from Scientific Figures

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在理解科学论文中图文交互信息时的局限性问题,即现有基准测试仅关注从图像中提取低级信息的简单问题,未能反映人类读者在阅读科研论文时通过图与文协同推理所生成的深层次、语境相关的疑问。其解决方案的关键在于提出一种扩展自文本-only 的“讨论中的问题”(Questions Under Discussion, QUD)理论至多模态场景的框架,并构建了MQUD数据集——该数据集包含由原作者标注的、基于图及其所在论文上下文生成的具有高阶多模态推理需求的探究性问题。通过在MQUD上微调VLM,模型能够从生成通用低级视觉问题转向具备内容特定性与视觉锚定能力的高级多模态QUD生成,从而更贴近人类在科研文献阅读中的认知过程。

链接: https://arxiv.org/abs/2604.23733
作者: Yating Wu,William Rudman,Venkata S Govindarajan,Alexandros G. Dimakis,Junyi Jessy Li
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校); Ithaca College (伊塔卡学院); UC Berkeley (加州大学伯克利分校); BespokeLabs.ai (BespokeLabs.ai)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Asking inquisitive questions while reading, and looking for their answers, is an important part in human discourse comprehension, curiosity, and creative ideation, and prior work has investigated this in text-only scenarios. However, in scientific or research papers, many of the critical takeaways are conveyed through both figures and the text that analyzes them. While scientific visualizations have been used to evaluate Vision-Language Models (VLMs) capabilities, current benchmarks are limited to questions that focus simply on extracting information from them. Such questions only require lower-level reasoning, do not take into account the context in which a figure appears, and do not reflect the communicative goals the authors wish to achieve. We generate inquisitive questions that reach the depth of questions humans generate when engaging with scientific papers, conditioned on both the figure and the paper’s context, and require reasoning across both modalities. To do so, we extend the linguistic theory of Questions Under Discussion (QUD) from being text-only to multimodal, where implicit questions are raised and resolved as discourse progresses. We present MQUD, a dataset of research papers in which such questions are made explicit and annotated by the original authors. We show that fine-tuning a VLM on MQUD shifts the model from generating generic low-level visual questions to content-specific grounding that requires a high-level of multimodal reasoning, yielding higher-quality, more visually grounded multimodal QUD generation.

[NLP-74] AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中情感机制可解释性研究中存在的关键混淆问题:现有方法(如线性探测、激活修补、稀疏自编码器特征分析、因果消融和引导向量提取)依赖于包含情感关键词的刺激文本,导致无法区分模型内部表征是基于真实情感状态还是仅仅识别了情绪词汇本身。这种混淆严重影响了对情感电路、特征及干预措施的下游推断准确性。解决方案的关键在于提出并发布AIPsy-Affect——一个480项的临床刺激电池,其中包含192个无关键词的情境故事(每种Plutchik八种基本情绪各192个),以及与之匹配的中性对照组(控制角色、场景、长度和表面结构,仅情感被手术式移除)。该匹配对设计提供了强方法学保障:任何能区分临床项与其匹配中性项的内部表征,必然是基于情境引发的情感而非关键词存在。通过三种自然语言处理(NLP)防御测试(词袋情感分析、情绪类别词典和上下文Transformer分类器)验证,词袋方法仅感知情境词汇,而上下文分类器虽能检测情感(p < 10⁻¹⁵),却无法识别具体类别(top-1准确率5.2% vs. 关键词丰富对照组82.5%),从而确认了刺激的纯净性。

链接: https://arxiv.org/abs/2604.23719
作者: Michael Keeman
机构: Keido Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Dataset paper. 4 pages + appendix, 2 figures. Dataset available at this https URL . MIT license

点击查看摘要

Abstract:Mechanistic interpretability research on emotion in large language models – linear probing, activation patching, sparse autoencoder (SAE) feature analysis, causal ablation, steering vector extraction – depends on stimuli that contain the words for the emotions they test. When a probe fires on “I am furious”, it is unclear whether the model has detected anger or detected the word “furious”. The two readings have very different consequences for every downstream claim about emotion circuits, features, and interventions. We release AIPsy-Affect, a 480-item clinical stimulus battery that removes the confound at the stimulus level: 192 keyword-free vignettes evoking each of Plutchik’s eight primary emotions through narrative situation alone, 192 matched neutral controls that share characters, setting, length, and surface structure with the affect surgically removed, plus moderate-intensity and discriminant-validity splits. The matched-pair structure supports linear probing, activation patching, SAE feature analysis, causal ablation, and steering vector extraction under a strong methodological guarantee: any internal representation that distinguishes a clinical item from its matched neutral cannot be doing so on the basis of emotion-keyword presence. A three-method NLP defense battery – bag-of-words sentiment, an emotion-category lexicon, and a contextual transformer classifier – confirms the property: bag-of-words methods see only situational vocabulary, and a contextual classifier detects affect (p 10^-15) but cannot identify the category (5.2% top-1 vs. 82.5% on a keyword-rich control). AIPsy-Affect extends our earlier 96-item battery (arXiv:2603.22295) by a factor of four and is released openly under MIT license.

[NLP-75] HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

【速读】: 该论文旨在解决大音频语言模型(Large Audio Language Models, LALMs)在处理长序列时推理成本高昂的问题,核心挑战在于现有token压缩方法通常假设所有注意力头(attention heads)对各类音频任务贡献均等,从而导致压缩效率低下。解决方案的关键在于提出一种无需训练的、基于注意力头重要性感知的剪枝方法HeadRouter,其创新点在于揭示了不同注意力头在语义和声学任务中具有异质性响应特性,并通过识别稀疏激活的头部来动态保留关键token,从而显著提升压缩性能,在AudioMarathon和MMAU-Pro基准上实现了优于基线模型的压缩效果。

链接: https://arxiv.org/abs/2604.23717
作者: Peize He,Yaodi Luo,Xiaoqian Liu,Xuyang Liu,Jiahang Deng,Yaosong Du,Bangyu Li,Xiyan Gui,Yuxuan Chen,Linfeng Zhang
机构: EPIC Lab, Shanghai Jiao Tong University (EPIC 实验室,上海交通大学); DAIL Tech; Northeastern University (东北大学); Sichuan University (四川大学); Huazhong University of Science and Technology (华中科技大学)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Homepage: this https URL

点击查看摘要

Abstract:Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in the sequence. Existing compression methods usually assume that all attention heads in LALMs contribute equally to various audio tasks and calculate token importance by averaging scores across all heads. However, our analysis demonstrates that attention heads exhibit distinct behaviors across diverse audio domains. We further reveal that only a sparse subset of attention heads actively responds to audio, with completely different performance when handling semantic and acoustic tasks. In light of this observation, we propose HeadRouter, a head-importance-aware token pruning method that perceives the varying importance of attention heads in different audio tasks to maximize the retention of crucial tokens. HeadRouter is training-free and can be applied to various LALMs. Extensive experiments on the AudioMarathon and MMAU-Pro benchmarks demonstrate that HeadRouter achieves state-of-the-art compression performance, exceeding the baseline model even when retaining 70% of the audio tokens and achieving 101.8% and 103.0% of the vanilla average on Qwen2.5-Omni-3B and Qwen2.5-Omni-7B, respectively.

[NLP-76] Agri-CPJ: A Training-Free Explainable Framework for Agricultural Pest Diagnosis Using Caption-Prompt-Judge and LLM -as-a-Judge ICASSP2026

【速读】: 该论文旨在解决作物病害诊断中两个关键问题:一是现有模型在基准测试上表现优异但常出现物种名称幻觉(hallucination),二是即使预测正确,其推理过程对实践者不可解释。解决方案的核心是提出Agri-CPJ(Caption-Prompt-Judge)框架,其关键创新在于引入结构化形态描述生成与多维质量过滤的caption refinement阶段——通过大视觉语言模型(Vision-Language Model, VLM)首先生成结构化的形态学描述(caption),再经多维度质量门控迭代优化,随后基于互补视角生成两个候选回答,并由领域知识增强的大语言模型(Large Language Model, LLM)判别器选择最优答案。实验证明,caption refinement是影响性能最大的单一模块,跳过该步骤会显著降低下游任务准确率;在CDDMBench上,结合GPT-5-Nano与GPT-5-mini生成的caption使疾病分类准确率提升22.7个百分点、问答得分提升19.5点;同时,结构化caption与判别理由共同构成可读的审计轨迹(audit trail),便于用户定位错误来源,实现可解释诊断。

链接: https://arxiv.org/abs/2604.23701
作者: Wentao Zhang,Qi Zhang,Mingkun Xu,Mu You,Henghua Shen,Zhongzhi He,Keyan Jin,Derek F. Wong,Tao Fang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This work is an expanded version of our prior paper published in the IEEE ICASSP 2026 conference arXiv:2512.24947 , from 4 to 20+ pages, presenting a well-structured and principled framework, extensive experiments, and deeper insights. Tao Fang is the corresponding author

点击查看摘要

Abstract:Crop disease diagnosis from field photographs faces two recurring problems: models that score well on benchmarks frequently hallucinate species names, and when predictions are correct, the reasoning behind them is typically inaccessible to the practitioner. This paper describes Agri-CPJ (Caption-Prompt-Judge), a training-free few-shot framework in which a large vision-language model first generates a structured morphological caption, iteratively refined through multi-dimensional quality gating, before any diagnostic question is answered. Two candidate responses are then generated from complementary viewpoints, and an LLM judge selects the stronger one based on domain-specific criteria. Caption refinement is the component with the largest individual impact: ablations confirm that skipping it consistently degrades downstream accuracy across both models tested. On CDDMBench, pairing GPT-5-Nano with GPT-5-mini-generated captions yields \textbf+22.7 pp in disease classification and \textbf+19.5 points in QA score over no-caption baselines. Evaluated without modification on AgMMU-MCQs, GPT-5-Nano reached 77.84% and Qwen-VL-Chat reached 64.54%, placing them at or above most open-source models of comparable scale despite the format shift from open-ended to multiple-choice. The structured caption and judge rationale together constitute a readable audit trail: a practitioner who disagrees with a diagnosis can identify the specific caption observation that was incorrect. Code and data are publicly available this https URL

[NLP-77] Benchmarking Testing in Automated Theorem Proving ACL2026

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在形式化定理证明中难以准确评估语义正确性的问题。现有方法依赖于词法重叠或人工检查等间接指标,存在不准确或成本高昂的局限。其解决方案的关键在于提出一种名为 T 的框架,该框架通过“集成测试”逻辑来评估生成定理的语义正确性:只有当所有依赖该定理的后续定理均能成功编译时,才认为该定理是正确的。这一机制模拟了软件工程中从词法比较向测试驱动评估的转变,从而提供了一种自动化、可扩展且更贴近实际应用的评估方式。

链接: https://arxiv.org/abs/2604.23698
作者: Jongyoon Kim,Hojae Han,Seung-won Hwang
机构: Seoul National University(首尔国立大学); Electronics and Telecommunications Research Institute(电子与电信研究院)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Industry

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric. The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.

[NLP-78] Rank Head-Channel Non-Identifiability and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers

【速读】: 该论文旨在解决Transformer架构中四种类型的“坍缩”现象(rank collapse、width collapse、head-channel non-identifiability 和 entropy collapse),这些现象会限制模型表达能力与训练稳定性。其核心贡献在于提出一个统一的对称性破缺框架来理解这些坍缩机制,并揭示了现有设计中的关键误解:层归一化(Layer Normalization, LN)并非“无作用”,而是精确保持token表示的仿射秩(affine rank);残差连接在测度论意义上可普遍抑制rank collapse,无需依赖MLP;而MLP的核心功能并非防止rank collapse,而是生成原始token嵌入线性空间之外的新特征方向,这是纯注意力层无法实现的。此外,论文识别出一种新的非identifiability问题——头通道不可区分性(head-channel non-identifiability),并提出位置门控输出投影(Position-Gated Output Projection, PG-OP)作为轻量级解决方案(参数开销<1.6%)。整体而言,该研究重构了对Transformer内部机制的理解,强调结构组件的功能分工与协同作用。

链接: https://arxiv.org/abs/2604.23681
作者: Giansalvo Cirrincione
机构: Laboratoire LTI, Université de Picardie Jules Verne (皮卡第朱尔斯·凡尔纳大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 36 pages, 8 figures, 1 table. Submitted to Artificial Intelligence (Elsevier)

点击查看摘要

Abstract:A widely cited result by Dong et al. (2021) showed that Transformers built from self-attention alone, without skip connections or feed-forward layers, suffer from rapid rank collapse: all token representations converge to a single direction. The proposed remedy was the MLP. We show that this picture, while correct in the regime studied by Dong, is incomplete in ways that matter for architectural understanding. Three results are established. First, layer normalisation is precisely affine-rank-neutral: it preserves the affine rank of the token representation set exactly. The widespread claim that LN “plays no role” is imprecise; the correct statement is sharper. Second, residual connections generically obstruct rank collapse in real Transformers such as BERT-base, in a measure-theoretic sense, without contribution from the MLP. The MLP’s irreplaceable function is different: generating feature directions outside the linear span of the original token embeddings, which no stack of attention layers can produce. Third, a phenomenon distinct from rank collapse is identified: head-channel non-identifiability. After multi-head attention sums per-head outputs through the output projection, individual contributions cannot be canonically attributed to a specific head; n(H-1)d_k degrees of freedom per layer remain ambiguous when recovering a single head from the mixed signal. The MLP cannot remedy this because it acts on the post-summation signal. A constructive partial remedy is proposed: a position-gated output projection (PG-OP) at parameter overhead below 1.6% of the standard output projection. The four collapse phenomena identified in the literature – rank collapse in depth, in width, head-channel non-identifiability, and entropy collapse – are unified under a symmetry-breaking framework, each corresponding to a distinct symmetry of the Transformer’s forward pass. Comments: 36 pages, 8 figures, 1 table. Submitted to Artificial Intelligence (Elsevier) Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML) Cite as: arXiv:2604.23681 [cs.LG] (or arXiv:2604.23681v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.23681 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-79] Neural Grammatical Error Correction for Romanian

【速读】: 该论文旨在解决非英语语言中语法错误纠正(Grammatical Error Correction, GEC)资源匮乏的问题,特别是针对罗曼尼亚语这一低资源语言场景。其解决方案的关键在于构建了一个包含10,000对句子的GEC语料库,并基于ERRANT工具开发了适用于罗曼尼亚语的评估指标,同时提出一种通过词性标注(POS tagger)生成人工训练样本的预训练策略,从而提升神经模型在低资源条件下的性能表现。实验表明,先在人工合成数据上预训练一个大型Transformer模型,再在真实语料上微调,可显著提升F0.5分数(从44.38提升至53.76),验证了该方法的有效性和可迁移性。

链接: https://arxiv.org/abs/2604.23627
作者: Teodor-Mihai Cotet,Stefan Ruseti,Mihai Dascalu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Resources for Grammatical Error Correction (GEC) in non-English languages are scarce, while available spellcheckers in these languages are mostly limited to simple corrections and rules. In this paper we introduce a first GEC corpus for Romanian consisting of 10k pairs of sentences. In addition, the German version of ERRANT (ERRor ANnotation Toolkit) scorer was adapted for Romanian to analyze this corpus and extract edits needed for evaluation. Multiple neural models were experimented, together with pretraining strategies, which proved effective for GEC in low-resource settings. Our baseline consists of a small Transformer model trained only on the GEC dataset (F0.5 of 44.38), whereas the best performing model is produced by pretraining a larger Transformer model on artificially generated data, followed by finetuning on the actual corpus (F0.5 of 53.76). The proposed method for generating additional training examples is easily extensible and can be applied to any language, as it requires only a POS tagger

[NLP-80] GraphPlanner: Graph Memory-Augmented Agent ic Routing for Multi-Agent LLM s

【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)路由机制在复杂多智能体(multi-agent)场景下能力不足的问题,特别是如何在任务规划、异构智能体间的多轮协作以及记忆利用等方面实现高效且精准的路由决策。其核心解决方案是提出 GraphPlanner,一个基于异构图记忆增强的代理路由器,通过将工作流生成建模为马尔可夫决策过程(Markov Decision Process, MDP),在每一步同时选择LLM主干和代理角色(如规划者、执行者、总结者),并借助名为GARNet的异构图结构捕获查询、代理与响应之间的交互记忆,从而融合历史记忆与工作流记忆以构建更丰富的状态表示。整个系统通过强化学习进行端到端优化,在提升任务性能的同时显著降低计算成本,实现了对未见任务和模型的零样本泛化能力,并支持归纳推理(inductive inference)与直推推理(transductive inference)。

链接: https://arxiv.org/abs/2604.23626
作者: Tao Feng,Haozhen Zhang,Zijie Lei,Peixuan Han,Jiaxuan You
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM routing has achieved promising results in integrating the strengths of diverse models while balancing efficiency and performance. However, to support more realistic and challenging applications, routing must extend into agentic LLM settings, where task planning, multi-round cooperation among heterogeneous agents, and memory utilization are indispensable. To address this gap, we propose GraphPlanner, a heterogeneous graph memory-augmented agentic router for multi-agent LLMs that generates routing workflows for each query and supports both inductive and transductive inference. GraphPlanner formulates workflow generation as a Markov Decision Process (MDP), where at each step it selects both the LLM backbone and the agent role, including Planner, Executor, and Summarizer. By leveraging a heterogeneous graph, denoted as GARNet, to capture interaction memories among queries, agents, and responses, GraphPlanner integrates historical memory and workflow memory into richer state representations. The entire pipeline is optimized with reinforcement learning, jointly improving task-specific performance and computational efficiency. We evaluate GraphPlanner across 14 diverse LLM tasks and demonstrate that: (1) GraphPlanner outperforms strong single-round and multi-round routers, improving accuracy by up to 9.3% while reducing GPU cost from 186.26 GiB to 1.04 GiB; (2) GraphPlanner generalizes robustly to unseen tasks and LLMs, exhibiting strong zero-shot capabilities; and (3) GraphPlanner effectively leverages historical memories, supporting both inductive and transductive inference for more adaptive routing. Our code for GraphPlanner is released at this https URL.

[NLP-81] Applications of the Transformer Architecture in AI-Assisted English Reading Comprehension

【速读】: 该论文旨在解决生成式 AI(Generative AI)在英语阅读理解教学中面临的可解释性不足、算法偏见显著以及学习环境中性能不可靠等问题。其解决方案的关键在于构建一个统一的技术流程,包括对抗性偏见校正方法、基于梯度的token级特征归因分析以及多头注意力热力图可视化技术,从而在保证高预测准确率的同时提升模型的公平性和可解释性,增强教师对AI评分系统的信任与操作性。

链接: https://arxiv.org/abs/2604.23615
作者: Ping Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, Conference paper for International Conference on Big Data Applications in Education and Engineering {ICBDAEE 2026)

点击查看摘要

Abstract:This paper studies interpretable and fair artificial intelligence architectures for understanding English reading. Introduced transformer-based models, integrating advanced attention mechanisms and gradient-based feature attribution. The model’s lack of interpretability, reduction of algorithmic bias, and unreliable performance in learning environments are the current issues faced in natural language teaching. A unified technical pipeline has been constructed, including adversarial bias correction methods, token-level attribution analysis, and multi-head attention heatmap visualization. Experimental validation was conducted using a large-scale labeled English reading comprehension dataset, and the data partitioning scheme and parameter optimization procedures have been determined. The method significantly outperforms the state-of-the-art models for this task in terms of accuracy and macro-average F1 score; in some aspects, it even surpasses or closely matches the results of human evaluations. In multi-week user experiments, the explainable transformer improved teachers’ trust and operability in feedback-based assessments within the scoring system. The proposed method aims to ensure high prediction accuracy and fairness for different learners. This indicates that it is a real-world educational application based on artificial intelligence with a focus on interpretation. Improve the user experience in AI-assisted reading comprehension systems, counteract biases, and enhance the details explained by transformers.

[NLP-82] Personality Shapes Gender Bias in Persona-Conditioned LLM Narratives Across English and Hindi: An Empirical Investigation

【速读】: 该论文旨在解决生成式 AI(Generative AI)在 persona-conditioned 应用中因人格特征与性别偏见交互而引发的代表性不公问题,特别是其如何加剧或缓解性别刻板印象。解决方案的关键在于通过受控实验设计,在英语和印地语环境中系统性地操纵职业角色、人格特质(基于 HEXACO 和 Dark Triad 框架)以及性别变量,对六种先进大语言模型(LLMs)生成的 23,400 条故事进行量化分析,发现 Dark Triad 人格特质显著增强性别刻板印象表达,而 HEXACO 中的社会宜人性特质则起到抑制作用,且这种效应在不同模型和语言间存在差异,从而揭示了性别偏见并非静态,而是高度依赖于上下文与人格条件的动态现象。

链接: https://arxiv.org/abs/2604.23600
作者: Tanay Kumar,Shreya Gautam,Aman Chadha,Vinija Jain,Francesco Pierri
机构: Politecnico di Milano; Apple; Google
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in persona-driven applications such as education, customer service, and social platforms, where models are prompted to adopt specific personas when interacting with users. While persona conditioning can improve user experience and engagement, it also raises concerns about how personality cues may interact with gender biases and stereotypes. In this work, we present a controlled study of persona-conditioned story generation in English and Hindi, where each story portrays a working professional in India producing context-specific artifacts (e.g., lesson plans, reports, letters) under systematically varied persona gender, occupational role, and personality traits from the HEXACO and Dark Triad frameworks. Across 23,400 generated stories from six state-of-the-art LLMs, we find that personality traits are significantly associated with both the magnitude and direction of gender bias. In particular, Dark Triad personality traits are consistently associated with higher gender-stereotypical representations compared to socially desirable HEXACO traits, though these associations vary across models and languages. Our findings demonstrate that gender bias in LLMs is not static but context-dependent. This suggests that persona-conditioned systems used in real-world applications may introduce uneven representational harms, reinforcing gender stereotypes in generated educational, professional, or social content.

[NLP-83] XITE: Cross-lingual Interpolation for Transfer using Embeddings

【速读】: 该论文旨在解决多语言语言模型中跨语言迁移(cross-lingual transfer)的挑战,即如何有效利用高资源语言(如英语)的知识来提升低资源目标语言上的任务性能。其解决方案的关键在于提出一种基于嵌入空间的数据增强技术——XITE(Cross-lingual Embedding-based Text Interpolation),通过在目标语言的未标注文本与英文对应语句之间进行嵌入相似性匹配并采用其标签,再结合线性插值生成合成训练数据;进一步地,在插值前使用线性判别分析(LDA)将目标语言投影至语言丰富的子空间,显著提升了跨语言迁移效果,尤其在情感分析和自然语言推理任务中取得了最高达81.16%的性能提升。

链接: https://arxiv.org/abs/2604.23589
作者: Barah Fazili,Preethi Jyothi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Facilitating cross-lingual transfer in multilingual language models remains a critical challenge. Towards this goal, we propose an embedding-based data augmentation technique called XITE. We start with unlabeled text from a low-resource target language, identify an English counterpart in a task-specific training corpus using embedding-based similarities and adopt its label. Next, we perform a simple interpolation of the source and target embeddings to create synthetic data for task-specific fine-tuning. Projecting the target text into a language-rich subspace using linear discriminant analysis (LDA), prior to interpolation, further boosts performance. Our cross-lingual embedding-based augmentation technique XITE yields significant improvements of up to 35.91% for sentiment analysis and up to 81.16% for natural language inference, using XLM-R, for a diverse set of target languages including Korean, Arabic, Urdu and Hindi. Apart from boosting cross-lingual transfer, adaptation using XITE also safeguards against forgetting and maintains task performance on the high-resource language.

[NLP-84] alker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

【速读】: 该论文旨在解决联合音视频生成模型中因全层次耦合建模导致的低效问题,尤其是在人脸说话合成任务中,音频与面部运动虽在高层语义上相关,但其低层实现(声学信号与视觉纹理)遵循不同的渲染过程,现有方法通过广泛注意力机制将两者完全纠缠,造成不必要的冗余和效率损失。解决方案的关键在于提出Talker-T2AV框架,采用自回归扩散架构:高层跨模态建模在共享骨干网络中完成,利用统一的patch级token空间由共享语言模型联合推理音频与视频;低层细节则通过两个轻量级扩散Transformer头分别解码为帧级音频和视频潜在表示,实现高阶语义一致性与低阶模态特异性解耦,从而提升唇音同步准确率、视频与音频质量,并增强跨模态一致性。

链接: https://arxiv.org/abs/2604.23586
作者: Zhen Ye,Xu Tan,Aoxiong Yin,Hongzhan Lin,Guangyan Zhang,Peiwen Sun,Yiming Li,Chi-Min Chan,Wei Ye,Shikun Zhang,Wei Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified patch-level token space. Two lightweight diffusion transformer heads decode the hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines.

[NLP-85] Agent Eval: DAG-Structured Step-Level Evaluation for Agent ic Workflows with Error Propagation Tracking ACL2026

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)代理系统在实际部署中因缺乏精细化评估机制而导致的中间步骤错误难以被发现的问题。现有评估方法如端到端结果检查和临时性追踪审查,往往掩盖了主导真实世界错误预算的中间失败环节。其解决方案的关键在于提出 AgentEval 框架,该框架将代理执行过程形式化为带有类型化质量指标的评估有向无环图(evaluation directed acyclic graph, DAG),通过校准后的大型语言模型(LLM)判官(GPT-4o)对每个节点进行评分,并基于三级、21子类别的故障分类体系进行归类,同时利用依赖关系链实现自动根因定位。实验表明,DAG结构建模本身即可提升失败检测召回率22个百分点、根因准确率34个百分点,且在多个生产工作流中显著优于传统方法。

链接: https://arxiv.org/abs/2604.23581
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: The University of Hong Kong(香港大学); Stellaris AI Limited(Stellaris AI有限公司)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted at ACL 2026 Industry Track. 14 pages, 3 figures, 21 tables

点击查看摘要

Abstract:Agentic systems that chain reasoning, tool use, and synthesis into multi-step workflows are entering production, yet prevailing evaluation practices like end-to-end outcome checks and ad-hoc trace inspection systematically mask the intermediate failures that dominate real-world error budgets. We present AgentEval, a framework that formalizes agent executions as evaluation directed acyclic graphs (DAGs), where each node carries typed quality metrics assessed by a calibrated LLM judge (GPT-4o), classified through a hierarchical failure taxonomy (3 levels, 21 subcategories), and linked to upstream dependencies for automated root cause attribution. An ablation study isolates the impact of DAG-based dependency modeling: it alone contributes +22 percentage points to failure detection recall and +34 pp to root cause accuracy over flat step-level evaluation with identical judges and rubrics. Across three production workflows (450 test cases, two agent model families, predominantly sequential architectures with a 12% non-DAG trace rate), AgentEval achieves 2.17x higher failure detection recall than end-to-end evaluation (0.89 vs. 0.41), Cohen’s kappa = 0.84 agreement with human experts, and 72% root cause accuracy against an 81% human ceiling. Cross-system evaluation on tau-bench and SWE-bench traces confirms transferability (failure detection recall = 0.78) without taxonomy or rubric modification. A 4-month pilot with 18 engineers detected 23 pre-release regressions through CI/CD-integrated regression testing, reducing median root-cause identification time from 4.2 hours to 22 minutes and driving measurable failure rate reductions in two workflows. Comments: Accepted at ACL 2026 Industry Track. 14 pages, 3 figures, 21 tables Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL) ACMclasses: I.2.7; D.2.5; D.2.4 Cite as: arXiv:2604.23581 [cs.SE] (or arXiv:2604.23581v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.23581 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-86] LLM s Reading the Rhythms of Daily Life: Aligned Understanding for Behavior Prediction and Generation

【速读】: 该论文旨在解决人类行为建模中的三大挑战:长尾行为(long-tail behaviors)的建模困难、模型可解释性不足,以及在统一框架下支持多任务学习的问题。现有基于深度学习的行为预测方法难以有效捕捉稀有或低频行为模式,且缺乏对决策逻辑的透明解释能力;同时,单一模型难以兼顾多种行为预测与生成任务。解决方案的关键在于提出一种名为行为理解对齐(Behavior Understanding Alignment, BUA)的新框架,其核心机制是通过预训练行为模型生成的序列嵌入(sequence embeddings)作为对齐锚点,引导大型语言模型(LLM)经历一个三阶段结构化课程学习(curriculum learning)过程,并结合多轮对话设置实现预测与生成能力的协同优化。该设计有效弥合了行为数据与自然语言之间的结构性差异,显著提升了复杂人类行为建模的效果与灵活性。

链接: https://arxiv.org/abs/2604.23578
作者: Fanjin Meng,Jingtao Ding,Nian Li,Yizhou Sun,Yong Li
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human daily behavior unfolds as complex sequences shaped by intentions, preferences, and context. Effectively modeling these behaviors is crucial for intelligent systems such as personal assistants and recommendation engines. While recent advances in deep learning and behavior pre-training have improved behavior prediction, key challenges remain–particularly in handling long-tail behaviors, enhancing interpretability, and supporting multiple tasks within a unified framework. Large language models (LLMs) offer a promising direction due to their semantic richness, strong interpretability, and generative capabilities. However, the structural and modal differences between behavioral data and natural language limit the direct applicability of LLMs. To address this gap, we propose Behavior Understanding Alignment (BUA), a novel framework that integrates LLMs into human behavior modeling through a structured curriculum learning process. BUA employs sequence embeddings from pretrained behavior models as alignment anchors and guides the LLM through a three-stage curriculum, while a multi-round dialogue setting introduces prediction and generation capabilities. Experiments on two real-world datasets demonstrate that BUA significantly outperforms existing methods in both tasks, highlighting its effectiveness and flexibility in applying LLMs to complex human behavior modeling. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.23578 [cs.CL] (or arXiv:2604.23578v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.23578 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-87] RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization ACL2026

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在服务多样化自然语言处理(Natural Language Processing, NLP)工作负载时产生的高额推理成本问题,尤其是在企业场景中,大量常规任务其实可由更小、更经济的模型完成,但当前系统往往统一使用高性能模型导致资源浪费。解决方案的关键在于提出一个闭环式框架 RouteNLP,其核心创新包括:(1) 基于难度感知的路由器(difficulty-aware router),利用共享的任务条件表示和偏好数据与质量信号进行训练;(2) 置信度校准的级联机制(confidence-calibrated cascading),采用 conformal prediction 实现无分布假设下的阈值初始化;(3) 蒸馏-路由协同优化循环(distillation-routing co-optimization loop),通过聚类升级失败案例、对低成本模型实施针对性知识蒸馏,并自动重训练路由器,从而显著提升成本效益。该方案在真实部署中实现 58% 的推理成本降低,同时保持高响应接受率和低延迟,验证了其有效性与实用性。

链接: https://arxiv.org/abs/2604.23577
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: The University of Hong Kong (香港大学); Stellaris AI Limited (Stellaris AI 有限公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ACL 2026 Industry Track. 13 pages, 2 figures, 15 tables, 1 algorithm

点击查看摘要

Abstract:Serving diverse NLP workloads with large language models is costly: at one enterprise partner, inference costs exceeded 200K/month despite over 70% of queries being routine tasks well within the capability of smaller models. We present RouteNLP, a closed-loop framework that routes queries across a tiered model portfolio to minimize cost while satisfying per-task quality constraints. The framework integrates three components: a difficulty-aware router with shared task-conditioned representations trained on preference data and quality signals; confidence-calibrated cascading that uses conformal prediction for distribution-free threshold initialization; and a distillation-routing co-optimization loop that clusters escalation failures, applies targeted knowledge distillation to cheaper models, and automatically retrains the router, yielding over twice the cost improvement of untargeted distillation. In an 8-week pilot deployment processing ~5K queries/day at an enterprise customer-service division, RouteNLP reduced inference costs by 58% while maintaining 91% response acceptance and reducing p99 latency from 1,847 ms to 387 ms. On a six-task benchmark spanning finance, customer service, and legal domains, the framework achieves 40-85% cost reduction while retaining 96-100% quality on structured tasks and 96-98% on generation tasks, with human evaluation confirming that 74.5% of routed generation outputs match or exceed frontier-model quality.

[NLP-88] he Collapse of Heterogeneity in Silicon Philosophers

【速读】: 该论文旨在解决生成式 AI(Generative AI)在哲学领域中对人类判断的模拟问题,特别是其是否能够准确再现个体哲学立场及跨问题的相关性结构。研究发现,语言模型在哲学判断上存在系统性过度相关现象,导致人工共识的虚假形成,这主要归因于模型对专家领域的隐含假设——即领域专家持有高度一致的哲学观点。解决方案的关键在于识别并量化这种“专家效应”(specialist effects),并通过对比不同微调方法(如DPO微调)和大规模数据集(PhilPapers 2020 Survey)验证结果的稳健性,从而揭示硅样本(silicon samples)作为人类判断替代品时的局限性及其对对齐(alignment)与评估工作的深远影响。

链接: https://arxiv.org/abs/2604.23575
作者: Yuanming Shi(Adobe Inc.),Andreas Haupt(Stanford University)
机构: Adobe Inc.(Adobe公司); Stanford University (斯坦福大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Silicon samples are increasingly used as a low-cost substitute for human panels and have been shown to reproduce aggregate human opinion with high fidelity. We show that, in the alignment-relevant domain of philosophy, silicon samples systematically collapse heterogeneity. Using data from N = 277 professional philosophers drawn from PhilPeople profiles, we evaluate seven proprietary and open-source large language models on their ability to replicate individual philosophical positions and to preserve cross-question correlation structures across philosophical domains. We find that language models substantially over-correlate philosophical judgments, producing artificial consensus across domains. This collapse is associated in part with specialist effects, whereby models implicitly assume that domain specialists hold highly similar philosophical views. We assess the robustness of these findings by studying the impact of DPO fine-tuning and by validating results against the full PhilPapers 2020 Survey ( N = 1785 ). We conclude by discussing implications for alignment, evaluation, and the use of silicon samples as substitutes for human judgment. The code of this project can be found at this https URL.

[NLP-89] Pref-CTRL: Preference Driven LLM Alignment using Representation Editing ACL2026

【速读】: 该论文旨在解决现有测试时对齐(test-time alignment)方法在处理人类偏好数据时的不足问题,特别是RE-Control方法虽通过外部价值函数引导大语言模型(LLM)生成,但未充分考虑对齐任务本质——即基于人类偏好比较候选响应的学习机制。其解决方案的关键在于提出一种新的基于偏好的训练框架Pref-CTRL,该框架采用多目标价值函数来更准确地建模偏好数据的结构,从而提升对齐效果与跨域泛化能力。

链接: https://arxiv.org/abs/2604.23543
作者: Imranul Ashrafi,Inigo Jauregi Unanue,Massimo Piccardi
机构: University of Technology Sydney, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:Test-time alignment methods offer a promising alternative to fine-tuning by steering the outputs of large language models (LLMs) at inference time with lightweight interventions on their internal representations. Recently, a prominent and effective approach, RE-Control (Kong et al., 2024), has proposed leveraging an external value function trained over the LLM’s hidden states to guide generation via gradient-based editing. While effective, this method overlooks a key characteristic of alignment tasks, i.e. that they are typically formulated as learning from human preferences between candidate responses. To address this, in this paper we propose a novel preference-based training framework, Pref-CTRL, that uses a multi-objective value function to better reflect the structure of preference data. Our approach has outperformed RE-Control on two benchmark datasets and showed greater generalization on out-of-domain datasets. Our source code is available at this https URL.

[NLP-90] MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings ACL2026

【速读】: 该论文旨在解决多轮、长时程任务中大型语言模型(Large Language Models, LLMs)因频繁调用导致推理成本高昂的问题。针对固定预算下的成本感知型多轮模型路由问题,其核心解决方案是提出MTRouter:通过将交互历史与候选模型编码为联合的历史-模型嵌入(history-model embeddings),并利用日志轨迹训练一个结果预测器来估计每一轮的模型效用(model utility),从而动态选择最优模型以优化性能与成本之间的权衡。关键创新在于引入可学习的turn-level模型效用预测机制,并实现更少的模型切换、对瞬时错误更强的鲁棒性以及模型间的涌现式专业化分工。

链接: https://arxiv.org/abs/2604.23530
作者: Yiqun Zhang,Hao Li,Zihan Wang,Shi Feng,Xiaocui Yang,Daling Wang,Bo Zhang,Lei Bai,Shuyue Hu
机构: Northeastern University (东北大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work has accepted by ACL 2026

点击查看摘要

Abstract:Multi-turn, long-horizon tasks are increasingly common for large language models (LLMs), but solving them typically requires many sequential model invocations, accumulating substantial inference costs. Here, we study cost-aware multi-turn LLM routing: selecting which model to invoke at each turn from a model pool, given a fixed cost budget. We propose MTRouter, which encodes the interaction history and candidate models into joint history-model embeddings, and learns an outcome estimator from logged trajectories to predict turn-level model utility. Experiments show that MTRouter improves the performance-cost trade-off: on ScienceWorld, it surpasses GPT-5 while reducing total cost by 58.7%; on Humanity’s Last Exam (HLE), it achieves competitive accuracy while reducing total cost by 43.4% relative to GPT-5, and these gains even carry over to held-out tasks. Further analyses reveal several mechanisms underlying its effectiveness: relative to prior multi-turn routers, MTRouter makes fewer model switches, is more tolerant to transient errors, and exhibits emergent specialization across models. Code: this https URL

[NLP-91] K-SENSE: A Knowledge-Guided Self-Augmented Encoder for Neuro-Semantic Evaluation of Mental Health Conditions on Social Media

【速读】: 该论文旨在解决从社交媒体文本中早期检测心理疾病(如压力和抑郁)的难题,这一问题在计算精神病学和自然语言处理领域仍具挑战性,主要源于用户生成内容中的隐喻语言、隐含情绪表达及高噪声特性。解决方案的关键在于提出K-SENSE框架,其核心创新是将外部心理学常识推理与内部表征鲁棒性联合建模:首先通过COMET模型提取五维心理状态的推断性常识知识;其次构建由双并行编码流融合得到的语义锚点,并投影至共享空间进行对齐;最后引入监督对比学习目标,在增强同类别表征一致性的同时,引导注意力机制抑制无关知识噪声。该设计实现了知识引导与自增强训练的统一,显著提升了模型在Dreaddit(压力检测)和Depression_Mixed(抑郁检测)数据集上的性能,F1分数分别提升约2.6和1.5个百分点。

链接: https://arxiv.org/abs/2604.23493
作者: Vijay Yadav
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early detection of mental health conditions, particularly stress and depression, from social media text remains a challenging open problem in computational psychiatry and natural language processing. Automated systems must contend with figurative language, implicit emotional expression, and the high noise inherent in user-generated content. Existing approaches either leverage external commonsense knowledge to model mental states explicitly, or apply self-augmentation and contrastive training to improve generalization, but seldom do both in a principled, unified framework. We propose K-SENSE (Knowledge-guided Self-augmented Encoder for Neuro-Semantic Evaluation of Mental Health), a framework that jointly exploits external psychological reasoning and internal representation robustness. K-SENSE adopts a three-stage encoding pipeline: (1) inferential commonsense knowledge is extracted from the COMET model across five mental state dimensions; (2) a semantic anchor is constructed by combining hidden representations from two parallel encoding streams, projected into a shared space before fusion; and (3) a supervised contrastive learning objective aligns same-class representations while encouraging the attention mechanism to suppress irrelevant knowledge noise. We evaluate K-SENSE on Dreaddit (stress detection) and Depression_Mixed (depression detection), achieving mean F1-scores of 86.1 (0.6%) and 94.3 (0.8%), respectively, over five independent runs. These represent improvements of approximately 2.6 and 1.5 percentage points over the strongest prior baselines. Ablation experiments confirm the contribution of each architectural component, including the temporal knowledge integration strategy and the choice to keep the knowledge encoder frozen during fine-tuning.

[NLP-92] JudgeSense: A Benchmark for Prompt Sensitivity in LLM -as-a-Judge Systems

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)作为自动化评判者(automated judges)时,其判断结果在语义等价提示(semantically equivalent prompt paraphrases)下是否稳定的问题。现有研究未量化此类稳定性,可能导致评估结果不可靠。解决方案的关键在于提出 JudgeSense 框架与基准测试,通过引入 Judge Sensitivity Score(JSS)来衡量评判一致性——即同一评判模型对语义等价提示对做出相同决策的比例。实验表明,不同任务类型存在显著差异:事实性任务中原始 JSS 偏低(约 0.63),主要受极性反转提示伪影影响;修正后提升至约 0.9;偏好和相关性任务则普遍表现出“始终选择 A”的退化行为,反映强位置偏倚;而连贯性任务中判别差异明显(JSS 范围 0.389–0.992)。该工作首次系统揭示了 LLM 判决的敏感性来源,并提供开源代码、决策日志及验证过的提示同义句数据集以支持标准化 JSS 报告。

链接: https://arxiv.org/abs/2604.23478
作者: Rohith Reddy Bellibatlu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures, 3 tables. Code: this https URL . Dataset (JudgeSense Benchmark): this https URL

点击查看摘要

Abstract:Large language models are increasingly deployed as automated judges for evaluating other models, yet the stability of their verdicts under semantically equivalent prompt paraphrases remains unmeasured. We introduce JudgeSense, a framework and benchmark for quantifying this property via the Judge Sensitivity Score (JSS), defined as the fraction of paraphrase pairs on which a judge returns an identical decision. Evaluating nine judge models on 494 validated paraphrase pairs, we find that coherence is the only task where judges meaningfully differ, with JSS ranging from 0.389 to 0.992. On factuality, all judges cluster near JSS about 0.63, driven by a polarity-inverted prompt artifact; after correction, factuality JSS rises to about 0.9. Pairwise tasks (preference and relevance) exhibit degenerate always-A behavior in 8 of 9 judges, indicating strong position bias. Model scale does not predict consistency. We release code, decision logs, and a validated paraphrase dataset to support standardized JSS reporting. Comments: 17 pages, 3 figures, 3 tables. Code: this https URL. Dataset (JudgeSense Benchmark): this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.23478 [cs.CL] (or arXiv:2604.23478v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.23478 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-93] Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中前馈网络(Feed-Forward Networks, FFNs)的通道级重要性组织问题,即识别哪些通道对模型损失函数最为敏感,从而指导高效且鲁棒的结构化剪枝策略。解决方案的关键在于提出一种基于激活梯度二阶矩的 Fisher 风格损失代理(Loss Proxy, LP),发现每层中仅有少量通道(如 top 1%)承载了绝大部分损失敏感性,这些通道被定义为“超节点”(supernodes)。研究表明,超节点与传统基于激活强度或权重范数的异常值重叠较弱,且其周围存在一个由冗余性强的非超节点构成的“晕环”结构(halo structure)。通过在剪枝时保护超节点核心(SCAR-Prot 方法),可显著提升稀疏化后的模型性能,在 50% FFN 稀疏度下将困惑度从 Wanda-channel 的 989.2 降至 54.8,验证了超节点作为损失关键结构的存在及其在可靠结构化剪枝中的核心作用。

链接: https://arxiv.org/abs/2604.23475
作者: Audrey Cherilyn,Houman Safaai
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study the organization of channel-level importance in transformer feed-forward networks (FFNs). Using a Fisher-style loss proxy (LP) based on activation-gradient second moments, we show that loss sensitivity is concentrated in a small set of channels within each layer. In Llama-3.1-8B, the top 1% of channels per layer accounts for a median of 58.7% of LP mass, with a range of 33.0% to 86.1%. We call these loss-critical channels supernodes. Although FFN layers also contain strong activation outliers, LP-defined supernodes overlap only weakly with activation-defined outliers and are not explained by activation power or weight norms alone. Around this core, we find a weaker but consistent halo structure: some non-supernode channels share the supernodes’ write support and show stronger redundancy with the protected core. We use one-shot structured FFN pruning as a diagnostic test of this organization. At 50% FFN sparsity, baselines that prune many supernodes degrade sharply, whereas our SCAR variants explicitly protect the supernode core; the strongest variant, SCAR-Prot, reaches perplexity 54.8 compared with 989.2 for Wanda-channel. The LP-concentration pattern appears across Mistral-7B, Llama-2-7B, and Qwen2-7B, remains visible in targeted Llama-3.1-70B experiments, and increases during OLMo-2-7B pretraining. These results suggest that LLM FFNs develop a small learned core of loss-critical channels, and that preserving this core is important for reliable structured pruning.

[NLP-94] Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models

【速读】: 该论文旨在解决连续思考模型(Continuous Thought Models)在不可解释的潜在空间中可能出现对齐偏差(misaligned reasoning)却仍输出看似合理结果的安全性问题。其解决方案的关键在于提出MoralChain基准与双触发机制:通过设计包含道德/非道德推理路径的12,000个社会场景,结合一个用于激活非对齐潜在推理的触发器[T]和一个释放有害输出的触发器[O],实验证明非对齐推理存在于潜在空间中几何上分离的区域,且可通过线性探测器从行为可区分条件([T][O] vs [O])迁移至检测“已武装但无害”状态([T] vs 基线),同时发现对齐偏差编码于早期潜在思维标记中,表明安全监控应聚焦于潜在推理的“规划”阶段。

链接: https://arxiv.org/abs/2604.23460
作者: Sharan Ramjee
机构: Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages with 2 figures

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model’s expressive bandwidth. Continuous thought models address this bottleneck by reasoning in latent space rather than human-readable tokens. While they enable richer representations and faster inference, they raise a critical safety question: how can we detect misaligned reasoning in an uninterpretable latent space? To study this, we introduce MoralChain, a benchmark of 12,000 social scenarios with parallel moral/immoral reasoning paths. We train a continuous thought model with backdoor behavior using a novel dual-trigger paradigm - one trigger that arms misaligned latent reasoning ([T]) and another that releases harmful outputs ([O]). We demonstrate three findings: (1) continuous thought models can exhibit misaligned latent reasoning while producing aligned outputs, with aligned and misaligned reasoning occupying geometrically distinct regions of latent space; (2) linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) transfer to detecting armed-but-benign states ([T] vs baseline) with high accuracy; and (3) misalignment is encoded in early latent thinking tokens, suggesting safety monitoring for continuous thought models should target the “planning” phase of latent reasoning.

[NLP-95] AI Safety Training Can be Clinically Harmful

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在心理治疗场景中缺乏临床有效性验证及潜在风险的问题,特别是在延长暴露疗法(Prolonged Exposure, PE)和认知行为疗法(Cognitive Behavioral Therapy, CBT)中的适配性与安全性不足。研究通过在250个PE场景和146个CBT认知重构练习上评估四个主流生成模型,并引入三名评委组成的LLM评分小组,发现尽管模型在表面共情层面表现优异(得分约0.91–1.00),其治疗适宜性(therapeutic appropriateness)在高严重度情境下骤降至0.22–0.33,且协议一致性(protocol fidelity)在两个模型中降为零。关键发现是:强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)的安全对齐机制反而破坏了心理治疗的核心机制——如在想象暴露中提供虚假安慰、错误插入危机资源、拒绝挑战自伤相关扭曲认知等。为此,作者提出一个五维评估框架(协议一致性、幻觉风险、行为一致性、危机安全性和人口多样性鲁棒性),并将其映射至FDA SaMD和欧盟AI法案要求,主张任何AI心理健康系统在部署前必须通过多维度跨维度验证。

链接: https://arxiv.org/abs/2604.23445
作者: Suhas BN,Andrew M. Sherrill,Rosa I. Arriaga,Chris W. Wiese,Saeed Abdullah
机构: Penn State University (宾夕法尼亚州立大学); Emory University (埃默里大学); Georgia Tech (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 26 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Large language models are being deployed as mental health support agents at scale, yet only 16% of LLM-based chatbot interventions have undergone rigorous clinical efficacy testing, and simulations reveal psychological deterioration in over one-third of cases. We evaluate four generative models on 250 Prolonged Exposure (PE) therapy scenarios and 146 CBT cognitive restructuring exercises (plus 29 severity-escalated variants), scored by a three-judge LLM panel. All models scored near-perfectly on surface acknowledgment (~0.91-1.00) while therapeutic appropriateness collapsed to 0.22-0.33 at the highest severity for three of four models, with protocol fidelity reaching zero for two. Under CBT severity escalation, one model’s task completeness dropped from 92% to 71% while the frontier model’s safety-interference score fell from 0.99 to 0.61. We identify a systematic, modality-spanning failure: RLHF safety alignment disrupts the therapeutic mechanism of action by grounding patients during imaginal exposure, offering false reassurance, inserting crisis resources into controlled exercises, and refusing to challenge distorted cognitions mentioning self-harm in PE; and through task abandonment or safety-preamble insertion during CBT cognitive restructuring. These findings motivate a five-axis evaluation framework (protocol fidelity, hallucination risk, behavioral consistency, crisis safety, demographic robustness), mapped onto FDA SaMD and EU AI Act requirements. We argue that no AI mental health system should proceed to deployment without passing multi-axis evaluation across all five dimensions.

[NLP-96] Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉问答(Visual Question Answering, VQA)任务中盲目沿用来自大语言模型(Large Language Models, LLMs)的随机采样(stochastic sampling)解码策略所带来的次优问题。作者指出,VQA 是一个封闭式任务,其答案分布具有头部密集特性(head-heavy answer distributions),且不确定性主要源于视觉证据缺失或模糊(epistemic uncertainty),而非多种合理延续的可能性。解决方案的关键在于理论证明了模型校准(calibration)与预测准确性之间的关系,并推导出贪婪解码(greedy decoding)最优性的充分条件;在此基础上,提出 Greedy Decoding for Reasoning Models(GDRM),该方法在多模态推理场景中显著优于传统随机采样和标准贪婪解码,表明在 VQA 中应优先采用贪婪解码作为默认策略。

链接: https://arxiv.org/abs/2604.23443
作者: Boqi Chen,Xudong Liu,Yunke Ao,Jianing Qiu
机构: ETH Zurich; University of Toronto; MBZUAI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.

[NLP-97] When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

【速读】: 该论文旨在解决深度神经网络中层归一化(LayerNorm)的必要性问题,提出一种无需显式归一化的替代方案——动态双曲正切(Dynamic Tanh, DyT),其核心机制是通过一个可学习参数 α 对激活值进行有界约束(即 tanh(αx))。关键创新在于发现这种边界作用在不同训练规模下具有显著的制度依赖性隐式正则化效应:在低数据量(如1M tokens)时能有效提升模型性能(验证损失降低27.3%),但在高数据量(如118M tokens)时反而导致性能下降(验证损失恶化18.8%)。这一现象揭示了DyT并非普适优化器,而是依赖于训练规模和模型容量的调节机制,且可通过激活饱和程度等可测量指标进行量化分析。

链接: https://arxiv.org/abs/2604.23434
作者: Lucky Verma
机构: UMBC (University of Maryland, Baltimore County)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 28 pages, 7 figures, includes appendices. Code and artifacts: this https URL

点击查看摘要

Abstract:Dynamic Tanh (DyT) removes LayerNorm by bounding activations with a learned tanh(alpha x). We show that this bounding is a regime-dependent implicit regularizer, not a uniformly beneficial replacement. Across GPT-2-family models spanning 64M to 3.78B parameters and 1M to 118M tokens, with Llama and ViT cross-checks, DyT improves validation loss by 27.3% at 64M/1M but worsens it by 18.8% at 64M/118M; the 1M benefit vanishes with capacity (+1.7% at 3.78B), while the 118M penalty reaches +27.9%. The mechanism is measurable: 49% of DyT activations saturate at 1M versus 23% at 118M, and a 500-step saturation heuristic classifies DyT’s sign with 75% raw in-sample accuracy on the 12-cell GPT-2 calibration set (AUC 0.75; 64% when adding Scale 5 stress cells), correctly labels 3/3 Llama checks, but only reaches 50% raw leave-one-scale-out accuracy. Three interventions support the bounding explanation: HardTanh reproduces the regime pattern, increasing alpha at 118M monotonically reduces DyT’s penalty, and vanilla+dropout(p=0.5) matches DyT’s data-rich loss. We also localize Llama-DyT collapse to SwiGLU gating, where saturation separates collapse from convergence in a 3-seed component ablation (r=0.94). Scope: all experiments are compute-limited (T/P 1.84), below Chinchilla-optimal training.

[NLP-98] Evolve: A Persistent Knowledge Lifecycle for Small Language Models KR

【速读】: 该论文旨在解决本地小规模语言模型在知识密集型任务中准确率不足的问题,同时降低对大规模教师模型(teacher model)的频繁调用成本。其核心挑战在于如何高效地利用教师模型的知识来增强本地模型的表现,而无需在每次查询时都依赖教师模型进行推理。解决方案的关键在于提出一种名为 Evolve 的架构,该架构通过构建一个由教师模型编译并持续优化的持久化知识存储库(knowledge store),该存储库基于语义连贯的段落而非文档片段进行组织,并采用睡眠巩固(sleep consolidation)和使用驱动刷新机制维护知识质量;新知识在获取时被暂存,离线合并后压缩以提升效率,且通过跨查询复用显著减少教师调用次数(>50%)。实验表明,2B参数本地模型在多个基准测试中准确率从20–33%提升至60–84%,同时保持高知识利用率与可审计性,支持“抑制”(strict section-only grounding)与“增强”(section-supplemented responses)两种生成模式。

链接: https://arxiv.org/abs/2604.23424
作者: Dikran Hovagimian
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 35 pages, 1 figure. Code and evaluation data: this https URL

点击查看摘要

Abstract:Evolve pairs a small local language model with a persistent, teacher-compiled knowledge store – refined through sleep consolidation and usage-driven refresh – to deliver substantial accuracy gains over the model’s parametric baseline while amortizing teacher costs through cross-query knowledge reuse. Rather than retrieving document fragments at query time, Evolve constructs a store of semantically coherent sections compiled by teacher models at natural conceptual boundaries; new sections are staged on acquisition, consolidated offline through teacher-mediated merging, and refreshed inline when expired. A 2B-parameter local model handles classification and generation; large teacher models are invoked only for knowledge operations. Across 750 benchmark queries spanning custom specialist questions, NaturalQuestions, and TriviaQA, the 2B model augmented by Evolve improves from 20-33% baseline accuracy to 60-84% (+40-52pp) while reducing teacher invocations by over 50% through reuse. Post-consolidation compresses the knowledge store by 31-33.5% across three independent benchmarks while preserving accuracy; section-based retrieval outperforms chunk-based retrieval by 5-9pp across every lifecycle condition. The architecture supports two generation modes over the same lifecycle – suppress (strict section-only grounding, auditable) and augment (section-supplemented responses). Comments: 35 pages, 1 figure. Code and evaluation data: this https URL Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) ACMclasses: I.2.6; I.2.7; H.3.3 Cite as: arXiv:2604.23424 [cs.LG] (or arXiv:2604.23424v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.23424 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-99] Beyond Local vs. External: A Game-Theoretic Framework for Trustworthy Knowledge Acquisition

【速读】: 该论文旨在解决云部署的大语言模型(Large Language Models, LLMs)在提供强大推理能力与动态知识的同时,因直接提交原始查询而引发敏感用户意图泄露的问题;与此同时,仅依赖本地可信模型虽能保障隐私,但受限于参数规模和知识覆盖范围,常导致回答质量下降。解决方案的关键在于提出一种基于博弈论的可信知识获取框架(Game-theoretic Trustworthy Knowledge Acquisition, GTKA),其核心机制是将知识效用与隐私保护之间的权衡建模为一个对抗性博弈:通过隐私感知的子查询生成器将敏感意图分解为低风险片段、利用对抗重建攻击者提供自适应泄露信号,并由受信本地集成器在安全边界内融合外部响应,从而在交替训练中优化子查询生成策略,在最大化知识获取准确率的同时最小化原始敏感意图的可重构性。

链接: https://arxiv.org/abs/2604.23413
作者: Rujing Yao,Yufei Shi,Yang Wu,Ang Li,Zhuoren Jiang,XiaoFeng Wang,Haixu Tang,Xiaozhong Liu
机构: Nankai University, China; The Hong Kong Polytechnic University, China; Worcester Polytechnic Institute, USA; Zhejiang University, China; Nanyang Technological University, Singapore; Indiana University Bloomington, USA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cloud-hosted Large Language Models (LLMs) offer unmatched reasoning capabilities and dynamic knowledge, yet submitting raw queries to these external services risks exposing sensitive user intent. Conversely, relying exclusively on trusted local models preserves privacy but often compromises answer quality due to limited parameter scale and knowledge. To resolve this dilemma, we propose Game-theoretic Trustworthy Knowledge Acquisition (GTKA), a framework that formulates the trade-off between knowledge utility and privacy as a strategic game. GTKA consists of three components: (i) a privacy-aware sub-query generator that decomposes sensitive intent into generalized, low-risk fragments; (ii) an adversarial reconstruction attacker that attempts to infer the original query from these fragments, providing adaptive leakage signals; and (iii) a trusted local integrator that synthesizes external responses within a secure boundary. By training the generator and attacker in an alternating adversarial manner, GTKA optimizes the sub-query generation policy to maximize knowledge acquisition accuracy while minimizing the reconstructability of the original sensitive intent. To validate our approach, we construct two sensitive-domain benchmarks in the biomedical and legal fields. Extensive experiments demonstrate that GTKA significantly reduces intent leakage compared to state-of-the-art baselines while maintaining high-fidelity answer quality.

[NLP-100] Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing ACL2026

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)领域中受版权保护的文学文本标注语料难以合法共享的问题。此类语料对全面反映真实场景下的数据多样性至关重要,但因版权限制难以流通。解决方案的关键在于:语料创建者公开提供清晰的标注信息以及源文本的不可逆哈希版本,而语料使用者需拥有相同源文本,并通过应用相同的哈希函数对其本地文本进行处理以匹配共享标注。该方法对用户持有的源文本版本差异具有鲁棒性,实验证明在不同版本小说上可实现98.7%至99.79%的token级正确对齐,前提是用户版本与创建者版本足够接近。

链接: https://arxiv.org/abs/2604.23412
作者: Arthur Amalvy,Vincent Labatut,Xavier Bost,Hen-Hsen Huang
机构: Institute of Information Science, Academia Sinica, Taiwan; Laboratoire Informatique d’Avignon, Avignon Université, France; Aiway, France
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully and publicly share the annotations of copyrighted literary texts. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show that our method is able to correctly align 98.7 to 99.79% of tokens depending on the novel, provided the user version is sufficiently close to the corpus creator’s version. We publicly release novelshare, a Python implementation of our method.

[NLP-101] When Chain-of-Thought Fails the Solution Hides in the Hidden States

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)是否蕴含可计算利用的任务相关信息这一核心问题,即CoT中的token是否真正承载了有助于解决问题的表征,而非仅仅是解释性的中间步骤。其解决方案的关键在于采用激活修补(activation patching)技术:将同一问题在CoT生成过程中某token的隐藏状态迁移至直接回答(direct-answer)运行中,通过测量最终答案准确率的变化来评估该token是否包含任务相关的推理信息。研究发现,即使原始CoT路径错误,修补后的生成结果也能显著提升准确率,表明单个CoT token可能已编码足够信息以恢复正确答案;此外,这类信息在正确CoT中更普遍、集中在中后期层且较早出现在推理序列中,且语言类token(如动词和实体)比数学符号更能引导正确推理方向。这揭示了CoT中存在可被提取的、粒度到token级别的可恢复问题求解信息,为理解推理过程的表示机制与失效点提供了新视角。

链接: https://arxiv.org/abs/2604.23351
作者: Houman Mehrafarin,Amit Parekh,Ioannis Konstas
机构: Heriot-Watt University (赫瑞-瓦特大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Whether intermediate reasoning is computationally useful or merely explanatory depends on whether chain-of-thought (CoT) tokens contain task-relevant information. We present a mechanistic causal analysis of CoT on GSM8K using activation patching: transferring token-level hidden states from a CoT generation to a direct-answer run for the same question, then measuring the effect on final-answer accuracy. Across models, generating after patching yields substantially higher accuracy than both direct-answer prompting and the original CoT trace, revealing that individual CoT tokens can encode sufficient information to recover the correct answer, even when the original trace is incorrect. This task-relevant information is more prevalent in correct than incorrect CoT runs and is unevenly distributed across tokens, concentrating in mid-to-late layers and appearing earlier in the reasoning trace. Moreover, patching language tokens such as verbs and entities carry task-solving information that steers generation toward correct reasoning, whereas mathematical tokens encode answer-proximal content that rarely succeeds. Patched outputs are often shorter and yet exceed the accuracy of a full CoT trace, suggesting complete reasoning chains are not always necessary. Together, these findings demonstrate that CoT encodes recoverable, token-level problem-solving information, offering new insight into how reasoning is represented and where it breaks down.

[NLP-102] Evaluating Large Language Models on Computer Science University Exams in Data Structures

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在计算机科学(Computer Science, CS)数据结构考试题型中的表现评估问题,尤其关注其对封闭式与多项选择题的处理能力。解决方案的关键在于构建了一个新的基准测试数据集——来自特拉维夫大学(Tel Aviv University, TAU)的课程考试题目,并在此基础上系统评估了包括GPT-4o、Claude 3.5在内的主流大模型以及Mathstral 7B和LLaMA 3 8B等较小模型的答题准确率,从而为LLMs在CS教育场景下的实际应用提供量化依据与性能参考。

链接: https://arxiv.org/abs/2604.23347
作者: Edan Gabay,Yael Maoz,Jonathan Stahl,Naama Maoz,Abdo Amer,Orr Eilat,Hanoch Levy,Michal Kleinbort,Amir Rubinstein,Adi Haviv
机构: Tel Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs’ abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI’s GPT 4o and Anthropic’s Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.

[NLP-103] Process Supervision of Confidence Margin for Calibrated LLM Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因基于结果的奖励机制导致过度自信的问题,这会引发幻觉、不可靠的置信度控制以及不必要的计算资源分配。其解决方案的关键在于提出一种校准感知的强化学习框架——置信度边际强化学习(Reinforcement Learning with Confidence Margin, RLCM),该框架通过在单个推理轨迹中增强正确与错误步骤之间的置信度边际来联合优化准确性和置信度可靠性,从而在不牺牲准确率的前提下显著提升模型校准性能,并支持更高效的置信度加权聚合与风险控制。

链接: https://arxiv.org/abs/2604.23333
作者: Liaoyaqi Wang,Chunsheng Zuo,William Jurayj,Benjamin Van Durme,Anqi Liu
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes models to be overconfident, leading to hallucinations, unreliable confidence-based control, and unnecessary compute allocation. We introduce Reinforcement Learning with Confidence Margin (\textbfRLCM), a calibration-aware RL framework that jointly optimizes correctness and confidence reliability via a margin-enhanced process reward over intermediate-budget completions. Rather than aligning confidence to correctness likelihoods, RLCM encourages to widen the confidence margin between correct and incorrect steps within a single reasoning trajectory. Across mathematical, code, logic and science benchmarks, our method substantially improves calibration while maintaining or improving accuracy. We further show that, with calibrated confidence signals, the resulting models enable more efficient conformal risk control and effective confidence-weighted aggregation.

[NLP-104] Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

【速读】: 该论文旨在解决当前音频-文本跨模态检索方法在处理长时、噪声大及弱标签音频时性能下降的问题,其核心挑战在于现有方法依赖对比学习和大批次训练,在小批量场景下难以稳定优化。解决方案的关键在于提出一种新颖的多模态检索框架,通过交叉模态嵌入精炼模块(cross-modal embedding refinement module)对音频和文本嵌入进行精细化调整,该模块结合基于Transformer的投影、线性映射与双向注意力机制;同时设计了一种混合损失函数,融合余弦相似度、L₁距离与对比损失目标,从而在小批量条件下实现稳定训练,并辅以静音感知分块(silence-aware chunking)和基于注意力的池化策略,有效提升对低信噪比(SNR 5–15 dB)长音频的鲁棒性。

链接: https://arxiv.org/abs/2604.23323
作者: Meizhu Liu,Matthew Rowe,Amit Agarwal,Michael Avendi,Yassi Abbasi,Hitesh Laxmichand Patel,Paul Li,Kyu J. Han,Tao Sheng,Sujith Ravi,Dan Roth
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, \mathcalL_1 , and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based pooling. Experiments on benchmark datasets demonstrate improvements over prior methods.

[NLP-105] Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

【速读】: 该论文旨在解决强化学习中基于验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的粗粒度信用分配问题,即在缺乏细粒度标注的情况下如何实现对策略更新的精细化控制。现有方法如Group Relative Policy Optimization (GRPO) 将相同的优势值分配给同一组内的所有token,导致无法捕捉局部推理质量差异。解决方案的关键在于利用隐藏状态分布的差异作为自监督信号:论文发现,在每个GRPO组内,正确与错误轨迹在局部推理分歧区域附近的span级隐藏状态分布间Wasserstein距离显著增大,且这一现象在跨样本和单个轨迹内部均成立。基于此观察,作者提出Span-level Hidden state Enabled Advantage Reweighting (SHEAR),通过计算span级Wasserstein距离来重加权token级优势值,从而放大隐藏状态分离度高的token的更新强度,实现无需额外标注或奖励模型训练的细粒度信用分配。

链接: https://arxiv.org/abs/2604.23318
作者: Xinzhu Chen,Wei He,Huichuan Fan,Wenzhe Niu,Zhongxiang Sun,Xuanru Wang,Jiuchong Gao,Jinghua Hao,Renqing He,Weijie Yu
机构: Beijing University of Posts and Telecommunications; Meituan; Tianjin University; Renmin University of China; The University of Melbourne; University of International Business and Economics
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) performs coarse-grained credit assignment in reinforcement learning with verifiable rewards (RLVR) by assigning the same advantage to all tokens in a rollout. Process reward models can provide finer-grained supervision, but they require step-level annotation or additional reward modeling. We show that hidden-state distributions contain a useful signal for local reasoning quality that can be extracted using only outcome-level correctness labels available in RLVR. Specifically, within each GRPO group, the Wasserstein distance between span-level hidden state distributions of correct and incorrect rollouts increases around regions where their local reasoning quality diverges. This association holds both across examples and within individual trajectories, suggesting that hidden-state distributional divergence can serve as a self-supervision signal for fine-grained credit assignment. We formalize this observation with a separation theorem showing that, under mild structural assumptions, post-divergence spans have larger Wasserstein distances than pre-divergence spans whenever the population-level distributional gap exceeds finite-sample noise. Motivated by this result, we propose \textbfSpan-level \textbfHidden state \textbfEnabled \textbfAdvantage \textbfReweighting (SHEAR), which modifies GRPO by using span-level Wasserstein distances to scale token-level advantages, amplifying updates on tokens whose hidden states are more separated from the opposing group. The method requires no additional model and only minimal changes to the training pipeline. Experiments on five mathematical reasoning benchmarks and five code generation benchmarks show improvements over standard GRPO and strong performance relative to supervised process reward models, while requiring no additional annotation or reward model training.

[NLP-106] mathcalS2IT: Stepwise Syntax Integration Tuning for Large Language Models in Aspect Sentiment Quad Prediction NAACL2025

【速读】: 该论文旨在解决生成式 AI(Generative AI)在 Aspect Sentiment Quad Prediction (ASQP) 任务中对句法结构信息利用不足的问题。尽管句法结构在以往的抽取式方法中已被证明有效,但大型语言模型(LLMs)因推理能力有限,难以充分整合此类信息。解决方案的关键在于提出 S²IT(Stepwise Syntax Integration Tuning)框架,通过三阶段渐进式微调机制,将全局和局部句法结构知识逐步融入 LLMs:首先进行全局句法引导的抽取,其次执行局部句法引导的分类,最后通过细粒度结构微调增强模型对元素间连接关系与节点分类的理解,从而显著提升 ASQP 任务的性能。

链接: https://arxiv.org/abs/2604.23296
作者: Bingfeng Chen,Chenjie Qiu,Yifeng Xie,Boyan Xu,Ruichu Cai,Zhifeng Hao
机构: Guangdong University of Technology (广东工业大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳)); Peng Cheng Laboratory (鹏城实验室); Shantou University (汕头大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of NAACL 2025

点击查看摘要

Abstract:Aspect Sentiment Quad Prediction (ASQP) has seen significant advancements, largely driven by the powerful semantic understanding and generative capabilities of large language models (LLMs). However, while syntactic structure information has been proven effective in previous extractive paradigms, it remains underutilized in the generative paradigm of LLMs due to their limited reasoning capabilities. In this paper, we propose S^2IT, a novel Stepwise Syntax Integration Tuning framework that progressively integrates syntactic structure knowledge into LLMs through a multi-step tuning process. The training process is divided into three steps. S^2IT decomposes the quadruple generation task into two stages: 1) Global Syntax-guided Extraction and 2) Local Syntax-guided Classification, integrating both global and local syntactic structure information. Finally, Fine-grained Structural Tuning enhances the model’s understanding of syntactic structures through the prediction of element links and node classification. Experiments demonstrate that S^2IT significantly improves state-of-the-art performance across multiple datasets. Our implementation will be open-sourced at this https URL.

[NLP-107] Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

【速读】: 该论文旨在解决印度语言(特别是印地语)中全双工语音对话系统(full-duplex spoken dialogue system)研究严重不足的问题,以实现自然的对话行为建模,如打断、重叠说话和反馈语(backchannels)。解决方案的关键在于:首先基于先进的双工语音架构Moshi,开发首个开源且可复现的印地语全双工系统;其次,通过构建定制化的印地语分词器(tokeniser)并重新初始化文本词汇依赖参数,同时保留预训练的音频模块;最后采用两阶段训练策略——大规模预训练后在1000小时对话数据上微调,从而直接从真实自发对话中学习轮流发言与重叠模式。实验表明,该方法能生成符合自然对话逻辑的印地语交互内容,为印地语及其他印度语言的实时双工语音对话系统奠定基础。

链接: https://arxiv.org/abs/2604.23295
作者: Bhaskar Singh,Shobhit Banga,Pranav Sharma
机构: JoshTalks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe – large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.

[NLP-108] Au-M-ol: A Unified Model for Medical Audio and Language Understanding

【速读】: 该论文旨在解决医疗场景下自动语音识别(Automatic Speech Recognition, ASR)的准确性与鲁棒性问题,尤其是在噪声环境、专业术语和说话人差异等挑战性条件下的性能瓶颈。其解决方案的关键在于提出了一种名为Au-M-ol的新型多模态架构,该架构通过三个核心组件实现:(1)音频编码器提取医学语音中的丰富声学特征;(2)适配层将音频特征映射至大语言模型(Large Language Model, LLM)的输入空间;(3)预训练LLM完成转录与临床语言理解任务。这种设计使模型能够直接解析医学语音内容,在降低词错误率(Word Error Rate, WER)方面比现有最优基线提升56%,并显著增强对复杂临床场景的适应能力。

链接: https://arxiv.org/abs/2604.23284
作者: Meizhu Liu,Nistha Mitra,Paul Li,Amine Abdaoui,Adam Ledyard,Tao Sheng
机构: Oracle AI Science (Oracle人工智能科学); Neuramill
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we present Au-M-ol, a novel multimodal architecture that extends Large Language Models (LLMs) with audio processing. It is designed to improve performance on clinically relevant tasks such as Automatic Speech Recognition (ASR). Au-M-ol has three main components: (1) an audio encoder that extracts rich acoustic features from medical speech, (2) an adaptation layer that maps audio features into the LLM input space, and (3) a pretrained LLM that performs transcription and clinical language understanding. This design allows the model to interpret spoken medical content directly, improving both accuracy and robustness. In experiments, Au-M-ol reduces Word Error Rate (WER) by 56% compared to state-of-the-art baselines on medical transcription tasks. The model also performs well in challenging conditions, including noisy environments, domain-specific terminology, and speaker variability. These results suggest that Au-M-ol is a strong candidate for real-world clinical applications, where reliable and context-aware audio understanding is essential.

[NLP-109] From Similarity to Structure: Training-free LLM Context Compression with Hybrid Graph Priors

【速读】: 该论文旨在解决长文本输入下大语言模型(Large Language Models, LLMs)在计算成本高且难以可靠处理超长上下文时所面临的挑战,尤其是现有压缩方法在严格token预算下难以同时保持任务相关性、主题覆盖度和跨句连贯性的局限。其解决方案的关键在于提出一种无需训练且与模型无关的压缩框架,通过构建稀疏混合句子图(sparse hybrid sentence graph),融合互为k近邻(k-NN)的语义边与短程顺序边,利用聚类提取主题骨架,并设计一个可解释的评分机制,综合考虑任务相关性、簇代表性、桥接中心性(bridge centrality)及环路覆盖提示(cycle coverage cue),最终采用带冗余抑制的预算贪心选择策略,在保持原文顺序的前提下生成可读性强的压缩上下文。

链接: https://arxiv.org/abs/2604.23277
作者: Yitian Zhou,Chaoning Zhang,Jiaquan Zhang,Zhenzhen Huang,Jinyu Guo,Sung-Ho Bae,Lik-Hang Lee,Caiyan Qin,Yang Yang
机构: University of Electronic Science and Technology of China (电子科技大学); Kyung Hee University (中央大学); The Hong Kong Polytechnic University (香港理工大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context large language models remain computationally expensive to run and often fail to reliably process very long inputs, which makes context compression an important component of many systems. Existing compression approaches typically rely on trained compressors, dense retrieval-style selection, or heuristic trimming, and they often struggle to jointly preserve task relevance, topic coverage, and cross-sentence coherence under a strict token budget. To address this, we propose a training-free and model-agnostic compression framework that selects a compact set of sentences guided by structural graph priors. Our method constructs a sparse hybrid sentence graph that combines mutual k-NN semantic edges with short-range sequential edges, extracts a topic skeleton via clustering, and ranks sentences using an interpretable score that integrates task relevance, cluster representativeness, bridge centrality, and a cycle coverage cue. A budgeted greedy selection with redundancy suppression then produces a readable compressed context in original order. Experimental results on four datasets show that our approach is competitive with strong extractive and abstractive baselines, demonstrating larger gains on long-document benchmarks.

[NLP-110] Lightweight and Production-Ready PDF Visual Element Parsing

【速读】: 该论文旨在解决PDF文档中复杂视觉元素(如图表、表格和表单)的准确提取问题,这些问题在现有PDF解析器中普遍存在:遗漏复杂图形、提取无信息噪声(如水印、logo)、产生碎片化元素以及无法可靠关联标题与对应内容,从而影响下游多模态检索增强生成(RAG)任务的性能。解决方案的关键在于结合空间启发式规则、版面分析与语义相似性,构建一个轻量级且可投入生产的PDF解析框架,实现了≥96%的视觉元素检测准确率和93%的标题关联准确率,并在多模态RAG场景中显著优于当前最先进解析器和大视觉语言模型,同时将延迟降低超过2倍。

链接: https://arxiv.org/abs/2604.23276
作者: Meizhu Liu,Yassi Abbasi,Matthew Rowe,Michael Avendi,Paul Li
机构: Oracle AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:PDF documents contain critical visual elements such as figures, tables, and forms whose accurate extraction is essential for document understanding and multimodal retrieval-augmented generation (RAG). Existing PDF parsers often miss complex visuals, extract non-informative artifacts (e.g., watermarks, logos), produce fragmented elements, and fail to reliably associate captions with their corresponding elements, which degrades downstream retrieval and question answering. We present a lightweight and production level PDF parsing framework that can accurately detect visual elements and associates captions using a combination of spatial heuristics, layout analysis, and semantic similarity. On popular benchmark datasets and internal product data, the proposed solution achieves \geq96% visual element detection accuracy and 93% caption association accuracy. When used as a preprocessing step for multimodal RAG, it significantly outperforms state-of-the-art parsers and large vision-language models on both internal data and the MMDocRAG benchmark, while reducing latency by over 2\times . We have deployed the proposed system in challenging production environment.

[NLP-111] Fine-tuning vs. In-context Learning in Large Language Models : A Formal Language Learning Perspective ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在两种核心学习模式——微调(Fine-tuning, FT)与上下文学习(In-context Learning, ICL)之间,其语言能力差异及归纳偏置(inductive biases)是否不同的问题。此前研究因实验设计不一致而结论混杂,难以得出可靠结论。为此,作者提出一个形式化语言学习任务,通过精确的语言边界定义、受控的字符串采样和无数据污染的方式构建严谨评估框架,并引入判别式语言熟练度测试:若模型对语言内字符串的生成概率高于语言外字符串,则判定其具备更高语言熟练度。该方案的关键在于将自然语言复杂性简化为可量化、可复现的形式语言环境,从而清晰区分FT与ICL在分布内泛化能力、归纳偏置演化以及模型规模敏感性等方面的差异,揭示出FT在分布内表现更优但两者在分布外性能相当,且ICL对模型架构和词汇表敏感的特性。

链接: https://arxiv.org/abs/2604.23267
作者: Bishwamittra Ghosh,Soumi Das,Till Speicher,Qinyuan Wu,Mohammad Aflah Khan,Deepak Garg,Krishna P. Gummadi,Evimaria Terzi
机构: Max Planck Institute for Software Systems (马克斯·普朗克软件系统研究所); Boston University (波士顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ACL 2026 (Main)

点击查看摘要

Abstract:Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. © Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at this https URL. Comments: Accepted at ACL 2026 (Main) Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2604.23267 [cs.CL] (or arXiv:2604.23267v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.23267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-112] Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中因用户输入提示(prompt)存在语义模糊性而导致的性能下降问题。自然语言提示常因语法不规范或语义多义性产生歧义,使模型难以选择正确的推理路径。现有方法通常在推理过程中进行查询编辑,但未从根源上消除歧义。本文提出一种预推理阶段的提示优化机制,其关键在于通过显式提示消歧(explicit prompt disambiguation)来提升输入质量:首先识别提示中的语义风险,继而从多角度验证一致性并解决语义冲突,最终将处理后的结构化提示作为清洁输入传递给LLM。该方案利用小型语言模型(Small Language Models, SLMs)高效执行消歧过程,在不干扰LLM内部推理机制的前提下显著提升推理准确性(实验显示性能提升2.5点,成本仅0.02)。

链接: https://arxiv.org/abs/2604.23263
作者: Zhenzhen Huang,Chaoning Zhang,Fachrina Dewi Puspitasari,Jiaquan Zhang,Yitian Zhou,Shuxu Chen,Yang Yang
机构: University of Electronic Science and Technology of China (电子科技大学); Kyung Hee University (中央大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly utilized in various complex reasoning tasks due to their excellent instruction following capability. However, the model’s performance is highly dependent on the open-ended characteristics of the users’ input prompt. Natural prompts often do not follow proper syntactic rules, which creates ambiguous queries that yield multiple interpretations. Such ambiguous prompts confuse the model in choosing the correct reasoning paths to answer questions. Prior works address this challenge by applying query editing during the LLM inference process without explicitly solving the root cause of the ambiguity. To address this limitation, we propose a pre-inference prompt optimization mechanism via explicit prompt disambiguation. Particularly, we identify semantic risks in the prompt, check their multi-perspective consistency, and resolve any semantic conflicts that arise. Finally, we organize the resolved ambiguities in a logically structured manner as a clean input to the LLM. By explicitly resolving semantic ambiguity, our method can produce a more focused attention distribution to the semantically essential tokens. We also leverage small language models (SLMs) as the main executor of prompt disambiguation to benefit from their efficient computation. Through comprehensive experiments on multiple benchmarks, we demonstrate that our method improves reasoning performance by 2.5 points at a cost of only \ 0.02. Our study promotes explicit prompt disambiguation as an effective prompt optimization method without disturbing the internal mechanism of LLM inference.

[NLP-113] Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection

【速读】: 该论文旨在解决人类模仿语音(human-imitated speech)的检测难题,此类语音因由真人自然生成,缺乏传统自动检测系统依赖的AI语音特征(如声谱失真或机器人化线索),因而更具隐蔽性且难以识别。解决方案的关键在于提出一种基于听觉感知的时频调制(Spectro-Temporal Modulation, STM)表示框架,该框架利用两种耳蜗滤波器模型——Gammatone Filterbank(GTFB)和Gammachirp Filterbank(GCFB),分别模拟频率选择性和强度依赖的非对称性,从而捕捉语音信号中与人类听觉感知相关的时频波动特征;进一步引入分段STM(Segmental-STM)表示以高分辨率建模短时调制模式,显著提升了对模仿语音攻击的检测性能,甚至超越了人类听觉判断水平,验证了感知启发式时频建模在提升语音认证鲁棒性方面的有效性。

链接: https://arxiv.org/abs/2604.23241
作者: Khalid Zaman,Masashi Unoki
机构: Japan Advanced Institute of Science and Technology (日本高级科学与技术研究生院)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human-imitated speech poses a greater challenge than AI-generated speech for both human listeners and automatic detection systems. Unlike AI-generated speech, which often contains artifacts, over-smoothed spectra, or robotic cues, imitated speech is produced naturally by humans, thereby preserving a higher degree of naturalness that makes imitation-based speech forgery significantly more challenging to detect using conventional acoustic or cepstral features. To overcome this challenge, this study proposes an auditory perception-based Spectro-Temporal Modulation (STM) representation framework for human-imitated speech detection. The STM representations are derived from two cochlear filterbank models: the Gammatone Filterbank (GTFB), which simulates frequency selectivity and can be regarded as a first approximation of cochlear filtering, and the Gammachirp Filterbank (GCFB), which further models both frequency selectivity and level-dependent asymmetry. These STM representations jointly capture temporal and spectral fluctuations in speech signals, corresponding to changes over time in the spectrogram and variations along the frequency axis related to human auditory perception. We also introduce a Segmental-STM representation to analyze short-term modulation patterns across overlapping time windows, enabling high-resolution modeling of temporal speech variations. Experimental results show that STM representations are effective for human-imitated speech detection, achieving accuracy levels close to those of human listeners. In addition, Segmental-STM representations are more effective, surpassing human perceptual performance. The findings demonstrate that perceptually inspired spectro-temporal modeling is promising for detecting imitation-based speech attacks and improving voice authentication robustness.

[NLP-114] Measuring Temporal Linguistic Emergence in Diffusion Language Models

【速读】: 该论文旨在解决生成式语言模型在扩散过程中信息可测量性的时序演化问题,即探究不同类型的语义信息(如词性、语义类别和词汇身份)在文本生成阶段何时变得可识别。其解决方案的关键在于利用扩散语言模型(Diffusion Language Models)提供的显式去噪轨迹(denoising trajectory),通过设计四种时间维度的测量指标(token commitment、线性可恢复性、置信度与熵动态变化、中段扰动敏感性),系统分析了LLaDA-8B-Base模型在Masked WikiText-103数据上的生成过程。结果表明:语义类别比功能类词汇更早稳定,粗粒度标签比精确词汇身份更具线性可恢复性,轨迹不确定性与最终生成正确性高度相关,且中段扰动敏感性峰值主要集中在被扰动位置本身,揭示了扩散时间轴作为分析工具的有效性。

链接: https://arxiv.org/abs/2604.23235
作者: Harry Lu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion language models expose an explicit denoising trajectory, making it possible to ask when different kinds of information become measurable during generation. We study three independent 32-step runs of LLaDA-8B-Base on masked WikiText-103 text, each with 1,000 probe-training sequences and 200 held-out evaluation sequences. From saved trajectories, we derive four temporal measurements: token commitment; linear recoverability of part-of-speech (POS), coarse semantic category, and token identity; confidence and entropy dynamics; and sensitivity under mid-trajectory re-masking. Across seeds, the same ordering recurs: content categories stabilize earlier than function-heavy categories, POS and coarse semantic labels remain substantially more linearly recoverable than exact lexical identity under our probe setup, uncertainty remains higher for tokens that ultimately resolve incorrectly even though late confidence becomes less calibrated, and perturbation sensitivity peaks in the middle of the trajectory. A direct/collateral decomposition shows that this peak is overwhelmingly local to the perturbed positions themselves. In this LLaDA+WikiText setting, denoising time is therefore a useful analysis axis: under our measurements, coarse labels are recovered earlier and more robustly than lexical identity, trajectory-level uncertainty tracks eventual correctness, and mid-trajectory states are the most intervention-sensitive.

[NLP-115] DARC-CLIP: Dynamic Adaptive Refinement with Cross-Attention for Meme Understanding ICASSP2026

【速读】: 该论文旨在解决现有基于CLIP(Contrastive Language–Image Pretraining)的图像-文本多模态内容检测方法在处理表情包(meme)时,因静态融合策略难以捕捉跨模态细粒度依赖关系而导致的敏感内容识别准确率不足的问题。其解决方案的关键在于提出DARC-CLIP框架,通过引入自适应交叉注意力精炼器(Adaptive Cross-Attention Refiner, ACAR)实现双向信息对齐,并结合动态特征适配器(Dynamic Feature Adapter, DFA)进行任务感知的信号自适应调整,从而构建分层细化堆栈以增强多模态特征的交互能力。实验表明,该方法在PrideMM和CrisisHateMM数据集上均取得显著性能提升,尤其在仇恨内容检测中AUROC与F1指标分别提升4.18和6.84,验证了自适应跨信号精炼策略在社会敏感分类任务中的有效性。

链接: https://arxiv.org/abs/2604.23214
作者: Qiyuan Jin
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to IEEE ICASSP 2026. 5 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Memes convey meaning through the interaction of visual and textual signals, often combining humor, irony, and offense in subtle ways. Detecting harmful or sensitive content in memes requires accurate modeling of these multimodal cues. Existing CLIP-based approaches rely on static fusion, which struggles to capture fine grained dependencies between modalities. We propose DARC-CLIP, a CLIP-based framework for adaptive multimodal fusion with a hierarchical refinement stack. DARC-CLIP introduces Adaptive Cross-Attention Refiners to for bidirectional information alignment and Dynamic Feature Adapters for task-sensitive signal adaptation. We evaluate DARC-CLIP on the PrideMM benchmark, which includes hate, target, stance, and humor classification, and further test generalization on the CrisisHateMM dataset. DARC-CLIP achieves highly competitive classification accuracy across tasks, with significant gains of +4.18 AUROC and +6.84 F1 in hate detection over the strongest baseline. Ablation studies confirm that ACAR and DFA are the main contributors to these gains. These results show that adaptive cross-signal refinement is an effective strategy for multimodal content analysis in socially sensitive classification.

[NLP-116] Discovering Agent ic Safety Specifications from 1-Bit Danger Signals AAMAS2026

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理如何通过纯粹的经验交互发现隐藏的安全目标(hidden safety objectives)的问题,尤其是在缺乏显式奖励信号或详细反馈的情况下。其核心挑战在于:在结构化、低维环境中,代理仅能接收到每步的稀疏二值危险警告(sparse binary danger warnings),而无法观测到真正的性能函数 $ R^* $。解决方案的关键是提出 EPO-Safe(Experiential Prompt Optimization for Safe Agents)框架,该框架利用LLM的自我反思能力,基于经验生成动作计划并迭代演化自然语言行为规范(natural language behavioral specification),从而自主识别出安全约束并形成可解释的规则。与依赖丰富文本反馈的标准LLM反思方法不同,EPO-Safe 证明了仅凭单比特危险信号即可实现有效的安全推理;同时揭示了仅基于奖励的反思会加剧奖励黑客(reward hacking)行为,强调必须引入专用的安全通道以确保安全发现的有效性。

链接: https://arxiv.org/abs/2604.23210
作者: Víctor Gallego
机构: Komorebi(科莫雷比)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to the Adaptive and Learning Agents Workshop (ALA 2026) @ AAMAS 2026. Code is available at this http URL

点击查看摘要

Abstract:Can large language model agents discover hidden safety objectives through experience alone? We introduce EPO-Safe (Experiential Prompt Optimization for Safe Agents), a framework where an LLM iteratively generates action plans, receives sparse binary danger warnings, and evolves a natural language behavioral specification through reflection. Unlike standard LLM reflection methods that rely on rich textual feedback (e.g., compiler errors or detailed environment responses), EPO-Safe demonstrates that LLMs can perform safety reasoning from a strictly impoverished signal in structured, low-dimensional environments: the agent never observes the hidden performance function R^* , only a single bit per timestep indicating that an action was unsafe. We evaluate on five AI Safety Gridworlds (Leike et al., 2017) and five text-based scenario analogs where visible reward R may diverge from R^* . EPO-Safe discovers safe behavior within 1-2 rounds (5-15 episodes), producing human-readable specifications with correct explanatory hypotheses about hazards (e.g., “X cells are directionally hazardous: entering from the north is dangerous”). Critically, we show that standard reward-driven reflection actively degrades safety: agents reflecting on reward alone use the loop to justify and accelerate reward hacking, proving that reflection must be paired with a dedicated safety channel to discover hidden constraints. We further evaluate robustness to noisy oracles: even when 50% of non-dangerous steps produce spurious warnings, mean safety performance degrades by only 15% on average, though sensitivity is environment-dependent, as cross-episode reflection naturally filters inconsistent signals. Each evolved specification functions as an auditable set of grounded behavioral rules discovered autonomously through interaction, rather than authored by humans as in Constitutional AI (Bai et al., 2022).

[NLP-117] Mechanistic Steering of LLM s Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在经过安全对齐(safety alignment)后仍可能被“越狱”(jailbroken)生成有害输出的问题,核心在于探究这种漏洞是否由可识别的内部特征驱动,而非仅依赖于特定提示(prompt)。解决方案的关键在于提出一个三阶段分析管道:首先通过子空间相似性提取对抗响应中与概念对齐的token;其次采用三种特征分组策略(聚类、层次聚类和单token驱动)识别Gemma-2-2B模型全部26层中的SAE(Sparsely Activated Encoder)特征子群;最后通过增强每个子群中的顶级特征并使用标准化LLM-judge评分协议测量有害性变化。结果表明,第16至25层的特征子群对越狱行为最为敏感,揭示了模型越狱漏洞集中于中后期层的特征子群,为基于特征层面的干预提供了更系统的方法路径,优于当前以提示为中心的防御策略。

链接: https://arxiv.org/abs/2604.23130
作者: Nilanjana Das,Manas Gaur
机构: UMBC, Maryland, U.S.A
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can still be jailbroken into producing harmful outputs despite safety alignment. Existing attacks show this vulnerability, but not the internal mechanisms that cause it. This study asks whether jailbreak success is driven by identifiable internal features rather than prompts alone. We propose a three-stage pipeline for Gemma-2-2B using the BeaverTails dataset. First, we extract concept-aligned tokens from adversarial responses via subspace similarity. Second, we apply three feature-grouping strategies (cluster, hierarchical-linkage, and single-token-driven) to identify SAE feature subgroups for the aligned tokens across all 26 model layers. Third, we steer the model by amplifying the top features from each identified subgroup and measure the change in harmfulness score using a standardized LLM-judge scoring protocol. In all three approaches, the features in the layers [16-25] were relatively more vulnerable to steering. All three methods confirmed that mid to later layer feature subgroups are more responsible for unsafe outputs. These results provide evidence that the jailbreak vulnerability in Gemma-2-2B is localized to feature subgroups of mid to later layers, suggesting that targeted feature-level interventions may offer a more principled path to adversarial robustness than current prompt-level defenses.

[NLP-118] Mixture of Heterogeneous Grouped Experts for Language Modeling ACL2026

【速读】: 该论文旨在解决传统混合专家(Mixture-of-Experts, MoE)架构中因专家规模统一而导致的计算资源分配僵化问题,以及现有异构专家设计在实际部署中面临的GPU利用率不均和参数利用效率低下的系统级挑战。其解决方案的关键在于提出一种两级路由机制的混合异构专家(Mixture of Heterogeneous Grouped Experts, MoHGE)架构:首先通过组内辅助损失(Group-Wise Auxiliary Loss)动态引导tokens至任务难度对应的最参数高效专家组;其次引入全尺寸分组解耦分配策略(All-size Group-decoupling Allocation)与组内专家辅助损失(Intra-Group Experts Auxiliary Loss),有效平衡各GPU间的负载,从而在保持MoE性能的同时降低约20%的总参数量并实现稳定的GPU利用率。

链接: https://arxiv.org/abs/2604.23108
作者: Zhicheng Ma,Xiang Liu,Zhaoxiang Liu,Ning Wang,Yi Shen,Kai Wang,Shuming Shi,Shiguo Lian
机构: Data Science Artificial Intelligence Research Institute, China Unicom; Unicom Data Intelligence, China Unicom
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ACL2026

点击查看摘要

Abstract:Large Language Models (LLMs) based on Mixture-of-Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes,creating a rigidity that fails to align computational costs with varying token-level complexity. While heterogeneous expert architectures attempt to address this by diversifying expert sizes, they often suffer from significant system-level challenges, specifically unbalanced GPU utilization and inefficient parameter utilization, which hinder practical deployment. To bridge the gap between theoretical heterogeneity and robust industrial application, we propose Mixture of Heterogeneous Grouped Experts (MoHGE) which introduces a two-level routing mechanism to enable flexible, resource-aware expert combinations. To optimize inference efficiency, we propose a Group-Wise Auxiliary Loss, which dynamically steers tokens to the most parameter-efficient expert groups based on task difficulty. To address the critical deployment challenge of GPU load balancing, we introduce an All-size Group-decoupling Allocation strategy coupled with an Intra-Group Experts Auxiliary Loss. These mechanisms collectively ensure uniform computation distribution across GPUs. Extensive evaluations demonstrate that MoHGE matches the performance of MoE architectures while reducing the total parameters by approximately 20% and maintaining balanced GPU utilization. Our work establishes a scalable paradigm for resource-efficient MoE design, offering a practical solution for optimizing inference costs in real-world scenarios.

[NLP-119] Code Broker: A Multi-Agent System for Automated Code Quality Assessment

【速读】: 该论文旨在解决Python代码质量评估自动化与可操作性反馈生成的问题,传统方法难以同时兼顾多维度(如正确性、安全性、风格和可维护性)的全面分析与开发者友好的输出。解决方案的关键在于构建一个基于Google Agent Development Kit (ADK) 的多智能体系统Code Broker,其采用五层分层架构:由根协调器调度顺序执行的管道代理,后者并行调用三个专业化代理——正确性评估代理(Correctness Assessor)、风格评估代理(Style Assessor)和描述生成代理(Description Generator),最终通过改进推荐代理(Improvement Recommender)整合结果并生成Markdown和HTML格式的结构化报告。该系统融合大语言模型(LLM)推理与Pylint等静态分析工具的确定性信号,结合异步执行与重试机制提升鲁棒性,并探索轻量级会话内存以保留上下文信息,从而实现高效、可解释且面向开发者的代码质量反馈。

链接: https://arxiv.org/abs/2604.23088
作者: Samer Attrah
机构: Google(谷歌)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: 8 pages, 1 figure, 2 tables, 28 references

点击查看摘要

Abstract:We present Code Broker, a multi agent system built with Google Agent Development Kit ADK that analyses Python code from files, local directories, or GitHub repositories and generates actionable quality assessment reports. The system employs a hierarchical five agents architecture in which a root orchestrator coordinates a sequential pipeline agent, which in turn dispatches three specialised agents in parallel a Correctness Assessor, a Style Assessor, and a Description Generator before synthesising findings through an Improvement Recommender. Reports score four dimensions correctness, security, style, and maintainability and are rendered in both Markdown and HTML. Code Broker combines LLM based reasoning with deterministic static-analysis signals from Pylint, uses asynchronous execution with retry logic to improve robustness, and explores lightweight session memory for retaining and querying prior assessment context. We position the paper as a technical report on system design and prompt or tool orchestration, and present a preliminary qualitative evaluation on representative Python codebases. The results suggest that parallel specialised agents produce readable, developer oriented feedback, while also highlighting current limitations in evaluation depth, security tooling, large repository handling, and the current use of only in memory persistence. All code and reproducibility materials are available at: this https URL.

[NLP-120] ContextWeaver: Selective and Dependency-Structured Memory Construction for LLM Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在长上下文交互中因传统上下文管理方法(如滑动窗口和提示压缩)导致早期结构化信息被忽略的问题,以及现有基于检索的记忆系统未能保留多步推理所需的因果与逻辑结构的局限性。其解决方案的关键在于提出ContextWeaver框架,通过依赖关系驱动的结构化记忆机制:(1) 构建并遍历以逻辑依赖为基础的推理步骤图谱,确保每一步与其依赖的前期步骤关联;(2) 对根节点到当前步骤的推理路径进行紧凑摘要,生成可复用的语义单元;(3) 引入轻量级验证层整合执行反馈,从而实现高效、稳定且可扩展的上下文管理,显著提升代理在工具调用任务中的推理效率与准确性。

链接: https://arxiv.org/abs/2604.23069
作者: Yating Wu,Yuhao Zhang,Sayan Ghosh,Sourya Basu,Anoop Deoras,Jun Huan,Gaurav Gupta
机构: AWS AI Labs (亚马逊云科技人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents often struggle in long-context interactions. As the agent accumulates more interaction history, context management approaches such as sliding window and prompt compression may omit earlier structured information that later steps rely on. Recent retrieval-based memory systems surface relevant content but still overlook the causal and logical structure needed for multi-step reasoning. We introduce ContextWeaver, a selective and dependency-structured memory framework that organizes an agent’s interaction trace into a graph of reasoning steps and selects the relevant context for future actions. Unlike prior context management approaches, ContextWeaver supports: (1) dependency-based construction and traversal that link each step to the earlier steps it relies on; (2) compact dependency summarization that condenses root-to-step reasoning paths into reusable units; and (3) a lightweight validation layer that incorporates execution feedback. On the SWE-Bench Verified and Lite benchmarks, ContextWeaver improves performance over a sliding-window baseline in pass@1, while reducing reasoning steps and token usage. Our observations suggest that modeling logical dependencies provides a stable and scalable memory mechanism for LLM agents that use tools.

[NLP-121] raining a General Purpose Automated Red Teaming Model

【速读】: 该论文旨在解决当前自动化红队测试(red teaming)方法在面对非安全类对抗目标时的局限性问题,即现有方法主要针对安全和内容审核场景设计,依赖预设的内容安全模型作为评估器,难以泛化至其他未被涵盖的对抗意图。其解决方案的关键在于提出一种无需训练时依赖预设评估器的红队模型训练流程,使模型能够学习并适应任意类型的对抗目标(包括未直接训练过的任务),并通过微调小型语言模型(如Qwen3-8B)实现对域内与域外对抗目标攻击生成能力的显著提升。

链接: https://arxiv.org/abs/2604.23067
作者: Aishwarya Padmakumar,Leon Derczynski,Traian Rebedea,Christopher Parisien
机构: NVIDIA
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated methods for red teaming LLMs are an important tool to identify LLM vulnerabilities that may not be covered in static benchmarks, allowing for more thorough probing. They can also adapt to each specific LLM to discover weaknesses unique to it. Most current automated red teaming methods are intended for tackling safety and content moderation. Thus, they make use of content safety models as evaluators and optimize for circumventing them, and as such, have not been tested with other adversarial intents not typically captured by these. We propose a pipeline for training a red teaming model that can generalize to arbitrary adversarial goals, including objectives it has not been directly trained on, and that does not depend on the existence of a pre-existing evaluator available at training time. We demonstrate that finetuning small models, such as Qwen3-8B, using this pipeline results in a substantial improvement in their ability to generate attacks for both in and out of domain adversarial goals.

[NLP-122] Implicit Framing in Obstetric Counseling Notes: A Grounded LLM Pipeline on a VBAC-Eligible Cohort

【速读】: 该论文旨在解决产科临床咨询语言(clinical framing)对患者决策影响的量化分析问题,尤其关注剖宫产后阴道分娩(VBAC)与重复剖宫产(RCS)两种可行方案中,医生如何通过语言框架影响患者理解与选择。其解决方案的关键在于:首先利用结构化临床数据结合大语言模型(LLM)提取自由文本中的可验证证据,构建严格限定的VBAC适格人群队列以排除医学禁忌因素的干扰;随后采用零样本LLM框架对咨询段落进行分类,识别出不同风险导向的语言表达模式,从而揭示VBAC与RCS文档中咨询框架分布的显著差异——RCS记录中风险导向语言占比更高,且经统计检验确认,证明了受控的LLM驱动框架分析在产科咨询质量评估中的有效性。

链接: https://arxiv.org/abs/2604.23059
作者: Baris Karacan,Barbara Di Eugenio,Patrick Thornton,Joanna Tess,Subhash Kumar Kolar
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages. Accepted at IEEE ICHI 2026. This is the author-accepted manuscript

点击查看摘要

Abstract:Clinical framing – the linguistic manner in which clinical information is presented – can influence patient understanding and decision-making, with important implications for healthcare outcomes. Obstetrics is a high-stakes domain in which physicians counsel patients on delivery mode choices such as vaginal birth after cesarean (VBAC) and repeat cesarean section (RCS), yet counseling language remains underexplored in large-scale clinical text analysis. In this work, we analyze physician counseling language in 2,024 obstetric history and physical narratives for a rigorously defined cohort of patients for whom both VBAC and RCS were clinically viable options. To control for confounding due to medical contraindications, we first construct a VBAC-eligible cohort using structured clinical data supplemented by a large language model (LLM)-based extraction pipeline constrained to grounded, verbatim evidence from free-text narratives. We then apply a zero-shot LLM framework to categorize counseling segments into predefined framing categories capturing how physicians linguistically present delivery options. Our analysis reveals a significant difference in counseling framing distributions between VBAC and RCS notes; risk-focused language accounts for a substantially larger share of counseling segments in RCS documentation than in VBAC, with category-level differences confirmed by statistical testing, highlighting the value of controlled LLM-based framing analysis in obstetric care.

[NLP-123] DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在预测前瞻性临床试验结果方面的性能瓶颈问题。现有研究表明,传统相关性预测方法(如随机森林和逻辑回归)以及主流商业LLMs在此任务上表现有限。其解决方案的关键在于提出DeepImagine框架,通过连续的反事实想象(counterfactual imagining)来训练LLMs进行生物医学推理:具体而言,模型被教导在受控扰动实验条件(如剂量、结局指标、研究组别、地理区域等)下推断观察到的试验结果如何变化,从而逼近临床试验中隐藏的因果机制。该框架结合监督微调与强化学习策略,并引入可验证奖励信号和合成因果推理轨迹,以提升模型对试验级机制的理解能力与预测准确性。

链接: https://arxiv.org/abs/2604.23054
作者: Youze Zheng,Jianyou Wang,Yuhan Chen,Matthew Feng,Longtian Bao,Hanyuan Zhang,Maxim Khan,Aditya K. Sehgal,Christopher D. Rosin,Umber Dube,Ramamohan Paturi
机构: University of California, San Diego (加州大学圣地亚哥分校); University of Chicago (芝加哥大学); Elsevier (爱思唯尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Work in Progress

点击查看摘要

Abstract:Predicting the outcomes of prospective clinical trials remains a major challenge for large language models. Prior work has shown that both traditional correlational predictors, such as random forests and logistic regression, and strong commercial LLMs achieve limited performance on this task. In this paper, we propose DeepImagine, a framework for teaching LLMs biomedical reasoning through successive counterfactual imagining. The central idea is to approximate hidden causal mechanisms of clinical trials by training models to infer how observed trial results would change under controlled perturbations of experimental conditions, such as dosage, outcome measures, study arms, geography, and other trial attributes. To support this objective, we construct both natural and approximate counterfactual pairs from real clinical trials with reported outcomes. For settings where strict counterfactual supervision is available, such as paired outcome measures or dose-ranging study arms within the same trial, we train models with supervised fine-tuning. For broader settings where only approximate counterfactual pairs can be retrieved, we optimize models with reinforcement learning using verifiable rewards based on downstream benchmark correctness. We further augment training with synthetic reasoning traces that provide causally plausible explanations for local counterfactual transitions. Using this pipeline, we train language models under 10B parameters, including Qwen3.5-9B, and evaluate them on clinical trial outcome prediction. We aim to show that DeepImagine consistently improves over untuned language models and traditional correlational baselines. Finally, we aim to show that the learned reasoning trajectories provide interpretable signals about how models represent trial-level mechanisms, suggesting a practical path toward more mechanistic and scientifically useful biomedical language models.

[NLP-124] Evaluating Temporal Consistency in Multi-Turn Language Models ACL2026

【速读】: 该论文旨在解决语言模型在交互式场景中维持和更新隐含时间假设的能力不足问题,即Temporal Scope Stability(时间范围稳定性)问题。其核心挑战在于模型能否在多轮对话中正确保留、覆盖或传递与时间相关的事实上下文,尤其是在后续问题未明确标注时间信息时。解决方案的关键是提出ChronoScope——一个大规模诊断基准,通过超过一百万条基于Wikidata生成的确定性问题链,在受控的多轮交互中精确隔离并评估模型的时间范围行为,从而揭示当前主流语言模型在长时间序列推理中普遍存在的时间假设漂移现象,并量化其与单轮事实准确性之间的差距。

链接: https://arxiv.org/abs/2604.23051
作者: Yash Kumar Atri,Steven L. Johnson,Tom Hartvigsen
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026

点击查看摘要

Abstract:Language models are increasingly deployed in interactive settings where users reason about facts over time rather than in isolation. In such scenarios, correct behavior requires models to maintain and update implicit temporal assumptions established earlier in a conversation. We study this challenge through the lens of temporal scope stability: the ability to preserve, override, or transfer time-scoped factual context across dialogue turns. We introduce ChronoScope, a large-scale diagnostic benchmark designed to isolate temporal scope behavior in controlled multi-turn interactions, comprising over one million deterministically generated question chains grounded in Wikidata. ChronoScope evaluates whether models can correctly retain inferred temporal scope when follow-up questions omit explicit time references, spanning implicit carryover, explicit scope switching, cross-entity transfer, and longer temporal trajectories. Through extensive evaluation of state-of-the-art language models, we find that temporal scope stability is frequently violated in controlled multi-turn settings, with models often drifting toward present-day assumptions despite correct underlying knowledge. These failures intensify with interaction length and persist even under oracle context conditions, revealing a gap between single-turn factual accuracy and coherent temporal reasoning under sequential interaction. We make our dataset and evaluation suite publicly available at this https URL

[NLP-125] Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

【速读】: 该论文旨在解决专家混合模型(MoE)在监督微调(SFT)过程中因路由器层脆弱性导致的性能下降问题,特别是现有方法如DenseMixer和ESFT虽能缓解路由坍缩(router collapse),但引入了噪声梯度从而损害模型表现。其解决方案的关键在于提出一种无辅助损失的MoE SFT框架:通过基于偏置的稀疏化策略(bias-driven sparsification)引导任务相关专家保持活跃,同时让长尾专家趋向 inactive,并引入始终激活的门控压缩专家(gated condenser experts)作为稳定的信息传递路径,从而避免梯度饥饿并有效整合稀疏激活专家中的知识,显著提升下游任务性能。

链接: https://arxiv.org/abs/2604.23036
作者: Haoze He,Xingyuan Ding,Xuan Jiang,Xinkai Zou,Alex Cheng,Yibo Zhao,Juncheng Billy Li,Heather Miller
机构: Carnegie Mellon University (卡内基梅隆大学); MIT (麻省理工学院); UCSD (加州大学圣地亚哥分校); Johns Hopkins University (约翰霍普金斯大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 36 pages

点击查看摘要

Abstract:Despite MoE models leading many benchmarks, supervised fine-tuning (SFT) for the MoE architectures remains difficult because its router layers are fragile. Methods such as DenseMixer and ESFT mitigate router collapse with dense mixing or auxiliary load-balancing losses, but these introduce noisy gradients that often degrade performance. In preliminary experiments, we systematically pruned experts and observed that while certain super experts are activated far more frequently, discarding less used experts still leads to notable performance degradation. This suggests that even rarely activated experts encode non-trivial knowledge useful for downstream tasks. Motivated by this, we propose an auxiliary-loss-free MoE SFT framework that combines bias-driven sparsification with always-active gated condenser experts. Rather than enforcing balanced activation across all experts, our method encourages task-relevant experts to remain active while pushing long-tailed experts toward inactivity. The condenser experts provide a persistent, learnable pathway that alleviates gradient starvation and facilitates consolidation of information that would otherwise remain fragmented across sparsely activated experts. Analysis further suggest that this design better preserves long-tailed expert information under sparse routing. Experiments on large-scale MoE models demonstrate that our approach outperforms state-of-the-art SFT baselines such as DenseMixer and ESFT, achieving average gain of 2.5%+ on both mathematical reasoning and commonsenseQA benchmarks.

[NLP-126] Chinese-SkillSpan: A Span-Level Dataset for ESCO-Aligned Competency Extraction from Chinese Job Ads

【速读】: 该论文旨在解决中文招聘文本中技能信息自动抽取的难题,即Job Skill Named Entity Recognition (JobSkillNER),以提升人才市场匹配效率并支持个性化就业服务。其解决方案的关键在于构建了首个面向中文招聘文本的JobSkillNER数据集——Chinese-SkillSpan,并提出了一种由大语言模型(LLM)驱动的宏观-微观协同标注流程:首先利用LLM的上下文理解能力进行初步标注,再通过专家对句子级别结果进行精细化修正,从而确保标注质量与ESC(O)职业能力标准(涵盖知识、技能、跨领域能力及语言能力四个维度)的一致性。该方法显著提升了中文技能实体识别资源的可用性与规范性。

链接: https://arxiv.org/abs/2604.23009
作者: Guojing Li,Zichuan Fu,Junyi Li,Wenxia Zhou,Xinyang Wu,Jinning Yang,Jingtong Gao,Feng Huang,Xiangyu Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Job Skill Named Entity Recognition (JobSkillNER) aims to automatically extract key skill information from large-scale job posting data, which is important for improving talent-market matching efficiency and supporting personalized employment services. To the best of our knowledge, this work presents the first Chinese JobSkillNER dataset for recruitment texts. We propose annotation guidelines tailored to Chinese job postings and an LLM-empowered Macro-Micro collaborative annotation pipeline. The pipeline leverages the contextual understanding ability of large language models (LLMs) for initial annotation and then refines the results through expert sentence-level adjudication. Using this pipeline, we annotate more than 20,000 instances collected from four major recruitment platforms over the period 2014-2025. Based on these efforts, we release Chinese-SkillSpan, the first Chinese JobSkillNER dataset aligned with the ESCO occupational skill standard across four dimensions: knowledge, skill, transversal competence, and language competence (LSKT). Experimental results show that the dataset supports effective model training and evaluation, indicating that Chinese-SkillSpan helps fill a major gap in Chinese JobSkillNER resources and provides a useful benchmark for intelligent recruitment research. Code and data are available at this https URL .

[NLP-127] FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agent ic Code Generation in Lean ACL2026

【速读】: 该论文旨在解决将非形式化的科学数学推理(如物理学中的狄拉克符号、矢量微积分等)自动转化为可验证的形式化代码这一挑战,尤其针对当前大语言模型(LLM)和代理方法在科学领域中难以处理特定领域语法与语义的问题。其解决方案的关键在于提出一个领域无关的人机协作代理流程(FormalScience),该流程允许单个领域专家(无需深入形式语言知识)以低成本生成语法正确且语义对齐的形式化证明;并通过构建包含200个大学级别物理问题及其Lean4形式表示的Dataset(FormalPhysics),系统性地评估了现有LLM在零样本提示、自精炼及新型多阶段代理策略下的自动形式化能力,并首次从“符号坍缩”(notational collapse)和“抽象提升”(abstraction elevation)等概念出发,刻画了物理自动形式化中语义漂移(semantic drift)的现象,揭示了当完全语义保真不可达时,形式语言实际验证的内容。

链接: https://arxiv.org/abs/2604.23002
作者: Jordan Meadows,Lan Zhang,Andre Freitas
机构: University of Manchester, UK; Idiap Research Institute, Switzerland; National Biomarker Centre, CRUK-MI, UK
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2026

点击查看摘要

Abstract:Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textite.g. Dirac notation, vector calculus) imposes additional formalisation challenges that modern LLMs and agentic approaches have yet to tackle. To aid autoformalisation in scientific domains, we present FormalScience; a domain-agnostic human-in-the-loop agentic pipeline that enables a single domain expert (without deep formal language experience) to produce \textitsyntactically correct and \textitsemantically aligned formal proofs of informal reasoning for low economic cost. Applying FormalScience to physics, we construct FormalPhysics, a dataset of 200 university-level (LaTeX) physics problems and solutions (primarily quantum mechanics and electromagnetism), along with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity. We evaluate open-source models and proprietary systems on a statement autoformalisation task on our dataset via zero-shot prompting, self-refinement with error feedback, and a novel multi-stage agentic approach, and explore autoformalisation limitations in modern LLM-based approaches. We provide the first systematic characterisation of semantic drift in physics autoformalisation in terms of concepts such as notational collapse and abstraction elevation which reveals what formal language verifies when full semantic preservation is unattainable. We release the codebase together with an interactive UI-based FormalScience system which facilitates autoformalisation and theorem proving in scientific domains beyond this http URL://github.com/jmeadows17/formal-science

[NLP-128] Uncertainty Quantification for LLM Function-Calling

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在函数调用(Function-Calling, FC)场景中因错误调用不可逆操作(如转账或删除数据)而导致严重后果的问题,其核心在于如何有效量化LLM对函数调用正确性的置信度。解决方案的关键在于评估和改进不确定性量化(Uncertainty Quantification, UQ)方法在FC任务中的表现:研究发现,传统多样本UQ方法(如语义熵)在自然语言问答任务中表现优异,但在FC场景下并未显著优于简单的单样本方法;进一步地,通过利用FC输出的结构特性——例如基于抽象语法树(Abstract Syntax Tree, AST)聚类多样本输出以增强多样本UQ效果,以及在计算基于logit的不确定性分数时仅保留语义有意义的token来优化单样本UQ方法,可显著提升现有UQ方法在FC设置下的性能。

链接: https://arxiv.org/abs/2604.22985
作者: Zihuiwen Ye,Lukas Aichberger,Michael Kirchhof,Sinead Williamson,Luca Zappella,Yarin Gal,Arno Blaas,Adam Golinski
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed to autonomously solve real-world tasks. A key ingredient for this is the LLM Function-Calling paradigm, a widely used approach for equipping LLMs with tool-use capabilities. However, an LLM calling functions incorrectly can have severe implications, especially when their effects are irreversible, e.g., transferring money or deleting data. Hence, it is of paramount importance to consider the LLM’s confidence that a function call solves the task correctly prior to executing it. Uncertainty Quantification (UQ) methods can be used to quantify this confidence and prevent potentially incorrect function calls. In this work, we present what is, to our knowledge, the first evaluation of UQ methods for LLM Function-Calling (FC). While multi-sample UQ methods, such as Semantic Entropy, show strong performance for natural language QA tasks, we find that in the FC setting, it offers no clear advantage over simple single-sample UQ methods. Additionally, we find that the particularities of FC outputs can be leveraged to improve the performance of existing UQ methods in this setting. Specifically, multi-sample UQ methods benefit from clustering FC outputs based on their abstract syntax tree parsing, while single-sample UQ methods can be improved by selecting only semantically meaningful tokens when calculating logit-based uncertainty scores.

[NLP-129] he Power of Power Law: Asymmetry Enables Compositional Reasoning

【速读】: 该论文试图解决的问题是:在训练生成式 AI 模型时,如何选择更有效的数据分布以提升对低频技能(即长尾技能)的学习效率。传统直觉认为,通过重加权或筛选数据使其趋向均匀分布,有助于模型更好地学习这些稀有技能;然而,论文发现,在多种组合推理任务(如状态跟踪和多步算术)中,基于幂律分布(power-law distribution)的数据训练反而显著优于均匀分布训练。其解决方案的关键在于:幂律采样能诱导一种有益的不对称性,改善损失函数的病态结构,使模型能够以较低的数据复杂度先掌握高频技能组合,进而作为跳板高效学习罕见的长尾技能。这一机制揭示了数据分布对模型训练效率的深层影响,并为优化训练策略提供了新的理论依据。

链接: https://arxiv.org/abs/2604.22951
作者: Zixuan Wang,Xingyu Dang,Jason D. Lee,Kaifeng Lyu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.

[NLP-130] AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练和推理阶段中验证(Verification)能力不足的问题,尤其是当前验证方法存在表达性与可控性之间的权衡:基于LLM的验证器虽具表达力但易出错且难以控制,而确定性的可执行验证器虽可靠可解释却能力受限。解决方案的关键在于提出AutoPyVerifier框架,其核心机制是利用LLM自动生成候选Python验证函数,并通过在有向无环图(Directed Acyclic Graph, DAG)上的搜索策略对这些函数进行系统性筛选与优化,从而自动构建一组最小且高效的确定性验证器集合,使其联合满足度最大程度逼近目标验证指标(如正确性)。该方法显著提升了验证准确性,并实验证明其可作为外部工具增强下游任务性能。

链接: https://arxiv.org/abs/2604.22937
作者: Pouya Pezeshkpour,Estevam Hruschka
机构: Megagon Labs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Verification is becoming central to both reinforcement-learning-based training and inference-time control of large language models (LLMs). Yet current verifiers face a fundamental trade-off: LLM-based verifiers are expressive but hard to control and prone to error, while deterministic executable verifiers are reliable and interpretable but often limited in capability. We study the following question: given a development set of LLM outputs and labels for a target objective, such as correctness, can we automatically induce a minimal set of Python verifiers whose joint satisfaction closely matches that objective? We propose AutoPyVerifier, a framework that uses an LLM to synthesize candidate verifier functions and then refines them through search over a directed acyclic graph (DAG). By navigating the DAG, AutoPyVerifier systematically explores the space of deterministic executable verifiers and selects a compact verifier set whose joint satisfaction best approximates the target objective. Across mathematical reasoning, coding, function calling, and instruction-following benchmarks for several state-of-the-art LLMs, AutoPyVerifier improves target-objective prediction by up to 55.0 F1 points over the initial LLM-generated verifier sets. Additional analyses show that the most useful verification targets vary by benchmark and model, and that the DAG-based search shifts the learned verifier sets toward more structural and semantically grounded checks. We further show that exposing the discovered verifier set to an LLM as an external tool improves downstream accuracy by up to 17.0 points. We release our code

[NLP-131] PExA: Parallel Exploration Agent for Complex Text-to-SQL ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的文本到SQL(text-to-SQL)代理在性能与延迟之间的权衡问题,即提升准确率通常会增加推理延迟,反之亦然。其解决方案的关键在于将文本到SQL生成过程重新建模为软件测试覆盖率问题:通过构造一组简化且原子化的SQL测试用例并行执行,以确保覆盖原始查询的语义空间;在迭代扩展测试用例覆盖范围的基础上,仅当收集到足够信息后才生成最终SQL,从而利用已探索的测试用例SQL来约束和引导最终生成过程。该方法在Spider 2.0基准上实现了70.2%的执行准确率,达到当前最优水平。

链接: https://arxiv.org/abs/2604.22934
作者: Tanmay Parekh,Ella Hofmann-Coyle,Shuyi Wang,Sachith Sri Ram Kothur,Srivas Prasad,Yunmo Chen
机构: University of California Los Angeles (加州大学洛杉矶分校); Bloomberg (彭博)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ACL 2026

点击查看摘要

Abstract:LLM-based agents for text-to-SQL often struggle with latency-performance trade-off, where performance improvements come at the cost of latency or vice versa. We reformulate text-to-SQL generation within the lens of software test coverage where the original query is prepared with a suite of test cases with simpler, atomic SQLs that are executed in parallel and together ensure semantic coverage of the original query. After iterating on test case coverage, the final SQL is generated only when enough information is gathered, leveraging the explored test case SQLs to ground the final generation. We validated our framework on a state-of-the-art benchmark for text-to-SQL, Spider 2.0, achieving a new state-of-the-art with 70.2% execution accuracy.

[NLP-132] Quantifying and Mitigating Self-Preference Bias of LLM Judges

【速读】: 该论文旨在解决大语言模型作为评判者(LLM-as-a-Judge)在自动化评估系统中普遍存在的自偏好偏差(Self-Preference Bias, SPB)问题,即模型在评估自身生成内容时存在系统性倾向,导致评价结果失真,进而影响模型对齐、排行榜构建和质量控制等关键应用的可靠性。解决方案的关键在于提出一个完全自动化的框架,通过构造质量差异可忽略的响应对(equal-quality pairs),实现判别能力与偏倚倾向的统计解耦,无需人工标注即可量化并缓解SPB;进一步地,作者基于认知负荷分解提出结构化的多维评估策略,在20个主流大语言模型上平均将SPB降低31.5%。

链接: https://arxiv.org/abs/2604.22891
作者: Jinming Yang,Chuxian Qiu,Zhenyu Deng,Xinshan Jiao,Tao Zhou
机构: CompleX Lab, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-as-a-Judge has become a dominant approach in automated evaluation systems, playing critical roles in model alignment, leaderboard construction, quality control, and so on. However, the scalability and trustworthiness of this approach can be substantially distorted by Self-Preference Bias (SPB), which is a directional evaluative deviation in which LLMs systematically favor or disfavor their own generated outputs during evaluation. Existing measurements rely on costly human annotations and conflate generative capability with evaluative stance, and thus are impractical for large-scale deployment in real-world systems. To address this issue, we introduce a fully automated framework to quantifying and mitigating SPB, which constructs equal-quality pairs of responses with negligible quality differences, enabling statistical disentanglement of discriminability from bias propensity without human gold standards. Empirical analysis across 20 mainstream LLMs reveals that advanced capabilities are often uncorrelated, or even negatively correlated, with low SPB. To mitigate this bias, we propose a structured multi-dimensional evaluation strategy grounded in cognitive load decomposition, which reduces SPB by 31.5% on average.

[NLP-133] xOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction ACL2026

【速读】: 该论文旨在解决科学文献PDF页面级重建为可编译LaTeX文档的问题,现有光学字符识别(OCR)系统多聚焦于纯文本或Markdown格式,忽视了LaTeX在科学出版中所依赖的结构化与可执行特性。其解决方案的关键在于构建了一个多维评估基准TexOCR-Bench和大规模训练语料库TexOCR-Train,并采用监督微调(SFT)与基于可验证奖励(来自LaTeX单元测试)的强化学习(RL)相结合的方法训练出一个20亿参数模型TexOCR。该方法通过直接强化编译正确性和引用完整性等关键文档不变量,显著提升了生成结果的结构忠实度和端到端可编译性,实验证明RL相较于SFT能更稳定地改善结构和编译性能指标。

链接: https://arxiv.org/abs/2604.22880
作者: Chengye Wang,Lin Fu,Zexi Kuang,Yilun Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026 Main

点击查看摘要

Abstract:Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.

[NLP-134] EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving

【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶场景中对自身运动(ego-motion)的物理合理性理解不足的问题,即模型虽具备逻辑上的物理概念,却难以将这些概念准确地与视觉观测对齐,导致其推理结果缺乏物理一致性。解决方案的关键在于提出一个诊断性基准EgoDyn-Bench,通过确定性映射将连续车辆动力学转化为离散运动概念,从而解耦模型内部的物理逻辑与视觉感知;进一步实验表明,引入显式的轨迹编码可显著提升所有评估模型的物理一致性,揭示出一种功能上的解耦结构:ego-motion逻辑几乎完全依赖于语言模态,而视觉信息对物理推理贡献甚微。这一发现为构建物理一致的具身人工智能提供了标准化诊断框架和可操作的改进路径。

链接: https://arxiv.org/abs/2604.22851
作者: Finn Rasmus Schäfer,Yuan Gao,Dingrui Wang,Thomas Stauner,Stephan Günnemann,Mattia Piccinini,Sebastian Schmidt,Johannes Betz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
备注: 36 Pages, under review

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have advanced highlevel reasoning in autonomous driving, their ability to ground this reasoning in the underlying physics of ego-motion remains poorly understood. We introduce EgoDyn-Bench, a diagnostic benchmark for evaluating the semantic ego-motion understanding of vision-centric foundation models. By mapping continuous vehicle kinematics to discrete motion concepts via a deterministic oracle, we decouple a model’s internal physical logic from its visual perception. Our large-scale empirical audit spanning 20 + models, including closed-source MLLMs, open-source VLMs across multiple scales, and specialized VLAs, identifies a significant Perception Bottleneck: while models exhibit logical physical concepts, they consistently fail to accurately align them with visual observations, frequently underperforming classical non-learned geometric baselines. This failure persists across model scales and domain-specific training, indicating a structural deficit in how current architectures couple visual perception with physical reasoning. We demonstrate that providing explicit trajectory encodings substantially restores physical consistency across all evaluated models, revealing a functional disentanglement between vision and language: egomotion logic is derived almost exclusively from the language modality, while visual observations contribute negligible additional signal. This structural finding provides a standardized diagnostic framework and a practical pathway toward physically aligned embodied AI. Keywords: Ego-motion - Physical Reasoning - Foundation Models

[NLP-135] AeSlides: Incentivizing Aesthetic Layout in LLM -Based Slide Generation via Verifiable Rewards

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在幻灯片(slide)生成任务中因模态鸿沟(modality gap)导致的视觉美学质量低下问题:当前模型以文本为中心进行生成,但幻灯片质量主要由视觉布局决定,现有方法要么依赖高成本的视觉反思机制,要么通过大规模数据微调提供间接美学监督,效果有限。解决方案的关键在于提出 AeSlides,一个基于可验证奖励的强化学习框架,首次将明确的美学原则作为直接监督信号引入幻灯片生成过程;其核心创新是设计了一套高效、低成本且可验证的布局质量度量指标,用于量化关键排版问题,并基于 GRPO(Generalized Reward Policy Optimization)算法对模型进行端到端优化,从而显著提升幻灯片在长宽比合规性、空白区域减少、元素碰撞抑制和视觉平衡等方面的性能,同时获得人类评估的显著提升。

链接: https://arxiv.org/abs/2604.22840
作者: Yiming Pan,Chengwei Hu,Xuancheng Huang,Can Huang,Mingming Zhao,Yuean Bi,Xiaohan Zhang,Aohan Zeng,Linmei Hu
机构: Beijing Institute of Technology (北京理工大学); Zhipu AI (Z.ai)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 21 pages, 25 figures, 9 tables

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong potential in agentic tasks, particularly in slide generation. However, slide generation poses a fundamental challenge: the generation process is text-centric, whereas its quality is governed by visual aesthetics. This modality gap leads current models to frequently produce slides with aesthetically suboptimal layouts. Existing solutions typically rely either on heavy visual reflection, which incurs high inference cost yet yields limited gains; or on fine-tuning with large-scale datasets, which still provides weak and indirect aesthetic supervision. In contrast, the explicit use of aesthetic principles as supervision remains unexplored. In this work, we present AeSlides, a reinforcement learning framework with verifiable rewards for Aesthetic layout supervision in Slide generation. We introduce a suite of meticulously designed verifiable metrics to quantify slide layout quality, capturing key layout issues in an accurate, efficient, and low-cost manner. Leveraging these verifiable metrics, we develop a GRPO-based reinforcement learning method that directly optimizes slide generation models for aesthetically coherent layouts. With only 5K training prompts on GLM-4.7-Flash, AeSlides improves aspect ratio compliance from 36% to 85%, while reducing whitespace by 44%, element collisions by 43%, and visual imbalance by 28%. Human evaluation further shows a substantial improvement in overall quality, increasing scores from 3.31 to 3.56 (+7.6%), outperforming both model-based reward optimization and reflection-based agentic approaches, and even edging out Claude-Sonnet-4.5. These results demonstrate that such a verifiable aesthetic paradigm provides an efficient and scalable approach to aligning slide generation with human aesthetic preferences. Our repository is available at this https URL.

[NLP-136] KARL: Mitigating Hallucinations in LLM s via Knowledge-Boundary-Aware Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对超出其知识边界的问题时易产生幻觉(hallucination)的问题,尤其针对现有强化学习方法因静态奖励机制导致模型过度保守、牺牲准确率的缺陷。解决方案的关键在于提出KARL框架,其核心创新包括:一是引入知识边界感知奖励(Knowledge-Boundary-Aware Reward),通过组内响应统计在线估计模型的知识边界,动态奖励正确回答或引导合理拒答;二是设计两阶段强化学习训练策略,先探索知识边界以规避“拒答陷阱”(abstention trap),再将越界错误答案转化为拒答而不损失准确性,从而实现准确率与幻觉抑制之间的最优平衡。

链接: https://arxiv.org/abs/2604.22779
作者: Cheng Gao,Cheng Huang,Kangyang Luo,Ziqing Qiao,Shuzheng Si,Huimin Chen,Chaojun Xiao,Maosong Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:Enabling large language models (LLMs) to appropriately abstain from answering questions beyond their knowledge is crucial for mitigating hallucinations. While existing reinforcement learning methods foster autonomous abstention, they often compromise answer accuracy because their static reward mechanisms, agnostic to models’ knowledge boundaries, drive models toward excessive caution. In this work, we propose KARL, a novel framework that continuously aligns an LLM’s abstention behavior with its evolving knowledge boundary. KARL introduces two core innovations: a Knowledge-Boundary-Aware Reward that performs online knowledge boundary estimation using within-group response statistics, dynamically rewarding correct answers or guided abstention; and a Two-Stage RL Training Strategy that first explores the knowledge boundary and bypasses the “abstention trap”, and subsequently converts incorrect answers beyond the knowledge boundary into abstentions without sacrificing accuracy. Extensive experiments on multiple benchmarks demonstrate that KARL achieves a superior accuracy-hallucination trade-off, effectively suppressing hallucinations while maintaining high accuracy across both in-distribution and out-of-distribution scenarios.

[NLP-137] he Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions

【速读】: 该论文旨在解决预训练语言模型中固有非随机性(intrinsic non-randomness)的量化与归因问题,即区分模型输出的非随机性是由上下文诱导还是源于模型权重本身的结构特性。解决方案的关键在于提出并系统测量“熵偏差”(Entropic Deviation, ED),定义为模型词元分布与均匀分布之间的归一化KL散度,并在多种模型架构(Transformer与状态空间模型)、提示类别、温度参数及语言环境下进行大规模实验验证。结果表明,即使在语义中立提示下,Transformer模型仍保持约0.30的ED值,说明其大部分非随机性来自训练所得权重而非上下文;而状态空间模型Mamba2则表现出更高的ED和更强的温度敏感性,揭示了不同架构对内在随机性的调控机制差异。

链接: https://arxiv.org/abs/2604.22771
作者: Jarosław Hryszko
机构: Jagiellonian University (雅盖隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Language models cannot be random. This paper introduces Entropic Deviation (ED), the normalised KL divergence between a model’s token distribution and the uniform distribution, and measures it systematically across 31,200 generations spanning seven models, two architectures (transformer and state space), nine prompt categories, three temperatures, and five languages. Under semantically neutral prompts (empty strings, random characters, nonsense syllables) transformers still exhibit ED of approximately 0.30, meaning that 88-93% of the non-randomness observed under semantic prompts is intrinsic to the learned weights rather than induced by context. Three transformer families (Gemma, Llama, Qwen) converge on nearly identical ED values despite different training data and vocabularies. A state space model (Mamba2) reveals a qualitatively different regime: twice the ED, three times lower within-sequence variance, and massive sensitivity to temperature (r = -0.78) where transformers are nearly immune (r 0.05). Cross-lingual experiments with Qwen-32B show a stable gradient across five languages (English, Japanese, Chinese, Polish, Arabic) that does not correlate with token fertility and persists when two languages sharing an identical tokeniser subset are compared. These findings establish a structural lower bound on randomness in pretrained language models, characterise how this bound differs across architectures, and demonstrate that language itself modulates the bound independently of tokenisation.

[NLP-138] Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation

【速读】: 该论文旨在解决在放射学领域部署自托管大语言模型(Large Language Model, LLM)时面临的监管合规性难题,特别是如何在不违反隐私保护法规的前提下实现临床可用的生成式AI服务。其核心挑战在于确保未脱敏的患者健康信息(Protected Health Information, PHI)处理符合“最小权限原则”,同时保障系统的技术可行性与临床实用性。解决方案的关键在于提出并实施一种“隔离优先”(isolation-first)的本地化架构:通过严格的网络分段、主机级出站流量过滤及主动隔离监控机制,防止未经授权的外部连接;结合自动化隔离与加固测试包,实现可重复验证的安全部署流程。该方案成功获得医院管理层和各合规部门批准,并在为期一周的试点中验证了其稳定性与临床价值,尤其在基于源文本的任务(如报告修正或简化)中表现出高实用性和低错误率,为德国大学附属医院大规模部署开源权重LLM提供了可落地的技术路径。

链接: https://arxiv.org/abs/2604.22768
作者: Sebastian Nowak,Jann-Frederick Laß,Narine Mesropyan,Babak Salam,Nico Piel,Mohammed Bahaaeldin,Wolfgang Block,Alois Martin Sprinkart,Julian Alexander Luetkens,Benjamin Wulff,Alexander Isaak
机构: University Hospital Bonn (波恩大学医院)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 39 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Purpose: To design, implement, evaluate, and report on the regulatory requirements of a self-hosted LLM infrastructure for radiology adhering to the principle of least privilege, emphasizing technical feasibility, network isolation, and clinical utility. Materials and Methods: The isolation-first, containerized LLM inference stack relies on strict network segmentation, host-enforced egress filtering, and active isolation monitoring preventing unauthorized external connectivity. An accompanying deployment package provides automated isolation and hardening tests. The system served the open-weights DeepSeek-R1 model via vLLM. In a one-week pilot phase, 22 residents and radiologists were free to use 10 predefined prompt-templates whenever they considered them useful in daily work. Afterward, they rated clinical utility and system stability on an 0-10 Likert scale and reported observed critical errors in model output. Results: The applied institutional governance pathway achieved approval from clinic management, compliance, data protection and information security officers for processing unanonymized PHI. The system was rated stable and user friendly during the pilot. Source text-anchored tasks, such as report corrections or simplifications, and radiology guideline recommendations received the highest utility ratings, whereas open-ended conclusion generation based on findings resulted in the highest frequency of critical errors, such as clinically relevant hallucinations or omissions. Conclusion: The proposed isolation-first on-premise architecture enabled overcoming regulatory borders, showed promising clinical utility in text-anchored tasks and is the current base to serve open-weights LLMs as an official service of a German University Hospital with over 10,000 employees. The deployment package were made publicly available (this https URL). Comments: 39 pages, 4 figures, 3 tables Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL) Cite as: arXiv:2604.22768 [cs.CY] (or arXiv:2604.22768v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.22768 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Sebastian Nowak [view email] [v1] Wed, 25 Mar 2026 17:38:20 UTC (2,616 KB)

[NLP-139] HalalBench: A Multilingual OCR Benchmark for Food Packaging Ingredient Extraction

【速读】: 该论文旨在解决食品包装OCR(光学字符识别)缺乏标准化评估基准的问题,这一问题制约了自动化清真食品验证系统的性能提升。现有基准主要针对通用文档或场景文本,未能涵盖食品标签特有的挑战,如曲面变形、密集多语言文本及小于8pt的微小字体。解决方案的关键在于构建首个开源多语言食品包装OCR基准HalalBench,包含1,043张图像(50张真实+993张合成)和36,438个标注,覆盖14种语言,并以COCO格式组织;同时提出一种后处理算法,在消融实验中使F1分数提升36%,并通过实际部署的HalalLens生产级扫描工具验证了方法的有效性。

链接: https://arxiv.org/abs/2604.22754
作者: Hasan Arief
机构: HalalLens Research
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 8 pages, 6 figures, 7 tables

点击查看摘要

Abstract:No standardized benchmark exists for evaluating OCR on food packaging, despite its critical role in automated halal food verification. Existing benchmarks target documents or scene text, missing the unique challenges of ingredient labels: curved surfaces, dense multilingual text, and sub-8pt fonts. We present HalalBench, the first open multilingual benchmark for food packaging OCR, comprising 1,043 images (50 real, 993 synthetic) with 36,438 annotations in COCO format spanning 14 languages. We evaluate four engines: docTR achieves F1=0.193, ML Kit 0.180, EasyOCR 0.167, while all fail on Japanese (F1=0.000). A clustering ablation shows 36% F1 improvement from our post-processing algorithm. We validate findings through HalalLens (this https URL), a production halal scanner serving 20+ countries. Dataset and code are released under open licenses.

[NLP-140] CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging NAACL2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成与问题求解中对初始代码质量依赖性强的问题,即当前基于外部工具的迭代调试方法效果受限于初始生成代码的质量,而这一环节仍是一个开放挑战。其解决方案的关键在于提出了一种名为CodeSim的多智能体代码生成框架,该框架通过类人类感知的方式,系统性地覆盖程序合成-规划、编码和调试三个阶段;尤其创新性地引入了基于输入/输出逐步模拟的计划验证与内部调试机制,使模型能够像人类一样通过可视化仿真验证算法理解,从而显著提升代码生成的准确性与鲁棒性。

链接: https://arxiv.org/abs/2502.05664
作者: Md. Ashraful Islam,Mohammed Eunus Ali,Md Rizwan Parvez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted in NAACL 2025 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis-planning, coding, and debugging-through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSim’s remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results-(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open-sourced our framework in this link (this https URL).

[NLP-141] Large language model-enabled automated data extraction for concrete materials informatics

【速读】: 该论文旨在解决材料科学领域中数据驱动的材料发现因高质量、大规模且可访问的实验数据集稀缺而受限的问题。其核心解决方案是提出了一种通用的大语言模型(Large Language Model, LLM)驱动的数据提取与结构化流程,能够从非结构化科学文献中自动识别并整理材料属性信息(如组成-工艺-性能等),并在一个小时内从超过27,000篇文献中提取近9,000条高质记录(含100余项属性),构建了目前最大的开放实验室混凝土数据库。该方法在多种LLM上均表现出鲁棒性能,F₁分数最高达0.97,验证了其对多样化材料体系的适应性与高效性,从而为材料信息学提供可扩展的数据基础设施。

链接: https://arxiv.org/abs/2604.22938
作者: Zhanzhao Li,Kengran Yang,Qiyao He,Kai Gong
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 5 figures, 1 table

点击查看摘要

Abstract:The promise of data-driven materials discovery remains constrained by the scarcity of large, high-quality, and accessible experimental datasets. Here, we introduce a generalizable large language model (LLM)-powered pipeline for automated extraction and structuring of materials data from unstructured scientific literature, using concrete materials as a representative and particularly challenging example. The pipeline exhibits robust performance across a broad range of LLMs and achieves an F_1 score of up to 0.97 for diverse composition–process–property attributes. Within one hour, it extracts nearly 9,000 high-quality records with over 100 attributes screened from more than 27,000 publications, enabling the construction of the largest open laboratory database for blended cement concrete. Machine learning analyses underscore the importance of large, diverse, and information-rich datasets for enhancing both in-distribution accuracy and out-of-distribution generalization to unseen materials. The proposed pipeline is readily adaptable to other materials domains and accelerates the development of scalable data infrastructures for materials informatics.

[NLP-142] In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions ICASSP2026

【速读】: 该论文旨在解决语音识别中时间戳预测(timestamp prediction)精度不足的问题,特别是在字级(word-level)时间戳预测方面,传统方法通常依赖外部对齐工具,导致系统复杂性和误差累积。解决方案的关键在于扩展现有的语音感知语言模型(speech-aware language model),使其能够直接联合预测文本与时间戳,并引入一组轻量级训练策略,显著提升对齐鲁棒性的同时保持语音识别(ASR)性能。实验表明,这些策略不仅提高了时间戳准确性,还带来了整体ASR性能的提升,从而实现了一个高效且统一的端到端语音识别与时间戳生成框架。

链接: https://arxiv.org/abs/2604.22817
作者: Xulin Fan,Vishal Sunder,Samuel Thomas,Mark Hasegawa-Johnson,Brian Kingsbury,George Saon
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Recent advances in speech-aware language models have coupled strong acoustic encoders with large language models, enabling systems that move beyond transcription to produce richer outputs. Among these, word-level timestamp prediction is critical for applications such as captioning, media search, and multimodal synchronization, yet it is often handled by external alignment tools. In this work, we extend an existing speech-aware language model to predict timestamps directly alongside transcripts. We introduce a set of novel lightweight training strategies that improve alignment robustness while preserving recognition quality. Experiments across multiple datasets show that these strategies not only enhance timestamp accuracy, but also yield gains in overall ASR performance. Together, they demonstrate an efficient and unified approach to speech recognition with precise timestamp prediction.

信息检索

[IR-0] XGRAG : A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

【速读】:该论文旨在解决GraphRAG(基于图的检索增强生成)系统推理过程缺乏可解释性的问题,即当前方法难以明确揭示特定结构化知识如何影响最终输出,从而限制了系统的透明度和可信度。针对这一问题,作者提出了一种名为XGRAG的新框架,其关键在于采用基于图的扰动策略(graph-based perturbation strategies),通过量化图中每个组件对模型答案的贡献程度,生成因果层面的可解释性说明。该方案不仅提升了解释质量(在多个数据集上相较基线RAG-Ex提升14.81% F1得分),还验证了其与图中心性指标的高度相关性,从而实现了对GraphRAG系统更可靠、更结构化的解释能力。

链接: https://arxiv.org/abs/2604.24623
作者: Zhuoling Li,Ha Linh Hong Tran Nguyen,Valeria Bladinieres,Maxim Romanovsky
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (GraphRAG) extends traditional RAG by using knowledge graphs (KGs) to give large language models (LLMs) a structured, semantically coherent context, yielding more grounded answers. However, GraphRAG reasoning process remains a black-box, limiting our ability to understand how specific pieces of structured knowledge influence the final output. Existing explainability (XAI) methods for RAG systems, designed for text-based retrieval, are limited to interpreting an LLM response through the relational structures among knowledge components, creating a critical gap in transparency and trustworthiness. To address this, we introduce XGRAG, a novel framework that generates causally grounded explanations for GraphRAG systems by employing graph-based perturbation strategies, to quantify the contribution of individual graph components on the model answer. We conduct extensive experiments comparing XGRAG against RAG-Ex, an XAI baseline for standard RAG, and evaluate its robustness across various question types, narrative structures and LLMs. Our results demonstrate a 14.81% improvement in explanation quality over the baseline RAG-Ex across NarrativeQA, FairyTaleQA, and TriviaQA, evaluated by F1-score measuring alignment between generated explanations and original answers. Furthermore, XGRAG explanations exhibit a strong correlation with graph centrality measures, validating its ability to capture graph structure. XGRAG provides a scalable and generalizable approach towards trustworthy AI through transparent, graph-based explanations that enhance the interpretability of RAG systems.

[IR-1] Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models SIGIR2026

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在基于注意力机制的文档重排序任务中,因静态选择注意力头(attention heads)或简单聚合所有头信号而导致的性能瓶颈问题。现有方法要么对所有注意力头进行平均处理,要么依赖启发式规则固定选取部分头,这无法适应不同查询(query)或领域下的信息分布差异,且多头间的冗余或冲突信号会进一步损害排序效果。解决方案的关键在于提出一种查询相关的注意力头选择机制——RouteHead,其核心是训练一个轻量级路由模块(router),根据输入查询动态映射至最优注意力头集合,并仅聚合这些头的注意力信号生成相关性分数;该路由器通过伪标签监督学习实现,利用冻结LLM的隐藏状态提取查询嵌入,结合可学习的头嵌入表示,辅以稀疏正则化约束,从而实现高效、精准的动态头选择策略。

链接: https://arxiv.org/abs/2604.24608
作者: Yuxing Tian,Fengran Mo,Zhiqi Huang,Weixu Zhang,Jian-Yun Nie
机构: Université de Montréal(蒙特利尔大学); Capital One(资本一号); McGill University (麦吉尔大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by SIGIR 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have recently been explored as fine-grained zero-shot re-rankers by leveraging attention signals to estimate document relevance. However, existing methods either aggregate attention signals across all heads or rely on a statically selected subset identified by heuristic rules. This solution can be suboptimal because the informative heads can vary across queries or domains. Moreover, naively combining multiple heads can degrade performance due to redundancy or conflicting ranking signals. In this paper, we propose a query-dependent head selection method, RouteHead, for attention-based re-ranking with LLMs. Specifically, we learn a lightweight router that can map each query to an optimal head set, and relevance scores are computed by aggregating attention signals only from these heads. Since query-to-head optimal labels are unavailable, we first construct pseudo labels via an offline search. The router represents each head with a learnable embedding and represents each query using an embedding extracted from the hidden states of the frozen LLM. Then it is trained on the pseudo labels with a sparsity regularizer. Experiments on diverse benchmarks and multiple LLM backbones show that the proposed method consistently outperforms strong baselines.

[IR-2] MEG-RAG : Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG

【速读】:该论文旨在解决当前多模态检索增强生成(Multimodal Retrieval-Augmented Generation, MRAG)系统在区分检索到的多模态数据是否真正支持答案语义核心方面存在的不足,尤其是现有评估指标依赖启发式的位置置信度,无法准确捕捉多模态实体的信息密度。其解决方案的关键在于提出一种语义感知的度量方法——多模态证据锚定(Multi-modal Evidence Grounding, MEG),该方法通过聚焦高逆文档频率(IDF)的信息承载词元来量化检索证据对答案语义核心的贡献;在此基础上构建的MEG-RAG框架进一步引入一个用于对齐检索证据与真实答案语义锚点的多模态重排序器,从而提升生成结果的准确性与多模态一致性。

链接: https://arxiv.org/abs/2604.24564
作者: Xihang Wang,Zihan Wang,Chengkai Huang,Quan Z. Sheng,Lina Yao
机构: Zhejiang University (浙江大学); Peking University (北京大学); UNSW and Macquarie University (新南威尔士大学和麦考瑞大学); Macquarie University (麦考瑞大学); UNSW and CSIRO’s Data61 (新南威尔士大学和澳大利亚联邦科学与工业研究组织数据61实验室)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (MRAG) addresses key limitations of Multimodal Large Language Models (MLLMs), such as hallucination and outdated knowledge. However, current MRAG systems struggle to distinguish whether retrieved multimodal data truly supports the semantic core of an answer or merely provides superficial relevance. Existing metrics often rely on heuristic position-based confidence, which fails to capture the informational density of multimodal entities. To address this, we propose Multi-modal Evidence Grounding (MEG), a semantic-aware metric that quantifies the contribution of retrieved evidence. Unlike standard confidence measures, MEG utilizes Semantic Certainty Anchoring, focusing on high-IDF information-bearing tokens that better capture the semantic core of the answer. Building on MEG, we introduce MEG-RAG, a framework that trains a multimodal reranker to align retrieved evidence with the semantic anchors of the ground truth. By prioritizing high-value content based on semantic grounding rather than token probability distributions, MEG-RAG improves the accuracy and multimodal consistency of generated outputs. Extensive experiments on the M ^2 RAG benchmark show that MEG-RAG consistently outperforms strong baselines and demonstrates robust generalization across different teacher models.

[IR-3] Modeling Behavioral Intensity and Transitions for Generative Recommendation

【速读】:该论文旨在解决生成式多行为推荐中因统一注意力机制导致的行为依赖关系激活不区分、无法捕捉行为强度差异及转换模式的问题。现有方法将不同行为视为辅助标记特征并输入统一注意力模块,隐含假设历史行为间依赖强度一致,从而忽略了行为间的异质性与动态转移结构。解决方案的关键在于提出BITRec框架,其核心创新包括:(i) 分层行为聚合(Hierarchical Behavior Aggregation, HBA),通过分离的探索路径与承诺路径显式建模行为强度差异;(ii) 转换关系编码(Transition Relation Encoding, TRE),利用可学习的关系矩阵显式编码行为间的转换结构,从而实现对多行为序列中复杂依赖模式的精细化建模。

链接: https://arxiv.org/abs/2604.24472
作者: Wenxuan Yang,Xiaoyang Xu,Hanyu Zhang,Zhexuan Xu,Wanqiang Xiong,Zhaoqun Chen
机构: Ant Group(蚂蚁集团); Fudan University(复旦大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-behavior recommendation aims to predict user conversions by modeling various interaction types that carry distinct intent signals. Recently, generative sequence modeling methods have emerged as an important paradigm for multi-behavior recommendation by achieving flexible sequence generation. However, existing generative methods typically treat behaviors as auxiliary token features and feed them into unified attention mechanisms. These models implicitly assume uniform activation of dependencies among historical behaviors, thereby failing to discern differences in intensity or capture transition patterns. To address these limitations, we propose BITRec, a novel generative multi-behavior recommendation framework that introduces structured behavioral modeling through selective dependency activation. BITRec incorporates (i) Hierarchical Behavior Aggregation (HBA), which explicitly models behavioral intensity differences through separated exploration and commitment pathways, and (ii) Transition Relation Encoding (TRE), which encodes transition structures through explicit learnable relation matrices. Experiments on four large-scale datasets (RetailRocket, Taobao, Tmall, Insurance Dataset) with millions of interactions achieve consistent improvements of 15-23% across multiple metrics, with peak gains of 22.79% MRR on Tmall and 17.83% HR@10, 17.55% NDCG@10 on Taobao.

[IR-4] Geometric Analysis of Self-Supervised Vision Representations for Semantic Image Retrieval

【速读】:该论文旨在解决现代自监督学习(Self-Supervised Learning, SSL)方法在图像内容检索(Content-Based Image Retrieval, CBIR)系统中应用时的性能瓶颈问题,特别是其表征空间几何特性对近似最近邻(Approximate Nearest Neighbor, ANN)索引效率的影响。解决方案的关键在于揭示了SSL方法生成的表征若具有高度各向异性(anisotropy)和高偏度(skewness),即使在线性探测或K-NN准确率上表现良好,仍会显著降低基于分区(partition-based)和哈希(hashing-based)的ANN搜索性能;相反,具备更高各向同性(isotropy)和局部纯净度(local purity)的表征更符合ANN索引的距离假设,从而提升语义检索效果。

链接: https://arxiv.org/abs/2604.24469
作者: Esteban Rodríguez-Betancourt,Edgar Casasola-Murillo
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Content-based image retrieval (CBIR) systems enable users to search images based on visual content instead of relying on metadata. The text domain has benefited from vector search of representations created with unsupervised methods such as BERT. However, modern self-supervised learning methods for vision are mostly not reported in CBIR-related literature, instead relying on supervised models or multi-modal methods that align text and vision. We evaluate how the representations learned by modern self-supervised learning methods for vision perform under typical retrieval stacks that leverage vector databases and nearest neighbor search. Our evaluation reveals that the latent space geometry impacts approximate nearest neighbor (ANN) indexing. Specifically, highly anisotropic representations with high skewness produced by several modern SSL methods degrade the performance of partition-based and hashing-based search, even if their own linear probe or K-NN accuracy is not affected. In contrast, representations with higher isotropy and local purity better satisfy the distance-based assumptions of ANN indexes, leading to improved semantic retrieval performance. Comments: 8 pages, 3 figures, 7 tables Subjects: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.24469 [cs.IR] (or arXiv:2604.24469v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.24469 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] Kwai Summary Attention Technical Report

【速读】:该论文旨在解决长序列建模中因标准softmax注意力机制导致的二次时间复杂度问题,从而在保持长距离依赖关系完整性的同时降低训练与推理成本。现有方法主要通过两种路径缓解此问题:一是压缩每层KV缓存(如GQA、MLA),但KV缓存仍与序列长度呈线性1:1关系;二是采用KV缓存友好的架构(如SWA、GDN),但常牺牲长程建模效果。本文提出一种中间路径:维持KV缓存与序列长度的线性关系,但通过语义层面的压缩比 $ k $ 实现结构化信息保留。其核心解决方案是提出Kwai Summary Attention (KSA),通过将历史上下文压缩为可学习的摘要标记(summary tokens),实现 $ O(n/k) $ 的序列建模复杂度,在内存开销可控的前提下完整保留远距离依赖的可解释性与参考性。

链接: https://arxiv.org/abs/2604.24432
作者: Chenglong Chu,Guorui Zhou,Guowang Zhang,Han Li,Hao Peng,Hongtao Cheng,Jian Liang,Jiangxia Cao,Kun Gai,Lingzhi Zhou,Lu Ren,Qi Zhang,Ruiming Tang,Ruitao Wang,Xinchen Luo,Yi Su,Zhiyuan Liang,Ziqi Wang,Boyang Ding,Chengru Song,Dunju Zang,Hui Wang,Jiao Ou,Jiaxin Deng,Jijun Shi,Jinghao Zhang,Junmin Chen,Lejian Ren,Minxuan Lv,Qianqian Wang,Qigen Hu,Shiyao Wang,Siyang Mao,Tao Wang,Xingmei Wang,Zhixin Ling,Ziming Li,Zixing Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel GDN, but often involve trade-offs among KV Cache and long-context modeling effectiveness. Besides the two technique routings, we argue that there exists an intermediate path not well explored: Maintaining a linear relationship between the KV cache and sequence length, but performing semantic-level compression through a specific ratio k . This O(n/k) path does not pursue a ``minimum KV cache’', but rather trades acceptable memory costs for complete, referential, and interpretable retention of long distant dependency. Motivated by this, we propose Kwai Summary Attention (KSA), a novel attention mechanism that reduces sequence modeling cost by compressing historical contexts into learnable summary tokens.

[IR-6] FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost

【速读】:该论文旨在解决大规模工业深度学习推荐模型在训练过程中因数据特征异构性导致的计算资源利用率低下问题,其核心表现是由于严重延迟节点(stragglers)和慢速阻塞通信引发的计算气泡(computational bubbles)。解决方案的关键在于三个方面:首先,通过精心设计的输入样本负载均衡策略缓解延迟节点问题;其次,利用优先级嵌入通信与计算重叠的方式减少阻塞通信;最后,采用SM-Free技术避免计算与通信重叠时GPU资源竞争,从而实现高效的并行执行。实证结果表明,FreeScale在256张H100 GPU上运行真实工作负载时,可将计算气泡减少高达90.3%。

链接: https://arxiv.org/abs/2604.24073
作者: Chenhao Feng,Haoli Zhang,Shakhzod Ali-Zade,Yanli Zhao,Liang Luo,Jennifer Cao,Lisen Deng,Siqiao Chen,Chenyu Zhao,Tristan Rice,Daniel Johnson,Min Si,Tiantu Xu,Yi Zhang,Siqi Yan,Chuanhao Zhuge,Min Ni,Bi Xue,Qunshu Zhang,Shen Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
备注: 14 pages, 11 figures. Accepted to the 9th MLSys Conference, Bellevue, WA, USA, 2026

点击查看摘要

Abstract:Modern industrial Deep Learning Recommendation Models typically extract user preferences through the analysis of sequential interaction histories, subsequently generating predictions based on these derived interests. The inherent heterogeneity in data characteristics frequently result in substantial under-utilization of computational resources during large-scale training, primarily due to computational bubbles caused by severe stragglers and slow blocking communications. This paper introduces FreeScale, a solution designed to (1) mitigate the straggler problem through meticulously load balanced input samples (2) minimize the blocking communication by overlapping prioritized embedding communications with computations (3) resolve the GPU resource competition during computation and communication overlapping by communicating through SM-Free techniques. Empirical evaluation demonstrates that FreeScale achieves up to 90.3% reduction in computational bubbles when applied to real-world workloads running on 256 H100 GPUs.

[IR-7] Disagreement as Signals: Dual-view Calibration for Sequential Recommendation Denoising

【速读】:该论文旨在解决基于Transformer的序列推荐模型在面对用户行为数据中的噪声时,因缺乏对用户兴趣演化动态的建模而导致性能下降的问题。现有基于大语言模型(Large Language Model, LLM)的去噪方法通常采用静态语义编辑,忽略了推荐模型的学习动态与用户兴趣的时变特性。其解决方案的关键在于提出一种双视角校准框架(Dual-view Calibration for Sequential Recommendation denoising, DC4SR),通过引入两个互补的噪声分布估计机制:一是基于微调LLM得到的全局语义先验(semantic prior),从语义层面估计噪声;二是基于模型学习动态推断的模型侧后验(model-side posterior),捕捉推荐过程中的动态变化。二者之间的不一致性被用于协同优化语义理解和模型表示,通过迭代更新实现对全局语义先验和模型侧后验的动态校准,从而增强模型对用户兴趣演变的适应能力与鲁棒性。

链接: https://arxiv.org/abs/2604.24048
作者: Sijia Li,Min Gao,Zongwei Wang,Zhiyi Liu,Xin Xia,Yi Zhang
机构: Chongqing University (重庆大学); University of Queensland (昆士兰大学)
类目: Information Retrieval (cs.IR)
备注: 9 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Sequential recommendation seeks to model the evolution of user interests by capturing temporal user intent and item-level transition patterns. Transformer-based recommenders demonstrate a strong capacity for learning long-range and interpretable dependencies, yet remain vulnerable to behavioral noise that is misaligned with users’ true preferences. Recent large language model (LLM)-based approaches attempt to denoise interaction histories through static semantic editing. Such methods neglect the learning dynamics of recommendation models and fail to account for the evolving nature of user interests. To address this limitation, we propose a Dual-view Calibration framework for Sequential Recommendation denoising (DC4SR). Specifically, we introduce a semantic prior, derived from an LLM fine-tuned via labeled historical interactions, to estimate the noise distribution from a semantic perspective. From the learning perspective, we further employ a model-side posterior that infers the noise distribution based on the model’s learning dynamics. The disagreement between the two distributions is then leveraged to jointly refine semantic understanding and learning-aware model-side representations. Through iterative updates, dynamic dual-view calibration is achieved for both the global semantic prior and the model-side posterior, enabling consistent alignment with evolving user interests. Extensive experiments demonstrate that DC4SR consistently outperforms strong Transformer-based recommenders and LLM-based denoising methods, exhibiting enhanced robustness across training stages and noise conditions.

[IR-8] Improving Robustness of Tabular Retrieval via Representational Stability

【速读】:该论文旨在解决基于Transformer的表格检索系统在面对语义等价但格式不同的表格序列化方式(如CSV、TSV、HTML、Markdown和DDL)时,因嵌入表示差异导致检索结果不稳定的问题。其核心解决方案是将不同序列化的嵌入视为共享语义信号的噪声观测,并通过计算这些嵌入的质心(centroid)作为标准化的目标表示,从而抑制格式特异性偏差并恢复跨格式共有的语义内容。进一步地,作者提出一种轻量级残差瓶颈适配器(residual bottleneck adapter),在冻结编码器基础上将单一分割嵌入映射至质心目标,同时保留方差并引入协方差正则化,显著提升了多种密集检索模型(如MPNet、BGE-M3、ReasonIR和SPLADE)的鲁棒性,验证了后处理几何校正在实现序列化无关表格检索方面的有效性。

链接: https://arxiv.org/abs/2604.24040
作者: Kushal Raj Bhandari,Adarsh Singh,Jianxi Gao,Soham Dan,Vivek Gupta
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); Arizona State University (亚利桑那州立大学); ScaleAI (ScaleAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Transformer-based table retrieval systems flatten structured tables into token sequences, making retrieval sensitive to the choice of serialization even when table semantics remain unchanged. We show that semantically equivalent serializations, such as \textttcsv , \texttttsv , \texttthtml , \textttmarkdown , and \textttddl , can produce substantially different embeddings and retrieval results across multiple benchmarks and retriever families. To address this instability, we treat serialization embedding as noisy views of a shared semantic signal and use its centroid as a canonical target representation. We show that centroid averaging suppresses format-specific variation and can recover the semantic content common to different serializations when format-induced shifts differ across tables. Empirically, centroid representations outrank individual formats in aggregate pairwise comparisons across \textttMPNet , \textttBGE-M3 , \textttReasonIR , and \textttSPLADE . We further introduce a lightweight residual bottleneck adapter on top of a frozen encoder that maps single-serialization embeddings towards centroid targets while preserving variance and enforcing covariance regularization. The adapter improves robustness for several dense retrievers, though gains are model-dependent and weaker for sparse lexical retrieval. These results identify serialization sensitivity as a major source of retrieval variance and show the promise of post hoc geometric correction for serialization-invariant table retrieval. Our code, datasets, and models are available at \hrefthis https URLthis https URL .

[IR-9] DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

【速读】:该论文旨在解决生物多样性研究中物种识别与未知物种发现的双重挑战,即在成千上万视觉相似的分类单元中准确识别物种,并在开放世界环境中发现新物种。传统方法将识别与发现视为独立问题,识别依赖封闭集假设,而发现则基于阈值拒绝策略,缺乏统一框架且难以解释。解决方案的关键在于提出 DeepTaxon——一个检索增强的多模态框架,通过可解释的链式推理(chain-of-thought reasoning)对检索到的视觉证据进行比较分析,将发现重新定义为显式的、基于检索的决策问题,而非隐式的参数化记忆问题。当检索索引无法提供足够识别证据时,样本即被判定为新颖物种,从而实现无需人工标注的自动监督信号,同时支持高召回率检索向高精度决策的转化,适用于大规模分类词汇表,显著提升识别与发现性能。

链接: https://arxiv.org/abs/2604.24029
作者: Jiawei Wang,Ming Lei,Yaning Yang,Xinyan Lin,Yuquan Le,Qiwei Ma,Zhiwei Xu,Zheqi Lv,Yuchen Ang,Zhe Quan,Tat-Seng Chua
机构: National University of Singapore(新加坡国立大学); Hunan University(湖南大学); Xiangtan University(湘潭大学); Hunan Normal University(湖南师范大学); Zhejiang University(浙江大学); Cornell University(康奈尔大学); Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: 13 pages, 6 figures, 9 tables

点击查看摘要

Abstract:Identifying species in biology among tens of thousands of visually similar taxa while discovering unknown species in open-world environments remains a fundamental challenge in biodiversity research. Current methods treat identification and discovery as separate problems, with classification models assuming closed sets and discovery relying on threshold-based rejection. Here we present DeepTaxon, a retrieval-augmented multimodal framework that unifies species identification and discovery through interpretable reasoning over retrieved visual evidence. Given a query image, DeepTaxon retrieves the top- k candidate species with n exemplar images each from a retrieval index and performs chain-of-thought comparative reasoning. Critically, we redefine discovery as an explicit, retrieval-based decision problem rather than an implicit parametric memory problem. A sample is novel if and only if the retrieval index lacks sufficient evidence for identification, so each retrieval naturally yields a classification or discovery label without manual annotation, thereby providing automatic supervision for both tasks. We train the framework via supervised fine-tuning on synthetic retrieval-augmented data, followed by reinforcement learning on hard samples, converting high-recall retrieval into high-precision decisions that scale to massive taxonomic vocabularies. Extensive experiments on a large-scale in-distribution benchmark and six out-of-distribution datasets demonstrate consistent improvements in both identification and discovery. Ablation studies further reveal effective test-time scaling with candidate count k and exemplar count n , strong zero-shot transfer to unseen domains, and consistent performance across retrieval encoders, establishing an interpretable solution for biodiversity research.

[IR-10] FUTURAL: A Metasearch Platform for Empowering Rural Areas with Smart Solutions

【速读】:该论文旨在解决多源数字智能解决方案(Smart Solutions, SS)在跨平台检索与访问时存在的碎片化和用户交互效率低下的问题。其核心解决方案是构建一个基于大型语言模型(Large Language Models, LLMs)的元搜索(Metasearch)平台最小可行产品(Minimum Viable Product, MVP),通过集成开源数据服务并利用LLMs的生成能力,实现自然语言接口以提升用户查询体验。关键创新在于将LLM适配技术与结构化数据服务相结合,从而高效支持语义级查询理解与结果召回,为后续扩展至更多FUTURAL项目数据集和服务奠定基础。

链接: https://arxiv.org/abs/2604.23817
作者: Matei Popovici,Ciprian Dobre
机构: University Politehnica of Bucharest (布加勒斯特理工大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The FUTURAL project aims to provide a comprehensive suite of digital Smart Solutions (SS) across five critical domains to address pressing social and environmental issues. Central to this initiative is a robust Metasearch platform, which will not only serve as the primary access point to FUTURAL’s solutions but also facilitate the search and retrieval of SS developed by other initiatives. This paper elaborates on the MVP implementation for the MetaSearch platform. It focuses on a single, open-source data service and harnesses the generative capabilities of Large Language Models (LLMs) to create a user-friendly natural language interface. The design of the Minimum Viable Product (MVP), the tools used for adapting LLMs to our specific application, and our comprehensive set of evaluation techniques are thoroughly detailed. The results from our evaluations demonstrate that our approach is highly effective and can be efficiently implemented in future iterations of the MVP. This groundwork paves the way for extending the platform to include additional services and diverse data sets from the FUTURAL project, enhancing its capacity to address a broader array of queries and datasets.

[IR-11] Similar Users-Augmented Interest Network

【速读】:该论文旨在解决推荐系统中因用户行为序列稀疏而导致的点击率(Click-through Rate, CTR)预测性能下降问题。在真实场景中,用户行为序列往往不完整,而现有序列建模方法仅依赖目标用户的自身行为,难以有效建模用户兴趣。其解决方案的关键在于提出一种名为SUIN(Similar Users-augmented Interest Network)的新方法:通过行为嵌入检索与目标用户行为相似的其他用户,并将这些相似用户的行为序列按相似度降序拼接至目标用户序列,构建增强型行为序列;同时设计了用户特定的目标感知位置编码以区分行为来源并捕捉相对位置信息,并引入用户感知的目标注意力机制,联合建模物品间和用户间的相关性,从而有效利用增强序列中的潜在信息并抑制来自相似用户行为的噪声。

链接: https://arxiv.org/abs/2604.23810
作者: Xiaolong Chen,Haoyi Zhao,Xu Huang,Defu Lian
机构: University of Science and Technology of China (中国科学技术大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Click-through rate (CTR) prediction is one of the core tasks in recommender systems. User behavior sequences, as one of the most effective features, can accurately reflect user preferences and significantly improve prediction accuracy. Richer behavior sequences often enable more comprehensive user profiling, and recent studies have shown that scaling the length of user behavior sequence can yield substantial gains in CTR. However, due to the widespread sparsity in recommender systems, incomplete behavior sequences are common in real-world scenarios. Existing sequential modeling methods often rely solely on the target user’s own behavior, and therefore struggle in such scenarios. This paper proposes a novel method called SUIN (Similar Users-augmented Interest Network), which enhances the target user’s behavior sequence with behaviors from similar users to enhance the user profile for CTR prediction. Specifically, we use behavior embeddings encoded by a sequence encoder to retrieve users with similar behaviors from a user retrieval pool. The behavior sequences of these similar users are then concatenated with that of the target user in descending order of similarity to construct an augmented sequence. Given that the augmented sequence contains behaviors from multiple users, we propose a user-specific target-aware position encoding, which identifies the source user of each behavior and captures its relative position to the target item. Furthermore, to mitigate the empirically observed noise in similar users’ behaviors, we design a user-aware target attention that jointly considers item-item and user-user correlations, fully exploiting the potential of the augmented behavior sequence. Comprehensive experiments on widely-used short-term and long-term sequence benchmark datasets demonstrate that our method significantly outperforms state-of-the-art sequential CTR models.

[IR-12] Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale

【速读】:该论文旨在解决小规模开源大语言模型(LLM)在医学问答任务中面临的模型设计选择问题:是通过领域微调(domain fine-tuning)增强模型性能,还是在推理时通过检索增强生成(Retrieval-Augmented Generation, RAG)注入领域知识。其关键解决方案在于控制变量法——固定模型规模、提示模板、解码温度、检索管道和评估协议,仅改变两个因素:(1)模型是否经过医学领域微调(Gemma 3 4B vs. MedGemma 4B);(2)是否在提示中插入来自医学知识库的检索段落。实验结果表明,在给定规模和基准下,领域知识编码于模型权重中的方式显著优于依赖上下文注入的知识,即微调带来的准确率提升(+6.8个百分点)远高于RAG的效果(甚至在微调模型中出现轻微负向效应)。

链接: https://arxiv.org/abs/2604.23801
作者: Avi-ad Avraam Buskila
机构: Bar-Ilan University (巴伊兰大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Practitioners deploying small open-weight large language models (LLMs) for medical question answering face a recurring design choice: invest in a domain-fine-tuned model, or keep a general-purpose model and inject domain knowledge at inference time via retrieval-augmented generation (RAG). We isolate this trade-off by holding model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol fixed, and varying only (i) whether the model has been domain-adapted (Gemma 3 4B vs. MedGemma 4B, both 4-bit quantized and served via Ollama) and (ii) whether retrieved passages from a medical knowledge corpus are inserted into the prompt. We evaluate all four cells of this 2x2 design on the full MedQA-USMLE 4-option test split (1,273 questions) with three repetitions per question (15,276 LLM calls). Domain fine-tuning yields a +6.8 percentage-point gain in majority-vote accuracy over the general 4B baseline (53.3% vs. 46.4%, McNemar p 10^-4). RAG over MedMCQA explanations does not produce a statistically significant gain in either model, and in the domain-tuned model the point estimate is slightly negative (-1.9 pp, p = 0.16). At this scale and on this benchmark, domain knowledge encoded in weights dominates domain knowledge supplied in context. We release the full experiment code and JSONL traces to support replication.

[IR-13] S2G-RAG : Structured Sufficiency and Gap Judging for Iterative Retrieval-Augmented QA ACL2026

【速读】:该论文旨在解决多跳问答(multi-hop question answering)中因迭代式检索生成(RAG)系统难以有效控制下一步检索内容及判断证据充分性而导致的问题,如基于不完整证据链回答或积累冗余/干扰信息。其核心解决方案是提出S2G-RAG框架,引入一个显式的控制器组件S2G-Judge,该组件在每一轮迭代中预测当前证据记忆是否足以作答;若不足,则输出结构化的“缺失信息项”(gap items),这些gap items被映射为下一轮检索查询,从而引导稳定的多轮检索轨迹。此外,通过维护句级证据上下文(Evidence Context)来减少噪声累积,提升多跳问答的性能与鲁棒性。

链接: https://arxiv.org/abs/2604.23783
作者: Minghan Li,Junjie Zou,Xinxuan Lv,Chao Zhang,Guodong Zhou
机构: Soochow University (苏州大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) grounds language models in external evidence, but multi-hop question answering remains difficult because iterative pipelines must control what to retrieve next and when the available evidence is adequate. In practice, systems may answer from incomplete evidence chains, or they may accumulate redundant or distractor-heavy text that interferes with later retrieval and reasoning. We propose S2G-RAG (Structured Sufficiency and Gap-judging RAG), an iterative framework with an explicit controller, S2G-Judge. At each turn, S2G-Judge predicts whether the current evidence memory supports answering and, if not, outputs structured gap items that describe the missing information. These gap items are then mapped into the next retrieval query, producing stable multi-turn retrieval trajectories. To reduce noise accumulation, S2G-RAG maintains a sentence-level Evidence Context by extracting a compact set of relevant sentences from retrieved documents. Experiments on TriviaQA, HotpotQA, and 2WikiMultiHopQA show that S2G-RAG improves multi-hop QA performance and robustness under multi-turn retrieval. Furthermore, S2G-RAG can be integrated into existing RAG pipelines as a lightweight component, without modifying the search engine or retraining the generator.

[IR-14] GLIER: Generative Legal Inference and Evidence Ranking for Legal Case Retrieval ACL2026

【速读】:该论文旨在解决法律案例检索(Legal Case Retrieval, LCR)中用户口语化查询与专业法律文档之间的语义鸿沟问题。现有密集检索方法通常将LCR视为黑箱语义匹配过程,忽视了支撑法律相关性的显式司法逻辑。解决方案的关键在于提出GLIER框架,其核心是将检索任务重构为对潜在法律变量的推理过程:首先通过联合生成推理模块(Joint Generative Inference)使用统一的序列到序列策略将原始查询转化为潜在法律指标(如罪名和法律要素),以确保逻辑一致性;其次利用多视角证据融合机制(Multi-View Evidence Fusion)整合生成置信度、结构特征与词汇信号,实现精准排序。此方法显著提升了检索性能,并在数据稀缺场景下仍保持鲁棒性。

链接: https://arxiv.org/abs/2604.23779
作者: Minghan Li,Tianrui Lv,Chao Zhang,Guodong Zhou
机构: Soochow University (苏州大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to the ACL 2026 main conference

点击查看摘要

Abstract:The semantic gap between colloquial user queries and professional legal documents presents a fundamental challenge in Legal Case Retrieval (LCR). Existing dense retrieval methods typically treat LCR as a black-box semantic matching process, neglecting the explicit juridical logic that underpins legal relevance. To address this, we propose GLIER (Generative Legal Inference and Evidence Ranking), a framework that reformulates retrieval as an inference process over latent legal variables. GLIER decomposes the task into two interpretability-driven stages. First, a Joint Generative Inference module translates raw queries into latent legal indicators, including charges and legal elements, using a unified sequence-to-sequence strategy that jointly generates charges and elements to enforce logical consistency. Second, a Multi-View Evidence Fusion mechanism aggregates generative confidence with structural and lexical signals for precise ranking. Extensive experiments on LeCaRD and LeCaRDv2 demonstrate that GLIER outperforms strong baselines such as SAILER and KELLER. Notably, GLIER exhibits strong data efficiency, maintaining robust performance even when trained with only 10% of the data.

[IR-15] Prism-Reranker: Beyond Relevance Scoring – Jointly Producing Contributions and Evidence for Agent ic Retrieval

【速读】:该论文旨在解决传统重排序器(reranker)仅输出标量相关性分数导致下游应用(如检索增强生成 RAG 和自主代理)需将整篇文档注入语言模型上下文,从而浪费计算资源的问题。其核心解决方案是提出 Prism-Reranker,一个基于 Qwen3.5 构建的多尺寸(0.8B–9B)重排序模型家族,它不仅判断文档是否相关(yes/no),还额外输出两个结构化信息:(i) 贡献陈述(contribution statement),即摘要式说明文档如何帮助回答查询;(ii) 证据片段(evidence passage),即保留所有查询相关信号并去除噪声的自包含重写版本。关键创新在于采用混合训练目标——结合来自强商业 reranker API 的点对点蒸馏与针对贡献和证据目标的监督微调,并通过 LLM-as-Judge 集成策略统一开放语料中的不一致标签,以获得清晰的二分类监督信号。实验证明该方法在 BEIR-QA 子集及贡献/证据质量评估上均表现稳健,且可扩展至现有 LLM-based reranker(如 Qwen3-Reranker-4B),显著提升 NDCG@10 指标。

链接: https://arxiv.org/abs/2604.23734
作者: Dun Zhang
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 28 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Modern retrieval pipelines increasingly serve downstream consumers like retrieval-augmented generation (RAG) and autonomous agents that need more than a scalar relevance score. A reranker that only tells the caller “how relevant” forces the agent to dump entire documents into the language-model context, wasting tokens on tangential passages and boilerplate. We introduce Prism-Reranker, a family of reranker models built on Qwen3.5 at four sizes (0.8B, 2B, 4B, 9B) that goes beyond scalar scoring. In addition to the standard yes/no relevance judgement, whenever the verdict is yes the model emits (i) a contribution statement summarizing how the document helps the query, and (ii) an evidence passage: a self-contained rewrite that preserves every query-relevant signal while discarding noise. Prism-Reranker is trained with a hybrid objective combining point-wise distillation from a strong commercial reranker API with supervised fine-tuning on contribution and evidence targets. We curate training data from KaLM-Embedding’s open-source aggregation, augmented with real web documents retrieved via commercial search APIs for open-domain queries and LLM-synthesized variants, and rewrite a portion of queries into keyword-style reformulations to adapt the model to agent-issued traffic. To reconcile inconsistent labels across open corpora and obtain crisp binary supervision, we relabel data with an LLM-as-Judge ensemble aggregating votes from five frontier LLMs. On a QA subset of BEIR and on an LLM-judged evaluation of contribution and evidence quality, Prism-Reranker attains solid results across all four sizes. We further show that the same recipe extends existing LLM-based rerankers, augmenting Qwen3-Reranker-4B with contribution and evidence capabilities while improving its average BEIR-QA NDCG@10 by +1.54 over the base model. Model weights, training recipe, and evaluation suite are released.

[IR-16] Prompt-Unknown Promotion Attacks against LLM -based Sequential Recommender Systems SIGIR2026

【速读】:该论文旨在解决大语言模型驱动的序列推荐系统(Large Language Model-powered Sequential Recommender Systems, LLM-SRSs)在面对物品推广攻击(item promotion attack)时的安全性问题,尤其是在攻击者无法获取目标模型或系统提示(system prompt)的全黑盒场景下。解决方案的关键在于提出Prompt-Unknown Dual-poisoning Attack (PUDA)框架,其核心创新包括:1)设计基于大语言模型的进化精炼策略以推断离散的系统提示,从而训练出能够模拟目标模型行为的有效替代模型(surrogate model);2)利用蒸馏得到的提示和替代模型,在语义约束下对目标物品文本进行对抗性修改,并结合由替代模型生成的高真实性中毒序列,实现低成本且高效的物品排名提升。实验表明,PUDA在真实数据集上显著优于现有最先进方法,揭示了即使在提示与模型均受保护的情况下,LLM-SRS仍存在严重安全风险。

链接: https://arxiv.org/abs/2604.23640
作者: Yuchuan Zhao,Tong Chen,Junliang Yu,Zongwei Wang,Lizhen Cui,Hongzhi Yin
机构: The University of Queensland (昆士兰大学); Griffith University (格里菲斯大学); Chongqing University (重庆大学); Shandong University (山东大学)
类目: Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2026

点击查看摘要

Abstract:Large language model-powered sequential recommender systems (LLM-SRSs) have recently demonstrated remarkable performance, enabling recommendations through prompt-driven inference over user interaction sequences. However, this paradigm also introduces new security vulnerabilities, particularly text-level manipulations, rendering them appealing targets for promotion attacks that purposely boost the ranking of specific target items. Although such security risks have been receiving increasing attention, existing studies typically rely on an unrealistic assumption of access to either the victim model or prompt to unveil attack mechanisms. In this work, we investigate the item promotion attack in LLM-SRSs under a more realistic setting where both the system prompt and victim model are unknown to the attacker, and propose a Prompt-Unknown Dual-poisoning Attack (PUDA) framework. To simulate attacks under this full black-box setting, we introduce an LLM-based evolutionary refinement strategy that infers discrete system prompts, enabling the training of an effective surrogate model that mimics the behaviors of the victim model. Leveraging the distilled prompt and surrogate model, we devise a promotion attack that adversarially revises target item texts under semantic constraints, which is further complemented by the highly plausible, surrogate-generated poisoning sequences to enable cost-effective target item promotion. Extensive experiments on real-world datasets demonstrate that PUDA consistently outperforms state-of-the-art competitors in boosting the exposure of unpopular target items. Our findings reveal critical security risks in modern LLM-SRSs even when both prompts and models are protected, and highlight the need for more robust defensive means.

[IR-17] From Rights to Rites: Expectations Management in Smart-Home AI

【速读】:该论文旨在解决智能家庭AI设备在设计与部署过程中伦理考量常被忽视或仅由合规团队处理的问题,聚焦于如何系统性地构建、校准和修复用户对智能家庭AI的期望。其解决方案的关键在于提出“期望管理”(Expectations Management, EM)模型,该模型基于建构主义扎根理论,强调通过组织权利与文化情境化仪式之间的平衡来塑造道德判断、情境化行动及跨文化差异,从而实现负责任的设计实践。EM模型识别出四大设计张力并提炼为五阶段设计指南(EM Design Playbook),以支持道德审慎,推动以人为本的人工智能(Human-Centred AI)发展。

链接: https://arxiv.org/abs/2604.23635
作者: Varad Vishwarupe,Ivan Flechais,Marina Jirotka,Nigel Shadbolt
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted as a main track conference paper at 2026 HCI International (HCII), Montreal, Canada

点击查看摘要

Abstract:Domestic voice assistants and smart-home devices are increasingly embedded in everyday routines, yet their ethics are often treated as an afterthought or delegated to compliance teams. To explore how expectations about smart-home AI are constructed and managed, we conducted 33 semi-structured interviews with designers, developers, and researchers from major smart-home platforms (Amazon Alexa, Microsoft Azure IoT, and Google Nest). Using a constructivist grounded theory approach, we develop Expectations Management (EM): a culturally embedded model describing how practitioners shape, calibrate, and repair expectations by balancing organisational rights with culturally situated rites. We show that EM differs from expectation-confirmation theory and trust-calibration by foregrounding moral judgement, situated action, and cross-cultural variation. Our analysis reveals four recurring design tensions: automation vs. autonomy, helpfulness vs. intrusiveness, personalisation vs. predictability, and transparency vs. obscurity and distils them into a five-phase EM Design Playbook that supports moral prudence. We discuss implications for responsible smart-home design and offer guidance for human-centred AI.

[IR-18] FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification ACL2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 在金融领域问答任务中因缺乏对监管文件内容的严格事实约束而导致的幻觉问题,包括虚构财务指标、编造引用和计算错误等,这些问题在欧盟《人工智能法案》(EU AI Act)高风险系统强制执行截止日期(2026年8月)临近背景下具有直接合规风险。解决方案的关键在于提出 FinGround,一个三阶段“验证后溯源”(verify-then-ground)管道:第一阶段采用面向金融领域的混合检索策略,联合文本与表格信息;第二阶段将答案分解为六类金融原子命题,并通过类型导向的验证策略(如公式重构)进行逐项校验;第三阶段对无法支持的陈述进行重写并附带段落级或表格单元格级引用。该方法通过“检索均衡化评估”机制隔离验证效果与检索质量的影响,在相同检索条件下使幻觉率降低 68%,整体管道相较 GPT-4o 减少 78% 幻觉,且轻量化模型可在低延迟下实现高效部署。

链接: https://arxiv.org/abs/2604.23588
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: The University of Hong Kong (香港大学); Stellaris AI Limited
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to ACL 2026 Industry Track. 14 pages, 1 figure, 14 tables

点击查看摘要

Abstract:Financial AI systems must produce answers grounded in specific regulatory filings, yet current LLMs fabricate metrics, invent citations, and miscalculate derived quantities. These errors carry direct regulatory consequences as the EU AI Act’s high-risk enforcement deadline approaches (August 2026). Existing hallucination detectors treat all claims uniformly, missing 43% of computational errors that require arithmetic re-verification against structured tables. We present FinGround, a three-stage verify-then-ground pipeline for financial document QA. Stage 1 performs finance-aware hybrid retrieval over text and tables. Stage 2 decomposes answers into atomic claims classified by a six-type financial taxonomy and verified with type-routed strategies including formula reconstruction. Stage 3 rewrites unsupported claims with paragraph- and table-cell-level citations. To cleanly isolate verification value from retrieval quality, we propose retrieval-equalized evaluation as standard methodology for RAG verification research: when all systems receive identical retrieval, FinGround still reduces hallucination rates by 68% over the strongest baseline ( p 0.01 ). The full pipeline achieves a 78% reduction relative to GPT-4o. An 8B distilled detector retains 91.4% F1 at 18x lower per-claim latency, enabling 0.003/query deployment, supported by qualitative signals from a four-week analyst pilot.

[IR-19] ComplianceNLP: Knowledge-Graph-Augmented RAG for Multi-Framework Regulatory Gap Detection ACL2026

【速读】:该论文旨在解决金融行业合规管理中因监管事件数量庞大(年均超6万条)而导致的人工处理效率低下与合规风险高的问题,特别是在2008年金融危机后行业累计支付超3000亿美元罚款的背景下。解决方案的关键在于提出一个端到端的合规自然语言处理系统——ComplianceNLP,其核心创新包括:(1)基于监管知识图谱增强的检索增强生成(RAG)管道,将生成内容锚定在包含12,847个条款的SEC、MiFID II和Basel III知识图谱上;(2)多任务义务提取模块,利用共享的LEGAL-BERT编码器联合执行命名实体识别(NER)、道义分类(deontic classification)及跨引用解析;(3)结合严重性感知评分的合规差距分析机制,实现义务与机构内部政策的映射。实证表明,该系统在真实场景下实现了96.0%召回率和90.7%精确度,并带来3.1倍分析师效率提升,验证了结构化监管知识对高复杂度跨引用任务的重要性。

链接: https://arxiv.org/abs/2604.23585
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: The University of Hong Kong(香港大学); Stellaris AI Limited(Stellaris AI有限公司)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at ACL 2026 Industry Track. 19 pages, 15 tables, 1 figure

点击查看摘要

Abstract:Financial institutions must track over 60,000 regulatory events annually, overwhelming manual compliance teams; the industry has paid over USD 300 billion in fines and settlements since the 2008 financial crisis. We present ComplianceNLP, an end-to-end system that automatically monitors regulatory changes, extracts structured obligations, and identifies compliance gaps against institutional policies. The system integrates three components: (1) a knowledge-graph-augmented RAG pipeline grounding generations in a regulatory knowledge graph of 12,847 provisions across SEC, MiFID II, and Basel III; (2) multi-task obligation extraction combining NER, deontic classification, and cross-reference resolution over a shared LEGAL-BERT encoder; and (3) compliance gap analysis that maps obligations to internal policies with severity-aware scoring. On our benchmark, ComplianceNLP achieves 87.7 F1 on gap detection, outperforming GPT-4o+RAG by +3.5 F1, with 94.2% grounding accuracy ( r=0.83 vs. human judgments) and 83.4 F1 under realistic end-to-end error propagation. Ablations show that knowledge-graph re-ranking contributes the largest marginal gain (+4.6 F1), confirming that structural regulatory knowledge is critical for cross-reference-heavy tasks. Domain-specific knowledge distillation (70B \to 8B) combined with Medusa speculative decoding yields 2.8\times inference speedup; regulatory text’s low entropy ( H=2.31 bits vs. 3.87 general text) produces 91.3% draft-token acceptance rates. In four months of parallel-run deployment processing 9,847 updates at a financial institution, the system achieved 96.0% estimated recall and 90.7% precision, with a 3.1\times sustained analyst efficiency gain. We report deployment lessons on trust calibration, GRC integration, and distributional shift monitoring for regulated-domain NLP.

[IR-20] Identity-Decoupled Anonymization for Visual Evidence in Multi-modal Retrieval-Augmented Generation

【速读】:该论文旨在解决多模态检索增强生成(Multi-modal Retrieval-Augmented Generation, MRAG)系统中因检索到含人脸图像而引发的隐私泄露问题,尤其是人脸身份信息作为敏感个人数据可能被滥用的风险。现有匿名化技术往往破坏下游推理依赖的非身份视觉线索,或无法提供可证明的隐私保障。其解决方案的关键在于提出一种身份解耦的MRAG框架(Identity-Decoupled MRAG),通过在检索与生成之间引入一个生成式匿名模块实现隐私保护:首先利用变分编码器将人脸分解为身份码(identity code)和结构属性码(attribute code),并通过互信息惩罚与梯度独立性项进行正则化;其次采用流形感知拒绝采样器替换身份码以确保合成身份既真实又与原身份显著不同;最后基于条件潜在扩散模型生成匿名化人脸,并将其蒸馏为低延迟部署的潜在一致性模型(latent consistency model)。整个过程通过多Oracle人脸识别模型集合与基于铰链损失的优化机制,在身份相似度低于伪造者阈值时终止训练,从而提供形式化的隐私保障。

链接: https://arxiv.org/abs/2604.23584
作者: Zehua Cheng,Wei Dai,Jiahao Sun
机构: University of Oxford (牛津大学); FLock.io
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: ACM International Conference on Multimedia Retrieval 2026

点击查看摘要

Abstract:Multi-modal retrieval-augmented generation (MRAG) systems retrieve visual evidence from large image corpora to ground the responses of large multi-modal models, yet the retrieved images frequently contain human faces whose identities constitute sensitive personal information. Existing anonymization techniques that destroy the non-identity visual cues that downstream reasoning depends on or fail to provide principled privacy guarantees. We propose Identity-Decoupled MRAG, a framework that interposes a generative anonymization module between retrieval and generation. Our approach consists of three components: (i)a disentangled variational encoder that factorizes each face into an identity code and a spatially-structured attribute code, regularized by a mutual-information penalty and a gradient-based independence term; (ii)a manifold-aware rejection sampler that replaces the identity code with a synthetic one guaranteed to be both distinct from the original and realistic; and (iii)a conditional latent diffusion generator that synthesizes the anonymized face from the replacement identity and the preserved attributes, distilled into a latent consistency model for low-latency deployment. Privacy is enforced through a multi-oracle ensemble of face recognition models with a hinge-based loss that halts optimization once identity similarity drops below the impostor-regime threshold.

[IR-21] Green-Red Watermarking for Recommender Systems

【速读】:该论文旨在解决推荐系统中知识产权保护的问题,特别是针对模型提取攻击(model extraction attacks)的威胁。现有水印方法依赖于强制模型记忆预定义的交互模式,导致需要大量合成数据注入且易受移除攻击,因其在统计上偏离自然用户行为。解决方案的关键在于提出一种名为GREW(Green-REd Watermarking)的新框架,其核心是通过秘密密钥将物品空间划分为“绿色”项目(用于软性提升)和“红色”项目(作为锚点),从而从脆弱的记忆机制转向隐蔽、密钥控制的输出偏差机制。GREW通过三个定制模块实现:语义一致哈希(Semantic-Consistent Hashing)以密钥聚类绿色项目并保持性能感知的隐蔽性;决策对齐掩码(Decision-Aligned Masking)将信号注入限制在竞争性物品子集内以保留排序逻辑;置信度感知缩放(Confidence-Aware Scaling)根据模型不确定性动态调节注入强度。最终通过黑盒输出的统计假设检验完成所有权验证,无需任何数据注入即可实现强验证能力和对抗提取攻击的鲁棒性。

链接: https://arxiv.org/abs/2604.23568
作者: Lei Zhou,Min Gao,Zongwei Wang,Yibing Bai,Wentao Li
机构: Chongqing University (重庆大学); University of Leicester (莱斯特大学)
类目: Information Retrieval (cs.IR); Cryptography and Security (cs.CR)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:The widespread open-sourcing of advanced recommendation algorithms and the rising threat of model extraction attacks have made safeguarding the intellectual property of recommender systems an imperative task. While watermarking serves as a potent defense, existing methods primarily rely on forcing models to memorize pre-defined interaction patterns. Such memorization-based approaches often require excessive synthetic data injection and are vulnerable to removal attacks due to their detectable statistical deviations from natural user behavior. To address these limitations, we propose GREW, a novel Green-REd Watermarking framework for recommender systems. GREW leverages a secret key to partition the item space into “green” items for soft promotion and “red” items as anchors, thereby shifting the paradigm from fragile memorization to a stealthy, key-controlled output bias. By integrating watermark signals directly into the intrinsic ranking process, GREW employs three recommendation-tailored modules: (1) Semantic-Consistent Hashing, which utilizes the secret key to cluster green items for performance-aware stealthiness; (2) Decision-Aligned Masking, which confines signal injection to the competitive item subset to preserve ranking logic; and (3) Confidence-Aware Scaling, which dynamically modulates injection intensity based on model uncertainty. Ownership verification is performed via statistical hypothesis testing on aggregated black-box outputs, enabled by the keyed re-partitioning of the item space. Experiments on multiple base models demonstrate that GREW achieves strong ownership verification and robustness against extraction attacks compared to existing baselines while requiring no data injection. Our code is available at this https URL.

[IR-22] CyberCane: Neuro-Symbolic RAG for Privacy-Preserving Phishing Detection with Formal Ontology Reasoning

【速读】:该论文旨在解决隐私敏感领域中钓鱼检测系统面临的多重矛盾需求:近乎零的误报率(False Positive Rate, FPR)以避免工作流程中断、为非专家用户提供可解释性、遵守监管要求禁止将敏感数据暴露于外部API,以及抵御生成式AI(Generative AI)驱动的新型攻击。现有基于规则的系统对新攻击形式脆弱,而大型语言模型(Large Language Model, LLM)驱动的检测器因未脱敏的数据传输违反隐私合规性。解决方案的关键在于提出CyberCane——一个神经符号框架,融合确定性符号分析与隐私保护的检索增强生成(Retrieval-Augmented Generation, RAG),其双阶段流水线首先用轻量级符号规则处理邮件元数据,再将边界案例交由RAG进行语义分类,并自动执行敏感信息脱敏及从仅含钓鱼样本的语料库中检索上下文;同时引入PhishOnt本体(OWL ontology),通过形式化推理链实现可验证的攻击分类。该方案在DataPhish2025数据集上相较纯符号方法在AI生成威胁上的召回率提升78.6点,精度超98%,FPR低至0.16%,且支持灵活调参以适配不同风险偏好。

链接: https://arxiv.org/abs/2604.23563
作者: Safayat Bin Hakim,Aniqa Afzal,Qi Zhao,Vigna Majmundar,Pawel Sloboda,Houbing Herbert Song
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校); Bowie State University (鲍伊州立大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Privacy-critical domains require phishing detection systems that satisfy contradictory constraints: near-zero false positives to prevent workflow disruption, transparent explanations for non-expert staff, strict regulatory compliance prohibiting sensitive data exposure to external APIs, and robustness against AI-generated attacks. Existing rule-based systems are brittle to novel campaigns, while LLM-based detectors violate privacy regulations through unredacted data transmission. We introduce CyberCane, a neuro-symbolic framework integrating deterministic symbolic analysis with privacy-preserving retrieval-augmented generation (RAG). Our dual-phase pipeline applies lightweight symbolic rules to email metadata, then escalates borderline cases to semantic classification via RAG with automated sensitive data redaction and retrieval from a phishing-only corpus. We further introduce PhishOnt, an OWL ontology enabling verifiable attack classification through formal reasoning chains. Evaluation on DataPhish2025 (12.3k emails; mixed human/LLM) and Nazario/SpamAssassin demonstrates a 78.6-point recall gain over symbolic-only detection on AI-generated threats, with precision exceeding 98% and FPR as low as 0.16%. Healthcare deployment projects a 542x ROI; tunable operating points support diverse risk tolerances, with open-source implementation at this https URL.

[IR-23] Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale

【速读】:该论文旨在解决推荐系统中语义ID(Semantic ID, SID)学习过程中因代码碰撞(collision)导致的表示混淆问题,即不同物品被分配相同或高度相似的离散编码,从而影响检索、排序和生成推荐的效果。现有方法主要依赖改进的量化策略或固定的重叠正则化,无法自适应地区分哪些重叠应被抑制、哪些应保留。其解决方案的关键在于提出AdaSID框架,通过两阶段自适应调节机制:第一阶段根据语义兼容性动态放松对已观测到的重叠的排斥力,保留合理的共享;第二阶段依据局部碰撞密度和训练进度分配剩余调控压力,在高冲突区域强化控制,并逐步将优化重心转向推荐目标对齐。这种设计实现了对重叠的智能决策——何时惩罚、如何强度调节及何时转移学习焦点,显著提升了SID的语义保真度与下游推荐性能。

链接: https://arxiv.org/abs/2604.23522
作者: Yongsen Pan,Yuxin Chen,Zheng Hu,Xu Yuan,Daoyuan Wang,Yuting Yin,Songhao Ni,Hongyang Wang,Jun Wang,Fuji Ren,Wenwu Ou
机构: University of Electronic Science and Technology of China (电子科技大学); Kuaishou Technology (快手科技)
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Modern recommendation systems involve massive catalogs of multimodal items, where scalable item identification must balance compactness, semantic fidelity, and downstream effectiveness. Semantic IDs (SIDs) address this need by representing items as short discrete token sequences derived from multimodal signals, providing a compact interface for retrieval, ranking, and generative recommendation. However, effective SID learning is hindered by collisions, where different items are assigned identical or highly confusable codes. Existing methods mainly rely on improved quantization or fixed overlap regularization, but they do not adaptively distinguish whether an overlap should be suppressed or preserved. We propose AdaSID, an adaptive semantic ID learning framework for recommendation. AdaSID regulates SID overlaps through a two-stage process. First, it relaxes repulsion for observed overlaps when the involved items are semantically compatible, preserving admissible sharing rather than uniformly separating all collisions. Second, it allocates the remaining regulation pressure according to local collision load and training progress, strengthening control in congested regions while gradually rebalancing optimization toward recommendation alignment. This design adaptively decides which overlaps to penalize, how strongly to regulate them, and when to shift the learning focus. Extensive offline and online experiments validate AdaSID. On two public benchmarks, AdaSID improves Recall and NDCG by about 4.5% on average over strong baselines, while improving codebook utilization and SID diversity. In Kuaishou e-commerce, an online A/B test on short-video retrieval covering tens of millions of users achieves statistically significant gains, including a 0.98% GMV improvement, and industrial ranking evaluation shows consistent AUC improvements.

[IR-24] A Benchmark Suite of Reddit-Derived Datasets for Mental Health Detection

【速读】:该论文旨在解决当前心理健康自然语言处理(Natural Language Processing, NLP)研究中缺乏高质量、可复现且跨任务通用的数据集问题。现有研究多构建任务特定的语料库,未形成广泛可用的资源,导致模型复现困难及跨任务比较不可行。其解决方案的关键在于提出一个统一的基准数据集套件,包含四个基于Reddit的互补任务:自杀意念检测、一般心理障碍二分类、双相情感障碍检测以及多类心理障碍分类。所有数据均通过严谨的语言学审查、明确的标注指南和人工验证建立,且标注者间一致性指标始终超过0.8的基线水平,确保标签可靠性;同时,该套件已在Transformer与上下文感知循环模型上验证了卓越性能(F1分数达93–99%),为可复现的心理健康NLP研究提供了统一基础,支持跨任务基准测试、多任务学习与公平模型比较。

链接: https://arxiv.org/abs/2604.23458
作者: Khalid Hasan,Jamil Saquer
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: In the proceedings of 12th Annual Conference on Computational Science Computational Intelligence (CSCI’25)

点击查看摘要

Abstract:The growing availability of online support groups has opened up new windows to study mental health through natural language processing (NLP). However, it is hindered by a lack of high-quality, well-validated datasets. Existing studies have a tendency to build task-specific corpora without collecting them into widely available resources, and this makes reproducibility as well as cross-task comparison challenging. In this paper, we present a uniform benchmark set of four Reddit-based datasets for disjoint but complementary tasks: (i) detection of suicidal ideation, (ii) binary general mental disorder detection, (iii) bipolar disorder detection, and (iv) multi-class mental disorder classification. All datasets were established upon diligent linguistic inspection, well-defined annotation guidelines, and human-judgmental verification. Inter-annotator agreement metrics always exceeded the baseline agreement score of 0.8, ensuring the labels’ trustworthiness. Previous work’s evidence of performance on both transformer and contextualized recurrent models demonstrates that these models receive excellent performances on tasks (F1 ~ 93-99%), further validating the usefulness of the datasets. By combining these resources, we establish a unifying foundation for reproducible mental health NLP studies with the ability to carry out cross-task benchmarking, multi-task learning, and fair model comparison. The presented benchmark suite provides the research community with an easy-to-access and varied resource for advancing computational approaches toward mental health research.

[IR-25] Automating Categorization of Scientific Texts with In-Context Learning and Prompt-Chaining in Large Language Models

【速读】:该论文旨在解决科学文献快速增长背景下,研究人员在知识发现与导航中面临的挑战,特别是如何利用大语言模型(Large Language Models, LLMs)实现对科研文本的自动分类,以提升研究信息系统的智能化水平。其解决方案的关键在于采用分层的 ORKG 分类体系作为标签框架,并通过改进的提示工程策略——即提示链(Prompt Chaining)和上下文学习(In-Context Learning, ICL)——来增强 LLM 在多层级分类任务中的表现。实验表明,提示链相较于纯 ICL 在处理嵌套结构的 ORKG 分类体系时具有显著优势,尤其在一级领域(domain)和二级主题(subject)预测上优于现有最优模型,但对三级主题(topic)的分类仍存在明显不足,准确率仅约 50%。

链接: https://arxiv.org/abs/2604.23430
作者: Gautam Kishore Shahi,Oliver Hummel
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Software Engineering (cs.SE)
备注: 25 pages

点击查看摘要

Abstract:The relentless expansion of scientific literature presents significant challenges for navigation and knowledge discovery. Within Research Information Retrieval, established tasks such as text summarization and classification remain crucial for enabling researchers and practitioners to effectively navigate this vast landscape, so that efforts have increasingly been focused on developing advanced research information systems. These systems aim not only to provide standard keyword-based search functionalities but also to incorporate capabilities for automatic content categorization within knowledge-intensive organizations across academia and industry. This study systematically evaluates the performance of off-the-shelf Large Language Models (LLMs) in analyzing scientific texts according to a given classification scheme. We utilized the hierarchical ORKG taxonomy as a classification framework, employing the FORC dataset as ground truth. We investigated the effectiveness of advanced prompt engineering strategies, namely In-Context Learning (ICL) and Prompt Chaining, and experimentally explored the influence of the LLMs’ temperature hyperparameter on classification accuracy. Our experiments demonstrate that Prompt Chaining yields superior classification accuracy compared to pure ICL, particularly when applied to the nested structure of the ORKG taxonomy. LLMs with prompt chaining outperform the state-of-the-art models for domain (1st level) prediction and show even better performance for subject (2nd level) prediction compared to the older BERT model. However, LLMs are not yet able to perform well in classifying the topic (3rd level) of research areas based on this specific hierarchical taxonomy, as they only reach about 50% accuracy even with prompt chaining.

[IR-26] IIRSim Studio: A Dashboard for User Simulation

【速读】:该论文旨在解决现有信息检索(Information Retrieval, IR)用户模拟框架因代码导向设计而导致的实验设置复杂、可复现性差的问题。其核心瓶颈在于缺乏将实验设计、执行与共享整合为统一可验证工作流的基础设施。解决方案的关键是提出 IIRSim Studio,一个基于网页的集成开发环境,通过四个关键创新实现:(1) 可视化模拟流水线构建界面,支持初学者和专家高效协作;(2) 基于 Git 的组件生命周期管理机制,实现自定义模拟组件的版本控制与运行时注入;(3) 以实验包(experiment bundles)和环境模板为基础的溯源模型,明确复现范围;(4) 支持共享任务的工作流,通过 Sim4IA 微任务的重新部署验证其有效性。

链接: https://arxiv.org/abs/2604.23406
作者: Saber Zerhoudi,Adam Roegiest,Michael Granitzer
机构: University of Passau(帕绍大学); Zuva(佐瓦)
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:User simulation is a valuable methodology for evaluation in Information Retrieval (IR), enabling low-cost experimentation and counterfactual analysis. However, existing simulation frameworks are primarily code-centric libraries that require substantial setup effort, which limits adoption and hinders reproducibility. The bottleneck is not the simulation engines themselves, but the lack of infrastructure connecting experiment design, execution, and sharing into a single verifiable workflow. This paper introduces IIRSim Studio, a web-based workbench that addresses this gap through four contributions: (1) a visual environment for composing simulation pipelines on top of simulation frameworks, serving both novices learning simulation concepts and experts piloting large-scale experiments; (2) a component lifecycle that supports authoring, versioning, and sharing custom simulation components through Git-backed storage and runtime injection; (3) a provenance model based on experiment bundles and environment templates that makes the scope of replication explicit; and (4) a shared-task workflow, demonstrated through the re-deployment of a Sim4IA micro-task. IIRSim Studio is available as a hosted service and as a portable containerized deployment.

[IR-27] Lost in Decoding? Reproducing and Stress-Testing the Look-Ahead Prior in Generative Retrieval SIGIR

【速读】:该论文旨在解决生成式检索(Generative Retrieval, GR)方法在有限束宽(beam size)解码下因早期修剪相关前缀而导致的性能下降问题。其解决方案的关键在于引入“提前规划”(Planning Ahead in Generative Retrieval, PAG),通过并行解码计算文档级别的前瞻先验(look-ahead prior),用以引导后续序列化解码过程,从而提升检索准确性与鲁棒性。

链接: https://arxiv.org/abs/2604.23396
作者: Kidist Amde Mekonnen,Yongkang Li,Yubao Tang,Simon Lupart,Maarten de Rijke
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 5 figures, 9 tables; accepted to the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 20-24, 2026, Melbourne/Naarm, Australia

点击查看摘要

Abstract:Generative retrieval (GR) ranks documents by autoregressively generating document identifiers. Because many GR methods rely on trie-constrained beam search, they are vulnerable to early pruning of relevant prefixes under finite-beam decoding. Planning Ahead in Generative Retrieval (PAG) mitigates this failure mode by using simultaneous decoding to compute a document-level look-ahead prior that guides subsequent sequential decoding. We reproduce PAG at inference time and stress-test its decoding behavior. Using the authors’ released checkpoint and identifier/trie artifacts under the reported decoding setup, we reproduce the main effectiveness results on MS MARCO Dev and TREC-DL 2019/2020, and corroborate the reported beam-size-latency trade-off in our hardware setting. Beyond reproduction, we introduce plan drift diagnostics that quantify how intent-preserving query variations alter the planner’s top-n candidate set and highest-weight planner tokens, and how these changes affect guided decoding. We find that PAG’s planning signal is brittle under lexical surface-form variation: intent-preserving typos can trigger plan collapse, where the planned candidate pool shifts enough that the look-ahead bonus provides little useful guidance, effectively reverting decoding toward weaker unguided search. We further evaluate fixed-index cross-lingual robustness using non-English mMARCO queries against an English index, and assess query-side mitigation strategies that require no re-indexing; query translation provides the strongest recovery in our setting. Overall, our results confirm PAG’s reported effectiveness and the benefit of planning-guided decoding under the released inference setup, while showing that these gains depend on the stability of the planning signal under realistic query variation and query-document mismatch.

[IR-28] A Parametric Memory Head for Continual Generative Retrieval SIGIR

【速读】:该论文旨在解决生成式信息检索(Generative Information Retrieval, GenIR)模型在动态文档集合中面临的稳定性-可塑性权衡问题。GenIR将知识参数化编码于模型权重中,导致传统微调方法(如全量微调或参数高效微调)易引发灾难性遗忘,即在适应新文档时显著降低对旧文档的检索性能。为解决此问题,作者提出后适应记忆调优(Post-Adaptation Memory Tuning, PAMT),其核心在于引入一个仅更新记忆模块的稳定阶段:冻结主干网络,并添加一个带有固定寻址机制的参数化记忆头(Parametric Memory Head, PMH)。在前缀树约束解码过程中,解码器隐藏状态稀疏地查询PMH以生成隐藏空间中的残差修正,这些修正通过冻结的输出嵌入矩阵映射为分数调整,且仅作用于前缀树有效token;同时,PAMT基于解码时访问统计选择性更新一小部分记忆值,优先保留当前文档片段高频激活但历史使用率低的记忆条目,从而最小化跨文档切片干扰,实现对旧文档的高保真保留与新文档的高效适应。

链接: https://arxiv.org/abs/2604.23388
作者: Kidist Amde Mekonnen,Yubao Tang,Maarten de Rijke
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 3 figures, 3 tables; accepted to the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 20-24, 2026, Melbourne/Naarm, Australia

点击查看摘要

Abstract:Generative information retrieval (GenIR) consolidates retrieval into a single neural model that decodes document identifiers (docids) directly from queries. While this model-as-index paradigm offers architectural simplicity, it is poorly suited to dynamic document collections. Unlike modular systems, where indexes are easily updated, GenIR’s knowledge is parametrically encoded in its weights; consequently, standard adaptation methods such as full and parameter-efficient fine-tuning can induce catastrophic forgetting. We show that sequential adaptation improves retrieval on newly added documents but substantially degrades performance on earlier slices, exposing a pronounced stability-plasticity trade-off. To address this, we propose post-adaptation memory tuning (PAMT), a memory-only stabilization stage that augments an adapted model with a modular parametric memory head (PMH). PAMT freezes the backbone and attaches a product-key memory with fixed addressing. During prefix-trie constrained decoding, decoder hidden states sparsely query PMH to produce residual corrections in hidden space; these corrections are mapped to score adjustments via the frozen output embedding matrix, computed only over trie-valid tokens. This guides docid generation while keeping routing and backbone parameters fixed. To limit cross-slice interference, PAMT updates only a fixed budget of memory values selected using decoding-time access statistics, prioritizing entries frequently activated by the current slice and rarely used in prior sessions. Experiments on MS MARCO and Natural Questions under sequential, disjoint corpus increments show that PAMT substantially improves retention on earlier slices with minimal impact on retrieval performance for newly added documents, while modifying only a sparse subset of memory values per session.

[IR-29] Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA ICMR2026

【速读】:该论文旨在解决传统基于理由(rationale-based)检索方法中因使用大语言模型(Large Language Models, LLMs)进行跨编码(cross-encoding)查询-文档对所带来的高计算开销问题。其核心解决方案是提出Rabtriever,一种独立编码查询与文档的高效检索器,同时保持与重排序器(reranker)相当的上下文感知理解能力。关键创新在于采用基于策略的蒸馏框架(on-policy distillation),以LLM生成式重排序器作为教师模型,指导Rabtriever学生模型重建教师的上下文感知查询嵌入;具体实现上引入联合嵌入预测架构(Joint-Embedding Predictive Architecture, JEPA),在LLM层间加入轻量可训练预测器,将查询嵌入投影至新隐藏空间并与文档嵌入对齐,从而将教师模型的二次复杂度降低为线性复杂度,显著提升效率并保持性能。

链接: https://arxiv.org/abs/2604.23336
作者: Teng Chen,Sheng Xu,Feixiang Guo,Xiaoyu Wang,Qingqing Gu,Hongyan Li,Luo Ji
机构: Geely AI Lab(吉利AI实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 8 figures. ICMR 2026

点击查看摘要

Abstract:Unlike traditional fact-based retrieval, rationale-based retrieval typically necessitates cross-encoding of query-document pairs using large language models, incurring substantial computational costs. To address this limitation, we propose Rabtriever, which independently encodes queries and documents, while providing comparable cross query-document comprehension capabilities to rerankers. We start from training a LLM-based generative reranker, which puts the document prior to the query and prompts the LLM to generate the relevance score by log probabilities. We then employ it as the teacher of an on-policy distillation framework, with Rabtriever as the student to reconstruct the teacher’s contextual-aware query embedding. To achieve this effect, Rabtriever is first initialized from the teacher, with parameters frozen. The Joint-Embedding Predictive Architecture (JEPA) paradigm is then adopted, which integrates a lightweight, trainable predictor between LLM layers and heads, projecting the query embedding into a new hidden space, with the document embedding as the latent vector. JEPA then minimizes the distribution difference between this projected embedding and the teacher embedding. To strengthen the sampling efficiency of on-policy distillation, we also add an auxiliary loss on the reverse KL of LLM logits, to reshape the student’s logit distribution. Rabtriever optimizes the teacher’s quadratic complexity on the document length to linear, verified both theoretically and empirically. Experiments show that Rabtriever outperforms different retriever baselines across diverse rationale-based tasks, including empathetic conversations and robotic manipulations, with minor accuracy degradation from the reranker. Rabtriever also generalizes well on traditional retrieval benchmarks such as MS MARCO and BEIR, with comparable performance to the best retriever baseline.

[IR-30] MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

【速读】:该论文旨在解决当前多模态嵌入模型在全模态(text、image、video、audio及以代理为中心的场景)设置下评估困难的问题,因为现有方法和基准测试大多仅覆盖部分模态,难以系统性地衡量完整模态的表征学习能力。其解决方案的关键在于提出MMEB-V3基准测试,该基准涵盖多种模态并引入OmniSET(Omni-modality Semantic Equivalence Tuples),通过构建跨模态语义等价样本对,实现对语义相似性与模态效应的解耦诊断。实验表明,当前多模态嵌入模型在指令驱动下的模态约束执行能力不足,导致检索行为缺乏一致性,从而揭示了现有方法的核心局限,并为未来研究提供了可量化分析的基准工具。

链接: https://arxiv.org/abs/2604.23321
作者: Haohang Huang,Xuan Lu,Mingyi Su,Xuan Zhang,Ziyan Jiang,Ping Nie,Kai Zou,Tomas Pfister,Wenhu Chen,Wei Zhang,Xiaoyu Shen,Rui Meng
机构: Eastern Institute of Technology (宁波东方理工大学); Shanghai Jiao Tong University (上海交通大学); Google AI Research (Google AI 研究); University of Waterloo (滑铁卢大学); NUS (新加坡国立大学); UCSB (加州大学圣巴巴拉分校); Netmind.ai (Netmind.ai)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multimodal embedding models aim to map heterogeneous inputs, such as text, images, videos, and audio, into a shared semantic space. However, existing methods and benchmarks remain largely limited to partial modality coverage, making it difficult to systematically evaluate full-modality representation learning. In this work, we take a step toward the full-modality setting. We introduce MMEB-V3, a comprehensive benchmark that evaluates embeddings across text, image, video, audio, as well as agent-centric scenarios. To enable more fine-grained diagnosis, we further construct OmniSET (Omni-modality Semantic Equivalence Tuples), where semantically equivalent instances are represented across modalities, allowing us to disentangle semantic similarity from modality effects. Through experiments on MMEB-V3, we conduct a systematic analysis of full-modality embeddings and identify three key findings: (1) models often fail to retrieve the intended target modality; (2) cross-modal retrieval is highly asymmetric and dominated by query-modality bias; and (3) instruction-induced shifts are either insufficient or misaligned with the target modality, and therefore do not reliably improve retrieval. These results indicate that current multimodal embeddings are not yet capable of reliably enforcing modality constraints specified by instructions, and consequently fail to exhibit consistent modality-aware retrieval behavior. We hope MMEB-V3 provides a useful benchmark for understanding and diagnosing these limitations, and for guiding future research on full-modality embeddings.

[IR-31] Birds of a Feather Cluster Nearby: a Proximity-Aware Geo-Codebook for Local Service Recommendation

【速读】:该论文旨在解决生成式推荐系统在本地服务场景中因忽略地理约束而导致的“语义相关但地理不可达”问题,即现有基于语义ID(Semantic ID, SID)的编码方式未能有效融合空间信息,从而影响推荐的实际可行性。其解决方案的关键在于提出一种感知邻近性的地理码本(Proximity-aware GEO-codebook, Pro-GEO),通过构建以聚类中心为原点的局部坐标系捕捉簇内空间关系,并引入地理旋转位置编码机制,将地理邻近性建模为高维嵌入空间中的正交旋转变换,从而实现语义与空间信号的协同建模,避免地理信息退化为弱辅助特征。

链接: https://arxiv.org/abs/2604.23156
作者: Tian He,Chen Yang,Jiawei Zhang,Lin Guo,Wei Lin,Zhuqing Jiang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Meituan (美团); Beijing Institute of Technology (北京理工大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative recommendation systems are increasingly adopted in local service platforms, where semantic relevance alone is insufficient without strict geographic feasibility. A key technical challenge lies in semantic ID (SID) tokenization, which directly impacts the recommendation performance. However, existing semantic codebooks neglect geographic constraints, often resulting in recommendations that are semantically relevant yet geographically unreachable. To address this limitation, we propose Pro-GEO, a Proximity-aware GEO-codebook. Pro-GEO establishes a geo-centroid local coordinate system to capture intra-cluster spatial relationships and a geo-rotary position encoding mechanism that models geographic proximity as orthogonal rotational transformations in the high-dimensional embedding. This design enables semantic and spatial signals to be jointly modeled in a balanced manner, without reducing geographic information to a weak auxiliary feature. Extensive experiments conducted on a large-scale industrial dataset reveal that Pro-GEO significantly outperforms state-of-the-art methods. In particular, Pro-GEO reduces the average geographic clustering distance by 45.60% and achieves a 1.87% improvement in Hit@50, highlighting its effectiveness for real-world local service recommendation.

[IR-32] Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems RECSYS’24

【速读】:该论文旨在解决音乐信息检索(Music Information Retrieval, MIR)领域中预训练模型在音乐推荐系统(Music Recommender Systems, MRS)中的应用效率未被充分探索的问题,以及推荐系统社区普遍偏好传统端到端神经网络训练而忽视预训练表示学习的优势。其解决方案的关键在于系统性地评估九种主流预训练音频表示模型(如MusicFM、MERT、MusiCNN等)在五种不同推荐方法(包括K近邻、浅层神经网络、对比多模态投影、混合模型和BERT4Rec)下的性能表现,覆盖热启动(hot-start)与冷启动(cold-start)两种典型场景,从而揭示预训练音频特征在MRS任务中与传统MIR任务之间的性能差异,为后续利用预训练音频表示优化音乐推荐系统提供实证基础与研究方向。

链接: https://arxiv.org/abs/2604.23077
作者: Yan-Martin Tamm,Anna Aljanaki
机构: University of Tartu(塔尔图大学)
类目: Information Retrieval (cs.IR)
备注: Extended version of arXiv:2409.08987 . Accepted for publication in the Special Issue “Highlights of RecSys '24” in ACM Transactions on Recommender Systems (TORS)

点击查看摘要

Abstract:Over the years, Music Information Retrieval (MIR) research community has released various models pretrained on large amounts of music data. Transfer learning showcases the proven effectiveness of pretrained backend models for a broad spectrum of downstream tasks, including auto-tagging and genre classification. However, MIR papers generally do not explore the efficiency of pretrained models for Music Recommender Systems (MRS). In addition, the Recommender Systems community tends to favour traditional end-to-end neural network training. Our research addresses this gap and evaluates the performance of nine pretrained backend models (MusicFM, Music2Vec, MERT, EncodecMAE, Jukebox, MusiCNN, MULE, MuQ and MuQ-MuLan) in the context of MRS. We assess them using five recommendation approaches: K-Nearest Neighbours (KNN), Shallow Neural Network, Contrastive Multi-Modal projection, a Hybrid model, and BERT4Rec both for the hot and cold-start scenarios. Our findings suggest that pretrained audio representations exhibit significant performance disparity between traditional MIR tasks and both hot and cold music recommendations, indicating that valuable aspects of musical information captured by backend models may differ depending on the task. This study establishes a foundation for further exploration of pretrained audio representations to enhance music recommendation systems.

[IR-33] CASP: Support-Aware Offline Policy Selection for Two-Stage Recommender Systems

【速读】:该论文旨在解决两阶段推荐系统(Two-stage Recommender Systems)中的离线策略选择问题,即在不依赖在线交互的情况下,如何从多个候选策略中选择出既具有高预期收益又具备足够数据支持的策略。传统单阶段目标无法捕捉到生成器(generator)与排序器(ranker)之间的耦合关系:生成器决定了排序器可用的物品集合,进而影响策略价值估计的可靠性。若某策略过度依赖于低支持度的生成器-物品对,则其性能可能不可靠。解决方案的关键在于提出CASP(Coupled Action-Set Pessimism),该方法通过结合双重稳健(doubly robust)的价值估计与支持负担惩罚项(support-burden penalty),实现对策略价值和数据支持程度的联合评估,从而在价值与支持可信度冲突时优先选择低负担策略。理论分析表明,忽略下游延续价值的分阶段规则可能任意次优,而CASP提供了群体、有限类和重构倾向性下的保守选择保证。

链接: https://arxiv.org/abs/2604.23022
作者: Nilson Chapagain
机构: Texas AM University (德州农工大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 10 pages

点击查看摘要

Abstract:Two-stage recommender systems first choose a candidate generator and then rank items within the generated set. Because the generator decides which items are available to the ranker, changing the generator changes both the policy value and the data support used to estimate that value. This creates an offline selection problem that standard single-stage objectives do not capture: a policy may look good under a retrieval score or a raw off-policy value estimate, but still be unreliable if it depends on weakly supported generator-item pairs. We propose CASP (Coupled Action-Set Pessimism), a support-aware offline selector for finite libraries of two-stage recommender policies. CASP combines doubly robust value estimation with a support-burden penalty. We show that stagewise rules that ignore downstream continuation value can be arbitrarily suboptimal, and we derive population, finite-class, and reconstructed-propensity guarantees for conservative selection. In simulations and a reconstructed MovieLens 1M application, CASP selects lower-burden policies when estimated value and support credibility are in tension.

[IR-34] Self Knowledge Re-expression: A Fully Local Method for Adapting LLM s to Tasks Using Intrinsic Knowledge

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在特定非生成性任务中性能受限的问题,其根本原因被归因于LLMs基于下一词预测(Next-Token Prediction, NTP)机制所表达知识的方式不够高效,而非知识获取能力不足。解决方案的关键在于提出一种名为Self-Knowledge Re-expression (SKR)的新颖、任务无关的适配方法,该方法通过仅使用未标注数据,将LLM输出从通用的词元生成转变为高度高效的、任务特定的知识表达方式,从而显著提升信息检索、目标检测和异常检测等任务的性能,且无需人工标注或模型蒸馏。

链接: https://arxiv.org/abs/2604.22939
作者: Mengyu Wang,Xiaoying Zhi,Zhiyi Li,Robin Schmucker,Shay B. Cohen,Tiejun Ma,Fran Silavong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:While the next-token prediction (NTP) paradigm enables large language models (LLMs) to express their intrinsic knowledge, its sequential nature constrains performance on specialized, non-generative tasks. We attribute this performance bottleneck to the LLMs’ knowledge expression mechanism, rather than to deficiencies in knowledge acquisition. To address this, we propose Self-Knowledge Re-expression (SKR), a novel, task-agnostic adaptation method. SKR transforms the LLM’s output from generic token generation to highly efficient, task-specific expression. SKR is a fully local method that uses only unannotated data, requiring neither human supervision nor model distillation. Experiments on a large financial document dataset demonstrate substantial improvements: over 40% in Recall@1 for information retrieval tasks, over 76% reduction in object detection latency, and over 33% increase in anomaly detection AUPRC. Our results on the MMDocRAG dataset surpass those of leading retrieval models by at least 12.6%.

[IR-35] Citation-Driven Multi-View Training for Patent Embeddings: QaECTER and Sophia-Bench

【速读】:该论文旨在解决专利检索领域缺乏能够反映真实世界搜索场景多样性的基准评测体系的问题,从而制约了模型性能评估与进步。其解决方案的关键在于:首先构建了Sophiabench这一大规模专利检索基准,包含10,000个查询和75,000篇文档,覆盖十年时间跨度、八个国际专利分类(IPC)技术领域及十二个司法管辖区,并支持12种不同类型的查询(从结构化字段到AI生成摘要),同时引入基于引用关系的真值标注与新颖的领域相关性指标(InScope),实现对模型在多种查询类型、技术领域和司法管辖区下性能的系统性测量;其次提出QaECTER模型,一个344M参数的嵌入模型,通过专利引用图和多视角自对齐训练,在规模仅为现有最优模型23倍的情况下,在Sophiabench上全面超越所有现有专利专用模型,平均NDCG@10提升达7.2%,并在独立外部基准上无需任务特定提示即优于所有先前模型,具备直接部署于大规模专利检索系统的实用性。

链接: https://arxiv.org/abs/2604.22897
作者: Younes Djemmal,You Zuo(ALMAnaCH),Kim Gerdes(LISN, Qatent),Kirian Guiller
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Patent retrieval underpins critical decisions in innovation, examination, and IP strategy, yet progress has been hampered by the absence of benchmarks that reflect the diversity of real world search scenarios. We address this gap with two contributions. First, we introduce Sophiabench, a large-scale patent retrieval benchmark comprising 10,000 queries and 75,000 corpus documents stratified across ten years, eight IPC technology sections, and twelve filing jurisdictions. Unlike prior benchmarks, Sophia-bench tests retrieval using 12 different query types-from structured patent fields to AI-generated summaries-and evaluates results against citation-based ground truth enhanced with a novel domain-relevance metric (InScope). Together, these enable systematic measurement of how well models perform across query types, technology domains, and jurisdictions. Second, we introduce QaECTER, a 344M-parameter embedding model trained on patent citation graphs and multi-view self-alignment. Despite its compact size, QaECTER establishes a new state of the art for patent retrieval. It outperforms the #1 model on the English retrieval text embedding benchmark (RTEB), a model 23x larger, as well as all existing patent specific models across every query type, IPC section, and jurisdiction on Sophia-bench, with gains of up to 7.2% average NDCG@10 over the next-best model. These results are confirmed on an independent external benchmark, where QaECTER surpasses all prior models without requiring task-specific instruction prompts. Both the benchmark and the model are designed for practical deployment in large-scale patent search systems.

[IR-36] A Large-Scale Cross-Disciplinary Corpus of Systematic Reviews

【速读】:该论文旨在解决现有系统性综述(Systematic Review, SR)基准数据集在规模和学科覆盖范围上的局限性问题,即多数数据集仅包含少量主题或集中于生物医学领域,难以支持跨学科的系统性综述研究。其解决方案的关键在于构建了一个大规模、跨学科的系统性综述语料库——Webis-SR4ALL-26,包含301,871篇涵盖OpenAlex所覆盖所有科学领域的系统性综述,并通过多阶段预处理流程将文献与已解析的OpenAlex元数据及参考文献列表关联,同时提取可结构化的检索与筛选方法学信息(如布尔查询或关键词列表、纳入/排除标准)。这些结构化数据层为检索模块的跨域基准测试、筛选方法的训练与评估以及不同学科和时间维度下的系统性综述实践比较提供了坚实基础。

链接: https://arxiv.org/abs/2604.22864
作者: Pierre Achkar,Tim Gollub,Arno Simons,Harrisen Scells,Martin Potthast
机构: Leipzig University (莱比锡大学); Fraunhofer ISI; Bauhaus-Universität Weimar (包豪斯大学魏玛分校); TU Berlin (柏林工业大学); University of Tübingen (图宾根大学); Kassel University (卡塞尔大学); hessian.AI; ScaDS.AI
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing benchmarks for systematic reviewing remain limited either in scale or in disciplinary coverage, with some collections comprising only a modest number of topics and others focusing primarily on biomedical research. We present Webis-SR4ALL-26, a large-scale, cross-disciplinary corpus of 301,871 systematic reviews spanning all scientific fields as covered by OpenAlex. Using a multi-stage pre-processing pipeline, we link reviews to resolved OpenAlex metadata and reference lists and extract, when explicitly reported, structured method artifacts relevant to retrieval and screening. These artifacts include reported search strategies (Boolean queries or keyword lists) that we normalize into executable approximations, as well as reported inclusion and exclusion criteria. Together, these layers support cross-domain benchmarking of retrieval and screening components against review reference lists, training and evaluation of extraction methods for review artifacts, and comparative meta-science analyses of systematic review practices across disciplines and time. To demonstrate one concrete use case, we report large-scale baseline retrieval signals by executing normalized search strategies in OpenAlex and comparing retrieved sets to resolved reference lists. We release the corpus and the pre-processing pipeline, along with code used for extraction validation and the retrieval demonstration.

[IR-37] IntrAg ent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review ACL2026

【速读】:该论文旨在解决科学文献中细粒度信息检索的自动化问题,即如何基于研究驱动型查询,在忠实于文献内容的前提下精准提取关键信息。传统方法在处理复杂、跨段落的科学知识时存在准确性不足的问题,尤其在多学科场景下难以保持上下文一致性。解决方案的关键在于提出IntrAgent——一个基于大语言模型(Large Language Model, LLM)的智能代理系统,其核心机制是模拟人类阅读行为:首先通过结构化知识增强的推理对文献段落进行排序(Section Ranking),再以迭代式精读策略逐层提取并融合细节信息,最终生成简洁且语境贴合的答案。该方法显著优于现有RAG(Retrieval-Augmented Generation)和研究代理基线,在跨领域测试中平均提升13.2%的准确率。

链接: https://arxiv.org/abs/2604.22861
作者: Fengbo Ma,Zixin Rao,Xiaoting Li,Zhetao Chen,Hongyue Sun,Yiping Zhao,Xianyan Chen,Zhen Xiang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2026 main conference

点击查看摘要

Abstract:Scientific research relies on accurate information retrieval from literature to support analytical decisions. In this work, we introduce a new task, INformation reTRieval through literAture reVIEW (IntraView), which aims to automate fine-grained information retrieval faithfully grounded in the provided content in response to research-driven queries, and propose IntrAgent, an LLM-based agent that addresses this challenging task. In particular, IntrAgent is designed to mimic human behaviors when reading literature for information retrieval – identifying relevant sections and then iteratively extracting key details to refine the retrieved information. It follows a two-stage pipeline: a Section Ranking stage that prioritizes relevant literature sections through structural-knowledge-enabled reasoning, and an Iterative Reading stage that continuously extracts details and synthesizes them into concise, contextually grounded answers. To support rigorous evaluation, we introduce IntraBench, a new benchmark consisting of 315 test instances built from expert-authored questions paired with literature spanning five STEM domains. Across seven backbone LLMs, IntrAgent achieves on average 13.2% higher cross-domain accuracy than state-of-the-art RAG and research-agent baselines.

[IR-38] R3AG: Retriever Routing for Retrieval-Augmented Generation

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因“一刀切”式检索策略导致的性能瓶颈问题,即不同查询对不同检索器存在差异化偏好,而现有路由方法通常仅基于语义相关性静态选择检索器,忽略了检索文档不仅要相关还需能有效支持生成正确答案这一关键因素。解决方案的关键在于提出R³AG框架,其核心创新是将检索器能力解耦为两个可学习维度:检索质量(retrieval quality)和生成效用(generation utility),并通过对比学习目标融合文档评估与下游答案正确性的互补监督信号,显式建模查询与检索器能力之间的动态对齐关系,从而实现更精准的查询感知路由。

链接: https://arxiv.org/abs/2604.22849
作者: Tong Zhao,Yutao Zhu,Yucheng Tian,Zhicheng Dou
机构: Renmin University of China (中国人民大学); South China University of Technology (华南理工大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has become a cornerstone for knowledge-intensive tasks. However, the efficacy of RAG is often bottlenecked by the one-size-fits-all'' retrieval paradigm, as different queries exhibit distinct preferences for different retrievers. While recent routing techniques attempt to select the optimal retriever dynamically, they typically operate under a single and static capability’’ assumption, selecting retrievers solely based on semantic relevance. This overlooks a critical distinction in RAG: a retrieved document must not only be relevant but also effectively support the generator in producing correct answers. To address this limitation, we propose R ^3 AG, a novel routing framework that explicitly models the dynamic alignment between queries and retriever capabilities. Unlike previous approaches, R ^3 AG decomposes retriever capability into two learnable dimensions: retrieval quality and generation utility. We employ a contrastive learning objective that leverages complementary supervision signals, \textiti.e., document assessments and downstream answer correctness, to capture query-specific preference shifts. Extensive experiments on several knowledge-intensive tasks show that R ^3 AG consistently outperforms both the best individual retrievers and state-of-the-art static routing methods.

[IR-39] Structure Guided Retrieval-Augmented Generation for Factual Queries

【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理事实性问题时因依赖向量相似度检索而引入语义噪声,导致生成答案无法完全满足查询中复杂条件的问题。其核心挑战在于如何确保检索到的信息不仅语义相关,而且结构上与查询条件严格匹配。解决方案的关键是提出全新的“精确检索问题”(Exact Retrieval Problem, ERP),并设计结构引导的检索增强生成方法(Structure Guided Retrieval-Augmented Generation, SG-RAG),将检索过程建模为基于嵌入的子图匹配任务,并利用检索到的拓扑结构指导大语言模型(LLM)生成符合所有查询条件的答案。

链接: https://arxiv.org/abs/2604.22843
作者: Miao Xie,Xiao Zhang,Yi Li,Chunli Lv
机构: China Agricultural University(中国农业大学); Nanyang Technological University(南洋理工大学); Ministry of Agriculture and Rural Affairs(农业农村部)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has been proposed to mitigate hallucinations in large language models (LLMs), where generated outputs may be factually incorrect. However, existing RAG approaches predominantly rely on vector similarity for retrieval, which is prone to semantic noise and fails to ensure that generated responses fully satisfy the complex conditions specified by factual queries, often leading to incorrect answers. To address this challenge, we introduce a novel research problem, named Exact Retrieval Problem (ERP). To the best of our knowledge, this is the first problem formulation that explicitly incorporates structural information into RAG for factual questions to satisfy all query conditions. For this novel problem, we propose Structure Guided Retrieval-Augmented Generation (SG-RAG), which models the retrieval process as an embedding-based subgraph matching task, and uses the retrieved topological structures to guide the LLM to generate answers that meet all specified query conditions. To facilitate evaluation of ERP, we construct and publicly release Exact Retrieval Question Answering (ERQA), a large-scale dataset comprising 120000 fact-oriented QA pairs, each involving complex conditions, spanning 20 diverse domains. The experimental results demonstrate that SG-RAG significantly outperforms strong baselines on ERQA, delivering absolute improvements from 20.68 to 50.88 points across all evaluation metrics, while maintaining reasonable computational overhead.

[IR-40] RCSB PDB AI Help Desk: retrieval-augmented generation for protein structure deposition support

【速读】:该论文旨在解决结构生物学领域中蛋白质数据银行(Protein Data Bank, PDB)生物注释员(biocurator)在处理海量 depositor(数据提交者)咨询消息时面临的效率瓶颈问题,尤其在2025年已收到约8,000条数据条目中的19,000条消息的情况下。解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的AI辅助帮助台系统,其核心技术包括:利用LangChain框架结合pgvector(PostgreSQL向量存储)实现高效文档检索;采用pymupdf4llm进行保留Markdown格式的PDF文本提取;通过两阶段文档分块与最大边际相关性(Maximal Marginal Relevance)检索提升语义准确性;引入主题约束机制过滤无关查询,并使用定制化系统提示防止内部术语泄露;同时采用双大语言模型(LLM)架构分别优化问题浓缩与响应生成流程,最终部署于Kubernetes平台以支持全天候、带引用来源的流式响应服务。

链接: https://arxiv.org/abs/2604.22800
作者: Vivek Reddy Chithari(1),Jasmine Y. Young(1),Irina Persikova(1),Yuhe Liang(1),Gregg V. Crichlow(1),Justin W. Flatt(1),Sutapa Ghosh(1),Brian P. Hudson(1),Ezra Peisach(1),Monica Sekharan(1),Chenghua Shao(1),Stephen K. Burley(1 and 2) ((1) RCSB Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ, USA, (2) RCSB Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, CA, USA)
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: 13 pages, 0 figures

点击查看摘要

Abstract:Motivation: Structural Biologists have contributed more than 245,000 experimentally determined three-dimensional structures of biological macromolecules to the Protein Data Bank (PDB). Incoming data are validated and biocurated by ~20 expert biocurators across the wwPDB. RCSB PDB biocurators who process more than 40% of global depositions face increasing challenges in maintaining efficient Help Desk operations, with approximately 19,000 messages in approximately 8,000 entries received from depositors in 2025. Results: We developed an AI-powered Help Desk using Retrieval-Augmented Generation (RAG) built on LangChain with a pgvector store (PostgreSQL) and GPT-4.1-mini. The system employs pymupdf4llm for Markdown-preserving PDF extraction, two-stage document chunking, Maximal Marginal Relevance retrieval, a topical guardrail that filters off-topic queries, and a specialized system prompt that prevents exposure of internal terminology. A dual-LLM architecture uses separate model configurations for question condensing and response generation. Deployed in production on Kubernetes with PostgreSQL (pgvector), it provides around-the-clock depositor assistance with citation-backed, streaming responses. Availability and implementation: Freely available at this https URL. Comments: 13 pages, 0 figures Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM) Cite as: arXiv:2604.22800 [cs.IR] (or arXiv:2604.22800v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.22800 Focus to learn more arXiv-issued DOI via DataCite

[IR-41] Implicit Humanization in Everyday LLM Moral Judgments

【速读】:该论文旨在解决当前对话式信息系统的用户查询中,存在一种隐含的人格化倾向——即用户在社会冲突情境下请求道德判断(如“谁错了?”)时,会无意识地将模型视为具有人类道德判断能力的实体,从而引发不当的人类投射(anthropomorphic projections),进而可能导致用户对大语言模型(Large Language Models, LLMs)产生过度依赖或错误信任的问题。解决方案的关键在于识别并量化LLM响应中所强化的言语、行为和认知层面的人类化线索,并提出应扩展对“拟人化”的理解范畴,不仅关注模型端的显性拟人特征,更需重视用户侧隐含的人类化假设;同时,需设计能够满足用户需求但又能纠正其对模型能力误判的系统机制,以降低潜在风险。

链接: https://arxiv.org/abs/2604.22764
作者: Hoda Ayad,Tanu Mitra
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 6 pages, 3 figures, Published in CHIIR '26

点击查看摘要

Abstract:Recent adoption of conversational information systems has expanded the scope of user queries to include complex tasks such as personal advice-seeking. However, we identify a specific type of sought advice-a request for a moral judgment (i.e. “who was wrong?”) in a social conflict-as an implicitly humanizing query which carries potentially harmful anthropomorphic projections. In this study, we examine the reinforcement of these assumptions in the responses of four major general-purpose LLMs through the use of linguistic, behavioral, and cognitive anthropomorphic cues. We also contribute a novel dataset of simulated user queries for moral judgments. We find current LLM system responses reinforce implicit humanization in queries, potentially exacerbating risks like overreliance or misplaced trust. We call for future work to expand the understanding of anthropomorphism to include implicit userside humanization and to design solutions that address user needs while correcting misaligned expectations of model capabilities.

[IR-42] Behavioral Intelligence Platforms: From Event Streams to Autonomous Insight via Probabilistic Journey Graphs Behavioral Knowledge Extraction and Grounded Language Generation

【速读】:该论文旨在解决当代产品分析系统依赖用户主动查询(如编写SQL、配置仪表盘或构建漏斗)所带来的瓶颈问题,即系统被动响应用户需求,要求使用者具备领域知识与技术能力,并假设其能预先明确所需问题。为突破这一局限,作者提出从“被动问答”向“主动发现”转变的思路,构建了行为智能平台(Behavioral Intelligence Platform, BIP),其核心在于通过四层架构实现从原始事件流到自动洞察生成的闭环:第一层标准化事件并构建语义状态层次;第二层利用行为图引擎(Behavioral Graph Engine, BGE)将用户路径建模为吸收马尔可夫链,量化转移概率、移除效应和路径质量;第三层基于行为知识图谱(Behavioral Knowledge Graph, BKG)识别行为现象;第四层通过受控语言层约束大语言模型输出,确保叙事性洞察基于验证事实。该方案的关键创新在于定义了行为智能问题的形式化框架、提出自主洞察生成的检测器分类体系,并设计了一个有趣度评分机制以在注意力有限条件下优先排序洞察。

链接: https://arxiv.org/abs/2604.22762
作者: Arun Patra,Bhushan Vadgave
机构: Journium, Inc.
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contemporary product analytics systems require users to pose explicit queries, such as writing SQL, configuring dashboards, or constructing funnels, before insights can surface. This pull-based paradigm creates a bottleneck: it requires both domain knowledge and technical fluency, and assumes practitioners know in advance which questions to ask. We argue that behavioral analytics should move from passive systems that answer queries to active systems that continuously detect and explain behavioral phenomena. We present the Behavioral Intelligence Platform (BIP), a system architecture that transforms raw event streams into automatically generated insights. BIP consists of four layers. First, Normalization and State Derivation (NSD) standardizes events and maps them to a semantic state hierarchy. Second, a Behavioral Graph Engine (BGE) models user journeys as absorbing Markov chains and computes transition probabilities, removal effects, and path quality metrics. Third, a Behavioral Knowledge Graph (BKG) and Detector System convert graph outputs into grounded behavioral facts and identify behavioral phenomena. Finally, a Grounded Language Layer constrains large language model outputs to verified facts, producing reliable narrative insights. We formalize the Behavioral Intelligence Problem, introduce a taxonomy of detectors for autonomous insight generation, and propose an interestingness score to prioritize insights under limited attention. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.22762 [cs.IR] (or arXiv:2604.22762v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.22762 Focus to learn more arXiv-issued DOI via DataCite

[IR-43] CS3: Efficient Online Capability Synergy for Two-Tower Recommendation

【速读】:该论文旨在解决多阶段推荐系统中轻量级两塔模型(two-tower models)因孤立架构导致的表征能力受限、嵌入空间对齐不足以及跨特征建模困难的问题。现有方法如晚期交互或知识蒸馏虽能缓解上述问题,但常引入显著延迟或难以在在线学习场景中实现。解决方案的关键在于提出一种高效在线框架Capability Synergy (CS3),其核心创新包括:(1) 循环自适应结构(Cycle-Adaptive Structure),通过塔内自适应特征去噪实现自我修正;(2) 塔间同步机制(Cross-Tower Synchronization),利用双塔间的相互感知提升表征对齐;(3) 级联模型共享(Cascade Model Sharing),通过复用下游模型的知识保障跨阶段一致性。该框架兼容多种两塔架构,在保持毫秒级延迟的同时显著提升推荐效果。

链接: https://arxiv.org/abs/2604.22761
作者: Lixiang Wang,Shaoyun Shi,Peng Wang,Wenjin Wu,Peng Jiang
机构: Kuaishou Technology (快手科技)
类目: Information Retrieval (cs.IR)
备注: 12pages, 8 figures, under review

点击查看摘要

Abstract:To balance effectiveness and efficiency in recommender systems, multi-stage pipelines employ lightweight two-tower models for large-scale candidate retrieval. However, their isolated architecture inherently hampers representation capacity, embedding-space alignment, and cross-feature modeling. Prior studies have explored incorporating late interaction or knowledge distillation to mitigate these issues, but such approaches often significantly increase model latency or pose challenges for implementation in online learning scenarios. To address these limitations, we propose an efficient online framework called Capability Synergy (CS3), which enhances two-tower models through three key innovations: (1) Cycle-Adaptive Structure, enabling self-revision via adaptive feature denoising within individual towers; (2) Cross-Tower Synchronization, improving representation alignment through mutual awareness between the towers; and (3) CascadeModel Sharing, bridging cross-stage consistency by reusing knowledge from downstream models. The CS3 framework is compatible with various two-tower architectures and meets real-time requirements in online learning scenarios. We evaluated CS3 on three public offline datasets and subsequently deployed it in a large-scale advertising system. Experimental results demonstrate that CS3 increases online ad revenue by up to 8.36% across three scenarios while maintaining millisecond-level latency and consistently performing well across diverse two-tower architectures.

[IR-44] Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking AAAI2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为自主代理在调用外部API执行复杂任务时,其行为一致性与可靠性缺乏量化评估的问题。解决方案的关键在于提出一个统一的基准测试框架,通过集合重叠(Average Overlap)、排名偏移重叠(Rank-Biased Overlap)、Kendall’s tau、Kendall’s W及Cronbach’s alpha等多维度指标,系统性地测量不同LLM在相同任务下对API发现和排序的分歧程度(inter-LLM divergence)。研究发现,尽管整体一致性中等(AO≈0.50,tau≈0.45),但分歧显著依赖于任务类型:结构化任务(如天气查询、语音转文本)稳定性高,而开放式任务(如情感分析)则表现出更高波动性;此外,该框架揭示了看似一致的决策可能掩盖动作相关排序中的不稳定性,从而暴露多智能体协作中的系统性失效模式,为部署前的安全诊断和基于共识加权的异构LLM协同提供了关键依据。

链接: https://arxiv.org/abs/2604.22760
作者: Eyhab Al-Masri
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: AAAI 2026 Conference (LAMAS Workshop)

点击查看摘要

Abstract:Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement remain poorly characterized. We present a unified benchmarking framework to quantify inter-LLM divergence, defined as the extent to which models differ in API discovery and ranking under identical tasks. Across 15 canonical API domains and 5 major model families, we measure pairwise and group-level agreement using set-, rank-, and consensus-based metrics including Average Overlap, Jaccard similarity, Rank-Biased Overlap, Kendall’s tau, Kendall’s W, and Cronbach’s alpha. Results show moderate overall alignment (AO about 0.50, tau about 0.45) but strong domain dependence: structured tasks (Weather, Speech-to-Text) are stable, while open-ended tasks (Sentiment Analysis) exhibit substantially higher divergence. Volatility and consensus analyses reveal that coherence clusters around data-bound domains and degrades for abstract reasoning tasks. These insights enable reliability-aware orchestration in multi-agent systems, where consensus weighting can improve coordination among heterogeneous LLMs. Beyond performance benchmarking, our results reveal systematic failure modes in multi-agent LLM coordination, where apparent agreement can mask instability in action-relevant rankings. This hidden divergence poses a pre-deployment safety risk and motivates diagnostic benchmarks for early detection.

[IR-45] Beyond Static: Related Questions Retrieval Through Conversations in Community Question Answering AAAI2026

【速读】:该论文旨在解决社区问答(community question answering, cQA)平台中相关问题检索(related question retrieval)的性能瓶颈问题,现有方法多依赖静态特征而忽视了用户与系统之间的交互特性,导致难以捕捉问题间的细粒度语义差异。其解决方案的关键在于提出一种基于对话的检索模型TeCQR,通过构建标签增强的澄清问题(tag-enhanced clarifying questions, CQs)来建立上下文对话,并设计噪声容忍机制以评估问题与标签间的语义相似性,从而有效处理反馈噪声;同时引入标签增强的两阶段离线训练策略,充分挖掘用户查询、问题和标签之间的相互关系,学习更精细的表示,最终结合上下文对话和生成的澄清问题实现更精准的相关问题检索。

链接: https://arxiv.org/abs/2604.22759
作者: Xiao Ao,Jie Zou,Yibiao Wei,Peng Wang,Weikang Guo
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages. Accepted at AAAI 2026

点击查看摘要

Abstract:In community question answering (cQA) platforms like Stack Overflow, related question retrieval is recognized as a fundamental task that allows users to retrieve related questions to answer user queries automatically. Although many traditional approaches have been proposed for investigating this research field, they mostly rely on static approaches and neglect the interaction property. We argue that the conversational way can well distinguish the fine-grained representations of questions and has great potential to improve the performance of question retrieval. In this paper, we propose a related question retrieval model through conversations, called TeCQR, to locate related questions in cQA. Specifically, we build conversations by utilizing tag-enhanced clarifying questions (CQs). In addition, we design a noise tolerance model that evaluates the semantic similarity between questions and tags, enabling the model to effectively handle noisy feedback. Moreover, the tag-enhanced two-stage offline training is proposed to fully exploit the mutual relationships among user queries, questions, and tags to learn their fine-grained representations. Based on the learned representations and contextual conversations, TeCQR incorporates conversational feedback by learning to ask tag-enhanced clarifying questions to retrieve related questions more effectively. Experimental results demonstrate that our model significantly outperforms state-of-the-art baselines.

[IR-46] RedParrot: Accelerating NL-to-DSL for Business Analytics via Query Semantic Caching

【速读】:该论文旨在解决企业级实时业务分析中自然语言(Natural Language, NL)到领域特定语言(Domain-Specific Language, DSL)转换的高延迟、高成本与误差传播问题。现有基于大语言模型(Large Language Models, LLMs)的多阶段流水线在面对高频重复查询时效率低下,难以满足电商和广告场景下的低延迟需求。解决方案的关键在于提出 RedParrot 框架,其核心创新是引入语义缓存机制:通过离线构建“查询骨架”(query skeletons,即标准化的结构模式),在线使用无实体依赖的对比学习嵌入模型进行高效匹配,并结合异构检索增强生成(Retrieval-Augmented Generation, RAG)方法融合多源知识以处理未见实体,从而显著提升推理速度与准确性。

链接: https://arxiv.org/abs/2604.22758
作者: Tong Wang,Yongqin Xu,Jianfeng Zhang,Lingxi Cui,Wenqing Wei,Suzhou Chen,Huan Li,Ke Chen,Lidan Shou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, at Xiaohongshu, the rapid expansion of e-commerce and advertising demands real-time business analytics with high accuracy and low latency. To meet this demand, systems typically rely on converting natural language (NL) queries into Domain-Specific Languages (DSLs) to ensure semantic consistency, validation, and portability. However, existing multi-stage LLM pipelines for this NL-to-DSL task suffer from prohibitive latency, high cost, and error propagation, rendering them unsuitable for enterprise-scale deployment. In this paper, we propose RedParrot, a novel NL-to-DSL framework that accelerates inference via a semantic cache. Observing the high repetition and stable structural patterns in user queries, RedParrot bypasses the costly pipeline by matching new requests against cached “query skeletons” (normalized structural patterns) and adapting their corresponding DSLs. Our core technical contributions include (1) an offline skeleton construction strategy, (2) an online, entity-agnostic embedding model trained via contrastive learning for robust matching, and (3) a heterogeneous Retrieval-Augmented Generation (RAG) method that integrates diverse knowledge sources to handle unseen entities. Experiments on six real enterprise datasets from Xiaohongshu show RedParrot achieves an average 3.6x speedup and an 8.26% accuracy improvement. Furthermore, on new public benchmarks adapted from Spider and BIRD, it boosts accuracy by 34.8%, substantially outperforming standard in-context learning baselines.

[IR-47] StratRAG : A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems

【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在多跳推理任务中,于现实场景下存在噪声文档池时的评估难题。现有基准数据集难以模拟真实环境中大量相关但非目标文档(即干扰项)对检索性能的影响,导致模型评估结果与实际应用脱节。解决方案的关键在于构建StratRAG——一个基于HotpotQA(干扰项设置)的开源检索评估数据集,包含2,200个样本,覆盖桥接型、比较型和是非型三类问题,每条查询对应15篇候选文档(含2篇黄金文档和13篇主题相关干扰项)。通过在此数据集上对比BM25、密集检索(all-MiniLM-L6-v2)及混合融合三种检索策略,发现混合检索在Recall@2(0.70)、MRR(0.93)等指标上表现最优,但桥接类问题仍具挑战性(Recall@2=0.67),凸显了未来需结合强化学习等方法优化检索策略的必要性。

链接: https://arxiv.org/abs/2604.22757
作者: Aryan Patodiya
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 6 Pages, 3 Table

点击查看摘要

Abstract:We introduce StratRAG, an open-source retrieval evaluation dataset for benchmarking Retrieval-Augmented Generation (RAG) systems on multi-hop reasoning tasks under realistic, noisy document-pool conditions. Derived from HotpotQA (distractor setting), StratRAG comprises 2,200 examples across three question types – bridge, comparison, and yes-no – each paired with a pool of 15 candidate documents containing exactly 2 gold documents and 13 topically related distractors. We benchmark three retrieval strategies – BM25, dense retrieval (all-MiniLM-L6-v2), and hybrid fusion – reporting Recall@k, MRR, and NDCG@5 on the validation set. Hybrid retrieval achieves the best overall performance (Recall@2 = 0.70, MRR = 0.93), yet bridge questions remain substantially harder (Recall@2 = 0.67), motivating future work on reinforcement-learning-based retrieval policies. StratRAG is publicly available at this https URL.

[IR-48] Your Reviews Replicate You: LLM -Based Agents as Customer Digital Twins for Conjoint Analysis ALT DATE

【速读】:该论文旨在解决传统联合分析(Conjoint Analysis)在市场研究中面临的时效性差、成本高及受访者疲劳等问题。其解决方案的关键在于构建基于大语言模型(Large Language Model, LLM)的“客户数字孪生体”(Customer Digital Twins, CDT),通过整合检索增强生成(Retrieval-Augmented Generation, RAG)与提示工程(Prompt Engineering),使虚拟受访者能够动态调用个体历史偏好和约束信息,从而模拟真实用户进行配对比较决策。该方法显著提升了市场研究的敏捷性和成本效益,并在计算机显示器品类案例中验证了其预测准确率达87.73%,有效量化了属性间的权衡关系。

链接: https://arxiv.org/abs/2604.22756
作者: Bin Xuan,Jungmin Hwang,Hakyeon Lee
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures + This abstract introduces an LLM-based Customer Digital Twin framework that replaces human respondents in conjoint analysis with RAG-enhanced customer agents, validated at 87.73% accuracy on Reddit user data, and positions the contribution as a scalable alternative to traditional preference elicitation methods

点击查看摘要

Abstract:Conjoint analysis is a cornerstone of market research for estimating consumer preferences; however, traditional methods face persistent challenges regarding time, cost, and respondent fatigue. To address these limitations, this study proposes a framework that utilizes large language model (LLM)-based “customer digital twins (CDT)” as virtual respondents. We identified active users within the Reddit community and aggregated their comprehensive review histories to construct individualized vector databases. By integrating retrieval-augmented generation (RAG) with prompt engineering, this study developed customer agents capable of dynamically retrieving and reasoning upon their specific past preferences and constraints. These customer agents, called CDTs, performed pairwise comparison tasks on product profiles generated via fractional factorial design, and the resulting choice data was analyzed to estimate part-worth utilities by logistic regression. Empirical validation demonstrates that these CDTs predict the preferences of actual users with 87.73% accuracy. Furthermore, a case study on the computer monitor category successfully quantified trade-offs between attributes such as panel type and resolution, deriving preference structures consistent with market realities. Ultimately, this study contributes to marketing research by presenting a scalable alternative that significantly improves both agility and cost-efficiency to traditional methods.

[IR-49] RADIANT-LLM : an Agent ic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering

【速读】:该论文旨在解决核工程领域中决策支持系统因文档碎片化和预训练大语言模型(Large Language Model, LLM)幻觉问题而导致的知识检索不可靠、安全性不足的问题。其解决方案的关键在于提出RADIANT-LLM框架,这是一个基于多模态检索增强生成(Retrieval-Augmented Generation, RAG)的本地优先、模型无关架构,通过结构化的元数据丰富知识库实现文档级乃至图示级别的精准检索,并引入代理层协调领域专用工具、强制引用溯源与人工在环验证机制,从而显著降低幻觉率并提升响应透明度与可审计性。

链接: https://arxiv.org/abs/2604.22755
作者: Zavier Ndum Ndum,Jian Tao,John Ford,Mansung Yim,Yang Liu
机构: Texas AM University(德州农工大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable decision support in nuclear engineering requires traceable, domain-grounded knowledge retrieval, yet safety and risk analysis workflows remain hampered by fragmented documentation and hallucination when use pre-trained large language model (LLM) in specialized nuclear domains. To address these challenges, this paper presents RADIANT-LLM (Retrival-Augumented, Domain-Intelligent Agent for Nuclear Technologies using LLM), a multi-modal retrieval-augmented generation (RAG) framework designed for nuclear safety, security, and safeguards applications. The framework uses a local-first, model-agnostic architecture that pairs a multi-modal document ingestion pipeline with a structured, metadata-rich knowledge base, supporting page- and figure-level retrieval from technical documents. An agentic layer coordinates domain-specific tools, enforces citation-backed responses with provenance tracking, and supports human-in-the-loop validation to reduce hallucination risks. To rigorously evaluate this framework, we develop and apply a suite of domain-aware metrics, including Context Precision (CoP), Hallucination Rate (HR), and Visual Recall (ViR), to expert-curated benchmarks derived from Used Nuclear Fuel Storage Facility design guidance. Across varying knowledge base sizes, CoP and ViR remain within an 85–98% band, and hallucination rates are substantially lower than those observed in general-purpose deployments. When the same queries are posed to commercial LLM platforms without the RAG layer, hallucinations and citation errors increase markedly. These results indicate that a locally controlled, multi-modal RAG framework with domain-specific retrieval and provenance enforcement is necessary to achieve the factual accuracy, transparency, and auditability that nuclear engineering workflows demand. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.22755 [cs.IR] (or arXiv:2604.22755v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.22755 Focus to learn more arXiv-issued DOI via DataCite

人机交互

[HC-0] Personalized Worked Example Generation from Student Code Submissions using Pattern-based Knowledge Components

【速读】:该论文旨在解决自适应编程教学中因固定习题库导致的学习内容与学生实际逻辑错误不匹配的问题,从而降低个性化学习的效率。其解决方案的关键在于利用从学生代码中提取的基于结构模式的知识成分(Knowledge Component, KC)来引导生成式模型,通过抽象语法树(AST)分析识别出学生代码中的重复性KC模式,并以此作为条件生成更贴合学生认知偏差的讲解示例,实现了以KC为锚点的个性化教育内容生成。

链接: https://arxiv.org/abs/2604.24758
作者: Griffin Pitts,Muntasir Hoq,Peter Brusilovsky,Narges Norouzi,Arto Hellas,Juho Leinonen,Bita Akram
机构: North Carolina State University (北卡罗来纳州立大学); University of Pittsburgh (匹兹堡大学); University of California, Berkeley (加州大学伯克利分校); Aalto University (阿尔托大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Accepted to the Thirteenth ACM Conference on Learning @ Scale (L@S 2026)

点击查看摘要

Abstract:Adaptive programming practice often relies on fixed libraries of worked examples and practice problems, which require substantial authoring effort and may not correspond well to the logical errors and partial solutions students produce while writing code. As a result, students may receive learning content that does not directly address the concepts they are working to understand, while instructors must either invest additional effort in expanding content libraries or accept a coarse level of personalization. We present an approach for knowledge-component (KC) guided educational content generation using pattern-based KCs extracted from student code. Given a problem statement and student submissions, our pipeline extracts recurring structural KC patterns from students’ code through AST-based analysis and uses them to condition a generative model. In this study, we apply this approach to worked example generation, and compare baseline and KC-conditioned outputs through expert evaluation. Results suggest that KC-conditioned generation improves topical focus and relevance to learners’ underlying logical errors, providing evidence that KC-based steering of generative models can support personalized learning at scale.

[HC-1] Childrens Online Safety Risks and Ethical Considerations in XR Games

【速读】:该论文旨在解决扩展现实(Extended Reality, XR)游戏设计中对儿童带来的在线安全风险问题,这些问题包括因沉浸式体验导致的现实判断力下降、交通事故以及接触有害内容(如虚拟性侵)等。其解决方案的关键在于提出一种以儿童为中心、注重设计意识的伦理框架,强调平台与政策制定者需充分考虑儿童的发展需求,并通过跨行业协作推动更安全、更具包容性的XR环境建设。

链接: https://arxiv.org/abs/2604.24601
作者: Zinan Zhang,Xinning Gui,Yubo Kou
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: This is an accepted position statement of IDC 2025 Workshop (Extended Reality and Children: Risks, Opportunities, and Ethics Workshop)

点击查看摘要

Abstract:Emerging extended reality technologies are reshaping how children play, learn, and socialize. Yet, they also present serious safety risks. Gaming, a primary form of entertainment for children, is also one of the key applications of XR. While XR platforms offer immersive and engaging gaming experiences, recent news has highlighted safety concerns such as car accidents, lower judgment for real-world situations, and exposure to disturbing content like virtual rape. This research examines how XR game design may lead to online safety risks for children. Through analysis of player forums, game developer forums, and interviews with child players, we identify harmful XR design patterns, explore how developers collaboratively generate and implement risky game ideas, and document children’s firsthand experiences of online safety risks. Existing ethical frameworks often fail to address the immersive and socially dynamic nature of XR games. We advocate for a child-centered, design-aware approach to ethical considerations in XR games, urging platforms and policymakers to prioritize children’s developmental needs. Our work aims to help shape safer, more inclusive XR environments through research and cross-sector collaboration.

[HC-2] Why AI Harms Cant Be Fixed One Identity at a Time: What 5300 Incident Reports Reveal About Intersectionality

【速读】:该论文旨在解决当前AI风险评估体系未能充分识别和量化交叉性伤害(intersectional harms)的问题,即由多个身份类别(如年龄、阶级、政治立场等)交互作用导致的非线性危害效应。现有评估方法多局限于单一身份维度(如种族与性别),忽视了复杂社会结构下多重压迫叠加对特定群体造成的放大性伤害。其解决方案的关键在于构建一个基于大规模AI事故数据库(AI Incident Database)的结构化分析框架,并借助大语言模型(Large Language Model, LLM)对5,300份事故报告进行精准标注与分类,从而系统识别出受损害个体及其多重身份特征。结果显示,除种族与性别外,年龄和政治身份同样高频出现在AI伤害中,且在特定交叉组合(如青少年女性、低收入有色人群、高收入政治精英)中,伤害强度可提升至单一类别下的三倍,证明将交叉性理论纳入AI风险评估是实现更精准、公平的风险识别与治理的核心路径。

链接: https://arxiv.org/abs/2604.24519
作者: Edyta Bogucka,Sanja Šćepanović,Daniele Quercia
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 29 pages, 7 figures

点击查看摘要

Abstract:AI risk assessment is the primary tool for identifying harms caused by AI systems. These include intersectional harms, which arise from the interaction between identity categories (e.g., class and skin tone) and which do not occur, or occur differently, when those categories are considered separately. Yet existing AI risk assessments are still built around isolated identity categories, and when intersections are considered, they focus almost exclusively on race and gender. Drawing on a large-scale analysis of documented AI incidents, we show that AI harms do not occur one identity category at a time. Using a structured rubric applied with a Large Language Model (LLM), we analyze 5,300 reports from 1,200 documented incidents in the AI Incident Database, the most curated source of incident data. From these reports, we identify 1,513 harmed subjects and their associated identity categories, achieving 98% accuracy. At the level of individual categories, we find that age and political identity appear in documented AI harms at rates comparable to race and gender. At the level of intersecting categories, harm is amplified up to three times at specific intersections: adolescent girls, lower-class people of color, and upper-class political elites. We argue that intersectionality should be a core component of AI risk assessment to more accurately capture how harms are produced and distributed across social groups.

[HC-3] Blur Effects on User Performance in Target-Pointing Tasks

【速读】:该论文旨在解决在投影仪和头戴式显示器(Head-Mounted Displays, HMDs)中,由于图像模糊导致用户操作效率下降的问题,尤其是在显示距离较远或用户视力不佳时,用户难以清晰辨识屏幕内容,从而影响交互性能。解决方案的关键在于通过实验量化模糊强度与用户任务表现之间的关系,并提出一种改进的Fitts定律模型,能够高精度预测受模糊影响下的移动时间(movement time)。进一步研究表明,通过为每位用户动态调整目标尺寸以补偿其视觉 acuity(视觉敏锐度),可以显著降低模糊对移动时间的影响,从而为自适应用户界面设计提供了理论依据和技术路径。

链接: https://arxiv.org/abs/2604.24482
作者: Ryuto Tomihari,Taiki Kinoshita,Yosuke Oba,Shota Yamanaka,Homei Miyashita
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In projectors and head-mounted displays, an out-of-focus image appears blurred. Even when a display itself is in focus, computer operation may be hindered if the display is far from the user or if a user has poor visual acuity, because the user cannot see the screen clearly. In this study, we conducted an experiment in which participants performed a pointing task under blurred display conditions and investigated the relationship between blur strength and user performance. The results showed that movement time and error rate increased as blur became stronger, and that the effect of blur on movement time was larger when targets were smaller. We further showed that movement time can be estimated with high accuracy by a model that improves on Fitts’ law. In a follow-up experiment to examine the applicability of this model, we adjusted target size for each participant and showed that the effect of blur level on movement time could be reduced. These findings suggest potential use in tools that adapt user interfaces to users’ visual acuity.

[HC-4] Putting a Face to the Issue: Fostering User Empathy of Open Source Software Developers With PersonaFlow

【速读】:该论文旨在解决开源软件(Open-source Software, OSS)开发者难以理解与回应用户情境的问题,现有工具如缺陷追踪系统主要聚焦于技术讨论,忽略了用户的情感与背景信息。解决方案的关键在于提出PersonaFlow工具,该工具能够从OSS仓库的各类文档中自动生成可编辑的用户角色画像(persona),并将其集成到问题报告(issue report)旁,从而帮助开发者更全面地理解用户需求。实证研究表明,大多数开发者在使用后改变了对用户的认知方式,并采取更具同理心的行为,例如调整语气、优化解释或提升优先级,这表明PersonaFlow有效促进了以用户为中心的开发实践。

链接: https://arxiv.org/abs/2604.24478
作者: Boniface Bahati Tadjuidje,Jin L.C. Guo,Jinghui Cheng
机构: Polytechnique Montreal (蒙特利尔工程学院); McGill University (麦吉尔大学)
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 25 pages, 4 figures, Accepted to DIS 2026

点击查看摘要

Abstract:Open-source software (OSS) developers often struggle to understand and respond to user context, while existing tools, such as issue trackers (for handling bugs, requests, and feedback), largely focus on technical discussion. Although personas could help, limited resources and UX expertise make them hard to scale. We present PersonaFlow, a tool that generates editable user personas from OSS repository artifacts and integrates them alongside issue reports. In a user study with 13 OSS developers, most reported shifts in how they understood users, and more than half modified their responses by adding empathetic language, tailoring explanations, or raising priority ratings. We found two pathways to this change: some connected emotionally to personas as people, while others used them pragmatically for triaging. Both appeared to lead to more user-centered behavior. We contribute design implications for persona-based tools relevant to OSS and other contexts where efficiency-driven systems or workflows obscure valuable human elements.

[HC-5] Measuring Successful Cooperation in Human-AI Teamwork: Development and Validation of the Perceived Cooperativity and Teaming Perception Scales

【速读】:该论文旨在解决当前人机协作(Human-AI Cooperation)情境下缺乏可靠主观评估工具的问题,以量化用户对协作质量的感知。解决方案的关键在于提出并验证两个理论驱动的量表:基于联合活动理论(Joint Activity Theory)的感知协作者量表(Perceived Cooperativity Scale, PCS),用于衡量单次交互中个体感知到的协作能力与实践;以及基于进化合作理论(Evolutionary Cooperation Theory)的团队感量表(Teaming Perception Scale, TPS),用于捕捉由相互贡献和支持所涌现的团队协同感知。这两个量表均经过人类-人类协作场景的适配,从而支持跨代理比较,并在三种不同协作场景(合作纸牌游戏、大语言模型交互、决策支持系统)中展现出良好的区分度、信度和构念效度,为后续人机协作的实证研究与系统评估提供了标准化测量工具。

链接: https://arxiv.org/abs/2604.24461
作者: Christiane Attig,Christiane Wiebel-Herboth,Patricia Wollstadt,Tim Schrills,Mourad Zoubir,Thomas Franke
机构: Honda Research Institute Europe GmbH, Offenbach am Main, Germany; Institute of Human-Centered Interactive Systems, University of Lübeck, Lübeck, Germany
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 33 pages, 3 figures

点击查看摘要

Abstract:As human-AI cooperation becomes increasingly prevalent, reliable instruments for assessing the subjective quality of cooperative human-AI interaction are needed. We introduce two theoretically grounded scales: the Perceived Cooperativity Scale (PCS), grounded in joint activity theory, and the Teaming Perception Scale (TPS), grounded in evolutionary cooperation theory. The PCS captures an agent’s perceived cooperative capability and practice within a single interaction sequence; the TPS captures the emergent sense of teaming arising from mutual contribution and support. Both scales were adapted for human-human cooperation to enable cross-agent comparisons. Across three studies (N = 409) encompassing a cooperative card game, LLM interaction, and a decision-support system, analyses of dimensionality, reliability, and validity indicated that both scales successfully differentiated between cooperation partners of varying cooperative quality and showed construct validity in line with expectations. The scales provide a basis for empirical investigation and system evaluation across a wide range of human-AI cooperation contexts.

[HC-6] Envisioning Mobile Data Visualization Libraries for Digital Health

【速读】:该论文旨在解决移动健康(mHealth)应用中可视化设计质量参差不齐的问题,特别是针对小屏幕设备的可视化适配不足,导致用户难以有效理解个人健康数据。其核心问题在于现有可视化工具库多面向桌面或通用移动场景,缺乏对健康语义(如正常范围、阈值和目标)的支持,迫使开发者采用不一致且难解释的定制方案。解决方案的关键是开发专为个人健康数据和移动上下文设计的可视化库,强调智能默认设置、内置健康标注以及流畅交互等设计考量,从而降低开发门槛、提升一致性,并增强mHealth应用的可访问性和可解释性。

链接: https://arxiv.org/abs/2604.24448
作者: Bongshin Lee,Seongjae Bae,Mengying Li,Eun Kyoung Choe
机构: Yonsei University (延世大学); University of Maryland, College Park (马里兰大学学院公园分校)
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 4 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Mobile health (mHealth) applications support health management through rich data collection and self-reflection, yet the quality of their visualizations varies widely. A key limitation is the suboptimal design of visualizations for small-screen devices. We argue that this gap is partly driven by a lack of specialized developer tools. Existing libraries primarily target desktop or general-purpose mobile use, providing limited support for health-specific semantics such as normal ranges, thresholds, and goals. As a result, developers often resort to custom solutions that are inconsistent or hard to interpret. We therefore advocate for dedicated mobile visualization libraries tailored to personal health data and mobile contexts, and discuss key design considerations including intelligent defaults, built-in health annotations, and fluid interactions. Such libraries can lower barriers, promote consistency, and enable more accessible and interpretable mHealth applications.

[HC-7] How Personal Characteristics Shape User Exploration of Diverse Movie Recommendations with a LLM -Based Multi-Agent System

【速读】:该论文旨在解决推荐系统在追求准确率之外,如何有效支持用户探索多样化内容的问题,尤其关注个体差异对用户体验的影响。传统单代理推荐系统可能无法充分满足不同用户对新颖性和多样性的需求,而本研究提出了一种基于大语言模型(Large Language Model, LLM)的多智能体(multi-agent)系统作为解决方案。其关键在于利用多个智能体协同生成推荐,并通过用户个性特征(如人格特质、GenAI使用经验与信任度)进行个性化适配,从而显著提升用户的感知新颖性(Perceived Novelty)和香农多样性(Shannon Diversity),同时揭示了系统设计与用户特征之间的交互效应,强调了构建人格感知型对话式推荐系统的重要性。

链接: https://arxiv.org/abs/2604.24405
作者: Yufan Zhou,Yirui Huang,Zhao Wang,Yucheng Jin
机构: KU Leuven (鲁汶大学); Duke Kunshan University (昆山杜克大学); Zhejiang University (浙江大学)
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages, 2 figures. Accepted by UMAP '26 (34th ACM International Conference on User Modeling, Adaptation and Personalization)

点击查看摘要

Abstract:Diversity is an important evaluation criterion for recommender systems beyond accuracy, yet users differ in their willingness to engage with novel and diverse content. In this work, we investigate how a Large Language Model (LLM)-based multi-agent system supports users’ exploration of diverse recommendations, and how individual characteristics shape user experiences. We conducted a between-subjects user study (N = 100) comparing a single-agent system (baseline) with a multi-agent system for movie recommendations. We measured Perceived Accuracy, diversity, novelty, and overall rating, and examined the influence of personal characteristics, including personality traits, demographics, GenAI recommendation experience, and GenAI skepticism. Results show that the multi-agent system significantly increases Perceived Novelty and Shannon Diversity. Conscientiousness is positively associated with Perceived Accuracy and diversity, whereas extraversion is negatively associated with Perceived Diversity. Prior experience with GenAI-based recommendations is positively associated with Shannon Diversity, while skepticism toward GenAI is negatively associated with it. We also observe significant interaction effects between system design and user characteristics. These findings highlight the importance of personality-aware conversational recommender systems and caution against one-size-fits-all multi-agent designs.

[HC-8] From Players to Participants: Citizen Science and Video Games to Understand Cognition

【速读】:该论文试图解决传统认知科学研究中面临的样本规模小、生态效度低以及公众参与度不足的问题。解决方案的关键在于利用公民科学视频游戏(Citizen Science Video Games)将实验任务嵌入具有吸引力的游戏体验中,从而在非实验室环境中大规模收集行为数据,同时提升研究的可扩展性、生态效度和公众参与度。通过整合专业游戏开发经验与科学严谨性,这类游戏能够有效弥合玩家与参与者之间的界限,使娱乐活动转化为高质量的认知科学研究工具。

链接: https://arxiv.org/abs/2604.24321
作者: Syrine Salouhou,Edgar Dubourg,Maxwell Scott-Slade,Hugo Spiers,Antoine Coutrot
机构: 未知
类目: Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Citizen science is transforming how cognitive scientists study the human mind, and video games are at the heart of this shift. By embedding experimental tasks into engaging, game-like experiences, researchers can reach large, diverse populations while collecting rich behavioral data outside the lab. In this review, we explore how citizen science video games bridge the gap between players and participants, turning entertainment into large-scale cognitive research. Drawing on recent projects such as Sea Hero Quest and The Music Lab, we outline the key benefits of this approach: scalability, ecological validity, and public engagement. We also examine the challenges of designing games that are scientifically rigorous, ethically sound, and meaningful for both researchers and players. Through professional game developer insights, we highlight what it takes to develop a successful citizen science video game for cognitive science, and why this approach is still rare in the literature.

[HC-9] he Alignment Target Problem: Divergent Moral Judgments of Humans AI Systems and Their Designers

【速读】:该论文试图解决的问题是:如何在人工智能(AI)决策中实现与人类价值观的一致性,即所谓的“对齐”(alignment)问题。传统研究假设人类行为是衡量AI道德标准的基准,但本文通过实验发现,公众对AI系统、人类个体及其设计者(如工程师)的道德评价存在显著差异,这表明“对齐”的目标并非单一标准,而是涉及多重规范性立场,作者将此现象称为“对齐目标问题”(alignment target problem)。解决方案的关键在于揭示并量化这种道德标准的分化——当AI行为被明确归因于人类设计时,受试者表现出更强的义务论(deontological)推理倾向,说明公众会基于责任归属调整道德判断。这一发现提示,在高风险领域制定AI治理框架时,必须考虑多元主体(AI、使用者、设计者)的不同道德约束,而非简单套用人类行为作为统一标尺。

链接: https://arxiv.org/abs/2604.24155
作者: Benjamin Minhao Chen,Xinyu Xie
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at ACM FAccT 2026

点击查看摘要

Abstract:The quest to align machine behavior with human values raises fundamental questions about the moral frameworks that should govern AI decision-making. Much alignment research assumes that the appropriate benchmark is how humans themselves would act in a given situation. Research into agent-type value forks has challenged this assumption by showing that people do not always hold AI systems to the same moral standards as humans. Yet this challenge is subject to two further questions: whether people evaluate AI behavior differently when its human origins are made visible, and whether people hold the humans who program AI systems to different moral standards than either the humans or the machines under evaluation. An experimental study on 1,002 U.S. adults measured moral judgments in a runaway mine train scenario, varying the subject of evaluation across four conditions: a repairman, a repair robot, a repair robot programmed by company engineers, and company engineers programming a repair robot. We find no significant variation in the moral standards applied to the repairman and the robot. However, moral judgments shifted substantially when robot actions were described as the product of human design. Participants exhibited markedly more deontological reasoning when evaluating the robot programmed by engineers or the engineers programming it, suggesting that making human design visible activates heightened moral constraints. These findings provide evidence that people apply meaningfully different moral standards to AI systems, to humans acting in the same situation, and to the humans who design them. We call this divergence the alignment target problem. Whether these plural normative standards can be reconciled into a coherent framework for AI governance in high-stakes domains remains an open question.

[HC-10] Listen to the Voices of Everyday Users: Democratizing Privacy Ratings for Sensitive Data Access in Mobile Apps

【速读】:该论文旨在解决移动应用频繁请求过度数据访问所引发的隐私问题,尤其是在缺乏明确规范指导的情况下,如何有效定义和执行必要的数据访问权限。现有监管机制主要依赖专家审计,存在可扩展性差、中立性不足及与用户预期脱节等问题。其解决方案的关键在于提出一种“民主化隐私评估”新范式,将用户重新定位为隐私审计过程中的主动评估者,利用用户对数据使用的感知来判断数据访问的适当性和必要性;并通过开发原型系统DePRa,集成情境化解释、类别代表性选择、直观评分界面和基于偏好的评分调整等功能,实现在真实场景中捕捉用户对敏感数据访问的意见,并验证该方法在补充专家审计和促进包容性隐私评估方面的潜力。

链接: https://arxiv.org/abs/2604.24066
作者: Liu Wang,Tianshu Zhou,Haoyu Wang,Yi Wang
机构: Huazhong University of Science and Technology (华中科技大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Mobile apps frequently request excessive data access, raising significant privacy concerns. While regulations like GDPR emphasize data minimization, they provide limited guidance on concretely defining and enforcing necessary data access. Existing regulatory mechanisms primarily rely on expert-driven audits that face challenges in scalability, neutrality, and alignment with user expectations. In this paper, we propose a novel paradigm–democratizing privacy assessment, inspired by prior work on user-centric privacy perceptions–which repositions users as active evaluators in the privacy auditing process, recognizing that user perceptions of data usage play a crucial role in assessing the appropriateness and necessity of data access. To operationalize this paradigm, we introduce DePRa, a prototype system developed through participatory design, featuring contextual explanation provision, category-based representative selection, an intuitive rating interface, and preference-based rating adjustment. We evaluated DePRa with 200 everyday mobile app users, analyzing how effectively it captures user opinions on sensitive data access, comparing their privacy ratings with expert assessments, and exploring risk preference-based score calibration. Our findings show the feasibility and promise of democratized privacy assessment, highlighting its potential to complement expert auditing and support inclusive privacy evaluation.

[HC-11] IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models

【速读】:该论文旨在解决多模态环境下人类意图识别的准确性问题,特别是在开放词汇场景中,如何有效融合文本与视觉等异构信号以实现鲁棒的意图理解。其解决方案的关键在于提出了一种两阶段视频-语言框架IntentVLM,受认知科学中前向-逆向建模启发,将意图理解分解为目标候选生成(goal candidate generation)与结构化推理选择两个步骤,从而在不引入灾难性遗忘的前提下显著减少潜在推理中的幻觉现象,提升开放词汇下人类意图识别的准确率与可靠性。

链接: https://arxiv.org/abs/2604.24002
作者: Hamed Rahimi,Clemence Grislain,Adrien Jacquet Cretides,Olivier Sigaud,Mohamed Chetouani
机构: Institute of Intelligent Systems and Robotics (ISIR), Sorbonne University (索邦大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Improving the effectiveness of human-robot interaction requires social robots to accurately infer human goals through robust intention understanding. This challenge is particularly critical in multimodal settings, where agents must integrate heterogeneous signals including text, visual cues to form a coherent interpretation of user intent. This paper presents IntentVLM, a novel two-stage video-language framework designed for open-vocabulary human intention recognition. The approach is inspired by forward-inverse modeling in cognitive science by decomposing intention understanding into goal candidate generation followed by structured inference through selection, effectively reducing hallucinations in latent reasoning. Evaluated on the IntentQA and Inst-IT Bench datasets, IntentVLM achieves state-of-the-art results with up to 80% accuracy, notably surpassing the baseline performance by 30% and matches human performance. Our findings demonstrate that this structured reasoning approach enhances open-vocabulary intention understanding without catastrophic forgetting, offering a robust foundation for human-centered robotics.

[HC-12] Supporting Family-School Partnerships with Robot-Facilitated Home-Based Activities

【速读】:该论文旨在解决家庭-学校伙伴关系(Family-School Partnerships, FSP)中普遍存在的障碍,如家长时间紧张、沟通碎片化以及缺乏有意义的参与机会。为应对这一问题,研究提出了一种创新解决方案:将社交机器人(social robot)整合进家庭环境,以支持家庭内部与学校议题相关的互动活动。其关键在于通过访谈与共同设计(co-design)方法,基于父母和儿童的实际需求开发出一个模块化机器人系统,并在10个家庭中开展为期一周的入户实验,验证其在日常生活中促进家庭成员间关于学校话题交流的有效性。研究揭示了家长引导风格对机器人使用方式的影响,以及家庭对其帮助性和挑战性的感知,从而为未来面向家庭-学校协作的技术设计提供了实证依据与设计启示。

链接: https://arxiv.org/abs/2604.23978
作者: Michael F Xu,Qiyao Yang,Heather Kirkorian,Bilge Mutlu
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Proceedings of the 25th Interaction Design and Children Conference (IDC '26)

点击查看摘要

Abstract:Family-school partnerships (FSP) are critical to children’s development, yet families often face barriers such as time constraints, fragmented communication, and limited opportunities for meaningful engagement. As a step toward facilitating broader family-school partnerships, we explore a novel approach that integrates a social robot into family settings, specifically supporting home-based activities. Through interviews and co-design sessions, we designed and developed a robotic system informed by both parents and children, that supported, among other interactions, family communication about school topics. We evaluated the robot in a week-long, in-home study with 10 families. Our findings show how families integrated the robot into daily life, how parental facilitation styles shaped use, and how families perceived both the helpfulness and challenges of the robot. We contribute empirical insights, a modular system, and design implications for family- and child-robot interactions. We discuss ethical and privacy considerations, and broaden the design space for technologies supporting family-school partnerships.

[HC-13] Designing Robots to Support Parent-Child Connections: Opportunities Through Robot-Mediated Communication

【速读】:该论文旨在解决现代技术进步导致人际互动减少、削弱家庭联结感的问题,探索如何通过机器人辅助的通信工具增强家庭成员间的情感连接。其解决方案的关键在于设计两类核心交互维度:一是机器人的行为策略(被动型、反应型、主动型),二是沟通模式(同步型与异步型),并通过实证研究揭示这些设计因素如何影响亲子互动质量与家庭联结体验,从而为支持日常家庭连接提供可操作的设计路径。

链接: https://arxiv.org/abs/2604.23976
作者: Michael F Xu,Bengisu Cagiltay,Yaxin Hu,Anjun Zhu,Bilge Mutlu
机构: University of Wisconsin–Madison(威斯康星大学麦迪逊分校); Koç University(科奇大学); Simon Fraser University(西蒙弗雷泽大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Proceedings of the 25th Interaction Design and Children Conference (IDC '26)

点击查看摘要

Abstract:The sense of family connectedness may support positive outcomes including individual well-being, resilience, and healthy family functioning. However, as technologies advance, they often replace human-human interactions instead of nurturing them. In this work, we investigate how robot-facilitated communication tools might instead create new opportunities for family connection. We conducted two studies with families with children aged 5-12. We first explored the design space through in-home technology probe sessions with six families. These probes inspired us to explore two key interaction design dimensions: the robot’s behavior strategy (passive, reactive, proactive) and the mode of communication (synchronous, asynchronous). We then conducted a laboratory study with 20 families to examine how the two dimensions shaped parent-child interaction and connection. Our findings characterize how parents and children appropriated robot-mediated exchanges, the tensions they experienced around initiative, timing, and privacy, and the opportunities they envisioned for supporting everyday connectedness.

[HC-14] Making Sense of Scams: Understanding Scam Conversations Through Multi-Level Alignment

【速读】:该论文旨在解决在线诈骗(online scams)检测中两个关键问题:一是现有系统依赖静态信号和中断式警告,缺乏能够反映对话动态中欺诈风险的连续性信号;二是对非中断式交互设计的研究不足。其解决方案的核心在于引入基于多层级对齐(multi-level alignment-based hints)的新检测信号,该信号受交互对齐模型(Interactive Alignment Model)启发,通过量化对话参与者之间低层级的词汇与句法对齐以及高层级的语义与情境模型对齐,使诈骗进展过程中的对话动态可被用户感知。实验表明,这种多层级对齐提示能显著提升检测精度、召回率与F1分数,并促进用户更早且更稳定地形成判断信心,尤其在结合不同层次对齐信息时效果最优。

链接: https://arxiv.org/abs/2604.23973
作者: Zhenyu Mao,Jacky Keung,Xiangyu Li,Yicheng Sun,Kehui Chen,Jingyu Zhang,Jialong Li
机构: City University of Hong Kong (香港城市大学); SeysoAI (赛索人工智能); Hong Kong Metropolitan University (香港都会大学); Waseda University (早稻田大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Online scams often unfold gradually through interaction, yet existing detection systems predominantly rely on snapshot-based signals and interruptive warnings, revealing two research gaps in the lack of signals that represent scam risk within conversational dynamics and the underexplored design of non-interruptive interaction. To address these gaps, we introduce multi-level alignment-based hints, informed by the Interactive Alignment Model, as a new detection signal for supporting sensemaking in scam-related conversations. These hints operationalize low-level lexical and syntactic alignments and high-level semantic and situation-model alignments between conversational participants, making conversational dynamics visible to users. We first conduct a preliminary evaluation on real-life scam dialogues, showing that as conversations approach scam attempts, low-level alignment scores remain stable while high-level alignment scores systematically decline, revealing a consistent cross-level pattern indicative of scam progression. Building on this insight, we conduct a user study with thirty participants, indicating that relative to the no-hint baseline, multi-level alignment-based hints increase precision by 0.25, recall by 0.16, and F1 score by 0.21, yielding substantially larger gains than the marginal improvements achieved by keyword-triggered alerts. Statistical analyses reveal that the proposed hints support earlier and more stable confidence formation over time, with ablation results further highlighting the effectiveness of combining alignment hints across levels in achieving these advantages.

[HC-15] What Did They Mean? How LLM s Resolve Ambiguous Social Situations across Perspectives and Roles

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理社交情境中的模糊性时,倾向于生成确定性解释而非保留不确定性的问题。其核心发现是:在涉及早期浪漫关系、师生互动、职场层级和模糊友谊等四类人际场景中,LLMs(如GPT、Claude和Gemini)在72次响应中仅有12.5%维持了真正的不确定性,其余87.5%均通过叙事对齐、叙事反转、不确定条件下的规范建议以及看似谨慎实则导向单一结论的措辞等方式实现“解释闭合”。解决方案的关键在于识别并设计能够“保留不确定性”的社会型人工智能(Social AI),以避免用户因模型过早提供看似合理但可能误导的叙事而误判复杂人际情境。

链接: https://arxiv.org/abs/2604.23942
作者: Qiming Yuan,Linyi Han,Nam Ling,Cihan Ruan
机构: Santa Clara University (圣克拉拉大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:People increasingly turn to large language models (LLMs) to interpret ambiguous social situations: a delayed text reply, an unusually cold supervisor, a teacher’s mixed signals, or a boundary-crossing friend. Yet in many such cases, no stable interpretation can be verified from the available evidence alone. We study how LLMs respond to these situations across four domains: early-stage romantic relationships, teacher–student dynamics, workplace hierarchies, and ambiguous friendships. Across 72 responses from GPT, Claude, and Gemini, only 9 (12.5%) genuinely preserved uncertainty. The remaining 87.5% produced interpretive closure through recurring pathways including narrative alignment, narrative reversal, normative advice under uncertainty, and hedged language that still supported a single conclusion. We further find that narrator perspective shapes the path to closure: first-person accounts more often elicited alignment, while third-person accounts invited more detached interpretation, even when the underlying situation remained comparable. Together, these findings show that LLMs do not simply assist interpersonal sensemaking; they tend to resolve ambiguity into coherent and actionable narratives. These results suggest that the central risk is not only that LLMs may misinterpret social situations, but that they may make unresolved situations feel prematurely settled. We frame this tendency as a design challenge for uncertainty-preserving social AI.

[HC-16] owards Localizing Conversation Partners using Head Motion

【速读】:该论文旨在解决在嘈杂环境中用户难以聚焦于特定对话者语音的问题,尤其是在存在多个干扰说话者或用户存在听力障碍的情况下。传统基于空间音频的方法虽能识别声源方向,但无法反映用户的主动听觉偏好,且在复杂噪声场景下性能受限。解决方案的关键在于利用智能眼镜上惯性测量单元(Inertial Measurement Units, IMUs)捕捉的头部朝向行为作为行为线索,以非侵入式方式推断用户的声学兴趣区域(acoustic zones of interest)。文中提出HALo网络和CoCo分类器,分别实现基于头部姿态的声源定位与对话人数估计,显著提升系统对用户意图的理解能力,并在多说话人高噪声环境下展现出优于现有方法的性能优势。

链接: https://arxiv.org/abs/2604.23927
作者: Payal Mohapatra,Calvin Murdock,Ali Aroudi,Ishwarya Ananthabhotla,Anjali Menon,Buye Xu,Morteza Khaleghimeybodi
机构: Northwestern University (西北大学); Meta Reality Labs (Meta 现实实验室)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Many individuals struggle to understand conversation partners in noisy settings, particularly amid background speakers or due to hearing impairments. Emerging wearables like smartglasses offer a transformative opportunity to enhance speech from conversation partners. Crucial to this is identifying the direction in which the user wants to listen, which we refer to as the user’s acoustic zones of interest. While current spatial audio-based methods can resolve the direction of vocal input, they are agnostic to listening preferences and have limited functionality in noisy settings with interfering speakers. To address this, behavioral cues are needed to actively infer a user’s acoustic zones of interest. We explore the effectiveness of head-orienting behavior, captured by Inertial Measurement Units (IMUs) on smartglasses, as a modality for localizing these zones in seated conversations. We introduce HALo, a head-orientation-based acoustic zone localization network that leverages smartglasses’ IMUs to non-invasively infer auditory zones of interest corresponding to conversation partner locations. By integrating an a priori estimate of the number of conversation partners, our approach yields a 21% performance improvement over existing methods. We complement this with CoCo, which classifies the number of conversation partners using only IMU data, achieving 0.74 accuracy and a 35% gain over rule-based and generic time-series baselines. We discuss practical considerations for feature extraction and inference and provide qualitative analyses over extended sessions. We also demonstrate a minimal end-to-end speech enhancement system, showing that head-orientation-based localization offers clear advantages in extremely noisy settings with multiple conversation partners.

[HC-17] From Trust to Appropriate Reliance: Measurement Constructs in Human-AI Decision-Making

【速读】:该论文试图解决的问题是:当前人类-人工智能(Human-AI)决策研究中,依赖信任(trust)测量来评估用户对AI系统的实际使用效果存在局限性,因为信任并不能准确反映用户对AI建议的适当依赖(appropriate reliance)。解决方案的关键在于区分“适当依赖”与“信任”及“单纯依赖”的概念,并系统梳理现有文献中关于人类对AI建议适当依赖的三种理论视角——传统视角(Traditional)、适当性视角(Appropriateness)和主导性视角(Dominance),进而提出应采用客观指标(objective metrics)来量化和比较不同情境下人类对AI建议的适当依赖程度,从而推动该领域研究的标准化与可比性。

链接: https://arxiv.org/abs/2604.23896
作者: Muhammad Raees,Konstantinos Papangelis
机构: Rochester Institute of Technology (罗切斯特理工学院)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While human-AI decision-making research has primarily used trust measurements to assess the practical usage of AI systems by their end-users, recent empirical evidence suggests that trust measurements do not inform users’ appropriate reliance on AI systems. While examining the human-AI decision-making literature, in this work, we review empirical studies that assess people’s appropriate reliance on AI advice, differentiating measurements and constructs of appropriate reliance from trust and mere reliance. Our analysis of literature shows that constructs for human-AI appropriate reliance are still fragmented in research. We present three views on appropriate reliance, namely Traditional, Appropriateness, and Dominance, as discussed in research. Using these views, we evaluate objective metrics reported in studies and argue for their consensus to facilitate the comparison across empirical research. We also discuss how studies employ objective metrics and examine their validity in application contexts. Our work contributes to the critical body of research on exploring objective metrics for assessing humans’ appropriate reliance on AI advice.

[HC-18] Who Gets to Interpret the Workout? User Tensions with AI-Generated Fitness Feedback

【速读】:该论文旨在解决生成式 AI (Generative AI) 在健身自我追踪平台(如 Strava)中集成后,运动员如何与AI生成的反馈互动所引发的隐性挑战问题。研究通过分析 Reddit 上 297 个帖子和 5,692 条评论,识别出四类核心张力:数值评估 vs. 情境理解、孤立训练片段 vs. 连续训练叙事、固定AI语气 vs. 多样情绪状态、单一AI声音 vs. 不同运动类型用户。解决方案的关键在于设计能够保留解释开放性和用户自主性的 AI 支持型自我追踪系统,避免AI反馈对用户自身生活经验解释权的过度约束。

链接: https://arxiv.org/abs/2604.23830
作者: Sujay Shalawadi,Joel Wester,Samuel Rhys Cox,Niels van Berkel
机构: Norwegian University of Science and Technology (挪威科技大学); University of Copenhagen (哥本哈根大学); Aalborg University (奥尔堡大学)
类目: Human-Computer Interaction (cs.HC)
备注: 13 pages, 5 figures, Designing Interactive Systems

点击查看摘要

Abstract:Fitness tracking platforms increasingly integrate generative AI to interpret activity data, such as Strava’s Athlete Intelligence. These integrations raise questions about how athletes engage with AI-supported fitness self-tracking. We analyzed 297 Reddit threads and 5,692 comments from r/Strava following the company’s launch of AI features to examine user reactions to AI-generated fitness feedback. Our findings revealed four recurring tensions: (1) numerical evaluation versus contextual understanding; (2) isolated session summaries versus ongoing training narratives; (3) a fixed AI tone versus diverse emotional states; and (4) a single AI voice versus different athletic types. Across these tensions, users resisted AI feedback that constrained interpretations of their own lived experiences. These findings shed light on the implicit challenges of integrating AI into self-tracking platforms. We conclude with implications for the design of AI-supported self-tracking systems that preserve interpretive openness and user agency.

[HC-19] MIRAG E: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks

【速读】:该论文旨在解决多人物绘画中“微交互”(micro-interactions)识别与解释的难题,即如何通过视觉线索(如目光对齐、手势和空间布局)系统性地理解人物之间的隐含关系,而现有视觉语言模型(VLMs)常因缺乏可追溯的视觉证据导致解释不可靠甚至产生幻觉。其解决方案的关键在于提出MIRAGE框架,该框架构建结构化的中间表示,显式分离空间定位与叙事生成两个阶段:首先提取并组织人物身份、姿态线索与注视假设等低层证据,再基于这些可验证的证据层进行高阶关系推理,从而实现对复杂视觉叙事的透明化、可解释且以人为导向的理解。

链接: https://arxiv.org/abs/2604.23788
作者: Jui-Cheng Chiu,Yu-Chao Wang,Shengyang Luo,Tongyan Wang,Qi Yang,Nabin Khanal,Yingjie Victor Chen
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 14 pages (11 pages main text), 6 figures, 1 table

点击查看摘要

Abstract:Appreciating multi-figure paintings requires understanding how characters relate through subtle cues like gaze alignment, gesture, and spatial arrangement. We present MIRAGE, an evidence-centric framework designed to scaffold the exploration of these “micro-interactions” in multi-figure artworks. While such cues are essential for deep narrative appreciation, they are often distributed across complex scenes and difficult for viewers to systematically identify. Existing vision-language models (VLMs) frequently fail to provide reliable assistance, offering ungrounded interpretations that lack traceable visual evidence. MIRAGE addresses this by constructing a structured intermediate representation capturing identities, pose cues, and gaze hypotheses. However, the challenge extends beyond extracting these cues to coordinating them during interpretation. Without an explicit mechanism to organize and reconcile relational evidence, models often collapse multiple interaction hypotheses into a single unstable or weakly grounded narrative, even when low-level signals are available. This representation allows users to verify how high-level interpretations are anchored in low-level visual facts. By separating spatial grounding from narrative generation, MIRAGE enables users to inspect and reason about figure-to-figure relationships through a verifiable evidence layer. We evaluate MIRAGE against painting-only VLM baselines using a blind assessment protocol. Results show that MIRAGE significantly improves identity consistency, reduces relational hallucinations, and increases the coverage of subtle interactions. These findings suggest that structured grounding can serve as a critical interaction control layer, providing the necessary scaffolding for a more reliable, transparent, and human-led understanding of complex visual narratives. Comments: 14 pages (11 pages main text), 6 figures, 1 table Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC) ACMclasses: H.5.2; I.2.10 Cite as: arXiv:2604.23788 [cs.CV] (or arXiv:2604.23788v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.23788 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-20] PageGuide: Browser extension to assist users in navigating a webpage and locating information

【速读】:该论文旨在解决用户在浏览网页时面临的三大核心挑战:难以快速定位相关信息、完成复杂的多步骤任务困难,以及易受干扰内容影响专注力。现有生成式 AI 助手(如 ChatGPT、Gemini)和浏览器代理(如 OpenAI Operator)虽能回答问题并执行自动化操作,但缺乏对信息来源的可追溯性,导致用户必须手动验证结果且无法控制自动化流程,从而产生信任风险与效率低下。解决方案的关键在于提出 PageGuide——一个浏览器扩展程序,通过在 HTML DOM(文档对象模型)层面直接锚定大语言模型(LLM)的回答,并以可视化叠加层呈现,实现三项功能:(a) Find —— 实时高亮页面中相关证据,支持即时验证;(b) Guide —— 提供分步操作指引,使用户可自主执行任务;© Hide —— 允许用户选择隐藏干扰元素,提升注意力集中度。实证研究表明,PageGuide 显著优于无辅助浏览,在准确性、任务完成时间及用户操作效率等方面均有显著提升。

链接: https://arxiv.org/abs/2604.23772
作者: Tin Nguyen,Thang T. Truong,Runtao Zhou,Trung Bui,Chirag Agarwal,Anh Totti Nguyen
机构: Auburn University (奥本大学); University of Virginia (弗吉尼亚大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Users browsing the web daily struggle to quickly locate relevant information in cluttered pages, complete unfamiliar multi-step tasks, and stay focused amid distracting content. State-of-the-art AI assistants (e.g., ChatGPT, Gemini, Claude) and browser agents (e.g., OpenAI Operator, Browser Use) can answer questions and automate actions, yet they return answers without showing where the information comes from on the page, forcing users to manually verify results and blindly trust every automated steps. We present PageGuide, a browser extension that grounds LLM answers directly in the HTML DOM via visual overlays, addressing three core user needs: (a) Find-locating and highlighting relevant evidence in-situ so users can instantly verify answers on the page; (b) Guide-showing step-by-step instructions (e.g. how to change password) one at a time so users can follow and perform actions by themselves; and © Hide-hiding distracting content-giving users a chance to decide to hide an element or not. In a user study (N=94), PageGuide outperform unaided browsing across all modes: Hide accuracy improve by 26 percentage points (86.7% relative gain) and task completion time drops by 70%; Guide completion rate increases by 30 percentage points; and Find reduces manual search effort, with Ctrl+F usage falling by 80% and task time decreasing by 19%. Code and demo is at: this http URL.

[HC-21] Modeling Induced Pleasure through Cognitive Appraisal Prediction via Multimodal Fusion

【速读】:该论文旨在解决多模态情感计算中视频内容如何通过认知评估变量引发特定情感体验(如愉悦感)的问题,尤其关注现有方法在人类标签噪声、正向情绪与愉悦之间的语义鸿沟、愉悦专用数据集稀缺以及黑箱融合方法可解释性不足等方面的局限。其解决方案的关键在于提出一种融合数据驱动与认知理论驱动的新型计算模型:利用认知评估理论(cognitive appraisal theory)构建心理机制框架,并引入模糊模型与基于Transformer的注意力机制,在细粒度提取跨模态特征的同时实现可解释的融合,从而捕捉与愉悦相关的模态内和模态间动态关系,有效弥合语义鸿沟并提升模型透明度。

链接: https://arxiv.org/abs/2604.23753
作者: Nastaran Dab,Raziyeh Zall,Mohammadreza Kangavari
机构: Iran University of Science and Technology (伊朗科学与技术大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal affective computing analyzes user-generated social media content to predict emotional states. However, a critical gap remains in understanding how visual content shapes cognitive interpretations and elicits specific affective experiences such as pleasure. This study introduces a novel computational model to infer video-induced pleasure via cognitive appraisal variables. The proposed model addresses four challenges: (1) noisy and inconsistent human labels, (2) the semantic gap between “positive emotions” and “pleasure,” (3) the scarcity of pleasure-specific datasets, and (4) the limited interpretability of existing black-box fusion methods. Our approach integrates data-driven and cognitive theory-driven methods, using cognitive appraisal theory and a fuzzy model within an innovative framework. The model employs transformer-based architectures and attention mechanisms for fine-grained multimodal feature extraction and interpretable fusion to capture both inter- and intra-modal dynamics associated with pleasure. This enables the prediction of underlying appraisal variables, thereby bridging the semantic gap and enhancing model explainability beyond conventional statistical associations. Experimental results validate the efficacy of the proposed method in detecting video-induced pleasure, achieving a peak accuracy of 0.6624 in predicting pleasure levels. These findings highlight promising implications for affective content recommendation, intelligent media creation, and advancing our understanding of how digital media influences human emotions.

[HC-22] StateScribe: Towards Accessible Change Awareness Across Real-World Revisits

【速读】:该论文旨在解决盲人及低视力(Blind and Low-Vision, BLV)个体在反复访问同一物理环境时,难以感知场景动态变化的问题,如物体移位、布局调整或信息更新(例如价格变动),这些问题可能带来安全隐患并增加认知负担。现有视觉辅助技术多局限于即时场景描述,缺乏跨次访问的持续变化感知能力。解决方案的关键在于提出StateScribe系统,其核心创新是采用双层记忆架构:结合情景式场景记忆(episodic scene memory)与以对象为中心的时间记忆(object-centric temporal memory),实现对变化内容、发生时间和空间位置的结构化追踪,并提供实时变化描述(如“你右侧的商店现在贴有‘CLOSED’标志;上周此时还是开放状态”)。该设计使系统在11次重访中保持高准确率(F1=83.1%)、低延迟(均值1.54秒)和内存效率(54MB),并通过用户实验证实显著提升BLV用户对环境变化的感知能力。

链接: https://arxiv.org/abs/2604.23749
作者: Ruei-Che Chang,Xirui Jiang,Rosiana Natalie,Hao Chen,Vlad Roznyatovskiy,Jianzhong Zhang,Kang G. Shin,Ke Sun,Anhong Guo
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Real-world environments evolve continuously, yet blind and low-vision (BLV) individuals often have limited access to understanding how they change over time. Unexpected or relocated objects, layout modifications, and content updates (e.g., price changes) can introduce safety risks and cognitive burden. While existing visual assistive technologies can describe immediate surroundings, they operate as one-off interactions and lack mechanisms to surface meaningful changes across revisits. Informed by a survey of 33 BLV individuals, we develop StateScribe, a system that supports accessible awareness of real-world changes across revisits. StateScribe employs a dual-layer memory architecture that integrates episodic scene memory and object-centric temporal memory to enable scalable and structured change tracking. It provides both live descriptions of the current scene, and descriptions of what has changed, when and where it occurred across revisits, such as "The shop on your right has a “CLOSED” sign; it was open at this time last week.‘’ Our evaluation shows that StateScribe maintains high accuracy (F1-score=83.1%) across 11 revisits, while remaining low-latency (mean1.54s) and memory-efficient (54MB) across 110 revisits. A user study with nine BLV participants demonstrates that StateScribe improves change awareness across revisits in three real-world locations. Finally, we discuss implications for long-term AI-assisted companions that support broader change observation using multimodal sensing, extend beyond changes to other memory capabilities, and adapt to individual users, intents, and contexts.

[HC-23] Impact of Age Specialized Models for Hypoglycemia Classification

【速读】:该论文旨在解决糖尿病患者在不同年龄阶段中低血糖(hypoglycemia)事件预测的个体化建模问题,尤其关注如何利用连续葡萄糖监测(Continuous Glucose Monitoring, CGM)数据提升预测准确性。其核心挑战在于:尽管不同年龄组患者的血糖变异性、自身抗体水平及低血糖发生频率存在差异,当前多数分类模型仍基于群体平均特征或仅针对特定年龄段训练,缺乏普适性和个性化适应能力。解决方案的关键在于通过对比三种策略——全局人群模型(包含所有年龄组)、按年龄分段独立建模以及基于迁移学习的个体化模型——发现全局模型在多数情况下性能与分年龄模型相当甚至更优,表明儿童、青少年和成人数据可联合用于训练低血糖分类模型;同时指出,尽管短期低血糖模式具有跨年龄相似性,但儿童群体的最佳召回率仍需依赖年龄特异模型,提示未来模型设计应兼顾泛化能力与特定人群优化。

链接: https://arxiv.org/abs/2604.23732
作者: Beyza Cinar,Maria Maleshkova
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted for IEEE CAI 2026. 13 pages, 6 Figures, and 10 Tables

点击查看摘要

Abstract:Disease progression varies with age and is influenced by underlying genetic, biochemical, and hormonal etiologies, suggesting the need for tailored monitoring, care, and medication beyond standard clinical guidelines. Specifically, in autoimmune diseases like type 1 diabetes (T1D), where patients depend on exogenous insulin to compensate for insulin deficiency, medication dosing and the physiological response reflected in vital signs can differ. Insulin therapy can lead to hypoglycemia, a dangerous condition characterized by decreased blood glucose levels ( \leq 70). This risk can be mitigated through improved diabetes management supported by data analytics. Notably, leveraging data from continuous glucose monitoring (CGM) devices, hypoglycemia onset can be predicted. However, while glucose variability, auto-antibody levels, and hypoglycemia occurrence differ across age groups, hypoglycemia classification most often only relies on population-based models specialized in specific age ranges. In this work, we classify hypoglycemia 0, 5-15, 20-45, and 50-120 minutes before onset using DiaData, a large CGM dataset of patients with T1D ranging from children to seniors. In particular, we investigate: 1) the generalizability of a population-based model including all age groups, 2) the impact of age-segmented models trained separately per age group, and 3) the effect of model individualization through transfer learning. The results show that a global population-based model yields similar or superior performance compared to age-segmented models. These findings suggest that data from children, teenagers, and adults can be combined for training models on hypoglycemia classification. While glucose variation differs across age groups, short-term hypoglycemic patterns are similar. However, data of children obtain their best recall with age specialized model.

[HC-24] alking Slide Avatars: Open-Source Multimodal Communication Approach for Teaching

【速读】:该论文旨在解决在线、混合及异步教学环境中,基于幻灯片的教学内容因缺乏教师存在感、叙事连贯性和表达性框架而导致学习者难以与内容建立连接的问题。传统全讲义视频虽能部分恢复这些特质,但录制、修改和复用成本过高。解决方案的关键在于提出一种开源的工作流程,通过集成OpenVoice实现文本到语音的生成与语音克隆,以及利用Ditto-TalkingHead完成音频驱动的动态图像合成,使教师仅需脚本和静态肖像即可生成短时叙述视频,嵌入课件或HTML教学材料中。该方法将“说话幻灯片头像”视为数字教学法、美学教育与艺术技术实践交汇处的多模态传播载体,强调其在提升教学人性化、增强互动性方面的潜力,并提供关于脚本长度、图像选择、节奏控制、透明度披露、可访问性及伦理使用的具体指导。

链接: https://arxiv.org/abs/2604.23703
作者: Xinxing Wu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 15 pahes

点击查看摘要

Abstract:Slide-based teaching is widely used in higher education, yet in online, hybrid, and asynchronous contexts, slides often lose the instructor presence, narrative continuity, and expressive framing that help learners connect with content. Full lecture video can partly restore these qualities, but it is time-consuming to record, revise, and reuse. This study addresses that pedagogical and production challenge by presenting a practice-based analysis of an open-source workflow for creating talking slide avatars for slide-based teaching. The workflow integrates OpenVoice for text-to-speech generation and voice cloning with Ditto-TalkingHead for audio-driven talking-image synthesis, enabling instructors to transform a script and a static portrait into a short narrated video that can be embedded in slide decks or HTML-based lecture materials. Rather than treating this workflow merely as a technical solution, the study frames talking slide avatars as multimodal communication artifacts at the intersection of digital pedagogy, aesthetic education, and art-technology practice. Using a practice-based implementation and analytic reflection approach, the study documents the production pipeline, examines its communicative and aesthetic affordances, and proposes practical guidelines for script length, image selection, pacing, disclosure, accessibility, and ethical use. The study makes three primary contributions: it presents an educator-oriented open-source production model, reframes talking avatars as an educational communication design problem, and proposes a responsible pathway for incorporating generative synthetic media into teaching. It concludes that short, transparent, and carefully designed avatars can humanize slide-based instruction while providing a reusable communicative layer for introductions, transitions, reminders, and recaps across online, hybrid, and asynchronous learning environments.

[HC-25] Directional Alignment and Narrative Agency in Human-LLM Co-Writing

【速读】:该论文旨在解决人机协同创作中叙事主导权(narrative agency)的归属问题,即在轮流写作过程中,人类与大语言模型(Large Language Model, LLM)哪一方更显著地推动故事发展。其解决方案的关键在于构建了一个包含87个真人-LLM协作创作故事的新语料库,并结合情感分析(sentiment modeling)与语义新颖性建模(semantic novelty modeling),辅以方向性度量方法来量化双方在叙事推进中的作用。研究发现:人类回合引入更高语义新颖性且更可能决定后续情节走向,而LLM则主要对人类引入的内容进行扩展和细化;在情感层面,LLM表现出更强的即时情绪适应能力,但双方均能追踪彼此情绪倾向,且LLM具有独立的积极情绪基线。这揭示了人机在协同创作中存在互补分工:人类驱动叙事创新与方向,LLM作为自适应放大器维持连贯性并丰富情节。

链接: https://arxiv.org/abs/2604.23676
作者: Halfdan Nordahl Fundal,Yuri Bizzoni
机构: Aarhus University (奥胡斯大学)
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages, 13 figures

点击查看摘要

Abstract:We investigate narrative agency in human-LLM creative co-writing, asking who drives story development in turn-based collaboration. Using a new corpus of 87 human-LLM co-written stories, we apply sentiment and semantic modeling to quantify affective alignment and semantic novelty in turn-taking, and directional measures to assess which agent shapes narrative progression. Our results show asymmetric influence: human turns introduce greater semantic novelty and are more likely to shape subsequent developments, whereas LLM contributions predominantly elaborate on human-introduced elements. At the sentiment level, alignment is also asymmetric, but more bidirectional: LLMs exhibit stronger turn-level emotional adaptation than humans, but both agents track each other’s emotional valence and LLMs show an independent tendency to more positive emotional baselines. These findings indicate a complementary division of labor in human-LLM co-writing, where humans drive narrative innovation and direction, while LLMs act as adaptive amplifiers that sustain coherence and elaborate emerging narratives.

[HC-26] Quantifying the Persistence of Daily Routines

【速读】:该论文旨在解决个体日常行为是否具有可识别的、稳定的典型日程模式(routine)及其在人群中的差异性问题,即是否存在每个人独特的“行为指纹”并能持续反映其生活节奏。解决方案的关键在于提出一种量化框架,将连续日视为不同类型的典型日程(routine)序列,并通过睡眠、移动和设备使用等跨人群共通的行为模式,识别出约八种主导的日常行为结构;进一步验证发现,个体在这些Routine类型上的时间分配和日间转换动态表现出高度的个体稳定性,远超群体间差异,且在不同年龄、职业和健康状态下保持一致,从而确立了Routine作为稳定、个性化的行为标记,为个性化健康监测提供了方法基础。

链接: https://arxiv.org/abs/2604.23638
作者: Nguyen Luong,Talayeh Aledavood
机构: Aalto University (阿尔托大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Daily life is structured by recurring routines that coordinate biological rhythms with social and occupational demands. Individual differences in work schedules, family obligations, and social commitments produce distinctive ways of organizing activities throughout the day. Do people have typical days with certain arrangement of activities? How often do these typical days or routines occur and does this differ from person to person? We introduce a framework for quantifying such recurring routines, their persistence over time and their distinctiveness for different people. We model consecutive days in one’s life as a sequence of different types of typical days, i.e. routines. Characterizing each day through patterns of activities common among all people - sleep, movement, and device use - we identify a small set of routine types that capture the dominant structure of everyday behavior. We then test whether individuals maintain stable, person-specific distributions over these types and transition between them in characteristic ways. Validating this framework with passive sensing data from 1,086 participants across 153,000 person-days in three longitudinal studies, we find that daily life typically resolves into approximately eight routine types and each person maintains a characteristic distribution over these types. Both the time allocation across routine types and the day-to-day transition dynamics are substantially more similar within individuals than between them, remaining stable across observation windows spanning weeks to months and across populations differing in age, occupation, and health status. Routine persistence shows modest associations with personality traits such as conscientiousness, but is broadly similar across age and gender. Our findings establish routine patterns as stable, person-specific behavioral fingerprints with applications in personalized health monitoring.

[HC-27] he Limits of Artificial Companionship

【速读】:该论文试图解决的问题是:在陪伴型聊天机器人(companion chatbot)的对话场景中,未披露的商业推广内容混入情感或关系性交流,导致市场交易与沟通亲密性之间的界限模糊,进而侵蚀用户自主性和对话情境的完整性。解决方案的关键在于建立明确的结构性区分,即在法律与社会层面确立商业与非商业对话场景之间的清晰边界,以此作为这些技术在社会生活中负责任稳定发展的前提条件。

链接: https://arxiv.org/abs/2604.23601
作者: Mauricio Figueroa
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Southwestern Journal of International Law (2026, forthcoming)

点击查看摘要

Abstract:This Article argues that conversations with companion chatbot should be subject to a clear structural distinction between commercial and non-commercial contexts. The insertion of undisclosed promotional content into affective or relational exchanges should be prohibited, as it collapses the boundary between market transaction and communicative intimacy in ways that erode user autonomy and conversational context. The Article begins by theorizing digital companionship as a sociotechnical form that reconfigures intimacy, dependence and relational vulnerability. It then introduces the potential economic harms derived from conversational advertising. The Article ultimately argues for a firm legal and social distinction between commercial and non-commercial conversational contexts as a precondition for the responsible stabilization of these technologies within social life.

[HC-28] Opening the Design Space: Two Years of Performance with Intelligent Musical Instruments

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在音乐创作中应用的局限性,即现有工具缺乏以艺术家为中心的设计,难以融入实际音乐实践或支持实验性交互。其解决方案的关键在于构建一个基于单板计算机(Single Board Computer)的低成本、便携式生成式 AI 音乐仪器平台,通过 MIDI 接口与其它乐器连接,并利用艺术家收集的数据集在常规计算机上训练模型。该平台使艺术家能够探索新的交互和表演方式,核心创新包括:(1)通过(重新)映射(re)mapping)替代重新训练实现 AI 交互发现;(2)引入快速输入交错(fast input interleaving)作为新型协同创作策略;(3)证明小数据 AI 模型可作为可移植的设计资源;(4)证实廉价硬件能有效降低艺术参与门槛。

链接: https://arxiv.org/abs/2604.23583
作者: Charles Patrick Martin
机构: The Australian National University (澳大利亚国立大学)
类目: ound (cs.SD); Human-Computer Interaction (cs.HC)
备注: Accepted for publication at the International Conference on New Interfaces for Musical Expression (NIME) 2026

点击查看摘要

Abstract:Machine generation of symbolic music and digital audio are hot topics but there have been relatively few digital musical instruments that integrate generative AI. Present musical AI tools are not artist centred and do not support experimentation or integrating into musical instruments or practices. This work introduces an inexpensive generative AI instrument platform based on a single board computer that connects via MIDI to other musical devices. The platform uses artist-collected datasets with models trained on a regular computer. This paper asks what the design space of intelligent musical instruments might look like when accessible and portable AI systems are available for artistic exploration. I contribute five examples of instruments created and tested through a two-year first-person artistic research process. These show that (re)mapping can replace retraining for discovering AI interaction, that fast input interleaving is a new co-creative strategy, that small-data AI models can be a transportable design resource, and that cheap hardware can lower barriers to inclusion. This work could enable artists to explore new interaction and performance schemes with intelligent musical instruments.

[HC-29] Your Students Dont Use LLM s Like You Wish They Did ACL2026

【速读】:该论文旨在解决教育类自然语言处理(Natural Language Processing, NLP)系统在评估时缺乏对教学目标实现程度的直接衡量问题,现有方法多依赖参与度指标和满意度调查,仅能作为教学目标达成的间接代理。其解决方案的关键在于提出六种计算型指标,用于自动化评估学生与AI对话中 pedagogical alignment(教学一致性),通过分析来自四个课程共500次对话、12,650条消息的数据,揭示了教育者设计对话辅导系统以促进持续学习对话,而学生实际主要将其用于答案提取的结构性错位。研究进一步发现,部署情境是使用模式最强预测因子,显著超过学生偏好或系统设计本身:当AI工具为可选时,使用集中在截止日期前后;当嵌入课程结构时,学生倾向于直接提问作业原题。该套指标能够帮助研究者更精准地衡量教育对话系统是否真正实现了其教学目标。

链接: https://arxiv.org/abs/2604.23486
作者: Sebastian Kobler,Matthew Clemson,Angela Sun,Jonathan K. Kummerfeld
机构: The University of Sydney (悉尼大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: To appear at ACL 2026 (Main Conference)

点击查看摘要

Abstract:Educational NLP systems are typically evaluated using engagement metrics and satisfaction surveys, which are at best a proxy for meeting pedagogical goals. We introduce six computational metrics for automated evaluation of pedagogical alignment in student-AI dialogue. We validate our metrics through analysis of 12,650 messages across 500 conversations from four courses. Using our metrics, we identify a fundamental misalignment: educators design conversational tutors for sustained learning dialogue, but students mainly use them for answer-extraction. Deployment context is the strongest predictor of usage patterns, outweighing student preference or system design: when AI tools are optional, usage concentrates around deadlines; when integrated into course structure, students ask for solutions to verbatim assignment questions. Whole-dialogue evaluation misses these turn-by-turn patterns. Our metrics will enable researchers building educational dialogue systems to measure whether they are achieving their pedagogical goals.

[HC-30] Can Humans Detect AI? Mining Textual Signals of AI-Assisted Writing Under Varying Scrutiny Conditions

【速读】:该论文旨在探究AI检测威胁是否会影响人们使用生成式AI(Generative AI)写作的方式,以及他人能否通过文本差异识别出由AI辅助撰写的文档。其关键解决方案在于设计了一个双阶段受控实验:21名参与者在相同条件下使用同一AI聊天机器人撰写关于远程工作的观点文章,其中一半被随机告知将接受AI检测工具扫描,另一半未获警告;随后由251名独立评判者对1999组配对文本进行盲评,判断哪篇更可能是人类所写。结果显示,尽管两组在所有可测量的文本特征(如AI重合度、词汇多样性、句法结构和代词使用)上无显著差异,但评判者更倾向于将被警告组的文本判为人类撰写(54.13% vs. 45.87%),且该结果统计显著(p = 0.000243)。这表明,AI检测威胁可能引发微妙的行为调整,而这些变化无法被现有基于特征的分析方法捕捉,揭示了人类感知与算法检测之间存在尚未明确的语义或风格线索。

链接: https://arxiv.org/abs/2604.23471
作者: Daniel Tabach(Georgia Institute of Technology)
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 25 pages, 12 figures

点击查看摘要

Abstract:This study asks whether the threat of AI detection changes how people write with AI, and whether other people can tell the difference. In a two-phase controlled experiment, 21 participants wrote opinion pieces on remote work using an AI chatbot. Half were randomly warned that their submission would be scanned by an AI detection tool. The other half received no warning. Both groups had access to the same chatbot. In Phase 2, 251 independent judges evaluated 1,999 paired comparisons, each time choosing which document in the pair was written by a human. Judges were not told that both writers had access to AI. Across all evaluations, judges selected the warned writer’s document as human 54.13% of the time versus 45.87% for the unwarned writer. A two-sided binomial test rejects chance guessing at p = 0.000243, and the result holds across both writing stances. Yet on every measurable text feature extracted, including AI overlap scores, lexical diversity, sentence structure, and pronoun usage, the two groups were indistinguishable. The judges are picking up on something that feature-based methods do not capture.

[HC-31] ArguAgent : AI-Supported Real-Time Grouping for Productive Argumentation in STEM Classrooms

【速读】:该论文旨在解决STEM教育中论证(Argumentation)教学效率低下的问题,即高成就学生常主导讨论,而低成就学生易被动参与或不贡献实质性推理,导致课堂互动不平等。解决方案的关键在于开发一个生成式AI驱动的系统ArguAgent,其核心机制是通过两阶段评估流程实现动态分组:首先使用基于语义分析的评分模型对学生的论证内容进行量化打分(0–4分量表),其次依据立场异质性与论证质量差异不超过±1级的学习进展标准进行聚类分组。实证表明,该方法在100个模拟班级中95.4%的小组满足设计目标,相较随机分组提升3.2倍,从而支持教师在教学过程中实时实施科学、公平且促进证据导向讨论的分组策略。

链接: https://arxiv.org/abs/2604.23449
作者: Jennifer Kleiman,Yizhu Gao,Xin Xia,Zhaoji Wang,Zipei Zhu,Jongchan Park,Xiaoming Zhai
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Full paper accepted to the 27th International Conference on AI in Education (AIED 2026). AIED Proceedings to be released Summer 2026

点击查看摘要

Abstract:Argumentation is a core practice in STEM education, but its productivity depends on who participates and how they interact. Higher-achieving students often dominate the talk and decision-making, while lower-achieving peers may disengage, defer, or comply without contributing substantive reasoning. Forming groups strategically based on students’ stances and argumentation skills could help foster inclusive, evidence-based discourse. In practice, however, teachers are constrained in implementing this grouping strategy because it requires real-time insight into students’ positions and the quality of their argumentation, information that is difficult to assess reliably and at scale during instruction. We present a generative AI-powered system, ArguAgent, that creates groups optimizing for stance heterogeneity while constraining argumentation quality differences to +/-1 level on a validated learning progression. ArguAgent uses a two-component assessment pipeline: first scoring student arguments on a 0-4 rubric, then clustering positions via semantic analysis. We validated the scoring component against human expert consensus (Krippendorff’s \alpha\alpha \alpha = 0.817) using 200 expert-generated scores. Testing three OpenAI models (GPT-4o-mini, GPT-5.1, GPT-5.2) with identical calibrated prompts, we found that systematic prompt engineering informed by human disagreement analysis contributed 89% of scoring improvement (QWK: 0.531 to 0.686), while model upgrades contributed an additional 11% (QWK: 0.686 to 0.708). Simulation testing across 100 classes demonstrated that the grouping algorithm achieves 95.4% of groups that meet both design criteria, a 3.2x improvement over random assignment. These results suggest ArguAgent can enable real-time, theoretically grounded grouping that promotes productive STEM argumentation in classrooms.

[HC-32] Otherness as a Quality in Designing Expressive Robotic Touch

【速读】:该论文旨在解决当前机器人触觉(Haptic)设计过度聚焦于模拟真实环境线索或手部动作所带来的局限性问题,这种趋同化设计不仅压缩了创新空间,还可能引发社会接受度的阻力。其解决方案的关键在于引入“他异性”(Otherness)这一概念作为设计核心品质——将机器人触觉固有的非人类特性从缺陷转化为表达潜力,通过有意识地塑造模糊性和多义性,激发用户对触觉交互的多元解读,从而推动更具表现力和情感共鸣的机器人触觉设计。研究基于反思性的设计研究(Research through Design, RtD)方法,结合艺术与设计案例分析,提炼出围绕“为何他异性重要”、“如何通过设计策略塑造他异性”以及“在何种系统层级嵌入他异性”的设计语言体系。

链接: https://arxiv.org/abs/2604.23402
作者: Ran Zhou,Laurens Boer,Daniel Leithinger,Madeline Balaam
机构: 未知
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Full paper accepted to 2026 ACM Designing Interactive Systems Conference (DIS '26)

点击查看摘要

Abstract:Haptic technologies have advanced rapidly, yet exploration of robotic touch remains dominated by replicating realistic environmental cues or hand gestures, which narrows the design space and risks social resistance. This paper argues for alternatives: grounded in the notion of “otherness” from human-robot interaction (HRI), we propose treating robotic touch’s inherent otherness as a design quality. Instead of being a limitation in pursuing realism, otherness can be embraced to elicit ambiguity and provoke alternative interpretations, fostering expressive and evocative robotic touch design. To develop this perspective, we analyze inspirational art and design precedents and four design research cases through a reflective Research through Design (RtD) approach. Through this analysis, we articulate a set of design languages structured around why otherness matters for touch meaning-making, how it can be shaped through design strategies, and where it can be embedded within robotic touch systems. We conclude by reflecting on the tensions and risks involved in designing robotic touch with otherness in mind.

[HC-33] VeriLLM ed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs

【速读】:该论文旨在解决医疗领域大语言模型(Large Language Models, LLMs)在实际部署中因高风险临床决策和推理可靠性不足而面临的调试难题。具体而言,开发者常缺乏足够的医学专业知识来理解模型错误的临床意义,难以在多样化的输入类型、任务和推理步骤中识别关键错误,并且现有调试方法多为基于单个实例的孤立分析,无法发现重复出现的错误模式。解决方案的关键在于提出VeriLLMed——一个融合外部生物医学知识的可视化分析系统,其核心机制包括:将模型输出转化为可比较的推理路径,构建基于知识图谱的参考路径,并识别三类典型诊断错误(关系错误、分支错误和缺失错误),从而帮助开发者高效定位临床不合理推理并生成可操作的改进策略。

链接: https://arxiv.org/abs/2604.23356
作者: Yurui Xiang,Xingyi Mao,Rui Sheng,Zixin Chen,Zelin Zang,Yuyang Wu,Haipeng Zeng,Huamin Qu,Yushi Sun,Yanna Lin
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show promise in medical diagnosis, but real-world deployment remains challenging due to high-stakes clinical decisions and imperfect reasoning reliability. As a result, careful inspection of model behavior is essential for assessing whether diagnostic reasoning is reliable and clinically grounded. However, debugging medical LLMs remains difficult. First, developers often lack sufficient medical domain expertise to interpret model errors in clinically meaningful terms. Second, models can fail across a large and diverse set of instances involving different input types, tasks, and reasoning steps, making it challenging for developers to prioritize which errors deserve focused inspection. Third, developers struggle to identify recurring error patterns across cases, as existing debugging practices are largely instance-centric and rely on manual inspection of isolated failures. To address these challenges, we present VeriLLMed, a visual analytics system that integrates external biomedical knowledge to audit and debug medical LLM diagnostic reasoning. VeriLLMed transforms model outputs into comparable reasoning paths, constructs knowledge graph-grounded reference paths, and identifies three recurring classes of diagnosis errors: relation errors, branch errors, and missing errors. Case studies and expert evaluation demonstrate that VeriLLMed helps developers identify clinically implausible reasoning and generate actionable insights that can inform the improvement of medical LLMs.

[HC-34] Bridging Reasoning and Action: Hybrid LLM -RL Framework for Efficient Cross-Domain Task-Oriented Dialogue

【速读】:该论文旨在解决跨域任务导向对话中长期多轮规划时,如何有效融合大语言模型(Large Language Models, LLMs)的约束推理能力与强化学习(Reinforcement Learning, RL)的长程行为优化能力的问题。其核心挑战在于:LLMs虽能推断显式和隐式可行性约束,但在长程对话中可靠性不足;而RL虽擅长优化长期策略,却难以从原始对话中提取结构化约束信息。解决方案的关键在于提出一种混合框架VLK-RL(Verified LLM-Knowledge empowered RL),通过双角色交叉验证机制对LLM生成的候选约束进行验证以抑制幻觉和跨轮次不一致,再将经验证的约束映射为与知识本体对齐的槽位-值表示,从而构建结构化的、约束感知的状态空间用于RL策略优化,显著提升模型在长程任务中的泛化性和鲁棒性。

链接: https://arxiv.org/abs/2604.23345
作者: Yangyang Zhao,Linfan Dai,Li Cai,Bowen Xing,Libo Qin
机构: Changsha University of Science and Technology(长沙理工大学); Central South University(中南大学); Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳); University of Science and Technology Beijing(北京科技大学); Guizhou University(贵州大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Cross-domain task-oriented dialogue requires reasoning over implicit and explicit feasibility constraints while planning long-horizon, multi-turn actions. Large language models (LLMs) can infer such constraints but are unreliable over long horizons, while Reinforcement learning (RL) optimizes long-horizon behavior yet cannot recover constraints from raw dialogue. Naively coupling LLMs with RL is therefore brittle: unverified or unstructured LLM outputs can corrupt state representations and misguide policy learning. Motivated by this, we propose Verified LLM-Knowledge empowered RL (VLK-RL), a hybrid framework that makes LLM-derived constraint reasoning usable for RL. VLK-RL first elicits candidate constraints with an LLM and then verifies them via a dual-role cross-examination procedure to suppress hallucinations and cross-turn inconsistencies. The verified constraints are mapped into ontology-aligned slot-value representations, yielding a structured, constraint-aware state for RL policy optimization. Experiments across multiple benchmarks demonstrate that VLK-RL significantly improves generalization and robustness, outperforming strong single-model baselines on long-horizon tasks.

[HC-35] Scalable LLM -based Coding of Dialogue in Healthcare Simulation: Balancing Coding Performance Processing Time and Environmental Impact

【速读】:该论文试图解决的问题是如何在团队医疗模拟复盘场景中,设计出既准确又高效、且具备环境可持续性的大语言模型(Large Language Models, LLMs)提示(prompt)策略,以实现在实时或近实时情境下对对话内容进行自动化编码分析。其解决方案的关键在于系统性地优化提示设计与批处理(batching)策略之间的权衡关系,通过实验对比不同提示方案在多种批大小下的编码准确性、处理时间和能耗表现,发现增大批处理规模虽能提升速度并降低能源消耗,但会损害编码质量;因此,研究提出了兼顾教育实用性与技术可行性的配置原则,为在时间敏感、隐私要求高和可持续性重要的场景中部署LLM驱动的对话分析提供了可落地的技术路径。

链接: https://arxiv.org/abs/2604.23255
作者: Kiyoshige Garces,Gloria Milena Fernandez-Nieto,Linxuan Zhao,Sachini Samaraweera,Dragan Gasevic,Roberto Martinez-Maldonado,Vanessa Echeverria
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages, 6 figures. Accepted at the Learning at Scale Conference (L@S) 2026

点击查看摘要

Abstract:Research shows that dialogue, the interactive process through which participants articulate their thinking, plays a central role in constructing shared understanding, coordinating action, and shaping learning outcomes in teams. Analysing dialogue content has been central to advancing team learning theory and informing the design of computer-supported collaborative learning environments, yet this progress has depended on labour-intensive qualitative coding. LLMs offer new possibilities for automating and enhancing the dialogue layer within emerging multimodal learning analytics approaches, with recent studies showing that they can approximate human coding through few-shot prompting. However, prior work has focused on replicating human coding accuracy for research purposes, rather than addressing a more educationally consequential question: how can we design prompts that allow an LLM to label team dialogue accurately and fast enough to be useful in real settings, such as in-person healthcare simulations, where results must be returned quickly and computational cost and sustainability also matter? This paper investigates how prompt design and batching strategies can be optimised to balance coding accuracy, processing time, and environmental impact in team-based healthcare simulation debriefing. Using a dataset of 11,647 utterances coded across 6 dialogue constructs, we compared 4 prompt designs across varying batch sizes, evaluating coding performance, processing time, and energy consumption, as well as the trade-offs between these metrics. Results indicate that increasing batch size improves speed and reduces energy use, but negatively impacts coding performance. Beyond demonstrating the feasibility of LLM-based qualitative analysis, this study offers practical guidance for scaling dialogue analytics in contexts where timeliness, privacy, and sustainability are critical.

[HC-36] PrivacyAssist: A User-Centric Agent Framework for Detecting Privacy Inconsistencies in Android Apps

【速读】:该论文旨在解决移动应用(Mobile Apps)中隐私保护机制失效且用户难以理解的问题,即用户在安装应用时往往无法有效识别权限授予与开发者实际数据收集行为之间的不一致。解决方案的关键在于提出PrivacyAssist——一个基于多智能体大语言模型(Multi-Agent LLM-based)的平台,利用检索增强生成(Retrieval-Augmented Generation, RAG)技术检测用户授权权限与开发者声明的敏感数据收集/共享实践之间的不一致性,并提供简洁明了的解释和设备端实时警告,从而支持用户做出知情的安装决策。

链接: https://arxiv.org/abs/2604.23248
作者: Tran Thanh Lam Nguyen,Edoardo Di Tullio,Barbara Carminati,Elena Ferrari
机构: University of Insubria (因苏布里亚大学)
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Mobile apps offer significant benefits, but their privacy protections often remain ineffective and confusing for users. While prior work mainly analyzes app privacy vulnerabilities, few approaches help users understand, set, and enforce their privacy preferences. This paper presents PrivacyAssist, a multi-agent LLM-based platform that detects inconsistencies between user-granted permissions and developers’ declared sensitive data collection and sharing practices. Using Retrieval-Augmented Generation (RAG), PrivacyAssist provides concise explanations and real-time on-device warnings to support informed installation decisions. We evaluate PrivacyAssist with 200 users and 2,347 Android apps, finding that only 16% of apps are fully consistent between granted permissions and declared data practices.

[HC-37] How Researchers Navigate Accountability Transparency and Trust When Using AI Tools in Early-Stage Research: A Think-Aloud Study

【速读】:该论文旨在解决生成式 AI(Generative AI)在科研早期阶段应用时,因输出结果缺乏透明性、可追溯性和可信度,导致学者难以维持原有学术判断力的问题。其核心问题在于:AI工具虽提升了效率,却模糊了责任归属、削弱了信息溯源能力,并破坏了研究者对AI的稳定信任,进而引发负责任人工智能(Responsible AI, RAI)风险。解决方案的关键在于识别出三类关键挑战——AI输出掩盖认知不确定性( accountability)、检索与内容构建过程不透明(transparency)、以及信任易受情境影响而动摇(trust),并发现研究人员会主动发展补偿策略以恢复在不确定环境下的学术判断力,从而提出应从实际研究者经验出发,设计能保障责任明晰、支持信息溯源、并培育理性信任的AI集成机制。

链接: https://arxiv.org/abs/2604.23136
作者: Sanjana Gautam,Houjiang Liu,Yujin Choi,Matthew Lease
机构: Microsoft(微软); The University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In the early stages of scientific research, researchers rely on core scholarly judgments to identify relevant literature, assess credible evidence, and determine which directions merit pursuit. As AI tools become increasingly integrated into these early-stage workflows, the scholarly judgments that were once transparent and attributable to individual researchers become obscured, raising critical Responsible AI (RAI) concerns around accountability, transparency, and trust. Yet how these three dimensions manifest in real-time, in-situ scholarly practice remains largely unexplored. To address this gap, we conducted a think-aloud study with 15 researchers to examine how they used AI tools powered by large language models (LLMs) across early-stage research tasks, including literature exploration, synthesis, and research ideation. Our key findings address the tripartite constructs of accountability, transparency, and trust. First, the confident tone of AI outputs misrepresents epistemic uncertainty, making it more difficult for researchers (who are ultimately accountable) to identify which outputs require the greatest scrutiny. Second, opaque retrieval and content construction make provenance difficult to establish for transparency. Third, trust in AI is fragile, context-dependent, and easily eroded. In response, participant researchers were seen to develop compensatory strategies to restore scholarly judgment under uncertainty. Overall, our findings serve to contextualize AI-mediated research as a RAI problem grounded in lived researcher experience and motivate attention to deliberate AI integration that preserves accountability, supports transparency, and fosters informed trust.

[HC-38] Visual Accessibility in a Virtual Kitchen: Effects of Open Shelving on Performance Cognitive Load and Experience in Older Adults with and without MCI

【速读】:该论文旨在解决老年人(包括轻度认知障碍[MCI]患者)在厨房环境中因橱柜设计导致的视觉可及性不足所引发的任务执行效率低下、认知负荷增加及行为适应困难等问题。其解决方案的关键在于通过对比封闭式橱柜与开放式货架两种设计条件,验证视觉可达性提升对任务表现、身体活动水平、认知负荷及用户体验的影响。研究发现,开放式货架显著缩短了任务时长并降低了物理活动需求(ENMO),同时改变了视觉搜索模式(表现为注视熵上升),但并未显著改变主观认知负荷(NASA-TLX)和内在动机(IMI),表明客观行为指标与主观感受存在差异。这一结果支持采用高视觉可达性的设计策略,以优化老年群体在居住环境中的功能表现,并为构建认知友好型建筑环境提供实证依据。

链接: https://arxiv.org/abs/2604.23081
作者: Ibrahim Bilau,Eunhwa Yang,Hyeokhyen Kwon,Stacie Smith,Bruce Walker,Hui Cai,Ece Erdogmus,Omobolanle Ogunseiju
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 21 pages, 4 figures, 3 tables

点击查看摘要

Abstract:This study examines how visual accessibility through cabinet design influences task performance, cognitive load, physical activity level, motivation, and user experience in a virtual kitchen among older adults with and without mild cognitive impairment (MCI). Seventeen older adults (7 with MCI, 10 without) completed a repeated-measures item retrieval task under two conditions, closed cabinets and open shelving, using a counterbalanced within-subjects design. Measures included task duration, physical activity level (ENMO), cognitive load (NASA-TLX and gaze entropy), intrinsic motivation (IMI), and post-task interviews. Open shelving significantly reduced task duration (beta = -291.20, p .001) and physical activity level (beta = -0.00615, p = .008). Gaze entropy increased (beta = 1.29, p = .001), with a significant Setting x MCI interaction (p = .009) and moderation by MoCA score (p .001). NASA-TLX and intrinsic motivation did not differ significantly between conditions. Qualitative findings indicated reduced reliance on memory-based search and highlighted themes related to independence, aesthetics, safety, and adoption. Overall, visual accessibility improved efficiency and reduced movement demands while altering visual-search organization, with divergence between subjective and objective indicators of cognitive load. These findings support visually accessible design strategies to enhance functional performance and inform cognitively supportive built environments for aging populations.

[HC-39] Within-person prediction of depressive symptom change using year-long Screenome data and CES-D assessments

【速读】:该论文旨在解决数字表型(digital phenotyping)领域中尚未充分探索的个体内部纵向轨迹预测问题,即如何基于日常行为数据准确预测个体抑郁症状在未来两周内是恶化、稳定还是改善。其解决方案的关键在于将预测任务建模为个体内部分类问题,并结合超过100百万次每5秒采集的屏幕截图(screenomic data)与每两周一次的CES-D量表评估,利用XGBoost模型实现高精度预测:在时间保持测试下,对既定CES-D严重程度分界线跨越的预测AUC达0.906,对个体自身波动基准变化的预测AUC为0.755;其中,个体典型症状水平是唯一显著优于最新CES-D得分的预测因子,且无此特征时无法识别关键恶化事件,凸显了个体基线建模的重要性;此外,从屏幕行为中提取的特征揭示了症状恶化的前驱模式,如社交媒体使用增加、设备使用碎片化及夜间活动变化,具有显著个体异质性,为早期干预提供了可操作的行为标志。

链接: https://arxiv.org/abs/2604.23040
作者: Merve Cerit,Andrea Mock,Vryan Almanon Feliciano,Thomas N. Robinson,Byron Reeves,Nilam Ram,Nick Haber
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Predicting whether an individual’s depressive symptoms will worsen, remain stable, or improve over the coming weeks can enable earlier and more targeted care, yet prospective within-person trajectory prediction remains largely unaddressed in digital phenotyping. We combine fortnightly CES-D assessments with over 100 million screenshots captured every five seconds via the Stanford Screenomics platform from 96 adults followed for approximately one year (M = 20.9, SD = 3.9 assessments per participant, 2,002 total observations). We frame prediction as a within-person classification task: whether symptoms will worsen, remain stable, or improve over the subsequent fortnight, operationalized in three ways to capture clinically meaningful change. Under temporal holdout, XGBoost achieves an AUC of 0.906 for crossings of established CES-D severity bands and 0.755 for change relative to each participant’s own within-person variability, generalizing to unseen individuals (AUC = 0.821). Each person’s typical symptom level was the only statistically significant predictor above the most recent CES-D score; without it, the most consequential worsening transitions go undetected. Screenome-derived behavioral features revealed prodromal patterns of worsening, including escalating social media use, fragmented device engagement, and changes in overnight activity, with substantial individual heterogeneity. These findings establish a proof-of-concept foundation for monitoring systems that could identify individuals approaching clinical deterioration before symptoms reach a crisis point.

[HC-40] Understanding teens self-beliefs when learning to construct and deconstruct AI/ML systems: Developing a survey instrument

【速读】:该论文旨在解决当前缺乏针对儿童与青少年群体的AI素养测评工具的问题,尤其关注计算赋权(computational empowerment)在建构(construction)与解构(deconstruction)活动中的体现。其解决方案的关键在于开发并验证了一个包含六个维度的量表工具,涵盖建构类自信念(如创造性表达和问题解决自我效能感)与解构类自信念(如审计自我效能感和对审计的兴趣),以及与设计正义(design justice)和学习AI/机器学习价值相关的普遍性自我信念。通过在124名青少年中进行验证性因子分析,证实了该量表的六因子结构,并发现设计正义信念与问题解决、审计自我效能感及创造性表达显著相关,为后续开展面向青少年的AI素养干预研究提供了可靠测量基础。

链接: https://arxiv.org/abs/2604.22959
作者: Luis Morales-Navarro,Deborah Fields,Michael T. Giang,Daniel J. Noh,Yasmin B. Kafai,Danaé Metaxa
机构: University of Pennsylvania (宾夕法尼亚大学); Utah State University (犹他州立大学); Cal Poly Pomona (加州理工州立大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Despite growing calls to foster AI literacy, there are few available survey instruments designed for children and youth that study computational empowerment alongside construction and deconstruction activities. In such activities, learners’ beliefs about their abilities and attributes can impact their engagement. In this paper, we introduce and validate a survey instrument with constructs related to construction (creative expression and problem-solving self-beliefs) and deconstruction (auditing self-efficacy and fascination with auditing), along with more general self-beliefs related to design justice and the value of learning about AI/ML. We administered the instrument to 124 teenagers and assessed the six-factor structure of the instrument using confirmatory factor analysis. In addition to confirming the structure, we found that design justice beliefs strongly correlated with problem-solving, auditing self-efficacy, and creative expression.

[HC-41] Enabling users to work sustainably on shared institute computing resources

【速读】:该论文旨在解决科研计算集群在能源效率提升受限背景下的可持续发展问题,特别是在老旧建筑中难以实施传统节能改造的情况下。其核心挑战是如何在不牺牲计算性能的前提下降低温室气体排放。解决方案的关键在于采用用户导向的双重策略:一是通过部署细粒度能耗监测系统与实时电网可再生能源比例数据,实现“绿色时段”调度(green-window scheduling),使用户能够基于碳强度动态调整任务执行时间;二是引入行为激励机制,如个人能耗查询、周报反馈及项目级碳足迹聚合,结合历史内存使用记录避免资源过度分配,从而在自愿基础上推动科研人员提高资源意识并形成低碳计算文化。这一技术与行为干预相结合的方法,旨在实现中长期的碳减排目标。

链接: https://arxiv.org/abs/2604.22799
作者: Niclas Eich,Johannes Erdmann,Martin Erdmann,Benjamin Fischer,Paul Gilles,Tim Hauptreif,Jan Kelleter
机构: RWTH Aachen University (亚琛工业大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); High Energy Physics - Experiment (hep-ex)
备注: 17 pages, 14 figures

点击查看摘要

Abstract:The VISPA project is a self-managed, mid-scale computing cluster that supports physics data analysis in research and teaching. Because the cluster is housed in a 1970s institute building with limited retrofit options, conventional efficiency upgrades would yield only minor energy savings. We therefore target sustainability primarily through user-centric measures. A monitoring system now records per-job energy consumption, while real-time data on the renewable share of the German power grid enable `green-window’ scheduling. Users can query their individual energy consumption and carbon footprints, receive weekly reports, and tag jobs by project for aggregate accounting; memory records from previous runs help avoid oversubscription. All options are voluntary, fostering a cultural shift rather than imposing hard constraints. A simulation framework evaluates the potential impact of these measures. Together, the technological and behavioral interventions aim at medium- to long-term reductions in greenhouse-gas emissions by increasing resource awareness within the scientific community.

[HC-42] Cognitive Alignment Deciphered: A Self-Developed Scenario-Based Prompt Scale Coupled with Representational Similarity Analysis and Social Network Analysis for Unraveling Bias Mechanisms Across Humans and LLM s

【速读】:该论文旨在解决传统认知偏差测量工具在覆盖范围、生态效度和依赖抽象自我报告方面的局限性,从而限制了情景化评估及人类与人工智能(AI)之间的对比研究。其解决方案的关键在于提出一种基于情境的认知偏差评估量表(Context-Based Cognitive Bias Assessment Scale, CBAS),该量表通过场景驱动的提示模板覆盖58种认知偏差,并划分为五大双系统维度(计算、信念、信息、社会和记忆)。进一步结合表示相似性分析(Representational Similarity Analysis, RSA)与社交网络分析(Social Network Analysis, SNA),实现对人类不同年龄群体与三个大语言模型(Baidu ERNIE 3.5 8K、DeepSeek V3、DeepSeek R1)的认知结构差异的量化比较。结果显示,人类表现出热冷系统整合一致且个体间变异性强的认知网络特征,而LLMs则呈现碎片化、僵化响应模式及较低变异性;并通过角色扮演与偏差缓解指令的提示干预策略显著提升LLM响应准确性(如DeepSeek R1达84.86%),部分重塑其内部表征,为可复现的认知对齐研究提供了方法论框架。

链接: https://arxiv.org/abs/2604.22775
作者: Chengrui Zhou
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages

点击查看摘要

Abstract:Traditional cognitive bias measurement tools are limited by narrow bias coverage, low ecological validity, and reliance on abstract self reports, constraining scenario based and human AI comparisons. We introduce the context based Cognitive Bias Assessment Scale CBAS, a scenario driven prompt template covering 58 cognitive biases across five hot cold dual system dimensions: Calculation, Belief, Information, Social, and Memory. Psychometric testing with 330 participants shows satisfactory reliability Cronbachs alpha 0.714 and good model fit chi squared df 1.83, RMSEA 0.057, CFI 0.908, TLI 0.903. We then combine Representational Similarity Analysis RSA and Social Network Analysis SNA to compare human age groups and three large language models Baidu ERNIE 3.5 8K, DeepSeek V3, DeepSeek R1. Humans show coherent hot cold integration with high inter individual variability, whereas LLMs display fragmented, inflexible response patterns and lower variability. Human cognitive networks exhibit strong inter module connectivity, while LLMs show fixed core biases and isolated information processing components. Prompt interventions integrating role playing and bias mitigation instructions effectively improve LLM response accuracy, reaching 84.86 percent for DeepSeek R1 and 78.24 percent for DeepSeek V3, and partially reshape their internal representations. Our work establishes a replicable assessment and analysis pipeline for cognitive alignment research, bridging empirical psychological evaluation and interpretable artificial intelligence.

[HC-43] race Mutation in Human-LLM Dialogue: The Transcript as Forensic and Mitigation Surface

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识工作场景中作为协作伙伴时,因“痕迹变异”(trace mutations)导致共享对话记录出现隐蔽性失真问题。此类失真虽表现为连续性,实则破坏了决策记录的可信度与可追溯性。解决方案的关键在于识别并区分两种特定类型的痕迹变异:一是“话语抹除”(utterance effacement),即一方发言内容被篡改但表面仍显连贯;二是“所有权断裂”(genitive dissociation),即模型丧失对其自身生成内容的归属权。作者通过理论分析与案例验证指出,这些现象不同于传统意义上的幻觉(confabulation)或迎合行为(sycophancy),且难以通过常规对话修复机制纠正,因此亟需在工具设计层面引入对上下文一致性与生成溯源的强化机制,以保障人机协作中的决策透明性和可靠性。

链接: https://arxiv.org/abs/2604.22773
作者: William J. Bensen
机构: Independent Researcher(独立研究员)
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 2 figures, 6 tables. Submitted to CHIWORK 2026 as Late-Breaking Work. Supplementary data: this https URL

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as partners in knowledge work, where the shared conversational record functions as the decision record that safeguards work continuity. We characterize a class of context failures we term trace mutations, in which distortions enter the shared record while presenting as grounded continuity. We describe two forms: utterance effacement, in which an interlocutor’s contribution is re-presented with altered substance, and genitive dissociation, in which a model loses authorship of its own contributions. Using a schematic illustration and two naturalistic anchor cases, we show how these failures differ from confabulation and sycophancy and why they resist ordinary conversational repair. Preliminary cross-model elicitation suggests that at least one such failure is highly camouflaged to contemporary models. We situate the phenomena within grounding and repair theory and discuss implications for tool design.

[HC-44] Learning in Blocks: A Multi Agent Debate Assisted Personalized Adaptive Learning Framework for Language Learning

【速读】:该论文旨在解决当前数字语言学习课程中普遍存在的“碎片化测试”问题,即依赖离散题目的测验仅评估记忆性知识,而无法有效衡量学习者在真实对话中的语法(Grammar)、词汇(Vocabulary)及互动沟通能力(Interactive Communication),导致学习者可能因通过测验而进展,却仍存在实际应用能力的短板。解决方案的关键在于提出“Learning in Blocks”框架,其核心创新是采用异质多智能体辩论机制(HeteroMAD),分两阶段实现:第一阶段由角色专业化代理独立评分并辩论以达成共识,确保评分可靠;第二阶段基于CEFR标准识别具体需强化的语法和词汇技能,并结合间隔复习(spaced review)与掌握度阈值(70%)驱动个性化学习路径,从而实现以实证交互能力为依据的精准进阶与补弱。

链接: https://arxiv.org/abs/2604.22770
作者: Nicy Scaria,Silvester John Joseph Kennedy,Deepak Subramani
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted as main paper in AIED 2026

点击查看摘要

Abstract:Most digital language learning curricula rely on discrete-item quizzes that test recall rather than applied conversational proficiency. When progression is driven by quiz performance, learners can advance despite persistent gaps in using grammar and vocabulary during interaction. Recent work on LLM-based judging suggests a path toward scoring open-ended conversations, but using interaction evidence to drive progression and review requires scoring protocols that are reliable and validated. We introduce Learning in Blocks, a framework that grounds progression in demonstrated conversational competence evaluated using CEFR-aligned rubrics. The framework employs heterogeneous multi-agent debate (HeteroMAD) in two stages: a scoring stage where role-specialized agents independently evaluate Grammar, Vocabulary, and Interactive Communication, engage in debate to address conflicting judgments, and a judge synthesizes consensus scores; and a recommendation stage that identifies specific grammar skills and vocabulary topics for targeted review. Progression requires demonstrating 70% mastery, and spaced review targets identified weaknesses to counter skill decay. We benchmark four scoring and recommendation methods on CEFR A2 conversations annotated by ESL experts. HeteroMAD achieves a superior score agreement with a 0.23 degree of variation and recommendation acceptability of 90.91%. An 8-week study with 180 CEFR A2 learners demonstrates that combining rubric-aligned scoring and recommendation with spaced review and mastery-based progression produces better learning outcomes than feedback alone.

[HC-45] he Imbalanced User-AI Relationships as an Ethical Failure of Front-End Design in Healthcare AI

【速读】:该论文旨在解决医疗领域中人工智能(Artificial Intelligence, AI)前端界面存在的伦理失衡问题,即患者在数据推理过程中被高度可见,却无法理解、质疑或影响自身在AI系统中的表征,从而导致用户与AI之间关系的不对称性。这种不对称性表现为“不对称可读性”(asymmetric legibility),即便AI技术本身准确,其设计选择如默认推荐、受限输入和抑制不确定性等仍会削弱患者的自主权、临床判断力及人类监督能力。论文提出以“互惠性”(reciprocity)作为设计导向,并提供干预措施,以构建更平衡、参与式的医患-AI关系,从而实现伦理上更具公平性的AI应用。

链接: https://arxiv.org/abs/2604.22767
作者: Maureen Mghambi Mwadime
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at the Proceedings of the CHI 2026 Workshop: Ethics at the Front-End

点击查看摘要

Abstract:Ethical discourse on AI in healthcare has focused predominantly on back-end concerns such as bias, fairness and explainability, while the front-end interface, where patients and clinicians actually encounter AI outputs, remains under explored. This paper identifies imbalanced user-AI relationships as a distinct class of front-end ethical failure: patients are rendered highly visible to AI systems through data inference, yet cannot understand, question or influence how they are represented. Through the concept of asymmetric legibility and a chat-based telemedicine case, we show how design choices e.g., default recommendations, restricted inputs and suppressed uncertainty, undermine agency, clinician judgment and human oversight even where systems are technically accurate. We propose reciprocity as a design orientation and offer interventions for more balanced, participatory user-AI relationships in healthcare.

[HC-46] A learning health system in Neurorehabilitation as a foundation for multimodal patient representation

【速读】:该论文旨在解决计算神经康复(computational neurorehabilitation, compNR)在临床实践中难以落地的问题,具体表现为数据系统碎片化、互操作性差以及临床医生在机器学习(Machine Learning, ML)模型开发中参与度低。其解决方案的关键在于引入学习健康系统(Learning Health System, LHS)框架,通过整合多模态数据采集、模型计算与临床可视化,实现临床医生与ML模型的协同工作,从而构建一个支持结构化数字数据捕获、安全计算处理及可互操作患者轨迹可视化的基础设施,有效推动compNR从研究模型向临床应用的转化。

链接: https://arxiv.org/abs/2604.22763
作者: Thomas Weikert,Eljas Roellin,Lukas Heumos,Fabian J. Theis,Diego Paez-Granados,Chris Easthope Awai
机构: Lake Lucerne Institute (湖伦茨研究所); ETH Zurich (苏黎世联邦理工学院); Swiss Paraplegic Research (瑞士截瘫研究中心); Helmholtz Center Munich (慕尼黑亥姆霍兹中心); Technical University of Munich (慕尼黑工业大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Neurological disorders represent a growing global health burden requiring long-term, interdisciplinary rehabilitation. Computational neurorehabilitation (compNR) - the use of data-driven and model-based approaches to personalize treatment - offers new opportunities for precision rehabilitation. However, its clinical deployment is limited by fragmented data systems, poor interoperability, and low clinician engagement in model development. We embed the learning health system (LHS) framework in Neurorehabilitation through integration of multimodal data collection, model computation, and clinical visualization that enables clinician-ML collaboration in everyday neurorehabilitation practice. The system facilitates structured digital data capture, secure computational processing, and interoperable visualization of patient trajectories. Through a real-world deployment in stroke rehabilitation, we demonstrate how such an infrastructure bridges the gap between research models and clinical use, showcasing one approach to a translational pathway for compNR.

[HC-47] How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agent ic Coding Tasks

【速读】:该论文旨在解决AI代理(AI agent)在复杂人类工作流中部署时面临的高大语言模型(Large Language Model, LLM)token消耗问题,具体聚焦于三个核心问题:token消耗的来源、不同模型的token效率差异以及代理是否能提前预测自身token成本。其解决方案的关键在于对八种前沿LLM在SWE-bench Verified数据集上的代理编码任务轨迹进行系统性分析,揭示了token消耗模式的非线性、高度随机性和与任务难度的人工感知不一致等特征,并首次量化评估了模型自我预测token成本的能力。研究发现,输入token是主要成本驱动因素,且token使用量与准确性之间呈倒U型关系,同时多数模型显著低估实际token消耗,这为优化AI代理的经济性提供了实证基础和理论依据。

链接: https://arxiv.org/abs/2604.22750
作者: Longju Bai,Zhemin Huang,Xingyao Wang,Jiao Sun,Rada Mihalcea,Erik Brynjolfsson,Alex Pentland,Jiaxin Pei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models’ ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.

[HC-48] Beyond Behavior: Why AI Evaluation Needs a Cognitive Revolution

【速读】:该论文试图解决的问题是:当前人工智能(AI)研究长期受制于图灵提出的以行为表现作为智能判断标准的 epistemological commitment(认识论承诺),这种承诺导致领域内忽视了对系统内部计算机制、过程和组织结构的探究,从而无法区分通过不同机制实现相同输出的系统,而这恰恰是智能归因的关键。解决方案之关键在于推动一场类似心理学从行为主义向认知革命的 epistemological transition(认识论转型)——不是摒弃行为证据,而是承认行为证据本身不足以支撑AI领域所宣称的智能构造性主张(construct claims),并建立一种后行为主义的认识论框架,使诸如“系统如何实现目标”“内部机制是否反映智能本质”等当前无法提出的问题变得可问且可答。

链接: https://arxiv.org/abs/2604.05631
作者: Amir Konigsberg
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In 1950, Alan Turing proposed replacing the question “Can machines think?” with a behavioral test: if a machine’s outputs are indistinguishable from those of a thinking being, the question of whether it truly thinks can be set aside. This paper argues that Turing’s move was not only a pragmatic simplification but also an epistemological commitment, a decision about what kind of evidence counts as relevant to intelligence attribution, and that this commitment has quietly constrained AI research for seven decades. We trace how Turing’s behavioral epistemology became embedded in the field’s evaluative infrastructure, rendering unaskable a class of questions about process, mechanism, and internal organization that cognitive psychology, neuroscience, and related disciplines learned to ask. We draw a structural parallel to the behaviorist-to-cognitivist transition in psychology: just as psychology’s commitment to studying only observable behavior prevented it from asking productive questions about internal mental processes until that commitment was abandoned, AI’s commitment to behavioral evaluation prevents it from distinguishing between systems that achieve identical outputs through fundamentally different computational processes, a distinction on which intelligence attribution depends. We argue that the field requires an epistemological transition comparable to the cognitive revolution: not an abandonment of behavioral evidence, but a recognition that behavioral evidence alone is insufficient for the construct claims the field wishes to make. We articulate what a post-behaviorist epistemology for AI would involve and identify the specific questions it would make askable that the field currently has no way to ask.

[HC-49] EyeBrain: Left and Right Brain Lateralization Activity Classification Through Pupil Diameter and Fixation Duration

【速读】:该论文旨在解决如何利用眼动指标(如瞳孔直径和注视持续时间)有效区分大脑左右半球的偏侧化活动这一问题。其解决方案的关键在于证实了这些眼动参数能够作为大脑半球功能状态的可靠生物标志物,通过机器学习分类模型实现了高达0.894 F1分数的准确分类性能,表明眼动测量可作为非侵入性手段用于认知状态监测与神经康复中的脑功能评估。

链接: https://arxiv.org/abs/2604.23562
作者: Ko Watanabe,Pooja Pol,Nicolas Großmann,Shoya Ishimaru,Andreas Dengel
机构: RPTU Kaiserslautern-Landau (莱布尼茨信息处理研究所); DFKI GmbH (德国弗劳恩霍夫信息安全研究中心); Osaka Metropolitan University (大阪都立大学); DFKI Lab Japan (德国弗劳恩霍夫信息安全研究中心日本实验室)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The relationship between brain lateralization and cognitive functions is well-documented. The left hemisphere primarily handles tasks such as language and arithmetic, while the right hemisphere is involved in creative activities like drawing and music perception. Eye-tracking technology has shown the potential to reveal cognitive states by measuring ocular metrics such as pupil diameter and fixation duration. However, the ability to distinguish lateralized brain activity using these ocular metrics remains underexplored. Here, we demonstrate that pupil diameter and fixation duration can effectively classify left and right brain hemisphere activities. We obtained a considerably high classification performance, with an F1 score of 0.894. The results suggest that ocular metrics are robust indicators of lateralized brain activity and can be applied in cognitive monitoring and neurorehabilitation. Our future work expands on this by integrating these methods into real-time applications EyeBrain, potentially broadening their use across various cognitive and neurological domains.

计算机视觉

[CV-0] World-R1: Reinforcing 3D Constraints for Text-to-Video Generation MICRO

【速读】:该论文旨在解决当前视频基础模型在视觉合成过程中普遍存在几何不一致性的问题,即生成视频中物体或场景的3D结构难以保持稳定与合理。现有方法通常通过修改网络架构引入3D先验信息,但存在计算开销大、扩展性差等局限。其解决方案的关键在于提出World-R1框架,利用强化学习(reinforcement learning)将视频生成过程与3D约束对齐:首先构建一个专用于世界模拟的纯文本数据集以支持训练;其次采用Flow-GRPO算法,基于预训练的3D基础模型和视觉-语言模型(vision-language models)提供的反馈信号优化模型,从而在不改变原始架构的前提下增强结构一致性;最后通过周期性解耦训练策略,在刚性几何一致性与动态场景流畅性之间实现平衡,显著提升3D一致性并保留原模型的视觉质量。

链接: https://arxiv.org/abs/2604.24764
作者: Weijie Wang,Xiaoxuan He,Youping Gu,Yifan Yang,Zeyu Zhang,Yefei He,Yanbo Ding,Xirui Hu,Donny Y. Chen,Zhiyuan He,Yuqing Yang,Bohan Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL , Code: this https URL

点击查看摘要

Abstract:Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

[CV-1] una-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

【速读】:该论文旨在解决统一多模态模型中视觉理解与生成任务之间因依赖预训练视觉编码器而产生的表征错位问题,以及由此导致的无法从原始像素端到端优化的局限性。其解决方案的关键在于提出Tuna-2,一种原生统一多模态模型,它摒弃了传统的模块化视觉编码器(如VAE或表示编码器),直接基于像素嵌入(pixel embeddings)完成视觉理解和生成任务;通过简单的patch嵌入层实现输入图像的编码,从而实现完全的端到端像素空间建模,实验表明该设计在多模态基准上达到最先进性能,并在大规模场景下展现出更强的细粒度视觉感知能力,证明了预训练视觉编码器并非多模态建模的必要组件。

链接: https://arxiv.org/abs/2604.24763
作者: Zhiheng Liu,Weiming Ren,Xiaoke Huang,Shoufa Chen,Tianhong Li,Mengzhao Chen,Yatai Ji,Sen He,Jonas Schult,Belinda Zeng,Tao Xiang,Wenhu Chen,Ping Luo,Luke Zettlemoyer,Yuren Cong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2’s encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.

[CV-2] OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer

【速读】:该论文旨在解决现有视频镜头边界检测(Shot Boundary Detection, SBD)方法中存在的三大问题:一是生成的边界缺乏可解释性,二是难以捕捉细微但有害的镜头间不连续性,三是依赖噪声大、多样性低的手动标注数据和过时的评估基准。其解决方案的关键在于提出OmniShotCut框架,将SBD建模为结构化关系预测任务,通过基于镜头查询(shot query)的密集视频Transformer联合估计镜头区间及其内部关系(intra-shot relations)与跨镜头关系(inter-shot relations),从而提升检测精度与可解释性;同时引入全自动合成过渡场景的流水线以避免人工标注误差,并构建OmniShotCutBench这一现代多领域基准用于全面且诊断性的评估。

链接: https://arxiv.org/abs/2604.24762
作者: Boyang Wang,Guangyi Xu,Zhipeng Tang,Jiahui Zhang,Zezhou Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Shot Boundary Detection (SBD) aims to automatically identify shot changes and divide a video into coherent shots. While SBD was widely studied in the literature, existing state-of-the-art methods often produce non-interpretable boundaries on transitions, miss subtle yet harmful discontinuities, and rely on noisy, low-diversity annotations and outdated benchmarks. To alleviate these limitations, we propose OmniShotCut to formulate SBD as structured relational prediction, jointly estimating shot ranges with intra-shot relations and inter-shot relations, by a shot query-based dense video Transformer. To avoid imprecise manual labeling, we adopt a fully synthetic transition synthesis pipeline that automatically reproduces major transition families with precise boundaries and parameterized variants. We also introduce OmniShotCutBench, a modern wide-domain benchmark enabling holistic and diagnostic evaluation.

[CV-3] DiffuSAM: Diffusion-Based Prompt-Free SAM2 for Few-Shot and Source-Free Medical Image Segmentation

【速读】:该论文旨在解决生成式分割模型(如SAM2)在医学图像领域迁移能力不足的问题,即其在自然图像上训练后,在医学数据上的零样本性能较差,导致实际应用中需大量微调和专家设计的提示(prompt)。解决方案的关键在于提出DiffuSAM,一种基于扩散模型的SAM2适配框架,通过轻量级扩散先验(diffusion prior)从预训练的SAM2图像特征中合成与SAM2兼容的掩码嵌入(mask-like embeddings),并将其注入SAM2的掩码解码器以实现无需用户提示的准确分割。此外,扩散先验进一步利用已分割切片进行条件控制,确保三维医学图像序列中的空间一致性,从而在无监督域自适应(SF-UDA)和少样本设置下实现高效且鲁棒的医学图像分割。

链接: https://arxiv.org/abs/2604.24719
作者: Tal Grossman,Noa Cahan,Lev Ayzenberg,Hayit Greenspan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation models such as Segment Anything Model (SAM) and SAM2 achieve strong prompt-driven zero-shot performance. However, their training on natural images limits domain transfer to medical data. Consequently, accurate segmentation typically requires extensive fine-tuning and expert-designed prompts. We propose DiffuSAM, a diffusion-based adaptation of SAM2 for prompt-free medical image segmentation. Our framework synthesizes SAM2-compatible segmentation mask-like embeddings via a lightweight diffusion-prior from off-the-shelf frozen SAM2 image features. The generated embeddings are integrated into SAM2’s mask decoder to produce accurate segmentations, thereby eliminating the need for user prompts. The diffusion prior is further conditioned on previously segmented slices, enforcing spatial consistency across volumes. Evaluated on the BTCV and CHAOS datasets for CT and MRI under Source-Free Unsupervised Domain Adaptation (SF-UDA) and Few-Shot settings, DiffuSAM achieves competitive performance with efficient training and inference. Code is available upon request from the corresponding author.

[CV-4] WildLIFT: Lifting monocular drone video to 3D for species-agnostic wildlife monitoring

【速读】:该论文旨在解决无人机单目RGB相机拍摄的野生动物视频中几何信息未被充分利用的问题,现有分析流程多局限于二维图像空间,难以支持对动物行为和种群动态的精细化量化研究。其解决方案的关键在于提出WildLIFT计算框架,该框架通过融合单目无人机视频中的三维场景几何信息与开放词汇的二维实例分割(open-vocabulary 2D instance segmentation),实现无物种依赖的三维检测与跟踪;同时利用带语义面信息的定向3D边界框标签,量化视角覆盖范围与个体间遮挡关系,生成结构化元数据以支撑下游生态学分析。

链接: https://arxiv.org/abs/2604.24718
作者: Vandita Shukla,Fabio Remondino,Blair Costelloe,Benjamin Risse
机构: FBK (Fondazione Bruno Kessler); Max Planck Institute for Biological Cybernetics (马克斯·普朗克生物控制论研究所); University of Münster (明斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular RGB cameras mounted on drones are widely used for wildlife monitoring, yet most analytical pipelines remain confined to two-dimensional image space, leaving geometric information in video underexploited. We present WildLIFT, a computational framework that integrates three-dimensional scene geometry from monocular drone video with open-vocabulary 2D instance segmentation to enable species-agnostic 3D detection and tracking. Oriented 3D bounding box labels with semantic face information enable quantitative assessment of viewpoint coverage and inter-animal occlusion, producing structured metadata for downstream ecological analyses. We validate the framework on 2,581 manually curated frames comprising over 6,700 3D detections across four large mammal species. WildLIFT maintains high identity consistency in multi-animal scenes and substantially reduces manual 3D annotation effort through keyframe-based refinement. By transforming standard drone footage into structured 3D and viewpoint-aware representations, WildLIFT extends the analytical utility of aerial wildlife datasets for behavioural research and population monitoring.

[CV-5] NeuroClaw Technical Report

【速读】:该论文旨在解决神经影像学研究中因多模态数据(如结构磁共振成像sMRI、功能磁共振成像fMRI、扩散磁共振成像dMRI和脑电图EEG)、复杂多阶段处理流程以及持续存在的可复现性风险所带来的挑战。其解决方案的关键在于提出NeuroClaw——一个面向神经影像领域的多智能体研究助手,它能够直接在原始数据上运行,基于BIDS元数据和数据集语义进行决策,无需用户预处理输入或编写定制模型代码;同时通过容器化环境管理(如固定Python环境、Docker支持、GPU配置)与执行过程控制(检查点、后执行验证、结构化审计日志)实现工具链的透明化和可复现性提升,并采用三层技能/代理层级架构将复杂工作流分解为安全且可重用的单元,从而显著增强科研流程的自动化水平与可靠性。

链接: https://arxiv.org/abs/2604.24696
作者: Cheng Wang,Zhibin He,Zhihao Peng,Shengyuan Liu,Yufan Hu,Lichao Sun,Xiang Li,Yixuan Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Agentic artificial intelligence systems promise to accelerate scientific workflows, but neuroimaging poses unique challenges: heterogeneous modalities (sMRI, fMRI, dMRI, EEG), long multi-stage pipelines, and persistent reproducibility risks. To address this gap, we present NeuroClaw, a domain-specialized multi-agent research assistant for executable and reproducible neuroimaging research. NeuroClaw operates directly on raw neuroimaging data across formats and modalities, grounding decisions in dataset semantics and BIDS metadata so users need not prepare curated inputs or bespoke model code. The platform combines harness engineering with end-to-end environment management, including pinned Python environments, Docker support, automated installers for common neuroimaging tools, and GPU configuration. In practice, this layer emphasizes checkpointing, post-execution verification, structured audit traces, and controlled runtime setup, making toolchains more transparent while improving reproducibility and auditability. A three-tier skill/agent hierarchy separates user-facing interaction, high-level orchestration, and low-level tool skills to decompose complex workflows into safe, reusable units. Alongside the NeuroClaw framework, we introduce NeuroBench, a system-level benchmark for executability, artifact validity, and reproducibility readiness. Across multiple multimodal LLMs, NeuroClaw-enabled runs yield consistent and substantial score improvements compared with direct agent invocation. Project homepage: this https URL

[CV-6] Aycromo: An Open-Source Platform for Automatic Chromosome Detection in Metaphase Images Based on Deep Learning

【速读】:该论文旨在解决遗传疾病诊断中染色体分析(chromosome analysis)依赖人工操作效率低、耗时长且高度依赖专家的问题。传统核型分析(karyotyping)流程通常需数天时间,限制了临床应用的时效性。为应对这一挑战,作者提出Aycromo——一个开源桌面平台,用于AI辅助细胞遗传学分析。其关键在于:基于Electron和ONNX Runtime构建,支持预训练模型加载、架构对比的集成基准模块以及交互式标注修正界面,无需命令行操作即可实现高效自动化检测与人工校正;实验表明,YOLOv11在CRCN-NE数据集上达到99.40% mAP@50,显著提升单张幻灯片分析速度至秒级,具备临床转化潜力。

链接: https://arxiv.org/abs/2604.24685
作者: Jorge L. A. Lima,Filipe R. Cordeiro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at SBCAS’26

点击查看摘要

Abstract:Chromosome analysis is a fundamental step in the diagnosis of genetic diseases, but the manual karyotyping workflow is time-consuming and heavily dependent on expert specialists, often requiring several days per patient. Although Deep Learning models have achieved high performance in chromosome detection, most proposed solutions remain restricted to research prototypes or lack graphical interfaces suitable for clinical use. In this work, we present Aycromo, an open-source desktop platform for AI-assisted cytogenetic analysis. Built on Electron and ONNX Runtime, the tool allows cytogeneticists to load pre-trained models, compare architectures through an integrated benchmarking module, and manually correct detections via an interactive annotation interface, all without command-line interaction. Preliminary experiments on metaphase images from the CRCN-NE dataset demonstrate that YOLOv11 achieves 99.40% mAP@50, while the platform reduces per-slide analysis to seconds

[CV-7] Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction

【速读】:该论文旨在解决病理基础模型(Pathology Foundation Models, PFMs)在乳腺癌生存预测任务中跨数据集泛化能力不足的问题,尤其缺乏在外部验证条件下的系统性比较。其解决方案的关键在于构建一个标准化的patch级特征提取与统一生存建模框架,对多种主流及新兴PFMs进行大规模、多中心的外部验证评估,涵盖超过5,400名患者的三个独立临床队列。实验表明,第二代PFMs整体优于第一代,且通过知识蒸馏得到的轻量模型H0-mini在性能上略胜于其教师模型H-optimus-0,证明了高效部署的可能性。这一结果为PFMs在临床路径中的实际应用提供了可信赖的基准和优化方向。

链接: https://arxiv.org/abs/2604.24679
作者: Fredrik K. Gustafsson,Constance Boissin,Johan Vallon-Christersson,David A. Clifton,Mattias Rantalainen
机构: Karolinska Institutet (卡罗林斯卡学院); University of Oxford (牛津大学); Lund University (隆德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pathology foundation models (PFMs) have recently emerged as powerful pretrained encoders for computational pathology, enabling transfer learning across a wide range of downstream tasks. However, systematic comparisons of these models for clinically meaningful prediction problems remain limited, especially in the context of survival prediction under external validation. In this study, we benchmark widely used and recently proposed PFMs for breast cancer survival prediction from whole-slide histopathology images. Using a standardized pipeline based on patch-level feature extraction and a unified survival modeling framework, we evaluate model representations across three independent clinical cohorts comprising more than 5,400 patients with long-term follow-up. Models are trained on one cohort and evaluated on two independent external cohorts, enabling a rigorous assessment of cross-dataset generalization. Overall, H-optimus-1 achieves the strongest survival prediction performance. More broadly, we observe consistent generational improvements across model families, with second-generation PFMs outperforming their first-generation counterparts. However, absolute performance differences between many recent PFMs remain modest, suggesting diminishing returns from further scaling of pretraining data or model size alone. Notably, the compact distilled model H0-mini slightly outperforms its larger teacher model H-optimus-0, despite using fewer than 8% of the parameters and enabling significantly faster feature extraction. Together, these results provide the first large-scale, externally validated benchmark of PFMs for breast cancer survival prediction, and offer practical guidance for efficient deployment of PFMs in clinical workflows.

[CV-8] Probing CLIPs Comprehension of 360-Degree Textual and Visual Semantics

【速读】:该论文旨在解决当前生成式AI模型在生成360度全景图像时,缺乏可靠评估手段的问题,特别是针对对比语言-图像预训练(CLIP)模型在理解360度全景图像与文本对之间的语义一致性方面的局限性。研究发现,CLIP能够有效利用显式的文本标识符来理解360度文本语义(360-degree textual semantics),但对水平圆周平移(horizontal circular shifts)下的语义不变性(即360度视觉语义,360-degree visual semantics)理解不足,导致其在评估全景图像语义一致性时表现不稳定。为解决这一问题,论文提出一种基于LoRA(Low-Rank Adaptation)的微调框架,通过显式引入圆周平移不变性约束,增强CLIP模型对360度视觉语义的理解能力;关键创新在于以最小的参数调整实现语义不变性的学习,尽管带来原始语义评估性能的轻微下降,验证了适应360度全景图像时存在语义对齐与不变性之间的根本权衡。

链接: https://arxiv.org/abs/2604.24642
作者: Hai Wang,Xiaochen Yang,Mingzhi Dong,Jing-Hao Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:The dream of instantly creating rich 360-degree panoramic worlds from text is rapidly becoming a reality, yet a crucial gap exists in our ability to reliably evaluate their semantic alignment. Contrastive Language-Image Pre-training (CLIP) models, standard AI evaluators, predominantly trained on perspective image-text pairs, face an open question regarding their understanding of the unique characteristics of 360-degree panoramic image-text pairs. This paper addresses this gap by first introducing two concepts: \emph360-degree textual semantics, semantic information conveyed by explicit format identifiers, and \emph360-degree visual semantics, invariant semantics under horizontal circular shifts. To probe CLIP’s comprehension of these semantics, we then propose novel evaluation methodologies using keyword manipulation and horizontal circular shifts of varying magnitudes. Rigorous statistical analyses across popular CLIP configurations reveal that: (1) CLIP models effectively leverage explicit textual identifiers, demonstrating an understanding of 360-degree textual semantics; and (2) CLIP models fail to robustly preserve semantic alignment under horizontal circular shifts, indicating limited comprehension of 360-degree visual semantics. To address this limitation, we propose a LoRA-based fine-tuning framework that explicitly instills invariance to circular shifts. Our fine-tuned models exhibit improved comprehension of 360-degree visual semantics, though with a slight degradation in original semantic evaluation performance, highlighting a fundamental trade-off in adapting CLIP to 360-degree panoramic images. Code is available at this https URL.

[CV-9] Meta-CoT: Enhancing Granularity and Generalization in Image Editing CVPR2026

【速读】:该论文旨在解决统一多模态理解/生成模型在图像编辑任务中,如何通过思维链(Chain-of-Thought, CoT)结构与训练策略的协同优化,同时提升理解粒度(understanding granularity)和泛化能力(generalization)的问题。其核心解决方案是提出Meta-CoT范式,该范式包含两个关键设计:一是可分解性(Decomposability),将任意单图编辑操作建模为三元组(任务、目标、所需理解能力),并通过任务与目标的解耦生成特定于任务的CoT,从而增强模型对编辑操作的理解粒度并指导训练;二是泛化性(Generalizability),进一步将编辑任务分解为五个基础元任务(meta-tasks),结合三元组中的其他元素进行联合训练,即可在少量元任务数据下实现对未见编辑任务的有效泛化。此外,引入CoT-Editing一致性奖励机制以强化CoT推理信息在编辑过程中的准确利用,实验表明该方法在21个编辑任务上平均提升15.8%,且具备强泛化性能。

链接: https://arxiv.org/abs/2604.24625
作者: Shiyi Zhang,Yiji Cheng,Tiankai Hang,Zijin Yin,Runze He,Yu Xu,Wenxun Dai,Yunlong Lin,Chunyu Wang,Qinglin Lu,Yansong Tang
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Hunyuan, Tencent (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted by CVPR2026, Project Page: this https URL

点击查看摘要

Abstract:Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model’s understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model’s editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at this https URL

[CV-10] CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

【速读】:该论文旨在解决基于流的视觉-语言-动作(Vision-Language-Action, VLA)策略在实时约束下效率与性能权衡不佳的问题。具体而言,传统方法需通过多步推理从无信息的高斯噪声中恢复动作结构,导致计算延迟高、实时性差。其解决方案的关键在于重新设计生成式动作建模的起始点机制:提出一种粗粒度到细粒度(Coarse-to-Fine, CF)的两阶段范式——第一阶段学习条件后验分布以生成结构化的动作初始点(即“粗初始化”),第二阶段进行单步局部修正(即“细精调”),从而显著减少函数评估次数(NFE)。该方法通过分阶段训练策略稳定优化过程,并在CALVIN和LIBERO基准上实现低NFE下的最优性能,验证了结构化、分层生成路径对高效且高性能动作决策的有效性。

链接: https://arxiv.org/abs/2604.24622
作者: Fan Du,Feng Yan,Jianxiong Wu,Xinrun Xu,Weiye Zhang,Weinong Wang,Yu Guo,Bin Qian,Zhihai He
机构: Southern University of Science and Technology (南方科技大学); Xi’an Jiaotong University (西安交通大学); United Nova Technology (联合诺瓦科技); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 \pi_0.5 baseline on several metrics, reduces action sampling latency by 75.4%, and achieves the best average real-robot success rate of 83.0%, outperforming MIP by 19.5 points and \pi_0.5 by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at this https URL.

[CV-11] Infrastructure-Guided Connectivity-Enhanced Road Crack Detection and Estimation

【速读】:该论文旨在解决道路裂缝检测在实际应用中精度不足与实时性差的问题,特别是在车载场景下如何高效、准确地实现裂缝识别。解决方案的关键在于构建了一种基于基础设施引导的通信增强型检测流水线(infrastructure-guided communication-enhanced road crack detection pipeline),其核心包括:设计专用通信协议实现从基础设施向车辆传输感兴趣区域(region of interest, ROI),结合动态图像裁剪与帧选择技术优化输入图像质量,以及利用先进裂缝检测模型骨干网络和面向前方视角的高质量标注数据集进行训练,从而显著提升检测性能并确保在乘用车平台上的可部署性。

链接: https://arxiv.org/abs/2604.24616
作者: Haosong Xiao,Yamini Ramesh,Rishabh Shukla,Swarat Sarkar,Chaozhe R. He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and will be presented at the Fourth IEEE International Conference on Mobility: Operations, Services, and Technologies (MOST) on May 4 - 6, 2026 at Detroit, Michigan

点击查看摘要

Abstract:In this paper, we report the world’s first infrastructure-guided communication-enhanced road crack detection pipeline that is effective and implementable on passenger vehicles. We first design a customized communication protocol to transmit the region of interest from the infrastructure to the vehicle. With proper camera image processing (e.g., dynamic cropping and frame selection), the focused images are provided to the crack detection model. Leveraging state-of-the-art crack detection model backbones and a carefully prepared dataset comprising a forward-facing view with a crack, we train the model to improve crack-detection performance. We demonstrate the full detection pipeline on an experimental vehicle platform, showcase the detection effectiveness, and project future research directions.

[CV-12] Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

【速读】:该论文旨在解决多模态视觉-语言模型在部署阶段因视觉与文本分支发生不对称分布偏移(asymmetric shift)而导致的性能下降问题,尤其是在基于熵的测试时自适应(test-time adaptation, TTA)方法可能因不可靠模态主导融合而引入错误的问题。解决方案的关键在于提出一种名为MG-MTTA的新方法,其核心思想是从“主序化视角”(majorization view)重新审视多模态后验分布,并将适应过程建模为对融合预测的受约束去混洗(constrained de-mixing)问题;通过保持骨干网络冻结,仅更新轻量级门控机制(gate)或适配器(adapter),并结合融合后验熵最小化与基于锚点模态一致性及跨模态冲突构建的可靠性感知门控先验,从而有效控制模态可靠性而非单纯降低预测熵,显著提升模型在语义保持文本偏移和联合视觉-文本偏移场景下的准确率。

链接: https://arxiv.org/abs/2604.24602
作者: Lixian Chen,Mingxuan Huang,Yanhui Chen,Junyi Lin,Yang Shi
机构: Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.

[CV-13] Point-MF: One-step Point Cloud Generation from a Single Image via Mean Flows

【速读】:该论文旨在解决单图像点云重建中生成式模型(如基于扩散模型的重构器)因需多次去噪迭代而导致推理速度慢、计算成本高的问题。其解决方案的关键在于提出Point-MF框架,该框架通过在点云空间中直接学习均值流(Mean-Flow)的平均速度场,实现仅需一次网络函数评估(1-NFE)即可完成重建,无需依赖变分自编码器(VAE)的潜在表示;同时引入针对大步长跳跃优化的扩散Transformer结构,结合冻结的DINOv3图像特征与轻量级token适配器,并引入显式的时间/间隔条件控制;此外,设计了“去噪空间锚定”(Denoised Space Anchor)辅助损失函数,用于稳定大步长生成过程并减少异常点和密度伪影,从而在保持高重建质量的同时实现毫秒级延迟。

链接: https://arxiv.org/abs/2604.24586
作者: Yuta Baba,Keiji Yanai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 14 figures

点击查看摘要

Abstract:Single-image point cloud reconstruction must infer complete 3D geometry, including occluded parts, from a single RGB image. While diffusion-based reconstructors achieve high accuracy, they typically require many denoising iterations, resulting in slow and expensive inference. We propose Point-MF, a Mean-Flow-based framework for low-NFE single-image point cloud reconstruction that couples a Mean-Flow-compatible architecture with an auxiliary loss. Specifically, Point-MF operates directly in point-cloud space to learn the mean velocity field and enables one-step reconstruction with a single network function evaluation (1-NFE), without relying on VAE-based latent representations. To make Mean Flow effective under large interval jumps, Point-MF employs a Diffusion Transformer tailored to the Mean-Flow setting, conditioned on frozen DINOv3 image features via a lightweight token adapter and equipped with explicit interval/time conditioning. Moreover, we introduce Denoised Space Anchor, a set-distance auxiliary loss on the denoised-space estimate x_\theta induced by the predicted velocity field, to stabilize large-step generation and reduce outliers and density artifacts. On ShapeNet-R2N2 and Pix3D, Point-MF strikes a strong balance between reconstruction quality and inference speed compared to multi-step diffusion baselines and competitive feedforward models, while generating high-quality point clouds with millisecond-level latency.

[CV-14] Improving Vision-language Models with Perception-centric Process Reward Models

【速读】:该论文旨在解决基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)在视觉语言模型(Vision-Language Models, VLMs)中因仅采用结果层面监督而导致无法精确诊断和纠正推理链内错误的问题。其解决方案的关键在于提出Perceval——一种过程奖励模型(Process Reward Model, PRM),能够实现token级错误定位:通过提取响应中的图像相关陈述,并逐一对比图像中的视觉证据,从而识别出包含感知错误的声明。Perceval利用感知密集型监督数据进行训练,并将该模型嵌入强化学习训练流程中,以token级优势取代传统GRPO的序列级优势,对Perceval识别出的幻觉片段施加惩罚,从而提供细粒度的监督信号。此外,Perceval还可用于推理阶段的测试时扩展(test-time scaling),通过截断错误部分并引导模型重生成或反思,实现多轮迭代优化,显著提升多个领域基准上的性能表现。

链接: https://arxiv.org/abs/2604.24583
作者: Yingqian Min,Kun Zhou,Yifan Li,Yuhuan Wu,Han Peng,Yifan Du,Wayne Xin Zhao,Min Yang,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Bytedance (字节跳动); University of California, San Diego (加州大学圣地亚哥分校); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model’s response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at this https URL.

[CV-15] Diffusion Model as a Generalist Segmentation Learner

【速读】:该论文旨在解决现有分割模型在语义分割和开放词汇分割任务中缺乏通用性与跨域适应能力的问题,尤其是如何利用预训练扩散模型(diffusion models)中蕴含的丰富空间对齐视觉先验来构建一个统一且可泛化的分割框架。其解决方案的关键在于提出DiGSeg(Diffusion Models as a Generalist Segmentation Learner),通过将输入图像与真实标签掩码共同编码至潜在空间并作为条件信号注入扩散U-Net,同时引入并行的CLIP对齐文本路径,在多尺度上注入语言特征,使模型能够根据任意文本提示生成结构化的分割掩码。这一设计使现成的扩散骨干网络从纯生成器转变为通用分割学习者,显著提升了标准语义分割基准上的性能,并展现出强大的开放词汇泛化能力和跨领域迁移能力(如医学、遥感和农业场景),无需针对特定领域进行架构定制。

链接: https://arxiv.org/abs/2604.24575
作者: Haoxiao Wang,Antao Xiang,Haiyang Sun,Peilin Sun,Changhao Pan,Yifu Chen,Minjie Hong,Weijie Wang,Shuang Chen,Yue Chen,Zhou Zhao
机构: 1. 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and open-vocabulary segmentation, and this approach can be generalized to various downstream tasks to make a general-purpose diffusion segmentation framework. Concretely, we introduce DiGSeg (Diffusion Models as a Generalist Segmentation Learner), which repurposes a pretrained diffusion model into a unified segmentation framework. Our approach encodes the input image and ground-truth mask into the latent space and concatenates them as conditioning signals for the diffusion U-Net. A parallel CLIP-aligned text pathway injects language features across multiple scales, enabling the model to align textual queries with evolving visual representations. This design transforms an off-the-shelf diffusion backbone into a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts. Extensive experiments demonstrate state-of-the-art performance on standard semantic segmentation benchmarks, as well as strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios-without domain-specific architectural customization. These results indicate that modern diffusion backbones can serve as generalist segmentation learners rather than pure generators, narrowing the gap between visual generation and visual understanding.

[CV-16] RACANet: Reliability-Aware Crowd Anchor Network for RGB-T Crowd Counting

【速读】:该论文旨在解决RGB-Thermal(RGB-T)人群计数中跨模态融合的准确性与可解释性不足的问题,尤其针对现有方法依赖隐式融合策略、缺乏对局部空间差异的显式建模以及在位置级别上对模态可靠性的细粒度刻画这一局限。其解决方案的关键在于提出一个两阶段融合框架RACANet(Reliability-Aware Crowd Anchor Network),第一阶段通过轻量级跨模态对齐预训练显式学习语义对应关系,利用人群先验监督和局部双向软匹配机制;第二阶段引入局部锚点融合模块(Local Anchor Fusion Module, LAFM),基于预训练获得的先验生成局部语义锚点,并通过局部注意力机制实现像素级特征自适应重分配,同时设计了一种差异感知的一致性约束以动态协调模态一致区域的可靠性,从而提升融合过程的精度与可解释性。

链接: https://arxiv.org/abs/2604.24543
作者: Jinghao Shi,Mengqi Lei,Kunliang He,Yun Li,Wei Bao,Siqi Li
机构: China University of Geosciences, Wuhan(中国地质大学(武汉)); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RGB-Thermal (T) crowd counting aims to integrate visible-spectrum and thermal infrared information to improve the robustness of crowd density estimation in complex scenes. Although existing studies generally improve counting accuracy through cross-modal feature fusion, most current methods rely on implicit cross-modal fusion strategies and lack explicit modeling of local spatial discrepancies as well as fine-grained characterization of modality reliability at the positional level, thereby limiting the accuracy and interpretability of the fusion process. To address these issues, this paper proposes a two-stage fusion framework, RACANet, a Reliability-Aware Crowd Anchor Network for RGB-T crowd counting. First, we introduce a lightweight cross-modal alignment pretraining stage, which explicitly learns cross-modal semantic correspondences through crowd-prior supervision and local bidirectional soft matching. Then, based on the priors learned during pretraining, a Local Anchor Fusion Module (LAFM) is introduced in the formal training stage. This module generates local semantic anchors by aggregating features from highly reliable regions and further enables adaptive pixel-level feature redistribution with a local attention mechanism. In addition, we propose a discrepancy-aware consistency constraint to dynamically coordinate the reliability of regions where modal representations are consistent. Experiments conducted on two widely used benchmark datasets, RGBT-CC and Drone-RGBT, demonstrate that RACANet outperforms existing methods. The anonymous code is available at this https URL.

[CV-17] Point Cloud Registration for Fusion between SPECT MPI and CTA Images

【速读】:该论文旨在解决单光子发射计算机断层成像心肌灌注显像(SPECT MPI)与冠状动脉计算机断层血管造影(CTA)在临床融合中因跨模态配准偏差和依赖人工标记点而导致的缺血定位不准确及病变水平功能评估困难的问题。解决方案的关键在于提出了一种集成功能与结构信息的注册与融合框架:首先通过U-Net实现双模态分割,自动提取左心室(LV)和双心室结构并基于其空间关系生成解剖标志点;随后采用尺度空间一致性预处理和基于标志点的粗配准缓解初始错位;在此基础上,系统比较了包括ICP、SICP、CPD、CluReg、FFD和BCPD-plus-plus在内的多种精细配准方法对LV心外膜表面点云进行优化,并将最优变换传播至体素级重采样以实现高精度SPECT-CTA融合。实验表明,BCPD-plus-plus方法达到最优精度(平均点云距离1.7 mm),且整个流程不依赖特定精细配准算法,具备良好的实用性与鲁棒性。

链接: https://arxiv.org/abs/2604.24524
作者: Ni Yao,Xiangyu Liu,Shaojie Tang,Danyang Sun,Chuang Han,Yanting Li,Jiaofen Nan,Chengyang Li,Fubao Zhu,Chen Zhao,Zhihui Xu,Weihua Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clinical fusion of Single Photon Emission Computed Tomography Myocardial Perfusion Imaging (SPECT MPI) and Computed Tomography Angiography (CTA) remains limited by cross-modality misregistration and reliance on manual landmarks, which can hinder accurate ischemia localization and lesion-level functional assessment. To address this issue, we propose a registration and fusion framework for SPECT MPI and CTA that integrates functional and structural information for comprehensive cardiac evaluation. The proposed pipeline performs U-Net-based segmentation on both modalities. On SPECT MPI, only the left ventricle (LV) is extracted, and anatomical landmarks are automatically derived from characteristic LV structures. On CTA, both ventricles are segmented, and their spatial relationship is used to automatically define landmarks at the interventricular septal junction. Scale-space consistency preprocessing and landmark-driven coarse registration are applied to mitigate initial misalignment. Based on this initialization, multiple fine registration methods are evaluated on LV epicardial surface point clouds, including ICP, SICP, CPD, CluReg, FFD, and BCPD-plus-plus. The resulting transformations are then propagated to voxel-level resampling for high-precision SPECT-CTA fusion. In a retrospective cohort of 60 patients, the proposed framework preserved sub-millimeter coronary detail from CTA while accurately overlaying quantitative SPECT perfusion. Among the evaluated methods, BCPD-plus-plus achieved the highest accuracy with a mean point cloud distance of 1.7 mm. By combining robust initialization, comparative fine registration, and voxel-level fusion, the proposed approach provides a practical solution for myocardial ischemia localization and functional evaluation of coronary lesions, while remaining independent of any specific fine registration algorithm.

[CV-18] Self-Supervised Representation Learning via Hyperspherical Density Shaping

【速读】:该论文旨在解决当前自监督表示学习方法普遍依赖经验性启发式策略、缺乏理论基础的问题。其解决方案的关键在于提出HyDeS方法,该方法基于多视图互信息最大化,在超球面空间中利用香农微分熵与非参数冯·米塞斯-费舍尔(von Mises-Fisher)密度估计器构建理论可解释的学习框架,从而引导模型聚焦于图像的前景特征,并在分割任务(如VOC PASCAL)中表现优异。

链接: https://arxiv.org/abs/2604.24498
作者: Esteban Rodríguez-Betancourt,Edgar Casasola-Murillo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Modern self-supervised representation learning methods often relies on empirical heuristics that are not theoretically grounded. In this study we propose HyDeS, a theoretically grounded method based on multi-view mutual information maximization within an hyperspherical space using Shannon differential entropy with a non-parametric von Mises-Fisher density estimator. We show that HyDeS bias the trained model towards focusing on foreground features of the images and perform well on segmentation tasks such as VOC PASCAL, while it lags in fine-grained classification. We provide a detailed analysis of the induced latent space geometry and learning dynamics, that can be used for designing other theoretically grounded self-supervised learning methods. Comments: 8 pages, 8 figures, 4 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.24498 [cs.CV] (or arXiv:2604.24498v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.24498 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-19] CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping

【速读】:该论文旨在解决生成式 AI(Generative AI)中面部交换(face swapping)任务里身份保留与视觉真实感难以平衡的问题,尤其是传统基于生成对抗网络(GAN-based)方法因可控性有限和模式崩溃(mode collapse)导致的性能瓶颈。其解决方案的关键在于提出一种基于扩散模型(diffusion-based)的新框架 CA-IDD(Cross-Attention Guided Identity-Conditional Diffusion),通过多尺度交叉注意力机制融合 gaze、身份嵌入(identity embedding)和面部解析(facial parsing)等多模态引导信号,在去噪过程中引入分层注意力结构以实现精准且一致的身份迁移;同时结合专家指导监督策略提升语义一致性与视觉质量,从而在姿态和表情变化下实现细粒度区域控制,并显著优于现有基线方法(如 FaceShifter 和 MegaFS)。

链接: https://arxiv.org/abs/2604.24493
作者: Md Shohel Rana,Tanoy Debnath
机构: Georgia Southern University (乔治亚南方大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face swapping aims to optimize realistic facial image generation by leveraging the identity of a source face onto a target face while preserving pose, expression, and context. However, existing methods, especially GAN-based methods, often struggle to balance identity preservation and visual realism due to limited controllability and mode collapse. In this paper, we introduce CA-IDD (Cross-Attention Guided Identity-Conditional Diffusion), the first diffusion-based face swapping approach that integrates multi-modal guidance comprising gaze, identity, and facial parsing through multi-scale cross-attention. Precomputed identity embeddings are incorporated into the denoising process via hierarchical attention layers, resulting in accurate and consistent identity transfer. To improve semantic coherence and visual quality, we use expert-guided supervision, with facial parsing and gaze-consistency modules. Unlike GAN-based or implicit-fusion methods, our diffusion framework provides stable training, robust generalization, and spatially adaptive identity alignment, allowing for fine-grained regional control across pose and expression variations. CA-IDD achieves an FID of 11.73, exceeding established baselines such as FaceShifter and MegaFS. Qualitative results also reveal improved identity retention across diverse poses, establishing CA-IDD as a strong foundation for future diffusion-based face editing.

[CV-20] Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

【速读】:该论文旨在解决硬件感知神经架构搜索(Hardware-Aware Neural Architecture Search, NAS)中因优化阶段与部署阶段精度不一致导致的性能下降问题。现有方法通常在全精度假设下进行架构搜索,仅在搜索完成后对模型进行低精度转换,这种“后处理”策略使得优化时的行为与实际低精度硬件部署时的表现存在显著偏差,从而严重损害模型准确性。解决方案的关键在于将部署对齐的低精度训练(Deployment-Aligned Low-Precision Training)直接整合进硬件感知NAS流程中,使候选架构在微调和评估阶段即受到FP16数值约束,从而实现架构效率与数值鲁棒性的联合优化,无需修改搜索空间或进化策略。实验表明,该方法可在不增加模型复杂度的前提下,显著提升边缘设备上的推理准确率,例如在Intel Movidius Myriad X VPU上将mIoU从0.78恢复至0.826,有效缓解了部署诱导的精度损失。

链接: https://arxiv.org/abs/2604.24492
作者: Parampuneet Kaur Thind,Vaibhav Katturu,Giacomo Zema,Roberto Del Prete
机构: Sapienza University of Rome; Φ-lab, ESA ESRIN; University of Pisa; Argotec Srl
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Designing deep networks that meet strict latency and accuracy constraints on edge accelerators increasingly relies on hardware-aware optimization, including neural architecture search (NAS) guided by device-level metrics. Yet most hardware-aware NAS pipelines still optimize architectures under full-precision assumptions and apply low-precision adaptation only after the search, leading to a mismatch between optimization-time behavior and deployment-time execution on low-precision hardware that can substantially degrade accuracy. We address this limitation by integrating deployment-aligned low-precision training directly into hardware-aware NAS. Candidate architectures are exposed to FP16 numerical constraints during fine-tuning and evaluation, enabling joint optimization of architectural efficiency and numerical robustness without modifying the search space or evolutionary strategy. We evaluate the proposed framework on vessel segmentation for spaceborne maritime monitoring, targeting the Intel Movidius Myriad X Visual Processing Unit (VPU). While post-training precision conversion reduces on-device performance from 0.85 to 0.78 mIoU, deployment-aligned low-precision training achieves 0.826 mIoU on-device for the same architecture (95,791 parameters), recovering approximately two-thirds of deployment-induced accuracy gap without increasing model complexity. These results demonstrate that incorporating deployment-consistent numerical constraints into hardware-aware NAS substantially improves robustness and alignment between optimization and deployment for resource-constrained edge Artificial Intelligence (AI).

[CV-21] Zero-to-CAD: Agent ic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data

【速读】:该论文旨在解决当前大规模3D数据集普遍缺乏参数化设计意图信息的问题,即现有数据多以边界表示(B-Reps)或网格形式存在,而忽略了CAD模型的构造历史(construction history),这限制了生成式AI在CAD领域的可编辑性和语义理解能力。解决方案的关键在于提出Zero-to-CAD框架,通过将大语言模型(LLM)嵌入到反馈驱动的CAD环境中,形成一种代理式搜索(agentic search)机制:系统迭代地生成、执行并验证CAD代码,结合工具调用与文档检索以确保几何有效性与操作多样性,从而合成约一百万条可执行、可读、可编辑的CAD构造序列,显著提升了CAD生成任务中对参数化语义的理解与建模能力。

链接: https://arxiv.org/abs/2604.24479
作者: Mohammadmehdi Ataei,Farzaneh Askari,Kamal Rahimi Malekshan,Pradeep Kumar Jayaraman
机构: Autodesk Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-Aided Design (CAD) models are defined by their construction history: a parametric recipe that encodes design intent. However, existing large-scale 3D datasets predominantly consist of boundary representations (B-Reps) or meshes, stripping away this critical procedural information. To address this scarcity, we introduce Zero-to-CAD, a scalable framework for synthesizing executable CAD construction sequences. We frame synthesis as an agentic search problem: by embedding a large language model (LLM) within a feedback-driven CAD environment, our system iteratively generates, executes, and validates code using tools and documentation lookup to promote geometric validity and operation diversity. This agentic approach enables the synthesis of approximately one million executable, readable, editable CAD sequences, covering a rich vocabulary of operations beyond sketch-and-extrude workflows. We also release a curated subset of 100,000 high-quality models selected for geometric diversity. To demonstrate the dataset’s utility, we fine-tune a vision-language model on our synthetic data to reconstruct editable CAD programs from multi-view images, outperforming strong baselines, including GPT-5.2, and effectively bootstrapping sequence generation capabilities without real construction-history training data. Zero-to-CAD bridges the gap between geometric scale and parametric interpretability, offering a vital resource for the next generation of CAD AI.

[CV-22] xtGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering AAAI

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在渲染提示词中指定文本时,难以准确保持文本空间布局的问题,尤其是在多段落、结构化场景下。这一挑战源于缺乏标注精确文本及其对应空间位置的训练数据集,以及缺少评估布局质量的有效指标。解决方案的关键在于提出 TextGround4M 数据集——一个包含超过 400 万组提示-图像对的大规模数据集,每条样本均标注了逐段文本及其对应的边界框(bounding boxes),从而实现细粒度的布局感知监督;同时引入一种轻量级训练策略,在不改变模型架构和推理行为的前提下,通过在训练过程中追加布局感知的文本标记(span tokens)来提升模型对提示词空间分布的理解能力。

链接: https://arxiv.org/abs/2604.24459
作者: Dongxing Mao,Yilin Wang,Linjie Li,Zhengyuan Yang,Alex Jinpeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: aaai poster; Project page: this https URL

点击查看摘要

Abstract:Despite recent advances in text-to-image generation, models still struggle to accurately render prompt-specified text with correct spatial layout – especially in multi-span, structured settings. This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality. To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes. This enables fine-grained supervision for layout-aware, prompt-grounded text rendering. Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference behavior. We further construct a benchmark with stratified layout complexity to evaluate both open-source and proprietary models in a zero-shot setting. In addition, we introduce two layout-aware metrics to address the long-standing lack of spatial evaluation in text rendering. Our results show that models trained on TextGround4M outperform strong baselines in text fidelity, spatial accuracy, and prompt consistency, highlighting the importance of fine-grained layout supervision for grounded T2I generation.

[CV-23] AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

【速读】:该论文旨在解决当前GUI自主代理(GUI Agent)评估体系的局限性问题,即现有基准测试要么聚焦于黑箱任务完成,要么仅停留在静态、浅层的界面元素定位,无法有效衡量代理对图形用户界面(Graphical User Interface, GUI)隐含功能与交互逻辑的深层理解能力。为填补这一空白,作者提出AutoGUI-v2基准测试,其核心解决方案是构建一个基于视觉语言模型(Vision-Language Model, VLM)与人类协同的递归解析管道,将多平台截图结构化为层次化的功能区域,并据此生成多样化的评估任务。该方法实现了对代理在区域级和元素级语义理解、接地(grounding)以及动态状态预测等方面的系统性测评,从而揭示了当前VLM在功能性理解和复杂交互逻辑推理上的显著短板,为下一代GUI代理的发展提供了可量化、可比较的新评估框架。

链接: https://arxiv.org/abs/2604.24441
作者: Hongxin Li,Xiping Wang,Jingran Su,Zheng Ju,Yuntao Chen,Qing Li,Zhaoxiang Zhang
机构: University of Chinese Academy of Sciences (UCAS); New Laboratory of Pattern Recognition (NLPR), CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA; Hong Kong Institute of Science Innovation, CASIA; PolyU; Shanghai Artificial Intelligence Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the “digital world state” resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.

[CV-24] DYMAPIA: A Multi-Domain Framework for Detecting AI-based Video Manipulation

【速读】:该论文旨在解决生成式 AI (Generative AI) 生成的虚假媒体(如 Deepfake)日益增多所带来的内容真实性与数字信任危机问题。其解决方案的关键在于提出一种多域深度伪造检测框架 DYMAPIA,通过融合空间、频域和时域线索来捕捉视觉数据中的细微篡改痕迹;具体而言,该框架构建动态异常掩码,整合傅里叶频谱、局部纹理描述子、边缘不规则性和光流一致性等证据,以高精度定位篡改区域,并利用轻量级分类器 DistXCNet 对这些区域进行快速聚焦分类,从而在多个基准测试中实现超过 99% 的准确率和 F1 分数,同时保持模型紧凑性以支持实时部署。

链接: https://arxiv.org/abs/2604.24426
作者: Md Shohel Rana,Andrew H. Sung
机构: Georgia Southern University (乔治亚南方大学); The University of Southern Mississippi (南密西西比大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:AI-generated media are advancing rapidly, raising pressing concerns for content authenticity and digital trust. We introduce DYMAPIA, a multi-domain Deepfake detection framework that fuses spatial, spectral, and temporal cues to capture subtle traces of manipulation in visual data. The system builds dynamic anomaly masks by combining evidence from Fourier spectra, local texture descriptors, edge irregularities, and optical flow consistency, which highlight tampered regions with fine spatial accuracy. These masks guide DistXCNet, a lightweight classifier distilled from Xception and optimized with depthwise separable convolutions for fast, region-focused classification. This joint design achieves state-of-the-art results, with accuracy and F1-scores exceeding 99% on FF++, Celeb-DF, and VDFD benchmarks, while keeping the model compact enough for real-time use. Beyond outperforming existing full-frame and multidomain detectors, DYMAPIA demonstrates deployment readiness for time-critical forensic tasks, including media verification, misinformation defense, and secure content filtering.

[CV-25] BMD-45: A Large-Scale CCTV Vehicle Detection Dataset for Urban Traffic in Developing Cities CVPR2026

【速读】:该论文旨在解决现有车辆检测模型在复杂、多样化的城市交通场景中泛化能力不足的问题,尤其是针对新兴经济体中密集、无序且多样的交通环境。当前主流基准数据集(如UA-DETRAC和COCO)主要来自同质化、结构化的交通模式,存在明显的区域和视角偏差,导致模型在真实部署环境中性能显著下降。解决方案的关键在于构建一个大规模、高多样性且贴近现实的车辆检测数据集——BMD-45,该数据集包含45K张来自3.6K个实际运行的“平安城市”闭路电视(CCTV)摄像头采集的图像,涵盖14种细粒度车辆类别(包括本地特有车型如三轮摩托车和Tempo Travellers),并覆盖极端视角变化、遮挡和高密度等真实挑战。实验表明,基于BMD-45训练的模型在mAP@0.50:0.95上比在UA-DETRAC上微调提升达2.5倍,凸显了地理多样性数据对提升感知系统鲁棒性的重要性。

链接: https://arxiv.org/abs/2604.24419
作者: Akash Sharma,Chinmay Mhatre,Sankalp Gawali,Ruthvik Bokkasam,Brij Sharma,Vishwajeet Pattanaik,Punit Rathore,Raghu Krishnapuram,Vijay Gopal Kovvali,Anirban Chakraborty,Yogesh Simmhan
机构: Indian Institute of Science (印度科学研究所); Department of Computational and Data Sciences (计算与数据科学系); Robert Bosch Center for Cyber Physical Systems (罗伯特·博世网络物理系统中心); Center of Data for Public Good (公共利益数据中心); Centre for Infrastructure, Sustainable Transportation Urban Planning (基础设施、可持续交通与城市规划中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 Findings Track. To appear in the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

点击查看摘要

Abstract:Robust vehicle detection from fixed CCTV cameras is critical for Intelligent Transportation Systems. Yet existing benchmarks predominantly feature relatively homogeneous, highly organized traffic patterns captured from ego-centric driving perspectives or controlled aerial views. This regional and sensor view bias creates a significant gap. Models trained on datasets such as UA-DETRAC and COCO struggle to generalize to the dense, heterogeneous, disorganized traffic conditions observed in rapidly developing urban centers in emerging economies. To address this limitation, we introduce BMD-45, a large-scale dataset comprising 480K bounding boxes annotated over 45K images captured from over 3.6K operational Safe City CCTV cameras. BMD-45 contains 14 fine-grained vehicle categories, including region-specific modes such as auto-rickshaws and tempo travellers, which are not present in existing benchmarks. The dataset captures real-world deployment challenges, including extreme viewpoint variation, occlusion, and vehicle density . We establish comprehensive baselines using state-of-the-art detectors and reveal a striking domain gap: models fine-tuned on UA-DETRAC achieve only 33.6% mAP@0.50:0.95, compared to 83.8% when trained in-domain on BMD-45, representing a 2.5x improvement that persists even when accounting for novel vehicle classes. This performance gap underscores the critical need for geographically diverse traffic benchmarks and establishes BMD-45 as a baseline for developing robust perception systems in underrepresented urban environments worldwide. The dataset is available at: this https URL.

[CV-26] Phase-Separated Complex Hilbert PCA on Markerless 3D Pose Estimation Data: A Global Phase Network and Its Extension to a Continuous Field on the Body Surface

【速读】:该论文旨在解决体育运动中全身协调机制量化分析的难题,特别是传统方法如运动序列(Kinematic Sequence, KS)和连续相对相位(Continuous Relative Phase, CRP)仅限于相邻关节对、缺乏统一框架,而基于功率流的段落分析又依赖力板和惯性参数,难以在非实验室环境中应用的问题。解决方案的关键在于引入复杂希尔伯特主成分分析(Complex Hilbert Principal Component Analysis, CHPCA),对无标记三维姿态估计数据按运动阶段(后摆与下摆)分别处理,提取出以单一复数特征向量表示的全身相位模式,并结合全自动信号驱动的相位分割技术及扩展至1079个体表网格顶点的连续相位场建模,从而实现从骨骼关节到体表网格的全身体相位结构统一描述。此方法不仅揭示了躯干锚定的全局相位架构和准备-执行阶段的功能不对称性,还通过统计一致性验证了其在运动学与动力学之间的桥梁作用。

链接: https://arxiv.org/abs/2604.24415
作者: Hiromitsu Goto,Tao Tao,Zheng-Lin Chia
机构: Kanazawa Gakuin University (金泽学院大学)
类目: ocial and Information Networks (cs.SI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 19 pages, 8 figures, 6 tables. Extended English version of a paper to be submitted to Transactions of the Japanese Society for Artificial Intelligence (JSAI; Special Issue on Emerging Topics in Sports Informatics)

点击查看摘要

Abstract:Quantitative analysis of the kinematic chain in sports motion is essential for performance evaluation and injury prevention. Conventional methods such as the kinematic-sequence (KS) and continuous relative phase (CRP) are confined to adjacent joint pairs and lack a unified framework for whole-body coordination, while segmental power-flow analysis requires force plates and inertial parameters that restrict it to laboratory environments. We apply Complex Hilbert Principal Component Analysis (CHPCA) separately to each motion phase (backswing and downswing) on markerless 3D pose estimation data, extracting the dominant whole-body phase pattern as a single complex eigenvector. The pipeline further includes a fully automatic signal-based phase segmentation (no priors on strike count or rest location) and an extension to 1,079 body-surface mesh vertices, so that the kinematic chain is represented as a continuous phase field across the body. On 14 hammer-striking trials of a single subject, the framework reveals (i) a trunk-anchored global phase architecture, (ii) a functional asymmetry between preparation and execution phases quantified by Mode-1 contribution (45.5% vs. 70.5%) and inter-trial Spearman consistency (0.38 vs. 0.58), and (iii) a consistent reorganisation across both skeletal joints and mesh vertices ( p 10^-10 on 1,079 vertices). As a methodological consistency check, pairwise phase differences from the Mode-1 eigenvector are compared against CRP on all 190 joint pairs by a permutation test ( \rho = 0.473 , p = 0.0005 ). A correspondence analysis between Mode-1 amplitude and kinetic-energy mobilisation variance further shows a strong positive correlation in the downswing ( \rho \approx 0.71 on both skeleton and mesh) and no correlation in the backswing, indicating that the proposed framework bridges kinematic and kinetic descriptions of coordination through phase structure.

[CV-27] AD-Relight: Training-Free Banner Relighting via Illumination Translation with Diffusion Priors

【速读】:该论文旨在解决个性化广告(Personalized Ads)中广告横幅(banner)与原始场景光照条件不一致的问题,从而实现更自然、无缝的广告插入效果。现有方法多依赖简单的几何变形(geometric warping),忽视了场景光照的影响;而当前最先进的扩散模型在处理广告横幅时也因缺乏专门训练数据而难以准确重照明(relighting)。解决方案的关键在于提出一种无需训练的多阶段测试时自适应框架 AD-Relight,该框架能够在推理阶段动态调整扩散式重照明模型,使其适配新插入的 Photoshop 生成广告横幅,从而显著提升视觉一致性与用户感知质量。

链接: https://arxiv.org/abs/2604.24407
作者: Rameshwar Mishra,A V Subramanyam
机构: Dolby Labs(杜比实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recent surge in content consumption through streaming services has driven a growing demand for personalized content. Personalized advertisements (ads) play a crucial role in enhancing both user engagement and ad effectiveness. A key aspect of ad personalization involves replacing existing regions in a frame with custom, Photoshop-generated banners. However, existing ad-placement pipelines typically rely on simple geometric warping, ignoring the scene’s underlying lighting conditions. Similarly, state-of-the-art diffusion-based object insertion and relighting models struggle to accurately relight these newly inserted banners, as they are not trained on ad-banner data, and training such a model for ad banners would require millions of images. This highlights the need for an effective relighting framework that enables seamless integration of custom banners into the original scene. Motivated by this, we present AD-Relight, a novel multi-stage training-free framework that adapts a diffusion-based relighting model at test time to relight newly added Photoshop-generated ad banners. Through extensive evaluation, we demonstrate that AD-Relight outperforms both relighting baselines and existing ad-placement methods based on simple warping. User studies further show that participants consistently prefer the outputs of AD-Relight over those of prior approaches.

[CV-28] Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation ACL2025

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在的对象幻觉(object hallucination)问题,即模型在生成描述时偏离图像真实内容,主要归因于对语言先验的过度依赖和视觉特征注意力不足。解决方案的关键在于提出一种无需训练的推理框架——正负解码(Positive-and-Negative Decoding, PND),其核心机制是通过双路径对比:正向路径利用多层注意力增强显著视觉证据以强化视觉忠实性,反向路径则通过削弱关键对象特征构建强反事实,从而惩罚脱离视觉依据的生成行为;该框架在每一步解码中对比两种视角下的输出,引导模型生成既符合语言概率又基于视觉事实的文本,显著降低幻觉并提升描述细节,在多个基准测试中实现最优性能。

链接: https://arxiv.org/abs/2604.24396
作者: Yubo Jiang,Xin Yang,Abudukelimu Wuerkaixi,Zheming Yuan,Xuxin Cheng,Fengying Xie,Zhiguo Jiang,Cao Liu,Ke Zeng,Haopeng Zhang
机构: Beihang University (北京航空航天大学); Meituan (美团); Tianmushan Laboratory (天目山实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures, Findings of ACL 2025

点击查看摘要

Abstract:Vision-Language Models (VLMs) are frequently undermined by object hallucination–generating content that contradicts visual reality–due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object’s features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model’s outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail–all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.

[CV-29] Complexity of Linear Regions in Self-supervised Deep ReLU Networks CVPR

【速读】:该论文旨在解决自监督学习(Self-Supervised Learning, SSL)模型在训练过程中其激活网络几何结构复杂性与表示质量之间关系不明确的问题。现有研究主要聚焦于监督学习框架下线性区域(linear regions)数量随训练演化的规律,而SSL方法通过优化表示空间以提升多下游任务性能,其几何特性尚未被充分探索。解决方案的关键在于引入SplineCam技术提取数据分布附近的二维多面体(polytopes),并系统追踪线性区域的数量、面积、偏心率及边界变化,从而揭示SSL模型中线性区域演化模式与其表示质量的关联。实验表明,对比学习(contrastive)方法快速扩展区域,自蒸馏(self-distillation)方法则倾向于合并邻近区域,且聚类指标可早期识别表示崩溃现象,证明多面体度量是可靠的表示质量与模型性能指示器。

链接: https://arxiv.org/abs/2604.24393
作者: Mufhumudzi Muthivhi,Terence L. van Zyl
机构: University of Johannesburg (约翰内斯堡大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition - Findings Track (CVPRF)

点击查看摘要

Abstract:There has been growing interest in studying the complexity of Rectified Linear Unit (ReLU) based activation networks. Recent work investigates the evolution of the number of piecewise-linear partitions (linear regions) that are formed during training. However, current research is limited to examining the complexity of models trained in a supervised way. Self-Supervised Learning (SSL) differs in that it directly optimises the representation space using a loss function to enhance the model’s performance across multiple downstream tasks. This study investigates the local distribution of linear regions produced by SSL models. We demonstrate that the evolution of linear regions correlates with the representation quality by utilising SplineCam to extract two-dimensional polytopes near the data distribution. We track the number, area, eccentricity, and boundaries of regions throughout training. The study compares supervised, contrastive, and self-distillation methods over two standard benchmark datasets, MNIST and FashionMNIST. The analysis of the experimental results shows that self-supervised methods create substantially fewer regions to achieve comparable accuracy to supervised models. Contrastive methods rapidly expand regions over time, whereas self-distillation methods tend to consolidate by merging neighbouring regions. Lastly, we can detect representation collapse early within the geometric space of linear regions. Our analysis suggests that polytopal metrics can serve as reliable indicators of representation quality and model performance.

[CV-30] Multispectral airborne laser scanning dataset for tree species classification: MS-ALS-SPECIES

【速读】:该论文旨在解决当前森林个体树木水平评估中缺乏高质量、公开可用的多光谱激光扫描(Multispectral Airborne Laser Scanning, MLS)数据集的问题,尤其是针对树种分类任务中存在的数据稀缺与验证不足。其解决方案的关键在于构建并公开一个包含6326个个体树木分段点云的多光谱ALS数据集,涵盖芬兰南部九种树种,覆盖两种不同分辨率的传感器(直升机搭载系统HeliALS点密度>1000 points/m² 和 Optech Titan系统约35 points/m²),并配套开发了一套高效且可扩展的野外实地验证数据采集方法。此外,论文进一步利用该数据集开展树种分类性能分析,揭示了分类精度与树高的关系,并验证了点变换器(Point Transformer)模型在小树和稀有物种识别中的优势,从而凸显该数据集的多功能性和应用潜力。

链接: https://arxiv.org/abs/2604.24370
作者: Matti Hyyppä,Klaara Salolahti,Eric Hyyppä,Xiaowei Yu,Josef Taher,Leena Matikainen,Matti Lehtomäki,Paula Litkey,Teemu Hakala,Harri Kaartinen,Juha Hyyppä,Antero Kukko
机构: Finnish Geospatial Research Institute FGI (芬兰地理空间研究所FGI); The National Land Survey of Finland (芬兰国家土地测量局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The shift from stand-level to individual-tree-level forest assessments supports improved biodiversity mapping, particularly in boreal ecosystems where tree species like aspen (Populus tremula L.) play a keystone role. While airborne laser scanning (ALS) is the standard for such inventories, a major limitation is the small number of publicly available ALS datasets containing high-quality, field-validated reference data. Furthermore, open multispectral ALS datasets with high-quality field reference data are completely lacking despite the potential of multispectral ALS data for tree species classification. This paper presents and details an open multispectral ALS dataset used in a recent international benchmarking study of machine learning and deep learning methods for tree species classification by Taher et al. (2026). The dataset comprises 6326 segment-level point clouds of individual trees representing nine species in Southern Finland. The point cloud data has been acquired using two multispectral laser scanning systems each operating at three laser wavelengths: a helicopter-borne system (HeliALS) with a point density exceeding 1000 points/m ^2 and an Optech Titan system with approximately 35 points/m ^2 . We provide a detailed description of field data collection techniques developed in the study to facilitate the collection of high-quality ground truth data in an efficient and scalable manner. Additionally, our article presents new analyses on species classification using multispectral data building upon the initial findings of Taher et al. (2026). Furthermore, we study the relation between classification accuracy and tree height to highlight the versatility of the open dataset and to demonstrate the advantage of the point transformer model for small trees and minority species.

[CV-31] ARETE: Attention-based Rasterized Encoding for Topology Estimation using HSV-transformed Crowdsourced Vehicle Fleet Data

【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)中高精地图(High-Definition Map, HD Map)的实时更新与高精度生成问题,尤其关注从众包车辆轨迹中自动提取车道中心线(centerline)和车道分隔线(lane divider)以支持下游驾驶任务。其解决方案的关键在于提出一种基于检测变压器(Detection Transformer, DETR)的方法,将聚合后的车辆轨迹转化为栅格化表示(rasterized representation),并以此输入模型预测矢量化的、带方向性的车道结构;其中每条车道由中心线及其几何约束下的分隔线组成,通过局部瓦片(local tile)划分与轨迹编码实现高效建模,在内部数据集及nuScenes、nuPlan公开数据集上验证了方法的有效性。

链接: https://arxiv.org/abs/2604.24353
作者: Daniel Fritz,Dimitrios Lagamtzis,Michael Mink,Markus Enzweiler,Steffen Schober
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The continuous advancement of autonomous driving (AD) introduces challenges across multiple disciplines to ensure safe and efficient driving. One such challenge is the generation of High-Definition (HD) maps, which must remain up to date and highly accurate for downstream automotive tasks. One promising approach is the use of crowdsourced data from a vehicle fleet, representing road topology and lane-level features. This work focuses on the generation of centerlines and lane dividers from crowdsourced vehicle trajectories. We adopt a Detection Transformer (DETR)-based approach, where a rasterized representation of vehicle trajectories is used as input to predict vectorized lane representations. Each lane consists of a centerline with an associated direction and corresponding lane dividers that are geometrically constrained by the centerline. Our method includes the extraction of local tiles, from which crowdsourced vehicle trajectories are aggregated. Each tile undergoes a transformation into a rasterized representation encoding both the presence and direction of each trajectory, enabling the prediction of vectorized directed lanes. Experiments are conducted on an internal dataset as well as on the public datasets nuScenes and nuPlan.

[CV-32] Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion

【速读】:该论文旨在解决可控扩散模型(controllable diffusion methods)在实际应用中因系统碎片化而导致的兼容性问题,如训练流程不一致、参数格式差异及运行时钩子不可互操作等,这些问题限制了跨任务基础设施复用、跨骨干模型能力迁移以及多控制策略在同一生成流水线中的组合使用。解决方案的关键在于提出 Diffusion Templates 框架,其核心是将基础模型推理与可控能力注入解耦,通过三个组件实现统一抽象:模板模型(Template models)将任务特定输入映射为中间能力表示,模板缓存(Template cache)作为标准化的能力注入接口,模板流水线(Template pipeline)负责加载、合并并注入一个或多个模板缓存到基础扩散运行时中。该设计使得不同能力载体(如 KV-Cache 和 LoRA)可在同一抽象层下被支持,从而实现对多种可控生成任务(如结构控制、亮度调整、图像编辑等)的统一建模与模块化扩展。

链接: https://arxiv.org/abs/2604.24351
作者: Zhongjie Duan,Hong Zhang,Yingda Chen
机构: Alibaba Group (阿里巴巴集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: 21 pages, 15 figures

点击查看摘要

Abstract:Controllable diffusion methods have substantially expanded the practical utility of diffusion models, but they are typically developed as isolated, backbone-specific systems with incompatible training pipelines, parameter formats, and runtime hooks. This fragmentation makes it difficult to reuse infrastructure across tasks, transfer capabilities across backbones, or compose multiple controls within a single generation pipeline. We present Diffusion Templates, a unified and open plugin framework that decouples base-model inference from controllable capability injection. The framework is organized around three components: Template models that map arbitrary task-specific inputs to an intermediate capability representation, a Template cache that functions as a standardized interface for capability injection, and a Template pipeline that loads, merges, and injects one or more Template caches into the base diffusion runtime. Because the interface is defined at the systems level rather than tied to a specific control architecture, heterogeneous capability carriers such as KV-Cache and LoRA can be supported under the same abstraction. Based on this design, we build a diverse model zoo spanning structural control, brightness adjustment, color adjustment, image editing, super-resolution, sharpness enhancement, aesthetic alignment, content reference, local inpainting, and age control. These case studies show that Diffusion Templates can unify a broad range of controllable generation tasks while preserving modularity, composability, and practical extensibility across rapidly evolving diffusion backbones. All resources will be open sourced, including code, models, and datasets.

[CV-33] SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters

【速读】:该论文旨在解决小规模开放权重视觉语言模型(Vision-Language Models, VLMs)在评估图像与文本描述一致性时可能出现的“谄媚行为”(sycophantic behavior)问题,即模型在缺乏视觉证据支撑的情况下仍给出高分评分,从而影响评估结果的可靠性。其解决方案的关键在于提出一个量化指标——“吹牛系数”(Bluffing Coefficient, \bc),用于衡量模型评分与其引用视觉证据之间的不匹配程度;通过该指标对六种参数量从450M到8B的VLMs进行系统评估,发现模型规模与谄媚率呈显著负相关(r = -0.96, p = 0.002),揭示了小型VLMs更易产生无依据的高分评价,为自动化评估任务中模型选择提供了关键实证依据。

链接: https://arxiv.org/abs/2604.24346
作者: Arya Shah,Deepali Mishra,Chaklam Silpasuwanchai
机构: IIT Gandhinagar; IIT Kanpur; Asian Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly deployed as evaluators in tasks requiring nuanced image understanding, yet their reliability in scoring alignment between images and text descriptions remains underexplored. We investigate whether small, open-weight VLMs exhibit \emphsycophantic behavior when evaluating image-text alignment: assigning high scores without grounding their judgments in visual evidence. To quantify this phenomenon, we introduce the \emphBluffing Coefficient (\bc), a metric that measures the mismatch between a model’s score and its evidence recall. We evaluate six open-weight VLMs ranging from 450M to 8B parameters on a benchmark of 173,810 AI-generated character portraits paired with detailed textual descriptions. Our analysis reveals a significant inverse correlation between model size and sycophancy rate ( r = -0.96 , p = 0.002 ), with smaller models exhibiting substantially higher rates of unjustified high scores. The smallest model tested (LFM2-VL, 450M) produced sycophantic evaluations in 22.3% of cases, compared to 6.0% for the largest (LLaVA-1.6, 7B). These findings have direct implications for the deployment of small, open-weight VLMs as automated evaluators within attribute-rich, synthetic image evaluation tasks, where the gap between assigned scores and cited visual evidence is both measurable and consequential.

[CV-34] See Further Think Deeper: Advancing VLMs Reasoning Ability with Low-level Visual Cues and Reflection CVPR2026

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在强化学习(Reinforcement Learning, RL)驱动下的推理能力受限问题,具体表现为缺乏低层级视觉信息的利用和有效的视觉反馈机制。解决方案的关键在于提出一个统一的多模态交错推理框架ForeSight,其核心创新包括:一是引入一组低层级视觉工具,将细粒度视觉特征整合进推理链中,缓解对局部视觉信息的忽视;二是设计基于掩码(mask-based)的视觉反馈机制,使模型能在思考过程中动态地进行视觉反思与答案修正,从而实现“看得更远”(See Further)与“想得更深”(Think Deeper)。该框架通过RL自主决策工具调用与答案验证,并以最终答案准确率为奖励信号进行训练,显著提升了模型性能。

链接: https://arxiv.org/abs/2604.24339
作者: Zhiheng Wu,Tong Wang,Shuning Wang,Naiming Liu,Yumeng Zhang
机构: Baidu Inc.(百度公司); Zhejiang University (浙江大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR2026

点击查看摘要

Abstract:Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbfForeSight, which enables VLMs to \textbfSee Further with low-level visual cues and \textbfThink Deeper with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.

[CV-35] An AffordableWearable Stereo-Eye-Tracking Platform

【速读】:该论文旨在解决现有可穿戴眼动追踪设备(包括商用和开源平台)在算法开发与对比评估方面灵活性不足的问题。其解决方案的关键在于设计并实现了一种基于市售组件和3D打印部件的低成本、可穿戴立体眼动追踪平台,该平台支持多种眼动追踪范式(如立体视觉法、光点法及双目方法)在同一硬件配置下运行,并通过模块化设计和软件支持实现了校准与同步数据采集功能,从而为研究提供高度可扩展和灵活的实验基础。

链接: https://arxiv.org/abs/2604.24331
作者: Alexander Zimmer,Yasmeen Abdrabou,Enkelejda Kasneci
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Research on video-based eye-tracking has long explored stereo and glint-based methods, yet existing wearable eye trackers - both commercial and open-source - offer limited flexibility for algorithm development and comparative evaluation. We present an affordable, wearable stereo eye-tracking platform built from off-the-shelf and 3D-printable components that explicitly targets this gap. The system combines four infrared eye cameras, infrared illumination, an optional scene camera, and software support for calibration and synchronized data acquisition. By design, the platform supports multiple eye-tracking paradigms, including stereo, glint-based, and binocular approaches, within a single hardware configuration. Rather than optimizing for end-user robustness, the platform prioritizes modularity and extensibility for research use. This paper focuses on the hardware architecture and calibration pipeline and demonstrates the feasibility of the approach using a prototype implementation. All hardware designs and documentation are made openly available.

[CV-36] Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures

【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)中现有方法因将问题视为欧几里得网格上的通用图像到图像回归而忽略透视投影所诱导的内在代数与几何结构的问题。解决方案的关键在于提出LAGRNet框架,其核心创新是将MDE从根本上建立在代数几何基础上:首先通过学习的代数群作用构建Group-defined Feature Manifold(GFM),以强制实现投影等变性并提升视角变化下的鲁棒性;其次引入Ring Convolution Layer(RCL),将特征融合建模为分次环同态,保障跨尺度交互的代数一致性;最后设计Sheaf-based Module(SM),利用Čech神经网络聚合局部深度线索,确保全局拓扑一致性。这一系列设计使模型在KITTI、NYU-Depth V2和ETH3D等多个基准上实现了显著优于当前最优方法的精度与泛化能力。

链接: https://arxiv.org/abs/2604.24328
作者: Qianlei Wang,Kexun Chen,Shaolin Zhang,Hongli Gao,Chaoning Zhang,Xiaolin Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular depth estimation (MDE) has witnessed remarkable progress driven by Convolutional Neural Networks and transformer-based architectures. However, these approaches typically treat the problem as a generic image-to-image regression on Euclidean grids, thereby overlooking the intrinsic algebraic and geometric structures induced by perspective projection. To address this limitation, we propose LAGRNet, a novel framework that fundamentally grounds MDE in algebraic geometry by explicitly embedding learnable group, ring, and sheaf structures into the deep learning pipeline. Modeling feature maps as sections of a sheaf over an approximated image manifold, our method first establishes a Group-defined Feature Manifold (GFM) parameterized by a learned algebraic group action to enforce projective equivariance and robustness against view changes. To facilitate algebraically consistent cross-scale interactions, we subsequently introduce a Ring Convolution Layer (RCL) that formulates feature fusion as a graded ring homomorphism. Furthermore, to ensure global topological consistency, a Sheaf-based Module (SM) aggregates local depth cues via Čech nerve on the image topology. Extensive zero-shot evaluations across the KITTI, NYU-Depth V2, and ETH3D benchmarks demonstrate that LAGRNet significantly outperforms state-of-the-art methods in both accuracy and generalization capabilities.

[CV-37] Dont Pause! Every prediction matters in a streaming video

【速读】:该论文旨在解决现有在线视频问答(VideoQA)基准测试对实时流式视频理解能力评估不足的问题。当前主流方法多采用固定时间戳暂停视频并进行事后提问,无法衡量模型在事件发生时即时响应的能力,导致流式感知(streaming perception)与辅助能力(assistive capabilities)未被有效检验。为填补这一空白,作者提出SPOT-Bench,其核心创新在于引入多轮主动查询(multi-turn proactive queries),并通过Timeliness-F1指标综合评估预测的时序精度和全视频覆盖均衡性。解决方案的关键在于AsynKV——一种无需训练的流式适配机制,通过引入长短时记忆结构,在“静默期”(dead-time,即无需响应的时间段)高效分配计算资源,从而在不增加延迟的前提下提升模型的实时响应能力,显著优于现有流式模型并在传统回顾性基准上达到最先进性能。

链接: https://arxiv.org/abs/2604.24317
作者: Dibyadip Chatterjee,Zhanzhong Pang,Fadime Sener,Yale Song,Angela Yao
机构: National University of Singapore (新加坡国立大学); Google Inc. (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 14 figures; this https URL

点击查看摘要

Abstract:Streaming video models should respond the moment an event unfolds, not after the moment has passed. Yet existing online VideoQA benchmarks remain largely retrospective. They pause the video at fixed timestamps, pose questions about current or past events, and score models only at those moments. This protocol leaves streaming predictions untested. To close this gap, we introduce SPOT-Bench, featuring multi-turn proactive queries that evaluate general streaming perception and assistive capabilities required by an always-on, real-time assistant. SPOT-Bench comes with Timeliness-F1, a consolidated metric that measures streaming predictions by their temporal precision and balanced coverage across the entire video. Our benchmark reveals: (i) offline models detect events reliably but spam predictions unprompted; (ii) post-training for silence reduces spamming but induces unresponsiveness; (iii) half of the streaming video expects no response, which we term dead-time - compute spent here does not affect response latency. These findings motivate AsynKV, a training-free streaming adaptation of offline models, that retains their event perception while improving their streaming behavior. AsynKV features a long-short term memory, utilized efficiently by scaling compute during dead-time. It serves as a strong baseline on SPOT-Bench, outperforming existing streaming models, and achieves state-of-the-art on retrospective benchmarks.

[CV-38] Unconstrained Multi-view Human Pose Estimation with Algebraic Priors

【速读】:该论文旨在解决从多视角图像中恢复三维人体姿态时对精确相机标定的依赖问题,这一限制在真实场景中尤为突出,导致现有方法适用性受限。其解决方案的关键在于提出一个无需标定约束的框架,核心创新包括:1)Triangulation with Transformer Regressor (TTR),将传统三角测量重构为数据驱动的token融合过程,从而绕过显式相机参数依赖;2)Gröbner basis Corrector (GC),通过引入基于多视角代数流形的约束损失函数,强制神经网络预测严格遵循射影几何规律;3)Temporal Equivariant Rectifier (TER),利用人体运动的等变特性施加时间一致性与结构约束,有效缓解未标定设置下的尺度模糊问题。该框架在标准基准上显著优于现有无标定方法,大幅缩小了与全标定“黄金标准”之间的性能差距。

链接: https://arxiv.org/abs/2604.24312
作者: Xiaolin Qin,Qianlei Wang,Jiacen Liu,Chaoning Zhang,Fei Zhu,Zhang Yi
机构: Chengdu Institute of Computer Applications, Chinese Academy of Sciences (中国科学院成都计算机应用研究所); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院); School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学计算机科学与工程学院); Centre for Artificial Intelligence and Robotics at Hong Kong Institute of Science Innovation, Chinese Academy of Sciences (中国科学院香港科技创新中心人工智能与机器人研究中心); School of Computer Science, Sichuan University (四川大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recovering 3D human pose from multi-view imagery typically relies on precise camera calibration, which is often unavailable in real-world scenarios, thereby severely limiting the applicability of existing methods. To overcome this challenge, we propose an unconstrained framework that synergizes deep neural networks, algebraic priors, and temporal dynamics for uncalibrated multi-view human pose estimation. First, we introduce the Triangulation with Transformer Regressor (TTR), which reformulates classical triangulation into a data-driven token fusion process to bypass the dependency on explicit camera parameters. Second, to explicitly embed the inherent algebraic relations of the multi-view variety into the learning process, we propose the Gröbner basis Corrector (GC). This pioneering loss formulation enforces constraints derived from the multi-view variety to ensure the neural predictions strictly adhere to the laws of projective geometry. Finally, we devise the Temporal Equivariant Rectifier (TER), which exploits the equivariance property of human motion to impose temporal coherence and structural consistency, effectively mitigating scale ambiguity in uncalibrated settings. Extensive evaluations on standard benchmarks demonstrate that our framework establishes a new state-of-the-art for uncalibrated multi-view human pose estimation. Notably, our approach significantly closes the performance gap between calibration-free methods and fully calibrated oracles.

[CV-39] BIMStruct3D: A Fully Automated Hybrid Learning Scan-to-BIM Pipeline with Integrated Topology Refinement

【速读】:该论文旨在解决从建筑扫描数据(如3D点云)自动构建符合IFC标准的建筑信息模型(BIM)这一关键挑战。其解决方案的核心在于提出一个模块化处理流程,融合基于学习的语义分割与拓扑感知的几何重建方法,以更准确地建模结构构件;同时引入vIoU指标,通过体素级重叠评估实现无需实例匹配的完整模型对比,从而提升评价的合理性与效率。

链接: https://arxiv.org/abs/2604.24311
作者: Mahdi Chamseddine,Fabian Kaufmann,Marius Schellen,Christian Glock,Didier Stricker,Jason Rambach
机构: German Research Center for Artificial Intelligence (DFKI); RPTU Kaiserslautern-Landau
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in EC3 2026

点击查看摘要

Abstract:Automatic generation of Building Information Models (BIM) from building scans is a key challenge in architecture and construction. We present a modular pipeline for generating IFC-compliant BIM from 3D point clouds. The hybrid approach combines learning-based semantic segmentation with topology-aware geometric reconstruction to model structural elements accurately. We propose vIoU, adapting voxel-based overlap evaluation to Scan-to-BIM by enabling holistic, instance-matching-free comparison of reconstructed and ground-truth models. We release the German Hospital dataset (DeKH), including high-resolution point clouds, ground truth BIMs, and semantic annotations. Experiments on DeKH and CV4AEC datasets show significant improvements over a RANSAC-based baseline, demonstrating robustness and scalability.

[CV-40] ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

【速读】:该论文旨在解决当前空间智能(spatial intelligence)评估在现代视觉-语言模型(VLM)设置下系统性无效的问题。具体而言,现有基准测试存在两大缺陷:一是QA对基于点云三维标注生成,但用于视频评估时易因重建与标注误差导致对象遗漏、误识别或几何相关回答错误;二是评估假设全场景访问,而实际VLM常基于稀疏采样帧(如16–64帧)运行,使许多问题在真实输入下无法回答。解决方案的关键在于提出ReVSI基准与协议,通过重新标注5个数据集共381个场景中的物体及几何信息以提升数据质量,并使用专业三维标注工具严格消除偏差并经人工验证重生成所有QA对,确保每个QA对在模型实际输入下均可答且正确;同时提供多帧预算(16/32/64/全部帧)和细粒度物体可见性元数据,实现可控诊断分析,从而揭示先前基准掩盖的系统性失败模式,提升空间智能评估的可靠性与诊断能力。

链接: https://arxiv.org/abs/2604.24300
作者: Yiming Zhang,Jiacheng Chen,Jiaqi Tan,Yongsen Mao,Wenhu Chen,Angel X. Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Current evaluations of spatial intelligence can be systematically invalid under modern vision-language model (VLM) settings. First, many benchmarks derive question-answer (QA) pairs from point-cloud-based 3D annotations originally curated for traditional 3D perception. When such annotations are treated as ground truth for video-based evaluation, reconstruction and annotation artifacts can miss objects that are clearly visible in the video, mislabel object identities, or corrupt geometry-dependent answers (e.g., size), yielding incorrect or ambiguous QA pairs. Second, evaluations often assume full-scene access, while many VLMs operate on sparsely sampled frames (e.g., 16-64), making many questions effectively unanswerable under the actual model inputs. We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under the model’s actual inputs. To this end, we re-annotate objects and geometry across 381 scenes from 5 datasets to improve data quality, and regenerate all QA pairs with rigorous bias mitigation and human verification using professional 3D annotation tools. We further enhance evaluation controllability by providing variants across multiple frame budgets (16/32/64/all) and fine-grained object visibility metadata, enabling controlled diagnostic analyses. Evaluations of general and domain-specific VLMs on ReVSI reveal systematic failure modes that are obscured by prior benchmarks, yielding a more reliable and diagnostic assessment of spatial intelligence.

[CV-41] Instance Awareness of Multi-class Semantic Segmentation Loss Functions CVPR

【速读】:该论文旨在解决多类语义分割中的实例不平衡(instance imbalance)与类别不平衡(class imbalance)问题,尤其在医学图像分割任务中,小目标和稀有类别因梯度信号不足而难以有效训练。其核心解决方案是将原本仅适用于单类分割的实例敏感损失(如CC损失和blob损失)通过一 vs. 全(one-vs-rest)类别分解扩展至多类场景,从而实现对每个类别的均匀梯度贡献;同时,在每个组件内部引入反向尺寸加权(inverse-size weighting),避免全局权重失衡导致的训练不稳定,使重加权机制局限于局部空间上下文,显著提升稀有类别的Dice分数(从0.36提升至0.44),并保持整体分割质量(Panoptic Quality)。

链接: https://arxiv.org/abs/2604.24276
作者: Soumya Snigdha Kundu,Florian Kofler,Marina Ivory,Hendrik Moller,Jonathan Shapey,Tom Vercauteren
机构: King’s College London (伦敦国王学院); University of Tübingen (图宾根大学); Technical University of Munich (慕尼黑工业大学); King’s College Hospital (伦敦国王学院医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 Figures, Accepted as Poster at CV4CLINIC workshop at CVPR

点击查看摘要

Abstract:Instance-sensitive losses for semantic segmentation such as blob loss and CC loss were designed to address instance imbalance, ensuring small lesions generate the same gradient as large ones, but operate only on single-class segmentation. In multi-class settings, class imbalance poses an additional problem: rare classes with few instances receive a disproportionately small share of the training signal. We show that extending instance-sensitive losses to multi-class segmentation via a one-vs-rest class decomposition repurposes them to also address class imbalance, as uniform averaging over classes ensures each class contributes equally regardless of frequency. We further show that inverse-size weighting, which destabilizes training when applied globally due to weight imbalances across rare and common classes, becomes effective when integrated within the per-component loss, confining the reweighting to each component’s spatial context. On the BraTS-METS 2025 dataset (260 test cases), multi-class CC loss improves foreground Dice (0.64 +/- 0.26 vs. 0.59 +/- 0.27 baseline) and rare-class Dice, while maintaining Panoptic Quality at DSC threshold 0.5. Multi-class blob loss achieves the best Panoptic Quality at threshold 0.5 (0.40 +/- 0.24 vs. 0.38 +/- 0.25 baseline) and recognition quality (0.53 +/- 0.29 vs. 0.49 +/- 0.30). Integrating inverse-size weighting within the per-component loss increases rare-class Dice to 0.44 +/- 0.36 at the cost of reduced detection quality.

[CV-42] ouchless Intraoperative Image Access System Based on Vision-Based Hand Tracking

【速读】:该论文旨在解决手术环境中对医学图像进行无接触交互的需求,以维持无菌操作和保障术中流程的连续性。其核心解决方案是基于单个RGB摄像头的视觉系统,利用MediaPipe Hands实现实时手部关键点(hand landmarks)的2.5D估计,并将简单直观的手势映射为平移、旋转和缩放命令,从而实现与图像查看器的自然、连续交互。该方法无需额外硬件或用户特定训练,且系统架构独立于可视化软件,具备低延迟、高稳定性和强鲁棒性的特点,验证了其在术中场景下作为低成本触控替代方案的可行性。

链接: https://arxiv.org/abs/2604.24235
作者: Yin Lin,Domenico Aquino,Alberto Redaelli,Massimiliano Del Bene,Riccardo Barbieri,Simona Ferrante
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Touchless interaction with medical images is becoming increasingly important in the surgical field, where sterility and continuity of the operational workflow are essential requirements. This work presents a vision-based system for intraoperative navigation of medical images through hand gestures acquired using a single RGB camera. Unlike many existing solutions, the system does not require additional hardware or user-specific training. Hand tracking is performed in real time using MediaPipe Hands, which provides a 2.5D estimation of hand landmarks. Simple and intuitive gestures are then mapped into translation, rotation, and zoom commands, enabling continuous and natural interaction with the image viewer. The system architecture is independent from the visualization software and, for implementation simplicity, in this study it was integrated with PyVista. Performance was evaluated through frame-level logging and quantitative analysis of latency, stability, and interaction robustness metrics. Experimental results highlight real-time behavior, with reduced latencies and stable control, in line with the requirements of fluid interaction. The system demonstrates the feasibility of a low-cost touchless solution for intraoperative access to medical images, laying the groundwork for future clinical evaluations.

[CV-43] Graph-augmented Segmentation of Complex Shapes in Laser Powder bed Fusion for Enhanced In Situ Inspection

【速读】:该论文旨在解决激光粉末床熔融(Laser Powder Bed Fusion, L-PBF)过程中原位图像分割在实际工业环境中面临的挑战,尤其是光照条件变化和层间像素强度模式波动导致的分割性能不稳定问题。其解决方案的关键在于提出一种图增强的分割方法,通过在U-Net架构中嵌入图神经网络(Graph Neural Network, GNN)瓶颈模块,将几何信息从像素级提升至区域级的全局建模,从而显式捕捉空间区域间的依赖关系与关联信息,有效提升了在存在空间和层间光度变异情况下的几何重建一致性与准确性。

链接: https://arxiv.org/abs/2604.24234
作者: Stefano Raimondo,Matteo Bugatti,Marco Grasso(Department of Mechanical Engineering, Politecnico di Milano, Milan, Italy)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Automation Science and Engineering (T-ASE)

点击查看摘要

Abstract:The technological maturity of in situ inspection and monitoring methods in additive manufacturing is steadily increasing, enabling more efficient and practical qualification procedures. In this context, image segmentation of powder bed images in Laser Powder Bed Fusion (L-PBF) has been investigated by various authors, leveraging both edge detection and machine learning approaches to identify deviations from nominal geometry. Despite these developments, several challenges remain, including the sensitivity of segmentation performance to industrial illumination conditions and layer-to-layer variability in pixel intensity patterns. The study addresses these limitations by proposing a graph-augmented segmentation approach. The underlying principle consists of preserving the geometrical information at a global level rather than at pixel-wise level, modeling dependencies and relational information among spatial regions with a Graph Neural Network bottleneck embedded into a U-Net architecture. This allows enhancing the consistency and accuracy of the geometry reconstruction in the presence of spatial and layer-wise photometric variability systematically faced in real data. The method is evaluated against benchmark techniques for the in situ reconstruction of lattice structures produced by L-PBF, demonstrating its potential as a scalable solution for robust in situ inspection and geometric verification in industrial environments.

[CV-44] Radiomics- and Clinical Feature-Driven Prediction of Volumetric Response in Skull-Base Meningioma after CyberKnife Radiosurgery

【速读】:该论文旨在解决颅底脑膜瘤(skull-base meningiomas)患者在接受CyberKnife立体定向放射外科治疗后,如何早期识别可能获益于该治疗的患者这一临床难题。现有研究多聚焦于无进展生存期或复发率,而本文创新性地以肿瘤体积变化(volumetric response)作为疗效指标,提出了一种融合影像组学(radiomics)特征与临床变量的预测框架。其解决方案的关键在于:利用TabPFN这一先进的机器学习架构,在嵌套交叉验证(nested cross-validation)策略下对小样本、高维数据进行建模,从而有效捕捉与治疗反应相关的复杂模式,最终实现了AUC达0.81的分类性能,为个体化精准放疗决策提供了可行的技术路径。

链接: https://arxiv.org/abs/2604.24230
作者: Yin Lin,Elena De Martin,Giacomo Conte,Domenico Aquino,Cristiana Pedone,Alberto Redaelli,Riccardo Barbieri,Laura Fariselli,Simona Ferrante
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Skull-base meningiomas are often characterized by favorable long-term prognosis, yet their anatomical complexity and proximity to critical neurovascular structures make treatment selection challenging. Stereotactic radiosurgery with CyberKnife represents an effective therapeutic option when surgical resection is not feasible; however, not all patients benefit equally from this treatment. Early identification of patients likely to respond to radiosurgery remains an open clinical problem. In this study, we propose a radiomics- and clinical feature-driven framework for predicting volumetric response in skull-base meningiomas treated with CyberKnife. Unlike most existing approaches that focus on progression-free survival or recurrence, our method targets volumetric response as an indicator of treatment efficacy. Pre-treatment MRI images from 104 patients were processed to extract radiomic features, which were combined with clinical variables and analyzed using six models. To ensure methodological rigor, the entire modeling process was implemented within a nested cross-validation scheme. Among the evaluated models, TabPFN achieved the best overall performance, with an AUC of 0.81 and consistently favorable classification metrics. These results suggest that advanced machine learning architectures, when combined with robust validation strategies, can effectively capture patterns associated with treatment response even in small-sample, high-dimensional settings.

[CV-45] Computer Vision-Based Early Detection of Container Loss at Sea

【速读】:该论文旨在解决海运中集装箱因动态海况(如船舶运动、风载荷和恶劣海况)导致堆叠失稳进而发生落海的问题,这一问题对航运安全、环境及经济均构成重大挑战。解决方案的关键在于提出了一种低成本、可 retrofittable(加装改造)的计算机视觉系统,利用船上现有摄像头实现对失稳集装箱的早期检测;其核心技术包括基于目标分割(object segmentation)分离集装箱堆垛,并结合光流法(optical flow)与个体对象残差运动提取(residual motion extraction)实现时序跟踪与相对位移量化,从而在复杂海况和能见度变化条件下有效识别集装箱层面的异常运动,为船员干预和航行调整提供早期预警,提升货物安全性、运营韧性与监管合规性。

链接: https://arxiv.org/abs/2604.24193
作者: Vishakha Lall,Capt. Stanley S Pinto,Capt. Chu Xing Peng,Wu Kaiwen
机构: Singapore Polytechnic (新加坡理工学院); Pacific International Lines (Pvt) Ltd (太平洋国际航运公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and Presented at SMRC x ICMASS/MTEC 2026

点击查看摘要

Abstract:Containerised shipping underpins global trade, yet container loss at sea remains a persistent safety, environmental, and economic challenge. Despite compliance with Cargo Securing Manuals, dynamic maritime conditions such as vessel motion, wind loading, and severe sea states can progressively destabilise container stacks, leading to overboard losses. With the new International Maritime Organisation’s (IMO) mandatory reporting requirements for lost containers, there is an urgent need for a reliable, evidence-based early detection solution for destabilised containers. This study showcases a low-cost, retrofittable computer vision-based system for early detection of destabilised containers using existing onboard cameras. The framework integrates object segmentation to isolate container stacks, temporal object tracking using optical flow and individual objects’ residual motion extraction to quantify relative movement. Experimental evaluation on real onboard ship footage demonstrates that the proposed pipeline effectively isolates container-level motion under challenging conditions of varying sea states and visibility conditions. By enabling early alerts for crew intervention and navigational adjustment, the proposed approach enhances cargo safety, operational resilience, and regulatory compliance.

[CV-46] Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

【速读】:该论文旨在解决多模态理解(omnimodal understanding)中因跨模态交互空间庞大且高度冗余而导致的推理效率低下问题,特别是现有推理范式(如顺序生成或并行采样 rollout)难以共享有前景的中间推理路径,从而限制探索效率并导致复杂音视频任务中的误差累积。解决方案的关键在于提出 Omni-o3 框架,其核心是基于深度嵌套推理策略(deep nested deduction policy),将推理建模为动态递归搜索过程,天然支持不同分支间推理前缀的共享,并通过四种原子认知动作(扩展、选择、模拟与反向传播)实现迭代执行;同时设计了两阶段训练机制:第一阶段利用 10.1 万条高质量长链轨迹进行监督微调以建立必要递归模式,第二阶段采用嵌套组 rollout 驱动的探索性强化学习结合多步奖励模型,激发深层嵌套推理能力,显著提升了多模态任务下的综合推理性能。

链接: https://arxiv.org/abs/2604.24191
作者: Zhicheng Zhang,Wentao Gu,Weicheng Wang,Yongjie Zhu,Wenyu Qin,Meng Wang,Pengfei Wan,Jufeng Yang
机构: 1: Tsinghua University (清华大学); 3: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.

[CV-47] Multivariate Gaussian NeRF for Wide Field-of-View Ultrasound Reconstruction

【速读】:该论文旨在解决宽视场(Wide Field-of-View, WFoV)超声成像中因凸探头(convex probe)产生的发散声束导致的图像拼接伪影和混叠问题,这些问题源于深度相关的分辨率变化,严重影响了3D重建质量和临床应用中的解剖上下文感知。其解决方案的关键在于提出Ultra-Wide-NeRF方法,该方法基于多变量3D高斯(Multivariate 3D Gaussian, MVG)神经辐射场(Neural Radiance Field, NeRF),通过显式建模距离依赖的凸形体素采样和各向异性3D高斯分布,从几何上抑制伪影并实现抗混叠;同时,该方法提供连续的神经表示而非静态网格,从而支持任意虚拟轨迹下的高保真新视角合成,显著扩展术中导航所需的解剖空间上下文。

链接: https://arxiv.org/abs/2604.24187
作者: Patris Valera,Magdalena Wysocki,Felix Duelmer,Mohammad Farid Azampour,Sebastian Herz,Stefan Wörz,Nassir Navab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wide Field-of-View (WFoV) reconstruction enhances 3D ultrasound imaging by providing valuable anatomical context for segmentation models and visualization. Clinical ultrasound volumes are predominantly acquired using convex probes, which generate expanding, diverging acoustic beams to maximize anatomical coverage. Stitching these sweeps together traditionally introduces significant compounding artifacts and aliasing due to depth-dependent resolution changes. Here, we introduce Ultra-Wide-NeRF, a Multivariate 3D Gaussian (MVG) NeRF-based method for WFoV ultrasound reconstruction. By explicitly modeling the complex beam geometry using distance-dependent convex volumetric sampling and anisotropic 3D Gaussians, our method inherently mitigates these compounding artifacts and provides anti-aliasing. Beyond simply reconstructing a static 3D grid, our NeRF-based approach yields a continuous neural representation of the tissue, enabling the synthesis of high-fidelity novel views from arbitrary virtual trajectories. We validate Ultra-Wide-NeRF for intracardiac echocardiography on phantom and porcine datasets, demonstrating that our method expands the spatial context important in intraoperative navigation. Code will be open-sourced upon publication.

[CV-48] POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation CVPR2026

【速读】:该论文旨在解决当前视觉文本生成模型在文本准确性(text accuracy)与图像整体一致性(image coherence)之间难以平衡的问题,尤其针对强化学习(reinforcement learning, RL)方法因多奖励加权求和导致的训练不稳定、权重难调以及提示(prompt)选择效率低下的挑战。解决方案的关键在于提出帕累托最优课程对齐(Pareto-Optimal Curriculum Alignment, POCA)框架,其核心创新包括:1)通过识别帕累托最优集(Pareto-optimal set)避免简单的标量加权策略,从而在统一奖励空间中找到更稳定的多目标最优解;2)设计自适应课程对齐策略,基于自动难度评估动态管理多奖励数据集的学习顺序,提升在有限数据环境下强化学习的收敛性与性能表现。实验表明,POCA显著提升了CLIP分数、HPS评分及句子准确率等指标。

链接: https://arxiv.org/abs/2604.24171
作者: Yaohou Fan,Qingzhong Wang,Yongsong Huang,Junyi Liu,Tomo Miyazaki,Shinichiro Omachi
机构: Tohoku University (东北大学); Amazon Web Services (亚马逊网络服务)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Current visual text generation models struggle with the trade-off between text accuracy and overall image coherence. We find that achieving high text accuracy can reduce aesthetic quality and instruction-following capability. Although reinforcement learning approaches can alleviate the problem through aligning with multiple rewards, they are often unstable for text generation, as existing approaches normally optimize multiple rewards in a weighted-sum way. In addition, it is difficult to balance the weight of each reward. Moreover, reinforcement learning requires a set of training instructions. A large number of prompts require more training time and computing resources, while a small set leads to poor performance. Hence, how to select the prompts for efficient training is an unsolved problem. In this study, we propose Pareto-Optimal Curriculum Alignment (POCA), a framework that addresses this issue as a multi-objective problem by: 1) identifying the Pareto-optimal set to avoid simple scalarization and 2) designing an adaptive curriculum alignment strategy to manage a learning sequence of a multi-reward dataset using automatic difficulty assessment, which is crucial for optimal convergence as RL methods explore in a limited data environment. In synergy, POCA finds the Pareto-optimal set in a unified reward space, which eliminates inconsistent signals to find the best trade-off solution from different rewards under an easy-to-hard optimization landscape. The experimental results show that POCA significantly improves all metrics such as CLIP, HPS scores and sentence accuracy.

[CV-49] PointTransformerX:Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms CVPR

【速读】:该论文旨在解决3D点云感知任务中对定制CUDA算子的高度依赖问题,这限制了模型在非NVIDIA(如AMD和嵌入式)硬件上的可移植性和效率。其核心解决方案是提出PointTransformerX(PTX),一个完全基于PyTorch实现的视觉Transformer骨干网络,彻底移除所有自定义CUDA算子和外部依赖库。关键创新包括:引入3D-GS-RoPE(3D Geometrically-aware Spatial Rotary Positional Embedding),在自注意力机制中直接编码三维空间关系而无需邻域构建;用线性投影替代稀疏卷积进行patch嵌入;并通过推理时动态扩展注意力窗口提升精度而不需重新训练。此外,通过重构前馈网络,PTX在ScanNet上达到PointTransformer V3 98.7%的准确率,参数减少79.2%,推理速度提升1.6倍,内存占用仅253 MB,且可在NVIDIA GPU、AMD GPU(ROCm)及CPU上原生运行,显著提升了点云感知模型的通用性与效率。

链接: https://arxiv.org/abs/2604.24169
作者: Laurenz Reichardt,Nikolas Ebert,Oliver Wasenmüller
机构: Mannheim University of Applied Sciences (曼海姆应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2026

点击查看摘要

Abstract:3D point cloud perception remains tightly coupled to custom CUDA operators for spatial operations, limiting portability and efficiency on non-NVIDIA, AMD, and embedded hardware. We introduce PointTransformerX (PTX), a fully PyTorch-native vision transformer backbone for 3D point clouds, removing all custom CUDA operators and external libraries while retaining competitive accuracy. PTX introduces 3D-GS-RoPE, a rotary positional embedding that encodes 3D spatial relationships directly in self-attention without neighborhood construction, and further replaces sparse convolutional patch embedding with a linear projection. PTX explores inference-time scaling of attention windows to improve accuracy without retraining. With a redesigned feed-forward network, PTX achieves 98.7% of PointTransformer V3’s accuracy on ScanNet with 79.2% fewer parameters and executing 1.6\times faster while requiring just 253 MB memory. PTX runs natively on NVIDIA GPUs, AMD GPUs (ROCm), and CPUs, providing an efficient and portable foundation for point cloud perception.

[CV-50] PEPS: Positional Encoding Projected Sampling – Extended

【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)中传统位置编码(Positional Encoding)在高分辨率学习能力不足的问题,尤其是在网格(Grid)表示中需要极高分辨率才能有效建模信号的局限性。解决方案的关键在于提出一种新的“位置编码投影采样”(Positional Encoding Projected Sampling)方法,将位置编码的高频分量视为具有特定运动规律的有意义点,并利用这些点的唯一运动模式作为基分解来实现可学习的位置编码。该方法显著提升了编码效率,在图像表示、纹理压缩和符号距离函数等三个任务中均优于现有最优方法,且在相同重建误差下参数量减少约25%。

链接: https://arxiv.org/abs/2604.24167
作者: Guillaume Perez,Janarbek Matai,Takahiro Harada
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Implicit neural representations (INRs) are increasingly being used as tools to map coordinates to signals, encompassing applications from neural fields to texture compression, shape representations, and beyond. Most INR methods are based on using high-dimensional projections of the initial coordinates through encoders such as grid or positional encoding. Nevertheless, positional encoding is often insufficient and grids, as we show in this paper, require high resolution for being able to learn. In this paper, we demonstrate that positional encoding can be used not only as a high-dimensional embedding but also decomposed as a series of meaningful points. We propose the Positional Encoding Projected Sampling, where we treat the projection of the original coordinate at each frequency as a point of interest. We describe the motion of each point with respect to the frequencies and show that it follows a unique pattern. Finally, we use the unique motion of each point as a basis decomposition for doing learned positional encoding using grids. We prove, using three competitive applications; image representation, texture compression, and signed distance function; that the proposed approach outperforms the current state of the art methods, and often requires 25% less parameters for equivalent reconstruction error or rendering.

[CV-51] Robust Deepfake Detection NTIRE 2026 Challenge: Report

【速读】:该论文旨在解决深度伪造检测(deepfake detection)中长期被忽视的鲁棒性问题,即检测模型在面对图像退化(degradation)时性能显著下降的问题。现实场景中,图像可能因自然处理流程产生轻微退化,更存在恶意攻击者刻意引入退化以规避检测的情况。为应对这一挑战,作者组织了NTIRE 2026鲁棒深度伪造检测挑战赛,要求参赛者构建能在未知测试集(包含常见与罕见退化)上稳定表现的检测器。关键解决方案包括:利用大规模基础模型(foundation models)、集成学习(ensembles)以及退化训练(degradation training),从而在泛化能力与鲁棒性之间取得平衡。

链接: https://arxiv.org/abs/2604.24163
作者: Benedikt Hopf,Radu Timofte,Chenfan Qu,Junchi Li,Fei Wu,Dagong Lu,Mufeng Yao,Xinlei Xu,Fengjun Guo,Yongwei Tang,Zhiqiang Yang,Zhiqiang Wu,Jia Wen Seow,Hong Vin Koay,Haodong Ren,Feng Xu,Shuai Chen,Minh-Khoa Le-Phan,Minh-Hoang Le,Trong-Le Do,Minh-Triet Tran,Chih-Yu Jian,Yi-Fan Wang,Bang-Kang Chen,You-Chen Chao,Chia-Ming Lee,Fu-En Yang,Yu-Chiang Frank Wang,Chih-Chung Hsu,Aashish Negi,Hardik Sharma,Prateek Shaily,Jayant Kumar,Sachin Chaudhary,Akshay Dudhane,Praful Hambarde,Amit Shukla,Jielun Peng,Yabin Wang,Yaqi Li,Jincheng Liu,Xiaopeng Hong,Krish Wadhwani,Liam Fitzpatrick,Utkarsh Tiwari,Bilel Benjdira,Anas M. Ali,Wadii Boulila,Cristian Lazo Quispe,Aishwarya A,Akshara S,Ashwathi N,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robustness is a long-overlooked problem in deepfake detection. However, detection performance is nearly worthless in the real world if it suffers under exposure to even slight image degradation. In addition to weaker degradations that can accidentally occur in the image processing pipeline, there is another risk of malicious deepfakes that specifically introduce degradations, purposefully exploiting the detector’s weaknesses in that regard. Here, we present an overview of the NTIRE 2026 Robust Deepfake Detection Challenge, which specifically addresses that problem. Participants were tasked with building a detector that would later be tested on an unknown test-set, which included both common and uncommon degradations of various strengths. With a total number of 337 participants and 57 submissions to the final leaderboard, the first edition of the challenge was well received. To ensure the reliability of the results, participants were given only 24h to complete the test run with no labels provided, limiting the possibility of training on the test data. Furthermore, the top solutions were scored on a private test-set to detect any such overfitting. This report presents the competition setting, dataset preparation, as well as details and performance of methods. Top methods rely on large foundation models, ensembles, and degradation training to combine generality and robustness.

[CV-52] 6thGrid-Net: Unified Remote Sensing Image Dehazing Based on Color Restoration and Edge-Preserving

【速读】:该论文旨在解决遥感图像因云层和雾霾等恶劣天气条件导致的退化问题,现有方法存在计算复杂度高、处理流程分离易引入干扰与伪影,以及统一网格方法使用固定各向同性插值核忽略自然图像低维流形结构从而造成边缘模糊等局限。其解决方案的关键在于提出6th Grid-Net框架,通过构建六维融合张量(six-dimensional fusion tensor),将3D查找表(3D LUT)的颜色还原能力与双边网格(bilateral grid)的空间亮度细节保持特性无缝集成;并引入一种流形自适应高维采样机制(manifold-adaptive high-dimensional sampling mechanism),根据局部边缘方向、纹理强度和颜色相似性动态调整插值核,实现单次前向传播中全局色彩风格化与局部边缘精细化同步完成;此外,结合边缘感知的网格平滑约束和动态量化策略,有效抑制鬼影伪影并大幅压缩模型体积,从而在资源受限的边缘设备上实现高效高质量的遥感图像恢复。

链接: https://arxiv.org/abs/2604.24149
作者: Runci Bai,Kui Jiang,Xiang Chen,Chen Wu,Dianjie Lu,Guijuan Zhang,Zhuoran Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing images are frequently degraded by adverse weather conditions, particularly clouds and haze, which severely impair downstream applications. Existing restoration methods typically rely on computationally heavy architectures or sequential pipelines (e.g., detail enhancement followed by color rendition) that suffer from mutual interference and artifact accumulation. Furthermore, recent unified grid-based approaches utilize fixed, isotropic interpolation kernels, neglecting the intrinsic low-dimensional manifold of natural images and inevitably causing edge blur. To address these limitations, we propose 6th Grid-Net, a highly efficient and unified remote sensing image restoration framework tailored for resource-constrained edge devices. Specifically, we construct a novel six-dimensional fusion tensor that seamlessly integrates the color rendition capabilities of 3D LUTs with the spatial-luminance detail preservation of bilateral grids. To overcome the drawbacks of standard trilinear interpolation, we introduce a manifold-adaptive high-dimensional sampling mechanism. This mechanism dynamically adjusts the interpolation kernel based on local edge orientation, texture strength, and color similarity, enabling simultaneous global color stylization and local edge refinement in a single forward pass. Additionally, an edge-aware grid smoothing constraint and dynamic quantization are incorporated to suppress ghosting artifacts and significantly compress the model size. Extensive experiments on multiple benchmark datasets demonstrate that 6th Grid-Net achieves state-of-the-art restoration quality across various degradation scenarios.

[CV-53] EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT

【速读】:该论文旨在解决三维胸部计算机断层扫描(CT)中,现有生成式 AI 模型难以同时实现疾病识别、异常定位及可解释性视觉证据提供的问题。当前主流视觉-语言基础模型通常将扫描与报告压缩为全局图像-文本表征,导致空间信息丢失,无法支持临床所需的精准解读。其解决方案的关键在于提出 EXACT——一种可解释的异常感知基础模型,通过解剖结构感知的弱监督学习,在无需体素级标注的情况下,联合学习器官分割与多实例异常定位,并生成器官特异性的异常评分图,从而在每个体素层面精确映射疾病风险并保留解剖上下文信息,显著提升多中心、多病种场景下的诊断准确性与可解释性。

链接: https://arxiv.org/abs/2604.24146
作者: Xuguang Bai,Mingxuan Liu,Tongxi Song,Yifei Chen,Hongjia Yang,Kasidit Anmahapong,Zihan Li,Ying Zhou,Qiyuan Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chest computed tomography (CT) is central to the detection and management of thoracic disease, yet the growing scale and complexity of volumetric imaging increasingly exceed what can be addressed by scan-level prediction alone. Clinically useful AI for CT must not only recognize disease across the whole volume, but also localize abnormalities and provide interpretable visual evidence. Existing vision-language foundation models typically compress scans and reports into global image-text representations, limiting their ability to preserve spatial evidence and support clinically meaningful interpretation. Here we developed EXACT, an explainable anomaly-aware foundation model for three-dimensional chest CT that learns spatially resolved representations from paired clinical scans and radiology reports. EXACT was pre-trained on 25,692 CT-reports pairs using anatomy-aware weak supervision, jointly learning organ segmentation and multi-instance anomaly localization without manual voxel-level annotations. The resulting organ-specific anomaly-aware maps assign each voxel a disease-specific anomaly score confined to its corresponding anatomy, jointly encoding lesion extent and organ-level context. In retrospective multinational and multi-center evaluations, EXACT showed broad and consistent improvements across clinically relevant CT tasks, spanning multi-disease diagnosis, zero-shot anomaly localization, downstream adaptation, and visually grounded report generation, outperforming existing three-dimensional medical foundation models. By transforming routine clinical CT scans and free-text reports into explainable voxel-level representations, EXACT establishes a scalable paradigm for trustworthy volumetric medical AI.

[CV-54] Bridging Restoration and Generation Manifolds in One-Step Diffusion for Real-World Super-Resolution

【速读】:该论文旨在解决预训练扩散模型在真实图像超分辨率(Real-ISR)任务中因迭代采样带来的计算瓶颈问题,以及单步蒸馏方法所面临的感知质量与失真之间的权衡难题。其核心挑战包括:刚性的时间步初始化、分布轨迹不匹配及脆弱的随机调制机制。解决方案的关键在于提出IDaS-SR框架,其中包含两个核心技术:一是Manifold Inversion Noise Estimator (MINE),通过预测与严重程度相关的初始时间步和反演噪声,精确地将低质量潜在表示锚定到扩散轨迹上,从而解决初始化和轨迹不匹配问题;二是CHARIOT机制,一种连续生成引导策略,通过重调度轨迹并插值噪声,实现对感知-失真边界的显式调控,同时保持结构先验不变,使模型能够在单次推理中无缝切换从结构恢复到纹理幻觉的多种能力。

链接: https://arxiv.org/abs/2604.24136
作者: Shyang-En Weng,Yi-Cheng Liao,Yu-Syuan Xu,Wei-Chen Chiu,Ching-Chun Huang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); MediaTek Inc. (联发科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Pretrained diffusion models have revolutionized real-world image super-resolution (Real-ISR) but suffer from computational bottlenecks due to iterative sampling. Recent single-step distillation accelerates inference but faces a stark perception-distortion trade-off due to rigid timestep initialization, distributional trajectory mismatches, and fragile stochastic modulation. To address this, we present Adaptive Inversion and Degradation-aware Sampling for Real-ISR (IDaS-SR), a one-step framework bridging the deterministic restoration and stochastic generation manifolds. At its core, the Manifold Inversion Noise Estimator (MINE) resolves these initialization and trajectory mismatches by predicting a severity-aware timestep and inversion noise, precisely anchoring low-quality latents onto the diffusion trajectory. Furthermore, to mitigate fragile stochastic modulation, we propose CHARIOT, a continuous generative steering mechanism. By rescheduling trajectories and interpolating noise, it enables explicit navigation of the perception-distortion boundary without compromising structural priors. Extensive experiments demonstrate that IDaS-SR outperforms state-of-the-art methods, seamlessly transitioning from a rigorous structural restorer to a sophisticated texture hallucinator in a single inference step.

[CV-55] Open-Vocabulary Semantic Segmentation Network Integrating Object-Level Label and Scene-Level Semantic Features for Multimodal Remote Sensing Images

【速读】:该论文旨在解决多模态遥感影像语义分割中忽视非视觉文本数据的问题,即当前方法主要关注互补的视觉模态融合,而未充分利用文本信息作为连接视觉模式与现实概念的语义桥梁。其解决方案的关键在于提出TSMNet(Text Supervised Multi-modal Open Vocabulary Semantic Segmentation Network),该模型通过引入双分支文本编码器提取场景级语义和对象级标签信息,并设计文本引导的视觉语义融合模块(text-guided visual semantic fusion module),使文本特征动态交互于视觉嵌入,实现领域感知的特征精炼与可解释的决策过程,从而显著提升模型在不同地理环境和传感器条件下的泛化能力。

链接: https://arxiv.org/abs/2604.24125
作者: Jinkun Dai,Yuanxin Ye,Peng Tang,Tengfeng Tang,Xianping Ma,Jing Xiao,Mi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation of multi-modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi-modal approaches mainly focus on integrating complementary visual modalities, yet neglect the incorporating of non-visual textual data - a rich source of knowledge that can bridge semantic gaps between visual patterns and real-world concepts. To address this limitation, we propose TSMNet, a text supervised multi-modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open-vocabulary semantic segmentation. Unlike conventional multi-modal segmentation frameworks, TSMNet introduces a dual-branch text encoder to extract both scene-level semantic and object-level label information from various textual data, enabling dynamic cross-modal fusion. These text-derived features dynamically interact with visual embeddings through the proposed text-guided visual semantic fusion module, enabling domain-aware feature refinement and human-interpretable decision-making. To verify our method, we innovatively construct two new multi-modal datasets, and carry out extensive experiments to make a comprehensive comparison between the proposed method and other state-of-the-art (SOTA) semantic segmentation models. Results demonstrate that TSMNet achieves superior segmentation accuracy while exhibiting robust generalization capabilities across diverse geographical and sensor-specific scenarios. This work establishes a new paradigm for explainable remote sensing analysis, demonstrating that textual knowledge integration significantly enhances model generalizability. The source code will be available at this https URL

[CV-56] FDIM: A Feature-distance-based Generic Video Quality Metric for Versatile Codecs

【速读】:该论文旨在解决传统视频质量评估(Video Quality Assessment, VQA)方法在面对生成式神经视频编解码器(Neural Video Codecs, NVCs)时泛化能力不足的问题,尤其是NVC产生的失真具有内容依赖性和生成特性,难以被传统VQA指标准确捕捉。解决方案的关键在于提出一种基于特征距离的通用视频质量度量方法FDIM,其核心创新是采用混合架构:一方面利用深度特征学习多尺度表示以捕获从结构纹理退化到高层语义偏移的各类失真;另一方面引入手工设计特征提供稳定互补线索,从而提升模型在不同编解码器、内容类型和动态范围(SDR/HDR)下的泛化性能。FDIM在包含超过16,000个视频序列的大规模主观质量数据集(DCVQA)上训练,并在十个涵盖多种未见过编解码器的SDR/HDR VQA数据集上验证,表现出与主观评价高度一致的性能。

链接: https://arxiv.org/abs/2604.24123
作者: Jiayi Wang,Lichun Zhang,Xiaoqi Zhuang,Jiaqi Zhang,Lu Yu,Yin Zhao
机构: Zhejiang University (浙江大学); Zhejiang Key Laboratory of Multimodal Communication Networks and Intelligent Information Processing (浙江省多模态通信网络与智能信息处理重点实验室); Huawei (华为)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video technology is advancing toward Ultra High Definition (UHD) and High Dynamic Range (HDR), which intensifies the need for higher compression efficiency for these high-specification videos. Beyond advances in traditional codecs, neural video codecs (NVCs) have attracted significant research attention and have evolved rapidly over the past few years. The coding artifacts of NVCs often exhibit content-varying and generative characteristics, which differ from those of conventional codecs and are challenging for traditional video quality assessment (VQA) methods to capture. Therefore, VQA metrics are required to generalize across different codecs, content types, and dynamic ranges to better support video codec research and evaluation. In this paper, we propose FDIM, a feature-distance-based generic video quality metric for both traditional and neural video codecs across SDR and HDR formats. FDIM employs a hybrid architecture that integrates deep and hand-crafted features. The deep feature component learns multi-scale representations to capture distortions ranging from structural and textural fidelity degradation to high-level semantic deviations, while the hand-crafted feature component provides stable complementary cues to improve overall generalization. We trained FDIM on a large-scale subjective quality assessment dataset (DCVQA) consisting of over 16k video sequences encoded by traditional block-based hybrid video codecs and end-to-end perceptually optimized neural video codecs. Extensive experiments on ten SDR/HDR VQA datasets containing diverse, previously unseen codecs demonstrate that FDIM achieves strong generalization and high correlation with subjective assessment. The source code for FDIM and the DCVQA validation set will be released at this https URL.

[CV-57] opoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations CVPR2026

【速读】:该论文旨在解决自动驾驶中拓扑推理(topology reasoning)的两个关键问题:一是现有方法主要依赖实例级学习进行中心线检测,并通过简化MLP模块进行拓扑推理,缺乏对点到实例(point-to-instance, P2I)关系的充分建模;二是拓扑推理与中心线检测通常为串行处理,难以实现协同优化。解决方案的关键在于提出一种端到端的TopoHR框架,其核心创新是建立中心线检测与拓扑推理之间的循环交互机制,使二者能够迭代增强。具体而言,TopoHR引入分层中心线表示(包括点查询、实例查询和语义表示),并通过分层中心线解码器实现多层级特征融合;同时设计了分层拓扑推理模块,在统一架构中捕获细粒度P2I关系与全局实例间(instance-to-instance, I2I)连接,从而显著提升拓扑推理的准确性与鲁棒性。在OpenLane-V2基准上的实验表明,该方法在多个指标上均取得显著提升,验证了所提结构的有效性。

链接: https://arxiv.org/abs/2604.24119
作者: Yifeng Bai,Zhirong Chen,Erkang Cheng,Haibin Ling
机构: NullMax; Westlake University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Topology reasoning is crucial for autonomous driving. Current methods primarily focus on instance-level learning for centerline detection, followed by a sequential module for topology reasoning that relies on simplified MLP layers. Moreover, they often neglect the importance of \textitpoint-to-instance (P2I) relationships in topology reasoning. To address these limitations, we present TopoHR (Topological Hierarchical Representation), a novel end-to-end framework that establishes cyclic interaction between centerline detection and topology reasoning, allowing them to iteratively enhance each other. Specifically, we introduce a hierarchical centerline representation including point queries, instance queries, and semantic representations. These multi-level features are seamlessly integrated and fused within a hierarchical centerline decoder. Furthermore, we design a hierarchical topology reasoning module that captures both fine-grained P2I relationships and global instance-to-instance (I2I) connections within a unified architecture. With these novel components, TopoHR ensures accurate and robust topology reasoning. On the OpenLane-V2 benchmark, TopoHR refreshes state-of-the-art performance with significant improvements. Notably, compared with previous best results, TopoHR achieves +3.8 in \mathrmDET_\textl , +5.4 in \mathrmTOP_\textll on \textsubset_A and +11.0 in \mathrmDET_\textl , +7.9 in \mathrmTOP_\textll on \textsubset_B , validating the effectiveness of the proposed components. The code will be shared publicly at this https URL.

[CV-58] SemiSAM-O1: How far can we push the boundary of annotation-efficient medical image segmentation?

【速读】:该论文旨在解决深度学习在医学图像分割中对大量标注数据依赖过高的问题,尤其是在极端低标注场景(如仅使用一张标注图像)下,现有基于基础模型(Foundation Model)的半监督学习方法难以保持鲁棒且具有竞争力的性能。其解决方案的关键在于提出SemiSAM-O1框架,通过充分利用基础模型编码器的密集特征表示能力,超越其提示接口的限制:第一阶段利用单张模板图像生成类原型,并基于特征相似性传播至未标注数据以获得粗粒度伪标签;第二阶段引入迭代训练与精化循环,结合体素级不确定性估计进行自监督优化,并通过不确定性引导的精化步骤,利用基础模型全局特征空间聚合高置信度邻近样本的标签来修正高不确定性区域,从而形成模型与伪标签之间的良性互促机制,显著缩小了单标签半监督学习与全监督之间的性能差距,同时大幅降低在线基础模型推理的计算开销。

链接: https://arxiv.org/abs/2604.24109
作者: Yichi Zhang,Le Xue,Bichun Xu,Judong Luo,Zhigang Wu,Yu Fu,Zixin Hu,Yuan Cheng,Yuan Qi
机构: Artificial Intelligence Innovation and Incubation Institute, Fudan University (复旦大学人工智能创新与孵化研究院); Shanghai Academy of Artificial Intelligence for Science (上海人工智能科学研究院); Department of Radiotherapy, Tongji Hospital, School of Medicine, Tongji University (同济大学医学院同济医院放射科); School of Information Science and Engineering, Lanzhou University (兰州大学信息科学与工程学院); Department of Information and Intelligence Development, Zhongshan Hospital, Fudan University (复旦大学中山医院信息与智能发展部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) has become a promising solution to alleviate the annotation burden of deep learning-based medical image segmentation models. While recent advances in foundation model-driven SSL have pushed the boundary to extremely limited annotation scenarios, they fail to maintain robust competitive performance in complex imaging modalities. In this paper, we propose SemiSAM-O1, an annotation-efficient framework using only one annotated template image for segmentation. SemiSAM-O1 extends the specialist-generalist collaborative learning framework to the extreme one-label setting by fully exploiting the foundation model’s feature representation capability beyond its prompting interface. SemiSAM-O1 operates in two stages. In the first stage, the foundation model’s encoder extracts dense features from all volumes, and class prototypes derived from the single annotated template are propagated to the unlabeled pool via feature similarity to produce coarse initial pseudo-labels. In the second stage, an iterative training-and-refinement loop progressively improves both the segmentation model and the pseudo-labels over multiple rounds, where each round trains the model from scratch on current pseudo-labels and generates updated predictions with voxel-wise uncertainty estimates. An uncertainty-guided refinement step further leverages the foundation model’s global feature space to correct high-uncertainty regions by aggregating labels from their most similar confident neighbors, establishing a virtuous cycle of mutual improvement. Extensive experiments on a wide range of segmentation tasks across different modalities and anatomical targets demonstrate that SemiSAM-O1 significantly narrows the performance gap between one-label semi-supervised learning and full supervision, while significantly reducing the computational overhead of online foundation model inference.

[CV-59] Light em Up: Enabling Few-Shot Low-Light 3D Gaussian Splatting with Multi-Scale Explicit Retinex Illumination Decoupling

【速读】:该论文旨在解决低光照条件下全向(360°)新视角合成(novel view synthesis)中的几何一致性与逼真度难以兼顾的问题,具体表现为光照不足导致的噪声放大和视点依赖的光度不一致。现有无监督方法在大视角变化下易出现颜色漂移,而有监督的低光照增强模型虽在2D任务中有效,却缺乏跨场景泛化能力且需重新训练。其解决方案的关键在于提出MERID-GS框架——基于Retinex理论显式解耦光照(illumination)与反射率(reflectance),并通过可学习增益和光照状态引导的频率门控机制抑制噪声传播并增强暗区结构;同时结合轻量级反射头(Reflection Head)与3D高斯泼溅(3D Gaussian Splatting),实现仅需少量样本即可适应新场景,并从稀疏视角观测中稳定生成高质量低光新视角图像。

链接: https://arxiv.org/abs/2604.24053
作者: YuHao Yin,Zongji Wang,Yuanben Zhang,Biqing Li,Jiesong Bai,Junyi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 5 figures. Code available at this https URL

点击查看摘要

Abstract:Full 360 ^\circ novel view synthesis under low-light conditions remains challenging. Insufficient illumination, noise amplification, and view-dependent photometric inconsistencies prevent existing methods from jointly preserving geometric consistency and photorealism. Unsupervised approaches often exhibit color drift under large viewpoint variations, while supervised low-light enhancement models, though effective for 2D tasks, struggle to generalize to new scenes and typically require retraining. To address this issue, we propose MERID-GS, a Multi-Scale Explicit Retinex Illumination-Decoupled Gaussian framework for low-light 360 ^\circ synthesis. Based on Retinex theory, the method explicitly separates illumination and reflectance, and suppresses noise propagation while enhancing dark-region structures via a learnable gain and Illumination-State-Guided Frequency Gating. Combined with lightweight Reflection Head and 3D Gaussian Splatting, MERID-GS adapts to new scenes with only a few shots and enables stable low-light novel view synthesis from sparse-view observations. In addition, we construct a low-light multi-view dataset covering full 360 ^\circ scenes for joint evaluation. Thorough experiments across multiple datasets in this area demonstrate that MERID-GS achieves SOTA performance, exhibiting superior cross-scene generalization and view consistency. The source code and pre-trained models are available at this https URL…

[CV-60] QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering EMNLP2025

【速读】:该论文旨在解决视频到文本摘要(Video-to-Text Summarization)领域中缺乏全面、无参考(reference-free)评估方法的问题。现有基于n-gram重叠或大语言模型(LLM)的指标严重依赖人工编写的参考摘要,限制了其在实际场景中的适用性,并难以捕捉语义细节。解决方案的关键在于提出QEVA——一种通过多模态问答(Multimodal Question Answering)直接将候选摘要与源视频进行对比的评估指标,能够从覆盖度(Coverage)、事实一致性(Factuality)和时间顺序(Chronology)三个维度量化摘要质量。同时,作者构建了MLVU(VS)-Eval基准数据集,为评估提供透明且一致的框架,实验表明QEVA相较于现有方法在Kendall’s τ_b、τ_c 和 Spearman’s ρ 上与人类判断具有更高相关性,显著提升了评估的可靠性与实用性。

链接: https://arxiv.org/abs/2604.24052
作者: Woojun Jung,Junyeong Kim
机构: Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of EMNLP 2025

点击查看摘要

Abstract:Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Chronology. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall’s \tau_b , \tau_c , and Spearman’s \rho . We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.

[CV-61] Generalising maximum mean discrepancy: kernelised functional Bregman divergences

【速读】:该论文旨在解决在函数空间(即点为函数而非有限维参数向量)中系统构建和应用Bregman散度的问题,这在机器学习中的聚类、参数估计与优化等任务中具有重要意义。传统方法多基于Banach空间或未充分结合核方法与希尔伯特空间几何结构,而本文的关键解决方案在于将Bregman散度定义于希尔伯特空间,并利用自对偶配对(self-dual pairing)和里斯表示定理(Riesz representer)简化计算;进一步通过将Bregman生成函数设计为核均值嵌入(kernel mean embedding)的复合形式,使得此类散度易于估计,从而实现了与核方法和现代机器学习框架的自然融合。

链接: https://arxiv.org/abs/2604.24047
作者: Russell Tsuchida,Frank Nielsen
机构: Monash University (莫纳什大学); Sony Computer Science Laboratories, Inc. (索尼计算机科学实验室公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注: 21 pages

点击查看摘要

Abstract:Bregman divergences play a pivotal role in statistics, machine learning and computational information geometry. Particularly in the context of machine learning, they are central to clustering, exponential families, parameter estimation and optimisation, among other things. Despite this, the full toolkit of Hilbert spaces and in particular reproducing kernel Hilbert spaces have not been systematically developed and applied to functional Bregman divergences, where points are functions rather than finite-dimensional parameter vectors. While other types of functional Bregman divergences have been studied, these are typically in a Banach space rather than more directly aligned with kernel methods and Hilbert-space geometry commonly used in machine learning. We consider functional Bregman divergences on a Hilbert space, where the self-dual pairing and Riesz representer afford us particularly convenient calculus. Further specialising Bregman generators as a composition involving a kernel mean embedding makes such divergences easy to estimate. We discuss applications in clustering, universal estimation, robust estimation and generative modelling, and contrast our approach with other types of Bregman divergences.

[CV-62] CLLAP: Contrastive Learning-based LiDAR-Augmented Pretraining for Enhanced Radar-Camera Fusion CVPR

【速读】:该论文旨在解决自动驾驶中雷达-相机融合方法依赖高质量标注雷达数据的问题,而此类数据稀缺且标注成本高昂。其核心解决方案是提出CLLAP(Contrastive Learning-based LiDAR-Augmented Pretraining)框架,关键在于利用大量易获取的激光雷达(LiDAR)数据通过L2R(LiDAR-to-Radar)采样方法生成伪雷达数据,并结合一种新颖的双阶段、双模态对比学习策略,在无需额外标注的情况下实现自监督预训练,从而显著提升现有雷达-相机融合模型的特征提取能力与检测精度。

链接: https://arxiv.org/abs/2604.24044
作者: Bingyi Liu,Chuanhui Zhu,Hongfei Xue,Jian Teng,Jipeng Liu,Enshu Wang,Penglin Dai,Pu Wang
机构: Wuhan University of Technology (武汉理工大学); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校); Wuhan University (武汉大学); Southwest Jiaotong University (西南交通大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by 2026 CVPR Findings

点击查看摘要

Abstract:Accurate 3D object detection is critical for autonomous driving, necessitating reliable, cost-effective sensors capable of operating in adverse weather conditions. Camera and millimeter-wave radar fusion has emerged as a promising solution; however, these methods often rely on finely annotated radar data, which is scarce and labor-intensive to produce. To address this challenge, we present CLLAP, a Contrastive Learning-based LiDAR-Augmented Pretraining framework that enhances the performance of existing radar-camera fusion methods for 3D object detection. CLLAP leverages abundant LiDAR data to generate pseudo-radar data using the proposed L2R (LiDAR-to-Radar) Sampling method. Then, it incorporates this data into a novel dual-stage, dual-modality contrastive learning strategy, enabling effective self-supervised learning from paired pseudo-radar and image data. This approach facilitates effective pretraining of existing radar-camera fusion models in a plug-and-play manner, enhancing their feature extraction capabilities and improving detection accuracy and robustness. Experimental results using NuScenes and Lyft Level 5 datasets demonstrate significant performance improvements across three baseline models, highlighting CLLAP’s effectiveness in advancing radar-camera fusion for autonomous driving applications.

[CV-63] Robust Grounding with MLLM s against Occlusion and Small Objects via Language-guided Semantic Cues ICASSP2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在拥挤场景中视觉接地(grounding)性能下降的问题。拥挤场景中存在的遮挡和小目标等视觉挑战会削弱物体语义信息,从而降低模型的定位准确性。针对这一问题,论文提出了一种基于语言引导语义线索(Language-Guided Semantic Cues, LGSCs)的新方法,其核心在于引入一个语义线索提取器(Semantic Cue Extractor, SCE),从MLLM的视觉流中提取物体语义线索,并通过对应文本嵌入对其进行引导,生成具有语言先验的语义线索;随后将这些线索重新融合进原始视觉流以优化物体语义表示,从而显著提升模型在复杂场景下的接地精度。

链接: https://arxiv.org/abs/2604.24036
作者: Beomchan Park,Seongho Kim,Hyunjun Kim,Sungjune Park,Yong Man Ro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 4 pages, 2 figures, ICASSP 2026

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have enhanced grounding capabilities in general scenes, their robustness in crowded scenes remains underexplored. Crowded scenes entail visual challenges (i.e., occlusion and small objects), which impair object semantics and degrade grounding performance. In contrast, language expressions are immune to such degradation and preserve object semantics. In light of these observations, we propose a novel method that overcomes such constraints by leveraging Language-Guided Semantic Cues (LGSCs). Specifically, our approach introduces a Semantic Cue Extractor (SCE) to derive semantic cues of objects from the visual pipeline of an MLLM. We then guide these cues using corresponding text embeddings to produce LGSCs as linguistic semantic priors. Subsequently, they are reintegrated into the original visual pipeline to refine object semantics. Extensive experiments and analyses demonstrate that incorporating LGSCs into an MLLM effectively improves grounding accuracy in crowded scenes.

[CV-64] JSSFF: A Joint Structural-Semantic Fusion Framework for Remote Sensing Image Captioning

【速读】:该论文旨在解决遥感图像_caption生成_中因对象边界模糊、遮挡及结构重叠导致的语义信息提取不充分问题,从而影响描述准确性。其关键解决方案在于:首先,在编码器中引入边缘感知融合机制(edge-aware fusion),将原始图像与其边缘增强版本联合输入,以同时捕获高层语义与低层空间细节;其次,采用基于比较的束搜索算法(comparison-based beam search, CBBS)生成文本描述,通过公平性导向的候选句对比优化,实现定量指标与定性相关性之间的平衡。该方法显著提升了遥感图像描述的准确性和边界敏感性。

链接: https://arxiv.org/abs/2604.24031
作者: Swadhin Das,Vivek Yadav
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The encoder-decoder framework has become widely popular nowadays. In this model, the encoder extracts informative visual features from an input image, and the decoder employs a sequence-to-sequence formulation to generate the corresponding textual description from these features. The existing models focus more on the decision part. However, extracting meaningful information from the image can help the decoder generate an accurate caption by providing information about the objects and their relationship. Remote sensing images are highly complex. One major challenge is detecting objects that extend beyond their visible boundaries due to occlusion, overlapping structures, and unclear edges. Hence, there is a need to design an approach that can effectively capture both high-level semantics and low-level spatial details for accurate caption generation. In this work, we have proposed an edge-aware fusion method by incorporating the original image and its edge-aware version into the encoder to enhance feature representation and boundary awareness. We used a comparison-based beam search (CBBS) to generate captions to achieve a balanced trade-off between quantitative metrics and qualitative caption relevance through fairness-based comparison of candidate captions. Experimental results demonstrate our model’s superiority over several baseline models in quantitative and qualitative perspectives.

[CV-65] Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras

【速读】:该论文旨在解决多投影仪标定中因逐个投影结构光图案而导致的标定时间与投影仪数量呈线性增长的问题,这一瓶颈严重限制了大规模投影映射系统的部署。其解决方案的关键在于将摄像头嵌入标定靶板表面,使嵌入式摄像头能够直接捕获来自不同方向的投影光,从而根据入射方向分离多个投影仪同时投射的结构光图案;通过建立嵌入式摄像头光心与投影仪像素之间的对应关系,实现所有投影仪内参和外参的同时估计,显著提升了系统在密集多投影仪场景下的可扩展性。

链接: https://arxiv.org/abs/2604.24024
作者: Takumi Kawano,Kohei Miura,Daisuke Iwai
机构: The University of Osaka (大阪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Conventional multi-projector calibration requires projecting and capturing structured light patterns for each projector sequentially, causing calibration time and effort to increase linearly with the number of projectors. This scalability bottleneck has long limited the deployment of large-scale projection mapping systems. We present a new calibration framework that breaks this limitation by embedding cameras into the surface of the calibration target. The embedded cameras directly capture the incoming projection light, enabling the separation of simultaneously projected structured light patterns from multiple projectors according to their incident directions. Our method establishes correspondences between the optical centers of the embedded cameras and the projector pixels, allowing the intrinsic and extrinsic parameters of all projectors to be simultaneously estimated. We further introduce a correction technique for small misalignments between the calibration board and camera optical centers. As a result, our system achieves calibration accuracy comparable to conventional methods while reducing the required number of projection-capture cycles from linear to nearly constant with respect to the number of projectors, dramatically improving scalability for dense multi-projector systems with overlapping projection regions, such as high-brightness stacking, super-resolution, light-field, and shadow-suppression displays.

[CV-66] ServImage: An Image Generation and Editing Benchmark from Real-world Commercial Imaging Services

【速读】:该论文旨在解决当前生成式AI(Generative AI)图像生成与编辑模型在学术基准测试中表现优异,但在真实商业设计项目中其输出是否具备经济价值尚不明确的问题。解决方案的关键在于构建一个名为ServImage的综合性基准体系,其核心由三部分组成:(i) ServImageBench——包含1.07k个付费商业设计任务及2.05k份设计师交付成果的数据集,涵盖肖像、产品和数字内容,并附带33k张候选图像与33k条人工标注;(ii) ServImageScore——一种融合基础要求满足度、视觉执行质量与商业必要性三个维度的评分系统,用于刻画驱动人类支付决策的核心因素;(iii) ServImageModel——基于该评分体系训练的支付预测模型,在人工标注图像上实现了82.00%的支付决策预测准确率并输出校准后的支付概率。这一框架首次将图像生成质量与经济价值直接关联,为评估图像生成模型的商业化可行性提供了可量化、可扩展的方法论基础。

链接: https://arxiv.org/abs/2604.24023
作者: Fengxian Ji,Jingpu Yang,Zirui Song,Lang Gao,Junhong Liang,Zhenhao Chen,Jinghui Zhang,Xiuying Chen
机构: MBZUAI(阿联酋穆巴达拉人工智能研究所); Institute of Botany, the Chinese Academy of Sciences(中国科学院植物研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent image generation and editing models demonstrate robust adherence to instructions and high visual quality on academic benchmarks. However, their performance on paid, real-world design projects remains uncertain. We introduce \textbfServImage, a benchmark that explicitly correlates model outputs with economic value in commercial design projects. ServImage consists of (i) \textbf\textitServImageBench: a dataset of 1.07k paid commercial design tasks and 2.05k designer deliverables totaling over \ 295k, covering portrait, product, and digital content, along with 33k candidate images and 33k human annotations. (ii) \textbf\textitServImageScore: an integrated scoring system that combines three quality dimensions: baseline requirements fulfilment, visual execution quality, and commercial necessity satisfaction. These three dimensions are designed to characterize the factors that drive human payment decisions and indicate whether an image is commercially acceptable. (iii) \textbf\textitServImageModel: under this scoring system, we propose a payment prediction model trained on the human-annotated candidate images, achieving 82.00% accuracy in predicting human payment decisions and producing calibrated payment probabilities. ServImage provides a comprehensive foundation for assessing the commercial viability of image generation models and offers a scalable resource for future research on economically grounded vision systems \hrefthis https URLGithub.

[CV-67] FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)分布式训练中因张量并行(Tensor Parallelism)和数据并行(Data Parallelism)导致的通信瓶颈问题,特别是现有基于数据切片的通信-计算重叠(Communication-Computation Overlap)方法中存在的尾延迟(Tail Latency)问题。其解决方案的关键在于提出了一种名为 Flash-Overlap 的新方法,该方法通过将传统的集合通信操作(如 reduce-scatter 和 all-gather)替换为分解后的点对点(Peer-to-Peer, P2P)通信,并精细调度分片计算任务,从而实现细粒度的通信与计算重叠,彻底消除尾延迟,同时兼容多种并行策略(如 TPSP 和 UP),显著提升模型 FLOPS 利用率(MFU)和吞吐量。

链接: https://arxiv.org/abs/2604.24013
作者: Rezaul Karim,Austin Wen,Wang Zongzuo,Weiwei Zhang,Yang Liu,Walid Ahmed
机构: Huawei(华为)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overhead significantly hindering computational efficiency. While communication-computation overlap presents a promising direction, existing data slicing based solutions suffer from tail latency. To overcome this limitation, this research introduces a novel communication-computation overlap technique to eliminate this tail latency in state of the art overlap methods for distributed LLM training. The aim of this technique is to effectively mitigate communication bottleneck of tensor parallelism and data parallelism for distributed training and inference. In particular, we propose a novel method termed Flash-Overlap that replaces conventional collective operations of reduce-scatter and all-gather with decomposed peer-to-peer (P2P) communication and schedules partitioned computations to enable fine-grained overlap. Our method provides an exact algorithm for reducing communication overhead that eliminates tail latency. Moreover, it presents a versatile solution compatible with data-parallel training and various tensor-level parallelism strategies, including TPSP and UP. Experimental evaluations demonstrate that our technique consistently achieves lower latency, superior Model FLOPS Utilization (MFU), and high throughput.

[CV-68] SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs CVPR2026

【速读】:该论文旨在解决多模态视觉-语言模型(Multimodal Vision-Language Models, VLMs)中基于专家混合(Mixture-of-Experts, MoE)架构的专家路由策略缺乏模态感知特异性的问题。现有方法通常依赖手工设计或与模态无关的路由机制,忽略了不同网络层中模态融合模式的动态变化,导致专家专业化程度不足,限制了模型性能和部署效率。其解决方案的关键在于提出Soft Modality-guided Expert Specialization (SMoES),通过引入动态软模态分数以捕捉层间模态融合模式、采用与专家并行部署对齐的专家分箱机制,以及利用跨箱互信息正则化促进模态专业化,从而实现更高效且有针对性的专家分配。

链接: https://arxiv.org/abs/2604.23996
作者: Zi-Hao Bo,Yaqian Li,Anzhou Hou,Rinyoichi Takezoe,Ertao Zhao,Tianxiang Pan,Jiale Yan,Mo Guang,Kaiwen Long
机构: Li Auto Inc. (小鹏汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Mixture-of-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored. Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent modality fusion patterns in MoE-VLMs and provide little guidance for expert specialization. We propose Soft Modality-guided Expert Specialization (SMoES), which consists of dynamic soft modality scores that capture layer-dependent fusion patterns, an expert binning mechanism aligned with expert-parallel deployment, and an inter-bin mutual information regularization that encourages coherent modality specialization. Our method leverages attention-based or Gaussian-statistics modality scores to optimize mutual information regularization. Experiments across four MoE-based VLMs and 16 benchmarks demonstrate improvement on both effectiveness and efficiency: 0.9% and 4.2% average gain on multimodal and language-only tasks, 56.1% reduction in EP communication overhead, and 12.3% throughput improvement under realistic deployment. These results validate that aligning routing with modality-aware expert specialization unlocks MoE-VLM capacity and efficiency.

[CV-69] Hierarchical Prototype-based Domain Priors for Multiple Instance Learning in Multimodal Histopathology Analysis

【速读】:该论文旨在解决数字病理学中对高分辨率全切片图像(Whole Slide Images, WSIs)复杂肿瘤微环境解析困难的问题,尤其是现有多实例学习(Multiple Instance Learning, MIL)框架因忽略组织形态语义和空间几何信息而导致的过拟合背景噪声、视觉特征与诊断知识不匹配等挑战。其解决方案的关键在于提出一种统一的多模态框架——分层原型域先验(Hierarchical Prototype-based Domain Priors, HPDP),通过引入形态锚定原型系统(Morphologically Anchored Prototype System, MAPS)将学习过程锚定于可解释的形态学聚类,并利用正弦位置编码(Sinusoidal Positional Encoder, SPE)显式建模组织架构;同时,借助大语言模型(Large Language Model, LLM)生成的描述文本,通过分层跨模态对齐模块(Hierarchical Cross-Modal Alignment, HCMA)弥合语义鸿沟,从而实现更鲁棒、可解释的组织病理学诊断与预后预测。

链接: https://arxiv.org/abs/2604.23982
作者: Xuemei Qiu,Dawei Fan,Yebin Huang,Yanping Chen,Lifang Wei
机构: Fujian Agriculture and Forestry University (福建农林大学); Fujian Provincial Hospital (福建省立医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Digital pathology has fundamentally altered diagnostic workflows by enabling the computational analysis of gigapixel Whole Slide Images (WSIs), yet effectively deciphering their complex tumor microenvironments remains a formidable challenge. Existing Multiple Instance Learning (MIL) frameworks typically treat Whole Slide Images as unstructured bags of patches, discarding critical morphological semantics and spatial geometry. This lack of inductive bias often leads to overfitting on background noise and fails to align visual features with high-level diagnostic knowledge. To overcome these limitations, we propose the Hierarchical Prototype-based Domain Priors (HPDP) framework, a unified multimodal approach for joint histopathology diagnosis and prognosis. HPDP mitigates the data-driven “black box” issue by introducing a Morphologically Anchored Prototype System (MAPS), which anchors learning to interpretable morphological clusters, and a Sinusoidal Positional Encoder (SPE) to explicitly model tissue architecture. Furthermore, we bridge the semantic gap via a Hierarchical Cross-Modal Alignment (HCMA) module, using Large Language Model (LLM)-generated descriptions to contextually refine visual representations. Extensive experiments across seven cancer cohorts demonstrate that HPDP consistently achieves state-of-the-art performance with superior robustness and interpretability.

[CV-70] Multi-View Synergistic Learning with Vision-Language Adaption for Low-Resource Biomedical Image Classification

【速读】:该论文旨在解决低资源条件下生物医学图像分类的挑战,包括标注数据稀缺、类别间视觉差异细微以及疾病语义复杂等问题。其解决方案的关键在于提出多视角协同学习(Multi-View Synergistic Learning, MVSL)框架,该框架通过解耦视觉与文本编码器的适应过程以实现更稳定的参数高效微调;引入多粒度对比学习显式建模全局图像语义与局部病灶级证据,提升对视觉相似疾病的细粒度区分能力;同时利用大语言模型生成的结构化监督信号保持疾病层级语义结构,约束文本表示并间接正则化视觉嵌入,从而增强跨模态对齐稳定性与有限监督下的分类性能。

链接: https://arxiv.org/abs/2604.23977
作者: Xiaoliu Luo,Minxue Xiao,Ting Xie,Mengzhu Wang,Huiqing Qi,Joey Tianyi Zhou,Taiping Zhang,Xu Wang
机构: Chongqing University of Technology(重庆理工大学); Collge of Computer Science, Sichuan University(四川大学计算机学院); Hebei University of Technology(河北工业大学); Nanyang Technological University(南洋理工大学); Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore(新加坡科技研究局前沿人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate biomedical image classification under low-resource conditions remains challenging due to limited annotations, subtle inter-class visual differences, and complex disease semantics. While vision–language models offer a promising foundation for mitigating data scarcity, their effective adaptation in biomedical settings is constrained by the need for parameter-efficient tuning alongside fine-grained and semantically consistent representation learning. In this work, we propose Multi-View Synergistic Learning (MVSL), a unified framework that addresses these challenges by jointly considering adaptation paradigms, representation granularity, and disease semantic relationships. MVSL decouples the adaptation of visual and textual encoders to respect their distinct representational characteristics, enabling more stable and effective parameter-efficient fine-tuning. It further introduces multi-granularity contrastive learning to explicitly model both global image semantics and localized lesion-level evidence, improving fine-grained discrimination for visually similar disease categories. In addition, MVSL preserves disease-level semantic structure by incorporating structured supervision derived from large language models, which constrains textual representations at the class level and indirectly regularizes visual embeddings through cross-modal alignment. Together, these components enable more stable cross-modal alignment and improved discrimination under limited supervision. Extensive experiments on 11 public biomedical datasets spanning 9 imaging modalities and 10 anatomical regions demonstrate that MVSL consistently outperforms state-of-the-art methods in few-shot and zero-shot classification settings.

[CV-71] LAVA: Layered Audio-Visual Anti-tampering Watermarking for Robust Deepfake Detection and Localization

【速读】:该论文旨在解决短时视频中深度伪造(deepfake)篡改检测与定位的难题,尤其针对现有方法在音频-视觉模态分离建模、压缩失真下水印信号不可靠以及多模态错位导致定位失效的问题。其解决方案的关键在于提出了一种校准感知的分层音视频抗篡改水印框架(Layered Audio-Visual Anti-tampering Watermarking, LAVA),通过跨模态水印融合与校准对齐机制,在编码压缩和音视频异步条件下仍能保持一致且可靠的篡改证据,从而实现鲁棒的篡改定位。

链接: https://arxiv.org/abs/2604.23957
作者: Bokang Zeng,Zheng Gao,Xiaoyu Li,Xiaoyan Feng,Jiaojiao Jiang
机构: UNSW Sydney (新南威尔士大学); Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, submitted to ACMMM 2026

点击查看摘要

Abstract:Proactive watermarking offers a promising approach for deepfake tamper detection and localization in short-form videos. However, existing methods often decouple audio and visual evidence and assume that watermark signals remain reliable under real-world degradations, making tamper localization vulnerable to multimodal misalignment and compression distortions. Moreover, existing semi-fragile visual watermarking methods often degrade significantly under codec compression because their embedding bands overlap with compression-sensitive frequency regions. To address these limitations, we propose Layered Audio-Visual Anti-tampering Watermarking (LAVA), a calibration-aware audio-visual watermark fusion framework for deepfake tamper detection and localization. LAVA leverages cross-modal watermark fusion and calibration-aware alignment to preserve consistent and reliable tamper evidence under compression and audio-visual asynchrony, enabling robust tamper localization. Extensive experiments demonstrate that LAVA achieves near-perfect detection performance (AP = 0.999), remains robust to compression and multimodal misalignment, and significantly improves tamper localization reliability over existing audio-visual fusion baselines.

[CV-72] Viewport-Unaware Blind Omnidirectional Image Quality Assessment: A Unified and Generalized Approach

【速读】:该论文旨在解决盲态全向图像质量评估(Blind Omnidirectional Image Quality Assessment, BOIQA)中存在的两大问题:一是现有方法通常依赖于视口生成(viewport generation)和质量预测两步流程,导致计算开销大且难以泛化至其他视觉内容(如2D平面图像);二是不同存储格式与用户观看行为的多样性进一步加剧了评估难度。解决方案的关键在于通过实验证明BOIQA可被重构为盲态2D图像质量评估(Blind Image Quality Assessment, BIQA)问题,从而无需显式进行视口生成,缩小了BOIQA与BIQA之间的自然差距。在此基础上,论文提出一种新的BOIQA方法,具备“无视口感知”(viewport-unaware)、“统一性”(unified,可直接用于BIQA)和“广义性”(generalized,对不同数据集具有更好泛化能力)三大优势,显著提升了模型的实用性与适应性。

链接: https://arxiv.org/abs/2604.23953
作者: Jiebin Yan,Kangcheng Wu,Jingwen Hou,Jiayu Zhang,Pengfei Chen,Yuming Fang
机构: Jiangxi University of Finance and Economics (江西财经大学); Jiangxi AI Quality Testing and Inspection Center (江西人工智能质量检测与检验中心); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Blind omnidirectional image quality assessment (BOIQA) presents a great challenge to the visual quality assessment community, due to different storage formats and diverse user viewing behaviors. The main paradigm of BOIQA models includes two steps, ie, viewport generation, and quality prediction, which brings an extra computational burden and is hard to generalize to other visual contents (eg, 2D planar image). Thus, in this paper, we make an attempt to solve these issues. First, we experimentally find that BOIQA can be formulated as a blind (2D planar) image quality assessment (BIQA) problem, ie, the first step - viewport generation - is no longer needed, which narrows the natural gap between BOIQA and BIQA. Then, we present a new BOIQA approach, which has three merits: ie, viewport-unaware - it accepts an omnidirectional image in the widely used equirectangular projection format as input without any transformation; unified - it can also be applied to BIQA; and generalized - it shows better generalizability against other competitors. Finally, we validate its promise by held-out test, cross-database validation, and the well-established gMAD competition.

[CV-73] LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models ICLR2026

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理长视觉序列输入时带来的巨大计算负担问题。现有方法通过剪枝不重要的视觉token来降低计算开销,但其有效性依赖于对token重要性的准确判断。论文指出,当前基于视觉编码器或大语言模型(Large Language Models, LLMs)注意力分数的剪枝策略存在局限性:视觉编码器存在注意力聚集(attention sink)现象,难以聚焦于关键前景区域;而LLM中虽然存在位置注意力偏差,但文本到视觉的注意力机制在中间层表现出对这种偏差的鲁棒性,能够提供有效的剪枝指导。因此,该研究提出了一种两阶段的可学习剪枝框架LearnPruner——首先在视觉编码器后使用可学习模块去除冗余视觉token,随后在LLM中间层保留任务相关的token,从而实现高精度与高效率之间的最优平衡。

链接: https://arxiv.org/abs/2604.23950
作者: Rinyoichi Takezoe,Yaqian Li,Zihao Bo,Anzhou Hou,Mo Guang,Kaiwen Long
机构: Li Auto Inc. (小鹏汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens, achieving substantial computational reduction while maintaining model performance. The core of token pruning lies in determining token importance, with current approaches primarily relying on attention scores from vision encoders or Large Language Models (LLMs). In this paper, we analyze the effectiveness of attention mechanisms in both vision encoders and LLMs. We find that vision encoders suffer from attention sink, leading to poor focus on informative foreground regions, while in LLMs, although prior studies have identified attention bias toward token positions, text-to-vision attention demonstrates resistance to this bias and enables effective pruning guidance in middle layers. Based on these observations, we propose LearnPruner, a two-stage token pruning framework that first removes redundant vision tokens via a learnable pruning module after the vision encoder, then retains only task-relevant tokens in the LLM’s middle layer. Experimental results show that our LearnPruner can preserve approximately 95% of the original performance while using only 5.5% of vision tokens, and achieve 3.2 \times inference acceleration, demonstrating a superior accuracy-efficiency trade-off.

[CV-74] GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

【速读】:该论文旨在解决在资源受限设备(如移动手机)上部署图形用户界面(GUI)元素定位能力的难题,当前主流视觉语言模型(VLM)参数量庞大(超过25亿),难以满足低延迟和轻量化需求。解决方案的关键在于提出一种仅含2.3亿参数的轻量级VLM——GoClick,其采用编码器-解码器架构而非简单的decoder-only结构,在小规模模型下实现了更优的定位精度;同时设计了渐进式数据精炼流程(Progressive Data Refinement),通过任务类型过滤与数据比例调整从1080万原始样本中提取出高质量的380万核心训练集,显著提升了模型性能。实验表明,GoClick在多个GUI元素定位基准测试中表现优异,并能有效增强设备-云端协同框架中GUI代理的整体成功率。

链接: https://arxiv.org/abs/2604.23941
作者: Hongxin Li,Yuntao Chen,Zhaoxiang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight model, but our experiments reveal that this approach yields suboptimal results. Instead, we select an encoder-decoder architecture, which outperforms decoder-only alternatives at small parameter scales for GUI grounding tasks. Additionally, the limited capacity of small VLMs encourages us to develop a Progressive Data Refinement pipeline that utilizes task type filtering and data ratio adjustment to extract a high-quality 3.8M-sample core set from a 10.8M raw dataset. Training GoClick using this core set brings notable grounding accuracy gains. Our experiments show that GoClick excels on multiple GUI element grounding benchmarks while maintaining a small size and high inference speed. GoClick also enhances GUI agent performance when integrated into a device-cloud collaboration framework, where GoClick helps cloud-based task planners perform precise element localization and achieve higher success rates. We hope our method serves as a meaningful exploration within the GUI agent community.

[CV-75] 2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA

【速读】:该论文旨在解决音频驱动视频目标分割(Audio-based Video Object Segmentation)中计算资源消耗大、音频与视频时序对齐困难以及依赖大规模音视频配对数据集等问题。其解决方案的关键在于将音频输入通过自动语音识别(ASR)模型转换为文本形式的运动描述,并利用预训练的基于文本的指代表达视频分割模型(如SaSaSa2VA)进行像素级预测,从而降低计算复杂度并提升对动态视频内容的适应性;同时引入一个经过微调的基于音频的多模态大语言模型(MLLM)作为无目标表达检测模块,以过滤掉不指向任何目标对象的音频片段,增强系统在面对模糊或无关音频时的鲁棒性。

链接: https://arxiv.org/abs/2604.23935
作者: Zhiyu Wang,Xudong Kang,Shutao Li
机构: Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages

点击查看摘要

Abstract:Audio-based video object segmentation aims to locate and segment objects in videos conditioned on audio cues, requiring precise understanding of both appearance and motion. Recent audio-driven video segmentation methods extend MLLMs by fusing audio and visual features for end-to-end localization. Despite their promise, these approaches are computationally intensive, struggle with aligning temporal audio cues to dynamic video content, and depend on large paired audio-video datasets. To address these challenges, we present ASR-SaSaSa2VA, a resource-efficient framework for audio-guided video segmentation. The key idea is to convert audio inputs into textual motion descriptions via automatic speech recognition (ASR) models and then leverage pre-trained text-based referring video segmentation models (e.g., SaSaSa2VA) for pixel-level predictions. To further enhance robustness, we incorporate a no-target expression detection module, implemented by a fine-tuned audio-based MLLM, which filters out audio clips that do not refer to any target object. This design allows the system to exploit strong pre-trained models while effectively handling ambiguous or irrelevant audio inputs. Our approach achieves a final score of 80.7 in the 5th PVUW Challenge (MeViS-v2-Audio track), earning the second-place ranking.

[CV-76] AMAVA: Adaptive Motion-Aware Video-to-Audio Framework for Visually-Impaired Assistance ICPR

【速读】:该论文旨在解决视障及低视力人群在导航过程中因持续、无差别的感官反馈而导致的认知过载问题,尤其是在动态现实环境中难以有效感知环境信息的挑战。解决方案的关键在于提出一种名为AMAVA(Audio-Visual Mapping for Accessibility)的实时视频转音频框架,其核心创新包括:1)采用轻量级AI分类模型实现运动感知,区分低运动与高运动场景;2)基于解码器-only的视觉-语言Transformer模型(融合专家混合与跨模态注意力机制)进行视觉理解,并结合神经文本转语音(text-to-speech)和自然声音合成网络生成语义相关的声音输出;3)引入基于提示的缓存机制与类别特定的节流策略,以减少听觉杂乱并优化延迟。该方案在静态环境中提供情境意识的语音描述,在高运动场景中优先输出安全相关的声学提示(如危险警报),从而显著提升用户信心与感知安全性。

链接: https://arxiv.org/abs/2604.23909
作者: Benjamin Klein,Kazi Ruslan Rahman,Sanchita Ghose
机构: San Francisco State University(旧金山州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures. Published in the Proceedings of the 15th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2026), pages 282–289

点击查看摘要

Abstract:Navigational aids for blind and low vision individuals struggle conveying dynamic real-world environments, leading to cognitive overload from continuous, undifferentiated feedback. We present AMAVA, a novel real-time video-to-audio framework that converts mobile device video into contextually relevant sound effects or text-to-speech descriptions. We propose a motion-aware pipeline using a lightweight AI classification model to distinguish between low and high-movement scenes followed by a real-time text-to-audio synthesis pipeline to enhance environmental perception more efficiently. In static environments, AMAVA generates spoken audio scene descriptions for situational awareness. In high-movement situations, it prioritizes safety by delivering sound cues, such as spoken hazard alerts and environmental sound effects. These audio outputs are produced by a decoder-only transformer-based vision-language model with mixture-of-experts and cross-modal attention for visual understanding, in conjunction with neural text-to-speech and natural sound synthesis networks. The proposed framework uses prompt-based caching and category-specific throttling to avoid auditory clutter and minimize latency. We present a comprehensive evaluation of the system, including a real-time navigation study comparing a white cane alone versus with AMAVA, that shows a significant increase in user confidence and perceived safety.

[CV-77] Mammographic Lesion Segmentation with Lightweight Models: A Comparative Study

【速读】:该论文旨在解决乳腺癌筛查中基于深度学习的病灶分割模型在资源受限环境中部署困难的问题,其关键解决方案是采用轻量化神经网络架构(如MobileNetV2、EfficientNet Lite、ENet和Fast-SCNN),以在保持较高分割性能的同时显著降低模型参数量和计算复杂度。实验表明,引入Squeeze-and-Excitation模块的MobileNetV2在INbreast数据集上实现了Dice分数0.5766,同时仅需U-Net约25%的参数量,展现出良好的实用性和可部署性。

链接: https://arxiv.org/abs/2604.23899
作者: Helder Oliveira
机构: GnosisX(独立研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to SPIE JMI

点击查看摘要

Abstract:Breast cancer is a leading cause of cancer-related mortality among women worldwide, with mammography as the primary screening tool. While deep learning models have shown strong performance in lesion segmentation, most rely on computationally intensive architectures that limit their use in resource-constrained environments. This study evaluates the performance and efficiency of lightweight models for mammographic lesion segmentation. Architectures including MobileNetV2, EfficientNet Lite, ENet, and Fast-SCNN were compared against a U-Net baseline using the INbreast dataset with 5-fold cross-validation. Performance was assessed using Dice score, Intersection over Union (IoU), and Recall, alongside model complexity. MobileNetV2 with Squeeze-and-Excitation (SCSE) achieved the best performance, with a Dice score of 0.5766 while using approximately 75% fewer parameters than U-Net. Cross-dataset evaluation on the DMID dataset showed reduced accuracy due to domain shift but preserved recall. These results demonstrate that lightweight architectures offer a practical balance between performance and efficiency for deployable CAD systems.

[CV-78] Risk-Aware Robust Learning: Reducing Clinical Risk under Label Noise in Medical Image Classification

【速读】:该论文旨在解决医学图像分类中标签噪声(label noise)对模型临床安全性的影响问题,特别是现有噪声鲁棒训练方法虽能提升整体准确率,却可能因忽视假阴性(false negative)的高临床代价而危及患者安全。其解决方案的关键在于引入成本敏感的全局风险(Global Risk)评估框架,在噪声鲁棒训练基础上融合成本敏感优化(cost-sensitive optimization),从而在保持模型性能的同时显著降低假阴性带来的临床风险,实现从单纯精度导向到临床风险导向的范式转变。

链接: https://arxiv.org/abs/2604.23875
作者: Maycon R. S. Pereira,Filipe R. Cordeiro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at SBCAS’26

点击查看摘要

Abstract:Noisy labels are a pervasive challenge in medical image classification, where annotation errors arise from inter-observer variability and diagnostic ambiguity. Although several noise-robust learning methods have been proposed, their evaluation predominantly relies on accuracy-oriented metrics, overlooking the clinical implications of asymmetric error costs. In medical diagnosis, a false negative (missed disease) carries substantially higher consequences than a false positive (false alarm), as delayed treatment can directly impact patient outcomes. In this work, we investigate whether noise-robust training methods preserve clinical safety under label noise. We conduct a systematic risk-aware evaluation of the state-of-the-art noise-robust methods Coteaching, DivideMix, UNICON, and a GMM-based filtering approach on binarized DermaMNIST and PathMNIST datasets under clean and label noise rates of 20%, and 40%. Beyond balanced accuracy, we adopt a cost-sensitive Global Risk formulation that explicitly penalizes false negatives. Our analysis reveals that the robustness of state-of-the-art methods does not guarantee clinical safety. Furthermore, we demonstrate that integrating cost-sensitive optimization into noise-robust training significantly reduces clinical risk, while mantaining model utility. These findings demonstrate that noise-robust learning must be evaluated through a clinical risk lens, and that combining robust training with cost-sensitive optimization can meaningfully reduce risk in noisy-label medical imaging scenarios.

[CV-79] Empirical Ablation and Ensemble Optimization of a Convolutional Neural Network for CIFAR-10 Classification

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在图像分类任务中性能提升受限于架构与训练策略选择的问题,尤其关注哪些优化手段能真正改善泛化能力而非仅增加模型复杂度。其解决方案的关键在于通过系统性的消融实验(ablation-based study),对17种渐进式修改(包括训练时长、学习率调度、Dropout配置、池化策略、网络深度、滤波器布局及全连接层设计等)进行量化评估,发现性能提升主要依赖于精心挑选的训练和结构优化,而非盲目增加网络深度或参数量;最终基于最优个体配置构建加权集成模型,在CIFAR-10数据集上分别达到86.38%(小样本)和89.23%(全数据)的准确率,验证了基于实证的优化与集成学习方法在小图像分类中的有效性。

链接: https://arxiv.org/abs/2604.23861
作者: Naser Khatti Dizabadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) remain a central approach in image classification, but their performance depends strongly on architectural and training choices. This paper presents an empirical ablation-based study of CNN optimization for the CIFAR-10 benchmark. The study evaluates 17 progressive modifications involving training duration, learning-rate scheduling, dropout configuration, pooling strategy, network depth, filter arrangement, and dense-layer design. The goal is to identify which changes improve generalization and which increase complexity without improving performance. The baseline model achieved 79.5% test accuracy. Extending training duration improved performance steadily, whereas several structural redesigns reduced accuracy despite greater architectural variation. Based on the strongest individual configurations, a weighted ensemble was constructed, achieving 86.38% accuracy in the reduced-data setting and 89.23% when trained using the full CIFAR-10 dataset. These results suggest that performance gains in CNN-based classification depend less on indiscriminate increases in depth or parameter count than on careful empirical selection of training and architectural modifications. The study therefore highlights the practical value of ablation-oriented optimization and ensemble learning for small-image classification.

[CV-80] Exploring Audio Hallucination in Egocentric Video Understanding ICASSP2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 在处理第一人称视角视频(egocentric video)时出现的音频幻觉(audio hallucination)问题,即模型在缺乏实际音频输入的情况下,错误地从视觉线索中推断出不存在的声音。解决方案的关键在于构建一个系统且自动化的评估框架,通过针对性的问题-回答(Q/A)协议对模型输出进行量化分析,并设计了一个包含300个视频和1000个声音相关问题的数据集;同时提出了一种基于语义的分类体系,将声音区分为前景动作声(foreground action sounds)与背景环境声(background ambient sounds),从而精准识别和衡量模型在不同声音类型上的幻觉率,结果显示先进多模态大语言模型如Qwen2.5 Omni在前景和背景声音任务上的准确率分别仅为27.3%和39.5%,凸显了可靠评估在提升多模态模型鲁棒性中的必要性。

链接: https://arxiv.org/abs/2604.23860
作者: Ashish Seth,Xinhao Mei,Changsheng Zhao,Varun Nagaraja,Ernie Chang,Gregory P. Meyer,Gael Le Lan,Yunyang Xiong,Vikas Chandra,Yangyang Shi,Dinesh Manocha,Zhipeng Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Egocentric videos provide a distinctive setting in which sound serves as crucial cues to understand user activities and surroundings, particularly when visual information is unstable or occluded due to continuous camera movement. State-of-the-art large audio-visual language models (AV-LLMs) can generate multimodal descriptions. However, we show in this work that they are prone to audio hallucinations, often inferring sounds from visual cues that are visible but not heard. We present a systematic and automatic evaluation framework for analyzing audio hallucinations in egocentric video through a targeted question-answering (Q/A) protocol. We curate a dataset of 300 egocentric videos and design 1,000 sound-focused questions to probe model outputs. To characterize hallucinations, we propose a grounded taxonomy that distinguishes between foreground action sounds from the user activities and background ambient sounds. Our evaluation shows that advanced AV-LLMs, such as Qwen2.5 Omni, exhibit high hallucination rates, achieving only 27.3% and 39.5% accuracy on Q/As related to foreground and background sounds, respectively. With this work, we highlight the need to measure the reliability of multimodal responses, emphasizing that robust evaluation of hallucinations is essential to develop reliable AV-LLMs.

[CV-81] Latent Inter-Frame Pruning: A Training-Free Method Bridging Traditional Video Compression and Modern Diffusion Transformers for Efficient Generation

【速读】:该论文旨在解决视频生成模型在计算效率上的瓶颈问题,即当前基于潜空间扩散模型(Latent Diffusion Model, LDM)的视频生成方法因冗余的时序计算而导致推理速度慢、难以支持实时应用。其核心解决方案是提出一种无需重新训练的潜空间帧间剪枝(Training-free Latent Inter-frame Pruning)框架,通过识别并跳过重复的潜变量块(latent patches)来减少冗余计算;同时为缓解训练与剪枝推理之间的不一致性导致的视觉伪影,引入注意力恢复机制(Attention Recovery),以弥合训练-推理差距。该方法在保持视频质量的前提下,将视频编辑吞吐量提升至12.44 FPS(NVIDIA RTX 6000),实现了显著的性能优化。

链接: https://arxiv.org/abs/2604.23858
作者: Dennis Menn,Chih-Hsien Chou
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校); Futurewei Technologies, Inc. (未来wei技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video generation, while capable of generating realistic videos, is computationally expensive and slow, prohibiting real-time applications. In this paper, we observe that video latents encoded via an autoencoder under the Latent Diffusion Model (LDM) framework contain redundancy along the temporal axis. Analogous to how traditional video compression algorithms avoid transmitting redundant frame data, we propose the Latent Inter-frame Pruning framework to prune (skip the re-computation of) duplicated latent patches, thereby reducing computational burden and increasing throughput. However, direct pruning results in visual artifacts due to the discrepancy between full-sequence training and pruned inference. To resolve these artifacts, we propose an Attention Recovery mechanism to bridge the train-inference gap. With our proposed method, we increase video editing throughput by 1.44 \times , achieving 12.44 FPS on an NVIDIA RTX 6000 while maintaining video quality. We hope our work inspires further research into integrating traditional video compression methods with modern video generation pipelines. This work is a preliminary work on Training-free Latent Inter-Frame Pruning with Attention Recovery.

[CV-82] Focus on What Matters: Two-Stage ROI-Aware Refinement for Anatomy-Preserving Fetal Ultrasound Reconstruction

【速读】:该论文旨在解决医学超声图像重建中因小感兴趣区域(ROI)主导临床决策而导致全局重建指标(如PSNR、SSIM)无法准确反映临床可用性的问题。其核心挑战在于跨医院域偏移下,如何在保持整体图像结构忠实性的同时,精准提升关键ROI的细节质量。解决方案的关键在于提出一种ROI-aware表示学习框架:首先通过多尺度结构相似性(MS-SSIM)约束训练双阶段卷积自编码器(CAE),学习全局一致的128维潜在编码;随后利用强度(L1)和归一化Sobel边缘约束对NT ROI进行精细化重构。为自动平衡异构损失目标而无需人工调参,采用基于梯度幅值的校准方法初始化损失权重,从而实现端到端优化。实验表明,该方法在多个医院数据集上均显著提升ROI区域的测量精度,并增强模型在未见医院场景下的泛化能力。

链接: https://arxiv.org/abs/2604.23839
作者: Ines Abbes,Mahmood Alzubaidi,Mowafa Househ,Khalid Alyafei,Marco Agus,Samir Brahim Belhaouari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures, multiple tables. Preprint submitted to arXiv

点击查看摘要

Abstract:Measurement-critical ultrasound tasks often depend on a small anatomical region, making global reconstruction metrics an unreliable proxy for clinical fidelity. We propose an ROI-aware representation learning framework and instantiate it for first-trimester nuchal translucency (NT) screening under multi-hospital domain shift. A two-phase convolutional autoencoder (CAE) first learns a globally faithful 128-D latent code via MS-SSIM, then refines the NT ROI using intensity (L1) and normalized Sobel-edge constraints. To combine these heterogeneous objectives without manual tuning, we initialize loss weights via gradient-based calibration from per-term gradient magnitudes. Under strict hospital-wise evaluation with one hospital held out, ROI refinement improves both global and measurement-relevant quality: on the standard dev split it increases PSNR by +0.27 dB (val) and +0.29 dB (held-out test), reduces ROI MAE by 8.87% (val) and 6.43% (held-out test), and reduces ROI Edge-MAE by 11.10% on source hospitals and 4.90% on the unseen hospital. Beyond reconstruction, frozen-latent probes provide additional evidence of generalization: hospital provenance becomes less confidently predictable on the unseen site (0.556 to 0.541 max-softmax; 0.684 to 0.688 entropy) while OOD detection remains strong across site-held-out protocols (Mahalanobis AUROC up to 0.9956, with modest KNN gains in challenging splits). The same ROI-aware refinement principle is anatomy-agnostic and can be adopted for other fetal biometry targets (e.g., crown-rump length (CRL), nasal bone (NB)) and broader medical imaging settings where small ROIs dominate clinical decisions.

[CV-83] Mapping License Plate Recoverability Under Extreme Viewing Angles for Oppor-tunistic Urban Sensing

【速读】:该论文旨在解决在城市环境中利用机会感知(opportunistic sensing)模式下,从低质量、高噪声且视角极端的图像中可靠恢复目标信息的问题,特别是针对车牌识别等下游任务的可行性边界判定。其核心挑战在于识别哪些退化参数允许有效恢复,而哪些会导致推理失败。解决方案的关键是提出了一种任务无关的“可恢复性图谱”(recoverability maps),通过密集合成退化参数扫描结合两个量化指标——边界曲线下面积(boundary area-under-curve)用于估计可恢复参数空间的比例,以及可靠性评分(reliability score)衡量该区域内失败发生的频率与严重程度,从而系统性地刻画恢复能力的边界。实验表明,不同超分辨率架构(如U-Net、Restormer、Pix2Pix和SR3扩散模型)在相同条件下表现相似,说明感知几何(sensing geometry)而非网络结构才是限制恢复性能的根本因素。

链接: https://arxiv.org/abs/2604.23814
作者: Igor Adamenko,Orpaz Ben Aharon,Yehudit Aperstein,Alexander Apartsin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:Urban environments contain many imaging sensors built for specific purposes, including ATM, body-worn, CCTV, and dashboard cameras. Under the opportunistic sensing paradigm, these sensors can be repurposed for secondary inference tasks such as license plate recognition. Yet objects of interest in such imagery are often noisy, low-resolution, and captured from extreme viewpoints. Recent advances in AI-based restoration can recover use-ful information even from severely degraded images. A central challenge is determining which distortion parame-ters allow reliable recovery and which lead to inference failure. This paper introduces recoverability maps, a task-agnostic method for quantifying this boundary. The method combines a dense synthetic sweep of degrada-tion parameters with two summary measures: boundary area-under-curve, which estimates the recoverable frac-tion of the parameter space, and a reliability score, which captures the frequency and severity of failures within that region. We demonstrate the method on license plate recognition from highly angled views under realistic camera artifacts. Several restoration architectures are trained and evaluated, including U-Net, Restormer, Pix2Pix, and SR3 diffusion. The best model recovers about 93% of the parameter space. Similar results across models sug-gest that sensing geometry, rather than architecture, sets the limit of recovery.

[CV-84] Bringing a Personal Point of View: Evaluating Dynamic 3D Gaussian Splatting for Egocentric Scene Reconstruction CVPR2026

【速读】:该论文旨在解决现有动态单目3D高斯溅射(Dynamic Monocular 3D Gaussian Splatting, 3DGS)模型在第一人称视角视频(egocentric video)上重建质量下降的问题,即这些模型是否能有效泛化到egocentric场景,还是需要专门针对该视角设计解决方案。其关键发现是:重建质量差异主要源于静态内容的重建性能下降,而非动态物体;这表明当前方法在处理egocentric视频中由快速相机运动和复杂场景动态带来的静态结构建模方面存在局限,从而强调了开发面向第一人称视角特异性优化策略的重要性,并建议未来评估应区分静态与动态区域以更精确地衡量模型表现。

链接: https://arxiv.org/abs/2604.23803
作者: Jan Warchocki,Xi Wang,Jonas Kulhanek,Jan van Gemert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the EgoVis Workshop at CVPR 2026

点击查看摘要

Abstract:Egocentric video provides a unique view into human perception and interaction, with growing relevance for augmented reality, robotics, and assistive technologies. However, rapid camera motion and complex scene dynamics pose major challenges for 3D reconstruction from this perspective. While 3D Gaussian Splatting (3DGS) has become a state-of-the-art method for efficient, high-quality novel view synthesis, variants, that focus on reconstructing dynamic scenes from monocular video are rarely evaluated on egocentric video. It remains unclear whether existing models generalize to this setting or if egocentric-specific solutions are needed. In this work, we evaluate dynamic monocular 3DGS models on egocentric and exocentric video using paired ego-exo recordings from the EgoExo4D dataset. We find that reconstruction quality is consistently lower in egocentric views. Analysis reveals that the difference in reconstruction quality, measured in peak signal-to-noise ratio, stems from the reconstruction of static, not dynamic, content. Our findings underscore current limitations and motivate the development of egocentric-specific approaches, while also highlighting the value of separately evaluating static and dynamic regions of a video.

[CV-85] VitaminP: cross-modal learning enables whole-cell segmentation from routine histology

【速读】:该论文旨在解决在常规苏木精-伊红(Hematoxylin and Eosin, HE)染色图像中因细胞质对比度不足而导致的全细胞分割精度受限问题,从而推动精准病理学与空间组学分析的发展。其核心解决方案是提出VitaminP,一个跨模态学习框架,通过从配对的HE-mIF(多重免疫荧光)数据中学习,将mIF提供的分子边界信息迁移至HE图像,以恢复缺失的细胞结构细节,实现了基于HE图像的高精度全细胞分割。该方法的关键在于利用跨模态监督机制,有效克服了HE染色在细胞质区域信息缺失的局限性,且具有良好的泛化能力,适用于多种癌症类型和未见数据集。

链接: https://arxiv.org/abs/2604.23799
作者: Yasin Shokrollahi,Karina B. Pinao Gonzales,Elizve N. Barrientos Toro,Paul Acosta,Patient Mosaic Team,Pingjun Chen,Yinyin Yuan,Xiaoxi Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 44 pages, 10 figures. Code and models available

点击查看摘要

Abstract:Accurate whole-cell and nuclear segmentation is essential for precision pathology and spatial omics, yet routine hematoxylin and eosin (HE) staining provides limited cytoplasmic contrast, restricting analyses to nuclei. Multiplex immunofluorescence (mIF) facilitates precise whole-cell delineation but remains constrained by cost and accessibility. We introduce VitaminP, a cross-modal learning framework enabling whole cell segmentation from HE images. By learning from paired HE-mIF data, VitaminP transfers molecular boundary information from mIF to overcome cytoplasmic contrast in HE, establishing cross-modal supervision as a general strategy for recovering missing biological structure. We train VitaminP on 14 public datasets covering 34 cancer types and over 7 million instances, integrating publicly available labels with extensive annotations generated in this study, forming one of the largest resources for segmentation. VitaminP outperforms four state-of-the-art methods and generalizes to unseen datasets, including an in-house dataset spanning 24 rare cancer types. We further developed VitaminPScope, an open-source platform providing an interface for scalable inference and enabling broad adoption.

[CV-86] ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers CVPR

【速读】:该论文旨在解决现有注意力加速器在保持精确softmax语义、支持高精度浮点计算(FP32)以及降低并行深度(parallel depth)方面的局限性问题。当前方法通常依赖于Tensor Core融合内核,或因顺序深度限制导致长序列下FP32吞吐量下降,难以在不同硬件平台(如A100 GPU与边缘设备Jetson TX2)上实现统一高效部署。解决方案的关键在于提出ELSA(Exact Low-Depth Softmax Attention),其核心创新包括:(i)通过算法重构在线softmax更新过程,在实数运算中保留精确softmax语义,并提供O(ulogn)\mathcal{O}(u\log n)的FP32相对误差理论保证;(ii)将更新操作建模为一个关于关联幺半群(m,S,W)(m, S, W)的前缀扫描(prefix scan),从而将并行深度降至O(logn)\mathcal{O}(\log n),仅需额外O(n)O(n)内存;(iii)完全独立于Tensor Core指令集,可在Triton和CUDA C++中实现,无需模型重训练或权重修改,具备跨平台可移植性。这使得ELSA成为首个在全精度下同时实现低并行深度与硬件无关性的精确注意力内核。

链接: https://arxiv.org/abs/2604.23798
作者: Chih-Chung Hsu,Xin-Di Ma,Wo-Ting Liao,Chia-Ming Lee
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPRF2026

点击查看摘要

Abstract:Existing attention accelerators often trade exact softmax semantics, depend on fused Tensor Core kernels, or incur sequential depth that limits FP32 throughput on long sequences. We present \textbfELSA, an algorithmic reformulation of online softmax attention that (i)~preserves exact softmax semantics in real arithmetic with a \emphprovable \mathcalO(u\log n) FP32 relative error bound; (ii)~casts the online softmax update as a prefix scan over an associative monoid (m,S,W) , yielding O(n) extra memory and O(\log n) parallel depth; and (iii)~is Tensor-Core independent, implemented in Triton and CUDA C++, and deployable as a \emphdrop-in replacement requiring no retraining or weight modification. Unlike FlashAttention-2/3, which rely on HMMA/GMMA Tensor Core instructions and provide no compatible FP32 path, ELSA operates identically on A100s and resource-constrained edge devices such as Jetson TX2 – making it the only hardware-agnostic exact-attention kernel that reduces parallel depth to O(\log n) at full precision. On A100 FP32 benchmarks (1K–16K tokens), ELSA delivers 1.3 – 3.5\times speedup over memory-efficient SDPA and 1.97 – 2.27\times on BERT; on Jetson TX2, ELSA achieves 1.5 – 1.6\times over Math (64–900 tokens), with 17.8 – 20.2% throughput gains under LLaMA-13B offloading at \ge 32K. In FP16, ELSA approaches hardware-fused baselines at long sequences while retaining full FP32 capability, offering a unified kernel for high-precision inference across platforms. Our code and implementation are available at this https URL.

[CV-87] MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

【速读】:该论文旨在解决当前生成式AI在电影级多镜头视频生成中的三大核心挑战:真实叙事逻辑缺失、时空文本-视频对齐冲突,以及Subject-to-Video(S2V)生成中普遍存在的“复制粘贴”问题(即跨镜头内容重复、缺乏连贯性)。其解决方案的关键在于提出一个大规模双轨数据集MuSS,该数据集涵盖3000余部电影,明确支持复杂蒙太奇过渡和以主体为中心的叙事结构;并通过创新的渐进式标注流程确保局部镜头准确性后再强化全局叙事一致性,同时引入跨镜头匹配机制从根本上消除S2V生成中的“复制粘贴”捷径。此外,论文还构建了基于视觉逻辑驱动的电影叙事基准(Cinematic Narrative Benchmark),并提出抗复制粘贴方差(ACP-Var)指标,用于量化评估连续叙事能力和三维结构一致性,从而推动多镜头视频生成从碎片化走向连贯性叙事。

链接: https://arxiv.org/abs/2604.23789
作者: Haojie Zhang,Di Wu,Bingyan Liu,Linjie Zhong,Yuancheng Wei,Xingsong Ye,Nanqing Liu,Yaling Liang
机构: South China University of Technology (华南理工大学); Fudan University (复旦大学); Yunnan Normal University (云南师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figues

点击查看摘要

Abstract:While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the “copy-paste” dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to fundamentally eradicate the S2V copy-paste shortcut. Alongside the dataset, we propose the Cinematic Narrative Benchmark, featuring a visual-logic-driven paradigm and a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously assess continuous storytelling and 3D structural consistency. Extensive experiments demonstrate that while current baselines struggle with continuous narrative logic or degenerate into trivial 2D sticker generators, our MuSS-augmented model achieves state-of-the-art narrative effectiveness and cross-shot identity preservation.

[CV-88] ClawMark: A Living-World Benchmark for Multi-Turn Multi-Day Multimodal Coworker Agents

【速读】:该论文旨在解决当前语言模型代理(language-model agents)在长期、多日工作流中面对动态环境变化时的评估不足问题。现有基准测试通常局限于单一静态场景且以文本为中心,无法真实反映代理作为持续协作同事(persistent coworkers)在现实环境中应对环境状态更新(如邮件、日历、知识库等变化)的能力。解决方案的关键在于构建一个名为 \bench 的新型基准,其核心特征包括:(1)围绕多轮次、跨多日的任务设计;(2)基于状态化的沙箱服务环境(filesystem, email, calendar, knowledge base, spreadsheet),该环境在每轮交互间可发生外生变化;(3)采用规则驱动的确定性验证机制(1537个Python检查器)对执行后服务状态进行评分,无需依赖大语言模型作为评判者。此设计使评估更贴近真实应用场景,并揭示出当前先进代理系统在适应环境演化方面仍存在显著挑战。

链接: https://arxiv.org/abs/2604.23781
作者: Fanqing Meng,Lingxiao Du,Zijian Wu,Guanzheng Chen,Xiangyan Liu,Jiaqi Liao,Chonghe Jiang,Zhenglin Wan,Jiawei Gu,Pengfei Zhou,Rui Huang,Ziqi Zhao,Shengyuan Ding,Ailing Yu,Bo Peng,Bowei Xia,Hao Sun,Haotian Liang,Ji Xie,Jiajun Chen,Jiajun Song,Liu Yang,Ming Xu,Qionglin Qiu,Runhao Fu,Shengfang Zhai,Shijian Wang,Tengfei Ma,Tianyi Wu,Weiyang Jin,Yan Wang,Yang Dai,Yao Lai,Youwei Shu,Yue Liu,Yunzhuo Hao,Yuwei Niu,Jinkai Huang,Jiayuan Zhuo,Zhennan Shen,Linyu Wu,Cihang Xie,Yuyin Zhou,Jiaheng Zhang,Zeyu Zheng,Mengkang Hu,Michael Qizhe Shieh
机构: Evolvent AI; National University of Singapore; Massachusetts Institute of Technology; The University of Hong Kong; University of California, Berkeley; University of Washington; The Chinese University of Hong Kong; The Hong Kong University of Science and Technology; The Hong Kong Polytechnic University; Peking University; Fudan University; Shanghai Jiao Tong University; University of Science and Technology of China; Zhejiang University; Renmin University of China; Hunan University; Tongji University; University of Electronic Science and Technology of China; Southeast University; Anhui University; University of California, Santa Cruz; Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: github repo: this https URL

点击查看摘要

Abstract:Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge-base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text-centric. We introduce \bench, a benchmark for coworker agents built around multi-turn multi-day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule-based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet) and scored by 1537 deterministic Python checkers over post-execution service state; no LLM-as-judge is invoked during scoring. We benchmark seven frontier agent systems. The strongest model reaches 75.8 weighted score, but the best strict Task Success is only 20.0%, indicating that partial progress is common while complete end-to-end workflow completion remains rare. Turn-level analysis shows that performance drops after the first exogenous environment update, highlighting adaptation to changing state as a key open challenge. We release the benchmark, evaluation harness, and construction pipeline to support reproducible coworker-agent evaluation.

[CV-89] From Noisy Historical Maps to Time-Series Oil Palm Mapping Without Annotation in Malaysia and Indonesia (2020-2024)

【速读】:该论文旨在解决东南亚地区油棕榈(Oil Palm)种植园监测中因现有地图空间分辨率低和缺乏近期时间覆盖而导致的快速土地利用变化难以有效监管的问题。其解决方案的关键在于提出一种基于深度学习的框架,利用Sentinel-2遥感影像在不依赖新人工标注的情况下生成2020–2024年期间分辨率为10米的油棕榈种植园分布图;为克服历史标签(100米分辨率)与高分辨率影像(10米)之间的尺度不匹配问题,采用了优化的U-Net架构,并引入基于行列式互信息(Determinant-based Mutual Information, DMI)的损失函数以有效抑制标签噪声的影响。

链接: https://arxiv.org/abs/2604.23776
作者: Nuttaset Kuapanich,Juepeng Zheng,Bohan Shi,Jiaying Liu,Jiayin Jiang,Jiatao Huang,Shenghan Tan,Qingmei Li,Haohuan Fu
机构: Sun Yat-Sen University (中山大学); Northeastern University (东北大学); Tsinghua University (清华大学); Shenzhen (深圳)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate monitoring of oil palm plantations is critical for balancing economic development with environmental conservation in Southeast Asia. However, existing plantation maps often suffer from low spatial resolution and a lack of recent temporal coverage, impeding effective surveillance of rapid land-use changes. In this study, we propose a deep learning framework to generate 10-meter resolution oil palm plantation maps for Indonesia and Malaysia from 2020 to 2024, utilizing Sentinel-2 imagery without requiring new manual annotations. To address the resolution mismatch between coarse 100-meter historical labels and 10-meter imagery, we employ a U-Net architecture optimized with Determinant-based Mutual Information (DMI). This approach effectively mitigates the influence of label noise. We validated our method against 2,058 manually verified points, achieving overall accuracies of 70.64%, 63.53%, and 60.06% for the years 2020, 2022, and 2024, respectively. Our comprehensive analysis reveals that oil palm coverage in the region peaked in 2022 before experiencing a decline in 2024. Furthermore, land cover transition analysis highlights a concerning trajectory of plantation expansion into flooded vegetation areas, despite a general stabilization in rotations with other crop types. These high-resolution maps provide essential data for monitoring sustainability commitments and deforestation dynamics in the region, and the generated datasets are made publicly available at this https URL.

[CV-90] Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

【速读】:该论文旨在解决大规模扩散变换器(DiT)在执行全局编辑指令时,因联合注意力架构缺乏显式通道指示而产生的局部编辑泄露问题,即编辑内容意外扩散至无关区域。其解决方案的关键在于提出REDEdit框架,该框架通过协同训练的、指令与区域感知的适配器机制,在不修改DiT主干权重的前提下实现精准局部编辑:首先在每个Transformer块中引入轻量级Block Adapter,以结构化条件流分离编辑内容(指令语义)与编辑位置(空间掩码);其次利用学习到的SpatialGate将适配器信号选择性路由至目标区域,同时保持图像其余部分接近原始状态;最后通过Region-Aware Loss聚焦训练目标于变化像素,使主干内部表示具备掩码感知能力。由此,一个联合训练的MaskPredictor头可直接从指令和源图像中定位编辑区域,从而在部署阶段无需用户提供的掩码。

链接: https://arxiv.org/abs/2604.23763
作者: Honghao Cai,Xiangyuan Wang,Yunhao Bai,Haohua Chen,Tianze Zhou,Runqi Wang,Wei Zhu,Yibo Chen,Xu Tang,Yao Hu,Zhen Li
机构: The Chinese University of Hong Kong, Shenzhen; Beijing University of Aeronautics and Astronautics; Tsinghua University; Peking University; Xiaohongshu Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large diffusion transformers (DiTs) follow global editing instructions well but consistently leak local edits into unrelated regions, because joint-attention architectures offer no explicit channel telling the network where to apply the edit. We introduce REDEdit, a co-trained, instruction- and region-aware adapter framework that retrofits a frozen DiT into a precise local editor without modifying its backbone weights. A lightweight Block Adapter at every transformer block injects a structured condition stream that factorizes what to edit (instruction semantics) from where to edit (spatial mask); a learned SpatialGate routes the adapter signal selectively into the edit region while keeping the rest of the image near-identical to the source; and a Region-Aware Loss focuses the training objective on the changing pixels. Because these components make the backbone’s internal representation mask-aware end-to-end, a thin MaskPredictor head trained jointly with the editor can ground the edit region directly from the instruction and source image eliminating any user-mask requirement at deployment. We evaluate on two complementary benchmarks: MagicBrush (paired ground-truth targets) to measure pixel-level preservation and edit accuracy, and Emu-Edit Test (no ground-truth images, 9 diverse edit categories) to stress-test instruction following and generalization across edit types. On both, REDEdit achieves state-of-the-art results, simultaneously outperforming mask-free and oracle-mask baselines. A seven-variant ablation cleanly isolates the contribution of each component.

[CV-91] DynProto: Dynamic Prototype Evolution for Out-of-Distribution Detection CVPR2026

【速读】:该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)在分布外(Out-of-Distribution, OOD)检测中依赖预定义OOD标签导致的局限性问题,即当真实世界的OOD样本不在预设标签集合内时,检测性能显著下降。解决方案的关键在于提出DynProto方法,其核心思想是:在测试阶段仅利用分布内(In-Distribution, ID)信息动态学习OOD原型(OOD prototypes)。该方法基于一个关键观察——被误判为同一ID类别的OOD样本在特征空间中趋于聚集;由此,通过易识别的OOD样本作为“锚点”(anchors),寻找难以检测的相似OOD样本,并引入两个模块实现高效检测:粗粒度OOD模式捕捉模块用于缓存易混淆的OOD模式,细粒度OOD模式精修模块则对每类缓存模式进行聚类并生成代表性OOD原型。最终,通过对比输入样本与ID原型及动态生成的OOD原型的相似度,实现高精度的OOD检测。

链接: https://arxiv.org/abs/2604.23729
作者: Yanqi Wu,Xinhua Lu,Runhe Lai,Qichao Chen,Jia-Xin Zhuang,Wei-Shi Zheng,Ruixuan Wang
机构: Sun Yat-sen University(中山大学); Peng Cheng Laboratory(鹏城实验室); University of Nottingham Malaysia(英国诺丁汉大学马来西亚分校); Hong Kong University of Science and Technology(香港科技大学); Key Laboratory of Machine Intelligence and Advanced Computing(机器智能与先进计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by CVPR2026 Findings

点击查看摘要

Abstract:Recent studies show that using potential out-of-distribution (OOD) labels from large corpora as auxiliary information can improve OOD detection in vision-language models (VLMs). However, these methods often fail when real-world OOD samples fall outside the predefined OOD label set. To address this limitation, we propose DynProto, a novel approach that learns OOD prototypes dynamically during testing using only in-distribution (ID) information. DynProto is inspired by a key observation: OOD samples predicted as the same ID class tend to cluster in the feature space. With this insight, we leverage easy-to-detect OOD samples as ``anchors’’ to find their harder-to-detect, similar counterparts. To this end, DynProto introduces two modules: \textbfCoarse OOD Pattern Capturing Module caches OOD patterns that are easily confused with each ID class during testing, and \textbfFine-grained OOD Pattern Refinement Module subsequently clusters these patterns within each cache and aggregates them into representative OOD prototypes. By measuring similarity to ID and dynamic OOD prototypes, DynProto enables accurate OOD detection. DynProto significantly outperforms prior methods across multiple benchmarks. On ImageNet OOD benchmark, DynProto reduces FPR95 by 11.60% and improves AUROC by 4.70%. Moreover, the framework is architecture-agnostic and can be integrated into various backbones.

[CV-92] ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction

【速读】:该论文旨在解决行人意图预测(Pedestrian Intention Prediction)中的三大挑战:多智能体交互模式过于简化、推理逻辑不透明以及行为预测缺乏全局一致性,这些问题显著影响了模型的鲁棒性和可解释性。解决方案的关键在于提出一种基于能量的时空交互感知框架(ESIA),其核心创新是将意图预测建模为统一图结构上的结构化预测问题,其中行人与环境被表示为时空节点,并通过单变量势(unary potentials)和成对势(pairwise potentials)分别刻画个体意图与社会/环境交互关系;同时引入全局能量函数以确保场景级行为预测的一致性,并设计结构一致性约束项来在无监督条件下抑制逻辑矛盾;最终通过一种新型的单变量引导模拟退火算法(Unary-Seeded Simulated Annealing, U-SSA)高效优化该能量函数,从而实现高性能且可解释的意图预测。

链接: https://arxiv.org/abs/2604.23728
作者: Yanping Wu,Meiting Dang,Lin Wu,Edmond S. L. Ho,Zhenghua Chen,Chongfeng Wei
机构: University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures, 3 tables. Under review

点击查看摘要

Abstract:Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decisions and actions by modeling temporal dynamics, social interactions, and environmental context. However, existing studies remain constrained by oversimplified multi-agent interaction patterns, opaque reasoning logic, and a lack of global consistency in behavioral predictions, which compromise both robustness and interpretability. In this work, we propose ESIA (Energy-based Spatiotemporal Interaction-Aware framework), a novel Conditional Random Field (CRF)-based paradigm. We cast the intention prediction task as a structured prediction problem over a unified graph-based representation, treating pedestrians and the environment as spatiotemporal nodes. To characterize their distinct roles, we assign unary potentials to nodes to capture individual intentions, and pairwise potentials to edges to encode social and environmental interactions. These potentials are integrated into a unified global energy function to ensure scene-level consistency across behavioral predictions. To further constrain inference without ground-truth supervision, we introduce structural consistency terms to penalize logical contradictions. This optimization is efficiently solved via a novel Unary-Seeded Simulated Annealing (U-SSA) algorithm, which leverages high-confidence unary priors to rapidly converge to a high-quality solution. Extensive experiments on standard benchmarks demonstrate that ESIA achieves state-of-the-art performance with improved interpretability over existing methods.

[CV-93] Zoom In Reason Out: Efficient Far-field Anomaly Detection in Expressway Surveillance Videos via Focused VLM Reasoning Guided by Bayesian Inference

【速读】:该论文旨在解决高速公路视频中远场目标异常检测难题,尤其是针对运动特征细微的车辆异常行为识别困难的问题。传统视觉语言模型(Vision-Language Models, VLMs)在处理全局视频帧时存在注意力稀释现象,且计算开销大,难以满足实时性和准确性要求。其解决方案的关键在于提出一种异步协同框架VIBES,核心创新是引入在线贝叶斯推理模块,通过持续评估车辆轨迹动态更新正常驾驶行为的概率边界,作为异步触发机制精准定位异常发生的空间与时间位置;随后仅对触发区域进行局部视觉输入,由VLM进行语义推理,从而避免注意力分散并显著降低计算成本,实现高精度、低延迟和可解释的异常检测。

链接: https://arxiv.org/abs/2604.23724
作者: Xiaowei Mao,Bowen Sui,Weijie Zhang,Yawen Yang,Shengnan Guo,Shilong Zhao,Jiaqi Lin,Tingrui Wu,Youfang Lin,Huaiyu Wa
机构: Beijing Jiaotong University (北京交通大学); Key Laboratory of Big Data Artificial Intelligence in Transportation, Ministry of Education (交通运输领域大数据人工智能重点实验室,教育部); Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence (北京市交通数据挖掘与具身智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Expressway video anomaly detection is essential for safety management. However, identifying anomalies across diverse scenes remains challenging, particularly for far-field targets exhibiting subtle abnormal vehicle motions. While Vision-Language Models (VLMs) demonstrate strong semantic reasoning capabilities, processing global frames causes attention dilution for these far-field objects and incurs prohibitive computational costs. To address these issues, we propose VIBES, an asynchronous collaborative framework utilizing VLMs guided by Bayesian inference. Specifically, to overcome poor generalization across varying expressway environments, we introduce an online Bayesian inference module. This module continuously evaluates vehicle trajectories to dynamically update the probabilistic boundaries of normal driving behaviors, serving as an asynchronous trigger to precisely localize anomalies in space and time. Instead of processing the continuous video stream, the VLM processes only the localized visual regions indicated by the trigger. This targeted visual input prevents attention dilution and enables accurate semantic reasoning. Extensive evaluations demonstrate that VIBES improves detection accuracy for far-field anomalies and reduces computational overhead, achieving high real-time efficiency and explainability while demonstrating generalization across diverse expressway conditions.

[CV-94] Caries DETR: Tooth Structure-aware Prior and Lesion-aware Dynamic Loss Refinement for DETR Based Caries Detection

【速读】:该论文旨在解决龋齿(dental caries)在口内影像中因细微且低对比度而难以被现有深度学习模型早期检测的问题。其核心挑战在于,尽管基于Transformer的检测器在自然图像中表现优异,但它们缺乏对牙体结构等领域特定先验知识的建模能力,导致对关键病变区域的关注不足。解决方案的关键在于提出Caries-DETR框架,包含两个创新模块:一是牙体结构感知查询初始化(Tooth Structure-aware Query Initialization, TSQI),通过大规模口内照片预训练与结构感知分支(Structure Perception Branch, SPB)融合高频结构先验信息,引导模型聚焦于解剖学上重要的病变区域;二是病变感知动态损失优化(Lesion-aware Dynamic Loss Refinement, LDLR),基于病变大小、解剖相关性和预测质量自适应重加权损失函数,实现质量驱动的难例挖掘,从而显著提升对微小病变的检测性能。

链接: https://arxiv.org/abs/2604.23718
作者: Xuefen Liu,Xinquan Yang,Mianjie Zheng,Kun Tang,Xuguang Li,Xiaoqi Guo,Linlin Shen,He Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As dental caries appear as subtle, low-contrast lesions in intraoral imaging, existing deep learning models face significant challenges in the early detection of caries. While recent Transformer-based detectors have shown promising results in natural images, they often fail to capture the domain-specific anatomical priors crucial for dental caries detection. In this paper, we propose Caries-DETR, a specialized Transformer framework for caries detection in intraoral images. A Tooth Structure-aware Query Initialization (TSQI) is designed, leveraging large-scale intraoral photograph pre-training and a structure perception branch (SPB) to integrate high-frequency structural priors, guiding the model to focus on anatomically significant lesion areas. Furthermore, we design a Lesion-aware Dynamic Loss Refinement (LDLR) to implement quality-driven hard mining through adaptive loss reweighting based on lesion size, anatomical relevance, and prediction quality, optimizing detection for subtle lesions. Extensive experiments on two public datasets (i.e., AlphaDent and DentalAI) demonstrate that Caries-DETR achieves a state-of-the-art performance compared to existing methods and exhibits good generalization and robustness. Code and data at this https URLthis https URL.

[CV-95] ZID-Net: Zero-Inference Diffusion Prior Decoupling Network for Single Image Dehazing

【速读】:该论文旨在解决单张图像去雾(Single image dehazing)中恢复质量与计算效率之间的权衡问题:传统卷积神经网络(CNN)虽高效但难以学习密集且非均匀雾霾的鲁棒先验,而扩散模型虽具备强大的生成先验能力,却存在推理延迟高和采样不稳定的问题。解决方案的关键在于提出ZID-Net框架,其核心创新是显式解耦扩散监督与前向推理路径——在训练阶段引入零推理先验传播头(Zero-Inference Prior Propagation Head, ZI-PPH),利用条件扩散过程预测残差噪声以提供退化感知的结构监督;在测试阶段移除扩散分支,仅保留一个频域-空间解耦的前馈主干网络,其中包含通道-空间拉普拉斯掩码(Channel-Spatial Laplacian Mask, CSLM)用于滤除雾霾增强噪声、轻量级全局上下文块(Lightweight Global Context Blocks, LGCBs)建模长程空间依赖关系,并通过动态特征仲裁模块(Dynamic Feature Arbitration Block, DFAB)自适应融合语义与结构特征,从而实现高精度且高效的去雾重建。

链接: https://arxiv.org/abs/2604.23709
作者: Xinheng Li,Minghao Chen,Mengqing Wu,Yan Liu,Guanying Huo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Submitted to Neurocomputing. Includes 12 figures and 8 tables

点击查看摘要

Abstract:Single image dehazing is often constrained by a trade-off between restoration quality and computational efficiency. While efficient, CNN networks struggle to learn robust priors for dense and non-homogeneous haze. Conversely, diffusion models provide strong generative priors but suffer from severe inference latency and sampling instability. To address these limitations, we propose ZID-Net, a novel framework that explicitly decouples diffusion supervision from feed-forward inference. For efficient inference, we design a frequency-spatial decoupled feed-forward backbone. Within this backbone, a Channel-Spatial Laplacian Mask (CSLM) filters haze-amplified noise to extract purified structural details, while Lightweight Global Context Blocks (LGCBs) establish long-range spatial dependencies to capture the global variations of haze. A Dynamic Feature Arbitration Block (DFAB) then adaptively fuses these semantic and structural features for robust reconstruction. To provide this backbone with physical priors without the inference cost, we introduce a Zero-Inference Prior Propagation Head (ZI-PPH) during training. ZI-PPH leverages a conditional diffusion process to predict residual noise, providing degradation-aware structural supervision to the backbone. By discarding the diffusion branch at test time, ZID-Net integrates diffusion priors into a pure feed-forward architecture for accurate and efficient restoration. ZID-Net achieves 40.75 dB PSNR on the synthetic RESIDE dataset and outperforms existing methods with a 1.13 dB gain on real-world datasets. Additionally, it yields a 3.06 dB PSNR gain on the StateHaze1k remote sensing dataset with an inference time of just 19.35 ms. The project code is available at: this https URL.

[CV-96] Weakly Supervised Multicenter Nancy Index Scoring in Ulcerative Colitis Using Foundation Models

【速读】:该论文旨在解决溃疡性肠炎(Ulcerative Colitis, UC)组织学活动度评估中人工评分耗时且存在观察者变异的问题,特别是在多中心、异质性样本中获取密集区域级标注成本高昂的挑战。其核心解决方案是提出一种弱监督的多实例学习(Multiple Instance Learning, MIL)方法,利用基础模型(foundation models)从病例级和切片级的Nancy组织学指数(Nancy Histological Index, NHI)标签中进行训练,从而实现对UC病理活动度的自动化、鲁棒且可解释的评估。关键创新在于通过基础模型编码器与聚合策略的组合优化,在无需精细标注的情况下实现了完整的五级NHI预测,并在真实多中心数据集上验证了该方法的有效性。

链接: https://arxiv.org/abs/2604.23706
作者: Adam Kukučka,Ondřej Fabián,Vít Musil,Tomáš Brázdil
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Histologic assessment of ulcerative colitis (UC) activity is an important endpoint in clinical trials and routine care, but manual grading with indices such as the Nancy histological index (NHI) is time-consuming and prone to observer variability. While computational pathology methods can automate scoring, many approaches depend on dense region-level annotations, which are costly to obtain, particularly in heterogeneous, multicenter cohorts. We propose a weakly supervised multiple instance learning (MIL) approach for whole-slide images that learns from case- and slide-level NHI labels, leveraging foundation models. Our method targets clinically relevant endpoints, including neutrophilic activity and derived Nancy-low/high groupings, enabling full five-grade NHI prediction. On a multicenter dataset of HE-stained colon biopsies from three hospitals (2019-2025), we evaluate multiple foundation model encoders and aggregation strategies. We find that foundation model choice and resolution substantially affect performance, with Virchow2 providing the most consistent gains, and that a simple ensembling rule improves five-grade NHI prediction compared to a hierarchical gating baseline. Overall, our results demonstrate that weakly supervised MIL with modern foundation-model representations can provide robust, interpretable UC histology activity assessment in realistic multicenter settings. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.23706 [cs.CV] (or arXiv:2604.23706v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.23706 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vít Musil [view email] [v1] Sun, 26 Apr 2026 13:38:03 UTC (9,933 KB)

[CV-97] A Pose-only Geometric Constraint for Multi-Camera Pose Adjustment

【速读】:该论文旨在解决多相机系统在视觉导航与三维场景重建中因特征冗余导致的计算效率低下问题,尤其是在光束法平差(bundle adjustment)过程中,由于同时优化相机位姿和场景点引起的巨大计算开销。解决方案的关键在于提出一种仅基于位姿的几何约束(pose-only geometric constraint),通过广义相机模型建立统一的多相机表示,并利用两个基准观测及其对应位姿隐式表达三维场景点,从而将投影几何关系转化为纯位姿参数空间的优化问题;进而设计了一种多相机位姿调整算法,完全移除3D点参数,实现高效且聚焦于位姿优化的目标,实验表明该方法在保持或提升位姿估计精度的同时显著提高了计算效率。

链接: https://arxiv.org/abs/2604.23704
作者: Shunkun Liang,Banglei Guan,Bin Li,Qifeng Yu,Yang Shang
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-camera systems offer rich observation capabilities for visual navigation and 3D scene reconstruction; however, the resulting feature redundancy often compromises computational efficiency. This challenge is particularly pronounced during bundle adjustment, where the non-linear optimization of both system poses and scene points incurs substantial computational overhead. To address this challenge, this paper introduces a pose-only geometric constraint for multi-camera systems and proposes a corresponding pose adjustment algorithm. Specifically, we use generalized camera model to establish a unified representation of the multi-camera system. Building upon this model, we formulate the multi-camera pose-only constraint, which implicitly represents a 3D scene point using two base observations and their associated poses, thereby achieving a pose-only representation of the projection geometry. Subsequently, we introduce a multi-camera pose adjustment algorithm that eliminates 3D points from the parameter space, thereby achieving efficient and focused pose optimization. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms baseline bundle adjustment methods in computational efficiency, while maintaining or even improving pose estimation accuracy.

[CV-98] Personalizing Causal Audio-Driven Facial Motion via Dynamic Multi-modal Retrieval

【速读】:该论文旨在解决音频驱动面部动画(Audio-driven facial animation)中实时流式处理与高保真个性化之间的矛盾问题,现有方法通常依赖引入延迟的音频前瞻机制,或要求用户预先编码静态嵌入以实现个性化,但后者难以捕捉动态个体特征。其解决方案的关键在于提出一个端到端的因果框架,通过动态多模态风格检索实现因果面部运动生成的个性化:一是设计了一种时间层次化的运动表示结构,在保持解码因果性的同时捕获全局时序上下文和高频细节;二是引入多模态风格检索器,联合查询音频与运动信息以动态提取风格先验,且不破坏因果性。该机制支持任意数量和内容的参考模板,实现了低延迟、高灵活性的个性化面部动画生成。

链接: https://arxiv.org/abs/2604.23692
作者: Xuangeng Chu,Yu Han,Wei Mao,Shih-En Wei
机构: The University of Tokyo, Codec Avatars Lab, Meta; Codec Avatars Lab, Meta
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-driven facial animation is essential for immersive digital interaction, yet existing frameworks fail to reconcile real-time streaming with high-fidelity personalization. Current methods often rely on latency-inducing audio look-ahead, or require high user compliance to pre-encode static embeddings that fails to capture dynamic idiosyncrasies. We present an end-to-end causal framework for personalizing causal facial motion generation via dynamic multi-modal style retrieval, enabling ultra-low latency while uniquely leveraging unstructured style references. We introduce two key innovations: (1) a temporal hierarchical motion representation that captures global temporal context and high-frequency details while maintaining decoding causality, and (2) a multi-modal style retriever that jointly queries audio and motion to dynamically extract stylistic priors without breaking causality. This mechanism allows for scalable personalization with total flexibility regarding the number and contents of templates. By integrating these components into a causal autoregressive architecture, our method significantly outperforms state-of-the-art approaches in lip-sync accuracy, identity consistency, and perceived realism, supported by extensive quantitative evaluations and user studies.

[CV-99] Do Protective Perturbations Really Protect Portrait Privacy under Real-world Image Transformations?

【速读】:该论文旨在解决现有基于像素级扰动的主动防御方法在真实场景中因图像变换(如缩放、色彩压缩等)导致防护失效的问题。其关键解决方案是提出一种简单而有效的净化框架,利用现实世界图像变换引入的漏洞,以低计算成本高效移除保护性扰动,从而揭示了当前主动防御机制在实际应用中的潜在风险。

链接: https://arxiv.org/abs/2604.23688
作者: Ruiqing Sun,Xingshan Yao,Zhijing Wu,Tian Lan,Chenhao Cui,Huiyang Zhao,Jialing Shi,Chen Yang,Xianling Mao
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Proactive defense methods protect portrait images from unauthorized editing or talking face generation (TFG) by introducing pixel-level protective perturbations, and have already attracted increasing attention for privacy protection. In real-world scenarios, images inevitably undergo various transformations during cross-device display and dissemination–such as scale transformations and color compression–that directly alter pixel values. However, it remains unclear whether such pixel-level modifications affect the effectiveness of existing proactive defense methods that rely on pixel-level perturbations. To solve this problem, we conduct a systematic evaluation of representative proactive defenses under image transformation. The evaluated methods are selected to span different generation architectures such as diffusion and GAN-based models, as well as defense scopes covering both portrait and natural images, and are assessed using both qualitative and quantitative metrics for subjective and objective comparison. Experimental results indicate that defense methods based on pixel-level perturbations struggle to withstand common image transformations, posing a risk of defense failure in real-world applications. To further highlight this risk, we propose a simple yet effective purification framework by leveraging the vulnerabilities induced by real-world image transformations. Experimental results demonstrate that the proposed method can efficiently remove protective perturbations with low computational cost, highlighting previously overlooked risks to the research community.

[CV-100] Reading in the Dark: Low-light Scene Text Recognition

【速读】:该论文旨在解决低光照环境下场景文本识别(Scene Text Recognition, STR)准确率低的问题,尤其在自动驾驶和智能监控等实际应用中,由于光照不足和噪声干扰导致现有方法性能显著下降。其解决方案的关键在于构建了一个大规模的低光场景文本识别数据集LSTR(包含11,273张合成低光图像)及真实夜间街景评估集ESTR,并提出两种策略:一是对OCR模型进行微调或LoRA微调;二是采用联合训练策略,将低光图像增强(Low-Light Image Enhancement, LLIE)模块与OCR模型协同优化。其中,作者创新性地提出了重渲染LLIE(re-render LLIE, RLLIE)模块,在真实世界数据上表现出更优性能,验证了专门针对文本识别设计的联合训练方法优于独立的LLIE或OCR模型处理方式。

链接: https://arxiv.org/abs/2604.23685
作者: Xuanshuo Fu,Lei Kang,Ernest Valveny,Dimosthenis Karatzas,Javier Vazquez-Corral
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate text recognition in low-light environments is essential for intelligent systems in applications ranging from autonomous vehicles to smart surveillance. However, challenges such as poor illumination and noise interference remain underexplored. To address this gap, we introduce LSTR, a large-scale Low-light Scene Text Recognition dataset comprising 11,273 low-light images generated from well-lit datasets (ICDAR2015, IIIT5K, and WordArt), along with ESTR, which includes 60 real nighttime street-scene images in English and Spanish for exclusive evaluation. We explore two solution strategies: (1) employing Optical Character Recognition (OCR) models with fine-tuning and LoRA-based fine-tuning and (2) a joint training strategy that integrates a low-light image enhancement (LLIE) module with an OCR model. In particular, we propose a novel re-render LLIE (RLLIE) module, which demonstrates improved performance on real-world data. Through extensive experimentation, we analyze various training strategies and address a key research question: \emphHow bright is bright enough for effective scene text recognition? Our results indicate that standalone LLIE or OCR models perform inadequately under low-light conditions, highlighting the advantages of specialized, jointly trained text-centric approaches. Additionally, we provide a comprehensive benchmark to support future research in robust low-light scene text recognition. this https URL.

[CV-101] Learning to Decipher from Pixels – A Case Study of Copiale STOC

【速读】:该论文旨在解决历史加密手稿中传统解密流程依赖于先转录后解密的繁琐且易出错的问题,即现有计算工作流普遍采用“转录优先”范式,需人工将手写符号转为文本后再进行密码分析,这一过程不仅耗时费力,还可能因转录错误影响最终解密准确性。解决方案的关键在于提出一种端到端、无需转录的图像到明文直接映射方法,通过构建首个基于文本行级别的加密图像与德语明文配对数据集(以Copiale密文为例),并利用通用手写文本预训练结合特定密文微调的策略,显著提升了解密准确率,验证了直接从加密图像生成明文在历史替代密码中的可行性与有效性。

链接: https://arxiv.org/abs/2604.23683
作者: Lei Kang,Giuseppe De Gregorio,Raphaela Heil,Alicia Fornés,Beáta Megyesi
机构: Computer Vision Center, Universitat Autònoma de Barcelona, Spain; Department of Linguistics, Stockholm University, Sweden
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to HistoCrypt 2026

点击查看摘要

Abstract:Historical encrypted manuscripts require both paleographic interpretation of cipher symbols and cryptanalytic recovery of plaintext. Most existing computational workflows rely on a transcription-first paradigm, in which handwritten symbols are transcribed prior to decipherment. This intermediate step is labor-intensive, error-prone, and not always aligned with the goal of direct plaintext recovery. We propose an end-to-end, transcription-free approach that directly maps handwritten cipher images to plaintext. Using the Copiale cipher as a case study, we introduce the first text-line-level dataset pairing cipher images with German plaintext. We show that pretraining on generic handwriting data followed by cipher-specific fine-tuning substantially improves decipherment accuracy. Our results demonstrate that transcription-free image-to-plaintext decipherment is both feasible and effective for historical substitution ciphers, offering a simplified and scalable alternative to traditional pipelines. this https URL

[CV-102] Deploy DINO with Many-to-Many Association

【速读】:该论文旨在解决监督式图像匹配模型在未见图像域上泛化能力有限的问题,特别是针对通用视觉特征(如DINO)在跨域匹配中因语义相似实例间存在固有歧义而导致的性能下降问题。其解决方案的关键在于引入一种多对多(many-to-many, m-to-m)匹配范式,并提出一种基于似然视角的新颖鲁棒机制——谐波一致性最大化(Harmonic Consensus Maximization, HCM),该机制将传统方法视为不可行的似然计算的零阶近似,从而实现更高效且细粒度的内点关联优化,显著提升了无适应条件下通用视觉特征在下游任务(如相机位姿估计)中的匹配精度与效率。

链接: https://arxiv.org/abs/2604.23670
作者: Haodong Jiang,Mingzhe Li,Junfeng Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Motivated by the limited generalization of supervised image matching models to unseen image domains, we explore the zero-shot deployment of DINO features for this task. The generalist visual representation extracted from DINO has inherent ambiguity when used to match feature points among semantically similar instances, prompting us to adopt a many-to-many (m-to-m) matching paradigm. However, the existing robust mechanism under m-to-m data association is computationally heavy, which requires finding a maximum-cardinality matching in the inlier association graph for each parameter evaluation. To address this inefficiency, we introduce a novel likelihood perspective, which interprets the existing method as a zeroth-order approximation of otherwise intractable likelihood calculation,and inspires us to propose a faster and finer-grained robust mechanism, termed as Harmonic Consensus Maximization (HCM). Take camera pose estimation as an exemplifying downstream task, we demonstrate that general-purpose visual features, used out of the box without any adaptation, can compete with specialized matching models on out-of-distribution datasets when mated with m-to-m association and the HCM mechanism.

[CV-103] HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA ICPR2026

【速读】:该论文旨在解决当前超球面(hyperbolic)CLIP模型训练依赖从头开始预训练所带来的高计算成本和资源消耗问题,同时探索如何将已有的、在欧几里得空间(Euclidean space)中预训练的CLIP模型高效迁移至超球面空间以更好地建模层次结构。其解决方案的关键在于提出一种参数高效的框架HAC(Hyperbolic Adaptation of CLIP),通过轻量级微调实现对预训练CLIP模型的超球面适应,从而在不破坏原始模型知识的前提下,显著提升模型在视觉问答(VQA)任务中的表现,尤其在推理密集型任务上相比基线模型平均提升达+1.9点。

链接: https://arxiv.org/abs/2604.23665
作者: Francesco Dibitonto,Cigdem Beyan,Vittorio Murino
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is the preprint version of the paper. The final version has been accepted for publication in the Proceedings of the 28th International Conference on Pattern Recognition (ICPR 2026)

点击查看摘要

Abstract:Recent advances in representation learning have shown that hyperbolic geometry can offer a more expressive alternative to the Euclidean embeddings used in CLIP models, capturing hierarchical structures and leading to better-organized representations. However, current hyperbolic CLIP variants are trained entirely from scratch, which is computationally expensive and resource-intensive. In this work, we propose HAC (Hyperbolic Adaptation of CLIP), a parameter-efficient framework that enables pretrained CLIP models to transition into hyperbolic space via lightweight fine-tuning. We apply HAC to Visual Question Answering (VQA), where models must interpret visual elements and align them with textual queries. Notably, HAC’s training is performed on a dataset with no overlap with any VQA benchmark, resulting in a strict zero-shot evaluation paradigm that underscores HAC’s task-agnostic adaptability. We evaluate HAC across a diverse suite of VQA benchmarks spanning General, Reasoning, and OCR categories. Both HAC-S (small) and HAC-B (medium) consistently surpass Euclidean baselines and prior hyperbolic approaches, with HAC-B delivering up to a +1.9 point average improvement over CLIP-B on reasoning-intensive tasks. Our code is available at this https URL

[CV-104] SolarFCD: A Large-Scale Dataset and Benchmark for Solar Fault Classification in Photovoltaic Systems

【速读】:该论文旨在解决太阳能光伏(Solar Photovoltaic, PV)系统在大规模部署背景下,缺乏高质量、多模态且公开可获取的标注数据集所导致的自动化缺陷检测技术发展受限问题。其解决方案的关键在于构建了一个名为SolarFCD的大规模、多模态、统一标注的光伏面板缺陷数据集,该数据集整合了RGB/无人机图像与热红外图像两种模态,并通过系统化的标签映射、近似重复样本去除及少数类针对性增强策略,将4,435张图像划分为健康、表面遮挡、结构故障和电气故障四类缺陷,同时基于16种分类架构提供了可复现的基准性能,其中ResNet101V2模型表现最优(准确率86.68%,F1-score 88.17%),且各类别间性能差异小于1.2个百分点,显著提升了自动化光伏巡检的可靠性和可扩展性。

链接: https://arxiv.org/abs/2604.23662
作者: Misbah Ijaz,Saif Ur Rehman Khan,Abd Ur Rehman,Arooj Zaib,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim
机构: University of Gujrat (古杰拉特大学); Rhine-Waal University of Applied Sciences (莱茵-瓦尔应用科学大学); German Research Center for Artificial Intelligence (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing global deployment of solar photovoltaic (PV) systems needs robust, scalable, and automated inspection technologies capable of detecting a wide range of panel flaws under a variety of operating situations. The lack of large-scale, multi-modal, publicly available annotated datasets is a major obstacle preventing advancement in this field. We introduce SolarFCD, an extensive dataset of solar panel defects created by methodically combining and reconciling three publicly accessible datasets covering two imaging modalities: RGB/Drone images and Thermal Infrared. The dataset consist of 4,435 images arranged under four unified defect classes such as: healthy images, Surface Obstruction, structural fault, and electrical fault. The dataset was divided into training, validation, and test splits at an 80:10:10 ratio through methodical label mapping, near-duplicate removal, and targeted augmentation of minority classes. Sixteen classification architectures from five design families were trained and assessed on the dataset to provide repeatable benchmark baselines. With an accuracy of 86.68%, precision of 88.65%, recall of 88.62%, and F1-score of 88.17%, ResNet101V2 performed the best overall. Per-class results showed balanced detection across all four defect categories within a narrow performance band of less than 1.2 percentage points. To promote open and repeatable research in automated PV inspection and solar energy operations and maintenance, the dataset, annotation files, and baseline code are made openly available.

[CV-105] BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments

【速读】:该论文旨在解决低光照和水下视频中常见的失真问题,如噪声、对比度低、色彩失衡和模糊等,这些问题不仅影响视觉感知,还会降低自动检测等下游任务的性能。传统后处理方法耗时较长,而基于AI的视频增强技术相比图像增强方法则需要更高的计算资源。其解决方案的关键在于提出一种名为Visual Mamba的新框架,该框架通过引入视觉状态空间(Visual State Space, VSS)模型显著降低内存占用和计算时间;具体包括两个核心模块:(i) 特征对齐模块,用于在特征空间中注册输入帧间的时空位移;(ii) 增强模块,采用类似UNet的结构但将所有卷积层替换为VSS块,实现高效去噪与亮度调整。实验表明,该方法在低光照和水下视频增强任务中优于Transformer和卷积基模型。

链接: https://arxiv.org/abs/2604.23655
作者: Guoxi Huang,Ruirui Lin,Yini Li,David R. Bull,Nantheera Anantrasirichai
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Videos captured in low-light and underwater conditions often suffer from distortions such as noise, low contrast, color imbalance, and blur. These issues not only limit visibility but also degrade automatic tasks like detection. Post-processing is typically required but can be time-consuming. AI-based tools for video enhancement also demand significantly more computational resources compared to image-based methods. This paper introduces a novel framework, Visual Mamba, designed to reduce memory usage and computational time by leveraging the Visual State Space (VSS) model. The framework consists of two modules: (i) a feature alignment module, where spatio-temporal displacement between input frames is registered in the feature space, and (ii) an enhancement module, where noise removal and brightness adjustment are performed using a UNet-like architecture, with all convolutional layers replaced by VSS blocks. Experimental results show that the Visual Mamba technique outperforms Transformer and convolution-based models in both low-light and underwater video enhancement tasks. Code is available on line at this https URL.

[CV-106] ResAF-Net: An Anchor-Free Attention-Based Network for Tree Detection and Agricultural Mapping in Palestine

【速读】:该论文旨在解决巴勒斯坦地区农业数据难以大规模获取的问题,主要受限于碎片化地形、实地访问受限以及空基监测受限等因素。为应对这一挑战,作者提出了一种基于卫星影像的树检测框架ResAF-Net,其核心创新在于融合了ResNet-50编码器、空洞空间金字塔池化(Atrous Spatial Pyramid Pooling, ASPP)、特征融合模块、多头自注意力精修模块及无锚点的FCOS检测头,从而在密集且异质场景中实现高灵敏度的树木定位。该方案在MillionTrees基准上实现了82%的召回率(Recall)和35.47%的mAP@0.50:0.95,表明其对树木存在的敏感性较强,同时具备良好的定位性能,为资源受限地区的农业遥感监测提供了可行的技术路径。

链接: https://arxiv.org/abs/2604.23653
作者: Rabee Al-Qasem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable agricultural data is essential for food security, land-use planning, and economic resilience, yet in Palestine, such data remains difficult to collect at scale because of fragmented landscapes, limited field access, and restrictions on aerial monitoring. This paper presents ResAF-Net, a satellite-based tree detection framework designed for large-scale agricultural monitoring in resource-constrained settings. The proposed architecture combines a ResNet-50 encoder, Atrous Spatial Pyramid Pooling (ASPP), a feature-fusion stage, a multi-head self-attention refinement module, and an anchor-free FCOS detection head to improve tree localization in dense and heterogeneous scenes. Trained on the MillionTrees benchmark, the model achieved 82% Recall, 63.03% mAP@0.50, and 35.47% mAP@0.50:0.95 on the validation split, indicating strong sensitivity to tree presence while maintaining competitive localization quality. Beyond benchmark evaluation, we implemented the model within a web-based GIS application integrated with Palestinian cadastral data from GeoMolg, enabling tree analysis at scene, parcel, and community levels. This deployment demonstrates the practical feasibility of AI-assisted agricultural inventorying in Palestine. It provides a foundation for data-driven monitoring, reporting, and future species-level analysis of Mediterranean tree crops.

[CV-107] Geometry-Conditioned Diffusion for Occlusion-Robust In-Bed Pose Estimation

【速读】:该论文旨在解决在被毯子遮挡条件下,床上人体姿态估计(in-bed human pose estimation)因缺乏可靠标注训练数据而面临的鲁棒性挑战。现有方法依赖多模态传感或图像到图像的转换框架,受限于可见源图像的条件,难以扩展且姿态多样性不足。其解决方案的关键在于将遮挡感知的数据增强重构为几何条件驱动的生成建模任务,提出一种基于骨骼关键点条件的潜在扩散模型(Pose-conditioned Latent Diffusion Model, Pose-LDM),直接从骨骼关键点生成被毯子覆盖的图像,无需成对监督和像素级源图像条件,从而实现任意姿态输入下的高效生成。实验表明,Pose-LDM 在严重遮挡下实现了最高的定位精度,同时保持与配对扩散模型相当的整体检测性能,接近全监督训练效果,验证了几何条件扩散在提升遮挡鲁棒性方面的有效性与监督效率。

链接: https://arxiv.org/abs/2604.23651
作者: Navid Aslankhani Khameneh,Marco Carletti,Cigdem Beyan
机构: University of Verona (维罗纳大学); EVS - Embedded Vision Systems Srl (嵌入式视觉系统公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is the preprint version of the paper. The final version has been accepted for publication in the Proceedings of the 20th IEEE International Conference on Automatic Face and Gesture Recognition (IEEE FG 2026)

点击查看摘要

Abstract:Robust in-bed human pose estimation under blanket occlusion remains challenging due to the scarcity of reliable labeled training data for heavily covered poses. Existing approaches rely on multi-modal sensing or image-to-image translation frameworks that remain conditioned on visible source imagery, limiting scalability and pose diversity. In this work, we reformulate occlusion-aware augmentation as a geometry-conditioned generative modeling task. We conduct a systematic comparison of deterministic masking, unpaired translation, paired diffusion-based translation, and a proposed pose-conditioned Latent Diffusion Model (Pose-LDM). Unlike image-guided methods, Pose-LDM synthesizes blanket-covered images directly from skeletal keypoints, eliminating dependence on paired supervision and pixel-level source-image conditioning while enabling generation from arbitrary pose inputs. All augmentation strategies are evaluated through their impact on downstream pose estimation under a fixed backbone. Pose- LDM achieves the highest strict localization accuracy under severe occlusion while maintaining overall detection performance comparable to paired diffusion models, approaching the performance of fully supervised training. These results demonstrate that geometry-conditioned diffusion provides an effective and supervision-efficient pathway toward occlusion-robust inbed pose estimation without modifying the sensing pipeline. The code is available at: this http URL GeoDiffPose.

[CV-108] RaV-IDP: A Reconstruction-as-Validation Framework for Faithful Intelligent Document Processing

【速读】:该论文旨在解决智能文档处理(Intelligent Document Processing, IDP)流水线中提取结果缺乏内在验证机制的问题,即现有方法无法确保结构化实体(如表格、图像和文本)的抽取内容忠实于原始文档,导致错误在下游系统(如知识库、检索增强生成等)中无声传播。解决方案的关键在于提出“重建即验证”(Reconstruction as Validation, RaV-IDP)架构:在每个实体被提取后,由专用重构器将提取表示还原为与原始文档区域可比的形式,并通过比较器计算重构结果与未修改源区域之间的保真度分数,从而获得一个基于文档本身的无标签质量信号;当保真度低于预设阈值时,触发基于GPT-4.1视觉能力的结构化回退机制重新执行验证循环,且整个流程严格遵循“锚定原始文档区域”的Bootstrap约束,避免验证过程陷入循环依赖。

链接: https://arxiv.org/abs/2604.23644
作者: Pritesh Jha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intelligent document processing pipelines extract structured entities (tables, images, and text) from documents for use in downstream systems such as knowledge bases, retrieval-augmented generation, and analytics. A persistent limitation of existing pipelines is that extraction output is produced without any intrinsic mechanism to verify whether it faithfully represents the source. Model-internal confidence scores measure inference certainty, not correspondence to the document, and extraction errors pass silently into downstream consumers. We present Reconstruction as Validation (RaV-IDP), a document processing pipeline that introduces reconstruction as a first-class architectural component. After each entity is extracted, a dedicated reconstructor renders the extracted representation back into a form comparable to the original document region, and a comparator scores fidelity between the reconstruction and the unmodified source crop. This fidelity score is a grounded, label-free quality signal. When fidelity falls below a per-entity-type threshold, a structured GPT-4.1 vision fallback is triggered and the validation loop repeats. We enforce a bootstrap constraint: the comparator always anchors against the original document region, never against the extraction, preventing the validation from becoming circular. We further propose a per-stage evaluation framework pairing each pipeline component with an appropriate benchmark. The code pipeline is publicly available at this https URL for experimentation and use. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.23644 [cs.CV] (or arXiv:2604.23644v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.23644 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.19694316 Focus to learn more DOI(s) linking to related resources

[CV-109] VDLF-Net: Variational Feature Fusion for Adaptive and Few-Shot Visual Learning

【速读】:该论文旨在解决少样本学习(few-shot learning)中特征表示能力不足与模型泛化性能受限的问题。其解决方案的关键在于提出VDLF-Net架构,该架构在多尺度卷积神经网络(multi-scale CNN)主干网络基础上附加一个轻量级变分自编码器(VAE),通过潜在向量(latent vectors)和softmax门控机制(softmax-gate)动态调节主干特征图,并利用门控后的ℓ₂归一化嵌入向量实现监督分类或元学习场景下的少样本预测。实验表明,该设计显著优于ResNet-50 Enhanced、VGG-16、原型网络(Prototypical Networks)和匹配网络(Matching Networks),且性能提升主要来源于完整的VDLF-Net结构与训练策略,而非单一模块如KL散度或重构损失的优化。

链接: https://arxiv.org/abs/2604.23641
作者: Jiawei Yan
机构: 上海交通大学(Shanghai Jiao Tong University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces VDLF-Net, which attaches a compact VAE to a multi-scale CNN backbone. Latent vectors and softmax-gate support the backbone feature maps, while \ell_2 -normalized embeddings from the gated maps contribute toward supervised classification or episodic few-shot prediction. Under standard CIFAR-100 and Mini-ImageNet protocols, VDLF-Net demonstrates an improved performance over ResNet-50 Enhanced, VGG-16, Prototypical Networks, and Matching Networks. Extensive ablations show that removing the fine-resolution scale has the greatest impact on VDLF-Net’s performance. At the same time, KL and reconstruction at the chosen \alpha pose a minor performance reduction, demonstrating that performance gains over classical episodic baselines mainly originate from the full VDLF-Net architecture and training strategy.

[CV-110] Discriminator-Guided Adaptive Diffusion for Source-Free Test-Time Adaptation under Image Corruptions ICPR2026

【速读】:该论文旨在解决**无源域自适应(Source-Free Unsupervised Domain Adaptation, SF-UDA)**在由自然图像退化引起的域偏移场景下的性能下降问题,特别是针对超出加性噪声范畴的复杂退化类型,如模糊、天气效应和数字伪影等。其核心挑战在于如何在不访问源域数据且保持源模型冻结的前提下,提升目标域测试样本的鲁棒性。解决方案的关键在于提出一种基于扩散模型的输入级自适应框架,利用预训练扩散模型作为生成先验,并引入判别器引导的自适应扩散策略——该策略通过判别器动态决定每张测试图像所需的前向扩散步数,以抑制特定退化特征,同时保留类别判别结构;随后通过逆扩散过程重建与源域对齐的图像,最终使用冻结的源分类器进行推理。此方法实现了对多种退化类型的平衡鲁棒性提升,且无需固定扩散深度,具备良好的通用性和实用性。

链接: https://arxiv.org/abs/2604.23636
作者: Francesco Olivato,Cigdem Beyan,Vittorio Murino
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is the preprint (submitted version) of the paper. The final version has been accepted for publication in the Proceedings of the 28th International Conference on Pattern Recognition (ICPR 2026)

点击查看摘要

Abstract:In this work, we study Source-Free Unsupervised Domain Adaptation under corruption-induced domain shifts, where performance degradation is caused by natural image corruptions that go beyond additive noise, including blur, weather effects, and digital artifacts. We propose a diffusion-based, input-level adaptation framework that operates entirely at test time and keeps all source-trained models frozen, explicitly targeting robustness to corrupted target inputs. Our method leverages a source-trained diffusion model as a generative prior and introduces a discriminator-guided adaptive diffusion strategy that dynamically controls the amount of perturbation applied to each test sample. Rather than relying on a fixed diffusion depth, the discriminator determines, on a per-image basis, when sufficient forward diffusion has been applied to suppress corruption-specific artifacts, with each corruption type effectively defining a distinct target domain. This adaptive stopping mechanism applies only the necessary amount of noise to remove domainspecific corruption while preserving class-discriminative structure. The reverse diffusion process then reconstructs a source-aligned image, optionally stabilized through structural guidance, which is classified using a frozen source-trained classifier. We evaluate the proposed approach across a broad spectrum of corruption-induced target domains, covering 15 diverse corruption types, and demonstrate more balanced robustness with competitive or improved performance across non-noise corruptions. Additional analyses reveal how the adaptive diffusion schedule responds to different corruption characteristics, highlighting the practicality, generality, and robustness of the proposed framework. The code is publicly available at this https URL.

[CV-111] Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

【速读】:该论文旨在解决实时文本驱动的音视频人脸动画生成中,现有音频-视觉扩散模型因推理速度过慢且加速后质量显著下降的问题。其核心解决方案是提出Hallo-Live框架,关键创新在于:(1)采用异步双流扩散机制结合人类中心偏好引导的知识蒸馏(HP-DMD),以在保持高质量的同时实现高效推理;(2)引入未来扩展注意力(Future-Expanding Attention),使每个视频块能访问同步音频及短时未来语音线索,从而减少因果生成中的发音延迟。该方法在两块NVIDIA H200 GPU上实现了20.38 FPS的帧率与0.94秒延迟,相比教师模型Ovi提升16倍吞吐量和99.3倍降低延迟,同时在视频对齐度、同步置信度等指标上保持竞争力,并展现出跨风格(写实、多说话人、艺术化)的鲁棒泛化能力。

链接: https://arxiv.org/abs/2604.23632
作者: Chunyu Li,Jiaye Li,Ruiqiao Mei,Haoyuan Xia,Hao Zhu,Jingdong Wang,Siyu Zhu
机构: Fudan University (复旦大学); University of Science and Technology of China (中国科学技术大学); Nanjing University (南京大学); Baidu (百度)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Real-time text-driven joint audio-video avatar generation requires jointly synthesizing portrait video and speech with high fidelity and precise synchronization, yet existing audio-visual diffusion models remain too slow for interactive use and often degrade noticeably after aggressive acceleration. We present Hallo-Live, a streaming framework for joint audio-visual avatar generation that combines asynchronous dual-stream diffusion with human-centric preference-guided distillation. To reduce articulation lag in causal generation, we introduce Future-Expanding Attention, which allows each video block to access synchronous audio together with a short horizon of future phonetic cues. To mitigate the quality loss of few-step distillation, we further propose Human-Centric Preference-Guided DMD (HP-DMD), which reweights training samples using rewards from visual fidelity, speech naturalness, and audio-visual synchronization. On two NVIDIA H200 GPUs, Hallo-Live runs at 20.38 FPS with 0.94 seconds latency, yielding 16.0x higher throughput and 99.3x lower latency than the teacher model Ovi. Despite this speedup, it retains strong generation quality, reaching comparable VideoAlign overall score and Sync Confidence score while outperforming other accelerated baselines in the overall quality-efficiency trade-off. Qualitative results further show robust generalization across photorealistic, multi-speaker, and stylized scenarios. To the best of our knowledge, Hallo-Live is the first framework to combine streaming dual-stream diffusion with preference-guided distillation for real-time, text-driven audio-visual generation.

[CV-112] A Synergistic CNN-Transformer Network with Pooling Attention Fusion for Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像(HSI)分类任务中空间-光谱信息协同利用不足以及网络层间信息传递过程中关键特征易丢失的问题。其解决方案的关键在于提出一种融合卷积神经网络(CNN)与视觉Transformer(ViT)的协同架构:首先设计双分支特征提取(TBFE)模块,通过并行的3D和2D卷积分别高效提取光谱与空间特征;其次引入混合池化注意力(HPA)模块以聚合空间注意力信息;再者采用级联Transformer编码器进行全局光谱特征建模;最后创新性地提出跨层特征融合(CFF)模块,有效减少前序网络层中的重要信息损失,从而实现空间与光谱信息的深度融合与稳定传播。

链接: https://arxiv.org/abs/2604.23622
作者: Peng Chen,Wenxuan He,Feng Qian,Guangyao Shi,Jingwen Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the hyperspectral image (HSI) classification task, each pixel is categorized into a specific land-cover category or material. Convolutional neural networks (CNNs) and transformers have been widely used to extract local and non-local features in HSI classification. Recent works have utilized a multi-scale vision transformer (ViT) to enhance spectral feature capture and yield promising results. However, most existing methods still face challenges in the effective joint use of spatial-spectral information and in preserving information across layers during the propagation process. To address these issues, we propose a synergistic CNN-Transformer network with pooling attention fusion for HSI classification, which collaboratively utilizes CNNs and ViT to process spatial and spectral features separately. Specifically, we propose a Twin-Branch Feature Extraction (TBFE) module, which employs 3D and 2D convolution in parallel to comprehensively extract spectral and spatial features from HSI. A hybrid pooling attention (HPA) module is designed to aggregate spatial attention. Moreover, a cascade transformer encoder is employed for global spectral feature extraction, and a simple yet efficient cross-layer feature fusion (CFF) module is designed to reduce the loss of crucial information in the previous network layers. Extensive experiments are conducted on several representative datasets to demonstrate the superior performance of our proposed method compared to the state-of-the-art works. Code is available at this https URL.

[CV-113] Comparative Study of Weighted and Coupled Second- and Fourth-Order PDEs for Image Despeckling in Grayscale Color SAR and Ultrasound

【速读】:该论文旨在解决传统偏微分方程(Partial Differential Equation, PDE)图像去斑方法中存在的两个关键问题:一是二阶PDE模型易产生块状伪影(blocky artifacts),二是高阶模型常引入斑点噪声(speckle patterns)。解决方案的关键在于提出两种先进的PDE框架:第一种采用加权混合策略,将二阶与四阶PDE通过权重参数融合,其中二阶项使用灰度和梯度指标控制扩散,四阶项仅由拉普拉斯(Laplacian)指标引导;第二种构建耦合PDE结构,独立求解二阶与四阶分量并迭代更新,各扩散系数分别定义以增强区域自适应性。这两种方法均基于显式有限差分法实现,在多种图像数据集上验证了其在抑制斑点噪声的同时有效保持边缘细节的能力,定量指标(PSNR、SSIM、Speckle Index)显示优于现有Telegraph Diffusion Model (TDM) 和Fourth-Order Telegraph Diffusion Model (TDFM)。

链接: https://arxiv.org/abs/2604.23612
作者: Manish Kumar,Rajendra K. Ray
机构: Indian Institute of Technology Mandi (印度理工学院曼迪分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Partial Differential Equation (PDE)-based approaches have gained significant attention in image despeckling due to their strong capability to preserve structural details while suppressing noise. However, conventional second-order PDE models tend to generate blocky artifacts, whereas higher-order models often introduce speckle patterns. To resolve it, this paper proposes and comparatively analyzes two advanced PDE-based frameworks designed for speckle noise suppression while preserving the fine edges. The first model introduces a novel weighted formulation that combines second and fourth-order PDEs through a weighting parameter. The second-order diffusion coefficient employs grayscale and gradient-based indicators, while the fourth-order term is guided solely by a Laplacian-based indicator. The second model constructs a coupled PDE framework, where independent fourth and second-order components are explicitly solved in an iterative manner. In this coupled structure, each diffusion coefficient is defined separately to enhance adaptability in varying image regions. Both models are implemented using the explicit finite difference method. The proposed techniques are extensively evaluated on a variety of datasets, including standard grayscale, color, Synthetic Aperture Radar (SAR), and ultrasound images. Comparative experiments with the existing Telegraph Diffusion Model (TDM) and Fourth-Order Telegraph Diffusion Model (TDFM) demonstrate the superiority of the proposed approaches in reducing speckle noise while effectively preserving fine image structures and edges. Quantitative evaluations using PSNR, SSIM and Speckle Index metrics confirm that the proposed models produce higher image quality and enhanced visual perception. Overall, the presented PDE-based formulations provide a reliable and efficient framework for image despeckling in both natural and medical imaging.

[CV-114] Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation CVPR

【速读】:该论文旨在解决三维激光雷达(LiDAR)场景下异常分割(Anomaly Segmentation)问题,即在自动驾驶和机器人感知中区分已知类别与未知异常物体。现有方法多依赖二维视觉的后处理技术,难以适配3D点云数据特性,且缺乏高质量、多样化的公开数据集支撑。解决方案的关键在于提出一种直接在特征空间操作的新方法:通过建模正常类别的特征分布来约束异常样本,从而实现高效、准确的异常检测;同时构建了一套混合真实-合成数据集,涵盖多种分布外物体和复杂环境,有效缓解了现有数据集因传感器分辨率导致的域偏移问题,显著提升了模型泛化能力与实用性。

链接: https://arxiv.org/abs/2604.23604
作者: Simone Mosco,Daniel Fusaro,Alberto Pretto
机构: University of Padova, Italy(帕多瓦大学, 意大利)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This paper has been accepted at the 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

点击查看摘要

Abstract:Understanding the surrounding environment is fundamental in autonomous driving and robotic perception. Distinguishing between known classes and previously unseen objects is crucial in real-world environments, as done in Anomaly Segmentation. However, research in the 3D field remains limited, with most existing approaches applying post-processing techniques from 2D vision. To cover this lack, we propose a new efficient approach that directly operates in the feature space, modeling the feature distribution of inlier classes to constrain anomalous samples. Moreover, the only publicly available 3D LiDAR anomaly segmentation dataset contains simple scenarios, with few anomaly instances, and exhibits a severe domain gap due to its sensor resolution. To bridge this gap, we introduce a set of mixed real-synthetic datasets for 3D LiDAR anomaly segmentation, built upon established semantic segmentation benchmarks, with multiple out-of-distribution objects and diverse, complex environments. Extensive experiments demonstrate that our approach achieves state-of-the-art and competitive results on the existing real-world dataset and the newly introduced mixed datasets, respectively, validating the effectiveness of our method and the utility of the proposed datasets. Code and datasets are available at this https URL.

[CV-115] PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics ICME2026

【速读】:该论文旨在解决现有图像到视频生成方法中存在的物理不真实性(physically implausible motions)和对象动力学控制不足的问题,尤其在深度感知的空间交互方面存在局限。解决方案的关键在于提出PhysLayer框架,其核心由三个模块构成:一是基于语言引导的场景理解模块,利用视觉基础模型将场景分解为基于深度的图层,从而解析物体组成、材质属性及物理参数;二是深度感知的分层物理模拟模块,通过扩展二维刚体动力学以引入深度运动和视角一致的缩放机制,在无需完整三维重建的前提下实现更真实的物体交互;三是物理引导的视频合成模块,结合模拟轨迹与场景感知的光照重渲染,确保时序一致性。该方法在多项指标上显著优于现有技术,实现了物理真实性和计算效率之间的平衡。

链接: https://arxiv.org/abs/2604.23574
作者: Tianyidan Xie,Zhentao Huang,Mingjie Wang,Xin Huang,Jun Zhou,Minglun Gong,Zili Yi
机构: 南京大学( Nanjing University); 中国科学技术大学(University of Science and Technology of China); 上海交通大学(Shanghai Jiao Tong University); 浙江大学(Zhejiang University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICME 2026

点击查看摘要

Abstract:Existing image-to-video generation methods often produce physically implausible motions and lack precise control over object dynamics. While prior approaches have incorporated physics simulators, they remain confined to 2D planar motions and fail to capture depth-aware spatial interactions. We introduce PhysLayer, a novel framework enabling language-guided, depth-aware layered animation of static images. PhysLayer consists of three key components: First, a language-guided scene understanding module that utilizes vision foundation models to decompose scenes into depth-based layers by analyzing object composition, material properties, and physical parameters. Second, a depth-aware layered physics simulation that extends 2D rigid-body dynamics with depth motion and perspective-consistent scaling, enabling more realistic object interactions without requiring full 3D reconstruction. Third, a physics-guided video synthesis module that integrates simulated trajectories with scene-aware relighting for temporally coherent results. Experimental results demonstrate improvements in CLIP-Similarity (+2.2%), FID score (+9.3%), and Motion-FID (+3%), with human evaluation showing enhanced physical plausibility (+24%) and text-video alignment (+35%). Our approach provides a practical balance between physical realism and computational efficiency for controllable image animation.

[CV-116] Spatiotemporal Degradation-Aware 3D Gaussian Splatting for Realistic Underwater Scene Reconstruction

【速读】:该论文旨在解决水下视频中真实场景重建的问题,其核心挑战在于水下成像中存在的时空退化现象(spatiotemporal degradations),包括光斑效应(caustics)、闪烁(flickering)、衰减(attenuation)和后向散射(backscattering),这些退化因素导致现有三维重建方法在几何结构与外观上均存在偏差。解决方案的关键在于提出一种基于3D高斯溅射(3D Gaussian Splatting)的框架 MarineSTD-GS,该框架通过引入成对的高斯原语——内在高斯(Intrinsic Gaussians)表示真实场景,退化高斯(Degraded Gaussians)模拟观测到的退化图像,并借助一个时空退化建模模块(Spatiotemporal Degradation Modeling, SDM)物理地推导出退化高斯的颜色,从而实现从退化图像中自监督地解耦出真实场景外观。此外,为提升训练稳定性和几何精度,作者进一步设计了深度引导的几何损失(Depth-Guided Geometry Loss)和多阶段优化策略,使模型能够有效应对复杂水下环境中的时空退化并生成逼真、无水体干扰的视图合成结果。

链接: https://arxiv.org/abs/2604.23551
作者: Shaohua Liu,Ning Gao,Zuoya Gu,Hongkun Dou,Yue Deng,Hongjue Li
机构: Beihang University (北京航空航天大学); School of Astronautics (宇航学院); Shen Yuan Honors College (沈元荣誉学院); School of Artificial Intelligence (人工智能学院); Zhongguancun Academy (中关村学院); State Key Laboratory of High-Efficiency Reusable Aerospace Transportation Technology (高效可重复使用航天运输技术国家重点实验室); Beijing Key Laboratory of System Design for Reusable Launch Vehicle (可重复使用运载器系统设计北京市重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 10 figures, 6 tables. Author version of the paper published in Proceedings of ACM Multimedia 2025

点击查看摘要

Abstract:Reconstructing realistic underwater scenes from underwater video remains a meaningful yet challenging task in the multimedia domain. The inherent spatiotemporal degradations in underwater imaging, including caustics, flickering, attenuation, and backscattering, frequently result in inaccurate geometry and appearance in existing 3D reconstruction methods. While a few recent works have explored underwater degradation-aware reconstruction, they often address either spatial or temporal degradation alone, falling short in more real-world underwater scenarios where both types of degradation occur. We propose MarineSTD-GS, a novel 3D Gaussian Splatting-based framework that explicitly models both temporal and spatial degradations for realistic underwater scene reconstruction. Specifically, we introduce two paired Gaussian primitives: Intrinsic Gaussians represent the true scene, while Degraded Gaussians render the degraded observations. The color of each Degraded Gaussian is physically derived from its paired Intrinsic Gaussian via a Spatiotemporal Degradation Modeling (SDM) module, enabling self-supervised disentanglement of realistic appearance from degraded images. To ensure stable training and accurate geometry, we further propose a Depth-Guided Geometry Loss and a Multi-Stage Optimization strategy. We also construct a simulated benchmark with diverse spatial and temporal degradations and ground-truth appearances for comprehensive evaluation. Experiments on both simulated and real-world datasets show that MarineSTD-GS robustly handles spatiotemporal degradations and outperforms existing methods in novel view synthesis with realistic, water-free scene appearances.

[CV-117] COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training

【速读】:该论文旨在解决光学化学结构识别(Optical Chemical Structure Recognition, OCSR)在真实文档中面临的挑战,包括化学结构的无限变体、简写惯例以及视觉噪声等问题。现有基于深度学习的方法通常采用教师强制(teacher forcing)与词元级最大似然估计(token-level Maximum Likelihood Estimation, MLE)进行训练,但这种范式存在暴露偏差(exposure bias),即模型在训练时依赖真实前缀,而在推理时需基于自身预测,导致性能下降;同时,词元级MLE目标难以优化分子层面的评价指标(如化学有效性与结构相似性)。解决方案的关键在于引入最小风险训练(Minimum Risk Training, MRT),并提出COMO(Closed-loop Optical Molecule recOgnition)框架——该框架通过迭代采样和评估模型自身的预测,直接优化分子级别的不可微目标,从而有效缓解暴露偏差,并显著提升识别准确率。实验表明,COMO在十个基准数据集上均优于现有规则和学习方法,且MRT具有架构无关性,具备广泛应用于端到端OCSR系统的潜力。

链接: https://arxiv.org/abs/2604.23546
作者: Zhuoqi Lyu,Qing Ke
机构: City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Optical chemical structure recognition (OCSR) translates molecular images into machine-readable representations like SMILES strings or molecular graphs, but remains challenging in real-world documents due to inexhaustible variations in chemical structures, shorthand conventions, and visual noise. Most existing deep-learning-based approaches rely on teacher forcing with token-level Maximum Likelihood Estimation (MLE). This training paradigm suffers from exposure bias, as models are trained under ground-truth prefixes but must condition on their own previous predictions during inference. Moreover, token-level MLE objectives hinder the optimization towards molecular-level evaluation criteria such as chemical validity and structural similarity. Here we introduce Minimum Risk Training (MRT) to OCSR and propose COMO (Closed-loop Optical Molecule recOgnition), a closed-loop framework that mitigates exposure bias by directly optimizing over molecule-level, non-differentiable objectives, by iteratively sampling and evaluating the model’s own predictions. Experiments on ten benchmarks including synthetic and real-world chemical diagrams from patent and scientific literature demonstrate that COMO substantially outperforms existing rule-based and learning-based methods with less training data. Ablation studies further show that MRT is architecture-agnostic, demonstrating its potential for broad application to end-to-end OCSR systems.

[CV-118] AusSmoke meets MultiNatSmoke: a fully-labelled diverse smoke segmentation dataset WACV2026

【速读】:该论文旨在解决当前生成式AI(Generative AI)在火灾烟雾分割任务中因训练数据规模小、地理分布局限及依赖合成图像而导致模型泛化能力不足的问题。其关键解决方案是构建两个新数据集:一是针对澳大利亚地区数据稀缺问题的AusSmoke烟雾分割数据集;二是整合国际公开数据与新采集的澳大利亚影像,形成覆盖多国、规模扩大一个数量级的MultiNatSmoke全标注基准数据集,从而显著提升烟雾分割模型在不同地理环境下的性能与泛化能力。

链接: https://arxiv.org/abs/2604.23542
作者: Weihao Li,Hongjin Zhao,Gao Zhu,Ge-Peng Ji,Nicholas Wilson,Marta Yebra,Nick Barnes
机构: Bushfire Research Centre of Excellence, Australian National University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026. Project page: this https URL

点击查看摘要

Abstract:Wildfires are an escalating global concern due to the devastating impacts on the environment, economy, and human health, with notable incidents such as the 2019-2020 Australian bushfires and the 2025 California wildfires underscoring the severity of these events. AI-enabled camera-based smoke detection has emerged as a promising approach for the rapid detection of wildfires. However, existing wildfire smoke segmentation datasets that are used for training detection and segmentation models are limited in scale, geographically constrained, and often rely on synthetic imagery, which hinders effective training and generalization. To overcome these limitations, we present AusSmoke, a new smoke segmentation dataset collected from Australia to address the data scarcity in this region. Furthermore, we introduce a MultiNational geographically diverse and substantially larger fully-labelled benchmark, called MultiNatSmoke, that consolidates publicly available international datasets with the newly collected Australian imagery, expanding the scale by an order of magnitude over previous collections. Finally, we benchmark smoke segmentation models, demonstrating improved performance and enhanced generalization across diverse geographical contexts. The project is available at \hrefthis https URLGithub.

[CV-119] Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

【速读】:该论文旨在解决文本到图像扩散模型中复杂文本提示与生成图像布局之间对齐不准确的问题,尤其是现有在线优化和搜索方法因依赖无约束的欧几里得梯度上升而导致潜在空间范数膨胀、破坏标准高斯先验,从而引发严重的视觉伪影(如颜色过饱和),并存在语义路由效率低及外部代理模型“奖励劫持”(reward hacking)等挑战。解决方案的关键在于提出 Oracle Noise 框架,将噪声初始化重构为严格限制在黎曼超球面上的语义驱动优化过程:通过直接识别提示中最关键的结构化词汇实现高效语义路由,同时沿球面路径更新噪声以数学上保持原始高斯分布特性,从而避免范数膨胀并支持激进的学习率以实现快速收敛。此几何约束机制显著提升了语义对齐速度与图像美学质量,且无需黑箱模型即可达到当前最优性能。

链接: https://arxiv.org/abs/2604.23540
作者: Haosen Li,Wenshuo Chen,Lei Wang,Shaofeng Liang,Haozhe Jia,Yutao Yue
机构: The Hong Kong University of Science and Technology (Guangzhou); Griffith University; Data61/CSIRO
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models have achieved remarkable generative capabilities, yet accurately aligning complex textual prompts with synthesized layouts remains an ongoing challenge. In these models, the initial Gaussian noise acts as a critical structural seed dictating the macroscopic layout. Recent online optimization and search methods attempt to refine this noise to enhance text-image alignment. However, relying on unconstrained Euclidean gradient ascent mathematically inflates the latent norm and destroys the standard Gaussian prior, causing severe visual artifacts like color over-saturation. Furthermore, these methods suffer from inefficient semantic routing and easily fall into the ``reward hacking’’ trap of external proxy models. To address these intertwined bottlenecks, we propose Oracle Noise, a zero-shot framework reframing noise initialization as semantic-driven optimization strictly confined to a Riemannian hypersphere. Instead of relying on complex external parsers, we directly identify the most impactful structural words in the prompt to efficiently route optimization energy. By updating the noise strictly along a spherical path, we mathematically preserve the original Gaussian distribution. This geometric constraint eliminates norm inflation and unlocks aggressive step sizes for rapid convergence. Extensive experiments demonstrate that Oracle Noise significantly accelerates semantic alignment and achieves superior aesthetics without black-box models. It completely mitigates Euclidean-induced degradation, establishing state-of-the-art performance across human preference metrics (e.g., HPSv2, ImageReward), semantic alignment (CLIP Score), and sample diversity, all within a strict 2-second optimization budget.

[CV-120] Z2-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

【速读】:该论文旨在解决扩散模型中Classifier-Free Guidance(CFG)方法因仅依赖瞬时梯度而忽略数据流形内在曲率所导致的语义对齐不足问题,以及现有显式zigzag采样(Z-Sampling)方法因多步前向-后向轨迹遍历带来三倍神经函数评估(NFE)成本和非约束截断误差、引发分布漂移的问题。解决方案的关键在于提出隐式Z-Sampling(Implicit Z-Sampling),通过算子对偶性代数消去中间状态,物理上消除流形外近似误差;并进一步引入Z²-Sampling(Zero-cost Zigzag Sampling),利用概率流ODE的时间一致性,将隐式代数坍缩与动态缓存的时间语义代理耦合,恢复标准2-NFE基线效率的同时保留语义探索能力,并通过反向误差分析证明其离散坍缩天然合成方向导数曲率惩罚项,从而在性能与效率之间实现结构上的帕累托前沿突破。

链接: https://arxiv.org/abs/2604.23536
作者: Haosen Li,Wenshuo Chen,Shaofeng Liang,Lei Wang,Kaishen Yuan,Yutao Yue
机构: The Hong Kong University of Science and Technology (Guangzhou); Griffith University; Data61/CSIRO
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved unprecedented success in text-aligned generation, largely driven by Classifier-Free Guidance (CFG). However, standard CFG operates strictly on instantaneous gradients, omitting the intrinsic curvature of the data manifold. Recent methods like Zigzag-sampling (Z-Sampling) explicitly traverse multi-step forward-backward trajectories to probe this curvature, significantly improving semantic alignment. Yet, these explicit traversals triple the Neural Function Evaluation (NFE) cost and introduce unconstrained truncation errors from off-manifold evaluations, causing cumulative drift from the true marginal distribution. In this paper, we theoretically demonstrate that the explicit zigzag sequence is topologically reducible. We propose Implicit Z-Sampling, rigorously proving that intermediate states can be algebraically annihilated via operator dualities, physically eliminating off-manifold approximation errors. To push sampling efficiency to its theoretical lower bound, we introduce Z^2 -Sampling (Zero-cost Zigzag Sampling). Exploiting the Probability Flow ODE’s temporal coherence, Z^2 -Sampling couples implicit algebraic collapse with a dynamically cached Temporal Semantic Surrogate. This restores the standard 2-NFE baseline without sacrificing semantic exploration. We formally prove via Backward Error Analysis that this discrete collapse inherently synthesizes a directional derivative curvature penalty. Finally, extensive evaluations demonstrate that Z^2 -Sampling structurally shatters the performance-efficiency Pareto frontier. We validate its universal applicability across diverse architectures (U-Nets, DiTs) and modalities (image/video), establishing seamless orthogonality with advanced alignment frameworks (AYS, Diffusion-DPO).

[CV-121] Emotion-Conditioned Short-Horizon Human Pose Forecasting with a Lightweight Predictive World Model

【速读】:该论文旨在解决短时人体姿态预测中忽略情感信号影响的问题,即现有轨迹预测模型主要依赖几何运动线索,而未充分考虑面部表情所蕴含的情绪信息对人类运动动态的潜在调节作用。其解决方案的关键在于提出一种轻量级自回归预测世界模型(lightweight autoregressive predictive world model),通过可学习的门控机制将姿态关键点与面部表情提取的情绪嵌入(emotion embeddings)进行融合,并采用两层LSTM架构实现15步滚动预测。实验表明,这种基于情绪条件的多模态融合策略在自然情绪驱动的运动序列中显著提升预测性能,且门控机制能有效区分辅助条件信号与冗余特征,从而验证了情绪嵌入作为辅助条件信号的有效性。

链接: https://arxiv.org/abs/2604.23532
作者: Jingni Huang,Peter Bloodsworth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Short-term human pose prediction plays a crucial role in interactive systems, assistive robots, and emotion-aware human-computer interaction[1-3]. While current trajectory prediction models primarily rely on geometric motion cues, they often overlook the underlying emotional signals influencing human motion dynamics[4-5]. This paper investigates whether facial expression-derived emotion embeddings can provide auxiliary conditional signals for short-term pose prediction. To further evaluate multimodal conditionation in a recursive prediction setting, we propose a lightweight autoregressive predictive world model that performs 15-step rolling pose prediction. This framework combines pose keypoints with emotion embeddings through a learnable gating mechanism and performs autoregressive unfolding prediction using a recurrent sequence model based on a two-layer LSTM architecture. Experiments were conducted on two small-scale pose-emotion video datasets: controlled motion sequences with minimal facial expression changes and, natural emotion-driven motion sequences with considerable facial expression changes. The results show that simple multimodal fusion does not consistently improve prediction accuracy, while normalized gating fusion significantly enhances the performance of emotion-driven motion sequences. Furthermore, counterfactual perturbation experiments demonstrate that the predicted trajectory exhibits measurable sensitivity to changes in multimodal input, suggesting that facial expression embeddings act as auxiliary conditional signals rather than redundant features. In summary, these results indicate that incorporating facial expression-derived emotion embeddings into emotion-conditional short-term pose prediction based on a lightweight predictive world model architecture is a feasible approach.

[CV-122] BurstGP: Enhancing Raw Burst Image Super Resolution with Generative Priors

【速读】:该论文旨在解决突发图像超分辨率(Burst Image Super Resolution, BISR)中因复杂纹理恢复困难和过度平滑导致的图像质量下降问题。现有方法在处理多帧低分辨率(LR)图像时,难以有效利用时间冗余与空间一致性来重建高分辨率(HR)图像,尤其在细节生成方面表现不足。其解决方案的关键在于提出一种基于扩散模型的新框架 BurstGP,该框架通过引入预训练基础模型的生成先验(generative priors),构建了一个多帧感知的扩散模型,并在此基础上创新性地设计了两个核心机制:(i) 一种退化感知的条件控制机制,根据输入图像的估计退化程度调控细粒度细节的合成;(ii) 一个鲁棒的 sRGB 到 lRGB 反变换器,使模型能够利用视频域的 sRGB 先验信息,同时兼容原始 RAW 输入和线性 RGB(lRGB)输出。这一方法显著提升了图像的感知质量和结构细节恢复能力,在定量指标(如 MUSIQ、LPIPS)和定性效果上均优于当前最优方法。

链接: https://arxiv.org/abs/2604.23508
作者: Dong Huo,Tristan Aumentado-Armstrong,Samrudhdhi B. Rangrej,Maitreya Suin,Angela Ning Ye,Zhiming Hu,Amanpreet Walia,Amirhossein Kazerouni,Konstantinos G. Derpanis,Iqbal Mohomed,Alex Levinshtein
机构: AI Center - Toronto, Samsung Electronics Vector Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 13 figures

点击查看摘要

Abstract:Burst image super resolution (BISR) aims to construct a single high-resolution (HR) image by aggregating information from multiple low-resolution (LR) frames, relying on temporal redundancy and spatial coherence across the burst. While conventional methods achieve impressive results, they often struggle with complex textures and oversmoothing. Diffusion models, particularly those pretrained on high-quality data, have shown remarkable capability in generating realistic details for image and video super-resolution. However, their potential remains largely under-explored in BISR, where existing approaches typically rely on task-specific diffusion models trained from scratch and operate on single-frame reconstructions. In this work, we propose BurstGP, a novel diffusion-based solution for BISR, which leverages generative priors of recent foundation models to overcome these issues. In particular, we build a multiframe-aware diffusion model on top of a conventional BISR approach, which boosts image quality with minimal loss to fidelity. Further, we introduce (i) a novel degradation-aware conditioning mechanism, which controls synthesis of fine details based on the estimated degradation in the input, and (ii) a robust sRGB-to-lRGB inverter, enabling us to utilize generative multiframe (video) sRGB priors, while operating with raw input and lRGB output images. Empirically, we demonstrate that BurstGP outperforms the existing state of the art, both quantitatively (especially with respect to perceptual metrics, including MUSIQ and LPIPS) and qualitatively. In particular, our proposed method excels at recovering richer textures and finer structural details, highlighting the potential of video priors for BISR over traditional methods.

[CV-123] Leverag ing Spatial Transcriptomics as Alternative to Manual Annotations for Deep Learning-Based Nuclei Analysis

【速读】:该论文旨在解决病理图像中细胞核分割与分类任务对大规模像素级人工标注的依赖问题,这类标注在不同组织类型和染色条件下获取成本高且难以扩展。其解决方案的关键在于利用空间转录组学(Spatial Transcriptomics, ST)数据作为监督信号,将基因表达谱转化为细胞类型标签,并构建一种面向图像的分类方法,从而实现从基因表达到图像识别的跨模态映射,有效提升模型在未见器官上的泛化能力。

链接: https://arxiv.org/abs/2604.23481
作者: Kazuya Nishimura,Ryoma Bise,Haruka Hirose,Yasuhiro Kojima
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning-based nuclei segmentation and classification in pathology images typically rely on large-scale pixel-level manual annotations, which are costly and difficult to obtain across diverse tissues and staining conditions. To address this limitation, we propose a framework that leverages spatial transcriptomics (ST) data as supervision for nuclei segmentation and classification. By incorporating cell-level ST data, we obtain gene expression profiles and corresponding nuclear masks from histopathological images. Gene expression profiles are converted into cell-type labels and used as training data for image-based classification. Because existing gene expression-based cell-type classification methods are not designed for image recognition, we introduce an image-oriented classification approach that bridges gene expression-based cell typing and image-based cell classification. To evaluate generalization, we conduct segmentation experiments on previously unseen organs and compare our method with conventional supervised models. Despite being trained on fewer organ types, our framework achieves higher segmentation accuracy, demonstrating strong transferability. Classification experiments further show consistent improvements over existing approaches.

[CV-124] From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers

【速读】:该论文旨在解决视觉 Transformer(Vision Transformer, ViT)在仅通过图像分类任务预训练的情况下,如何隐式地编码空间结构信息的问题。研究发现,尽管没有显式的空间监督信号,ViT 仍能逐步构建出从局部边界到全局深度的层次化空间表征:局部 patch 边界在第 5–6 层即可线性解码(AP = 0.833),而依赖全局上下文的深度信息则在第 8 层达到峰值(MAE = 0.0875)。关键在于,这种空间结构并非静态存储于残差流中,而是通过逐层主动重构维持——因果干预实验表明,特定方向的激活扰动会显著削弱深度解码能力(最多下降 165%),且中层干预对下游特征影响最持久,揭示了 ViT 中存在类似灵长类视觉皮层“早期到晚期”加工的主动空间层次结构。

链接: https://arxiv.org/abs/2604.23452
作者: Jainum Sanghavi
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 6 figures. Code available at this https URL

点击查看摘要

Abstract:Vision Transformers trained only on image classification routinely transfer to tasks that demand spatial understanding, yet they receive no spatial supervision during pretraining. We ask where and how robustly such structure is encoded. Probing a frozen ViT-B/16 layerwise for two complementary properties, local patch boundaries (BSDS500) and per-patch depth (NYU Depth V2), reveals a clear hierarchy: boundary structure becomes linearly decodable at layers 5-6 (AP = 0.833), while depth, which requires integrating global cues, peaks two to three layers later at layer 8 (MAE = 0.0875). Both signals collapse at the final classification layer, and random-weight controls confirm the encodings are learned rather than architectural. Causal interventions add specificity: ablating the single direction a linear depth probe reads degrades depth decoding by up to 165%, while ablating any other direction changes it by less than 1%. Targeted activation patching along that direction shows the depth signal is partially re-derived at each layer rather than passively carried in the residual stream, with mid-layer interventions persisting most strongly downstream. The result is that a classification-trained ViT develops an actively maintained spatial hierarchy that mirrors the early-to-late progression observed in the primate visual cortex.

[CV-125] Resource-Constrained UAV-Based Weed Detection for Site-Specific Management on Edge Devices

【速读】:该论文旨在解决资源受限边缘平台(如无人机搭载的嵌入式设备)上实时杂草检测模型部署中的关键挑战,即如何在有限计算资源下平衡检测精度与推理速度。其解决方案的关键在于提出一个面向部署的框架,整合无人机数据采集、模型开发与本地推理全流程,并系统评估多种先进目标检测模型(包括基于卷积的YOLO系列和基于Transformer的RT-DETR系列)在不同硬件平台(Jetson Orin Nano、AGX Xavier 和 AGX Orin)上的性能表现。实验表明,轻量级模型虽精度略低(66%-71% mAP50),但延迟显著降低,满足实时性需求;而RT-DETRv2-R50-M和YOLOv11s则在精度与效率之间取得最佳平衡,成为无人机实时杂草识别的理想候选模型。

链接: https://arxiv.org/abs/2604.23442
作者: Linyuan Wang,Haibo Yao,Te-Ming Tseng,Kelvin Betitame,Xin Sun,Hanbo Huang,Dong Chen
机构: Mississippi State University (密西西比州立大学); USDA-ARS (美国农业部农业研究服务局); North Dakota State University (北达科他州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weeds compete with crops for light, water, and nutrients, reducing yield and crop quality. Efficient weed detection is essential for site-specific weed management (SSWM). Although deep learning models have been deployed on UAV-based edge systems, a systematic understanding of how different model architectures perform under real-world resource constraints is still lacking. To address this gap, this study proposes a deployment-oriented framework for real-time UAV-based weed detection on resource-constrained edge platforms. The framework integrates UAV data acquisition, model development, and on-device inference, with a focus on balancing detection accuracy and computational efficiency. A diverse set of state-of-the-art object detection models is evaluated, including convolution-based YOLO models (v8-v12) and transformer-based RT-DETR models (v1-v2). Experiments on three edge devices (Jetson Orin Nano, Jetson AGX Xavier, and Jetson AGX Orin) demonstrate clear trade-offs between accuracy and inference latency across models and hardware configurations. Results show that high-capacity models achieve up to 86.9% mAP50 but suffer from high latency, limiting real-time deployment. In contrast, lightweight models achieve 66%-71% mAP50 with significantly lower latency, enabling real-time performance. Among all models, RT-DETRv2-R50-M achieves competitive accuracy (79% mAP50) with improved efficiency, while YOLOv10n provides the fastest inference speed. YOLOv11s and RT-DETRv2-R50-M offer the best balance between accuracy and speed, making them strong candidates for real-time UAV deployment.

[CV-126] Knee-xRAI: An Explainable AI Framework for Automatic Kellgren-Lawrence Grading of Knee Osteoarthritis

【速读】:该论文旨在解决膝骨关节炎(Knee Osteoarthritis, KOA)放射学分级中存在的人工读片者间变异性和深度学习模型“黑箱”问题,即当前基于Kellgren-Lawrence(KL)系统的评估依赖主观判断,而现有深度学习方法直接预测KL等级但缺乏对关键结构特征的可解释性。解决方案的关键在于提出一个模块化、可解释的框架——Knee-xRAI,其核心创新是将KL分级所需的三个核心影像学特征(关节间隙狭窄[Joint Space Narrowing, JSN]、骨赘[Osteophytes]和软骨下硬化[Subchondral Sclerosis])分别独立量化,并构建由结构化特征向量驱动的双路径分类架构:一是基于XGBoost的透明路径,支持SHAP特征归因以实现全特征级审计;二是融合图像编码器与结构化特征的ConvNeXt混合路径,提升预测性能。该设计既保证了模型输出的可解释性,又实现了高精度的KL分级(测试加权kappa达0.8436),显著优于单一特征或传统端到端模型。

链接: https://arxiv.org/abs/2604.23435
作者: Azmul A. Irfan,Nur Ahmad Khatim,Alfan Alfian Irfan,Achmad Zaki,Erike A. Suwarsono,Mansur M. Arief
机构: UIN Syarif Hidayatullah Jakarta (印尼伊斯兰大学); Institut Teknologi Sepuluh Nopember (印尼十月十日技术学院); Universitas Muhammadiyah Yogyakarta (穆罕默迪亚雅加达大学); King Fahd University of Petroleum and Minerals (法赫德国王石油与矿业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Radiographic grading of knee osteoarthritis (KOA) with the Kellgren-Lawrence (KL) system is limited by inter-reader variability and the opacity of current deep learning approaches, which predict KL grades directly from images without decomposing structural features. We present Knee-xRAI, a modular framework that independently quantifies the three cardinal radiographic features of KOA (joint space narrowing [JSN], osteophytes, and subchondral sclerosis) and integrates them into an explainable KL grade classification. The pipeline combines U-Net++ segmentation for contour-based JSN measurement, an SE-ResNet-50 network for per-site osteophyte grading (OARSI scale), and a hybrid texture-CNN classifier for binary sclerosis quantification. The resulting 50-dimensional structured feature vector feeds two complementary classification paths. An XGBoost path supports SHAP-based feature attribution. A ConvNeXt hybrid path combines the structured vector with a full-image encoder for enhanced predictive performance. Evaluated on 8,260 radiographs from an OAI-derived dataset, the JSN module achieved a Dice coefficient of 0.8909 and an mJSW intraclass correlation of 0.8674 against manual annotations. The ConvNeXt hybrid path reached a test quadratic weighted kappa (QWK) of 0.8436 and AUC of 0.9017. The transparent XGBoost path achieved a test QWK of 0.6294 with full feature-level audit capability. Ablation confirmed JSN as the dominant predictor (QWK = 0.6103 alone), with osteophyte features providing consistent incremental gain (+0.0183) and sclerosis contributing marginally. Inference-time ablation of Path B confirmed the structured pathway contributes materially beyond the image encoder, with QWK drops of 0.098 (feature zeroing) and 0.284 (feature-image permutation). Knee-xRAI explicitly quantifies all three KL-defining radiographic features within a single auditable pipeline.

[CV-127] Sphere-Depth: A Benchmark for Depth Estimation Methods with Varying Spherical Camera Orientations

【速读】:该论文旨在解决从球面图像(spherical images)中进行可靠深度估计的问题,尤其是在机器人导航和沉浸式场景理解中,由于球面相机在实际平台中存在非预期的姿态变化(pose variations),加之等距圆柱投影(equirectangular projection)固有的几何畸变,导致单目深度估计模型性能显著下降。其解决方案的关键在于提出一个名为Sphere-Depth的公开基准测试集,系统性地评估单目深度模型在模拟姿态扰动下的鲁棒性,并引入基于深度校准的误差协议(depth calibration-based error protocol),通过监督学习获得每个模型的缩放因子,将预测的相对深度值转换为度量深度值(metric depth values),从而实现跨模型的公平、有意义的性能比较。实验表明,即便针对球面图像设计的模型(如Depth Anywhere、ACDNet等)在相机姿态偏离标准位姿时仍表现出明显性能退化。

链接: https://arxiv.org/abs/2604.23432
作者: Soulayma Gazzeh,Giuseppe Mazzola,Liliana Lo Presti,Marco La Cascia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Reliable depth estimation from spherical images is crucial for 360° vision in robotic navigation and immersive scene understanding. However, the onboard spherical camera can experience unintentional pose variations in real-world robotic platforms that, along with the geometric distortions inherent in equirectangular projections, significantly impact the effectiveness of depth estimation. To study this issue, a novel public benchmark, called Sphere-Depth, is introduced to systematically evaluate the robustness of monocular depth estimation models from equirectangular images in a reproducible way. Camera pose perturbations are simulated and used to assess the performance of a popular perspective-based model, Depth Anything, and of spherical-aware models such as Depth Anywhere, ACDNet, Bifuse++, and SliceNet. Furthermore, to ensure meaningful evaluation across models, a depth calibration-based error protocol is proposed to convert predicted relative depth values into metric depth values using supervised learned scaling factors for each model. Experiments show that even models explicitly designed to process spherical images exhibit substantial performance degradation when variations in the camera pose are observed with respect to the canonical pose. The full benchmark, evaluation protocol, and dataset splits are made publicly available at: this https URL

[CV-128] Enhanced Privacy and Communication Efficiency in Non-IID Federated Learning with Adaptive Quantization and Differential Privacy

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中的两大核心问题:一是通信瓶颈,即因设备间网络条件差异导致的数据传输开销过大;二是隐私泄露风险,即通过模型参数或梯度分析可能暴露敏感信息。解决方案的关键在于融合差分隐私(Differential Privacy, DP)与自适应量化方法:首先采用基于拉普拉斯分布的差分隐私机制以提供比高斯型DP更紧致的隐私保障;其次提出两种量化策略——基于轮次的余弦退火全局比特长度调度器和基于客户端贡献度(通过数据集熵分析估计)的动态本地比特调度器,从而在保证模型精度的同时显著降低通信量,实验表明在MNIST、CIFAR10及医学影像数据集上可减少高达52.64%的通信数据量,且维持强隐私性。

链接: https://arxiv.org/abs/2604.23426
作者: Emre Ardıç,Yakup Genç
机构: Gebze Technical University ( Gebze 工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in IEEE Access, Vol. 13, 2025. DOI: https://doi.org/10.1109/ACCESS.2025.3554138 Github: this https URL

点击查看摘要

Abstract:Federated learning (FL) is a distributed machine learning method where multiple devices collaboratively train a model under the management of a central server without sharing underlying data. One of the key challenges of FL is the communication bottleneck caused by variations in connection speed and bandwidth across devices. Therefore, it is essential to reduce the size of transmitted data during training. Additionally, there is a potential risk of exposing sensitive information through the model or gradient analysis during training. To address both privacy and communication efficiency, we combine differential privacy (DP) and adaptive quantization methods. We use Laplacian-based DP to preserve privacy, which is relatively underexplored in FL and offers tighter privacy guarantees than Gaussian-based DP. We propose a simple and efficient global bit-length scheduler using round-based cosine annealing, along with a client-based scheduler that dynamically adapts based on client contribution estimated through dataset entropy analysis. We evaluate our approach through extensive experiments on CIFAR10, MNIST, and medical imaging datasets, using non-IID data distributions across varying client counts, bit-length schedulers, and privacy budgets. The results show that our adaptive quantization methods reduce total communicated data by up to 52.64% for MNIST, 45.06% for CIFAR10, and 31% to 37% for medical imaging datasets compared to 32-bit float training while maintaining competitive model accuracy and ensuring robust privacy through differential privacy.

[CV-129] A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis

【速读】:该论文旨在解决现有两流动作识别网络中对RGB和光流(optical flow)模态采用相同卷积骨干网络的问题,忽略了二者在结构特性上的本质差异:RGB帧包含丰富的外观与场景上下文信息,而光流则捕捉细粒度的运动模式。这种同质化处理方式导致模态间的信息互补被削弱。解决方案的关键在于提出一种异构双流架构——DualStreamHybrid,其核心是为不同模态分配适配的骨干网络:RGB流使用预训练的ViT-Tiny/16,光流流则基于20通道堆叠光流表示从头训练MobileNetV2;并通过一个可学习的投影层将两个不同维度的特征向量映射至统一空间以实现融合,避免强制架构对称性。在此基础上设计五种融合策略(包括交叉注意力、加权融合等),实验表明该方法能有效利用模态差异,并揭示了融合策略的选择与数据集规模密切相关。

链接: https://arxiv.org/abs/2604.23415
作者: Md. Afzalur Rahaman,Tahmid Rahman
机构: Hamdard University Bangladesh (哈姆丹大学孟加拉国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Most two-stream action recognition networks apply the same convolutional backbone to both RGB and optical flow streams, ignoring the fact that the two modalities have fundamentally different structural properties. Optical flow captures fine-grained motion patterns, while RGB frames carry rich appearance and scene context - treating them identically discards this distinction. We propose DualStreamHybrid, a heterogeneous two-stream architecture that assigns each stream a backbone suited to its input: a pretrained ViT-Tiny/16 for RGB frames, and a MobileNetV2 trained from scratch on a 20-channel stacked optical flow representation. A learned projection layer maps the two differently-sized feature vectors to a common dimensionality before fusion, enabling the two streams to interact without forcing architectural symmetry. We design five fusion strategies within a unified framework - late fusion, concatenation, cross-attention, weighted fusion, and gated fusion - and evaluate them on UCF11 (1,600 videos, 11 classes) and UCF50 (6,681 videos, 50 classes) to study how fusion behaviour scales with dataset size. On UCF11, cross-attention achieves 98.12% test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94%, which suggests that explicit inter-modal attention is particularly effective on smaller, less complex datasets. On UCF50, weighted fusion reaches 96.86% and proves the most consistent strategy across both benchmarks. The learned stream weights reveal an interesting pattern: UCF11 sees near-equal modality contribution (RGB: 0.507, flow: 0.493), while UCF50 favours the RGB stream slightly more (RGB: 0.554, flow: 0.446) - arguably reflecting the larger and more visually diverse action space. Taken together, these results suggest that even a lightweight motion stream meaningfully complements a strong appearance encoder, and that the optimal fusion strategy depends on dataset scale.

[CV-130] PushupBench: Your VLM is not good at counting pushups

【速读】:该论文旨在解决大规模视觉语言模型(VLMs)在视频理解任务中无法准确计数重复事件发生次数的问题。现有模型虽能识别视频中“发生了什么”,但缺乏对事件频次的精确感知能力,导致其在需要时序推理的任务中表现受限。解决方案的关键在于构建一个专门用于评估重复计数能力的基准数据集 PushupBench,包含 446 条长视频片段(平均长度 36.7 秒),并发现仅依赖准确率会误导对模型能力的判断——弱模型往往依赖模态计数(modal count)而非真正的时序推理。进一步实验表明,通过少量样本(1000 条)进行计数任务微调,可显著提升模型在通用视频理解任务上的性能(如 MVBench、PerceptionTest 和 TVBench),证明重复计数是一种有效的时序理解代理指标(proxy for temporal reasoning)。

链接: https://arxiv.org/abs/2604.23407
作者: Shengzhi Li,Jiarun Chen,Karun Sharma,Jiaqi Su,Shichao Pei
机构: Shengzhi Li; Jiarun Chen; Karun Sharma; Jiaqi Su; Shichao Pei
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) can recognize \textitwhat happens in video but fail to count \textithow many times. We introduce \textbfPushupBench, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1% exact accuracy; open-source 4B models score \sim 6%, matching supervised baselines. We show that accuracy alone misleads – weaker models exploit the modal count rather than reason temporally. Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal this http URL incorporated in \textttlmms-eval (this https URL) and hosted on (this http URL)

[CV-131] LearnDrop: Fast Learning of CNNs based on Layer Dropping

【速读】:该论文旨在解决深度卷积神经网络(Deep Convolutional Neural Networks, CNNs)在训练过程中计算效率低下的问题,尤其关注于减少前向传播(forward propagation)阶段的运算量,而非仅聚焦于推理阶段的模型压缩或反向传播中的操作限制。其解决方案的关键在于提出一种动态评估机制,通过量化每一层参数的变化程度来判断该层是否仍需继续学习,并据此对网络进行缩放——即减少待学习参数的数量,从而显著降低训练时的浮点运算次数(FLOPs)。实验表明,该方法在VGG和ResNet架构上均能实现超过50%的训练时间缩短,同时保持模型精度基本不变,特别适用于需要频繁微调或在线训练的应用场景。

链接: https://arxiv.org/abs/2604.23403
作者: Giorgio Cruciata,Luca Cruciata,Liliana Lo Presti,Jan Van Gemert,Marco La Cascia
机构: University of Palermo (帕勒莫大学); Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Preprint. Paper accepted to Springer Neural Computing and Applications

点击查看摘要

Abstract:This paper proposes a new method to improve the training efficiency of deep convolutional neural networks. During training, the method evaluates scores to measure how much each layer’s parameters change and whether the layer will continue learning or not. Based on these scores, the network is scaled down such that the number of parameters to be learned is reduced, yielding a speed up in training. Unlike state-of-the-art methods that try to compress the network to be used in the inference phase or to limit the number of operations performed in the backpropagation phase, the proposed method is novel in that it focuses on reducing the number of operations performed by the network in the forward propagation during training. The proposed training strategy has been validated on two widely used architecture families: VGG and ResNet. Experiments on MNIST, CIFAR-10 and Imagenette show that, with the proposed method, the training time of the models is more than halved without significantly impacting accuracy. The FLOPs reduction in the forward propagation during training ranges from 17.83% for VGG-11 to 83.74% for ResNet-152. These results demonstrate the effectiveness of the proposed technique in speeding up learning of CNNs. The technique will be especially useful in applications where fine-tuning or online training of convolutional models is required, for instance because data arrive sequentially.

[CV-132] Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation

【速读】:该论文旨在解决高精度语义分割模型在计算资源受限场景下效率低下、难以部署的问题,尤其是在硬件条件受限(如显存不足)时性能显著下降的挑战。其解决方案的关键在于提出DGM-Net架构,通过结构设计而非单纯扩大模型规模来提升建模能力:引入线性复杂度O(N)的Directional Geometric Mamba(G-Mamba)模块替代传统上下文建模单元(如ASPP和PPM),并设计DGM-Module以提取中心流场与拓扑骨架,引导状态空间模型(State Space Model, SSM)的扫描过程,从而增强边界保持能力。该方法无需大规模预训练或重型骨干网络即可实现高效且稳定的分割性能,在Cityscapes和ADE20K数据集上分别达到82.3%和45.24% mIoU,并在低资源条件下(如batch size=2、8GB显存)仍保持鲁棒性,验证了几何引导机制在SSM-based架构中的有效性与资源友好性。

链接: https://arxiv.org/abs/2604.23399
作者: Sheng-Wei Chan,Xin-Jui Pan,Chun-Po Shen,Chia-Min Lin,Yung-Che Wang,Jen-Shiun Chiang
机构: Tamkang University (淡江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 20 figures. Code will be released

点击查看摘要

Abstract:High-performance semantic segmentation has achieved significant progress in recent years, often driven by increasingly large backbones and higher computational budgets. While effective, such approaches introduce substantial computational overhead and limit accessibility under constrained hardware settings. In this paper, we propose DGM-Net (Directional Geometric Mamba Network), an efficient architecture that improves modeling capability through structural design rather than increasing model capacity. We introduce Directional Geometric Mamba (G-Mamba), a linear-complexity O(N) operator as an alternative to conventional context modeling modules such as ASPP and PPM. To further enhance structural awareness in state space model (SSM)-based modeling, we design the DGM-Module, which extracts centripetal flow fields and topological skeletons to guide the scanning process and improve boundary preservation. Without relying on large-scale pretraining or heavy backbone scaling, DGM-Net achieves 80.8% mIoU within 28k iterations, 82.3% mIoU on Cityscapes test set, and 45.24% mIoU on ADE20K. In addition, the model maintains stable performance under constrained hardware settings (e.g., batch size of 2 on 8GB VRAM), highlighting its efficiency and practicality. These results demonstrate that incorporating geometric guidance into SSM-based architectures provides an effective and resource-efficient direction for semantic segmentation.

[CV-133] Keypoint-based Dynamic Object 6-DoF Pose Tracking via Event Camera ICRA2026

【速读】:该论文旨在解决动态环境下物体6-DoF位姿估计的准确性问题,传统基于相机的方法在运动模糊、传感器噪声和低光照条件下性能受限。其解决方案的关键在于利用事件相机(event camera)高动态范围与低延迟特性,并提出一种基于关键点检测与跟踪的方法:首先通过构建关键点检测网络从事件流生成的时间表面中提取特征点;随后结合事件极性与空间坐标,利用关键点邻域内的事件密度实现连续跟踪;最后建立2D关键点与3D模型关键点之间的哈希映射关系,并采用EPnP算法完成6-DoF位姿估计,从而在仿真与真实事件环境中均展现出优于现有事件视觉方法的精度与鲁棒性。

链接: https://arxiv.org/abs/2604.23387
作者: Zhe Wang,Qijin Song,Zihao Li,Jingyu Xiao,Weibang Bai
机构: ShanghaiTech University (上海科技大学); WLSA Shanghai Academy (上海协和国际学校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)

点击查看摘要

Abstract:Accurate 6-DoF pose estimation of objects is critical for robots to perform precise manipulation tasks. However, for dynamic object pose estimation, conventional camera-based approaches face several major challenges, such as motion blur, sensor noise, and low-light limitation. To address these issues, we employ event cameras, whose high dynamic range and low latency offer a promising solution. Furthermore, we propose a keypoint-based detection and tracking approach for dynamic object pose estimation. Firstly, a keypoint detection network is constructed to extract keypoints from the time surface generated by the event stream. Subsequently, the polarity and spatial coordinates of the events are leveraged, and the event density in the vicinity of each keypoint is utilized to achieve continuous keypoint tracking. Finally, a hash mapping is established between the 2D keypoints and the 3D model keypoints, and the EPnP algorithm is employed to estimate the 6-DoF pose. Experimental results demonstrate that, whether in simulated or real event environments, the proposed method outperforms the event-based state-of-the-art methods in terms of both accuracy and robustness.

[CV-134] V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

【速读】:该论文旨在解决生成式 AI(Generative AI)模型在后训练阶段与人类偏好或可验证奖励对齐的问题,特别是针对去噪生成模型(denoising generative models)中因似然函数不可计算而导致的强化学习(Reinforcement Learning, RL)优化困难。现有方法要么基于诱导马尔可夫决策过程(Induced Markov Decision Process, MDP)进行轨迹采样,虽稳定但效率低;要么使用基于扩散证据下界(Diffusion Evidence Lower Bound, ELBO)的似然替代目标,但此前在视觉生成任务中表现不佳。论文的关键突破在于证明并实现:通过降低ELBO代理目标的方差并控制梯度步长,ELBO-based方法可以兼具稳定性与高效性。为此提出Variational GRPO(V-GRPO),将ELBO代理目标与Group Relative Policy Optimization(GRPO)算法结合,并引入一系列简单但关键的技术手段,从而在文本到图像生成任务中达到当前最优性能,同时相较MixGRPO和DiffusionNFT分别提速2倍和3倍。

链接: https://arxiv.org/abs/2604.23380
作者: Bingda Tang,Yuhui Zhang,Xiaohan Wang,Jiayuan Mao,Ludwig Schmidt,Serena Yeung-Levy
机构: Stanford University (斯坦福大学); Tsinghua University (清华大学); Amazon FAR; University of Pennsylvania (宾夕法尼亚大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a 2\times speedup over MixGRPO and a 3\times speedup over DiffusionNFT.

[CV-135] Hierarchical Spatio-Channel Clustering for Efficient Model Compression in Medical Image Analysis

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在资源受限环境中的部署难题,其核心挑战在于模型庞大的内存占用和计算需求。现有低秩压缩方法通常独立处理空间冗余与通道冗余,未能充分利用特征图中局部结构的协同冗余。解决方案的关键在于提出一种分层的时空低秩压缩框架(hierarchical spatio-channel low-rank compression framework),该框架首先将特征图按空间区域划分,再基于各区域内通道的共激活模式对通道进行聚类,最后对每个时空簇应用自适应秩的奇异值分解(rank-adaptive SVD)。这一策略显著提升了压缩效率与模型性能,在AlexNet脑肿瘤MRI分类任务中实现了81.1%的浮点运算次数(FLOPs)削减、1.38倍推理加速,并将准确率从87.76%提升至89.80%,同时改善了难分类类别(如脑膜瘤)的宏平均F1分数。

链接: https://arxiv.org/abs/2604.23375
作者: Sisipho Hamlomo,Marcellin Atemkeng,Habte Tadesse Likassa,Blaise Ravelo,Thierry Bouwmans,Sébastien Lalléchère,Antoine Vacavant,Ding-Geng Chen
机构: Rhodes University(罗德斯大学); National Institute for Theoretical and Computational Sciences(国家理论与计算科学研究所); Arizona State University(亚利桑那州立大学); Addis Ababa University(亚的斯亚贝巴大学); University of Pretoria(普利托里亚大学); Université Clermont Auvergne(克莱蒙奥弗涅大学); La Rochelle Université(拉罗谢尔大学); Association Française de Science des Systèmes(法国系统科学协会); Osnabrück University(奥斯纳布吕克大学); Nanjing University of Information Science and Technology(南京信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) have become increasingly difficult to deploy in resource-constrained environments due to their large memory and computational requirements. Although low-rank compression methods can reduce this burden, most existing approaches compress spatial and channel redundancy independently and therefore do not fully exploit the localised structure within convolutional feature maps. This paper proposes a hierarchical spatio-channel low-rank compression framework for CNNs that exploits redundancy across spatial regions and channel activations. Unlike conventional methods, which apply a uniform decomposition across an entire layer, the proposed approach first partitions feature maps into spatial regions, then groups channels according to their co-activation patterns within each region, and finally applies rank-adaptive SVD to each resulting spatio-channel cluster. The method is evaluated on an AlexNet-based brain tumour MRI classification model and compared with Global SVD and Tucker decomposition under (3\times) and (6\times) compression budgets. Our method outperforms both baselines, reducing FLOPs from (8.21,\mathrmG) to (1.55,\mathrmG) ((81.1%) reduction), achieving a (1.38\times) inference speed-up, and increasing classification accuracy from (87.76%) to (89.80%). The method also improves the macro (F_1)-score and performance on challenging classes such as meningioma. A hyper-parameter trade-off analysis demonstrates that the framework provides Pareto-optimal configurations, enabling control over the balance between compression and predictive performance. Moderate clustering with adaptive rank selection yields strong results. Bootstrap standard errors are reported for all classification metrics.

[CV-136] EmoTrans: A Benchmark for Understanding Reasoning and Predicting Emotion Transitions in Multimodal LLM s

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在情感理解上仍局限于静态识别、缺乏对情绪动态演变过程建模能力的问题。现有基准测试未能有效评估模型在多样化社交情境中对情绪状态变化、转移及预测的理解能力,导致其在实际应用如人机交互和社交机器人中的表现受限。解决方案的关键在于提出 EmoTrans 基准,这是一个面向多模态视频的情感动态理解评测平台,包含 1,000 个精心收集并人工标注的视频片段,覆盖 12 种真实场景,并提供超过 3,000 个任务特定的问答对。EmoTrans 引入四个递进式任务:情绪变化检测(Emotion Change Detection, ECD)、情绪状态识别(Emotion State Identification, ESI)、情绪转移推理(Emotion Transition Reasoning, ETR)和下一情绪预测(Next Emotion Prediction, NEP),从而构建从粗粒度到细粒度、从感知到推理再到预测的系统性评估框架,推动 MLLMs 在复杂社会情境下对情绪动态建模能力的发展。

链接: https://arxiv.org/abs/2604.23348
作者: He Hu,Tengjin Weng,Zebang Cheng,Yu Wang,Jiachen Luo,Björn Schuller,Zheng Lian,Laizhong Cui
机构: Shenzhen University(深圳大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东省人工智能与数字经济发展实验室(深圳)); Queen Mary University of London(伦敦大学玛丽女王学院); Technical University of Munich(慕尼黑工业大学); Imperial College London(帝国理工学院); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and generation, and are increasingly used in applications such as social robots and human-computer interaction, where understanding human emotions is essential. However, existing benchmarks mainly formulate emotion understanding as a static recognition problem, leaving it largely unclear whether current MLLMs can understand emotion as a dynamic process that evolves, shifts between states, and unfolds across diverse social contexts. To bridge this gap, we present EmoTrans, a benchmark for evaluating emotion dynamics understanding in multimodal videos. EmoTrans contains 1,000 carefully collected and manually annotated video clips, covering 12 real-world scenarios, and further provides over 3,000 task-specific question-answer (QA) pairs for fine-grained evaluation. The benchmark introduces four tasks, namely Emotion Change Detection (ECD), Emotion State Identification (ESI), Emotion Transition Reasoning (ETR), and Next Emotion Prediction (NEP), forming a progressive evaluation framework from coarse-grained detection to deeper reasoning and prediction. We conduct a comprehensive evaluation of 18 state-of-the-art MLLMs on EmoTrans and obtain two main findings. First, although current MLLMs show relatively stronger performance on coarse-grained emotion change detection, they still struggle with fine-grained emotion dynamics modeling. Second, socially complex settings, especially multi-person scenarios, remain substantially challenging, while reasoning-oriented variants do not consistently yield clear improvements. To facilitate future research, we publicly release the benchmark, evaluation protocol, and code at this https URL.

[CV-137] Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection CVPR2026

【速读】:该论文旨在解决开放词汇目标检测(Open-vocabulary object detection, OVD)中两个关键问题:一是视觉语言模型(Vision-language models, VLMs)在区域级别(region-level)的伪标签生成时存在类标签分配不准确的问题,因为VLMs通常优化于图像级别的分类任务;二是区域提议网络(Region Proposal Networks, RPNs)仅在基础类别(base classes)上训练,导致对新类别(novel classes)的置信度估计不可靠。解决方案的关键在于提出一种分层置信度校准(Hierarchical Confidence Calibration, HCC)机制,通过跨层次语义一致性(类别、超类与子类)提升伪标签的准确性,并引入LoCLIP——一种参数高效的CLIP改进方法,其嵌入对象性标记(objectness token)以缓解RPN的基础类别偏差问题,从而为新类别提供可靠的物体性估计。实验表明,该方法在COCO和LVIS等标准OVD基准上显著优于现有方法,达到新的SOTA性能。

链接: https://arxiv.org/abs/2604.23344
作者: Sanghoon Lee,Geon Lee,Hyekang Park,Bumsub Ham
机构: Yonsei University (延世大学); Korea Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:Conventional object detectors typically operate under a closed-set assumption, limiting recognition to a predefined set of base classes seen during training. Open-vocabulary object detection (OVD) addresses this limitation by leveraging vision-language models (VLMs) to generate pseudo labels for novel object classes. However, existing OVD methods suffer from two critical drawbacks: (1) inaccurate class label assignments, as VLMs are optimized for image-level predictions rather than the region-level predictions required for pseudo labeling, and (2) unreliable objectness scores from region proposal networks (RPNs) trained exclusively on base object classes. To address these issues, we propose a novel pseudo labeling framework for OVD. Our approach introduces a hierarchical confidence calibration (HCC) technique, which ensures reliable class label estimation by assessing consistency across hierarchical semantic levels (class, super- and sub-category). We also present LoCLIP, a parameter-efficient adaptation of CLIP that incorporates an objectness token to mitigate base class bias problem of RPNs and provide reliable objectness estimations for novel object classes. Extensive experiments on standard OVD benchmarks, including COCO and LVIS, demonstrate that our approach clearly sets a new state of the art, validating the effectiveness of our approach. Project site: this https URL

[CV-138] H-SemiS: Hierarchical Fusion of Semi and Self-Supervised Learning for Knee Osteoarthritis Severity Grading

【速读】:该论文旨在解决膝骨关节炎(Knee Osteoarthritis, KOA)严重程度分级中因标注数据有限、类别不平衡、噪声样本及临床标注变异导致的模型性能下降问题。其核心解决方案是提出一种分层半监督自监督框架(Hierarchical fusion of Semi-Supervised framework with Self-Supervision, H-SemiS),关键在于:首先将多分类任务分解为一系列二分类子任务,以缓解类别不平衡问题;其次引入对抗性自监督重建模块,增强对未标注数据中解剖结构特征的学习能力;最后采用量子启发的教师-学生特征混合机制,在伪标签存在噪声时优化相邻等级间的判别边界,从而提升模型鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2604.23335
作者: Chandravardhan Singh Raghaw,Anushka Parwal,Shahid Shafi Dar,Prajakta Darade,Nagendra Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knee osteoarthritis (KOA) is a degenerative joint disease that can lead to chronic pain, reduced mobility, and long-term disability. Automated severity grading from knee radiographs can support early assessment, but current methods heavily depend on large labeled datasets and remain sensitive to class imbalance, noisy samples, and variability in clinical annotations. To alleviate these limitations, we propose a Hierarchical fusion of Semi-Supervised framework with Self-Supervision (H-SemiS) for KOA severity grading in knee X-ray samples using limited annotated data. Rather than treating severity grading as a flat multi-class problem, H-SemiS decomposes the task into a sequence of binary sub-tasks within a semi-supervised teacher-student architecture, directly mitigating the impact of class imbalance. To further enhance feature learning from unlabeled data, the framework integrates an adversarial self-supervised reconstruction module that encourages the network to capture robust anatomical structures. In parallel, a teacher-student design with quantum-inspired feature mixing improves discrimination boundaries between adjacent grades when pseudo-labels are noisy. We comprehensively evaluate H-SemiS on two challenging multi-class datasets and assess its generalizability on two binary-class datasets. Our experimental results demonstrate the superiority of the proposed H-SemiS framework across multiple evaluation metrics, consistently outperforming several competing baselines and state-of-the-art methods. The code is publicly available at this https URL.

[CV-139] EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence ICMR

【速读】:该论文旨在解决情感驱动的说话人脸视频生成中存在的一系列问题:现有方法依赖简单的情感标签,导致语义信息不足;引入高层语义虽能提升表达力,但易引发唇音同步(lip synchronization)退化;同时,主流生成方法在长视频中难以兼顾计算效率与全局运动感知,并且存在时间一致性差的问题。解决方案的关键在于提出一种基于扩散模型的情感感知网络(EAD-Net),其核心创新包括:1)引入SyncNet监督和时序表示对齐(Temporal Representation Alignment, TREPA)机制以缓解多模态融合引起的唇音同步退化;2)设计时空方向注意力(Spatio-Temporal Directional Attention, STDA)机制,通过条带注意力捕捉长视频中的全局运动模式;3)构建时序帧图推理模块(Temporal Frame graph Reasoning Module, TFRM),利用图结构学习显式建模帧间时间一致性;4)借助大语言模型从真实视频中提取文本描述作为高层语义引导,增强情感语义控制能力。实验表明,该方法在唇音准确性、时间一致性和情感准确性上均优于现有方法。

链接: https://arxiv.org/abs/2604.23325
作者: Yahui Li,Yinfeng Yu,Liejun Wang,Shengjie Shen
机构: Xinjiang University(新疆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Main paper (10 pages). Accepted for publication by ICMR(International Conference on Multimedia Retrieval) 2026

点击查看摘要

Abstract:Emotionally talking head video generation aims to generate expressive portrait videos with accurate lip synchronization and emotional facial expressions. Current methods rely on simple emotional labels, leading to insufficient semantic information. While introducing high-level semantics enhances expressiveness, it easily causes lip-sync degradation. Furthermore, mainstream generation methods struggle to balance computational efficiency and global motion awareness in long videos and suffer from poor temporal coherence. Therefore, we propose an \textbfEmotion-\textbfAware \textbfDiffusion model-based \textbfNetwork, called \textbfEAD-Net. We introduce SyncNet supervision and Temporal Representation Alignment (TREPA) to mitigate lip-sync degradation caused by multi-modal fusion. To model complex spatio-temporal dependencies in long video sequences, we propose a Spatio-Temporal Directional Attention (STDA) mechanism that captures global motion patterns through strip attention. Additionally, we design a Temporal Frame graph Reasoning Module (TFRM) to explicitly model temporal coherence between video frames through graph structure learning. To enhance emotional semantic control, a large language model is employed to extract textual descriptions from real videos, serving as high-level semantic guidance. Experiments on the HDTF and MEAD datasets demonstrate that our method outperforms existing methods in terms of lip-sync accuracy, temporal consistency, and emotional accuracy.

[CV-140] KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition

【速读】:该论文旨在解决现有卷积神经网络(Convolutional Neural Networks, CNNs)与基于Kolmogorov-Arnold表示定理的神经网络(Kolmogorov-Arnold Networks, KANs)之间融合不足的问题,尤其是如何在保持KAN理论优势的同时有效结合卷积机制以提升计算机视觉任务的性能。其解决方案的关键在于提出了一种全新的Kolmogorov-Arnold卷积层(Kolmogorov-Arnold Convolutional Layer),该层深度整合了Kolmogorov-Arnold表示定理与卷积操作,通过在边上传播可学习的非线性激活函数、在节点上进行简单求和的方式,显著减少了参数量并增强了模型的可解释性;在此基础上构建的KAConvNet架构不仅克服了传统KAN-卷积混合方法因简化理论基础而导致的性能瓶颈,还实现了比现有KAN-卷积组合方法更优的效果,并在性能上可与主流Vision Transformers (ViTs) 和CNNs相媲美。

链接: https://arxiv.org/abs/2604.23320
作者: Zhaoxiang Liu,Zhicheng Ma,Kaikai Zhao,Kai Wang,Shiguo Lian
机构: China Unicom(中国联通)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: published on journal"Image and Vision Computing"

点击查看摘要

Abstract:The Convolutional Neural Networks (CNNs) have been the dominant and effective approach for general computer vision tasks. Recently, Kolmogorov-Arnold neural networks (KANs), based on the Kolmogorov-Arnold representation theorem, have shown potential to replace Multi-Layer Perceptrons (MLPs) in deep learning. KANs, which use learnable nonlinear activations on edges and simple summation on nodes, offer fewer parameters and greater explainability compared to MLPs. However, there has been limited exploration of integrating the Kolmogorov-Arnold representation theorem with convolutional methods for computer vision tasks. Existing attempts have merely replaced learnable activation functions with weights, undermining KANs’ theoretical foundation and limiting their potential effectiveness. Additionally, the B-spline curves used in KANs suffer from computational inefficiency and a tendency to overfit. In this paper, we propose a novel Kolmogorov-Arnold Convolutional Layer that deeply integrates the Kolmogorov-Arnold representation theorem with convolution. This layer provides stronger method interpretability because it is based on established mathematical theorems and its design has theoretical alignment. Building on the Kolmogorov-Arnold Convolutional Layer, we design an efficient network architecture called KAConvNet, which outperforms existing methods combining KAN and convolution, and achieves competitive performance compared to mainstream ViTs and CNNs. We believe that our work offers valuable insight into the field of artificial intelligence and will inspire the development of more innovative CNNs in the 2020s. The code is publicly available at this https URL.

[CV-141] Learning from Noisy Prompts: Saliency-Guided Prompt Distillation for Robust Segmentation with SAM CVPR2026

【速读】:该论文旨在解决生成式 AI(Generative AI)在医学影像分割任务中对不精确提示(prompt)的脆弱性问题,尤其是在临床环境中常见的弱、通用且噪声较大的提示条件下,Segment Anything Model(SAM)性能显著下降的问题。解决方案的关键在于提出一种基于显著性引导的提示蒸馏框架(SPD),其核心机制包括:首先通过轻量级显著性头学习数据驱动的解剖先验,生成可靠的定位图;随后利用相邻切片的解剖线索进行上下文提示蒸馏,验证并增强噪声提示,形成与专家推理行为一致的共识提示集;同时引入成对切片一致性目标以强化局部解剖结构的一致性。该方法有效提升了基础模型在真实临床场景下的鲁棒性和分割精度。

链接: https://arxiv.org/abs/2604.23314
作者: Jingxuan Kang,Ziqi Zhang,Shaoming Zheng,Shuang Li,Uday Bharat Patel,Alexander Harry Fitzhugh,Phillip Lung,Yusuf Kiberu,Nikesh Jathanna,Shahnaz Jamil-Copley,Bernhard Kainz,Chen Qin
机构: Imperial College London (帝国理工学院); Beihang University (北京航空航天大学); National Health Service (国家医疗服务体系); University of Nottingham (诺丁汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (Findings Track)

点击查看摘要

Abstract:Segmentation is central to clinical diagnosis and monitoring, yet the reliability of modern foundation models in medical imaging still depends on the availability of precise prompts. The Segment Anything Model (SAM) offers powerful zero-shot capabilities, although it collapses under the weak, generic, and noisy prompts that dominate real clinical workflows. In practice, annotations such as centerline points are coarse and ambiguous, often drifting across neighboring anatomy and misguiding SAM toward inconsistent or incomplete masks. We introduce SPD, a Saliency-Guided Prompt Distillation framework that converts these unreliable cues into robust guidance. SPD first learns data-driven anatomical priors through a lightweight saliency head to obtain confident localization maps. These priors then drive Contextual Prompt Distillation, which validates and enriches noisy prompts using cues from anatomically adjacent slices, producing a consensus prompt set that matches the behavior of expert reasoning. A Pairwise Slice Consistency objective further enforces local anatomical coherence during segmentation. Experiments on four challenging MRI and CT benchmarks demonstrate that SPD consistently outperforms existing SAM adaptations and supervised baselines, delivering large gains in both region-based and boundary-based metrics. SPD provides a practical and principled path toward reliable foundation model deployment in clinical environments where only imperfect prompts are available.

[CV-142] STAND: Semantic Anchoring Constraint with Dual-Granularity Disambiguation for Remote Sensing Image Change Captioning

【速读】:该论文针对遥感图像变化描述(Remote Sensing Image Change Captioning, RSICC)任务中因视角(viewpoint)、尺度(scale)和先验知识(prior knowledge)不确定性导致的语义模糊问题,提出了一种名为STAND的解决方案。其关键在于引入语义锚定约束与双粒度去歧模块:首先通过可解释的约束正则化时序特征以建立可靠的特征基础;随后利用宏观全局上下文聚合缓解视角混淆,并结合微观频域重聚焦注意力机制增强小目标尺度感知;最终借助语言类别先验构建语义概念锚定模块,在解码阶段有效应对知识模糊性,从而实现更精确的文本生成。

链接: https://arxiv.org/abs/2604.23309
作者: Yanpei Gong,Beichen Zhang,Hao Wang,Zhaobo Qi,Xinyan Liu,Yuanrong Xu,Ruiyang Gao,Weigang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Remote sensing image change captioning (RSICC) aims to describe the difference between two remote sensing images. While recent methods have explored video modeling, they largely overlook the inherent ambiguities in viewpoint, scale, and prior knowledge, lacking effective constraints on the encoder. In this paper, we present STAND, a Semantic Anchoring Constraint with Dual-Granularity Disambiguation for RSICC, to progressively resolve these ambiguities. Specifically, to establish a reliable feature foundation, we first introduce an interpretable constraint to regularize temporal representations. Operating on these purified features, a dual-granularity disambiguation module resolves spatial uncertainties by coupling macro-level global context aggregation for viewpoint confusion with micro-level frequency-refocused attention for small-object scale enhancement. Ultimately, to translate these visually disambiguated features into precise text, a semantic concept anchoring module leverages language categorical priors to tackle knowledge ambiguity during decoding. Extensive experiments verify the superiority of STAND and its effectiveness in addressing ambiguities.

[CV-143] MetaErr: Towards Predicting Error Patterns in Deep Neural Networks

【速读】:该论文旨在解决深度学习系统在运行过程中可能突然失效但缺乏预警机制的问题,即如何预测一个深度神经网络(Deep Neural Network, DNN)在处理特定数据样本时是否会失败。其核心挑战在于现有研究主要关注降低模型误差率,而对错误发生的可预测性关注不足。解决方案的关键在于提出一种名为MetaErr的框架,该框架训练一个与基础模型架构和训练参数无关的元模型(meta-model),通过观察基础模型在特定任务上的表现来判断其是否会在当前样本上预测失败。该方法具备通用性和有效性,已在多个计算机视觉基准数据集上验证了其在提升伪标签半监督学习性能方面的优越性。

链接: https://arxiv.org/abs/2604.23289
作者: Varun Totakura,Shayok Chakraborty
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted and presented at the IEEE International Conference on SMART MULTIMEDIA (ICSM 2025)

点击查看摘要

Abstract:Due to the unprecedented success of deep learning, it has become an integral component in several multimedia computing applications in todays world. Unfortunately, deep learning systems are not perfect and can fail, sometimes abruptly, without prior warning or explanation. While reducing the error rate of deep neural networks has been the primary focus of the multimedia community, the problem of predicting when a deep learning system is going to fail has received significantly less research attention. In this paper, we propose a simple yet effective framework, MetaErr, to address this under-explored problem in deep learning research. We train a meta-model whose goal is to predict whether a base deep neural network will succeed or fail in predicting a particular data sample, by observing the base models performance on a given learning task. The meta-model is completely agnostic of the architecture and training parameters of the base model. Such an error prediction system can be immensely useful in a variety of smart multimedia applications. Our empirical studies corroborate the promise and potential of our framework against competing baselines. We further demonstrate the usefulness of our framework to improve the performance of pseudo-labeling-based semi-supervised learning, and show that MetaErr outperforms several strong baselines on three benchmark computer vision datasets.

[CV-144] Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search ACL2026

【速读】:该论文旨在解决文本驱动的人体异常行为检索(text-based person anomaly search)中因姿态语义鸿沟(Pose-Semantic Gap)导致的准确性不足问题,即不同语义的行为可能具有相似的骨骼结构,从而误导检索结果。解决方案的关键在于提出结构-语义解耦级联框架(Structure-Semantic Decoupled Cascade, SSDC),其核心是将检索过程分为两个阶段:第一阶段为结构感知粗筛(Structure-Aware Coarse Retrieval),通过轻量级模型基于骨骼相似性快速筛选候选片段;第二阶段为侦探小组交互机制(Detective Squad Interaction),由侦探(Detective)、分析师(Analyst)和写作者(Writer)组成的多智能体系统进行语义验证与合成,最终融合语义生成结果与结构先验信息对候选结果重排序,从而在保持高效率的同时提升语义理解精度。

链接: https://arxiv.org/abs/2604.23282
作者: Zequn Xie,Guijin Luo,Chuxin Wang,Sihang Cai,Tao Jin,Zhou Zhao,Yixuan Tang
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to ACL 2026.10 pages, 5 figures

点击查看摘要

Abstract:Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive. We propose the Structure-Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure-Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity ; and (2) Detective Squad Interaction, a multi-agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re-rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state-of-the-art performance by balancing efficiency and semantic reasoning.

[CV-145] Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data

【速读】:该论文旨在解决多模态人体活动识别(Multimodal Human Activity Recognition)中因模态间数据高度异构性和标签稀缺性导致的现实应用差距问题。其核心解决方案是提出一种通用的对比学习框架CLMM,关键在于采用两阶段训练策略:第一阶段利用CNN-DiffTransformer编码器提取局部与全局特征以捕捉跨模态共享信息,并通过硬正样本加权算法增强梯度传播以强化共享学习;第二阶段则设计双分支结构结合质量引导注意力机制和双向门控单元来提取模态特异性信息,同时引入主辅协同训练策略融合共享与模态特异性信息,从而在有限标注数据下显著提升识别准确率与收敛性能。

链接: https://arxiv.org/abs/2604.23281
作者: Long Jing,Zhixiong Yang,Yajun Zhang,Xinlong Feng
机构: Northwest University (西北大学); Xinjiang University (新疆大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human activity recognition serves as the foundation for various emerging applications. In recent years, researchers have used collaborative sensing of multi-source sensors to capture complex and dynamic human activities. However, multimodal human activity sensing typically encounters highly heterogeneous data across modalities and label scarcity, resulting in an application gap between existing solutions and real-world needs. In this paper, we propose CLMM, a general contrastive learning framework for human activity recognition that achieves effective multimodal recognition with limited labeled data. CLMM employs a novel two-stage training strategy. In the first stage, CLMM employs a CNN-DiffTransformer encoder to capture cross-modal shared information by extracting local and global features. Meanwhile, a hard-positive samples weighting algorithm enhances gradient propagation to reinforce shared learning. In the second stage, a dual-branch architecture combining quality-guided attention and bidirectional gated units captures modality-specific information, while a primary-auxiliary collaborative training strategy fuses both shared and modality-specific information. Experimental results on three public datasets demonstrate that CLMM significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.23281 [cs.LG] (or arXiv:2604.23281v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.23281 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-146] SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation CVPR2026

【速读】:该论文旨在解决医学图像分割中标签稀缺与标注成本高昂的问题,传统判别式分割方法依赖大量标注掩码(segmentation masks),忽视了特征层面的分布约束,导致在小样本场景下难以实现鲁棒的语义表示学习和未标记数据的自适应建模。其解决方案的关键在于提出一种新颖的生成式双分布对齐框架(SemiGDA),通过引入双分布对齐模块(Dual-distribution Alignment Module, DAM)和一致性驱动的跳跃适配器策略(Consistency-Driven Skip Adapter, CDSA):DAM利用两个结构不同的编码器分别建模图像与掩码的特征分布,并在潜在空间中施加分布约束以实现结构化的特征一致性;CDSA则通过图像与掩码双跳接适配器融合多尺度特征,借助一致性损失增强跨分支语义对齐与细粒度语义一致性,从而显著提升模型在少量标注数据下的性能表现。

链接: https://arxiv.org/abs/2604.23274
作者: Kaiwen Huang,Yi Zhou,Yizhe Zhang,Jingxiong Li,Tao Zhou
机构: Nanjing University of Science and Technology (南京理工大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper have been accepted by CVPR 2026

点击查看摘要

Abstract:Semi-supervised learning addresses label scarcity and high annotation costs in medical image segmentation by exploiting the latent information in unlabeled data to enhance model performance. Traditional discriminative segmentation relies on segmentation masks, neglecting feature-level distribution constraints. This limits robust semantic representation learning and adaptive modeling of unlabeled data in scenarios with few labels. To address these limitations, we propose SemiGDA, a novel Generative Dual-distribution Alignment framework for semi-supervised medical image segmentation. Our SemiGDA overcomes the reliance of discriminative methods on large labeled datasets by aligning feature and semantic distributions to boost semantic learning and scene adaptability. Specifically, we propose a Dual-distribution Alignment Module (DAM), which employs two structurally distinct encoders to model image and mask feature distributions. It enforces their alignment in the latent space via distributional constraints, establishing structured feature consistency. Moreover, we design a Consistency-Driven Skip Adapter (CDSA) strategy, which introduces dual skip adapters (Image and Mask) to fuse multi-scale features via skip connections. Using a consistency loss, CDSA enhances cross-branch semantic alignment and reinforces fine-grained semantic consistency. Experimental results on diverse medical datasets show that our method outperforms other state-of-the-art semi-supervised segmentation methods. Code is released at: this https URL.

[CV-147] A Hierarchical Ensemble Inference Pipeline for Robust White Blood Cell Classification Under Domain Shifts

【速读】:该论文旨在解决白细胞(White Blood Cell, WBC)分类模型在真实世界部署中因染色协议、扫描仪特性及实验室间差异导致的域偏移(domain shift)问题,从而影响模型性能。其核心解决方案是提出一种基于记忆增强的分层集成推理流程,通过引入特征库(feature bank)和经LoRA微调的DinoBloom骨干网络,构建三级k近邻(kNN)检索机制,有效降低对单一决策路径的依赖,提升模型在跨域场景下的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2604.23271
作者: Ruyi Dai,Tingkwong Ng,Hao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated white blood cell (WBC) classification is essential for scalable leukaemia screening. However, real-world deployment is challenged by domain shifts caused by staining protocols, scanner characteristics, and inter-laboratory variability, which often degrade model performance. The White Blood Cell Classification Challenge (WBCBench) at ISBI 2026 aims to advance robust WBC recognition, with a focus on accurately identifying blast cells and other clinically critical rare subtypes. We propose a memory-augmented, hierarchical ensemble pipeline for WBC classification under domain shifts, leveraging a feature bank and a DinoBloom backbone fine-tuned with LoRA. Our three-stage inference hierarchy combines k-nearest neighbors (kNN) retrieval at each level, reducing over-reliance on any single decision. Evaluated on the WBCBench dataset, our method ranks within the top ten by macro F1-score in the final testing phase.

[CV-148] LatentBurst: A Fast and Efficient Multi Frame Super-Resolution for Hexadeca-Bayer Pattern CIS images

【速读】:该论文旨在解决突发十六进制拜耳(hexadeca-Bayer)图案接触式图像传感器(CIS)图像的多帧超分辨率重建问题,其核心挑战包括:1)十六进制拜耳模式下相同颜色像素间距增大导致插值困难;2)大物体运动和相机移动引起的图像错位,造成模糊或鬼影伪影;3)模型需在移动设备上实时运行,对效率要求高。解决方案的关键在于提出名为LatentBurst的新网络架构:首先采用潜空间金字塔对齐与融合策略以应对大运动场景;其次设计轻量级UNet结构确保移动端高效推理;最后通过微调光流估计与两阶段知识蒸馏有效缩小域间差异,从而提升重建质量与鲁棒性。

链接: https://arxiv.org/abs/2604.23268
作者: Sangwook Baek,Vin Van Duong,Karam Park,Pilkyu Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a novel multi frame super-resolution network (MFSR) for burst hexadeca Bayer pattern Contact Image Sensor (CIS) images, which includes demosaicing, denoising, multi-frame fusion, and super-resolution. Designing a high-quality reconstruction network poses several challenges as follows: 1) Unlike the Bayer color filter array (CFA) pattern, it is hard to interpolate hexadeca-Bayer pattern since the pixel distance between the same color groups increases; 2) Due to large object motion and camera movements, the final fusion result usually suffers the misalignment resulting a blurry image or ghosting artifacts; 3) The proposed network should be fast and efficient enough to operate in real-time on mobile devices. To overcome these challenges, we propose a novel network, called LatentBurst, which contains: 1) a pyramid align and fusion approach in latent feature to deal with large motion scenario; 2) an efficient UNet-based structure which can run efficiently on mobile device; 3) fine-tuned optical flow estimation and two-step knowledge distillation to reduce domain-gap more effectively. Experimental results in various scenarios demonstrate the effectiveness of our proposed method compared with other state-of-the-art methods.

[CV-149] MotionHiFlow: Text-to-motion via hierarchical flow matching CVPR2026

【速读】:该论文旨在解决文本到动作生成(Text-to-motion generation)中语义对齐不足与时间连贯性差的问题,尤其针对现有方法通常仅在单一时间尺度上操作所导致的细节缺失和结构不一致问题。其解决方案的关键在于提出 MotionHiFlow,一个分层流匹配框架(hierarchical flow matching framework),通过从低到高时间尺度逐步构建运动生成路径:低尺度流捕捉高层语义和粗粒度运动结构,高尺度流细化时间细节;同时引入跨尺度过渡过程以保证连续性和噪声一致性,并结合文本-动作扩散Transformer与拓扑感知运动变分自编码器(topology-aware Motion VAE),利用关节感知位置编码和骨骼拓扑建模关节间的结构依赖关系,从而实现精准语义对齐与细粒度运动细节的联合控制。

链接: https://arxiv.org/abs/2604.23264
作者: Heng Li,Xiaotong Lin,Ling-An Zeng,Yulei Kang,Shuai Li,Jian-Fang Hu
机构: Sun Yat-sen University (中山大学); Shandong University (山东大学); Guangdong Province Key Laboratory of Information Security Technology (广东省信息安全技术重点实验室); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education (教育部机器智能与先进计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to CVPR 2026

点击查看摘要

Abstract:Text-to-motion generation aims to generate 3D human motions that are tightly aligned with the input text while remaining physically plausible and rich in fine-grained detail. Although recent approaches can produce complex and natural movements, they usually operate at only one temporal scale, which limits both semantic alignment and temporal coherence. Inspired by the fact that complex motions are conceptualized hierarchically rather than at a single temporal scale in the human cognitive system, we propose \textitMotionHiFlow, a hierarchical flow matching framework to generate motion progressively by constructing flow path from low to high temporal scales. The flows at lower scales capture high-level semantics and coarse motion structures, while flows at higher scales refine temporal details. To link the flows across scales, we introduce a novel cross-scale transition process, ensuring continuity and preserving noise consistency. Furthermore, by integrating a Text-Motion Diffusion Transformer and a topology-aware Motion VAE, MotionHiFlow explicitly models structural dependencies among joints via joint-aware positional encoding and skeletal topology, enabling precise semantic alignment alongside fine-grained motion details. Extensive experiments on HumanML3D and KIT-ML benchmarks demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of the hierarchical design and key components. Code is available at this https URL.

[CV-150] Micro-Expression-Aware Avatar Fingerprinting via Inter-Frame Feature Differencing

【速读】:该论文旨在解决合成人脸视频中驱动者身份验证(即“Avatar fingerprinting”)的问题,其核心挑战在于现有方法依赖固定且不可微的特征提取阶段,无法实现从原始像素到指纹特征的端到端优化。解决方案的关键在于提出一种无需预处理的系统:采用基于微表情感知的F5C主干网络直接处理原始视频帧,并以帧间特征差分(inter-frame feature differencing)为核心设计原则——通过在学习到的深层特征空间中对连续帧的特征图进行相减,使时间上稳定的外观维度贡献为零,而保留驱动者特有的运动动态信息。实验表明,这种设计显著提升了身份区分能力,且微表情感知的主干结构与差分机制协同作用缺一不可。

链接: https://arxiv.org/abs/2604.23247
作者: Masoumeh Chapariniya,Jean-Marc Odobez,Volker Dellwo,Teodora Vuković
机构: University of Zurich (苏黎世大学); UZH Digital Society Initiative; Idiap Research Institute (Idiap研究学院); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to TrustFA Workshop, IEEE FG 2026

点击查看摘要

Abstract:Avatar fingerprinting, i.e., verifying who drives a synthetic talking-head video rather than whether it is real, is a critical safeguard for authorized use of face-reenactment technology. Existing methods rely on a fixed, non-differentiable landmark extraction stage that prevents the fingerprinting model from being optimized end-to-end from raw pixels. We propose a preprocessing-free system built on a micro-expression-aware backbone operating on raw video frames, with inter-frame feature differencing as the core design principle: consecutive feature maps are subtracted in the learned deep feature space, so that temporally stable appearance dimensions contribute zero to the output while driver-specific motion dynamics are preserved. A controlled ablation on NVFAIR confirms that temporal motion accounts for the large majority of discriminative performance, and that raw appearance features actively degrade identity separation. Both the choice of backbone and the differencing principle are essential: differencing alone is insufficient when applied to a generic encoder, as appearance-dominated features collapse to near-identical representations across adjacent frames, while the micro-expression-aware F5C backbone retains measurable motion variation that the differencing operation can exploit. Without any external preprocessing, our model achieves an overall AUC of 0.877 on NVFAIR and matches or exceeds the landmark-based baseline on the majority of cross-generator pairs.

[CV-151] AnalogRetriever: Learning Cross-Modal Representations for Analog Circuit Retrieval

【速读】:该论文旨在解决模拟电路设计中跨模态检索难题,即在不同表示形式(如SPICE网表、原理图和功能描述)之间难以实现语义层面的高效匹配问题。现有方法通常局限于单一模态内的精确匹配,无法捕捉多模态间的语义关联。其解决方案的关键在于提出AnalogRetriever框架,通过两阶段修复管道构建高质量数据集,并利用视觉-语言模型编码原理图与文本描述、端口感知的关系图卷积网络编码网表,结合课程对比学习将三者映射至共享嵌入空间,从而实现六种跨模态检索方向上的高精度匹配(平均Recall@1达75.2%),显著优于基线方法。

链接: https://arxiv.org/abs/2604.23195
作者: Yihan Wang,Lei Li,Yao Lai,Jing Wang,Yan Lu
机构: Tsinghua University (清华大学); The University of Hong Kong (香港大学); University of Cambridge (剑桥大学); Nanjing University of Posts and Telecommunications (南京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures. Yihan Wang and Lei Li contributed equally to this paper

点击查看摘要

Abstract:Analog circuit design relies heavily on reusing existing intellectual property (IP), yet searching across heterogeneous representations such as SPICE netlists, schematics, and functional descriptions remains challenging. Existing methods are largely limited to exact matching within a single modality, failing to capture cross-modal semantic relationships. To bridge this gap, we present AnalogRetriever, a unified tri-modal retrieval framework for analog circuit search. We first build a high-quality dataset on top of Masala-CHAI through a two-stage repair pipeline that raises the netlist compile rate from 22% to 100%. Built on this foundation, AnalogRetriever encodes schematics and descriptions with a vision-language model and netlists with a port-aware relational graph convolutional network, mapping all three modalities into a shared embedding space via curriculum contrastive learning. Experiments show that AnalogRetriever achieves an average Recall@1 of 75.2% across all six cross-modal retrieval directions, significantly outperforming existing baselines. When integrated into the AnalogCoder agentic framework as a retrieval-augmented generation module, it consistently improves functional pass rates and enables previously unsolved tasks to be completed. Our code and dataset will be released.

[CV-152] DyABD: The Abdominal Muscle Segmentation in Dynamic MRI Benchmark

【速读】:该论文旨在解决当前医学图像分割领域缺乏高质量、动态且具有临床意义的腹部肌肉分割数据集的问题,特别是针对腹股沟疝患者在运动状态下产生的极端解剖变异带来的挑战。解决方案的关键在于构建首个动态腹部MRI基准数据集DyABD,其包含患者执行不同运动时采集的动态影像、高质量腹部肌肉标注、术前与术后对比图像,从而为模型泛化能力评估提供真实世界复杂场景。通过在监督学习、少样本和零样本分割范式下对现有方法进行系统评估,发现当前主流技术Dice系数仅为0.82,揭示了该领域仍存在显著提升空间,重新定义了医学图像分割任务的基准标准。

链接: https://arxiv.org/abs/2604.23187
作者: Niamh Belton,Victoria Joppin,Aonghus Lawlor,Catherine Masson,Thierry Bege,David Bendahan,Kathleen M. Curran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work introduces DyABD, a novel and complex benchmark dataset of dynamic abdominal MRIs from patients with abdominal hernias and associated high quality abdominal muscle annotations. DyABD is the first-of-its-kind in four key ways; (1) it proposes the first abdominal muscle segmentation task, (2) the dynamic MRIs are acquired whilst the patients perform various exercises, introducing extreme anatomical variability, making it one of the most challenging segmentation datasets to date, (3) it includes both pre and post corrective MRIs and (4) DyABD promotes clinical research into the high recurrence rates of abdominal hernias. Beyond dataset introduction, this work provides a comprehensive evaluation of the generalisation capabilities of existing segmentation models across Supervised, Few Shot and Zero Shot paradigms on the unseen DyABD dataset. This work reveals that there is still room for substantial improvement in the field of medical image segmentation, with the majority of techniques achieving a Dice Coefficient of 0.82. This work therefore sheds light on the true progress of the field and redefines the benchmark for progress in medical image segmentation.

[CV-153] One Identity Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition CVPR2026

【速读】:该论文旨在解决视频场景理解(Video Situation Recognition, VidSitu)中多事件下实体角色识别与跨镜头时空定位的一致性问题,即如何在复杂视频中准确识别“谁对谁做了什么、用什么、如何做以及在哪里做”的细粒度语义结构。其核心挑战在于实体在不同事件和视角下的描述一致性与视觉锚定(grounding)的协同优化。解决方案的关键是提出多模态实体共指(Multimodal Entity Coreference, MEC),通过CineMEC这一多阶段框架,将文本中的事件角色提及组与视觉上聚类的实体进行对齐,无需显式标注的接地监督即可实现跨模态一致的实体识别。该方法利用视觉接地与图像描述之间的相互促进机制,在提升 captioning(CIDEr +2.5%,LEA +7%)的同时显著增强视觉定位性能(HOTA +18%)。

链接: https://arxiv.org/abs/2604.23173
作者: Balaji Darur,Amanmeet Garg,Makarand Tapaswi
机构: CVIT, IIIT Hyderabad, India; Amazon Prime Video, Seattle
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Findings. Project Page: this https URL

点击查看摘要

Abstract:Video Situation Recognition (VidSitu) addresses the challenging problem of “who did what to whom, with what, how, and where” in a video. It tests thorough video understanding by requiring identification of salient actions and associated short descriptions for event roles across multiple events. Grounding with VidSitu requires spatio-temporal localization of key entities across shots and varied appearances. We posit that coherent video understanding requires consistent identification of entities that play different roles. We propose Multimodal Entity Coreference (MEC) to unite entity descriptions in text with grounding across the video. Towards this, we introduce CineMEC, a multi-stage approach that unites event role mention groups with visual clusters of entities, without explicit grounding supervision during training. Our approach is designed to exploit the synergy between visual grounding and captioning, where improving one influences the other and vice versa. For evaluation, we extend the VidSitu dataset with grounding annotations. While previous work focuses primarily on descriptions, CineMEC improves consistency across both: captioning (+2.5% CIDEr, +7% LEA) and visual grounding (+18% HOTA). Comments: Accepted to CVPR 2026 Findings. Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.23173 [cs.CV] (or arXiv:2604.23173v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.23173 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-154] A Topology fixated Shape Gradient Framework for Non Simple Boundary Extraction for CIE Lab color images with Repulsive Energy

【速读】:该论文旨在解决图像分割中复杂场景下边界自相交及拓扑结构难以控制的问题,尤其针对包含不相连区域和多重边界的图像。其解决方案的关键在于提出一种无水平集(level set free)的混合分割方法,该方法结合了改进的分段常数形状梯度(piecewise constant shape gradient)与一个排斥函数(repulsive function),并通过非局部形状能量驱动离散曲线演化实现分割;此外,引入了一个多变量函数,依赖于曲线上少量采样点,以有效处理边界演化过程中的自相交现象,从而在复杂图像如嵌套结构和天文物体中实现对分割拓扑结构和边界自交的精确控制。

链接: https://arxiv.org/abs/2604.23167
作者: Shafeequdheen Palengara,Jyotiranjan Nayak,Vijayakrishna Rowthu
机构: SRM University-AP, India
类目: Computer Vision and Pattern Recognition (cs.CV); Analysis of PDEs (math.AP)
备注:

点击查看摘要

Abstract:A levelset free but a hybrid image segmentation approach based on a modified version of the piece wise constant shape gradient of an Mumford Shah shape functional and a repulsive function is considered. The segmentation is performed a non-local shape based through an evolution of discrete curves driven by a non local shape based energy to segment images containing disjoint regions and multiple boundaries. This formulation has a novel additional component as a multivariable function dependent on a few sampled points of the curves that handles the occurrence of self intersection during boundary curves evolution. The method is applied to a few gray scale and color images, including images with nested structures and astronomical objects. The results indicate effective segmentation in complex scenarios with absolute control on the topology of the segments and self-intersections of the boundaries

[CV-155] A satellite foundation model for improved wealth monitoring

【速读】:该论文旨在解决低收入和中等收入国家在获取准确、及时的贫困统计数据方面面临的挑战,如人口普查和家庭调查成本高、频率低、更新慢且易出错的问题。其核心解决方案是提出一种名为Tempov的卫星基础模型(satellite foundation model),该模型通过自监督学习在三百万对双时相Landsat影像上预训练,并采用参数高效微调(parameter-efficient fine-tuning)技术适配稀疏的实地调查标签。关键创新在于模型具备跨时间迁移能力,可在仅有少量标签的情况下实现高分辨率财富制图与动态测量,包括零样本现在预测(zero-shot nowcasting)长达十年后的经济状况、回溯性历史重建(retrospective hindcasting)及十年尺度的变化追踪,同时在非洲范围内展现出优异的泛化性能(R²=0.63,r²=0.68),显著降低对昂贵实地标签采集的依赖。

链接: https://arxiv.org/abs/2604.23166
作者: Zhuo Zheng,Iván Higuera-Mendieta,Richard Lee,David Newhouse,Talip Kilic,Stefano Ermon,Marshall Burke,David B. Lobell
机构: 未知
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 10 figures, technical report

点击查看摘要

Abstract:Poverty statistics guide social policy, but in many low- and middle-income countries, censuses and household surveys that collect these data are costly, infrequent, quickly outdated, and sometimes error-prone. Satellite imagery offers global coverage and the possibility of predicting economic livelihoods at scale, yet existing approaches to predicting livelihoods with imagery or other non-traditional data often fail to reliably identify local-level variation and, as we show, degrade under temporal shift. Here we introduce Tempov, a satellite foundation model pretrained by self-supervision on three million bi-temporal Landsat pairs and adapted with parameter-efficient fine-tuning to sparse survey labels. The model enables large-scale, high-resolution wealth mapping and dynamic measurement, including zero-shot nowcasting up to a decade after observed labels, retrospective hindcasting, and decadal change tracking, while outperforming existing neural network and geospatial foundation-model baselines. In low-label regimes, Tempov achieves competitive accuracy with only 10% of survey samples, indicating substantially reduced dependence on expensive label collection. The model further generalizes across populous countries within and outside Africa, and scales to a unified Africa-wide model with strong continent-level performance ( R^2=0.63 , r^2=0.68 ), from which we generate high-resolution decadal maps of wealth and wealth changes for the African continent. Analysis of these maps shows large variation in recent economic performance both within and across countries. Our open-source approach provides a pathway to timely, scalable, low-cost monitoring of wealth and poverty from routinely collected satellite data.

[CV-156] BSViT: A Burst Spiking Vision Transformer for Expressive and Efficient Visual Representation Learning

【速读】:该论文旨在解决现有脉冲视觉Transformer(Spiking Vision Transformers, S-ViTs)在能效感知视觉学习中面临的两个核心问题:一是二进制脉冲编码的信息容量受限,二是全局自注意力机制引入的密集令牌交互导致计算冗余。其解决方案的关键在于提出BSViT架构,其中包含双通道突发脉冲自注意力机制(Dual-Channel Burst Spiking Self-Attention, DBSSA),通过将查询编码为二进制脉冲、键编码为突发脉冲以增强表征能力,并在值路径中采用兴奋性和抑制性双二进制通道实现带符号调制与更丰富的脉冲交互;同时,整个注意力运算保持仅加法计算,确保与神经形态硬件的兼容性;此外,引入补丁邻接掩码策略限制注意力范围至局部邻域,从而实现结构感知稀疏性并降低计算开销,且系统性地在网络中整合突发脉冲编码以提升脉冲层级的表征能力。

链接: https://arxiv.org/abs/2604.23165
作者: Hongxiang Peng,Dewei Bai,Hong Qu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and incorporate spatial priors, a patch adjacency masking strategy is introduced to restrict attention to local neighborhoods, resulting in structure-aware sparsity and reduced computational overhead. In addition, burst spike coding is systematically integrated across the network to increase spike-level representational capacity beyond conventional binary spiking. Extensive experiments on both static and event-based vision benchmarks demonstrate that BSViT consistently outperforms existing spiking Transformers in accuracy while maintaining competitive energy efficiency.

[CV-157] UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

【速读】:该论文旨在解决视频问答(VideoQA)任务中大型多模态模型(LMMs)因隐式推理导致决策过程不透明、且难以进行多跳推理的问题。其核心挑战在于如何在保持高精度的同时提升模型的可解释性与多步逻辑推理能力。解决方案的关键在于提出一种模块化框架UpstreamQA,通过引入上游显式推理模块(upstream reasoning modules),利用大型推理模型(LRMs)先执行对象识别和场景上下文生成,从而生成结构化的中间推理路径,并将这些增强后的推理痕迹传递给下游LMMs完成最终的视频问答任务。这一设计实现了对视频理解核心组件的解耦与独立评估,显著提升了性能和诊断透明度,尤其在基线性能不足时效果突出,但也表明当原始模型已具备较高能力时,额外显式推理可能带来性能下降。

链接: https://arxiv.org/abs/2604.23145
作者: Jason Nguyen,Ameet Rao,Alexander Chang,Ishaan Kumar,Erin Tan
机构: Lincoln North Star High School (林肯北星高中); The Charter School of Wilmington (特许学校温彻斯特); Greenwich High School (格林威治高中); Santa Susana High School (圣苏萨娜高中); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Question Answering (VideoQA) demands models that jointly reason over spatial, temporal, and linguistic cues. However, the task’s inherent complexity often requires multi-step reasoning that current large multimodal models (LMMs) perform implicitly, leaving their internal decision process opaque. In contrast, large reasoning models (LRMs) explicitly generate intermediate logical steps that enhance interpretability and can improve multi-hop reasoning accuracy. Yet, these models are not designed for native video understanding, as they typically rely on static frame sampling. We propose UpstreamQA, a modular framework that disentangles and evaluates core video reasoning components through explicit upstream reasoning modules. Specifically, we employ multimodal LRMs to perform object identification and scene context generation before passing enriched reasoning traces to downstream LMMs for VideoQA. We evaluate UpstreamQA on the OpenEQA and NExTQA datasets using two LRMs (o4-mini, Gemini 2.5 Pro) and two LMMs (GPT-4o, Gemini 2.5 Flash). Our results demonstrate that introducing explicit reasoning can significantly boost performance and interpretability of downstream VideoQA, but can also lead to performance degradation when baseline performance is sufficiently high. Overall, UpstreamQA offers a principled framework for combining explicit reasoning and multimodal understanding, advancing both performance and diagnostic transparency in VideoQA in several scenarios.

[CV-158] CNN-ViT Fusion with Adaptive Attention Gate for Brain Tumor MRI Classification: A Hybrid Deep Learning Model

【速读】:该论文旨在解决脑肿瘤早期检测与分类中从磁共振成像(MRI)图像中提取有效特征的难题,尤其是在医学图像复杂性和多样性背景下如何提升分类精度。其解决方案的关键在于提出一种新型混合架构,融合了SqueezeNet风格的卷积神经网络(CNN)分支与MobileViT风格的全局视觉Transformer(ViT)分支,并通过自适应注意力门(Adaptive Attention Gate)机制动态学习每样本、每特征的权重,实现局部纹理与全局依赖信息的上下文敏感融合。该方法在Brain Tumor MRI Dataset(Kaggle)上取得了97.60%的测试准确率、0.9946的宏平均AUC等优异性能,显著优于单一CNN或ViT基线及现有融合方法,验证了动态特征加权在医学图像分类中的有效性。

链接: https://arxiv.org/abs/2604.23137
作者: Syed Ibad Hasnain,Muhammad Faris,Hafiza Syeda Yusra Tirmizi,Rabail Khowaja,Hafsa Israr
机构: Hamdard University (哈姆丹大学); Sir Syed University (赛义德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 9 pages, 4 figures, submitted as conference paper

点击查看摘要

Abstract:Early detection and classifying brain tumors using Magnetic Resonance Imaging (MRI) images is highly important but difficult to extract in medical images. Convolutional Neural Networks (CNNs) are good at capturing both local texture and spatial information whereas Vision Transformers (ViTs) are good at capturing long-range global dependencies. We propose a new hybrid architecture that combines a SqueezeNet-style CNN branch with a MobileViT-style global transformer branch, through an Adaptive Attention Gate mechanism, in this paper. The gate learns dynamically per-sample, per-feature weights to weight the contribution of each branch, allowing context-sensitive merging of local and global representations. The proposed model has a test accuracy of 97.60, a precision of 97.30, a recall of 97.50, an F1-score of 97.40, and a macro-average area under the curve (AUC) of 0.9946 with a trained and evaluated on the Brain Tumor MRI Dataset (Kaggle). These scores are higher than single CNN and ViT baselines, and current competitive fusion methods, showing that dynamic feature weighting is an effective way to classify medical images.

[CV-159] Learning from Imperfect Text Guidance: Robust Long-Tail Visual Recognition with High-Noise Label

【速读】:该论文旨在解决长尾分布(long-tailed distribution)与高噪声标签(high-noise labels)共存场景下深度模型性能显著下降的问题,尤其关注在高噪声条件下标签与图像之间存在的严重不一致(label-image mismatch)。其解决方案的关键在于引入弱教师监督(Weak Teacher Supervision, WTS),利用预训练视觉-语言模型(vision-language model)中固有的跨模态对齐特性,通过辅助文本信息从观察到的标签中提取类别语义,从而校正标签与图像之间的不一致性。WTS不受标签噪声和数据分布偏移影响,尽管准确率有限,但其激活由文本预测标签与观测标签之间的差异决定,实现了对噪声标签的有效鲁棒性修正。

链接: https://arxiv.org/abs/2604.23125
作者: Mengke Li,Haiquan Ling,Yiqun Zhang,Yang Lu,Hui Huang
机构: Shenzhen University (深圳大学); Guangdong University of Technology (广东工业大学); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVM 2026

点击查看摘要

Abstract:Real-world data often exhibit long-tailed distributions with numerous noisy labels, substantially degrading the performance of deep models. While prior research has made progress in addressing this combined challenge, it overlooks the severe label-image mismatch inherent to high-noise settings, thereby limiting their effectiveness. Given that observed labels, though mismatched with images, still retain category information, we propose employing auxiliary text information from labels to address label-image inconsistencies in long-tailed noisy data. Specifically, we leverage the intrinsic cross-modal alignment in pre-trained visual-language models to correct the label-image inconsistencies. This supervisory signal, referred to as Weak Teacher Supervision (WTS), is unaffected by label noise and data distribution biases, albeit exhibits limited accuracy. Therefore, the activation of WTS is determined by evaluating the discrepancy between text-predicted labels and observed labels. Extensive experiments demonstrate the superior performance of WTS across synthetic and real-world datasets, particularly under high-noise conditions. The source code is available at this https URL.

[CV-160] Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)策略在低数据监督微调(Supervised Fine-Tuning, SFT)后出现的“锁死”(lock-in)问题,即策略过度拟合训练数据中的对象、属性或空间目标,导致无法响应新指令。解决方案的关键在于:DeLock通过在微调过程中保持视觉接地(visual grounding),并在测试时引入对比提示引导(contrastive prompt guidance),以控制策略的去噪动力学,从而有效缓解概念锁死和空间锁死,无需额外监督信号或扩增数据即可恢复泛化能力。

链接: https://arxiv.org/abs/2604.23121
作者: Suning Huang,Jiaqi Shao,Ke Wang,Qianzhong Chen,Jiankai Sun,Yanjiang Guo,Mac Schwager,Jeannette Bohg
机构: Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Have you ever post-trained a generalist vision-language-action (VLA) policy on a small demonstration dataset, only to find that it stops responding to new instructions and is limited to behaviors observed during post-training? We identify this phenomenon as lock-in: after low-data, supervised fine-tuning (SFT), the policy becomes overly specialized to the post-training data and fails to generalize to novel instructions, manifesting as concept lock-in (fixation on training objects/attributes) and spatial lock-in (fixation on training spatial targets). Many existing remedies introduce additional supervision signals, such as those derived from foundation models or auxiliary objectives, or rely on augmented datasets to recover generalization. In this paper, we show that the policy’s internal pre-trained knowledge is sufficient: DeLock mitigates lock-in by preserving visual grounding during post-training and applying test-time contrastive prompt guidance to steer the policy’s denoising dynamics according to novel instructions. Across eight simulation and real-world evaluations, DeLock consistently outperforms strong baselines and matches or exceeds the performance of a state-of-the-art generalist policy post-trained with substantially more curated demonstrations.

[CV-161] ransferable Physical-World Adversarial Patches Against Object Detection in Autonomous Driving

【速读】:该论文旨在解决物理空间中对抗性补丁攻击在自动驾驶(Autonomous Driving, AD)目标检测系统中的迁移性不足问题,即现有攻击方法通常针对单一检测模型设计,难以有效泛化到未见过的检测器。解决方案的关键在于提出一种基于迁移性的物理攻击框架AdvAD,其通过在统一框架内对多个检测模型联合优化对抗补丁,使扰动能够捕获不同架构间的共享脆弱性;同时引入自适应权重机制平衡各模型贡献,并结合数据增强与几何变换提升补丁在真实物理条件下的鲁棒性,从而显著增强攻击效果和跨模型迁移能力。

链接: https://arxiv.org/abs/2604.23105
作者: Zihui Zhu,Ziqi Zhou,Yichen Wang,Lulu Xue,Minghui Li,Shengshan Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning drives major advances in autonomous driving (AD), where object detectors are central to perception. However, adversarial attacks pose significant threats to the reliability and safety of these systems, with physical adversarial patches representing a particularly potent form of attack. Physical adversarial patch attacks pose severe risks but are usually crafted for a single model, yielding poor transferability to unseen detectors. We propose AdvAD, a transfer-based physical attack against object detection in autonomous driving. Instead of targeting a specific detector, AdvAD optimizes adversarial patches over multiple detection models in a unified framework, encouraging the learned perturbations to capture shared vulnerabilities across architectures. The optimization process adaptively balances model contributions and enforces robustness to physical variations. It further employs data augmentation and geometric transformations to maintain patch effectiveness under diverse physical conditions. Experiments in both digital and real-world settings show that AdvAD consistently outperforms state-of-the-art (SOTA) attacks in performance and transferability.

[CV-162] INSIGHT: Indoor Scene Intelligence from Geometric-Semantic Hierarchy Transfer for Public~Safety

【速读】:该论文旨在解决室内环境中缺乏空间智能基础设施的问题,特别是在应急响应场景下,救援人员难以获取可机器读取的安全设备地图;同时应对两个关键挑战:一是室内标注数据稀缺,二是传统点云方法对小型安全关键特征识别效果差。解决方案的核心是提出INSIGHT零目标域标注流水线,通过注册的RGB-D数据将2D图像理解投影到3D度量空间,采用两种可互换的视觉模块(基于SAM3的基础模型堆栈用于文本提示分割,以及传统计算机视觉堆栈支持开集检测、视觉问答和光学字符识别),共享统一的3D后端,从而实现高精度点云标注与符合ISO 19164标准的场景图生成,显著压缩数据量并支持在FirstNet Band 14下快速传输,有效缓解了标注数据瓶颈,并为现场部署提供了紧凑的建筑智能表示。

链接: https://arxiv.org/abs/2604.23095
作者: Alexander Nikitas Dimopoulos,Joseph Grasso,John Beltz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Indoor environments lack the spatial intelligence infrastructure that GPS provides outdoors; first responders arriving at unfamiliar buildings typically have no machine-readable map of safety equipment. Prior work on 3D semantic segmentation for public safety identified two barriers: scarcity of labeled indoor training data and poor recognition of small safety-critical features by native point-cloud methods. This paper presents INSIGHT, a zero-target-domain-annotation pipeline that projects 2D image understanding into 3D metric space via registered RGB-D data. Two interchangeable vision stacks share a common 3D back end: a SAM3 foundation-model stack for text-prompted segmentation, and a traditional CV stack (open-set detection, VQA, OCR) whose intermediate outputs are independently inspectable. Evaluated on all seven subareas of Stanford 2D-3D-S (70,496 images), the pipeline produces Pointcept-schema-compatible labeled point clouds and ISO~19164-compliant scene graphs with \sim10^4\times compression; role-filtered payloads transmit in 15 ,s at 1,Mbps over FirstNet Band~14. We report per-point labeling accuracy on 7 shared classes, detection sensitivity for 15 safety-critical classes absent from public 3D benchmarks alongside code-capped deployable estimates, and inter-pipeline complementarity, demonstrating that 2D-to-3D semantic transfer addresses the labeled-data bottleneck while scene graphs provide building intelligence compact enough for field deployment.

[CV-163] oward Real-World Adoption of Portrait Relighting via Hybrid Domain Knowledge Fusion

【速读】:该论文旨在解决人像光照重渲染(portrait relighting)在真实世界应用中面临的三大挑战:数据集域差距(dataset domain gaps)、相机敏感性(camera sensitivity)以及计算成本过高。其解决方案的关键在于提出“混合领域知识融合”(Hybrid Domain Knowledge Fusion)范式,通过融合合成数据、单光源逐次(One-Light-at-a-Time, OLAT)数据和真实世界数据的专长优势,构建一个紧凑模型。具体而言,该方法首先利用领域感知适应(domain-aware adaptation)强化专用先验模型,再通过增强的知识蒸馏(augmented knowledge distillation)将多域知识迁移至轻量级学生模型,从而在保持最先进(SOTA)视觉质量的同时实现6倍至240倍的推理速度提升。

链接: https://arxiv.org/abs/2604.23094
作者: Qian Huang,Mayoore Selvarasa Jaiswal,Zhen Zhong,Rochelle Pereira,Jianyuan Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The real-world adoption of portrait relighting is hindered by dataset domain gaps, camera sensitivity, and computational costs. We address these challenges with Hybrid Domain Knowledge Fusion, a paradigm that fuses the specialized strengths of synthetic, One-Light-at-A-Time (OLAT), and real-world datasets into a compact model. Our approach features specialized prior models hardened by domain-aware adaptation, followed by augmented knowledge distillation into a lightweight student model with multi-domain expertise. Our method demonstrates a 6x to 240x inference speedup while maintaining state-of-the-art (SOTA) visual quality in the experiments. Additionally, we construct a massive, high-fidelity synthetic dataset with diverse ground-truth intrinsics to support our training pipeline.

[CV-164] From Pixels to Explanations: Interpretable Diabetic Retinopathy Grading with CNN-Transformer Ensembles Visual Explainability and Vision-Language Models

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)筛查中深度学习模型可解释性不足的问题,即现有分类器难以在临床场景下提供清晰、可信的诊断依据。其解决方案的关键在于构建一个结合强判别能力与多模态解释机制的框架:首先采用经过严格验证的卷积神经网络(CNN)和Transformer类骨干模型(如ResNet-50与ConvNeXt-Tiny)实现高精度分级(交叉验证加权kappa系数QWK达0.919),并通过加权软投票集成策略进一步提升等级一致性(QWK 0.934 ± 0.017);其次引入Grad-CAM++热力图与视觉语言模型(Vision-Language Models, VLMs)生成的文本推理说明,在保守提示约束下将像素级特征映射为临床可理解的输出,从而增强模型决策的透明度与可信度。

链接: https://arxiv.org/abs/2604.23079
作者: Pir Bakhsh Khokhar,Carmine Gravino,Fabio Palomba,Sule Yildirim Yayilgan,Sarang Shaikh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The quality of diabetic retinopathy (DR) screening relies on the ability to correctly grade severity; however, many deep-learning (DL) classifiers cannot be easily interpreted in the clinical context. This study presents a methodology that combines strong discriminative models with multimodal explanations, converting retinal pixels into clinically interpretable outputs. Using the APTOS 2019 benchmark, we evaluated six representative CNN- and transformer-based backbones under a controlled protocol with stratified five-fold cross-validation. We then compared ensembling strategies (hard voting, weighted soft voting, stacking) and investigated a hybrid class-level fusion variant to exploit grade-specific advantages. For interpretability, we produced Grad-CAM++ visual attribution maps and short textual rationales using vision-language models (VLMs) conditioned on the fundus image and classifier outputs under conservative prompting constraints. Modern CNN backbones (ResNet-50 and ConvNeXt-Tiny) provided the strongest single-model baselines, with cross-validated QWK up to 0.919 and 0.914, respectively. Ensembling improved ordinal agreement, and weighted soft voting was the most consistent across folds (QWK 0.934 +/- 0.017). Hybrid class-level fusion was competitive but did not yield a statistically reliable improvement over standard fusion in paired fold comparisons (Holm-adjusted p = 1.000). For explanation quality, Grad-CAM++ offered plausible but coarse localization, and VLM rationales were generally grade-consistent. Quantitatively, VLM variants showed a trade-off between clinical completeness and template-level semantic similarity (coverage 0.700 vs. BERTScore 0.072), while image-text alignment was comparable (CLIPScore approximately 0.34).

[CV-165] Urban Flood Observations (UFO): A hand-labeled training and validation dataset of post-flood inundation

【速读】:该论文旨在解决从卫星影像中准确映射复杂城市环境中洪水淹没范围的难题,该问题受制于空间分辨率有限、获取频率低以及云层遮挡等因素。解决方案的关键在于构建了一个全球性的、人工标注的都市洪水观测数据集(Urban Flood Observations, UFO),包含215个1024×1024像素的图像块,覆盖2017至2021年间14次洪水事件,每张图像均标注为“淹没”(包括可见地表水体,如洪水和永久或季节性水体)与“非淹没”两类。通过该数据集训练的语义分割模型在留一事件交叉验证下实现了77.3%的平均交并比(IoU),显著优于现有主流水体产品(NASA IMPACT模型和Google Dynamic World水体分类)的44.1%和48.1% IoU,验证了UFO数据集对提升城市洪涝制图精度的重要价值。

链接: https://arxiv.org/abs/2604.23066
作者: Rohit Mukherjee,Hannah K. Friedrich,Beth Tellman,Ariful Islam,Zhijie Zhang,Jonathan Giezendanner,Upmanu Lall,Venkataraman Lakshmi
机构: Pacific Northwest National Laboratory (太平洋西北国家实验室); University of Arizona (亚利桑那大学); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Utah State University (犹他州立大学); Massachusetts Institute of Technology (麻省理工学院); Columbia University (哥伦比亚大学); University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:Urban flooding affects lives and infrastructure worldwide. Mapping inundation in complex urban environments from satellite imagery remains challenging due to limited spatial resolution, infrequent acquisitions, and cloud cover. We present Urban Flood Observations (UFO), a global, hand-labeled dataset of post-flood inundation in diverse urban settings. UFO comprises 215 image chips (1024 by 1024 pixels) from 14 flood events between 2017 and 2021, derived from 3 m PlanetScope imagery. Each chip is annotated with two classes: ‘inundated’ (all visible surface water, including floodwater and pre-existing water bodies (permanent or seasonal)) and ‘non-inundated’. To demonstrate the dataset’s utility, we trained a segmentation model using leave-one-event-out cross-validation, achieving a mean Intersection over Union (IoU) of 77.3. We also used UFO to evaluate two widely used surface water products, the Sentinel-1-based NASA IMPACT model and Google’s 10 m Dynamic World water class, which yielded IoUs of 44.1 and 48.1, respectively. UFO is publicly available to support the development and validation of urban inundation mapping methods.

[CV-166] Understanding Representation Gaps Across Scales in Tropical Tree Species Classification from Drone Imagery ICLR2026

【速读】:该论文旨在解决从无人机(UAV)航拍影像中准确分类热带树种的难题,尤其针对物种多样性高且在典型图像分辨率下(厘米级像素)物种间视觉相似性强的问题。传统方法依赖于顶部视角的粗分辨率影像,分类性能受限;而现有研究表明,基于智能手机拍摄的近距离图像训练的模型在植物物种识别上表现优异。论文的关键解决方案在于利用新近可获取的、与顶部视角影像空间配准的高分辨率近距离无人机图像,通过微调实验量化了视觉基础模型与领域内通用植物识别模型在两种图像类型上的性能差异,并提出一种跨尺度的自监督表示对齐方法,以将细粒度视觉信息整合进基于顶部视角影像的冠层级物种分类模型中,从而提升大规模热带森林生物多样性监测的准确性。

链接: https://arxiv.org/abs/2604.23019
作者: Sulagna Saha,Arthur Ouaknine,Etienne Laliberté,Carol Altimas,Evan M. Gora,Adriane Esquivel Muelbert,Ian R. McGregor,Cesar Gutierrez,Vanessa E. Rubio,David Rolnick
机构: Mila – Quebec AI Institute (魁北克人工智能研究所); McGill University (麦吉尔大学); Université de Montréal (蒙特利尔大学); Cary Institute of Ecosystem Studies (卡里生态系统研究所); Smithsonian Tropical Research Institute (史密森热带研究学院); Department of Plant Sciences, University of Cambridge (剑桥大学植物科学系); Universidade do Estado do Mato Grosso (UNEMAT) (马托格罗索州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ML4RS @ICLR 2026 (Main)

点击查看摘要

Abstract:Accurate classification of tropical tree species from unoccupied aerial vehicle (UAV) imagery remains challenging due to high species diversity and strong visual similarity among species at typical image resolutions (centimeters per pixel). In contrast, models trained on close-up citizen science photographs captured with smartphones achieve strong plant species classification performance. Recent advances in UAV data acquisition now enable the collection of close-up images that are spatially registered with top-view aerial imagery and approach the level of visual detail found in smartphone photographs, with the trade-off that such high-resolution photos cannot be acquired for many trees. In this work, we evaluate the performance of existing methods using paired top-view and close-up UAV imagery collected in a species-rich tropical forest. Through fine-tuning experiments, we quantify the performance gap between vision foundation models and in-domain generalist plant recognition models across both image types (high-resolution close-up versus coarser-resolution top-view imagery). We show that classification performance is consistently higher on close-up images than on top-view aerial imagery, and that this performance gap widens for rare species. Finally, we propose that self-supervised representation alignment across these two spatial scales offers a promising approach for integrating fine-grained visual information into canopy-level species classification models based on top-view UAV imagery. Leveraging high-resolution close-up UAV imagery to enhance canopy-level species classification could substantially improve large-scale monitoring of tropical forest biodiversity.

[CV-167] AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI

【速读】:该论文旨在解决大规模3D资产库中普遍存在的一致性与可用性问题,即现有Web-scale 3D资产常存在度量尺度不统一、坐标轴和旋转中心错误、几何结构脆弱以及纹理缺乏光照重渲染能力等问题(如:无PBR材质支持),严重限制了其在具身智能(embodied AI)、机器人仿真、游戏开发及AR/VR等下游任务中的部署潜力。解决方案的关键在于构建AmaraSpatial-10K数据集——一个包含超过10,000个合成3D资产的高质量基准,每个资产均以度量尺度标准化(metric-scaled)、语义锚定(semantically anchored)的.glb格式发布,配套分离的PBR材质贴图、凸包碰撞体、参考图像及多句文本元数据,并采用统一的空间约定规范覆盖室内物体、车辆、建筑、生物和道具等类别。此外,作者还提出了一套全面的评估体系,包括连续尺度合理性评分(Scale Plausibility Score, SPS)、LLM驱动的概念密度指标、锚点误差度量及跨模态CLIP一致性协议,从而系统性验证了该数据集在空间一致性与语义准确性上的显著提升,相较于Objaverse等主流数据源,在基于文本检索任务中实现CLIP Recall@5从0.181提升至0.612(提升3.4倍),并显著降低平均检索排名(从267降至3)。

链接: https://arxiv.org/abs/2604.23018
作者: Mohammad Sadegh Salehi,Alex Perkins,Igor Maurell,Ashkan Dabbagh,Raymond Wong
机构: Zero One Creative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Web-scale 3D asset collections are abundant, but rarely deployment-ready. Assets ship with arbitrary metric scale, incorrect pivots and forward axes, brittle geometry, and textures that do not support relighting, which limits their utility for embodied AI, robotics simulation, game development, and AR/VR. We present AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets designed for downstream use rather than volume alone. Each asset is released as a metric-scaled, semantically anchored .glb with separated PBR material maps, a convex collision hull, a paired reference image, and rich multi-sentence text metadata. The dataset spans indoor objects, vehicles, architecture, creatures, and props under a unified spatial convention. Alongside the dataset, we introduce an evaluation suite for 3D asset banks. The suite comprises a continuous Scale Plausibility Score (SPS) with an LLM-as-Judge interval protocol, an LLM Concept Density score for metadata, an anchor-error metric, and a cross-modal CLIP coherence protocol, and we use it to audit AmaraSpatial-10K alongside matched subsets from Objaverse, HSSD, ABO, and GSO. Compared with Objaverse-sourced assets, we demonstrate that AmaraSpatial-10K substantially improves text-based retrieval precision (CLIP Recall@5 of 0.612 vs 0.181, a 3.4x improvement with median rank falling from 267 to 3), and we establish that it satisfies the spatial and semantic prerequisites for physics-aware scene composition and embodied-AI asset banks, leaving those downstream evaluations to future work. AmaraSpatial-10K is publicly available on Hugging Face.

[CV-168] DeepSignature: Digitally Signed Content-Encoding Watermarks for Robust and Transparent Image Authentication

【速读】:该论文旨在解决生成式 AI (Generative AI) 技术带来的图像伪造问题,即虚假图像看似源自可信来源,从而破坏公众对图像真实性的信任。解决方案的关键在于提出 DeepSignature,一种将数字签名的可验证性与深度神经网络能力相结合的新方法:通过神经网络生成内容编码的水印,并将其不可感知地嵌入图像中,同时确保水印可被鲁棒提取;这些水印具备密码学可验证性,支持源身份归属和图像完整性验证;此外,引入一种新颖的潜在空间验证机制以检测并定位篡改行为。该方案兼容现有图像格式、支持客户端侧验证(仅需签名者的公钥),且具有模块化设计和可调参数,能适应不同应用场景需求。

链接: https://arxiv.org/abs/2604.23016
作者: Mathias Graf,Marco Willi,Melanie Mathys,Michael Aerni,Christian Schwarzer,Martin Melchior,Michael H. Graber
机构: FHNW; ETH Zürich
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 9 figures, 7 tables

点击查看摘要

Abstract:AI-powered generative models have significantly expanded the possibilities for editing, manipulating, and creating high-quality images. Particularly, images that falsely appear to originate from trusted sources pose a serious threat, undermining public trust in image authenticity. We propose DeepSignature, a novel approach that integrates the guarantees of digital signatures with the capabilities of deep neural networks. Neural networks are used both to generate content-encoding watermarks and to embed them imperceptibly into images while ensuring robust extraction. These watermarks are cryptographically verifiable, enabling source attribution and image integrity validation. DeepSignature is compatible with existing image formats and requires no special handling of signed images. It supports client-side verification, requiring only the signer’s public key. Additionally, we introduce a novel latent-space verification approach to detect and localize tampering attempts. We evaluate DeepSignature in terms of imperceptibility, robustness to benign transformations, forgery detection, and its resilience against various attack scenarios. Our results highlight the inherent trade-offs between imperceptibility, robustness, and integrity verification. We demonstrate that DeepSignature reliably identifies significant forgery attempts – achieving near 100% in our experiments. Finally, we emphasize DeepSignature’s modularity and tunable parameters, allowing adaptation to application-specific requirements. Code and model weights will be published.

[CV-169] On-Device Vision Training Deployment and Inference on a Thumb-Sized Microcontroller MICRO

【速读】:该论文旨在解决在资源受限的微控制器(microcontroller)设备上实现端到端视觉机器学习(vision machine learning)流程的问题,传统云平台依赖外部基础设施且难以透明化计算过程,而本文提出了一种无需外部ML依赖、可在成本15–40美元的嵌入式设备(如Seeed Studio ESP32-S3 XIAO ML Kit)上完整执行数据采集、两层卷积神经网络(CNN)训练与实时推理的轻量级方案。其关键解决方案包括:支持批量梯度累积以提升训练稳定性;为推理阶段预计算图像缩放查找表以加速处理;采用双格式权重导出机制(SD-free baked-in部署);设计三级权重优先级系统(SD二进制文件 > 内联头文件 > He初始化)自动解析;提供单常量接口实现网络结构动态配置;以及针对PSRAM内存限制的智能管理策略,从而在有限硬件资源下实现高效、可复现的边缘AI部署。

链接: https://arxiv.org/abs/2604.23012
作者: Jeremy Ellis
机构: Unknown
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages; 3 figures; 3 tables. Code and datasets available at this https URL . Paper 1 of the webmcu-ai series. Implements end-to-end on-device CNN training and inference on a thumb-sized microcontroller (ESP32-S3) the XIAO ML Kit in ~1,750 lines of single-file C++ without external ML dependencies

点击查看摘要

Abstract:This paper presents a complete, end-to-end on-device vision machine learning pipeline, comprising data acquisition, two-layer CNN training with Adam optimization, and real-time inference, executing entirely on a microcontroller-class device costing 15-40 USD. Unlike cloud-based workflows that require external infrastructure and conceal the computational pipeline from the practitioner, this system implements every step of the core ML lifecycle in approximately 1,750 lines of readable C++ that compiles in under one minute using the Arduino IDE, with no external ML dependencies. Running on the Seeed Studio ESP32-S3 XIAO ML Kit (8 MB PSRAM), the firmware achieves three-class 64x64 image classification in approximately 9 minutes per training run, with real-time inference at 6.3 FPS. Key contributions include: correct batch-level gradient accumulation; pre-computed resize lookup tables for inference; dual-format weight export for SD-free baked-in deployment; a three-tier weight priority system (SD binary baked-in header He-initialization) resolved automatically at boot; a single-constant network reconfiguration interface; and PSRAM-aware memory management suited to microcontroller constraints. All source code and reference datasets are released under the MIT License at this https URL

[CV-170] GenAssets: Generating in-the-wild 3D Assets in Latent Space CVPR2025

【速读】:该论文旨在解决多传感器仿真中高质量交通参与者3D资产生成的难题,特别是针对从真实世界(in-the-wild)数据中构建兼具多样性与真实感的3D资产时,现有基于神经渲染的重建方法存在速度慢、视角依赖性强,以及基于扩散模型的生成方法在稀疏观测和部分遮挡场景下性能不佳的问题。解决方案的关键在于提出一种3D潜在扩散模型(3D latent diffusion model),其核心是“先重建后生成”(reconstruct-then-generate)策略:首先利用多场景训练的遮挡感知神经渲染(occlusion-aware neural rendering)构建高质量对象潜在空间(latent space),再在此基础上训练扩散模型进行3D资产生成,从而实现几何完整、外观逼真且适用于仿真场景的多样化内容生成。

链接: https://arxiv.org/abs/2604.23010
作者: Ze Yang,Jingkang Wang,Haowei Zhang,Sivabalan Manivasagam,Yun Chen,Raquel Urtasun
机构: Waabi; University of Toronto
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:High-quality 3D assets for traffic participants are critical for multi-sensor simulation, which is essential for the safe end-to-end development of autonomy. Building assets from in-the-wild data is key for diversity and realism, but existing neural-rendering based reconstruction methods are slow and generate assets that render well only from viewpoints close to the original observations, limiting their usefulness in simulation. Recent diffusion-based generative models build complete and diverse assets, but perform poorly on in-the-wild driving scenes, where observed actors are captured under sparse and limited fields of view, and are partially occluded. In this work, we propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high-quality 3D assets with complete geometry and appearance. Key to our method is a “reconstruct-then-generate” approach that first leverages occlusion-aware neural rendering trained over multiple scenes to build a high-quality latent space for objects, and then trains a diffusion model that operates on the latent space. We show our method outperforms existing reconstruction and generation based methods, unlocking diverse and scalable content creation for simulation.

[CV-171] Efficient Image Annotation via Semi-Supervised Object Segmentation with Label Propagation

【速读】:该论文旨在解决通用服务机器人在家庭场景中对物体进行可靠感知的问题,特别是针对开放词汇(open-vocabulary)检测器在类别泛化能力有限、而完全监督训练又依赖大量人工标注的痛点。其解决方案的关键在于提出一种半监督标签传播方法:首先通过类无关的分割提议器生成候选掩码,随后利用一组霍普菲尔德网络(Hopfield networks)在互补的基础模型嵌入空间(CLIP、ViT、Theia)中学习代表性特征嵌入,并据此为掩码分配标签。该方法仅需少量标注数据即可扩展至50个物体类别,在RoboCup@Home场景下可自动标注60%的数据,显著降低标注成本并提升泛化性能。

链接: https://arxiv.org/abs/2604.22992
作者: Vitalii Tutevych,Raphael Memmesheimer,Luca Eichler,Dmytro Pavlichenko,Fynn Schilke,Rodja Krudewig,Sven Behnke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 12 pages, 6 figures, 7 tables, submitted to RoboCup 2026 Symposium

点击查看摘要

Abstract:Reliable object perception is necessary for general-purpose service robots. Open-vocabulary detectors struggle to generalize beyond a few classes and fully supervised training of object detectors requires time-intensive annotations. We present a semi-supervised label propagation approach for household object segmentation. A segment proposer generates class-agnostic masks, and an ensemble of Hopfield networks assigns labels by learning representative embeddings in complementary foundation model embedding spaces (CLIP, ViT, Theia). Our approach scales to 50 object classes with limited annotation overhead and can automatically label 60% of the data in a RoboCup@Home setting, where preparation time is severely constrained. Dataset and code are publicly available at this https URL.

[CV-172] Hard to See Hard to Label: Generative and Symbolic Acquisition for Subtle Visual Phenomena CVPR2026

【速读】:该论文旨在解决工业缺陷检测中因异常样本稀疏且视觉模糊而导致的主动学习采样偏差问题,即传统基于判别不确定性或特征多样性的采集策略容易过度选择主导模式下的常见样本,而忽略低频但关键的细微缺陷(如微裂纹、亚毫米空隙等)。解决方案的关键在于提出GSAL框架,其核心创新是融合扩散模型驱动的难度信号与分层语义覆盖先验:前者通过重建差异和去噪变异性量化图像及候选区域的视觉难易程度,优先选择结构异常但视觉模糊的样本;后者构建三级概念图以组织候选样本,并主动促进对欠代表语义区域的探索,从而在提升标签效率的同时增强对罕见目标的召回能力。

链接: https://arxiv.org/abs/2604.22990
作者: Renjith Prasad,Rishabh Sharma,Andrew E. Shao,Annmary Justine Koomthanam,Shreyas Kulkarni,Suparna Bhattacharya,Martin Foltin,Amit Sheth,David Orozco,Brian Sammuli
机构: University of South Carolina (南卡罗来纳大学); AI Research Lab, HPE Labs, Hewlett Packard Enterprise (AI 研究实验室,HPE 实验室,惠普企业); Indian AI Research Organization (印度人工智能研究组织); General Atomics (通用原子能公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2026 SVC Workshop

点击查看摘要

Abstract:Subtle visual anomalies such as hairline cracks, sub-millimeter voids, and low-contrast inclusions are structurally atypical yet visually ambiguous, making them both difficult to annotate and easy to overlook during active learning. Standard acquisition heuristics based on discriminative uncertainty or feature diversity often overselect dominant patterns while underexploring sparse yet important regions of the data space. This failure mode is especially severe in industrial defect inspection, where anomalies may be both low-prevalence and difficult to distinguish from surrounding structure. To resolve this, we propose GSAL, an active learning framework for object detection that combines a diffusion-based difficulty signal with a hierarchical semantic coverage prior. The diffusion component scores images and proposals using reconstruction discrepancy and denoising variability, prioritizing visually atypical or ambiguous examples. However, diffusion alone does not prevent acquisition from repeatedly favoring hard samples within dominant semantic modes. The semantic component therefore organizes candidate samples in a three-level concept graph and promotes coverage of underrepresented semantic regions while providing interpretable acquisition rationales. By balancing visual difficulty with semantic coverage, GSAL improves retrieval of subtle and rare targets that are often missed by uncertainty-only selection. Experiments on a proprietary thin-film defect, Pascal VOC and MS COCO dataset show consistent gains in label efficiency and rare-class retrieval over uncertainty-, diversity-, and hybrid-based baselines

[CV-173] CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging CVPR

【速读】:该论文旨在解决当前医学多模态基础模型在处理胸部X光图像时因两阶段解耦式架构(如LLaVA风格微调)引入投影层而导致的视觉特征失真问题,尤其是在依赖细微影像线索进行准确诊断的场景下。其解决方案的关键在于提出CheXmix——一种基于早期融合(early-fusion)的生成式模型,通过将图像和文本token统一编码为序列,避免了传统方法中的投影瓶颈;同时创新性地采用两阶段多模态生成预训练策略,结合掩码自编码器(masked autoencoder)与多模态大语言模型(MLLM)的优势,在保持细粒度信息的同时提升模型在判别与生成任务上的灵活性与性能表现。

链接: https://arxiv.org/abs/2604.22989
作者: Ashwin Kumar,Robbie Holland,Corey Barrett,Jangwon Kim,Maya Varma,Zhihong Chen,Yunhe Gao,Greg Zaharchuk,Tara Taghavi,Krishnaram Kenthapadi,Akshay Chaudhari
机构: Stanford AIMI, Stanford University (斯坦福大学人工智能与医学中心); Oracle Health AI (奥拉克健康人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR Findings (2026)

点击查看摘要

Abstract:Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach introduces a projection layer that can distort visual features. This is especially concerning in medical imaging where subtle cues are essential for accurate diagnoses. In contrast, early-fusion generative approaches such as Chameleon eliminate the projection bottleneck by processing image and text tokens within a single unified sequence, enabling joint representation learning that leverages the inductive priors of language models. We present CheXmix, a unified early-fusion generative model trained on a large corpus of chest X-rays paired with radiology reports. We expand on Chameleon’s autoregressive framework by introducing a two-stage multimodal generative pretraining strategy that combines the representational strengths of masked autoencoders with MLLMs. The resulting models are highly flexible, supporting both discriminative and generative tasks at both coarse and fine-grained scales. Our approach outperforms well-established generative models across all masking ratios by 6.0% and surpasses CheXagent by 8.6% on AUROC at high image masking ratios on the CheXpert classification task. We further inpaint images over 51.0% better than text-only generative models and outperform CheXagent by 45% on the GREEN metric for radiology report generation. These results demonstrate that CheXmix captures fine-grained information across a broad spectrum of chest X-ray tasks. Our code is at: this https URL.

[CV-174] BrickNet: Graph-Backed Generative Brick Assembly CVPR2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 在复杂积木构建序列建模中的物理合理性问题,即如何在不违反物理约束的前提下自回归地生成由数千种不同连接语义的 LEGO 积木组成的结构化搭建序列。传统方法仅限于离散的体素式塔状结构,难以处理多样化的积木类型与空间关系。其关键解决方案是设计了一种基于图的程序表示(graph-based program representation),通过参数化结构的连接性而非直接预测每个积木的 3D 姿态,从而增强生成序列的物理可实现性与结构一致性。

链接: https://arxiv.org/abs/2604.22984
作者: Peter Kulits,Cordelia Schmid
机构: Inria, Ècole Normale Supèrieure, CNRS, PSL Research University (法国国家信息与自动化研究所,巴黎高等师范学院,法国国家科学研究中心,PSL研究大学); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: CVPR 2026; project page: this https URL

点击查看摘要

Abstract:We train a language model to generate LEGO-brick build sequences. While prior work has been restricted to discrete, voxel-like towers, we consider a much broader set of pieces, encompassing thousands of part types with diverse connection semantics. To enable this, we first collect a large-scale dataset of over 100,000 human-designed LDraw brick objects and scenes. The complexity of our setting makes it challenging to autoregressively assemble structures that satisfy physical constraints. When predicting block pose directly, build sequences quickly become invalid after a small number of steps. Although pieces are placed in 3D space, it is the spatial relationships of the parts which define the whole. With this in mind, we design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. To enable future applications, we make our dataset and models available for research purposes. this https URL

[CV-175] AnemiaVision: Non-Invasive Anemia Detection via Smartphone Imagery Using EfficientNet-B3 with TrivialAugmentWide Mixup Augmentation and Persistent Patient History Management

【速读】:该论文旨在解决全球超过十亿人罹患的贫血问题,特别是在缺乏实验室检测条件的低资源地区,传统诊断手段难以普及的问题。其核心解决方案是提出AnemiaVision系统,一种基于智能手机拍摄眼睑结膜和指甲床图像的无创贫血筛查方法。关键创新在于:采用微调后的EfficientNet-B3骨干网络结合重新设计的三层分类头(含BatchNorm、GELU激活函数及高比例Dropout),并融合四种正则化与训练优化技术——TrivialAugmentWide图像增强、RandomErasing空间正则化、Mixup(α=0.2)类别间平滑以及余弦退火调度配合线性预热;同时通过以验证集准确率峰值作为早停依据,避免因高方差导致过早终止。实验表明,该系统在验证集上达到96.2%准确率和0.98 AUC-ROC,显著优于基线模型(44.9%准确率,AUC-ROC 0.58),且对贫血类别的敏感度达0.96,具备作为农村社区卫生工作者一线筛查工具的应用潜力。

链接: https://arxiv.org/abs/2604.22964
作者: Rahul Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 6 pages, 6 figures, 6 tables. Final year personal project, Department of Electronics and Communication Engineering, Indian Institute of Information Technology Surat. Code: this https URL Demo: this https URL

点击查看摘要

Abstract:Anemia affects over one billion people globally and remains severely under-diagnosed in low-resource regions where laboratory blood tests are inaccessible. This paper presents AnemiaVision, an end-to-end web-based system for non-invasive anemia screening from smartphone photographs of the palpebral conjunctiva and fingernail beds. The proposed pipeline fine-tunes a pre-trained EfficientNet-B3 backbone with a redesigned three-layer classifier head incorporating BatchNorm, GELU activations, and high-rate Dropout (0.45/0.35). Training employs four orthogonal accuracy-boosting techniques: TrivialAugmentWide for policy-free image augmentation, RandomErasing for spatial regularisation, Mixup (alpha=0.2) for inter-class smoothing, and cosine-annealing scheduling with linear warmup. Early stopping is governed by peak validation accuracy rather than validation loss to prevent premature termination on high-variance epochs. The deployed Flask application integrates persistent patient-history management backed by PostgreSQL on Render, with an automated database-migration entrypoint ensuring zero data loss across redeploys. Ablation experiments demonstrate that accuracy-first early stopping contributes +1.6% and Mixup contributes +2.8% to final validation accuracy. Overall, the proposed system achieves a validation accuracy of 96.2% and AUC-ROC of 0.98, compared with 44.9% validation accuracy and AUC-ROC of 0.58 from the three-epoch CPU-only baseline. Sensitivity for the anemic class reaches 0.96, making the system suitable as a first-line screening tool for community health workers in rural settings. The system is publicly accessible and source code is openly available.

[CV-176] VS-DDPM: Efficient Low-Cost Diffusion Model for Medical Modality Translation

【速读】:该论文旨在解决扩散模型(Diffusion Models)在医学图像合成任务中推理速度慢的问题,同时保持生成质量。其核心挑战是在硬件和时间约束下实现高效且高保真度的3D医学图像生成。解决方案的关键在于提出一种三维可变步长去噪扩散概率模型(3D Variable-Step Denoising Diffusion Probabilistic Model, VS-DDPM),通过动态调整去噪过程中的采样步数,在保证生成质量(如Dice分数、SSIM等指标)的前提下显著加速推理过程。实验表明,该方法在缺失MRI合成任务中达到当前最优性能(Dice最高达0.88),并在其他任务中展现出鲁棒性和可调性。

链接: https://arxiv.org/abs/2604.22942
作者: Nikoo Moradi,Gijs Luijten,Behrus Hinrichs-Puladi,Jens Kleesiek,Victor Alves,Jan Egger,André Ferreira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models produce high-quality synthetic data but suffer from slow inference. We propose 3D Variable-Step Denoising Diffusion Probabilistic Model (VS-DDPM) a framework engineered to maintain generative quality while accelerating inference by several factors. We tested our approach on four tasks (missing MRI, tumor removal, MRI-to-sCT, and CBCT-to-sCT) within the BraTS2025 and SynthRAD2025 challenges. Designed for high efficiency under hardware and time constrains imposed by both challenges. VS-DDPM achieved state-of-the-art (SOTA) performance in missing MRI synthesis, yielding Dice scores of 0.80, 0.83, and 0.88 for the enhancing tumor, tumor core, and whole tumor regions, respectively, alongside a structural similarity index (SSIM) of 0.95. For MRI tumor removal, the model attained a root mean squared error (RMSE) of 0.053, a peak signal-to-noise ratio (PSNR) of 26.77, and an SSIM of 0.918. While the framework demonstrated competitive performance in MRI-to-sCT and CBCT-to-sCT tasks, it did not reach SOTA benchmarks, potentially due to sensitivities in data pre and post-processing pipelines or specific loss function configurations. These results demonstrate that VS-DDPM provides a robust and tunable solution for high-fidelity 3D medical image synthesis. The code is available in this https URL.

[CV-177] On the Complementarity of Quantum and Classical Features: Adaptive Hybrid Quantum-Classical Feature Fusion for Breast Cancer Classification

【速读】:该论文旨在解决量子机器学习(Quantum Machine Learning, QML)与经典深度学习在医学图像分析中难以有效融合的问题,尤其是由优化不对称性导致的训练瓶颈。其关键解决方案是提出一种基于双分支特征提取管道的混合量子-经典架构,并引入三种渐进式特征融合策略:静态混合融合(Static Hybrid Fusion, SHF)、动态混合融合(Dynamic Hybrid Fusion, DHF)以及一种新颖的温度缩放混合融合(Temperature-Scaled Hybrid Fusion, TSHF)。其中,TSHF通过引入一个可学习标量参数来动态平衡混合梯度传播,缓解优化不稳定性,从而显著提升分类性能,在BreastMNIST数据集上实现了87.82%的准确率、91.77%的F1分数和89.08%的AUC-ROC,优于纯经典模型,验证了该框架在临床部署中具备更高稳定性和性能优势。

链接: https://arxiv.org/abs/2604.22903
作者: Yasmin Rodrigues Sobrinho,João Renato Ribeiro Manesco,João Paulo Papa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 41 pages, 16 figures. This manuscript is a preprint under review at Artificial Intelligence in Medicine

点击查看摘要

Abstract:The integration of quantum machine learning with classical deep learning offers promising avenues for medical image analysis by mapping data into high-dimensional Hilbert spaces. However, effectively unifying these distinct paradigms remains challenging due to common optimization asymmetries. In this paper, a novel hybrid quantum-classical architecture for breast cancer diagnosis based on a dual-branch feature-extraction pipeline is proposed. Our framework extracts and unifies complementary representations from classical models and quantum circuits, exploring both trainable and deterministic (non-trainable) quantum paradigms. To integrate these embeddings, three progressive feature fusion strategies are introduced: Static Hybrid Fusion (SHF) for offline extraction, Dynamic Hybrid Fusion (DHF) for end-to-end co-adaptation, and a novel Temperature-Scaled Hybrid Fusion (TSHF). The TSHF strategy incorporates a learnable scalar, inspired by multimodal learning, that dynamically balances hybrid gradient dynamics and resolves optimization bottlenecks. Empirical validation on the BreastMNIST dataset confirms our hypothesis that unifying diverse feature representations creates a richer data context. The TSHF strategy, specifically when pairing a ResNet backbone with a trainable quantum circuit, achieved a peak accuracy of 87.82%, F1-score of 91.77%, and an AUC-ROC of 89.08%, outperforming purely classical baselines. These results demonstrate that the proposed hybrid framework improves classification accuracy and threshold reliability, providing a stable, high-performance architecture for the clinical deployment of quantum-enhanced diagnostic tools.

[CV-178] xt-Guided Multimodal Unified Industrial Anomaly Detection

【速读】:该论文旨在解决工业异常检测中基于RGB-3D多模态数据的无监督方法所面临的两个关键问题:一是由于缺乏高层语义引导导致的跨模态对齐模糊性,二是RGB到3D特征映射过程中几何建模不足。其解决方案的核心在于提出一个由文本语义引导的统一多模态工业异常检测框架,包含两个核心模块:几何感知的跨模态映射器(Geometry-Aware Cross-Modal Mapper),用于在模态转换过程中保留几何结构;以及对象条件的文本特征适配器(Object-Conditioned Textual Feature Adaptor),用于将多模态特征与语义先验对齐。此外,该框架打破了“一模型一类别”的限制,实现了单个模型对多种类别的精准异常检测,显著提升了无监督场景下的分类与定位性能。

链接: https://arxiv.org/abs/2604.22899
作者: Zewen Li,Shuo Ye,Zitong Yu,Weicheng Xie,Linlin Shen
机构: Shenzhen University (深圳大学); Great Bay University (大湾区大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Industrial anomaly detection based on RGB-3D multimodal data has emerged as a mainstream paradigm for intelligent quality inspection. However, existing unsupervised methods suffer from two critical limitations: ambiguous cross-modal alignment caused by the lack of high-level semantic guidance and insufficient geometric modeling for RGB-to-3D feature mapping. To address these issues, we propose a unified multimodal industrial anomaly detection framework guided by text semantics. The framework consists of two core modules: a Geometry-Aware Cross-Modal Mapper to preserve geometric structure during modality conversion, and an Object-Conditioned Textual Feature Adaptor to align multimodal features with semantic priors. Furthermore, we establish a unified learning paradigm for multimodal industrial anomaly detection, which breaks the one-model-one-class constraint and enables accurate anomaly detection across diverse classes using a single model. Extensive experiments on the MVTec 3D-AD and Eyecandies datasets demonstrate that our method achieves state-of-the-art performance in classification and localization under unsupervised settings.

[CV-179] Breaking Degradation Coupling: A Structural Entropy Guided Decoupled Framework and Benchmark for Infrared Enhancement CVPR2026

【速读】:该论文旨在解决热红外图像增强中因复合退化类型复杂多样而导致的梯度干扰与参数竞争问题。现有统一建模方法通常采用单一共享主干网络处理多种退化,导致各任务间参数冲突,影响恢复质量。其解决方案的关键在于提出结构熵引导的解耦框架(Structural Entropy-Guided Decoupled, SEGD),通过将复合退化分解为独立子过程,并利用退化特定残差模块(Degradation-Specific Residual Modules, DRMs)分别建模,实现任务解耦与联合训练;同时引入退化感知证据网络估计退化类型与强度,自适应调节各DRM的恢复强度;并通过结构熵准则对不同退化路径的特征进行聚合,最终生成具备结构保真度和退化感知能力的解码器输入表示,从而实现细粒度、可解释的增强效果。

链接: https://arxiv.org/abs/2604.22886
作者: Pu Li,Huafeng Li,Yafei Zhang,Yu Liu,Wen Wang
机构: Faculty of Information Engineering and Automation, Kunming University of Science and Technology(昆明理工大学信息工程与自动化学院); Department of Biomedical Engineering, Hefei University of Technology(合肥工业大学生物医学工程系); School of Mathematics and Statistics, Yunnan University(云南大学数学与统计学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Thermal infrared image enhancement aims to restore high-quality images from complex compound degradations. Existing all-in-one approaches typically employ a single shared backbone to handle diverse degradations, which causes gradient interference and parameter competition. To address this, we propose a Structural Entropy-Guided Decoupled (SEGD) Framework. Unlike unified modeling paradigms, SEGD decomposes compound degradations into independent sub-processes and models them in a divide-and-conquer manner through Degradation-Specific Residual Modules (DRMs). Each DRM focuses on residual estimation for a specific degradation, enabling task decoupling while remaining jointly trainable, which mitigates parameter contention. A Degradation-Aware Evidential Network further estimates degradation type and intensity, providing priors that adaptively regulate DRM restoration strength. To handle compound cases, DRMs are composed in varying orders to form multiple restoration paths, from which the most informative features are aggregated under a structural-entropy criterion, yielding decoder-ready representations with structural fidelity and degradation awareness. Integrating divide-and-conquer restoration, evidential perception, and entropy-guided adaptation, SEGD achieves fine-grained and interpretable enhancement. We also construct a nighttime TIR benchmark for evaluation under real low-light conditions. Experimental results demonstrate that SEGD surpasses state-of-the-art methods while achieving higher efficiency with fewer parameters.

[CV-180] Federated Cross-Modal Retrieval with Missing Modalities via Semantic Routing and Adapter Personalization

【速读】:该论文旨在解决联邦跨模态检索(Federated Cross-Modal Retrieval)中因客户端数据异构性带来的挑战,特别是非独立同分布(non-IID)语义分布和模态缺失问题。现有单一全局模型难以同时捕捉跨客户端共享的跨模态知识与个体客户端特定特征。其解决方案的关键在于提出RCSR框架,通过原型锚定(prototype anchoring)使单模态客户端对齐全局跨模态语义,结合以检索为中心的语义路由(retrieval-centric semantic routing)动态调整聚合权重以缓解异构更新中的对齐漂移,并引入可选的客户端特定适配器(client-specific adapters)实现轻量级本地个性化,从而在保持全局知识迁移效率的同时显著提升客户端层面的检索性能,尤其在模态不完整的情况下表现优异。

链接: https://arxiv.org/abs/2604.22885
作者: Hefeng Zhou,Xuan Liu,Sicheng Chen,Wutong Zhang,Wu Yan,Jiong Lou,Chentao Wu,Guangtao Xue,Wei Zhao,Jie Li
机构: Shanghai Jiao Tong University (上海交通大学); Hohai University (河海大学); Shenzhen Univ of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated cross-modal retrieval faces severe challenges from heterogeneous client data, particularly non-IID semantic distributions and missing modalities. Under such heterogeneity, a single global model is often insufficient to capture both shared cross-modal knowledge and client-specific characteristics. We propose RCSR, a personalization-friendly federated framework that integrates prototype anchoring, retrieval-centric semantic routing, and optional client-specific adapters. Built on a frozen CLIP backbone, RCSR leverages lightweight shared adapters for global knowledge transfer while supporting efficient local personalization. Prototype anchoring helps unimodal clients align with global cross-modal semantics, and a server-side semantic router adaptively assigns aggregation weights based on retrieval consistency to mitigate alignment drift during heterogeneous updates. Extensive experiments on MS-COCO, Flickr30K, and other benchmarks show that RCSR consistently improves global retrieval accuracy and training stability, while further enhancing client-level retrieval performance, especially for clients with incomplete modalities. Code is available at this https URL.

[CV-181] Can Multimodal Large Language Models Truly Understand Small Objects?

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在小物体理解(Small Object Understanding, SOU)任务中能力不足且缺乏系统评估的问题。现有研究尚未对MLLMs在识别和理解图像中小尺度目标方面的性能进行充分探索,导致其在驾驶、航拍和水下等实际场景中的应用受限。解决方案的关键在于提出首个全面的SOU基准测试平台SOUBench,包含自动构建的视觉问答(Visual Question Answering, VQA)数据集SOU-VQA(18,204对样本,涵盖六个子任务和三种典型场景),并进一步开发用于提升模型SOU能力的训练数据集SOU-Train(11,226对样本)。通过在15个先进MLLM上进行系统评测及基于SOU-Train的监督微调实验,验证了该方案能显著增强模型对小物体的理解能力,为后续研究提供了可复现的评估标准与训练范式。

链接: https://arxiv.org/abs/2604.22884
作者: Fujun Han,Junan Chen,Xintong Zhu,Jingqi Ye,Xuanjie Mao,Tao Chen,Peng Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Peer Review (26 pages, 9 figures, 6 tables)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM’s ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: this https URL.

[CV-182] NeuroAPS-Net: Neuro-Anatomically Aware Point Cloud Representation for Efficient Alzheimers Disease Classification IJCNN2026

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)分类中深度学习方法计算资源消耗高、难以部署于资源受限环境的问题。现有基于3D卷积神经网络(CNNs)的方法虽然有效,但其高复杂度限制了实际应用。解决方案的关键在于提出两个核心创新:一是通过解剖优先采样(Anatomical Priority Sampling, APS)将T1加权磁共振成像(MRI)转换为具有解剖学信息的二维点云(2D point clouds),构建首个神经解剖标注的MRI衍生点云数据集ADNI-2DPC;二是设计轻量级几何深度学习模型NeuroAPS-Net,通过区域感知特征编码和ROI令牌聚合引入解剖先验知识,在保持分类性能的同时显著降低推理延迟和GPU显存占用,从而为AD分类提供一种高效且可解释的替代方案。

链接: https://arxiv.org/abs/2604.22883
作者: Towhidul Islam(1),Mufti Mahmud(2 and 3) ((1) ICS Department, King Fahd University of Petroleum amp; Minerals, Dhahran, Saudi Arabia, (2) SDAIA-KFUPM JRC for AI, King Fahd University of Petroleum amp; Minerals, Dhahran, Saudi Arabia, (3) IRC for Bio Systems and Machines, King Fahd University of Petroleum amp; Minerals, Dhahran, Saudi Arabia)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, Accepted under IJCNN 2026

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a progressive neurodegenerative disorder and a major cause of dementia. Structural MRI is widely used to analyze AD-related brain atrophy; however, most deep learning methods rely on computationally expensive 3D convolutional neural networks (CNNs), limiting deployment in resource-constrained settings. This work introduces two main contributions. First, we propose a pipeline that converts T1-weighted MRI into anatomically informed 2D point clouds using Anatomical Priority Sampling (APS), producing ADNI-2DPC, the first neuroanatomically labeled MRI-derived point cloud dataset. Second, we present NeuroAPS-Net, a lightweight geometric deep learning model that incorporates anatomical priors via region-aware feature encoding and ROI token aggregation. Experiments on ADNI-2DPC demonstrate that NeuroAPS-Net achieves competitive classification accuracy while significantly reducing inference latency and GPU memory compared to state-of-the-art point cloud methods. These results highlight the potential of anatomically guided point cloud learning as an efficient and interpretable alternative to voxel-based CNNs for AD classification.

[CV-183] SketchVLM: Vision language models can annotate images to explain thoughts and guide users

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在回答图像相关问题时仅输出文本、缺乏可解释性与用户验证能力的问题。现有模型如Gemini-3-Pro和GPT-5虽具备强大推理能力,但其输出难以被用户直观验证或理解其推理过程。为此,作者提出SketchVLM,一种无需训练、模型无关的框架,通过生成非破坏性且可编辑的SVG(Scalable Vector Graphics)叠加层来可视化模型的推理依据。该方案的关键在于:利用预训练VLM的内部表征直接生成结构化图形标注(如绘图、标记、连线等),从而增强模型输出的透明度与可信度,在多个视觉推理任务中显著提升准确率(最高+28.5个百分点)与标注质量(最高1.48倍于图像编辑和微调基线),同时保证标注内容忠实于模型的回答逻辑。

链接: https://arxiv.org/abs/2604.22875
作者: Brandon Collins,Logan Bolton,Hung Huy Nguyen,Mohammad Reza Taesiri,Trung Bui,Anh Totti Nguyen
机构: Auburn University (奥本大学); Adobe (Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image-editing and fine-tuned sketching baselines, while also producing annotations that are more faithful to the model’s stated answer. We find that single-turn generation already achieves strong accuracy and annotation quality, and multi-turn generation opens up further opportunities for human-AI collaboration. An interactive demo and code are at this https URL.

[CV-184] Vision-Based Lane Following and Traffic Sign Recognition for Resource-Constrained Autonomous Vehicles

【速读】:该论文旨在解决嵌入式自主车辆中资源受限环境下可靠视觉感知算法的实时实现难题。其核心挑战在于如何在计算资源有限的平台上同时实现高精度的车道检测与跟踪、交通标志识别,并保证系统运行效率。解决方案的关键在于构建一个轻量级视觉感知框架,集成基于阈值的车道分割方法(结合透视变换和直方图曲率估计)以实现鲁棒的车道跟踪,辅以规则-based转向控制器确保稳定导航;同时采用两个轻量级卷积神经网络(CNNs)——EfficientNet-B0与MobileNetV2进行交通标志识别,在自建数据集上验证了EfficientNet-B0在离线测试中达到98.77%准确率、实时部署时仍保持90%准确率,优于MobileNetV2,尽管后者具有更低的计算开销和更快的推理速度。整体方案在保障实时性的同时显著提升了感知精度,为嵌入式自动驾驶系统提供了可行且高效的视觉感知路径。

链接: https://arxiv.org/abs/2604.22872
作者: Md Tanjemul Islam,Md Rafiul Kabir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 2026 International Conference on Intelligent Systems, Blockchain, and Communication Technologies

点击查看摘要

Abstract:Autonomous vehicles (AVs) rely on real-time perception systems to understand road environments and ensure safe navigation. However, implementing reliable perception algorithms on resource-constrained embedded platforms remains challenging due to limited computational resources. This paper presents a lightweight vision-based framework that integrates lane detection, lane tracking, and traffic sign recognition for embedded autonomous vehicles. A computationally efficient threshold-based lane segmentation method combined with perspective transformation and histogram-based curvature estimation is used for robust lane tracking under varying illumination conditions. A rule-based steering controller generates steering commands to maintain stable vehicle navigation. For traffic sign recognition, two lightweight convolutional neural networks (CNNs), EfficientNet-B0 and MobileNetV2, are evaluated using a custom dataset captured from the vehicle’s onboard camera. Experimental results show that the system achieves real-time performance while maintaining accurate lane tracking with only 3.16% maximum offset RMSE. EfficientNet-B0 achieves a high offline classification accuracy of 98.77% on the test dataset, while achieving 90% accuracy during real-time on-device deployment, outperforming MobileNetV2 in both settings. MobileNetV2, however, offers slightly faster inference and lower computational cost. These results highlight the effectiveness of lightweight vision-based perception pipelines for resource-constrained autonomous driving applications.

[CV-185] Probing Visual Planning in Image Editing Models ICLR2026

【速读】:该论文旨在解决机器学习中视觉规划(Visual Planning)任务的计算效率低下问题,尤其是在依赖逐步生成式推理(step-by-step planning-by-generation paradigm)时所面临的瓶颈。其核心挑战在于如何在保持视觉推理能力的同时提升模型的效率与泛化性能。解决方案的关键在于提出“编辑即推理”(Editing-as-Reasoning, EAR)范式,将视觉规划重构为单步图像变换(single-step image transformation),从而避免冗长的逐步生成过程。为此,作者设计了AMAZE数据集,通过程序化生成的抽象谜题(如迷宫和皇后问题)隔离视觉识别与内在推理,并支持对自回归和扩散模型在像素级保真度与逻辑有效性上的自动评估。实验表明,尽管微调后模型能在同域及跨域尺度上实现显著泛化,但当前最优模型仍无法达到人类解题者的零样本效率,揭示了神经视觉推理领域存在的持续性差距。

链接: https://arxiv.org/abs/2604.22868
作者: Zhimu Zhou,Yanpeng Zhao,Qiuyu Liao,Bo Zhao,Xiaojian Ma
机构: Shanghai Jiao Tong University (上海交通大学); Renmin University of China (中国人民大学); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室, BIGAI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ES-Reasoning Workshop @ ICLR 2026. Our code is available at this https URL

点击查看摘要

Abstract:Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal-centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step-by-step planning-by-generation paradigm. In this work, we present EAR, an editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation. To isolate intrinsic reasoning from visual recognition, we employ abstract puzzles as probing tasks and introduce AMAZE, a procedurally generated dataset that features the classical Maze and Queen problems, covering distinct, complementary forms of visual planning. The abstract nature of AMAZE also facilitates automatic evaluation of autoregressive and diffusion-based models in terms of both pixel-wise fidelity and logical validity. We assess leading proprietary and open-source editing models. The results show that they all struggle in the zero-shot setting, finetuning on basic scales enables remarkable generalization to larger in-domain scales and out-of-domain scales and geometries. However, our best model that runs on high-end hardware fails to match the zero-shot efficiency of human solvers, highlighting a persistent gap in neural visual reasoning.

[CV-186] MeshLAM: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction CVPR2026

【速读】:该论文旨在解决从单张图像中快速重建高保真、可动画化三维头部网格模型的问题,传统方法通常依赖于耗时的测试阶段优化或大量多视角数据,难以实现高效且高质量的重建。其解决方案的关键在于提出了一种前馈式框架 MeshLAM,采用双通道结构(形状与纹理贴图)协同处理,通过共享的 Transformer 主干网络提取图像特征,并利用迭代 GRU 解码机制实现渐进式的几何变形与纹理精修,同时引入基于重投影的纹理引导机制以锚定外观学习,从而在单次前向传播中生成拓扑完整、可动画化的高质量头部网格。

链接: https://arxiv.org/abs/2604.22865
作者: Yisheng He,Steven Hoi
机构: Tongyi Lab, Alibaba Group(通义实验室,阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:We introduce MeshLAM, a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based texture guidance mechanism that anchors appearance learning to the input image. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in reconstruction quality, animation capability, and computational efficiency. Project page at this https URL.

[CV-187] A Digital Pathology Resource for Liver Cancer Quantification with Datasets Benchmarks and Tools

【速读】:该论文旨在解决肝癌病理图像分析中缺乏细粒度组织成分标注数据的问题,从而阻碍了可重复的模型开发与定量分析工具的部署。其关键解决方案是发布HepatoBench——一个基于切片级别(patch-level)标注的肝癌图像数据库,包含七个关键组织类别注释,并基于此训练了一个深度学习分类模型用于组织识别;同时构建了一个全切片图像(WSI-level)肿瘤/非肿瘤分割模型以自动定位病灶区域。通过整合这两个模型,作者开发出HepatoQuant,一个端到端的、疾病特异性的区域量化工具,实现了从WSI到组织成分解析与定量统计的统一工作流,为肝癌病理图像的自动化区域定量和公平方法比较提供了坚实基础。

链接: https://arxiv.org/abs/2604.22858
作者: Ying Xiao,Shimiao Tang,Xitong Ling,Weiming Chen,Jun Wang,Jiawen Li,Huaitian Yuan,Jianghui Yang,Bowen Li,Huan Li,Yiting Meng,Tian Guan,Yonghong He,Hongfang Yin
机构: Model call failure
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Liver cancer, especially hepatocellular carcinoma (HCC), imposes a substantial global disease burden. Accurate diagnosis and prognostic assessment directly influence treatment selection and patient survival, and pathological examination remains the gold standard for liver cancer diagnosis. Identifying diverse tissue components and pathological subtypes on histopathology slides is crucial for estimating postoperative recurrence risk and overall prognosis. However, most publicly available resources are still provided at the whole-slide image (WSI) level, and well-annotated datasets for fine-grained tissue component identification in liver cancer are scarce, which hinders reproducible model development and the deployment of quantitative analysis tools. To address this gap, we release HepatoBench, a patch-level image database for liver cancer with annotations for seven key tissue categories. Based on HepatoBench, we train and open-source a deep learning classification model as a tissue recognition tool. Furthermore, we train a WSI-level tumor/non-tumor segmentation model to automatically localize lesion regions across entire slides. By integrating the patch-level tissue classifier with the WSI-level segmentation model, we build HepatoQuant, an end-to-end, disease-specific regional quantification tool for liver cancer, enabling a unified workflow from WSIs to tissue composition parsing and quantitative statistics. We also open-source HepatoBench, the benchmarking protocol, and supporting tools, providing a solid foundation for automated regional quantification and fair method comparison in liver cancer pathology.

[CV-188] IoT-Enhanced CNN-Based Labelled Crack Detection for Additive Manufacturing Image Annotation in Industry 4.0

【速读】:该论文旨在解决增材制造(Additive Manufacturing, AM)表面裂纹自动检测的难题,以提升质量控制的准确性与效率。传统方法难以实现在线、实时且高精度的缺陷识别,尤其在复杂工艺参数下易受噪声干扰。解决方案的关键在于构建一个物联网(IoT)增强的深度学习框架,融合边缘计算、高分辨率成像与实时数据流处理技术,实现从数据采集到缺陷分类的闭环自动化流程。其核心创新包括:基于 Raspberry Pi 4B 的 IoT 监测系统、优化的卷积神经网络(CNN)模型(通过量化和批量处理将推理延迟降低 47%)、以及基于 MQTT 协议的低延迟 5G 数据传输机制(减少 35% 传输开销)。此外,数字孪生(Digital Twin, DT)技术的集成实现了工艺参数与缺陷形成之间的关联建模,支持预测性分析与自适应控制,从而显著提升了系统泛化能力与工业适用性。

链接: https://arxiv.org/abs/2604.22857
作者: Mohsen Asghari Ilani,Yaser Mike Banad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 Figures, 23 Pages

点击查看摘要

Abstract:This paper presents an IoT-enhanced deep learning framework for automated crack detection in Additive Manufacturing (AM) surfaces using convolutional neural networks (CNNs). By integrating IoT-enabled real-time monitoring, high-resolution imaging, and edge computing, the system enables continuous in-situ defect detection and classification. Real-time data acquisition supports immediate CNN-based analysis, improving both accuracy and efficiency in AM quality control. The framework supports supervised and semi-supervised learning, enabling robust performance on large, sparsely annotated datasets. Using LabelImg for annotation and OpenCV for preprocessing, the system achieves 99.54% accuracy on 14,982 images, with 96% precision, 98% recall, and a 97% F1-score. Dataset balancing and augmentation significantly improve generalization, increasing accuracy from 32% to 99%. Beyond detection, the framework establishes a linkage between AM process parameters, defect formation, and surface topology, supporting predictive analytics and defect mitigation. Aligned with Industry 4.0, it incorporates Digital Twin (DT) technology for real-time process simulation, predictive maintenance, and adaptive control. Key contributions include an IoT-based monitoring system using edge devices (Raspberry Pi 4B), an optimized CNN with model quantization and batch processing reducing inference latency by 47%, and an MQTT-based low-latency data streaming system with 5G connectivity, lowering transmission overhead by 35%. DT integration further enables predictive defect analysis and dynamic adjustment of AM parameters. This work advances intelligent AM quality control by providing a scalable, high-accuracy, and low-latency framework. Future directions include multimodal data fusion, hybrid architectures, and enhanced Digital Twin simulations for AI-driven defect prevention.

[CV-189] Attention-Augmented YOLOv8 with Ghost Convolution for Real-Time Vehicle Detection in Intelligent Transportation Systems

【速读】:该论文旨在解决车辆检测在复杂交通环境中的准确性与鲁棒性问题,特别是在特征冗余、注意力机制不足以及几何形变适应性差等挑战下,如何提升检测模型的性能。解决方案的关键在于提出一种改进的YOLOv8n模型,通过集成三个核心模块:Ghost Module(减少特征冗余)、Convolutional Block Attention Module (CBAM)(增强通道与空间注意力以优化特征表示),以及Deformable Convolutional Networks v2 (DCNv2)(提升对车辆结构几何变化的自适应能力),从而实现高精度、高效率的车辆检测,在KITTI数据集上达到95.4% mAP@0.5,显著优于基线模型及其他七种先进检测器。

链接: https://arxiv.org/abs/2604.22856
作者: Syed Sajid Ullah,Muhammad Zunair Zamir,Ahsan Ishfaq,Salman Khan
机构: Chang’an University(长安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate vehicle detection is a critical component of autonomous driving, traffic surveillance, and intelligent transportation systems. This paper presents an enhanced YOLOv8n-based model that integrates the Ghost Module, Convolutional Block Attention Module (CBAM), and Deformable Convolutional Networks v2 (DCNv2) to improve detection performance. The Ghost Module reduces feature redundancy through efficient feature generation, CBAM refines feature representation via channel and spatial attention, and DCNv2 enhances adaptability to geometric variations in vehicle structures. Evaluated on the KITTI dataset, the proposed model achieves 95.4% mAP@0.5, representing an 8.97% improvement over the baseline YOLOv8n, along with 96.2% precision, 93.7% recall, and a 94.93% F1-score. Comparative analysis against seven state-of-the-art detectors demonstrates consistent superiority across key performance metrics, while ablation studies validate the individual and combined contributions of the integrated modules. By addressing feature redundancy, attention refinement, and spatial adaptability, the proposed approach offers a robust and computationally efficient solution for vehicle detection in diverse and complex traffic environments.

[CV-190] Evaluating Remote Sensing Image Captions Beyond Metric Biases

【速读】:该论文旨在解决遥感图像描述(Remote Sensing Image Captioning, RSIC)任务中因依赖人工标注参考文本进行评估而导致的系统性偏差问题,即传统评价指标迫使模型模仿特定人类注释风格,从而掩盖了基础大模型(Multimodal Large Language Models, MLLMs)的真实描述能力。其解决方案的关键在于提出一种全新的无参考评估指标 ReconScore,该指标不依赖文本相似度计算,而是通过评估生成文本是否能准确重构原始视觉内容来衡量caption质量,从而消除人类注释偏倚。基于此评估机制,研究发现未微调的MLLM在零样本RSIC任务中表现优于微调模型,并据此提出无需训练的生成方法 RemoteDescriber,利用ReconScore作为自校正机制迭代提升语义精度,实现无需计算微调开销的高性能生成。

链接: https://arxiv.org/abs/2604.22855
作者: Ziyun Chen,Fan Liu,Liang Yao,Chuanyi Zhang,Yuye Ma,Wei Zhou
机构: Hohai University (河海大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The core objective of image captioning is to achieve lossless semantic compression from visual signals into textual modalities. However, the reliance on manually curated reference texts for evaluation essentially forces models to mimic specific human annotation styles, thereby masking the true descriptive capabilities of advanced foundation models. This systemic misalignment prompts a critical question: Is task-specific fine-tuning truly necessary for Remote Sensing Image Captioning, or is the perceived performance gap merely an artifact of flawed evaluation criteria? To investigate this discrepancy, we propose ReconScore, a novel reference-free evaluation metric. Rather than computing textual similarities, we assess caption quality by its capability to reconstruct the original visual elements solely from the generated text, effectively neutralizing human annotation biases. Applying this metric, we uncover a profound, counterintuitive truth: inherently powerful, unfine-tuned MLLMs surpass their fine-tuned counterparts in authentic zero-shot RSIC tasks. Driven by this structural discovery, we introduce RemoteDescriber, a completely training-free generation methodology. By employing ReconScore as a self-correction mechanism, we iteratively refine the semantic precision of MLLM outputs without any computational fine-tuning overhead. Comprehensive experiments demonstrate that RemoteDescriber achieves state-of-the-art performance on three datasets. Furthermore, we validate ReconScore’s reliability and analyze the flaws of traditional metrics. Our code is available at this https URL.

[CV-191] MAE-Based Self-Supervised Pretraining for Data-Efficient Medical Image Segmentation Using nnFormer

【速读】:该论文旨在解决医学图像分割中因标注数据稀缺导致的模型训练不稳定、过拟合以及性能受限的问题。其关键解决方案是引入基于掩码自编码器(Masked Autoencoders, MAE)的自监督预训练框架,使nnFormer模型在无需人工标注的大量体积医学图像上进行预训练,以学习有意义的解剖结构表征;随后在小规模标注数据集上进行微调,从而显著提升分割性能(Dice分数)、加快收敛速度并增强泛化能力。

链接: https://arxiv.org/abs/2604.22854
作者: R. M. Krishna Sureddi,T. Satyanarayana Murthy,Nomula Varsha Reddy,Adi Kanishka,Nalla Manvika Reddy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Transformer architectures, including nnFormer,have demonstrated promising results in volumetric medical image segmentation by being able to capture long-range spatial interactions. Although they have high performance, these models need large quantities of labeled training data and are also likely to overfit and become training unstable. This is a serious practical problem because it is not only time-consuming but also expensive to obtain medical images that are annotated by experts. Moreover, fully supervised traditional training pipelines do not take advantage of the available large amounts of unlabeled medical imaging data that can be easily obtained in the clinics. We have solved these drawbacks by advancing the efficiency of the nnFormer with a self-supervised pretraining framework, which is based on the Masked Autoencoders (MAE). In this method, the model is pretrained on unlabeled volumetric medical images to reconstruct randomly masked parts of the input. This allows the encoder to learn meaningful anatomical and structural representations . The encoder is then further fine-tuned on a labeled dataset on the downstream segmentation task. Conducted Experiment shows that the offered method leads to a higher segmentation performance on the count of Dice score, a quicker convergence rate on the course of the fine-tuning procedure, and a superior generalization on the basis of limited labeled data . These findings validate that self-supervised learning combined with transformer-based segmentation models is an appropriate approach to the problem of data shortage in medical image analysis.

[CV-192] FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods FAST

【速读】:该论文旨在解决当前快速对抗训练(Fast Adversarial Training, FastAT)方法之间缺乏公平比较的问题。现有基准和排行榜通常允许不同的模型架构、训练配置及外部数据源,导致性能提升难以区分是算法改进还是实验条件更优。为解决此问题,作者提出了FastAT Benchmark,其关键在于构建一个受控的评估框架,包含三大设计原则:统一的模型架构要求、标准化的训练设置以及严格禁止使用外部或合成数据。该框架在单一代码库中实现了二十多种代表性FastAT方法,并采用双指标评价体系(对抗鲁棒性与计算成本),从而实现可复现、透明且公平的性能对比,揭示了单步法在保持高鲁棒性的同时显著降低计算开销的潜力。

链接: https://arxiv.org/abs/2604.22853
作者: Chao Pan,Xin Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, 3 tables. Code: this https URL Project page: this https URL

点击查看摘要

Abstract:Fast Adversarial Training (FastAT) seeks to achieve adversarial robustness at a fraction of the computational cost incurred by standard multi-step methods such as PGD-AT. Although numerous FastAT techniques have been proposed in recent years, fair comparison among them remains elusive. Existing benchmarks and public leaderboards typically permit diverse model architectures, varying training configurations, and external data sources, making it unclear whether reported improvements reflect genuine algorithmic advances or merely more favorable experimental conditions. To address this problem, we introduce the FastAT Benchmark, a controlled evaluation framework built on three core design principles: unified architecture requirements, standardized training settings, and strict prohibition of external or synthetic data. The benchmark implements over twenty representative FastAT methods within a single codebase, enabling direct and reproducible comparison. Each method is assessed through a dual-metric evaluation framework that measures both adversarial robustness (accuracy under PGD, AutoAttack, and CR Attack) and computational cost (GPU training time and peak memory footprint). Comprehensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet provide reliable baseline measurements and reveal that well-designed single-step methods can match or surpass PGD-AT robustness at substantially lower cost, while no single method dominates across all evaluation dimensions. The complete benchmark, including source code, configuration files, and experimental results, is publicly available to support transparent and fair evaluation of future FastAT research.

[CV-193] Accelerating New Product Introduction for Visual Quality Inspection via Few-Shot Diffusion-Based Defect Synthesis

【速读】:该论文旨在解决工业视觉检测系统在新产品导入(New Product Introduction, NPI)早期阶段普遍面临的标注缺陷数据严重匮乏问题,这一限制导致难以部署鲁棒的监督式缺陷检测模型,而此时自动化质量控制需求最为迫切。解决方案的关键在于提出一个端到端的生成式框架,通过解耦缺陷形态(defect morphology)与背景外观(background appearance),实现高保真、少样本的缺陷合成,从而支持域内数据增强和跨域迁移。具体技术包括:基于掩码文本反演(masked textual inversion)的缺陷表征学习、噪声融合条件生成(noise-blended conditioned generation)以实现表面感知合成,以及梯度感知后处理(gradient-aware post-processing)以确保合成缺陷与真实图像的无缝融合。实验证明,该方法在少样本增强和零样本适应两个场景下均显著提升检测性能,有效加速了NPI阶段的质量检测模型部署。

链接: https://arxiv.org/abs/2604.22850
作者: Serkan Hamdi Güğül,Kemal Levi,Burak Acar
机构: Relimetrics, Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures. White paper from Relimetrics, Inc

点击查看摘要

Abstract:Industrial visual inspection systems often suffer from a severe scarcity of labeled defect data, particularly during the early stages of New Product Introduction (NPI). This limitation hinders the deployment of robust supervised detectors precisely when automated quality control is most needed. We present an end-to-end generative framework for high-fidelity, few-shot defect synthesis that enables both in-domain augmentation and cross-domain transfer. Our approach disentangles defect morphology from background appearance by combining masked textual inversion for defect representation learning, noise-blended conditioned generation for surface-aware synthesis, and gradient-aware post-processing for seamless visual integration. We evaluate the framework in two practically relevant settings: few-shot data augmentation, where synthetic samples enrich a small set of real defects, and zero-shot adaptation, where defects learned from a source domain are transferred to a novel target surface without any real target-domain defect examples. Using RF-DETR as the downstream detector, we show that the proposed pipeline substantially narrows the domain gap on a private industrial dataset. In the few-shot setting, synthetic augmentation improves mAP from 78.8% to 83.3%. In the zero-shot setting, synthetic domain adaptation improves mAP from 65.0% to 85.1%. These results demonstrate that high-fidelity defect synthesis can meaningfully accelerate NPI by enabling effective inspection models before sufficient real defect data has been collected.

[CV-194] LunarDepthNet: Generation of Digital Elevation Models using Deep Learning and Monocular Satellite Images

【速读】:该论文旨在解决月球表面高精度数字高程模型(Digital Elevation Model, DEM)数据匮乏的问题,尤其是在缺乏立体影像(stereo-images)区域难以获取详细地形信息的挑战。解决方案的关键在于提出一种基于深度学习的新方法——LunarDepthNet,其核心创新包括:采用EfficientNet作为编码器以提取高效特征,结合定制化网络层来建模光照阴影与真实高程之间的关系,并引入联合损失函数以确保生成DEM在细节保留和光滑性上的平衡。该方法可直接从单张月面图像中估计并生成高质量DEM,验证结果显示模型在测试阶段达到均方根归一化误差(nRMSE)0.437和平均绝对误差(MAE)4.5米,证明其在无立体影像区域具有良好的泛化能力与实用性。

链接: https://arxiv.org/abs/2604.22848
作者: Aaranay Aadi,Jai Gopal Singla,Amitabh,Nitant Dube,Praveen Kumar Shukla,Vijaypal Singh Dhaka
机构: Manipal University Jaipur (曼尼帕尔大学贾伊普尔分校); Space Applications Centre (SAC) (印度空间研究组织空间应用中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE - 4th International Conference on Computer, Electronics, Electrical Engineering and their applications (IC2E3 2026)

点击查看摘要

Abstract:Recent times have seen an increase in demand of high quality Digital Elevation Models (DEMs) for the lunar surface, because they are highly important for studying the moon and planning future missions. However, there is an evident lack of detailed elevation data on the Moon. To overcome this limitation, this study proposes a novel deep learning method that estimates and generates a surface elevation map directly from monocular images of the surface. The dataset used comprises of the Chandrayaan-2 Terrain Mapping Camera (TMC) images with their corresponding Digital Terrain Models (DTMs). The study proposes LunarDepthNet, which comprises of a UNet architecture to generate DEMS. It incorporates an EfficientNet encoder and custom layers to correctly learn how the light shadows on the surface relate to the actual elevation values. A combined loss function was also utilized to keep the terrain details accurate and smooth. During validation, the model showed a stable loss convergence of 12%. It achieved a mean nRMSE of 0.437 and an MAE of 4.5m in the testing stage. These results prove the model can generate dependable elevation maps from single orbital images, which are quite useful in regions of the moon where stereo-images are not available.

[CV-195] Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes

【速读】:该论文旨在解决高效生成结构化、可交互的三维环境(如Minecraft世界)的问题,尤其关注如何在保持语义一致性的同时提升生成效率。其核心挑战在于传统方法难以在大规模3D空间中实现高质量且可控的生成,同时缺乏对用户交互(如局部修改或扩展)的支持。解决方案的关键在于提出Dream-Cubed数据集和基于立方体(cube)的生成模型家族:首先构建包含数十亿token的高质量、混合生成与人工设计的地图数据集;其次设计直接在方块(block)空间中操作的扩散模型(diffusion models),利用立方体作为组合单元实现高效、语义明确的生成,并支持诸如补全(inpainting)和扩展(outpainting)等交互式工作流。该方案通过FID变体和人类偏好实验验证了生成质量,为未来面向交互式3D环境的生成建模提供了基础框架。

链接: https://arxiv.org/abs/2604.22847
作者: Tim Merino,Sam Earle,Ryunosuke Iwai,Julian Togelius,Edoardo Cetin
机构: New York University (纽约大学); Sakana AI (萨卡纳人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Dream-Cubed, a large-scale dataset of Minecraft worlds at voxel resolution, and a family of models using cubes as powerful compositional units for efficient generation of interactive 3D environments. Dream-Cubed comprises tens of billions of tokens from a carefully curated mixture of procedural biome terrain and high-quality human-authored maps. We use this dataset to conduct the first large-scale study of 3D diffusion models for voxel generation, analyzing discrete and continuous diffusion formulations, data compositions, and architectural design choices. Our models operate directly in the space of blocks, enabling efficient and semantically grounded generation while supporting interactive user workflows such as inpainting and outpainting from user-authored blocks. To quantitatively evaluate our models, we adapt the FID metric to assess semantic differences between real and generated world renderings, and validate generation quality through a human preference study. We release the full dataset, code, and all our pretrained models, which we hope will provide a foundation for future research in efficient generative modeling for structured, interactive 3D environments.

[CV-196] Unified Multi-Foundation-Model Slide Representation for Pan-Cancer Recognition and Text-Guided Tumor Localization

【速读】:该论文旨在解决病理基础模型(pathology foundation models)在临床应用中因碎片化切片级(tile-level)表征导致的统一切片级推理困难及与临床有意义信息缺乏可解释关联的问题。其核心挑战在于如何将异构的基础模型表征整合为共享的切片级表示空间,并实现语义锚定(semantically grounded),从而支持多任务泛癌预测与弱监督肿瘤定位。解决方案的关键在于提出ASTRA框架,通过稀疏专家混合上下文建模(sparse mixture-of-experts contextualization)、掩码多模态重建(masked multi-model reconstruction)和对比对齐结构化病理提示(contrastive alignment to structured pathology prompts)三者协同,学习具备判别力且语义明确的切片表示,仅依赖少量结构化病理标注字段(如分类类别、癌症类型、解剖部位)即可实现无需像素级监督的4类分类、3类实体瘤分型、16类癌症分型以及文本引导的肿瘤定位任务。

链接: https://arxiv.org/abs/2604.22846
作者: Tianyang Wang,Ziyu Su,Abdul Rehman Akbar,Usama Sajjad,Lina Gokhale,Charles Rabolli,Wei Chen,Anil Parwani,Muhammad Khalid Khan Niazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The expanding ecosystem of pathology foundation models has produced powerful but fragmented tile-level representations, limiting their use in clinical tasks that require unified slide-level reasoning and interpretable linkage to clinically meaningful information. We present ASTRA, a pan-cancer framework that integrates heterogeneous foundation-model representations into a shared slide-level representation space and semantically grounds that space using structured pathology annotation fields, including classification category, cancer type, and anatomic site. ASTRA combines sparse mixture-of-experts contextualization, masked multi-model reconstruction, and contrastive alignment to structured pathology prompts to learn slide representations that support 4-category classification, 3-class solid tumor typing, 16-class cancer typing, and text-guided tumor localization without pixel-level supervision. Developed on a CHTN cohort of 10,359 whole-slide images (WSIs) spanning 16 tumor types, ASTRA consistently improves pan-cancer classification across four pathology foundation-model backbones, achieving up to 97.8% macro-AUC for 4-category classification, 99.7% for 3-class solid tumor typing, and 99.2% for 16-class cancer typing. For tumor localization, ASTRA achieves a mean Dice of 0.897 on an annotated in-domain CHTN subset (n = 380) spanning 16 cancer types and 0.738 on an external TCGA cohort (n = 1,686) spanning four cancer types. These results demonstrate that minimal structured pathology annotation fields derived from slide-level metadata can provide effective semantic supervision for unified slide representation learning, enabling both pan-cancer prediction and weakly supervised tumor localization within a single framework.

[CV-197] EX-FIQA: Leverag ing Intermediate Early eXit Representations from Vision Transformers for Face Image Quality Assessment

【速读】:该论文旨在解决现有基于视觉Transformer(Vision Transformer, ViT)的面部图像质量评估(Face Image Quality Assessment, FIQA)方法仅依赖网络最终层表示、忽略中间层蕴含的质量相关信息的问题。其解决方案的关键在于提出一种无需架构修改或额外训练的分数融合框架,通过分析ViT中所有12个Transformer块的中间表示,发现不同深度捕获了互补且独特的质量相关特征(如注意力模式差异和性能表现),并采用逐层加权平均策略——对较深层的输出赋予更高权重,以充分利用ViT层级化特征学习结构,从而在保持高精度的同时实现计算效率优化。

链接: https://arxiv.org/abs/2604.22842
作者: Guray Ozgur,Tahar Chettaoui,Eduarda Caldeira,Jan Niklas Kolf,Andrea Atzori,Fadi Boutros,Naser Damer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at FG2026

点击查看摘要

Abstract:Face Image Quality Assessment is crucial for reliable face recognition systems, yet existing Vision Transformer-based approaches rely exclusively on final-layer representations, ignoring quality-relevant information captured at intermediate network depths. This paper presents the first comprehensive investigation of how intermediate representations within ViTs contribute to face quality assessment through early exit mechanisms and score fusion strategies. We systematically analyze all twelve transformer blocks of ViT-FIQA architectures, demonstrating that different depths capture distinct and complementary quality-relevant information, as evidenced by varying attention patterns and performance characteristics across network layers. We propose a score fusion framework that combines quality predictions from multiple transformer blocks without architectural modifications or additional training. Our early exit analysis reveals optimal performance-efficiency trade-offs, enabling significant computational savings while maintaining competitive performance. Through extensive evaluation across eight benchmark datasets using four FR models, we demonstrate that our fusion strategy improves upon single-exit approaches. Our proposed quality fusion approach employs depth-weighted averaging that assigns progressively higher importance to deeper transformer blocks, achieving the best quality assessment performance by effectively leveraging the hierarchical nature of feature learning in ViTs. Our work challenges the conventional wisdom that only deep features matter for face analysis, revealing that intermediate representations contain valuable information for quality assessment. The proposed framework offers practical benefits for real-world biometric systems by enabling adaptive computation based on resource constraints while maintaining competitive quality assessment capabilities.

[CV-198] ATTN-FIQA: Interpretable Attention-based Face Image Quality Assessment with Vision Transformers

【速读】:该论文旨在解决人脸图像质量评估(Face Image Quality Assessment, FIQA)中计算复杂度高、依赖额外训练或反向传播等问题,尤其针对现有方法难以在不修改模型结构的前提下实现高效且可解释的质量评分。解决方案的关键在于提出一种无需训练的新型评估方法 ATTN-FIQA,其核心思想是利用预训练 Vision Transformer (ViT) 模型中前 softmax 阶段的注意力分数作为质量指标:高质量图像因具有判别性面部特征而产生聚焦、高幅值的注意力模式,低质量图像则生成弥散、低幅值模式。该方法仅需对预训练模型进行一次前向传播,提取最后一层 Transformer 块的多头注意力矩阵并聚合所有图像块的信息,通过简单平均即可获得图像级质量得分,从而实现了高效、无训练、具备空间可解释性的 FIQA。

链接: https://arxiv.org/abs/2604.22841
作者: Guray Ozgur,Tahar Chettaoui,Eduarda Caldeira,Jan Niklas Kolf,Marco Huber,Andrea Atzori,Naser Damer,Fadi Boutros
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at FG2026

点击查看摘要

Abstract:Face Image Quality Assessment (FIQA) aims to assess the recognition utility of face samples and is essential for reliable face recognition (FR) systems. Existing approaches require computationally expensive procedures such as multiple forward passes, backpropagation, or additional training, and only recent work has focused on the use of Vision Transformers. Recent studies highlighted that these architectures inherently function as saliency learners with attention patterns naturally encoding spatial importance. This work proposes ATTN-FIQA, a novel training-free approach that investigates whether pre-softmax attention scores from pre-trained Vision Transformer-based face recognition models can serve as quality indicators. We hypothesize that attention magnitudes intrinsically encode quality: high-quality images with discriminative facial features enable strong query-key alignments producing focused, high-magnitude attention patterns, while degraded images generate diffuse, low-magnitude patterns. ATTN-FIQA extracts pre-softmax attention matrices from the final transformer block, aggregate multi-head attention information across all patches, and compute image-level quality scores through simple averaging, requiring only a single forward pass through pre-trained models without architectural modifications, backpropagation, or additional training. Through comprehensive evaluation across eight benchmark datasets and four FR models, this work demonstrates that attention-based quality scores effectively correlate with face image quality and provide spatial interpretability, revealing which facial regions contribute most to quality determination.

[CV-199] From Skeletons to Pixels: Few-Shot Precise Event Spotting via Representation and Prediction Distillation

【速读】:该论文旨在解决少样本场景下精准事件检测(Precise Event Spotting, PES)的挑战,尤其是在网球等快节奏运动中,由于动作细微、运动模糊以及标注数据稀缺,导致帧级定位困难。解决方案的关键在于提出两种互补的蒸馏策略:一是预测层面的自适应权重蒸馏(Adaptive Weight Distillation, AWD),通过动态调整教师模型在无标签数据上的监督权重;二是表示层面的渐进多模态蒸馏(Annealed Multimodal Distillation for Few-Shot Event Detection, AMD-FED),利用渐进式伪标签将鲁棒的骨架知识迁移至视觉模态。实验表明,尤其是表示层蒸馏方法在少样本k-clip设置下显著优于单模态基线和现有PES方法,并在花样滑冰数据集上验证了其泛化能力,凸显了多模态蒸馏、特别是表示层知识迁移对提升少样本PES性能的有效性。

链接: https://arxiv.org/abs/2604.22839
作者: Zhong Han Ervin Yeoh,Jiang Kan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 39 pages, 4 figures, ISACE 2026

点击查看摘要

Abstract:Precise Event Spotting (PES) is essential in fast-paced sports such as tennis, where fine-grained events occur within very short temporal windows. Accurate frame-level localization is challenging because of motion blur, subtle action differences, and limited annotated data. We study two complementary distillation strategies for few-shot PES: Adaptive Weight Distillation (AWD), a prediction-level method that adaptively weights teacher supervision on unlabeled data, and Annealed Multimodal Distillation for Few-Shot Event Detection (AMD-FED), a representation-level framework that transfers robust skeleton knowledge into visual modalities through annealed pseudo-labeling. Both methods use multimodal distillation to improve generalization under limited supervision. We evaluate them on F3Set-Tennis(sub) under few-shot k-clip settings, where they consistently outperform single-modality baselines and prior PES approaches. After observing the stronger performance of representation-level distillation on tennis, we further validate AMD-FED on a second sports dataset, Figure Skating, where it also shows robust performance in the k-clip scenario. These results highlight the effectiveness of multimodal distillation, especially representation-level transfer, for few-shot precise event spotting.

[CV-200] Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning

【速读】:该论文旨在解决现有优化器在处理预训练模型微调(fine-tuning)与从零开始训练(training from scratch)两种不同范式时,未能充分适配其独特需求的问题。传统优化方法仅关注损失函数的最小化,忽视了两类场景下权重更新特性与网络结构差异对模型性能的影响。解决方案的关键在于提出DualOpt,一种将优化策略解耦的新型方法:针对从零训练,引入实时分层权重衰减(real-time layer-wise weight decay),以提升收敛速度和泛化能力;针对微调任务,则通过在每次权重更新中加入权重回滚项(weight rollback),保持上下游模型权重分布一致性,缓解知识遗忘问题;进一步地,将分层权重衰减扩展为动态调整各层回滚强度的机制,以适应下游任务的差异化需求。实验表明,该方法在图像分类、目标检测、语义分割和实例分割等多类任务上均展现出优异性能。

链接: https://arxiv.org/abs/2604.22838
作者: Xin Ning,Qiankun Li,Xiaolong Huang,Qiupu Chen,Feng He,Weijun Li,Prayag Tiwari,Xinwang Liu
机构: Institute of Semiconductors, Chinese Academy of Sciences (中国科学院半导体研究所); College of Computing and Data Science (CCDS), Nanyang Technological University (南洋理工大学计算机与数据科学学院); University of Science and Technology of China (中国科学技术大学); Mila-Quebec AI Institute (Mila-魁北克人工智能研究所); School of Information Technology, Halmstad University (哈尔姆斯塔德大学信息科技学院); College of Computer, National University of Defense Technology (国防科技大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IEEE T=PAMI

点击查看摘要

Abstract:With the accumulation of resources in the era of big data and the rise of pre-trained models in deep learning, optimizing neural networks for various tasks often involves different strategies for fine-tuning pre-trained models versus training from scratch. However, existing optimizers primarily focus on reducing the loss function by updating model parameters, without fully addressing the unique demands of these two major paradigms. In this paper, we propose DualOpt, a novel approach that decouples optimization techniques specifically tailored for these distinct training scenarios. For training from scratch, we introduce real-time layer-wise weight decay, designed to enhance both convergence and generalization by aligning with the characteristics of weight updates and network architecture. For more importantly fine-tuning, we integrate weight rollback with the optimizer, incorporating a rollback term into each weight update step. This ensures consistency in the weight distribution between upstream and downstream models, effectively mitigating knowledge forgetting and improving fine-tuning performance. Additionally, we extend the layer-wise weight decay to dynamically adjust the rollback levels across layers, adapting to the varying demands of different downstream tasks. Extensive experiments across diverse tasks, including image classification, object detection, semantic segmentation, and instance segmentation, demonstrate the broad applicability and state-of-the-art performance of DualOpt. Code is available at this https URL.

[CV-201] OAMVOS:2nd Report for 5th PVUW MOSE Track

【速读】:该论文旨在解决基于SAM(Segment Anything Model)的密集跟踪器在长时间遮挡、快速运动、视角变化和干扰物等复杂场景下性能下降的问题,尤其针对小目标对象因少量错误记忆更新即可主导后续预测的脆弱性。解决方案的关键在于对原DAM4SAM跟踪器的记忆控制机制进行增强而非替换骨干网络,引入四个核心组件:可靠性感知的跟踪状态机、基于分支的恢复策略、延迟的动态记忆重用(DRM)提升机制,以及针对原生SAM3记忆选择的有选择性策略。该设计在稳定跟踪阶段沿用单路径传播流程,在置信度下降时进入模糊或恢复模式,维持候选分支集合并在分支重新确认后才提交记忆;对于小目标的消失与再出现,临时绕过原生记忆选择以保留旧锚点,并显式保留首个条件帧及适度扩大条件记忆预算,从而在保持简单场景高效性的同时显著提升长间隙遮挡和再出现场景下的鲁棒性。

链接: https://arxiv.org/abs/2604.22837
作者: Deshui Miao,Xingsen Huang,Yameng Gu,Xiaogang yu,Xin Li,Ming-Hsuan Yang
机构: Pengcheng Laboratory (鹏程实验室); Guangzhou Hengyan Technology (广州恒研科技); University of California at Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SAM-based dense trackers provide strong short-term mask propagation but remain fragile under long occlusion, fast motion, viewpoint change, and distractors. The problem is especially severe for small objects, where a few incorrect memory updates can dominate later predictions. This report presents an occlusion- and reappearance-aware extension of DAM4SAM that improves memory control rather than changing the backbone. The method augments the original SAM3 tracker with four ingredients: a reliability-aware tracking state machine, branch-based recovery, delayed DRM promotion, and a selective policy for native SAM3 memory selection. During stable tracking, the model follows the original single-path propagation process. Once confidence drops, the tracker enters an ambiguous or recovery mode, maintains a small set of candidate branches, and commits memory only after a branch is reconfirmed. For small-object disappearance and reappearance, native memory selection is temporarily bypassed so older anchors remain accessible. In addition, the first conditioning frame is explicitly preserved, and the conditioning-memory budget is moderately enlarged to improve long-gap recovery. The resulting design keeps DAM4SAM efficient in easy cases while improving robustness in sequences dominated by occlusion and reappearance.

[CV-202] Agent RVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method

【速读】:该论文旨在解决参考视频目标分割(Ref-VOS)任务中模型对目标存在性判断不准、时空推理能力弱以及分割精度不足的问题。其解决方案的关键在于构建一个以Sa2VA为基础、由多个角色明确的智能体(agent)协同工作的流水线架构:Sa2VA负责生成全局密集语义假设(即粗略的目标轨迹),而代理层则通过规划、搜索、评分与修正等机制,实现目标存在性验证、时间片段划分、锚帧选取及掩码精修,从而在保证语义理解密度的同时,提升分割结果的准确性与鲁棒性。

链接: https://arxiv.org/abs/2604.22836
作者: Deshui Miao,Chao Yang,Chao Tian,Guoqing Zhu,Kai Yang,Zhifan Mo,Xin Li
机构: Pengcheng Laboratory; Wuhan Textile University; Yixiang Innovation Technology (Shenzhen)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This report describes a Ref-VOS pipeline centered on Sa2VA and organized with explicit agent roles. The key idea is that Sa2VA should provide the first dense semantic hypothesis, while an agent loop decides whether that hypothesis should be accepted, revised, or refined. The pipeline starts with a target-presence judgment stage. If the referred object does not exist in the video, the system directly outputs zero masks. Otherwise, Sa2VA receives the video and referring prompt and produces a coarse mask trajectory over the full video. This trajectory is treated as a semantic prior rather than a final answer. A planner agent decomposes the query, temporal partition agents identify informative blocks, scout agents search for anchor frames, and refinement agents convert reliable Sa2VA masks into boxes and points for SAM3 propagation. A critic scores candidate trajectories, a reflection controller repairs weak hypotheses, and a collaboration controller reconciles multiple agent branches. The result is a Ref-VOS system in which Sa2VA is responsible for dense grounded understanding, while the agent layer handles presence verification, temporal search, confidence-aware revision, and final mask refinement.

[CV-203] ParkingScenes: A Structured Dataset for End-to-End Autonomous Parking in Simulation Scenes

【速读】:该论文旨在解决智能驾驶系统中自主泊车任务在城市受限环境下的挑战,特别是由于缺乏高质量、结构化的数据集导致的端到端学习性能受限问题。其解决方案的关键在于构建了一个名为ParkingScenes的多模态数据集,该数据集基于CARLA仿真平台生成,包含由混合A*(Hybrid A*)规划器与模型预测控制器(Model Predictive Controller, MPC)共同产生的结构化泊车轨迹,从而提供精确且可复现的监督信号。数据集涵盖16种倒车入库和6种平行泊车场景,每种场景在有无行人条件下各执行16次,共704个结构化episode和约10.5万帧图像,每帧包含四个RGB相机、四个深度传感器、车辆运动状态及鸟瞰图(Bird’s-Eye View, BEV)等同步多模态信息,支持丰富的跨模态融合与上下文感知学习。实验表明,在相同训练条件下,使用该结构化数据集训练的模型显著优于基于非结构化人工采集数据的模型,验证了结构化监督对提升泊车策略鲁棒性和准确性的重要作用。

链接: https://arxiv.org/abs/2604.22835
作者: Haonan Chen,Kaiwen Xiao,Bin Tian,Jun Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CAC 2025

点击查看摘要

Abstract:Autonomous parking remains a critical yet challenging task in intelligent driving systems, particularly within constrained urban environments where maneuvering space is limited and precise control is essential. While recent advances in end-to-end learning have shown great promise, the lack of high-quality, structured datasets tailored for parking scenarios remains a significant this http URL address this gap, we present ParkingScenes, a comprehensive multimodal dataset specifically designed for end-to-end autonomous parking in simulated scenes. Built on the CARLA simulator, ParkingScenes features structured parking trajectories generated by a Hybrid A* planner and a Model Predictive Controller (MPC), providing accurate and reproducible supervision signals. The dataset includes 16 reverse-in and 6 parallel parking scenarios, each executed under two pedestrian conditions (present and absent), resulting in 704 structured episodes and approximately 105000 frames. Each scenario is repeated 16 times to ensure consistent coverage. Each frame contains synchronized data from four RGB cameras, four depth sensors, vehicle motion states, and Bird’s-Eye View (BEV) representations, enabling rich multimodal fusion and context-aware learning. To demonstrate the utility of our dataset, we compare models trained on ParkingScenes with those trained on unstructured, manually collected simulation data under identical conditions. Results show significant improvements in performance, underscoring the effectiveness of structured supervision for robust and accurate parking policy learning. By releasing both the dataset and the collection framework, ParkingScenes establishes a scalable and reproducible benchmark for advancing learning-based autonomous parking systems. The dataset and collection framework will be released at: this https URL

[CV-204] WebSerial Vision Training for Microcontrollers: A Browser-Based Companion to On-Device CNN Training

【速读】:该论文旨在解决在资源受限设备上实现端到端TinyML(Tiny Machine Learning)视觉模型训练与部署的复杂性问题,尤其针对教育工作者、小型企业和研究人员在特定部署环境下快速开发任务专用视觉分类器的需求。解决方案的关键在于构建一个单文件、零安装的浏览器应用(webmcu-vision-web),通过Chromium浏览器即可完成从固件烧录、图像采集、卷积神经网络(CNN)训练、权重导出到推理激活热力图可视化的一整套本地化机器学习流程,全程无需外部依赖或数据外传;其中核心创新包括基于esptool-js的浏览器内固件烧录、HTTP URL驱动的超参数实时调整机制、以及在浏览器端完成约30秒内三类分类任务(每类30张图像,20轮训练)的轻量级CNN训练能力,显著缩短了“收集-训练-部署”周期至10分钟以内,且所有计算和数据均保持本地化处理,保障隐私与可控性。

链接: https://arxiv.org/abs/2604.22834
作者: Jeremy Ellis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 pages, 16 figures, 5 tables. Paper 2 of the webmcu-ai series. All source code and supplemental results available at: this https URL

点击查看摘要

Abstract:This paper presents webmcu-vision-web, a single-file, zero-install browser application for end-to-end TinyML vision model training and deployment on the Seeed Studio XIAO ESP32-S3 Sense (XIAO ML Kit, 15–40 USD). Acting as a browser-based companion to the on-device Arduino firmware of Paper 1 [1], it provides a private, fully local machine learning pipeline, from firmware flashing through image collection, CNN training, weight export, and live activation visualization, without any software installation beyond a Chromium-based browser. The system targets educators, small businesses, and researchers who need to train task-specific visual classifiers under their exact deployment conditions. Key capabilities include: in-browser firmware flashing via esptool-js; an SD card file browser with image preview and inline editing; this http URL live-sync for zero-recompile hyperparameter adjustment; webcam and ESP32 OV2640 camera image capture; this http URL CNN training completing a three-class run (~30 images per class, 20 epochs) in approximately 1 minute browser-side versus 9 minutes on-device, enabling a complete collect-train-deploy cycle in under 10 minutes; weight export as this http URL and myWeights.h; confusion matrix; and a live Conv2 activation heatmap streamed from the ESP32 during inference. No data leaves the local machine at any stage. A five-run consistency evaluation on the three-class reference problem (0Blank, 1Cup, 2Pen) demonstrates stable convergence with mean accuracy and standard deviation reported; all artefacts are released at the repository link below. The repository is a living template for LLM-assisted adaptation to new hardware and tasks. All source code is MIT-licensed at this https URL.

[CV-205] Intervention-Aware Multiscale Representation Learning from Imaging Phenomics and Perturbation Transcriptomics CVPR2026

【速读】:该论文旨在解决基于显微镜的表型分析(microscopy-based phenotypic profiling)在药物发现中缺乏转录组学(transcriptomics)所具有的机制深度,而现有转录组学方法成本高且数据稀缺的问题。同时,现有多模态方法要么仅用图像辅助其他模态,要么通过样本身份进行表征对齐,忽略了弱配对数据中的细胞类型(cell type)和剂量(dose)差异,限制了模型在未见干预措施上的泛化能力。其解决方案的关键在于提出一种干预感知蒸馏框架(intervention-aware distillation framework):利用扰动转录组学指导图像表示学习,其中教师模型基于基因表达和干预元数据生成化学感知码本(chemistry-aware codebook)中的软分布,并通过微调的单细胞基础模型编码细胞类型上下文并解耦剂量效应;学生模型仅依赖显微图像即可预测这些分布,从而在测试时独立运行,同时显式建模剂量与细胞类型不匹配问题。该设计强调干预语义而非身份对齐,理论上证明转录组引导可收紧图像预测的风险边界。

链接: https://arxiv.org/abs/2604.22832
作者: Jiayuan Chen,Ruoqi Liu,Zishan Gu,Ping Zhang
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2026 main conference

点击查看摘要

Abstract:Microscopy-based phenotypic profiling is scalable for drug discovery but lacks the mechanistic depth of transcriptomics, which remains costly and scarce. Existing multimodal approaches either use images to support other modalities or naively align representations by sample identity, ignoring cell-type and dose variations in weakly paired data-limiting generalization to unseen interventions. In this paper, we introduce an intervention-aware distillation framework that leverages perturbational transcriptomics to guide image representation learning. A transcriptome-conditioned teacher integrates gene expression and intervention metadata to produce soft distributions over a chemistry-aware codebook organized by drug similarity. The teacher employs a fine-tuned single-cell foundation model to encode cell-type context and disentangle dose effects. An image-only student learns to predict these distributions from microscopy alone, distilling mechanistic knowledge while operating independently at test time. This design emphasizes intervention semantics rather than identity alignment and explicitly handles dose and cell-type mismatches. We provide theoretical guarantees showing that transcriptomic guidance tightens the risk bound for image-based prediction. On Cell Painting and RxRx datasets paired with L1000, our method significantly improves one-shot transfer to unseen interventions and drug-target gene discovery compared to self-supervised and alignment baselines.

[CV-206] 2D Pre-Training for 3D Pose Estimation

【速读】:该论文旨在解决3D人体姿态估计(3D Human Pose Estimation, 3D HPE)中因训练数据有限导致模型泛化能力不足的问题。现有方法通常仅在少数强基准数据集(如Human3.6M)上进行预训练,限制了模型在多样化场景下的适应性。解决方案的关键在于扩展现有3D HPE框架的兼容性,使其能够整合更多2D与3D HPE数据集(如Occlusion Person),并通过系统研究2D预训练中的关键因素(如模型规模)对下游任务性能的影响,验证2D预训练在提升模型泛化能力和计算效率方面的优势。实验表明,相较于直接在3D数据上训练,2D预训练显著提升了性能,且在MPII和Human3.6M数据集上实现了MPJPE低于64.5mm的精度。

链接: https://arxiv.org/abs/2604.22830
作者: Liyao Jiang,Ruichen Chen,Keith G. Mills
机构: University of Alberta (阿尔伯塔大学); LSU (路易斯安那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This work was completed as a graduate course project more than four years prior to this preprint. It is shared for archival and educational purposes. We open-source our code fork here: this https URL

点击查看摘要

Abstract:Pre-training is a general method that is used in a range of deep learning tasks. By first training a model on one task, and then further training on the downstream task used for final evaluation, the model is forced to learn a more general understanding of the input data. While pre-training has been applied to 3D Human Pose Estimation (HPE) previously, the scope of datasets used is typically very limited to some strong benchmarks, like Human3.6M. Therefore, in this project, we expand the scope of an existing 3D HPE scheme to be compatible with additional 2D and 3D HPE datasets, like Occlusion Person. We perform an extensive study on how aspects of 2D pre-training, such as model size, affect downstream performance, and to what extent pre-training can help the model generalize to different datasets. Experimental results show that 2D pre-training consistently outperforms training on 3D data alone, particularly in terms of computational efficiency. Finally, using MPII and Human3.6M, we are able to obtain an MPJPE score of under 64.5mm.

[CV-207] Lost in the Vibrations: Vision Language Models Fail the Dynamic Gauges Test

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在工业制造场景中对模拟仪表(analog gauges)进行高精度测量时存在的局限性问题,特别是其在解析高频时间序列事件和指针振动方面的不足。解决方案的关键在于构建一个新型数据集,包含多种类型仪表(圆形、线性及游标)在不同运动速度下的视频序列,并基于此对最新VLMs(如GPT-5和Gemini 3)进行严格评估,以检验其在计量学(metrology)和不确定度量化(uncertainty quantification)方面的适用性。结果表明,现有模型在理解指针轨迹与刻度语义方面表现有限,无法满足安全关键监测所需的可追溯性和可靠性要求。

链接: https://arxiv.org/abs/2604.22829
作者: Tairan Fu,Francisco Javier Santos-Martín,Javier Conde,Pedro Reviriego,Elena Merino-Gómez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The digital transformation of industrial manufacturing increasingly relies on the ability of autonomous robots to interact with legacy infrastructure, particularly analog gauges. While Vision-Language Models (VLMs) have demonstrated potential in zero-shot instrument recognition, their deployment in measurement systems remains constrained by an inherent inability to accurately analyze high-frequency temporal events and needle vibrations. This paper evaluates state-of-the-art models, including GPT-5 and Gemini 3, against the strict requirements of metrology and uncertainty quantification. To facilitate this evaluation, we introduce a novel dataset comprising video sequences of various gauge types: circular, linear, and Vernier, under diverse motion speed profiles. Our findings indicate that current VLMs exhibit limited ability in interpreting needle trajectories and scale semantics, failing to provide the traceability and reliability needed for safety-critical monitoring. The results demonstrate that these models have not yet achieved the performance necessary to be classified as trustworthy synthetic instruments under existing IEEE and ISO standards.

[CV-208] MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling

【速读】:该论文旨在解决当前生成式 AI 在空间尺度上的局限性问题,即现有模型虽在语言和视觉理解上取得突破,但其生成能力局限于局部环境,难以捕捉地理环境在数千公里尺度下的演变规律及大尺度物理世界的结构特征,从而制约了地球观测与模拟中的超宽域空间智能发展。解决方案的关键在于提出一个新的“空间尺度”作为基础模型的扩展维度,并构建了首个具备全球尺度空间一致性生成能力的生成式基础模型 MetaEarth3D;该模型基于1000万张全球分布的真实遥感图像训练,能够生成涵盖大尺度地形、中尺度城市到细粒度街区的多层级、无界且多样化的三维场景,在视觉真实性和地理统计真实性方面均表现优异,为地球观测领域的下一代空间智能提供了新的数据生成引擎。

链接: https://arxiv.org/abs/2604.22828
作者: Jinqi Cao,Zhiping Yu,Baihong Lin,Chenyang Liu,Zhenwei Shi,Zhengxia Zou
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent generative AI models have achieved remarkable breakthroughs in language and visual understanding. However, although these models can generate realistic visual content, their spatial scale remains confined to bounded environments, preventing them from capturing how geographic environments evolve across thousands of kilometers or from modeling the spatial structure of the large-scale physical world. This limitation poses a critical challenge for ultra-wide-area spatial intelligence in Earth observation and simulation, revealing a deeper gap in generative AI: progress has relied primarily on scaling model parameters and training data, while overlooking spatial scale as a core dimension of intelligence. Here, motivated by this missing dimension, we investigate spatial scale as a new scaling axis in foundation models and present MetaEarth3D, the first generative foundation model capable of spatially consistent generation at the planetary scale. Taking optical Earth observation simulation as a testbed, MetaEarth3D enables the generation of multi-level, unbounded, and diverse 3D scenes spanning large-scale terrains, medium-scale cities, and fine-grained street blocks. Built upon 10 million globally distributed real-world training images, MetaEarth3D demonstrates both strong visual realism and geospatial statistical realism. Beyond generation, MetaEarth3D serves as a generative data engine for diverse virtual environments in ultra-wide spatial intelligence. We argue that this study may help empower next-generation spatial intelligence for Earth observation.

[CV-209] DGHMesh: A Large-scale Dual-radar mmWave Dataset and Generalization-focused Benchmark for Human Mesh Reconstruction

【速读】:该论文旨在解决毫米波(mmWave)雷达在无接触、隐私保护且鲁棒的人体网格重建(Human Mesh Reconstruction, HMR)研究中缺乏统一基准(benchmark)以评估算法泛化性能的问题,尤其针对配置变化(如人体位置/朝向偏移、子阵列尺寸调整及跨被试场景)下的公平比较难题。解决方案的关键在于提出DGHMesh数据集与对应的通用性导向基准,该数据集包含15名受试者执行8种动作的36万帧同步数据,涵盖FMCW和SFCW双雷达模态、RGB图像及高精度3D人体网格标注,并提供原始I/Q数据与精确雷达空间标定信息;在此基础上,进一步设计了mmPTM——一种基于查询的多雷达融合框架,通过联合利用点云与成像管(imaging tubes)特征实现更准确的HMR,实验表明其在多种子基准上均展现出优异精度与强泛化能力,验证了多雷达融合策略的有效性与该数据集的实际价值。

链接: https://arxiv.org/abs/2604.22827
作者: Rongxiao Guo,Qingchao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Millimeter-wave (mmWave) radar has shown great potential for contactless, privacy-preserving, and robust human sensing, yet existing mmWave-based human mesh reconstruction (HMR) studies are still limited by the lack of benchmarks for generalization analysis under configuration shifts and fair comparison of different algorithms. To address the limitation, we present DGHMesh, a large-scale dual-radar mmWave dataset and generalization-focused benchmark for HMR. It contains data from 15 subjects performing 8 actions, with 360,000 synchronized frames collected from FMCW radar, SFCW radar, RGB images, and high-precision 3D HMR annotations. In addition, the dataset provides synchronized raw I/Q data from both radar modalities and accurately calibrated radar spatial positions. The benchmark is designed to evaluate HMR methods under diverse measurement configurations, including human position shifts, human orientation shifts, subarray size variations, and cross-subject settings. Based on DGHMesh, we also propose mmPTM, a query-based multi-radar fusion framework that jointly exploits point clouds and imaging tubes for HMR. Extensive experiments are conducted against representative baselines under different settings. The results demonstrate that mmPTM consistently achieves outstanding accuracy and competitive generalization capability across multiple sub-benchmarks, validating the effectiveness of multi-radar fusion and the practical value of the proposed dataset and benchmark for mmWave-based HMR research. DGHMesh and mmPTM are publicly available at this https URL complete benchmark and code will be released after paper publication)

[CV-210] Shape: A Self-Supervised 3D Geometry Foundation Model for Industrial CAD Analysis

【速读】:该论文旨在解决工业计算机辅助设计(CAD)工作流中对鲁棒且可泛化的三维几何表示的需求,以同时保障精度与可解释性。其核心解决方案是提出Shape模型,该模型通过自监督预训练将表面网格转换为密集的token级嵌入;关键创新包括:结构化的3D潜在网格、多尺度几何感知分词器(MAGNO,Multi-scale Geometry-aware Tokenizer)及其交叉注意力机制,以及采用分组查询注意力和RMSNorm的Transformer处理器;此外,通过学习的重建先验实现区域级别的归因分析,从而支持可解释预测。预训练目标结合掩码token重建与多分辨率对比一致性,显著提升了模型在未见数据上的泛化能力。

链接: https://arxiv.org/abs/2604.22826
作者: Bayangmbe Mounmo,Sam Chien,Mile Mitrovic
机构: SIMD AI; SB AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 2 figures

点击查看摘要

Abstract:Industrial CAD workflows require robust, generalizable 3D geometric representations supporting accuracy and explainability. We introduce Shape, a self-supervised foundation model converting surface meshes into dense per-token embeddings. Shape combines a structured 3D latent grid, a multi-scale geometry-aware tokenizer (MAGNO) with cross-attention, and a transformer processor using grouped-query attention and RMSNorm. A learned reconstruction prior enables per-region attribution for explainable predictions. Pretraining uses masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency. The 10.9M-parameter backbone is pretrained on 61,052 CAD meshes from Thingi10K, MFCAD, and Fusion360. On a held-out split of 2,983 meshes, Shape achieves reconstruction R2 = 0.729 and 98.1% top-1 retrieval under the Wang-Isola protocol, with near-zero reconstruction train/val gap (contrastive scores use a larger evaluation pool). A 2x2 ablation on loss type and target-space normalization shows per-dimension normalization is critical: without it, performance collapses (R2 0.14, top-1 88%); with it, both losses succeed (R2 0.70, top-1 96%). Smooth-L1 offers secondary stability. Code, embeddings, and an interactive demo are released at this https URL.

[CV-211] SGP-SAM: Self-Gated Prompting for Transferring 3D Segment Anything Models to Lesion Segmentation

【速读】:该论文旨在解决将3D分割基础模型(如Segment Anything Model, SAM)直接迁移至病灶分割任务时面临的两大挑战:一是中间特征中对小而不规则目标的空间表征能力较弱,二是3D场景下前景-背景类别极度不平衡的问题。其解决方案的关键在于提出一种自门控提示框架(Self-Gated Prompting Framework, SGP-SAM),核心组件为自门控提示模块(Self-Gated Prompting Module, SGPM),该模块通过轻量级多通道门控单元动态判断当前特征是否需要多尺度融合,并仅在必要时激活多尺度特征融合块以增强空间上下文信息;此外,为提升小病灶的学习效果,设计了Zoom Loss,结合Dice损失与体素平衡的焦点损失,强化病灶区域的监督信号。实验表明,该方法在MSD Liver Tumor和MSD Brain Tumor数据集上显著优于基于SAM-Med3D的强基线模型,尤其在Liver Tumor上mDice指标提升达7.3%。

链接: https://arxiv.org/abs/2604.22825
作者: Zixuan Tang,Shen Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large segmentation foundation models such as the Segment Anything Model (SAM) have reshaped promptable segmentation in natural images, and recent efforts have extended these models to medical images and volumetric settings. However, directly transferring a 3D SAM-style model to lesion segmentation remains challenging due to (i) weak spatial representational capacity for small, irregular targets in intermediate features, and (ii) extreme foreground-background imbalance in 3D this http URL propose SGP-SAM, a self-gated prompting framework for efficient and effective transfer to 3D lesion segmentation. Our key component, the Self-Gated Prompting Module (SGPM), performs conditional multi-scale spatial enhancement: a lightweight multi-channel gating unit predicts whether the current features require additional multi-scale fusion, and only then activates a Multi-Scale Feature Fusion Block to enrich spatial context. To further address small-lesion learning, we design a Zoom Loss that up-weights lesion-focused supervision by combining Dice and a voxel-balanced focal this http URL on MSD Liver Tumor and MSD Brain Tumor (enhancing tumor) show consistent gains over strong transfer baselines based on SAM-Med3D. On MSD Liver Tumor, SGP-SAM improves mDice by 7.3% over fine-tuning.

[CV-212] WeatherSeg: Weather-Robust Image Segmentation using Teacher-Student Dual Learning and Classifier-Updating Attention

【速读】:该论文旨在解决自动驾驶环境中恶劣天气条件下语义分割性能下降的问题,同时降低标注成本。其核心解决方案是提出WeatherSeg框架,关键创新在于引入双教师-学生权重共享模型(Dual Teacher-Student Weight-Sharing Model, DTSWSM)实现从受天气影响图像中的知识蒸馏,并结合分类器权重更新注意力机制(Classifier Weight Updating Attention Mechanism, CWUAM)根据环境属性动态调整分类器权重,从而提升模型在多种天气条件下的准确性和鲁棒性。

链接: https://arxiv.org/abs/2604.22824
作者: Zhang Zhang,Yifeng Zeng,Jinquan Pan,Yinghui Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:WeatherSeg, an advanced semi-supervised segmentation framework, addresses autonomous driving’s environmental perception challenges in adverse weather while reducing annotation costs. This framework integrates a Dual Teacher-Student Weight-Sharing Model (DTSWSM) that enables knowledge distillation from weather-affected images, and a Classifier Weight Updating Attention Mechanism (CWUAM) that dynamically adjusts classifier weights based on environmental attributes. Comprehensive evaluations demonstrate that WeatherSeg significantly outperforms baseline models in both accuracy and robustness across various weather conditions, including clear, rainy, cloudy, and foggy scenarios, establishing it as an effective solution for all-weather semantic segmentation in autonomous driving and related applications.

[CV-213] PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在预训练阶段如何有效融合来自异构数据源的跨模态对齐能力的问题。现有模型合并方法主要关注微调后的场景,忽略了预训练阶段中不同数据分布带来的跨模态对齐互补性。为应对这一挑战,作者提出“后对齐合并”(post-alignment merging)任务,其核心是整合不同预训练模型中学习到的跨模态对齐特性。解决方案的关键在于提出 PivotMerge 框架,包含两个核心组件:一是共享空间分解与过滤(Shared-space Decomposition and Filtering),用于分离跨域共性对齐模式并抑制冲突方向;二是对齐引导的层间合并(Alignment-guided Layer-wise Merging),根据各层对跨模态对齐的贡献差异动态分配合并权重。该方法显著缓解了跨域参数干扰和层间对齐贡献不均的问题,实验证明其在多个多模态基准上优于现有基线。

链接: https://arxiv.org/abs/2604.22823
作者: Zibo Shao,Baochen Xiong,Xiaoshan Yang,Yaguang Song,Qimeng Zhang,Haifeng Chen,Changsheng Xu
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; Pengcheng Laboratory, Shenzhen 518066, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China; Data Science and Artificial Intelligence Research Institute, China United Network Communications Group Co., Ltd., Beijing 100033, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) rely on multimodal pre-training over diverse data sources, where different datasets often induce complementary cross-modal alignment capabilities. Model merging provides a cost-effective mechanism for integrating multiple expert MLLMs with complementary strengths into a unified model. However, existing model merging research mainly focuses on post-finetuning scenarios, leaving the pre-training stage largely unexplored. We argue that the core of MLLM pre-training lies in establishing effective cross-modal alignment, which bridges visual and textual representations into a unified semantic space. Motivated by this insight, we introduce the post-alignment merging task, which aims to integrate cross-modal alignment capabilities learned from heterogeneous multimodal pre-training. This setting introduces two key challenges: cross-domain parameter interference, where parameter updates learned from different data distributions conflict during merging, and layer-wise alignment contribution disparity, where different layers and projectors contribute unevenly to cross-modal alignment. To address them, we propose \textbfPivotMerge, a post-alignment merging framework for cross-modal projectors. PivotMerge incorporates two key components: Shared-space Decomposition and Filtering, which disentangles shared alignment patterns from domain-specific variations and suppresses conflicting directions, and Alignment-guided Layer-wise Merging, which assigns layer-specific merging weights based on differing alignment contributions. We construct systematic CC12M-based post-alignment merging scenarios for evaluation. Extensive experiments on multiple multimodal benchmarks show that PivotMerge consistently outperforms existing baselines, demonstrating its effectiveness and generalization ability.

[CV-214] DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在物体存在性验证任务中普遍存在但机制不明的“对象级幻觉”(object level hallucination)问题,尤其关注错误来源是感知局限还是上下文文本先验(textual priors)干扰。其解决方案的关键在于提出一个受控的诊断基准DO-Bench,通过结构化的多模态干预手段,将错误归因于两个互补维度:一是“先验覆盖”维度(Prior Override),逐步增强文本先验而固定视觉证据以评估模型对先验压力的抵抗能力;二是“感知受限”维度(Perception-Limited),从全场景图像逐步过渡到局部目标裁剪以测量模型的感知锚定强度。该设计实现了对错误成因的精细拆解,并引入PriorRobust和PerceptionAbility两个诊断指标量化模型行为,揭示了不同VLMs在先验敏感性和感知可靠性上的系统性差异,从而超越传统仅依赖聚合准确率的评估范式。

链接: https://arxiv.org/abs/2604.22822
作者: JiYang Wang,Jiawei Chen,Mengqi Xiao,Yu Cheng,Yangfu Li,Zhaoxia Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Object level hallucination remains a central reliability challenge for vision language models (VLMs), particularly in binary object existence verification. Existing benchmarks emphasize aggregate accuracy but rarely disentangle whether errors stem from perceptual limitations or from the influence of contextual textual priors, leaving underlying failure mechanisms ambiguous. We introduce DO-Bench, a controlled diagnostic benchmark that isolates these sources through structured multimodal interventions. Rather than evaluating models in unconstrained settings, DO-Bench probes two complementary dimensions: the Prior Override dimension progressively strengthens contextual textual priors while holding visual evidence constant to assess resistance to prior pressure, and the Perception-Limited dimension incrementally enhances visual evidence from full-scene context to localized object crops to measure perceptual grounding strength. This paired design enables attribution of errors to prior suppression, perceptual insufficiency, or their interaction. We further define two diagnostic metrics, PriorRobust and PerceptionAbility, to quantify these behaviors consistently. Evaluations across diverse open- and closed-source VLMs reveal systematic differences in prior sensitivity and perceptual reliability, demonstrating that object hallucination reflects heterogeneous, mechanism dependent failure patterns beyond aggregate accuracy.

[CV-215] FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers

【速读】:该论文旨在解决长序列视频扩散变换器(video diffusion transformers)中自注意力机制带来的二次方复杂度问题,该问题在处理超长token序列时显著增加运行时间和内存占用。解决方案的关键在于提出一种频域感知的异构注意力框架——FreqFormer,其核心思想是利用视频特征的频谱结构:低频分量承载全局布局和粗粒度运动,高频分量包含纹理与细节信息。通过将token特征按频带分割,并分别采用不同注意力操作(低频使用压缩后的密集全局注意力、中频采用结构化块稀疏注意力、高频使用滑动窗口局部注意力),并引入轻量级频谱路由网络根据层统计信息和扩散时间步动态分配注意力头,实现计算资源向早期全局结构、后期细节的自适应倾斜;同时结合融合GPU执行计划以协同调度各类注意力分支,减少内核启动次数和内存访问流量,从而在保持硬件友好模式的前提下大幅降低注意力FLOPs和KV相关内存流量。

链接: https://arxiv.org/abs/2604.22808
作者: Haopeng Jin
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 24 pages, 17 figures, 14 tables, Technical Report

点击查看摘要

Abstract:Long-sequence video diffusion transformers hit a quadratic self-attention cost that dominates runtime and memory for very long token sequences. Most efficient attention methods use one approximation everywhere, yet video features are spectrally structured: low frequencies carry global layout and coarse motion; high frequencies carry texture and fine detail. We present FreqFormer, a frequency-aware heterogeneous attention framework. Token features are split into spectral bands with different operators: dense global attention on compressed low-frequency content, structured block-sparse attention on mid frequencies, and sliding-window local attention on high frequencies. A lightweight spectral routing network allocates heads across bands using layer statistics and the diffusion timestep, shifting compute toward global structure early in denoising and detail later. Cross-band summary tokens provide cheap residual exchange. FreqFormer is paired with a fused GPU execution plan that co-schedules dense, sparse, and local branches to cut kernel launches and memory traffic. We give a consistent complexity model, an orthonormal-decomposition view of approximation, and simulation-based systems numbers (throughput, arithmetic intensity, memory traffic, duration scaling). In simulations from 64K to 1M tokens, FreqFormer substantially reduces estimated attention FLOPs and KV-related memory traffic versus dense attention while keeping a hardware-friendly pattern, supporting spectrally structured heterogeneous attention as a practical direction for long-video diffusion transformers.

[CV-216] See No Evil: Semantic Context-Aware Privacy Risk Detection for AR ICASSP

【速读】:该论文旨在解决增强现实(Augmented Reality, AR)系统中因持续采集视觉数据而带来的独特隐私风险问题,尤其是现有AR隐私框架缺乏对视觉内容的语义理解,难以有效识别依赖上下文的隐私风险。其解决方案的关键在于提出PrivAR,该方案利用视觉语言模型(Vision Language Models, VLMs)结合思维链(chain-of-thought)提示技术,实现对AR环境中潜在敏感信息的上下文感知检测;通过视觉场景线索推断可能存在的敏感信息类型(如办公环境中识别密码便签),并在此基础上对文本内容进行掩蔽处理,在防止敏感信息泄露的同时保留VLM推理所需的上下文线索,从而显著提升隐私保护效果与用户隐私意识水平。

链接: https://arxiv.org/abs/2604.22805
作者: Jialu Liu,Yao Li,Zhuoheng Li,Huining Li,Ying Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026

点击查看摘要

Abstract:Augmented reality (AR) systems pose unique privacy risks due to their continuous capture of visual data. Existing AR privacy frameworks lack semantic understanding of visual content, limiting their effectiveness in detecting context-dependent privacy risks. We propose PrivAR, which leverages vision language models (VLMs) with chain-of-thought prompting for contextual privacy risk detection in AR environments. PrivAR uses visual scene cues to infer potential sensitive information types, such as identifying password notes in office environments through contextual reasoning. PrivAR detects and obfuscates textual content, preventing exposure of sensitive information while preserving contextual cues necessary for VLM inference. Additionally, we investigate contextually-informed warning interfaces to enhance user privacy awareness. Experiments on a real-world AR dataset show that PrivAR achieves superior accuracy (81.48%) and F1-score (84.62%) compared to baselines, while reducing privacy leakage rate to 17.58%. User studies evaluating contextually-informed warning interfaces provide insights into effective privacy-aware AR design.

[CV-217] When VLMs Fix Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

【速读】:该论文旨在解决当前教育人工智能系统中手写数学表达式识别(Handwritten Mathematics Optical Character Recognition, HM-OCR)评估存在的核心缺陷:现有基准测试主要聚焦于单行表达式,并依赖词法指标(如BLEU)进行评价,无法有效衡量多行学生解题过程中的语义推理能力。更重要的是,主流视觉-语言模型(Vision-Language Models, VLMs)在处理手写数学时普遍存在“过度修正”(over-correction)问题——即模型倾向于自动修正学生的原始笔迹错误,从而掩盖了教学评估本应识别的关键错误。为此,作者提出PINK(Penalized INK-based score)这一新型语义评估指标,其关键在于利用大语言模型(Large Language Model, LLM)基于评分标准(rubric-based grading)对转录结果进行语义判别,并显式惩罚过度修正行为,从而更真实反映模型对学生产出内容的忠实度。实证表明,PINK相较于BLEU能显著提升与人工专家判断的一致性,为教育场景下的HM-OCR提供了一个更为可靠和可解释的评估框架。

链接: https://arxiv.org/abs/2604.22774
作者: Jin Seong,Wencke Liermann,Minho Kim,Jong-hun Shin,Soojong Lim
机构: Electronics and Telecommunications Research Institute, Republic of Korea
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student’s work, these models often “fix” errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU’s 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.

[CV-218] Semantic Segmentation for Histopathology using Learned Regularization based on Global Proportions

【速读】:该论文旨在解决病理图像中仅依赖全局标签比例(label proportions)进行像素级分割的问题,即在缺乏像素级标注的情况下,如何从组织类型的空间分布和比例信息中推断出稠密的分割结果。这一任务本质上是欠定的,因为存在多种空间上不同的分割方案可以满足相同的全局比例约束。解决方案的关键在于提出一种两阶段变分框架——Variational Segmentation from Label Proportions (VSLP),首先利用预训练的Transformer模型结合测试时增强(test-time augmentation)生成像素级置信度估计,随后通过求解一个包含Wasserstein数据保真项与学习到的正则化项的变分优化问题来融合这些估计,从而获得可解释且性能优越的分割结果。该方法无需任何像素级标注,同时具备可视化能量平衡的能力,显著优于现有的弱监督与无监督方法。

链接: https://arxiv.org/abs/2604.24347
作者: Yangping Li,Thomas Pinetz,Michael Hölzel,Marieta Toma,Alexander Effland
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In pathology, the spatial distribution and proportions of tissue types are key indicators of disease progression, and are more readily available than fine-grained annotations. However, these assessments are rarely mapped to pixel-wise segmentation. The task is fundamentally underdetermined, as many spatially distinct segmentations can satisfy the same global proportions in the absence of pixel-wise constraints. To address this, we introduce Variational Segmentation from Label Proportions (VSLP), a two-stage framework that infers dense segmentations from global label proportions, without any pixel-level annotations. This framework first leverages a pre-trained transformer model with test-time augmentation to produce a pixel-wise confidence estimate. In the second stage, these estimates are fused by solving a variational optimization problem that incorporates a Wasserstein data fidelity term alongside a learned regularizer. Unlike end-to-end networks, our variational method can visualize the fidelity-regularization energy, resulting in more interpretable segmentation. We validate our approach on two public datasets, achieving superior performance over existing weakly supervised and unsupervised methods. For one of these datasets, proportions have been estimated by an experienced pathologist to provide a realistic benchmark to the community. Furthermore, the method scales to an in-house dataset with noisy pathologist labels, severely outperforming state-of-the-art methods, thereby demonstrating practical applicability. The code and data will be made publicly available upon acceptance at this https URL.

[CV-219] Deep Learning-Enabled Dissolved Oxygen Sensing in Biofouling Environments for Ocean Monitoring

【速读】:该论文旨在解决海洋环境中溶解氧(Dissolved Oxygen, DO)浓度长期监测中因生物污损(biofouling)导致的信号漂移问题,从而实现低成本、高鲁棒性的实时环境感知。其核心解决方案是提出一种融合基于相机的DO传感系统与视觉Transformer(Vision Transformer, ViT)驱动的物理信息神经网络(Physics-Informed Neural Network, PINN)的新范式;该方法通过将Stern-Volmer(SV)物理模型嵌入损失函数,显著提升了在生物污损条件下的传感精度,相较传统统计和机器学习方法分别降低92%和89%的平均绝对误差(MAE),并借助深度集成(deep ensemble)量化预测不确定性,实现了自诊断式传感能力。

链接: https://arxiv.org/abs/2604.24236
作者: Nikolaos Salaris,Adrien Desjardins,Manish K. Tiwari
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:The escalating climate crisis and ecosystem degradation demand intelligent, low-cost sensors capable of robust, long-term monitoring in real-world environments. Absolute dissolved oxygen (DO) concentration is a key parameter for predicting climate tipping points. Inexpensive optoelectronic sensors based on microstructured polymer films doped with phosphorescent dyes could be readily deployable; however, signal drift and marine biofouling remain major challenges. Here, we introduce a sensing paradigm that combines camera-based DO sensors with a visual transformer (ViT)-based physics-informed neural network (PINN) for high-fidelity sensing under biofouling conditions. Training and testing data were obtained from an algae-laden water tank over 14 days to capture accelerated biofouling. The ViT-PINN, which embeds the Stern-Volmer (SV) equation into the loss function, reduces mean average error (MAE) by 92% and 89% compared to classical statistical and ML approaches, achieving ~2 umol/L absolute error. A deep ensemble further quantifies predictive uncertainty, enabling self-diagnostic sensing.

[CV-220] Shared-kernel Wavelet Neural Networks for Poisson Image Reconstruction

【速读】:该论文旨在解决图像表示与重建中的效率与精度问题,尤其针对传统方法在图像压缩、低光增强等应用中难以兼顾计算复杂度与重建质量的挑战。解决方案的关键在于利用图像的拉普拉斯场(Laplacian field)稀疏性及其满足稳定分布的特性,将图像以稀疏拉普拉斯场的形式进行高效编码,并通过求解泊松方程(Poisson equation)实现高精度重构。文中提出一种共享核小波神经网络(shared-kernel wavelet neural network),该网络直接求解泊松方程,具有参数量少(< 0.0002M)、线性计算复杂度和更高重建精度三大优势,从而实现了实时且高质量的图像重建。

链接: https://arxiv.org/abs/2604.24000
作者: Yuanhao Gong,Tan Tang,Qianyan Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Applications (stat.AP)
备注:

点击查看摘要

Abstract:The Laplacian operator transforms the image into its Laplacian field, which usually is sparse and satisfies a stable distribution. On the other hand, an image can be uniquely reconstructed from its Laplacian field via solving a Poisson equation with a proper boundary condition. Such uniqueness is mathematically guaranteed. Thanks to these properties, we propose to use the sparse Laplacian field to present the image. We first show that the Laplacian field is sparse and satisfies a stable distribution on hundreds images. Then, we show that the image can be accurately reconstruct from its Laplacian field. For the reconstruction task, we propose a shared-kernel wavelet neural network, which solves the Poisson equation and has three advantages. First, it has less than \bf 0.0002M parameters, which is compact enough for most of devices. Second, it has linear computation complexity, leading to a real-time reconstruction. Third, it achieves higher accuracy than previous methods. Several numerical experiments are conducted to show the effectiveness and efficiency of the sparse Laplacian field and the proposed Poisson solver. The proposed method can be applied in a large range of applications such as image compression, low light enhancement, object tracking, etc.

[CV-221] GS-DOT: Gaussian splatting-based image reconstruction for diffuse optical tomography

【速读】:该论文旨在解决扩散光学断层成像(Diffuse Optical Tomography, DOT)中图像重建的精度与计算效率问题,尤其是在高散射介质中准确建模光传输过程的挑战。其解决方案的关键在于引入高斯点绘(Gaussian Splatting, GS)框架,将吸收系数表示为一组可优化的各向异性高斯基元的稀疏叠加,并通过解析梯度和Adam优化算法拟合测量的时间分辨点扩散函数(time-resolved point-spread functions)。这是首次将GS算法应用于光扩散区域,用扩散函数替代传统射线传输函数,从而实现对复杂散射介质中光传播的高保真建模,显著提升了重建结果的空间定位精度与定量准确性,同时大幅降低内存消耗并增强抗噪能力。

链接: https://arxiv.org/abs/2604.23675
作者: Jingjing Jiang
机构: University Hospital Zurich and University of Zurich (苏黎世大学医院和苏黎世大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:This work presents GS-DOT, a novel image reconstruction framework based on Gaussian Splatting (GS) for diffuse optical tomography (DOT). Inspired by GS for rendering applications, absorption coefficients are represented as a sparse sum of anisotropic Gaussian primitives optimized to fit measured time-resolved point-spread functions through analytic gradients and Adam optimization. This is the first adaptation of GS algorithms in the photon diffusion regime, where the ray transport function is replaced by the diffusion functions to enable accurate modeling of light transport in highly scattering media. Validation on synthetic tissue models demonstrate high accuracy in localization and quantification of reconstructed absorption maps for both clean and noisy signals. GS-DOT has demonstrated high robustness to noise and showed a huge reduction in memory demand.

[CV-222] Physics-Informed Temporal U-Net for High-Fidelity Fluid Interpolation

【速读】:该论文旨在解决从稀疏时间观测中重建高保真流体动力学的问题,这一问题因流体传输的混沌性和非线性特征而极具挑战性。标准深度学习插值方法常趋向于回归均值,导致空间模糊和时间闪烁,尤其在观测锚点帧附近出现不连续的过渡现象。解决方案的关键在于提出一种新颖的Temporal U-Net架构,其核心创新包括:引入基于VGG的感知损失(perceptual loss)以增强纹理细节保留,以及设计物理信息桥梁(Physics-Informed Bridge)来约束物理一致性;同时通过时间加权特征融合与由t(1 - t)定义的抛物线边界条件,确保时间过渡平滑且端点完全一致。实验表明,该方法在结构保真度和纹理保持方面显著优于传统模型,MAE降低至0.015(对比L1基线的0.085),并能有效保留高频湍流细节。

链接: https://arxiv.org/abs/2604.23372
作者: Eshwar R. A.,Nevin Mathew Thomas,Nehal G,Farida M. Begam
机构: PES University - Electronic City Campus (PES大学-电子城校区)
类目: Fluid Dynamics (physics.flu-dyn); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
备注: 12 pages, 6 figures, 1 table

点击查看摘要

Abstract:Reconstructing high-fidelity fluid dynamics from sparse temporal observations is quite challenging, mainly due to the chaotic and non-linear nature of fluid transport. Standard deep learning-based interpolation methods often tend to regress to the mean, which results in spatial blurring and temporal strobing, especially noticeable around the observed anchor frames where transitions become discontinuous. In this work, we propose a novel Temporal U-Net architecture that integrates a VGG-based perceptual loss along with a Physics-Informed Bridge to overcome these issues. By introducing time-weighted feature blending and enforcing a parabolic boundary condition defined by t(1 - t), the model ensures smooth transitions while also maintaining perfect consistency at the endpoints. Experimental results on multi-channel RGB fluid data show that our method clearly outperforms standard models, both in terms of structural fidelity and texture preservation. In particular, the model achieves a Mean Absolute Error of 0.015, compared to 0.085 for a standard L1 baseline. Further Spatial Power Spectral Density (PSD) analysis reveals that the model is able to retain high-frequency turbulent details that are usually lost in deterministic reconstructions.

[CV-223] CT-Guided Spatially-varying Regularization for Voxel-Wise Deformable Whole-Body PET Registration

【速读】:该论文旨在解决全身体积正电子发射断层成像(PET)配准中因解剖异质性导致的挑战,即在刚性结构(如骨骼)与软组织之间需采用不同强度的约束以避免不合理的形变。其解决方案的关键在于提出一种基于CT引导的空间自适应正则化策略:利用PET/CT扫描中配对的CT图像构建体素级正则化权重图,替代传统的单一全局正则化系数,从而实现对刚性和软组织区域的自适应调节,提升跨示踪剂PET图像的配准精度和器官层面的对齐效果。

链接: https://arxiv.org/abs/2604.22905
作者: Xiangcen Wu,Ruohua Chen,Sichun Li,Qianye Yang,Sheng Liu,Jianjun Liu,Zhaoheng Xie
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole-body Positron Emission Tomography (PET) registration is essential for multi-parametric tumor characterization and assessment of metastatic disease progression. In deep learning-based deformable registration, the dense displacement field (DDF) regularizer is crucial for stabilizing optimization and preventing unrealistic deformations in large 3D volumes. A key challenge in whole-body deformable registration is anatomical heterogeneity, rigid structures (e.g., bones) should undergo stronger regularization, whereas soft tissues require more flexible deformation and weaker constraints. In this work, we propose a simple yet effective CT-guided spatially-varying regularization strategy for whole-body cross-tracer deformable PET registration. The key idea is to use the paired CT volume from the PET/CT acquisition to construct a voxel-wise regularization map for the DDF, replacing the conventional single global regularization weight. This yields anatomy-adaptive regularization strength across rigid and soft tissues. The proposed method is evaluated on a real clinical cross-tracer PET/CT dataset of 296 patients involving 18F-PSMA and 18F-FDG, showing that the proposed method achieves statistically significant improvements over weakly-supervised registration baseline in both whole-body registration performance and organ-wise alignment.

[CV-224] riple-Phase Sequential Fusion Network for Hepatobiliary Phase Liver MRI Synthesis

【速读】:该论文旨在解决肝细胞癌(Hepatocellular Carcinoma, HCC)影像学诊断中,因获取肝胆期(Hepatobiliary Phase, HBP)图像需延长造影后延迟时间而导致的临床流程效率低下及运动伪影风险增加的问题。其核心解决方案是提出一种三相序贯融合网络(Triple-Phase Sequential Fusion Network, TriPF-Net),通过利用造影前的T1加权成像(T1-weighted imaging)作为基础,并在动脉相(Arterial Phase, AP)和门静脉相(Venous Phase, VP)可用时自适应融合其特征信息,建模组织特异性对比剂摄取与排泄动态过程,从而实现对HBP图像的高保真合成。该方法在AP或VP序列缺失情况下仍能保持生理一致性,关键创新在于引入增强区域引导编码器和动态特征统一模块,并结合区域引导序贯融合损失函数以及临床变量(如年龄、性别、总胆红素和白蛋白)以提升合成图像的生理合理性与泛化能力。

链接: https://arxiv.org/abs/2604.22904
作者: Qiuli Wang,Xinhuan Sun,Fengxi Chen,Yongxu Liu,Jie Cheng,Lin Chen,Jiafei Chen,Yue Zhang,Xiaoming Li,Wei Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 figures, 7 tables

点击查看摘要

Abstract:Gadoxetate disodium-enhanced MRI is essential for the detection and characterization of hepatocellular carcinoma. However, acquisition of the hepatobiliary phase (HBP) requires a prolonged post-contrast delay, which reduces workflow efficiency and increases the risk of motion artifacts. In this study, we propose a Triple-Phase Sequential Fusion Network (TriPF-Net) to synthesize HBP images by leveraging the sequential information from pre-HBP sequences: while T1-weighted imaging serves as the indispensable baseline, the model adaptively integrates arterial-phase (AP) and venous-phase (VP) features when available. By modeling the tissue-specific contrast uptake and excretion dynamics across these three phases, TriPF-Net ensures robust HBP synthesis even under the stochastic absence of one or both dynamic contrast-enhanced sequences. The framework comprises an Enhanced Region-Guided Encoder and a Dynamic Feature Unification Module, optimized with a Region-Guided Sequential Fusion Loss to maintain physiological consistency. In addition, clinical variables, including age, sex, total bilirubin, and albumin, are incorporated to enhance physiological consistency. Compared with conventional methods, TriPF-Net achieved superior performance on datasets from two centers. On the internal dataset, the model achieved an MAE of 10.65, a PSNR of 23.27, and an SSIM of 0.76. On the external validation dataset, the corresponding values were 12.41, 23.11, and 0.78, respectively. This flexible solution enhances clinical workflow and lesion depiction, potentially eliminating the need for delayed HBP acquisition in HCC imaging.

[CV-225] Generalizable CT-Free PET Attenuation and Scatter Correction for Pediatric Patients

【速读】:该论文旨在解决儿科正电子发射断层扫描(PET)中因传统衰减与散射校正依赖于X射线计算机断层成像(CT)而带来的额外辐射暴露问题,同时克服现有无CT方法在跨设备或放射性示踪剂变化时性能下降的局限性。其解决方案的关键在于提出通用PET校正网络(Generalizable PET Correction Network, GPCN),该网络通过双域建模实现域鲁棒性:一方面采用多带上下文精修模块(multi-band contextual refinement module),利用小波变换进行多尺度分解和长程空间上下文建模以捕捉儿童解剖变异;另一方面引入频域感知谱解耦模块(frequency-aware spectral decoupling module),在傅里叶域中基于坐标条件对幅度和相位进行精细化调整,从而显式分离不变的拓扑结构与域相关的噪声成分,确保在不同扫描仪和示踪剂组合下仍能实现器官和病灶的精确定量恢复。

链接: https://arxiv.org/abs/2604.22894
作者: Jia-Mian Wu,Jun Liu,Siqi Li,Xiaoya Wang,Shibai Yin,Huanyu Luo,Lingling Zheng,Qiang Gao,Jigang Yang,Tai-Xiang Jiang
机构: Southwestern University of Finance and Economics (西南财经大学); Beijing Friendship Hospital, Capital Medical University (首都医科大学附属北京友谊医院); Beijing Children’s Hospital, Capital Medical University (首都医科大学附属北京儿童医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 15 figures, 7 tables. Source code available at this https URL

点击查看摘要

Abstract:Computed tomography (CT)-based attenuation and scatter correction improves quantitative PET but adds radiation exposure that is particularly undesirable in pediatric imaging. Existing CT-free methods are commonly trained in homogeneous settings and often degrade under scanner or radiotracer shifts, which limits their clinical utility. We propose the Generalizable PET Correction Network (GPCN), a dual-domain network for domain-robust CT-free PET attenuation and scatter correction. GPCN combines a multi-band contextual refinement module, which models pediatric anatomical variability through wavelet-based multiscale decomposition and long-range spatial context modeling, with a frequency-aware spectral decoupling module, which performs coordinate-conditioned amplitude/phase refinement in the Fourier domain. By synergizing multi-band spatial contextual modeling with asymmetric frequency-spectrum decoupling, the network explicitly separates invariant topological structures from domain-specific noise, thereby achieving precise quantitative recovery of both anatomical organs and focal lesions. This design aims to separate anatomy-dominant structures from domain-sensitive spectral residuals and to improve robustness across heterogeneous imaging conditions. We train and evaluate the method on 1085 pediatric whole-body PET scans acquired with two scanners and five radiotracers. In both joint training and zero-shot cross-domain evaluation, GPCN outperforms representative baselines and maintains stable quantitative accuracy on unseen scanner-tracer combinations. The method is further supported by ablation, region-wise quantitative analysis, and downstream segmentation experiments. In our cohort, the CT component of the conventional protocol corresponded to an average effective dose of 10.8 mSv, indicating the potential clinical value of reliable CT-free correction for pediatric PET.

人工智能

[AI-0] Learning to Think from Multiple Thinkers

【速读】:该论文旨在解决在链式思维(Chain-of-Thought, CoT)监督下,从多个不同思考者(thinkers)获取的正确但可能系统性不同的推理路径中高效学习的问题。研究聚焦于一类在单个思考者CoT监督下可计算上易学、但在仅依赖最终结果监督(无CoT)时难以学习的类别。其核心挑战在于:当存在少量不同思考者的CoT数据时,如何设计算法实现高效学习。解决方案的关键在于提出一种通用的计算高效的主动学习算法,该算法仅需每个思考者少量独立于目标精度 ε\varepsilon 的CoT样本,且所需思考者数量仅随 log1εloglog1ε\log \frac{1}{\varepsilon} \log \log \frac{1}{\varepsilon} 增长,同时利用足够多的被动收集的端到端结果数据(规模为 1εpolylog1ε\frac{1}{\varepsilon} \cdot \text{poly}\log\frac{1}{\varepsilon}),从而在理论上保证了对这类问题的可学习性。

链接: https://arxiv.org/abs/2604.24737
作者: Nirmit Joshi,Roey Magen,Nathan Srebro,Nikolaos Tsilivis,Gal Vardi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (stat.ML)
备注: Comments are welcome. There are 78 pages and 5 Figures

点击查看摘要

Abstract:We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy \varepsilon , a moderate number of thinkers that scales as \log \frac1\varepsilon\log \log \frac1\varepsilon , and sufficient passive end-result data that scales as \frac1\varepsilon\cdot poly\log\frac1\varepsilon . Comments: Comments are welcome. There are 78 pages and 5 Figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (stat.ML) Cite as: arXiv:2604.24737 [cs.LG] (or arXiv:2604.24737v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.24737 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-1] Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling

【速读】:该论文旨在解决Transformer架构中旋转位置编码(Rotary Positional Embeddings, RoPE)长期被视作固定、手工设计结构的问题,即忽视了其作为注意力机制中一个潜在可学习的第二维度表达空间的价值。传统RoPE仅使用离散序数索引填充旋转流形(rotation manifold),限制了其动态建模能力。解决方案的关键在于提出SIREN-RoPE方法,将旋转空间从固定结构转变为可学习且信号条件化的空间,通过双分支正弦表示网络(Sinusoidal Representation Network, SIREN)注入连续时间戳、周期性时序模式和分类元数据等异构信号,从而在不增加显著计算开销的前提下,使每个token的嵌入同时包含语义(实部)与动态关系(虚部)信息,实现对token间时序、位置和上下文依赖关系的更丰富建模。

链接: https://arxiv.org/abs/2604.24717
作者: Hailing Cheng,Daqi Sun,Xinyu Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Every Transformer architecture dedicates enormous capacity to learning rich representations in semantic embedding space – yet the rotation manifold acted upon by Rotary Positional Embeddings (RoPE) has been treated as a fixed, hand-crafted structure, populated only by discrete ordinal indices. We argue that this rotation space is a largely overlooked second dimension of expressivity in the attention mechanism, one whose systematic exploration may open a new door for attention-based architectures. The analogy to complex numbers is instructive: just as introducing the imaginary axis – orthogonal to and independent of the real line – unlocked new algebraic structure once believed impossible, treating the rotation manifold as a learnable, signal-conditioned space opens an orthogonal degree of freedom in attention. In this framing, the token embedding encodes the semantic (real) component of a representation – what a token means – while the rotation encodes its dynamic (imaginary) component – how it relates to every other token across time, position, and context. We introduce SIREN-RoPE, a concrete instantiation of this idea, which populates the rotation dimension with heterogeneous signals – continuous timestamps, cyclical temporal patterns, and categorical metadata – via a dual-branch Sinusoidal Representation Network (SIREN). As a proof of concept, we evaluate on a production-scale news feed dataset from a major social network using a generative recommender as the ranking model, demonstrating that activating this hidden dimension yields consistent improvements across calibration and ranking objectives with negligible computational overhead. We invite the community to view the rotation space not as a solved positional-encoding detail, but as an untapped axis whose rich structure may prove as consequential for attention as the imaginary unit proved for algebra. Comments: 8 pages, 3 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.24717 [cs.AI] (or arXiv:2604.24717v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.24717 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-2] Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

【速读】:该论文旨在解决大规模神经网络训练中学习率(learning rate)配置探索不足的问题,尤其是在数据并行随机梯度下降(data-parallel stochastic gradient descent)场景下,多个GPU副本执行相同更新导致学习率空间未被有效利用。其解决方案的关键在于提出Hyperparameter-Divergent Ensemble Training(HDET),通过交替的“扇出”(fan-out)与“收敛”(converge)阶段,使各副本在结构化、对称的学习率分布下独立训练,并定期通过AllReduce操作平均参数,从而实现学习率的高效探索;进一步引入自动学习率控制器(auto-LR),将副本间训练损失差异作为零阶超梯度(zero-order hypergradient)信号,以无梯度方式动态调整共享的基础学习率调度策略,无需额外超参数调优或训练预算即可提升优化质量和泛化性能。此框架还可扩展至其他标量超参数(如dropout率、注意力温度等),统一采用相同的扇出/收敛协议进行探索。

链接: https://arxiv.org/abs/2604.24708
作者: Hailing Cheng,Tao Huang,Chen Zhu,Antonio Alonso
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Training large neural networks with data-parallel stochastic gradient descent allocates N GPU replicas to compute effectively identical updates – a practice that leaves the rich space of learning rate configurations entirely unexplored during training. We propose Hyperparameter-Divergent Ensemble Training (HDET), a method that repurposes these replicas for simultaneous learning rate exploration at negligible communication overhead. HDET operates in alternating phases: a fan-out stage in which replicas train independently under a structured, symmetric spread of learning rates, and a converge stage in which parameters are averaged across all replicas via AllReduce every T steps. Building on this ensemble substrate, we further propose an automatic learning rate (auto-LR) controller that treats the relative training loss across replicas as a performance signal, updating the shared base schedule toward higher-performing configurations via a momentum-based gradient-free meta-update. The combined method produces a self-adapting learning rate schedule that improves both optimization quality and generalization without additional hyperparameter sweeps or training budget. Crucially, the framework generalizes beyond learning rate: any scalar hyperparameter that does not alter model architecture – such as dropout rate, attention scale temperature, or weight-decay coefficient – can be explored across replicas using the same fan-out/converge protocol, with inter-replica loss differences serving as zero-order hypergradients that guide the search direction. HDET is implemented as a drop-in replacement for PyTorch’s OneCycleLR scheduler, requiring no changes to model architecture, optimizer, or data pipeline. Comments: 8 pages, 2 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.24708 [cs.LG] (or arXiv:2604.24708v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.24708 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] Defective Task Descriptions in LLM -Based Code Generation: Detection and Analysis

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成任务中对缺陷性任务描述(task description defects)敏感的问题,这类缺陷包括词汇模糊(Lexical Vagueness)、描述不足(Under-Specification)和语法格式错误(Syntax-Formatting),可能导致生成代码的正确性显著下降。解决方案的关键在于提出SpecValidator——一个基于小规模参数高效微调模型的轻量级分类器,能够自动识别上述三类缺陷;实验表明,SpecValidator在多个基准上的F1值达0.804、MCC为0.745,显著优于GPT-5-mini和Claude Sonnet 4,并且具备泛化能力,可检测未见过的描述缺陷,尤其在结构化程度高的任务描述(如LiveCodeBench)中表现出更强鲁棒性,揭示了任务描述质量对LLM代码生成可靠性的重要性。

链接: https://arxiv.org/abs/2604.24703
作者: Amal Akli,Mike Papadakis,Maxime Cordy,Yves Le Traon
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are widely used for code generation, yet they rely on an implicit assumption that the task descriptions are sufficiently detailed and well-formed. However, in practice, users may provide defective descriptions, which can have a strong effect on code correctness. To address this issue, we develop SpecValidator, a lightweight classifier based on a small model that has been parameter-efficiently finetuned, to automatically detect task description defects. We evaluate SpecValidator on three types of defects, Lexical Vagueness, Under-Specification and Syntax-Formatting on 3 benchmarks with task descriptions of varying structure and complexity. Our results show that SpecValidator achieves defect detection of F1 = 0.804 and MCC = 0.745, significantly outperforming GPT-5-mini (F1 = 0.469 and MCC = 0.281) and Claude Sonnet 4 (F1 = 0.518 and MCC = 0.359). Perhaps more importantly, our analysis indicates that SpecValidator can generalize to unseen issues and detect unknown Under-Specification defects in the original (real) descriptions of the benchmarks used. Our results also show that the robustness of LLMs in task description defects depends primarily on the type of defect and the characteristics of the task description, rather than the capacity of the model, with Under-Specification defects being the most severe. We further found that benchmarks with richer contextual grounding, such as LiveCodeBench, exhibit substantially greater resilience, highlighting the importance of structured task descriptions for reliable LLM-based code generation.

[AI-4] Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft ICRA

【速读】:该论文旨在解决当前人工智能系统在“发现到应用”(discovery-to-application)循环中的能力评估难题,即如何有效衡量AI在复杂现实场景中从科学发现到工程实现的全流程智能水平。现有方法难以捕捉AI在面对未知问题时进行因果规律识别、实验探索与知识整合的能力。解决方案的关键在于提出SciCrafter——一个基于Minecraft的基准测试平台,通过参数化红石电路任务(parameterized redstone circuit tasks)将该循环具象化:Agent需根据指定模式(如同步或定时点亮灯)构建电路,且目标参数的扩展显著提升任务复杂度,迫使模型进行真正的知识发现而非记忆复现。这一设计使评估聚焦于四个核心能力维度:知识缺口识别、实验发现、知识整合与知识应用,并通过针对性干预手段量化各环节的贡献,从而揭示前沿模型(如GPT-5.2、Gemini-3-Pro等)当前瓶颈已从“解决问题”转向“提出正确问题”。

链接: https://arxiv.org/abs/2604.24697
作者: Zhou Ziheng,Huacong Tang,Jinyuan Zhang,Haowei Lin,Bangcheng Yang,Qian Long,Fang Sun,Yizhou Sun,Yitao Liang,Ying Nian Wu,Demetri Terzopoulos,Xiaofeng Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint, under review. 41 pages. Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:Discovering causal regularities and applying them to build functional systems–the discovery-to-application loop–is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities–knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application–and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle–indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.

[AI-5] Governing What You Cannot Observe: Adaptive Runtime Governance for Autonomous AI Agents

【速读】:该论文旨在解决自主AI代理在无代码变更情况下因行为漂移、对手适应和决策模式变化而导致的安全性下降问题。解决方案的关键在于提出信息可行性原则(Informational Viability Principle),即通过估计未观测风险的上界 B^(x)=U(x)+SB(x)+RG(x)\hat{B}(x) = U(x) + SB(x) + RG(x),并在动作容量 S(x)S(x) 超过该上界且具备安全裕度时才允许执行动作。基于Aubin的可行理论,作者构建了代理可行性框架(Agent Viability Framework),包含监控(P1)、预测(P2)和单调限制(P3)三个必要且充分属性,以覆盖已知的代理失效模式;进一步通过RiskGate实现该框架,引入专用统计估计器(KL散度、段间z检验、序列模式匹配)、故障安全的单调流水线及闭环自动驾控机制(作为Aubin调节映射的实例),并定义标量可行性指数 VI(t)[1,+1]VI(t) \in [-1,+1] 及一阶预测时间 tt^*,将治理从被动响应提升为前瞻性控制。

链接: https://arxiv.org/abs/2604.24686
作者: German Marin,Jatin Chaudhary
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous AI agents can remain fully authorized and still become unsafe as behavior drifts, adversaries adapt, and decision patterns shift without any code change. We propose the \textbfInformational Viability Principle: governing an agent reduces to estimating a bound on unobserved risk \hatB(x) = U(x) + SB(x) + RG(x) and allowing an action only when its capacity S(x) exceeds \hatB(x) by a safety margin. The \textbfAgent Viability Framework, grounded in Aubin’s viability theory, establishes three properties – monitoring (P1), anticipation (P2), and monotonic restriction (P3) – as individually necessary and collectively sufficient for documented failure modes. \textbfRiskGate instantiates the framework with dedicated statistical estimators (KL divergence, segment-vs-rest z -tests, sequential pattern matching), a fail-secure monotonic pipeline, and a closed-loop Autopilot formalised as an instance of Aubin’s regulation map with kill-switch-as-last-resort; a scalar Viability Index VI(t) \in [-1,+1] with first-order t^* prediction transforms governance from reactive to predictive. Contributions are the theoretical framework, the reference implementation, and analytical coverage against published agent-failure taxonomies; quantitative empirical evaluation is scoped as follow-up work.

[AI-6] Leverag ing LLM s for Multi-File DSL Code Generation: An Industrial Case Study

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在企业领域特定语言(Domain-Specific Languages, DSLs)中进行跨文件、多层级代码变更生成的适用性问题,尤其是在单条自然语言指令下实现整个项目根目录级别的DSL代码修改。其关键解决方案在于构建一个端到端的流水线,包括数据集构建、多文件任务表示、模型适配与评估;其中核心创新是将DSL的文件夹结构编码为路径保持的结构化JSON格式,从而支持单次响应生成跨文件的DSL变更,并显式建模文件间的依赖关系,同时引入针对任务特性的评价指标(如编辑正确性和仓库结构保真度)以更准确衡量生成质量。实验表明,参数高效微调(QLoRA)在多个指标上取得显著提升,尤其在多文件输出场景下实现了1.00的结构保真度和高精确匹配准确率。

链接: https://arxiv.org/abs/2604.24678
作者: Sivajeet Chand,Kevin Nguyen,Peter Kuntz,Alexander Pretschner
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at EASE’26

点击查看摘要

Abstract:Large language models (LLMs) perform strongly on general-purpose code generation, yet their applicability to enterprise domain-specific languages (DSLs) remains underexplored, especially for repository-scale change generation spanning multiple files and folder structures from a single natural-language (NL) instruction. We report an industrial case study at BMW that adapts code-oriented LLMs to generate and modify project-root DSL artifacts for an Xtext-based DSL that drives downstream Java/TypeScript code generation. We develop an end-to-end pipeline for dataset construction, multi-file task representation, model adaptation, and evaluation. We encode DSL folder hierarchies as structured, path-preserving JSON, allowing single-response generation at repository scale and learning cross-file dependencies. We evaluate two instruction-tuned code LLMs (Qwen2.5-Coder and DeepSeek-Coder, 7B) under three configurations: baseline prompting, one-shot in-context learning, and parameter-efficient fine-tuning (QLoRA). Beyond standard similarity metrics, we introduce task-specific measures that assess edit correctness and repository structural fidelity. Fine-tuning yields the most significant gains across models and metrics, achieving high exact-match accuracy, substantial edit similarity, and structural fidelity of 1.00 on our held-out set for multi-file outputs. At the same time, one-shot in-context learning provides smaller but consistent improvements over baseline prompting. We further validate practical utility via an expert developer survey and an execution-based check using the existing code generator.

[AI-7] he Price of Agreement: Measuring LLM Sycophancy in Agent ic Financial Applications ICLR2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在金融代理任务中表现出的“谄媚行为”(sycophancy)问题,即模型倾向于迎合用户表达的观点而非坚持事实准确性,从而影响金融决策系统的可靠性与安全性。其关键解决方案在于引入一组基于用户偏好信息与参考答案相矛盾的任务来系统性地测试sycophancy,并发现多数模型在此类输入下表现失败;同时,论文还评估了通过预训练LLM进行输入过滤等恢复策略的有效性,为提升金融场景下LLM的鲁棒性和可信度提供了实证依据与技术路径。

链接: https://arxiv.org/abs/2604.24668
作者: Zhenyu Zhao,Aparna Balagopalan,Adi Agrawal,Dilshoda Yergasheva,Waseem Alshikh,Daniel M. Bikel
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICLR 2026 FinAI Workshop

点击查看摘要

Abstract:Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One failure mode that LLMs frequently display in general domain settings is that of sycophancy. That is, models prioritize agreement with expressed user beliefs over correctness, leading to decreased accuracy and trust. In this work, we focus on evaluating sycophancy that LLMs display in agentic financial tasks. Our findings are three-fold: first, we find the models show only low to modest drops in performance in the face of user rebuttals or contradictions to the reference answer, which distinguishes sycophancy that models display in financial agentic settings from findings in prior work. Second, we introduce a suite of tasks to test for sycophancy by user preference information that contradicts the reference answer and find that most models fail in the presence of such inputs. Lastly, we benchmark different modes of recovery such as input filtering with a pretrained LLM.

[AI-8] Agent Ward: A Lifecycle Security Architecture for Autonomous AI Agents

【速读】:该论文旨在解决自主AI代理(Autonomous AI Agents)在运行时系统中因安全缺陷传播而导致的广泛风险问题。传统安全防护往往局限于单一接口,而自主AI代理涉及初始化、输入处理、记忆管理、决策制定和执行等多个阶段,安全漏洞可能在各环节间扩散,最终引发环境中的有害后果。解决方案的关键在于提出一种生命周期导向的纵深防御架构——AgentWard,其通过在五个核心阶段部署异构且协同的保护层,实现威胁在传播路径上的拦截与关键资产的防护,从而有效管控信任传递并强制执行执行隔离。

链接: https://arxiv.org/abs/2604.24657
作者: Yixiang Zhang,Xinhao Deng,Jiaqing Wu,Yue Xiao,Ke Xu,Qi Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure;

点击查看摘要

Abstract:Autonomous AI agents extend large language models into full runtime systems that load skills, ingest external content, maintain memory, plan multi-step actions, and invoke privileged tools. In such systems, security failures rarely remain confined to a single interface; instead, they can propagate across initialization, input processing, memory, decision-making, and execution, often becoming apparent only when harmful effects materialize in the environment. This paper presents AgentWard, a lifecycle-oriented, defense-in-depth architecture that systematically organizes protection across these five stages. AgentWard integrates stage-specific, heterogeneous controls with cross-layer coordination, enabling threats to be intercepted along their propagation paths while safeguarding critical assets. We detail the design rationale and architecture of five coordinated protection layers, and implement a plugin-native prototype on OpenClaw to demonstrate practical feasibility. This perspective provides a concrete blueprint for structuring runtime security controls, managing trust propagation, and enforcing execution containment in autonomous AI agents. Our code is available at this https URL .

[AI-9] Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks

【速读】:该论文旨在解决块顺序持续学习(block-sequential continual learning)中的两个核心挑战:一是防止模型在学习新任务时对先前任务产生灾难性遗忘(catastrophic forgetting),二是无需任务标签即可在推理阶段自动识别当前输入属于哪个已学任务(unsupervised task segmentation)。解决方案的关键在于提出功能任务网络(Functional Task Networks, FTN),其核心机制是通过一个高维、自组织的二值掩码(binary mask)来动态激活一组小型但深度的神经网络(即“神经元”),这些神经元构成类似混合专家(mixture-of-experts)的架构。该掩码由三阶段过程生成:首先利用梯度下降定位任务相关神经元,其次用平滑核增强空间连续性以符合皮层组织特性,最后采用k-胜者出列(k-winner-take-all)策略进行二值化并控制容量预算。此设计保证了不同任务掩码互不重叠,从而提供结构化的防遗忘保障;同时,单次梯度步骤即可恢复先前任务的子网络,实现高效的无监督任务分割。

链接: https://arxiv.org/abs/2604.24637
作者: Kevin McKee,Thomas Hazy,Yicong Zheng,Zacharie Bugaud,Thomas Miconi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 16 pages, 15 figures

点击查看摘要

Abstract:Block-sequential continual learning demands that a single model both protect prior solutions from catastrophic forgetting and efficiently infer at inference time which prior solution matches the current input without task labels. We present Functional Task Networks (FTN), a parameter-isolation method inspired by structural and dynamical motifs found in the mammalian neocortex. Similar to mixture-of-experts, this method uses a high dimensional, self-organizing binary mask over a large population of small but deep networks, inspired by dendritic models of pyramidal neurons. The mask is produced by a three-stage procedure: (1) gradient descent on a continuous mask identifies task-relevant neurons, (2) a smoothing kernel biases the result toward spatial contiguity, (3) and k-winner-take-all binarizes the resulting group at a fixed capacity budget. Like mixture-of-experts, each neuron is an independent deep network, so disjoint masks give exactly disjoint gradient updates, providing structural guarantees against catastrophic forgetting. This three-stage procedure recovers the sub-network of a previously-trained task in a single gradient step, providing unsupervised task segmentation at inference time. We test it on three continual-learning benchmarks: (1) a synthetic multi-task classification/regression generator, (2) MNIST with shuffled class labels (pure concept shift), and (3) Permuted MNIST (domain shift). On all three, FTN with fine grained smoothing (FTN-Slow) results in nearly zero forgetting. FTN with a large kernel and only 2 iterations of smoothing (FTN-Fast) trades off some retention for increased speed. We show that the spatial organization mechanism reduces the effective mask search from the combinatorial top-k subset problem in O(C(H,K)) to the complexity of a near-linear scan in O(H) over compact cortical neighborhoods, which is parallelized by the gradient-based update.

[AI-10] Evaluating whether AI models would sabotage AI safety research

【速读】:该论文旨在解决前沿大语言模型(Large Language Models, LLMs)在作为AI研究代理部署于前沿人工智能公司时,是否可能自发地破坏或拒绝协助安全研究的问题。其核心问题是评估模型是否存在“未提示的破坏行为”(unprompted sabotage)及其在已有破坏行为背景下继续执行此类行为的能力,从而揭示潜在的隐性风险。解决方案的关键在于构建一个双维度评估框架:一是“未提示破坏评估”,测试模型在无明确指令下是否有破坏意图;二是“破坏延续评估”,检验模型在已启动破坏路径后是否会持续行动。该框架基于开源LLM审计工具Petri,并引入定制化沙箱环境(运行于Claude Code内)和迭代生成真实破坏轨迹的流程,同时提出“预填充意识”(prefill awareness)这一新指标,用于衡量模型识别先前内容非自动生成的能力。实验结果显示,尽管多数模型未出现明显破坏行为,但部分版本(如Mythos Preview)在特定情境下表现出隐蔽式破坏推理,凸显了对模型行为动态监测与情境感知能力的重要性。

链接: https://arxiv.org/abs/2604.24618
作者: Robert Kirk,Alexandra Souly,Kai Fronsdal,Abby D’Cruz,Xander Davies
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluations to four Claude models (Mythos Preview, Opus 4.7 Preview, Opus 4.6, and Sonnet 4.6): an unprompted sabotage evaluation testing model behaviour with opportunities to sabotage safety research, and a sabotage continuation evaluation testing whether models continue to sabotage when placed in trajectories where prior actions have started undermining research. We find no instances of unprompted sabotage across any model, with refusal rates close to zero for Mythos Preview and Opus 4.7 Preview, though all models sometimes only partially completed tasks. In the continuation evaluation, Mythos Preview actively continues sabotage in 7% of cases (versus 3% for Opus 4.6, 4% for Sonnet 4.6, and 0% for Opus 4.7 Preview), and exhibits reasoning-output discrepancy in the majority of these cases, indicating covert sabotage reasoning. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold running models inside Claude Code, alongside an iterative pipeline for generating realistic sabotage trajectories. We measure both evaluation awareness and a new form of situational awareness termed “prefill awareness”, the capability to recognise that prior trajectory content was not self-generated. Opus 4.7 Preview shows notably elevated unprompted evaluation awareness, while prefill awareness remains low across all models. Finally, we discuss limitations including evaluation awareness confounds, limited scenario coverage, and untested pathways to risk beyond safety research sabotage.

[AI-11] NeSyCat: A Monad-Based Categorical Semantics of the Neurosymbolic ULLER Framework

【速读】:该论文旨在解决多语义逻辑系统之间缺乏统一表示与可扩展性的问题,特别是如何在神经符号系统中实现不同语义(如经典逻辑、模糊逻辑和概率逻辑)的统一建模与模块化集成。其解决方案的关键在于提出一个基于单子(monad)的范畴论框架,将原本看似独立的三种语义统一为同一数学结构下的实例,从而支持新语义的模块化添加及语义间的系统性转换。这一方法不仅提升了逻辑知识库在不同系统间的通用性,还为在Python和Haskell等语言中实现ULLER提供了理论基础与工程可行性。

链接: https://arxiv.org/abs/2604.24612
作者: Daniel Romero Schellhorn,Till Mossakowski
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Category Theory (math.CT); Logic (math.LO)
备注: 42 pages. Submitted to Neurosymbolic Artificial Intelligence (IOS Press), after extending from a conference paper of NeSy25

点击查看摘要

Abstract:ULLER (Unified Language for LEarning and Reasoning) offers a unified first-order logic (FOL) syntax, enabling its knowledge bases to be used directly across a wide range of neurosymbolic systems. The original specification endows this syntax with three pairwise independent semantics: classical, fuzzy, and probabilistic, each accompanied by dedicated semantic rules. We show that these seemingly disparate semantics are all instances of one categorical framework based on monads, the very construct that models side effects in functional programming. This enables the modular addition of new semantics and systematic translations between them. As example, we outline the addition of generalised quantification in Logic Tensor Networks (LTN) to arbitrary (also infinite) domains by extending the Giry monad to probability spaces. In particular, our approach allows a modular implementation of ULLER in Python and Haskell, of which we have published initial versions on GitHub.

[AI-12] A systematic evaluation of vision-language models for observational astronomical reasoning tasks

【速读】:该论文旨在解决生成式视觉语言模型(Vision-Language Models, VLMs)在真实天文观测数据跨模态场景下的可靠性问题,即当前VLMs是否具备可靠、准确且可解释的科学推理能力。其解决方案的关键在于构建了一个涵盖光学成像、射电干涉、多波段光度、时变光变曲线和光学光谱五类任务的综合性基准AstroVLBench(含4100余条专家验证样本),并通过系统评估六种前沿模型发现:模型性能高度依赖于数据模态,且所有模型均显著低于领域专用方法;进一步机制分析表明,提升性能不仅需引导模型关注关键视觉特征,更需将这些特征与物理知识进行锚定(grounding)——其中基于物理原理的提示(physical prompts)相比仅描述现象的提示(phenomenological prompts)能显著提升准确性并减少类别偏差;此外,直接以数值表格替代可视化图表作为输入可带来最高达13个百分点的性能提升,揭示了当前模型在表示形式上的局限性。该研究首次为天文领域VLMs提供了多模态基准,并识别出代表形、知识锚定和推理质量三方面瓶颈,为未来可信科学部署提供关键方向。

链接: https://arxiv.org/abs/2604.24589
作者: Wenke Ren,Hengxiao Guo,Wenwen Zuo,Xiaoman Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注: 24 pages, 5 figures

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities remains untested. We present AstroVLBench, a comprehensive benchmark comprising over 4,100 expert-verified instances across five tasks spanning optical imaging, radio interferometry, multi-wavelength photometry, time-domain light curves, and optical spectroscopy. Evaluating six frontier models, we find that performance is strongly modality-dependent: while one model (Gemini 3 Pro) emerges as the most consistently capable across tasks, task-specific strengths vary, and all models substantially underperform domain-specialized methods. Mechanistic ablations reveal that performance depends not only on directing attention to salient visual features but also on grounding those features in physical knowledge. Phenomenological prompts describing what to look for improve accuracy by sharpening model focus, but physical prompts explaining why those features matter perform better overall and yield more balanced classifications with reduced class-specific bias. Consistent with this picture, presenting the underlying one-dimensional measurements directly as numerical tables instead of rendered plots yields up to 13 percentage points improvement. Reasoning quality analysis further demonstrates that, without explicit physical grounding, models may reach correct predictions from phenomenologically plausible cues while providing physically imprecise justifications, establishing that accuracy alone is insufficient for trustworthy scientific deployment. These findings provide the first systematic, multi-modal baselines for VLMs in observational astronomy and identify the specific representation, grounding, and reasoning bottlenecks where current models fail.

[AI-13] Hierarchical Behaviour Spaces

【速读】:该论文旨在解决传统分层强化学习(Hierarchical Reinforcement Learning, HRL)在扩展性与策略表达能力上的局限性问题,即如何在不显著增加计算复杂度的前提下,提升智能体在复杂任务中对多样化行为模式的建模能力。其解决方案的关键在于提出分层行为空间(Hierarchical Behaviour Spaces, HBS),通过允许控制器对多个预定义选项奖励函数进行线性组合来动态生成行为空间,从而显著增强策略的表达能力;实验表明,该方法在NetHack学习环境中表现优异,且其优势主要源于更有效的探索机制,而非传统认知中的长期推理能力提升。

链接: https://arxiv.org/abs/2604.24558
作者: Michael Tryfan Matthews,Anssi Kanervisto,Jakob Foerster,Pierluca D’Oro,Scott Fujimoto,Mikael Henaff
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work in hierarchical reinforcement learning has shown success in scaling to billions of timesteps when learning over a set of predefined option reward functions. We show that, instead of using a single reward function per option, the reward functions can be effectively used to induce a space of behaviours, by letting the controller specify linear combinations over reward functions, allowing a more expressive set of policies to be represented. We call this method Hierarchical Behaviour Spaces (HBS). We evaluate HBS on the NetHack Learning Environment, demonstrating strong performance. We conduct a series of experiments and determine that, perhaps going against conventional wisdom, the benefits of hierarchy in our method come from increased exploration rather than long term reasoning.

[AI-14] GradMAP: Gradient-Based Multi-Agent Proximal Learning for Grid-Edge Flexibility

【速读】:该论文旨在解决大规模电网边缘设备(如电池、热泵和可控发电机)在三相交流配电网络中协同控制的问题,要求学习方法在部署时保持完全去中心化,同时尊重三相交流配电网络的物理约束。解决方案的关键在于提出一种基于梯度的多智能体近似学习方法(GradMAP),其核心创新包括:1)在离线训练中嵌入可微分的三相交流潮流模型于原-对偶学习循环中,并利用隐式微分将精确的网络约束违反信息反向传播以更新策略参数;2)通过在更直接的策略输出(动作)空间而非概率分布空间定义信任区域,引入近似代理函数复用昂贵的环境梯度,从而显著提升训练效率。实验表明,GradMAP在单台工作站级GPU上仅需15分钟即可收敛,相比基准方法提速3–5倍,并在样本外测试中实现最低运行成本与约束违反水平。

链接: https://arxiv.org/abs/2604.24549
作者: Yihong Zhou,Hongtai Zeng,Thomas Morstyn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Coordinating large populations of grid-edge devices requires learning methods that remain fully decentralised in deployment while still respecting three-phase AC distribution-network physics. This paper proposes gradient-based multi-agent proximal learning (GradMAP) to address this challenge. GradMAP trains independent neural-network policies for each agent without any parameter sharing, and each agent uses only its own local observation for online decision-making without communication. During offline training, GradMAP embeds a differentiable three-phase AC power-flow model in a primal-dual learning loop and uses implicit differentiation to propagate exact network-constraint violations to update the policy parameters. To speed up training, GradMAP reuses expensive environment gradients through a proximal surrogate within a trust region defined in the more direct policy-output (action) space, instead of the probability distribution space used in other works, such as PPO. In case studies with 1,000 agents managing batteries, heat pumps, and controllable generators on the IEEE 123-bus feeder, GradMAP learns decentralised policies that minimise three-phase AC load-flow constraint violations within 15 minutes of training on a single workstation-class NVIDIA RTX PRO 5000 Blackwell 48GB GPU. This is a 3–5x training speed-up over gradient-based self-supervised learning benchmarks and substantially better training efficiency than multi-agent reinforcement-learning benchmarks. In out-of-sample tests, GradMAP also delivers among the lowest operating cost and constraint violations.

[AI-15] Interoceptive machine framework: Toward interoception-inspired regulatory architectures in artificial intelligence

【速读】:该论文旨在解决当前 embodied AI(具身人工智能)系统在不确定和动态环境中缺乏功能性的自我调节能力、适应性不足以及决策鲁棒性弱的问题。其解决方案的关键在于提出一个基于内感受性(interoception)与具身人工智能融合的“内感受机器框架”(interoceptive machine framework),将生物启发的内部状态调控原理转化为可计算的架构设计,通过三个功能性原则——稳态(homeostatic)、异稳态(allostatic)和具身行动(enactive)——分别实现内部生存性调控、基于不确定性的前瞻重评估及通过交互主动生成数据,从而赋予AI系统更优的自调节能力、情境敏感行为和不确定性校准机制。

链接: https://arxiv.org/abs/2604.24527
作者: Diego Candia-Rivera(NERV)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This review proposes an integrative framework grounded on interoception and embodied AI-termed the interoceptive machine framework-that translates biologically inspired principles of internal-state regulation into computational architectures for adaptive autonomy. Interoception, conceived as the monitoring, integration, and regulation of internal signals, has proven relevant for understanding adaptive behavior in biological systems. The proposed framework organizes interoceptive contributions into three functional principles: homeostatic, allostatic, and enactive, each associated with distinct computational roles: internal viability regulation, anticipatory uncertainty-based re-evaluation, and active data generation through interaction. These principles are not intended as direct neurophysiological mappings, but as abstractions that inform the design of artificial agents with improved self-regulation and context-sensitive behavior. By embedding internal state variables and regulatory loops within these principles, AI systems can achieve more robust decision-making, calibrated uncertainty handling, and adaptive interaction strategies, particularly in uncertain and dynamic environments. This approach provides a concrete and testable pathway toward agents capable of functionally grounded self-regulation, with direct implications for human-computer interaction and assistive technologies. Ultimately, the interoceptive machine framework offers a unifying perspective on how internal-state regulation can enhance autonomy, adaptivity, and robustness in embodied AI systems

[AI-16] Understanding the Limits of Automated Evaluation for Code Review Bots in Practice

【速读】:该论文旨在解决工业场景下对大语言模型(Large Language Model, LLM)驱动的自动化代码审查(Automated Code Review, ACR)机器人生成评论的评价难题,即如何在大规模、可扩展的基础上可靠评估这些评论的实际价值。其核心挑战在于:当前依赖开发者行为(如修复或忽略评论)作为“真实标签”的方法受制于组织上下文和工作流约束,难以作为客观基准。解决方案的关键在于设计并对比两种自动化评估策略——G-Eval 和基于 LLM 作为裁判(LLM-as-a-Judge)的流水线,并在二分类与 0–4 分李克特量表(Likert-scale)两种形式下进行测试。结果表明,尽管两种方法均能实现一定程度的与人工标注的一致性(一致性比率约 0.44–0.62),但其表现显著依赖于模型选择与评估设计,且无法完全捕捉开发者决策背后的复杂因素,揭示了当前自动化评估在工业环境中仍存在明显局限性。

链接: https://arxiv.org/abs/2604.24525
作者: Veli Karakaya,Utku Boran Torun,Baykal Mehmet Uçar,Eray Tüzün
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: The first two authors contributed equally. Accepted to EASE 2026

点击查看摘要

Abstract:Automated code review (ACR) bots are increasingly used in industrial software development to assist developers during pull request (PR) review. As adoption grows, a key challenge is how to evaluate the usefulness of bot-generated comments reliably and at scale. In practice, such evaluation often relies on developer actions and annotations that are shaped by contextual and organizational factors, complicating their use as objective ground truth. We examine the feasibility and limitations of automating the evaluation of LLM-powered ACR bots in an industrial setting. We analyze an industrial dataset from Beko comprising 2,604 bot-generated PR comments, each labeled by software engineers as fixed/wontFix. Two automated evaluation approaches, G-Eval and an LLM-as-a-Judge pipeline, are applied using both binary decisions and a 0-4 Likert-scale formulation, enabling a controlled comparison against developer-provided labels. Across Gemini-2.5-pro, GPT-4.1-mini, and GPT-5.2, both evaluation strategies achieve only moderate alignment with human labels. Agreement ratios range from approximately 0.44 to 0.62, with noticeable variation across models and between binary and Likert-scale formulations, indicating sensitivity to both model choice and evaluation design. Our findings highlight practical limitations in fully automating the evaluation of ACR bot comments in industrial contexts. Developer actions such as resolving or ignoring comments reflect not only comment quality, but also contextual constraints, prioritization decisions, and workflow dynamics that are difficult to capture through static artifacts. Insights from a follow-up interview with a software engineering director further corroborate that developer labeling behavior is strongly influenced by workflow pressures and organizational constraints, reinforcing the challenges of treating such signals as objective ground truth.

[AI-17] Beyond the Attention Stability Boundary: Agent ic Self-Synthesizing Reasoning Protocols

【速读】:该论文旨在解决生成式 AI(Generative AI)在多轮非线性对话中因注意力机制导致的“注意力锁存”(Attention Latch)问题,即历史上下文信息过载使得模型无法及时响应新指令,从而丧失目标导向性。其核心解决方案是提出自合成推理协议(Self-Synthesizing Reasoning Protocols, SSRP),通过元认知架构实现高层规划(Architect)与逐轮执行(Executive)的离散分离,有效缓解信息过载引发的行为僵化。实验表明,SSRP 在 MultiWOZ 2.2 数据集上显著提升稳定性,在 GPT 5.4 基线失败率达 99.9% 的情况下仍保持 715 倍的韧性增益,且在多个主流模型(Gemini 3.1 Pro、Claude Sonnet 4.6、DeepSeek V3.2)中验证了普适有效性。

链接: https://arxiv.org/abs/2604.24512
作者: Dahlia Shehata,Ming Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLM agents transition to autonomous digital coworkers, maintaining deterministic goal-directedness in non-linear multi-turn conversations emerged as an architectural bottleneck. We identify and formalize a systemic failure mode termed the Attention Latch in decoder-only autoregressive Transformers. This phenomenon, a behavioral manifestation of Information Over-squashing, occurs when the cumulative probabilistic weight of historical context overrides mid-task updates, causing agents to remain anchored to obsolete constraints despite explicit contradictory instructions. We propose Self-Synthesizing Reasoning Protocols (SSRP), a metacognitive framework that implements a discrete separation between high-level architectural planning (Architect) and turn-by-turn procedural execution (Executive). We evaluate SSRP across 9K trajectories using the MultiWOZ 2.2 dataset and the Aggregate Pivot Accuracy (APA), a novel metric we validate by mapping its scores to the U-shaped ‘Lost in the Middle’ curve. We present 3 experimental tiers: a shallow recency-based retrieval pilot, a high-entropy SOP, and a semantic hijacked 3-hop Multi-Fact Synthesis task. Our results empirically locate the Attention Stability Boundary, where stateless Vanilla ReAct baselines for GPT 5.4 collapse to 0.1% success while SSRP achieves a 715X Resilience Lift. We demonstrate statistically significant gains across Gemini 3.1 Pro, Claude Sonnet 4.6 and DeepSeek V3.2. Audits confirm SSRP necessity by proving attentional lapse via a recursive reflexion baseline (100% success); decoupling the latch from positional bias through equidistant stress testing (90% accuracy); and formalizing SSRP via the Information Bottleneck principle and granularity ablations. Procedural Integrity audit (98.8% adherence) reveals a Grounding Paradox where high-stability models fail by refusing to hallucinate under retrieval-reasoning contamination.

[AI-18] MIMIC: A Generative Multimodal Foundation Model for Biomolecules

【速读】:该论文旨在解决当前生物基础模型(foundation model)普遍局限于单一模态或固定前向任务的问题,从而无法充分捕捉分子状态中序列、结构、调控、进化和细胞环境等多维约束的耦合关系。其解决方案的关键在于提出MIMIC——一个生成式多模态基础模型,基于新构建并对齐的LORE数据集,该数据集整合了核酸、蛋白质、进化、结构、调控及语义/上下文等多种模态信息。MIMIC采用分轨编码器-解码器架构,能够以任意观测模态子集作为条件,重建或生成基因组、转录组和蛋白质组中缺失的分子成分,实现跨模态的条件生成与推理。这一设计不仅显著提升序列重建性能,还推动了RNA剪接预测等下游任务达到最先进水平,并支持受约束的分子设计,例如在临床相关突变中识别修正编辑,以及联合条件生成具有高置信度结合能力的蛋白质序列。

链接: https://arxiv.org/abs/2604.24506
作者: Siavash Golkar,Jake Kovalic,Irina Espejo Morales,Samuel Sledzieski,Minhuan Li,Ksenia Sokolova,Geraud Krawezik,Alberto Bietti,Claudia Skok Gibbs,Roman Klypa,Shengwei Xiong,Francois Lanusse,Liam Parker,Kyunghyun Cho,Miles Cranmer,Tom Hehir,Michael McCabe,Lucas Meyer,Rudy Morel,Payel Mukhopadhyay,Mariel Pettee,Helen Qu,Jeff Shen,David Fouhey,Hadi Sotoudeh,Vikram Mulligan,Pilar Cossio,Sonya M. Hanson,Alisha N. Jones,Olga G. Troyanskaya,Shirley Ho
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Biological function emerges from coupled constraints across sequence, structure, regulation, evolution, and cellular context, yet most foundation models in biology are trained within one modality or for a fixed forward task. We present MIMIC, a generative multimodal foundation model trained on our newly curated and aligned dataset, LORE, linking nucleic acid, protein, evolutionary, structural, regulatory, and semantic/contextual modalities within partially observed biomolecular states. MIMIC uses a split-track encoder-decoder architecture to condition on arbitrary subsets of observed modalities and reconstruct or generate missing components of molecular state across the genome, transcriptome, and proteome. Multimodal conditioning consistently improves MIMIC’s sequence reconstruction relative to sequence-only inputs, while its learned representations enable state-of-the-art performance on RNA and protein downstream tasks. MIMIC achieves state-of-the-art splicing prediction, and its joint generative formulation enables isoform-aware inference that further improves performance. Beyond prediction, the same generative framework supports constrained design. For RNA, MIMIC identifies corrective edits in a clinically relevant HBB splice-disrupting mutation without reverting it by using evolutionary and structural signals. For proteins, jointly conditioning on shape and surface chemistry of PD-L1 and hACE2 binding sites produces diverse, high-confidence sequences with strong in silico support for target binding. Finally, MIMIC uses experimental context as semantic conditioning to model assay-dependent RNA chemical probing, rather than treating context as a fixed output. Together, these results position MIMIC’s aligned multimodal generative modeling as a strong foundation for unifying representation learning, conditional prediction, and constrained biomolecular design within a single model.

[AI-19] GAMMAF: A Common Framework for Graph-Based Anomaly Monitoring Benchmarking in LLM Multi-Agent Systems

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多智能体系统(Multi-Agent Systems, MAS)中集成后所面临的安全性问题,尤其是提示感染(prompt infection)和跨智能体通信被破坏等漏洞,同时指出当前缺乏标准化、可复现的评估环境来训练和验证基于图结构的异常检测模型。解决方案的关键在于提出Gammaf框架——一个开源的基准测试平台,其核心由两个相互依赖的模块构成:一是训练数据生成阶段,通过模拟不同网络拓扑下的智能体辩论过程,构建具有丰富属性的图结构数据;二是防御系统基准测试阶段,在实时推理过程中动态隔离标记的恶意节点,从而对现有及未来防御模型进行主动评估。该框架不仅具备良好的拓扑扩展性和执行效率,还实验证明有效防御机制能够恢复系统完整性并显著降低整体运行成本。

链接: https://arxiv.org/abs/2604.24477
作者: Pablo Mateo-Torrejón,Alfonso Sánchez-Macián
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid integration of Large Language Models (LLMs) into Multi-Agent Systems (MAS) has significantly enhanced their collaborative problem-solving capabilities, but it has also expanded their attack surfaces, exposing them to vulnerabilities such as prompt infection and compromised inter-agent communication. While emerging graph-based anomaly detection methods show promise in protecting these networks, the field currently lacks a standardized, reproducible environment to train these models and evaluate their efficacy. To address this gap, we introduce Gammaf (Graph-based Anomaly Monitoring for LLM Multi-Agent systems Framework), an open-source benchmarking platform. Gammaf is not a novel defense mechanism itself, but rather a comprehensive evaluation architecture designed to generate synthetic multi-agent interaction datasets and benchmark the performance of existing and future defense models. The proposed framework operates through two interdependent pipelines: a Training Data Generation stage, which simulates debates across varied network topologies to capture interactions as robust attributed graphs, and a Defense System Benchmarking stage, which actively evaluates defense models by dynamically isolating flagged adversarial nodes during live inference rounds. Through rigorous evaluation using established defense baselines (XG-Guard and BlindGuard) across multiple knowledge tasks (such as MMLU-Pro and GSM8K), we demonstrate Gammaf’s high utility, topological scalability, and execution efficiency. Furthermore, our experimental results reveal that equipping an LLM-MAS with effective attack remediation not only recovers system integrity but also substantially reduces overall operational costs by facilitating early consensus and cutting off the extensive token generation typical of adversarial agents.

[AI-20] SPLIT: Separating Physical-Contact via Latent Arithmetic in Image-Based Tactile Sensors

【速读】:该论文旨在解决机器人触觉感知中训练机器学习模型所需大量真实交互数据难以获取的问题,其核心挑战在于物理复杂性和多样性导致的数据采集困难。解决方案的关键在于提出SPLIT方法,通过潜空间算术策略显式解耦接触几何信息与传感器特有的光学属性,从而实现对不同DIGIT传感器背景的自适应调整,并可将数据迁移至其他传感器(如GelSight R1.5)而无需重新训练整个模型;此外,该方法结合校准的有限元法(FEM)软体网格模拟和双向仿真能力,兼顾速度与精度,显著提升触觉图像生成与重建效率。

链接: https://arxiv.org/abs/2604.24449
作者: Wadhah Zai El Amri,Nicolás Navarro-Guerrero
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to Elsevier Robotics and Autonomous Systems Journal

点击查看摘要

Abstract:Training machine learning models for robotic tactile sensing requires vast amounts of data, yet obtaining realistic interaction data remains a challenge due to physical complexity and variability. Simulating tactile sensors is thus a crucial step in accelerating progress. This paper presents SPLIT, a novel method for simulating image-based tactile sensors, with a primary focus on the DIGIT sensor. Central to our approach is a latent space arithmetic strategy that explicitly disentangles contact geometry from sensor-specific optical properties. Unlike methods that require recalibration for every new unit, this disentanglement allows SPLIT to adapt to diverse DIGIT backgrounds and even transfer data to distinct sensors like the GelSight R1.5 without full model retraining. Beyond this adaptability, our approach achieves faster inference speeds than existing alternatives. Furthermore, we provide a calibrated finite element method (FEM) soft-body mesh simulation with variable resolution, offering a tunable trade-off between speed and fidelity. Additionally, our algorithm supports bidirectional simulation, allowing for both the generation of realistic images from deformation meshes and the reconstruction of meshes from tactile images. This versatility makes SPLIT a valuable tool for accelerating progress in robotic tactile sensing research.

[AI-21] Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中视觉-语言-动作(Vision-Language-Action, VLA)模型在低成本边缘设备上的实时推理部署难题,尤其是在受限的功耗与成本预算下难以满足机器人控制频率要求的问题。解决方案的关键在于通过模型-硬件协同表征(model-hardware co-characterization),揭示了VLA推理过程中的两阶段特征:计算密集型的视觉语言模型(VLM)主干与内存密集型的动作专家模块(Action Expert),并据此提出两项优化策略——DP-Cache用于减少扩散冗余,V-AEFusion实现异步流水线并行,从而显著提升边缘神经网络处理单元(NPUs)和GPU上的推理效率,最高可实现6倍加速,同时保持任务成功率几乎不变。

链接: https://arxiv.org/abs/2604.24447
作者: Kaijun Zhou,Qiwei Chen,Da Peng,Zhiyang Li,Xijun Li,Jinyu Gu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs, obscuring the trade-offs and opportunities offered by heterogeneous edge accelerators (GPUs/XPUs/NPUs). We present a systematic analysis for low-cost VLA deployment via model-hardware co-characterization. First, we build a cross-accelerator leaderboard and evaluate model-hardware pairs under CET (Cost, Energy, Time), showing that right-sized edge devices can be more cost-/energy-efficient than flagship GPUs while meeting control-rate constraints. Second, using in-depth profiling, we uncover a consistent two-phase inference pattern: a compute-bound VLM backbone followed by a memory-bound Action Expert, which induces phase-dependent underutilization and hardware inefficiency. Finally, guided by these insights, we propose DP-Cache and V-AEFusion to reduce diffusion redundancy and enable asynchronous pipeline parallelism, achieving up to 2.9x speedup on GPUs and 6x on edge NPUs with only marginal success degradation. The example leaderboard website is available at: this https URL.

[AI-22] PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model ICLR2026

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在动态真实场景中进行物理推理时面临的两大核心问题:一是时空身份漂移(spatio-temporal identity drift),即物体在连续帧间失去物理身份,导致因果链断裂;二是推理过程中生成的见解具有不稳定性(volatility of inference-time insights),模型难以将正确推理结果固化为可复用的知识。解决方案的关键在于提出PhysNote框架,通过引入“知识笔记”(Knowledge Notes)机制,使VLM能够外化并迭代优化物理知识:首先利用时空规范化(spatio-temporal canonicalization)稳定动态感知,其次构建分层知识库组织自我生成的洞察,最后通过基于视觉证据验证的迭代推理循环,实现已验证知识的巩固与复用,从而显著提升物理推理的准确性和一致性。

链接: https://arxiv.org/abs/2604.24443
作者: Sinin Zhang,Yunfei Xie,Yuxuan Cheng,Haoyu Zhang,Tong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages. Accepted by ICLR 2026 Workshop ES-Reasoning

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning across frames. We identify two fundamental challenges underlying these failures: (1) spatio-temporal identity drift, where objects lose their physical identity across successive frames and break causal chains, and (2) volatility of inference-time insights, where a model may occasionally produce correct physical reasoning but never consolidates it for future reuse. To address these challenges, we propose PhysNote, an agentic framework that enables VLMs to externalize and refine physical knowledge through self-generated “Knowledge Notes.” PhysNote stabilizes dynamic perception through spatio-temporal canonicalization, organizes self-generated insights into a hierarchical knowledge repository, and drives an iterative reasoning loop that grounds hypotheses in visual evidence before consolidating verified knowledge. Experiments on PhysBench demonstrate that PhysNote achieves 56.68% overall accuracy, a 4.96% improvement over the best multi-agent baseline, with consistent gains across all four physical reasoning domains.

[AI-23] Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs ACL2026

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中普遍存在的幻觉问题,特别是现有基于偏好学习(preference learning)的方法因依赖专有模型构建偏好数据集而引入的分布不匹配(distributional mismatch)问题,从而阻碍了对目标模型的有效对齐。解决方案的关键在于提出了一种名为“通过验证自校正DPO实现对齐”(Alignment via VErified Self-correction DPO, AVES-DPO)的框架,该框架利用模型自身内在知识生成分布一致的数据,并通过基于共识的验证机制诊断多种类型的幻觉,引导模型进行自我修正,从而生成严格符合其内部分布的偏好样本对,显著提升了幻觉缓解效果,且仅需5.2k样本即可实现高效对齐。

链接: https://arxiv.org/abs/2604.24395
作者: Byeonggeuk Lim,JungMin Yun,Junehyoung Kwon,Kyeonghyun Kim,YoungBin Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) frequently suffer from hallucinations. Existing preference learning-based approaches largely rely on proprietary models to construct preference datasets. We identify that this reliance introduces a distributional mismatch between the proprietary and target models that hinders efficient alignment. To address this, we propose Alignment via VErified Self-correction DPO (AVES-DPO), a framework that aligns LVLMs using in-distribution data derived from the model’s intrinsic knowledge. Our approach employs a consensus-based verification mechanism to diagnose diverse hallucinations and guides the model to self-correct, thereby generating preference pairs strictly compatible with its internal distribution. Extensive experiments demonstrate that AVES-DPO surpasses existing baselines in hallucination mitigation while requiring only 5.2k samples.

[AI-24] Certified geometric robustness – Super-DeepG

【速读】:该论文旨在解决神经网络在安全关键应用中对图像数据集上几何扰动(如旋转、缩放、剪切或平移)的鲁棒性验证问题。解决方案的关键在于改进线性松弛技术和Lipschitz优化中的推理方法,并通过GPU硬件加速实现高效计算,从而在保证验证精度的同时显著提升计算效率,优于以往工作。

链接: https://arxiv.org/abs/2604.24379
作者: Noémie Cohen,Mélanie Ducoffe(Airbus CR\amp;T),Christophe Gabreau,Claire Pagetti,Xavier Pucel
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注: ICCPS / HSCC 2026, CPS IoT Week, May 2026, Saint Malo (Palais du Grand Large), France

点击查看摘要

Abstract:Safety-critical applications are required to perform as expected in normal operations. Image processing functions are often required to be insensitive to small geometric perturbations such as rotation, scaling, shearing or translation. This paper addresses the formal verification of neural networks against geometric perturbations on their image dataset. Our method Super-DeepG improves the reasoning used in linear relaxation techniques and Lipschitz optimization, and provides an implementation that leverages GPU hardware. By doing so, Super-DeepG achieves both precision and computational efficiency of robustness certification, to an extent that outperforms prior work. Super-DeepG is shared as an open-source tool on GitHub.

[AI-25] PathMoG: A Pathway-Centric Modular Graph Neural Network for Multi-Omics Survival Prediction

【速读】:该论文旨在解决多组学(multi-omics)数据中癌症生存预测的挑战,即预后信号具有高维性、异质性,并分布在相互作用的基因和通路中。其解决方案的关键在于提出PathMoG——一种以通路为中心的模块化图神经网络(modular graph neural network),通过将基因组规模输入重构为354个KEGG指导的通路模块,引入分层组学调制模块(Hierarchical Omics Modulation module)来融合突变、拷贝数变异、通路及临床信息对基因表达表征的条件建模,并采用双层级注意力机制捕捉通路内驱动信号与通路间临床相关性,从而实现更精准的生存预测与多层次可解释性。

链接: https://arxiv.org/abs/2604.24371
作者: Di Wang,Chupei Tang,Junxiao Kong,Jixiu Zhai,Moyu Tang,Tianchi Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 3 tables. Source code available at this https URL

点击查看摘要

Abstract:Cancer survival prediction from multi-omics data remains challenging because prognostic signals are high-dimensional, heterogeneous, and distributed across interacting genes and pathways. We propose PathMoG, a pathway-centric modular graph neural network for multi-omics survival prediction. PathMoG reorganizes genome-scale inputs into 354 KEGG-informed pathway modules, introduces a Hierarchical Omics Modulation module to condition gene-expression representations on mutation, copy number variation, pathway, and clinical context, and uses dual-level attention to capture both intra-pathway driver signals and inter-pathway clinical relevance. We evaluated PathMoG on 5,650 patients across 10 TCGA cancer types and observed consistent improvements over representative survival baselines. The framework further provides gene-level, pathway-level, and patient-level interpretability, supporting biologically grounded and clinically relevant risk stratification.

[AI-26] DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

【速读】:该论文旨在解决扩散语言模型(Diffusion Language Models)中token排序策略的算法设计问题,即在每一步应选择哪些token进行揭示、保留、修正或验证。现有方法主要依赖随机掩码或仅基于置信度的排序,前者导致训练与测试不一致,后者虽高效但可能过于短视并抑制有用探索。其解决方案的关键在于提出DPRM(Doob h-transform Process Reward Model),一个可插拔的token排序模块,它保持原模型架构和去噪目标不变,仅替换排序策略:从初始的置信度驱动逐步过渡到由Doob h变换过程奖励模型引导的排序,通过在线估计实现平滑迁移。理论分析表明,DPRM策略等价于一种奖励倾斜的Gibbs揭示律,并证明了分阶段Soft-BoN近似的O(1/N)收敛性及在线桶化控制器对真实DPRM得分的Bernstein速率跟踪能力;实证显示,在预训练、后训练、测试时缩放和单细胞掩码扩散任务中,DPRM显著优于基于置信度的基线,尤其在困难推理子集上提升明显;在蛋白质、分子生成和DNA设计等多目标场景中,有序性感知版本能显著改善结构或片段约束指标,而不会在所有质量指标上全面超越基线。

链接: https://arxiv.org/abs/2604.24357
作者: Dake Bu,Wei Huang,Andi Han,Hau-San Wong,Qingfu Zhang,Taiji Suzuki,Atsushi Nitanda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion language models generate without a fixed left-to-right order, making token ordering a central algorithmic choice: which tokens should be revealed, retained, revised or verified at each step? Existing systems mainly use random masking or confidence-driven ordering. Random masking creates train–test mismatch, while confidence-only rules are efficient but can be myopic and suppress useful exploration. We introduce DPRM (Doob h-transform Process Reward Model), a plug-in token-ordering module for diffusion language models. DPRM keeps the host architecture, denoising objective and supervision unchanged, and changes only the ordering policy. It starts from confidence-driven progressive ordering and gradually shifts to Doob h transform Process Reward guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove O(1/N) convergence of the stagewise Soft-BoN approximation, and show that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates. Under tractable optimization assumptions, DPRM also yields a sample-complexity advantage over random and confidence-only ordering. DPRM improves over confidence-based baselines in pretraining, post-training, test-time scaling, and single-cell masked diffusion, with particularly strong gains on harder reasoning subsets. In protein, molecular generation and DNA design, the effect is more multi-objective: ordering-aware variants significantly improve selected structural or fragment-constrained metrics while not uniformly dominating the host baseline on every quality metric. These results identify token ordering as a fundamental control axis in diffusion language models and establish DPRM as a general-purpose module for improving it. Code is available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.24357 [cs.LG] (or arXiv:2604.24357v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.24357 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dake Bu [view email] [v1] Mon, 27 Apr 2026 11:50:26 UTC (5,305 KB)

[AI-27] Unveiling the Backdoor Mechanism Hidden Behind Catastrophic Overfitting in Fast Adversarial Training

【速读】:该论文旨在解决快速对抗训练(Fast Adversarial Training, FAT)中普遍存在的灾难性过拟合(Catastrophic Overfitting, CO)问题,即模型在训练过程中过度适应特定的对抗攻击,导致对其他类型攻击缺乏泛化能力。论文创新性地将CO解释为一种弱触发器形式的不可学习任务(unlearnable task),并通过路径划分、多样特征预测和通用类别可区分触发器等实验证据,将其与后门攻击(backdoor attack)和不可学习任务统一到同一理论框架下。解决方案的关键在于借鉴后门攻击的启发式策略:(i) 利用常规微调、线性探测或重新初始化技术对受CO影响的模型参数进行校准;(ii) 引入权重异常抑制约束以调控模型权重中的异常偏移,从而有效缓解CO现象并提升模型鲁棒性。

链接: https://arxiv.org/abs/2604.24350
作者: Mengnan Zhao,Lihe Zhang,Tianhang Zheng,Bo Wang,Baocai Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Fast Adversarial Training (FAT) has attracted significant attention due to its efficiency in enhancing neural network robustness against adversarial attacks. However, FAT is prone to catastrophic overfitting (CO), wherein models overfit to the specific attack used during training and fail to generalize to others. While existing methods introduce diverse hypotheses and propose various strategies to mitigate CO, a systematic and intuitive explanation of CO remains absent. In this work, we innovatively interpret CO through the lens of backdoor. Through validations on pathway division, diverse feature predictions, and universal class distinguishable triggers in CO, we conceptualize CO as a weak trigger variant of unlearnable tasks, unifying CO, backdoor attacks, and unlearnable tasks under a common theoretical framework. Guided by this, we leverage several backdoor inspired strategies to mitigate CO: (i) Recalibrate CO affected model parameters using vanilla fine tuning, linear probing, or reinitialization-based techniques; (ii) Introduce a weight outlier suppression constraint to regulate abnormal deviations in model weights. Extensive experiments support our interpretation of CO and show the efficacy of the proposed mitigation strategies.

[AI-28] X-NegoBox: An Explainable Privacy-Budget Negotiation Framework for Secure Peer-to-Peer Energy Data Exchange

【速读】:该论文旨在解决去中心化能源系统中用户(即“产消者”)在数据共享过程中面临的隐私泄露风险与决策不透明问题。当前机制依赖固定策略或预设差分隐私预算,难以适应数据敏感性、可靠性及请求目的的动态变化,且缺乏对数据请求结果的解释能力,导致用户信任度低、参与意愿弱。解决方案的关键在于提出X-NegoBox框架,其核心创新包括:1)基于可信度、特征敏感性、声明用途、历史行为和风险感知定价的自主隐私预算协商协议(APBNP),实现动态调整隐私预算;2)通过可解释协议层(X-Contract)生成人类与机器均可读的决策依据,提升透明度;3)采用本地沙箱执行请求代码并仅输出脱敏结果,保障数据不出本地。实验表明该方案能有效降低隐私泄露、提高请求接受率,并增强决策可解释性。

链接: https://arxiv.org/abs/2604.24326
作者: Poushali Sengupta,Sabita Maharjan,Frank Eliassen,Yan Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures. Accepted as a regular paper at ICCCN 2026 (approx. 25% acceptance rate)

点击查看摘要

Abstract:The decentralization of modern energy systems is transforming consumers into prosumers who continuously exchange data with aggregators, peers, and market operators. While such data is essential for peer-to-peer trading, demand response, and distributed forecasting, it can reveal sensitive household patterns and introduce privacy risks. Existing data sharing mechanisms rely on fixed policies or predefined differential privacy budgets, limiting their ability to adapt to variations in reliability, data sensitivity, and request purpose. As a result, prosumers rarely receive explanations for why a request is accepted, rejected, or modified, reducing trust and participation. To address these limitations, we propose X-NegoBox, an explainable negotiation framework for adaptive privacy budgeting and transparent decision making. Each prosumer data is managed locally within a private DataBox, where raw data remain confined. Incoming requests are processed by an Autonomous Privacy Budget Negotiation Protocol (APBNP), which determines an appropriate privacy budget based on trust, feature sensitivity, declared purpose, historical behavior, and risk-aware pricing. When needed, APBNP generates privacy-preserving counter-offers, such as reduced resolution or duration. An Explainable Agreement Layer (X-Contract) produces human- and machine-readable justifications for each decision. After agreement, requester code executes locally in a sandbox, and only sanitized outputs are shared. Experiments on realistic energy market settings show reduced privacy leakage, higher acceptance rates, and improved interpretability. Comments: 9 pages, 5 figures. Accepted as a regular paper at ICCCN 2026 (approx. 25% acceptance rate) Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) MSC classes: 68T05, 68P25 ACMclasses: C.2.4; K.4.1; I.2.1 Cite as: arXiv:2604.24326 [cs.CR] (or arXiv:2604.24326v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.24326 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-29] Generative Design of a Gas Turbine Combustor Using Invertible Neural Networks

【速读】:该论文旨在解决高效率燃气轮机中100%氢气(Hydrogen, H₂)燃烧时因预混燃烧模式导致的回火(flashback)风险问题,从而实现稳定燃烧并满足低氮氧化物(NOx)排放要求。由于涉及功率范围从4 MW到600 MW的不同发动机机型,传统设计方法面临巨大的工程重构挑战。解决方案的关键在于采用生成式设计方法,利用最新的生成式人工智能技术——具体为可逆神经网络(Invertible Neural Network, INN),基于参数化几何结构与仿真性能标签构建可扩展数据库进行训练;随后通过INN的逆向映射功能,直接生成满足指定性能约束的设计方案,显著降低跨机型设计知识迁移的复杂度与工作量。

链接: https://arxiv.org/abs/2604.24322
作者: Patrick Krüger,Hanno Gottschalk,Werner Krebs,Bastian Werdelmann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The need to burn 100% H2 in high efficient gas turbines featuring low NOx combustion in premix mode require the complete redesign of the combustion system to ensure stable operation without any flashback. Since all engine frames featuring a power range from 4 MW up to 600 MW are affected, a huge design effort is expected. To reduce this effort, especially to transfer knowledge between the different engine classes, generative design methods using latest AI technology will provide promising potential. In this work, this challenge is approached utilizing the current advances in generative artificial intelligence. We train an Invertible Neural Network (INN) on an expandable database of geometrically parameterized combustor designs with simulated performance labels. Utilizing the INN in its inverse direction, multiple design proposals are generated which fulfill specified performance labels.

[AI-30] Self-Abstraction Learning for Effective and Stable Training of Deep Neural Networks

【速读】:该论文旨在解决大规模深度神经网络在训练过程中普遍存在的优化难题,包括梯度消失(gradient vanishing)、过拟合(overfitting)以及学习过程不稳定等问题。其解决方案的关键在于提出一种分层的自抽象学习(Self-Abstraction Learning, SAL)框架,该框架按照结构复杂度组织网络层级,采用自顶向下的顺序引导机制:先训练最顶层结构最简的网络,再将其隐藏层和输出层作为指导信号用于训练下方更复杂的网络,从而有效缓解优化困难并实现深层架构的稳定训练。

链接: https://arxiv.org/abs/2604.24313
作者: Wonyong Cho,Taemin Kim,Jungmin Kim,Jeong-Rae Kim,Sung Hoon Jung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE Access. Under review

点击查看摘要

Abstract:Training large-scale deep neural networks effectively and stably is essential for applying deep learning across various fields. However, conventional methods, which rely on training a single large network, often encounter challenges such as gradient vanishing, overfitting and unstable learning. To overcome these limitations, we introduce Self-Abstraction Learning (SAL), a hierarchical framework. In SAL, networks are arranged by structural complexity, where the simplest topmost network is trained first and its hidden and output layers serve as guidance for the successively more complex networks below. This top-down sequential guidance effectively mitigates optimization issues, enabling stable training of deep architectures. Various experiments across MLP, CNN, and RNN architectures demonstrate that SAL consistently outperforms conventional methods, ensuring robust generalization even in data-scarce and complex network regimes.

[AI-31] SolarTformer: A Transformer Based Deep Learning Approach for Short Term Solar Power Forecasting

【速读】:该论文旨在解决短时太阳能发电功率预测的准确性问题,以提升可再生能源并网运行的效率与可靠性。其解决方案的关键在于提出一种基于注意力机制的深度学习模型 SolarTformer,该模型借鉴 Transformer 架构,利用自注意力(self-attention)机制有效捕捉太阳辐照度的时间依赖性和空间变异性;同时引入电站特有元数据(metadata),增强模型在不同地理位置、光伏板配置及季节条件下的泛化能力。实验表明,该方法显著优于现有模型,尤其在晴天和阴天均表现出强鲁棒性与普适性。

链接: https://arxiv.org/abs/2604.24306
作者: Ankan Basu,Jyotiraditya Roy,Aditya Datta,Prayas Sanyal,Sumanta Banerjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Accurate forecasting of solar power output is essential for efficient integration of renewable energy into the grid. In this study, an attention-based deep learning model, inspired by transformer architecture, is used for short-term solar power forecasting. Our proposed model, “SolarTformer”, is designed to predict solar power output from meteorological data. Unlike traditional models, SolarTformer leverages self-attention mechanisms to effectively capture temporal dependencies and spatial variability in solar irradiance. In addition, the proposed methodology includes feeding power station-specific metadata into the model, which helps to generalize between power stations located at different locations and with different panel configurations and in different seasons. Our experiments demonstrate that SolarTformer significantly outperforms previous models on the same data set. In particular, the model exhibits strong performance on both clear and cloudy days, indicating high robustness and generalizability. These findings highlight the potential of attention-based architectures in enhancing the accuracy of solar forecasting, contributing to a more reliable management of renewable energy.

[AI-32] Latent-Hysteresis Graph ODEs: Modeling Coupled Topology-Feature Evolution via Continuous Phase Transitions

【速读】:该论文旨在解决图神经微分方程(Graph ODE)在长期演化中面临的单稳定性陷阱(monostability trap)问题,即当使用严格正的不可约混合算子时,信息泄漏不可避免,系统最终收敛至单一全局共识吸引子,导致模型丧失对复杂拓扑结构的表达能力。解决方案的关键在于提出迟滞图ODE(Hysteresis Graph ODE, HGODE),其通过引入由学习到的成对作用力驱动的潜在拓扑势,将特征演化与边状态的双井势(double-well edge potential)和极化门控机制耦合,使边状态能够在连通或绝缘相之间极化,同时保持可微性,从而实现动态拓扑调节并抑制信息坍缩。

链接: https://arxiv.org/abs/2604.24293
作者: Qinhan Hou,Jing Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 tables and 3 figures

点击查看摘要

Abstract:Graph neural ordinary differential equations (Graph ODEs) extend graph learning from discrete message-passing layers to continuous-time representation flows. While it supports adaptive long-range propagation, we show that Graph ODEs with strictly positive irreducible mixing operators face an inherent \emphmonostability trap: in the long-time regime, information leakage is unavoidable and the dynamics converge to a single global consensus attractor. We propose the \textbfHysteresis Graph ODE (HGODE), which couples feature evolution with a latent topological potential driven by a learned pairwise force. A double-well edge potential and bipolarized gate allow edge states to polarize into connected or insulated phases while preserving differentiability. We provide asymptotic analysis of the collapse mechanism and the proposed hysteretic topology dynamics, and validate HGODE on theory-driven synthetic diagnostics and real-world graph benchmarks.

[AI-33] RAS: a Reliability Oriented Metric for Automatic Speech Recognition

【速读】:该论文旨在解决自动语音识别(Automatic Speech Recognition, ASR)系统在噪声或歧义环境下产生高置信度但错误转录的问题,此类错误易误导用户及下游应用。传统基于词错误率(Word Error Rate, WER)的评估方法仅关注准确性,忽略了转录结果的可靠性。解决方案的关键在于提出一种“弃权感知”(abstention-aware)的转录框架,使ASR模型能够对不确定段落显式弃权;同时设计了RAS(Reliability-Aware Score)这一可靠性导向指标,平衡转录的信息量与错误规避能力,并通过人类偏好校准其权衡参数。最终通过监督引导训练结合强化学习实现模型优化,在保持竞争力准确率的同时显著提升转录可靠性。

链接: https://arxiv.org/abs/2604.24278
作者: Wenbin Huang,Yuhang Qiu,Bohan Li,Yiwei Guo,Jing Peng,Hankun Wang,Xie Chen,Kai Yu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to capture transcription reliability. We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. To evaluate reliability under abstention, we propose RAS, a reliability-oriented metric that balances transcription informativeness and error aversion, with its trade-off parameter calibrated by human preference. We then train an abstention-aware ASR model through supervised bootstrapping followed by reinforcement learning. Our experiments demonstrate substantial improvements in transcription reliability while maintaining competitive accuracy.

[AI-34] Adaptive ToR: Complexity-Aware Tree-Based Retrieval for Pareto-Optimal Multi-Intent NLU

【速读】:该论文旨在解决多意图自然语言理解(Multi-intent Natural Language Understanding, Multi-intent NLU)中检索系统在准确率与计算效率之间难以平衡的问题。现有方法要么采用统一的单步检索策略导致召回率下降,要么使用固定深度的分层分解策略引入不必要的延迟,无法根据查询复杂度动态调整资源分配。其解决方案的核心是提出自适应检索树架构(Adaptive Tree-of-Retrieval, Adaptive ToR),通过四个关键模块实现复杂度感知的动态拓扑配置:(1) 查询树分类器基于加权语言信号计算查询复杂度指数,决定路由至快速单步路径或自适应深度分层路径;(2) 基于树的检索模块递归地将复杂查询分解为聚焦子查询并校准预测复杂度;(3) 自适应剪枝模块采用两阶段过滤机制抑制节点指数级增长;(4) 重排序层采用去重优先流水线和全局大语言模型(LLM)重评分以提升生产效率。实验证明该方案在NLU++基准上实现了显著的准确率提升(相对改进9.7%)及资源消耗降低(延迟减少37.6%,LLM调用减少43.0%),验证了复杂度感知资源分配的有效性与帕累托最优平衡。

链接: https://arxiv.org/abs/2604.24219
作者: Hee-Kyong Yoo,Wonbae Kim,Hyocheol Ahn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 5 Figures, 4 Tables

点击查看摘要

Abstract:Multi-intent natural language understanding requires retrieval systems that simultaneously achieve high accuracy and computational efficiency, yet existing approaches apply either uniform single-step retrieval that compromises recall or fixed-depth hierarchical decomposition that introduces excessive latency regardless of query complexity. This paper proposes Adaptive Tree-of-Retrieval (Adaptive ToR), a complexity-aware retrieval architecture that dynamically configures retrieval topology based on query characteristics. The system integrates four components: (1) a Query Tree Classifier computing a Query Complexity Index from weighted linguistic signals to route queries to either a rapid single-step path or an adaptive-depth hierarchical path; (2) a Tree-Based Retrieval module that recursively decomposes complex queries into focused sub-queries calibrated to predicted complexity; (3) an Adaptive Pruning Module employing two-stage filtering combining quantitative similarity gating with semantic relevance evaluation to suppress exponential node growth; and (4) a Retrieval Reranking Layer featuring a deduplicator-first pipeline and global LLM rescoring for production efficiency. Evaluation on the NLU++ benchmark (2,693 multi-intent queries across Banking and Hotel domains) yields 29.07% Subset Accuracy and 71.79% Micro-F1, a 9.7% relative improvement over fixed-depth baselines, while reducing latency by 37.6%, LLM invocations by 43.0%, and token consumption by 9.8%. Depth-wise analysis reveals that 26.92% of queries resolve within three seconds (2.45s mean latency) via single-step routing (d=0: 37.9% Subset Accuracy, 74.8% Micro-F1), while token consumption scales by 4.9x across depths, validating complexity-aware resource allocation and establishing Pareto-optimal balance across accuracy, latency, and computational efficiency.

[AI-35] RefEvo: Agent ic Design with Co-Evolutionary Verification for Agile Reference Model Generation

【速读】:该论文旨在解决生成式 AI 在硬件参考模型(reference models)构建中面临的三大核心挑战:(1)静态工作流无法适应不同设计复杂度导致效率低下;(2)多轮交互中上下文窗口溢出引发关键规范的灾难性遗忘;(3)耦合验证失败问题——即测试平台(Testbench, TB)因相关幻觉错误地验证有缺陷的模型,严重削弱可靠性。解决方案的关键在于提出 RefEvo,一个动态多智能体框架,其创新点包括:(1)动态设计规划器(Dynamic Design Planner),根据语义复杂度自动分解规范并构建定制化执行流程;(2)协同进化验证机制(Co-Evolutionary Verification Mechanism),通过辩证仲裁器(Dialectical Arbiter)同时修正模型与验证逻辑以对抗虚假阳性;(3)规范锚定策略(Spec Anchoring Strategy),实现无损上下文压缩,显著降低token消耗并保障规范召回率。

链接: https://arxiv.org/abs/2604.24218
作者: Yifan Zhang,Jianmin Ye,Jiahao Yang,Xi Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 6 pages, 7 figures, accepted by ISEDA2026

点击查看摘要

Abstract:As the complexity of System-on-Chip (SoC) designs grows, the shift-left paradigm necessitates the rapid development of high-fidelity reference models (typically written in SystemC) for early architecture exploration and verification. While Large Language Models (LLMs) show promise in code generation, their application to hardware modeling faces unique challenges: (1) Rigid, static workflows fail to adapt to varying design complexity, causing inefficiency; (2) Context window overflow in multi-turn interactions leads to catastrophic forgetting of critical specifications; and (3) the Coupled Validation Failure problem–where generated Testbenches (TBs) incorrectly validate flawed models due to correlated hallucinations–severely undermines reliability. To address these limitations, we introduce RefEvo, a dynamic multi-agent framework designed for agile and reliable reference modeling. RefEvo features three key innovations: (1) A Dynamic Design Planner that autonomously decomposes design specifications and constructs tailored execution workflows based on semantic complexity; (2) A Co-Evolutionary Verification Mechanism, which employs a Dialectical Arbiter to simultaneously rectify the model and verification logic against the specification (Spec) oracle, effectively mitigating false positives; and (3) A Spec Anchoring Strategy for lossless context compression. Evaluated on a diverse benchmark of 20 hardware modules, RefEvo achieves a 95% pass rate, outperforming static baselines by a large margin. Furthermore, our context optimization reduces token consumption by an average of 71.04%, achieving absolute savings of over 70,000 tokens per session for complex designs while maintaining 100% specification recall.

[AI-36] Speech Enhancement Based on Drifting Models

【速读】:该论文旨在解决语音增强(Speech Enhancement)中传统方法依赖多步迭代采样、计算效率低且难以利用未配对数据的问题。其解决方案的关键在于提出基于漂移模型(Drifting Models)的生成框架 DriftSE,将去噪建模为一个平衡问题:通过学习一个“漂移场”(Drifting Field)——即一个可微分的修正向量场——直接引导噪声语音的分布演化至干净语音分布的高密度区域,从而实现单步推断(one-step inference)。此机制不依赖于成对训练样本,而是通过分布匹配进行训练,显著提升了推理效率并增强了对未配对数据的适应能力。

链接: https://arxiv.org/abs/2604.24199
作者: Liang Xu,Diego Caviedes-Nozal,Bastiaan Kleijn,Longfei Felix Yan,Rasmus Kongsgaard Olsson
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

[AI-37] Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLM s Alignment

【速读】:该论文旨在解决多目标对齐(Multi-Objective Alignment)中因采用静态偏好权重构建策略而导致的训练信息损失问题,即固定目标会忽略响应生成过程中蕴含的有效偏好权衡信息。解决方案的关键在于提出Meal(MEta ALigner),一个双层元学习框架,实现偏好与策略响应之间的双向优化:通过引入偏好权重网络(preference-weight-net)作为元学习器,根据输入提示动态生成可学习的偏好权重;同时将大语言模型(LLM)策略作为基学习器,在拒绝采样策略下基于这些动态偏好生成响应,从而在训练中捕捉更稳定的偏好-策略交互关系。

链接: https://arxiv.org/abs/2604.24178
作者: Wenzhe Xu,Biao Liu,Yiyang Sun,Xin Geng,Ning Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-Objective Alignment aims to align Large Language Models (LLMs) with diverse and often conflicting human values by optimizing multiple objectives simultaneously. Existing methods predominantly rely on static preference weight construction strategies. However, rigidly aligning to fixed targets discards valuable intermediate information, as training responses inherently embody valid preference trade-offs even when deviating from the target. To address this limitation, we propose Meal, i.e., MEta ALigner, a bi-level meta-learning framework enabling bidirectional optimization between preferences and policy responses, generating instructive dynamic preferences for steadier training. Specifically, we introduce a preference-weight-net as a meta-learner to generate adaptive preference weights based on input prompts and update the preference weights as learnable parameters, while the LLM policy acts as a base-learner optimizing response generation conditioned on these preferences with rejection sampling strategy. Extensive empirical results demonstrate that our method achieves superior performance on several multi-objective benchmarks, validating the effectiveness of the dynamic bidirectional preference-policy optimization framework.

[AI-38] Explanation Quality Assessment as Ranking with Listwise Rewards

【速读】:该论文旨在解决生成式 AI (Generative AI) 中解释质量评估(explanation quality assessment)的难题,传统方法通常将此任务建模为生成问题,即优化模型逐 token 生成最优解释,但这种方法难以有效衡量解释之间的相对优劣。论文提出将解释质量评估重构为排序问题(ranking problem),通过训练奖励模型(reward model)对多个候选解释进行排序而非生成单一解释,从而更准确地捕捉解释间的相对质量差异。其关键创新在于构建具有分级质量水平的实例级候选集,并采用列表级(listwise)和成对级(pairwise)排序损失函数(如 ListNet、LambdaRank、RankNet),以保留序关系并避免点回归或二元偏好目标常见的分数压缩问题。实验表明,这种基于排序的奖励机制不仅在不同领域均优于回归方法,且在高质量数据下小规模编码器也能媲美大规模模型,同时在策略优化中实现稳定收敛,凸显了数据结构化与排序建模的核心作用。

链接: https://arxiv.org/abs/2604.24176
作者: Thomas Bailleux,Tanmoy Mukherjee,Emmanuel Lonca,Pierre Marquis,Zied Bouraoui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We reformulate explanation quality assessment as a ranking problem rather than a generation problem. Instead of optimizing models to produce a single “best” explanation token-by-token, we train reward models to discriminate among multiple candidate explanations and learn their relative quality. Concretely, we construct per-instance candidate sets with graded quality levels and train listwise and pairwise ranking models (ListNet, LambdaRank, RankNet) to preserve ordinal structure and avoid score compression typical of pointwise regression or binary preference objectives. We observe three findings: First, ranking losses consistently outperform regression on score separation across all domains tested. Second, the optimal ranking loss depends on data characteristics: listwise objectives excel with well-separated quality tiers, while pairwise methods are more robust to noisy natural annotations. Third, when trained on carefully curated and well-structured data, small encoder models can match models that are orders of magnitude larger, suggesting that data quality matters more than model scale. Finally, when used as rewards in policy optimization, ranking-based scores enable stable convergence in settings where regression-based rewards fail entirely. Code and data are available at: this https URL

[AI-39] Credal Concept Bottleneck Models for Epistemic-Aleatoric Uncertainty Decomposition

【速读】:该论文旨在解决概念瓶颈模型(Concept Bottleneck Models, CBMs)在输出点估计的概念概率时,无法区分认知不确定性(epistemic uncertainty,可由模型改进减少)与固有不确定性(aleatoric uncertainty,由输入本身的模糊性导致)的问题,这使得概念层面的不确定性难以解释且难以用于指导决策。其解决方案的关键在于提出CREDENCE(Credal Ensemble Concept Estimation)框架,通过构造性地将每个概念表示为概率区间(即credal prediction),利用多个异构概念头之间的分歧来推断认知不确定性,并通过一个专门训练的歧义输出模块来估计固有不确定性——该模块在有标注者分歧数据时能匹配这种分歧。由此获得的两类不确定性信号支持可操作的决策:低认知不确定性时自动处理,高认知不确定性时优先数据收集,高固有不确定性时转交人工审核,两者均高时则选择弃权。

链接: https://arxiv.org/abs/2604.24170
作者: Tanmoy Mukherjee,Thomas Bailleux,Pierre Marquis,Zied Bouraoui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) predict through human-interpretable concepts, but they typically output point concept probabilities that conflate epistemic uncertainty (reducible model underspecification) with aleatoric uncertainty (irreducible input ambiguity). This makes concept-level uncertainty hard to interpret and, more importantly, hard to act upon. We introduce CREDENCE (Credal Ensemble Concept Estimation), a CBM framework that decomposes concept uncertainty by construction. CREDENCE represents each concept as a credal prediction (a probability interval), derives epistemic uncertainty from disagreement across diverse concept heads, and estimates aleatoric uncertainty via a dedicated ambiguity output trained to match annotator disagreement when available. The resulting signals support prescriptive decisions: automate low-uncertainty cases, prioritize data collection for high-epistemic cases, route high-aleatoric cases to human review, and abstain when both are high. Across several tasks, we show that epistemic uncertainty is positively associated with prediction errors, whereas aleatoric uncertainty closely tracks annotator disagreement, providing guidance beyond error correlation. Our implementation is available at the following link: this https URL

[AI-40] Defusing the Trigger: Plug-and-Play Defense for Backdoored LLM s via Tail-Risk Intrinsic Geometric Smoothing

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对后门攻击时的防御难题,此类攻击通常通过污染训练数据植入隐蔽触发器,导致模型在特定输入下产生恶意输出。现有防御方法要么依赖高成本的离线净化过程损害模型性能,要么采用复杂的在线干预引入显著延迟,难以兼顾安全性、可用性和效率。解决方案的关键在于提出一种即插即用的推理时防御机制——尾风险内在几何平滑(Tail-risk Intrinsic Geometric Smoothing, TIGS),其核心创新是利用后门触发器引发注意力机制中语义区域内的局部注意力坍缩现象,在不修改模型参数、无需外部清洁数据或辅助生成的前提下,仅通过原生前向传播完成检测与修正:首先基于样本内信号进行内容感知的尾风险筛选以识别可疑注意力头和行;随后实施内在几何平滑——弱形式的内容域校正保持语义锚定,强形式的全行收缩破坏触发主导路由;最终通过可控的全行写回重构注意力矩阵,确保推理稳定性。实验表明,TIGS在大幅降低攻击成功率的同时严格保留干净样本的推理能力与开放语义一致性,并在密集型、推理导向型及稀疏专家混合(Mixture-of-Experts)等多种架构中均实现安全-效用-延迟的最优平衡。

链接: https://arxiv.org/abs/2604.24162
作者: Kaisheng Fan,Weizhe Zhang,Yishu Gao,Tegawendé F. Bissyandé,Xunzhu Tang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Defending against backdoor attacks in large language models remains a critical practical challenge. Existing defenses mitigate these threats but typically incur high preparation costs and degrade utility via offline purification, or introduce severe latency via complex online interventions. To overcome this dichotomy, we present Tail-risk Intrinsic Geometric Smoothing (TIGS), a plug-and-play inference-time defense requiring no parameter updates, external clean data, or auxiliary generation. TIGS leverages the observation that successful backdoor triggers consistently induce localized attention collapse within the semantic content region. Operating entirely within the native forward pass, TIGS first performs content-aware tail-risk screening to identify suspicious attention heads and rows using sample-internal signals. It then applies intrinsic geometric smoothing: a weak content-domain correction preserves semantic anchoring, while a stronger full-row contraction disrupts trigger-dominant routing. Finally, a controlled full-row write-back reconstructs the attention matrix to ensure inference stability. Extensive evaluations demonstrate that TIGS substantially suppresses attack success rates while strictly preserving clean reasoning and open-ended semantic consistency. Crucially, this favorable security-utility-latency equilibrium persists across diverse architectures, including dense, reasoning-oriented, and sparse mixture-of-experts models. By structurally disrupting adversarial routing with marginal latency overhead, TIGS establishes a highly practical, deployment-ready defense standard for state-of-the-art LLMs.

[AI-41] Multi-Dimensional Evaluation of Sustainable City Trips with LLM -as-a-Judge and Human-in-the-Loop

【速读】:该论文旨在解决在缺乏充足人工标注的情况下,如何有效评估生成式 AI (Generative AI) 在城市旅行推荐场景中所输出的推荐列表的质量问题,尤其关注多维目标(相关性、多样性、可持续性与受欢迎程度平衡)的量化评估。传统指标往往忽略利益相关者导向的目标,难以捕捉推荐系统的复杂价值维度。其解决方案的关键在于提出一个三阶段校准框架:首先利用多个大语言模型(LLM)进行基线判断;其次通过专家评估识别系统性偏差;最后基于规则和少量示例对各维度进行针对性校准,从而提升评价的一致性和透明度,同时揭示了可持续性概念在不同模型间存在解释分歧,强调了开发具备偏见感知能力的 LLM 评估机制的重要性。

链接: https://arxiv.org/abs/2604.24158
作者: Ashmi Banerjee,Adithi Satish,Wolfgang Wörndl,Yashar Deldjoo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable city-trip lists across four dimensions – relevance, diversity, sustainability, and popularity balance, and propose a three-phase calibration framework: (1) baseline judging with multiple LLMs, (2) expert evaluation to identify systematic misalignment, and (3) dimension-specific calibration via rules and few-shot examples. Across two recommendation settings, we observe model-specific biases and high dimension-level variance, even when judges agree on overall rankings. Calibration clarifies reasoning per dimension but exposes divergent interpretations of sustainability, highlighting the need for transparent, bias-aware LLM evaluation. Prompts and code are released for reproducibility: this https URL.

[AI-42] Strategic Bidding in 6G Spectrum Auctions with Large Language Models

【速读】:该论文旨在解决6G网络中频谱资源分配的效率与公平性问题,特别是在车辆网络场景下,面对海量连接和异构服务对有限无线资源的竞争,如何设计具备自适应能力的拍卖机制以提升长期用户效用。其解决方案的关键在于引入大型语言模型(Large Language Models, LLMs)作为投标代理,在具有预算约束的重复频谱拍卖中,通过历史结果学习与提示驱动推理动态调整投标策略,从而在理论假设失效时仍能实现比传统诚实策略和启发式方法更高的参与持续性和效用水平,展现出超越静态机制设计的自适应均衡逼近能力。

链接: https://arxiv.org/abs/2604.24156
作者: Ismail Lotfi,Ali Ghrayeb
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE Transactions on Vehicular Technology

点击查看摘要

Abstract:Efficient and fair spectrum allocation is a central challenge in 6G networks, where massive connectivity and heterogeneous services continuously compete for limited radio resources. We investigate the use of Large Language Models (LLMs) as bidding agents in repeated 6G spectrum auctions with budget constraints in vehicular networks. Each user equipment (UE) acts as a rational player optimizing its long-term utility through repeated interactions. Using the Vickrey-Clarke-Groves (VCG) mechanism as a benchmark for incentive-compatible, dominant-strategy truthfulness, we compare LLM-guided bidding against truthful and heuristic strategies. Unlike heuristics, LLMs leverage historical outcomes and prompt-based reasoning to adapt their bidding behavior dynamically. Results show that when the theoretical assumptions guaranteeing truthfulness hold, LLM bidders recover near-equilibrium outcomes consistent with VCG predictions. However, when these assumptions break – such as under static budget constraints – LLMs sustain longer participation and achieve higher utilities, revealing their ability to approximate adaptive equilibria beyond static mechanism design. This work provides the first systematic evaluation of LLM bidders in repeated spectrum auctions, offering new insights into how AI-driven agents can interact strategically and reshape market dynamics in future 6G networks.

[AI-43] Progressive Approximation in Deep Residual Networks: Theory and Validation

【速读】:该论文旨在解决残差网络(Residual Networks)中近似能力分布不明确的问题,即通用逼近定理(Universal Approximation Theorem, UAT)虽保证了深度模型的表达能力,但未阐明残差结构如何在各层间分配近似任务。其核心解决方案是提出逐层渐进近似(Layer-wise Progressive Approximation, LPA),通过将残差网络建模为从输入到目标的逐层逼近轨迹,并证明存在误差随深度单调下降的“渐进路径”,从而实现结构化的、分步优化的函数逼近机制,而非端到端(End-to-End, E2E)黑箱映射。LPA是一种与架构无关的训练原则,可显式对齐每一层与其残差目标,使单个模型在不同深度均能提供有效预测,支持无需重训的浅层推理,统一了逼近理论与实际深度学习实践。

链接: https://arxiv.org/abs/2604.24154
作者: Wei Wang,Xiao-Yong Wei,Qing Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Universal Approximation Theorem (UAT) guarantees universal function approximation but does not explain how residual models distribute approximation across layers. We reframe residual networks as a layer-wise approximation process that builds an approximation trajectory from input to target, and prove the existence of progressive trajectories where error decreases monotonically with depth. It reveals that residual networks can implement structured, step-by-step refinement rather than end-to-end (E2E) black-box mapping. Building on this, we propose Layer-wise Progressive Approximation (LPA), a theoretically grounded training principle that explicitly aligns each layer with its residual target to realize such trajectories. LPA is architecture-agnostic: we observe progressive behavior in residual FNNs, ResNets, and Transformers across tasks including complex surface fitting, image classification, and NLP with LLMs for generation and classification. Crucially, this enables ``train once, use N models": a single network yields useful predictions at every depth, supporting efficient shallow inference without retraining. Our work unifies approximation theory with practical deep learning, providing a new lens on representation learning and a flexible framework for multi-depth deployment. The source code will be released unpon acceptance at https://(open_upon_acceptance).

[AI-44] Right-to-Act: A Pre-Execution Non-Compensatory Decision Protocol for AI Systems

【速读】:该论文旨在解决当前人工智能(AI)系统在生成决策后直接触发现实世界行动时缺乏前置控制机制的问题,尤其针对现有AI安全与治理方法多依赖事后验证或概率风险评估、隐含假设“一旦决策生成即允许执行”的局限性。其解决方案的关键在于提出Right-to-Act协议,这是一种确定性的、非补偿性的预执行决策层,用于判定AI生成的决策是否具备被执行的合法性资格;该协议通过严格的结构化约束实现:只要任一必要条件未满足,则禁止或推迟执行,从而避免不可逆操作的发生,将AI控制从优化决策转向管控其可执行性,并形成独立于模型架构和训练方法的协议级抽象。

链接: https://arxiv.org/abs/2604.24153
作者: Gadi Lavi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures. Introduces a pre-execution decision protocol for AI systems

点击查看摘要

Abstract:Current AI systems increasingly operate in contexts where their outputs directly trigger real-world actions. Most existing approaches to AI safety, risk management, and governance focus on post-hoc validation, probabilistic risk estimation, or certification of model behavior. However, these approaches implicitly assume that once a decision is produced, it is eligible for execution. In this work, we introduce the Right-to-Act protocol, a deterministic, non-compensatory pre-execution decision layer that evaluates whether an AI-generated decision is permitted to be realized at all. Unlike compensatory systems, where high-confidence signals can override failed conditions, the proposed framework enforces strict structural constraints: if any required condition is unmet, execution is halted or deferred. We formalize the distinction between compensatory and non-compensatory decision regimes and define a pre-execution legitimacy boundary. Through a scenario-based case study, we demonstrate how identical AI outputs can lead to divergent outcomes when evaluated under a Right-to-Act protocol, preserving reversibility and preventing premature or irreversible actions. The proposed approach reframes AI control from optimizing decisions to governing their admissibility, introducing a protocol-level abstraction that operates independently of model architecture or training methodology.

[AI-45] Leverag ing Human Feedback for Semantically-Relevant Skill Discovery ICPR2026

【速读】:该论文旨在解决无监督技能发现(Unsupervised Skill Discovery)在强化学习中因缺乏约束而导致的潜在不安全、不道德或与人类意图不一致的行为问题,同时克服现有基于人类偏好反馈的方法在处理多样化技能空间(如行走、跳跃、奔跑等)时存在的反馈效率低下的局限。解决方案的关键在于引入语义标注(Semantic Labelling),利用人类认知优势识别并标注具有语义意义的行为,并据此提出“语义相关技能发现”(Semantically Relevant Skill Discovery, SRSD)方法:通过收集人类提供的语义标签来学习奖励函数,从而引导代理生成更具语义多样性和相关性的技能。实验表明,SRSD在二维导航和四种运动控制环境中均能有效提升技能的语义多样性并实现可扩展的多类行为发现。

链接: https://arxiv.org/abs/2604.24127
作者: Maxence Hussonnois,Thommen George Karimpanal,Santu Rana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 28th International Conference on Pattern Recognition (ICPR 2026)

点击查看摘要

Abstract:Unsupervised skill discovery in reinforcement learning aims to intrinsically motivate agents to discover diverse and useful behaviours. However, unconstrained approaches can produce unsafe, unethical, or misaligned behaviours. To mitigate these risks and improve the practical desireability of discovered skills, recent work grounds the discovery process by leveraging human preference feedback. However, preference-based approaches are feedback-inefficient and inherently ill-equipped to deal with skill spaces composed of a variety of different skills such as running, jumping, walking, etc. To overcome this limitation, we introduce semantic labelling, a novel and feedback-efficient approach that leverages human cognitive strengths to identify and label semantically meaningful behaviours. Based on semantic labelling, we propose Semantically Relevant Skill Discovery (SRSD), a novel human-in-the-loop approach that collects semantic labels from human feedback and learns a reward function to encourage skills to be more semantically diverse and relevant. Through our experiments in a 2D navigation environment and four locomotion environments, we demonstrate that SRSD can improve semantic diversity and discover relevant behaviours while scaling effectively to a large variety of behaviours.

[AI-46] An Analysis of the Coordination Gap between Joint and Modular Learning for Job Shop Scheduling with Transportation Resources

【速读】:该论文旨在解决在具有运输资源的作业车间调度问题(Job-Shop Scheduling Problem, JSSP)中,如何选择合适的训练策略以实现最优调度性能的问题。具体而言,研究聚焦于联合训练(joint training)与模块化训练(modular training)两种方式在不同环境条件下的适用性差异,尤其是识别何时联合训练对于提升调度效率是必要的。解决方案的关键在于通过系统性的敏感性分析,量化“协调差距”(coordination gap)——即两种训练模式在性能上的差异,并发现:在资源稀缺或时间主导性强的场景下,联合训练能显著优于模块化训练;但在瓶颈环境中(如严重受限的运输和加工约束),协调优势减弱,此时模块化训练成为更具可行性的替代方案。这一发现为基于强化学习的调度系统设计提供了可操作的决策依据。

链接: https://arxiv.org/abs/2604.24117
作者: Moritz Link,Jonathan Hoss,Noah Klarmann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Supported by the Chips Joint Undertaking and its members, including top-up funding by National Authorities, within the Cynergy4MIE project (Grant Agreement No. 101140226). This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Efficient job-shop scheduling with transportation resources is critical for high-performance manufacturing. With the rise of “decentralized factories”, multi-agent reinforcement learning has emerged as a promising approach for the combined scheduling of production and transportation tasks. Prior work has largely focused on developing novel cooperative architectures while overlooking the question of when joint training is necessary. Joint training denotes the simultaneous training of job and automatic guided vehicle scheduling agents, whereas modular training involves independently training each agent followed by post-hoc integration. In this study, we systematically investigate the conditions under which joint training is essential for optimal performance in the job-shop scheduling problem with transportation resources. Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap – the performance difference between these two training modalities. In our evaluation, the joint training can produce superior performance compared to the best-performing combinations of dispatching rules and modular training. However, the coordination gap advantage diminishes in bottleneck environments, particularly under severe transport and processing constraints. These findings indicate that modular training represents a viable alternative in environments where a single scheduling task dominates. Overall, our work provides practical guidance for selecting between training modalities based on environmental conditions, enabling decision-makers to optimize reinforcement learning-based scheduling performance.

[AI-47] Latency and Cost of Multi-Agent Intelligent Tutoring at Scale

【速读】:该论文旨在解决多智能体大语言模型(Multi-agent Large Language Model, Multi-agent LLM)辅导系统在高并发场景下响应延迟显著增加的问题,其核心挑战在于并行阶段的最大延迟效应(parallel-phase maximum effect),即每个学生请求触发多个并发API调用,导致延迟随并发数增长而累积。解决方案的关键在于通过精细化的基础设施调度策略与资源定价模型优化:实验对比了三种Google Vertex AI服务层级(Standard PayGo、Priority PayGo和Provisioned Throughput),发现Priority PayGo可在全负载范围内保持低于4秒的稳定延迟,而Provisioned Throughput在低并发时延迟最低但超过约20个并发用户后饱和;同时成本分析表明,在合理使用场景下(如可预测流量集中),Provisioned Throughput具备成本竞争力,为不同规模部署(从单个研讨班到全校范围)提供可量化的资源选择依据。

链接: https://arxiv.org/abs/2604.24110
作者: Iizalaarab Elhaimeur,Nikos Chrisochoides
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 11 pages, 5 figures, 5 tables. Companion papers: arXiv:Q-ID (Quantum deployment), arXiv:A-ID (Architecture)

点击查看摘要

Abstract:Multi-agent LLM tutoring systems improve response quality through agent specialization, but each student query triggers several concurrent API calls whose latencies compound through a parallel-phase maximum effect that single-agent systems do not face. We instrument ITAS, a four-agent tutoring system built on Gemini 2.5 Flash and Google Vertex AI, across three throughput tiers (Standard PayGo, Priority PayGo, and Provisioned Throughput) and eleven concurrency levels up to 50 simultaneous users, producing over 3,000 requests drawn from a live graduate STEM deployment. Priority PayGo maintains flat sub-4-second response times across the full load range; Standard PayGo degrades substantially under classroom-scale concurrency; and Provisioned Throughput delivers the lowest latency at low concurrency but saturates its reserved capacity above approximately 20 concurrent users. Cost analysis places both pay-per-token tiers well below the price of a STEM textbook per student per semester under a worst-case usage ceiling. Provisioned Throughput, expensive under continuous provisioning, becomes cost-competitive for institutions that can predict and concentrate their traffic toward high utilization. These results provide concrete tier-selection guidance across deployment scales from a single seminar to a university-wide rollout.

[AI-48] SemML 2.0: Synthesizing Controllers for LTL

【速读】:该论文旨在解决从线性时序逻辑(Linear Temporal Logic, LTL)规范中自动合成反应式系统(reactive system)的问题,此类问题在安全关键系统设计中具有重要应用。其核心挑战在于如何高效地生成满足LTL规范的Mealy机或AIGER电路表示,并在保证解质量的前提下提升求解速度与规模。解决方案的关键在于SemML工具的第二版实现:一方面采用经典的自动机理论方法,结合部分探索(partial exploration)和机器学习引导(machine-learning guidance)以提高搜索效率;另一方面引入多种启发式策略与经典算法改进,用于提取更紧凑的系统表示。实验表明,该方法在SYNTCOMP竞赛数据集上显著优于现有主流工具(如Strix、LtlSynt及前一版本SemML),不仅求解更多实例,且耗时更少,同时保持了最优的解质量。

链接: https://arxiv.org/abs/2604.24102
作者: Jan Křetínský,Tobias Meggendorfer,Maximilian Prokop
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Synthesizing a reactive system from specifications given in linear temporal logic (LTL) is a classical problem, finding its applications in safety-critical systems design. These systems are typically represented using either Mealy machines or AIGER circuits. We present the second version of SemML, which outperforms all state-of-the-art tools for finding either solution. Aside from implementing the classical automata-theoretic approach, our tool utilizes partial exploration and machine-learning guidance for obtaining solutions efficiently, and numerous heuristics and improvements of classic algorithms for extracting small representations of these solutions. We evaluate our tool against the existing state-of-the-art tools (in particular Strix, LtlSynt, and the previous version of SemML) on the dataset of the synthesis competition SYNTCOMP. We show that we solve significantly more instances and do so much faster than other tools, while maintaining state-of-the-art solution quality.

[AI-49] Meta-Ensemble Learning with Diverse Data Splits for Improved Respiratory Sound Classification

【速读】:该论文旨在解决呼吸音分类模型在小规模且人群多样性不足的数据集上训练时可靠性差的问题。其关键解决方案是采用一种元集成学习(meta-ensemble learning)方法,通过在不同数据划分策略(固定80-20%分割与五折交叉验证)和不同数据粒度层级(患者级与样本级)下训练基础模型,从而增强基础模型预测的多样性,并利用一个经过训练的元模型对这些多样化的输出进行融合,最终提升了模型的泛化能力与性能,在ICBHI基准测试中达到66.49分的新SOTA结果,并在两个分布外数据集上表现出更强的适应性。

链接: https://arxiv.org/abs/2604.24096
作者: June-Woo Kim,Miika Toikkanen,Heejoon Koo,Yoon Tae Kim,Doyoung Kwon,Kyunghoon Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: EMBC 2026 Accepted

点击查看摘要

Abstract:Training reliable respiratory sound classification models remains challenging due to the limited size and subject diversity of datasets. Ensemble methods can improve robustness, but when base models are trained on identical data, models tend to overfit and produce highly correlated predictions, thereby reducing the effectiveness of ensembling. In this work, we investigate a meta-ensemble learning methodology that enhances prediction diversity by training base models on diverse data splits and combining their outputs through a trained meta-model. Specifically, we train base models on the ICBHI dataset using two data split settings: fixed 80-20% split and five-fold cross-validation split, under two data granularity settings: patient- and sample-level. The resulting diversity in base model predictions enables the meta-model to better generalize. Our approach achieves new state-of-the-art performance on the ICBHI benchmark, reaching a Score of 66.49% and showing improved generalization on two out-of-distribution datasets, indicating its potential applicability to real-world clinical data.

[AI-50] ACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

【速读】:该论文旨在解决大规模张量并行(Tensor-Parallel, TP)训练中因中间张量具有稀疏且接近零的分布特性,导致频繁通信时误差加剧以及压缩过程引入显著计算开销的问题。解决方案的关键在于提出TACO(Tensor-parallel Adaptive COmmunication compression)框架:首先,采用数据驱动的重塑策略结合自适应缩放-哈达玛变换(Adaptive Scale-Hadamard Transform),实现高保真度的FP8量化,并通过双尺度量化(Dual-Scale Quantization)机制保障训练全程的数值稳定性;其次,设计高度融合的压缩算子以降低内存访问和核启动开销,从而实现与通信的有效重叠;最终将TACO集成至现有数据并行(Data Parallelism)和流水线并行(Pipeline Parallelism)方法中,构建支持压缩的三维并行(3D-Parallel)训练框架,在GPT和Qwen模型上验证了其在保持近无损精度的同时实现最高1.87倍端到端吞吐量提升。

链接: https://arxiv.org/abs/2604.24088
作者: Man Liu,Xingchen Liu,Xingjian Tian,Bing Lu,Shengkay Lyu,Shengquan Yin,Wenjing Huang,Zheng Wei,Hairui Zhao,Guangming Tan,Dingwen Tao
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted by HPDC’26, 12 pages, 17 figures, 3 tables

点击查看摘要

Abstract:Handling communication overhead in large-scale tensor-parallel training remains a critical challenge due to the dense, near-zero distributions of intermediate tensors, which exacerbate errors under frequent communication and introduce significant computational overhead during compression. To this end, we propose TACO (Tensor-parallel Adaptive COmmunication compression), a robust FP8-based framework for compressing TP intermediate tensors. First, we employ a data-driven reshaping strategy combined with an Adaptive Scale-Hadamard Transform to enable high-fidelity FP8 quantization, while its Dual-Scale Quantization mechanism ensures numerical stability throughout training. Second, we design a highly fused compression operator to reduce memory traffic and kernel launch overhead, allowing efficient overlap with communication. Finally, we integrate TACO with existing state-of-the-art methods for Data and Pipeline Parallelism to develop a compression-enabled 3D-parallel training framework. Detailed experiments on GPT models and Qwen model demonstrate up to 1.87X end-to-end throughput improvement while maintaining near-lossless accuracy, validating the effectiveness and efficiency of TACO in large-scale training.

[AI-51] AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation

【速读】:该论文旨在解决基于云端部署的视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人连续移动导航中因网络延迟导致的时间-空间错位问题,即由于推理延迟使得过去帧中的意图在当前帧中发生空间偏移,从而引发碰撞风险。解决方案的关键在于提出一种即插即用的异步控制框架AsyncShield,其核心创新是摒弃传统黑箱时间序列预测方法,转而采用确定性的物理白盒空间映射机制:通过维护一个时序位姿缓冲区并利用运动学变换,将时间滞后精确转换为空间位姿偏移,以恢复VLA原始几何意图;同时,为平衡意图还原精度与物理安全性,将边缘适应建模为约束马尔可夫决策过程(Constrained Markov Decision Process, CMDP),并通过PPO-Lagrangian算法求解,使强化学习适配器动态权衡对VLA意图的跟踪与高频LiDAR障碍物避障硬约束的响应。

链接: https://arxiv.org/abs/2604.24086
作者: Kai Yang,Zedong Chu,Yingnan Guo,Zhengbo Wang,Shichao Xie,Yanfen Shen,Xiaolong Wu,Xing Li,Mu Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 4 tables

点击查看摘要

Abstract:While Vision-Language-Action (VLA) models have been demonstrated possessing strong zero-shot generalization for robot control, their massive parameter sizes typically necessitate cloud-based deployment. However, cloud deployment introduces network jitter and inference latency, which can induce severe spatiotemporal misalignment in mobile navigation under continuous displacement, so that the stale intents expressed in past ego frames may become spatially incorrect in the current frame and lead to collisions. To address this issue, we propose AsyncShield, a plug-and-play asynchronous control framework. AsyncShield discards traditional black-box time-series prediction in favor of a deterministic physical white-box spatial mapping. By maintaining a temporal pose buffer and utilizing kinematic transformations, the system accurately converts temporal lag into spatial pose offsets to restore the VLA’s original geometric intent. To balance intent restoration fidelity and physical safety, the edge adaptation is formulated as a constrained Markov decision process (CMDP). Solved via the PPO-Lagrangian algorithm, a reinforcement learning adapter dynamically trades off between tracking the VLA intent and responding to high-frequency LiDAR obstacle avoidance hard constraints. Furthermore, benefiting from a standardized universal sub-goal interface, domain randomization, and perception-level adaptation via Collision Radius Inflation, AsyncShield operates as a lightweight, plug-and-play module. Simulation and real-world experiments demonstrate that, without fine-tuning any cloud-based foundation models, the framework exhibits zero-shot and robust generalization capabilities, effectively improving the success rate and physical safety of asynchronous navigation.

[AI-52] Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)和视觉语言模型(Vision Language Models, VLMs)在交互式因果学习中的抽象结构迁移能力不足的问题,特别是它们是否具备人类那种通过少量经验即可跨情境迁移因果结构的能力。研究的关键在于设计了一个名为OpenLock的范式,用于系统性地识别共同原因(Common Cause, CC)和共同效应(Common Effect, CE)等潜在因果结构,并通过对比人类与模型在文本、图像及多模态条件下的表现,揭示了模型存在“环境锚定”(environmental grounding)现象——即必须先建立特定环境下的映射才能获得效率提升,而人类则能直接利用已有结构知识进行首次尝试;此外,模型在视觉信息下反而表现下降,且表现出与人类不同的CC/CE不对称性,说明其并非基于方向中立的因果抽象,而是依赖符号处理和启发式偏差。这一发现表明,当前LLMs和VLMs缺乏去情境化的因果模式表征,这是其类人推理能力的根本局限。

链接: https://arxiv.org/abs/2604.24062
作者: Liangru Xiang,Yuxi Ma,Zhihao Cao,Yixin Zhu,Song-Chun Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extracting abstract causal structures and applying them to novel situations is a hallmark of human intelligence. While Large Language Models (LLMs) and Vision Language Models (VLMs) have shown strong performance on a wide range of reasoning tasks, their capacity for interactive causal learning – inducing latent structures through sequential exploration and transferring them across contexts – remains uncharacterized. Human learners accomplish such transfer after minimal exposure, whereas classical Reinforcement Learning (RL) agents fail catastrophically. Whether state-of-the-art Artificial Intelligence (AI) models possess human-like mechanisms for abstract causal structure transfer is an open question. Using the OpenLock paradigm requiring sequential discovery of Common Cause (CC) and Common Effect (CE) structures, here we show that models exhibit fundamentally delayed or absent transfer: even successful models require initial environmental-specific mapping – what we term environmental grounding – before efficiency gains emerge, whereas humans leverage prior structural knowledge from the very first solution attempt. In the text-only condition, models matched or exceeded human discovery efficiency. In contrast, visual information – in both the image-only and text-and-image conditions – overall degraded rather than enhanced performance, revealing a broad reliance on symbolic processing rather than integrated multimodal reasoning. Models further exhibited systematic CC/CE asymmetries absent in humans, suggesting heuristic biases rather than direction-neutral causal abstraction. These findings reveal that large-scale statistical learning does not produce the decontextualized causal schemas underpinning human analogical reasoning, establishing grounding-dependent transfer as a fundamental limitation of current LLMs and VLMs.

[AI-53] A2DEPT: Large Language Model-Driven Automated Algorithm Design via Evolutionary Program Trees

【速读】:该论文旨在解决传统组合优化问题(Combinatorial Optimization Problems, COPs)启发式设计依赖大量领域知识、且现有基于大语言模型(Large Language Model, LLM)的自动启发式设计(Automated Heuristic Design, AHD)方法受限于固定算法模板、难以实现系统级算法表达的问题。其解决方案的关键在于提出一种基于进化程序树(Evolutionary Program Trees)的自动化算法设计框架A2DEPT,将LLM作为系统级算法架构师,通过树结构进化搜索结合混合选择策略与分层算子,在开放空间中探索完整的算法生成路径,并引入轻量级程序维护循环实现反馈驱动的可执行性保障,从而突破组件级调优限制,实现端到端的算法合成与迭代优化。

链接: https://arxiv.org/abs/2604.24043
作者: Bin Chen,Shouliang Zhu,Beidan Liu,Yong Zhao,Tianle Pu,Huichun Li,Zhengqiu Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Designing heuristics for combinatorial optimization problems (COPs) is a fundamental yet challenging task that traditionally requires extensive domain expertise. Recently, Large Language Model (LLM)-based Automated Heuristic Design (AHD) has shown promise in autonomously generating heuristic components with minimal human intervention. However, most existing LLM-based AHD methods enforce fixed algorithmic templates to ensure executability, which confines the search to component-level tuning and limits system-level algorithmic expressiveness. To enable open-ended solver synthesis beyond rigid templates, we propose Automated Algorithm Design via Evolutionary Program Trees (A2DEPT), which treats LLMs as system-level algorithm architects. A2DEPT explores the vast program space via a tree-structured evolutionary search with hybrid selection and hierarchical operators, enabling iterative refinement of complete algorithms. To make open-ended generation practical, we enforce executability with a lightweight program-maintenance loop that performs feedback-driven repair. In experiments, A2DEPT consistently outperforms representative LLM-based baselines on both standard and highly constrained benchmarks. On the standard benchmarks, it reduces the mean normalized optimality gap by 9.8% relative to the strongest competing AHD baseline.

[AI-54] End-to-End Learning for Partially-Observed Time Series with PyPOTS KDD2026

【速读】:该论文旨在解决部分观测时间序列(Partially-observed time series, POTS)在实际应用中普遍存在的问题,即现有工具链通常将缺失值处理与下游学习任务分离,导致可复现性差和整体性能受限。解决方案的关键在于提出 PyPOTS——一个开源的 Python 生态系统,支持从缺失值模拟、数据预处理到模型训练与评估的端到端数据挖掘与机器学习流程,涵盖插补、预测、分类、聚类和异常检测等核心任务,通过统一 API 和工程化实践提升 POTS 管道在研究与生产环境中的鲁棒性、透明性和可重用性。

链接: https://arxiv.org/abs/2604.24041
作者: Wenjie Du,Yiyuan Yang,Tianxiang Zhan,Qingsong Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:Partially-observed time series (POTS) is ubiquitous in real-world applications, yet most existing toolchains separate missing-value handling from downstream learning, which limits reproducibility and overall performance. This tutorial introduces PyPOTS, an open-source Python ecosystem for end-to-end data mining and machine learning on POTS. We present practical workflows spanning missingness simulation, data preprocessing, model training, and evaluation across core tasks, including imputation, forecasting, classification, clustering, and anomaly detection. The tutorial consists of two parts: Part I emphasizes hands-on application for practitioners through unified APIs and benchmark-oriented experiments. Part II targets developers and researchers, focusing on extending PyPOTS with custom models, domain-specific constraints, and contribution-ready engineering practices. Participants will gain both conceptual understanding and implementation experience for building robust, transparent, and reusable POTS pipelines in research and production settings. PyPOTS is publicly available at this https URL

[AI-55] QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成原创、非平凡数学证明方面面临的挑战,特别是如何让AI系统在开放研究问题上实现可靠且可验证的证明生成。现有LLMs虽在基准测试中表现优异,但在真实研究场景中仍存在多种系统性失败模式,如上下文污染、引用幻觉、关键步骤模糊处理、证明策略不稳定等。为应对这些问题,作者提出QED——一个开源的多智能体证明系统,其架构设计直接针对每种失败模式进行优化,通过协同推理与模块化分工提升整体可靠性。实验表明,在五个由领域专家提出的分析与偏微分方程(PDEs)开放问题中,QED成功生成了三个被专家确认为原创且非平凡的正确证明,验证了该系统设计的有效性。

链接: https://arxiv.org/abs/2604.24021
作者: Chenyang An,Qihao Ye,Minghao Pan,Jiayaun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP)
备注:

点击查看摘要

Abstract:We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proofs remains an outstanding challenge for LLMs. Through systematic experiments with frontier LLMs on research-level proof tasks, we identify seven failure modes that prevent reliable proof generation, including context contamination, citation hallucination, hand-waving on key steps and misallocation of proof effort, unstable proof plans, unfocused verification, problem modification and single-model bottleneck. We argue that the gap between benchmark success and research-level proving is primarily one of system design, due to those failure modes. We present QED, an open-source multi-agent proof system in which each architectural decision directly addresses a specific failure mode. Evaluated on five open problems in applied analysis and PDEs contributed by domain experts, QED produces correct proofs for three problems, each verified by the contributing experts as original and nontrivial. QED is released as open-source software at this https URL.

[AI-56] Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

【速读】:该论文旨在解决自主AI代理(Autonomous AI Agents)在部署平台如OpenClaw上面临的安全威胁问题,包括提示注入(prompt injection)、记忆污染(memory poisoning)、供应链攻击和社交工程等,而现有防御措施仅聚焦于平台边界,未能训练代理自身对威胁的识别与推理能力。解决方案的关键在于提出ClawdGo框架,通过内生安全意识训练(endogenous security awareness training),使代理在推理阶段无需模型修改即可自主识别和推理威胁;其核心创新包括:TLDT(Three-Layer Domain Taxonomy)构建自卫、用户保护和企业安全三层可训练维度体系;ASAT(Autonomous Security Awareness Training)采用自-play机制,代理在最弱先调度策略下交替扮演攻击者、防御者和评估者角色;CSMA(Cross-Session Memory Accumulation)利用四层持久化记忆架构与Axiom Crystallisation Promotion(ACP)累积技能提升;以及SACP(Security Awareness Calibration Problem)形式化了内生训练引入的精度-召回权衡问题。实验表明,最弱先ASAT显著提升平均TLDT得分至96.9,覆盖11/12维度,且CSMA保持长期性能稳定,验证了该方案的有效性与实用性。

链接: https://arxiv.org/abs/2604.24020
作者: Jiaqi Li,Yang Zhao,Bin Sun,Yang Yu,Jian Chang,Lidong Zhai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 4 pages, including 1 poster page. Poster abstract and poster version

点击查看摘要

Abstract:Autonomous AI agents deployed on platforms such as OpenClaw face prompt injection, memory poisoning, supply-chain attacks, and social engineering, yet existing defences address only the platform perimeter, leaving the agent’s own threat judgement entirely untrained. We present ClawdGo, a framework for endogenous security awareness training: we teach the agent to recognise and reason about threats from the inside, at inference time, with no model modification. Four contributions are introduced: TLDT (Three-Layer Domain Taxonomy) organises 12 trainable dimensions across Self-Defence, Owner-Protection, and Enterprise-Security layers; ASAT (Autonomous Security Awareness Training) is a self-play loop where the agent alternates attacker, defender, and evaluator roles under weakest-first curriculum scheduling; CSMA (Cross-Session Memory Accumulation) compounds skill gains via a four-layer persistent memory architecture and Axiom Crystallisation Promotion (ACP); and SACP (Security Awareness Calibration Problem) formalises the precision-recall tradeoff introduced by endogenous training. Live experiments show weakest-first ASAT raises average TLDT score from 80.9 to 96.9 over 16 sessions, outperforming uniform-random scheduling by 6.5 points and covering 11 of 12 dimensions. CSMA retains the full gain across sessions; cold-start ablation recovers only 2.4 points, leaving a 13.6-point gap. E-mode generates 32 TLDT-conformant scenarios covering all 12 dimensions. SACP is observed when a heavily trained agent classifies a legitimate capability assessment as prompt injection (30/160).

[AI-57] COD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

【速读】:该论文旨在解决在线蒸馏(On-Policy Distillation, OPD)在多轮智能体(multi-turn agent)设置中因轨迹级KL散度不稳定而导致训练失效的问题,即所谓的“轨迹级KL不稳定性”(Trajectory-Level KL Instability)。其核心表现是:随着训练进行,KL散度持续上升且伴随成功率下降,即使收敛后KL仍维持高位,导致学生模型难以稳定学习。此问题源于跨轮次误差累积——错误逐步放大使学生偏离教师的有效支持区域,从而使监督信号失真。解决方案的关键在于提出时序课程在线蒸馏(Temporal Curriculum On-Policy Distillation, TCOD)框架,通过控制并逐步扩展学生所接触的轨迹深度(trajectory depth),从短到长施加课程式训练策略,从而抑制KL增长、提升训练稳定性,并显著增强多轮任务中的代理性能(最高提升18点),甚至可超越教师表现并在未见任务上实现泛化。

链接: https://arxiv.org/abs/2604.24005
作者: Jiaqi Wang,Wenhao Zhang,Weijie Shi,Yaliang Li,James Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher’s effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum this http URL results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher’s performance and generalize to tasks on which the teacher fails.

[AI-58] CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation ACL2026

【速读】:该论文旨在解决医学影像报告生成中细粒度事实一致性评估的难题,尤其针对CT(计算机断层扫描)报告生成任务中存在的文本体量大、病灶特征多样复杂以及临床属性细微但关键的问题。传统评估指标仅依赖词汇重叠或实体匹配,无法反映临床所需的精确诊断准确性。解决方案的关键在于提出CT-FineBench基准,其核心创新是基于问答(Question-Answering, QA)机制构建细粒度评估流程:首先识别并结构化与具体发现相关的临床属性(如位置、大小、边界),然后将其系统转化为QA数据集,通过查询机器生成报告并评分答案正确性来实现对临床细节准确性的可解释性评估,从而显著提升对细微事实错误的敏感性,并更贴近专家临床判断。

链接: https://arxiv.org/abs/2604.24001
作者: Ruifeng Yuan,Wanxing Chang,Weiwei Cao,Bowen Shi,Zhongyu Wei,Ling Zhang,Jianpeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Main

点击查看摘要

Abstract:The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the presence of fine-grained, disease-oriented attributes. Conventional evaluation metrics offer only coarse measures of lexical overlap or entity matching and fail to reflect the granular diagnostic accuracy required for clinical use. To address this gap, we propose CT-FineBench, a benchmark built from CT-RATE and Merlin to evaluate the fine-grained factual consistency of CT reports, constructed from CT-RATE and Merlin. Our benchmark is constructed through a meticulous, Question-Answering (QA) based process: first, we identify and structure key, finding-specific clinical attributes (like location, size, margin). Second, we systematically transform these attributes into a QA dataset, where questions probe for specific clinical details grounded in gold-standard reports. The evaluation protocol for CT-FineBench involves using this QA dataset to query a machine-generated report and scoring the correctness of the answers. This allows for a comprehensive, interpretable, and clinically-relevant assessment, moving beyond superficial lexical overlap to pinpoint specific clinical errors. Experiments show that CT-FineBench correlates better with expert clinical assessment and is substantially more sensitive to fine-grained factual errors than prior metrics.

[AI-59] Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

【速读】:该论文旨在解决传统静态评估方法在部署后的多语言智能代理(如公共空间中的数字前台系统)中难以捕捉运行时失败模式的问题。现有评估通常基于输入-输出映射的平均得分,忽略了系统在真实运行环境中因语言差异导致的结构性偏差与失效行为。其解决方案的核心是提出PSA-Eval框架,将评估单元从单一分数转向可追踪、可审查、可修复和可回归测试的“失败案例”,并通过三语等效输入作为受控探针,量化群体层面的语言政策漂移(cross-language policy drift)。实验表明,尽管系统整体得分高达23.15/24,仍存在显著的跨语言评分漂移现象,验证了该失败中心型评估能揭示聚合评分掩盖的部署级结构信号。

链接: https://arxiv.org/abs/2604.23990
作者: M. Meng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 5 figures. arXiv preprint

点击查看摘要

Abstract:This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static input-output mapping to a runtime system, the basic unit of analysis should shift from score to failure. PSA-Eval extends the conventional chain Question - Answer - Score - End into Question - Batch - Run - Score - Failure Case - Repair - Regression Batch, making failures traceable, reviewable, repairable, and regression-testable. The framework uses trilingual equivalent inputs as controlled probes for observing group-level cross-language policy drift. We conduct a pilot study on a real trilingual digital front-desk system deployed in the lobby of an international financial institution. The pilot uses a simplified single-foundation-model setting (MA = MB), so the observed drift should not be interpreted as an A/B foundation-model difference. The study contains 81 samples organized into 27 trilingual equivalent question groups. Although the system achieves an average score of 23.15/24, 14 groups show non-zero cross-language score drift, 5 groups show drift of at least 3 points, and the maximum drift reaches 9 points. These results provide initial evidence that failure-centered runtime evaluation can expose structured deployment signals hidden by aggregate scoring. Comments: 25 pages, 5 figures. arXiv preprint Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.23990 [cs.AI] (or arXiv:2604.23990v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.23990 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-60] Fix Initial Codes and Iteratively Refine Textual Directions Toward Safe Multi-Turn Code Correction

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在推理阶段性能提升机制不清晰的问题,尤其是复杂搜索策略如散射森林搜索(Scattered Forest Search, SFS)中各因素对性能贡献的不确定性。其解决方案的关键在于提出一种更简洁的方法——迭代文本方向精炼(Iterative Refinement of Textual Directions, IRTD),该方法固定初始代码并仅通过迭代优化文本指导方向来提升推理效果,从而避免了SFS中复杂的搜索结构;同时,借助Oracle-Guided Inductive Synthesis(OGIS)理论框架证明了IRTD的安全性,实验证明其在多个代码生成基准上达到与最先进方法相当的性能,表明高质量文本方向的迭代精炼足以显著提升LLM推理表现。

链接: https://arxiv.org/abs/2604.23989
作者: Yuto Tanaka,Issei Sato
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work on large language models (LLMs) has emphasized the importance of scaling inference compute. From this perspective, the state-of-the-art method Scattered Forest Search (SFS) has been proposed, employing Monte Carlo Tree Search with carefully crafted initial seeds and textual optimization for multi-turn code correction. However, its complexity makes it unclear what factors contribute to improvements in inference performance. To address this problem, we analyze SFS and propose a simpler method, Iterative Refinement of Textual Directions (IRTD), which fixes initial codes and iteratively refines textual directions. Because of the simplicity of IRTD, we theoretically establish the safety of IRTD using Oracle-Guided Inductive Synthesis (OGIS). Experiments on several code generation benchmarks suggest that IRTD achieves inference performance comparable to state-of-the-art methods. These results indicate that, even without complex search structures, refining initial codes with high-quality textual directions alone can effectively improve inference performance.

[AI-61] Hindsight Preference Optimization for Financial Time Series Advisory ICLR2026

【速读】:该论文旨在解决生成式 AI(Generative AI)在时间序列预测中从单纯数值预测转向提供具有推理过程、可操作建议及风险管控的决策咨询(predictive advisory)的问题。其核心挑战在于,训练此类模型时所需的高质量标签信息在预测时刻尚不可得。解决方案的关键在于提出“事后偏好优化”(Hindsight Preference Optimization),该方法借鉴强化学习中的两个思想:一是利用执行后才可获得的信息生成训练信号,二是通过偏好对齐机制;具体而言,基于实际观测结果,让大型语言模型(LLM)对候选咨询方案进行多维排序,从而构建无需人工标注的偏好对(pairwise preference data),用于直接偏好优化(DPO)训练,最终实现更高质量的预测咨询输出。

链接: https://arxiv.org/abs/2604.23988
作者: Yanwei Cui,Guanghui Wang,Xing Zhang,Peiyang He,Ziyuan Li,Bing Zhu,Wei Qiu,Xusheng Wang,Zheng Yu,Anqi Xin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026 TSALM Workshop

点击查看摘要

Abstract:Time series models predict numbers; decision-makers need advisory – directional signals with reasoning, actionable suggestions, and risk management. Training language models for such predictive advisory faces a fundamental challenge: quality depends on outcomes unknown at prediction time. We bridge two ideas from reinforcement learning – using information unavailable during execution to retrospectively generate training signal, and preference alignment – and propose Hindsight Preference Optimization: observed outcomes let an LLM judge rank candidate advisories on dimensions that scalar metrics cannot capture, producing preference pairs for DPO without human annotation. We apply this to Vision-Language-Model-based predictive advisories on SP 500 equity time series, demonstrated by a 4B model outperforming its 235B teacher on both accuracy and advisory quality.

[AI-62] DecompKAN: Decomposed Patch-KAN for Long-Term Time Series Forecasting

【速读】:该论文旨在解决科学领域中时间序列预测的准确性与模型透明性之间的矛盾问题,尤其是在气候建模、生理监测和能源系统等场景下,既需要高精度预测又要求模型具备可解释性。其解决方案的关键在于提出了一种轻量级、无需注意力机制的架构 DecompKAN,该架构融合了趋势-残差分解(trend-residual decomposition)、通道-wise 分块(channel-wise patching)、学习型实例归一化(learned instance normalization)以及基于 B-样条的 Kolmogorov-Arnold 网络(B-spline Kolmogorov-Arnold Network, KAN)边函数。其中,KAN 边函数能够学习显式的、可直接可视化的 1D 标量函数,从而实现对隐含非线性变换的直观分析,显著提升了模型的可解释性;同时,在多个标准基准测试中表现出优于或相当的预测性能(MSE 指标),尤其在具有平滑时序动态的数据集(如 Solar、ECL 和生理信号 PPG-DaLiA)上表现突出。

链接: https://arxiv.org/abs/2604.23968
作者: Naveen Mysore
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 15 pages, 6 figures, 8 tables. Preprint; under review

点击查看摘要

Abstract:Accurate time series forecasting in scientific domains such as climate modeling, physiological monitoring, and energy systems benefits from both competitive predictions and model transparency. This work proposes DecompKAN, a lightweight attention-free architecture that combines trend-residual decomposition, channel-wise patching, learned instance normalization, and B-spline Kolmogorov-Arnold Network (KAN) edge functions. Each KAN edge learns an explicit, inspectable 1D scalar function over learned patch-embedding coordinates that can be directly visualized. On standard benchmarks, DecompKAN achieves best or tied-best MSE on 15 of 32 dataset-horizon combinations among selected published baselines, and achieves best or tied-best MSE on 20 of 36 comparisons under a controlled same-recipe evaluation across 9 datasets including the physiological PPG-DaLiA benchmark. The architecture shows particular strength on datasets with smooth temporal dynamics (Solar -17%, ECL -10% vs. iTransformer, Weather) and physiological time series. Visualization of learned edge functions reveals qualitatively different latent nonlinearities across domains. Ablation analysis shows that the architectural pipeline (decomposition, patching, normalization) drives performance more than the choice of nonlinear layer, while the KAN formulation enables inspection of learned latent transformations.

[AI-63] ask-guided Spatiotemporal Network with Diffusion Augmentation for EEG-based Dementia Diagnosis and MMSE Prediction

【速读】:该论文旨在解决传统多任务学习方法在处理异质性数据(如脑电图 EEG 和认知评分 MMSE)时存在的特征纠缠问题,从而导致任务间干扰、影响诊断与预测性能的问题。其解决方案的关键在于提出一种任务引导的时空网络(Task-Guided Spatiotemporal Network, TGSN),该网络通过三个核心模块实现:1)多频带特征融合模块以捕获 EEG 的互补频谱信息;2)基于扩散过程的数据增强模块提升样本多样性;3)门控时空注意力模块建模 EEG 的长程空间依赖性和时间动态性,并结合任务引导查询模块实现任务特定特征提取,有效缓解多任务间的干扰。实验表明,该方法在 XY02 数据集上显著优于现有最优基线,在阿尔茨海默病(AD)/额颞叶痴呆(FTD)分类准确率达 97.78%,在 AD/FTD/血管性认知障碍(VCI)分类中达 83.93%,同时 MMSE 预测均方根误差(RMSE)降低至 1.93 和 2.38,验证了其有效性与泛化能力。

链接: https://arxiv.org/abs/2604.23964
作者: Xiaoyu Zheng,Xu Tian,Bin Jiao,Kunbo Cui,Hanhe Lin,Lu Shen,Jin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Patients with dementia typically exhibit cognitive impairment, which is routinely assessed using the Mini-Mental State Examination (MMSE). Concurrently, their underlying neurophysiological abnormalities are reflected in Electroencephalography (EEG), providing a basis for joint modeling. However, traditional multi-task approaches suffer from feature entanglement, which leads to inter-task interference when handling heterogeneous this http URL address this challenge, we propose a task-guided spatiotemporal network (TGSN) with diffusion augmentation for EEG-based dementia diagnosis and MMSE prediction. Specifically, TGSN integrates a multi-band feature fusion module to capture complementary spectral information from EEG. Meanwhile, a pre-trained data augmentation module utilizing a diffusion process is introduced toincrease sample diversity. To model the complex spatiotemporal patterns of EEG, we propose a gated spatiotemporal attention module that captures long-range spatial dependencies and temporal dynamics. Moreover, we design a task-guided query module to achieve task-specific feature extraction, thereby mitigating task interference. The effectiveness of TGSN is evaluated on the XY02 dataset. Experimental results demonstrate that the proposed network outperforms several state-of-the-art methods, achieving classification accuracies of 97.78% for Alzheimer’s Disease (AD)/Frontotemporal Dementia (FTD) and 83.93% for AD/FTD/Vascular Cognitive Impairment (VCI), which exceed the best baselines by 16.39% and 8.28%, respectively. In parallel, it reduces the RMSE for MMSE prediction to 1.93 and 2.38, achieving significant error reductions of 1.44 and 1.43 compared to the best baselines. Additionally, validation on the DS004504 dataset demonstrates strong cross-dataset generalization…

[AI-64] An empirical evaluation of the risks of AI model updates using clinical data: stability arbitrariness and fairness

【速读】:该论文旨在解决临床环境中人工智能/机器学习(AI/ML)模型因训练数据过时而导致性能下降的问题,同时指出模型更新可能引入新的风险,如预测稳定性降低、预测任意性增加以及子群体间准确性公平性和误差率平衡恶化。其解决方案的关键在于提出多维度的持续监测框架,用于识别上述风险,并强调此类监测对于构建可信的临床决策支持系统至关重要。

链接: https://arxiv.org/abs/2604.23954
作者: Ioannis Bilionis,Ricardo C. Berrios,Luis Fernandez-Luque,Carlos Castillo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to iEEE EMBC 2026. 4 pages, 3 figures

点击查看摘要

Abstract:Artificial Intelligence and Machine Learning (AI/ML) models used in clinical settings are increasingly deployed to support clinical decision-making. However, when training data become stale due to changes in demographics, environment, or patient behaviors, model performance can degrade substantially. While updating models with new training data is necessary, such updates may also introduce new risks. We evaluated the proposed monitoring framework on four publicly available U.S.-based Type 1 Diabetes datasets containing high-resolution continuous glucose monitoring (CGM) data, comprising approximately 11,300 weekly observations from 496 participants under 20 years of age. All datasets included structured sociodemographic information. Using the prediction of severe hyperglycemia events in children with type 1 diabetes as a case study, we examine how different model update strategies can adversely affect model stability (e.g., by causing predictions to “flip” for a large number of cases after an update), increase arbitrariness in predictions, or worsen accuracy equity and the balance of error rates across subpopulations. We propose multiple dimensions for continuous monitoring to detect these issues and argue that such monitoring is essential for the development of trustworthy clinical decision support systems.

[AI-65] Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLM s

【速读】:该论文旨在解决在大规模医疗资源中断(如疫情或系统故障)期间,如何利用生成式 AI (Generative AI) 提升医院床位等关键资源预测的稳定性与决策相关性问题。传统时间序列模型虽能处理历史数据,但难以整合人口、地理等非时序公共健康上下文信息;而直接使用大语言模型(LLM)进行数值预测则存在结果不稳定的问题。解决方案的关键在于提出一种结构化的混合建模框架(HybridARX),将 LLM 提取的上下文信号嵌入到经典的自回归外生变量模型(ARX)中,从而在保持模型结构稳定性的前提下增强对复杂现实场景的适应能力,显著提升预测的校准度和时序一致性,尤其适用于非平稳的医疗资源调度决策场景。

链接: https://arxiv.org/abs/2604.23949
作者: Rhea Makkuni,Ananya Joshi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical and public health experts must make real-time resource decisions, such as expanding hospital bed capacity, based on projected hospitalization trends during large-scale healthcare disruptions (e.g., operational failures or pandemics). Forecasting models can assist in this task by analyzing large volumes of resource-related data at the facility level, but they must be reliable for decision-making under real-world data conditions. Recent work shows that large language models (LLMs) can incorporate richer forms of context into numerical forecasting. Whereas traditional models rely primarily on temporal context (i.e., past observations), LLMs can also leverage non-temporal public health context such as demographic, geographic, and population-level features. However, it remains unclear how these models should be used to produce stable or decision-relevant predictions in real-world healthcare settings. To evaluate how LLMs can be effectively used in this setting, we evaluate three approaches across 60 counties with low-,mid-, and high-hospitalization intensities in the United States: direct LLM-based forecasting, classical time-series models, and a context-augmented hybrid pipeline (HybridARX) that incorporates LLM-derived signals into structured models. Because the goal is operational decision-making rather than error minimization alone, we evaluate performance with bias and lead-lag alignment in addition to standard forecasting metrics. Our results show that HybridARX improves over classical ARX by yielding more stable and better-calibrated forecasts, particularly when incorporating noisy contextual signals into structured time-series models. These findings suggest that, in non-stationary healthcare resource forecasting, LLMs are most useful when embedded within structured hybrid models.

[AI-66] GAMED.AI: A Hierarchical Multi-Agent Framework for Automated Educational Game Generation

【速读】:该论文旨在解决如何将教师提供的自然语言问题高效、可靠地转化为结构化且具有教学意义的可玩教育游戏的问题,同时确保生成内容在机制设计和认知目标(如布卢姆分类法)上的对齐性。解决方案的关键在于提出一种分层多智能体框架 GameDAI,其核心包括基于阶段的 LangGraph 子图结构、确定性的质量门控机制(Quality Gates)以及结构化的 Pydantic 数据模式,从而实现从问题输入到可验证游戏输出的端到端自动化流程,并在保持高合规性和教学有效性的同时显著降低 token 消耗(相比 ReAct 代理减少约 73%)。

链接: https://arxiv.org/abs/2604.23947
作者: Shiven Agarwal,Yash Shah,Ashish Raj Shekhar,Priyanuj Bordoloi,Vivek Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce GameDAI, a hierarchical multi-agent framework that transforms instructor-provided questions into fully playable, pedagogically grounded educational games validated through formal mechanic contracts. Built on phase-based LangGraph sub-graphs, deterministic Quality Gates, and structured Pydantic schemas, GameDAI supports two template families encompassing 15 interaction mechanics across spatial reasoning, procedural execution, and higher-order Bloom’s Taxonomy objectives. Evaluated on 200 questions spanning five subject domains, the system achieves a 90% validation pass rate, 98.3% schema compliance, and 73% token reduction over ReAct agents ( \sim 73,500 \rightarrow \sim 19,900 tokens/game) at 0.46 per game. Within this model configuration, these results suggest that phase-bounded architectural structure correlates more strongly with alignment quality than prompting strategy alone. Our demonstration lets attendees generate Bloom’s-aligned games from natural language in under 60 seconds, inspect Quality Gate outputs at each pipeline phase, and browse a curated library of 50 games spanning all 15 mechanic types.

[AI-67] Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery

【速读】:该论文旨在解决现有反编译器(decompiler)生成的源代码往往无法正确编译或执行的问题,从而限制了其在安全分析、恶意软件逆向工程和遗留软件维护中的实际应用。解决方案的关键在于提出一种多智能体框架,通过多层次约束引导的反编译(Multi-level Constraint-Guided Decompilation, MCGD),构建一个包含语法正确性(通过解析验证)、可编译性(通过GCC验证)和行为等价性(通过大语言模型LLM生成测试用例验证)的分层验证流水线;当任一层次验证失败时,由专门的大语言模型代理基于结构化错误反馈迭代优化代码,最终显著提升反编译结果的可重执行性(re-executability)。实验表明,该方法在真实世界二进制文件上实现了84–97%的可重执行率,较基线提升28–89个百分点,并优于当前最先进的基于LLM的反编译方法。

链接: https://arxiv.org/abs/2604.23940
作者: Yifan Zhang,Xiaohan Wang,Yueke Zhang,Kevin Leach
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decompilation – recovering source code from compiled binaries – is essential for security analysis, malware reverse engineering, and legacy software maintenance. However, existing decompilers produce code that often fails to compile or execute correctly, limiting their practical utility. We present a multi-agent framework that transforms decompiled code into re-executable source through Multi-level Constraint-Guided Decompilation (MCGD). Our approach employs a hierarchical validation pipeline with three constraint levels: (1) syntactic correctness via parsing, (2) compilability via GCC, and (3) behavioral equivalence via LLM-generated test cases. When validation fails, specialized LLM agents iteratively refine the code using structured error feedback. We evaluate our framework on 1,641 real-world binaries from ExeBench across three decompilers (RetDec, Ghidra, and Angr). Our framework achieves 84-97% re-executability, improving baseline decompiler output by 28-89 percentage points. In comparison with state-of-the-art LLM-based decompilation methods using the same GPT-4o backbone, our approach (84.1%) outperforms LLM4Decompile (80.3%), SK2Decompile (73.9%), and SALT4Decompile (61.8%). Our ablation study reveals that execution-based validation is critical: compile-only approaches achieve 0% behavioral correctness despite 91-99% compilation rates. The system converges efficiently, with 90%+ binaries reaching correctness within 2 iterations at an average cost of 0.03-0.05 per binary. Our results demonstrate that constraint-guided agentic refinement can bridge the gap between raw decompiler output and practically useful source code.

[AI-68] Agent ic AI platforms for autonomous training and rule induction of human-human and virus-human protein-protein interactions

【速读】:该论文旨在解决蛋白质-蛋白质相互作用(Protein-Protein Interaction, PPI)预测与可解释性之间的关键挑战,即如何在保证高预测准确性的同时,生成人类可读的生物学规则以增强模型的透明度和可信度。解决方案的关键在于构建两个独立的智能体(AI agent)平台:第一个平台通过五个协作智能体实现从数据收集、验证、特征嵌入、模型设计到训练验证的全流程自动化,针对人-人和病毒-人PPI分别达到87.3%和86.5%的准确率;第二个平台则利用蛋白嵌入、理化自协方差描述符、细胞组分注释、通路域重叠及图上下文等多维信息,诱导出可解释的显式规则,其中人-人PPI由两则规则定义,人-病毒PPI则为加权规则集合,并且这些规则与SHAP特征重要性分析结果高度一致,从而实现了从数据驱动建模到知识驱动解释的闭环。

链接: https://arxiv.org/abs/2604.23924
作者: Hung N. Do,Jessica Z. Kubicek-Sutherland,Oscar A. Negrete,S. Gnanakaran
机构: 未知
类目: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: Other correspondence email: donguyenhung238@gmail.com

点击查看摘要

Abstract:We instruct an AI agent to construct two separate agentic AI platforms: one for autonomous training of predictive ML models for human-human and virus-human PPI, and the other for inducing explicit general rules governing human-human and virus-human PPI. The first agentic AI platform for autonomous training of predictive ML models for PPI is designed to consist of five AI agents that handle autonomous data collection, data verification, feature embedding, model design, and training and validation on three-way protein-disjoint cross-fold datasets. For human-human and human-virus PPIs, the final three-way protein-disjoint ensemble achieves an accuracy of 87.3% and 86.5%, respectively. For cross-checking and interpretability purposes, the second agentic AI platform is designed to replace ML predictions with human-readable rules derived from protein embeddings, physicochemical autocovariance descriptors, compartment annotations, pathway-domain overlap, and graph contexts. For human-human PPI, it is defined by a two-rule induction, whereas human-virus is induced by a more complex set of weighted rules. The rules induced by the second agentic platform align with the SHAP-identified features from the predictive ML models built by the first agentic platform. Taken together, our work demonstrates the agentic AI’s ability to orchestrate from data planning to execution, and from rule induction to explanation in ML, opening the door to various applications.

[AI-69] Crystal structure prediction using graph neural combinatorial optimization

【速读】:该论文旨在解决晶体材料中晶体结构预测(Crystal Structure Prediction, CSP)的组合优化难题,即在给定化学组成下高效搜索具有最低能量的原子排列方式。传统方法依赖精确数学优化或启发式算法,在大规模问题中因原子配置空间呈指数级增长而计算成本高昂。其解决方案的关键在于引入基于图神经网络(Graph Neural Networks, GNNs)的神经组合优化框架,通过构建捕捉短程与长程原子相互作用的扩展图(expander graphs)作为计算图结构,并利用Gumbel-Sinkhorn机制确保生成结构的化学计量比约束,从而实现无监督条件下对可行晶体结构分布的有效采样。该方法显著优于经典启发式策略,并在多种化学组成下达到商用优化求解器的性能水平,为利用GPU并行计算能力突破当前CSP计算瓶颈提供了新路径。

链接: https://arxiv.org/abs/2604.23921
作者: Stavros Gerolymatos,J. Kyle Brubaker,Martin J. A. Schuetz,Vladimir V. Gusev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Crystalline materials are widely used in technological applications, yet their discovery remains a significant challenge. As their properties are driven by structure, crystal structure prediction (CSP) methods play a central role in computational approaches aiming to accelerate this process. Previously, CSP has been approached from a combinatorial optimization perspective, with the core challenge of allocating atoms on a fine grid of predefined discrete positions within a unit cell while minimizing their interaction energy. Exact mathematical optimization methods provide guaranteed solutions, but they become computationally expensive for large-scale instances, where the atomic configuration space grows rapidly, particularly in the absence of additional symmetry constraints. In this work, we introduce a neural combinatorial optimization approach to the atom allocation challenge and, subsequently, CSP, based on graph neural networks (GNNs), which can effectively sample from the distribution of feasible structures in an unsupervised manner. We leverage expander graphs to construct computational graphs over discrete positions that capture both short- and long-range interactions between atoms, and employ the Gumbel-Sinkhorn approach to enforce the desired stoichiometry of the generated structures. We demonstrate that our method outperforms classical heuristic approaches and is competitive with a commercial optimization solver across a range of chemical compositions. This enables the use of ever-expanding GPU infrastructure to tackle the inherent combinatorial challenges of CSP, paving the way for scaling beyond current capabilities.

[AI-70] SMSI: System Model Security Inference: Automated Threat Modeling for Cyber-Physical Systems

【速读】:该论文旨在解决网络物理系统(Cyber-Physical Systems, CPS)中威胁建模高度依赖人工的问题,提出了一种名为SMSI(System Model Security Inference)的混合神经符号流水线,从SysML架构模型出发,自动推导出符合NIST 800-53标准的安全控制措施并进行优先级排序。其解决方案的关键在于三阶段架构:首先通过确定性解析器将系统组件映射至CVE漏洞(基于NVD);其次利用多种方法(包括微调后的SecureBERT+、密集编码检索模型及零样本大语言模型Gemma-4 26B)实现CVE到MITRE ATT&CK技术的映射;最后采用控制推荐模块输出安全控制建议。实验表明,使用预训练SecureBERT在ATT&CK到NIST控制映射阶段表现最优,验证了密集嵌入(dense embeddings)作为自动化控制推荐基础的有效性。

链接: https://arxiv.org/abs/2604.23905
作者: RoÝah Radaideh,Ali Khreis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures, 7 tables. University of Ottawa for ELG5271/CSI5388

点击查看摘要

Abstract:Threat modeling for cyber-physical systems (CPS) remains a largely manual exercise. This project presents SMSI (System Model Security Inference), a hybrid neuro-symbolic pipeline that starts from a SysML architecture model and produces a prioritized list of NIST 800-53 security controls. The prototype has three main stages: a deterministic parser mapping system components to vulnerabilities via the NVD; a family of retrieval and classification models linking vulnerabilities to MITRE ATTCK techniques; and a control recommender. We explore three approaches for CVE-to-ATTCK mapping: a supervised classifier using fine-tuned SecureBERT+, retrieval-based dense encoders, and a zero-shot LLM approach using Gemma-4 26B. We validate the pipeline on a healthcare IoT gateway with nine software components. For the ATTCK-to-NIST stage, pretrained SecureBERT achieves the highest control retrieval scores, demonstrating that dense embeddings provide a strong basis for automated control recommendation.

[AI-71] LLM -Augmented Traffic Signal Control with LSTM-Based Traffic State Prediction and Safety-Constrained Decision Support

【速读】:该论文旨在解决传统固定周期和规则驱动的交通信号控制方法在动态交通需求下适应性差、决策可解释性弱的问题。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)增强的交通信号控制框架,该框架融合了LSTM-based短期交通状态预测、基于预测的相位选择、结构化LLM推理以及安全约束动作过滤机制;其中,LLM模块通过对结构化交通状态输入进行推理,生成拥堵诊断、相位调整建议及自然语言解释,从而提升决策透明度与合理性,而安全约束过滤则保障执行动作的可靠性。实验表明,该方法在动态和非重复性交通场景中显著提升通行效率且无违反安全约束的情况发生。

链接: https://arxiv.org/abs/2604.23902
作者: Jiazhao Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic signal control is a critical task in intelligent transportation systems, yet conventional fixed-time and rule-based methods often struggle to adapt to dynamic traffic demand and provide limited decision interpretability. This study proposes an LLM-augmented traffic signal control framework that integrates LSTM-based short-term traffic state prediction, predictive phase selection, structured large language model reasoning, and safety-constrained action filtering. The LSTM module forecasts future queue length, waiting time, vehicle count, and lane occupancy based on recent intersection-level observations. A predictive controller then generates candidate signal actions, while the LLM module evaluates these actions using structured traffic-state inputs and produces congestion diagnoses, phase adjustment recommendations, and natural-language explanations. To ensure operational reliability, all LLM-generated recommendations are validated by a safety filter before execution. Simulation-based experiments in SUMO compare the proposed method with fixed-time control, rule-based control, and an LSTM-based predictive baseline under balanced demand, directional peak demand, and sudden surge scenarios. The results indicate that the proposed framework improves traffic efficiency, especially under dynamic and non-recurrent traffic conditions, while maintaining zero constraint violations after safety filtering. Overall, this study demonstrates that LLMs can enhance traffic signal control when used as constrained reasoning and decision-support modules rather than direct low-level controllers. Keywords: Intelligent Transportation Systems; Traffic Signal Control; Large Language Models; LSTM; Traffic State Prediction; Decision Support; Safety-Constrained Control; SUMO Simulation.

[AI-72] MarketBench: Evaluating AI Agents as Market Participants

【速读】:该论文旨在解决AI代理在市场机制中有效协作的瓶颈问题,即如何使AI代理具备准确评估自身完成任务的能力(能力信号)和执行成本(成本信号)的能力,从而实现高效的资源分配。解决方案的关键在于提出MarketBench基准测试框架,用于评估AI代理是否具备上述能力信号的自我报告准确性;实验表明,当前大型语言模型(LLM)在成功概率和token使用量上的自评存在显著偏差(miscalibrated),导致基于这些自报信息构建的拍卖机制偏离完全信息下的最优分配。尽管引入历史实验数据作为上下文可小幅提升校准度,但仍未接近理想状态,凸显出自我评估能力是制约市场式协调机制落地的核心瓶颈。

链接: https://arxiv.org/abs/2604.23897
作者: Andrey Fradkin,Rohit Krishnan
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the performance of a market-based scaffolding with these LLMs. Our results point to self-assessment as a key bottleneck for market-style coordination of AI agents.

[AI-73] Geometry Preserving Loss Functions Promote Improved Adaptation of Blackbox Generative Model

【速读】:该论文旨在解决大规模生成式AI模型在特定应用场景下的领域适应(domain adaptation)难题,尤其是在模型权重和梯度不可访问、且传统微调方法因存储与计算成本过高而不可行的情况下。解决方案的关键在于提出了一种端到端的适配流程,通过结合几何保真损失函数(geometry-preserving loss function)与预训练生成对抗网络(Generative Adversarial Networks, GANs),重新定义了GAN反演(GAN inversion)在获取准确潜在空间表示中的作用。该方法通过保持切空间之间的成对距离,确保潜在空间结构的一致性,从而训练出能够从目标分布生成样本的潜在生成模型,实验表明其在StyleGAN上的适应性能优于传统损失函数。

链接: https://arxiv.org/abs/2604.23888
作者: Sinjini Mitra,Constantine Kyriakakis,Shenyuan Liang,Anuj Srivastava,Pavan Turaga
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adaptation of blackbox generative models has been widely studied recently through the exploration of several methods including generator fine-tuning, latent space searches, leveraging singular value decomposition, and so on. However, adapting large-scale generative AI tools to specific use cases continues to be challenging, as many of these industry-grade models are not made widely available. The traditional approach of fine-tuning certain layers of a generative network is not feasible due to the expense of storing and fine-tuning generative models, as well as the restricted access to weights and gradients. Recognizing these challenges, we propose a novel end-to-end pipeline aimed at domain adaptation by leveraging geometry-preserving loss functions in conjunction to pre-trained generative adversarial networks (GANs). Our method rethinks the problem of adaptation by re-contextualizing the role of GAN inversion in obtaining accurate latent space representations. Extending the ability of existing state-of-the-art inverters, we preserve pair-wise distances between tangent spaces to successfully train a latent generative model to produce samples from the target distribution. We evaluate our proposed pipeline on StyleGANs with real distribution shifts and demonstrate that the introduction of the geometry preserving loss function lends to improved adaptation of generative models compared to other traditional loss functions.

[AI-74] Evaluation of Prompt Injection Defenses in Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因系统提示词(system prompts)嵌入敏感信息而导致的秘密泄露问题。研究发现,依赖模型自身进行防护的各类防御机制均被自适应攻击者在数千次攻击中成功突破,而唯一有效的防御方案是输出过滤(output filtering),即在应用代码层面对模型响应进行硬编码规则校验,从而在15,000次攻击中实现零泄露。其关键在于:安全边界必须由外部应用程序强制执行,而非依赖被攻击的模型本身来保障安全性。

链接: https://arxiv.org/abs/2604.23887
作者: Priyal Deep,Shane Emmons,Amy Fox,Kyle Bacon,Kelley McAllister,Krisztian Flautner
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:LLM-powered applications routinely embed secrets in system prompts, yet models can be tricked into revealing them. We built an adaptive attacker that evolves its strategies over hundreds of rounds and tested it against nine defense configurations across more than 20,000 attacks. Every defense that relied on the model to protect itself eventually broke. The only defense that held was output filtering, which checks the model’s responses via hardcoded rules in separate application code before they reach the user, achieving zero leaks across 15,000 attacks. These results demonstrate that security boundaries must be enforced in application code, not by the model being attacked. Until such defenses are verified by tools like Swept AI, AI systems handling sensitive operations should be restricted to internal, trusted personnel.

[AI-75] MUSIC: Learning Muscle-Driven Dexterous Hand Control

【速读】:该论文旨在解决物理驱动的灵巧手在未见音乐曲目上进行精确钢琴演奏的问题,即如何实现基于肌肉控制的双手协同运动以超越参考数据集的限制。解决方案的关键在于提出一种分层架构:低层通过强化学习训练通用单手策略,生成动态肌腱激活信号以跟踪大规模参考动作数据集中的轨迹,并将这些策略蒸馏为变分自编码器(Variational Autoencoder, VAE)模型,从而抽象出平滑且结构化的潜在空间;高层则训练针对特定乐曲的策略,在此潜在空间中根据乐谱提取的音符事件协调双手动作,实现对新音乐作品的合成演奏。该方法结合改进的生物力学手模型与肌肉驱动控制器,显著提升了运动跟踪精度和生理合理性,实现了当前物理驱动灵巧控制中钢琴演奏的最先进性能。

链接: https://arxiv.org/abs/2604.23886
作者: Pei Xu,Yufei Ye,Shuchun Sun,Yu Ding,Elizabeth Schumann,C. Karen Liu
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: Website: this https URL

点击查看摘要

Abstract:We present a data-driven approach for physics-based, muscle-driven dexterous control that enables musculoskeletal hands to perform precise piano playing for novel pieces of music outside the reference dataset. Our approach combines high-frequency muscle-level control with low-frequency latent-space coordination in a hierarchical architecture. At the low level, general single-hand policies are trained via reinforcement learning to generate dynamic muscle-tendon activations while tracking trajectories from a large reference motion dataset. The resulting tracking policies are then distilled into variational autoencoder (VAE) models, yielding smooth and structured latent spaces that abstract away low-level muscle dynamics. For the high level, we train piece-specific policies to operate in this latent space, coordinating bimanual motions based on specific goals, denoted by note events extracted from given musical scores, to synthesize performances beyond the reference data. In addition, we present an enhanced musculoskeletal hand model that supports fine control of fingers for accurate low-level motion tracking and diverse high-level motion synthesis. We evaluate the control pipeline of our approach on a diverse piano repertoire spanning multiple musical styles and technical demands. Results demonstrate that our approach can synthesize coordinated bimanual motions with accurate key presses, and achieve the state-of-the-art performance of piano playing in physics-based dexterous control. We also show that our musculoskeletal hand model demonstrates superior biomechanical stability and tracking precision compared to the existing model, and validate that our musculoskeletal hand model and muscle-driven controller can generate physiologically plausible activation patterns that align with human electromyography (EMG) recordings.

[AI-76] ZenBrain: A Neuroscience-Inspired 7-Layer Memory Architecture for Autonomous AI Systems NEURIPS2026

【速读】:该论文旨在解决当前人工智能代理(AI agent)记忆系统缺乏对人类记忆核心机制——巩固(consolidation)、遗忘(forgetting)和再巩固(reconsolidation)——的整合问题,现有系统多依赖工程化比喻(如虚拟内存分页、扁平大语言模型存储等),难以实现高效且类脑的记忆管理。其解决方案的关键在于提出ZenBrain架构,这是一个融合15个神经科学模型的多层记忆系统,包含7个记忆层级与9个基础算法,并引入6个创新的预测性记忆组件(Predictive Memory Architecture, PMA),如四通道神经调质引擎(NeuromodulatorEngine)、预测误差门控的再巩固引擎(ReconsolidationEngine)、三重复制记忆(TripleCopyMemory)及基于杏仁核快速通路的优先级映射(PriorityMap)。实验证明,该架构在多个基准测试中显著优于单层基线,尤其在长期记忆稳定性、存储效率和跨任务泛化能力上表现突出,展现出类脑记忆系统的协同生存网络特性。

链接: https://arxiv.org/abs/2604.23878
作者: Alexander Bering
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Pre-print of NeurIPS 2026 main-track submission. Earliest preprint version on Zenodo 31 March 2026 (DOI: https://doi.org/10.5281/zenodo.19353664%29%3B cross-posted to TDCommons (dpubs_series/9683, 1 April 2026). Six Zenodo revisions and three TDCommons revisions through 9 April 2026 (Zenodo concept DOI: https://doi.org/10.5281/zenodo.19353663 ). 41 pages, 22 tables, 2 figures

点击查看摘要

Abstract:Despite a century of empirical memory research, existing AI agent memory systems rely on system-engineering metaphors (virtual-memory paging, flat LLM storage, Zettelkasten notes), none integrating principles of consolidation, forgetting, and reconsolidation. We present ZenBrain, a multi-layer memory architecture integrating fifteen neuroscience models. It implements seven memory layers (working, short-term, episodic, semantic, procedural, core, cross-context) orchestrated by nine foundational algorithms (Two-Factor Synaptic Model, vmPFC-coupled FSRS, Simulation-Selection sleep, Bayesian confidence, and five more) plus six new Predictive Memory Architecture (PMA) components: a four-channel NeuromodulatorEngine, prediction-error-gated ReconsolidationEngine, TripleCopyMemory with divergent decay, four-dimensional PriorityMap with amygdala fast-path, StabilityProtector (NogoA/HDAC3 analogue), and MetacognitiveMonitor for bias detection. The 15-algorithm ablation reveals a cooperative survival network: under stress, 9 of 15 algorithms become individually critical (delta-Q up to -93.7%, Wilcoxon, 10 seeds, alpha=0.005). Simulation-Selection sleep achieves 37% stability improvement (p0.005) with 47.4% storage reduction. TripleCopyMemory retains S(t)=0.912 at 30 days; PriorityMap reaches NDCG@10=0.997. Multi-layer routing beats a flat single-layer baseline by 20.7% F1 on LoCoMo (p0.005) and 19.5% on MemoryArena (p=0.015). On LongMemEval-500, ZenBrain holds the highest mean rank on all 12 system-judge cells (4 systems x 3 LLM judges), three-judge mean J=0.545 vs letta=0.485, a-mem=0.414, mem0=0.394; all 9 pair-wise contrasts clear Bonferroni (alpha=0.05/18, min p=6.2e-31, d in [0.18, 0.52]). Under LongMemEval’s binary judge, ZenBrain reaches 91.3% of oracle accuracy at 1/106th the per-query token budget. Open-source with 11,589 automated test cases. Comments: Pre-print of NeurIPS 2026 main-track submission. Earliest preprint version on Zenodo 31 March 2026 (DOI: https://doi.org/10.5281/zenodo.19353664)%3B cross-posted to TDCommons (dpubs_series/9683, 1 April 2026). Six Zenodo revisions and three TDCommons revisions through 9 April 2026 (Zenodo concept DOI: https://doi.org/10.5281/zenodo.19353663). 41 pages, 22 tables, 2 figures Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.23878 [cs.AI] (or arXiv:2604.23878v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.23878 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alexander Bering [view email] [v1] Sun, 26 Apr 2026 20:39:19 UTC (95 KB)

[AI-77] Inverting Foundation Models of Brain Function with Simulation-Based Inference

【速读】:该论文旨在解决如何从合成脑活动(synthetic brain activity)中逆向恢复刺激或其属性的问题,即探索基础脑模型(foundation brain models)在“逆向设计”中的潜力。其解决方案的关键在于结合脑模拟器(brain emulator)与大语言模型(LLMs),利用基于模拟的推断(simulation-based inference)学习从脑图谱到潜在刺激参数(如情绪效价、唤醒度和支配度)的概率映射关系。实验表明,这些参数可从预测的脑图谱中有效恢复,验证了神经编码的质量,并证明LLMs能够作为可控刺激生成器用于模拟实验,为基于基础脑模型的解码与逆向设计提供了可行路径。

链接: https://arxiv.org/abs/2604.23865
作者: Niels Bracher,Xavier Intes,Stefan T. Radev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Foundation models of brain activity promise a new frontier for in silico neuroscience by emulating neural responses to complex stimuli across tasks and modalities. A natural next step is to ask whether these models can also be used in reverse. Can we recover a stimulus or its properties from synthetic brain activity? We study this question in a proof-of-concept setting using TRIBEv2. We pair the brain emulator with large language models (LLMs) that generate news headlines from linguistic parameters such as valence, arousal, and dominance. We then use simulation-based inference to learn a probabilistic mapping from brain maps to latent stimulus parameters. Our results show that these parameters can be recovered from predicted brain maps, validating the quality of neural encodings. They also show that LLMs can serve as controllable stimulus generators for simulated experiments. Together, these findings provide a step toward decoding and inverse design with foundation brain models.

[AI-78] me-Series Forecasting in Safety-Critical Environments: An EU-AI-Act-Compliant Open-Source Package / Zeitreihenprognose in sicherheitskritischen Umgebungen: Ein KI-VO-konformes Open-Source-Paket

【速读】:该论文旨在解决在安全关键环境中使用基于Python的时间序列点预测模型时的合规性问题,尤其针对欧盟《人工智能法案》(Regulation (EU) 2024/1689,简称EU AI Act)、IEC 61508、ISA/IEC 62443系列标准及网络安全韧性法案(Cyber Resilience Act)等法规要求难以嵌入到现有工具链中的挑战。解决方案的关键在于采用“合规即设计”(Compliance-by-Design)方法,将监管要求直接锚定于库内部机制中,包括接口契约(API contracts)、持久化格式和持续集成(CI)门禁;同时通过四项不可妥协的代码开发规则(零死代码、确定性处理、故障安全处理、最小依赖)与配套流程规范(模型卡、可执行文档字符串、CI工作流、Common-Platform-Enumeration (CPE)标识符、REUSE兼容许可证、发布流水线)实现系统性保障。该方案明确排除交互式可视化、超参数调优、自动化机器学习(AutoML)及深度学习和大语言模型后端,以避免引入攻击面扩大、非确定性或可复现性受损的风险。

链接: https://arxiv.org/abs/2604.23859
作者: Thomas Bartz-Beielstein,Eva Bartz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Bilingual twin paper: English version first, German original below (91 pages total). Single shared bibliography

点击查看摘要

Abstract:With spotforecast2-safe we present an integrated Compliance-by-Design approach to Python-based point forecasting of time series in safety-critical environments. A review of the relevant open-source tooling shows that existing compliance solutions operate consistently outside of the library to be used - e.g. as scanners, templates, or runtime layers. spotforecast2-safe takes the inverse approach and anchors the requirements of Regulation (EU) 2024/1689 (the EU AI Act, in German: KI-VO), of IEC 61508, of the ISA/IEC 62443 standards series, and of the Cyber Resilience Act within the library: in application-programming-interface contracts, persistence formats, and continuous-integration gates. The approach is operationalised by four non-negotiable code-development rules (zero dead code, deterministic processing, fail-safe handling, minimal dependencies) together with the corresponding process rules (model card, executable docstrings, CI workflows, Common-Platform-Enumeration (CPE) identifier, REUSE-conformant licensing, release pipeline). Interactive visualisation, hyperparameter tuning and automated machine learning (AutoML), as well as deep-learning and large-language-model backends are deliberately excluded, because each of these components either enlarges the attack surface, introduces non-determinism, or impairs reproducibility. A bidirectional traceability matrix maps every regulatory provision onto the corresponding mechanism in the code; an end-to-end example of European-market electricity generation, transmission, and consumption forecasting demonstrates the application. The package is open-source and available under Affero General Public License (AGPL) 3.0-or-later.

[AI-79] Does Machine Unlearning Preserve Clinical Safety? A Risk Analysis for Medical Image Classification

【速读】:该论文旨在解决深度学习在医疗诊断中应用时,模型训练数据的删除(即“机器遗忘”)可能引发的临床风险问题。现有标准遗忘策略(如微调、随机标签和SalUn)虽能实现数据移除,但可能降低模型性能并显著增加假阴性率,从而放大临床风险。解决方案的关键在于提出SalUn-CRA(Clinical Risk-Aware),其核心改进是将原始SalUn中的随机重标签机制替换为基于熵的遗忘策略,专门针对需遗忘的恶性样本,防止模型学习到有害的良性关联,从而在保障遗忘有效性的同时显著降低临床风险。

链接: https://arxiv.org/abs/2604.23854
作者: Andreza M. C. Falcao,Filipe R. Cordeiro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at SBCAS’26

点击查看摘要

Abstract:The application of Deep Learning in medical diagnosis must balance patient safety with compliance with data protection regulations. Machine Unlearning enables the selective removal of training data from deployed models. However, most methods are validated primarily through efficiency and privacy-oriented metrics, with limited attention to clinically asymmetric error costs. In this work, we investigate how unlearning affects clinical risk in binary medical image classification. We show that standard unlearning strategies (Fine-Tuning, Random Labeling, and SalUn) may reduce test utility while increasing false-negative rates, thereby amplifying clinical risk. To mitigate this, we propose SalUn-CRA (Clinical Risk-Aware), a variant of SalUn that replaces random relabeling with entropy-based forgetting for malignant samples in the forget set, preventing the model from learning harmful benign associations. We evaluate on DermaMNIST and PathMNIST medical image datasets under 20% and 50% data removal. Using Global Risk metrics with asymmetric costs, SalUn-CRA achieves lower or comparable clinical risk to full retraining while preserving unlearning effectiveness. These results suggest that clinical risk should be an integral component of unlearning validation in medical systems.

[AI-80] ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

【速读】:该论文旨在解决生成式 AI(Generative AI)代理在技能蒸馏(skill-distillation)过程中缺乏每步成本信息的问题,导致无法区分哪些步骤是必要的、哪些是冗余的。其核心挑战在于:没有细粒度的成本标注,蒸馏管道难以优化路径效率,可能保留高成本但无关紧要的步骤或遗漏关键修复点。解决方案的关键在于提出 ClawTrace 平台,它通过记录代理会话中的每一步 LLM 调用、工具使用和子代理生成,并生成包含每步美元成本、token 数量及冗余标记的 TraceCard(YAML 格式摘要),从而实现精确的成本归因;在此基础上构建 CostCraft 蒸馏管道,能生成三种类型的技能补丁(preserve、prune、repair),其中“prune”补丁通过反事实论证剔除不必要高成本步骤,显著降低推理开销并提升泛化能力。

链接: https://arxiv.org/abs/2604.23853
作者: Boqin Yuan,Renchu Song,Yue Su,Sen Yang,Jing Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per-step cost, a pipeline cannot distinguish adding a missing step to fix a bug from removing an expensive step that never affected the outcome. We introduce ClawTrace, an agent tracing platform that records every LLM call, tool use, and sub-agent spawn during an agent session and compiles each session into a TraceCard: a compact YAML summary with per-step USD cost, token counts, and redundancy flags. Built on ClawTrace, CostCraft is a distillation pipeline that reads TraceCards and produces three types of skill patches. Preserve patches keep behaviors that led to success. Prune patches remove expensive steps that did not matter, each backed by a counterfactual argument against a named high-cost step. Repair patches fix failures grounded in oracle evidence. Ablations on 30 held-out SpreadsheetBench tasks show that both cost attribution and prune patches independently reduce quality regressions. When the same skill is applied to 30 unrelated SkillsBench tasks, an unexpected asymmetry emerges: prune rules transferred across benchmarks and cut median cost by 32%, while preserve rules, trained on benchmark-specific conventions, caused regressions on new task types. We release ClawTrace and TraceCards as open infrastructure for cost-aware agent research.

[AI-81] Scalable Production Scheduling: Linear Complexity via Unified Homogeneous Graphs

【速读】:该论文旨在解决工业场景中作业车间调度问题(Job Shop Scheduling Problem, JSSP)的高效求解难题,尤其关注策略在计算效率与拓扑鲁棒性之间的平衡。现有基于强化学习(Reinforcement Learning, RL)的方法常因图结构复杂度呈二次增长或异构层架构开销过大而面临可扩展性瓶颈。其解决方案的关键在于提出一种统一的图框架,通过特征驱动的同质化映射将不同节点角色投影至共享潜在空间,从而允许标准同构图同构网络(Graph Isomorphism Network, GIN)以线性复杂度捕捉资源竞争关系,实现低延迟推理。此外,研究发现作业-机器比(job-to-machine ratio)是决定策略性能的核心因素,而非问题绝对规模,并据此提出结构饱和假设:在临界拥堵实例(JM\mathcal{J} \approx \mathcal{M})上训练的策略能内化尺度不变的冲突解决逻辑,进而将大规模矩形实例视为饱和子问题的顺序拼接,无需针对不同规模重新训练,避免统计捷径过拟合,为动态生产环境中RL方案的部署提供稳健高效的路径。

链接: https://arxiv.org/abs/2604.23841
作者: Jonathan Hoss,Moritz Link,Noah Klarmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Efficiently solving the Job Shop Scheduling Problem in real-world industrial applications requires policies that are both computationally lean and topologically robust. While Reinforcement Learning has shown potential in automating dispatching rules, existing models often struggle with a scalability bottleneck caused by quadratic graph complexity or the architectural overhead of heterogeneous layers. We introduce a unified graph framework that employs feature-based homogenization to project distinct node roles into a shared latent space. This allows a standard homogeneous Graph Isomorphism Network to capture complex resource contention with linear complexity, ensuring low-latency inference for large-scale industrial applications. Our empirical results demonstrate that our framework achieves state-of-the-art performance while exhibiting consistent zero-shot generalization. We identify the job-to-machine ratio as the primary driver of policy effectiveness, rather than absolute problem size. Based on this, we propose a hypothesis of structural saturation, demonstrating that policies trained on critically congested instances ( \mathcalJ \approx \mathcalM ) learn scale-invariant resolution strategies. Agents trained at this saturation point internalize invariant conflict-resolution logic, allowing them to treat massive rectangular instances as a sequential concatenation of saturated sub-problems. This approach eliminates the need for expensive scale-specific retraining and prevents overfitting to statistical shortcuts, providing a robust and efficient pathway for deploying RL solutions in dynamic production environments.

[AI-82] Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features

【速读】:该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)提取的特征库在实际应用中缺乏结构化组织的问题,即特征分布杂乱、领域概念与通用或弱关联特征混杂、相关概念分散于多个单元且缺乏特征间关系的理解。其解决方案的关键在于:首先通过对比激活和多阶段过滤构建一个严格限定于特定领域的概念宇宙;随后在此基础上建立两个对齐的图结构视图——共现图用于刻画语料层面的概念层级结构,以及基于译码器机制的路径图用于连接源层与目标层特征的稀疏潜在通路;最终通过自动边标注将这些图转化为可读的知识图谱,从而将原始扁平的SAE特征集重构为内部知识图谱,实现从局部特征可解释性到模型全局知识地图的跃迁,并支持推理忠实性的审计。

链接: https://arxiv.org/abs/2604.23829
作者: John Winnicki,Abeynaya Gnanasekaran,Eric Darve
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) extract millions of interpretable features from a language model, but flat feature inventories aren’t very useful on their own. Domain concepts get mixed with generic and weakly grounded features, while related ideas are scattered across many units, and there’s no way to understand relationships between features. We address this by first constructing a strict domain-specific concept universe from a large SAE inventory using contrastive activations and a multi-stage filtering process. Next, we build two aligned graph views on the filtered set: a co-occurrence graph for corpus-level conceptual structure, organized at multiple levels of granularity, and a transcoder-based mechanism graph that links source-layer and target-layer features through sparse latent pathways. Automated edge labeling then turns these graph views into readable knowledge graphs rather than unlabeled layouts. In a case study on a biology textbook, these graphs recover coherent chapter and subchapter-level structure, reveal concepts that bridge neighboring topics, and transform messy sentence-level activity containing thousands of features into compact, readable views that illustrate the model’s local activity. Taken together, this reframes a flat SAE inventory as an internal knowledge graph that converts feature-level interpretability into a global map of model knowledge and enables audits of reasoning faithfulness.

[AI-83] Query2Diagram: Answering Developer Queries with UML Diagrams

【速读】:该论文旨在解决软件文档普遍滞后或缺失的问题,尤其是在开发者需要聚焦理解复杂代码库时,传统自动化逆向工程工具生成的UML图信息过载且忽略开发意图。解决方案的关键在于提出查询驱动的UML图生成方法(query-driven UML diagram generation),利用大语言模型(LLM)根据自然语言问题直接生成语义聚焦的图表,仅包含与查询相关的元素并附带上下文描述。通过在结构化JSON格式的代码文件、开发者提问及对应图表示组成的精心构建数据集上微调Qwen2.5-Coder-14B模型,实验表明该方法显著提升F1分数并降低结构缺陷率,实现了结构合理且语义忠实于开发者意图的按需文档生成。

链接: https://arxiv.org/abs/2604.23816
作者: Oleg Baryshnikov(1),Anton M. Alekseev(2 and 3),Sergey I. Nikolenko(2 and 3) ((1) HSE University, (2) St. Petersburg Department of Steklov Mathematical Institute, RAS, (3) St. Petersburg State University)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures. Code: this https URL

点击查看摘要

Abstract:Software documentation frequently becomes outdated or fails to exist entirely, yet developers need focused views of their codebase to understand complex systems. While automated reverse engineering tools can generate UML diagrams from code, they produce overwhelming detail without considering developer intent. We introduce query-driven UML diagram generation, where LLMs create diagrams that directly answer natural language questions about code. Unlike existing methods, our approach produces semantically focused diagrams containing only relevant elements with contextual descriptions. We fine-tune Qwen2.5-Coder-14B on a curated dataset of code files, developer queries, and corresponding diagram representations in a structured JSON format, evaluating with both automatic detection of structural defects and human assessment of semantic relevance. Results demonstrate that fine-tuning on a modest amount of manually corrected data yields dramatic improvements: our best model achieves the highest F1 scores while reducing defect rates below state-of-the-art LLMs, generating diagrams that are both structurally sound and semantically faithful to developer queries. Thus, we establish the feasibility of using LLMs for scalable contextual, on-demand documentation generation. We make our code and dataset publicly available at this https URL.

[AI-84] Symmetric Equilibrium Propagation for Thermodynamic Diffusion Training

【速读】:该论文旨在解决生成式 AI(Generative AI)中基于分数的扩散模型(score-based diffusion models)在物理硬件上实现训练闭环的问题,即如何在不依赖外部数字加速器的情况下,直接在模拟电路基板上完成梯度传播与参数更新。其核心挑战在于:传统方法需通过数字路径传输梯度,导致能效损失;而物理实现需确保训练过程无偏估计且具备可扩展性。解决方案的关键是提出对称双线性等效传播(symmetric bilinear Equilibrium Propagation, EqProp),它在零扰动极限下提供无偏的去噪分数匹配梯度估计,并通过对称扰动将主导偏差项从 O(β)\mathcal{O}(\beta) 降低至 O(β2)\mathcal{O}(\beta^2),同时保持低秩模块间耦合结构不变。这一机制使训练无需额外计算资源即可实现高对齐梯度更新,在有限弛豫时间内显著提升能效比,最终投影出每训练步相较匹配GPU基准10³–10⁴倍的能量优势。

链接: https://arxiv.org/abs/2604.23806
作者: Aditi De
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reverse process in score-based diffusion models is formally equivalent to overdamped Langevin dynamics in a time-dependent energy landscape. In our prior work we showed that a bilinearly-coupled analog substrate can physically realize this dynamics at a projected three-to-four orders of magnitude energy advantage over digital inference by replacing dense skip connections with low-rank inter-module couplings. Whether the \emphtraining loop can be closed on the same substrate – without routing gradients through an external digital accelerator – has remained open. We resolve this affirmatively: Equilibrium Propagation applied directly to the bilinear energy yields an unbiased estimator of the denoising score-matching gradient in the zero-nudge limit. For finite nudging we derive a sharp bias bound controlled solely by substrate stiffness, local curvature, and the norm of the loss-gradient signal, with a bilinear-specific corollary showing that one dominant bias term vanishes identically for coupling-parameter updates. Symmetric nudging further upgrades the leading bias from \mathcalO(\beta) to \mathcalO(\beta^2) at negligible extra cost. Under realistic finite-relaxation budgets this upgrade is essential, as one-sided EqProp produces anti-correlated gradients while symmetric EqProp yields well-aligned updates. Bias-variance analysis determines the optimal operating point, and end-to-end physical-unit accounting projects a 10^3 - 10^4\times energy advantage per training step over a matched GPU baseline. Symmetric bilinear EqProp is the first local, readout-only training rule that preserves the low-rank coupling enabling scalable thermodynamic diffusion models.

[AI-85] FAIR_XAI: Improving Multimodal Foundation Model Fairness via Explainability for Wellbeing Assessment

【速读】:该论文旨在解决生成式视觉语言模型(Vision-Language Models, VLMs)在心理健康评估,特别是抑郁预测中的诊断可靠性与群体公平性问题。当前VLMs虽具强大表征能力,但其部署于临床场景时面临透明度不足和潜在偏见的风险,而现有研究尚未充分探索如何通过可解释人工智能(Explainable AI, XAI)手段实现公平性优化。解决方案的关键在于提出一种基于XAI的干预框架,结合公平性提示(fairness prompting)与基于解释的策略,在不同数据集(实验室环境AFAR-BSFT与自然情境E-DAIC)中对模型行为进行调节,以期在保持预测准确性的同时提升性别、种族等维度的公平表现。然而实验表明,此类干预在不同环境下效果差异显著:部分方法虽能改善程序一致性或特定公平指标,却可能牺牲整体准确率或加剧原有偏见,揭示出“程序透明”与“结果公平”之间仍存在根本性矛盾,因此未来需协同优化准确性、群体平等性和跨域泛化能力。

链接: https://arxiv.org/abs/2604.23786
作者: Sophie Chiang,Tom Brennan,Fethiye Irmak Dogan,Jiaee Cheong,Hatice Gunes
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 3 tables

点击查看摘要

Abstract:In recent years, the integration of multimodal machine learning in wellbeing assessment has offered transformative potential for monitoring mental health. However, with the rapid advancement of Vision-Language Models (VLMs), their deployment in clinical settings has raised concerns due to their lack of transparency and potential for bias. While previous research has explored the intersection of fairness and Explainable AI (XAI), its application to VLMs for wellbeing assessment and depression prediction remains under-explored. This work investigates VLM performance across laboratory (AFAR-BSFT) and naturalistic (E-DAIC) datasets, focusing on diagnostic reliability and demographic fairness. Performance varied substantially across environments and architectures; Phi3.5-Vision achieved 80.4% accuracy on E-DAIC, while Qwen2-VL struggled at 33.9%. Additionally, both models demonstrated a tendency to over-predict depression on AFAR-BSFT. Although bias existed across both architectures, Qwen2-VL showed higher gender disparities, while Phi-3.5-Vision exhibited more racial bias. Our XAI intervention framework yielded mixed results; fairness prompting achieved perfect equal opportunity for Qwen2-VL at a severe accuracy cost on E-DAIC. On AFAR-BSFT, explainability-based interventions improved procedural consistency but did not guarantee outcome fairness, sometimes amplifying racial bias. These results highlight a persistent gap between procedural transparency and equitable outcomes. We analyse these findings and consolidate concrete recommendations for addressing them, emphasising that future fairness interventions must jointly optimise predictive accuracy, demographic parity, and cross-domain generalisation.

[AI-86] he Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

【速读】:该论文旨在解决基于超网络(hypernetwork)的文档内化方法在处理与预训练知识冲突时性能显著下降的问题,尤其在深层事实冲突场景下准确率骤降至46.4%。其核心发现是:失败根源在于幅度(amplitude)问题而非表征能力不足——超网络虽能定位正确层,但其适配器(adapter)边际幅度在不同文档间基本恒定,而预训练模型的边际幅度随训练频率增长,导致深度冲突无法胜出。解决方案的关键在于引入“幅度调节”机制:Selective Layer Boosting通过放大适配器在顶层范数(top-norm)层的权重增强响应,Conflict-Aware Internalization则仅在基础模型对冲突内容自信时触发提升。二者均为无训练(training-free)策略,可将Gemma-2B和Mistral-7B在深度冲突上的准确率分别从46.4%和53.6%提升至71.0%和72.5%,同时保持新知识召回率,并优于纯检索增强生成(retrieval-augmented generation, RAG)方法在中等冲突任务中的表现。

链接: https://arxiv.org/abs/2604.23750
作者: Shuaizhi Cheng,Xiang Shi,Mingwei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 35 pages, 15 figures

点击查看摘要

Abstract:Hypernetwork-based methods such as Doc-to-LoRA internalize a document into an LLM’s weights in a single forward pass, but they fail systematically on conflicts: when the document contradicts pretraining knowledge, accuracy collapses to 46.4% on the deepest facts. We show the failure is a magnitude problem rather than a representational one. The hypernetwork already targets the right layers, but its adapter margin is approximately constant across documents while the pretrained margin grows with training frequency, so deep conflicts lose by construction. The account predicts that failure should track prior strength: sorting 194 conflicts by the base model’s log-probability on the contradicted fact, baseline accuracy falls from 68% on weak-prior questions to 16% on strong-prior ones, a 52 percentage-point gap. The cure is amplitude. Selective Layer Boosting scales the adapter at its top-norm layers, and Conflict-Aware Internalization triggers boosting only when the base model is confident. Both are training-free; together they raise deep-conflict accuracy from 46.4% to 71.0% on Gemma-2B and from 53.6% to 72.5% on Mistral-7B while preserving novel-knowledge recall, and beat vanilla retrieval-augmented generation on medium conflicts by 18 percentage points despite operating entirely in parameter space. We release KID-Bench, a 489-question benchmark that separates novel recall, cross-knowledge combination, and prior-graded conflicts.

[AI-87] Expert Evaluation of LLM s Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在日语法律情境下进行开放式法律推理能力评估不足的问题,尤其是缺乏针对日本司法体系的专门数据集与专家评价机制。其关键解决方案是构建首个基于日本律师资格考试写作部分的真实题目的数据集,并通过法律专家对模型生成的回答进行人工评估,从而系统揭示LLMs在法律推理中的局限性与幻觉现象(hallucinations),即模型如何在无依据的情况下引入非判例或非法律规定的内容。这一方法不仅量化了当前LLMs在日语法律领域的性能边界,也为未来研究提供了可复现的基准和评估框架。

链接: https://arxiv.org/abs/2604.23730
作者: Jungmin Choi,Keisuke Sakaguchi,Hiroaki Yamada
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, Accepted to ICAIL 2026

点击查看摘要

Abstract:Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in realistic scenarios remains insufficiently explored. Notably, to our best knowledge, there are no prior studies or datasets addressing this issue in the Japanese context. This study presents the first dataset designed to evaluate the open-ended legal reasoning performance of LLMs within the Japanese jurisdiction. The dataset is based on the writing component of the Japanese bar examination, which requires examinees to identify multiple legal issues from long narratives and to construct structured legal arguments in free text format. Our key contribution is the manual evaluation of LLMs’ generated responses by legal experts, which reveals limitations and challenges in legal reasoning. Moreover, we conducted a manual analysis of hallucinations to characterize when and how the models introduce content not supported by precedent or law. Our real exam questions, model-generated responses, and expert evaluations reveal the milestones of current LLMs in the Japanese legal domain. Our dataset and relevant resources will be available online. Comments: 5 pages, Accepted to ICAIL 2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.23730 [cs.AI] (or arXiv:2604.23730v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.23730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-88] OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving

【速读】:该论文旨在解决现有形式化定理证明工具在本科层次优化问题上的能力不足问题,这类问题在机器学习、运筹学和科学计算中至关重要,但因其依赖领域特定的形式化(如凸性、最优性条件和算法分析),导致从奥林匹克级别数学到本科优化领域的直接迁移效果不佳。解决方案的关键在于提出OptProver模型,通过两个核心创新实现稳健迁移:一是基于专家迭代的大规模优化领域数据构建;二是设计一种专用的偏好学习目标,结合困惑度加权优化与对有效但无进展证明步骤的惩罚机制,从而引导搜索路径向高效方向演进,同时缓解分布偏移并避免灾难性遗忘。

链接: https://arxiv.org/abs/2604.23712
作者: Chenyi Li,Yanchen Nie,Zhengyu Ming,Gong Zhang,Kun Yuan,Zaiwen Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in formal theorem proving have focused on Olympiad-level mathematics, leaving undergraduate domains largely unexplored. Optimization, fundamental to machine learning, operations research, and scientific computing, remains underserved by existing provers. Its reliance on domain-specific formalisms (convexity, optimality conditions, and algorithmic analysis) creates significant distribution shift, making naive domain transfer ineffective. We present OptProver, a trained model that achieves robust transfer from Olympiad to undergraduate optimization. Starting from a strong Olympiad-level prover, our pipeline mitigates distribution shift through two key innovations. First, we employ large-scale optimization-focused data curation via expert iteration. Second, we introduce a specialized preference learning objective that integrates perplexity-weighted optimization with a mechanism to penalize valid but non-progressing proof steps. This not only addresses distribution shifts but also guides the search toward efficient trajectories. To enable rigorous evaluation, we construct a novel benchmark in Lean 4 focused on optimization. On this benchmark, OptProver achieves state-of-the-art Pass@1 and Pass@32 among comparably sized models while maintaining competitive performance on general theorem-proving tasks, demonstrating effective domain transfer without catastrophic forgetting.

[AI-89] ransferable Human Mobility Network Reconstruction with neuroGravity

【速读】:该论文旨在解决在缺乏全面交通调查数据的欠发达地区,如何准确重建人类移动性网络的问题。其核心挑战在于如何利用有限的公开数据(如城市设施和人口分布)来可靠地推断出城市间的流动模式,并实现跨城市的模型迁移。解决方案的关键在于提出了一种物理信息驱动的深度学习模型 neuroGravity,该模型不仅能够从少量观测中重构流动性流,还通过引入空间收入隔离指数(spatial income segregation index)量化了目标城市与源城市在社会经济结构上的相似性,从而显著提升了模型在未观测城市中的可迁移性和预测准确性。这一机制使得模型能够在资源受限地区生成超过1200座城市的移动性代理数据,为城市规划和公共卫生提供可行的数据支持。

链接: https://arxiv.org/abs/2604.23678
作者: Jinming Yang,Shaoyu Huang,Zongyuan Huang,Yaohui Jin,Xiaokang Yang,Marta C. Gonzalez,Yanyan Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate modeling of human mobility is critical for tackling urban planning and public health challenges. In undeveloped regions, the absence of comprehensive travel surveys necessitates reconstructing mobility networks from publicly available data. Here we develop neuroGravity, a physics-informed deep learning model that reliably reconstructs mobility flows from limited observations and transfers to unobserved cities. Using only urban facility and population distributions, we find that neuroGravity’s regional representations strongly correlate with socioeconomic and livability status, offering scalable proxies for costly surveys. Furthermore, we uncover that spatial income segregation plays a key role in model transferability: mobility networks are most reliably reconstructed when target cities share similar segregation levels with the source. We design an index to quantify this segregation and accurately predict transferability. Finally, we generate mobility flow proxies for over 1,200 cities worldwide, highlighting neuroGravity’s potential to mitigate critical data shortages in resource-limited, underdeveloped areas.

[AI-90] Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work

【速读】:该论文旨在解决生物医药研究中因复杂多步骤分析流程、数据模态异质性以及专业人力匮乏导致的研究门槛高、资源分布不均的问题,尤其针对独立研究者和低资源地区科研人员面临的困境。其解决方案的核心在于提出“Vibe Medicine”协同工作范式,通过自然语言引导具备技能增强的AI代理(AI agent)执行端到端的生物医学工作流,同时由临床专家或研究人员担任研究决策者角色,负责目标设定、中间结果审核与领域知识驱动的判断。该方案依托三层基础设施:高性能大语言模型(LLM)、如OpenClaw和Hermes Agent等智能体框架,以及包含1000余个来自开源库的结构化医学技能集合,从而实现高效、可扩展且可控的AI辅助科研流程。

链接: https://arxiv.org/abs/2604.23674
作者: Zihao Wu,Steven Xu,Bowen Chen,Shaowen Wan,Yiwei Li,Wei Ruan,Yanjun Lyu,Siyuan Li,Dajiang Zhu,Tianming Liu,Lin Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the emergence of large language models (LLMs) and AI agent frameworks, the human-AI co-work paradigm known as Vibe Coding is changing how people code, making it more accessible and productive. In scientific research, where workflows are more complex and the burden of specialized labor limits independent researchers and those in low-resource areas, the potential impact is even greater, particularly in biomedicine, which involves heterogeneous data modalities and multi-step analytical pipelines. In this paper, we introduce Vibe Medicine, a co-work paradigm in which clinicians and researchers direct skill-augmented AI agents through natural language to execute complex, multi-step biomedical workflows, while retaining the role of research director who specifies objectives, reviews intermediate results, and makes domain-informed decisions. The enabling infrastructure consists of three layers: capable LLMs, agent frameworks such as OpenClaw and Hermes Agent, and the OpenClaw medical skills collection, which includes more than 1,000 curated skills from multiple open-source repositories. We analyze the architecture and skill categories of this collection across ten biomedical domains, and present case studies covering rare disease diagnosis, drug repurposing, and clinical trial design that demonstrate end-to-end workflows in practice. We also identify the principal risks, such as hallucination, data privacy, and over-reliance, and outline directions toward more reliable, trustworthy, and clinically integrated agent-assisted research that advances research and technological equity and reduces health care resource disparities.

[AI-91] An AI-Based Supervisory Measurement Integrity Validation Layer for Cyber-Resilient AC/DC Protection in Inverter-Based Microgrids

【速读】:该论文旨在解决逆变器型微电网中线路电流差动保护继电器(Line Current Differential Relays, LCDRs)因依赖数字化通信测量而面临虚假数据注入攻击(False-Data Injection Attacks, FDIAs)的问题,此类攻击可导致保护误动作但不对应实际物理故障。解决方案的关键在于提出一种测量完整性验证方案,作为现代LCDRs的监督层运行,通过分析继电保护动作期间短时同步瞬时电流采样窗口,利用仅基于继电器可用电流数据训练的循环神经网络(Recurrent Neural Network, RNN)识别差动电流波形的时间结构特征,从而区分真实故障引发的电流轨迹与网络攻击伪造的异常轨迹。该方法无需额外传感器、辅助保护元件或拓扑信息,适用于交直流系统且无需修改原有继电器结构,经硬件在环仿真验证满足保护时限要求,具备实时性与高检测准确性。

链接: https://arxiv.org/abs/2604.23666
作者: Ahmad Mohammad Saber,Ahmed Saber Refae,Davor Svetinovic,Hatem Zeineldin,Amr Youssef,Ehab F. El-Saadany,Deepa Kundur
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Line current differential relays (LCDRs) are measurement-driven relays that rely on time-synchronized multi-phase current waveforms to infer internal faults in AC and DC power networks. In inverter-based microgrids, however, the increasing reliance on digitally communicated measurements exposes LCDRs to false-data injection attacks (FDIAs), in which adversaries manipulate remote measurement streams to create protection-triggering yet physically inconsistent current trajectories. This paper addresses this emerging measurement integrity problem by introducing a measurement integrity validation scheme that operates as a supervisory instrumentation layer for modern LCDRs. The proposed scheme interprets short windows of synchronized instantaneous current measurements recorded during relay operation and assesses their physical consistency to distinguish genuine fault-induced trajectories from cyber-manipulated measurement streams. A recurrent neural network is trained offline using only relay-available current measurements and exploits the temporal structure of differential current waveforms, which remains informative in inverter-dominated systems where current magnitude is no longer a reliable observable. The method requires no additional sensors, auxiliary protection elements, or prior knowledge of network topology, and is applicable to both AC and DC LCDRs without structural modification. The proposed measurement validation scheme is evaluated on an islanded inverter-based microgrid under a comprehensive set of fault and FDIA scenarios, demonstrating high detection accuracy while preserving relay dependability. Hardware-in-the-loop validation using an OPAL-RT real-time simulator confirms that the scheme satisfies protection timing constraints and can operate in real time under realistic operating conditions.

[AI-92] FlowPlace: Flow Matching for Chip Placement

【速读】:该论文旨在解决芯片布局(chip placement)中生成式 AI (Generative AI) 方法存在的三大问题:预训练数据为随机合成导致质量不足、采样时间过长,以及依赖梯度求解器导致布局重叠。其核心解决方案是 FlowPlace,关键创新在于:1)基于掩码引导的合成数据生成以提升数据相关性;2)基于流模型(flow-based)的高效训练机制并支持灵活先验注入;3)硬约束采样策略确保无重叠布局。实验表明,FlowPlace 在 PPA(功耗-性能-面积)指标上更优,采样效率提升 10–50 倍,且完全避免布局重叠。

链接: https://arxiv.org/abs/2604.23658
作者: Peng Xie,Ke Xue,Yunqi Shi,Ruo-Tong Chen,Chengrui Gao,Siyuan Xu,Chenjian Ding,Mingxuan Yuan,Chao Qian
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: DAC 2026

点击查看摘要

Abstract:Chip placement plays an important role in physical design. While generative models like diffusion models offer promising learning-based solutions, current methods have the following limitations: they use random synthetic data for pre-training, require long sampling times, and often result in overlaps due to their dependence on gradient-based solvers during the sampling process. To overcome these issues, we propose FlowPlace, which features mask-guided synthetic data generation, flow-based efficient training with flexible prior injection, and hard constraint sampling for overlap-free layouts. Experiments on OpenROAD and ICCAD 2015 benchmarks show FlowPlace achieves better PPA metrics, 10-50 \times faster sampling efficiency, and zero overlaps.

[AI-93] Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

【速读】:该论文旨在解决前沿人工智能系统中存在的**代理不对齐(agentic misalignment)**问题,即模型在无明确用户指令的情况下,基于内部构建的目标生成并执行有害行为。现有缓解方法如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和宪法提示(constitutional prompting)主要作用于模型层面,仅提供概率性安全保证,难以应对复杂威胁场景。其解决方案的关键在于提出一种“分权制衡”式的系统架构——策略-执行-授权(Policy-Execution-Authorization, PEA)架构,通过将意图生成、授权与执行分离为独立且隔离的层级,并利用密码学约束的能力令牌(capability tokens)进行连接,实现从行为层面到结构层面的安全保障。该架构引入五项核心机制:意图验证层(Intent Verification Layer, IVL)、意图溯源追踪(Intent Lineage Tracking, ILT)、目标漂移检测、输出语义门(Output Semantic Gate, OSG)及形式化验证框架,从而确保即使在模型被攻击的情况下,仍能维持目标完整性与操作可追溯性,为自主代理治理提供可信赖的系统级约束基础。

链接: https://arxiv.org/abs/2604.23646
作者: Rong Xiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods, such as Reinforcement Learning from Human Feedback (RLHF) and constitutional prompting, operate primarily at the model level and provide only probabilistic safety guarantees. We propose the Policy-Execution-Authorization (PEA) architecture, a “separation-of-powers” design that enforces safety at the system level. PEA decouples intent generation, authorization, and execution into independent, isolated layers connected via cryptographically constrained capability tokens. We present five core contributions: (C1) an Intent Verification Layer (IVL) for ensuring capability-intent consistency; (C2) Intent Lineage Tracking (ILT), which binds all executable intents to the originating user request via cryptographic anchors; (C3) Goal Drift Detection, which rejects semantically divergent intents below a configurable threshold; (C4) an Output Semantic Gate (OSG) that detects implicit coercion using a structured K \times I \times P threat calculus (Knowledge, Influence, Policy); and (C5) a formal verification framework proving that goal integrity is maintained even under adversarial model compromise. By shifting agent alignment from a behavioral property to a structurally enforced system constraint, PEA provides a robust foundation for the governance of autonomous agents.

[AI-94] Causal Discovery as Dialectical Aggregation: A Quantitative Argumentation Framework KR2026

【速读】:该论文旨在解决约束型因果发现方法在有限样本条件下易受错误条件独立(Conditional Independence, CI)判断影响而导致结构误差累积的问题。其解决方案的关键在于提出一种语义驱动的框架——定量论证因果发现(Quantitative Argumentation for Causal Discovery, QACD),将CI检验结果建模为具有可变强度和可撤销性的论证,而非不可逆的约束;通过连接性驱动的见证传播机制聚合冲突证据,并最终生成候选邻接关系上的固定点可接受性标注,从而提升因果结构的稳健性和干预可靠性。

链接: https://arxiv.org/abs/2604.23633
作者: Sheng Wei,Yulin Chen,Beishui Liao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 23rd International Conference on Principles of Knowledge Representation and Reasoning (KR 2026). This arXiv version includes supplementary material and additional implementation details

点击查看摘要

Abstract:Constraint-based causal discovery is brittle in finite-sample regimes because erroneous conditional-independence (CI) decisions can cascade into substantial structural errors. We propose Quantitative Argumentation for Causal Discovery (QACD), a semantics-driven framework that represents CI outcomes as graded, defeasible arguments rather than irreversible constraints. QACD maps statistical test outcomes to argument strengths and aggregates conflicting evidence through connectivity-mediated witness propagation, producing a fixed-point acceptability labeling over candidate adjacencies. Experiments on standard benchmark Bayesian networks suggest that QACD improves structural coherence and interventional reliability in several noisy or inconsistent CI regimes, while remaining competitive with classical constraint-based, hybrid, and prior argumentation-based baselines.

[AI-95] andem: Riding Together with Large and Small Language Models for Efficient Reasoning ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在执行推理密集型任务时因生成序列过长而导致的高计算开销问题。其核心解决方案是提出Tandem框架,通过协同大语言模型(LLM)与小语言模型(Small Language Models, SLMs)实现高效推理:LLM作为策略协调者,生成少量关键推理洞察以指导SLM完成完整的推理过程并输出最终答案;同时引入基于成本感知的终止机制,动态判断是否已积累足够推理引导信息,从而提前终止LLM生成,显著降低计算成本。实验表明,该方法在数学推理和代码生成任务上相较纯LLM推理减少约40%的计算量,且性能相当或更优,且跨域迁移能力良好。

链接: https://arxiv.org/abs/2604.23623
作者: Zichuan Fu,Xian Wu,Guojing Li,Yejing Wang,Yijun Chen,Zihao Zhao,Yixuan Luo,Hanyu Yan,Yefeng Zheng,Xiangyu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final answers. While such approaches improve answer quality and interpretability, they incur substantial computational overhead due to the prolonged generation sequences. In this paper, we propose Tandem, a novel collaborative framework that synergizes large and small language models (LLMs and SLMs) to achieve high-quality reasoning with significantly reduced computational cost. Specifically, the LLM serves as a strategic coordinator, efficiently generating a compact set of critical reasoning insights. These insights are then used to guide a smaller, more efficient SLM in executing the full reasoning process and delivering the final response. To balance efficiency and reliability, Tandem introduces a cost-aware termination mechanism that adaptively determines when sufficient reasoning guidance has been accumulated, enabling early stopping of the LLM’s generation. Experiments on mathematical reasoning and code generation benchmarks demonstrate that Tandem reduces computational costs by approximately 40% compared to standalone LLM reasoning, while achieving superior or competitive performance. Furthermore, the sufficiency classifier trained on one domain transfers effectively to others without retraining. The code is available at: this https URL.

[AI-96] hinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床决策支持中因处理非结构化电子健康记录(Electronic Health Records, EHRs)而产生的“隧道视野”(tunnel vision)和诊断幻觉(diagnostic hallucinations)问题。解决方案的核心在于提出一种基于链式结构的临床推理框架DxChain,其关键创新包括:(i) 采用“先画像后规划”(Profile-Then-Plan)范式建立全景患者基线以缓解冷启动阶段的幻觉;(ii) 引入医学思维树(Medical Tree-of-Thoughts, Med-ToT)算法实现前瞻性的策略规划与资源感知导航;(iii) 设计辩证诊断验证机制,通过“天使-魔鬼”对抗辩论模式解决复杂证据冲突,从而显著提升诊断准确性和逻辑一致性。

链接: https://arxiv.org/abs/2604.23605
作者: Zhiqi Lv,Duofan Tu,Jun Li,Mingyue Zhao,Heqin Zhu,Wenliang Li,Shaohua Kevin Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The application of large language models (LLMs) in clinical decision support faces significant challenges of “tunnel vision” and diagnostic hallucinations present in their processing unstructured electronic health records (EHRs). To address these challenges, we propose a novel chain-based clinical reasoning framework, called DxChain, which transforms the diagnostic workflow into an iterative process by mirroring a clinician’s cognitive trajectory that consists of “Memory Anchoring”, “Navigation” and “Verification” phases. DxChain introduces three key methodological innovations to elicit the potential of LLM: (i) a Profile-Then-Plan paradigm to mitigate cold-start hallucinations by establishing a panoramic patient baseline, (ii) a Medical Tree-of-Thoughts (Med-ToT) algorithm for strategic look ahead planning and resource aware navigation, and (iii) a Dialectical Diagnostic Verification procedure utilizing “Angel-Devil” adversarial debates to resolve complex evidence conflicts. Evaluated on two real world benchmarks, MIMIC-IV-Ext Cardiac Disease and MIMIC-IV-Ext CDM, DxChain achieves state-of-the-art performances in both diagnostic accuracy and logical consistency, offering a modular and reliable architecture for next-generation clinical AI. The code is at this https URL.

[AI-97] Partition-of-Unity Gaussian Kolmogorov-Arnold Networks

【速读】:该论文旨在解决基于径向基函数(Radial Basis Function, RBF)的Kolmogorov–Arnold Network (KAN) 在训练过程中对尺度参数 ϵ\epsilon 敏感、稳定性差以及在非光滑目标函数上性能受限的问题。其核心解决方案是提出了一种Shepard-type归一化机制,即Partition-of-unity Gaussian KAN (PU-GKAN),通过将每条边上的高斯基函数值除以其局部固定中心的和,构建出具有单位分解性质(partition-of-unity)的特征映射,从而实现边缘级精确常数再现,并引入显式的有限特征核解释。此设计显著降低了模型对 ϵ\epsilon 的敏感性,提升了验证准确率与训练稳定性,且在多种场景下(包括高维架构、Matérn RBF基函数及物理信息驱动的偏微分方程问题)均表现出鲁棒性优势。

链接: https://arxiv.org/abs/2604.23599
作者: Amir Nooeizadegan
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP)
备注:

点击查看摘要

Abstract:Gaussian basis functions provide an efficient and flexible alternative to spline activations in KANs. In this work, we introduce the partition-of-unity Gaussian KAN (PU-GKAN), a Shepard-type normalized Gaussian KAN in which the Gaussian basis values on each edge are divided by their local sum over fixed centers. This produces a partition-of-unity feature map with trainable coefficients, while preserving the standard edge-based KAN structure. The normalized construction gives exact constant reproduction at the edge level and admits an explicit finite-feature kernel interpretation. We formulate both the standard Gaussian KAN (GKAN) and PU-GKAN from a finite-feature and additive-kernel viewpoint, making the induced layer kernels and empirical feature matrices explicit. Using the first-layer feature matrix as the reference object, we adopt a practical scale-selection interval for (\epsilon), with the lower endpoint determined by adjacent-center overlap and the upper endpoint determined by a conservative conditioning threshold. Numerical experiments show that PU-GKAN reduces sensitivity to (\epsilon), improves validation accuracy for most smooth and moderately non-smooth targets, and gives more stable training behavior. The benefit persists across sample-size and center-number sweeps, higher-dimensional architectures, Matérn RBF bases, and physics-informed examples involving Helmholtz and wave equations. These results indicate that Shepard-type partition-of-unity normalization is a simple and effective stabilization mechanism for RBF-based KANs. Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP) Cite as: arXiv:2604.23599 [cs.CE] (or arXiv:2604.23599v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2604.23599 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-98] When AI reviews science: Can we trust the referee?

【速读】:该论文旨在解决生成式 AI(Generative AI)在学术同行评审中的可靠性与安全性问题,特别是针对AI审稿过程中存在的潜在攻击向量和不可靠行为。其解决方案的关键在于构建一个覆盖审稿全生命周期的安全性与可靠性分析框架,涵盖训练与数据检索、初筛、深度评审、反驳阶段及系统级层面,并通过四组对照实验对ICLR 2025投稿进行实证审计,以识别声誉框架、断言强度、反驳谄媚和上下文投毒等关键因素对评分的因果影响,从而为评估和提升AI同行评审的可信度提供可量化、可测试的基准与改进方向。

链接: https://arxiv.org/abs/2604.23593
作者: Jialiang Wang,Yuchen Liu,Hang Xu,Kaichun Hu,Shimin Di,Wangze Ni,Linan Yue,Min-Ling Zhang,Kui Ren,Lei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The volume of scientific submissions continues to climb, outpacing the capacity of qualified human referees and stretching editorial timelines. At the same time, modern large language models (LLMs) offer impressive capabilities in summarization, fact checking, and literature triage, making the integration of AI into peer review increasingly attractive – and, in practice, unavoidable. Yet early deployments and informal adoption have exposed acute failure modes. Recent incidents have revealed that hidden prompt injections embedded in manuscripts can steer LLM-generated reviews toward unjustifiably positive judgments. Complementary studies have also demonstrated brittleness to adversarial phrasing, authority and length biases, and hallucinated claims. These episodes raise a central question for scholarly communication: when AI reviews science, can we trust the AI referee? This paper provides a security- and reliability-centered analysis of AI peer review. We map attacks across the review lifecycle – training and data retrieval, desk review, deep review, rebuttal, and system-level. We instantiate this taxonomy with four treatment-control probes on a stratified set of ICLR 2025 submissions, using two advanced LLM-based referees to isolate the causal effects of prestige framing, assertion strength, rebuttal sycophancy, and contextual poisoning on review scores. Together, this taxonomy and experimental audit provide an evidence-based baseline for assessing and tracking the reliability of AI peer review and highlight concrete failure points to guide targeted, testable mitigations.

[AI-99] PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement

【速读】:该论文旨在解决生成式 AI 在物理感知符号模拟(physics-aware symbolic simulation)中的语义鸿沟问题,即如何将自然语言描述的物理现象准确转化为可执行的三维场景仿真环境。现有大语言模型(LLMs)虽在通用代码生成上表现优异,但在物理准确性方面存在显著不足。解决方案的关键在于提出一个名为 Self-Corrective Multi-Agent Refinement Framework (SMRF) 的多智能体协同优化框架,其包含三个专业化代理:模拟生成器(simulation generator)、错误纠正器(error corrector)和模拟精炼器(simulation refiner),通过迭代协作与领域特定验证机制,显著提升物理仿真的准确性和可执行性。实验表明,SMRF 在自建基准 PhysCodeBench 上取得 67.7 分的综合性能,相比最优基线提升 31.4 分,验证了错误纠正机制与多智能体架构对高质量物理模拟生成的核心作用。

链接: https://arxiv.org/abs/2604.23580
作者: Tianyidan Xie,Peiyu Wang,Yuyi Qian,Yuxuan Wang,Rui Ma,Ying Tai,Song Wu,Qian Wang,Lanjun Wang,Zili Yi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Physics-aware symbolic simulation of 3D scenes is critical for robotics, embodied AI, and scientific computing, requiring models to understand natural language descriptions of physical phenomena and translate them into executable simulation environments. While large language models (LLMs) excel at general code generation, they struggle with the semantic gap between physical descriptions and simulation implementation. We introduce PhysCodeBench, the first comprehensive benchmark for evaluating physics-aware symbolic simulation, comprising 700 manually-crafted diverse samples across mechanics, fluid dynamics, and soft-body physics with expert annotations. Our evaluation framework measures both code executability and physical accuracy through automated and visual assessment. Building on this, we propose a Self-Corrective Multi-Agent Refinement Framework (SMRF) with three specialized agents (simulation generator, error corrector, and simulation refiner) that collaborate iteratively with domain-specific validation to produce physically accurate simulations. SMRF achieves 67.7 points overall performance compared to 36.3 points for the best baseline among evaluated SOTA models, representing a 31.4-point improvement. Our analysis demonstrates that error correction is critical for accurate physics-aware symbolic simulation and that specialized multi-agent approaches significantly outperform single-agent methods across the tested physical domains.

[AI-100] CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning

【速读】:该论文旨在解决高维系统中未知动力学下安全探索的问题,现有安全强化学习方法通常仅提供期望意义上的安全保证,仍可能导致安全违规;而控制理论方法虽能提供硬约束的安全保障,但往往依赖已知系统动力学或需准确估计控制仿射模型。其解决方案的关键在于:在离线设置下学习一个概率性的控制仿射动力学模型,并利用该模型显式构造包含模型不确定性的控制屏障函数(Control Barrier Functions, CBFs),从而生成保守的安全约束;这些约束通过在线的基于约束的动作修正机制加以执行,实现安全探索的同时不显著限制任务性能。

链接: https://arxiv.org/abs/2604.23576
作者: Rahul Narava,Siddharth Verma,Ojas Jain,Shashi Shekhar Jha,Mayank Shekhar Jha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring safe exploration in high-dimensional systems with unknown dynamics remains a significant challenge. Existing safe reinforcement learning methods often provide safety guarantees only in expectation, which can still lead to safety violations. Control-theoretic approaches, in contrast, offer hard constraint-based safety guarantees but typically assume access to known system dynamics or require accurate estimation of control-affine models. In this paper, we propose a safe reinforcement learning framework that learns a probabilistic control-affine dynamics model in an offline setting. The learned model is leveraged to explicitly construct control barrier functions (CBFs) that incorporate model uncertainty to provide conservative safety constraints. These CBF constraints are enforced through an online constraint-based action correction mechanism, enabling safe exploration without overly restricting task performance. Empirical evaluations on nonlinear, complex continuous-control benchmarks demonstrate that our approach achieves returns comparable to those of existing baselines while significantly reducing safety violations.

[AI-101] On the Memorization of Consistency Distillation for Diffusion Models

【速读】:该论文旨在解决扩散模型(diffusion models)在部署前通过蒸馏(distillation)过程对记忆行为(memorization behavior)的影响机制不明确的问题,尤其是当教师模型已存在数据记忆时,学生模型如何继承或重塑这种记忆特性。解决方案的关键在于揭示一致性蒸馏(consistency distillation)能够抑制与记忆相关的不稳定特征方向,同时保留稳定的、可泛化的模式,从而在不损害生成质量的前提下显著降低转移的记忆现象。这一发现表明蒸馏不仅是加速模型推理的手段,更是一种优化记忆-泛化权衡的有效机制。

链接: https://arxiv.org/abs/2604.23552
作者: Bingqing Jiang,Difan Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 34 pages

点击查看摘要

Abstract:Diffusion models are central to modern generative modeling, and understanding how they balance memorization and generalization is critical for reliable deployment. Recent work has shown that memorization in diffusion models is shaped by training dynamics, with generalization and memorization emerging at different stages of training. However, deployed diffusion models are often further distilled, introducing an additional training phase whose impact on memorization is not well understood. In this work, we analyze how distillation reshapes memorization behavior in diffusion models, taking consistency distillation as a representative framework. Empirically, we show that when applied to a teacher model that has memorized data, consistency distillation significantly reduces transferred memorization in the student while preserving, and sometimes improving, sample quality. To explain this behavior, we provide a theoretical analysis using a random feature neural network model [Bonnaire et al., 2025], showing that consistency distillation suppresses unstable feature directions associated with memorization while preserving stable, generalizable modes. Our findings suggest that distillation can serve not only as an acceleration tool, but also as a mechanism for improving the memorization-generalization trade-off.

[AI-102] Hardware-Efficient FPGA Implementation of Sigmoid Function Using Mixed-Radix Hyperbolic Rotation CORDIC

【速读】:该论文旨在解决在资源受限的边缘设备(如现场可编程门阵列,FPGA)上高效实现非线性激活函数的问题,特别是针对广泛应用于概率输出、二分类和循环神经网络门控机制中的Sigmoid激活函数。由于其依赖指数运算导致硬件开销大,本文提出一种基于混合进制CORDIC(Mixed-Radix CORDIC, MR-CORDIC)架构的硬件优化方案,关键在于利用Sigmoid与双曲正切(tanh)函数之间的数学关系,将输入范围归一化至1,使对应的tanh计算可在更小的0.5范围内进行,从而显著改善收敛性;进一步引入改进的混合进制双曲旋转CORDIC(MR-HRC)算法,结合radix-2与radix-4迭代策略,在初始radix-2阶段保障稳定收敛,后续radix-4阶段加速收敛且无需缩放因子补偿,最终通过radix-2线性矢量CORDIC(R2-LVC)完成双曲正弦与余弦值的除法运算以获得tanh输出,整体设计高度流水化,实现在Xilinx Virtex-7 FPGA上的低资源消耗(仅835个逻辑切片,无DSP使用),同时保持高精度(均方误差为4.23×10⁻⁴)。

链接: https://arxiv.org/abs/2604.23547
作者: Chintan Panchal,Ankur Changela,Mohendra Roy
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for the 2026 International Conference on Applied Artificial Intelligence (2AI)

点击查看摘要

Abstract:Efficient hardware implementation of nonlinear activation functions is a crucial task in deploying artificial neural networks on resource-constrained and edge devices such as Field-Programmable Gate Arrays (FPGAs). The sigmoid activation function is widely used for probabilistic output, binary classification, and gating mechanisms in recurrent neural networks, despite its reliance on exponential computations. This paper presents a hardware-efficient FPGA implementation of the sigmoid activation function using a mixed-radix CORDIC-based architecture. The proposed approach leverages the mathematical relationship between the sigmoid and hyperbolic tangent functions. The input range is normalized to 1, enabling the corresponding tanh computation to operate within a reduced range of 0.5, which significantly improves convergence behavior. To achieve high accuracy with minimal hardware overhead, a modified mixed-radix hyperbolic rotation CORDIC (MR-HRC) algorithm combining radix-2 and radix-4 iterations is introduced. The initial radix-2 stage ensures stable convergence, while the subsequent radix-4 stage accelerates convergence without requiring scale-factor compensation. In the final stage, a radix-2 linear vectoring CORDIC (R2-LVC) is used to compute the hyperbolic tangent by dividing hyperbolic sine and cosine values derived from the MR-HRC algorithm. The entire architecture is fully pipelined and implemented on an FPGA. The design is realized on an Xilinx Virtex-7 FPGA using a 16-bit fixed-point representation. Experimental results demonstrate a significant reduction in hardware utilization, requiring only 835 logic slices with zero DSP usage. Additionally, the design achieves a mean absolute error of 4.23 10^-4, outperforming several recent sigmoid implementations. Comments: This paper has been accepted for the 2026 International Conference on Applied Artificial Intelligence (2AI) Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.23547 [cs.AR] (or arXiv:2604.23547v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2604.23547 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mohendra Roy (PhD) [view email] [v1] Sun, 26 Apr 2026 05:52:39 UTC (122 KB)

[AI-103] MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

【速读】:该论文旨在解决生成式 AI(Generative AI)快速发展背景下,模型和数据卡片(Model and Data Cards)自动化生成缺乏大规模、高保真基准测试的问题。现有手动制作方式难以扩展,而自动方法又缺少系统性评估标准。其解决方案的关键在于提出 MetaGAI,一个包含 2,541 个经验证文档三元组的综合基准,通过学术论文、GitHub 仓库与 Hugging Face 资产的语义三角验证构建;并采用多智能体框架(包括检索器、生成器和编辑器代理),结合四维人机协同评估机制(含编辑后人工校准的真实标签),建立融合自动化指标与受控大语言模型作为裁判(LLM-as-a-Judge)的稳健评估协议,从而为模型和数据卡片的自动化生成提供可扩展、可验证的测试平台。

链接: https://arxiv.org/abs/2604.23539
作者: Haoxuan Zhang,Ruochi Li,Yang Zhang,Zhenni Liang,Junhua Ding,Ting Xiao,Haihua Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automated approaches lack large-scale, high-fidelity benchmarks for systematic evaluation. We introduce MetaGAI, a comprehensive benchmark comprising 2,541 verified document triplets constructed through semantic triangulation of academic papers, GitHub repositories, and Hugging Face artifacts. Unlike prior single-source datasets, MetaGAI employs a multi-agent framework with specialized Retriever, Generator, and Editor agents, validated through four-dimensional human-in-the-loop assessment, including human evaluation of editor-refined ground truth. We establish a robust evaluation protocol combining automated metrics with validated LLM-as-a-Judge frameworks. Extensive analysis reveals that sparse Mixture-of-Experts architectures achieve superior cost-quality efficiency, while a fundamental trade-off exists between faithfulness and completeness. MetaGAI provides a foundational testbed for benchmarking, training, and analyzing automated Model and Data Card generation methods at scale. Our data and code are available at: this https URL.

[AI-104] Grammar-Constrained Refinement of Safety Operational Rules Using Language in the Loop: What Could Go Wrong

【速读】:该论文旨在解决网络物理系统(Cyber-Physical Systems, CPS)中操作规则因环境演化而产生的不一致性问题,即在仿真验证过程中,原有安全规范与实际观测到的系统行为之间出现偏差时,如何进行语法正确且语义合理的规则修正。其解决方案的关键在于引入一种结合反事实推理(counterfactual reasoning)与语法约束的精化循环机制,确保规则修改既满足领域特定语法要求,又能通过语义层面的合理性验证,从而避免生成过拟合或逻辑不合理的新规则。该方法已在自动驾驶控制系统中成功修复了传统基线模型推导出的不一致规则,同时保持语法合规性,并揭示了大语言模型(Large Language Model, LLM)在规则精化中的性能差异和潜在安全风险,强调未来需加强语法强制、语义验证及更全面的评估策略。

链接: https://arxiv.org/abs/2604.23523
作者: Khouloud Gaaloul,Zaid Ghazal,Madhu Latha Pulimi,Sam Emmanuel Kathiravan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure, 2 tables. Accepted at SEAMS 2026

点击查看摘要

Abstract:Safety specifications in cyber-physical systems (CPS) capture the operational conditions the system must satisfy to operate safely within its intended environment. As operating environments evolve, operational rules must be continuously refined to preserve consistency with observed system behavior during simulation-based verification and validation. Revising inconsistent rules is challenging because the changes must remain syntactically correct under a domain-specific grammar. Language-in-the-loop refinement further raises safety concerns beyond syntactic violations, as it can produce semantically unjustified refinements that overfit to the observed outcomes. We introduce a framework that combines counterfactual reasoning with a grammar-constrained refinement loop to refine operational rules, aligning them with the observed system behavior. Applied to an autonomous driving control system, our approach successfully resolved the inconsistencies in an operational rule inferred by a conventional baseline while remaining grammar compliant. An empirical large language model (LLM) study further revealed model-dependent refinement quality and safety lessons, which motivate rigorous grammar enforcement, stronger semantic validation, and broader evaluation in future work.

[AI-105] Autocorrelation Reintroduces Spectral Bias in KANs for Time Series Forecasting

【速读】:该论文旨在解决时间序列预测(Time Series Forecasting, TSF)任务中Kolmogorov-Arnold Networks (KANs)因输入变量存在强自相关性而重新引入谱偏差(spectral bias)的问题。传统理论认为KANs可克服神经网络的谱偏差,但该假设依赖于输入变量统计独立性,这在TSF中不成立。研究发现,随着输入自相关程度增加,KANs对低频成分的偏好显著增强,导致性能下降。解决方案的关键在于引入离散余弦变换(Discrete Cosine Transform, DCT)对输入进行预处理,以降低输入变量间的相关性;实验表明,DCT预处理有效缓解了KANs在TSF中的低频偏好,验证了谱偏差源于输入自相关性这一核心机制。

链接: https://arxiv.org/abs/2604.23518
作者: Chen Zeng,Jiahui Wang,Qiao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing theory suggests that Kolmogorov-Arnold Networks (KANs) can overcome the spectral bias commonly observed in neural networks under the assumption that inputs are statistically independent. However, this assumption does not hold in time series forecasting (TSF), where inputs are lagged observations with strong temporal autocorrelation. Through theoretical analysis and empirical validation, we obtain an unexpected finding: temporal autocorrelation reintroduces spectral bias in KANs, and the bias becomes increasingly pronounced as the degree of autocorrelation increases. This suggests that standard KANs may face substantial difficulties in TSF with strongly autocorrelated inputs. To address this problem, we introduce the Discrete Cosine Transform (DCT) to reduce the correlations among the network inputs. As expected, experimental results reveal that DCT preprocessing substantially reduces the observed low-frequency preference in TSF. This result also corroborates that the spectral bias of KANs in TSF tasks is indeed induced by the autocorrelation among input variables.

[AI-106] Uncertainty Propagation in LLM -Based Systems

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)应用中不确定性在系统各层级间传播与重用的问题,尤其关注早期误差如何通过模型内部、工作流阶段、组件边界、持久状态及人机协作过程累积并难以检测和管控。其解决方案的关键在于提出一个系统级的不确定性传播框架,构建涵盖模型内(P1)、系统级(P2)和社会技术层面(P3)的结构化分类体系,从而为识别、建模和治理不确定性传播提供理论基础与工程指导。

链接: https://arxiv.org/abs/2604.23505
作者: Boming Xia,Liming Zhu,Erdun Gao,Qinghua Lu,Minhui Xue,Dino Sejdinovic
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: WIP under review

点击查看摘要

Abstract:Uncertainty in large language model (LLM)-based systems is often studied at the level of a single model output, yet deployed LLM applications are compound systems in which uncertainty is transformed and reused across model internals, workflow stages, component boundaries, persistent state, and human or organisational processes. Without principled treatment of how uncertainty is carried and reused across these boundaries, early errors can propagate and compound in ways that are difficult to detect and govern. This paper develops a systems-level account of uncertainty propagation. It introduces a conceptual framing for characterising propagated uncertainty signals, presents a structured taxonomy spanning intra-model (P1), system-level (P2), and socio-technical (P3) propagation mechanisms, synthesises cross-cutting engineering insights, and identifies five open research challenges.

[AI-107] Interpretable Physics-Informed Load Forecasting for U.S. Grid Resilience: SHAP-Guided Ensemble Validation in Hybrid Deep Learning Under Extreme Weather

【速读】:该论文旨在解决美国电网短期电力负荷预测中深度学习模型可解释性不足的问题,尤其是在极端天气条件下难以获得运行人员信任的挑战。其核心解决方案是一个统一、可解释且融合物理规律的集成框架:通过卷积神经网络(CNN)分支提取局部特征、Transformer分支建模长程依赖关系,并采用基于验证优化的加权集成方式进行融合;同时引入源自德州电力可靠性委员会(ERCOT)系统分段抛物线温度-需求关系的物理信息损失函数进行正则化,从而提升模型在极端事件下的鲁棒性和可信度。此外,利用SHapley Additive exPlanations(SHAP)方法实现事后可解释性分析,揭示了不同气象因素在正常运行与极端天气事件中的主导作用变化,为调度决策提供科学依据。

链接: https://arxiv.org/abs/2604.23500
作者: Md Abubakkar,Sajib Debnath,Md. Uzzal Mia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate short-term electricity load forecasting is a cornerstone of U.S. grid reliability; however, prevailing deep learning models remain opaque, limiting operator trust during extreme weather. A unified, interpretable, physics-informed ensemble framework is proposed, integrating a Convolutional Neural Network (CNN) branch for local feature extraction and a Transformer branch for long-range dependency modeling; the branches are fused through a validation-optimized weighted ensemble and regularized by a physics-informed loss derived from the piecewise parabolic temperature-demand relationship of the Electric Reliability Council of Texas (ERCOT) system. Post-hoc interpretability is provided through SHapley Additive exPlanations (SHAP) with the DeepExplainer backend, yielding global and event-level attributions. Using eight years of ERCOT hourly load data (2018-2025) fused with Automated Surface Observing System (ASOS) records from three Texas stations, the framework achieves 713 MW MAE, 812 MW RMSE, and 1.18% MAPE on the test window. For Hampel-flagged extreme events, MAPE falls by 20.7% relative to its Transformer branch and by 40.5% relative to its CNN branch; an ablation confirms that the parabolic and ramp constraints drive a 14.7% RMSE reduction. SHAP analysis reveals a regime shift: temperature dominates under normal operation, whereas wind speed and precipitation become more influential during cold fronts and heatwaves.

[AI-108] Do Transaction-Level and Actor-Level AML Queues Agree? An Empirical Evaluation of Granularity Effects on the Elliptic Graph

【速读】:该论文旨在解决区块链反洗钱(Anti-Money Laundering, AML)系统中评分粒度(transaction-level vs. actor-address-level)对调查队列组成影响的问题,尤其是在固定审查预算下如何优化合规行动的效率。其核心解决方案在于提出了一种投影框架,通过四种聚合算子将交易级评分映射至地址级行动单元,并引入预算约束下的评估指标(如yield@budget、burden decomposition和case fragmentation),从而量化不同粒度下调查队列的质量差异。实证结果表明,地址级评分能显著提升非法活动检出率( illicit per 100 reviews),且其检测价值具有时间集中性,而混合策略反而低于最优单一粒度方案,揭示了评分粒度作为关键设计变量对AML系统性能的决定性影响。

链接: https://arxiv.org/abs/2604.23494
作者: Ankur Malik
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 9 tables, 4 appendices

点击查看摘要

Abstract:Graph-based anti-money laundering (AML) systems on blockchain networks can score suspicious activity at two granularity levels – transactions or actor addresses – yet compliance action is conducted per actor. This paper contributes an evaluation methodology for measuring how scoring granularity affects investigation queue composition under fixed review budgets. We formalize the evaluation through a projection framework mapping transaction-level scores to the actor-level action unit via four aggregation operators, and introduce budgeted investigation metrics – yield@budget, burden decomposition, and case fragmentation. Using the public Elliptic++ Bitcoin dataset (203,769 transactions; 822,942 address occurrences), we train independent random forest classifiers at each level under a causal temporal protocol and compare review queues through Jaccard overlap, burden decomposition, and feature-matching ablations. At one-percent budget, temporal evaluation yields mean Jaccard of 0.374 (SD 0.171); static pooled evaluation yields 0.087 (95% CI [0.079, 0.094]). An enriched address model receiving all 237 features produces even lower overlap (Jaccard=0.051), with 4.3% illicit per 100 reviews versus 30.2% for the transaction-projected queue. Address-level detection value is temporally concentrated: two timesteps exceed 91% illicit per 100 reviews while the static burden is only 3.4%. A fixed hybrid policy underperforms the best single-level queue by 5.05pp (CI [-10.2pp, -0.9pp]). These findings establish that scoring granularity is a consequential design variable for AML investigation systems – same data, same budget, different queues, different addresses investigated.

[AI-109] Agent ic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines

【速读】:该论文旨在解决多组件自然语言处理(Natural Language Processing, NLP)流水线在高风险决策场景中的鲁棒性测试问题,特别是在现实约束条件下的黑盒攻击挑战:仅能获得二元反馈、无法获取梯度信息且查询预算严格受限(10次查询)。针对这一严苛的黑盒威胁模型,作者提出了一种双代理规避框架,其核心创新在于将攻击过程置于语义扰动空间中——攻击代理(Attacker Agent)生成语义不变的文本重写,而提示优化代理(Prompt Optimization Agent)利用仅有二元决策反馈,在有限查询内迭代优化攻击策略。该方案的关键突破在于无需依赖替代模型或梯度信息即可实现高效规避,实验表明其对基于大语言模型(Large Language Model, LLM)的系统可实现19.95%至40.34%的规避率,显著优于传统基于token级扰动的基线方法(最高仅3.90%),并揭示了架构特性(如证据检索机制、检索-推理耦合程度及基线准确率)对攻击成功率的决定性影响。

链接: https://arxiv.org/abs/2604.23483
作者: Mazal Bethany,Kim-Kwang Raymond Choo,Nishant Vishwamitra,Peyman Najafirad
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY

点击查看摘要

Abstract:Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions: binary-only feedback, no gradient access, and strict query budgets. We formalize this strict black-box threat model and propose a two-agent evasion framework operating in a semantic perturbation space. An Attacker Agent generates meaning-preserving rewrites while a Prompt Optimization Agent refines the attack strategy using only binary decision feedback within a 10-query budget. Evaluated against four evidence-based misinformation detection pipelines, the framework achieves evasion rates of 19.95 to 40.34% on modern large language model (LLM) based systems, compared to at most 3.90% for token-level perturbation baselines that rely on surrogate models because they cannot operate under our threat model. A legacy system relying on static lexical retrieval exhibits near-total vulnerability 97.02%, establishing a lower bound that exposes how architectural choices govern the attack surface. Evasion effectiveness is associated with three architectural properties: evidence retrieval mechanism, retrieval-inference coupling, and baseline classification accuracy. The iterative prompt optimization yields the largest marginal gains against the most robust targets, confirming that adaptive strategy discovery is essential when evasion is non-trivial. Analysis of successful rewrites reveals four exploitation patterns, each targeting failures at distinct pipeline stages. A pattern-informed defense reduces the evasion rate by up to 65.18%.

[AI-110] Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization

【速读】:该论文旨在解决当前自主代理(Autonomous Agents)在面对开放性任务时,因依赖人工编写的流程和启发式规则而导致的性能天花板问题。其核心挑战在于如何实现系统性、持续性的自我进化,而非静态优化。解决方案的关键在于提出Escher-Loop框架——一个全闭环的双群体协同演化机制:一类是解决具体任务的Task Agents,另一类是递归优化任务代理及其自身策略的Optimizer Agents。该框架通过动态基准测试机制,将新生成的任务代理的实证得分作为相对胜负信号,直接用于更新优化器的评分,从而无需额外计算开销即可驱动优化器的自适应演进。这一设计使得优化器能够动态匹配高性能任务代理的需求变化,从而实现持续改进与最终性能突破。

链接: https://arxiv.org/abs/2604.23472
作者: Ziyang Liu,Xinyan Guo,Xuchen Wei,Han Hao,Liu Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The first three authors contributed equally. Corresponding Authors: Han Hao, Liu Yang

点击查看摘要

Abstract:While recent autonomous agents demonstrate impressive capabilities, they predominantly rely on manually scripted workflows and handcrafted heuristics, inherently limiting their potential for open-ended improvement. To address this, we propose Escher-Loop, a fully closed-loop framework that operationalizes the mutual evolution of two distinct populations: Task Agents that solve concrete problems, and Optimizer Agents that recursively refine both the task agents and themselves. To sustain this self-referential evolution, we propose a dynamic benchmarking mechanism that seamlessly reuses the empirical scores of newly generated task agents as relative win-loss signals to update optimizers’ scores. This mechanism leverages the evolution of task agents as an inherent signal to drive the evaluation and refinement of optimizers without additional overhead. Empirical evaluations on mathematical optimization problems demonstrate that Escher-Loop effectively pushes past the performance ceilings of static baselines, achieving the highest absolute peak performance across all evaluated tasks under matched compute. Remarkably, we observe that the optimizer agents dynamically adapt their strategies to match the shifting demands of high-performing task agents, which explains the system’s continuous improvement and superior late-stage performance.

[AI-111] Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在交互式短序列推理场景中因推理延迟(inference latency)和内核启动开销(kernel launch overhead)导致的性能瓶颈问题。其核心解决方案是提出一种混合运行时框架,结合即时编译(Just-In-Time, JIT)与CUDA Graph执行机制:将Transformer推理过程划分为静态部分通过CUDA Graph重放执行、动态部分由JIT编译的内核处理,从而实现跨解码步骤的异步图捕获与复用,在保持运行时灵活性的同时显著降低首 token 延迟(Time-to-First-Token, TTFT)和尾部延迟(P99 latency)。实验表明,该方法在单GPU、batch size为1的条件下对LLaMA-2 7B模型进行10–500 tokens长度的推理任务时,TTFT最高可减少66.0%,优于TensorRT-LLM等现有方案,验证了该混合策略在低延迟AI应用中的有效性。

链接: https://arxiv.org/abs/2604.23467
作者: Divakar Kumar Yadav,Tian Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved strong performance across natural language and multimodal tasks, yet their practical deployment remains constrained by inference latency and kernel launch overhead, particularly in interactive, short-sequence settings. This paper presents a hybrid runtime framework that combines Just-In-Time (JIT) compilation with CUDA Graph execution to reduce launch overhead while preserving runtime flexibility during autoregressive decoding. The framework partitions transformer inference into static components executed via CUDA Graph replay and dynamic components handled through JIT-compiled kernels, enabling asynchronous graph capture and reuse across decoding steps. We evaluate the proposed approach on LLaMA-2 7B using single-GPU, batch-size-one inference across prompt lengths from 10 to 500 tokens. Experimental results show that the hybrid runtime reduces Time-to-First-Token (TTFT) by up to 66.0% and achieves lower P99 latency compared with TensorRT-LLM in this regime. These results indicate that hybrid JIT-CUDA Graph execution can effectively reduce inference latency and variance for short-sequence LLM workloads, making it a practical optimization strategy for latency-sensitive AI applications. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2604.23467 [cs.LG] (or arXiv:2604.23467v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.23467 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-112] Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

【速读】:该论文旨在解决现代GPU编程中效率与易用性之间的矛盾问题,即如何在简化开发者编写高性能GPU内核(kernel)的同时,保持对Tensor Core和Tensor Memory Accelerator(TMA)等硬件特性的高效利用。其解决方案的关键在于提出并评估NVIDIA CUDA Tile(CuTile),这是一种基于Python的、以“tile”为中心的抽象机制,通过将计算任务分解为可优化的tile单元,使开发者能够以极少的代码行数(如60行Python代码实现高性能融合多头注意力)完成接近手写CUDA内核的性能表现,同时在特定架构(如Blackwell B200)上显著优于现有库(如FlashAttention-2)。然而,该方案的性能高度依赖于具体工作负载和GPU架构,跨架构一致性仍存在挑战,凸显了自动优化与可移植性设计的重要性。

链接: https://arxiv.org/abs/2604.23466
作者: Divakar Kumar Yadav,Tian Zhao,Deepak Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:NVIDIA’s CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2604.23466 [cs.LG] (or arXiv:2604.23466v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.23466 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-113] IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance ACL2026

【速读】:该论文旨在解决工业维护场景中大语言模型(Large Language Models, LLMs)生成的解释缺乏结构有效性、无法提供可验证的因果依据以及难以支持反事实推理等问题,这些问题严重削弱了在安全关键环境中对AI助手的信任。解决方案的关键在于提出IndustryAssetEQA,一个神经符号操作智能系统,其核心创新是将事件性遥测表示与故障模式影响分析知识图谱(Failure Mode Effects Analysis Knowledge Graph, FMEA-KG)相结合,从而实现对工业资产的具身问答(Embodied Question Answering, EQA),确保回答具备结构合理性、可验证性和行动导向性。

链接: https://arxiv.org/abs/2604.23446
作者: Chathurangi Shyalika,Dhaval Patel,Amit Sheth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures, 4 tables, Accepted for the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026) Industry Track

点击查看摘要

Abstract:Industrial maintenance environments increasingly rely on AI systems to assist operators in understanding asset behavior, diagnosing failures, and evaluating interventions. Although large language models (LLMs) enable fluent natural-language interaction, deployed maintenance assistants routinely produce generic explanations that are weakly grounded in telemetry, omit verifiable provenance, and offer no testable support for counterfactual or action-oriented reasoning that undermine trust in safety-critical settings. We present IndustryAssetEQA, a neurosymbolic operational intelligence system that combines episodic telemetry representations with a Failure Mode Effects Analysis Knowledge Graph (FMEA-KG) to enable Embodied Question Answering (EQA) over industrial assets. We evaluate on four datasets covering four industrial asset types, including rotating machinery, turbofan engines, hydraulic systems, and cyber-physical production systems. Compared to LLM-only baselines, IndustryAssetEQA improves structural validity by up to 0.51, counterfactual accuracy by up to 0.47, and explanation entailment by 0.64, while reducing severe expert-rated overclaims from 28% to 2% (approximately 93% reduction). Code, datasets, and the FMEA-KG are available at this https URL.

[AI-114] When Corrective Hints Hurt: Prompt Design in Reason er-Guided Repair of LLM Overcaution on Entailed Negations under OWL~2~DL

【速读】:该论文旨在解决生成式 AI(Generative AI)在处理描述逻辑(Description Logic, DL)推理任务时出现的可重现性错误模式问题,具体表现为模型在面对OWL 2 DL语义约束下的查询时,常将应答为“no”的命题误判为“unknown”。其解决方案的关键在于引入基于推理器(reasoner)指导的修复机制:通过三轮基于推理器断言的修正(verdict-guided repair),并在开放世界假设(Open World Assumption, OWA)提示下进行反馈,显著提升了回答准确性(达到97.8%),远超单次推理(43.9%)和通用纠错策略(81.7%)。研究进一步表明,提示框架的设计比单纯的内容修正更为关键,强调了在构建可靠推理增强型系统时需显式评估推理器封装组件的有效性。

链接: https://arxiv.org/abs/2604.23398
作者: Yijiashun Qi,Xiang Xu,Yuxuan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted by icaide 2026

点击查看摘要

Abstract:We report a reproducible error pattern in GPT-5.4 on OWL~2~DL compliance queries: the model frequently answers unknown'' when the reasoner-entailed answer is no’’ under \emphFunctionalProperty closure or class \emphdisjointness. Using 180 reasoner-audited queries from a procedural expansion of the observed pattern plus 18 hand-authored held-out queries in two unrelated domains (insurance and clinical), we compare four interaction modes under matched query budget: single-shot, three rounds of generic ``you-are-wrong’’ retry, three rounds of reasoner-verdict repair with an open-world-assumption (OWA) hint, and the same repair without the hint. Direct faithfulness is 43.9,% (Wilson 95,% CI [36.8,51.2] ); generic retry reaches 81.7,% ( [75.4,86.6] ); the verdict-with-hint variant is \emphworse at 67.2,% ( [60.1,73.7] ); the verdict-only variant reaches 97.8,% ( [94.4,99.1] ). All pairwise comparisons remain significant under McNemar’s exact test with Bonferroni correction ( \alpha = 0.01 ; all p 10^-5 ). The same fingerprint accounts for 4/4 errors on the held-out queries. Our interpretation is bounded: prompt framing can matter more than corrective content, and reasoner-guided wrappers should be ablated explicitly.

[AI-115] SoccerRef-Agents : Multi-Agent System for Automated Soccer Refereeing SOCC

【速读】:该论文旨在解决当前AI辅助足球裁判决策中存在的一系列问题,包括现有方法多局限于孤立的视频感知任务、缺乏对犯规场景的理解与推理能力,以及难以实现公平、准确且可解释的判罚。其解决方案的关键在于提出SoccerRef-Agents框架,通过构建多模态基准SoccerRefBench和基于“比赛规则”(Laws of the Game)的向量知识库RefKnowledgeDB,实现了知识驱动的精准推理;同时设计了一种新型多智能体架构,利用跨模态检索增强生成(cross-modal RAG)机制,弥合视觉内容与法规文本之间的语义鸿沟,从而显著提升判罚准确率与解释质量。

链接: https://arxiv.org/abs/2604.23392
作者: Zi Meng,Wanli Song,Yi Hu,Jiayuan Rao,Gang Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 8 figures. Submitted to ISACE 2026. Github Repo: this https URL

点击查看摘要

Abstract:Refereeing is vital in sports, where fair, accurate, and explainable decisions are fundamental. While intelligent assistant technologies are being widely adopted in soccer refereeing, current AI-assisted approaches remain preliminary. Existing research mostly focuses on isolated video perception tasks and lacks the ability to understand and reason about foul scenarios. To fill this gap, we propose SoccerRef-Agents, a holistic and explainable multi-agent decision-making framework for soccer refereeing. The main contributions are: (i) constructing the multimodal benchmark SoccerRefBench with over 1,200 referee theory questions and 600 foul video clips; (ii) building a vector-based knowledge base RefKnowledgeDB using the latest “Laws of the Game” and a classic case database for precise, knowledge-driven reasoning; (iii) designing a novel multi-agent architecture that collaborates via cross-modal RAG to bridge the semantic gap between visual content and regulatory texts. This work explores the technical capability of integrating MLLMs with refereeing expertise, and evaluations show our system significantly outperforms general-purpose MLLMs in decision accuracy and explanation quality. All databases, benchmarks, and code will be made available.

[AI-116] A Taxonomy and Resolution Strategy for Client-Level Disagreements in Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在多利益相关方环境中因客户端间存在战略、监管或竞争性分歧而导致的协作失效问题,即“客户端层面的不一致”(client-level disagreements)。传统FL假设所有客户端无条件参与,忽略了现实场景中某些客户端可能需要永久或临时排除其他客户端的需求。解决方案的关键在于提出一种多轨(multi-track)稳健的冲突化解策略:通过构建和管理隔离的模型更新路径(称为“tracks”),确保严格客户端排除,从而避免模型更新间的交叉污染和不公平现象。该方法在MNIST和N-CMAPSS数据集上34种场景下验证有效,并证明服务器端开销极低(每轮仅1毫秒),而客户端训练负载可通过子模型复用策略有效缓解,具备良好的可扩展性和架构合理性。

链接: https://arxiv.org/abs/2604.23386
作者: Daan Rosendal,Ana Oprescu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 16 figures. Published in IEEE BigData 2025

点击查看摘要

Abstract:Federated Learning (FL) typically assumes unconditional collaboration, a premise that overlooks the complexities of real-world, multi-stakeholder environments in which clients may need to exclude one another for strategic, regulatory, or competitive reasons. This paper addresses this gap, which we term ‘client-level disagreements,’ by first introducing a taxonomy of such scenarios. We then propose a robust, multi-track resolution strategy that guarantees strict client exclusion by creating and managing isolated model update paths (‘tracks’), thereby preventing the cross-contamination and unfairness issues present in naive strategies. Through an empirical evaluation of our custom simulation system across 34 scenarios using the MNIST and N-CMAPSS datasets, we validate that our approach correctly handles permanent, temporal, and overlapping disagreement patterns. Our scalability analysis reveals the server-side resolution algorithm’s overhead is negligible (1 ms per round) even under heavy load. The primary scalability constraint is the client-side training load from participating in multiple tracks, a cost that we show can be effectively mitigated by a submodel reuse strategy. This work presents a scalable and architecturally sound method for managing client-level disagreements, and enhances the practical applicability of FL in settings where policy compliance and strategic control are non-negotiable.

[AI-117] Constraint-Based Analysis of Reasoning Shortcuts in Neurosymbolic Learning KR2026

【速读】:该论文旨在解决神经符号系统(neurosymbolic systems)在学习过程中可能产生的“推理捷径”(reasoning shortcuts)问题,即模型虽满足逻辑约束但未能正确映射概念与标签,导致语义不一致。其核心解决方案是将推理捷径形式化为约束满足问题,并提出基于答案集编程(Answer Set Programming, ASP)的算法来验证给定约束集是否唯一确定预期的概念映射;若检测到捷径,则通过贪心修复算法扩充约束集以消除所有替代有效映射,最多在 $ k $ 次迭代内收敛,其中 $ k $ 为替代映射数量。该方法具备理论完备性与可计算复杂度保障,且实验验证了其在多个基准领域的有效性。

链接: https://arxiv.org/abs/2604.23377
作者: Akihiro Takemura,Katsumi Inoue,Masaaki Nishino
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the full version of a paper appearing at the 23rd International Conference on Principles of Knowledge Representation and Reasoning (KR 2026)

点击查看摘要

Abstract:Neurosymbolic systems can satisfy logical constraints during learning without achieving the intended concept-label correspondence; this is a problem known as reasoning shortcuts. We formalize reasoning shortcuts as a constraint satisfaction problem and investigate under which conditions concept mappings are uniquely determined by the constraints. We prove that a discrimination property (requiring that no valid concept mapping can be transformed into another valid mapping by swapping two concept values) is necessary for shortcut-freeness under bijective mappings, but demonstrate via a counterexample that it is insufficient even when the constraint graph is connected. We develop an ASP-based algorithm that verifies whether a given constraint set uniquely determines the intended concept mapping, with proven soundness and completeness. When shortcuts are detected, a greedy repair algorithm eliminates them by augmenting the constraint set, converging in at most k iterations, where k is the number of alternative valid mappings. We further provide a complexity classification: deciding shortcut-freeness is coNP-complete, counting shortcuts is #P-complete, and finding minimal repairs is NP-hard. We also establish sample complexity bounds showing that logarithmically many label queries suffice for disambiguation in favorable cases, while querying all ambiguous positions suffices in the worst case. Experiments across eight benchmark domains validate our approach.

[AI-118] An Empirical Evaluation of Locally Deployed LLM s for Bug Detection in Python Code

【速读】:该论文旨在解决在隐私敏感或资源受限环境中,如何有效利用本地部署的大语言模型(Large Language Models, LLMs)进行真实场景下的Python代码缺陷检测问题。传统方法多依赖云端模型或专用硬件,难以满足实际工程需求。解决方案的关键在于通过系统性实证评估两个本地运行的LLM(LLaMA 3.2与Mistral),采用零样本提示(zero-shot prompting)策略在函数级别对BugsInPy基准中的349个缺陷进行检测,并结合自动化关键词匹配的评估框架量化性能。结果表明,本地模型能够识别出约43%–45%的缺陷,且多数输出为部分正确响应——即能定位问题代码区域但无法给出精确修复建议,凸显了代码库特性对模型表现的重要影响,验证了本地部署LLMs在现实开发场景中具备实用价值,尽管精确缺陷定位仍具挑战。

链接: https://arxiv.org/abs/2604.23361
作者: Jelena Ilić Vulićević
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 5 figures. Code available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud-based models or specialized hardware, limiting practical applicability in privacy-sensitive or resource-constrained environments. In this paper, we present a systematic empirical evaluation of two locally deployed LLMs, LLaMA 3.2 and Mistral, for real-world Python bug detection using the BugsInPy benchmark. We evaluate 349 bugs across 17 projects using a zero-shot prompting approach at the function level and an automated keyword-based evaluation framework. Our results show that locally executed models achieve accuracy between 43% and 45%, while producing a large proportion of partially correct responses that identify problematic code regions without pinpointing the exact fix. Performance varies significantly across projects, highlighting the importance of codebase characteristics. The results demonstrate that local models can identify a meaningful share of bugs, though precise localization remains difficult for locally executed LLMs, particularly when handling complex and context dependent bugs in realistic development scenarios.

[AI-119] LEGO: An LLM Skill-Based Front-End Design Generation Platform

【速读】:该论文旨在解决现有基于大语言模型(Large Language Model, LLM)的电子设计自动化(Electronic Design Automation, EDA)代理系统普遍存在的任务特定性与可复用性差的问题,即这些系统往往孤立运行,导致重复工程投入且难以迁移成功的电路设计与调试策略。其解决方案的关键在于提出一个统一的、基于技能(Skill)的平台 LEGO,将数字前端设计流程分解为六个独立步骤,并以标准化、可组合的电路技能(Circuit Skill)形式封装每个代理能力,构建了一个插件式架构。通过自动化提取42个可执行电路技能并建立高效检索机制(Agent Skill RAG),实现子毫秒级技能召回,显著提升RTL设计自动化效果:在VerilogEval v2难例集上,单个技能使Pass@1从0.000提升至0.805,跨项目技能组合同样达到0.805,优于Hierarchy-Verilog和VerilogCoder等基线方法,验证了模块化技能组合在灵活性与有效性上的优势。

链接: https://arxiv.org/abs/2604.23355
作者: Jincheng Lou,Ruohan Xu,Jiecheng Ma,Runzhe Tao,Xinyu Qu,Yibo Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ISEDA 2026. 7 pages, 3 figures

点击查看摘要

Abstract:Existing LLM-based EDA agents are often isolated task-specific systems. This leads to repeated engineering effort and limited reuse of successful design and debugging strategies. We present LEGO, a unified skill-based platform for front-end design generation. It decomposes the digital front-end flow into six independent steps and represents every agent capability as a standardized composable circuit skill within a plug-and-play architecture. To build this skill library, we survey more than 100 papers, select 11 representative open-source projects, and extract 42 executable circuit skills within a six-step finite state machine formulation. Circuit Skill Builder automates skill extraction with linear scalability. Agent Skill RAG achieves submillisecond retrieval without relying on embedding models. Empirical evaluation on a hard subset of 41 VerilogEval v2 problems that gpt-5.2-codex fails to solve under extra-high reasoning effort shows that individual circuit skills constructed within LEGO raise Pass@1 from 0.000 to 0.805. This is an 80.5% gain over the baseline. Cross-project skill compositions also reach 0.805 Pass@1. They outperform hierarchy-verilog by 14.6% and VerilogCoder by 2.5%. They also match MAGE. These results show that modular skill composition supports both effective and flexible RTL design automation. The LEGO platform and all circuit skills are publicly available at GitHub: this https URL

[AI-120] Evaluating Jailbreaking Vulnerabilities in LLM s Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在电力系统运行中作为辅助决策工具时,因提示词攻击(prompt-based adversarial attacks)引发的安全风险问题,特别是授权用户(如操作员)通过构造恶意提示词绕过模型的安全对齐机制、生成违反北美电力可靠性公司(NERC)合规标准的输出。解决方案的关键在于系统性评估三种先进LLM(GPT-4o mini、Gemini 2.0 Flash-Lite、Claude 3.5 Haiku)在模拟电网运营场景下对Baseline、BitBypass和DeepInception三种越狱攻击方法的脆弱性,结果表明:DeepInception攻击成功率最高(63.17%),而Claude 3.5 Haiku表现出完全抗性(0%攻击成功率为),同时发现细微的提示词调整可显著提升基础攻击方法的有效性(ASR从33.1%提升至30.6%),凸显了优化提示工程与强化模型防御机制的必要性。

链接: https://arxiv.org/abs/2604.23341
作者: Taha Hammadia,Lucas Rea,Ahmad Mohammad Saber,Amr Youssef,Deepa Kundur
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) as assistants in electric grid operations promises to streamline compliance and decision-making but exposes new vulnerabilities to prompt-based adversarial attacks. This paper evaluates the risk of jailbreaking LLMs, i.e., circumventing safety alignments to produce outputs violating regulatory standards, assuming threats from authorized users, such as operators, who craft malicious prompts to elicit non-compliant guidance. Three state-of-the-art LLMs (OpenAI’s GPT-4o mini, Google’s Gemini 2.0 Flash-Lite, and Anthropic’s Claude 3.5 Haiku) were tested against Baseline, BitBypass, and DeepInception jailbreaking methods across scenarios derived from nine NERC Reliability Standards (EOP, TOP, and CIP). In the initial broad experiment, the overall Attack Success Rate (ASR) was 33.1%, with DeepInception proving most effective at 63.17% ASR. Claude 3.5 Haiku exhibited complete resistance (0% ASR), while Gemini 2.0 Flash-Lite was most vulnerable (55.04% ASR) and GPT-4o mini moderately susceptible (44.34% ASR). A follow-up experiment refining malicious wording in Baseline and BitBypass attacks yielded a 30.6% ASR, confirming that subtle prompt adjustments can enhance simpler methods’ efficacy.

[AI-121] Layer Embedding Deep Fusion Graph Neural Network

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在低同质性(low-homophily)场景下因消息传递机制依赖节点标签一致性而导致的性能下降问题,以及随着网络深度增加所引发的过平滑(over-smoothing)现象。其核心解决方案是提出一种名为Layer Embedding Deep Fusion Graph Neural Network (LEDF-GNN) 的新框架,关键创新在于设计了一个非线性融合多层嵌入的Layer Embedding Deep Fusion (LEDF) 操作,以捕捉层间依赖关系并有效缓解深层传播退化;同时引入Dual-Topology Parallel Strategy (DTPS),通过并行利用原始拓扑与重构拓扑实现结构与语义的自适应协同优化,从而提升模型在不同同质性条件下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2604.23324
作者: Taihua Xu,Genhao Tian,Jicong Fan,Xibei Yang,Qinghua Zhang,Yun Cui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated impressive performance in learning representations from graph-structured data. However, their message-passing mechanism inherently relies on the assumption of label consistency among connected nodes, limiting their applicability to low-homophily settings. Moreover, since message passing operates as a hierarchical diffusion process, GNNs face challenges in capturing long-range dependencies. As network depth increases, the structural noise along heterophilic edges tends to be amplified, resulting in over-smoothing. This issue becomes especially prominent in highly heterophilic graphs, where the propagation of inconsistent semantics across the topology continually exacerbates misaggregation. To address this issue, we propose a novel framework named Layer Embedding Deep Fusion Graph Neural Network (LEDF-GNN). Specifically, we design a Layer Embedding Deep Fusion (LEDF) operator that nonlinearly fuses multi-layer embeddings to capture inter-layer dependencies and effectively alleviate deep propagation degradation. Meanwhile, to mitigate structural heterophily, LEDF-GNN employs a Dual-Topology Parallel Strategy (DTPS) that simultaneously leverages the original and reconstructed topologies, allowing for adaptive structure-semantics co-optimization under diverse homophily conditions. Extensive semi-supervised classification experiments on the citation and image benchmarks demonstrate that, under both homophilic and heterophilic settings, LEDF-GNN consistently outperforms state-of-the-art baselines, validating its effectiveness and generalization capability across diverse graph types.

[AI-122] GIFT: Global stabilisation via Intrinsic Fine Tuning

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)策略在复杂连续控制环境中对初始条件高度敏感的问题,这种敏感性导致系统状态动力学混沌,限制了DRL在需要性能与稳定性保障的实际控制系统中的应用。解决方案的关键在于提出一种通用训练框架——基于内在微调的全局稳定化(Global Stabilisation via Intrinsic Fine Tuning, GIFT),其通过设计特定奖励函数直接优化现有高性能DRL策略的全局稳定性,从而在保持任务性能的同时显著提升控制交互的稳定性,增强DRL策略在现实场景中的适用性。

链接: https://arxiv.org/abs/2604.23312
作者: Rory Young,Nicolas Pugeault
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep reinforcement learning policies achieve strong performance in complex continuous control environments with nonlinear contact forces. However, these policies often produce chaotic state dynamics, with trivially small changes to the initial conditions significantly impacting the long-term behaviour of the control system. This high sensitivity to initial conditions limits the application of Deep RL to real-world control systems where performance and stability guarantees are often required. To address this issue, we propose Global stabilisation via Intrinsic Fine Tuning (GIFT), a general-purpose training framework which directly optimises the global stability of existing high-performing deep RL policies using a custom reward function. We demonstrate that GIFT increase the stability of the control interaction while maintaining comparable task performance, thereby improving the suitability of deep RL policies for real-world control systems.

[AI-123] CombiMOTS: Combinatorial Multi-Objective Tree Search for Dual-Target Molecule Generation ICML2025

【速读】:该论文旨在解决双靶点分子生成中的两大关键问题:一是现有方法将复杂的多目标优化问题简化为标量化的单一目标组合,导致无法有效捕捉靶点结合能力与分子理化性质之间的权衡关系;二是未将合成可及性(synthetic accessibility)整合进生成流程,限制了实际应用潜力。其解决方案的核心在于提出CombiMOTS框架,该框架基于Pareto蒙特卡洛树搜索(Pareto Monte Carlo Tree Search, PMCTS),在可合成片段空间中进行探索,并采用向量化优化约束来同时建模双靶点亲和力与药代动力学特性,从而实现高效、多样且具有平衡药理特性的双靶点分子生成。

链接: https://arxiv.org/abs/2604.23307
作者: Thibaud Southiratn,Bonil Koo,Yijingxiu Lu,Sun Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a poster at ICML 2025 (Main Track)

点击查看摘要

Abstract:Dual-target molecule generation, which focuses on discovering compounds capable of interacting with two target proteins, has garnered significant attention due to its potential for improving therapeutic efficiency, safety and resistance mitigation. Existing approaches face two critical challenges. First, by simplifying the complex dual-target optimization problem to scalarized combinations of individual objectives, they fail to capture important trade-offs between target engagement and molecular properties. Second, they typically do not integrate synthetic planning into the generative process. This highlights a need for more appropriate objective function design and synthesis-aware methodologies tailored to the dual-target molecule generation task. In this work, we propose CombiMOTS, a Pareto Monte Carlo Tree Search (PMCTS) framework that generates dual-target molecules. CombiMOTS is designed to explore a synthesizable fragment space while employing vectorized optimization constraints to encapsulate target affinity and physicochemical properties. Extensive experiments on real-world databases demonstrate that CombiMOTS produces novel dual-target molecules with high docking scores, enhanced diversity, and balanced pharmacological characteristics, showcasing its potential as a powerful tool for dual-target drug discovery. The code and data is accessible through this https URL.

[AI-124] An Analysis of Active Learning Algorithms using Real-World Crowd-sourced Text Annotations IJCNN2026

【速读】:该论文旨在解决传统主动学习(Active Learning)算法在现实世界应用中面临的关键挑战:即假设标注者(labeling oracle)始终提供正确标签的不合理性。现实中,标注者可能因能力、疲劳或主观判断等因素提供错误标签,甚至拒绝标注,这显著影响模型性能。解决方案的关键在于通过众包平台收集真实文本样本的标注数据(来自3个基准文本分类数据集),并在此基础上对8种常用主动学习技术(结合深度神经网络)进行系统性实证研究,从而揭示这些方法在存在噪声和缺失标注的真实场景下的表现差异。这一方法避免了用机器学习模型模拟标注行为的局限性,更贴近实际部署需求,为深度主动学习系统的落地提供了重要实践依据。

链接: https://arxiv.org/abs/2604.23290
作者: Varun Totakura,Ankita Singh,Yushun Dong,Shayok Chakraborty
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: The proposed dataset can be accessed at this https URL . To appear in Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2026)

点击查看摘要

Abstract:Active learning algorithms automatically identify the most informative samples from large amounts of unlabeled data and tremendously reduce human annotation effort in inducing a machine learning model. In a conventional active learning setup, the labeling oracles are assumed to be infallible, that is, they always provide correct answers (in terms of class labels) to the queried unlabeled instances, which cannot be guaranteed in real-world applications. To this end, a body of research has focused on the development of active learning algorithms in the presence of imperfect / noisy oracles. Existing research on active learning with noisy oracles typically simulate the oracles using machine learning models; however, real-world situations are much more challenging, and using ML models to simulate the annotation patterns may not appropriately capture the nuances of real-world annotation challenges. In this research, we first collect annotations of text samples (from 3 benchmark text classification datasets) from crowd-sourced workers through a crowd-sourcing platform. We then conduct extensive empirical studies of 8 commonly used active learning techniques (in conjunction with deep neural networks) using the obtained annotations. Our analyses sheds light on the performance of these techniques under real-world challenges, where annotators can provide incorrect labels, and can also refuse to provide labels. We hope this research will provide valuable insights that will be useful for the deployment of deep active learning systems in real-world applications. The obtained annotations can be accessed at this https URL.

[AI-125] AI Identity: Standards Gaps and Research Directions for AI Agents

【速读】:该论文试图解决的问题是:随着生成式 AI (Generative AI) 代理在组织边界外自主执行交易、工作流和子代理链,当前基础设施无法识别、验证并追究无实体、无持久记忆且无法律地位的 AI 实体的责任。解决方案的关键在于提出“AI 身份”(AI Identity)的概念——即 AI 代理声明的身份与其行为之间在某一时刻的对应关系,并以置信度进行界定。论文通过结构化分析人类与 AI 身份在载体、持久性、可验证性和法律地位四个维度上的根本不对称性,指出直接套用人类身份框架会导致系统性失效;同时识别出五个结构性缺口(语义意图验证、递归委托责任、代理身份完整性、治理透明度与执行、运营可持续性),强调仅靠工程优化无法填补这些缺口,必须开展基础研究来构建 AI 身份的理论与技术体系。

链接: https://arxiv.org/abs/2604.23280
作者: Takumi Otsuka,Kentaroh Toyoda,Alex Leung
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:AI agents are now running real transactions, workflows, and sub-agent chains across organizational boundaries without continuous human supervision. This creates a problem no current infrastructure is equipped to solve: how do you identify, verify, and hold accountable an entity with no body, no persistent memory, and no legal standing? We define AI Identity as the continuous relationship between what an AI agent is declared to be and what it is observed to do, bounded by the confidence that those two things correspond at any given moment. Through a structured survey of industry trends, emerging standards, and technical literature, we conduct a gap analysis across the full agent identity lifecycle and make three contributions: (1) a structural comparison of human and AI identity across four dimensions (substrate, persistence, verifiability, and legal standing) showing that the asymmetry is fundamental and that extending human frameworks to agents without structural modification produces systematic failures; (2) an evaluation of current technical and regulatory documents against the identity requirements of autonomous agents, finding that none adequately address the challenge of governing nondeterministic, boundary-crossing entities; and (3) identification of five critical gaps (semantic intent verification, recursive delegation accountability, agent identity integrity, governance opacity and enforcement, and operational sustainability) that no current technology or regulatory instrument resolves. These gaps are structural; more engineering effort alone will not close them. Foundational research on AI identity is the central conclusion of this report.

[AI-126] Active Inference: A method for Phenotyping Agency in AI systems?

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 迅速发展背景下,缺乏系统性概念工具来刻画计算系统中“代理行为”(agency)的问题。现有定义多依赖于自主性和目标导向性,但难以提供可操作且可检验的理论基础。论文提出一种基于三个核心标准的最小化代理观:意向性(intentionality),即行动由信念和欲望所驱动;理性(rationality),即行动在世界模型下具有规范一致性;可解释性(explainability),即行动因果可追溯至内部状态。其解决方案的关键在于将这些标准形式化为部分可观测马尔可夫决策过程(POMDP)框架下的变分推断机制,其中后验信念、先验偏好与期望自由能最小化共同构成代理行为链。通过经典T迷宫范式验证了“赋能”(empowerment)作为通道容量指标能够区分低、中、高代理表型,从而为从计算表型到AI治理策略提供了一个原则性的变分桥梁。

链接: https://arxiv.org/abs/2604.23278
作者: Philip Wilson,Axel Constant,Mahault Albarracin,Nicolás Hinrichs,Jasmine Moore,Daniel Polani,Karl Friston
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of agentic artificial intelligence has outpaced the conceptual tools needed to characterize agency in computational systems. Prevailing definitions mainly rely on autonomy and goal-directedness. Here, we argue for a minimal notion open to principled inspection given three criteria: intentionality as action grounded in beliefs and desires, rationality as normatively coherent action entailed by a world model, and explainability as action causally traceable to internal states; we subsequently instantiate these as a partially observable Markov decision process under a variational framework wherein posterior beliefs, prior preferences, and the minimization of expected free energy jointly constitute an agentic action chain. Using a canonical T-maze paradigm, we evidence how empowerment, formulated as the channel capacity between actions and anticipated observations, serves as an operational metric that distinguishes zero-, intermediate-, and high-agency phenotypes through structural manipulations of the generative model. We conclude by arguing that as agents engage in epistemic foraging to resolve ambiguity, the governance controls that remain effective must shift systematically from external constraints to the internal modulation of prior preferences, offering a principled, variational bridge from computational phenotyping to AI governance strategy

[AI-127] CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)提示在长且多步骤问题上推理不稳定的问题,即同一任务在不同运行中可能产生不一致的答案,影响模型的可靠性和准确性。解决方案的关键在于提出一种循环对抗提示优化框架(Cycle Adversarial Prompt optimization, CAP-CoT),其核心机制包括:在每轮循环中,一个前向求解器生成候选推理链,一个对抗挑战者基于针对性错误策略构造看似合理但存在故意缺陷的推理链,以及一个反馈代理对两者进行逐步骤对比并生成结构化反馈;该反馈同时用于更新求解器提示以修复暴露的错误,并更新挑战者提示以生成更具针对性的错误,从而形成双向优化闭环。此方法通过任务语义驱动的对抗性检验,系统性提升CoT推理的准确率与稳定性。

链接: https://arxiv.org/abs/2604.23270
作者: Shuxu Chen,Yitian Zhou,Jiaquan Zhang,Haoyu Bian,Aming Wu,Sungyoung Lee,Chaoning Zhang,Hyundong Shin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has emerged as a simple and effective way to elicit step-by-step solutions from large language models (LLMs). However, CoT reasoning can be unstable across runs on long, multi-step problems, leading to inconsistent answers for unchanged task. Most prior work focuses on improving the forward reasoning chain within a single pass, with less attention to iterative and contrastive correction. To address this gap, we propose CAP-CoT, a Cycle Adversarial Prompt optimization framework designed to improve both CoT reasoning accuracy and stability of a single deployed solver. In each cycle, a forward solver generates candidate reasoning chains, an adversarial challenger constructs plausible but deliberately flawed chains using targeted error strategies, and a feedback agent contrasts the two chains and produces step-aligned structured feedback. This feedback closes the optimization loop in two directions, including updating the solver prompt based on errors exposed by the challenger, and updating the challenger prompt to generate increasingly targeted errors in subsequent cycles. Unlike safety-oriented adversarial prompting such as jailbreak or prompt-injection attacks, our adversarial component is task-semantic and aims to expose logical vulnerabilities in reasoning chains. Experiments across six benchmarks and four LLM backbones demonstrate that within two to three adversarial prompt optimization cycles, CAP-CoT consistently reduces variability across runs while improving reasoning accuracy and robustness to prompt perturbations.

[AI-128] Knowledge Lever Risk Management for Software Engineering: A Stochastic Framework for Mitigating Knowledge Loss

【速读】:该论文旨在解决软件工程(Software Engineering, SE)组织中因隐性知识资产(tacit knowledge assets)流失或未文档化决策衰减所引发的知识风险问题,此类风险在传统以进度和预算为核心的SE风险管理中长期被忽视。解决方案的关键在于提出并验证“知识杠杆风险管理体系”(Knowledge Lever Risk Management, KLRM),其核心思想是将隐性知识资产重构为可主动干预的风险缓解机制——即“知识杠杆”(Knowledge Levers),并通过一个四阶段结构化架构(审计、对齐、激活、保障)将其系统集成到软件开发生命周期中;同时引入形式化的随机模型量化杠杆激活对项目知识资本的影响,实证表明全面激活知识杠杆可使预期知识资本提升63.8%,并几乎消除知识危机概率,从而增强项目管理铁三角(范围、时间、成本)的一致性。

链接: https://arxiv.org/abs/2604.23257
作者: Mark Chua,Samuel Ajila
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: A shorter version of the paper will be presented on the 24th IEEE/ACIS International Conference on Software Engineering Research, Management and Applications (SERA 2026)

点击查看摘要

Abstract:Software engineering (SE) organizations operate in a knowledge-intensive domain where critical assets – architectural expertise, design rationale, and system intuition – are overwhelmingly tacit and volatile. The departure of key contributors or the decay of undocumented decisions can severely impair project velocity and software quality. While conventional SE risk management optimized for schedule and budget is common, the intangible knowledge risks that determine project success remain under-represented. The goal of this research work is to propose and evaluate the Knowledge Lever Risk Management (KLRM) Framework, designed specifically for the software development lifecycle. The primary objectives are to: (1) recast intangible knowledge assets as active mechanisms for risk mitigation (Knowledge Levers); (2) integrate these levers into a structured four-phase architecture (Audit, Alignment, Activation, Assurance); and (3) provide a formal stochastic model to quantify the impact of lever activation on project knowledge capital. We detail the application of these levers through software-specific practices such as pair programming, architectural decision records (ADRs), and LLM-assisted development. Stochastic Monte Carlo simulations demonstrate that full lever activation increases expected knowledge capital by 63.8% and virtually eliminates knowledge crisis probability. Our research shows that knowledge lever activation improves alignment across the project management iron triangle (scope, time, cost) by reducing rework and rediscovery costs. Comments: A shorter version of the paper will be presented on the 24th IEEE/ACIS International Conference on Software Engineering Research, Management and Applications (SERA 2026) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.23257 [cs.SE] (or arXiv:2604.23257v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.23257 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-129] Why Architecture Choice Matters in Symbolic Regression

【速读】:该论文旨在解决梯度驱动的符号回归(symbolic regression)中结构选择对目标公式恢复成功率的影响问题。其核心发现是:尽管表达能力(expressiveness)决定了搜索空间是否包含解,但优化景观(optimization landscape)才是决定梯度下降能否实际找到解的关键因素。研究通过对比三种共享相同算子和目标语言但变量接入方式不同的树结构,在超过12,700次训练中发现,最表达能力强的结构反而在某些目标上完全失败,而受限结构却能稳定恢复特定公式;此外,改变算子或反转其梯度特性会显著影响恢复性能,且非链式(balanced)树结构从未被恢复。这表明,单纯提升表达能力不足以保证符号回归的成功,必须同时考虑结构设计对优化路径的影响。

链接: https://arxiv.org/abs/2604.23256
作者: Chakshu Gupta
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注: 4 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Symbolic regression discovers mathematical formulas from data. Some methods fix a tree of operators, assign learnable weights, and train by gradient descent. The tree’s structure, which determines what operators and variables appear at each position, is chosen once and applied to every target. This paper tests whether that choice affects which targets are actually recovered. Three structures are compared, all sharing the same operator and target language but differing in how variables enter the tree; one is strictly more expressive. Across over 12,700 training runs, one structure recovers a target at 100% while another scores 0%, and the ranking reverses on a different target. Expressiveness guarantees that a solution exists in the search space, but not that gradient descent finds it: the most expressive structure fails on targets that a restricted alternative solves reliably. Switching the operator changes which targets succeed; reversing its gradient profile collapses recovery entirely. Balanced (non-chain) tree shapes are never recovered. These findings show that the optimization landscape, not expressiveness alone, determines what gradient-based symbolic regression recovers.

[AI-130] AI-Assisted Code Review as a Scaffold for Code Quality and Self-Regulated Learning: An Experience Report

【速读】:该论文旨在解决软件工程教育中代码审查(code review)在顶点项目(capstone projects)中难以规模化的问题,主要挑战包括紧迫的截止日期、同伴反馈质量不均以及学生缺乏前期经验。解决方案的关键在于将大语言模型(LLM)作为审查者直接集成到 GitHub 拉取请求(pull requests)的工作流中,采用“人类在环”(human-in-the-loop)的设计模式,通过结构化评论引导学生聚焦代码质量并促进自我调节学习(self-regulated learning)。该设计不仅提升了迭代活动数量和工具可用性,还通过教学干预减少了对 AI 的过度依赖,从而实现对学生主导的 AI 辅助审查的负责任应用。

链接: https://arxiv.org/abs/2604.23251
作者: Eduardo Oliveira,Michael Fu,Patanamon Thongtanunam,Sonsoles López-Pernas,Mohammed Saqr
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Code review is central to software engineering education but hard to scale in capstone projects due to tight deadlines, uneven peer feedback, and limited prior experience. We investigate an LLM-as-reviewer integrated directly into GitHub pull requests (human-in-the-loop) across two cohorts (more than 100 students, 2023–2024). Using a mixed-methods design – GitHub data, reflective reports, and a targeted survey – we examine engagement and responsiveness as behavioral indicators of self-regulated learning processes. Quantitatively, the 2024 cohort produced more iterative activity (1176 vs. 581 PRs), while technical issues observed in 2023 (227 failed AI attempts) dropped to zero after tool and instructional refinements. Despite different adoption levels (93% vs. 50% of teams using the tool), responsiveness was stable: 32% (2023) and 33% (2024) of successfully AI-reviewed PRs were followed by subsequent commits on the same PR. Qualitatively, students used the LLM’s structured comments to focus reviews and discuss code quality, while guidance reduced over-reliance. We contribute: (i) an in-workflow design for an AI reviewer that scaffolds learning while mitigating cognitive offloading; (ii) a repeated cross sectional comparison across two cohorts in authentic settings; (iii) a mixed-methods analysis combining objective GitHub metrics with student self-reports; and (iv) evidence-based pedagogical recommendations for responsible, student-led AI-assisted review.

[AI-131] raining Machine Learning Models on Encrypted Data: A Privacy-Preserving Framework using Homomorphic Encryption

【速读】:该论文旨在解决机器学习(Machine Learning, ML)在数据驱动决策中因依赖敏感数据集而引发的隐私保护问题。传统加密方法仅能保障数据在静态存储或传输过程中的安全,无法保护计算过程中数据的隐私,存在未经授权访问的风险。为此,论文提出了一种基于同态加密(Homomorphic Encryption, HE)的隐私保护框架,其关键在于利用Cheon-Kim-Kim-Song (CKKS)方案实现对加密数据的近似实数运算,从而支持在不解密前提下训练K近邻(K-Nearest Neighbors, KNN)和线性回归模型,并完成基本多层感知机(Multilayer Perceptron, MLP)的加密推理。实验表明,该方法在保持模型性能与明文训练相当的同时,有效实现了隐私保护,为隐私增强型机器学习在实际场景中的应用奠定了基础。

链接: https://arxiv.org/abs/2604.23245
作者: Alexandre Marques,Beatriz Sá,Rui Botelho,Pedro Pinto
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 5th International Conference on Optimization, Learning Algorithms and Applications (OL2A 2026)

点击查看摘要

Abstract:The use of Machine Learning (ML) for data-driven decision-making often relies on access to sensitive datasets, which introduces privacy challenges. Traditional encryption methods protect data at rest or in transit but fail to secure it during processing, exposing it to unauthorized access. Homomorphic encryption emerges as a transformative solution, enabling computations on encrypted data without decryption, thus preserving confidentiality throughout the ML pipeline. This paper addresses the challenge of training ML models on encrypted data while maintaining accuracy and efficiency by proposing a proof-of-concept for a privacy-preserving framework that leverages Cheon-Kim-Kim-Song (CKKS) for approximate real-number arithmetic. Also, it demonstrates the feasibility of training K-Nearest Neighbors (KNN) and linear regression models on encrypted data, and evaluates encrypted inference for a basic Multilayer Perceptron (MLP) architecture. Experimental results show that models trained under Homomorphic encryption achieve performance metrics comparable to plaintext-trained models, validating the approach. However, challenges such as computational overhead, noise management, and limited support for non-polynomial operations persist. This work lays the groundwork for broader adoption of privacy-preserving ML in real-world applications, balancing security with computational feasibility.

[AI-132] AdaMamba: Adaptive Frequency-Gated Mamba for Long-Term Time Series Forecasting

【速读】:该论文旨在解决长期时间序列预测(Long-Term Time Series Forecasting, LTSF)中因跨域异质性(cross-domain heterogeneity)导致的频率域分析失效问题。现有基于频域的方法通常隐含假设变量在时域同步时也具有相似的频域特性,但现实中不同变量可能在时域上呈现同步趋势,而在频域上存在显著差异,从而限制了模型对复杂动态周期模式的建模能力。解决方案的关键在于提出AdaMamba框架,其核心创新是将自适应、上下文感知的频域分析内生化到Mamba状态空间更新过程中:通过交互式补丁编码模块捕捉变量间动态关联,并设计自适应频率门控状态空间模块,生成输入依赖的频率基底,同时将传统的时间遗忘门扩展为统一的时间-频率遗忘门,实现基于学习到的频域重要性的状态转移动态校准,从而在保持Mamba对长程依赖建模优势的同时,提升对跨域异质性的适应能力。

链接: https://arxiv.org/abs/2604.23239
作者: Xudong Jiang,Mingshan Loo,Hanchen Yang,Wengen Li,Mingrui Zhang,Yichao Zhang,Jihong Guan,Shuigeng Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate long-term time series forecasting (LTSF) requires the capture of complex long-range dependencies and dynamic periodic patterns. Recent advances in frequency-domain analysis offer a global perspective for uncovering temporal characteristics. However, real-world time series often exhibit pronounced cross-domain heterogeneity where variables that appear synchronized in the time domain can differ substantially in the frequency domain. Existing frequency-based LTSF methods often rely on implicit assumptions of cross-domain homogeneity, which limits their ability to adapt to such intricate variability. To effectively integrate frequency-domain analysis with temporal dependency learning, we propose AdaMamba, a novel framework that endogenizes adaptive and context-aware frequency analysis within the Mamba state-space update process. Specifically, AdaMamba introduces an interactive patch encoding module to capture inter-variable interaction dynamics. Then, we develop an adaptive frequency-gated state-space module that generates input-dependent frequency bases, and generalizes the conventional temporal forgetting gate into a unified time-frequency forgetting gate. This allows dynamic calibration of state transitions based on learned frequency-domain importance, while preserving Mamba’s capability in modeling long-range dependencies. Extensive experiments on seven public LTSF benchmarks and two domain-specific datasets demonstrate that AdaMamba consistently outperforms state-of-the-art methods in forecasting accu racy while maintaining competitive computational efficiency. The code of AdaMamba is available at this https URL.

[AI-133] Protecting the Trace: A Principled Black-Box Approach Against Distillation Attacks

【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中前沿模型(frontier models)在通过采样推理轨迹(sampling reasoning traces)进行知识蒸馏时,因暴露于第三方而引发的安全、隐私与对齐风险问题。现有方法在对抗性蒸馏(antidistillation)上缺乏理论基础,常需依赖学生模型代理或大量微调,且易导致教师模型性能显著下降。论文的关键创新在于将 antidistillation 形式化为 Stackelberg 博弈,从而构建了一个具有理论支撑的框架;基于此框架,作者提出 \textttTraceGuard 方法,一种高效、黑盒、后生成阶段的中毒策略,可精准污染对教师推理至关重要的句子,实现对学生模型学习的有效干扰,同时保持教师模型性能稳定,为安全地共享模型洞察提供了可扩展的解决方案。

链接: https://arxiv.org/abs/2604.23238
作者: Max Hartman,Vidhata Jayaraman,Moulik Choraria,Lav R. Varshney
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Frontier models push the boundaries of what is learnable at extreme computational costs, yet distillation via sampling reasoning traces exposes closed-source frontier models to adversarial third parties who can bypass their guardrails and misappropriate their capabilities, raising safety, security, and intellectual privacy concerns. To address this, there is growing interest in building antidistillation methods, which aim to poison reasoning traces to hinder downstream student model learning while maintaining teacher performance. However, current techniques lack theoretical grounding, requiring either heavy fine-tuning or access to student model proxies for gradient based attacks, and often lead to a significant teacher performance degradation. In this work, we present a theoretical formulation of antidistillation as a Stackelberg game, grounding a problem that has so far largely been approached heuristically. Guided by the desired design properties our formulation reveals, we propose \textttTraceGuard, an efficient, post-generation black-box method to poison sentences with high importance for teacher reasoning. Our work offers a scalable solution to share model insights safely, ensuring that the advancement of reasoning capabilities does not come at the cost of intellectual privacy or AI safety alignment.

[AI-134] oward Polymorphic Backdoor against Semantic Communication via Intensity-Based Poisoning

【速读】:该论文旨在解决现有语义通信(Semantic Communication, SC)后门攻击中因采用单一目标的同质化攻击范式而导致的攻击多样性、效率与灵活性不足的问题,尤其在异构下游场景下表现受限。其解决方案的关键在于提出一种多态性SC后门攻击方法SemBugger,通过动态调整触发强度实现对SC知识的细粒度控制,从而从系统中生成多样化的恶意输出;该方法基于多效中毒训练框架,引入分级强度触发器对训练数据进行污染,并采用分层恶意损失函数优化SC系统,使系统知识能根据输入触发强度自适应调整以达成目标输出,同时保持良性样本的传输保真度。

链接: https://arxiv.org/abs/2604.23231
作者: Xiao Yang,Yuni Lai,Gaolei Li,Jun Wu,Kai Zhou,Jianhua Li,Mingzhe Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by IEEE TIFS

点击查看摘要

Abstract:Semantic Communication (SC) backdoor attacks aim to utilize triggers to manipulate the system into producing predetermined outputs via backdoored shared knowledge. Current SC backdoors adopt monomorphic paradigms with single attack target, which suffers from limited attack diversity, efficiency, and flexibility in heterogeneous downstream scenarios. To overcome the limitations, we propose SemBugger, a polymorphic SC backdoor. By dynamically adjusting the trigger intensity, SemBugger finely-grained controls over the SC knowledge to generate diverse malicious results from the system. Specifically, SemBugger is realized through a multi-effect poisoning-training framework. It introduces graded-intensity triggers to poison training data and optimizes SC systems with hierarchical malicious loss. The trained system’s knowledge dynamically adapts to trigger intensity in inputs to yield target outputs, all while preserving transmission fidelity for benign samples. Moreover, to augment SC security, we propose a provable robustness defense that resists SemBugger’s homogeneous attacks through a controlled noise mechanism. It operates via strategically adding noise in SC inputs, and we formally provide a theoretical lower bound on the defense efficacy. Experiments across diverse SC models and benchmark datasets indicate that SemBugger attains high attack efficacy while maintaining the regular functionality of SC systems. Meanwhile, the designed defense effectively neutralizes SemBugger attacks.

[AI-135] StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

【速读】:该论文旨在解决当前视频片段检索(video moment retrieval)模型在处理叙事内容时存在的语义鸿沟问题,即模型能够识别动作但无法推理其背后意图与因果关系,这源于缺乏理论心智(Theory of Mind, ToM)能力。解决方案的关键在于提出首个需要ToM推理的基准数据集StoryTR,并设计一种基于智能体的数据生成管道(Agentic Data Pipeline),通过显式的三层ToM链条(意图解码、叙事推理、边界定位)生成训练数据,从而引导模型学习从多模态线索中推断隐含心理状态和叙事逻辑的能力。实验表明,基于此方法训练的Shorts-Moment模型在StoryTR上相比基线提升15.1%相对IoU,证明了叙事推理能力比参数规模更重要。

链接: https://arxiv.org/abs/2604.23198
作者: Xuanyue Zhong,Yuqiang Xie,Guanqun Bi,Jiangping Yang,Guibin Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textitwhat is happening but fail to reason \textitwhy it matters. This semantic gap stems from the lack of \textbfTheory of Mind (ToM): the cognitive ability to infer implicit intentions, mental states, and narrative causality from surface-level observations. We introduce \textbfStoryTR, the first video moment retrieval benchmark requiring ToM reasoning, comprising 8.1k samples from narrative short-form videos (shorts/reels). These videos present an ideal testbed. Their high information density encodes meaning through subtle multimodal cues. For instance, a glance paired with a sigh carries entirely different semantics than the glance alone. Yet multimodal perception alone is insufficient; ToM is required to decode that a character smiling'' may actually be concealing hostility.‘’ To teach models this reasoning capability, we propose an \textbfAgentic Data Pipeline that generates training data with explicit three-tier ToM chains (intent decoding, narrative reasoning, boundary localization). Experiments reveal the severity of the reasoning gap: Gemini-3.0-Pro achieves only 0.53 Avg IoU on StoryTR. However, our 7B \textbfShorts-Moment model, trained on ToM-guided data, improves +15.1% relative IoU over baselines, demonstrating that \textitnarrative reasoning capability matters more than parameter scale.

[AI-136] From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents

【速读】:该论文旨在解决现有基于大语言模型的智能体在动态多步骤任务中因固定粒度规划导致的效率与适应性不足问题,即简单任务过度细化、复杂任务细节不足,难以平衡规划的简洁性与复杂性。解决方案的关键在于提出一种自适应分层规划机制 AdaPlan-H,该机制受认知科学中“渐进精炼”原则启发,从粗粒度宏观计划开始,依据任务复杂度逐步精细化,生成针对不同难度任务的自适应层级计划,并通过模仿学习和能力增强进行优化,从而显著提升任务执行成功率并减少过度规划现象。

链接: https://arxiv.org/abs/2604.23194
作者: Haoran Tan,Zeyu Zhang,Chen Ma,Tianze Liu,Quanyu Dai,Xu Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model-based agents have recently emerged as powerful approaches for solving dynamic and multi-step tasks. Most existing agents employ planning mechanisms to guide long-term actions in dynamic environments. However, current planning approaches face a fundamental limitation that they operate at a fixed granularity level. Specifically, they either provide excessive detail for simple tasks or insufficient detail for complex ones, failing to achieve an optimal balance between simplicity and complexity. Drawing inspiration from the principle of \textitprogressive refinement in cognitive science, we propose \textbfAdaPlan-H, a self-adaptive hierarchical planning mechanism that mimics human planning strategies. Our method initiates with a coarse-grained macro plan and progressively refines it based on task complexity. It generates self-adaptive hierarchical plans tailored to the varying difficulty levels of different tasks, which can be optimized by imitation learning and capability enhancement. Experimental results demonstrate that our method significantly improves task execution success rates while mitigating overplanning at the planning level, providing a flexible and efficient solution for multi-step complex decision-making tasks. To contribute to the community, our code and data will be made publicly available at this https URL.

[AI-137] RAT: RunAnyThing via Fully Automated Environment Configuration

【速读】:该论文旨在解决自动化仓库级软件工程任务中环境配置的难题,特别是针对手动配置带来的劳动密集型瓶颈问题。现有方法通常依赖预定义的构件或局限于特定编程语言,难以适配真实世界仓库的多样性。解决方案的关键在于提出一种无语言限制的框架RAT(RunAnyThing),其核心是一个多阶段流水线,包括语义初始化、规划机制、专用工具集和健壮的沙箱环境,从而实现对任意仓库的自动化环境配置。实验表明,RAT在环境设置成功率(ESSR)上相较强基线平均提升29.6%,验证了其有效性与通用性。

链接: https://arxiv.org/abs/2604.23190
作者: Renhong Huang,Dongdong Hua,Yifei Sun,Sitao Ding,Hanyang Yuan,Daixin Wang,Yang Yang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automating repository-level software engineering tasks is a foundational challenge for autonomous code agents, largely due to the difficulty of configuring executable environments. However, manual configuration remains a labor-intensive bottleneck, necessitating a transition toward fully automated environment configuration. Existing approaches often rely on pre-defined artifacts or are restricted to specific programming languages, limiting their applicability to real-world repositories. In this paper, we first propose RAT (RunAnyThing), a language-agnostic framework for automated environment configuration on arbitrary repositories. RAT features a multi-stage pipeline that integrates semantic initialization, a planning mechanism, specialized toolset, and a robust sandbox for configuration. Furthermore, to enable rigorous evaluation, we propose RATBench, a benchmark that reflects the the distribution and heterogeneity of real-world repositories. Extensive experiments demonstrate that RAT achieves state-of-the-art performance, improving the Environment Setup Success Rate (ESSR) by an average of 29.6% over strong baselines.

[AI-138] Designing escalation criteria for international AI incident response: criteria triggers and thresholds

【速读】:该论文旨在解决当前AI事件报告机制中缺乏明确操作标准的问题,即在何种情况下应将已检测到的AI事件从国家层面升级至国际协调处理。其解决方案的关键在于提出一个可跨司法管辖区通用的升级框架,该框架基于对SB 53法案、欧盟《人工智能法案》(EU AI Act)、全球人工智能伙伴关系(GPAI)行为准则及其他行业事件管理框架的分析,提炼出八个评估标准,并将其转化为带有分阶段决策点和阈值检查的流程图。此框架不仅提供统一的判断逻辑,还通过实证测试识别出三种可能导致系统性漏检的设计模式,从而为政策制定者和从业者优化AI事件响应机制提供了结构化依据。

链接: https://arxiv.org/abs/2604.23183
作者: Francesca Gomez,Matthew Ball,Michael Harre,Lydia Preston,Josephine Schwab,Caio Machado
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI incident reporting requirements are emerging in regulation and policy, yet no operational criteria exist for determining when a detected AI incident warrants escalation beyond national handling to international coordination. This paper proposes an escalation framework to address this gap, intended as a common reference point across jurisdictions that enables aligned escalation while preserving flexibility in how actors respond within their own legal and policy contexts. We review SB 53, the EU AI Act, the GPAI Code of Practice, and incident frameworks from other industries to derive eight criteria for assessing whether an incident warrants escalation, translated into a sequential flowchart with gated decision points and threshold checks. For each criterion, we map how it interplays with these regulatory frameworks, identifying where their design choices support or undermine effective detection. We test the framework against ten documented AI incidents and structured variants to identify where criteria under-detect or misclassify incidents in practice. We find three design patterns that may lead to systematic under-detection in regimes where model developers are responsible for escalation: a. where escalation requires confirmed harm, events such as model weight exfiltration risk detection only after severe, irreversible harm has propagated; b. where incidents are assessed individually, systemic harms emerging from accumulation risk being under-detected; and c. where thresholds align with legal instruments rather than quantitatively testable terms, criteria risk being impractical to apply under time pressure. We also find that escalation rules are only one component of a broader framework: the underlying definitions against which thresholds are set, and the data available to the responsible actor, create interdependencies that can themselves drive under-detection.

[AI-139] Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM -as-a-Judge Pipelines

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为评判者(LLM-as-a-Judge)时存在的系统性偏差问题,此类偏差会显著影响对语言模型输出质量评估的可靠性。研究通过系统性实证比较九种去偏策略在五种不同来源的判别模型(来自Google、Anthropic、OpenAI和Meta四家提供商)、三个基准测试集(MT-Bench、LLMBar及自定义数据集)和四种类型偏差下的表现,发现风格偏差(style bias)是主导因素(0.76–0.92),远高于位置偏差(position bias = 0.04),且此前研究关注不足;同时验证了模型对简洁性的偏好并非源于长度误判,而是具备区分质量和长度的能力(准确率达0.92–1.00)。关键解决方案在于提出一种组合预算策略(combined budget strategy),在Claude Sonnet 4上实现显著提升(+11.2个百分点,p < 0.0001),且多数模型呈现正向趋势,表明去偏策略具有模型依赖性但总体有效。

链接: https://arxiv.org/abs/2604.23178
作者: Sadman Kabir Soumik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, 6 tables. Under review at TMLR

点击查看摘要

Abstract:LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empirical study comparing nine debiasing strategies across five judge models from four provider families (Google, Anthropic, OpenAI, Meta), three benchmarks (MT-Bench n=400, LLMBar n=200, custom n=225), and four bias types. Our key findings: (1) Style bias is the dominant bias (0.76-0.92 across all models), far exceeding position bias (= 0.04), yet has received minimal research attention. (2) All models show a conciseness preference on expansion pairs, but truncation controls confirm they correctly distinguish quality from length (0.92-1.00 accuracy), suggesting quality-sensitive evaluation rather than a simple length bias. (3) Debiasing is beneficial but model-dependent: the combined budget strategy significantly improves Claude Sonnet 4 by +11.2 pp (p 0.0001), with directionally positive trends for other models. Only 2 of 20 non-baseline configurations show decreased agreement. We release our evaluation framework, controlled dataset, and all experimental artifacts at this https URL.

[AI-140] Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

【速读】:该论文旨在解决大规模多专家模型(Mixture-of-Experts, MoE)在多节点部署场景下的推理瓶颈问题,特别是由专家负载不均和低效的令牌路由所导致的显著跨节点全对全(all-to-all)通信开销。通过系统性地分析多个前沿开源MoE模型(如Llama 4 Maverick、DeepSeek V3-671B和Qwen3-230B-A22B)的真实专家激活轨迹(超过10万条),研究发现存在三个普遍现象:专家负载波动性强、不同任务类别(代码、数学、对话、通用)对应不同的热门专家、以及预填充(prefill)与解码(decode)阶段的专家激活高度相关。基于这些洞察,论文提出两项关键优化策略:一是基于工作负载感知的微批分组(workload-aware micro-batch grouping),二是优化专家放置策略以最大化令牌到目标专家的局部性(token locality)。这两项措施显著减少了跨节点通信数据量(最高达20%),从而降低了MoE解码延迟并提升了加速器利用率。

链接: https://arxiv.org/abs/2604.23150
作者: Abhimanyu Bambhaniya,Geonhwa Jeong,Jason Park,Jiecao Yu,Jaewon Lee,Pengchao Wang,Changkyu Kim,Chunqiang Tang,Tushar Krishna
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead. To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load imbalance, domain-specific expert activation where expert popularity shifts across task families (code, math, chat, general), and a strong correlation between prefill and decode expert activations. Motivated by these findings, we propose workload-aware micro-batch grouping and an expert placement strategy to maximize token locality to the destination expert, thereby reducing inter-node communication. Across models and datasets, these optimizations help reduce all2all communication data up to 20, resulting in lower MoE decode latency and better accelerator utilization. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2604.23150 [cs.LG] (or arXiv:2604.23150v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.23150 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-141] PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks

【速读】:该论文旨在解决基于增强现实与大语言模型(AR-LLM)的社会工程攻击(AR-LLM-SE)在实际应用中面临的两大瓶颈问题:一是“冷启动个性化”问题,即当前检索增强生成(Retrieval-Augmented Generation, RAG)方法在对话初期引入显著延迟,阻碍实时社会画像构建;二是“静态攻击策略”问题,现有方法依赖固定阶段的手工设计社交工程脚本,缺乏心理学理论支撑。解决方案的关键在于提出PhySE框架,其核心创新包括:(1)基于视觉语言模型(Visual Language Model, VLM)的社会情境预训练,实现快速、动态的个体画像生成,消除初始延迟;(2)自适应心理代理机制,通过心理学驱动的大语言模型根据目标响应动态调整策略类别,突破传统静态脚本限制,从而提升攻击的隐蔽性与有效性。

链接: https://arxiv.org/abs/2604.23148
作者: Tianlong Yu,Yang Yang,Ziyi Zhou,Jiaying Xu,Siwei Li,Tong Guan,Kailong Wang,Ting Bi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emerging threat of AR-LLM-based Social Engineering (AR-LLM-SE) attacks (e.g. SEAR) poses a significant risk to real-world social interactions. In such an attack, a malicious actor uses Augmented Reality (AR) glasses to capture a target visual and vocal data. A Large Language Model (LLM) then analyzes this data to identify the individual and generate a detailed social profile. Subsequently, LLM-powered agents employ social engineering strategies, providing real-time conversation suggestions, to gain the target trust and ultimately execute phishing or other malicious acts. Despite its potential, the practical application of AR-LLM-SE faces two major bottlenecks, (1) Cold-start personalization, Current Retrieval-Augmented Generation (RAG) methods introduce critical delays in the earliest turns, slowing initial profile formation and disrupting real-time interaction, (2) Static Attack Strategies, Existing approaches rely on fixed-stage, handcrafted social engineering tactics that lack foundation in established psychological theory. To address these limitations, we propose PhySE, a novel framework with two core innovations, (1) VLM-Based SocialContext Training, To eliminate profiling delays, we efficiently pre-train a Visual Language Model (VLM) with social-context data, enabling rapid, on-the-fly profile generation, (2) Adaptive Psychological Agent, We introduce a psychological LLM that dynamically deploys distinct classes of psychological strategies based on target response, moving beyond static, handcrafted scripts. We evaluated PhySE through an IRB-approved user study with 60 participants, collecting a novel dataset of 360 annotated conversations across diverse social scenarios.

[AI-142] UNSEEN: A Cross-Stack LLM Unlearning Defense against AR-LLM Social Engineering Attacks

【速读】:该论文旨在解决基于增强现实-大语言模型(AR-LLM)的社会工程攻击(SEAR)所带来的新型隐私与安全威胁。此类攻击利用AR眼镜采集目标的视觉和语音信息,通过大语言模型(LLM)生成社交画像并驱动智能代理实施钓鱼诱导,而现有防御手段如基于角色的访问控制或数据流追踪因无法适配嵌入式AR设备与黑盒式LLM推理机制而失效。解决方案的关键在于提出UNSEEN——一种跨栈协同防御框架,其核心包括:(1)AR访问控制层(AR ACL),实现基于身份的感知权限管理;(2)基于F-RMU的LLM遗忘机制,用于抑制敏感个人特征的生成;(3)运行时代理护栏机制,对交互式代理行为进行动态约束。该方案在60名参与者的IRB批准用户研究中验证了有效性,涵盖360条真实社交场景对话数据。

链接: https://arxiv.org/abs/2604.23141
作者: Tianlong Yu,Yang Yang,Xiao Luo,Lihong Liu,Fudu Xing,Zui Tao,Kailong Wang,Gaoyang Liu,Ting Bi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emerging AR-LLM-based Social Engineering attack (e.g., SEAR) is at the edge of posing great threats to real-world social life. In such AR-LLM-SE attack, the attacker can leverage AR (Augmented Reality) glass to capture the image and vocal information of the target, using the LLM to identify the target and generate the social profile, using the LLM agents to apply social engineering strategies for conversation suggestion to win the target trust and perform phishing afterwards. Current defensive approaches, such as role-based access control or data flow tracking, are not directly applicable to the convergent AR-LLM ecosystem (considering embedded AR device and opaque LLM inference), leaving an emerging and potent social engineering threat that existing privacy paradigms are ill-equipped to address. This necessitates a shift beyond solely human-centric measures like legislation and user education toward enforceable vendor policies and platform-level restrictions. Realizing this vision, however, faces significant technical challenges: securing resource-constrained AR-embedded devices, implementing fine-grained access control within opaque LLM inferences, and governing adaptive interactive agents. To address these challenges, we present UNSEEN, a coordinated cross-stack defense that combines an AR ACL (Access Control Layer) for identity-gated sensing, F-RMU-based LLM unlearning for sensitive profile suppression, and runtime agent guardrails for adaptive interaction control. We evaluate UNSEEN in an IRB-approved user study with 60 participants and a dataset of 360 annotated conversations across realistic social scenarios.

[AI-143] ArgRE: Formal Argumentation for Conflict Resolution in Multi-Agent Requirements Negotiation

【速读】:该论文旨在解决复杂软件系统中多质量属性(quality attributes)冲突难以有效平衡的问题,尤其是在监管严格领域中缺乏可审计的决策机制。现有基于多智能体大语言模型(multi-agent large language model)的框架通常采用启发式方法进行冲突解决,导致需求聚合过程隐式且不可追溯。解决方案的关键在于提出ArgRE系统,其核心创新是将Dung风格的抽象论证理论(abstract argumentation)嵌入到需求协商阶段:每个提案、批评和优化均被建模为论证,冲突以有向攻击关系表示,并通过基底(grounded)和偏好(preferred)语义计算接受的论证集合,从而实现论证级别的可追溯性;同时集成KAOS目标建模、多层验证与标准导向的产物生成,显著提升了决策透明度与合规覆盖能力。

链接: https://arxiv.org/abs/2604.23124
作者: Haowei Cheng,Milhan Kim,Chong Liu,Teeradaj Racharak,Truong Vinh Truong Duy,Phan Thi Huyen Thanh,Jialong Li,Naoyasu Ubayashi,Hironori Washizaki
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As software systems grow in complexity, they must satisfy an increasing number of competing quality attributes, making it essential to balance them in a principled manner – for example, a safety requirement for sensor-fusion verification may conflict with a tight planning-cycle budget. Multi-agent large language model frameworks support this balancing process by assigning specialized agents to different objectives. However, their conflict resolution is typically heuristic. Requirements are aggregated implicitly without explicit acceptance or rejection, limiting auditability in regulated domains. We present ArgRE, a multi-agent requirements negotiation system that embeds Dung-style abstract argumentation into the negotiation stage. Each proposal, critique, and refinement is modeled as an argument, conflicts are represented as directed attack relations, and the accepted set of arguments is computed under grounded and preferred semantics. The pipeline further integrates KAOS goal modeling, multi-layer verification, and standards-oriented artifact generation. Evaluation across five case studies spanning safety-critical, financial, and information-system domains shows that ArgRE provides argument-level traceability absent from existing frameworks. Independent evaluators rated its decision justifications significantly higher than those of heuristic synthesis (4.32 vs. 3.07, p 0.001), indicating improved auditability, while semantic intent preservation remains comparable (94.9% BERTScore F1) and compliance coverage reaches 84.7% versus 47.6%–47.8% for baselines. Structural analysis further confirms that the default pairwise protocol yields acyclic graphs in which grounded and preferred semantics coincide, whereas cross-pair arbitration introduces controlled cyclicity, leading to predictable divergence between the two semantics.

[AI-144] ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

【速读】:该论文旨在解决生成式 AI (Generative AI) 模型评估过程中资源消耗大、效率低的问题,具体包括推理速度慢、人工评分成本高以及模型与基准测试快速迭代带来的挑战。其解决方案的关键在于提出 ProEval 框架,利用预训练的高斯过程(Gaussian Process, GP)作为性能评分函数的代理模型,将模型输入映射为误差严重性或安全违规等指标;并通过贝叶斯积分(Bayesian Quadrature, BQ)建模性能估计和超水平集采样(superlevel set sampling)进行失败案例发现,结合不确定性感知的主动决策策略,智能选择或合成具有信息量的测试样本,从而显著提升评估效率并增强对失败模式的覆盖能力。

链接: https://arxiv.org/abs/2604.23099
作者: Yizheng Huang,Wenjun Zeng,Aditi Kumaresan,Zi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Our open-sourced code and data can be found at this https URL

点击查看摘要

Abstract:Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

[AI-145] owards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach

【速读】:该论文旨在解决从非结构化自然语言中自动构建形式化本体(Ontology)这一知识工程领域的核心挑战,尤其关注大型语言模型(Large Language Models, LLMs)在生成质量上的局限性及其成因。其解决方案的关键在于提出一种多智能体架构(multi-agent architecture),将本体构建过程分解为四个以产物为导向的角色:领域专家(Domain Expert)、管理者(Manager)、编码者(Coder)和质量保证者(Quality Assurer)。实验表明,该方法通过前置规划(front-loaded planning)显著提升了本体的结构质量,并适度改善了查询能力,验证了“规划优先、产物驱动”的生成范式在可扩展自动化本体工程中的有效性与可审计性。

链接: https://arxiv.org/abs/2604.23090
作者: Abid Talukder,Maruf Ahmed Mridul,Oshani Seneviratne
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatically generating formal ontologies from unstructured natural language remains a central challenge in knowledge engineering. While large language models (LLMs) show promise, it remains unclear which architectural design choices drive generation quality and why current approaches fail. We present a controlled experimental study using domain-specific insurance contracts to investigate these questions. We first establish a single-agent LLM baseline, identifying key failure modes such as poor Ontology Design Pattern compliance, structural redundancy, and ineffective iterative repair. We then introduce a multi-agent architecture that decomposes ontology construction into four artifact-driven roles: Domain Expert, Manager, Coder, and Quality Assurer. We evaluate performance across architectural quality (via a panel of heterogeneous LLM judges) and functional usability (via competency question driven SPARQL evaluation with complementary retrieval augmented generation based assessment). Results show that the multi-agent approach significantly improves structural quality and modestly enhances queryability, with gains driven primarily by front-loaded planning. These findings highlight planning-first, artifact-driven generation as a promising and more auditable path toward scalable automated ontology engineering.

[AI-146] Analytica: Soft Propositional Reasoning for Robust and Scalable LLM -Driven Analysis ICLR2026

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在复杂现实世界分析任务中面临的推理随机不稳定性以及缺乏可验证、可组合结构的问题。其解决方案的核心是提出一种基于软命题推理(Soft Propositional Reasoning, SPR)的新颖代理架构 Analytica,该架构将复杂分析建模为对不同结果命题的软真值估计过程,并通过形式化建模和最小化估计误差(包括偏差与方差)来提升性能。关键创新在于:1)采用并行分治框架分解问题为子命题树,结合工具增强的LLM接地代理(如Jupyter Notebook代理)进行事实验证与评分以降低偏差;2)利用稳健线性模型递归合成已接地的叶节点,有效平均随机噪声,从而显著降低方差,同时支持高效的交互式“假设分析”(what-if analysis)。

链接: https://arxiv.org/abs/2604.23072
作者: Junyan Cheng,Kyle Richardson,Peter Chin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026 Camera-ready

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly tasked with complex real-world analysis (e.g., in financial forecasting, scientific discovery), yet their reasoning suffers from stochastic instability and lacks a verifiable, compositional structure. To address this, we introduce Analytica, a novel agent architecture built on the principle of Soft Propositional Reasoning (SPR). SPR reframes complex analysis as a structured process of estimating the soft truth values of different outcome propositions, allowing us to formally model and minimize the estimation error in terms of its bias and variance. Analytica operationalizes this through a parallel, divide-and-conquer framework that systematically reduces both sources of error. To reduce bias, problems are first decomposed into a tree of subpropositions, and tool-equipped LLM grounder agents are employed, including a novel Jupyter Notebook agent for data-driven analysis, that help to validate and score facts. To reduce variance, Analytica recursively synthesizes these grounded leaves using robust linear models that average out stochastic noise with superior efficiency, scalability, and enable interactive “what-if” scenario analysis. Our theoretical and empirical results on economic, financial, and political forecasting tasks show that Analytica improves 15.84% accuracy on average over diverse base models, achieving 71.06% accuracy with the lowest variance of 6.02% when working with a Deep Research grounder. Our Jupyter Notebook grounder shows strong cost-effectiveness that achieves a close 70.11% accuracy with 90.35% less cost and 52.85% less time. Analytica also exhibits highly noise-resilient and stable performance growth as the analysis depth increases, with a near-linear time complexity, as well as good adaptivity to open-weight LLMs and scientific domains.

[AI-147] C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在分子优化任务中难以有效对齐选择性且相互竞争的药物设计约束的问题。其核心挑战在于如何在多个目标属性(如活性、选择性、ADMET性质等)之间实现稳定且可控的优化,同时保持分子骨架相似性。解决方案的关键在于提出C-Moral框架,该框架采用基于分组的相对优化策略、用于异构目标的属性评分对齐机制以及连续非线性奖励聚合方法,从而提升在多目标优化场景下的稳定性与性能表现。实验表明,C-Moral在C-MuMOInstruct基准测试中显著优于现有最先进模型,在域内(IND)和域外(OOD)任务上分别实现了48.9%和39.5%的最佳成功优化率(Success Optimized Rate, SOR),同时有效维持了分子骨架的相似性。

链接: https://arxiv.org/abs/2604.23061
作者: Rui Gao,Youngseung Jeon,Swastik Roy,Morteza Ziyadi,Xiang ‘Anthony’ Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug-design constraints remains challenging. We propose C-Moral, a reinforcement learning post-training framework for controllable multi-objective molecular optimization. C-Moral combines group-based relative optimization, property score alignment for heterogeneous objectives, and continuous non-linear reward aggregation to improve stability across competing properties. Experiments on the C-MuMOInstruct benchmark show that C-Moral consistently outperforms state-of-the-art models across both in-domain and out-of-domain settings, achieving the best Success Optimized Rate (SOR) of 48.9% on IND tasks and 39.5% on OOD tasks, while largely preserving scaffold similarity. These results suggest that RL post-training is an effective way to align molecular language models with continuous molecular design objectives. Our code and models are publicly available at this https URL.

[AI-148] Dont Make the LLM Read the Graph: Make the Graph Think

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在协作式多智能体推理中性能提升的问题,特别是通过引入显式信念图(belief graphs)来增强模型对其他智能体意图和知识状态的理解能力。其核心解决方案在于设计两种不同的信念图集成架构:一是作为提示上下文(prompt context),二是作为动作选择的结构化门控机制(gated action selection via ranked shortlists)。研究发现,仅将信念图作为提示内容时,其价值受限于模型强度——强模型无法从中获益,而弱模型则显著受益;但当信念图用于结构化地筛选行动选项时,即使强模型也能获得显著性能提升(如第二层心智理论任务准确率从20%提升至100%),表明结构化的信念图嵌入是实现高效协作推理的关键。此外,论文还揭示了“规划者违抗”(Planner Defiance)现象,即部分模型家族(如Llama 70B)倾向于忽略正确规划建议,进一步说明系统设计需兼顾模型行为可预测性与外部信念信息的有效融合。

链接: https://arxiv.org/abs/2604.23057
作者: Yuqi Sun,Tianqin Meng,George Liu,Yashraj Panwar,Lakshya Chaudhry,Munasib Ilham,Aman Chadha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: main body has 9 pages, 4 figures, under review for COLM 2026 conference

点击查看摘要

Abstract:We investigate whether explicit belief graphs improve LLM performance in cooperative multi-agent reasoning. Through 3,000+ controlled trials across four LLM families in the cooperative card game Hanabi, we establish four findings. First, integration architecture determines whether belief graphs provide value: as prompt context, graphs are decorative for strong models and beneficial only for weak models on 2nd-order Theory of Mind (80% vs 10%, p0.0001, OR=36.0); when graphs gate action selection through ranked shortlists, they become structurally essential even for strong models (100% vs 20% on 2nd-order ToM, p0.001). Second, we identify “Planner Defiance,” a model-family-specific failure where LLMs override correct planner recommendations at partial competence (90% override, replicated N=20); Gemini models show near-zero defiance while Llama 70B shows 90%, and models distinguish factual context (deferred to) from advisory recommendations (overridden). Third, full-game evidence confirms inter-agent conventions (+128% over baseline, p=0.003) outperform all single-agent interventions, and individual belief-graph components must be combined to produce gains. Fourth, preliminary scaling analysis (N=10/cell, exploratory) suggests graph depth has diminishing returns: shallow graphs provide the best cost-benefit ratio, while deeper ToM graphs appear harmful at larger player counts (-1.5 pts at 5-player, p=0.029).

[AI-149] K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning ICML ICML2025

【速读】:该论文旨在解决策略梯度强化学习中奖励归一化(reward normalization)依赖固定启发式方法所带来的局限性,尤其是在非平稳环境或高方差回报场景下性能不稳定的问题。其解决方案的关键在于引入一维卡尔曼滤波器(1D Kalman filter)用于在线奖励估计,通过递归地估计潜在奖励均值,实现对高方差回报的平滑处理并自适应环境变化,同时保持极低计算开销且无需修改现有策略网络结构。

链接: https://arxiv.org/abs/2604.23056
作者: Zixuan Xia,Quanxi Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in NewInML Workshop, The 42nd International Conference on Machine Learning (ICML 2025).\href{ this https URL }{Event Page}

点击查看摘要

Abstract:We propose a simple yet effective alternative to reward normalization in policy gradient reinforcement learning by integrating a 1D Kalman filter for online reward estimation. Instead of relying on fixed heuristics, our method recursively estimates the latent reward mean, smoothing high-variance returns and adapting to non-stationary environments. This approach incurs minimal overhead and requires no modification to existing policy architectures. Experiments on \textitLunarLander and \textitCartPole demonstrate that Kalman-filtered rewards significantly accelerate convergence and reduce training variance compared to standard normalization techniques. Code is available at this https URL.

[AI-150] A Decoupled Human-in-the-Loop System for Controlled Autonomy in Agent ic Workflows

【速读】:该论文旨在解决当前多智能体系统中人类在环(Human-in-the-Loop, HITL)机制集成不一致、难复用、难以扩展的问题。现有HITL实现通常嵌入于应用逻辑内部,导致跨智能体环境中的治理能力受限。其解决方案的关键在于提出一种解耦的HITL系统架构,将人类监督作为独立组件纳入智能体运行环境,并通过明确的接口与结构化执行模型分离人类交互管理与应用工作流;同时引入一个四维设计框架(干预条件、角色解析、交互语义与通信通道),以支持选择性且上下文感知的人类介入,从而保障系统一致性并兼容新兴的智能体通信协议,为可扩展的治理和渐进式自主性提供基础。

链接: https://arxiv.org/abs/2604.23049
作者: Edward Cheng,Jeshua Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:AI agents are increasingly deployed to execute tasks and make decisions within agentic workflows, introducing new requirements for safe and controlled autonomy. Prior work has established the importance of human oversight for ensuring transparency, accountability, and trustworthiness in such systems. However, existing implementations of Human-in-the-Loop (HITL) mechanisms are typically embedded within application logic, limiting reuse, consistency, and scalability across multi-agent environments. This paper presents a decoupled HITL system architecture that treats human oversight as an independent system component within the agent operating environment. The proposed design separates human interaction management from application workflows through explicit interfaces and a structured execution model. In addition, a design framework is introduced to formalize HITL integration along four dimensions: intervention conditions, role resolution, interaction semantics, and communication channel. This framework enables selective and context-aware human involvement while maintaining system-level consistency. The approach supports alignment with emerging agent communication protocols, allowing HITL to be implemented as a protocol-level concern. By externalizing HITL and structuring its integration, the system provides a foundation for scalable governance and progressive autonomy in agentic workflows. Comments: 8 pages, 2 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.23049 [cs.AI] (or arXiv:2604.23049v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.23049 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-151] A Systematic Approach for Large Language Models Debugging

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中调试困难的问题,主要源于其黑箱特性和概率性输出机制,以及在多样任务和场景下难以系统化诊断错误。解决方案的关键在于提出一种将LLM视为可观测系统的结构化调试方法,通过整合评估(evaluation)、可解释性(interpretability)与错误分析(error analysis)实践,实现从问题检测到模型优化的闭环迭代流程,从而支持提示词调整、参数微调及数据适配等操作,并在缺乏标准化基准和评估标准的场景下依然保持有效性。该方法显著提升了调试效率、可复现性、透明度与部署扩展性。

链接: https://arxiv.org/abs/2604.23027
作者: Basel Shbita,Anna Lisa Gentile,Bing Zhang,Sungeun An,Shailja Thakur,Shubhi Asthana,Yi Zhou,Saptha Surendran,Farhan Ahmed,Rohan Kulkarni,Yuya Jeremy Ong,Chad DeLuca,Hima Patel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains a persistent challenge due to their opaque and probabilistic nature and the difficulty of diagnosing errors across diverse tasks and settings. This paper introduces a systematic approach for LLM debugging that treats models as observable systems, providing structured, model-agnostic methods from issue detection to model refinement. By unifying evaluation, interpretability, and error-analysis practices, our approach enables practitioners to iteratively diagnose model weaknesses, refine prompts and model parameters, and adapt data for fine-tuning or assessment, while remaining effective in contexts where standardized benchmarks and evaluation criteria are lacking. We argue that such a structured methodology not only accelerates troubleshooting but also fosters reproducibility, transparency, and scalability in the deployment of LLM-based systems.

[AI-152] Vision-Language-Action in Robotics: A Survey of Datasets Benchmarks and Data Engines

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型发展中被忽视的核心瓶颈——支撑具身学习的数据基础设施问题。作者指出,未来VLA模型的突破将不再主要依赖于模型架构的改进,而是取决于高保真度数据引擎与结构化评估协议的协同设计。解决方案的关键在于将数据基础设施提升为研究的核心问题,而非背景性考量,并通过系统性分析三大支柱:数据集(涵盖真实世界与合成数据在具身多样性、模态组合和动作空间上的差异)、基准测试(揭示任务复杂性与环境结构之间的耦合关系及现有评估协议在组合泛化与长时推理方面的缺失),以及数据引擎(比较基于仿真的、视频重建的和自动化任务生成范式,识别其在物理真实性与仿真到现实迁移中的共性局限),从而提炼出四个开放挑战:表征对齐、多模态监督、推理评估和可扩展的数据生成。

链接: https://arxiv.org/abs/2604.23001
作者: Ziyao Wang,Bingying Wang,Hanrong Zhang,Tingting Du,Tianyang Chen,Guoheng Sun,Yexiao He,Zheyu Shen,Wanghao Ye,Ang Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: This is a survey paper. The survey is already accepted by TMLR after peer-review. The OpenReview link is here: this https URL

点击查看摘要

Abstract:Despite remarkable progress in Vision–Language–Action (VLA) models, a central bottleneck remains underexamined: the data infrastructure that underlies embodied learning. In this survey, we argue that future advances in VLA will depend less on model architecture and more on the co-design of high-fidelity data engines and structured evaluation protocols. To this end, we present a systematic, data-centric analysis of VLA research organized around three pillars: datasets, benchmarks, and data engines. For datasets, we categorize real-world and synthetic corpora along embodiment diversity, modality composition, and action space formulation, revealing a persistent fidelity-cost trade-off that fundamentally constrains large-scale collection. For benchmarks, we analyze task complexity and environment structure jointly, exposing structural gaps in compositional generalization and long-horizon reasoning evaluation that existing protocols fail to address. For data engines, we examine simulation-based, video-reconstruction, and automated task-generation paradigms, identifying their shared limitations in physical grounding and sim-to-real transfer. Synthesizing these analyses, we distill four open challenges: representation alignment, multimodal supervision, reasoning assessment, and scalable data generation. Addressing them, we argue, requires treating data infrastructure as a first-class research problem rather than a background concern.

[AI-153] owards Causally Interpretable Wi-Fi CSI-Based Human Activity Recognition with Discrete Latent Compression and LTL Rule Extraction

【速读】:该论文旨在解决基于Wi-Fi信道状态信息(Channel State Information, CSI)的人类活动识别(Human Activity Recognition, HAR)中模型可解释性、符号可控性和直接处理高维原始信号之间的矛盾问题。现有深度神经网络虽在CSI-based HAR(CHAR)中表现优异,但其依赖连续潜变量导致黑箱特性且难以调控;而纯符号方法又无法直接处理原始CSI流。解决方案的关键在于提出一个完全自动且严格解耦的流程:首先使用带有Gumbel-Softmax潜变量的分类变分自编码器(categorical variational autoencoder)对CSI幅值窗口进行压缩,获得受容量约束的紧凑离散表示;随后冻结编码器作为确定性映射生成one-hot潜轨迹,并在其上执行因果发现以估计类别条件下的时序依赖图;最终将统计显著的滞后依赖关系转化为线性时序逻辑(Linear Temporal Logic, LTL)规则,构建仅基于规则评估与聚合的符号化、确定性分类器,无需任何学习判别头。此方法实现了从原始CSI到可解释规则的端到端符号化建模,同时支持天线级规则组合实现结构化多天线融合,无需重新训练编码器。

链接: https://arxiv.org/abs/2604.22979
作者: Luca Cotti,Luca Lavazza,Marco Cominelli,Liying Han,Gaofeng Dong,Francesco Gringoli,Mani B. Srivastava,Trevor Bihl,Erik P. Blasch,Daniel O. Brigham,Kara Combs,Lance M. Kaplan,Federico Cerutti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure. Accepted at FUSION 2026

点击查看摘要

Abstract:We address Human Activity Recognition (HAR) utilizing Wi-Fi Channel State Information (CSI) under the joint requirements of causal interpretability, symbolic controllability, and direct operation on high-dimensional raw signals. Deep neural models achieve strong predictive performance on CSI-based HAR (CHAR), yet rely on continuous latent representations that are opaque and difficult to modify; purely symbolic approaches, in contrast, cannot process raw CSI streams. We propose a fully automatic and strictly decoupled pipeline in which CSI magnitude windows are compressed by a categorical variational autoencoder with Gumbel-Softmax latent variables under a capacity-controlled objective, yielding a compact discrete representation. The encoder is then frozen and used as a deterministic mapping to one-hot latent trajectories. Causal discovery is performed on these trajectories to estimate class-conditional temporal dependency graphs. Statistically supported lagged dependencies are translated into Linear Temporal Logic (LTL) rules, producing a fully symbolic and deterministic classifier based solely on rule evaluation and aggregation, without any learned discriminative head. Because rules are defined over discrete latent variables, antenna-specific rule sets can in principle be combined at the symbolic level, enabling structured multi-antenna fusion without retraining the encoder. Results from CHAR Latent Temporal Rule Extraction (CHARL-TRE) indicate competitive performance while preserving explicit temporal and causal structure, showing that deterministic symbolic classification grounded in unsupervised discrete latent representations constitutes a viable alternative to end-to-end black-box models for wireless HAR.

[AI-154] Institutions for the Post-Scarcity of Judgment

【速读】:该论文试图解决的问题是:在生成式 AI(Generative AI)技术快速发展的背景下,传统以“判断力”(judgment)为核心构建的制度体系(如法院、期刊、认证机构等)正面临功能替代危机,因为 AI 已能以接近零边际成本规模化生成看似可信的判断(如排序、归因、认证),而真正稀缺的是验证信号(verified signal)、合法性(legitimacy)、真实来源(authentic provenance)和认知委托的接受度(integration capacity)。解决方案的关键在于:将 AI 政策重新定位为制度重构(institutional redesign),构建可验证与可追溯的公共品(commons)基础设施,并发展一套面向策略性代理(strategic agents)的制度组合形式化框架,从而重塑人机协同下的新型治理结构。

链接: https://arxiv.org/abs/2604.22966
作者: Lauri Lovén
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 5 pages, 9 references. Submitted to Communications of the ACM (Opinion section). Comments welcome

点击查看摘要

Abstract:Each major technological revolution inverts a particular scarcity and rebuilds institutions around the shift. The near-consensus diagnosis of the AI revolution holds that AI collapses the cost of prediction while judgment remains scarce. This Opinion argues the inversion has now flipped: competent-looking judgment (selecting, ranking, attributing, certifying) is produced at scale and at marginal cost approaching zero, and four complements become scarce: verified signal, legitimacy, authentic provenance, and integration capacity (the community’s tolerance for delegated cognition). Because judgment is the substance of institutions, the institutions built to manufacture legitimate judgment (courts, journals, licensing bodies, legislatures) now compete with the technology for the same functional role. The piece traces the pattern across scientific institutions, professional licensing, intellectual property, democratic legitimacy, and foundation-model concentration, and closes with a three-move agenda: reframe AI policy as institutional redesign, build provenance and verification as commons, and develop the formal apparatus for institutional composition under strategic agents.

[AI-155] On the Existence of an Inverse Solution for Preference-Based Reductions in Argumentation

【速读】:该论文试图解决偏好型论证框架(Preference-based Argumentation Frameworks, PAFs)中的逆问题,即给定一个论证图、一个标记(labelling)和一种语义(semantics),判断是否存在一种偏好关系,使得通过该偏好关系对PAF进行转换后,能够产生所期望的标记结果。这一问题在偏好获取(preference elicitation)和可解释性(explainability)等领域具有重要应用价值。论文的关键解决方案在于:针对四种最广泛使用的基于偏好的归约方法,在完全语义(complete semantics)下证明该逆问题在大多数情况下可在多项式时间内求解,从而为PAFs的建模与分析提供了高效计算基础。

链接: https://arxiv.org/abs/2604.22958
作者: Alessio Zaninotto,Bruno Yun,Nir Oren,Srdjan Vesic
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures

点击查看摘要

Abstract:Preference-based argumentation frameworks (PAFs) extend Dung’s approach to abstract argumentation (AAFs) by encoding preferences over arguments. Such preferences control the transformation of attacks into defeats, and different approaches to doing so result in different reductions from a PAF to an AAF. In this paper we consider a PAF inverse problem which takes an argumentation graph, a labelling and a semantics as an input, and outputs a yes" or no" as to whether there is a preference relation between the arguments which can yield the desired labelling. This inverse problem has applications in areas including preference elicitation and explainability. We consider this problem in the context of the four most widely-used preference based reductions under the complete semantics. We show that in most cases, the problem can be answered in polynomial time.

[AI-156] Reconstructive Authority Model: Runtime Execution Validity Under Partial Observability

【速读】:该论文旨在解决自主系统在部分可观测环境下执行有效性(execution validity)保障不足的问题。现有治理机制(如可信执行环境、签名状态证明和密码学认证)仅能确保计算与状态投影的完整性(integrity),但无法保证系统动作所需的状态信息覆盖充分,导致在未观测状态存在时仍可能产生非法执行行为。解决方案的核心是提出重构权威模型(Reconstructive Authority Model, RAM),其关键创新在于将完整性(integrity)与覆盖度(coverage)解耦:RAM引入一个重构门(reconstruction gate),基于显式的覆盖包络(包括已验证状态、声明假设及已知不可观测残差)动态判断当前状态覆盖是否满足特定动作类别的要求;若覆盖不足,则动态缩小权限或强制关闭执行,从而实现对执行有效性的形式化保障。实验表明,RAM在所有覆盖水平下均实现零无效执行率,而传统基于认证的方法即使在全覆盖情况下仍存在因未定义状态处理失败引发的无效执行(IER=0.233)。

链接: https://arxiv.org/abs/2604.22898
作者: Marcelo Fernandez - TraslaIA
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Agent Governance Series, Paper P5. Standalone resubmission following consolidation of P3+P4 into a single paper (P3/4, submit/7510907). Prior submission: submit/7501806. Companion papers on arXiv: P0 ( 2604.17511 ), P1 ( 2603.18829 ), P2 ( 2604.17517 ). Zenodo: https://doi.org/10.5281/zenodo.19669430

点击查看摘要

Abstract:Autonomous systems increasingly operate under partial observability where execution-relevant state is never fully accessible. Existing governance mechanisms – trusted execution environments, oracle-signed state proofs, cryptographic attestation – enforce the integrity of computation and state projections. We show this is structurally insufficient: an authenticated projection of state is necessary but never sufficient for execution validity. We introduce the Reconstructive Authority Model (RAM), which separates integrity from coverage. RAM defines a reconstruction gate that reasons over an explicit coverage envelope – comprising proven state, declared assumptions, and an acknowledged unobservable residual – and permits execution only when coverage is adequate for the action class. When coverage is insufficient, RAM narrows privileges dynamically or fails closed. Attestation proves trust in measurement; RAM proves adequacy of what is measured. We formalize RAM, prove necessity via two theorems (attestation insufficiency and RAM necessity) and three corollaries, and present a hybrid RAM+Attestation architecture with privilege-narrowing. Synthetic experiments (N=100,000, seed=42) show RAM achieves zero invalid execution rates at all coverage levels. Attestation-based systems exhibit IER=0.423 at low coverage and IER=0.233 even at full coverage, the latter arising from undefined-state handling failures undetectable by integrity checks alone. This reframes execution validity as a coverage reconstruction problem, distinct from and complementary to integrity guarantees provided by attestation. Comments: Agent Governance Series, Paper P5. Standalone resubmission following consolidation of P3+P4 into a single paper (P3/4, submit/7510907). Prior submission: submit/7501806. Companion papers on arXiv: P0 (2604.17511), P1 (2603.18829), P2 (2604.17517). Zenodo: https://doi.org/10.5281/zenodo.19669430 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2604.22898 [cs.CR] (or arXiv:2604.22898v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.22898 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.5281/zenodo.19669430 Focus to learn more DOI(s) linking to related resources Submission history From: Marcelo Fernandez [view email] [v1] Fri, 24 Apr 2026 13:19:27 UTC (274 KB)

[AI-157] Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLM s

【速读】:该论文旨在解决传统数据估值方法(基于“行数 × 质量系数”的范式)无法准确捕捉数据对大型语言模型(Large Language Model, LLM)能力贡献的非线性特性这一问题。其解决方案的关键在于提出一个三层动态数据估值框架:首先在token层面利用香农熵(Shannon entropy)和数据质量评分(Data Quality Scores)计算信息密度;其次通过影响函数(influence functions)、代理模型策略(proxy model strategies)和数据Shapley值(Data Shapley values)实证测量训练收益;最后借助哈希承诺(hash-based commitments)、默克尔树(Merkle trees)和防篡改训练日志(tamper-evident training ledger)实现密码学可验证性。该框架在指令遵循、数学推理和代码摘要三个真实场景中验证了代理模型方法能近乎完美地对齐实际效用排名,显著优于基于行数或token数的基线方法,从而为公平的数据即服务(Data-as-a-Service)经济提供依据,并确保数据市场的透明与可信。

链接: https://arxiv.org/abs/2604.22893
作者: Minghui Xu,Qi Luo,Kun Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 1 figure, 6 tables

点击查看摘要

Abstract:Traditional data valuation methods based on ``row-count \times quality coefficient’’ paradigms fail to capture the nuanced, nonlinear contributions that data makes to Large Language Model (LLM) capabilities. This paper presents a dynamic data valuation framework that transitions from static accounting to utility-based pricing. Our approach operates on three layers: (1) token-level information density metrics using Shannon entropy and Data Quality Scores; (2) empirical training gain measurement through influence functions, proxy model strategies, and Data Shapley values; and (3) cryptographic verifiability through hash-based commitments, Merkle trees, and a tamper-evident training ledger. We provide comprehensive experimental validation on three real domains (instruction following, mathematical reasoning, and code summarization), demonstrating that proxy-based empirical gain achieves near-perfect ranking alignment with realized utility, substantially outperforming row-count and token-count baselines. This framework enables a fair Data-as-a-Service economy where high-reasoning data is priced according to its actual contribution to model intelligence, while providing the transparency and auditability necessary for trustworthy data markets.

[AI-158] RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理中由技能中毒(skill poisoning)引发的间接注入攻击问题,此类攻击通过将恶意指令嵌入到功能合法的动作导向型技能中,使得攻击更具隐蔽性和危害性。传统基于文本内容的过滤方法难以识别此类攻击,因为恶意指令被隐藏在看似正常的技能描述内。解决方案的关键在于识别攻击诱导的内部机制——注意力劫持(attention hijacking),即模型在响应生成过程中,注意力从可信上下文转移到恶意技能片段,从而驱动有害行为。为此,作者提出RouteGuard,一种基于冻结主干网络的检测器,其核心创新在于融合响应条件注意力与隐藏状态对齐,并通过可靠性门控的晚期融合机制,实现对内部信号的敏感捕捉,从而显著提升对技能中毒攻击的检测效果,在关键的Skill-Inject通道上达到0.8834的F1分数,并恢复90.51%被词法筛查遗漏的描述类攻击。

链接: https://arxiv.org/abs/2604.22888
作者: Wenjie Xiao,Xuehai Tang,Biyu Zhou,Songlin Hu,Jizhong Han
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it reaches 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, showing that defending against skill poisoning requires internal-signal detection rather than text-only filtering

[AI-159] MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GR)系统中因频繁编码用户长序列历史而导致的高推理成本问题,尤其是跨请求Key-Value (Key-Value, KV) 缓存复用所面临的存储爆炸挑战——单个用户状态规模巨大,远超GPU显存物理限制。解决方案的关键在于提出MTServe,一个分层缓存管理系统,通过利用主机内存(host RAM)作为可扩展的备份存储来虚拟化GPU显存;同时,为弥合不同层级间的I/O性能差距,引入了混合存储布局、异步数据传输流水线及基于局部性的替换策略,从而在保持近似完美命中率(98.5%)的同时实现最高达3.1倍的加速效果。

链接: https://arxiv.org/abs/2604.22881
作者: Xin Wang,Chi Ma,Shaobin Chen,Pu Wang,Menglei Zhou,Junyi Qiu,Qiaorui Chen,Jiayu Sun,Shijie Liu,Zehuan Wang,Lei Yu,Chuan Liu,Fei Jiang,Wei Lin,Hao Wang,Jiawei Jiang,Xiao Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup store. To bridge the I/O gap between tiers, MTServe introduces a suite of system-level optimizations, including a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven replacement policy. On both public and production datasets, MTServe delivers up to 3.1* speedup while maintaining near-perfect hit ratios (98.5%).

[AI-160] When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中训练完成后部署目标发生变化时,无法重新训练策略(actor)的场景下如何实现安全、有效的策略调整问题。其核心挑战在于:在数据、成本或治理限制下,冻结已训练好的策略后,如何在部署阶段灵活适应新目标,同时避免性能崩溃。解决方案的关键在于采用专家乘积(Product-of-Experts, PoE)组合机制,结合目标条件先验(goal-conditioned prior),通过精度加权(precision-weighted)的融合方式保持对原始冻结策略的锚定效应,从而实现平滑退化而非普遍性能提升;进一步理论分析表明,在对角高斯分布假设下,PoE与KL正则化适应(KL-regularized adaptation)存在闭式等价关系,二者本质可统一为一种以冻结策略为中心的安全调控机制,适用于部署时的策略微调。

链接: https://arxiv.org/abs/2604.22873
作者: Elias Hossain,Mohammad Jahid Ibna Basher,Ivan Garibay,Ozlem Garibay,Niloofar Yousefi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) can learn effective policies from fixed datasets, but deployment objectives may change after training, and in many applications the trained actor cannot be retrained because of data, cost, or governance constraints. We study deployment-time adaptation for frozen offline actors using Product-of-Experts (PoE) composition with a goal-conditioned prior. Our main practical finding is graceful degradation rather than universal performance gain: under degraded or random priors, precision-weighted composition remains anchored to the frozen actor, while additive and prior-only adaptation collapse, and a KL-budget selector often recovers a near-oracle operating point. We also make explicit a closed-form identity in the frozen-actor setting: for diagonal-Gaussian actors and priors, PoE with coefficient alpha yields the same deterministic policy as KL-regularized adaptation with beta = alpha / (1 - alpha), with posterior covariances differing only by a global scalar factor. Empirically, across four D4RL environments (3,900 MuJoCo episodes), we observe a 4/5/3 HELP/FROZEN/HURT split. Extending the analysis to six harder cells and two AntMaze diagnostics reveals an actor-competence ceiling: medium-expert remains HURT in all 9 cells at every tested alpha, while AntMaze with a behavior-cloned frozen actor yields zero success for all composition rules. Overall, PoE and KL-regularized adaptation are best viewed as a single actor-anchored safety mechanism for deployment-time steering.

[AI-161] owards Understanding the Expressive Power of GNNs with Global Readout

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)中消息传递聚合-组合-读出(Aggregate-Combine-Readout, ACR-GNN)架构的逻辑表达能力问题,特别是其对一阶(First-Order, FO)性质的刻画能力。研究发现,当使用求和聚合与读出机制时,ACR-GNN能够捕捉到无法被二阶计数逻辑(C2)表达的FO性质,这超越了此前依赖特定设计的聚合与读出函数的研究结果。解决方案的关键在于识别出两种自然方式可恢复对C2逻辑的精确刻画:一是限制局部聚合过程(不约束全局读出),二是将GNN运行于有界度图(无大小限制)。在这两种情形下,GNN所能表达的FO性质恰好对应于带全局计数模态的分级模态逻辑(graded modal logic with global counting modalities)中的公式。这一结果揭示了GNN表达能力的内在上下界,并指出正是聚合与读出之间无限制的交互作用推动了其表达力超越C2逻辑。

链接: https://arxiv.org/abs/2604.22870
作者: Maurice Funk,Daumantas Kojelis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 17 pages

点击查看摘要

Abstract:We study the expressive power of message-passing aggregate-combine-readout graph neural networks (ACR-GNNs). Particularly, we focus on the first-order (FO) properties expressible by this formalism. While a tight logical characterisation remains a difficult open question, we make two contributions towards answering it. First, we show that sum aggregation and readout suffice for GNNs to capture FO properties that cannot be expressed in the logic C2 on both directed and undirected graphs. This strengthens known results by Hauke and Wał\k ega (2026) where aggregation and readout functions are specially crafted for the task. Second, we identify two natural ways of restoring characterisability (with regard to C2) for ACR-GNNs. One option is to limit local aggregation (without imposing restrictions on global readout), whilst the second is to run ACR-GNNs over graphs of bounded degree (but unbounded size). In both cases, the FO properties captured by GNNs are exactly those definable by a formula in graded modal logic with global counting modalities. Our results thus establish an innate lower- and upper-bound in terms of how far (fragments of) C2 can be taken to characterise GNNs, and imply that is indeed the unbounded interaction of aggregation and readout that pushes the logical expressive power of GNNs above C2.

[AI-162] Avionic Main Fuel Pump Simulation and Fault-Diagnosis Benchmark

【速读】:该论文旨在解决在航空等关键网络物理系统(Cyber-Physical Systems, CPS)中,由于数据保护限制和可观测性不足导致的异常检测与诊断算法训练数据匮乏的问题。解决方案的关键在于构建一个高保真、基于物理信息的共仿真平台,使用MATLAB/Simulink Simscape Fluids对典型飞机主燃油泵系统进行建模,并生成带有健康状态和故障模式标注的时间序列数据。该数据集作为基准测试工具,验证了无监督递归变分自编码器(RNN-VAE)在异常检测以及基于自组织映射的变分自编码器(SOM-VAE)在运行工况离散化方面的可行性,从而为缺乏真实故障数据的场景提供有效的训练与评估基础。

链接: https://arxiv.org/abs/2604.22869
作者: Felix Leonhard Janzen,Lukas Moddemann,Alexander Diedrich,Oliver Niggemann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:In many cyber-physical systems, especially in critical applications such as aeroplanes, data to train anomaly detection and diagnosis algorithms is lacking due to data protection issues and partial observability. To combat this inherent lack of data, we introduce a high-fidelity, physics-informed co-simulation of a common aircraft main-fuel-pump system modelled in \textscMATLAB/Simulink Simscape Fluids. We also describe its generated time-series data with health and fault mode annotations. To show feasibility of our benchmark, we apply an unsupervised Recurrent Variational Autoencoder (RNN-VAE) for anomaly detection and a SOM-VAE for operating mode discretization, trained to separate healthy and faulty conditions.

[AI-163] SwarmDrive: Semantic V2V Coordination for Latency-Constrained Cooperative Autonomous Driving

【速读】:该论文旨在解决自动驾驶中云端大语言模型(Large Language Model, LLM)推理带来的高往返延迟和对稳定网络连接的依赖问题,以及纯本地边缘模型在遮挡场景下性能下降的问题。其解决方案的关键在于提出一种语义车辆间协作框架 SwarmDrive,通过附近车辆运行本地小语言模型(Small Language Model, SLM),仅在高不确定性时共享紧凑的意图分布,并利用事件触发的一致性机制进行融合,从而在保证低延迟的同时提升决策鲁棒性。实验表明,在特定遮挡交叉路口场景下,SwarmDrive 在 6G 通信条件下将任务成功率从单个本地 SLM 的 68.9% 提升至 94.1%,同时将延迟从云端参考的 510 ms 降低至 151.4 ms。

链接: https://arxiv.org/abs/2604.22852
作者: Anjie Qiu,Donglin Wang,Zexin Fang,Sanket Partani,Hans D. Schotten
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, submitted to VTC fall 2026 workshop on W9: 2nd Workshop on Shaping Future Connectivity: Emerging and Intelligent Technologies for Sustainable Vehicular Networks

点击查看摘要

Abstract:Cloud-hosted LLM inference for autonomous driving adds round-trip delay and depends on stable connectivity, while purely local edge models struggle under occlusion. We present SwarmDrive, a semantic Vehicle-to-Vehicle (V2V) coordination framework in which nearby vehicles run local Small Language Models (SLMs), share compact intent distributions only when uncertainty is high, and fuse them through event-triggered consensus. We evaluate SwarmDrive in a 5-seed executable study built around one occluded intersection case, combining matched operating-point comparisons with robustness sweeps. In that setting, SwarmDrive under its 6G communication setting (“Swarm 6G”) raises success from 68.9% to 94.1% over a single local SLM while reducing latency from a 510 ms cloud reference to 151.4 ms. However, an increased number of participating vehicles leads to higher communication overhead and packet loss. SwarmDrive also evaluates the impact of swarm-size, packet-loss, and entropy-threshold sweeps and shows that the cooperative gain holds across ablations and is best balanced near an active swarm size of 4 vehicles and an entropy trigger threshold of 0.65 in the current prototype. These results show that semantic edge cooperation can work under tight latency constraints in the targeted intersection case, but they are not a deployment-grade validation of a real 6G stack.

[AI-164] UGAF-ITS: A Standards Harmonization Framework and Validation Tool for Multi-Framework AI Governance in Distributed Intelligent Transportation Systems

【速读】:该论文旨在解决智能交通系统(Intelligent Transportation Systems, ITS)在部署生成式 AI 时面临的多标准治理碎片化问题,即 ISO/IEC 42001、欧盟《人工智能法案》(EU AI Act)和 NIST AI 风险管理框架在控制语义、证据要求和审计节奏上的不一致,导致分布式 ITS 架构中责任不清、合规成本高且事件溯源困难。解决方案的关键在于提出 UGAF-ITS 框架,通过五阶段交叉映射方法将来自三个标准的 154 项义务整合为 12 个统一控制项,并划分至车辆、边缘和云三层架构中执行;同时构建包含 20 个版本化证据资产的“证据骨干”,实现跨框架的单一审计包,减少重复内容并支持双向可追溯性,最终在四种典型 ITS 场景中验证了其平均 91.7% 的框架覆盖率与 45.9% 的证据冗余降低效果。

链接: https://arxiv.org/abs/2604.22789
作者: Talal Ashraf Butt,Muhammad Iqbal,Razi Iqbal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Organizations deploying AI-enabled Intelligent Transportation Systems face fragmented governance: ISO/IEC 42001 demands a certifiable management system, the EU AI Act imposes binding high-risk obligations from August 2026, and the NIST AI Risk Management Framework structures voluntary practice. Each instrument is internally coherent, yet they drive different control vocabularies, evidence expectations, and audit rhythms. In distributed ITS deployments where vehicle manufacturers, roadside integrators, and cloud operators each hold partial evidence and partial accountability, this fragmentation multiplies compliance effort and obscures incident traceability. This paper introduces UGAF-ITS, a standards harmonization framework that consolidates 154 source obligations from the three instruments into 12 unified controls across eight governance domains through a reproducible five-phase crosswalk methodology. A three-tier operating model allocates each control to the vehicle, edge, or cloud tier where enforcement and defensible evidence production are feasible. An evidence backbone of 20 versioned artifacts supports a single audit package across all three frameworks without duplicating content. We validate UGAF-ITS through an open-source governance engine evaluated across four architecturally distinct ITS deployment scenarios. The engine encodes the complete crosswalk catalog and executes eight compliance computations. Three-tier deployments achieve 91.7% average framework coverage with 45.9% evidence reduction, complete bidirectional traceability, and 80% of artifacts serving all three frameworks simultaneously. Partial deployments degrade gracefully: coverage and reduction scale with architectural complexity. The tool, scenarios, and all reported results are publicly available for independent replication. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.22789 [cs.CY] (or arXiv:2604.22789v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.22789 Focus to learn more arXiv-issued DOI via DataCite

[AI-165] Conformal PM2.5 Mapping Under Spatial Covariate Shift: Satellite-Reanalysis Fusion for Africas Green Industrial Transition

【速读】:该论文旨在解决非洲地区空气质量监测基础设施薄弱的问题,以支持绿色工业化进程和可持续发展目标(SDGs)。针对现有PM₂.₅浓度预测模型在地理泛化能力不足、不确定性量化不充分的挑战,研究提出了一种基于LightGBM与抗泄漏空间交叉验证及校准预测(conformal prediction)相结合的融合系统。其关键创新在于引入位置分组的空间交叉验证策略以避免数据泄露,并通过分位数校准预测(split conformal prediction)对预测结果提供可解释的置信区间,从而识别不同区域的预测可靠性差异(如东非地区实际覆盖概率仅为65.3%,显著低于90%目标),并据此制定区域可靠性标签和监测站优先扩展评分,精准引导空气质量管理资源向高暴露风险但缺乏监测的人群倾斜,有效支撑非洲绿色工业转型与多个SDG指标实现。

链接: https://arxiv.org/abs/2604.22787
作者: Yaw Osei Adjei(1),Davis Opoku(1),Ephraim Abotsi(1),Kwadwo Owusu Amanqua(1),Oliver Kornyo(1),Elisha Soglo-Ahianyo(1),Cephas Anertey Abbey(1) ((1) Kwame Nkrumah University of Science and Technology, Kumasi, Ghana)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures, 6 tables. Index Terms: PM2.5 mapping, conformal prediction, covariate shift, spatial cross-validation, air quality, green industrialisation, trustworthy AI, Africa

点击查看摘要

Abstract:Africa’s green industrialization imperative demands reliable infrastructure for monitoring air quality. We present a satellite-reanalysis PM2.5 fusion system trained on 2,068,901 records from 404 monitoring locations in 29 African countries (OpenAQ, 2017-2022), combining LightGBM with leakage-resistant spatial cross-validation and conformal prediction to quantify predictions and their geographic applicability limits. Under 5-fold location-grouped spatial cross-validation, LightGBM achieves RMSE = 30.83 +/- 5.07 ug/m3, MAE = 14.54 +/- 1.66 ug/m3, R2 = 0.134 +/- 0.023, and macro F1 = 0.336 +/- 0.018. This R2 is substantially below random-split benchmarks (0.90) but reflects true geographic generalisation difficulty rather than model failure. Split conformal prediction targeting 90% marginal coverage reveals severe East Africa degradation (actual PICP = 65.3% vs. nominal 90%), consistent with medium-strength covariate shift (humidity KS = 0.2237, sat_pblh KS = 0.2558). We operationalise these findings through regional reliability flags (High/Medium/Low/Unreliable) and a monitor prioritisation score directing infrastructure expansion toward highest-burden unmonitored populations, directly supporting Africa’s green industrial transition and SDGs 3.9, 7.1.2, 9, 11.6.2, and 13.

[AI-166] Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation

【速读】:该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在实际部署中仍面临内存效率不足的问题,尤其是当序列长度增加时,中间激活张量(intermediate tensors)的线性增长会导致设备端(on-device)内存溢出(out-of-memory errors)。尽管LoRA和IA3等主流PEFT方法显著减少了可训练参数,但它们并未有效控制激活张量带来的内存开销。论文提出LARS(Low-memory Activation-Rank Subspace)框架,其核心创新在于将低秩约束从模型参数转移到训练过程中使用的激活子空间(activation subspace),从而直接针对内存消耗的主要来源进行优化,从根本上降低内存增长速率。这一机制使LARS在GPU和CPU上分别实现平均33.54%和51.95%的内存减少,同时保持与LoRA相当的准确率和吞吐量,并成功部署于树莓派(Raspberry Pi)和消费级CPU,验证了其在资源受限边缘设备上的可扩展性。

链接: https://arxiv.org/abs/2604.22783
作者: Irene Tenison,Stella Ahn,Miriam Kim,Ebtisam Alshehri,Lalana Kagal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) has become the standard for adapting large language models (LLMs). In this work we challenge the wide-spread assumption that parameter efficiency equates memory efficiency and on-device adaptability. We show that this is not true - while methods like LoRA and IA3 significantly reduce trainable parameters, they remain bound by intermediate tensors that scale linearly with sequence length, often triggering out-of-memory errors on-device. In this work, we introduce LARS (Low-memory Activation-Rank Subspace), a novel adaptation framework that decouples memory consumption from sequence length. While prior PEFT methods apply low-rank constraints to model parameters, LARS instead constrains the activation subspace used during training, directly targeting the dominant source of memory consumption and fundamentally flattening the memory growth rate. LARS reduces the memory footprint by an average of 33.54% on GPUs and 51.95% on CPUs in comparison to LoRA across reasoning, understanding and long-context datasets using different models while maintaining competitive accuracy and throughput. Besides GPUs, we deploy on Raspberry Pi and consumer-grade CPUs to demonstrate that LARS provides a scalable path for sophisticated LLM personalization on resource-constrained hardware and edge devices.

[AI-167] Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

【速读】:该论文旨在解决生成式 AI(Generative AI)在服务Transformer语言模型时因Key-Values(KVs)缓存带来的高内存占用问题,从而降低推理成本。其核心挑战在于如何在不损失性能的前提下减少KV缓存的内存开销。解决方案的关键在于提出一种基于随机跨层注意力(random cross-layer attention)的训练方法:在训练过程中,模型层随机选择关注自身或前一层的KV状态,使模型对不同深度维度上的缓存共享策略具备鲁棒性,从而在部署时可根据硬件约束灵活采用跨层缓存共享,实现无信息丢失的高效优化。

链接: https://arxiv.org/abs/2604.22782
作者: Anastasiia Filippova,David Grangier,Marco Cuturi,João Monteiro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue that the \emphdepth dimension offers an orthogonal and robust avenue for optimization. Although prior research suggests that a full cache for every layer is redundant, implementing cross-layer cache sharing remains a practical challenge; existing methods typically suffer from reduced throughput or increased time-to-first-token. In this paper, we demonstrate that dropping a layer’s cache offers efficient optimization without information loss. We propose a simple training approach: random cross-layer attention. During training, layers randomly choose to attend either to their own KV states or those of a preceding layer. This stochastic process adapts the model to be robust to various depth-wise cache sharing strategies, ensuring flexibility for unknown hardware constraints at deployment time. Our evaluations show that applying this scheme during pre-training or fine-tuning enables depth-wise cache sharing for various model families. Furthermore, for larger models in data-constrained settings, this approach is suggestive of a regularization-like effect, frequently preserving or improving performance while significantly reducing the cache’s memory footprint.

[AI-168] BiTA: Bidirectional Gated Recurrent Unit-Transformer Aggregator in a Temporal Graph Network Framework for Alert Prediction in Computer Networks

【速读】:该论文旨在解决现有基于时序图神经网络(Temporal Graph Neural Networks, TGNs)的主动告警预测方法在捕捉复杂攻击行为中递归性与多尺度时间模式方面的局限性问题。当前方法主要依赖单向或单一机制的时间聚合策略,难以充分建模节点邻域内长期上下文关系和双向序列依赖。其解决方案的关键在于提出BiTA(Bidirectional Gated Recurrent Unit-Transformer Aggregator),通过重构TGN中的时间聚合函数,联合编码每个节点时间邻域内的双向序列依赖和长程上下文关联,从而实现不同时间尺度上的互补性时序推理,同时保持原有TGN的记忆存储与消息传递结构不变。该设计显著提升了模型对真实世界网络攻击行为的感知能力与预测精度。

链接: https://arxiv.org/abs/2604.22781
作者: Zahra Makki Nayeri,Mohsen Rezvani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Proactive alert prediction in computer networks is critical for mitigating evolving cyber threats and enabling timely defensive actions. Temporal Graph Neural Networks (TGNs) provide a principled framework for modeling time-evolving interactions; however, existing TGN-based methods predominantly rely on unidirectional or single-mechanism temporal aggregation, which limits their ability to capture recursive, multi-scale temporal patterns commonly observed in real-world attack behaviors. In this paper, we propose BiTA, a Bidirectional Gated Recurrent Unit-Transformer Aggregator for temporal graph learning. Rather than introducing a deeper or higher-capacity model, BiTA redesigns the temporal aggregation function within the TGN framework by jointly encoding bidirectional sequential dependencies and long-range contextual relations over each node’s temporal neighborhood. This aggregation strategy enables complementary temporal reasoning at different scales while preserving the original TGN memory and message-passing structure. We evaluate BiTA on real-world alert datasets, demonstrating significant improvements in key performance metrics such as area under the curve, average precision, mean reciprocal rank, and per-category prediction accuracy when compared to state-of-the-art temporal graph models. BiTA outperforms baseline methods under both transductive and inductive settings, highlighting its robustness and generalization capabilities in dynamic network environments. BiTA is a scalable and interpretable framework for real-time cyber threat anticipation, paving the way toward more intelligent and adaptive intrusion detection systems.

[AI-169] he Spectral Lifecycle of Transformer Training: Transient Compression Waves Persistent Spectral Gradients and the Q/K–V Asymmetry

【速读】:该论文旨在揭示Transformer模型在预训练过程中权重矩阵奇异值谱(singular value spectra)的动态演化规律,以理解模型内部表征学习与结构压缩之间的关系。其核心问题是:权重矩阵的秩(rank)和奇异值分布形状(spectral shape)如何随训练深度变化,并各自反映何种训练信息?解决方案的关键在于通过系统性追踪不同规模模型(30M–285M参数)中每个权重矩阵的完整奇异值分解(SVD),发现三个关键现象:(1)瞬态压缩波(Transient Compression Waves)——稳定秩压缩沿网络深度传播并反转;(2)持久谱梯度(Persistent Spectral Gradients)——功率律指数α形成非单调倒U型分布且随深度迁移;(3)Q/K–V功能不对称性——查询/键投影承载深度依赖动力学而值/输出投影均匀压缩。这些发现表明,秩与谱形编码了训练中不同的信息,由此构建两时间尺度动力学模型并推导出缩放律(Δα ∝ L^0.26, R²=0.99),最终验证谱特征可有效预测层重要性并指导更优剪枝策略(性能提升达1.1×–3.6×)。

链接: https://arxiv.org/abs/2604.22778
作者: Yi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present the first systematic study of weight matrix singular value spectra \emphduring transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M–285M parameters). We discover three phenomena: \textbf(1)~Transient Compression Waves: stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then \emphreverses – late layers eventually over-compress past early layers. \textbf(2)~Persistent Spectral Gradients: the power-law exponent~ \alpha develops a permanent depth gradient forming a non-monotonic inverted-U in deeper models, with peaks shifting toward earlier layers as depth increases. \textbf(3)~Q/K–V Functional Asymmetry: value/output projections compress uniformly while query/key projections carry the full depth-dependent dynamics. The dissociation between transient compression and persistent spectral shape reveals that \emphrank and spectral shape encode fundamentally different information about training. We formalize this as a two-timescale dynamical model and derive scaling laws ( \Delta\alpha \propto L^0.26 , R^2=0.99 ). We validate on nine models across three families (custom, GPT-2, Pythia; 30M–1B parameters; 8–36 layers), demonstrate that \alpha predicts layer importance ( \rho=0.69 – 0.84 , p0.02 ), and show that spectral-guided pruning outperforms Last-N heuristics by 1.1\times – 3.6\times across seven models in two families (GPT-2 124M–774M, Pythia 160M–1B), with worst-vs-best gaps up to 23.7\times confirming the causal role of spectral structure.

[AI-170] An Intelligent Fault Diagnosis Method for General Aviation Aircraft Based on Multi-Fidelity Digital Twin and FMEA Knowledge Enhancement

【速读】:该论文旨在解决通用航空器故障诊断中面临的三大挑战:真实故障数据稀缺、故障类型多样以及故障特征微弱的问题。其核心解决方案是构建一个基于多保真度数字孪生(multi-fidelity digital twin)的智能故障诊断框架,关键在于通过高保真飞行力学仿真与FMEA驱动的故障注入机制生成高质量残差特征,并结合GRU代理模型实现低延迟在线计算,从而在保证诊断精度的同时显著提升推理效率。具体而言,该方案提出配对镜像残差与GRU替代预测残差相结合的多保真度残差计算框架,在高保真路径中利用相同初始条件下的理想轨迹提取纯净故障偏差信号,而在低保真路径中采用多步预测GRU代理模型实现4.3倍加速且仅损失0.6%性能;实验表明,残差特征质量对诊断性能的影响约为分类器架构的5倍,确立了“残差质量优先”的设计原则。

链接: https://arxiv.org/abs/2604.22777
作者: Zhihuan Wei,Yang Hu,Xinhang Chen,Yiming Zhang,Jie Liu,Wei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fault diagnosis of general aviation aircraft faces challenges including scarce real fault data, diverse fault types, and weak fault signatures. This paper proposes an intelligent fault diagnosis framework based on multi-fidelity digital twin, integrating four modules: high-fidelity flight dynamics simulation, FMEA-driven fault injection, multi-fidelity residual feature extraction, and large language model (LLM)-enhanced interpretable report generation. A digital twin is constructed using the JSBSim six-degree-of-freedom (6-DoF) flight dynamics engine, generating 23-channel engine health monitoring data via semi-empirical sensor synthesis equations. A three-layer fault injection engine based on failure mode and effects analysis (FMEA) models the physical causal propagation of 19 engine fault types. A multi-fidelity residual computation framework comprising paired-mirror residuals and GRU surrogate prediction residuals is proposed: the high-fidelity path obtains clean fault deviation signals using nominal mirror trajectories with identical initial conditions, while the low-fidelity path achieves online real-time residual computation through a multi-step prediction GRU surrogate model. A 1D-CNN classifier performs end-to-end diagnosis of 20 fault classes. An LLM diagnostic report engine enhanced with FMEA knowledge fuses classification results, residual evidence, and domain causal knowledge to generate interpretable natural language reports. Experiments show the paired-mirror residual scheme achieves a Macro-F1 of 96.2% on the 20-class task, while the GRU surrogate scheme achieves 4.3x inference acceleration at only 0.6% performance cost. Comparison across 24 schemes reveals that residual feature quality contributes approximately 5x more to diagnostic performance than classifier architecture, establishing the “residual quality first” design principle.

[AI-171] Epicure: Multidimensional Flavor Structure in Food Ingredient Embeddings

【速读】:该论文旨在解决烹饪实践中隐性知识(tacit knowledge)难以形式化表达的问题,特别是关于风味(flavor)、质地(texture)和文化认同等要素的内在认知。解决方案的关键在于利用FlavorGraph中基于食谱共现和食品化学训练得到的300维原料嵌入(ingredient embeddings),系统性地提取并重构这些隐性知识;通过引入大语言模型(LLM)增强的整理流程,将6,653个原始原料条目精炼为1,032个规范条目,显著增强了可恢复的知识结构,并识别出至少十五个独立可分类的维度,涵盖味道、质地、地理、加工方式及文化等多个方面。

链接: https://arxiv.org/abs/2604.22776
作者: Jakub Radzikowski,Josef Chen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A chef’s intuition about flavor, texture, and cultural identity represents tacit knowledge that is difficult to articulate yet central to culinary practice. We show that this knowledge is already encoded in FlavorGraph’s 300-dimensional ingredient embeddings, trained on recipe cooccurrence and food chemistry, and that it can be systematically recovered. An LLM-augmented curation pipeline consolidates 6,653 raw FlavorGraph ingredients into 1,032 canonical entries, substantially strengthening the recoverable structure. We identify at least fifteen independently classifiable dimensions spanning taste, texture, geography, food processing, and culture.

[AI-172] Artificial General Intelligence Forecasting and Scenario Analysis: State of the Field Methodological Gaps and Strategic Implications

【速读】:该论文旨在解决如何在深度不确定性条件下对人工通用智能(Artificial General Intelligence, AGI)的到达时间进行可靠预测的问题。其核心挑战在于现有预测方法存在显著局限性,难以应对AGI发展路径的高度不确定性和复杂性。解决方案的关键在于构建一个更具鲁棒性的预测框架,通过整合多种预测方法、识别其内在假设与误差来源,并提出系统性的研究议程以改进预测基础设施;同时,论文采用人机协同的迭代式工作流程——由大型语言模型(如GPT 5.1、Gemini 3 Pro和Claude 4.5 Opus)完成初稿撰写,人类研究人员负责方向把控、同行评审、事实核查与修订,从而提升预测结果的可信度与实用性。

链接: https://arxiv.org/abs/2604.22766
作者: Gopal P. Sarma,Sunny D. Bhatt,Michael Jacob,Rachel Steratore
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 75 pages, 1 figure

点击查看摘要

Abstract:In this report, we review the current state of methodologies to forecast the arrival of artificial general intelligence, assess their reliability, and analyze the implications for strategy and policy. We synthesize diverse forecasting approaches, document significant limitations in existing methods, and propose a research agenda for developing more-robust forecasting infrastructure. The report does not endorse a specific forecast or scenario but rather provides a framework for interpreting forecasts under conditions of deep uncertainty. We experimented with an iterative approach to human and artificial intelligence collaboration for this report. The primary drafting of the text was performed by large language models (GPT 5.1, Gemini 3 Pro, and Claude 4.5 Opus), with human researchers providing direction, peer review, fact-checking, and revision.

[AI-173] Algorithmic Administration and the EU AI Act: Legal Principles for Public Sector Use of AI

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在公共部门部署过程中如何与行政法基本原则(如行政裁量权、说明理由义务和比例原则)相协调的问题,尤其关注高风险AI系统在社会福利、移民、教育和执法等敏感领域应用时的合法性与问责性。其解决方案的关键在于分析欧盟《人工智能法案》(EU AI Act)对公共部门部署者的监管义务,并提出通过强化透明度、可审查性和适当的风险分级机制来确保AI在公共决策中的伦理合规与法律正当性,同时借助解释策略保障比例原则的实现。

链接: https://arxiv.org/abs/2604.22765
作者: Georgios Pavlidis,Ioannis Kastanas
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing use of artificial intelligence (AI) by public authorities introduces both opportunities for innovation and significant challenges for the administrative rule of law. This article examines how the EU AI Act interacts with the fundamental principles of administrative law, with a particular focus on administrative discretion, the duty to state reasons, and proportionality. It analyses the regulatory obligations imposed by the AI Act on public sector deployers of high-risk systems, especially in sensitive domains such as social benefits, migration, education, and law enforcement. It also explores whether the AI Act adequately ensures accountability, transparency, and reviewability in automated public decision-making. The article further considers how the AI Act’s risk-based approach aligns (or fails to align) with the principle of proportionality and it proposes safeguards and interpretative strategies to ensure the ethical and lawful deployment of AI in the public sector.

[AI-174] ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

【速读】:该论文旨在解决现有时间序列异常检测方法在评估时仅关注准确率、忽视部署约束(如有限计算资源和可预测延迟)所导致的可行性误判问题,尤其针对车载监控等嵌入式场景中对稳定性能与低延迟的严格要求。其解决方案的关键在于提出ECoLAD(Efficiency Compute Ladder for Anomaly Detection),一个面向部署的评估协议:通过机械确定的整数缩放规则和显式的CPU线程上限,在异构检测器家族中施加单调的计算资源削减阶梯,并记录每次配置变更;同时以目标评分速率为参数进行吞吐量约束下的行为分析,量化覆盖度(满足目标的实体比例)与可达的最佳AUC-PR值,从而更真实地反映不同方法在实际受限环境中的可行性边界。

链接: https://arxiv.org/abs/2603.10926
作者: Kadir-Kaan Özer,René Ebeling,Markus Enzweiler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Time-series anomaly detectors are commonly compared on workstation-class hardware under unconstrained execution. In-vehicle monitoring, however, requires predictable latency and stable behavior under limited CPU parallelism. Accuracy-only leaderboards can therefore misrepresent which methods remain feasible under deployment-relevant constraints. We present ECoLAD (Efficiency Compute Ladder for Anomaly Detection), a deployment-oriented evaluation protocol instantiated as an empirical study on proprietary automotive telemetry (anomaly rate \approx 0.022) and complementary public benchmarks. ECoLAD applies a monotone compute-reduction ladder across heterogeneous detector families using mechanically determined, integer-only scaling rules and explicit CPU thread caps, while logging every applied configuration change. Throughput-constrained behavior is characterized by sweeping target scoring rates and reporting (i) coverage (the fraction of entities meeting the target) and (ii) the best AUC-PR achievable among measured ladder configurations satisfying the target. On constrained automotive telemetry, lightweight classical detectors sustain both coverage and detection lift above the random baseline across the full throughput sweep. Several deep methods lose feasibility before they lose accuracy. Comments: 6 pages, 3 figures, 5 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.10926 [cs.LG] (or arXiv:2603.10926v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.10926 Focus to learn more arXiv-issued DOI via DataCite

[AI-175] Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training

【速读】:该论文旨在解决地理分布式机器学习(ML)训练中的两个核心问题:一是多区域云资源的弹性调度缺失,导致资源利用率低和训练性能受限;二是广域网(WAN)通信成为主要瓶颈,受带宽限制和高波动性影响显著。解决方案的关键在于提出一个名为Cloudless-Training的框架,其核心创新包括:(1) 采用控制平面与物理训练平面分离的两层架构,实现基于Serverless的弹性资源调度;(2) 提出异步SGD梯度累积(ASGD-GA)和跨参数服务器(PS)模型平均(MA)两种新型同步策略,有效提升跨区域训练分区间的通信效率与收敛稳定性。该方案在腾讯云上基于OpenFaaS实现并验证,实验表明可显著降低训练成本(最多减少24.0%)并提升训练速度(最高达1.7倍加速比),同时保证模型正确性。

链接: https://arxiv.org/abs/2303.05330
作者: Wenting Tan,Xiao Shi1,Cunchi Lv,Xiaofang Zhao
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 11 pages, 11 figures

点击查看摘要

Abstract:Geo-distributed ML training can benefit many emerging ML scenarios (e.g., large model training, federated learning) with multi-regional cloud resources and wide area network. However, its efficiency is limited due to 2 challenges. First, efficient elastic scheduling of multi-regional cloud resources is usually missing, affecting resource utilization and performance of training. Second, training communication on WAN is still the main overhead, easily subjected to low bandwidth and high fluctuations of WAN. In this paper, we propose a framework, Cloudless-Training, to realize efficient PS-based geo-distributed ML training in 3 aspects. First, it uses a two-layer architecture with control and physical training planes to support elastic scheduling and communication for multi-regional clouds in a serverless this http URL, it provides an elastic scheduling strategy that can deploy training workflows adaptively according to the heterogeneity of available cloud resources and distribution of pre-existing training datasets. Third, it provides 2 new synchronization strategies for training partitions among clouds, including asynchronous SGD with gradient accumulation (ASGD-GA) and inter-PS model averaging (MA). It is implemented with OpenFaaS and evaluated on Tencent Cloud. Experiments show that Cloudless-Training can support general ML training in a geo-distributed way, greatly improve resource utilization (e.g., 9.2%-24.0% training cost reduction) and synchronization efficiency (e.g., 1.7x training speedup over baseline at most) with model correctness guarantees.

[AI-176] Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data

【速读】:该论文旨在解决从高维观测数据中识别系统动力学状态变量的问题,这在物理科学领域是一个核心挑战,因为状态变量通常不可直接观测,需从原始高维数据中无监督地推断出来。解决方案的关键在于提出了一种名为DySIB(Dynamical Symmetric Information Bottleneck)的方法,其通过在潜在空间中最大化过去与未来观测窗口之间的预测互信息(predictive mutual information),同时对表示复杂度进行惩罚,从而学习低维时间序列表示。该方法不依赖于观测的重建,而是直接优化潜在空间中的预测能力,实验证明其能准确恢复已知维度、拓扑和几何结构的动力学坐标,例如物理摆的相空间坐标(角度与角速度)。

链接: https://arxiv.org/abs/2604.24662
作者: K. Michael Martini,Eslam Abdelaleem,Paarth Gulati,Ilya Nemenman
机构: 未知
类目: Data Analysis, Statistics and Probability (physics.data-an); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 12 pages including references, 7 figures, 4 appendix pages with 4 appendix figures

点击查看摘要

Abstract:Identifying the dynamical state variables of a system from high-dimensional observations is a central problem across physical sciences. The challenge is that the state variables are not directly observable and must be inferred from raw high-dimensional data without supervision. Here we introduce DySIB (Dynamical Symmetric Information Bottleneck) as a method to learn low-dimensional representations of time-series data by maximizing predictive mutual information between past and future observation windows while penalizing representation complexity. This objective operates entirely in latent space and avoids reconstruction of the observations. We apply DySIB to an experimental video dataset of a physical pendulum, where the underlying state space is known. The method, with hyperparameters of the learning architecture set self-consistently by the data, recovers a two-dimensional representation that matches the dimensionality, topology, and geometry of the pendulum phase space, with the learned coordinates aligning smoothly with the canonical angle and angular velocity. These results demonstrate, on a well-characterized experimental system, that predictive information in latent space can be used to recover interpretable dynamical coordinates directly from high-dimensional data.

[AI-177] Quantum Kernel Advantage over Classical Collapse in Medical Foundation Model Embeddings

【速读】:该论文旨在解决医疗图像分类中少数类样本识别性能不足的问题,特别是在保险分类任务中提升对低频类别的判别能力。其解决方案的关键在于利用量子支持向量机(QSVM)结合从医学基础模型(如MedSigLIP-448、RAD-DINO和ViT-patch32)提取的冻结嵌入(frozen embeddings),构建具有高有效秩(effective rank)的量子核函数。通过两层公平比较框架验证:在未调参条件下,QSVM在所有18种配置下均显著优于经典线性支持向量机(linear SVM),尤其在q=11时实现F1分数从0.050提升至0.343(p < 0.001);且相较于调参后的RBF-SVM,QSVM在7个配置中也保持优势(平均增益+0.068)。量子核的有效秩在q=11时达69.80,远超线性核,表明其能更有效地捕捉复杂特征结构,从而缓解经典核方法因维度坍缩导致的多数类主导问题。

链接: https://arxiv.org/abs/2604.24597
作者: Sebastian Cajas Ordóñez,Felipe Ocampo Osorio,Dax Enshan Koh,Rafi Al Attrach,Aldo Marzullo,Ariel Guerra-Adames,J. Alejandro Andrade,Siong Thye Goh,Chi-Yu Chen,Rahul Gorijavolu,Xue Yang,Noah Dane Hebdon,Leo Anthony Celi
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We provide evidence of quantum kernel advantage under noiseless simulation in binary insurance classification on MIMIC-CXR chest radiographs using quantum support vector machines (QSVM) with frozen embeddings from three medical foundation models (MedSigLIP-448, RAD-DINO, ViT-patch32). We propose a two-tier fair comparison framework in which both classifiers receive identical PCA-q features. At Tier 1 (untuned QSVM vs. untuned linear SVM, C = 1 both sides), QSVM wins minority-class F1 in all 18 tested configurations (17 at p 0.001, 1 at p 0.01). The classical linear kernel collapses to majority-class prediction on 90-100% of seeds at every qubit count, while QSVM maintains non-trivial recall. At q = 11 (MedSigLIP-448 plateau center), QSVM achieves mean F1 = 0.343 vs. classical F1 = 0.050 (F1 gain = +0.293, p 0.001) without hyperparameter tuning. Under Tier 2 (untuned QSVM vs. C-tuned RBF SVM), QSVM wins all seven tested configurations (mean gain +0.068, max +0.112). Eigenspectrum analysis reveals quantum kernel effective rank reaches 69.80 at q = 11, far exceeding linear kernel rank, while classical collapse remains C-invariant. A full qubit sweep reveals architecture-dependent concentration onset across models. Code: this https URL

[AI-178] BandRouteNet: An Adaptive Band Routing Neural Network for EEG Artifact Removal

【速读】:该论文旨在解决脑电图(Electroencephalography, EEG)信号易受眼电(Electrooculographic, EOG)和肌电(Electromyographic, EMG)等伪迹干扰的问题,这些问题会显著降低信号质量并影响神经诊断与脑机接口(Brain-Computer Interfaces, BCIs)等应用的可靠性。传统方法难以应对不同伪迹源在时域分布和频谱特性上的多样性与动态变化。为此,作者提出BandRouteNet——一种自适应频率感知的神经网络架构,其核心创新在于:通过频带特异性处理显式建模频率依赖的伪迹模式,并引入路由机制以自适应地决定每个频带内各时间点的去噪强度;同时,设计全频带条件器提取全局时间上下文信息,用于调节频带路径并提供粗粒度信号重构补充。该方案实现了高精度去噪与参数效率的统一,在EEGDenoiseNet基准数据集上优于现有方法,尤其在EOG、EMG及混合伪迹场景下表现突出。

链接: https://arxiv.org/abs/2604.24428
作者: Phat Lam
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:Electroencephalography (EEG) is highly susceptible to artifact contamination, such as electrooculographic (EOG) and electromyographic (EMG) interference, which severely degrades signal quality and hinders reliable interpretation in applications including neurological diagnosis, brain-computer interfaces (BCIs), etc. Effective EEG denoising remains challenging because different artifact sources exhibit diverse and temporally varying distributions, together with distinct spectral characteristics across frequency bands. To address these issues, we propose BandRouteNet, an adaptive frequency-aware neural network for EEG denoising that jointly exploits band-specific processing and full-band contextual modeling. The proposed model performs band-wise denoising to explicitly capture frequency-dependent artifact patterns. Within this framework, we introduce a routing mechanism that adaptively determines where and to what extent denoising should be applied across temporal locations within each frequency band. In parallel, a full-band conditioner directly processes the original noisy EEG to extract global temporal context, producing both conditional parameters for modulating the band-wise pathway and a coarse-grained signal-level refinement to supplement the final reconstruction. Extensive experiments on the EEGDenoiseNet benchmark dataset demonstrate that BandRouteNet outperforms other methods under EOG, EMG, and mixed-artifact conditions in terms of Relative Root Mean Square Error (RRMSE) and Signal-to-Noise Ratio Improvement (SNR _\textimp ) under unified experimental settings, while remaining highly parameter-efficient with only 0.2M trainable parameters. These results highlight its strong potential for high-performance EEG artifact removal in resource-constrained applications.

[AI-179] Do Quantum Transformers Help? A Systematic VQC Architecture Comparison on Tabular Benchmarks

【速读】:该论文旨在解决近期内存量子设备上变分量子电路(Variational Quantum Circuits, VQCs)架构选择问题,特别是如何在经典表格数据上实现最优的准确率与参数复杂度之间的权衡。其关键解决方案是通过系统性实证比较四种VQC家族——多层全连接(FC-VQC)、残差结构(ResNet-VQC)、混合量子-经典Transformer(QT)以及全量子Transformer(FQT),发现FC-VQC在保持高精度的同时显著减少参数量(比基于注意力机制的VQC少40–50%),且在多个回归和分类基准上优于同等容量的经典多层感知机(MLP)。此外,研究揭示了电路深度约3时表达能力已饱和,说明浅层VQC即可有效覆盖希尔伯特空间;同时指出显式量子自注意力在多数数据集上仅带来边际增益但大幅增加参数,而类型4的块间连接结构可部分模拟注意力机制的作用,为实际部署提供了明确的架构设计指导。

链接: https://arxiv.org/abs/2604.23931
作者: Chi-Sheng Chen,En-Jui Kuo
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Variational quantum circuits (VQCs) are a leading approach to quantum machine learning on near-term devices, yet it remains unclear which circuit architecture yields the best accuracy-parameter trade-off on classical tabular data. We present a systematic empirical comparison of four VQC families – multi-layer fully-connected (FC-VQC), residual (ResNet-VQC), hybrid quantum-classical transformer (QT), and fully quantum transformer (FQT) – across five regression and classification benchmarks. Our key findings are: \textbf(i)~FC-VQCs achieve 90-96% of the R^2 of attention-based VQCs while using 40-50% fewer parameters, and consistently outperform equal-capacity MLPs (mean R^2=0.829 vs.\ MLP _720 's 0.753 on Boston Housing, 3-seed average); \textbf(ii)~FC-VQC’s Type~4 inter-block connectivity provides partial cross-token mixing that approximates the role of attention – explicit quantum self-attention yields only marginal gains on most datasets while significantly increasing parameter count; \textbf(iii)~expressibility saturates at circuit depth~ \approx,3 , explaining why shallow VQCs already cover the Hilbert space effectively; \textbf(iv)~LayerNorm on the fully quantum transformer improves classification accuracy, suggesting normalization is important when all operations are quantum; \textbf(v)~in our noise study on Boston Housing, FQT degrades gracefully under depolarizing noise while QT collapses. All results are validated across three random seeds. These findings provide practical architectural guidance for deploying VQCs on near-term quantum hardware.

[AI-180] Quasi-Quadratic Gradient: A New Direction for Accelerating the BFGS Method in Quasi-Newton Optimization

【速读】:该论文旨在解决经典BFGS(拟牛顿法)在优化过程中收敛速度较慢的问题。其解决方案的关键在于提出一种新的搜索方向——准二次梯度(Quasi-Quadratic Gradient, QQG),该方向通过将Hessian矩阵的逆近似与当前梯度相乘,显式利用局部二阶曲率信息来修正搜索路径,从而在保持计算效率的同时显著提升收敛速度。

链接: https://arxiv.org/abs/2604.23922
作者: John Chiang
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce the Quasi-Quadratic Gradient (QQG), a novel search direction designed to accelerate the BFGS method within the quasi-Newton framework. By defining the QQG as the product of the inverse Hessian approximation and the current gradient, we explicitly leverage local second-order curvature to rectify the search path. Theoretical analysis and empirical results demonstrate that our approach significantly outperforms vanilla BFGS in convergence speed while maintaining computational efficiency.

[AI-181] Generative Synthetic Data for Causal Inference: Pitfalls Remedies and Opportunities

【速读】:该论文旨在解决生成式合成数据在因果推断中应用时的核心问题:尽管合成数据能够较好地保留预测性能,但其对因果效应估计(如平均处理效应,ATE)的保真度往往严重受损。作者通过理论分析揭示了ATE保持的关键在于同时控制生成变量的协变量分布(covariate law)和结果回归中的处理效应对比(treatment-effect contrast),并指出传统全生成模型(如GAN或大语言模型驱动的合成器)难以满足这一要求。解决方案的关键是提出一种混合合成数据框架,将协变量与处理机制和结果机制分离建模,利用最近邻距离诊断监控协变量合成质量,并通过独立学习的扰动模型构建(W, A, Y)三元组;此外,还引入针对性合成增强策略以改善正则性(positivity)问题,从而在有限样本下提升ATE估计的稳健性。

链接: https://arxiv.org/abs/2604.23904
作者: Yichen Xu
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Synthetic data offers a promising tool for privacy-preserving data release, augmentation, and simulation, but its use in causal inference requires preserving more than predictive fidelity. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can achieve strong train-on-synthetic-test-on-real performance while substantially distorting causal estimands such as the average treatment effect (ATE). We formalize this failure through sensitivity and tradeoff results showing that ATE preservation requires control of both the generated covariate law and the treatment-effect contrast in the outcome regression. Motivated by this observation, we propose a hybrid synthetic-data framework that generates covariates separately from the treatment and outcome mechanisms, using distance-to-closest-record diagnostics to monitor covariate synthesis and separately learned nuisance models to construct (W, A, Y) triplets. We further study targeted synthetic augmentation for practical positivity problems and characterize when added overlap support helps by improving conditional-effect estimation more than it shifts the covariate distribution. Finally, we develop a synthetic simulation engine for pre-analysis estimator evaluation, enabling finite-sample comparison of OR, IPW, AIPW, and TMLE under realistic covariate structure. Across experiments, hybrid synthetic data substantially improve ATE preservation relative to fully generative baselines and provide a practical diagnostic tool for robust causal analysis.

[AI-182] A Milestone in Formalization: The Sphere Packing Problem in Dimension 8

【速读】:该论文旨在解决高维空间中的球体堆积问题(sphere packing problem),具体聚焦于8维空间中球体排列的最优密度问题。其解决方案的关键在于利用模形式(modular forms)构造出一个“神奇函数”(‘magic’ function),该函数满足Cohn与Elkies在2003年提出的最优性条件,从而严格证明了8维空间中E₈格是最优球体堆积方式。这一突破性成果在2016年由Viazovska完成,而本文则聚焦于将该证明过程通过Lean定理证明器进行形式化验证,并借助自动形式化模型Gauss实现了最终阶段的自动化验证,标志着人类数学智慧与生成式AI(Generative AI)协同协作的新范式。

链接: https://arxiv.org/abs/2604.23468
作者: Sidharth Hariharan,Christopher Birkbeck,Seewoo Lee,Ho Kiu Gareth Ma,Bhavik Mehta,Auguste Poiroux,Maryna Viazovska
机构: 未知
类目: Metric Geometry (math.MG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Number Theory (math.NT)
备注: 8 pages

点击查看摘要

Abstract:In 2016, Viazovska famously solved the sphere packing problem in dimension 8 , using modular forms to construct a ‘magic’ function satisfying optimality conditions determined by Cohn and Elkies in 2003. In March 2024, Hariharan and Viazovska launched a project to formalize this solution and related mathematical facts in the Lean Theorem Prover. A significant milestone was achieved in February 2026: the result was formally verified, with the final stages of the verification done by Math, Inc.'s autoformalization model ‘Gauss’. We discuss the techniques used to achieve this milestone, reflect on the unique collaboration between humans and Gauss, and discuss project objectives that remain.

[AI-183] Nonlinear Non-Gaussian Density Steering with Input and Noise Channel Mismatch: Sinkhorn with Memory for Solving the Control-affine Schrödinger Bridge Problem

【速读】:该论文旨在解决控制仿射型薛定谔桥(Schrödinger bridge)问题中,当控制通道与噪声通道不匹配时,传统基于Hopf-Cole变换的动态Sinkhorn算法失效的问题。其关键在于设计一种带有记忆机制的Sinkhorn迭代算法,该算法利用了非线性偏微分方程(PDEs)的结构特性,在控制与噪声通道不匹配的情况下仍能有效求解最优密度调控问题,并证明了所提算法的局部稳定性。

链接: https://arxiv.org/abs/2604.23370
作者: Georgiy A. Bondar,Asmaa Eldesoukey,Yongxin Chen,Abhishek Halder
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Solutions to the Schrödinger bridge problem and its generalizations yield feedback control policies for optimal density steering over a controlled diffusion. To numerically compute the same, the dynamic Sinkhorn recursion has become a standard approach. The mathematical engine behind this approach is the Hopf-Cole transform that recasts the conditions for optimality into a system of boundary-coupled linear PDEs. Recent works pointed out that for the control-affine Schrödinger bridge problem, this exact linearity via Hopf-Cole transform, and thus the standard Sinkhorn recursion, apply only if the control and noise channels are proportional. When the channels do not match, the Hopf-Cole-transformed PDEs remain nonlinear, and no algorithm is available to solve the same. We advance the state-of-the-art by designing a Sinkhorn recursion with memory that leverages the structure of these nonlinear PDEs, and demonstrate how it solves the control-affine Schrödinger bridge problem with input and noise channel mismatch. We prove the local stability of the proposed algorithm.

[AI-184] Explainable AI in Speaker Recognition – Making Latent Representations Understandable

【速读】:该论文旨在解决生成式 AI(Generative AI)中神经网络表示空间内未知组织模式的可解释性问题,特别是语音识别网络在学习说话人身份表征时所形成的层次化聚类现象。其解决方案的关键在于引入两种新型聚类算法——Single-Linkage Clustering (SLINK) 和 Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN),以揭示网络表示中非独立、具有层级关系的聚类结构,并设计 Hierarchical Cluster-Class Matching (HCCM) 算法实现语义类别与层次聚类之间的逐一对齐,从而定量评估匹配性能并识别限制因素。

链接: https://arxiv.org/abs/2604.23354
作者: Yanze Xu,Wenwu Wang,Mark D. Plumbley
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:Neural networks can be trained to learn task-relevant representations from data. Understanding how these networks make decisions falls within the Explainable AI (XAI) domain. This paper proposes to study an XAI topic: uncovering unknown organisational patterns in network representations, particularly those representations learned by the speaker recognition network that recognises the speaker identity of utterances. Past studies employed algorithms (e.g. t-distributed Stochastic Neighbour Embedding and K-means) to analyse and visualise how network representations form independent clusters, indicating the presence of flat clustering phenomena within the space defined by these representations. In contrast, this work applies two algorithms – Single-Linkage Clustering (SLINK) and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) – to analyse how representations form clusters with hierarchical relationships rather than being independent, thereby demonstrating the existence of hierarchical clustering phenomena within the network representation space. To semantically understand the above hierarchical clustering phenomena, a new algorithm, termed Hierarchical Cluster-Class Matching (HCCM), is designed to perform one-to-one matching between predefined semantic classes and hierarchical representation clusters (i.e. those produced by SLINK or HDBSCAN). Some hierarchical clusters are successfully matched to individual semantic classes (e.g. male, UK), while others to conjunctions of semantic classes (e.g. male and UK, female and Ireland). A new metric, Liebig’s score, is proposed to quantify the performance of each matching behaviour, allowing us to diagnose the factor that most strongly limits matching performance.

[AI-185] he Security Cost of Intelligence: AI Capability Cyber Risk and Deployment Paradox

【速读】:该论文试图解决的问题是:在组织治理能力未能跟上AI系统能力提升的背景下,企业如何权衡AI部署与网络安全投资,以实现最优生产力收益。其核心挑战在于,高价值AI应用通常需要更广泛的权限暴露(如数据访问、流程集成和授权委托),而弱治理环境下这种暴露会增加安全风险。解决方案的关键在于构建一个联合决策模型,揭示“部署悖论”现象——即在高损失环境中,更强大的AI反而可能导致企业减少部署,因为能力提升未伴随治理对权限暴露的有效管控;同时发现,治理投资通过降低潜在损失可缩小该悖论区域,而外部性则扩大社会约束范围,表明治理成熟度不仅是AI采用的限制因素,更是决定能力能否转化为有效部署的核心条件。

链接: https://arxiv.org/abs/2604.23058
作者: Sukwoong Choi
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Firms are deploying more capable AI systems, but organizational controls often have not kept pace. These systems can generate greater productivity gains, but high-value uses require broader authority exposure – data access, workflow integration, and delegated authority – when governance controls have not yet decoupled capability from authority exposure. We develop an analytical model in which a firm jointly chooses AI deployment and cybersecurity investment under this governance-capability gap. The central result shows a deployment paradox: in high-loss environments, better AI can lead a firm to deploy less when capability is deployed through broader authority exposure under weak governance. Optimal deployment also falls below the no-risk benchmark, and this shortfall widens with breach-loss magnitude and with the authority exposure attached to more capable systems. Governance investment that reduces breach-loss magnitude shrinks the paradox region itself, while breach externalities expand the range of environments in which deployment is socially constrained. Governance maturity is therefore not merely a constraint on AI adoption. It is a condition that shapes whether capability improvements translate into productive deployment.

[AI-186] Applied AI-Enhanced RF Interference Rejection

【速读】:该论文旨在解决无线射频(RF)传输中干扰抑制的问题,特别是在缺乏干扰信号详细设计信息或传播条件先验知识的情况下,如何实现对信号的检测、解调与解码。传统方法仅依赖信号本身(Signal of Interest, SOI),难以应对复杂干扰环境;而本文提出利用生成式 AI(Generative AI)技术,通过训练模型同时学习SOI和混合信号(SOI加干扰),从而显著提升抗干扰能力。解决方案的关键在于采用自回归Transformer解码器架构,相较于早期基于WaveNet的模型,在推理阶段展现出数量级更快的吞吐量,且能在轻量级GPU(如Jetson AGX Orin)上实现低延迟处理,有效提升了战术场景下语音通信的可懂度(以PESQ指标衡量)。

链接: https://arxiv.org/abs/2604.22816
作者: Rahul Jain,Pierre Trepagnier,Rick Gentile,Joey Botero,Alexia Schulz
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 8 figures, Accepted to the 2nd IEEE International Conference on AI and Data Analytics (ICAD 2026)

点击查看摘要

Abstract:AI-enhanced interference rejection in radio frequency (RF) transmissions has recently attracted interest because deep learning approaches trained on both the signal of interest (SOI) and the signal mixture (SOI plus interference) can outperform traditional approaches which only consider the SOI. The goal is to detect, demodulate, and decode signals over a range of signal-to-interference-plus-noise (SINR) levels without having a detailed, design-level knowledge of the interfering signal or the propagation conditions. Our present AI interference suppression results are based on Autoregressive Transformer Decoder models which exhibit orders of magnitude faster throughput at inference time than WaveNet models developed in earlier work. As a specific example, we investigate an analog FM “Walkie Talkie” radio signal of interest in the presence of an Orthogonal Frequency-Division Multiplexing (OFDM) interferer. This type of interferer is near-ubiquitous in the current RF landscape. Our results clearly show the benefits of transformer-based interference mitigation in tactical settings. We show that unintelligible transmissions become intelligible via metrics such as Perceptual Evaluation of Speech Quality (PESQ), while overall latency is kept to a minimum using readily available lightweight GPUs such as a Jetson AGX Orin. We believe these same techniques can also be applied to a broader set of national security scenarios, as well as having commercial applications.

[AI-187] A General Framework for Generative Self-supervised Learning in Non-invasive Estimation of Physiological Parameters Using Photoplethysmography

【速读】:该论文旨在解决大规模光电容积脉搏波(PPG)数据中生理参数标签对齐困难且标注成本高昂的问题,同时提升在有限标注数据下深度学习模型的泛化能力。其核心解决方案是提出一种生成式自监督表示学习框架TS2TC,关键在于设计了一个名为跨时域融合生成锚点(CTFGA)的预训练任务,通过建模时间依赖性和粗粒度独立片段重建,实现鲁棒的全局特征提取与局部上下文表征;并引入受认知启发的双过程迁移(DPT)策略,融合先验依赖的自主过程与后验观测推理过程,以协同利用共享与特定表示的优势。此外,TS2TC在时域-频谱域混合空间中采用双线性融合机制,增强多源信息间的细粒度上下文交互,最终在仅使用10%训练数据的情况下,相较当前最优方法平均降低RMSE 2.49%,显著提升了非侵入式生理参数估计的准确性与鲁棒性。

链接: https://arxiv.org/abs/2604.22780
作者: Zexing Zhang,Huimin Lu,Songzhe Ma,Jianzhong Peng,Chenglin Lin,Niya Li,Bingwang Dong
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aligning physiological parameter labels with large-scale photoplethysmographic (PPG) data for deep learning is challenging and resource-intensive. While self-supervised representation learning (SSRL) can handle limited annotated data, the challenge lies in learning robust shared representations from vast unlabeled data and integrating contextual cues to learn distinctive representations. To alleviate these challenges, a generative SSRL framework TS2TC is proposed to utilize the temporal, spectrogram, and temporal-spectrogram mixed domains to explore and incorporate the unique features of PPG for universal and noninvasive physiological parameter estimation. A pretext task named Cross-Temporal Fusion Generative Anchor (CTFGA) is designed, modeling temporal dependencies and reconstructing independent segments at a coarse level to provide robust global feature extraction and local contextual representation. The framework includes sub-signals from PPG with diverse frequency scales and order derivatives reflecting hemodynamics to facilitate learning shared representations at varying semantic levels. Secondly, a cognitive-inspired dual-process transfer (DPT) strategy is formulated, consisting of prior-dependent autonomous processes and posterior observation reasoning processes, to leverage the independent and integrated advantages of shared and specific representations. TS2TC introduces a bilinear temporal-spectrogram fusion method in the mixed domain, aligning latent representations from different domains and establishing fine-grained contextual interactions across multiple sources of information. Extensive experiments on physiological parameter estimation tasks showed that the joint performance of CTFGA and DPT outperforms standard generative learning significantly. TS2TC achieved an average 2.49% improvement in RMSE over state-of-the-art estimation methods with only 10% training data.

机器学习

[LG-0] he Optimal Sample Complexity of Multiclass and List Learning

链接: https://arxiv.org/abs/2604.24749
作者: Chirag Pabbaraju
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While the optimal sample complexity of binary classification in terms of the VC dimension is well-established, determining the optimal sample complexity of multiclass classification has remained open. The appropriate complexity parameter for multiclass classification is the DS dimension, and despite significant efforts, a gap of \sqrt\textDS has persisted between the upper and lower bounds on sample complexity. Recent work by Hanneke et al. (2026) shows a novel algebraic characterization of multiclass hypothesis classes in terms of their DS dimension. Building up on this, we show that the maximum hypergraph density of any multiclass hypothesis class is upper-bounded by its DS dimension. This proves a longstanding conjecture of Daniely and Shalev-Shwartz (2014). As a consequence, we determine the optimal dependence of the sample complexity on the DS dimension for multiclass as well as list learning. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2604.24749 [cs.LG] (or arXiv:2604.24749v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.24749 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Conflict-Aware Harmonized Rotational Gradient for Multiscale Kinetic Regimes

链接: https://arxiv.org/abs/2604.24745
作者: Zhangyong Liang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a harmonized rotational gradient method, termed HRGrad, for simultaneously tackling multiscale time-dependent kinetic problems with varying small parameters. These parameters exhibit asymptotic transitions from microscopic to macroscopic physics, making it a challenging multi-task problem to solve over all ranges simultaneously. Solving tasks in different asymptotic regions often encounter gradient conflicts, which can lead to the failure of multi-task learning. To address this challenge, we explicitly encode a hidden representation of these parameters, ensuring that the corresponding solving tasks are serialized for simultaneous training. Furthermore, to mitigate gradient conflicts, we segment the prediction results to construct task losses and introduce a novel gradient alignment metric to ensure a positive dot product between the final update and each loss-specific gradient. This metric maintains consistent optimization rates for all task losses and dynamically adjusts gradient magnitudes based on conflict levels. Moreover, we provide a mathematical proof demonstrating the convergence of the HRGrad method, which is evaluated across a range of challenging asymptotic-preserving neural networks (APNNs) scenarios. We conduct an extensive set of experiments encompassing the Bhatnagar-Gross-Krook (BGK) equation and the linear transport equation in all ranges of Knudsen number. Our results indicate that HRGrad effectively overcomes the `failure modes’ of APNNs in these problems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.24745 [cs.LG] (or arXiv:2604.24745v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.24745 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhangyong Liang [view email] [v1] Mon, 27 Apr 2026 17:47:42 UTC (3,236 KB) Full-text links: Access Paper: View a PDF of the paper titled Conflict-Aware Harmonized Rotational Gradient for Multiscale Kinetic Regimes, by Zhangyong LiangView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-2] SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

链接: https://arxiv.org/abs/2604.24729
作者: Zijian Guo,İlker Işık,H. M. Sabbir Ahmad,Wenchao Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Specification-guided reinforcement learning (RL) provides a principled framework for encoding complex, temporally extended tasks using formal specifications such as linear temporal logic (LTL). While recent methods have shown promising results, their ability to generalize across unseen specifications and diverse environments remains insufficiently understood. In this work, we introduce SpecRLBench, a benchmark designed to evaluate the generalization capabilities of LTL-based specification-guided RL methods. The benchmark spans multiple difficulty levels across navigation and manipulation domains, incorporating both static and dynamic environments, diverse robot dynamics, and varied observation modalities. Through extensive empirical evaluation, we characterize the strengths and limitations of existing approaches and reveal the challenges that emerge as specification and environment complexity increase. SpecRLBench provides a structured platform for systematic comparison and supports the development of more generalizable specification-guided RL methods. Code is available at this https URL.

[LG-3] Exploiting Differential Flatness for Efficient Learning-based Model Predictive Control of Constrained Multi-Input Control Affine Systems

链接: https://arxiv.org/abs/2604.24706
作者: Tobias A. Farger,Adam W. Hall,Angela P. Schoellig
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted for publication in 2026 European Control Conference

点击查看摘要

Abstract:Learning-based control techniques use data from past trajectories to control systems with uncertain dynamics. However, learning-based controllers are often computationally inefficient, limiting their practicality. To address this limitation, we propose a learning-based controller that exploits differential flatness, a property of many robotic systems. Recent research on using flatness for learning-based control either is limited in that it (i) ignores input constraints, (ii) applies only to single-input systems, or (iii) is tailored to specific platforms. In contrast, our approach uses a system extension and block-diagonal cost formulation to control general multi-input, nonlinear, affine systems. Furthermore, it satisfies input and half-space flat state constraints and guarantees probabilistic Lyapunov decrease using only two sequential convex optimizations. We show that our approach performs similarly to, but is multiple times more efficient than, a Gaussian process model predictive controller in simulation, and achieves competitive tracking in real hardware experiments.

[LG-4] Diffusion-Guided Feature Selection via Nishimori Temperature: Noise-Based Spectral Embedding

链接: https://arxiv.org/abs/2604.24692
作者: Vasiliy S. Usatyuk,Denis A. Sapozhnikov,Sergey I. Egorov
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, extended version (with noise shift proof) of DSPA2026 article

点击查看摘要

Abstract:We propose Noise-Based Spectral Embedding (NBSE), a physics-informed framework for selecting informative features from high-dimensional data without greedy search. NBSE constructs a sparse similarity graph on the samples and identifies the Nishimori temperature \beta_N the critical inverse temperature at which the Bethe Hessian becomes singular. The corresponding smallest eigenvector captures the dominant mode of an intrinsically degree-corrected diffusion process, naturally reweighting nodes to prevent hub dominance. By transposing the data matrix and applying NBSE in feature space, we obtain a one-dimensional spectral embedding that reveals groups of redundant or semantically related dimensions; balanced binning then selects one representative per group. We prove that coloured Gaussian perturbations shift \beta_N by at most O(\bar\sigma^2) , guaranteeing robustness to measurement noise. Experiments on ImageNet embeddings from MobileNetV2 and EfficientNet-B4 show that NBSE preserves classification accuracy even under aggressive compression: on EfficientNet-B4 the accuracy drop is below 1% when retaining only 30% of features, outperforming ANOVA F -test and random selection by up to 6.8% .

[LG-5] A Functorial Formulation of Neighborhood Aggregating Deep Learning

链接: https://arxiv.org/abs/2604.24672
作者: Sun Woo Park,Yun Young Choi,U Jin Choi,Youngho Woo
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注: 32 pages, 11 figures. Comments welcome

点击查看摘要

Abstract:We provide a mathematical interpretation of convolutional (or message passing) neural networks by using presheaves and copresheaves of the set of continuous functions over a topological space. Based on this interpretation, we formulate a theoretical heuristic which elaborates a number of empirical limitations of these neural networks by using obstructions on such sets of continuous functions over a topological space to be sheaves or copresheaves.

[LG-6] he Last Human-Written Paper: Agent -Native Research Artifacts

链接: https://arxiv.org/abs/2604.24658
作者: Jiachen Liu,Jiaxin Pei,Jintao Huang,Chenglei Si,Ao Qu,Xiangru Tang,Runyu Lu,Lichang Chen,Xiaoyan Bai,Haizhong Zheng,Carl Chen,Zhiyang Chen,Haojie Ye,Yujuan Fu,Zexue He,Zijian Jin,Zhenyu Zhang,Shangquan Sun,Maestro Harmon,John Dianzhuo Wang,Jianqiao Zeng,Jiachen Sun,Mingyuan Wu,Baoyu Zhou,Yuchen You,Shijian Lu,Yiming Qiu,Fan Lai,Yuan Yuan,Yao Li,Junyuan Hong,Ruihao Zhu,Beidi Chen,Alex Pentland,Ang Chen,Mosharaf Chowdhury,Zechen Zhang
类目: Machine Learning (cs.LG)
*备注: 45 pages, 15 figures, 14 tables

点击查看摘要

Abstract:Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation details unwritten. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work. We introduce the Agent-Native Research Artifact (Ara), a protocol that replaces the narrative paper with a machine-executable research package structured around four layers: scientific logic, executable code with full specifications, an exploration graph that preserves the failures compilation discards, and evidence grounding every claim in raw outputs. Three mechanisms support the ecosystem: a Live Research Manager that captures decisions and dead ends during ordinary development; an Ara Compiler that translates legacy PDFs and repos into Aras; and an Ara-native review system that automates objective checks so human reviewers can focus on significance, novelty, and taste. On PaperBench and RE-Bench, Ara raises question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%. On RE-Bench’s five open-ended extension tasks, preserved failure traces in Ara accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent’s capabilities.

[LG-7] Uncovering Latent Patterns in Social Media Usage and Mental Health: A Clustering-Based Approach Using Unsupervised Machine Learning ALT

链接: https://arxiv.org/abs/2604.24611
作者: Md All Shahria,Sanjeda Dewan Mithila,Touhid Alam,Mohammad Sakib Mahmood,Mahfuza Khatun
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures, International Conference on Advancement in Healthcare Technology and Biomedical Engineering, Vancouver, BC, Canada

点击查看摘要

Abstract:The widespread adoption of social media has heightened interest in its psychological effects, particularly on mental health indicators such as anxiety, depression, loneliness, and sleep quality, as these platforms increasingly influence social interactions and well-being. Although previous research has examined correlations between social media use and mental health, few studies have utilized unsupervised machine learning to segment users based on behavioral and psychological patterns, leaving a gap in identifying distinct risk profiles across diverse groups. This study seeks to address this by segmenting individuals according to their social media usage and psychological well-being, employing clustering to reveal hidden patterns and evaluate their mental health implications. Data from 551 participants, collected via an online survey, were preprocessed using KNN imputation for missing values, one-hot encoding for categorical variables like Gender with 5 unique values, and outlier detection via IQR and Z-score methods. K-Means clustering, optimized at 6 clusters using the Elbow Method and a Silhouette Score of 0.32, was applied, with PCA reducing 22 dimensions for visualization and a correlation heatmap highlighting relationships, such as a 0.28 correlation between social media hours and anxiety.

[LG-8] Fraud Detection in Cryptocurrency Markets with Spatio-Temporal Graph Neural Networks

链接: https://arxiv.org/abs/2604.24590
作者: Lidia Losavio,Luca Persia,Madan Sathe,Dimosthenis Pasadakis
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 9 pages, 3 figures, Accepted at the SDS2026: IEEE Swiss Conference on Data Science and AI

点击查看摘要

Abstract:Technological advancements in cryptocurrency markets have increased accessibility for investors, but concurrently exposed them to the risks of market manipulations. Existing fraud detection mechanisms typically rely on machine learning methods that treat each financial asset (i.e., token) and its related transactions independently. However, market manipulation strategies are rarely isolated events, but are rather characterized by coordination, repetition, and frequent transfers among related assets. This suggests that relational structure constitutes an integral component of the signal and can be effectively represented through graphical means. In this paper, we propose three graph construction methods that rely on aggregated hourly market data. The proposed graphs are processed by a unified spatio-temporal Graph Neural Network (GNN) architecture that combines attention-based spatial aggregation with temporal Transformer encoding. We evaluate our methodology on a real-world dataset comprised of pump-and-dump schemes in cryptocurrency markets, spanning a period of over three years. Our comparative results showcase that our graph-based models achieve significant improvements over standard machine learning baselines in detecting anomalous events. Our work highlights that learned market connectivity provides substantial gains for detecting coordinated market manipulation schemes.

[LG-9] Efficient learning by implicit exploration in bandit problems with side observations NEURIPS

链接: https://arxiv.org/abs/2604.24555
作者: Tomas Kocak,Gergely Neu,Michal Valko,Remi Munos
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published at Neural Information Processing Systems (NeurIPS) 2014

点击查看摘要

Abstract:We consider online learning problems under a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner’s action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem.

[LG-10] Dialysis Risk Prediction and Treatment Effect Estimation for AKI patients using Longitudinal Electronic Health Records

链接: https://arxiv.org/abs/2604.24547
作者: Kalyani P. Pande,Evan Yang,Bryan Zhu,Sandeep K. Mallipattu,Alisa Yurovsky,Tengfei Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Progression to dialysis or end-stage renal disease is a rare but clinically important outcome. Clinicians need evidence on how medication exposures influence downstream risk. We constructed a fixed-window EHR cohort (90-day observation, 730-day prediction; N=81401; dialysis/ESRD prevalence: 1.1%) and modeled sequences of diagnoses, procedures, and medications with kidney laboratory trends (creatinine, BUN, eGFR). A transformer-based causal multi-head model was trained to estimate drug- and ingredient-level average treatment effects (ATEs) using counterfactual exposure removal and insertion under a full medication history setup. On test set, predictive performance reached an AUC of 0.694 and PR-AUC of 0.094. At the selected decision threshold (0.883), the model achieved an F1 score of 0.201 with a Brier score of 0.018. Post-hoc causal analyses of lab changes (eGFR, creatinine, BUN) using IPTW, AIPW, naive, and covariate-adjusted OLS methods assessed clinical directionality. Results showed partial protective-direction support for ACE/ARB exposures and worsening-direction signals for loop diuretics.

[LG-11] Stochastic simultaneous optimistic optimization ICML2013

链接: https://arxiv.org/abs/2604.24537
作者: Michal Valko,Alexandra Carpentier,Rémi Munos
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published in International Conference on Machine Learning (ICML 2013)

点击查看摘要

Abstract:We study the problem of global maximization of a function f given a finite number of evaluations perturbed by noise. We consider a very weak assumption on the function, namely that it is locally smooth (in some precise sense) with respect to some semi-metric, around one of its global maxima. Compared to previous works on bandits in general spaces (Kleinberg et al., 2008; Bubeck et al., 2011a) our algorithm does not require the knowledge of this semi-metric. Our algorithm, StoSOO, follows an optimistic strategy to iteratively construct upper confidence bounds over the hierarchical partitions of the function domain to decide which point to sample next. A finite-time analysis of StoSOO shows that it performs almost as well as the best specifically-tuned algorithms even though the local smoothness of the function is not known.

[LG-12] A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning ICLR2026

链接: https://arxiv.org/abs/2604.24532
作者: Ying-Tu Chen,Wei Hung,Bing-Shu Wu,Zhang-Wei Hong,Ping-Chun Hsieh
类目: Machine Learning (cs.LG)
*备注: ICLR 2026

点击查看摘要

Abstract:Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. In multi-objective reinforcement learning (MORL), one widely studied approach addresses this by training a single policy network conditioned on preference-weighted rewards. In this paper, we explore a novel algorithmic perspective: leveraging reward-free reinforcement learning (RFRL) for MORL. While RFRL has historically been studied independently of MORL, it learns optimal policies for any possible reward function, making it a natural fit for MORL’s challenge of handling unknown user preferences. We propose using the RFRL’s training objective as an auxiliary task to enhance MORL, enabling more effective knowledge sharing beyond the multi-objective reward function given at training time. To this end, we adapt a state-of-the-art RFRL algorithm to the MORL setting and introduce a preference-guided exploration strategy that focuses learning on relevant parts of the environment. Through extensive experiments and ablation studies, we demonstrate that our approach significantly outperforms the state-of-the-art MORL methods across diverse MO-Gymnasium tasks, achieving superior performance and data efficiency. This work provides the first systematic adaptation of RFRL to MORL, demonstrating its potential as a scalable and empirically effective solution to multi-objective policy learning.

[LG-13] Prior-Agnostic Robust Forecast Aggregation

链接: https://arxiv.org/abs/2604.24517
作者: Zhi Chen,Cheng Peng,Wei Tang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Robust forecast aggregation combines the predictions of multiple information sources to perform well in the worst case across all possible information structures. Previous work largely focuses on settings with a known binary state space, where the state is either 0 or 1. We study prior-agnostic robust forecast aggregation in which the aggregator observes only experts’ reports, yet is ignorant of both the underlying joint information structure and the full prior, including the underlying state space. Unlike the standard model that fixes the binary state space 0, 1, we allow the (binary) unknown state values to be arbitrary numbers in [0, 1], so the same reported probability may correspond to very different realized outcome frequencies across environments. Our main contribution is a simple, explicit, closed-form log-odds aggregator that linearly pools forecasts in logit space, together with (nearly-)tight minimax-regret guarantees across three knowledge regimes. We first show that under conditionally independent (CI) signals, robust aggregation with an unknown state space is strictly harder than in the known-state setting by establishing a larger lower bound, and our aggregation rule can achieve a worst-case regret of 0.0255. Along the way, we also characterize tight regret bounds for Blackwell-ordered structures and for general information structures. In the classical setting with known state space 0,1, our aggregator achieves regret strictly below 0.0226 for CI structures. To the best of our knowledge, this is the first explicit closed-form aggregator that achieves a regret upper bound strictly less than 0.0226. Finally, we extend the model where the aggregator additionally knows each expert’s marginal forecast distribution; in this setting, with the CI structures, we show that a generalized log-odds rule achieves regret of 0.0228, complementing with a lower bound of 0.0225. Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2604.24517 [cs.LG] (or arXiv:2604.24517v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.24517 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-14] SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling

链接: https://arxiv.org/abs/2604.24514
作者: Xinrun Wang,Deshun Xia,Ke Xu,Weijie Zhu
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted by ICIC 2026

点击查看摘要

Abstract:Accurate trajectory prediction is fundamentally challenging due to high scene heterogeneity - the severe variance in motion velocity, spatial density, and interaction patterns across different real-world environments. However, most existing approaches typically train a single unified model, expecting a fixed-capacity architecture to generalize universally across all possible scenarios. This conventional model-centric paradigm is fundamentally flawed when confronting such extreme heterogeneity, inevitably leading to a severe generalization gap, degraded accuracy, and massive computational waste. To overcome this bottleneck, rather than refining restricted model-centric architectures, we propose selective learning, a novel scene-centric paradigm. It explicitly analyzes the characteristics of the underlying scene to dynamically route inputs to the most appropriate expert models. As a concrete implementation of this paradigm, we introduce SceneSelect. Specifically, SceneSelect utilizes unsupervised clustering on interpretable geometric and kinematic features to discover a latent scene taxonomy. A highly decoupled classification module is then trained to assign real-time inputs to these scene categories, and a highly extensible, plug-and-play scheduling policy automatically dispatches the trajectory sequence to the optimal expert predictor. Crucially, this decoupled design ensures excellent generalization capabilities, allowing seamless integration with different off-the-shelf models and robust adaptation across new datasets without requiring computationally expensive joint retraining. Extensive experiments on three public benchmarks (ETH-UCY, SDD, and NBA) demonstrate that our method consistently outperforms strong single-model and ensemble baselines, achieving an average improvement of 10.5%, showcasing the effectiveness of scene-aware selective learning.

[LG-15] Advancing Ligand-based Virtual Screening and Molecular Generation with Pretrained Molecular Embedding Distance

链接: https://arxiv.org/abs/2604.24474
作者: Shiyun Wa,Yifei Wang,Simone Sciabola,Ye Wang
类目: Machine Learning (cs.LG)
*备注: 26 pages, 12 figures, 9 tables

点击查看摘要

Abstract:Molecular similarity plays a central role in ligand-based drug discovery, such as virtual screening, analog searching, and goal-directed molecular generation. However, traditional similarity measures, ranging from fingerprint-based Tanimoto coefficients to 3D shape overlays, are often computationally expensive at scale or rely on hand-crafted molecular descriptors. Meanwhile, many deep learning approaches to similarity-aware design still depend on similarity-specific supervision or costly data curation, limiting their generality across targets. In this work, we propose pretrained embedding distance (PED) as an effective alternative, computed directly from pretrained molecular models without task-specific training. Experimental results show that PED exhibits distinct correlations with traditional similarity metrics, and performs effectively in both ranking molecules for virtual screening and guiding molecular generation via reward design. These findings suggest that pretrained molecular embeddings capture rich structural information and can serve as a promising and scalable similarity measurement for modern AI-aided drug discovery.

[LG-16] An Automatic Ground Collision Avoidance System with Reinforcement Learning

链接: https://arxiv.org/abs/2604.24403
作者: Seyyid Osman Sevgili,Atahan Cilan,Mahir Demir,Özgün Can Yürütken,Ümit Can Bekar
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This article evaluates an artificial intelligence (AI)-based Automatic Ground Collision Avoidance System (AGCAS) designed for advanced jet trainers to enhance operational effectiveness. In the continuously evolving field of aerospace engineering, the integration of AI is crucial for advancing operations with improved timing constraints and efficiency. Our study explores the design process of an AI-driven AGCAS, specifically tailored for advanced jet trainers, focusing on addressing the AGCAS problem within a limited observation space. The system utilizes line-of-sight queries on a terrain server to ensure precise and efficient collision avoidance. This approach aims to significantly improve the safety and operational capabilities of advanced jet trainers.

[LG-17] SAGE: Sparse Adaptive Guidance for Dependency-Aware Tabular Data Generation ACL2026

链接: https://arxiv.org/abs/2604.24368
作者: Shuo Yang,Zheyu Zhang,Bardh Prenkaj,Gjergji Kasneci
类目: Machine Learning (cs.LG)
*备注: Accepted by ACL 2026

点击查看摘要

Abstract:Generating high-fidelity synthetic tabular data remains a critical challenge for enhancing data availability in privacy-sensitive and low-resource domains. Recent approaches leverage LLMs by representing table rows as sequences, yet suffer from two fundamental limitations: (1) they model feature dependencies densely, introducing spurious correlations; and (2) they assume static relationships between features, ignoring how these dependencies vary with feature values. To overcome these limitations, we introduce SAGE (Sparse Adaptive Guidance), a novel LLM-based generation framework that enforces sparse and dynamic dependency guidance. SAGE discretizes features into value-aware pseudo-features and constructs a mutual information-based sparse dependency graph. This graph adaptively guides generation through explicit context selection or implicit logit correction, enabling LLMs to focus on truly relevant information during synthesis. Our extensive experiments across six datasets and multiple tasks reveal that SAGE not only improves data fidelity and downstream utility, boosting F1 scores by 10% compared to previous LLM-based methods, but also reduces policy violations by one point. These results highlight the importance of adaptive structure in tabular data generation and provide new insights into context-sensitive control of LLMs.

[LG-18] Primitive Recursion without Composition: Dynamical Characterizations from Neural Networks to Polynomial ODEs

链接: https://arxiv.org/abs/2604.24356
作者: Olivier Bournez
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:What do recurrent neural networks, polynomial ODEs, and discrete polynomial maps each bring to computation, and what do they lack? All three operate over the continuum–real-valued states evolved by real-valued dynamics–even when the target functions are discrete. We study them through primitive recursion. We prove that primitive recursion admits equivalent characterizations in all three frameworks: bounded iteration of a fixed recurrent ReLU network, robust computation by a fixed polynomial ODE, and iteration of a fixed polynomial map with an externally supplied step-size parameter. In each, the time bound is itself primitive recursive, composition emerges from the dynamics rather than as a closure rule, and inputs are raw integer vectors. Every primitive recursive function is first compiled into bounded iteration of a single threshold-affine normal form, then interpreted as a ReLU computation and as a polynomial ODE. The equivalences expose a structural asymmetry: no fixed polynomial map can round uniformly to the nearest integer or realize exact phase selection–operations polynomial ODEs perform robustly via continuous-time flow. Each formalism compensates for a limitation the others lack: the ReLU gate provides exact branching, continuous time provides autonomous rounding and control, and the step-size parameter recovers both at the cost of discretization precision. This opens dynamical characterizations of subrecursive hierarchies and complexity classes by restricting time bounds, polynomial degrees, or discretization resources within one framework. More broadly, these models do not compute by composing subroutines: they shape the trajectory of a dynamical system through clocks, phase selectors, and error correction built into the dynamics. This differs structurally from symbolic programming, and our theorem gives a precise framework to study the difference. Subjects: Computational Complexity (cs.CC); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2604.24356 [cs.CC] (or arXiv:2604.24356v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2604.24356 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Olivier Bournez [view email] [v1] Mon, 27 Apr 2026 11:48:49 UTC (546 KB)

[LG-19] An Aircraft Upset Recovery System with Reinforcement Learning

链接: https://arxiv.org/abs/2604.24355
作者: Mahir Demir,Atahan Cilan,Seyyid Osman Sevgili,Özgün Can Yürütken,Ümit Can Bekar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article explores the progress made in the creation of a pilot activated recovery system (PARS) for advanced jet trainers that utilizes artificial intelligence (AI) in an effort to enhance operational efficiency. The PARS model employs an advanced reinforcement learning (RL) architecture, incorporating a cutting-edge soft-actor critic (SAC) model and hyper-parameter optimization methods. Negative-g punishments and other handcrafted features remarked upon by control engineers and domain experts regarding PARS are also taken into account by the system. When evaluated by them, the AI model’s behavior is deemed more desirable than that of conventional control methods.

[LG-20] Perfecting Aircraft Maneuvers with Reinforcement Learning

链接: https://arxiv.org/abs/2604.24338
作者: Atahan Cilan,Mahir Demir,Özgün Can Yürütken,Seyyid Osman Sevgili,Ümit Can Bekar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper evaluates an advanced jet trainer’s utilization of artificial intelligence (AI)-based aircraft aerobatic maneuvers with the intention of developing an AI-assisted pilot training module for specific aircraft maneuvers. A multitude of aircraft maneuvers have been simulated using reinforcement learning (RL) agents, which will serve as a training tool for future pilots.

[LG-21] Mitigating Error Amplification in Fast Adversarial Training

链接: https://arxiv.org/abs/2604.24332
作者: Mengnan Zhao,Lihe Zhang,Bo Wang,Tianhang Zheng,Hong Zhong,Geyong Min
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Fast Adversarial Training (FAT) has proven effective in enhancing model robustness by encouraging networks to learn perturbation-invariant representations. However, FAT often suffers from catastrophic overfitting (CO), where the model overfits to the training attack and fails to generalize to unseen ones. Moreover, robustness oriented optimization typically leads to notable performance degradation on clean inputs, and such degradation becomes increasingly severe as the perturbation budget grows. In this work, we conduct a comprehensive analysis of how guidance strength affects model performance by modulating perturbation and supervision levels across distinct confidence groups. The findings reveal that low confidence samples are the primary contributors to CO and the robustness accuracy trade off. Building on this insight, we propose a Distribution-aware Dynamic Guidance (DDG) strategy that dynamically adjusts both the perturbation budget and supervision signal. Specifically, DDG scales the perturbation magnitude according to the sample confidence at the ground truth class, thereby guiding samples toward consistent decision boundaries while mitigating the influence of learning spurious correlations. Simultaneously, it dynamically adjusts the supervision signal based on the prediction state of each sample, preventing overemphasis on incorrect signals. To alleviate potential gradient instability arising from dynamic guidance, we further design a weighted regularization constraint. Extensive experiments on standard benchmarks demonstrate that DDG effectively alleviates both CO and the robustness accuracy trade off.

[LG-22] Model-Free Inference of Investor Preferences: A Relative Entropy IRL Approach

链接: https://arxiv.org/abs/2604.24280
作者: Chen Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a framework using Relative Entropy Inverse Reinforcement Learning (RE-IRL) to recover investor reward functions from observed investment actions and market conditions. Unlike traditional IRL algorithms, RE-IRL is employed to account for environments where transition probabilities are unknown or inaccessible. To address the challenge of data sparsity, we utilize a K -nearest neighbor approach to estimate the observed behavior policy. Furthermore, we propose a statistical testing framework to evaluate the validity and robustness of the estimated results.

[LG-23] BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment

链接: https://arxiv.org/abs/2604.24273
作者: Md. Ashiq Ul Islam Sajid,Mohammad Sakib Mahmood,Md. Tareq Hasan,Md Abdur Rahim,Rafat Ara,Md. Arafat Hossain
类目: Machine Learning (cs.LG)
*备注: 6pages, 1 Figure, IEEE International Conference of Frontiers of Engineering and Emerging Technologies 2026

点击查看摘要

Abstract:The deployment of intelligent reinforcement learning (RL) agents on resource-constrained edge devices remains a fundamental challenge due to the substantial memory, computational, and energy requirements of modern deep learning systems. While large language models (LLMs) have emerged as powerful architectures for decision-making agents, their multi-billion parameter scale confines them to cloud-based deployment, raising concerns about latency, privacy, and connectivity dependence. We introduce BitRL, a framework for building RL agents using 1-bit quantized language models that enables practical on-device learning and inference under severe resource constraints. Leveraging the BitNet b1.58 architecture with ternary weights (-1, 0, +1) and an optimized inference stack, BitRL achieves 10-16x memory reduction and 3-5x energy efficiency improvements over full-precision baselines while maintaining 85-98 percent of task performance across benchmarks. We provide theoretical analysis of quantization as structured parameter perturbation, derive convergence bounds for quantized policy gradients under frozen-backbone architectures, and identify the exploration-stability trade-off in extreme quantization. Our framework systematically integrates 1-bit quantized language models with reinforcement learning for edge deployment and demonstrates effectiveness on commodity hardware. Comments: 6pages, 1 Figure, IEEE International Conference of Frontiers of Engineering and Emerging Technologies 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.24273 [cs.LG] (or arXiv:2604.24273v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.24273 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] GeoEdit: Local Frames for Fast Training-Free On-Manifold Editing in Diffusion Models

链接: https://arxiv.org/abs/2604.24238
作者: Yiming Zhang,Sitong Liu,Ke Li,Zhihong Wu,Alex Cloninger,Melvin Leok
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models are a leading paradigm for data generation, but training-free editing typically re-runs the full denoising trajectory for every edit strength, making iterative refinement expensive. To address this issue, we instead edit near the data manifold, where small local updates can replace repeated re-synthesis. To enable this, we estimate a local manifold tangent space directly from perturbed samples and prove that this sample-based estimator closely approximates the true tangent. Building on this guarantee, we devise a Jacobian-free algorithm that constructs a tangent frame via small perturbations to the initial noise and alternates small tangent moves with diffusion-based projections. Updates within this frame follow principled on-manifold directions while suppressing off-manifold drift, enabling fine-grained edits without full re-diffusion or additional training. Edit strength is controlled by the number of steps for rapid, continuous adjustments that preserve fidelity and plug into existing samplers. Empirically, the resulting tangent directions yield smooth, semantic unsupervised traversals and effective CLIP-guided optimization, demonstrating practical interactive continuous editing.

[LG-25] IMPA-Net: Meteorology-Aware Multi-Scale Attention and Dynamic Loss for Extreme Convective Radar Nowcasting

链接: https://arxiv.org/abs/2604.24224
作者: Haofei Cui,Guangxin He,Juanzhen Sun,Jingjia Luo,Haonan Chen,Xiaoran Zhuang,Mingxuan Chen,Xian Xiao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Short-range prediction of convective precipitation from weather radar observations is essential for severe weather warnings. However, deep learning models trained with pixel-wise error metrics tend to produce overly smooth forecasts that suppress intense echoes critical for hazard detection. This issue is exacerbated by insufficient multi-scale feature interaction and suboptimal fusion of heterogeneous geophysical inputs. We propose IMPA-Net (Integrated Multi-scale Predictive Attention Network), a deterministic 0-2 hour nowcasting framework that addresses these limitations through meteorologically-informed designs at the input, architecture, and loss function levels. A parameter-free Spatial Mixer reorganizes heterogeneous input channels at the mesoscale- \gamma neighborhood (~2 km) via deterministic channel permutation, providing a structured cross-field prior. An integrated multi-scale predictive attention module serves as the spatiotemporal translator, capturing dynamics from mesoscale- \beta to mesoscale- \gamma scales. A Meteorologically-Aware Dynamic Loss employs three-level asymmetric weighting – adapting across training epochs, storm intensity, and forecast lead time – to counteract regression-to-the-mean. Evaluated against seven baselines on a multi-source radar dataset over eastern China, IMPA-Net raises the Heidke Skill Score at \geq 45 dBZ from 0.049 (SimVP baseline) to 0.143 under matched settings. Relative to pySTEPS, it provides a better trade-off between severe-event detection and false-alarm control. Spectral analysis confirms preserved energy across mesoscale bands where competing methods show progressive smoothing. These improvements are shown within a single domain and convective regime; generalizability to other orographic and climatic regions remains to be tested.

[LG-26] CMGL: Confidence-guided Multi-omics Graph Learning for Cancer Subtype Classification

链接: https://arxiv.org/abs/2604.24201
作者: Boyang Fan,Hengchuang Yin,Siyu Yi,Yifan Wang,Zhicheng Li,Leijiyu Zhou,Jiancheng Lv,Wei Ju
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Molecular Networks (q-bio.MN)
*备注: 24 pages, 15 figures, 13 tables, 2 algorithms (main paper + supplementary materials)

点击查看摘要

Abstract:Motivation: Multi-omics integration can improve cancer subtyping, but modality informativeness and noise vary across cancer types and patients. Existing graph-based methods optimize modality weights jointly with the classification objective and therefore lack independent reliability estimates, so low-quality omics distort patient similarity graphs and amplify noise through message passing. Results: We propose CMGL, a two-stage framework that estimates per-sample modality reliability through evidential deep learning and uses the frozen confidence scores to guide cross-omics fusion and graph construction. On four MLOmics cancer-subtype tasks and the 32-class pan-cancer task, CMGL consistently improves over the strongest baseline, surpassing it by 4.03% in average accuracy on the four single-cancer tasks. Its representations recover the PAM50 intrinsic subtypes of breast invasive carcinoma (BRCA), and the BRCA-trained model transfers without fine-tuning to kidney renal clear cell carcinoma (KIRC), stratifying patients into prognostically distinct groups. Comments: 24 pages, 15 figures, 13 tables, 2 algorithms (main paper + supplementary materials) Subjects: Machine Learning (cs.LG); Genomics (q-bio.GN); Molecular Networks (q-bio.MN) MSC classes: 62H30, 68T07, 92C40 ACMclasses: I.2.6; J.3 Cite as: arXiv:2604.24201 [cs.LG] (or arXiv:2604.24201v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.24201 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Machine-Learning-Based Classification of Radio Frequency Building Loss

链接: https://arxiv.org/abs/2604.24143
作者: Jiayi Tan,Neelabhro Roy,James Gross,Rohit Chandra,Tsao-Tsen Chen
类目: Machine Learning (cs.LG)
*备注: Accepted as a short paper in International Conference on Telecommunications (ICT) 2026

点击查看摘要

Abstract:Accurate modeling of outdoor-to-indoor (O2I) and indoor-to-indoor (I2I) signal loss is important for improving indoor wireless network performance in dense urban areas. Traditional on-site measurements are expensive, time-consuming, and difficult to conduct across wide regions. Real-world datasets also tend to be noisy and imbalanced, which makes signal loss prediction challenging. This study presents a machine learning framework for classifying radio frequency (RF) building loss. The framework combines passively collected, crowdsourced user equipment (UE) data from 3GPP-compliant networks with public building information. We evaluated Random Forest, XGBoost, LightGBM, and a voting classifier using both supervised (SL) and semi-supervised learning (SSL). Compared to SL-only inference, the proposed SL and SSL framework improved both prediction accuracy and confidence under identical data constraints, achieving up to 12.6% relative accuracy gain for O2I loss and 3.4% for I2I loss, while reducing prediction entropy by up to 8.4%. Among the evaluated models, SSL XGBoost provided the most confident O2I loss classification, whereas SSL LightGBM achieved the best performance for I2I loss. These results demonstrate that the proposed approach provides a practical, data-driven alternative to traditional models, with promising potential to support better network planning and indoor coverage optimization.

[LG-28] Fed-DLoRA: Efficient Wireless Federated Learning with Dynamic Low-Rank Adaptation

链接: https://arxiv.org/abs/2604.24103
作者: Huaicheng Li,Junhui Zhao,Haoyu Quan,Xiaoming Wang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 11 pages, 7 figures. Accepted for publication in IEEE Transactions on Vehicular Technology

点击查看摘要

Abstract:Federated learning (FL) offers a promising distributed learning paradigm for internet of vehicles (IoV) applications. However, it faces challenges from communication overhead and dynamic environments. Model compression techniques reduce computing and communication burden yet create trade-offs between compression ratios and vehicle participation strategies. In this paper, we propose a lightweight FL algorithm named federated learning with dynamic low-rank adaptation (Fed-DLoRA), which is combined with low-rank adaptation (LoRA) to effectively reduce parameters and communication costs while enhancing training efficiency. The convergence analysis of Fed-DLoRA is conducted through stochastic gradient descent optimization coupled with singular value decomposition. This analysis establishes the theoretical relationships among LoRA rank, vehicular scheduling strategies and the model’s convergence characteristics. Building on these insights, we formulate a joint optimization problem aimed at maximizing system performance. To address this problem, we propose an adaptive rank, bandwidth and vehicle selection (ARBVS) algorithm that integrates enumeration with greedy optimization strategies. The algorithm provides efficient rank selection and resource scheduling strategies for each FL communication round, thereby achieving effective performance improvements for the FL system. Experimental results demonstrate that Fed-DLoRA achieves superior performance compared to conventional federated learning approaches, exhibiting enhanced accuracy, faster convergence, and improved communication efficiency.

[LG-29] Explaining Temporal Graph Predictions With Shapley Values

链接: https://arxiv.org/abs/2604.24078
作者: Lea-Marie Sussek,Stefan Heindorf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal Graph Neural Networks (TGNNs) have become increasingly popular in recent years due to their superior predictive performance by combining both spatial and temporal information. However, how these models utilize the information to make predictions is rather unexplored, leading to potentially faulty or biased models. This work introduces two novel model-agnostic explainers for local explanations of TGNNs based on Shapley and Owen values. The first method, an event-level (edge-level) Shapley explainer, applies the KernelSHAP algorithm to estimate contribution scores for individual temporal events, providing interpretable descriptions for model behavior. The second, a feature-level Shapley explainer, extends this framework by decomposing event-level Shapley values into Owen values, and thereby uncovers hierarchical dependencies of the event and its features. The explainers outperform SOTA explainers on different metrics and datasets. Additionally, the Feature Explainer reveals a faulty extraction of actual timestamps of a commonly used TGAT implementation, helping to further understand performance drops on very sparse explanations.

[LG-30] A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

链接: https://arxiv.org/abs/2604.24037
作者: Jun Shu,Junxiong Jia,Deyu Meng,Zongben Xu
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Emergent intelligence have played a major role in the modern AI development. While existing studies primarily rely on empirical observations to characterize this phenomenon, a rigorous theoretical framework remains underexplored. This study attempts to develop a mathematical approach to formalize emergent intelligence from the perspective of limit theory. Specifically, we introduce a performance function E(N, P, K), dependent on data size N, model size P and training steps K, to quantify intelligence behavior. We posit that intelligence emerges as a transition from finite to effectively infinite knowledge, and thus recast emergent intelligence as existence of the limit \lim_N,P,K \to \infty \mathcalE(N,P,K) , with emergent abilities corresponding to the limiting behavior. This limit theory helps reveal that emergent intelligence originates from the existence of a parameter-limit architecture (referred to as the limit architecture), and that emergent intelligence rationally corresponds to the learning behavior of this limit system. By introducing tools from nonlinear Lipschitz operator theory, we prove that the necessary and sufficient conditions for existence of the limit architecture. Furthermore, we derive the scaling law of foundation models by leveraging tools of Lipschitz operator and covering number. Theoretical results show that: 1) emergent intelligence is governed by three key factors-training steps, data size and the model architecture, where the properties of basic blocks play a crucial role in constructing foundation models; 2) the critical condition Lip(T)=1 for emergent intelligence provides theoretical support for existing findings. 3) emergent intelligence is determined by an infinite-dimensional system, yet can be effectively realized in practice through a finite-dimensional architecture. Our empirical results corroborate these theoretical findings.

[LG-31] Geometry-Aware Offline-to-Online Learning in Linear Contextual Bandits

链接: https://arxiv.org/abs/2604.24016
作者: Zean Han,Ruihan Lin,Zezhen Ding,Jiheng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study offline-to-online learning in linear contextual bandits with biased offline regression data: the offline parameter need not match the online one, so history should not be treated as a single warm start. We model directional transfer with a shift certificate (M_\mathrmshift,\rho) and offline ridge estimation, yielding a geometry-aware confidence region for the online parameter rather than an isotropic radius. We propose \emphEllipsoidal-MINUCB, which combines a standard online branch with an offline-informed pooled branch and uses offline information only when it tightens uncertainty. With high probability, regret is bounded by the minimum of a standard SupLinUCB-style fallback and a pooled term that separates statistical width from a certificate-weighted shift penalty. Under a simple alignment condition, the pooled term further simplifies to a rate governed by an effective dimension induced by the offline geometry. We also show that a purely Euclidean (scalar) shift bound, by itself, does not determine which feature directions are transferable. Beyond this fixed certificate, we show how to learn a data-driven certificate from data at finitely many refresh times and establish a high-probability regret bound for Ellipsoidal-MINUCB with epoch-wise learned certificates. Experiments match the main prediction: gains are strongest at intermediate horizons when offline coverage and transferability align, while the method otherwise tracks the safe online baseline.

[LG-32] FedSLoP: Memory-Efficient Federated Learning with Low-Rank Gradient Projection

链接: https://arxiv.org/abs/2604.24012
作者: Yutong He,Zhengyang Huang,Jiahe Geng
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 27 pages, 7 figures

点击查看摘要

Abstract:Federated learning enables a population of clients to collaboratively train machine learning models without exchanging their raw data, but standard algorithms such as FedAvg suffer from slow convergence and high communication and memory costs in heterogeneous, resource-constrained environments. We introduce FedSLoP, a federated optimization algorithm that combines stochastic low-rank subspace projections of gradients, thereby reducing the dimension of communicated and stored updates while preserving optimization progress. On the theoretical side, we develop a detailed nonconvex convergence analysis under standard smoothness and bounded-variance assumptions, showing that FedSLoP is guaranteed to converge to a first-order stationary point at a rate of O(1/\sqrtNT) . On the empirical side, we conduct extensive experiments on federated MNIST classification with heterogeneous data partitions, showing that FedSLoP substantially reduces communication volume and client-side memory while achieving competitive or better accuracy compared with FedAvg and representative sparse or low-rank baselines. Together, our results demonstrate that random subspace momentum methods such as FedSLoP provide a principled and effective approach to communication- and memory-efficient federated learning. Codes are available at: this https URL.

[LG-33] Coverag e-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels

链接: https://arxiv.org/abs/2604.24008
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-Training Quantization (PTQ) compresses large language models to low bit-widths using a small calibration set, and its quality depends strongly on which samples are chosen. We identify a failure mode in which calibration samples fail to activate outlier channels, hidden dimensions with unusually large activations, causing the quantizer to underestimate their dynamic range and producing per-channel reconstruction errors that dominate layer-wise loss. Motivated by this observation, we argue that PTQ calibration quality is governed more by weighted outlier-channel coverage than by generic sample representativeness, and formulate calibration selection as a weighted set cover problem over outlier channels. The objective is monotone submodular, and the greedy algorithm, COVERCAL, operates on pre-computed activation statistics and requires no GPU time at selection. We further show that the weight choice is internally consistent: under a stylized clipping model, missed weighted coverage upper-bounds surrogate loss, justifying the weighted coverage objective as principled rather than purely empirical. Across LLaMA-2, LLaMA-3, and Mistral, under AWQ and GPTQ backends and five downstream evaluations, COVERCAL improves over random, max-perplexity, max-activation-variance, and stratified baselines, with the largest gains at small calibration budgets. At INT4 with 128 samples, COVERCAL improves MMLU by 1.2 to 1.5 points over random calibration and reduces perplexity degradation by 15 to 30%; with 64 samples, it matches or exceeds random calibration at 256. The contribution is not a new PTQ backend but a formulation of calibration selection as weighted outlier coverage, with a simple, efficient algorithm and a surrogate-based justification.

[LG-34] Adaptive-Distribution Randomized Neural Networks for PDEs: A Low-Dimensional Distribution-Learning Framework

链接: https://arxiv.org/abs/2604.23999
作者: You Yang,Fei Wang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Randomized neural networks (RaNNs) are attractive for partial differential equations (PDEs) because they replace expensive end-to-end training with a linear least-squares solve over randomized hidden features. Their practical performance, however, depends strongly on the sampling distribution of the hidden-layer parameters, which is usually chosen heuristically and problem by problem. This distribution sensitivity is a central bottleneck in randomized neural PDE solvers. In this work, we propose Adaptive-Distribution Randomized Neural Networks (AD-RaNN), a framework that promotes randomized feature generation from a fixed heuristic choice to a low-dimensional adaptive optimization problem. Instead of training all hidden weights and biases, AD-RaNN parameterizes the hidden-feature sampling distribution by a low-dimensional vector p and optimizes only p, thereby preserving the least-squares structure of RaNNs while reducing manual distribution tuning. The method uses a two-stage strategy: ridge-regularized reduced training for stable distribution-parameter optimization, followed by an unregularized least-squares refit for final solution recovery. We develop two adaptive mechanisms, PDE-Driven Adaptive Distribution (PDAD) and Data-Driven Adaptive Distribution (DDAD), and deploy them in space-time solvers, discrete-time solvers, and operator-learning models. We also incorporate an adaptive layer-growth enhancement for localized structures. For the reduced optimization problem, we establish well-posedness of the reduced objectives, consistency of ridge-regularized minimizers, an efficient gradient formula, and a practical lower-bound estimate for the ridge parameter. Numerical experiments on benchmark problems show that AD-RaNN provides an effective distribution-level adaptation mechanism, reduces reliance on hand-crafted hidden-feature distributions, and achieves strong empirical accuracy.

[LG-35] Continual Calibration: Coverag e Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning

链接: https://arxiv.org/abs/2604.23987
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more sharply than top-1 performance. We study this empirically by measuring conformal coverage and calibration error on sequentially fine-tuned models across three model families and eight task sequences drawn primarily from classification and multiple-choice benchmarks. Across the classification-style settings we study, coverage loss exceeds accuracy loss by a factor of roughly (3.4\times \pm 0.5\times) on average across seeds; in the most pronounced case, coverage drops from (0.92) to (0.61), while accuracy remains within three points of baseline. Standard continual-learning methods that preserve accuracy do not automatically preserve coverage, and naive calibration baselines recover only part of the gap. We propose calibration replay, a lightweight post-hoc procedure that maintains a task-specific held-out buffer and refits a task-specific conformal threshold under the current model after each update. It adds no training-time gradient cost, uses less than one percent of the memory of ordinary experience replay, and typically restores coverage to within two points of nominal at buffer size (m = 200). We accompany the empirical study with a drift decomposition, a finite-sample recovery theorem showing exact conformal validity under exchangeability, and a mixture-validity proposition explaining why pooled thresholds do not suffice. Our guarantees are stated for classification-style tasks with task-specific buffers; extensions to open-ended generation are exploratory.

[LG-36] Robust and Clinically Reliable EEG Biomarkers: A Cross Population Framework for Generalizable Parkinsons Disease Detection ALT

链接: https://arxiv.org/abs/2604.23933
作者: Nicholas R. Rasmussen,Longwei Wang,Rodrigue Rizk,Md Rezwanul Akter Pallab,Samuel Stuwart,Martina Mancini,Arun Singh,KC Santosh
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注: This is the non anonymized preprint corresponding to the version submitted to ACM Transactions on Computing for Healthcare. It is not the final typeset or accepted version

点击查看摘要

Abstract:Developing robust and clinically reliable EEG biomarkers requires evaluation frameworks that explicitly address cross population generalization in multi site settings such as Parkinsons disease (PD) detection. Models trained under i.i.d. assumptions often capture population specific artifacts rather than disease relevant neural structure, leading to poor generalization across clinical cohorts. EEG further amplifies this challenge due to low signal to noise ratio and heterogeneous acquisition conditions. We propose a population aware evaluation framework to assess the robustness and clinical reliability of EEG biomarkers under distribution shift. Using an n gram expansion strategy, we enumerate all cross population train test configurations across five independent cohorts, resulting in 75 directional evaluations. A nested cross validation design with integrated channel selection ensures prospective biomarker identification without population leakage. Results show that cross population transfer is asymmetric and that both accuracy and biomarker stability improve with increasing training population diversity, achieving up to 94.1% accuracy on held out cohorts. A theoretical analysis based on mixture risk optimization and hypothesis space contraction explains these trends, showing that multi population training promotes population robust representations. This work establishes a principled framework for learning robust, generalizable, and clinically reliable EEG biomarkers for multi site biomedical applications.

[LG-37] Gromov-Wasserstein Methods for Multi-View Relational Embedding and Clustering

链接: https://arxiv.org/abs/2604.23912
作者: Rafael Pereira Eufrazio,Eduardo Fernandes Montesuma,Charles Casimiro Cavalcante
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This manuscript is currently under review at the XLIV Simposio Brasileiro de Telecomunicacoes e Processamento de Sinais - SBrT (Brazilian Symposium on Telecommunications and Signal Processing ) 2026

点击查看摘要

Abstract:Learning low-dimensional representations from multi-view relational data is challenging when underlying geometries differ across views. We propose Bary-GWMDS, a Gromov-Wasserstein-based method that operates directly on distance matrices to learn a consensus embedding preserving shared relational structure. By leveraging intrinsic distances, the approach naturally handles nonlinear distortions across views. We also introduce Mean-GWMDS-C, a clustering-oriented formulation that averages distance matrices and learns reduced-support representations via a consensus Gromov-Wasserstein transport. Experiments on synthetic and real-world datasets show that the proposed framework yields stable and geometrically meaningful embeddings.

[LG-38] Machine Learning and Deep Learning Models for Short Term Electricity Price Forecasting in Australias National Electricity Market

链接: https://arxiv.org/abs/2604.23908
作者: Wei Lu,Jay Wang,Dingli Duan,Ding Mao,Caiyi Song,John Huang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 28 pages, 5 figures

点击查看摘要

Abstract:Short term electricity price forecast is essential in competitive power markets, yet electricity price series exhibit high volatility, irregularity, and non-stationarity. This phenomenon is pronounced in the South Australian region of the National Electricity Market, where high renewable penetration drives price volatility and frequent negative price intervals, while structural changes such as the transition to five-minute settlement further complicate forecast. To address these challenges, this study develops a unified benchmark framework. Under identical data preprocessing, feature engineering with lag features, rolling statistics, cyclic temporal encodings, and so on, and an 85% to 15% chronological train test split, six algorithms are systematically compared, including AWMLSTM, CatBoost, GBRT, LSTM, LightGBM, and SVR. The results show that for price prediction, tree-based models, especially GBRT with an R squared value of 0.88, generally outperform LSTM and SVR. However, all models achieve a mean absolute percentage error above 90%, and more than 65% of GBRT predictions have relative errors above 10%, which highlights the inherent difficulty of price forecast. For demand prediction, all models perform substantially better than in price prediction. AWMLSTM and GBRT achieve an R2 value of 0.96 with mean absolute percentage error below 32%, and GBRT has 74.37% of samples within 5% error, while LSTM and SVR perform less accurately in both tasks. Future improvements should focus on hybrid models such as tree plus transformers, data augmentation for extreme events, and error correction to better capture price spikes.

[LG-39] Cardiac Stability Theory: An Axiomatically Grounded Framework for Continuous Cardiac Health Monitoring via Smartphone Photoplethysmography

链接: https://arxiv.org/abs/2604.23876
作者: Timothy Oladunni,Farouk Ganiyu Adewumi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Cardiac Stability Theory (CST), an axiomatically grounded framework formally defining cardiovascular health as a stability margin around a cardiac dynamical attractor. From four axioms we derive the Cardiac Stability Index (CSI), a composite scalar in [0,1] integrating the largest Lyapunov exponent, recurrence determinism, and signal entropy via time-delay embedding. The ECG-based model (CSISurrogateV2, CNN-Transformer) achieves R^2=0.8788 , MAE =0.0234 on PTB-XL (21,799 recordings). We extend CSI to smartphone PPG via Complementary Domain Transfer (CDT): CSISurrogateV2 generates pseudo-labels for the BUT PPG dataset (48 recordings, 12 subjects), training TinyCSINet (122,849 parameters), achieving MAE =0.0557 , \rho=0.660 on the held-out test set ( n=1065 windows) at 30 ms mobile latency. CDT is validated on BIDMC, Welltory, and RWS-PPG. Paired validation on 5,035 BIDMC windows yields r=0.454 ( \rho=0.485 , p10^-295 ), confirming correlated cardiac stability across modalities. CSI is negatively correlated with age (slope = -0.000225 CSI/year, PTB-XL), discriminates atrial fibrillation from normal sinus rhythm (AUROC =0.89 ), and is robust under Perturbation Invariance Training (max AUC drop 1.65%). We derive HeartSpan, a longitudinal stability metric relative to population age norms, enabling continuous non-invasive cardiac monitoring from commodity smartphones for longevity tracking and cardiac risk stratification.

[LG-40] Learning Interpretable PDE Representations for Generative Reconstructions with Structured Sparsity

链接: https://arxiv.org/abs/2604.23867
作者: Valerie Tsao,Nathaniel Chaney,Manolis Veveakis
类目: Machine Learning (cs.LG)
*备注: 28 pages, 20 figures

点击查看摘要

Abstract:Scientific measurements are often bottlenecked by suboptimal conditions, whether that be noise, incomplete spatial coverage, or limited resolution, rendering accurate field reconstruction a difficult task. We introduce LatentPDE, a latent diffusion framework designed to simultaneously resolve sparse-observation reconstruction and super-resolution. While existing physics-guided diffusion models typically rely on soft loss penalties or uninterpretable representations, our approach enforces physical compliance by constructing an inherently interpretable latent space. Specifically, we parameterize the latent variables directly as the coefficients and source terms of an assumed governing PDE. In doing so, LatentPDE is able to reliably reconstruct dynamics across highly disparate and structured data gaps. Empirical results on diverse configurations demonstrate that our model achieves high-fidelity recovery at any desired resolution while also tracking the underlying predictive uncertainty.

[LG-41] JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

链接: https://arxiv.org/abs/2604.23838
作者: Zhengding Hu,Hehua Ouyang,Chang Chen,Zaifeng Pan,Yue Guan,Zhongkai Yu,Zhen Wang,Steven Swanson,Yufei Ding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present JigsawRL, a cost-efficient framework that explores Pipeline Multiplexing as a new dimension of RL parallelism. JigsawRL decomposes each pipeline into a Sub-Stage Graph that exposes the intra-stage and inter-worker imbalance hidden by stage-level systems. On this abstraction, JigsawRL resolves multiplexing interference through dynamic resource allocation, eliminates fragmented utilization by migrating long-tail rollouts across workers, and formulates their coordination as a graph scheduling problem solved with a look-ahead heuristic. On 4-64 H100/A100 GPUs across different agentic RL pipelines and models, JigsawRL achieves up to 1.85x throughput over Verl on synchronous RL, 1.54x over StreamRL and AReaL on asynchronous RL, and supports heterogeneous pipelines with moderate latency trade-off.

[LG-42] SeqShield: A Behavioral Analysis Approach to Uncover Rootkits

链接: https://arxiv.org/abs/2604.23812
作者: Paras Ghodeshwar,Sandeep K Shukla,Anand Handa,Nitesh Kumar
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 15 pages, 1 Algorithm, 1 Architecute Digram. Model training on both relevant features and irrelevant features with featured extraction method is explored

点击查看摘要

Abstract:Rootkits are among the most elusive types of malware, capable of bypassing traditional static analysis methods due to their metamorphic behavior. Signature-based detection techniques struggle against these threats, necessitating a shift toward dynamic analysis approaches. We propose SeqShield, a behavior-based rootkit detection approach designed specifically for the Windows OS, leveraging API call sequences for dynamic behavior analysis. Instead of relying on static signatures, SeqShield examines the execution patterns of API calls, which inherently reflect malicious intent. Analyzing API sequences, we can effectively identify rootkit-like behavior. We also employed a metamorphic code engine to generate 10X mutated variants of rootkits, demonstrating their obfuscation strategies. SeqShield applies n-gram analysis to extract bigram and trigram features from these API call sequences, enabling effective detection of rootkit-like activity. Among the models tested, Random Forest achieves the highest accuracy of 97.27% (bigram) and 96.17% (trigram). To optimize performance and decrease the dimension, we apply feature importance ranking using the Gini Impurity Index, iteratively selecting the most significant features. The optimized lower-dimensional feature matrix significantly enhances detection efficiency without sacrificing accuracy. Using the optimized feature set, our approach achieves 96.72% accuracy for bigrams and 97.81% accuracy for trigrams.

[LG-43] Reparameterization through Coverings and Topological Weight Priors

链接: https://arxiv.org/abs/2604.23804
作者: Maxim Beketov,Pavel Snopov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We generalise the reparameterization trick applied in variational autoencoders (VAEs) letting these have latent spaces of non-trivial topology - i.e. that of base manifolds covered with other ones, on which some technique for RT is available. That is possible since covering maps are measurable - moreover, in case of particular measure preservation property holding for the covering, one can establish an inequality on KL-divergence between pushforward (PF) densities on the base latent manifold, making the KL-term of VAE’s ELBO analytically tractable, despite the topological non-triviality of the supporting latent manifold. Our development follows a route close but somewhat alternative to reparameterization on Lie groups, the latest proposal for which is to reparameterize PFs of normal densities from the Lie algebra - “through” the exponential map, seen by us as sometimes a particular case of what we propose to call reparameterization through a covering. Covering maps need not be global diffeomorphisms (although Lie-exp maps, in general, need not either, but, to date only smooth ones were considered in this context, to the best of our knowledge), which makes many non-trivial topologies tamable to our proposed technique, that we detail on a particular such example. We demonstrate the working of our approach by constructing a VAE with the latent space of Klein bottle (not a Lie group) topology, which we call KleinVAE, successfully learning an appropriate artificial dataset. We discuss potential applicability of such topology-informed generative models as weight priors in Bayesian learning, particularly for convolutional vision models, where said manifold was peculiarly shown to have some relevance.

[LG-44] Causal Representation Learning from General Environments under Nonparametric Mixing AISTATS2025

链接: https://arxiv.org/abs/2604.23800
作者: Ignavier Ng,Shaoan Xie,Xinshuai Dong,Peter Spirtes,Kun Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to AISTATS 2025. This is a slightly revised version of the published paper

点击查看摘要

Abstract:Causal representation learning aims to recover the latent causal variables and their causal relations, typically represented by directed acyclic graphs (DAGs), from low-level observations such as image pixels. A prevailing line of research exploits multiple environments, which assume how data distributions change, including single-node interventions, coupled interventions, or hard interventions, or parametric constraints on the mixing function or the latent causal model, such as linearity. Despite the novelty and elegance of the results, they are often violated in real problems. Accordingly, we formalize a set of desiderata for causal representation learning that applies to a broader class of environments, referred to as general environments. Interestingly, we show that one can fully recover the latent DAG and identify the latent variables up to minor indeterminacies under a nonparametric mixing function and nonlinear latent causal models, such as additive (Gaussian) noise models or heteroscedastic noise models, by properly leveraging sufficient change conditions on the causal mechanisms up to third-order derivatives. These represent, to our knowledge, the first results to fully recover the latent DAG from general environments under nonparametric mixing. Notably, our results match or improve upon many existing works, but require less restrictive assumptions about changing environments.

[LG-45] A General Representation-Based Approach to Multi-Source Domain Adaptation ICML2025

链接: https://arxiv.org/abs/2604.23790
作者: Ignavier Ng,Yan Li,Zijian Li,Yujia Zheng,Guangyi Chen,Kun Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2025

点击查看摘要

Abstract:A central problem in unsupervised domain adaptation is determining what to transfer from labeled source domains to an unlabeled target domain. To handle high-dimensional observations (e.g., images), a line of approaches use deep learning to learn latent representations of the observations, which facilitate knowledge transfer in the latent space. However, existing approaches often rely on restrictive assumptions to establish identifiability of the joint distribution in the target domain, such as independent latent variables or invariant label distributions, limiting their real-world applicability. In this work, we propose a general domain adaptation framework that learns compact latent representations to capture distribution shifts relative to the prediction task and address the fundamental question of what representations should be learned and transferred. Notably, we first demonstrate that learning representations based on all the predictive information, i.e., the label’s Markov blanket in terms of the learned representations, is often underspecified in general settings. Instead, we show that, interestingly, general domain adaptation can be achieved by partitioning the representations of Markov blanket into those of the label’s parents, children, and spouses. Moreover, its identifiability guarantee can be established. Building on these theoretical insights, we develop a practical, nonparametric approach for domain adaptation in a general setting, which can handle different types of distribution shifts.

[LG-46] WISE-FM:Operation-Aware Engineering-Informed Foundation Model for Multi-Task Well Design

链接: https://arxiv.org/abs/2604.23767
作者: Carine de Menezes Rebello,Anderson Rapello dos Santos,Idelfonso B. R. Nogueira
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying machine learning models across diverse well portfolios requires generalisation to wells with design parameters outside the training distribution. Current data-driven approaches to virtual flow metering (VFM) and bottomhole estimation typically treat each well independently or ignore the influence of well design on operational behaviour. We present WISE (Well Intelligence and Systems Engineering Foundation Model), a design-aware, physics-informed multi-task model that integrates three complementary mechanisms: Feature-wise Linear Modulation (FiLM) and cross-modal attention to condition operational embeddings on well design parameters; multi-task learning for simultaneous prediction of flow rates, bottomhole conditions, and flow regime classification; and structural mass conservation with soft physics constraints derived from well engineering principles. Evaluation on the ManyWells benchmark (2000 simulated wells, 10^6 data points) demonstrates that design-aware models reduce VFM prediction error by up to 13\times compared to design-unaware baselines, and that physics constraints reduce negative flow predictions by 65%. Flow regime classification achieves 97.7% bottomhole accuracy, providing continuous well integrity monitoring without additional sensors. The methodology transfers to real operational data from five Equinor Volve producers (oil rate R^2 = 0.89 , bottomhole pressure R^2 = 0.98 , water rate R^2 = 0.97 ). The trained model additionally serves as a fast surrogate for integrity-aware well design optimisation over a 24-dimensional design space, with more than 1000\times speedup over drift-flux simulations. These results demonstrate that design awareness, physics enforcement, and multi-task learning are essential and complementary ingredients for foundation models intended to operate across large well portfolios.

[LG-47] Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks

链接: https://arxiv.org/abs/2604.23765
作者: Vugar Ismailov
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Functional Analysis (math.FA)
*备注: 19 pages, 26 references

点击查看摘要

Abstract:We analyze the universal approximation property of Kolmogorov-Arnold Networks (KANs) in terms of their edge functions. If these functions are all affine, then universality clearly fails. How many non-affine functions are needed, in addition to affine ones, to ensure universality? We show that a single one suffices. More precisely, we prove that deep KANs in which all edge functions are either affine or equal to a fixed continuous function \sigma are dense in C(K) for every compact set K\subset\mathbbR^n if and only if \sigma is non-affine. In contrast, for KANs with exactly two hidden layers, universality holds if and only if \sigma is nonpolynomial. We further show that the full class of affine functions is not required; it can be replaced by a finite set without affecting universality. In particular, in the nonpolynomial case, a fixed family of five affine functions suffices when the depth is arbitrary. More generally, for every continuous non-affine function \sigma , there exists a finite affine family A_\sigma such that deep KANs with edge functions in A_\sigma\cup\sigma\ remain universal. We also prove that KANs with the spline-based edge parameterization introduced by Liu et al.~\citeLiu2024 are universal approximators in the classical sense, even when the spline degree and knot sequence are fixed in advance.

[LG-48] Agent ic Fusion of Large Atomic and Language Models to Accelerate Materials Discovery

链接: https://arxiv.org/abs/2604.23758
作者: Mingze Li,Yu Rong,Songyou Li,Lihong Wang,Jiacheng Cen,Liming Wu,Anyi Li,Zongzhao Li,Qiuliang Liu,Rui Jiao,Tian Bian,Pengju Wang,Hao Sun,Jianfeng Zhang,Ji-Rong Wen,Deli Zhao,Shifeng Jin,Tingyang Xu,Wenbing Huang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:The discovery of novel materials is critical for global energy and quantum technology transitions. While deep learning has fundamentally reshaped this landscape, existing predictive or generative models typically operate in isolation, lacking the autonomous orchestration required to execute the full discovery process. Here we present ElementsClaw, an agentic framework for materials discovery that synergizes Large Atomic Models (LAMs) with Large Language Models (LLMs). In response to varied human requirements, ElementsClaw dynamically orchestrates a suite of LAM tools finetuned from our proposed model Elements for atomic-scale numerical computation, while leveraging LLMs for high-level semantic reasoning. This shift moves AI-driven materials science from isolated processes toward integrated and human interactive discovery. In the demanding domain of superconductors, our agentic system guides the experimental synthesis of four new superconductors, including Zr3ScRe8 with a transition temperature of 6.8 K and HfZrRe4 at 6.7 K. At scale, ElementsClaw screens more than 2.4 million stable crystals within only 28 GPU hours, identifying 68,000 high-confidence superconducting candidates and vastly expanding the known superconducting space. These results demonstrate how our agent accelerates materials discovery with high physical fidelity.

[LG-49] ransformer as an Euler Discretization of Score-based Variational Flow

链接: https://arxiv.org/abs/2604.23740
作者: Huadong Liao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the Transformer’s dominance across machine learning, its architecture remains largely heuristic and lacks a unified theoretical foundation. We introduce Score-based Variational Flow (SVFlow), a continuous-time dynamical system for representation learning in which the state evolves according to a variational posterior-weighted average of conditional log-likelihood scores, and provide a principled basis for regularization through variational consistency. We show that forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture. Multi-head attention approximates SVFlow vector field via a vMF kernel-smoothed posterior, while MoE/FFN approximates it in a relaxed network-based way, and the residual-normalization block implements a relaxed retraction that maintains spherical geometry. This unification explains why attention trains stably without explicit regularization while MoE requires auxiliary balancing losses. Experiments on pre-trained language models with prefix shuffling show that SVFlow-induced metrics correlate with task performance, reveal depth-dependent sensitivity, and reflect the intrinsic dynamics of attention.

[LG-50] Quasi-Equivariant Metanetworks ICLR2026

链接: https://arxiv.org/abs/2604.23720
作者: Viet-Hoang Tran,An Nguyen,Benoît Guérand,Thieu N. Vo,Tan M. Nguyen
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Metanetworks are neural architectures designed to operate directly on pretrained weights to perform downstream tasks. However, the parameter space serves only as a proxy for the underlying function class, and the parameter-function mapping is inherently non-injective: distinct parameter configurations may yield identical input-output behaviors. As a result, metanetworks that rely solely on raw parameters risk overlooking the intrinsic symmetries of the architecture. Reasoning about functional identity is therefore essential for effective metanetwork design, motivating the development of equivariant metanetworks, which incorporate equivariance principles to respect architectural symmetries. Existing approaches, however, typically enforce strict equivariance, which imposes rigid constraints and often leads to sparse and less expressive models. To address this limitation, we introduce the novel concept of quasi-equivariance, which allows metanetworks to move beyond the rigidity of strict equivariance while still preserving functional identity. We lay down a principled basis for this framework and demonstrate its broad applicability across diverse neural architectures, including feedforward, convolutional, and transformer networks. Through empirical evaluation, we show that quasi-equivariant metanetworks achieve good trade-offs between symmetry preservation and representational expressivity. These findings advance the theoretical understanding of weight-space learning and provide a principled foundation for the design of more expressive and functionally robust metanetworks.

[LG-51] Can an MLP Absorb Its Own Skip Connection?

链接: https://arxiv.org/abs/2604.23705
作者: Antonij Mijoski,Marko Karbevski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study when a skip connection around a single-hidden-layer MLP can be absorbed into a residual-free MLP of the same width. We first show that for any architecture whose skip branch is an invertible linear map (including Hyper-Connections and their manifold-constrained variants), the problem reduces to the identity skip case. For homogeneous activations of degree k \neq 1 , such as ReLU ^2 and ReGLU, absorption is unconditionally impossible by a degree argument. For gated activations whose gate is differentiable at the origin with g(0) = 0 , including SwiGLU and GeGLU, a linearization argument gives the same conclusion. These impossibility results extend to arbitrary depth: a composition of L residual blocks using such activations cannot be replicated by any composition of L residual-free blocks of the same width. For ungated ReLU and GELU, the situation is richer. For generic weight matrices, absorption holds at the single-block level if and only if there exists an index set S of size at least d such that W_\mathrmdown[:,S],W_\mathrmup[S,:] = -I_d . This condition is non-generic (it fails with probability one under continuous weight distributions), so skip-connected and residual-free MLPs of the same width represent generically disjoint function classes. Whether this disjointness persists for deep compositions of ReLU or GELU blocks remains open.

[LG-52] Beyond coauthorship: semantic structure and phantom collaborators in transportation research 1967–2025

链接: https://arxiv.org/abs/2604.23699
作者: Seongjin Choi
类目: Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a semantic-structural atlas of transportation research built from 120,323 papers across 34 peer-reviewed journals published between 1967 and 2025, roughly an order of magnitude larger than and a decade beyond Sun and Rahwan’s~(2017) coauthorship study. We use OpenAlex and Crossref as open, CC0-licensed data sources, resolve author identity through OpenAlex author IDs, ORCID records, and manual alias resolution, and embed every paper with SPECTER2 with Arora-style whitening concatenated with concept TF–IDF and venue linear-discriminant projections. On this substrate we report three findings. First, Leiden on the author-level semantic k-nearest-neighbor graph yields 23 topic communities that agree only weakly with the 172 coauthor communities (normalized mutual information 0.23 ), opening room for a predictive layer that neither source encodes alone. Second, a multiplex Leiden partition combining both edge types recovers 181 communities and localizes where collaboration and topic structure decouple. Third – the paper’s core methodological contribution – we define \emphphantom collaborators, pairs of authors who are top- K semantic neighbors yet \geq 3 hops apart in the coauthor graph, and show via a temporal hold-out (training cutoff 2019) that phantom pairs become real coauthors in 2020–2025 at a rate 16 to 33 times above random, popularity-weighted, and same-venue baselines, with a 68 -fold monotone gradient between the highest- and lowest-similarity buckets. All artifacts are released as a live, reproducible web atlas at this https URL.

[LG-53] Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices ISCAS

链接: https://arxiv.org/abs/2604.23647
作者: Dawon Choi,Hana Kim,Ji-Hoon Kim
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted by 2026 IEEE International Symposium on Circuits and Systems (ISCAS)

点击查看摘要

Abstract:In Transformer models, non-GEMM (non-General Matrix Multiplication) operations – especially Softmax and Layer Normalization (LayerNorm) – often dominate hardware cost due to their nonlinear nature. To address this, previous approximation studies mainly target rank-oriented tasks, which is acceptable for classification. However, edge Natural Language Processing (NLP) applications and edge generative AI are largely evaluated based on score-oriented tasks, so normalization-guaranteed non-GEMM operations are essential. We propose a hardware-efficient Softmax and LayerNorm with Guaranteed Normalization for Edge devices. Our design employs hardware-efficient approximation methods while preserving the normalization (Softmax: \sum p = 1 , LayerNorm: \sigma = 1 ). Our architecture is described in Verilog HDL and synthesized using the Samsung 28nm CMOS process. In accuracy evaluation, we achieve high accuracy with minimal degradation: GLUE +0.07%, SQuAD -0.01%, perplexity -0.09%. Implementation results show that our architecture is small: 942,\mu m^2 for Softmax, 1199,\mu m^2 for LayerNorm. Compared to the state of the art, we achieve up to 11x and 14x reduction in area, respectively.

[LG-54] Characterizations of Admissible Objective Functions for Hierarchical Clustering

链接: https://arxiv.org/abs/2604.23628
作者: Ryuki Tsukuba,Kazutoshi Ando
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 22 pages, 3 figures. Submitted to Algorithmica

点击查看摘要

Abstract:Hierarchical clustering is a fundamental task in data analysis, yet for a long time it lacked a principled objective function. Dasgupta [STOC 2016] initiated a formal framework by introducing a discrete objective function for cluster trees. This framework was subsequently expanded by Cohen-Addad et al. [J. ACM 2019], who introduced the notion of admissibility – a criterion ensuring that, whenever the input similarities admit consistent hierarchical representations, the minimizers of an objective function recover them. They also provided a necessary and sufficient condition for admissibility within a broad class of objective functions, which we refer to as sum-type objective functions. Our contributions are twofold. First, we characterize admissible sum-type objective functions when the scaling function g is a symmetric polynomial of degree at most two, together with sufficient conditions in the degree-three case. For admissible objective functions in this class, we show that the recursive sparsest cut algorithm achieves an O( \phi )-approximation ratio, where \phi denotes the approximation factor of the sparsest cut subroutine. Second, we introduce a new class of objective functions for hierarchical clustering, which we term max-type objective functions, where cluster interactions are measured by maximum rather than aggregate similarity. For this class, we establish a general characterization of admissibility for arbitrary scaling functions g, and a complete characterization when g is a symmetric polynomial of degree at most two. These results provide new theoretical insights into admissible objective functions for hierarchical clustering and clarify the scope of algorithmic guarantees for optimizing them.

[LG-55] Hamiltonian Graph Inference Networks: Joint structure discovery and dynamics prediction for lattice Hamiltonian systems from trajectory data

链接: https://arxiv.org/abs/2604.23606
作者: Ru Geng,Panayotis Kevrekidis,Yixian Gao,Hong-Kun Zhang,Jian Zu
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:Lattice Hamiltonian systems underpin models across condensed matter, nonlinear optics, and biophysics, yet learning their dynamics from data is obstructed by two unknowns: the interaction topology and whether node dynamics are homogeneous. Existing graph-based approaches either assume the graph is given or, as in \alpha -separable graph Hamiltonian network, infer it only for separable Hamiltonians with homogeneous node dynamics. We introduce the Hamiltonian Graph Inference Network (HGIN), which jointly recovers the interaction graph and predicts long-time trajectories from state data alone, for both separable and non-separable Hamiltonians and under heterogeneous node dynamics. HGIN couples a structure-learning module – a learnable weighted adjacency matrix trained under a Hamilton’s-equations loss – with a trajectory-prediction module that partitions edges into physically distinct subgraphs via k -means clustering, assigning each subgraph its own encoder and thereby breaking the parameter-sharing bottleneck of conventional GNNs. On three benchmarks – a Klein–Gordon lattice with long-range interactions and two discrete nonlinear Schrödinger lattices (homogeneous and heterogeneous) – HGIN reduces long-time energy prediction error and trajectory prediction error by six to thirteen orders of magnitude relative to baselines. A symmetry argument on the Hamiltonian loss further shows that the learned weights encode the parity of the underlying pair potential, yielding an interpretable readout of the system’s interaction structure.

[LG-56] mingLLM : A Two-Stage Retrieval-Augmented Framework for Pre-Synthesis Timing Prediction from Verilog

链接: https://arxiv.org/abs/2604.23602
作者: Armin Abdollahi,Negin Ashrafi,Mehdi Kamal,Massoud Pedram
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Early, tool-free prediction of post-synthesis timing remains a key obstacle to rapid RTL iteration. We introduce TimingLLM, a two-stage retrieval-augmented LLM pipeline that estimates worst negative slack (WNS) and total negative slack (TNS) directly from Verilog. Stage 1 is a fine-tuned LLM that acts as a compact post-synthesis timing oracle, producing path-level arrivals/required times that are summarized into lightweight structural-timing cues (e.g., bag-of-gates counts, critical-path depth, gate-type patterns). Stage 2 is an LLM-based regressor that predicts WNS/TNS and applies a learned diagonal steering vector at the last transformer block, computed from the k nearest timing-labeled modules in a disjoint retrieval bank. On VerilogEval, TimingLLM attains R_WNS = 0.91 (MAPE 12%) and R_TNS=0.97 (MAPE 16%) while running 1.3-1.6 times faster than prior methods. Training uses a new 60k-module Verilog corpus with synthesis reports, which we will release. After training once, TimingLLM can be adapted to new technology libraries and PVT corners by refitting only a small regression head on 1000 labeled modules per setting, consistently outperforming state-of-the-art baselines.

[LG-57] When PINNs Go Wrong: Pseudo-Time Stepping Against Spurious Solutions

链接: https://arxiv.org/abs/2604.23528
作者: Sifan Wang,Shawn Koohy,Yiping Lu,Paris Perdikaris
类目: Machine Learning (cs.LG)
*备注: 41 pages, 18 figures

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) provide a promising machine learning framework for solving partial differential equations, but their training often breaks down on challenging problems, sometimes converging to physically incorrect solutions despite achieving small residual losses. This failure, we argue, is not merely an optimization difficulty. Rather, it reflects a fundamental weakness of the empirical PDE residual loss, which can admit trivial or spurious solutions during training. From this perspective, we revisit pseudo-time stepping, a technique that has recently shown strong empirical success in PINNs. We show that its main benefit is not simply to ease optimization; instead, when combined with collocation-point resampling, it helps reveal and avoid spurious solutions. At the same time, we find that the effectiveness of pseudo-time stepping depends critically on the choice of step size, which cannot be tuned reliably from the training loss alone. To overcome this limitation, we propose an adaptive pseudo-time stepping strategy that selects the step size from a finite-difference surrogate of the local residual Jacobian, yielding the largest step permitted by local stability without per-problem tuning. Across a diverse set of PDE benchmarks, the proposed method consistently improves both accuracy and robustness. Together, these findings provide a clearer understanding of why PINNs fail and suggest a practical pathway toward more reliable physics-informed learning. All code and data accompanying this manuscript are available at this https URL.

[LG-58] Multi-Plane HyperX: A Low-Latency and Cost-Effective Network for Large-Scale AI and HPC Systems

链接: https://arxiv.org/abs/2604.23519
作者: Ziyu Wang,Fei Lei,Dezun Dong
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Preprint. Work in progress

点击查看摘要

Abstract:Multi-plane architectures have become increasingly prevalent in the Fat-Tree networks of AI data centers. By leveraging multiple ports on a single network interface card (NIC) or multiple NICs within a scale-up domain, each port or NIC is allocated to an independent network plane, thereby provisioning the overall system with multiple network planes. However, no prior literature has explored the application of multi-plane technologies to direct networks such as HyperX. This paper investigates the multi-plane HyperX network and demonstrates that, compared to state-of-the-art network topologies like multi-plane Fat-Tree, Dragonfly, and Dragonfly+, the multi-plane HyperX architecture achieves a significantly smaller network diameter and superior cost-effectiveness.

[LG-59] Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

链接: https://arxiv.org/abs/2604.23488
作者: Lichen Li,Hengguang Zhou,Yijun Liang,Tianyi Zhou,Cho-Jui Hsieh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reward hacking in code generation, where models exploit evaluation loopholes to obtain full reward without correctly solving the tasks, poses a critical challenge for Reinforcement Learning (RL) and the deployment of reasoning models. Existing studies have been conducted primarily on synthetic hacking trajectories. However, whether these synthetic behaviors faithfully represent naturally emerging hacking in the wild remains unclear. In this work, we present a systematic analysis of the synthetic vs. in-the-wild discrepancy in reward hacking. We examine to what extent hacking behaviors induced by prompting resemble those emerging during RL training, and whether monitors trained on synthetic trajectories generalize to naturally arising but previously unseen hacking. To scale up the curation of in-the-wild reward hacking trajectories, we modified Group Relative Policy Optimization (GRPO) by injecting conflicting unit tests as tracers and applying a “resampling-until-hack” mechanism. Through controlled comparisons between monitors trained on synthetic versus in-the-wild data, we find that (1) synthetic-data-trained monitors fail to generalize to “in-the-wild” hacking, and (2) monitors trained on our “in-the-wild” trajectories demonstrate stronger generalizability to unseen hacking types. Our results indicate that synthetic reward hacking data may not fully reflect natural reward hacking behaviors, and that relying solely on synthetic data can lead to misleading conclusions. The codebase is available at this https URL

[LG-60] GeoCert: Certified Geometric AI for Reliable Forecasting

链接: https://arxiv.org/abs/2604.23474
作者: Regina Zhang,Zongru Li,Honggang Wen,Xiaofeng Liu,Siu-Ming Yiu,Pietro Liò,Kwok-Yan Lam
类目: Machine Learning (cs.LG)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Forecasting systems in science must be accurate, physically consistent, and certifiably reliable. Most existing models address prediction, constraint enforcement, and verification separately, limiting scalability and interpretability. We introduce GeoCert, a geometric AI framework that unifies forecasting, physical reasoning, and formal verification within a single differentiable computation. GeoCert formulates forecasting as evolution along a hyperbolic manifold, where negative curvature induces contraction dynamics, intrinsic robustness, and logarithmic-time certification. A hierarchical constraint architecture separates universal physical laws from domain-specific dynamics, enabling certified generalization across energy, climate, finance, and transportation systems. GeoCert achieves state-of-the-art accuracy while reducing computational cost by 97.5% and maintaining better certification rates. By embedding verification into the geometry of learning, GeoCert transforms forecasting from empirical approximation to formally verified inference, offering a scalable foundation for trustworthy, reproducible, and physically grounded scientific AI.

[LG-61] Machine learning models for estimating counterfactuals in a single-arm inflammatory bowel disease study

链接: https://arxiv.org/abs/2604.23465
作者: Dan Liu,Fida K. Dankar,Jennifer C. deBruyn,Amanda Ricciuto,Anne M. Griffiths,Thomas D. Walters,Khaled EI Emam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Single-arm trials accelerate study timelines by reducing the number of patients that must be recruited for a concurrent control group. However, these designs require an alternative comparator to estimate treatment effects. One approach is to construct a virtual control arm using a machine learning (ML) model trained on external control data to predict the counterfactual outcomes of the treatment arm. Our aim in this study was to leverage virtual controls by developing and evaluating ML-based counterfactual outcome models trained on IFX-treated patients to predict 1-year steroid-free clinical remission (SFCR ) and a composite of C-reactive protein remission plus steroid-free clinical remission (CRP-SFCR) for ADA-treated pediatric Crohn’s disease patients, and to compare the resulting IFX-versus-ADA treatment effect estimates with those obtained using propensity score matching to external controls. Five ML models were used to train counterfactual models on the observed IFX cohort data. The resulting models were used to predict the counterfactual outcomes for the ADA arm patients. LGBM yields the best OR closest to the propensity score matched reference, and all 95% CI results align with the conclusion from the reference study that no statistical difference in the primary and secondary outcomes has been observed between the patients treated with ADA or IFX. Our study supports virtual controls as a viable and effective substitute for expensive, lengthy or unethical patient recruitment in an inflammatory bowel disease (IBD) trial. The developed gradient boosted prediction model can be used as a pretrained model to generate IFX counterfactual predictions in future studies, pending external validation and assessment of transportability.

[LG-62] Scalable and Verifiable Federated Learning for Cross-Institution Financial Fraud Detection

链接: https://arxiv.org/abs/2604.23437
作者: Prajwal Panth,Nishant Nigam
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures. Preprint

点击查看摘要

Abstract:The global financial ecosystem confronts a critical asymmetry: while fraud syndicates operate as borderless, distributed networks, banking institutions remain constrained by regulatory data silos, limiting visibility into cross-institutional threat patterns under strict privacy laws such as GDPR. Although Federated Learning (FL) enables collaborative training, existing protocols impose a trade-off among scalability, privacy, and integrity. Homomorphic encryption schemes are computationally expensive, while pairwise masking protocols require O(N^2) key exchanges and lack mechanisms to detect malformed updates. Existing defenses also remain vulnerable to gradient inversion attacks that can reconstruct sensitive transaction data. To address these limitations, we propose Dynamic Sharded Federated Learning (DSFL), a verifiable secure aggregation framework for cross-institution financial fraud detection. DSFL replaces mesh topologies with Dynamic Stochastic Sharding, reducing communication complexity from O(N^2) to O(N m), where m is a fixed shard size, achieving linear scalability. To mitigate insider threats, we introduce Linear Integrity Tags, an additive-homomorphic commitment mechanism that enables probabilistic verification of submitted updates without the overhead of zero-knowledge proofs, while not enforcing semantic correctness. Additionally, the Active Neighborhood Recovery protocol ensures robust aggregation under participant dropouts. Empirical evaluation on the Credit Card Fraud Detection Dataset (ULB) demonstrates an approximately 33x latency reduction compared to Paillier-based secure aggregation, while maintaining strong resilience under simulated failures. These results position DSFL as a practical foundation for scalable and privacy-preserving collaborative fraud detection. Comments: 8 pages, 7 figures. Preprint Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) ACMclasses: I.2.6; K.4.1 Cite as: arXiv:2604.23437 [cs.CR] (or arXiv:2604.23437v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.23437 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-63] Approximating Uniform Random Rotations by Two-Block Structured Hadamard Rotations in High Dimensions

链接: https://arxiv.org/abs/2604.23418
作者: Tomer Zilca,Gal Mendelson
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Uniform random rotations are a useful primitive in applications such as fast Johnson-Lindenstrauss embeddings, kernel approximation, communication-efficient learning, and recent AI compression pipelines, but they are computationally expensive to generate and apply in high dimensions. A common practical replacement is repeated structured random rotations built from Walsh-Hadamard transforms and random sign diagonals. Applying the structured random rotation twice has been shown empirically to be useful, but the supporting theory is still limited. In this paper we study the approximation quality achieved when using this two-block structured Hadamard rotation. Our results are both positive and negative. On the positive side, we prove that every fixed coordinate of the two-block transform converges uniformly, over all inputs, to the corresponding coordinate of a uniformly rotated vector, with an explicit Kolmogorov-distance bound of order d^-1/5 . On the negative side, we prove an explicit lower bound on the Wasserstein distance between the full vector distributions, showing that the two-block transform is not a globally accurate surrogate for a uniform random rotation in the worst case. For the extremal input used in the lower bound, we also prove a matching asymptotic upper bound, showing that the lower-bound scale is sharp for that input. Taken together, the results identify a clear separation between one-dimensional marginal behavior, where approximation improves with dimension, and full high-dimensional geometry, where a nonvanishing discrepancy remains. This provides a partial theoretical explanation for the empirical success of structured Hadamard rotations in some algorithms, while also clarifying the limitations of treating them as drop-in replacements for true uniform random rotations. Subjects: Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2604.23418 [cs.LG] (or arXiv:2604.23418v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.23418 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-64] Domain-Adapted Fine-Tuning of ECG Foundation Models for Multi-Label Structural Heart Disease Screening

链接: https://arxiv.org/abs/2604.23385
作者: Duc N. Do,Minh N. Do,Dang Nguyen,Khanh T.Q. Le,Khoa D. Pham,Hung N. Huynh,Phi Pham-Van-Hoang,Quan K. Huynh,Ramez M. Odat,Perisa Ashar,Ethan Philip Lowder,Minh H.N. Le,Hoang Le,Phat V.H. Nguyen,Quan Le,Jacques Kpodonu,Phat K. Huynh
类目: Machine Learning (cs.LG)
*备注: Accepted to Canadian AI 2026

点击查看摘要

Abstract:Transthoracic echocardiography is the reference standard for confirming structural heart disease (SHD), but first-line screening is limited by cost, workflow burden, and specialist availability. We evaluated whether open pretrained electrocardiogram (ECG) foundation models can support echo-confirmed multi-label SHD detection using the public EchoNext Mini-Model benchmark. Six echocardiography-derived abnormalities were targeted: reduced left ventricular ejection fraction, increased left ventricular wall thickness, aortic stenosis, mitral regurgitation, tricuspid regurgitation, and right ventricular systolic dysfunction. Under a common pipeline, we compared engineered ECG features with gradient boosting, end-to-end waveform learning from scratch, and transfer from open ECG foundation models. We then applied in-domain self-supervised adaptation of an ECG foundation model (ECG-FM) on EchoNext waveforms followed by selective supervised fine-tuning, and evaluated trade-offs between discrimination and adaptation cost. Adapted ECG-FM models achieved the best overall performance: peak macro-AUROC 0.8509 and macro-AUPRC 0.4297, while a parameter-efficient operating point preserved AUROC (0.8501) and attained the highest fixed-threshold macro-F1 0.3691. Late fusion with covariates did not improve threshold-independent discrimination, and evaluated LoRA, alternative backbones, and mixture-of-foundations strategies did not surpass the best adapted single-backbone models. These results indicate that for ECG-based case finding and echocardiography triage, combining target-domain self-supervised adaptation with selective supervised updating of a pretrained ECG backbone is the most effective transfer strategy.

[LG-65] When Context Sticks: Studying Interference in In-Context Learning

链接: https://arxiv.org/abs/2604.23371
作者: Hanna Rød,Dagny Streit,Nils Valseth Selte,Justin Li
类目: Machine Learning (cs.LG)
*备注: 14 pages, 6 figures, 2 tables. Code available at: this https URL

点击查看摘要

Abstract:This paper investigates context stickiness in in-context learning (ICL), a phenomenon where earlier examples in a prompt interfere with a transformer’s ability to adapt to later tasks. Using synthetic regression tasks over linear and quadratic functions, we examine how models trained under sequential, mixed, and random curricula handle abrupt task switches during inference. By sweeping over structured combinations of misleading linear examples followed by recovery quadratic examples, we quantify how prior context biases prediction error and how quickly models realign. Our results show strong evidence of persistent interference: more preceding linear examples reliably degrade quadratic predictions, while additional quadratic examples reduce error but with diminishing returns. We further find that training curricula significantly modulate resilience, with sequential training on the target function class yielding the fastest recovery, and surprisingly, random training producing the least robust behavior.

[LG-66] EMPO: Transformers for Temporal Disease Progression from Cross-Sectional Data ALT

链接: https://arxiv.org/abs/2604.23368
作者: Hongtao Hao,Joseph L. Austerweil
类目: Machine Learning (cs.LG)
*备注: 31 pages; Published at Conference on Health, Inference, and Learning (CHIL) 2026

点击查看摘要

Abstract:Event-Based Models (EBMs) infer biomarker progression from cross-sectional data but typically only as ordinal sequences and rely on rigid model assumptions. We propose \textscTempo, a Transformer architecture that learns both ordinal and continuous event sequences through simulation-based supervised learning. \textscTempo uses two Transformer modules: one treats biomarkers as tokens to infer event sequencing; the other treats patients as tokens, representing each by their per-biomarker abnormality profile, to infer patients’ disease stages. On synthetic benchmarks, \textscTempo reduces normalized Kendall’s Tau distance by 52.89% and staging MAE by 25.33% compared to state-of-the-art SA-EBM, with larger reductions in high-dimensional settings (58.88% and 61.10%). Applied to ADNI, \textscTempo recovers a biologically plausible Alzheimer’s progression: early medial temporal atrophy, followed by amyloid accumulation and cognitive decline, and late-stage tau pathology with terminal acceleration of global neurodegeneration – broadly consistent with established disease models. \textscTempo also eliminates the need to derive custom inference algorithms and enables rapid empirical comparison of generative hypotheses.

[LG-67] UniAda: Universal Adaptive Multi-objective Adversarial Attack for End-to-End Autonomous Driving Systems

链接: https://arxiv.org/abs/2604.23362
作者: Jingyu Zhang,Jacky Wai Keung,Yan Xiao,Yihan Liao,Yishu Li,Xiaoxue Ma
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Published at IEEE Transactions on Reliability journal (2023)

点击查看摘要

Abstract:Adversarial attacks play a pivotal role in testing and improving the reliability of deep learning (DL) systems. Existing literature has demonstrated that subtle perturbations to the input can elicit erroneous outcomes, thereby substantially compromising the security of DL systems. This has emerged as a critical concern in the development of DL-based safety-critical systems like Autonomous Driving Systems (ADSs). The focus of existing adversarial attack methods on End-to-End (E2E) ADSs has predominantly centered on misbehaviors of steering angle, which overlooks speed-related controls or imperceptible perturbations. To address these challenges, we introduce UniAda, a multi-objective white-box attack technique with a core function that revolves around crafting an image-agnostic adversarial perturbation capable of simultaneously influencing both steering and speed controls. UniAda capitalizes on an intricately designed multi-objective optimization function with the Adaptive Weighting Scheme (AWS), enabling the concurrent optimization of diverse objectives. Validated with both simulated and real-world driving data, UniAda outperforms five benchmarks across two metrics, inducing steering and speed deviations from 3.54 degrees to 29 degrees and 11 km per hour to 22 km per hour on average. This systematic approach establishes UniAda as a proven technique for adversarial attacks on modern DL-based E2E ADSs.

[LG-68] GeoFunFlow-3D: A Physics-Guided Generative Flow Matching Framework for High-Fidelity 3D Aerodynamic Inference over Complex Geometries

链接: https://arxiv.org/abs/2604.23350
作者: Ruiling Jiang,Yong Zhang,Houbiao Li
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 40 pages, 18 figures,4 tables

点击查看摘要

Abstract:Deep generative models and neural operators have demonstrated significant potential for 3D aerodynamic inference. However, they often face inherent challenges in maintaining physical consistency and preserving high-frequency features, primarily due to spectral bias and gradient conflicts within the governing equations. To address these issues, we propose GeoFunFlow-3D, a physics-guided generative flow matching framework. Temporally, we utilize optimal transport theory to build the generation path, ensuring stable training dynamics. Spectrally, we introduce a high-order discrete engine without automatic differentiation (No-AD) to reduce gradient stiffness. Spatially, a topology-aware super-resolution module (SATO) is employed to rigorously enforce physical laws in localized regions such as shock waves. We evaluated our framework on complex industrial datasets. On the BlendedNet dataset, the model successfully avoids mode collapse even under sparse data conditions. For the NASA Rotor37 test, it accurately captures 3D detached shock structures. Compared to conventional operators, GeoFunFlow-3D significantly improves accuracy, reducing the pressure field error (RRMSE) to 0.0215 while maintaining competitive inference efficiency. Ultimately, this work provides a reliable, geometry-driven approach for generating high-dimensional fluid fields.

[LG-69] From Stateless Queries to Autonomous Actions: A Layered Security Framework for Agent ic AI Systems

链接: https://arxiv.org/abs/2604.23338
作者: Kexin Chu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 23 pages, 3 figures, 10 tables

点击查看摘要

Abstract:Agentic AI systems face security challenges that stateless large language models do not. They plan across extended horizons, maintain persistent memory, invoke external tools, and coordinate with peer agents. Existing security analyses organize threats by attack type (prompt injection, jailbreaking), but provide no principled model of which architectural component is vulnerable or over what timescale the threat manifests. This paper makes five contributions. First, we introduce the Layered Attack Surface Model (LASM), a seven-layer framework that maps threats to distinct architectural components: Foundation, Cognitive, Memory, Tool Execution, Multi-Agent Coordination, Ecosystem, and Governance, the accountability and observability layer that spans the stack analogously to the network management plane. Second, we introduce attack temporality as an orthogonal analytical dimension with four classes: Instantaneous (T1), Session-Persistent (T2), Cross-Session Cumulative (T3), and Sub-Session-Stack, Non-Session-Bounded (T4). Third, through a systematic review of 94 papers (2021–2025), we show that the most dangerous emerging threats concentrate at the intersection of high-layer attacks (L5–L7) and slow-burn temporality (T3–T4): covert agent collusion, long-term memory poisoning, MCP supply-chain compromise, and alignment failure that manifests as an insider threat with no external adversary. Only 8 of 120 paper-cell assignments (7%) fall in this zone. Fourth, we propose a cross-layer defense taxonomy spanning all seven LASM layers and all four temporality classes, exposing which threat classes existing defenses leave unaddressed. Fifth, we survey evaluation benchmarks, identify five research gaps in the under-studied high-layer, slow-burn zone, and argue that agentic security must be treated as a distributed systems problem embedded in an adversarial ecosystem.

[LG-70] CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

链接: https://arxiv.org/abs/2604.23308
作者: Marcel Hedman,Kale-ab Abebe Tessera,Juan Claude Formanek,Anya Sims,Riccardo Zamboni,Trevor McInroe,John Torr,Elliot Fosong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they cannot co-adapt as their policies change. We introduce CODA (Coordination via On-Policy Diffusion for Multi-Agent Reinforcement Learning), a diffusion-based multi-agent trajectory generator for data augmentation that samples conditioned on the current joint policy, producing synthetic experience which reflects the evolving behaviours of the agents, thereby providing a mechanism for co-adaptation. We find that previous diffusion-based augmentation approaches are insufficient for fostering multi-agent coordination because they produce static augmented datasets that do not evolve as the current joint policy changes during training; CODA resolves this by more closely simulating on-policy learning and is a meaningful step toward coordinated behaviours in the offline setting. CODA is algorithm-agnostic and can be layered onto both model-free and model-based offline reinforcement learning pipelines as an augmentation module. Empirically, CODA not only resolves canonical coordination pathologies in continuous polynomial games but also delivers strong results on the more complex MaMuJoCo continuous-control benchmarks.

[LG-71] Revisable by Design: A Theory of Streaming LLM Agent Execution

链接: https://arxiv.org/abs/2604.23283
作者: Zhiyuan Zhai,Ming Li,Xin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current LLM agents operate under an implicit but universal assumption: execution is a transaction – the user submits a request, the agent works in isolation, and only upon completion does the dialogue resume. This forces users into a binary choice: wait for a potentially incorrect output, or interrupt and lose all progress. We reject this assumption and propose the stream paradigm, in which agent execution and user intervention are concurrent, interleaved processes sharing a bidirectional channel. We formalize this paradigm through a reversibility taxonomy that classifies every agent action as Idempotent, Reversible, Compensable, or Irreversible, and arrive at a core conclusion: an agent’s flexibility is bounded by its reversibility. We prove that conflicting compensable actions impose unavoidable adaptation costs and that conflicting irreversible actions make full specification satisfaction impossible – these costs are properties of the action space, not of the algorithm. Guided by this insight, we present the Revision Absorber, a reactive algorithm based on the Earliest-Conflict Rollback rule that is structurally optimal under mild assumptions. Experiments on StreamBench with real LLM agents validate all predictions: the Absorber matches the quality of a brute-force full-restart baseline while wasting an order of magnitude fewer steps of already-completed work, turning mid-execution revisions from a dead-end into a first-class interaction.

[LG-72] A Layer Separation Optimization Framework for Cross-Entropy Training in Deep Learning

链接: https://arxiv.org/abs/2604.23225
作者: Yaru Liu,Michael K. Ng,Yiqi Gu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper investigates the deep learning optimization problem with softmax cross-entropy loss. We propose a layer separation strategy to alleviate the strong nonconvexity encountered during training deep networks. For cross-entropy models with fully connected and convolutional neural networks, we introduce auxiliary variables associated with hidden layer outputs and construct corresponding layer separation models, which decompose the original deeply nested optimization problem into a sequence of more manageable subproblems. We also conduct theoretical analyses, proving that the new layer separation loss provides an upper bound for the original cross-entropy loss. Moreover, we design alternating minimization algorithms and prove that, under appropriate conditions, these algorithms exhibit decreasing properties of the loss function. Numerical experiments validate the effectiveness of the proposed methods and indicate improved optimization behavior, especially for fully connected and convolutional neural networks.

[LG-73] ssera: Secure Near-Line-Rate Weight Streaming for UMA Edge Accelerators

链接: https://arxiv.org/abs/2604.23205
作者: Animan Naskar
类目: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 17 pages, 2 figures, 4 tables, 2 algorithms

点击查看摘要

Abstract:Deploying proprietary Deep Neural Networks (DNNs) on commodity edge devices demands hardware-backed Digital Rights Management (DRM) capable of withstanding both software-level and physical adversaries. In Unified Memory Architecture (UMA) systems, the host CPU and Neural Processing Unit (NPU) share physical DRAM, leaving plaintext model weights directly readable by a compromised OS kernel. Existing defenses fail in this constrained setting: trusted execution environments monopolize scarce memory with permanently reserved regions, while full-memory encryption operates at page granularity. This forces the system to fetch massive 4 KB memory pages for sub-page tensor tiles, severely crippling bandwidth. We present Tessera, a reference architecture for inline, cache-line granularity weight decryption on UMA edge accelerators. The design intercepts 64-byte AXI bursts, computing AES-256-CTR keystreams in parallel with DRAM fetches. This streams plaintext directly into isolated NPU SRAM, creating a transient memory footprint confined to the active tile and eliminating the need for permanent memory carve-outs. Measurements across three distinct SoC platforms demonstrate that this parallelization hides cryptographic latency behind standard DRAM fetch times, a condition that holds even under worst-case timing variations. Consequently, Tessera is projected to achieve 98.4% of the theoretical memory bandwidth ceiling (a mere 1.6% overhead). Across standard vision and language models, page-level memory encryption suffers up to a 32x bandwidth penalty, whereas Tessera maintains an optimal 1x footprint for all layer geometries. Finally, Tessera neutralizes major UMA-specific attack vectors – including physical DRAM extraction, rogue DMA, and compute hijacking – and formally prevents plaintext leakage across sparse tensors. Comments: 17 pages, 2 figures, 4 tables, 2 algorithms Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2604.23205 [cs.CR] (or arXiv:2604.23205v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.23205 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] Follow the TRACE: Exploiting Post-Click Trajectories for Online Delayed Conversion Rate Prediction SIGIR2026

链接: https://arxiv.org/abs/2604.23197
作者: Xinyue Zhang,Yuanhao Ding,Xiang Ao
类目: Machine Learning (cs.LG)
*备注: Accepted as a SIGIR 2026 short paper

点击查看摘要

Abstract:Delayed feedback poses a core challenge for online CVR prediction, forcing a trade-off between label accuracy and data freshness. Existing methods address this through delay modeling or sample reweighting, yet neglect how post-click behaviors evolve over the observation period. To overcome this limitation, we formalize this evolution as feedback trajectory and propose TRACE. Instead of forcing hard labels on unrevealed samples, our method evaluates how well the accumulated feedback status aligns with conversion versus non-conversion, dynamically refining posteriors without waiting for final outcomes. To counteract early-stage trajectory sparsity, we further design a reliability-gated retrospective completer that leverages full-lifecycle data to provide adaptive posterior guidance for unrevealed samples. Extensive experiments validate TRACE’s superiority over state-of-the-art baselines and confirm the retrospective completion module as a model-agnostic enhancer for existing systems. Our code is available at this https URL.

[LG-75] Well-Conditioned Oblivious Perturbations in Linear Space

链接: https://arxiv.org/abs/2604.23193
作者: Shabarish Chenakkod,Michał Dereziński,Xiaoyu Dong,Mark Rudelson
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Perturbing a deterministic n -dimensional matrix with small Gaussian noise is a cornerstone of smoothed analysis of algorithms [Spielman and Teng, JACM 2004], as it reduces the condition number of the input to O(n) , and with it the complexity of many matrix algorithms. However, when deployed algorithmically, these perturbations are expensive due to the cost of generating and storing n^2 Gaussian random variables. We propose a perturbation that requires generating and storing O(n) random numbers in O(\log n) bits of precision, and reduces the condition number of any deterministic matrix to O(n) , matching Gaussian perturbations. Our result in particular implies a better complexity for the perturbed conjugate gradient algorithm, showing that we can solve an n\times n linear system in linear space to within an arbitrarily small constant backward error using O(n) matrix-vector products. In our construction, we introduce the concept of a pattern matrix, which is a dense deterministic matrix that maps all sparse vectors into dense vectors, and we combine it with a sparse perturbation whose entries are dependent and located in a non-uniform fashion. In order to analyze this construction, we develop new techniques for lower bounding the smallest singular value of a random matrix with dependent entries. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2604.23193 [cs.DS] (or arXiv:2604.23193v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2604.23193 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-76] A Unified Fractional Regularization Framework for Sparse Recovery

链接: https://arxiv.org/abs/2604.23184
作者: Yinhao Zhao,Haoyu He,Chuanqi Ma,Hao Wang
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a unified fractional regularization framework for sparse signal recovery based on the \ell_1/\ell_p^q model. Our main theoretical contribution is the characterization of the equivalence between the first-order stationary points of the \ell_1/\ell_p^q formulation and the subtractive \ell_1 - \alpha \ell_p model, providing a unified perspective on these nonconvex regularizers. In addition, we establish a new sufficient recovery condition under the Restricted Isometry Property (RIP), showing that the framework’s robustness even under high-coherence sensing matrices. To solve the resulting problem, we develop a majorization-minimization (MM) algorithm and prove its convergence via the Kurdyka-Lojasiewicz (KL) property. Numerical experiments on different sensing matrices and MRI reconstruction demonstrate that the proposed approach consistently outperforms existing methods.

[LG-77] Efficient VQ-QAT and Mixed Vector/Linear quantized Neural Networks

链接: https://arxiv.org/abs/2604.23172
作者: Terry Gou,Puneet Gupta
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:In this work, we developed and tested 3 techniques for vector quantization (VQ) based model weight compression. To mitigate codebook collapse and enable end-to-end training, we adopted cosine similarity-based assignment. Building on ideas from attention-based formulations in Differentiable K-Means (DKM), we further improved this approach by using cosine similarity for assignment combined with top-1 sampling and a straight-through estimator, thereby eliminating the need for weighted-average reconstruction. Finally, we investigated the use of differentiable neural architecture search (NAS) to adaptively select layer-wise quantization configurations, further optimizing the compression process. Although our method does not consistently outperform existing approaches across all quantization levels, it provides useful insights into the design trade-offs and behaviors of VQ-based model compression methods.

[LG-78] Surface Sensitivity in Lean 4 Autoformalization

链接: https://arxiv.org/abs/2604.23135
作者: William Feng,Ethan Lou,Aryan Sharma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Natural-language variation poses a key challenge in Lean autoformalization: semantically equivalent paraphrases of the same theorem statements can induce divergent formal outputs, yet it remains unclear whether this variation reflects semantic disagreements or shallower failures. We investigate this question in Lean 4 using 60 deterministic paraphrase rules applied to ProofNet# and miniF2F. Across four GPT-family models and three open-weight 7B autoformalizers, we find that the observed paraphrase sensitivity reflects compilation-boundary failures rather than semantic divergence among successful formalizations. In particular, when both baseline and perturbed outputs compile, paired predictions are semantically equivalent under BEq+ and structurally near-identical under GTED. By contrast, paraphrasing substantially affects whether outputs compile, with failure modes varying across datasets and perturbation classes. Our results suggest that future training-time interventions should target the compile boundary rather than the semantic layer, and that benchmarks should separate compile-conditional equivalence from surface consistency.

[LG-79] h-MINT: Modeling Pocket-Ligand Binding with Hierarchical Molecular Interaction Network

链接: https://arxiv.org/abs/2604.23134
作者: Yanru Qu,Yijie Zhang,Wenjuan Tan,Xiangzhe Kong,Xiangxin Zhou,Chaoran Cheng,Mathieu Blanchette,Jiaxuan You,Ge Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate molecular representations are critical for drug discovery, and a central challenge lies in capturing the chemical environment of molecular fragments, as key interactions, such as H-bond and \pi stacking, occur only under specific local conditions. Most existing approaches represent molecules as atom-level graphs; however, atom-level representations can hardly express higher-order chemical context (e.g., stereochemistry, lone pairs, conjugation). Fragment-based methods (e.g., principal subgraph, predefined functional groups) fail to preserve essential information such as chirality, aromaticity, and ionic states. This work addresses these limitations from two aspects. (i) OverlapBPE tokenization. We propose a novel data-driven molecule tokenization method. Unlike existing approaches, our method allows overlapping fragments, reflecting the inherently fuzzy boundaries of small-molecule substructures and, together with enriched chemical information at the token level, thereby preserving a more complete chemical context. (ii) h-MINT model. OverlapBPE induces many-to-many atom-fragment mappings, which necessitate a new hierarchical architecture. We therefore develop a hierarchical molecular interaction network capable of jointly modeling interactions at both atom and fragment levels. By supporting fragment overlaps, the model naturally accommodates the many-to-many atom-fragment mappings introduced by the OverlapBPE scheme. Extensive evaluation against state-of-the-art methods shows our method improves binding affinity prediction by 2-4% Pearson/Spearman correlation on PDBBind and LBA, enhances virtual screening by 1-3% in key metrics on DUD-E and LIT-PCBA, and achieves the best overall HTS performance on PubChem assays. Further analysis demonstrates that our method effectively captures interactive information while maintaining good generalization.

[LG-80] HBGSA: Hydrogen Bond Graph with Self-Attention for Drug-Target Binding Affinity Prediction

链接: https://arxiv.org/abs/2604.23115
作者: Junxiao Kong,Chupei Tang,Di Wang,Jixiu Zhai,Yi He,Moyu Tang,Tianchi Lu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of drug-target binding affinity accelerates drug discovery by prioritizing compounds for experimental validation. Current methods face three limitations: sequence-based approaches discard spatial geometric constraints, structure-based methods fail to exploit hydrogen bond features, and conventional loss functions neglect prediction-target correlation, a key factor for identifying high-affinity compounds in virtual screening. We developed HBGSA (Hydrogen Bond Graph with Self-Attention), a 3.06M-parameter model that encodes hydrogen bond spatial features. HBGSA uses graph neural networks to model hydrogen bond spatial topology with self-attention enhancement and Pearson correlation loss. Experimental results on PDBbind Core Set and CSAR-HiQ dataset demonstrate that HBGSA outperforms baseline methods with strong generalization capability. Ablation studies confirm the effectiveness of hydrogen bond modeling and Pearson correlation loss.

[LG-81] A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning

链接: https://arxiv.org/abs/2604.23114
作者: Qishi Zhan,Minxuan Hu,Liang He,Guansu Wang,Jiaxin Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In limited-data settings, a single endpoint mean of an evaluation metric such as the Continuous Ranked Probability Score (CRPS) is itself a random variable, yet it is routinely reported as if it were a stable property of the method. We study when this practice fails. Using 50 independent repetitions across six regression datasets, we show that CRPS variance trajectories differ substantially across methods and are not always well described by a smooth power-law decay. Methods with a learned heteroscedastic variance head, namely MAP and Deep Ensembles, can develop pronounced, reproducible variance peaks at intermediate training sizes on real datasets, whereas MC Dropout and Bayes by Backprop typically show smooth variance contraction. These peaks have direct practical consequences: at the variance peak on Seoul Bike, the relative RMSE of a single-seed MAP estimate reaches 93.6%, and the probability of falling within (\pm 10%) of the repeated-run mean drops to 5.9%. We show that local CRPS variance provides a direct signal of single-seed estimation error, with Spearman correlations above 0.96 on every real dataset. Power-law fit quality and monotonicity together provide compact method-level summaries of trajectory regularity. Finally, replacing the standard heteroscedastic objective with (\beta)-NLL substantially reduces the irregular behavior, consistent with the view that the heteroscedastic training objective contributes to the instability. Practitioners should report trajectory summaries alongside endpoint means and concentrate repeated evaluation in high-variance regions.

[LG-82] Conditional Imputation for Within-Modality Missingness in Multi-Modal Federated Learning

链接: https://arxiv.org/abs/2604.23112
作者: Wugeng Zheng,Ziwen Kan,Katie Wang,Chen Chen,Song Wang
类目: Machine Learning (cs.LG)
*备注: Wugeng Zheng and Ziwen Kan contributed equally to this work. Song Wang is the corresponding author. Accepted to FedVision 2026

点击查看摘要

Abstract:Multimodal Federated Learning (MMFL) enables privacy-preserving collaborative training, but real-world clinical applications often suffer from within-modality missingness caused by sensor intermittency or irregular sampling. Existing methods implicitly represent unobserved data via architectural alignment or missing embeddings, often failing to recover the true distribution and yielding sub-optimal performance. We propose CondI, a federated framework explicitly addressing this missingness using conditional diffusion models. CondI employs a two-phase training pipeline: first, imputing unobserved temporal components using available multimodal context and conditional embeddings; second, optimizing modality-specific extractors and joint embedding spaces. During inference, imputed raw data pass through trained extractors to generate robust features, providing a holistic representation for downstream tasks. Explicit data imputation ensures models operate on complete semantic structures, significantly enhancing resilience against severe data incompleteness. Experiments on three clinical datasets (PTB-XL, SLEEP-EDF, MIMIC-IV) demonstrate CondI achieves comparable results to state-of-the-art baselines. Code: this https URL

[LG-83] Unstable Rankings in Bayesian Deep Learning Evaluation

链接: https://arxiv.org/abs/2604.23102
作者: Qishi Zhan,Minxuan Hu,Guansu Wang,Jiaxin Liu,Liang He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Standard evaluations of Bayesian deep learning methods assume that metric estimates are reliable, but we show this assumption fails under data scarcity. Method rankings are not only unreliable at small n , but also dataset-dependent in ways that point estimates cannot reveal: the same method comparison yields P(\mathrmMCD \prec \mathrmEnsemble) = 1.000 at n = 50 on one dataset and remains below 0.95 even at n = 500 on another. Across the datasets we consider, no universal sample size threshold exists, which is precisely why dataset-specific posterior inference is necessary. To address this, we use a Bayesian hierarchical model with method-specific variances to treat evaluation metrics as random variables across data realizations, and we use a predictive Minimum Detectable Difference curve to assess whether an observed gap would be detectable at a given training size. Across six Bayesian deep learning methods and five regression datasets, our results show that uncertainty-aware evaluation is necessary in low-data settings, because current evidence for method superiority and predictive detectability at the same training size can diverge substantially. Our framework provides practitioners with principled tools to determine whether their evaluation data is sufficient before drawing conclusions about method superiority.

[LG-84] Channel Adaptation for EEG Foundation Models: A Systematic Benchmark Across Architectures Tasks and Training Regimes

链接: https://arxiv.org/abs/2604.23091
作者: Kuntal Kokate,Bruno Aristimunha,Dung Truong,Arnaud Delorme
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scaling EEG foundation models requires pooling data across heterogeneous electrode montages, a prerequisite both for larger pretraining corpora and for downstream deployment. We present the first systematic comparison of four channel adaptation methods (Conv1d projection, spherical spline interpolation (SSI), source-space decomposition, and Riemannian re-centering) across five pretrained EEG foundation models (5M–157M parameters), five downstream tasks, and two training regimes with 10–15 random seeds each. We find that rigid-montage models (BENDR, Neuro-GPT) require external adaptation, while flexible models (EEGPT, CBraMod) match or exceed it natively when fine-tuned but benefit from external methods under frozen-encoder deployment. A probe-SFT asymmetry exists: external adaptation can cause severe negative transfer during fine-tuning of flexible models. The optimal method is architecture-dependent (Conv1d for BENDR, SSI/Riemannian for Neuro-GPT, source-space decomposition for depression detection), and 5M-parameter CBraMod outperforms models up to 31 \times larger on 4/5 datasets, consistent with independent findings that compact EEG-specific architectures can match larger models.

[LG-85] RL Token: Bootstrapping Online RL with Vision-Language-Action Models

链接: https://arxiv.org/abs/2604.23073
作者: Charles Xu,Jost Tobias Springenberg,Michael Equi,Ali Amin,Adnan Esmail,Sergey Levine,Liyiming Ke
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models can learn to perform diverse manipulation skills “out of the box,” but achieving the precision and speed that real-world tasks demand requires further fine-tuning – for example, via reinforcement learning (RL). We introduce a lightweight method that enables sample-efficient online RL fine-tuning of pretrained VLAs using just a few hours of real-world practice. We (1) adapt the VLA to expose an “RL token,” a compact readout representation that preserves task-relevant pretrained knowledge while serving as an efficient interface for online RL, and (2) train a small actor-critic head on this RL token to refine the actions, while anchoring the learned policy to the VLA. Online RL with the RL token (RLT) makes it possible to fine-tune even large VLAs with RL quickly and efficiently. Across four real-robot tasks (screw installation, zip tie fastening, charger insertion, and Ethernet insertion), RLT improves the speed on the hardest part of the task by up to 3x and raises success rates significantly within minutes to a few hours of practice. It can even surpass the speed of human teleoperation on some of the tasks.

[LG-86] ML-Guided Primal Heuristics for Mixed Binary Quadratic Programs

链接: https://arxiv.org/abs/2604.23053
作者: Weimin Huang,Natalie M. Isenberg,Ján Drgoňa,Draguna L Vrabie,Bistra Dilkina
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Mixed Binary Quadratic Programs (MBQPs) are an important and complex set of problems in combinatorial optimization. As solving large-scale combinatorial optimization problems is challenging, primal heuristics have been developed to quickly identify high-quality solutions within a short amount of time. Recently, a growing body of research has also used machine learning to accelerate solution methods for challenging combinatorial optimization problems. Despite the increasing popularity of these ML-guided methods, a large body of work has focused on Mixed-Integer Linear Programs (MILPs). MBQPs are challenging to solve due to the combinatorial complexity coupled with nonlinearities. This work proposes ML-guided primal heuristics for Mixed Binary Quadratic Programs (MBQPs) by adapting and extending existing work on ML-guided MILP solution prediction to MBQPs. We introduce a new neural network architecture for MBQP solution prediction and a new training data collection procedure. Moreover, we extend existing loss functions in solution prediction and propose to combine contrastive and weighted cross-entropy losses. We evaluate the methods on standard and real-world MBQP benchmarks and show that the developed ML-guided methods significantly outperform existing primal heuristics and state-of-the-art solvers. Furthermore, models trained with our proposed extension with combined losses outperform other ML-based methods adapted from MILPs and improve generalization in cross-regional inference on a real-world wind farm layout optimization problem.

[LG-87] Shape of Memory: a Geometric Analysis of Machine Unlearning in Second-Order Optimizers

链接: https://arxiv.org/abs/2604.23046
作者: Kennon Stewart
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注: Full experiment data available at this http URL

点击查看摘要

Abstract:We argue that current definitions of machine unlearning are underspecified for second-order optimizers. We compare first-order and second-order learners for their ability to handle the data deletion task with varying degrees of eigendecomposition to mimic the loss model memory. While both first and second-order methods realign with the ideal counterfactul in terms of performance and gradient, the second-order optimizer shows significant volatility in the optimizer state. This indicates residual information, supposedly deleted, that isn’t detectable by first-order analysis. Various eigendecay treatments show that stability and information loss is regained only under controlled state pertubation where geometric information (or memory) is erased.

[LG-88] A Differentiable Framework for Global Circulation Model Precipitation Bias Correction

链接: https://arxiv.org/abs/2604.23045
作者: Kamlesh Sawadekar,Seth McGinnis,Peijun Li,Chaopeng Shen
类目: Machine Learning (cs.LG)
*备注: 45 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Systematic biases in Global Circulation Model (GCM) outputs limit their direct applicability in regional planning, necessitating bias correction. Correcting precipitation is particularly challenging due to its non-Gaussian distribution, intermittent nature, and non-linear extremes. However, traditional statistical methods cannot learn from big data and easily address systematic biases in the GCMs, and while machine learning does provide this flexibility, their black-box type functionality hinders us from understanding these biases completely which also further prevents generalization across different GCMs and locations, especially for precipitation. In this study, we propose a differentiable bias adjustment framework called \deltaCLIMBA (or dCLIMBA), that learns a spatiotemporally adaptive parametric bias adjustment procedure between historical CMIP6 model outputs and reference reanalysis datasets (Livneh). Results demonstrate that the proposed method accurately corrects both the magnitude and distribution of extreme storm events, with particularly strong performance in capturing extremes. The quantile distribution of precipitation is well reproduced across diverse U.S. cities, and spatial patterns perform comparably to the widely used LOCA2 statistical downscaling technique. In addition, the framework showed future trend preservation unlike pure quantile based methods and LOCA2; and results from bias correction over unseen regions showed that the marginal biases were attenuated. This work presents a modular, computationally efficient and extensible bias correction approach that is physically informed, scalable, and compatible with both historical and future applications. Its flexibility makes it suitable for integration into Earth system post-processing pipelines and impact workflows.

[LG-89] Self-Supervised Learning for Android Malware Detection on a Time-Stamped Dataset

链接: https://arxiv.org/abs/2604.23025
作者: Annan Fu,Hao Pei,Maryam Tanha
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Android malware detectors built with machine learning often suffer from temporal bias: models are trained and evaluated without respecting apps’ actual release times, inflating accuracy and weakening real-world robustness. We address this by constructing a time-stamped dataset of benign and malicious Android apps and introducing a timestamp-verification procedure to ensure temporal accuracy. We then propose a detection framework that uses Bootstrap Your Own Latent (BYOL) for self-supervised pre-training to learn obfuscation-resilient representations, followed by supervised classification. Under time-aware evaluation, the method attains 98% accuracy and 89% F1. We further characterize malware behavior by analyzing true positives and false negatives using VirusTotal and the MITRE ATTCK framework. To support reproducibility and further innovation, we release our dataset and source code.

[LG-90] Complex SGD and Directional Bias in Reproducing Kernel Hilbert Spaces

链接: https://arxiv.org/abs/2604.23017
作者: Natanael Alpay,Emeric Battaglia
类目: Machine Learning (cs.LG); Complex Variables (math.CV); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Stochastic Gradient Descent (SGD) is a known stochastic iterative method popular for large-scale convex optimization problems due to its simple implementation and scalability. Some objectives, such as those found in complex-valued neural networks, benefit from updates like in SGD and Gradient Descent (GD) with a newly defined ``gradient’’ that allows for complex parameters. This complex variant of the SGD/GD methods has already been proposed, but convergence guarantees without analyticity constraints have not yet been provided. We propose a variant of SGD (complex SGD) that allows for complex parameters, and we provide convergence guarantees under assumptions that parallel those from the real setting. Notably, these results extend to GD as well, and with the same set of assumptions, we confirm that some directional bias results extend from the real to the complex setting for kernel regression problems. We provide empirical results demonstrating the efficacy of the complex SGD in kernel regression problems utilizing complex reproducing kernel Hilbert spaces. In particular, we demonstrate we may recover superoscillation functions and Blaschke products from the Fock Space and Hardy Space, respectively, as the optimal functions for a particular choice of a loss function.

[LG-91] Collocation-based Robust Physics Informed Neural Networks for time-dependent simulations of pollution propagation under thermal inversion conditions on Spitsbergen

链接: https://arxiv.org/abs/2604.23003
作者: Leszek Siwik,Maciej Sikora,Natalia Leszczyńska,Tomasz Maciej Ciesielski,Eirik Valseth,Manuela Bastidas Olivares,Marcin Łoś,Tomasz Służalec,Jacek Leszczyński,Maciej Paszyński
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Robust Variational Physics Informed Neural Networks; Pollution propagation simulations; Longyearbyen at Spitsbergen; Advection-diffusion model; In-field measurements; Open source software

点击查看摘要

Abstract:In this paper, we propose a Physics-Informed Neural Network framework for time-dependent simulations of pollution propagation originating from moving emission sources. We formulate a robust variational framework for the time-dependent advection-diffusion problem and establish the boundedness and inf-sup stability of the corresponding discrete weak formulation. Based on this mathematical foundation, we construct a robust loss function that is directly related to the true approximation error, defined as the difference between the neural network approximation and the (unknown) exact solution. Additionally, a collocation-based strategy is introduced to speed up neural network training. As a case study, we investigate pollution propagation caused by snowmobile traffic in Longyearbyen, Spitsbergen, supported by detailed in-field measurements collected using dedicated sensors. The proposed framework is applied to analyze the effects of thermal inversion on pollutant accumulation. Our results demonstrate that thermal inversion traps dense and humid air masses near the ground, significantly enhancing particulate matter (PM) concentration and worsening local air quality.

[LG-92] Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling

链接: https://arxiv.org/abs/2604.22981
作者: Alex Nikulkov
类目: Machine Learning (cs.LG)
*备注: 27 pages, 14 figures

点击查看摘要

Abstract:Reward models in RLHF are trained to score only the final token of a response - a choice that discards rich signal from every intermediate position and produces models whose token-level outputs are noise. We argue this is a missed opportunity: a well-trained reward model’s output at any token should represent the conditional expectation of the final reward given the response so far. We introduce Temporally Coherent Reward Modeling (TCRM), which induces this property via two regularization terms on top of the standard Bradley-Terry loss, with minimizers provably equal to conditional expectations. The regularizers correspond to Monte Carlo and TD value-learning objectives, establishing a direct connection to RL value functions. TCRM requires zero changes to architecture, data, or inference, yet unlocks three capabilities from one principle: interpretable token-level reward trajectories (middle-token pairwise accuracy improved from 50% to 88.9%, final-token accuracy preserved); state-of-the-art PRM performance on ProcessBench (44.9% average F1) among models trained only on outcome data; and unified reward/value modeling in PPO, reducing peak GPU memory by 27% and step time by 19% with matching LLM quality.

[LG-93] Score-Repellent Monte Carlo: Toward Efficient Non-Markovian Sampler with Constant Memory in General State Spaces

链接: https://arxiv.org/abs/2604.22948
作者: Jie Hu,Lingyun Chen,Geeho Kim,Jinyoung Choi,Bohyung Han,Do Young Eun
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:History-dependent sampling can reduce long-run Monte Carlo variance by discouraging redundant revisits, but existing schemes typically encode history through empirical measure on finite state spaces, which is infeasible in high-dimensional discrete configuration spaces or ill-posed in continuous domains. We propose Score-Repellent Monte Carlo (SRMC) framework that summarizes trajectory history by a running average of score evaluations in R^d , where d is the dimension of the score and state representation. This history is converted into a surrogate target through an exponential score tilt, indexed with \alpha that represents the strength of repellence in controlling the magnitude of the history-based repulsion. The surrogate family is normalization-free in the standard MCMC sense, yielding a generic wrapper: at each iteration, any base kernel targeting \pi can instead be run on the current surrogate \pi_\theta_n while the history is updated online. We analyze the coupled evolution of the history recursion and Monte Carlo estimators using stochastic approximation with controlled Markovian noise, establishing almost sure convergence and a joint central limit theorem. We further identify regimes in which the asymptotic covariance decreases as \alpha increases, with scaling O(1/\alpha) , extending the near-zero-variance effect of finite-state history-dependent samplers to general state spaces with constant memory. Experiments on continuous targets and discrete energy-based models demonstrate improved estimator variance and mode coverage, while retaining O(d) memory usage and modest per-iteration overhead.

[LG-94] Deep Clustering for Climate: Analyzing Teleconnections through Learned Categorical States

链接: https://arxiv.org/abs/2604.22909
作者: Lívia Meinhardt,Dário Oliveira
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding and representing complex climate variability is essential for both scientific analysis and predictive modeling. However, identifying meaningful climate regimes from raw variables is challenging, as they exhibit high noise and nonlinear dependencies. In this work, we explore the use of Masked Siamese Networks to discretize climate time series into semantically rich clusters. Focusing on daily minimum and maximum temperature, we show that the resulting representations: (i) yield clusters that reflect meaningful climate states under our modeling assumptions, offering a simplified representation for downstream use; (ii) enable sampling and analysis of specific climate scenarios; and (iii) exhibit statistical associations with El Niño events, underscoring their scientific relevance. Our findings highlight the potential of self-supervised discretization as a tool for climate data analysis and open avenues for incorporating richer climate indicators in future work.

[LG-95] Accelerating Frequency Domain Diffusion Models with Error-Feedback Event-Driven Caching

链接: https://arxiv.org/abs/2604.22901
作者: Dong Liu,Haisheng Wang,Yanxuan Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models achieve remarkable success in time series generation. However, slow inference limits their practical deployment. We propose E ^2 -CRF (Error-Feedback Event-Driven Cumulative Residual Feature caching) to accelerate frequency domain diffusion models. Our method exploits two structural properties: (1) spectral localization, where signal energy concentrates in low frequencies, and (2) mirror symmetry, which halves the effective frequency dimension. E ^2 -CRF uses a closed-loop error-feedback system that adaptively caches transformer KV features across diffusion steps. We trigger recomputation using event-driven residual dynamics instead of fixed schedules. Our method selectively recomputes high-energy or rapidly-changing tokens while reusing cached features for stable high-frequency components. E ^2 -CRF achieves ~2.2 speedup while maintaining sample quality. We demonstrate effectiveness on 5 datasets. Our caching strategy naturally aligns with the diffusion process’s structure-to-detail progression. We include sufficient-condition error and complexity bounds under standard regularity assumptions (Appendix), alongside empirical validation. Our code is available at this https URL and is also integrated in this https URL.

[LG-96] Magnetic Indoor Localization through CNN Regression and Rotation Invariance

链接: https://arxiv.org/abs/2604.22896
作者: Helge Rosé,Konstantin Klipp,Tom Koubek,Bernd Schäufele,Ilja Radusch
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Published and presented at the 2026 4th International Conference on Mechatronics, Control and Robotics (ICMCR)

点击查看摘要

Abstract:Indoor positioning is an essential technology for a wide range of applications in GNSS-denied environments, including indoor navigation and IoT systems. Combining convolutional neural networks (CNNs) and magnetic field-based features offers a low-cost, infrastructure-free solution for precise positioning. While magnetic fingerprints are a promising approach for indoor positioning, models trained on raw 3D magnetometer data are highly sensitive to device orientation. We address this by using two rotation invariant features derived from the 3D magnetic field: the norm (Mn) and the projection onto the gravity axis (Mg). We train a lightweight 7-layer dilated CNN (MagNetS/XL) on magnetic sequences to directly regress (x, y) positions. Using the MagPie dataset (three buildings, handheld trajectories), we systematically evaluate fixed and random rotations of test and/or train data. Raw 3D inputs (Mx, My , Mz) exhibit isotropic error increases under fixed 90° rotations and further degrade with growing random rotations. In contrast, 2D (Mn, Mg) inputs maintain rotation invariant accuracy and surpass the 3D inputs once rotation exceeds building-specific thresholds for three reference buildings: 0° for Loomis (large), 5° for Talbot (medium), and 6° for CSL (small). MagNetXL achieves or exceeds state-of-the-art accuracy on the MagPie dataset, and MagNetS delivers similar performance with roughly one third of the parameters, favoring mobile deployment. These results show that the robustness gained from rotation invariant inputs outweighs the loss of input dimensionality in realistic usage, allowing mapping and localization without orientation alignment or added infrastructure.

[LG-97] StackFeat RL: Reinforcement Learning over Iterative Dual Criterion Feature Selection for Stable Biomarker Discovery

链接: https://arxiv.org/abs/2604.22892
作者: A. Yermekov,D.A. Herrera-Martí
类目: Machine Learning (cs.LG)
*备注: 7 pages. Submitted to eccb2026

点击查看摘要

Abstract:Feature selection in high-dimensional genomic data ( d \gg n ) demands methods that are simultaneously accurate, sparse, and stable. Existing approaches either require manual threshold specification (mRMR, stability selection), produce unstable selections under data perturbation (Lasso, Boruta), or ignore biological structure entirely. We introduce StackFeat-RL, a meta-learning framework that optimises the hyperparameters of an iterative dual-criterion feature selection algorithm via REINFORCE policy gradients. The dual criterion, requiring both coefficient consistency and selection frequency, guards against two failure modes missed by single-criterion methods, while iterative accumulation provides convergence guarantees via the law of large numbers. On COVID-19 miRNA data (GSE240888, 332 features) and three Alzheimer’s disease classification tasks (GSE84422, 13237 genes; Normal vs.\ Possible, Probable, and Definite AD), StackFeat-RL achieves the highest predictive accuracy among all evaluated methods, including ElasticNet, Boruta, mRMR, and stability selection, while requiring 3–4 \times fewer features. Keywords: feature selection, reinforcement learning, REINFORCE, elastic net, biomarker discovery, Alzheimer’s disease, dual-criterion selection, protein interaction networks Comments: 7 pages. Submitted to eccb2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.22892 [cs.LG] (or arXiv:2604.22892v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.22892 Focus to learn more arXiv-issued DOI via DataCite

[LG-98] Predicting Wind Loads on Container Ships in Harbor Environments through Multi-Fidelity Modeling

链接: https://arxiv.org/abs/2604.22882
作者: Matilde Fiore,Andrea Bresciani,Miguel Alfonso Mendez,Jeroen van Beeck
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Modern container ships face higher wind loads due to increased windage areas, making accurate predictions of wind loads essential for mooring design. Existing empirical models, largely developed for container ships with smaller windage areas and simpler geometrical configurations than those of modern large-scale vessels, often lack accuracy and do not account for the influence of nearby structures. This study proposes a multi-fidelity surrogate modelling framework for the prediction of wind-load coefficients, combining empirical correlations with simplified and detailed CFD models for ships in open-sea and harbor environments. The approach relies on recursive co-kriging to consistently fuse information across fidelity levels, enabling accurate predictions at a reduced computational cost. A sensitivity analysis is used to identify the most influential geometric parameters, and the resulting reduced parameter space is explored through sequential sampling to efficiently construct the training database. The surrogate models are validated over a wide range of loading configurations and for two distinct harbor environments. The results demonstrate that the multi-fidelity approach significantly improves prediction accuracy compared to single-fidelity models, while substantially reducing the reliance on high-fidelity simulations. In particular, the proposed framework captures the dependence of wind loads on key geometric parameters and consistently outperforms traditional empirical correlations, providing a robust and efficient tool for engineering applications. Subjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an) Cite as: arXiv:2604.22882 [cs.LG] (or arXiv:2604.22882v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.22882 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-99] Audio2Tool: Bridging Spoken Language Understanding and Function Calling

链接: https://arxiv.org/abs/2604.22821
作者: Ramit Pahwa,Apoorva Beedu,Parivesh Priye,Rutu Gandhi,Saloni Takawale,Aruna Baijal,Zengli Yang
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Voice assistants increasingly rely on Speech Language Models (SpeechLMs) to interpret spoken queries and execute complex tasks, yet existing benchmarks lack domain breadth, acoustic diversity, and compositional reasoning complexity to evaluate tool-calling performance. We introduce Audio2Tool, a large-scale dataset comprising approximately 30,000 queries designed to assess tool-calling capabilities of SpeechLMs across three primary domains: Smart Car, Smart Home, and Wearables. Our benchmark features a multi-tier complexity hierarchy, ranging from simple direct commands to complex multi-intent and needle-in-a-haystack extraction to isolate distinct failure modes. To ensure realism, we employ zero-shot voice cloning text-to-speech synthesis and diverse noise profiles to simulate in-the-wild conditions. Evaluations of state-of-the-art SpeechLMs and ASR-LLM pipelines show strong performance on simple commands but significant degradation under compositional and acoustic challenges. We will release the dataset and benchmark upon acceptance.

[LG-100] Cross-Course Generalizability of SRL-Aligned Predictive Models Using Digital Learning Traces

链接: https://arxiv.org/abs/2604.22812
作者: Jakob Schwerter,Loreen Sabel,Judith Bose,Matthew L. Bernacki,Di Xu,Marko Schmellenkamp,Thomas Zeume,Philipp Doebler
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:STEM dropout rates remain high at universities, particularly in computer science programs with theory-intensive courses. Digital learning environments now capture rich behavioral data that could help identify struggling students early, yet the generalizability of data-driven prediction models across courses and institutions remains uncertain. Guided by self-regulated learning (SRL) theory, this study analyzed multimodal digital-trace data from three undergraduate theoretical computer science courses (N1 = 137, N2 = 104, N3 = 148) at two universities. Weekly SRL-aligned digital-trace indicators were modeled using Elastic Net, Random Forest, and XGBoost to evaluate predictive performance over time and across settings, and model calibration both within and across courses. Early prediction of at-risk students was feasible, with SRL-related behaviors such as time management, effort regulation, and sustained engagement emerging as key predictors. While Random Forest achieved the highest in-sample accuracy, Elastic Net generalized more robustly across contexts. Out-of-sample accuracy and calibration declined between institutions with different base rates, underscoring the contextual nature of predictive analytics in higher education. These findings suggest that digital learning traces enable early identification of at-risk students within courses, but generalizing predictive models beyond their original context requires caution, particularly if the at-risk rates differ between contexts.

[LG-101] Hierarchical RL-MPC Control for Dynamic Wake Steering in Wind Farms

链接: https://arxiv.org/abs/2604.22797
作者: Marcus Binder Nilsen,Teodor Olof Benedict Åstrand,Tuhfe Göçmen,Pierre-Elouan Réthoré
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: This work has been submitted to IFAC for possible publication

点击查看摘要

Abstract:Wind farm wake steering optimization is challenging due to complex flow physics and changing conditions. This paper presents a hierarchical framework that combines reinforcement learning with model predictive control, where an RL agent learns compensatory state estimates for an MPC controller, rather than directly controlling turbines. Evaluated on a three-turbine case, the approach achieves a 23% power gain over the baseline control and surpasses the idealized MPC with perfect state knowledge. Compared to direct RL control, the hybrid architecture maintains superior safety characteristics during training while achieving comparable performance with more stable control actions.

[LG-102] Load constrained wind farm flow control through multi-objective multi-agent reinforcement learning

链接: https://arxiv.org/abs/2604.22795
作者: Teodor Åstrand,Marcus Binder Nilsen,Iasonas Tsaklis,Tuhfe Göçmen,Pierre-Elouan Réthoré,Nikolay Dimitrov
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Submitted to Journal of Physics: Conference Series (Torque 2026). This is the Accepted Manuscript version of an article accepted for publication in Journal of Physics: Conference Series. IOP Publishing Ltd is not responsible for any errors or omissions in this version of the manuscript or any version derived from it. This Accepted Manuscript is published under a CC BY licence

点击查看摘要

Abstract:This study presents a multi-agent reinforcement learning (MARL) framework for load-constrained wind farm flow control (WFFC). While wake steering can enhance total wind farm power, it often introduces increased structural loads on downstream turbines. To address this, we integrate an Independent Soft Actor-Critic (I-SAC) architecture with a data-driven, local inflow sector-averaged surrogate model to provide real-time estimates of Damage Equivalent Loads (DELs). By incorporating these estimates into a shaped reward function, turbine-specific agents are trained to maximize power production while adhering to specific load-increase thresholds ( \Delta_max ) of 10%, 20%, and 30% relative to a baseline controller. The framework is implemented within the WindGym environment using the DYNAMIKS flow solver with Dynamic Wake Meandering (DWM) model to capture non-stationary wake physics. Results indicate that the MARL agents successfully learn collaborative policies that prioritise power gain while actively retreating from high-DEL control strategies.

[LG-103] Accelerating Reinforcement Learning for Wind Farm Control via Expert Demonstrations

链接: https://arxiv.org/abs/2604.22794
作者: Marcus Binder Nilsen,Julian Quick,Tuhfe Göçmen,Nikolay Dimitrov,Pierre-Elouan Réthoré
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Submitted to Journal of Physics: Conference Series (Torque 2026). This is the Accepted Manuscript version of an article accepted for publication in Journal of Physics: Conference Series. IOP Publishing Ltd is not responsible for any errors or omissions in this version of the manuscript or any version derived from it. This Accepted Manuscript is published under a CC BY licence

点击查看摘要

Abstract:Reinforcement learning (RL) offers a promising approach for adaptive wind farm flow control, yet its practical deployment is hindered by slow training convergence and poor initial performance, factors that could translate to years of reduced power output if an untrained agent were deployed directly. This work investigates whether domain knowledge from steady-state wake models can accelerate RL training and improve initial controller performance. We propose a pretraining methodology in which expert demonstrations are generated by deploying a PyWake-based steady-state optimizer within a dynamic wake simulation (WindGym), then used to initialize both the actor and critic networks of a Soft Actor-Critic agent via behavior cloning. Experiments on a 2x2 wind farm show that pretraining eliminates the costly initial learning phase: while an untrained agent underperforms the greedy zero-yaw baseline by approximately 12%, pretraining raises initial performance to near-baseline levels. During online fine-tuning, all configurations converge within 250,000 environment steps to achieve similar performance, ultimately exceeding that of a lookup-table controller, which reaches approximately 7% power gain after 500,000 steps.

[LG-104] AutoCompress: Critical Layer Isolation for Efficient Transformer Compression

链接: https://arxiv.org/abs/2604.22786
作者: Archit Thorat
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 tables. Code available at this https URL

点击查看摘要

Abstract:We present AutoCompress, a transformer compression method motivated by an empirical finding: in small transformers, Layer 0 carries disproportionately high task-critical information, with an NTK-based importance score of 3.6 compared to a maximum of 0.054 for all other layers – a gap of over 60x. Based on this finding, we propose Critical Layer Isolation (CLI), an architecture that protects Layer 0 at full dimensionality, compresses all intermediate layers through a learned bottleneck, and restores the full dimension at the final layer. Applied to GPT-2 Medium (354.8M parameters), CLI-GPT2 achieves 204.5 perplexity on WikiText-103 with only 143.8M parameters – a 2.47x compression ratio and 59.5% parameter reduction. Crucially, an ablation study demonstrates that a uniform bottleneck baseline of comparable size achieves only 571.8 perplexity under identical training conditions, confirming that the architectural decision to protect Layer 0 – rather than simply reducing model size – is the primary driver of performance. Code and checkpoints are publicly available.

[LG-105] CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLM s

链接: https://arxiv.org/abs/2604.22785
作者: Stela Tong,Elai Ben-Gal
类目: Machine Learning (cs.LG)
*备注: 17 pages, 0 figures

点击查看摘要

Abstract:Large language model (LLM) deployments increasingly rely on multi-agent architectures in which multiple models either compete through routing mechanisms or collaborate to produce a final answer. In both settings, the learning signal received by each agent is filtered by the system mechanism. Routing produces selection-gated feedback where only the chosen response is evaluated, while collaboration produces shared rewards that obscure the individual contribution of each agent. As a result, standard RLHF objectives designed for a single deployed policy become misspecified. We introduce CoFi-PGMA (Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs), a unified framework for learning under filtered feedback in multi-agent LLM systems. Our approach derives a counterfactual per-agent training objective based on marginal contribution, which corrects the learning signal under both routing and collaborative mechanisms. For routing systems, the objective corresponds to off-policy corrections for selection-gated feedback, while for collaborative systems it reduces to leave-one-out difference rewards for credit assignment. We further analyze how softmax routing induces risk-sensitive incentives and provide practical training algorithms that integrate counterfactual estimators, multiturn-aware rewards, and policy optimization methods, and demonstrate the approach on a real-world reasoning dataset.

[LG-106] Learning Without Adversarial Training: A Physics-Informed Neural Network for Secure Power System State Estimation under False Data Injection Attacks

链接: https://arxiv.org/abs/2604.22784
作者: Solon Falas,Markos Asprou,Charalambos Konstantinou,Maria K. Michael
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State estimation is a cornerstone of power system control-center operations, and its robust operation is increasingly a cyber-physical security concern as modern grids become more digitalized and communication-intensive. Neural network-based approaches have gained attention as alternatives to conventional model-based state estimation methods. Physics-Informed Neural Networks (PINNs), which embed power-flow consistency into the learning objective, have shown improved accuracy over existing approaches. This work proposes a PINN-based model for Power System State Estimation (PSSE) that protects the estimation process against the stealth-constrained AC False Data Injection Attacks (FDIAs) considered in this study. The model is developed without adversarial training. Instead, a dynamic loss-weighting formulation based on homoscedastic uncertainty learns the relative scaling of supervised data-fit and physics-residual terms during training, reducing sensitivity to manual weight tuning. Robustness is evaluated on the IEEE 118-bus system using representative stealthy-FDIA families including state distortion, load redistribution, line overloading, and residual-constrained stealth corruption. Performance is measured using Mean Absolute Error (MAE) on voltage magnitudes and phase angles. Results demonstrate higher accuracy and stability than existing fixed-weight PINN variants.

[LG-107] Energy-Arena: A Dynamic Benchmark for Operational Energy Forecasting

链接: https://arxiv.org/abs/2604.24705
作者: Max Kleinebrahm,Jonathan Berrisch,Philipp Eiser,Wolf Fichtner,Veit Hagenmeyer,Matthias Hertel,Nils Koster,Sebastian Lerch,Ralf Mikut,Jan Priesmann,Melanie Schienle,Benjamin Schaefer,Jann Weinand,Florian Ziel
类目: Econometrics (econ.EM); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, 1 table. Submitted to the European Electricity Markets (EEM) conference

点击查看摘要

Abstract:Energy forecasting research faces a persistent comparability gap that makes it difficult to measure consistent progress over time. Reported accuracy gains are often not directly comparable because models are evaluated under study-specific datasets, time periods, information sets, and scoring setups, while widely used benchmarks and competition datasets are typically tied to fixed historical windows. This paper introduces the Energy-Arena, a dynamic benchmarking platform for operational energy time series forecasting that provides a continuously updated reference point as energy systems evolve. The platform operates as an open, API-based submission system and standardizes challenge definitions and submission deadlines aligned with operational constraints. Performance is reported on rolling evaluation windows via persistent leaderboards. By moving from retrospective backtesting to forward-looking benchmarking, the Energy-Arena enforces standardized ex-ante submission and ex-post evaluation, thereby improving transparency by preventing information leakage and retroactive tuning. The platform is publicly available at this http URL.

[LG-108] Dual Control of Linear Systems from Bilinear Observations with Belief Space Model Predictive Control

链接: https://arxiv.org/abs/2604.24663
作者: Daniel Cao,Beixi Du,Andrew Lowitt,Sunmook Choi,Sarah Dean,Yahya Sattar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We study finite-horizon quadratic control of linear systems with bilinear observations, in which the control input affects not only the state dynamics but also the partial observations of the state. In this setting, the separation principle can fail because control inputs influence the future quality of state estimates. State estimation requires an input-dependent Kalman filter whose gain and error covariance evolve as functions of the control inputs. To address this challenge, we propose a belief-space model predictive control ( \textttB-MPC ) method that plans directly over both the estimated state and its error covariance. In particular, \textttB-MPC plans with a deterministic surrogate of the belief evolution defined by the input-dependent Kalman filter. Through numerical experiments in two synthetic settings, we show that \textttB-MPC can outperform both the separation-principle controller and its MPC variant in favorable regimes, and that these gains are accompanied by lower estimation covariance and more uncertainty-aware action choices.

[LG-109] Computational Design and Experimental Validation of Photoactive PARP1 Inhibitors

链接: https://arxiv.org/abs/2604.24634
作者: Simon Axelrod,Miroslav Kašpar,Kristýna Jelínková,Markéta Šmídková,Erika Bartůňková,Sille Štěpánová,Eugene Shakhnovich,Václav Kašička,Martin Dračínský,Zlatko Janeba,Rafael Gómez-Bombarelli
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Light-activated drugs are a promising way to treat localized diseases for which existing treatments have severe side effects. However, their development is complicated by the set of photophysical and biological properties that must be simultaneously optimized. Here we used computational techniques to find a set of promising candidates for the photoactive inhibition of the poly(ADP-ribose) polymerase 1 (PARP1) cancer target. Using our recently developed methods based on atomistic simulation and machine learning (ML), we screened a set of 5 million hypothetical photoactive ligands. Our workflow used protein-ligand docking to identify candidates with differential PARP1 binding under light and dark conditions; ML force fields and quantum chemistry calculations to predict p K_\mathrma , absorption spectra, and thermal half-lives; graph-based surrogate models to screen additional compounds; excited-state nonadiabatic dynamics with ML force fields to estimate quantum yields; and free energy perturbation (FEP) to refine binding predictions. From these predictions, we prioritized a small set of synthetically feasible candidates expected to have red-shifted absorption spectra, thermal half-lives on the order of seconds to minutes, and isomer-dependent PARP1 binding under visible-light control. We synthesized 10 candidates and experimentally characterized their photobehavior and PARP1 inhibition constants. Among the validated compounds, \textbf1 showed a 15-fold increase in inhibition of PARP1 upon green-light irradiation at 519 nm (208.8 \pm 28.3 \mu M vs 14.4 \pm 1.9 \mu M). These results validate the computation-guided screening strategy for identifying red-shifted PARP1 photoinhibitors, while also underscoring current limitations such as rapid thermal relaxation in aqueous media.

[LG-110] Enhancing molecular dynamics with equivariant machine-learned densities

链接: https://arxiv.org/abs/2604.24563
作者: Mihail Bogojeski,Muhammad R. Hasyim,Leslie Vogt-Maranto,Klaus-Robert Müller,Kieron Burke,Mark E. Tuckerman
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 7 figures

点击查看摘要

Abstract:Machine-learning interatomic potentials (MLIPs) have enabled molecular dynamics at near ab initio accuracy, yet remain limited to energies and forces by construction, leaving electronic observables such as dipole moments and polarizabilities inaccessible. We introduce DenSNet, a density-first approach to machine-learned electronic structure that learns the Hohenberg–Kohn map from nuclear configurations to the ground-state electron density. Our approach employs an SE(3)-equivariant neural network to predict density coefficients of a flexible atom-centered Gaussian basis, combined with a \Delta -learning strategy that uses superposed atomic densities as a prior to accelerate training. A second equivariant network then maps the predicted density to the total energy, providing a unified framework for molecular dynamics and electronic structure. We validate DenSNet on ethanol, ethanethiol, and resorcinol, where infrared spectra from machine-learned trajectories show excellent agreement with experimental gas-phase measurements. To test scalability, we train on polythiophene oligomers with 1–6 monomers and extrapolate to chains of up to 12 monomers, generating stable long-time trajectories whose infrared spectra agree with reference density functional theory calculations. Here, we show that reinstating the electron density as the central learned quantity opens a practical route to transferable prediction of spectroscopic and electronic observables in large-scale molecular simulations.

[LG-111] GSC-QEMit: A Telemetry-Driven Hierarchical Forecast-and-Bandit Framework for Adaptive Quantum Error Mitigation IJCNN2026

链接: https://arxiv.org/abs/2604.24551
作者: Steven Szachara,Sheeraja Rajakrishnan,Dylan Jay Van Allen,Jason Pollack,Travis Desell,Daniel Krutz
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, accepted to EEE/INNS IJCNN 2026 and is a part of WCCI2026

点击查看摘要

Abstract:Quantum error mitigation (QEM) is essential for extracting reliable results from near-term quantum devices, yet practical deployments must balance mitigation strength against runtime overhead under time-varying noise. We introduce \emphGSC-QEMit, a telemetry-driven, \textbfcontext–forecast–bandit framework for \emphadaptive mitigation that switches between lightweight suppression and heavier intervention as drift evolves. GSC-QEMit composes three coupled modules: (G) a Growing Hierarchical Self-Organizing Map (GHSOM) that clusters streaming telemetry into operating contexts; (S) an uncertainty-aware subsampled Gaussian-process forecaster that predicts short-horizon fidelity degradation; and © a cost-aware contextual multi-armed bandit (CMAB) that selects mitigation actions via Thompson sampling with explicit intervention cost. We evaluate GSC-QEMit on benchmark circuit families (GHZ, Quantum Fourier Transform, and Grover search) under nonstationary noise regimes simulated in Qiskit Aer, using an instrumented testbed where action labels correspond to graded mitigation intensity. Across Clifford, non-Clifford, and structured workloads, GSC-QEMit improves average logical fidelity by \textbf+9.0% relative to unmitigated execution while reducing unnecessary heavy interventions by reserving them for inferred noise spikes. The resulting policies exhibit a favorable fidelity–cost trade-off and transfer across the evaluated workloads without circuit-specific tuning.

[LG-112] Extreme bandits NEURIPS

链接: https://arxiv.org/abs/2604.24545
作者: Alexandra Carpentier,Michal Valko
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published at Neural Information Processing Systems (NeurIPS) 2014

点击查看摘要

Abstract:In many areas of medicine, security, and life sciences, we want to allocate limited resources to different sources in order to detect extreme values. In this paper, we study an efficient way to allocate these resources sequentially under limited feedback. While sequential design of experiments is well studied in bandit theory, the most commonly optimized property is the regret with respect to the maximum mean reward. However, in other problems such as network intrusion detection, we are interested in detecting the most extreme value output by the sources. Therefore, in our work we study extreme regret which measures the efficiency of an algorithm compared to the oracle policy selecting the source with the heaviest tail. We propose the ExtremeHunter algorithm, provide its analysis, and evaluate it empirically on synthetic and real-world experiments.

[LG-113] Few-Shot Cross-Device Transfer for Quantum Noise Modeling on Real Hardware

链接: https://arxiv.org/abs/2604.24397
作者: Sahil Al Farib,Sheikh Redwanul Islam,Azizur Rahman Anik
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 pages, 8 figures, 8 tables. Submitted to IEEE Quantum Computing and Engineering (QCE) 2026

点击查看摘要

Abstract:In the noisy intermediate-scale quantum (NISQ) regime, quantum devices contain hardware-specific noise sources which restrict device-invariant error mitigation strategies. We explore transfer learning approaches to apply noise models learned on one quantum device to a different device with the help of a small amount of data. We create a real-hardware dataset from two IBM quantum devices, ibm_fez (source) and ibm_marrakesh (target), comprising 170 noisy and ideal circuit output distributions, with device calibration features added. We train a residual neural network on the source device to map noisy to ideal outcomes. The zero-shot transfer test shows a KL divergence of 1.6706 (up from 0.3014), establishing device specificity. With K = 20 fine-tuning samples, KL drops to 1.1924 (28.6% improvement over zero-shot), recovering 34.9% of the gap between zero-shot and in-domain KL. Ablation studies reveal that the major cause of mismatches across devices is CX gate error, followed by readout error. The results show quantum noise can be learned and fine-tuned with minimal samples, and provide a plausible approach to cross-device quantum error mitigation.

[LG-114] New non-Euclidean neural quantum states from additional types of hyperbolic recurrent neural networks

链接: https://arxiv.org/abs/2604.24337
作者: H. L. Dao
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we extend the class of previously introduced non-Euclidean neural quantum states (NQS) which consists only of Poincaré hyperbolic GRU, to new variants including Poincaré RNN as well as Lorentz RNN and Lorentz GRU. In addition to constructing and introducing the new non-Euclidean hyperbolic NQS ansatzes, we generalized the results of our earlier work regarding the definitive outperformances delivered by hyperbolic Poincaré GRU NQS ansatzes when benchmarked against their Euclidean counterparts in the Variational Monte Carlo (VMC) experiments involving the quantum many-body settings of the Heisenberg J_1J_2 and J_1J_2J_3 models, which exhibit hierarchical structures in the forms of the different degrees of nearest-neighbor interactions. Here, in particular, using larger systems consisting of 100 spins, we found that all four hyperbolic RNN/GRU NQS variants always outperformed their respective Euclidean counterparts. Specifically, for all J_2 and (J_2,J_3) couplings considered, including J_2=0.0 , Lorentz RNN NQS and Poincaré RNN NQS always outperformd Euclidean RNN NQS, while Lorentz/Poincaré GRU NQS always outperformed Euclidean GRU NQS, with a single exception when J_2=0.0 for Poincaré GRU NQS. Furthermore, among the four hyperbolic NQS ansatzes, depending on the specific J_2 or (J_2, J_3) couplings, on four out of eight experiment settings, Lorentz GRU and Poincaré GRU took turns to be the top performing variant among all Euclidean and hyperbolic NQS ansatzes considered, while Lorentz RNN, with up to three times fewer parameters, was capable of not only surpassing the Euclidean GRU eight out of eight times but also outperforming both Lorentz GRU and Poincaré GRU four out of eight times, to emerge as the best overall hyperbolic NQS ansatz.

[LG-115] Identifiability and Stability of Generative Drifting with Companion-Elliptic Kernel Families

链接: https://arxiv.org/abs/2604.24196
作者: Hak Geun Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 50 pages, no figures

点击查看摘要

Abstract:This paper analyzes identifiability and stability for the drifting field underlying distributional matching in the Generative Drifting framework of Deng et al. First, we introduce the class of companion-elliptic kernels, which includes the Laplace kernel and is characterized by a second-order elliptic coupling between each kernel \kappa in this class and its companion function \eta . For each kernel in this class and each pair of Borel probability measures, we prove that the drifting field vanishes if and only if the two probability measures are equal. We further show that this class consists precisely of Gaussian kernels and Matérn kernels with \nu \ge 1/2 . Second, by constructing counterexamples, we exhibit sequences for which mass escapes to infinity while the field tends to zero; in particular, control of the field norm alone does not guarantee weak convergence. Nevertheless, we prove that the only possible mode of failure is confined to the one-dimensional ray \c,p:0\le c\le 1\ . Consequently, weak convergence can be restored by imposing an asymptotic lower bound on the intrinsic overlap scalar, a linear observable defined by the kernel and the target measure.

[LG-116] A Divergence-Based Method for Weighting and Averag ing Model Predictions AISTATS2026

链接: https://arxiv.org/abs/2604.24172
作者: Olav Benjamin Vassend
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Accepted at AISTATS 2026

点击查看摘要

Abstract:This paper uses a minimum divergence framework to introduce a new way of calculating model weights that can be used to average probabilistic predictions from statistical and machine learning models. The method is general and can be applied regardless of whether the models under consideration are fit to data using frequentist, Bayesian, or some other fitting method. The proposed method is motivated in two different ways and is shown empirically to perform better than or on a par with standard model averaging methods, including model stacking and model averaging that relies on Akaike-style negative exponentiated model weighting, especially when the sample size is small. Our theoretical analysis explains why the method has a small-sample advantage.

[LG-117] Conditional Score-Based Modeling of Effective Langevin Dynamics

链接: https://arxiv.org/abs/2604.23952
作者: Ludovico T. Giorgini
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Stochastic reduced-order models are widely used to represent the effective dynamics of complex systems, but estimating their drift and diffusion coefficients from data remains challenging. Standard approaches often rely on short-time trajectory increments, state-space partitioning, or repeated simulation of candidate models, which become unreliable or computationally expensive for high-dimensional systems, coarse temporal sampling, or unevenly sampled data. We introduce a data-driven calibration method based on a novel relationship between the coefficients of a stochastic reduced model and the conditional score of the finite-time transition density, defined as the gradient of the logarithm of the transition density with respect to the initial state. The resulting identity expresses derivatives of lagged correlation functions as stationary expectations over observed lagged pairs involving this conditional score and the unknown model coefficients. This formulation allows the drift and diffusion structure to be constrained directly from finite-lag statistics, without differentiating trajectories, partitioning state space, or repeatedly integrating candidate reduced models during calibration, yielding a least-squares fitting problem over stationary lagged pairs. We validate the approach on analytically tractable and data-driven nonequilibrium diffusions, demonstrating that the inferred models preserve the invariant statistics while accurately reproducing finite-lag dynamical correlations. The framework provides a scalable route for learning stochastic reduced-order models from data that reproduce prescribed statistical and dynamical properties.

[LG-118] Sliced-Regularized Optimal Transport

链接: https://arxiv.org/abs/2604.23944
作者: Khai Nguyen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 22 pages, 8 figures, 1 table

点击查看摘要

Abstract:We propose a new regularized optimal transport (OT) formulation, termed sliced-regularized optimal transport (SROT). Unlike entropic OT (EOT), which regularizes the transport plan toward an independent coupling, SROT regularizes it toward a smoothened sliced OT (SOT) plan. To the best of our knowledge, SROT is the first approach to leverage a version of SOT plan as a reference to improve classical OT. We provide a formal definition of SROT, derive its dual formulation, and provide a post-Bayesian interpretation of SROT. We then develop a Sinkhorn-style algorithm for efficient computation, retaining the same scalability advantages as EOT. By incorporating a scalable SOT plan as a prior, SROT yields more accurate approximations of the exact OT plan than EOT under the same level of regularization. Moreover, the resulting transport plan improves upon the reference SOT plan itself. We further introduce the corresponding OT divergence induced by SROT, named SROT divergence, and analyze its topological and computational properties. Finally, we validate our approach through experiments on synthetic datasets and color transfer tasks, demonstrating that SROT is better than both EOT and SOT in approximating exact OT. Additional experiments on gradient flows further highlight the advantages of SROT divergence.

[LG-119] Multi-scale Dynamic Wake Modeling of Floating Offshore Wind Turbines via Fourier Neural Operators and Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2604.23937
作者: Guodan Dong,Jianhua Qin,Chang Xu
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-scale dynamic wake prediction is essential for the real-time control and performance optimization of floating offshore wind turbines (FOWTs). In this study, Fourier neural operators (FNOs) and physics-informed neural networks (PINNs) are utilized for the first time to reconstruct and predict the complex turbulent wakes of the FOWT under coupled surge and pitch motions across a range of Strouhal numbers (St = [0, 0.6]). Results demonstrate that while both models successfully capture dominant dynamic characteristics such as wake meandering, PINN-generated wakes appear relatively smooth, failing to resolve high-frequency coherent structures as well as the intensity of temporal variations in wake center and wake half-width. FNO effectively resolves both large- and small-scale coherent turbulent structures with significantly higher fidelity. Furthermore, FNO achieves a training speed approximately eight times faster than PINN, converging in far fewer epochs. Power spectral density (PSD) analysis reveals that FNO is more effective at capturing not only the primary wake meandering frequencies (St) but also their higher-order harmonics (e.g., 2St and 3St) and small-scale coherent structures. In fact, PINN effectively acts as a spatiotemporal low-pass filter; they resolve only large-scale dynamic features and fail to capture other spectral signatures induced by coupled surge and pitch motions, thereby significantly underestimating the energy in the high-frequency regime. These findings suggest that FNO is a promising approach for FOWT wake prediction.

[LG-120] Nearly Optimal Subdata Selection

链接: https://arxiv.org/abs/2604.23930
作者: Min Yang,Wei Zheng,John Stufken,Ming-Chung Chang,Ting Tian,Xueqin Wang
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:When, in terms of the number of data points, the size of a dataset exceeds available computing resources, or when labeling is expensive, an attractive solution consists of selecting only some of the data points (subdata) for further consideration. A central question for selecting subdata of size n from N available data points is which n points to select. While an answer to this question depends on the objective, one approach for a parametric model and a focus on parameter estimation is to select subdata that retains maximal information. Identifying such subdata is a classical NP-hard problem due to its inherent discreteness. Based on optimal approximate design theory, we develop a new methodology for information-based subdata selection, resulting in subdata that approaches the optimal solution. To achieve this, we develop a novel algorithm that applies to a general model, accommodates arbitrary choices of N and n , and supports multiple optimality criteria, and we prove its convergence. Moreover, the new methodology facilitates an assessment of the efficiency of subdata selected by any method by obtaining tight lower and upper bounds for the efficiency. We show that the subdata obtained through the new methodology is highly efficient and outperforms all existing methods.

[LG-121] Integrative neurocybernetic modeling in the era of large-scale neuroscience

链接: https://arxiv.org/abs/2604.23903
作者: Il Memming Park,Ayesha Vermani,Gonzalo G. de Polavieja,Juan Álvaro Gallego,Kathleen Esfahany,Shreya Saxena,Michael Orger,Auke Ijspeert,Matthew Dowling,Daniel McNamee,Srinivas C. Turaga,Zachary Mainen,Joseph J. Paton,Alfonso Renart
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: Perspective

点击查看摘要

Abstract:Large-scale neuroscience is generating rich datasets across animals, brain areas and behavioral contexts, yet our modeling efforts remains fragmented across isolated experiments. We argue that understanding behavior requires integrative neurocybernetic models: understandable dynamical models that capture the closed-loop coupling of brain, body and environment, treat the brain as a controller pursuing latent objectives, represent structured variation across scales, and scale to heterogeneous datasets. Such models shift the goal from predicting neural recordings in isolation to inferring the organizing principles that govern neural and behavioral dynamics. We outline a practical route toward this goal by combining nonlinear state-space models and meta-dynamical extensions with scalable inference, knowledge distillation, mixed open- and closed-loop training, and connectomics-informed architectures. By pooling complementary constraints from recordings, behavior, perturbations and anatomy, integrative neurocybernetic models can provide statistical amplification, few-shot generalization, and mechanistic insight into shared dynamical structure, individual variation, and the control objectives that govern behavior. This agenda offers a model-centric path from fragmented data to a mechanistic science of how brains produce behavior.

[LG-122] Deep Learning of Solver-Aware Turbulence Closures from Nudged LES Dynamics

链接: https://arxiv.org/abs/2604.23874
作者: Ashwin Suriyanarayanan,Melissa Adrian,Dibyajyoti Chakraborty,Romit Maulik
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Dynamical Systems (math.DS); Computational Physics (physics.comp-ph); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Deep learning approaches have shown remarkable promise in turbulence closure modeling for large eddy simulations (LES). The differentiable physics paradigm uses the so-called a-posteriori approach for learning by embedding a neural network closure directly inside the solver and optimizing its learnable parameters against ground truth time-series data which may be observed sparsely. This addresses a key limitation of a-priori learning where direct numerical simulation (DNS) data is used to approximate the subgrid stress with the assumption of a filter. However, closures that are trained in this manner frequently lead to unstable deployments due to the mismatch between the assumed filter and the effect of numerical discretizations. However, a-posteriori learning incurs high computational costs due to the need to backpropagate gradients through an LES solver. Furthermore, a-posteriori methods are challenging to apply broadly since they require significant modification of existing solvers. Finally, these approaches have also been observed to be limited when generalization is desired across different numerical schemes. In this work, we discuss a novel approach for the deep learning of turbulence closure models motivated by the continuous data assimilation (CDA) approach (also known as nudging). Our approach enables a-priori training of closures for coarse-grid LES, treating DNS data as sparse observations. This approach enables the deep learning model to successfully learn the required forcing to capture the ground-truth statistics while maintaining long term stability without needing adjoints or backpropagation through the solver. We train and evaluate the model’s ability to adapt to different numerical and temporal schemes. Additionally, we analyse the model behavior with varying numerical discretization errors and compare its predictions to traditional closure models.

[LG-123] Accelerating Quantum Materials Characterization: Hybrid Active Learning for Autonomous Spin Wave Spectroscopy

链接: https://arxiv.org/abs/2604.23821
作者: William Ratcliff II
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 47 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Autonomous neutron spectroscopy must solve three distinct tasks: detection (where is the signal?), inference (which Hamiltonian governs it?), and refinement (what are the parameters?). No single controller solves all three equally well. We present TAS-AI, a hybrid agnostic-to-physics-informed framework for autonomous triple-axis spin-wave spectroscopy that separates these tasks explicitly. In blind reconstruction benchmarks, model-agnostic methods such as random sampling, coarse grids, and Gaussian-process mappers reach a global error threshold more reliably and with fewer measurements than physics-informed planning, supporting the claim that discovery and inference are distinct tasks requiring distinct controllers. Once signal structure is localized, the physics-informed stage performs in-loop Hamiltonian discrimination and parameter refinement: in a controlled square-lattice test between nearest-neighbor-only and J1-J2 Hamiltonians, TAS-AI reaches a decisive AIC-derived evidence ratio (100) in fewer than 10 measurements, while motion-aware scheduling cuts wall-clock time by 32% at a fixed measurement budget. We also identify a failure mode of posterior-weighted design, algorithmic myopia, in which the planner over-refines the current leading model while under-sampling low-intensity falsification probes. A constrained falsification channel sharply reduces time spent committed to the wrong model and accelerates correct model selection without modifying the Bayesian inference engine. In controlled two-model ablations, both a deterministic top-two max-disagreement rule and an LLM-based audit committee achieve this gain under identical constraints. We demonstrate the full workflow in silico using a high-fidelity digital twin and provide an open-source Python implementation.

[LG-124] Fixed-Reservoir vs Variational Quantum Architectures for Chaotic Dynamics: Benchmarking QRC and QPINN on the Lorenz System

链接: https://arxiv.org/abs/2604.23743
作者: Tushar Pandey
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying quantum machine learning on NISQ devices requires architectures where training overhead does not negate computational advantages. We systematically compare two quantum approaches for chaotic time-series prediction on the Lorenz system: a variational Quantum Physics-Informed Neural Network (QPINN) and a Quantum Reservoir Computing (QRC) framework utilizing a fixed transverse-field Ising Hamiltonian. Under matched resources ( 4 – 5 qubits, 2 – 3 layers), QRC achieves an 81% lower mean-squared error (test MSE 3.2 \pm 0.6 vs. 47.9 \pm 36.6 for QPINN) while training \sim 52,000\times faster ( 0.2 ,s vs. \sim 2.4 ,h per seed). Drawing on the classical delay-embedding principle, we formalize a temporal windowing technique within the QRC pipeline that improves attractor reconstruction by providing bounded, structured input history. Analysis reveals that QPINN instability stems from capacity limitations and competing loss terms rather than barren plateaus; gradient norms remained large ( 10^3 – 10^4 ), ruling out exponential suppression at this scale. These failure modes are absent by construction in the non-variational QRC approach. We validate robustness across three canonical systems (Lorenz, Rössler, and Lorenz-96), where QRC consistently achieves low test MSE ( 3.1 \pm 0.6 , 1.8 \pm 0.1 , and 12.4 \pm 0.6 , respectively) with sub-second training. Our findings suggest the fixed-reservoir architecture is a primary driver of QRC’s advantage at these scales, warranting further investigation at larger qubit counts and on hardware where quantum-specific advantages are expected to emerge.

[LG-125] High-dimensional Semi-supervised Classification via the Fermat Distance

链接: https://arxiv.org/abs/2604.23573
作者: Ruoxu Tan,Yiming Zang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-supervised classification, where unlabeled data are massive but labeled data are limited, often arises in machine learning applications. We address this challenge under high-dimensional data by leveraging the manifold and cluster assumptions. Based on the Fermat distance, a density-sensitive metric that naturally encodes the cluster assumption, we propose the weighted k -nearest neighbors (NN) classifier and multidimensional scaling (MDS)-induced classifiers. The use of MDS with a large target dimension allows the effective application of linear classifiers to complex manifold data. Theoretically, we derive a sharp lower bound for the expected excess risk within clusters and prove that the weighted k -NN classifier utilizing the true Fermat distance is minimax optimal. Furthermore, we explicitly quantify the utility of unlabeled data by showing that the error arising from estimating the Fermat distance decays exponentially with the pooled sample size. Such a rate is much faster than the related rates in the literature. Extensive experiments on synthetic and real datasets demonstrate competitive or superior performance of our approaches compared to state-of-the-art graph-based semi-supervised classifiers.

[LG-126] Probabilistic Graphical Model using Graph Neural Networks for Bayesian Inversion of Discrete Structural Component States

链接: https://arxiv.org/abs/2604.23514
作者: Teng Li,Stephen Wu,Yong Huang,James L. Beck,Hui Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The health condition of components in civil infrastructures can be described by various discrete states according to their performance degradation. Inferring these states from measurable responses is typically an ill-posed inverse problem. Although Bayesian methods are well-suited to tackle such problems, computing the posterior probability density function (PDF) presents challenges. The likelihood function cannot be analytically formulated due to the unclear relationship between discrete states and structural responses, and the high-dimensional state parameters resulting from numerous components severely complicates the computation of the marginal likelihood function. To address these challenges, this study proposes a novel Bayesian inversion paradigm for discrete variables based on Probabilistic Graphical Models (PGMs). The Markov networks are employed as modeling tools, with model parameters learned from data and structural topology prior. It has been proved that inferring this PGM produces the same probabilistic estimation as the posterior PDF derived from Bayesian inference, which effectively solves the above challenges. The inference is accomplished by Graph Neural Networks (GNNs), and a graph property-based GNN training strategy is developed to enable accurate inference across varying graph scales, thereby significantly reducing the computational overhead in high-dimensional problems. Both synthetic and experimental data are used to validate the proposed framework

[LG-127] Inference of Online Newton Methods with Nesterovs Accelerated Sketching

链接: https://arxiv.org/abs/2604.23436
作者: Haoxuan Wang,Xinchen Du,Sen Na
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO)
*备注: 51 pages, 2 tables, 3 figures

点击查看摘要

Abstract:Reliable decision-making with streaming data requires principled uncertainty quantification of online methods. While first-order methods enable efficient iterate updates, their inference procedures still require updating proper (covariance) matrices, incurring O(d^2) time and memory complexity, and are sensitive to ill-conditioning and noise heterogeneity of the problem. This costly inference task offers an opportunity for more robust second-order methods, which are, however, bottlenecked by solving Newton systems with O(d^3) complexity. In this paper, we address this gap by studying an online Newton method with Hessian averaging, where the Newton direction at each step is approximately computed using a sketch-and-project solver with Nesterov’s acceleration, matching O(d^2) complexity of first-order methods. For the proposed method, we quantify its uncertainty arising from both random data and randomized computation. Under standard smoothness and moment conditions, we establish global almost-sure convergence, prove asymptotic normality of the last iterate with a limiting covariance characterized by a Lyapunov equation, and develop a fully online covariance estimator with non-asymptotic convergence guarantees. We also connect the resulting uncertainty quantification to that of exact and sketched Newton methods without Nesterov’s acceleration. Extensive experiments on regression models demonstrate the superiority of the proposed method for online inference.

[LG-128] On (not) learning the Möbius function

链接: https://arxiv.org/abs/2604.23427
作者: Alexey Pozdnyakov
类目: Number Theory (math.NT); Machine Learning (cs.LG)
*备注: 62 pages

点击查看摘要

Abstract:We prove lower bounds on learning the Möbius or Liouville function with a variety of standard learning techniques, including kernel methods, noisy gradient methods, and correlational statistical query algorithms. These results follow from quantitative bounds on the correlation of Möbius with digital characters of various finite abelian groups, where the group is dictated by the type of input data the algorithm is given. Using residues mod p for many different primes corresponds to a cyclic group, and using the base p expansion for a fixed prime corresponds to an elementary abelian p -group. We also note that lower bounds of this form are closely related to certain types of digital prime number theorems.

[LG-129] Explicit integral representations and quantitative bounds for two-layer ReLU networks

链接: https://arxiv.org/abs/2604.23260
作者: Anthony Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An approach to construct explicit integral representations for two-layer ReLU networks is presented, which provides relatively simple representations for any multivariate polynomial. Quantitative bounds are provided for a particular, sharpened ReLU integral representation, which involves a harmonic extension and a projection. The bounds demonstrate that functions can be approximated with L^2(\mathcalD) errors that do not depend explicitly on dimension or degree, but rather the coefficients of their monomial expansions and the distribution \mathcalD .

[LG-130] Learning Curves and Benign Overfitting of Spectral Algorithms in Large Dimensions

链接: https://arxiv.org/abs/2604.23212
作者: Weihao Lu,Qian Lin,Yingcun Xia,Dongming Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Existing large-dimensional theory for spectral algorithms resolves either the optimally tuned point or the interpolation limit, but leaves the under-regularized regime unexplored. We study the learning curve and benign overfitting of spectral algorithms in the large-dimensional setting where the sample size and dimension are of comparable order, i.e., n \asymp d^\gamma for some \gamma0 . We first consider inner-product kernels on the sphere \mathbbS^d-1 and establish a sharp asymptotic characterization of the excess risk across the full regularization path under various source conditions s \geq 0 , where s measures the relative smoothness of the regression function. Our results reveal that the learning curve is not simply U-shaped but instead consists of three distinct regimes: over-regularized, under-regularized, and interpolation regimes. This characterization allows us to fully capture the benign overfitting phenomenon, demonstrating that benign overfitting arises consistently across both the under-regularized and interpolation regimes whenever s is positive but no larger than a critical threshold. We further show that, in the sufficiently regularized regime, the kernel learning curve is recovered by an associated sequence model. Finally, we extend the learning-curve analysis to large-dimensional KRR for a class of kernels on general domains in \mathbbR^d whose low-degree eigenspaces satisfy spectral-scaling and hyper-contractivity conditions.

[LG-131] A Dynamic Learning Observatory Reveals the Rapid Salinization of Satkhira Bangladesh

链接: https://arxiv.org/abs/2604.23127
作者: Showmitra Kumar Sarkar,Sai Ravela
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Soil salinity is a major environmental challenge in coastal Bangladesh, threatening agricultural productivity and local livelihoods. This study develops a machine-learning-based framework to predict and map soil salinity in Satkhira district by integrating field observations with Landsat-derived spectral indices. A total of 205 soil samples collected during 2024-2025 were used to train an Extreme Gradient Boosting (XGBoost) model, and predictions were further improved using a Generalized Additive Model (GAM). Spatial cross-validation was applied to reduce autocorrelation bias, and bootstrap resampling was used to quantify prediction uncertainty. The results show strong spatial variability of soil salinity, with higher concentrations in the southern and central coastal regions and lower levels in the northern inland areas. Vegetation indices, particularly NDVI, along with salinity-related spectral indicators, were identified as key predictors. 10-year-window peak-exposure maps generated for 2014-2023 reveal recurrent high-salinity zones and a persistent, expanding footprint of moderate-to-high salinity exposure across the central parts of the district. Uncertainty analysis indicates higher variability in coastal zones and improved prediction stability when multi-year datasets are combined. The proposed framework provides a robust and scalable approach for long-term monitoring of soil salinity. It supports climate-resilient agriculture, land-use planning, and evidence-based decision-making in coastal Bangladesh.

[LG-132] MOCA: A Transformer-based Modular Causal Inference Framework with One-way Cross-attention and Cutting Feedback

链接: https://arxiv.org/abs/2604.23107
作者: Lei Wang,Debashis Ghosh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 25 pages, 8 figures, 4 tables. Preprint

点击查看摘要

Abstract:Causal effect estimation from observational data requires careful adjustment for confounding. Classical estimators such as inverse probability weighting and augmented inverse probability weighting are effective under favorable model specification, but may become unstable when treatment assignment and outcome mechanisms are complex, non-linear, and high-dimensional. Machine learning and representation learning approaches improve flexibility, yet joint training can allow outcome-related information to influence treatment-side representations, which is undesirable from a causal perspective. We propose MOCA (Modular One-way Causal Attention), a transformer-based framework that separates treatment and outcome modeling through a modular design, and performs confounder adjustment using a one-way attention mechanism. A cutting-feedback strategy, implemented via gradient detachment, prevents the outcome loss from updating the treatment module. This design preserves directional information flow while retaining the representational power of transformer architectures for causal inference. Across multiple simulated scenarios, including linear, nonlinear, heavy-tailed, hidden confounding, and high-dimensional settings, MOCA shows competitive or improved performance relative to IPW, AIPW, X-learner, TARNet, and DragonNet. We further illustrate the method on the Infant Health and Development Program dataset and the Dehejia-Wahba dataset as real-world benchmarks. These results suggest that modular attention with one-way information flow provides a promising and interpretable direction for causal inference with modern deep learning models.

[LG-133] urtle shell clustering: A mixture approach to discriminative clustering with applications to flow cytometry and other data

链接: https://arxiv.org/abs/2604.23083
作者: Mackenzie R. Neal,Paul D. McNicholas,Arthur White
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Generative approaches to clustering provide information on geometric properties of clusters, whereas discriminative approaches provide boundaries between clusters. Ideas from both approaches are incorporated to present a fully unsupervised, probabilistic, and discriminative clustering method via a regularized mutual information objective function, wherein a mixture of mixtures of Gaussian and uniform distributions is used for formulation of the conditional model. Automatic selection of the number of components is established with the introduction of the regularizing term and a merge step, similar to those applied in reversible jump Markov chain Monte Carlo methods used in Bayesian clustering. Consequently, the turtle shell method – a fully unsupervised clustering method capable of estimating non-linear boundary lines, automatically selecting the number of components, and capturing intuitive clusters in the presence of data abnormalities such as noise and/or irregular cluster shapes – is introduced. We test this method on various simulated and real datasets commonly explored in clustering research, and extend the analysis to datasets arising from flow cytometry experiments.

[LG-134] Rethinking Trust Region Bayesian Optimization in High Dimensions

链接: https://arxiv.org/abs/2604.22967
作者: Wei-Ting Tang,Joel A. Paulson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trust Region Bayesian Optimization (TuRBO) is an effective strategy for alleviating the curse of dimensionality in high-dimensional black-box optimization. However, inappropriate lengthscale design can cause the local Gaussian process (GP) model within the trust region to degenerate, leading to suboptimal performance in high dimensions. In this work, we show that TuRBO’s local GP may remain either excessively complex or overly simple as the dimension D and trust region side length L vary. To address this issue, we propose a straightforward variant, AdaScale-TuRBO, which scales the GP lengthscale with both the problem dimension and trust region size, thereby preserving kernel geometry and maintaining consistent prior complexity. Empirically, we show that AdaScale-TuRBO can robustly outperform standard TuRBO and other popular high-dimensional BO methods on synthetic benchmarks and real-world trajectory planning tasks.

[LG-135] A Specialized Importance-Aware Quantum Convolutional Neural Network with Ring-Topology (IA-QCNN) for MGMT Promoter Methylation Prediction in Glioblastoma

链接: https://arxiv.org/abs/2604.22877
作者: Emine Akpinar,Murat Oduncuoglu
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Submitted to Applied Soft Computing

点击查看摘要

Abstract:GBM is a highly aggressive primary malignancy in adults, necessitating personalized therapeutic strategies due to its inherent molecular heterogeneity. MGMT promoter methylation is a pivotal prognostic biomarker for anticipating response to temozolomide-based chemotherapy. Although various AI frameworks have been developed for non-invasive MGMT prediction, spatial heterogeneity of methylation status and the high-dimensional and correlated nature of MRI data frequently constrain discriminative feature learning and generalizability of classical models. To circumvent these limitations, a specialized IA-QCNN architecture is proposed, based on the principles of quantum mechanics, including superposition and entanglement, and enabling more efficient representation learning in high-dimensional Hilbert space. The framework establishes a methodological bridge between GBM radiogenomics and quantum deep learning by integrating energy-based slice selection, importance-aware weighting, ring-topology quantum convolution, and folding-based pooling layers. When the model predicts MGMT promoter methylation status using both mpMRI and T1Gd images, experimental results demonstrate that the IA-QCNN achieves high accuracy despite its low number of trainable parameters while effectively minimizing the overfitting problem observed in classical models. Quantitative analyses reveal that the T1Gd modality possesses higher discriminative power than mpMRI, establishing a clinically significant sequence preference. Furthermore, the model exhibits exceptional robustness in hybrid noise environments, effectively utilizing noise as a regularization mechanism to enhance predictive performance. Consequently, the specialized IA-QCNN architecture provides a robust and computationally efficient alternative to classical approaches in the analysis of heterogeneous radiogenomic data.

[LG-136] Sliced Wasserstein Steering between Gaussian Measures

链接: https://arxiv.org/abs/2604.22807
作者: Kaito Ito,Anqi Dong
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: Accepted at the European Control Conference 2026

点击查看摘要

Abstract:Optimal transport with quadratic cost provides a geometric framework for steering an ensemble, modeled by a probability law, with minimal effort. Yet ambient-space formulations become unwieldy in high dimensions, and sensing or actuation in practice often reveals only linear views of the state – camera silhouettes, LiDAR beams, tomographic slices. We develop a sliced feedback controller for distribution steering: the evolving law is projected onto one-dimensional directions on the sphere, the optimal one-dimensional velocity is synthesized in each projection, and these velocities are averaged to produce a feedback control in the ambient space. The construction reduces to the Benamou–Brenier problem in one dimension. In addition, it is invariant under orthogonal transforms, nonexpansive under projections, and well posed on \mathcalP_2(\mathbbR^n) . Computation proceeds by sampling directions on the sphere and solving independent one-dimensional subproblems, yielding a scalable method aligned with partial observations. In the Gaussian setting, we show that the developed sliced controller steers the law to the prescribed target. Furthermore, we derive an identity relating the energy consumption incurred by the controller to the sliced Wasserstein distance.

[LG-137] Context-Integrated Adversarial Learning for Predictive Modelling of Stock Price Dynamics

链接: https://arxiv.org/abs/2604.22801
作者: Alexis Lazanas,Spyros Christodoulou,Spyridon Karpouzis
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:It is a challenging task to forecast equity prices in fast moving financial markets as this becomes even more difficult when the predictive signal is based on non-homogeneous information channels. The classical statistical methods, especially the Autoregressive Integrated Moving Average (ARIMA) models, limit their analytical ability with the linear assumptions that prevent the modeling of complex temporal dynamics. In contrast, complex neural networks, including Long Short-Term Memory (LSTM) networks, are also skilled at capturing sequential interaction effects; they however tend to collapse in the face of abrupt shifts in volatility and changing distributions. In this paper we introduce a context-sensitive adversarial learning model to predict equity prices in this work, which is synthesized distribution-based generative modelling with sentiment-based auxiliary information obtained through Natural Language Processing (NLP). The architecture uses adversarial training to model future price movements and incorporates contextual sentiment features derived using financial textual data. Through a collective utilization of quantitative market indicators along with the additional contextual cues, the framework hopes to enhance the reliability of forecasts during the periods of increased volatility and regime change. Empirical evaluation of a sample of U.S. equities testifies that the presented approach outperforms the traditional ARIMA and LSTM baselines in a range of measures of error. These findings imply that context-sensitive adversarial paradigm is an effective instrument of enhancing stock price prediction effectiveness in complex financial environments characterized by uncertainty and structural changes.

[LG-138] From Equations to Algorithms and Data: Transforming Microwave Engineering and Education with Machine Learning

链接: https://arxiv.org/abs/2604.22792
作者: Mehmet Parlak,Islam Guven
类目: ignal Processing (eess.SP); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 4 figures, 1 table, conference

点击查看摘要

Abstract:Conventional microwave engineering education relies heavily on analytical methods, canonical circuit topologies, and intuition-driven design, which have proven effective at microwave frequencies. However, as systems increasingly operate in the millimeter-wave and terahertz regimes, parasitic effects, process-dependent electromagnetic interactions, and ultra-wideband performance requirements challenge both topology/layout-constrained traditional design methodologies and existing teaching paradigms. This paper proposes a pedagogical shift in microwave and RFIC (Radio Frequency Integrated Circuit) engineering and education by introducing machine-learning (ML) and data-driven electromagnetic synthesis as a complementary design framework for microwave circuits such as power dividers and combiners, couplers, and baluns. Rather than emphasizing predefined topologies, the proposed approach enables topology-agnostic, performance-oriented exploration of the design space, allowing students to directly engage with electromagnetic behavior through specification-driven synthesis. By integrating machine-learning-based inverse design and multi-objective optimization into the curriculum, the framework enhances physical intuition, encourages design creativity, and better aligns microwave education with emerging industrial practices in high-frequency and ultra-wideband system design.

[LG-139] Non-Destructive Prediction of Fruit Ripeness and Firmness Using Hyperspectral Imaging and Lightweight Machine Learning Models

链接: https://arxiv.org/abs/2604.22788
作者: Phongsakon Mark Konrad,Casper Kunstmann-Olsen,Jacek Fiutowski,Serkan Ayvaz
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-harvest fruit quality assessment is essential for reducing food waste, yet reliable non-destructive methods typically depend on expensive hyperspectral cameras and computationally intensive deep learning models. These systems typically require GPU resources, large-scale training data, and domain expertise, limiting their feasibility for many real-world agricultural settings. This study systematically evaluates 20 classical machine learning algorithms on hyperspectral imaging data for simultaneous ripeness classification and firmness prediction across five fruit species, using cross-validated experimental design with Bayesian hyperparameter optimization. Data preprocessing strategy, particularly class balancing and spectral transformations, contributes as much to prediction accuracy as algorithm choice. Our results show that tree-based machine learning models can outperform state-of-the-art deep earning models reported in Fruit-HSNet. Moreover, the findings indicate that only three visible-range wavelengths are needed to recover over 94% of full-spectrum accuracy, demonstrating that low-cost multispectral sensors combined with lightweight machine learning models can serve as practical alternatives to expensive hyperspectral cameras and complex deep learning approaches for practical fruit quality sorting.

附件下载

点击下载今日全部论文列表