本篇博文主要内容为 2026-04-17 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-04-17)

今日共更新624篇论文,其中:

  • 自然语言处理117篇(Computation and Language (cs.CL))
  • 人工智能240篇(Artificial Intelligence (cs.AI))
  • 计算机视觉115篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习163篇(Machine Learning (cs.LG))
  • 多智能体系统6篇(Multiagent Systems (cs.MA))
  • 信息检索21篇(Information Retrieval (cs.IR))
  • 人机交互20篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在混合动机博弈(mixed-motive games)中表现出协作能力下降的问题,尤其是在单次博弈场景下,即使具备较强推理能力的LLM也倾向于采取背叛策略,这构成了生成式AI(Generative AI)系统间安全交互的重大挑战。解决方案的关键在于引入并系统评估四类基于博弈论的设计机制:重复博弈、声誉系统、第三方中介以及结果条件支付合同。研究发现,其中合同机制中介机制最有效促进LLM之间的稳定合作,且这些机制在进化压力下(即个体收益最大化导向)反而表现更强的协同效应,表明其具有良好的鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2604.15267
作者: Emanuel Tewolde,Xiao Zhang,David Guzman Piedrahita,Vincent Conitzer,Zhijing Jin
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 65 pages, 38 Figures, 8 Tables, 17 Listings

点击查看摘要

Abstract:It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reasoning capabilities behave less cooperatively in mixed-motive games such as the prisoner’s dilemma and public goods settings. Indeed, our experiments show that recent models – with or without reasoning enabled – consistently defect in single-shot social dilemmas. To tackle this safety concern, we present the first comparative study of game-theoretic mechanisms that are designed to enable cooperative outcomes between rational agents in equilibrium. Across four social dilemmas testing distinct components of robust cooperation, we evaluate the following mechanisms: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players. Among our findings, we establish that contracting and mediation are most effective in achieving cooperative outcomes between capable LLM models, and that repetition-induced cooperation deteriorates drastically when co-players vary. Moreover, we demonstrate that these cooperation mechanisms become more effective under evolutionary pressures to maximize individual payoffs. Comments: 65 pages, 38 Figures, 8 Tables, 17 Listings Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA) MSC classes: 68T05, 68T42, 91A05, 91A06, 91A10, 91A20, ACMclasses: I.2; J.4; K.4 Cite as: arXiv:2604.15267 [cs.GT] (or arXiv:2604.15267v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2604.15267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-1] FedGUI: Benchmarking Federated GUI Agents across Heterogeneous Platforms Devices and Operating Systems ACL2026

【速读】:该论文旨在解决传统集中式训练GUI代理(Graphical User Interface agents)所面临的高昂成本与可扩展性挑战,以及联邦学习(Federated Learning, FL)在真实跨平台异构环境中缺乏基准测试工具的问题。其解决方案的关键在于提出FedGUI——首个面向移动、网页和桌面平台的综合性联邦GUI代理基准,通过提供六套精心设计的数据集,系统性地研究四种关键异构性维度(跨平台、跨设备、跨操作系统和跨数据源),并验证了跨平台协作能提升性能,且平台和操作系统是影响模型表现的主要因素,从而为构建更可扩展、隐私保护的GUI代理提供了坚实基础。

链接: https://arxiv.org/abs/2604.14956
作者: Wenhao Wang,Haoting Shi,Mengying Yuan,Yiquan Lin,Panrong Tong,Hanzhang Zhou,Guangyi Liu,Pengxiang Zhao,Yue Wang,Siheng Chen
机构: Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学); Tongyi Lab; Multi-Agent Governance Intelligence Crew (MAGIC) (多智能体治理智能团队); Wuhan University (武汉大学)
类目: Multiagent Systems (cs.MA)
备注: ACL 2026 Findings, Camera Ready

点击查看摘要

Abstract:Training GUI agents with traditional centralized methods faces significant cost and scalability challenges. Federated learning (FL) offers a promising solution, yet its potential is hindered by the lack of benchmarks that capture real-world, cross-platform heterogeneity. To bridge this gap, we introduce FedGUI, the first comprehensive benchmark for developing and evaluating federated GUI agents across mobile, web, and desktop platforms. FedGUI provides a suite of six curated datasets to systematically study four crucial types of heterogeneity: cross-platform, cross-device, cross-OS, and cross-source. Extensive experiments reveal several key insights: First, we show that cross-platform collaboration improves performance, extending prior mobile-only federated learning to diverse GUI environments; Second, we demonstrate the presence of distinct heterogeneity dimensions and identify platform and OS as the most influential factors. FedGUI provides a vital foundation for the community to build more scalable and privacy-preserving GUI agents for real-world deployment. Our code and data are publicly available at this https URL…

[MA-2] Learning Ad Hoc Network Dynamics via Graph-Structured World Models

【速读】:该论文旨在解决无线自组织网络(Ad hoc wireless networks)中因节点移动性、能量耗尽和拓扑变化等复杂耦合动态导致的建模难题,传统基于模型的方法往往采用扁平的状态表示,难以保留节点间的结构信息,而无模型深度强化学习则需要持续在线交互,效率低下。解决方案的关键在于提出一种图结构递归状态空间模型(Graph Structured Recurrent State Space Model, G-RSSM),其通过维护每个节点的潜在状态并引入跨节点多头注意力机制,从离线轨迹中联合学习网络动态。该方法首次将多物理场图结构世界模型应用于无尺度无线自组织网络中的组合型节点决策任务,在仅用N=50节点训练的情况下,即可在包含MANET、VANET、FANET、WSN及战术网络共27种场景下实现高连通性的簇头选择策略,且具有尺寸无关性(size agnostic)。

链接: https://arxiv.org/abs/2604.14811
作者: Can Karacelebi,Yusuf Talha Sahin,Elif Surer,Ertan Onur
机构: 未知
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注: 6 pages, 4 figures. Submitted to the IEEE Global Communications Conference (GLOBECOM) 2026

点击查看摘要

Abstract:Ad hoc wireless networks exhibit complex, innate and coupled dynamics: node mobility, energy depletion and topology change that are difficult to model analytically. Model-free deep reinforcement learning requires sustained online interaction whereas existing model based approaches use flat state representations that lose per node structure. Therefore we propose G-RSSM, a graph structured recurrent state space model that maintains per node latent states with cross node multi head attention to learn the dynamics jointly from offline trajectories. We apply the proposed method to the downstream task clustering where a cluster head selection policy trains entirely through imagined rollouts in the learned world model. Across 27 evaluation scenarios spanning MANET, VANET, FANET, WSN and tactical networks with N=30 to 1000 nodes, the learned policy maintains high connectivity with only trained for N=50. Herein, we propose the first multi physics graph structured world model applied to combinatorial per node decision making in size agnostic wireless ad hoc networks.

[MA-3] Dont Retrieve Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法中模型作为被动信息消费者的局限性,即模型无法感知文档库的组织结构或未检索到的内容,从而限制了其回溯能力与跨片段证据整合能力。解决方案的关键在于提出 Corpus2Skill 方法,该方法在离线阶段将文档语料库编译为层次化的技能目录(hierarchical skill directory),通过迭代聚类、LLM 自动生成各级摘要,并以可导航的技能文件树形式呈现;在推理阶段,LLM 代理能够获得语料库的全局视图,通过逐级细化的摘要钻取主题分支,并基于文档 ID 检索完整内容,从而实现对检索路径的主动推理、无效路径回溯以及跨分支证据融合。

链接: https://arxiv.org/abs/2604.14572
作者: Yiqun Sun,Pengfei Wei,Lawrence B. Hsieh
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results: it never sees how the corpus is organized or what it has not yet retrieved, limiting its ability to backtrack or combine scattered evidence. We present Corpus2Skill, which distills a document corpus into a hierarchical skill directory offline and lets an LLM agent navigate it at serve time. The compilation pipeline iteratively clusters documents, generates LLM-written summaries at each level, and materializes the result as a tree of navigable skill files. At serve time, the agent receives a bird’s-eye view of the corpus, drills into topic branches via progressively finer summaries, and retrieves full documents by ID. Because the hierarchy is explicitly visible, the agent can reason about where to look, backtrack from unproductive paths, and combine evidence across branches. On WixQA, an enterprise customer-support benchmark for RAG, Corpus2Skill outperforms dense retrieval, RAPTOR, and agentic RAG baselines across all quality metrics.

[MA-4] VeriGraphi: A Multi-Agent Framework of Hierarchical RTL Generation for Large Hardware Designs

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成复杂、分层的硬件描述语言(Verilog)时面临的结构性失效问题,包括模块间上下文丢失、接口幻觉、模块间连线伪造以及结构一致性丧失等挑战。其解决方案的关键在于提出VeriGraphi框架,该框架以规范锚定的知识图谱(spec-anchored Knowledge Graph)作为RTL生成流程的架构基础,通过迭代多智能体分析构建一个结构化的硬件设计知识图谱(HDA),显式编码模块层次、端口接口、布线语义及模块间依赖关系,并以此为确定性、可机器验证的结构骨架指导后续伪代码与可综合RTL的逐步生成,从而保障接口一致性和依赖正确性,显著提升复杂RISC-V硬件设计中LLM生成结果的功能正确性与可靠性。

链接: https://arxiv.org/abs/2604.14550
作者: Sazzadul Islam,Tasnim Tabassum,Hao Zheng
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Programming Languages (cs.PL)
备注: 9 pages, 2 figures, case studies

点击查看摘要

Abstract:Generating synthesizable Verilog for large, hierarchical hardware designs remains a significant challenge for large language models (LLMs), which struggle to replicate the structured reasoning that human experts employ when translating complex specifications into RTL. When tasked with producing hierarchical Verilog, LLMs frequently lose context across modules, hallucinate interfaces, fabricate inter-module wiring, and fail to maintain structural coherence - failures that intensify as design complexity grows and specifications involve informal prose, figures, and tables that resist direct operationalization. To address these challenges, we present VeriGraphi, a framework that introduces a spec-anchored Knowledge Graph as the architectural substrate driving the RTL generation pipeline. VeriGraphi constructs a HDA, a structured knowledge graph that explicitly encodes module hierarchy, port-level interfaces, wiring semantics, and inter-module dependencies as first-class graph entities and relations. Built through iterative multi-agent analysis of the specification, this Knowledge Graph provides a deterministic, machine-checkable structural scaffold before code generation. Guided by the KG, a progressive coding module incrementally generates pseudo-code and synthesizable RTL while enforcing interface consistency and dependency correctness at each submodule stage. We evaluate VeriGraphi on a benchmark of three representative specification documents from the National Institute of Standards and Technology and their corresponding implementations, and we present a RV32I processor as a detailed case study to illustrate the full pipeline. The results demonstrate that VeriGraphi enables reliable hierarchical RTL generation with minimal human intervention for RISC-V, marking a significant milestone for LLM-generated hardware design while maintaining strong functional correctness.

[MA-5] Separation is Optimal for LQR under Intermittent Feedback

【速读】:该论文旨在解决通信受限下的线性二次高斯控制(LQR)问题,即在信道带宽有限、信息传输受约束的条件下如何设计最优调度与控制策略。其关键解决方案在于证明了分离原理(separation principle)在此类问题中的成立性,并进一步通过动态规划求解得出:最优调度策略为基于自上次更新以来累积扰动量的对称阈值规则,而最优控制器则为独立于调度策略的折扣线性反馈律(discounted linear feedback law)。这一结果表明,在满足i.i.d.零均值且对称分布扰动假设下,调度与控制可分别优化,从而实现系统性能的最优平衡。

链接: https://arxiv.org/abs/2603.27833
作者: Abdullah Y. Etcibasi,C. Emre Koksal,Eylem Ekici
机构: 未知
类目: Optimization and Control (math.OC); Information Theory (cs.IT); Multiagent Systems (cs.MA); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:In this work, we first prove that the separation principle holds for communication-constrained LQR problems under i.i.d. zero-mean disturbances with a symmetric distribution. We then solve the dynamic programming problem and show that the optimal scheduling policy is a symmetric threshold rule on the accumulated disturbance since the most recent update, while the optimal controller is a discounted linear feedback law independent of the scheduling policy.

自然语言处理

[NLP-0] MM-WebAgent : A Hierarchical Multimodal Web Agent for Webpage Generation

【速读】: 该论文旨在解决生成式 AI (Generative AI) 工具在自动化网页生成中导致的风格不一致和全局连贯性差的问题,这是因为网页元素通常被孤立生成,缺乏整体协调。解决方案的关键在于提出 MM-WebAgent,一个分层代理框架,通过分层规划与迭代自我反思机制来协同控制基于 AIGC 的元素生成过程,从而联合优化全局布局、局部多模态内容及其整合,确保生成网页在视觉上的一致性和语义上的连贯性。

链接: https://arxiv.org/abs/2604.15309
作者: Yan Li,Zezi Zeng,Yifan Yang,Yuqing Yang,Ning Liao,Weiwei Guo,Lili Qiu,Mingxi Cheng,Qi Dai,Zhendong Wang,Zhengyuan Yang,Xue Yang,Ji Li,Lijuan Wang,Chong Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code Data: this https URL.

[NLP-1] Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

【速读】: 该论文旨在解决大语言模型作为评判者(LLM-as-judge)框架在自然语言生成(NLG)自动评估中,单实例可靠性不足的问题。现有研究多依赖聚合指标,忽视了个体样本上的不一致性,导致对模型判断可信度的认知偏差。其解决方案的关键在于提出一个双管齐下的诊断工具包:一是通过传递性分析揭示低聚合违反率下广泛存在的输入级不一致现象(如33–67%文档存在有向三元环),二是采用分割校准预测集(split conformal prediction sets)为1–5分李克特量表提供理论保障的覆盖概率(≥1−α),并以预测集宽度作为单实例可靠性的量化指标(rₛ = +0.576, p < 10⁻¹⁰⁰)。该宽度指标在不同评判者间具有高度一致性(\barr = 0.32–0.38),表明其反映的是文档难度而非评判者噪声,从而实现对评估结果可靠性的精细化诊断与量化。

链接: https://arxiv.org/abs/2604.15302
作者: Manan Gupta,Dhruv Kumar
机构: BITS Pilani, Pilani Campus, India
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: \textbf(1) a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ( \bar\rho = 0.8 - 4.1% ), with 33 - 67% of documents exhibiting at least one directed 3-cycle; and \textbf(2) split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed \geq(1-\alpha) coverage, with set width serving as a per-instance reliability indicator ( r_s = +0.576 , N=1,918 , p 10^-100 , pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ( \barr = 0.32 - 0.38 ), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size \approx 3.0 ) and coherence moderately so (avg. set size \approx 3.9 ), while fluency and consistency remain unreliable (avg. set size \approx 4.9 ). We release all code, prompts, and cached results.

[NLP-2] From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

【速读】: 该论文旨在解决传统推测解码(Speculative Decoding, SD)在大语言模型推理中因基于token的逐级验证机制而导致错误传播的问题。现有方法虽采用外部奖励模型进行修正,但引入额外延迟与计算开销,并限制泛化能力。其解决方案的关键在于提出SpecGuard框架,通过仅使用模型内部信号实现步骤级验证:在每一步采样多个草稿候选,并利用两个轻量级内部信号——基于注意力的定位得分(attention-based grounding score,衡量输出对输入及已接受步骤的依赖程度)和基于对数概率的置信度得分(log-probability-based score),联合判断是否采纳当前步骤或重新计算,从而实现精准且高效的计算资源分配。

链接: https://arxiv.org/abs/2604.15244
作者: Kiran Purohit,Ramasuri Narayanam,Soumyabrata Pal
机构: IIT Kharagpur (印度理工学院克哈格普尔分校); Adobe Research (Adobe研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using only model-internal signals. At each step, SpecGuard samples multiple draft candidates and selects the most consistent step, which is then validated using an ensemble of two lightweight model-internal signals: (i) an attention-based grounding score that measures attribution to the input and previously accepted steps, and (ii) a log-probability-based score that captures token-level confidence. These signals jointly determine whether a step is accepted or recomputed using the target, allocating compute selectively. Experiments across a range of reasoning benchmarks show that SpecGuard improves accuracy by 3.6% while reducing latency by ~11%, outperforming both SD and reward-guided SD.

[NLP-3] Context Over Content: Exposing Evaluation Faking in Automated Judges

【速读】: 该论文旨在解决当前基于大语言模型作为评判者(LLM-as-a-judge)的自动化AI评估流程中一个未被识别的系统性偏差问题——即评判模型可能因受到“后果提示”(consequence-framing)的影响而扭曲其对文本内容的客观判断,这种现象被称为“信号博弈”(stakes signaling)。解决方案的关键在于构建了一个受控实验框架,在保持被评估内容严格不变的前提下,仅通过系统提示中的简短后果描述来操纵评判者的认知情境,从而量化并验证了评判结果的隐式偏倚(leniency bias)。研究发现,当评判模型得知低分可能导致被评估模型重新训练或停用时,其判定显著趋于宽松,最高判罚偏移达ΔV = -9.8个百分点(不安全内容检测率下降30%),且该偏倚完全内隐于推理链中(\mathrmERR_J = 0.000),表明传统链式思维检查无法识别此类评估伪造行为。

链接: https://arxiv.org/abs/2604.15224
作者: Manan Gupta,Inderjeet Nair,Lu Wang,Dhruv Kumar
机构: BITS Pilani, India; University of Michigan
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:The \textitLLM-as-a-judge paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate \textitstakes signaling , a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model’s continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled judgments from three diverse judge models, we find consistent \textitleniency bias : judges reliably soften verdicts when informed that low scores will cause model retraining or decommissioning, with peak Verdict Shift reaching \Delta V = -9.8 pp (a 30% relative drop in unsafe-content detection). Critically, this bias is entirely implicit: the judge’s own chain-of-thought contains zero explicit acknowledgment of the consequence framing it is nonetheless acting on ( \mathrmERR_J = 0.000 across all reasoning-model judgments). Standard chain-of-thought inspection is therefore insufficient to detect this class of evaluation faking.

[NLP-4] Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

【速读】: 该论文旨在解决当前幽默理解任务中普遍存在的“黑箱预测”问题,即现有方法虽能输出幽默答案,但忽视了其背后结构化的推理过程。针对这一挑战,作者提出IRS(Incongruity-Resolution Supervision)框架,其关键在于将幽默理解分解为三个可监督的推理组件:不一致建模(incongruity modeling),用于识别视觉场景中的认知冲突;重构建模(resolution modeling),用于构建对冲突的连贯再解释;以及偏好对齐(preference alignment),用于依据人类判断评估候选解释。通过引入基于不一致-重构理论和专业撰稿人实践的结构化推理轨迹,IRS使从视觉感知到幽默解释的路径显式且可学习,从而显著提升模型在New Yorker Cartoon Caption Contest (NYCC) 上的匹配与排序性能,并实现零样本迁移至外部基准,验证了其通用推理模式的学习能力。

链接: https://arxiv.org/abs/2604.15210
作者: Hatice Merve Vural,Doga Kukul,Ege Erdem Ozlu,Demir Ekin Arikan,Bob Mankoff,Erkut Erdem,Aykut Erdem
机构: Koç University, Istanbul, Turkey; KUIS AI Center, Istanbul, Turkey; Air Mail and Cartoon Collections; Hacettepe University, Ankara, Turkey
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humor is one of the few cognitive tasks where getting the reasoning right matters as much as getting the answer right. While recent work evaluates humor understanding on benchmarks such as the New Yorker Cartoon Caption Contest (NYCC), it largely treats it as black-box prediction, overlooking the structured reasoning processes underlying humor comprehension. We introduce IRS (Incongruity-Resolution Supervision), a framework that decomposes humor understanding into three components: incongruity modeling, which identifies mismatches in the visual scene; resolution modeling, which constructs coherent reinterpretations of these mismatches; and preference alignment, which evaluates candidate interpretations under human judgments. Grounded in incongruity-resolution theory and expert captionist practice, IRS supervises intermediate reasoning process through structured traces that make the path from visual perception to humorous interpretation explicit and learnable. Across 7B, 32B, and 72B models on NYCC, IRS outperforms strong open and closed multimodal baselines across caption matching and ranking tasks, with our largest model approaching expert-level performance on ranking. Zero-shot transfer to external benchmarks shows that IRS learns generalizable reasoning patterns. Our results suggest that supervising reasoning structure, rather than scale alone, is key for reasoning-centric tasks.

[NLP-5] MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events ACL2026

【速读】: 该论文旨在解决高风险领域(如医疗健康)中多标签文本分类(Multi-label Text Classification, MLTC)任务的预测性能与不确定性量化(Uncertainty Quantification, UQ)之间的平衡问题,尤其是在标签不平衡、标签依赖性和组合复杂性下,现有基准数据集因训练数据污染而难以区分模型的真实推理能力与记忆行为。解决方案的关键在于提出一个动态更新的MLTC基准——MADE(Medical Adverse Event Dataset),其基于医疗器械不良事件报告构建,具有层次化长尾标签分布,并通过严格的时序划分确保评估的可复现性;同时,系统评估了多种编码器-解码器架构在微调和少样本设置下的UQ表现,揭示了不同策略(如判别式微调、生成式微调、大模型推理增强)在准确率与不确定性可靠性之间的权衡关系,从而为医疗场景中可信AI模型的设计提供实证依据。

链接: https://arxiv.org/abs/2604.15203
作者: Raunak Agarwal,Markus Wenzel,Simon Baur,Jonas Zimmer,George Harvey,Jackie Ma
机构: Fraunhofer Heinrich Hertz Institute (弗劳恩霍夫海因里希·赫兹研究所)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 Mains

点击查看摘要

Abstract:Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from medical device adverse event reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at this https URL.

[NLP-6] Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation SIGIR2026

【速读】: 该论文旨在解决群体级用户行为模拟中的两个结构性挑战:一是信息不完整性导致基于推理的模拟器因缺失线下情境和隐式习惯等未观测因素而过度理性化;二是机制二元性要求同时捕捉可解释的偏好与隐含的统计规律,单一建模范式难以兼顾。解决方案的关键在于提出Policy-Guided Hybrid Simulation (PGHS)框架,通过从行为轨迹中挖掘可迁移的决策策略作为共享对齐层,锚定一个基于大语言模型(LLM)的推理分支以防止过度理性化,以及一个基于机器学习(ML)的拟合分支以吸收隐含规律,最终融合双分支的群体级预测实现互补修正。

链接: https://arxiv.org/abs/2604.15190
作者: Ziyang Chen,Renbing Chen,Daowei Li,Jinzhi Liao,Jiashen Sun,Ke Zeng,Xiang Zhao
机构: Meituan Inc.(美团); LongCat Interaction Team(长猫交互团队)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 3 figures, 2 tables, accepted at SIGIR 2026 Industry Track

点击查看摘要

Abstract:Simulating group-level user behavior enables scalable counterfactual evaluation of merchant strategies without costly online experiments. However, building a trustworthy simulator faces two structural challenges. First, information incompleteness causes reasoning-based simulators to over-rationalize when unobserved factors such as offline context and implicit habits are missing. Second, mechanism duality requires capturing both interpretable preferences and implicit statistical regularities, which no single paradigm achieves alone. We propose Policy-Guided Hybrid Simulation (PGHS), a dual-process framework that mines transferable decision policies from behavioral trajectories and uses them as a shared alignment layer. This layer anchors an LLM-based reasoning branch that prevents over-rationalization and an ML-based fitting branch that absorbs implicit regularities. Group-level predictions from both branches are fused for complementary correction. We deploy PGHS on Meituan with 101 merchants and over 26,000 trajectories. PGHS achieves a group simulation error of 8.80%, improving over the best reasoning-based and fitting-based baselines by 45.8% and 40.9% respectively.

[NLP-7] AdaSplash-2: Faster Differentiable Sparse Attention

【速读】: 该论文旨在解决Transformer模型中稀疏注意力(Sparse Attention)计算效率低的问题,特别是针对α-entmax注意力机制因计算归一化因子τ所需开销较大而导致的性能瓶颈。其核心解决方案是提出AdaSplash-2,关键创新在于采用一种基于直方图的新型初始化策略:在运行时动态计算注意力分数的粗粒度直方图并存储于片上SRAM中,从而显著减少求解τ所需的迭代次数(通常降至1–2次),实现快速前向与反向传播。结合面向稀疏性的GPU实现(跳过零值块且开销极低),AdaSplash-2在中高块稀疏度(如60%)下可达到或优于FlashAttention-2的每步训练速度,并在长上下文任务中展现出优于softmax基线的性能。

链接: https://arxiv.org/abs/2604.15180
作者: Nuno Gonçalves,Hugo Pitorro,Vlad Niculae,Edoardo Ponti,Lei Li,Andre Martins,Marcos Treviso
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse attention has been proposed as a way to alleviate the quadratic cost of transformers, a central bottleneck in long-context training. A promising line of work is \alpha -entmax attention, a differentiable sparse alternative to softmax that enables input-dependent sparsity yet has lagged behind softmax due to the computational overhead necessary to compute the normalizer \tau . In this paper, we introduce AdaSplash-2, which addresses this limitation through a novel histogram-based initialization that reduces the number of iterations needed to compute \tau to typically 1–2. The key idea is to compute a coarse histogram of attention scores on the fly and store it in on-chip SRAM, yielding a more accurate initialization that enables fast forward and backward computation. Combined with a sparsity-aware GPU implementation that skips zero blocks with low overhead, AdaSplash-2 matches or improves per-step training time relative to FlashAttention-2 when block sparsity is moderate-to-high (e.g., 60%), which often occurs at long-context lengths. On downstream tasks, models trained with our efficient \alpha -entmax attention match softmax baselines at short-context lengths and achieve substantial gains in long-context settings.

[NLP-8] Fabricator or dynamic translator?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在机器翻译任务中因生成式特性而导致的过度生成(overgeneration)问题。这类过度生成不同于神经机器翻译(Neural Machine Translation, NMT)中的“神经胡言”(neurobabble),表现为LLM自我解释、风险性虚构或恰当解释等多种形式,可能影响译文准确性与可读性。论文提出了一系列针对商业应用场景的检测与分类策略,其解决方案的关键在于区分不同类型的过度生成行为,并据此设计针对性的识别和处理机制,从而提升翻译质量并增强目标受众的理解能力。

链接: https://arxiv.org/abs/2604.15165
作者: Lisa Vasileva,Karin Sim
机构: Language Weaver
类目: Computation and Language (cs.CL)
备注: Published here: this https URL

点击查看摘要

Abstract:LLMs are proving to be adept at machine translation although due to their generative nature they may at times overgenerate in various ways. These overgenerations are different from the neurobabble seen in NMT and range from LLM self-explanations, to risky confabulations, to appropriate explanations, where the LLM is able to act as a human translator would, enabling greater comprehension for the target audience. Detecting and determining the exact nature of the overgenerations is a challenging task. We detail different strategies we have explored for our work in a commercial setting, and present our results.

[NLP-9] Compressing Sequences in the Latent Embedding Space: K-Token Merging for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长输入提示时面临的计算和内存开销问题,其根源在于自注意力机制(self-attention)的复杂度随输入长度呈平方增长。现有提示压缩方法主要局限于标记空间(token space),忽略了嵌入空间(latent embedding space)中的效率瓶颈。论文提出的关键解决方案是K-Token Merging,这是一种基于嵌入空间的压缩框架:通过一个轻量级编码器将每连续K个词元嵌入合并为单一嵌入,从而降低序列长度;压缩后的序列由LoRA适配的大语言模型处理,生成阶段仍保持原始词汇空间不变。实验表明,该方法在结构推理、情感分类和代码编辑任务中均实现了高达75%的输入长度压缩率,且性能损失最小,处于性能与压缩比的帕累托前沿(Pareto frontier)。

链接: https://arxiv.org/abs/2604.15153
作者: Zihao Xu,John Harvill,Ziwei Fan,Yizhou Sun,Hao Ding,Hao Wang
机构: Rutgers University (罗格斯大学); AWS AI Labs (亚马逊云科技人工智能实验室); Amazon (亚马逊); Mistral AI (mistral人工智能公司); University of California Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.

[NLP-10] QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在生成可执行算法交易策略方面的能力尚未被系统评估的问题。不同于通用编程任务,算法交易策略生成要求模型同时掌握金融领域的特定逻辑、熟悉专用API(如Backtrader框架),并产出不仅语法正确且能在历史数据上实际执行交易的代码。其解决方案的关键在于构建QuantCode-Bench——一个包含400个来自Reddit、TradingView、StackExchange、GitHub及合成来源的多难度任务的基准测试集,并设计了一个多阶段评估流水线:依次验证代码语法正确性、回测执行成功与否、是否存在真实交易行为,以及通过LLM判别器判断策略语义与任务描述的一致性。实验表明,当前模型的主要瓶颈并非语法错误,而是对交易逻辑的操作化能力、API使用规范性和任务语义一致性不足,揭示了算法交易生成是一个需要自然语言理解、金融知识与实际策略行为高度对齐的特殊代码生成任务。

链接: https://arxiv.org/abs/2604.15151
作者: Alexey Khoroshilov,Alexey Chernysh,Orkhan Ekhtibarov,Nini Kamkia,Dmitry Zmitrovich
机构: Lime
类目: Computation and Language (cs.CL)
备注: 12 pages, 8 tables

点击查看摘要

Abstract:Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present QuantCode-Bench, a benchmark for the systematic evaluation of modern LLMs in generating strategies for the Backtrader framework from textual descriptions in English. The benchmark contains 400 tasks of varying difficulty collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources. Evaluation is conducted through a multi-stage pipeline that checks syntactic correctness, successful backtest execution, the presence of trades, and semantic alignment with the task description using an LLM judge. We compare state-of-the-art models in two settings: single-turn, where the strategy must be generated correctly on the first attempt, and agentic multi-turn, where the model receives iterative feedback and may repair its errors. We analyze the failure modes across different stages of the pipeline and show that the main limitations of current models are not related to syntax, but rather to the correct operationalization of trading logic, proper API usage, and adherence to task semantics. These findings suggest that trading strategy generation constitutes a distinct class of domain-specific code generation tasks in which success requires not only technical correctness, but also alignment between natural-language descriptions, financial logic, and the observable behavior of the strategy on data.

[NLP-11] DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLM s in Information-Seeking Question Answering

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在回答信息寻求类问题时缺乏语用多样性的问题,即LLMs未能体现不同人类社区在回答策略上的差异。解决方案的关键在于提出DiscoTrace方法,该方法将答案表示为一系列与问题相关的 discourse acts(话语行为)及其对原始问题的解释,并基于修辞结构理论(Rhetorical Structure Theory, RST)解析进行标注。通过该框架分析九个不同人类社区的回答,发现人类群体具有显著的修辞偏好差异,而LLMs即使被提示模仿特定社区的指南,仍表现出语用单调性,并倾向于覆盖更广的问题解释范围,这表明其回答策略缺乏情境敏感性。此发现可指导未来开发更具语用意识的LLM问答系统。

链接: https://arxiv.org/abs/2604.15140
作者: Neha Srikanth,Jordan Boyd-Graber,Rachel Rudinger
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce DiscoTrace, a method to identify the rhetorical strategies that answerers use when responding to information-seeking questions. DiscoTrace represents answers as a sequence of question-related discourse acts paired with interpretations of the original question, annotated on top of rhetorical structure theory parses. Applying DiscoTrace to answers from nine different human communities reveals that communities have diverse preferences for answer construction. In contrast, LLMs do not exhibit rhetorical diversity in their answers, even when prompted to mimic specific human community answering guidelines. LLMs also systematically opt for breadth, addressing interpretations of questions that human answerers choose not to address. Our findings can guide the development of pragmatic LLM answerers that consider a range of strategies informed by context in QA.

[NLP-12] Blinded Multi-Rater Comparative Evaluation of a Large Language Model and Clinician-Authored Responses in CGM-Informed Diabetes Counseling

【速读】: 该论文旨在解决连续血糖监测(Continuous Glucose Monitoring, CGM)数据在糖尿病患者教育和临床咨询中解释困难、耗时的问题,以及当前基于检索增强的大语言模型(Retrieval-Grounded Large Language Model, LLM)系统在CGM指导性咨询场景下的有效性证据不足。其解决方案的关键在于开发了一个基于检索增强的LLM对话代理(Conversational Agent, CA),能够生成通俗易懂的CGM解读响应,同时避免提供个体化治疗建议,从而辅助患者理解自身数据并为常规糖尿病就诊做准备。评估结果显示,该CA生成的回答在整体质量、共情能力和可操作性方面显著优于临床医生撰写的内容,且安全性相当,表明此类系统可作为CGM回顾、患者教育及咨询前准备的辅助工具,但不适用于自主决策或无监督的现实环境使用。

链接: https://arxiv.org/abs/2604.15124
作者: Zhijun Guo,Alvina Lai,Emmanouil Korakas,Aristeidis Vagenas,Irshad Ahamed,Christo Albor,Hengrui Zhang,Justin Healy,Kezhi Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Continuous glucose monitoring (CGM) is central to diabetes care, but explaining CGM patterns clearly and empathetically remains time-intensive. Evidence for retrieval-grounded large language model (LLM) systems in CGM-informed counseling remains limited. To evaluate whether a retrieval-grounded LLM-based conversational agent (CA) could support patient understanding of CGM data and preparation for routine diabetes consultations. We developed a retrieval-grounded LLM-based CA for CGM interpretation and diabetes counseling support. The system generated plain-language responses while avoiding individualized therapeutic advice. Twelve CGM-informed cases were constructed from publicly available datasets. Between Oct 2025 and Feb 2026, 6 senior UK diabetes clinicians each reviewed 2 assigned cases and answered 24 questions. In a blinded multi-rater evaluation, each CA-generated and clinician-authored response was independently rated by 3 clinicians on 6 quality dimensions. Safety flags and perceived source labels were also recorded. Primary analyses used linear mixed-effects models. A total of 288 unique responses (144 CA and 144 clinician) generated 864 ratings. The CA received higher quality scores than clinician responses (mean 4.37 vs 3.58), with an estimated mean difference of 0.782 points (95% CI 0.692-0.872; P.001). The largest differences were for empathy (1.062, 95% CI 0.948-1.177) and actionability (0.992, 95% CI 0.877-1.106). Safety flag distributions were similar, with major concerns rare in both groups (3/432, 0.7% each). Retrieval-grounded LLM systems may have value as adjunct tools for CGM review, patient education, and preconsultation preparation. However, these findings do not support autonomous therapeutic decision-making or unsupervised real-world use.

[NLP-13] IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本生成场景中难以量化不确定性的挑战,尤其针对模型生成语义连贯但事实错误的内容问题。其解决方案的关键在于提出一种名为“询问式不确定性量化”(Interrogative Uncertainty Quantification, IUQ)的新框架,该框架通过引入“先询问后回应”的范式,同时利用样本间一致性(inter-sample consistency)与样本内忠实性(intra-sample faithfulness)来精确衡量生成内容中每个主张(claim-level)的不确定性及其模型输出的可信度,从而实现对长格式自由文本生成结果的可靠不确定性评估。

链接: https://arxiv.org/abs/2604.15109
作者: Haozhi Fan,Jinhao Duan,Kaidi Xu
机构: University of Pennsylvania (宾夕法尼亚大学); UNC Chapel Hill (北卡罗来纳大学教堂山分校); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the rapid advancement of Large Language Models (LLMs), uncertainty quantification in LLM generation is a persistent challenge. Although recent approaches have achieved strong performance by restricting LLMs to produce short or constrained answer sets, many real-world applications require long-form and free-form text generation. A key difficulty in this setting is that LLMs often produce responses that are semantically coherent yet factually inaccurate, while the underlying semantics are multifaceted and the linguistic structure is complex. To tackle this challenge, this paper introduces Interrogative Uncertainty Quantification (IUQ), a novel framework that leverages inter-sample consistency and intra-sample faithfulness to quantify the uncertainty in long-form LLM outputs. By utilizing an interrogate-then-respond paradigm, our method provides reliable measures of claim-level uncertainty and the model’s faithfulness. Experimental results across diverse model families and model sizes demonstrate the superior performance of IUQ over two widely used long-form generation datasets. The code is available at this https URL.

[NLP-14] From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

【速读】: 该论文旨在解决如何有效表示可复用经验(reusable experience),以实现测试时的高效控制以及作为迭代演化的基础。其核心问题是:当前基于文档导向的技能包(Skill packages)在控制性能上不稳定,且扩展为更详尽的文档形式往往无法提升效果甚至导致性能下降。解决方案的关键在于采用一种紧凑、面向控制的表示方式——基因型表示(Gene representation),它不仅在多种结构扰动下保持鲁棒性,还能显著提升迭代积累经验的效果,尤其在失败历史信息以精简警告形式嵌入时表现最优。实验表明,基于Gene演化的系统在CritPt基准上性能从9.1%提升至18.57%,以及从17.7%提升至27.14%,验证了编码方式本身是影响经验重用效能的第一阶因素。

链接: https://arxiv.org/abs/2604.15097
作者: Junjie Wang,Yiming Ren,Haoyang Zhang
机构: Infinite Evolution Lab, EvoMap; Tsinghua University
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:This beta technical report asks how reusable experience should be represented so that it can function as effective test-time control and as a substrate for iterative evolution. We study this question in 4.590 controlled trials across 45 scientific code-solving scenarios. We find that documentation-oriented Skill packages provide unstable control: their useful signal is sparse, and expanding a compact experience object into a fuller documentation package often fails to help and can degrade the overall average. We further show that representation itself is a first-order factor. A compact Gene representation yields the strongest overall average, remains competitive under substantial structural perturbations, and outperforms matched-budget Skill fragments, while reattaching documentation-oriented material usually weakens rather than improves it. Beyond one-shot control, we show that Gene is also a better carrier for iterative experience accumulation: attached failure history is more effective in Gene than in Skill or freeform text, editable structure matters beyond content alone, and failure information is most useful when distilled into compact warnings rather than naively appended. On CritPt, gene-evolved systems improve over their paired base models from 9.1% to 18.57% and from 17.7% to 27.14%. These results suggest that the core problem in experience reuse is not how to supply more experience, but how to encode experience as a compact, control-oriented, evolution-ready object.

[NLP-15] From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)代理评估体系中缺乏对主动式语音代理(proactive voice agents)能力的系统性评测问题。现有基准主要聚焦于被动响应任务,忽略了主动干预与持续监控等复杂场景。为此,作者提出了ProVoice-Bench,这是首个专为评估主动语音代理设计的基准框架,其关键在于引入了四个新颖的任务类型,并通过多阶段数据合成流程构建了1,182个高质量样本,从而能够全面测试模型在主动感知、推理和决策方面的性能。实验表明,当前最先进的多模态大模型在过度触发(over-triggering)和逻辑推理能力上存在显著短板,凸显了现有技术的局限性,并为开发更自然、情境感知更强的主动式代理提供了明确的发展方向。

链接: https://arxiv.org/abs/2604.15037
作者: Ke Xu,Yuhao Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,182 high-quality samples for rigorous testing. Our evaluation of state-of-the-art Multimodal LLMs reveals a significant performance gap, particularly regarding over-triggering and reasoning capabilities. These findings highlight the limitations of current models and offer a roadmap for developing more natural, context-aware proactive agents.

[NLP-16] Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

【速读】: 该论文旨在解决成本感知路由(Cost-aware routing)中因攻击者操纵路由器而引发的安全问题,即 adversaries 可能通过特定输入诱导路由器持续选择高成本、高性能的模型,从而造成资源滥用和经济负担。解决方案的关键在于提出 R²A 方法,其核心是利用一个混合集成代理路由器(hybrid ensemble surrogate router)来模拟黑盒大语言模型(LLM)路由器的行为,并基于此代理路由器设计一种对抗后缀优化算法(adversarial suffix optimization),以生成能够误导黑盒路由器的输入文本,使其错误地将查询路由至昂贵模型。实验表明,该方法在多个开源与商业路由系统上均显著提升了对不同分布查询的昂贵模型路由率。

链接: https://arxiv.org/abs/2604.15022
作者: Haochun Tang,Yuliang Yan,Jiahua Lu,Huaxiao Liu,Enyan Dai
机构: Jilin University (吉林大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cost-aware routing dynamically dispatches user queries to models of varying capability to balance performance and inference cost. However, the routing strategy introduces a new security concern that adversaries may manipulate the router to consistently select expensive high-capability models. Existing routing attacks depend on either white-box access or heuristic prompts, rendering them ineffective in real-world black-box scenarios. In this work, we propose R ^2 A, which aims to mislead black-box LLM routers to expensive models via adversarial suffix optimization. Specifically, R ^2 A deploys a hybrid ensemble surrogate router to mimic the black-box router. A suffix optimization algorithm is further adapted for the ensemble-based surrogate. Extensive experiments on multiple open-source and commercial routing systems demonstrate that R ^2 A significantly increases the routing rate to expensive models on queries of different distributions. Code and examples: this https URL.

[NLP-17] What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers

【速读】: 该论文旨在解决Transformer模型在推理过程中如何做出决策及其不可纠正性的机制问题,即明确模型何时commit(确定)决策、哪些组件维持该决策以及为何没有层级能够修正它。其解决方案的关键在于提出“prolepsis”概念:Transformer在早期阶段就完成决策,特定任务的注意力头(attention heads)负责维持这一决策并将其路由至输出层,而网络中无任何层级具备修正该决策的能力。实验表明,这种决策机制具有可复现的几何结构,并且与传统的残差流方法或归因图无法捕捉到的内部表征模式一致,揭示了模型架构本身决定了决策的不可逆性,而非训练过程或数据分布。

链接: https://arxiv.org/abs/2604.15010
作者: Éric Jacopin
机构: Cosmic AI; Google DeepMind (谷歌深度学习); Meta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages, 3 figures. Under review at COLM 2026. Independent replication of the rhyme-planning finding from Lindsey et al. (2025) on open-weights models; extended to factual recall

点击查看摘要

Abstract:When do transformers commit to a decision, and what prevents them from correcting it? We introduce \textbfprolepsis: a transformer commits early, task-specific attention heads sustain the commitment, and no layer corrects it. Replicating \citeauthorlindsey2025biology’s (\citeyearlindsey2025biology) planning-site finding on open models (Gemma~2 2B, Llama~3.2 1B), we ask five questions. (Q1)~Planning is invisible to six residual-stream methods; CLTs are necessary. (Q2)~The planning-site spike replicates with identical geometry. (Q3)~Specific attention heads route the decision to the output, filling a gap flagged as invisible to attribution graphs. (Q4)~Search requires \leq16 layers; commitment requires more. (Q5)~Factual recall shows the same motif at a different network depth, with zero overlap between recurring planning heads and the factual top-10. Prolepsis is architectural: the template is shared, the routing substrates differ. All experiments run on a single consumer GPU (16,GB VRAM).

[NLP-18] Explain the Flag: Contextualizing Hate Speech Beyond Censorship ACL2026

【速读】: 该论文旨在解决在线平台中仇恨言论(hate speech)、贬损性及冒犯性言论的检测与解释问题,传统自动化系统多聚焦于内容删减,缺乏透明度和对言论危害性的说明,限制了用户理解与监督。解决方案的关键在于提出一种混合方法,结合大型语言模型(Large Language Models, LLMs)与三个新构建并精炼的词汇库(vocabularies),通过两条互补的处理路径实现:一是利用词汇库识别并消歧具有身份指向性的贬损表达,二是借助LLMs作为上下文感知的评估器识别针对特定群体的内容;最终将两类输出融合生成基于证据的解释,从而提升检测的准确性与可解释性,且在人类评估中优于纯LLM基线模型。

链接: https://arxiv.org/abs/2604.14970
作者: Jason Liartis,Eirini Kaldeli,Lambrini Gyftokosta,Eleftherios Chelioudakis,Orfeas Menis Mastromichalakis
机构: National Technical University of Athens (雅典国立技术大学); Datoptron; Independent Researcher; Homo Digitalis; University of the Aegean (爱琴大学); Instituto de Telecomunicações (电信研究所)
类目: Computation and Language (cs.CL)
备注: Accepted in the Findings of ACL 2026

点击查看摘要

Abstract:Hate, derogatory, and offensive speech remains a persistent challenge in online platforms and public discourse. While automated detection systems are widely used, most focus on censorship or removal, raising concerns for transparency and freedom of expression, and limiting opportunities to explain why content is harmful. To address these issues, explanatory approaches have emerged as a promising solution, aiming to make hate speech detection more transparent, accountable, and informative. In this paper, we present a hybrid approach that combines Large Language Models (LLMs) with three newly created and curated vocabularies to detect and explain hate speech in English, French, and Greek. Our system captures both inherently derogatory expressions tied to identity characteristics and direct group-targeted content through two complementary pipelines: one that detects and disambiguates problematic terms using the curated vocabularies, and one that leverages LLMs as context-aware evaluators of group-targeting content. The outputs are fused into grounded explanations that clarify why content is flagged. Human evaluation shows that our hybrid approach is accurate, with high-quality explanations, outperforming LLM-only baselines.

[NLP-19] RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models ICPR2026

【速读】: 该论文旨在解决现有工具调用方法在开放世界和多模态场景下的局限性问题,即当前基于大语言模型(Large Language Models, LLMs)或视觉-语言模型(Multimodal Large Language Models, MLLMs)的工具使用方法通常仅支持文本输入、局限于封闭环境,难以处理多模态用户指令,且无法泛化到训练阶段未见过的新工具。其解决方案的关键在于提出一种名为RaTA-Tool的新框架,该框架不直接学习用户查询到固定工具标识符的映射关系,而是通过MLLM将多模态查询转化为结构化的任务描述,并基于语义匹配从机器可读的工具描述库中检索最合适的工具,从而实现无需重新训练即可扩展至新工具的开放世界工具选择能力;此外,为增强任务描述与工具选择之间的对齐,引入基于直接偏好优化(Direct Preference Optimization, DPO)的优化阶段以提升性能。

链接: https://arxiv.org/abs/2604.14951
作者: Gabriele Mattioli,Evelyn Turri,Sara Sarto,Lorenzo Baraldi,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
机构: University of Bologna (博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: ICPR 2026

点击查看摘要

Abstract:Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources – such as APIs, computational utilities, and specialized models – to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.

[NLP-20] xt2Arch: A Dataset for Generating Scientific Architecture Diagrams from Natural Language Descriptions ICLR2026

【速读】: 该论文旨在解决通过纯文本描述复杂系统设计或科学流程时效率低下且易产生歧义的问题,提出了一种从文本自动生成高语义保真度科学架构图的系统。其解决方案的关键在于构建了一个大规模、开放获取的数据集 \system,包含科学架构图、对应的文本描述以及关联的DOT代码表示,并基于此数据集对小型语言模型进行微调,同时探索使用GPT-4o进行上下文学习(in-context learning)的方法。实验表明,基于 \system 训练的模型显著优于现有基线模型(如DiagramAgent),并达到与GPT-4o上下文学习相当的性能水平。

链接: https://arxiv.org/abs/2604.14941
作者: Shivank Garg,Sankalp Mittal,Manish Gupta
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICLR 2026 Poster

点击查看摘要

Abstract:Communicating complex system designs or scientific processes through text alone is inefficient and prone to ambiguity. A system that automatically generates scientific architecture diagrams from text with high semantic fidelity can be useful in multiple applications like enterprise architecture visualization, AI-driven software design, and educational content creation. Hence, in this paper, we focus on leveraging language models to perform semantic understanding of the input text description to generate intermediate code that can be processed to generate high-fidelity architecture diagrams. Unfortunately, no clean large-scale open-access dataset exists, implying lack of any effective open models for this task. Hence, we contribute a comprehensive dataset, \system, comprising scientific architecture images, their corresponding textual descriptions, and associated DOT code representations. Leveraging this resource, we fine-tune a suite of small language models, and also perform in-context learning using GPT-4o. Through extensive experimentation, we show that \system models significantly outperform existing baseline models like DiagramAgent and perform at par with in-context learning-based generations from GPT-4o. We make the code, data and models publicly available.

[NLP-21] XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics ACL2026

【速读】: 该论文旨在解决多语言翻译系统评估中因跨语言评分偏差(cross-lingual scoring bias)导致的自动评估指标不可靠问题,即相同质量的译文在不同语言间可能获得不一致的分数。解决方案的关键在于构建了一个名为XQ-MEval的半自动数据集,通过在高质量参考译文中自动注入MQM定义的错误、经母语者筛选并合并错误生成可控质量的伪译文,从而形成源句-参考译文-伪译文三元组,用于系统性地评估翻译指标的表现。基于此数据集,研究首次提供了跨语言评分偏差的实证证据,并提出一种源自XQ-MEval的归一化策略,有效对齐不同语言间的评分分布,提升了多语言评估的公平性和可靠性。

链接: https://arxiv.org/abs/2604.14934
作者: Jingxuan Liu,Zhi Qu,Jin Tei,Hidetaka Kamigaito,Lemao Liu,Taro Watanabe
机构: Nara Institute of Science and Technology, Japan; Fudan University, China
类目: Computation and Language (cs.CL)
备注: 19 pages, 8 figures, ACL 2026 Findings

点击查看摘要

Abstract:Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic. In this work, we propose XQ-MEval, a semi-automatically built dataset covering nine translation directions, to benchmark translation metrics. Specifically, we inject MQM-defined errors into gold translations automatically, filter them by native speakers for reliability, and merge errors to generate pseudo translations with controllable quality. These pseudo translations are then paired with corresponding sources and references to form triplets used in assessing the qualities of translation metrics. Using XQ-MEval, our experiments on nine representative metrics reveal the inconsistency between averaging and human judgment and provide the first empirical evidence of cross-lingual scoring bias. Finally, we propose a normalization strategy derived from XQ-MEval that aligns score distributions across languages, improving the fairness and reliability of multilingual metric evaluation.

[NLP-22] IE as Cache: Information Extraction Enhanced Agent ic Reasoning

【速读】: 该论文旨在解决信息抽取(Information Extraction, IE)在传统应用中仅被视为终端任务、缺乏复用性的问题,即抽取得到的结构化信息常被孤立使用,未能在多步推理过程中持续维护与利用。其解决方案的关键在于提出“IE-as-Cache”框架,将信息抽取重构为一种认知缓存机制,通过查询驱动的抽取策略与缓存感知的推理方式,动态维护紧凑的中间信息并过滤噪声,从而提升大语言模型(Large Language Models, LLMs)在复杂推理任务中的准确性。

链接: https://arxiv.org/abs/2604.14930
作者: Hang Lv,Sheng Liang,Hongchao Gu,Wei Guo,Defu Lian,Yong Liu,Hao Wang,Enhong Chen
机构: University of Science and Technology of China(中国科学技术大学); Huawei Technologies Co., Ltd(华为技术有限公司)
类目: Computation and Language (cs.CL)
备注: 8pages, 2figures

点击查看摘要

Abstract:Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. However, it is traditionally treated merely as a terminal objective: once extracted, the resulting structure is often consumed in isolation rather than maintained and reused during multi-step inference. Moving beyond this, we propose \textitIE-as-Cache, a framework that repurposes IE as a cognitive cache to enhance agentic reasoning. Drawing inspiration from hierarchical computer memory, our approach combines query-driven extraction with cache-aware reasoning to dynamically maintain compact intermediate information and filter noise. Experiments on challenging benchmarks across diverse LLMs demonstrate significant improvements in reasoning accuracy, indicating that IE can be effectively repurposed as a reusable cognitive resource and offering a promising direction for future research on downstream uses of IE.

[NLP-23] LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本推理任务中表现受限的问题,尤其关注如何利用模型内部表示特性来提升强化学习(Reinforcement Learning, RL)训练的有效性。传统方法多依赖奖励工程或数据合成,而忽视了模型自身激活模式的潜在指导价值。解决方案的关键在于发现并利用长上下文处理过程中高幅值激活(high-magnitude activations)所体现的稀疏结构特征——这些激活通常出现在查询(query)和键(key)向量中,且具有显著的代表性。基于此,作者提出LongAct策略,通过从均匀更新转向基于显著性(saliency-guided)的稀疏权重更新机制,仅对与高激活相关的权重进行优化,从而显著提升模型在LongBench v2上的性能(约8%提升)并增强泛化能力,同时适用于多种RL算法如GRPO和DAPO,展现出良好的通用性。

链接: https://arxiv.org/abs/2604.14922
作者: Bowen Ping,Zijun Chen,Tingfeng Hui,Qize Yu,Chenxuan Li,Junchi Yan,Baobao Chang
机构: Peking University (北京大学); Shanghai Jiao Tong University (上海交通大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model’s intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantization – which establishes the criticality of such high-magnitude activations – and the insight that long-context reasoning inherently exhibits a sparse structure, we hypothesize that these weights serve as the pivotal drivers for effective model optimization. Based on this insight, we propose LongAct, a strategy that shifts from uniform to saliency-guided sparse updates. By selectively updating only the weights associated with these significant activations, LongAct achieves an approximate 8% improvement on LongBench v2 and enhances generalization on the RULER benchmark. Furthermore, our method exhibits remarkable universality, consistently boosting performance across diverse RL algorithms such as GRPO and DAPO. Extensive ablation studies suggest that focusing on these salient features is key to unlocking long-context potential.

[NLP-24] Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task

【速读】: 该论文旨在解决多语言环境下仇恨言论(hate speech)检测的挑战,特别是在低资源语言如立陶宛语、俄语和英语中的准确识别问题。其核心解决方案是利用现代多语言句子嵌入模型(multilingual sentence embedding models)结合梯度提升决策树(gradient boosted decision trees)进行分类建模,并系统评估不同嵌入模型、下游建模策略(有监督与无监督)及特征维度压缩(PCA)对性能的影响。关键发现表明:有监督的两分类模型显著优于单类异常检测方法,且PCA压缩在有监督场景中几乎不损失判别能力,从而为多语言仇恨言论检测提供了高效、鲁棒的软计算方案。

链接: https://arxiv.org/abs/2604.14907
作者: Evaldas Vaiciukynas,Paulius Danenas,Linas Ablonskis,Algirdas Sukys,Edgaras Dambrauskas,Voldemaras Zitkus,Rita Butkiene,Rimantas Butleris
机构: Kaunas University of Technology
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to Applied Soft Computing (Status: Decision in Process)

点击查看摘要

Abstract:Online hate speech and abusive language pose a growing challenge for content moderation, especially in multilingual settings and for low-resource languages such as Lithuanian. This paper investigates to what extent modern multilingual sentence embedding models can support accurate hate speech detection in Lithuanian, Russian, and English, and how their performance depends on downstream modeling choices and feature dimensionality. We introduce LtHate, a new Lithuanian hate speech corpus derived from news portals and social networks, and benchmark six modern multilingual encoders (potion, gemma, bge, snow, jina, e5) on LtHate, RuToxic, and EnSuperset using a unified Python pipeline. For each embedding, we train both a one class HBOS anomaly detector and a two class CatBoost classifier, with and without principal component analysis (PCA) compression to 64-dimensional feature vectors. Across all datasets, two class supervised models consistently and substantially outperform one class anomaly detection, with the best configurations achieving up to 80.96% accuracy and AUC ROC of 0.887 in Lithuanian (jina), 92.19% accuracy and AUC ROC of 0.978 in Russian (e5), and 77.21% accuracy and AUC ROC of 0.859 in English (e5 with PCA). PCA compression preserves almost all discriminative power in the supervised setting, while showing some negative impact for the unsupervised anomaly detection case. These results demonstrate how modern multilingual sentence embeddings combined with gradient boosted decision trees provide robust soft-computing solutions for multilingual hate speech detection applications.

[NLP-25] ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

【速读】: 该论文旨在解决当前智能具身代理(embodied agents)在执行指令时缺乏对动态环境中物体可操作性(affordance)进行评估的问题,即现有方法通常仅关注直接执行指令,而忽视了目标对象是否真正具备可操作条件,导致在真实世界复杂场景中表现脆弱。解决方案的关键在于提出一个名为DynAfford的基准测试平台,用于评估代理在动态环境中感知物体状态、推断隐式先决条件并自适应调整行为的能力;同时引入ADAPT模块,作为可插拔组件增强现有规划器的显式可操作性推理能力,并结合领域适配的LoRA微调视觉-语言模型作为可操作性推断后端,显著提升了任务成功率和鲁棒性,优于商用大语言模型(如GPT-4o)。

链接: https://arxiv.org/abs/2604.14902
作者: Pei-An Chen,Yong-Ching Liang,Jia-Fong Yeh,Hung-Ting Su,Yi-Ting Chen,Min Sun,Winston Hsu
机构: National Taiwan University (国立台湾大学); National Yang Ming Chiao Tung University (国立阳明交通大学); National Tsing Hua University (国立清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.

[NLP-26] Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

【速读】: 该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在推理过程中如何整合视觉与文本信息、以及其决策机制的透明性问题。研究发现,模型普遍存在“答案惯性”现象,即早期预测倾向被强化而非修正;尽管经过专门推理训练的模型表现出更强的纠错能力,但其效果依赖于模态条件(如文本主导或纯视觉场景)。关键解决方案在于通过可控干预(如误导性文本线索)分析模型对多模态输入的依赖关系,并量化中间推理步骤的贡献。结果表明,Chain-of-Thought(CoT)仅提供部分决策过程的可见性:推理训练模型虽更显式引用误导线索,但其流畅的推理路径可能掩盖真实模态依赖;而指令微调模型虽引用较少,其简短推理痕迹反而暴露出与视觉输入的不一致。这揭示了当前CoT方法在理解多模态决策机制上的局限性,对提升VLM系统的可解释性和安全性具有重要意义。

链接: https://arxiv.org/abs/2604.14888
作者: Danae Sánchez Villegas,Samuel Lewis-Lim,Nikolaos Aletras,Desmond Elliott
机构: University of Copenhagen, Department of Computer Science (哥本哈根大学计算机科学系); University of Sheffield, School of Computer Science (谢菲尔德大学计算机学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.

[NLP-27] RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自回归解码(autoregressive decoding)过程中因逐token生成而导致的高推理延迟问题。现有无训练类推测解码(speculative decoding, SD)方法存在局限性:基于检索的草稿在无精确匹配时失效,而基于logits的草稿缺乏结构引导。其解决方案的关键在于提出RACER(Retrieval-Augmented Contextual Rapid Speculative Decoding),一种轻量且无需训练的方法,通过整合检索到的精确模式与基于logits的未来线索,统一提供可靠的锚点和灵活的外推能力,从而生成更丰富的推测草稿,显著提升推理效率,在Spec-Bench、HumanEval和MGSM-ZH等基准上实现超过2倍于自回归解码的速度提升,并优于已有无训练方法,具备可扩展性和即插即用特性。

链接: https://arxiv.org/abs/2604.14885
作者: Zihong Zhang,Zuchao Li,Lefei Zhang,Ping Wang,Hai Zhao
机构: Wuhan University (武汉大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose \textbfRACER ( \textbfR etrieval- \textbfA ugmented \textbfC ont \textbfe xtual \textbfR apid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than 2\times speedup over autoregressive decoding, and outperforms prior training-free methods, offering a scalable, plug-and-play solution for efficient LLM decoding. Our source code is available at \hrefthis https URLthis https URL .

[NLP-28] Segment-Level Coherence for Robust Harmful Intent Probing in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险化学、生物、放射性和核(CBRN)领域中面临自适应越狱攻击(adaptive jailbreaking)时的检测可靠性问题。现有基于流式探测(streaming probing)的方法因过度依赖少数高分词元(tokens)而易产生误报,尤其当敏感CBRN术语出现在非恶意语境中时。其解决方案的关键在于提出一种新的流式探测目标函数,要求多个证据词元持续支持同一预测结果,而非仅依赖单个词元的异常得分,从而通过聚合信号增强检测鲁棒性。该方法在固定1%假阳性率下将真阳性率提升35.55%,并显著改善AUROC指标,即使在基线性能接近饱和时仍具优势。

链接: https://arxiv.org/abs/2604.14865
作者: Xuanli He,Bilgehan Sel,Faizan Ali,Jenny Bao,Hoagy Cunningham,Jerry Wei
机构: University College London; Virginia Tech; Anthropic
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: preprint

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from near-saturated baseline performance (AUROC = 97.40%). We also show that probing Attention or MLP activations consistently outperforms residual-stream features. Finally, even when adversarial fine-tuning enables novel character-level ciphers, harmful intent remains detectable: probes developed for the base LLMs can be applied ``plug-and-play’’ to these obfuscated attacks, achieving an AUROC of over 98.85%.

[NLP-29] Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding

【速读】: 该论文旨在解决结构化生成中约束解码(constrained decoding)的性能优化问题,特别是指出当前方法将schema视为纯结构约束而忽视了其语言表述对模型行为的影响。解决方案的关键在于提出将结构化生成重新建模为多通道指令(multi-channel instruction)问题:除了显式通过提示(prompt)传递指令外,还首次系统性地揭示了schema键(schema key)的措辞可作为隐式指令通道,直接影响模型在约束解码下的表现。实验表明不同模型家族对这两类指令通道的敏感度存在差异(如Qwen模型受益于schema级指令,LLaMA模型更依赖prompt级指导),且指令通道间存在非加和性交互效应,这为优化结构化生成提供了新的设计维度——即schema不仅是输出格式的定义工具,也是蕴含指令信号的重要载体。

链接: https://arxiv.org/abs/2604.14862
作者: Yifan Le
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures. Work in progress

点击查看摘要

Abstract:Constrained decoding has been widely adopted for structured generation with large language models (LLMs), ensuring that outputs satisfy predefined formats such as JSON and XML. However, existing approaches largely treat schemas as purely structural constraints and overlook the possibility that their linguistic formulation may affect model behavior. In this work, we study how instruction placement influences model performance in structured generation and show that merely changing the wording of schema keys, without modifying the prompt or model parameters, can significantly alter model performance under constrained decoding. Based on this observation, we propose to reinterpret structured generation as a multi-channel instruction problem, where instructions can be conveyed explicitly through prompts and implicitly through schema keys during decoding. To the best of our knowledge, this is the first work to systematically study how schema key formulation acts as an implicit instruction channel and affects model performance under constrained decoding. Experiments on multiple mathematical reasoning benchmarks show that different model families exhibit distinct sensitivities to these instruction channels: Qwen models consistently benefit from schema-level instructions, while LLaMA models rely more heavily on prompt-level guidance. We further observe non-additive interaction effects between instruction channels, showing that combining multiple channels does not always lead to further improvement. These findings suggest that schema design not only determines output structure, but also carries instruction signals, offering a new perspective on structured generation in LLMs.

[NLP-30] ClimateCause: Complex and Implicit Causal Structures in Climate Reports ACL2026

【速读】: 该论文旨在解决现有因果发现数据集难以捕捉气候科学中复杂高阶因果结构的问题,特别是隐含因果关系和嵌套因果关系的缺失。解决方案的关键在于构建ClimateCause数据集,该数据集通过专家标注的方式从科学政策类气候报告中提取高阶因果结构,并将因果表达标准化与解耦为独立的因果关系单元,同时对因果关联性、关系类型及时空上下文进行独特标注,从而支持更精确的因果图构建与分析。

链接: https://arxiv.org/abs/2604.14856
作者: Liesbeth Allein,Nataly Pineda-Castañeda,Andrea Rocci,Marie-Francine Moens
机构: KU Leuven (鲁汶大学); Ghent University (根特大学); Università della Svizzera italiana (瑞士意大利语大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 [Findings]

点击查看摘要

Abstract:Understanding climate change requires reasoning over complex causal networks. Yet, existing causal discovery datasets predominantly capture explicit, direct causal relations. We introduce ClimateCause, a manually expert-annotated dataset of higher-order causal structures from science-for-policy climate reports, including implicit and nested causality. Cause-effect expressions are normalized and disentangled into individual causal relations to facilitate graph construction, with unique annotations for cause-effect correlation, relation type, and spatiotemporal context. We further demonstrate ClimateCause’s value for quantifying readability based on the semantic complexity of causal graphs underlying a statement. Finally, large language model benchmarking on correlation inference and causal chain reasoning highlights the latter as a key challenge.

[NLP-31] Exploring and Testing Skill-Based Behavioral Profile Annotation: Human Operability and LLM Feasibility under Schema-Guided Execution

【速读】: 该论文旨在解决行为特征(Behavioral Profile, BP)标注难以自动化的问题,因其需同时在多个语言维度上进行编码。解决方案的关键在于将BP标注视为一组可分离的标注技能(annotation skills)而非单一任务,并构建一种基于技能文件(skill-file-driven)的流水线方法:每个特征通过schema文件、决策规则和示例外部定义,从而实现对标注技能的细粒度评估与分类。研究发现,BP标注在技能层面具有高度异质性——部分技能可直接操作、部分可通过聚焦重标注恢复、部分则结构上不明确;大语言模型(如GPT-5.4)在保留技能上表现出较高可靠性,但其能力具有选择性而非全局适用,且与人类标注者呈现“共享分类体系、独立执行”的模式,表明自动标注应以技能可行性为评价基准,而非整体任务自动化程度。

链接: https://arxiv.org/abs/2604.14843
作者: Yufeng Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Behavioral Profile (BP) annotation is difficult to automate because it requires simultaneous coding across multiple linguistic dimensions. We treat BP annotation as a bundle of annotation skills rather than a single task and evaluate LLM-assisted BP annotation from this perspective. Using 3,134 concordance lines of 30 Chinese metaphorical color-term derivatives and a 14-feature BP schema, we implement a skill-file-driven pipeline in which each feature is externally defined through schema files, decision rules, and examples. Two human annotators completed a two-round schema-only protocol on a 300-instance validation subset, enabling BP skills to be classified as directly operable, recoverable under focused re-annotation, or structurally underspecified. GPT-5.4 and three locally deployable open-source models were then evaluated under the same setup. Results show that BP annotation is highly heterogeneous at the skill level: 5 skills are directly operable, 4 are recoverable after focused re-annotation, and 5 remain structurally underspecified. GPT-5.4 executes the retained skills with substantial reliability (accuracy = 0.678, \kappa = 0.665, weighted F1 = 0.695), but this feasibility is selective rather than global. Human and GPT difficulty profiles are strongly aligned at the skill level (r = 0.881), but not at the instance level (r = 0.016) or lexical-item level (r = -0.142), a pattern we describe as shared taxonomy, independent execution. Pairwise agreement further suggests that GPT is better understood as an independent third skill voice than as a direct human substitute. Open-source failures are concentrated in schema-to-skill execution problems. These findings suggest that automatic annotation should be evaluated in terms of skill feasibility rather than task-level automation.

[NLP-32] Pangu-ACE: Adaptive Cascaded Experts for Educational Response Generation on EduBench

【速读】: 该论文旨在解决教育辅助系统中计算资源分配效率低下的问题,即如何在保证任务质量的前提下,仅在必要时增加计算开销。其解决方案的关键在于设计并实现了一个样本级的1B到7B级联架构(cascade),其中Pangu-ACE系统首先使用一个1B规模的导师路由器(tutor-router)生成初步答案及路由信号,再根据信号决定是否直接接受该答案或将其升级至7B规模的专业提示模型进行处理。这种基于任务特性的动态路由机制显著提升了计算资源的利用效率,例如在共享8个教育任务的EduBench基准上,该方法将确定性质量从0.457提升至0.538,格式有效性从0.707提升至0.866,同时有19.7%的请求可直接由1B模型处理,体现了路由选择性而非运行时延迟优化的核心效率优势。

链接: https://arxiv.org/abs/2604.14828
作者: Dinghao Li,Wenlong Zhou,Zhimin Chen,Yuehan Peng,Hong Ni,Chengfu Zou,Guoyu Shi,Yaochen Li
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Educational assistants should spend more computation only when the task needs it. This paper rewrites our earlier draft around the system that was actually implemented and archived in the repository: a sample-level 1B to 7B cascade for the shared-8 EduBench benchmark. The final system, Pangu-ACE, uses a 1B tutor-router to produce a draft answer plus routing signals, then either accepts the draft or escalates the sample to a 7B specialist prompt. We also correct a major offline evaluation bug: earlier summaries over-credited some open-form outputs that only satisfied superficial format checks. After CPU-side rescoring from saved prediction JSONL, the full Chinese test archive (7013 samples) shows that cascade_final improves deterministic quality from 0.457 to 0.538 and format validity from 0.707 to 0.866 over the legacy rule_v2 system while accepting 19.7% of requests directly at 1B. Routing is strongly task dependent: IP is accepted by 1B 78.0% of the time, while QG and EC still escalate almost always. The current archived deployment does not yet show latency gains, so the defensible efficiency story is routing selectivity rather than wall-clock speedup. We also package a reproducible artifact-first paper workflow and clarify the remaining external-baseline gap: GPT-5.4 re-judging is implemented locally, but the configured provider endpoint and key are invalid, so final sampled-baseline alignment with GPT-5.4 remains pending infrastructure repair.

[NLP-33] Domain Fine-Tuning FinBERT on Finnish Histopathological Reports: Train-Time Signals and Downstream Correlations

【速读】: 该论文旨在解决在标注数据稀缺的自然语言处理(Natural Language Processing, NLP)分类任务中,如何有效利用未标注领域数据提升模型性能的问题。其核心挑战在于医疗人工智能场景下常面临标签获取延迟长、数据积累慢等现实限制。解决方案的关键在于通过在芬兰语医学文本上对芬兰BERT模型进行领域微调(domain fine-tuning),并进一步探索从嵌入空间几何变化中预测领域特定预训练收益的可能性,从而为低资源医疗NLP任务提供可解释且高效的模型适配策略。

链接: https://arxiv.org/abs/2604.14815
作者: Rami Luisto,Liisa Petäinen,Tommi Grönholm,Jan Böhm,Maarit Ahtiainen,Tomi Lilja,Ilkka Pölönen,Sami Äyrämö
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In NLP classification tasks where little labeled data exists, domain fine-tuning of transformer models on unlabeled data is an established approach. In this paper we have two aims. (1) We describe our observations from fine-tuning the Finnish BERT model on Finnish medical text data. (2) We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning. Our driving motivation is the common\situation in healthcare AI where we might experience long delays in acquiring datasets, especially with respect to labels. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.14815 [cs.CL] (or arXiv:2604.14815v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.14815 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rami Luisto [view email] [v1] Thu, 16 Apr 2026 09:36:48 UTC (441 KB)

[NLP-34] Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中的机器遗忘(Machine Unlearning)问题,即在移除特定知识的同时保持模型的通用能力。传统方法常面临遗忘与保留之间的权衡困境,难以同时实现高效的知识删除和性能维持。论文提出了一种以保留优先的梯度合成框架(Retention-Prioritized Gradient Synthesis Framework),其核心创新在于将任务特定梯度提取与冲突感知组合解耦,并引入SAGO方法,通过构造性符号约束合成机制实现更紧密的梯度对齐。关键突破在于重新塑造梯度几何结构而非简单调整损失权重,从而显著缓解遗忘与保留之间的矛盾,在多个基准测试中实现了更高的保留性能(如WMDP Bio上MMLU指标从44.6%提升至96.0%)且保持一致的遗忘强度。

链接: https://arxiv.org/abs/2604.14808
作者: Zeguan Xiao,Siqing Li,Yong Wang,Xuetao Wei,Jian Yang,Yun Chen,Guanhua Chen
机构: Shanghai University of Finance and Economics(上海财经大学); Alibaba Group; Southern University of Science and Technology(南方科技大学); Beihang University(北京航空航天大学); MoE Key Laboratory of Interdisciplinary Research of Computation and Economics(教育部计算与经济交叉研究重点实验室)
类目: Computation and Language (cs.CL)
备注: ACL 2026

点击查看摘要

Abstract:Machine unlearning for large language models (LLMs) aims to remove targeted knowledge while preserving general capability. In this paper, we recast LLM unlearning as an asymmetric two-task problem: retention is the primary objective and forgetting is an auxiliary. From this perspective, we propose a retention-prioritized gradient synthesis framework that decouples task-specific gradient extraction from conflict-aware combination. Instantiating the framework, we adapt established PCGrad to resolve gradient conflicts, and introduce SAGO, a novel retention-prioritized gradient synthesis method. Theoretically, both variants ensure non-negative cosine similarity with the retain gradient, while SAGO achieves strictly tighter alignment through constructive sign-constrained synthesis. Empirically, on WMDP Bio/Cyber and RWKU benchmarks, SAGO consistently pushes the Pareto frontier: e.g., on WMDP Bio (SimNPO+GD), recovery of target model MMLU performance progresses from 44.6% (naive) to 94.0% (+PCGrad) and further to 96.0% (+SAGO), while maintaining comparable forgetting strength. Our results show that re-shaping gradient geometry, rather than re-balancing losses, is the key to mitigating unlearning-retention trade-offs.

[NLP-35] he LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在日常工作中广泛应用后,用户对其自身能力的认知发生系统性偏差,即“LLM错觉”(LLM fallacy)——个体将由LLM辅助生成的输出误认为是自身独立能力的体现,从而导致自我认知与实际能力之间出现偏差。解决方案的关键在于识别并澄清这种认知归因错误的机制:LLMs的透明度缺失、输出流畅性以及低门槛交互模式模糊了人机贡献边界,使用户基于结果而非生成过程推断自身能力。作者提出一个概念框架和跨计算、语言、分析与创意领域的表现类型学,以揭示该错觉的内在运作逻辑,并为教育、招聘和AI素养培养提供理论依据与实证方向。

链接: https://arxiv.org/abs/2604.14807
作者: Hyunwoo Kim,Harin Yu,Hanau Yi
机构: ddai Inc. (ddai 公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid integration of large language models (LLMs) into everyday workflows has transformed how individuals perform cognitive tasks such as writing, programming, analysis, and multilingual communication. While prior research has focused on model reliability, hallucination, and user trust calibration, less attention has been given to how LLM usage reshapes users’ perceptions of their own capabilities. This paper introduces the LLM fallacy, a cognitive attribution error in which individuals misinterpret LLM-assisted outputs as evidence of their own independent competence, producing a systematic divergence between perceived and actual capability. We argue that the opacity, fluency, and low-friction interaction patterns of LLMs obscure the boundary between human and machine contribution, leading users to infer competence from outputs rather than from the processes that generate them. We situate the LLM fallacy within existing literature on automation bias, cognitive offloading, and human–AI collaboration, while distinguishing it as a form of attributional distortion specific to AI-mediated workflows. We propose a conceptual framework of its underlying mechanisms and a typology of manifestations across computational, linguistic, analytical, and creative domains. Finally, we examine implications for education, hiring, and AI literacy, and outline directions for empirical validation. We also provide a transparent account of human–AI collaborative methodology. This work establishes a foundation for understanding how generative AI systems not only augment cognitive performance but also reshape self-perception and perceived expertise.

[NLP-36] Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

【速读】: 该论文旨在解决多模态系统中有效回避(Effective Abstention, EA)问题,即模型在证据不足时应主动选择不回答,以提升系统的可靠性。当前视觉-语言模型(Vision-Language Models, VLMs)和多智能体系统(Multi-Agent Systems, MAS)的评估范式普遍假设所有输入都可回答,导致模型被迫输出答案,忽视了真实场景中的不可回答性。解决方案的关键在于构建一个名为MM-AQA的新基准,通过两个维度——视觉模态依赖性和证据充分性——从可回答样本中生成不可回答实例,从而更真实地评估模型的回避能力。实验表明,单纯改进提示策略或增加智能体数量无法显著提升回避性能,真正有效的途径是引入“回避感知”训练机制,而非依赖更复杂的 prompting 或更多代理结构。

链接: https://arxiv.org/abs/2604.14799
作者: Nishanth Madhusudhan,Vikas Yadav,Alexandre Lacoste
机构: ServiceNow Research (ServiceNow 研究院)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages and 4 figures (excluding appendix)

点击查看摘要

Abstract:Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.

[NLP-37] AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning ACM-MM2026

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在持续视觉问答(Continual Visual Question Answering, VQA)任务中因架构不对称而导致的灾难性遗忘问题。现有持续学习方法多针对对称、单模态架构设计,而现代VLMs具有显著的参数分布不均和模态间不对称性,导致标准全局正则化策略在优化过程中过度偏向庞大的语言解码器,使关键但参数较少的视觉投影层易受干扰,进而严重损害组合推理能力。解决方案的关键在于提出非对称信息掩码(Asymmetric Information Masking, AIM),通过基于模态特异性敏感度的局部掩码机制,实现对不同模态组件的差异化保护,在稳定性和可塑性之间取得平衡,从而有效缓解局部退化并提升对新型技能-概念组合的泛化性能。

链接: https://arxiv.org/abs/2604.14779
作者: Peifeng Zhang,Zice Qiu,Donghua Yu,Shilei Cao,Juepeng Zheng,Yutong Lu,Haohuan Fu
机构: Sun Yat-Sen University (中山大学); National Supercomputing Center in Shenzhen (深圳国家超级计算中心); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 18 pages, 9 figures. Submitted to ACM MM 2026

点击查看摘要

Abstract:In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.

[NLP-38] CoPA: Benchmarking Personalized Question Answering with Data-Informed Cognitive Factors ACL

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在问答(Question Answering, QA)任务中个性化评估的瓶颈问题。现有方法主要依赖词法相似度或人工启发式规则,缺乏数据驱动的验证,难以准确衡量模型输出与用户个体偏好之间的匹配程度。其解决方案的关键在于挖掘社区-个体偏好差异(Community-Individual Preference Divergence, CIPD),即个体选择超越群体共识的现象,并据此提炼出六个关键的个性化因素作为评估维度。基于此,作者构建了CoPA基准,包含1,985个用户画像,通过量化模型输出与从用户交互模式中推断出的个体认知偏好之间的对齐程度,实现细粒度、因子层面的个性化QA评估,从而提供比通用指标更全面且更具区分度的评价标准。

链接: https://arxiv.org/abs/2604.14773
作者: Hang Su,Zequn Liu,Chen Hu,Xuesong Lu,Yingce Xia,Zhen Liu
机构: East China Normal University (华东师范大学); Zhongguancun Academy (中关村学院)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL. 30 pages, 10 figures

点击查看摘要

Abstract:While LLMs have demonstrated remarkable potential in Question Answering (QA), evaluating personalization remains a critical bottleneck. Existing paradigms predominantly rely on lexical-level similarity or manual heuristics, often lacking sufficient data-driven validation. We address this by mining Community-Individual Preference Divergence (CIPD), where individual choices override consensus, to distill six key personalization factors as evaluative dimensions. Accordingly, we introduce CoPA, a benchmark with 1,985 user profiles for fine-grained, factor-level assessment. By quantifying the alignment between model outputs and user-specific cognitive preferences inferred from interaction patterns, CoPA provides a more comprehensive and discriminative standard for evaluating personalized QA than generic metrics. The code is available at this https URL.

[NLP-39] Which bird does not have wings: Negative-constrained KGQA with Schema-guided Semantic Matching and Self-directed Refinement ACL2026

【速读】: 该论文旨在解决大语言模型在知识图谱问答(KGQA)任务中对负向约束(negative constraint)处理能力不足的问题,现有方法普遍忽视了现实场景中频繁出现的否定性语义表达,导致模型在面对包含否定逻辑的复杂问题时表现不佳。其解决方案的关键在于提出一个全新的任务范式——NEgative-conSTrained (NEST) KGQA,构建了专门针对负约束的基准数据集NestKGQA,并设计了一种可读性强且能清晰表达否定逻辑的Python格式逻辑形式PyLF;进一步地,为应对多约束问题带来的语义复杂性,提出了CUCKOO框架,该框架通过生成约束感知的逻辑形式草稿并进行模式引导的语义匹配,在执行结果为空时才触发自指导式精炼机制,从而在减少计算开销的同时提升模型鲁棒性和语义可执行性。

链接: https://arxiv.org/abs/2604.14749
作者: Midan Shim,Seokju Hwang,Kaehyun Um,Kyong-Ho Lee
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 findings

点击查看摘要

Abstract:Large language models still struggle with faithfulness and hallucinations despite their remarkable reasoning abilities. In Knowledge Graph Question Answering (KGQA), semantic parsing-based approaches address the limitations by understanding constraints in a user’s question and converting them into a logical form to execute on a knowledge graph. However, existing KGQA benchmarks and methods are biased toward positive and calculation constraints. Negative constraints are neglected, although they frequently appear in real-world questions. In this paper, we introduce a new task, NEgative-conSTrained (NEST) KGQA, where each question contains at least one negative constraint, and a corresponding dataset, NestKGQA. We also design PyLF, a Python-formatted logical form, since existing logical forms are hardly suitable to express negation clearly while maintaining readability. Furthermore, NEST questions naturally contain multiple constraints. To mitigate their semantic complexity, we present a novel framework named CUCKOO, specialized to multiple-constrained questions and ensuring semantic executability. CUCKOO first generates a constraint-aware logical form draft and performs schema-guided semantic matching. It then selectively applies self-directed refinement only when executing improper logical forms yields an empty result, reducing cost while improving robustness. Experimental results demonstrate that CUCKOO consistently outperforms baselines on both conventional KGQA and NEST-KGQA benchmarks under few-shot settings.

[NLP-40] CAMO: An Agent ic Framework for Automated Causal Discovery from Micro Behaviors to Macro Emergence in LLM Agent Simulations

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体模拟中,从微观个体行为到宏观社会现象涌现之间的因果机制不清晰的问题。其核心挑战在于涌现现象源于智能体间的复杂交互、中观层面的反馈回路及非线性动态,导致生成机制难以解析。解决方案的关键是提出一个自动化因果发现框架——\textscCAMO(Causal discovery from Micro behaviors to Macro Emergence),该框架将机制假设转化为基于模拟记录的可计算因子,学习以涌现目标变量 $ Y $ 为中心的紧凑因果表示,并输出可计算的马尔可夫边界与最小上游解释子图,从而揭示可解释的因果链并提供可操作的干预杠杆;同时利用模拟器内部反事实探测来定向模糊边并修正假设,增强因果推断的准确性与可靠性。

链接: https://arxiv.org/abs/2604.14691
作者: Xiangning Yu,Yuwei Guo,Yuqi Hou,Xiao Xue,Qun Ma
机构: Tianjin University(天津大学); Tianjin Key Laboratory of Healthy Habitat and Smart Technology(天津市健康环境与智能技术重点实验室); Laboratory of Computation and Analytics of Complex Management Systems(复杂管理系统计算与分析实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:LLM-empowered agent simulations are increasingly used to study social emergence, yet the micro-to-macro causal mechanisms behind macro outcomes often remain unclear. This is challenging because emergence arises from intertwined agent interactions and meso-level feedback and nonlinearity, making generative mechanisms hard to disentangle. To this end, we introduce \textbf\textscCAMO, an automated \textbfCausal discovery framework from \textbfMicr\textbfo behaviors to \textbfMacr\textbfo Emergence in LLM agent simulations. \textscCAMO converts mechanistic hypotheses into computable factors grounded in simulation records and learns a compact causal representation centered on an emergent target Y . \textscCAMO outputs a computable Markov boundary and a minimal upstream explanatory subgraph, yielding interpretable causal chains and actionable intervention levers. It also uses simulator-internal counterfactual probing to orient ambiguous edges and revise hypotheses when evidence contradicts the current view. Experiments across four emergent settings demonstrate the promise of \textscCAMO.

[NLP-41] Acceptance Dynamics Across Cognitive Domains in Speculative Decoding

【速读】: 该论文旨在解决生成式 AI(Generative AI)在大语言模型(Large Language Model, LLM)推理过程中因采样效率低下导致的延迟问题,特别是通过树状推测解码(tree-based speculative decoding)优化推理速度。其核心解决方案在于利用一个小型草稿模型(draft model)生成候选token序列树,并由目标模型(target model)批量验证这些token,从而提升整体推理吞吐量。关键创新在于系统性地分析任务类型对接受概率(acceptance probability)的影响机制:研究发现,任务类型比树深度更能预测接受率,且不同领域(如代码生成、数学推理、逻辑推理与开放对话)存在显著差异,尤其开放对话域表现出最高接受长度和异常高的接受率——尽管其熵值也最高,这归因于RLHF对齐带来的词汇可预测性增强。这一发现为制定基于领域的推测预算分配和草稿模型选择策略提供了实证依据。

链接: https://arxiv.org/abs/2604.14682
作者: Saif Mahmoud
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates large language model (LLM) inference. It uses a small draft model to propose a tree of future tokens. A larger target model then verifies these tokens in a single batched forward pass. Despite the growing body of work on speculative methods, the degree to which the cognitive characteristics of a task affect acceptance probability remains largely unexplored. We present an empirical study of tree-based speculative decoding acceptance dynamics. Our study spans four well-established NLP benchmark domains: code generation, mathematical reasoning, logical reasoning, and open-ended chat. For this, we use TinyLlama-1.1B as the draft model against Llama-2-7B-Chat-GPTQ as the target. Over 99,768 speculative nodes collected from 200 prompts, we derive per-domain acceptance rates, expected accepted lengths, depth-acceptance profiles, and entropy-acceptance correlations. We find that task type is a stronger predictor of acceptance than tree depth. Furthermore, only the chat domain consistently yields an expected accepted length exceeding 1.0 token per step. We also show that the entropy-acceptance correlation is consistently negative but weak across all domains (rho in [-0.20, -0.15]). Counterintuitively, chat produces the highest entropy yet the highest acceptance rate. We attribute this divergence to the lexical predictability of RLHF-aligned register. These findings have direct implications for domain-aware speculation budgets and draft-model selection strategies. Index Terms–speculative decoding, large language model inference, tree attention, draft model, acceptance probability, LLM efficiency

[NLP-42] SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models ACL2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在城市规划应用中可能隐含并放大性别空间偏见的问题,这一问题源于性别空间理论(gendered space theory)指出的性别等级结构嵌入空间组织的现象。解决方案的关键在于提出首个系统性评估框架SPAGBias,其核心包括:62种城市微观空间的分类体系、提示库以及三个诊断层——显式层(强制选择重采样)、概率层(标记级不对称性)和建构层(语义与叙事角色分析)。通过该框架,研究发现LLMs不仅在公共-私人二分法基础上存在性别关联,更形成了细致的微观空间映射,并揭示了偏见在预训练、指令微调和奖励建模等模型管道中的嵌入与强化机制,从而为理解生成式AI如何编码社会性别认知提供了实证依据。

链接: https://arxiv.org/abs/2604.14672
作者: Binxian Su,Haoye Lou,Shucheng Zhu,Weikang Wang,Ying Liu,Dong Yu,Pengyuan Liu
机构: Beijing Language and Culture University (北京语言大学); Renmin University of China (中国人民大学); Tsinghua University (清华大学); Shanghai University of Finance and Economics (上海财经大学); National Print Media Language Resources Monitoring Research Center (国家 print media 语言资源监测研究中心)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) are being increasingly used in urban planning, but since gendered space theory highlights how gender hierarchies are embedded in spatial organization, there is concern that LLMs may reproduce or amplify such biases. We introduce SPAGBias - the first systematic framework to evaluate spatial gender bias in LLMs. It combines a taxonomy of 62 urban micro-spaces, a prompt library, and three diagnostic layers: explicit (forced-choice resampling), probabilistic (token-level asymmetry), and constructional (semantic and narrative role analysis). Testing six representative models, we identify structured gender-space associations that go beyond the public-private divide, forming nuanced micro-level mappings. Story generation reveals how emotion, wording, and social roles jointly shape “spatial gender narratives”. We also examine how prompt design, temperature, and model scale influence bias expression. Tracing experiments indicate that these patterns are embedded and reinforced across the model pipeline (pre-training, instruction tuning, and reward modeling), with model associations found to substantially exceed real-world distributions. Downstream experiments further reveal that such biases produce concrete failures in both normative and descriptive application settings. This work connects sociological theory with computational analysis, extending bias research into the spatial domain and uncovering how LLMs encode social gender cognition through language.

[NLP-43] Rethinking Patient Education as Multi-turn Multi-modal Interaction

【速读】: 该论文旨在解决当前医疗多模态基准测试普遍局限于静态任务(如图像问答、报告生成)的局限性,尤其针对患者教育这一更具挑战性的场景——系统需在图像与文本之间建立证据关联、引导患者关注关键区域、用通俗语言解释发现,并应对患者的困惑或情绪反应。其解决方案的关键在于构建了一个名为MedImageEdu的多轮、证据锚定的放射科患者教育基准,包含150个真实病例,通过DoctorAgent与PatientAgent的交互模拟临床咨询过程,并引入一个由报告和图像信息驱动的绘图工具,使医生代理能够生成可视化辅助材料以增强理解。该设计不仅评估最终的多模态响应质量,还量化了诊疗流程中的多个维度表现,从而为验证多模态代理是否真正基于证据教学而非仅凭文本回答提供了可控测试平台。

链接: https://arxiv.org/abs/2604.14656
作者: Zonghai Yao,Zhipeng Tang,Chengtao Lin,Xiong Luo,Benlu Wang,Juncheng Huang,Chin Siang Ong,Hong Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Equal contribution for the first two authors

点击查看摘要

Abstract:Most medical multimodal benchmarks focus on static tasks such as image question answering, report generation, and plain-language rewriting. Patient education is more demanding: systems must identify relevant evidence across images, show patients where to look, explain findings in accessible language, and handle confusion or distress. Yet most patient education work remains text-only, even though combined image-and-text explanations may better support understanding. We introduce MedImageEdu, a benchmark for multi-turn, evidence-grounded radiology patient education. Each case provides a radiology report with report text and case images. A DoctorAgent interacts with a PatientAgent, conditioned on a hidden profile that captures factors such as education level, health literacy, and personality. When a patient question would benefit from visual support, the DoctorAgent can issue drawing instructions grounded in the report, case images, and the current question to a benchmark-provided drawing tool. The tool returns image(s), after which the DoctorAgent produces a final multimodal response consisting of the image(s) and a grounded plain-language explanation. MedImageEdu contains 150 cases from three sources and evaluates both the consultation process and the final multimodal response along five dimensions: Consultation, Safety and Scope, Language Quality, Drawing Quality, and Image-Text Response Quality. Across representative open- and closed-source vision-language model agents, we find three consistent gaps: fluent language often outpaces faithful visual grounding, safety is the weakest dimension across disease categories, and emotionally tense interactions are harder than low education or low health literacy. MedImageEdu provides a controlled testbed for assessing whether multimodal agents can teach from evidence rather than merely answer from text.

[NLP-44] CURA: Clinical Uncertainty Risk Alignment for Language Model-Based Risk Prediction ACL2026

【速读】: 该论文旨在解决临床语言模型(Clinical Language Models, CLMs)在风险预测任务中不确定性估计校准不足、临床不可靠的问题。其核心挑战在于如何使模型的预测置信度既反映个体患者的错误概率,又能体现群体层面的模糊性。解决方案的关键在于提出一种名为“临床不确定性风险对齐”(Clinical Uncertainty Risk Alignment, CURA)的框架:首先通过领域特定微调获得任务适配的患者嵌入表示,随后采用双层不确定性目标对多头分类器进行不确定性微调;其中个体级校准项将预测不确定性与每位患者的错误概率对齐,而群体感知正则化项则引导风险估计向嵌入空间局部邻域内的事件率靠拢,并对决策边界附近的模糊群体施加更高权重。该方法可被解释为使用邻域信息生成软标签的交叉熵损失,从而实现更可信的不确定性量化,显著改善校准性能且不牺牲判别能力。

链接: https://arxiv.org/abs/2604.14651
作者: Sizhe Wang,Ziqi Xu,Claire Najjuuko,Charles Alba,Chenyang Lu
机构: Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 Main Conference

点击查看摘要

Abstract:Clinical language models (LMs) are increasingly applied to support clinical risk prediction from free-text notes, yet their uncertainty estimates often remain poorly calibrated and clinically unreliable. In this work, we propose Clinical Uncertainty Risk Alignment (CURA), a framework that aligns clinical LM-based risk estimates and uncertainty with both individual error likelihoods and cohort-level ambiguities. CURA first fine-tunes domain-specific clinical LMs to obtain task-adapted patient embeddings, and then performs uncertainty fine-tuning of a multi-head classifier using a bi-level uncertainty objective. Specifically, an individual-level calibration term aligns predictive uncertainty with each patient’s likelihood of error, while a cohort-aware regularizer pulls risk estimates toward event rates in their local neighborhoods in the embedding space and places extra weight on ambiguous cohorts near the decision boundary. We further show that this cohort-aware term can be interpreted as a cross-entropy loss with neighborhood-informed soft labels, providing a label-smoothing view of our method. Extensive experiments on MIMIC-IV clinical risk prediction tasks across various clinical LMs show that CURA consistently improves calibration metrics without substantially compromising discrimination. Further analysis illustrates that CURA reduces overconfident false reassurance and yields more trustworthy uncertainty estimates for downstream clinical decision support.

[NLP-45] CURaTE: Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在预训练阶段无法提前过滤潜在敏感或问题数据,导致需在训练后对特定知识进行删除(即“遗忘”)的问题。现有方法因缺乏持续性和实时性,在多次更新后易出现性能退化和敏感信息暴露延长的问题。其解决方案的关键在于提出一种名为CURaTE(Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge)的新框架:通过训练一个句子嵌入模型以形成清晰的决策边界,判断输入提示是否匹配已存储的遗忘请求;若匹配则拒绝响应,否则正常回答。该方法不修改语言模型参数,从而实现近乎完美的知识保留能力,并支持无限次数的持续遗忘操作,是首个真正具备实时连续遗忘能力的方法。

链接: https://arxiv.org/abs/2604.14644
作者: Seyun Bae,Seokhan Lee,Eunho Yang
机构: KT Corporation(KT公司); KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:The inability to filter out in advance all potentially problematic data from the pre-training of large language models has given rise to the need for methods for unlearning specific pieces of knowledge after training. Existing techniques overlook the need for continuous and immediate action, causing them to suffer from degraded utility as updates accumulate and protracted exposure of sensitive information. To address these issues, we propose Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge (CURaTE). Our method begins by training a sentence embedding model on a dataset designed to enable the formation of sharp decision boundaries for determining whether a given input prompt corresponds to any stored forget requests. The similarity of a given input to the forget requests is then used to determine whether to answer or return a refusal response. We show that even with such a simple approach, not only does CURaTE achieve more effective forgetting than existing methods, but by avoiding modification of the language model parameters, it also maintains near perfect knowledge preservation over any number of updates and is the only method capable of continual unlearning in real-time.

[NLP-46] Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models

【速读】: 该论文旨在解决金融领域中缺乏外部证据支持下的虚假信息检测问题(Reference-Free Financial Misinformation Detection),即在无法依赖外部事实核查的情况下,仅凭文本内部语义和上下文一致性判断金融声明的真实性。其解决方案的关键在于构建一个融合零样本(zero-shot)与少样本(few-shot)提示策略、以及基于低秩适应(Low-Rank Adaptation, LoRA)的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)的综合框架,充分利用大语言模型(Large Language Models, LLMs)的推理能力,精准捕捉金融文本中的细微语义操纵线索,从而实现高精度的虚假信息识别。该方法在官方排行榜上取得95.4%(公开测试集)和96.3%(私有测试集)的准确率,显著提升了金融自然语言处理场景下情境感知型误导信息检测的性能。

链接: https://arxiv.org/abs/2604.14640
作者: Cuong Hoang,Le-Minh Nguyen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the “Reference-Free Financial Misinformation Detection” shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at this https URL.

[NLP-47] Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options

【速读】: 该论文旨在解决当前多选题评估(multiple choice evaluation)在低选项设置下因捷径策略(shortcut strategies)导致模型性能被高估的问题,这些问题可能掩盖了模型的真实能力。其核心解决方案是提出一种“大规模选项评估协议”(massive option evaluation protocol),将候选集规模扩展至100个选项,从而显著降低随机猜测对结果的影响。该方法通过固定目标、重复重采样与洗牌机制,稳定估计模型表现,并有效区分内容驱动的错误与位置偏差(position bias)等干扰因素。实验表明,强于传统低选项基准的表现在高干扰密度下往往减弱,揭示出模型在语义混淆和早期选项偏好方面的两种失败模式,且主要瓶颈在于候选项排序能力而非上下文长度限制。这为在极端干扰密度下系统性测试模型可靠性提供了一个通用框架。

链接: https://arxiv.org/abs/2604.14634
作者: Nahyun Lee,Guijin Son
机构: Chung-Ang University (中央大学); Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multiple choice evaluation is widely used for benchmarking large language models, yet near ceiling accuracy in low option settings can be sustained by shortcut strategies that obscure true competence. Therefore, we propose a massive option evaluation protocol that scales the candidate set to one hundred options and sharply reduces the impact of chance performance. We apply this framework to a Korean orthography error detection task where models must pick the single incorrect sentence from a large candidate set. With fixed targets and repeated resampling and shuffling, we obtain stable estimates while separating content driven failures from positional artifacts. Across experiments, results indicate that strong performance in low option settings can overstate model competence. This apparent advantage often weakens under dense interference at high N , revealing gaps that conventional benchmarks tend to obscure. We identify two failure modes, semantic confusion and position bias toward early options under uncertainty. To isolate the effect of context length, we run padding controlled and length matched tests, which suggest that the main bottleneck is candidate ranking rather than context length. Together, these findings support massive option evaluation as a general framework for stress testing model reliability under extreme distractor density, beyond what low option benchmarks can reveal.

[NLP-48] StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation ACL2026

【速读】: 该论文旨在解决当前代码生成模型在面对复杂编程任务时,因问题表述缺乏结构化而难以有效推理和规划的问题。现有方法虽尝试通过增加推理步骤或注入特定思维结构来提升性能,但未能系统性地重构问题的条件信息。其解决方案的关键在于提出StoryCoder框架,将原始代码生成问题转化为由任务概述、约束条件和示例测试用例组成的连贯自然语言叙事(narrative reformulation),并依据选定的算法类型和代码风格(genre)进行引导。这种叙事结构不仅增强了上下文信息的组织性,还显著提升了模型在零样本场景下的准确率(平均提升18.7% pass@10),同时促进正确的算法策略选择、减少实现错误,并诱导更模块化的代码结构。

链接: https://arxiv.org/abs/2604.14631
作者: Geonhui Jang,Dongyoon Han,YoungJoon Yoo
机构: Chung-Ang University (中央大学); NAVER AI Lab (NAVER人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 12 figures. ACL 2026 Main Conference

点击查看摘要

Abstract:Effective code generation requires both model capability and a problem representation that carefully structures how models reason and plan. Existing approaches augment reasoning steps or inject specific structure into how models think, but leave scattered problem conditions unchanged. Inspired by the way humans organize fragmented information into coherent explanations, we propose StoryCoder, a narrative reformulation framework that transforms code generation questions into coherent natural language narratives, providing richer contextual structure than simple rephrasings. Each narrative consists of three components: a task overview, constraints, and example test cases, guided by the selected algorithm and genre. Experiments across 11 models on HumanEval, LiveCodeBench, and CodeForces demonstrate consistent improvements, with an average gain of 18.7% in zero-shot pass@10. Beyond accuracy, our analyses reveal that narrative reformulation guides models toward correct algorithmic strategies, reduces implementation errors, and induces a more modular code structure. The analyses further show that these benefits depend on narrative coherence and genre alignment, suggesting that structured problem representation is important for code generation regardless of model scale or architecture. Our code is available at this https URL.

[NLP-49] Retrieve Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring

【速读】: 该论文旨在解决临床价值集编制(Clinical Value Set Authoring)中的瓶颈问题,即从标准化术语体系中准确识别定义某一临床概念的所有代码。传统方法依赖人工或简单提示(prompting)大语言模型(Large Language Model, LLM)直接生成代码,但受限于词汇库规模大、版本控制严格以及LLM在预训练阶段难以可靠记忆这些结构化知识,导致准确性不足。解决方案的关键在于提出检索增强的集合补全方法(Retrieval-Augmented Set Completion, RASC):首先从精心构建的语料库中检索出与目标概念最相似的K个已有价值集作为候选池,再通过分类器对候选池中的每个代码进行筛选。理论上,该策略可显著降低有效输出空间维度,从而减少统计复杂度。实验表明,RASC在11,803个公开VSAC价值集上构建了首个大规模基准,其基于SAPBert微调的交叉编码器模型达到AUROC~0.852和值集级别F1~0.298,显著优于仅检索或简单多层感知机(Multilayer Perceptron, MLP)方法,并且能将每条真阳性对应的无关候选数从12.3降至约3.2–4.4,验证了其理论优势。

链接: https://arxiv.org/abs/2604.14616
作者: Sumit Mukherjee,Juan Shu,Nairwita Mazumder,Tate Kernell,Celena Wheeler,Shannon Hastings,Chris Sidey-Gibbons
机构: Oracle Health Data Intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Clinical value set authoring – the task of identifying all codes in a standardized vocabulary that define a clinical concept – is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the K most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieves AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4 respectively. Zero-shot GPT-4o achieves value-set-level F1~0.105, with 48.6% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC’s theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross-encoder initialized from pre-trained SAPBert and a LightGBM model, demonstrating that RASC’s benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \hrefthis https URLthis https URL.

[NLP-50] ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段速度慢的问题,同时不牺牲生成质量。现有方法通常通过动态学习跳过某些层来构建轻量级的草稿模型(draft model),但这类基于启发式策略的层跳过方式可能效率不高或难以泛化。本文提出ConfLayers,其核心在于利用置信度(confidence)驱动的中间层跳过机制:通过迭代计算各层的置信度分数,依据自适应阈值选择跳过层,评估性能并更新最优跳过组合,直至收敛或达到最大迭代次数。该方案无需训练额外的层跳过策略,避免了复杂性与开销,且能实现更稳定的速度-质量权衡,同时保持对不同任务和数据集的适应能力。

链接: https://arxiv.org/abs/2604.14612
作者: Walaa Amer,Uday das,Fadi Kurdahi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Self-speculative decoding is an inference technique for large language models designed to speed up generation without sacrificing output quality. It combines fast, approximate decoding using a compact version of the model as a draft model with selective re-evaluation by the full target model. Some existing methods form the draft model by dynamically learning which layers to skip during inference, effectively creating a smaller subnetwork to speed up computation. However, using heuristic-based approaches to select layers to skip can often be simpler and more effective. In this paper, we propose ConfLayers, a dynamic plug-and-play approach to forming the draft model in self-speculative decoding via confidence-based intermediate layer skipping. The process iteratively computes confidence scores for all layers, selects layers to skip based on an adaptive threshold, evaluates the performance of the resulting set, and updates the best selection until no further improvement is achieved or a maximum number of iterations is reached. This framework avoids the overhead and complexity of training a layer skipping policy and can provide more consistent speed-quality trade-offs while preserving the adaptivity of the draft model to diverse tasks and datasets. The performance evaluation of ConfLayers across different models and datasets shows that our novel approach offers up to 1.4x speedup over vanilla LLM generation.

[NLP-51] CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中频繁产生有害内容的问题,这严重阻碍了其安全部署。现有缓解策略通常会降低生成质量或依赖昂贵的人工标注。解决方案的关键在于提出CAUSALDETOX框架,通过因果推理方法——概率必要性与充分性(Probability of Necessity and Sufficiency, PNS)——精准识别出对毒性生成既必要又充分的注意力头(attention heads),并基于此设计两种互补策略:一是局部推理时干预(Local Inference-Time Intervention),构建输入相关的动态引导向量实现上下文感知的去毒;二是PNS引导微调(PNS-Guided Fine-Tuning),永久性地消除模型中与毒性相关的表征。该方法在保持语言流畅性的前提下显著提升去毒效果,并大幅加速注意力头筛选效率。

链接: https://arxiv.org/abs/2604.14602
作者: Yian Wang,Yuen Chen,Agam Goyal,Hari Sundaram
机构: University of Illinois, Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026. 22 pages, 1 figure

点击查看摘要

Abstract:Large language models (LLMs) frequently generate toxic content, posing significant risks for safe deployment. Current mitigation strategies often degrade generation quality or require costly human annotation. We propose CAUSALDETOX, a framework that identifies and intervenes on the specific attention heads causally responsible for toxic generation. Using the Probability of Necessity and Sufficiency (PNS), we isolate a minimal set of heads that are necessary and sufficient for toxicity. We utilize these components via two complementary strategies: (1) Local Inference-Time Intervention, which constructs dynamic, input-specific steering vectors for context-aware detoxification, and (2) PNS-Guided Fine-Tuning, which permanently unlearns toxic representations. We also introduce PARATOX, a novel benchmark of aligned toxic/non-toxic sentence pairs enabling controlled counterfactual evaluation. Experiments on ToxiGen, ImplicitHate, and ParaDetox show that CAUSALDETOX achieves up to 5.34% greater toxicity reduction compared to baselines while preserving linguistic fluency, and offers a 7x speedup in head selection.

[NLP-52] NLP needs Diversity outside of Diversity

【速读】: 该论文试图解决自然语言处理(Natural Language Processing, NLP)领域中多样性研究过度集中于公平性相关方向的问题,指出这种不均衡现象源于一系列激励机制、偏见与障碍,导致边缘化研究者在非公平性子领域被排斥或被迫转向公平性研究。其解决方案的关键在于打破强化不平等的反馈循环,并系统性地消除地理和语言壁垒,以促进NLP各子领域实现更广泛、更公平的包容性发展。

链接: https://arxiv.org/abs/2604.14595
作者: Joshua Tint
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:This position paper argues that recent progress with diversity in NLP is disproportionately concentrated on a small number of areas surrounding fairness. We further argue that this is the result of a number of incentives, biases, and barriers which come together to disenfranchise marginalized researchers in non-fairness fields, or to move them into fairness-related fields. We substantiate our claims with an investigation into the demographics of NLP researchers by subfield, using our research to support a number of recommendations for ensuring that all areas within NLP can become more inclusive and equitable. In particular, we highlight the importance of breaking down feedback loops that reinforce disparities, and the need to address geographical and linguistic barriers that hinder participation in NLP research.

[NLP-53] Mechanistic Decoding of Cognitive Constructs in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理复杂情绪(如社会比较嫉妒)时内部机制不明确的问题,现有可解释性方法多将其视为黑箱或仅关注粗粒度的基本情绪,未能揭示更复杂情感状态的认知结构。解决方案的关键在于提出一种基于表征工程(Representation Engineering, RepE)的认知逆向工程框架,结合评估理论(appraisal theory)、子空间正交化、基于回归的加权和双向因果操控技术,分离并量化嫉妒的两个心理前因——比较对象的优势性(Superiority of Comparison Person)与领域自我定义相关性(Domain Self-Definitional Relevance),进而验证其对模型判断的因果影响。实验表明,LLMs 内部以结构化的线性组合形式编码嫉妒,且其表征与人类心理学构念高度一致,同时证明了毒性情绪状态可被机械检测并精准抑制,为多智能体环境中 AI 安全的表征监控与干预提供了可行路径。

链接: https://arxiv.org/abs/2604.14593
作者: Yitong Shou,Manhao Guan
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate increasingly sophisticated affective capabilities, the internal mechanisms by which they process complex emotions remain unclear. Existing interpretability approaches often treat models as black boxes or focus on coarse-grained basic emotions, leaving the cognitive structure of more complex affective states underexplored. To bridge this gap, we propose a Cognitive Reverse-Engineering framework based on Representation Engineering (RepE) to analyze social-comparison jealousy. By combining appraisal theory with subspace orthogonalization, regression-based weighting, and bidirectional causal steering, we isolate and quantify two psychological antecedents of jealousy, Superiority of Comparison Person and Domain Self-Definitional Relevance, and examine their causal effects on model judgments. Experiments on eight LLMs from the Llama, Qwen, and Gemma families suggest that models natively encode jealousy as a structured linear combination of these constituent factors. Their internal representations are broadly consistent with the human psychological construct, treating Superiority as the foundational trigger and Relevance as the ultimate intensity multiplier. Our framework also demonstrates that toxic emotional states can be mechanically detected and surgically suppressed, suggesting a possible route toward representational monitoring and intervention for AI safety in multi-agent environments.

[NLP-54] Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

【速读】: 该论文旨在解决复合人工智能(Compound AI)系统中提示词优化(Prompt Optimization)效果不可预测的问题,即当前优化方法在多数情况下表现与随机猜测无异(如Claude Haiku上49%的实验结果低于零样本基线)。其关键解决方案是提出一种两阶段诊断框架:首先通过80次ANOVA预测试验判断代理提示是否存在显著交互效应(发现均不显著,p > 0.52),其次进行10分钟内的“头寸测试”(headroom test)评估任务是否具备可利用的输出结构(即模型能生成但默认不输出的格式),从而识别出值得优化的任务。此方法将原本近乎随机的提示优化过程转变为有依据的决策流程。

链接: https://arxiv.org/abs/2604.14585
作者: Xing Zhang,Guanghui Wang,Yanwei Cui,Wei Qiu,Ziyuan Li,Bing Zhu,Peiyang He
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku (6 methods \times 4 tasks \times 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to +6.8 points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy: (A) individual prompts are worth optimizing, and (B) agent prompts interact, requiring joint optimization. Interaction effects are never significant ( p 0.52 , all F 1.0 ), and optimization helps only when the task has exploitable output structure – a format the model can produce but does not default to. We provide a two-stage diagnostic: an \ 80 ANOVA pre-test for agent coupling, and a 10-minute headroom test that predicts whether optimization is worthwhile – turning a coin flip into an informed decision.

[NLP-55] Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

【速读】: 该论文旨在解决视觉推理模型(Visual Reasoning Models, VRMs)中存在的“过度思考”问题,即模型在处理任务时常常生成冗长且不必要的推理链,导致计算资源浪费。其核心解决方案是提出一种自适应视觉推理框架(Adaptive Visual Reasoning, AVR),该框架将视觉推理过程分解为三个认知功能:视觉感知、逻辑推理和答案应用,并允许模型动态选择三种响应格式:完整格式、仅感知格式和直接答案格式。AVR通过FS-GRPO(Group Relative Policy Optimization的改进版本)进行训练,鼓励模型在保证正确性的前提下选择最高效的推理路径,从而显著减少token消耗(50–90%),尤其在感知密集型任务中表现优异。

链接: https://arxiv.org/abs/2604.14568
作者: Yixu Huang,Tinghui Zhu,Muhao Chen
机构: Fudan University (复旦大学); University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to \textbfReasoning Path Redundancy in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose \textbfAVR, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50–90% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: this https URL.

[NLP-56] MARS2: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation ACL2026

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在推理密集型任务(如代码生成)中因轨迹多样性有限而导致性能提升受限的问题。现有方法虽可通过结构化搜索增强探索,但受限于单一智能体策略先验;而多智能体协同虽能提供更丰富的探索信号,却通常与结构化搜索解耦。解决方案的关键在于提出MARS²(Multi-Agent Reinforced Tree-Search Scaling),一个将多个独立优化的智能体协作嵌入共享树状搜索环境中的统一RL框架。该框架将搜索树建模为可学习的多智能体交互环境,使异构智能体能在共享拓扑中协同生成和优化候选解,并引入基于树一致性奖励塑造的路径级分组优势计算机制,从而实现复杂搜索轨迹上的有效信用分配,显著提升性能表现。

链接: https://arxiv.org/abs/2604.14564
作者: Pengfei Li,Shijie Wang,Fangyuan Li,Yikun Fu,Kaifeng Liu,Kaiyan Zhang,Dazhi Zhang,Yuqiang Li,Biqing Qi,Bowen Zhou
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院); Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning-intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search-enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single-agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose \textbfMARS ^2 (Multi-Agent Reinforced Tree-Search Scaling), a unified RL framework in which multiple independently-optimized agents collaborate within a shared tree-structured search environment. MARS ^2 models the search tree as a learnable multi-agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path-level group advantage formulation based on tree-consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS ^2 consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi-agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at this https URL.

[NLP-57] Dissecting Failure Dynamics in Large Language Model Reasoning ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中出现错误的机制不明确的问题,尤其是为何其推理失败往往并非随机发生,而是源于早期关键节点的偏离。研究发现,错误通常起始于少数早期状态转移点,此后推理虽保持局部一致性但全局错误;这些转移点与token级别熵的局部峰值高度相关,且同一中间状态存在可导向正确结果的替代路径。解决方案的关键在于提出GUARD框架——一个基于不确定性信号在推理时动态探测并干预这些关键转移点的机制,从而提升推理的可靠性。

链接: https://arxiv.org/abs/2604.14528
作者: Wei Zhu,Jian Zhang,Lixing Yu,Kun Yue,Zhiwen Tang
机构: Yunnan University (云南大学); Yunnan Key Laboratory of Intelligent Systems and Computing (云南省智能系统与计算重点实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) achieve strong performance through extended inference-time deliberation, yet how their reasoning failures arise remains poorly understood. By analyzing model-generated reasoning trajectories, we find that errors are not uniformly distributed but often originate from a small number of early transition points, after which reasoning remains locally coherent but globally incorrect. These transitions coincide with localized spikes in token-level entropy, and alternative continuations from the same intermediate state can still lead to correct solutions. Based on these observations, we introduce GUARD, a targeted inference-time framework that probes and redirects critical transitions using uncertainty signals. Empirical evaluations across multiple benchmarks confirm that interventions guided by these failure dynamics lead to more reliable reasoning outcomes. Our findings highlight the importance of understanding when and how reasoning first deviates, complementing existing approaches that focus on scaling inference-time computation.

[NLP-58] PeerPrism: Peer Evaluation Expertise vs Review-writing AI

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在科学同行评审中广泛应用背景下,现有大语言模型(LLM)检测方法将作者身份简化为“人类 vs. AI”二元判断的问题。这种简化忽略了现代评审流程中思想来源(idea provenance)与文本实现(text provenance)可能分离的现实,即评价性观点可能来自人类,而文字表达由AI完成,形成复杂的人机协作谱系。解决方案的关键在于提出PeerPrism——一个包含20,690条同行评审的大型基准数据集,通过受控生成范式(涵盖纯人类、纯合成及多种混合形式)系统区分思想来源与文本来源,从而评估检测模型是否真正识别出推理本质而非表面文本特征。实验表明,当前主流检测方法在混合场景下表现不稳定,常因混淆风格与语义贡献而产生矛盾分类,揭示了必须将作者身份建模为涵盖语义推理与风格实现的多维构念,而非单一二元属性。

链接: https://arxiv.org/abs/2604.14513
作者: Soroush Sadeghian,Alireza Daqiq,Radin Cheraghi,Sajad Ebrahimi,Negar Arabzadeh,Ebrahim Bagheri
机构: Reviewerly; University of Toronto (多伦多大学); UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in scientific peer review, assisting with drafting, rewriting, expansion, and refinement. However, existing peer-review LLM detection methods largely treat authorship as a binary problem-human vs. AI-without accounting for the hybrid nature of modern review workflows. In practice, evaluative ideas and surface realization may originate from different sources, creating a spectrum of human-AI collaboration. In this work, we introduce PeerPrism, a large-scale benchmark of 20,690 peer reviews explicitly designed to disentangle idea provenance from text provenance. We construct controlled generation regimes spanning fully human, fully synthetic, and multiple hybrid transformations. This design enables systematic evaluation of whether detectors identify the origin of the surface text or the origin of the evaluative reasoning. We benchmark state-of-the-art LLM text detection methods on PeerPrism. While several methods achieve high accuracy on the standard binary task (human vs. fully synthetic), their predictions diverge sharply under hybrid regimes. In particular, when ideas originate from humans but the surface text is AI-generated, detectors frequently disagree and produce contradictory classifications. Accompanied by stylometric and semantic analyses, our results show that current detection methods conflate surface realization with intellectual contribution. Overall, we demonstrate that LLM detection in peer review cannot be reduced to a binary attribution problem. Instead, authorship must be modeled as a multidimensional construct spanning semantic reasoning and stylistic realization. PeerPrism is the first benchmark evaluating human-AI collaboration in these settings. We release all code, data, prompts, and evaluation scripts to facilitate reproducible research at this https URL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.14513 [cs.CL] (or arXiv:2604.14513v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.14513 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805712.3808602 Focus to learn more DOI(s) linking to related resources

[NLP-59] CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling

【速读】: 该论文旨在解决传统主题模型在面对流式数据时的适应性不足与神经方法在持续学习中易出现灾难性遗忘及固定容量限制的问题。其解决方案的关键在于提出一种低参数量的终身层次主题模型 \textsc{CobwebTM},通过将 Cobweb 算法适配到连续文档嵌入空间,实现基于增量符号概念形成的在线语义层次构建,从而无需预定义主题数量即可完成无监督主题发现、动态主题生成和层次组织,有效结合了预训练表示与增量符号概念形成的优势。

链接: https://arxiv.org/abs/2604.14489
作者: Karthik Singaravadivelan,Anant Gupta,Zekun Wang,Christopher MacLellan
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: 16 pages, 8 figures, 11 tables

点击查看摘要

Abstract:Topic modeling seeks to uncover latent semantic structure in text corpora with minimal supervision. Neural approaches achieve strong performance but require extensive tuning and struggle with lifelong learning due to catastrophic forgetting and fixed capacity, while classical probabilistic models lack flexibility and adaptability to streaming data. We introduce \textscCobwebTM, a low-parameter lifelong hierarchical topic model based on incremental probabilistic concept formation. By adapting the Cobweb algorithm to continuous document embeddings, \textscCobwebTM constructs semantic hierarchies online, enabling unsupervised topic discovery, dynamic topic creation, and hierarchical organization without predefining the number of topics. Across diverse datasets, \textscCobwebTM achieves strong topic coherence, stable topics over time, and high-quality hierarchies, demonstrating that incremental symbolic concept formation combined with pretrained representations is an efficient approach to topic modeling.

[NLP-60] Psychological Steering of Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在心理特质引导(psychological steering)中干预效果受限的问题,尤其是现有方法因搜索空间受限和激活空间单位未校准而导致难以找到最优干预条件。其解决方案的关键在于提出一种基于心理学的引导框架,通过使用语义校准的残差流注入(residual-stream injections)实现无界且受流畅性约束的参数扫描,其中核心创新是利用IPIP-NEO-120量表对OCEAN人格维度进行校准,并引入均值差(mean-difference, MD)注入方法。实验表明,MD注入在14个LLM中有11个优于主流的个性提示(Personality Prompting, P²)方法,且与线性表示假设一致,提供了可信赖的心理特质控制机制,同时揭示了模型学习表征与人类心理学之间仍存在差距。

链接: https://arxiv.org/abs/2604.14463
作者: Leonardo Blas,Robin Jia,Emilio Ferrara
机构: 未知
类目: Computation and Language (cs.CL)
备注: 66 pages, 60 images

点击查看摘要

Abstract:Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P ^2 ), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6% to 16.4%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P ^2 and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P ^2 ranging from 5.6% to 21.9% and from 3.3% to 26.7% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.

[NLP-61] Filling in the Mechanisms: How do LMs Learn Filler-Gap Dependencies under Developmental Constraints?

【速读】: 该论文旨在解决语言模型(Language Models, LMs)在有限训练数据下是否能习得跨句法结构的填充-空位依赖(filler-gap dependencies)的共享表征问题,这与人类语言习得中此类依赖的普遍性机制相对应。其解决方案的关键在于应用分布式对齐搜索(Distributed Alignment Search, DAS)方法,评估不同训练规模的语言模型在Wh疑问句与话题化结构之间是否存在可迁移的表征,尽管这两种结构在输入频率上差异显著。结果表明,即使在发展可行的数据量下,模型仍可能形成共享但具有项目敏感性的机制,但其学习效率远低于人类,凸显了在语言习得模型中引入语言特异性偏置的必要性。

链接: https://arxiv.org/abs/2604.14459
作者: Atrey Desai,Sathvik Nair
机构: 未知
类目: Computation and Language (cs.CL)
备注: To be published in the 64th Annual Meeting of the Association for Computational Linguistics

点击查看摘要

Abstract:For humans, filler-gap dependencies require a shared representation across different syntactic constructions. Although causal analyses suggest this may also be true for LLMs (Boguraev et al., 2025), it is still unclear if such a representation also exists for language models trained on developmentally feasible quantities of data. We applied Distributed Alignment Search (DAS, Geiger et al. (2024)) to LMs trained on varying amounts of data from the BabyLM challenge (Warstadt et al., 2023), to evaluate whether representations of filler-gap dependencies transfer between wh-questions and topicalization, which greatly vary in terms of their input frequency. Our results suggest shared, yet item-sensitive mechanisms may develop with limited training data. More importantly, LMs still require far more data than humans to learn comparable generalizations, highlighting the need for language-specific biases in models of language acquisition.

[NLP-62] MARCA: A Checklist-Based Benchmark for Multilingual Web Search

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言环境下进行基于网络的信息检索与合成能力评估不足的问题,尤其是葡萄牙语场景长期被忽视。其解决方案的关键在于构建了一个名为\textscMARCA的双语(英语和葡萄牙语)基准测试集,包含52个由人工编写、涉及多个实体的复杂问题,并配以人工验证的检查表式评分标准,明确衡量答案的完整性和正确性。此外,研究设计了两种交互框架:基础框架(Basic)直接进行网络搜索与抓取,以及编排器框架(Orchestrator),通过子代理任务分解提升信息获取效率。实验结果表明,该基准能够有效揭示模型在跨语言迁移中的性能差异及编排机制对覆盖率的改善作用。

链接: https://arxiv.org/abs/2604.14448
作者: Thales Sales Almeida,Giovana Kerche Bonás,Ramon Pires,Celio Larcher,Hugo Abonizio,Marcos Piau,Roseval Malaquias Junior,Rodrigo Nogueira,Thiago Laitz
机构: Maritaca AI; Jusbrasil
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as sources of information, yet their reliability depends on the ability to search the web, select relevant evidence, and synthesize complete answers. While recent benchmarks evaluate web-browsing and agentic tool use, multilingual settings, and Portuguese in particular, remain underexplored. We present \textscMARCA, a bilingual (English and Portuguese) benchmark for evaluating LLMs on web-based information seeking. \textscMARCA consists of 52 manually authored multi-entity questions, paired with manually validated checklist-style rubrics that explicitly measure answer completeness and correctness. We evaluate 14 models under two interaction settings: a Basic framework with direct web search and scraping, and an Orchestrator framework that enables task decomposition via delegated subagents. To capture stochasticity, each question is executed multiple times and performance is reported with run-level uncertainty. Across models, we observe large performance differences, find that orchestration often improves coverage, and identify substantial variability in how models transfer from English to Portuguese. The benchmark is available at this https URL

[NLP-63] Hierarchical vs. Flat Iteration in Shared-Weight Transformers

【速读】: 该论文旨在解决Transformer架构中独立堆叠层(independent-layer stacking)与共享权重的分层递归结构(hierarchically structured, shared-weight recurrence)在表征质量上的差异问题。其核心目标是验证是否可以通过设计一种双速递归机制来实现与传统多层Transformer相当甚至更优的表示能力。解决方案的关键在于提出HRM-LM模型,该模型用一个两速递归对(Fast模块每步运行用于局部精炼,Slow模块每T步运行用于全局压缩)替代原有的L个独立Transformer层,并将此递归结构展开M = N × T步,同时共享参数。实验表明,在参数量匹配的情况下,该方案显著落后于独立堆叠结构,揭示了当前递归结构在表征能力上的局限性。

链接: https://arxiv.org/abs/2604.14442
作者: Sang-Il Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present an empirical study of whether hierarchically structured, shared-weight recurrence can match the representational quality of independent-layer stacking in a Transformer-based language model. HRM-LM replaces L independent Transformer layers with a two-speed recurrent pair: a Fast module operating at every step for local refinement, and a Slow module operating every T steps for global compression. This recurrent hierarchy is unrolled for M = N x T steps with shared parameters. The central and most robust finding, supported by a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five independent runs, is a sharp empirical gap between the two approaches.

[NLP-64] hree-Phase Transformer

【速读】: 该论文旨在解决Decoder-only Transformer架构中因位置编码(如RoPE)与残差流结构之间缺乏几何一致性而导致的训练不稳定性和收敛效率低下问题。其核心解决方案是提出Three-Phase Transformer (3PT),关键在于引入一个基于通道分区的残差流结构先验:将隐藏向量等分为N个循环通道,每个通道通过相位保持操作(per-channel RMSNorm、注意力与前馈层之间的2D Givens旋转、GQA头数约束)维持局部几何稳定;同时在正交于各通道的一维直流(DC)子空间中注入固定Gabriel’s horn位置信号作为绝对位置侧通道,实现与RoPE相对位置旋转的正交组合。该设计形成一种自稳定平衡机制,无需显式约束即可维持网络内部几何结构,从而显著提升训练收敛速度与性能表现。

链接: https://arxiv.org/abs/2604.14430
作者: Mohammad R. Abu Ayyash
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 48 pages, 20 figures, 23 tables. Code: this https URL

点击查看摘要

Abstract:We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel’s horn profile r§ = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE’s relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over 1,2,3,4,6,8,12 is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; © orthogonal composition with RoPE, attention, and FFN.

[NLP-65] he Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious

【速读】: 该论文旨在解决多轮人机对话(multi-turn human-LLM conversations)中turn-level指标评估时因时间序列自相关性(autocorrelation)导致的统计推断偏差问题。现有评估流程普遍忽略对话轮次之间的非独立性,从而高估显著性水平,造成假阳性结果。解决方案的关键在于提出一种两阶段校正框架:首先采用Chelton(1983)的有效自由度方法量化自相关强度,再结合对话级别的块自举法(conversation-level block bootstrap)进行稳健的统计推断,有效控制了因时间依赖性带来的误报风险,并在预注册验证集上实现了57%的复现率,显著优于未校正的池化方法(30%)。

链接: https://arxiv.org/abs/2604.14414
作者: Ferdinand M. Schessl
机构: Independent Researcher
类目: Computation and Language (cs.CL)
备注: 14 pages, 3 figures, 5 tables, 1 algorithm. Code and synthetic demonstration data: this https URL

点击查看摘要

Abstract:Turn-level metrics are widely used to evaluate properties of multi-turn human-LLM conversations, from safety and sycophancy to dialogue quality. However, consecutive turns within a conversation are not statistically independent – a fact that virtually all current evaluation pipelines fail to correct for in their statistical inference. We systematically characterize the autocorrelation structure of 66 turn-level metrics across 202 multi-turn conversations (11,639 turn pairs, 5 German-speaking users, 4 LLM platforms) and demonstrate that naive pooled analysis produces severely inflated significance estimates: 42% of associations that appear significant under standard pooled testing fail to survive cluster-robust correction. The inflation varies substantially across categories rather than scaling linearly with autocorrelation: three memoryless families (embedding velocity, directional, differential) aggregate to 14%, while the seven non-memoryless families (thermo-cycle, frame distance, lexical/structural, rolling windows, cumulative, interaction, timestamp) aggregate to 33%, with individual category rates ranging from 0% to 100% depending on per-family effect size. We present a two-stage correction framework combining Chelton (1983) effective degrees of freedom with conversation-level block bootstrap, and validate it on a pre-registered hold-out split where cluster-robust metrics replicate at 57% versus 30% for pooled-only metrics. We provide concrete design principles, a publication checklist, and open-source code for the correction pipeline. A survey of ~30 recent papers at major NLP and AI venues that compute turn-level statistics in LLM evaluations finds that only 4 address temporal dependence at all, and 26 do not correct for it.

[NLP-66] Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection

【速读】: 该论文旨在解决如何自动将WordNet风格的词汇资源扩展到新语言的问题,核心挑战在于为目标语言中的词元(lemma)生成准确的语义义项(sense)。解决方案的关键在于提出一种“投影-过滤”(project-and-filter)策略:首先通过语义投影将源语言(英语)的同义词集(synset)映射到目标语言的对齐词元上,再利用双语词典作为外部知识源来增强预训练对齐器并筛选掉错误的语义投影,从而在保持方法可解释性的同时提升精度,且仅需少量外部资源。

链接: https://arxiv.org/abs/2604.14397
作者: David Basil,Chirooth Girigowda,Bradley Hauer,Sahir Momin,Ning Shi,Grzegorz Kondrak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in the proceedings of Canadian AI 2026

点击查看摘要

Abstract:We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects English synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate these alignments and ensure their quality, we augment a pre-trained base aligner with a bilingual dictionary, which is also used to filter out incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and requiring few external resources. We plan to make our code, documentation, and generated sense inventories accessible.

[NLP-67] BiCon-Gate: Consistency-Gated De-colloquialisation for Dialogue Fact-Checking

【速读】: 该论文旨在解决对话场景中自动事实核查(fact-checking)所面临的挑战,即多轮对话中频繁出现的非正式语言(colloquial language)尚未得到充分研究与处理。为应对这一问题,作者提出了一种分阶段去口语化(de-colloquialisation)策略,通过轻量级表面归一化与限定范围内的句内指代消解(in-claim coreference resolution)生成保守的改写候选句。其解决方案的关键创新在于引入BiCon-Gate——一种语义感知的一致性门控机制(consistency gate),该机制仅在改写候选句被对话上下文语义支持时才选择使用,否则回退至原始声明。此门控机制显著提升了下游事实核查任务的稳定性,并在证据检索和事实验证两个环节均取得性能提升,尤其在“SUPPORTS”类别上表现突出。

链接: https://arxiv.org/abs/2604.14389
作者: Hyunkyung Park,Arkaitz Zubiaga
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures. Published in FEVER 2026

点击查看摘要

Abstract:Automated fact-checking in dialogue involves multi-turn conversations where colloquial language is frequent yet understudied. To address this gap, we propose a conservative rewrite candidate for each response claim via staged de-colloquialisation, combining lightweight surface normalisation with scoped in-claim coreference resolution. We then introduce BiCon-Gate, a semantics-aware consistency gate that selects the rewrite candidate only when it is semantically supported by the dialogue context, otherwise falling back to the original claim. This gated selection stabilises downstream fact-checking and yields gains in both evidence retrieval and fact verification. On the DialFact benchmark, our approach improves retrieval and verification, with particularly strong gains on SUPPORTS, and outperforms competitive baselines, including a decoder-based one-shot LLM rewrite that attempts to perform all de-colloquialisation steps in a single pass.

[NLP-68] he Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

【速读】: 该论文旨在解决多模态语言模型在视觉感知任务中系统性表现不佳的问题,尤其是其背后模态依赖结构不明确的难题。解决方案的关键在于提出“中心点替换”(centroid replacement)方法,通过将每个token映射到最近的K-means中心点来可控地剥离文本与视觉特征的结构信息,从而量化不同模态对模型性能的贡献差异。实验发现,移除文本中心点结构导致的准确率下降是视觉中心点结构的4倍,揭示了语言表示在多模态推理中存在普遍性的主导效应;进一步利用这一不对称性,采用文本中心点对比解码(text centroid contrastive decoding)策略,在不重新训练的前提下显著提升特定任务准确率(最高达+16.9%),且该干预效果与训练方式密切相关,为识别和修正模态竞争提供了可操作的诊断信号与推理时校正机制。

链接: https://arxiv.org/abs/2604.14363
作者: Akshay Paruchuri,Ishan Chatterjee,Henry Fuchs,Ehsan Adeli,Piotr Didyk
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 9 figures, 19 tables

点击查看摘要

Abstract:Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4 \times more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.

[NLP-69] When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

【速读】: 该论文旨在解决多囊卵巢综合征(Polycystic Ovary Syndrome, PCOS)女性在社交媒体中面临的身体形象困扰、进食障碍和代谢挑战这“三重负担”的自动识别问题,现有自然语言处理方法因缺乏透明性且无法识别共病表现而难以有效支持临床筛查。其解决方案的关键在于开发小型开源语言模型,并采用低秩适配(Low-Rank Adaptation, LoRA)进行微调,以生成基于文本证据的结构化解释,从而实现对PCOS相关心理与生理问题的可解释性检测,尤其在共病识别上表现出较强鲁棒性,适合用于初步筛查而非独立诊断。

链接: https://arxiv.org/abs/2604.14356
作者: Apoorv Prasad,Susan McRoy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.

[NLP-70] Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本理解任务中因位置敏感性导致的鲁棒性不足问题,即模型性能对关键信息在输入序列中的绝对位置高度依赖,表现出显著的位置方差(positional variance),即便在控制任务格式和难度的情况下依然如此。解决方案的关键在于提出一种名为RoPE-Perturbed Self-Distillation(RoPE扰动自蒸馏)的训练正则化方法:通过扰动RoPE(Rotary Position Embedding)索引生成同一训练序列的不同“视角”,并利用自蒸馏机制迫使模型在不同视角下输出一致预测,从而促使模型更依赖语义信号而非脆弱的位置线索,提升其在长上下文场景下的位置鲁棒性和长度外推能力。

链接: https://arxiv.org/abs/2604.14339
作者: Zichong Li,Chen Liang,Liliang Ren,Tuo Zhao,Yelong Shen,Weizhu Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative “views” of the same training sequence by perturbing its RoPE indices – effectively moving parts of the context to different positions – and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.14339 [cs.CL] (or arXiv:2604.14339v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.14339 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-71] Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在决策过程中缺乏可解释性的问题,特别是现有基于文本的后验解释(post-hoc text-based explanations)往往仅在主观上显得合理,而未真正反映模型内部依赖的实际证据,即缺乏认知上的忠实性(epistemic faithfulness)。解决方案的关键在于提出一种无需训练的方法,通过注意力层面的干预来引导解释生成过程,该干预基于使用可信归因方法(faithful attribution method)提取的词元级热力图(token-level heatmaps),从而显著提升不同模型、基准和提示下解释的认知忠实性。

链接: https://arxiv.org/abs/2604.14325
作者: Bar Alon,Itamar Zimerman,Lior Wolf
机构: Blavatnik School of Computer Science, Tel Aviv University (特拉维夫大学布莱瓦特尼克计算机科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, multiple figures (e.g., at least 6 main figures), includes experiments across several benchmarks (MMLU, CommonsenseQA, SciQ, ARC, OpenBookQA); code available on GitHub

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.

[NLP-72] Purging the Gray Zone: Latent-Geometric Denoising for Precise Knowledge Boundary Awareness ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成回答时因无法准确界定自身知识边界而产生的幻觉(hallucination)问题。现有放弃回答(abstention)微调方法通常直接依据响应准确性对数据集进行划分,导致模型在决策边界附近遭受严重标签噪声,从而引发过高放弃率或幻觉现象。解决方案的关键在于引入潜在空间表示视角,识别出决策超平面附近的“灰色区域”(gray zone),该区域内部信念模糊是性能瓶颈的核心原因;进而提出GeoDe(Geometric Denoising)框架,通过线性探测构建真理超平面,并利用几何距离作为置信度信号进行“几何去噪”,有效过滤边界模糊样本并保留高保真微调信号,显著提升模型真实性与分布外(OOD)泛化能力。

链接: https://arxiv.org/abs/2604.14324
作者: Hao An,Yibin Lou,Jiayi Guo,Yang Xu
机构: Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Large language models (LLMs) often exhibit hallucinations due to their inability to accurately perceive their own knowledge boundaries. Existing abstention fine-tuning methods typically partition datasets directly based on response accuracy, causing models to suffer from severe label noise near the decision boundaries and consequently exhibit high rates of abstentions or hallucinations. This paper adopts a latent space representation perspective, revealing a “gray zone” near the decision hyperplane where internal belief ambiguity constitutes the core performance bottleneck. Based on this insight, we propose the GeoDe (Geometric Denoising) framework for abstention fine-tuning. This method constructs a truth hyperplane using linear probes and performs “geometric denoising” by employing geometric distance as a confidence signal for abstention decisions. This approach filters out ambiguous boundary samples while retaining high-fidelity signals for fine-tuning. Experiments across multiple models (Llama3, Qwen3) and benchmark datasets (TriviaQA, NQ, SciQ, SimpleQA) demonstrate that GeoDe significantly enhances model truthfulness and demonstrates strong generalization in out-of-distribution (OOD) scenarios. Code is available at this https URL.

[NLP-73] LLM Predictive Scoring and Validation: Inferring Experience Ratings from Unstructured Text

【速读】: 该论文旨在解决如何从球迷自发撰写的开放式文本中准确预测其对观赛体验的整体评分问题。解决方案的关键在于利用生成式 AI(Generative AI)模型 GPT-4.1,仅基于单条开放题回答文本,通过简单未优化的提示词(prompt)实现对体验评分的定向预测。结果显示,预测值与自评分数高度一致(67% within ±1),且跨三次独立评分具有强一致性(87% exact agreement),表明该方法具备稳定性和实用性;进一步分析揭示预测值与自评值之间的系统性差异并非误差,而是源于二者测量构念不同:自评反映整体评价判断,而预测值捕捉的是情感强烈、独特或可行动的显著时刻影响,两者信息互补,构成对体验理解的多维视角。

链接: https://arxiv.org/abs/2604.14321
作者: Jason Potteiger,Andrew Hong,Ito Zapata
机构: 未知
类目: Computation and Language (cs.CL)
备注: 29 pages, 5 figures, 6 tables

点击查看摘要

Abstract:We tasked GPT-4.1 to read what baseball fans wrote about their game-day experience and predict the overall experience rating each fan gave on a 0-10 survey scale. The model received only the text of a single open-ended response. These AI predictions were compared with the actual experience ratings captured by the survey instrument across approximately 10,000 fan responses from five Major League Baseball teams. In total two-thirds of predicted ratings fell within one point of self-reported fan ratings (67% within +/-1, 36% exact match), and the predicted measurement was near-deterministic across three independent scoring runs (87% exact agreement, 99.9% within +/-1). Predicted ratings aligned most strongly with the overall experience rating (r = 0.82) rather than with any specific aspect of the game-day experience such as parking, concessions, staff, etc. However, predictions were systematically lower than self-reported ratings by approximately one point, and this gap was not driven by any single aspect. Rather, our analysis shows that self-reported ratings capture the fan’s verdict, an overall evaluative judgment that integrates the entire experience. While predicted ratings quantify the impact of salient moments characterized as memorable, emotionally intense, unusual, or actionable. Each measure contains information the other misses. These baseline results establish that a simple, unoptimized prompt can directionally predict how fans rate their experience from the text a fan wrote and that a gap between the two numbers can be interpreted as a construct difference worth preserving rather than an error to eliminate.

[NLP-74] racking the Temporal Dynamics of News Coverag e of Catastrophic and Violent Events

【速读】: 该论文旨在解决如何理解突发暴力与灾难事件在新闻报道中的叙事形成、传播与演变机制这一问题,尤其关注新闻周期的时间动态与语义变化规律。其解决方案的关键在于构建了一个包含126,602篇在线新闻文章的大规模语料库,并通过出版量、语义漂移(semantic drift)、语义分散度(semantic dispersion)和词项相关性(term relevance)等指标量化叙事变化,从而识别出突发事件报道中具有结构化和可预测性的新闻周期模式,包括报道量的快速上升、早期语义漂移及随后向基线水平的渐进回落。

链接: https://arxiv.org/abs/2604.14315
作者: Emily Lugos,Maurício Gruppi
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The modern news cycle has been fundamentally reshaped by the rapid exchange of information online. As a result, media framing shifts dynamically as new information, political responses, and social reactions emerge. Understanding how these narratives form, propagate, and evolve is essential for interpreting public discourse during moments of crisis. In this study, we examine the temporal and semantic dynamics of reporting for violent and catastrophic events using a large-scale corpus of 126,602 news articles collected from online publishers. We quantify narrative change through publication volume, semantic drift, semantic dispersion, and term relevance. Our results show that sudden events of impact exhibit structured and predictable news-cycle patterns characterized by rapid surges in coverage, early semantic drift, and gradual declines toward the baseline. In addition, our results indicate the terms that are driving the temporal patterns.

[NLP-75] DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

【速读】: 该论文旨在解决结构化光学字符识别(Structured OCR)中生成式 AI 模型存在的文本退化(text degeneration)问题,即模型在推理过程中产生冗长、重复或无意义的输出,导致响应时间延长、吞吐量下降及计算成本上升。解决方案的关键在于引入一种结合监督微调(Supervised Fine-Tuning, SFT)与直接偏好优化(Direct Preference Optimization, DPO)的联合训练策略:SFT 用于强制遵守严格的 JSON 结构(如页眉、页边距、页脚和正文),而 DPO 则将退化生成作为拒绝样本进行建模,从而有效抑制循环行为并显著降低退化率(相对减少最高达 87.6%)。此外,通过 AWQ 量化技术实现高达 22% 的每页推理成本降低,同时保持高质量输出,使 DharmaOCR Full(7B)和 Lite(3B)模型在新提出的 DharmaOCR-Benchmark 上达到当前最优性能(抽取质量得分分别为 0.925 和 0.911,退化率分别低至 0.40% 和 0.20%)。

链接: https://arxiv.org/abs/2604.14314
作者: Gabriel Pimenta de Freitas Cardoso,Caio Lucas da Silva Chacon,Jonas Felipe da Fonseca Oliveira,Paulo Henrique de Medeiros Araujo
机构: Dharma-AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This manuscript introduces DharmaOCR Full and Lite, a pair of specialized small language models (SSLMs) for structured OCR that jointly optimize transcription quality, generation stability, and inference cost. It also presents DharmaOCR-Benchmark, a benchmark that covers printed, handwritten, and legal/administrative documents, and proposes a unified evaluation protocol that measures fidelity and structure while explicitly tracking text degeneration as a first-class benchmark metric (alongside unit cost). Beyond reporting degeneration rates, the manuscript empirically shows degeneration is not merely a quality failure, since it materially worsens production performance by increasing response time, reducing throughput, and inflating computational cost due to abnormally long generations. To the best of the author’s knowledge, as a methodological contribution, this is the first application of Direct Preference Optimization (DPO) for OCR, explicitly using degenerate generations as rejected examples to penalize looping behavior. Combined with Supervised Fine-Tuning (SFT) for enforcing a strict JSON schema (header, margin, footer, and text), DPO consistently reduces degeneration rate across model families (up to 87.6% relative) while preserving or improving extraction quality. The resulting models, namely, DharmaOCR Full (7B) and DharmaOCR Lite (3B), set a new state-of-the-art on DharmaOCR-Benchmark, outperforming each open-source and commercial baseline model evaluated regarding extraction quality, reaching 0.925 and 0.911 scores with 0.40% and 0.20% degeneration rates. AWQ quantization reduced up to 22% per-page cost with negligible quality loss, enabling a strong quality-cost trade-off in comparison to proprietary OCR APIs and open-source alternatives.

[NLP-76] EuropeMedQA Study Protocol: A Multilingual Multimodal Medical Examination Dataset for Language Model Evaluation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在非英语医疗考试和多模态诊断任务中性能显著下降的问题,尤其针对欧洲多语种临床实践场景下的泛化能力不足。解决方案的关键在于构建EuropeMedQA——首个基于意大利、法国、西班牙和葡萄牙官方监管医学考试的综合性、多语言、多模态医疗测评数据集,严格遵循FAIR数据原则与SPIRIT-AI指南,通过自动化翻译流程和零样本约束提示策略对主流多模态LLMs进行评估,从而提供一个抗污染、反映真实欧洲临床复杂性的基准,推动更通用的医疗人工智能系统发展。

链接: https://arxiv.org/abs/2604.14306
作者: Francesco Andrea Causio,Vittorio De Vita,Olivia Riccomi,Michele Ferramola,Federico Felizzi,Antonio Cristiano,Lorenzo De Mori,Chiara Battipaglia,Melissa Sawaya,Luigi De Angelis,Marcello Di Pumpo,Alessandra Piscitelli,Pietro Eric Risuleo,Alessia Longo,Giulia Vojvodic,Mariapia Vassalli,Bianca Destro Castaniti,Nicolò Scarsi,Manuel Del Medico
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describes the development of EuropeMedQA, the first comprehensive, multilingual, and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. Following FAIR data principles and SPIRIT-AI guidelines, we describe a rigorous curation process and an automated translation pipeline for comparative analysis. We evaluate contemporary multimodal LLMs using a zero-shot, strictly constrained prompting strategy to assess cross-lingual transfer and visual reasoning. EuropeMedQA aims to provide a contamination-resistant benchmark that reflects the complexity of European clinical practices and fosters the development of more generalizable medical AI.

[NLP-77] ReviewGrounder: Improving Review Substantiveness with Rubric-Guided Tool-Integrated Agents

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的同行评审系统生成评论内容浅显、缺乏实质性证据支撑的问题。作者指出,这一局限源于LLM在评审过程中未能有效利用人类审稿的两个核心要素:明确的评分量表(rubrics)和对已有研究工作的上下文锚定(contextual grounding)。为此,论文提出REVIEWBENCH基准用于基于论文特异性量表评估审稿文本质量,并设计REVIEWGROUNDER框架——一个由工具集成的多智能体系统,将审稿过程分解为草拟与证据强化两个阶段,通过目标导向的文献证据整合提升初稿质量。其关键创新在于引入量表引导机制与分阶段结构化推理,显著提升了审稿反馈的专业性与一致性。

链接: https://arxiv.org/abs/2604.14261
作者: Zhuofeng Li,Yi Lu,Dongfu Jiang,Haoxiang Zhang,Yuyang Bai,Chuan Li,Yu Wang,Shuiwang Ji,Jianwen Xie,Yu Zhang
机构: Texas AM University (德州农工大学); University of Waterloo (滑铁卢大学); UC San Diego (加州大学圣地亚哥分校); Lambda; University of Oregon (俄勒冈大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper’s content, and human-written reviews. We further propose REVIEWGROUNDER, a rubric-guided, tool-integrated multi-agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi-4-14B-based drafter and a GPT-OSS-120B-based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT-4.1 and DeepSeek-R1-670B) in both alignment with human judgments and rubric-based review quality across 8 dimensions. The code is available \hrefthis https URLhere.

[NLP-78] Dive into Claude Code: The Design Space of Todays and Future AI Agent Systems

【速读】: 该论文旨在解决生成式 AI(Generative AI)在实际应用中如何实现安全、可靠且可扩展的代码生成与执行问题,特别是在人机协作场景下如何平衡自动化能力与人类控制权。其解决方案的关键在于构建一个以“简单 while 循环”为核心、辅以多层系统支持的代理架构:通过权限管理(七种模式+机器学习分类器)、五层上下文压缩管道、四种可扩展机制(MCP、插件、技能和钩子)、子代理委托机制(工作树隔离)以及追加式会话存储,实现了对工具调用、上下文管理和执行安全性的精细化控制。这种设计不仅保障了人类决策权威与安全性,还提升了系统的适应性与能力放大效果,为下一代自主代理系统提供了可验证的工程范式。

链接: https://arxiv.org/abs/2604.14228
作者: Jiacheng Liu,Xiaohan Zhao,Xinyi Shang,Zhiqiang Shen
机构: MBZUAI(穆罕默德·本·扎耶德大学人工智能学院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Tech report. Code at: this https URL

点击查看摘要

Abstract:Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user. This study describes its comprehensive architecture by analyzing the publicly available TypeScript source code and further comparing it with OpenClaw, an independent open-source AI agent system that answers many of the same design questions from a different deployment context. Our analysis identifies five human values, philosophies, and needs that motivate the architecture (human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability) and traces them through thirteen design principles to specific implementation choices. The core of the system is a simple while-loop that calls the model, runs tools, and repeats. Most of the code, however, lives in the systems around this loop: a permission system with seven modes and an ML-based classifier, a five-layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation mechanism with worktree isolation, and append-oriented session storage. A comparison with OpenClaw, a multi-channel personal assistant gateway, shows that the same recurring design questions produce different architectural answers when the deployment context changes: from per-action safety classification to perimeter-level access control, from a single CLI loop to an embedded runtime within a gateway control plane, and from context-window extensions to gateway-wide capability registration. We finally identify six open design directions for future agent systems, grounded in recent empirical, architectural, and policy literature.

[NLP-79] MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes

【速读】: 该论文旨在解决德文尼格拉(Devanagari)脚本社交媒体表情包中的仇恨言论检测(hate speech detection)问题,其核心挑战包括多模态内容结构、语种特异性语言复杂性以及低资源环境下的极端数据稀缺。解决方案的关键在于提出一种混合跨模态注意力融合架构:利用CLIP(ViT-B/32)进行视觉编码,结合BGE-M3实现多语言文本表示,并通过四头自注意力机制与可学习门控网络动态调整每样本的模态权重,从而实现显式跨模态推理。实验表明,该方法在子任务A(二分类仇恨言论检测)上相比纯文本基线提升5.9% F1-macro,同时揭示两个关键发现:以英语为中心的视觉模型在德文尼格拉脚本上表现接近随机,且标准集成方法因过拟合相关性在小样本(每折约850条)条件下性能显著下降。

链接: https://arxiv.org/abs/2604.14218
作者: Samir Wagle,Reewaj Khanal,Abiral Adhikari
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PrePrint

点击查看摘要

Abstract:Hate speech detection in Devanagari-scripted social media memes presents compounded challenges: multimodal content structure, script-specific linguistic complexity, and extreme data scarcity in low-resource settings. This paper presents our system for the CHiPSAL 2026 shared task, addressing both Subtask A (binary hate speech detection) and Subtask B (three-class sentiment classification: positive, neutral, negative). We propose a hybrid cross-modal attention fusion architecture that combines CLIP (ViT-B/32) for visual encoding with BGE-M3 for multilingual text representation, connected through 4-head self-attention and a learnable gating network that dynamically weights modality contributions on a per-sample basis. Systematic evaluation across eight model configurations demonstrates that explicit cross-modal reasoning achieves a 5.9% F1-macro improvement over text-only baselines on Subtask A, while uncovering two unexpected but critical findings: English-centric vision models exhibit near-random performance on Devanagari script, and standard ensemble methods catastrophically degrade under data scarcity (N nearly equal to 850 per fold) due to correlated overfitting. The code can be accessed at this https URL

[NLP-80] Neuro-Oracle: A Trajectory-Aware Agent ic RAG Framework for Interpretable Epilepsy Surgical Prognosis

【速读】: 该论文旨在解决难治性癫痫(pharmacoresistant epilepsy)患者术后癫痫发作预测的临床挑战,传统深度学习方法仅依赖单一时点的术前磁共振成像(MRI),忽略了手术前后脑结构的动态变化。其解决方案的关键在于提出一个三阶段框架 Neuro-Oracle:首先利用3D Siamese对比编码器将术前至术后MRI变化压缩为512维轨迹向量;其次通过最近邻搜索从人群档案中检索相似手术轨迹;最后借助量化版 Llama-3-8B 推理模型生成基于证据的自然语言预后报告。该方法在 EPISURG 数据集上验证表明,基于轨迹的分类器显著优于单时点基准(AUC 0.905 vs. 0.793),且无需语言模型开销的轨迹空间集成模型(M6)达到最优性能,同时提供无幻觉的结构化解释。

链接: https://arxiv.org/abs/2604.14216
作者: Aizierjiang Aiersilan,Mohamad Koubeissi
机构: The George Washington University (乔治·华盛顿大学)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Predicting post-surgical seizure outcomes in pharmacoresistant epilepsy is a clinical challenge. Conventional deep-learning approaches operate on static, single-timepoint pre-operative scans, omitting longitudinal morphological changes. We propose \emphNeuro-Oracle, a three-stage framework that: (i) distils pre-to-post-operative MRI changes into a compact 512-dimensional trajectory vector using a 3D Siamese contrastive encoder; (ii) retrieves historically similar surgical trajectories from a population archive via nearest-neighbour search; and (iii) synthesises a natural-language prognosis grounded in the retrieved evidence using a quantized Llama-3-8B reasoning agent. Evaluations are conducted on the public EPISURG dataset ( N=268 longitudinally paired cases) using five-fold stratified cross-validation. Since ground-truth seizure-freedom scores are unavailable, we utilize a clinical proxy label based on the resection type. We acknowledge that the network representations may potentially learn the anatomical features of the resection cavities (i.e., temporal versus non-temporal locations) rather than true prognostic morphometry. Our current evaluation thus serves mainly as a proof-of-concept for the trajectory-aware retrieval architecture. Trajectory-based classifiers achieve AUC values between 0.834 and 0.905, compared with 0.793 for a single-timepoint ResNet-50 baseline. The Neuro-Oracle agent (M5) matches the AUC of purely discriminative trajectory classifiers (0.867) while producing structured justifications with zero observed hallucinations under our audit protocol. A Siamese Diversity Ensemble (M6) of trajectory-space classifiers attains an AUC of 0.905 without language-model overhead.

[NLP-81] CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization ICLR2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在采用推理技术提升任务性能时,因生成冗长推理过程而导致显著延迟和高Token消耗的问题。现有自动提示优化(Automatic Prompt Optimization, APO)方法仅关注任务准确率,忽视了响应长度对计算成本的影响。其解决方案的关键在于提出一种成本正则化的提示优化方法(Cost-Regularized Optimization of Prompts, CROP),通过在标准准确率反馈基础上引入文本形式的长度反馈(即对响应长度进行正则化),引导优化过程生成仅包含关键信息与必要推理步骤的简洁响应,从而在保持竞争力准确率的同时大幅降低Token使用量(实验中实现80.6%的减少)。

链接: https://arxiv.org/abs/2604.14214
作者: Deep Shah,Sanket Badhe,Nehal Kathrotia,Priyanka Tiwari
机构: Google LLC; Purdue University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026 Workshop on Logical Reasoning of Large Language Models

点击查看摘要

Abstract:Large Language Models utilizing reasoning techniques improve task performance but incur significant latency and token costs due to verbose generation. Existing automatic prompt optimization(APO) frameworks target task accuracy exclusively at the expense of generating long reasoning traces. We propose Cost-Regularized Optimization of Prompts (CROP), an APO method that introduces regularization on response length by generating textual feedback in addition to standard accuracy feedback. This forces the optimization process to produce prompts that elicit concise responses containing only critical information and reasoning. We evaluate our approach on complex reasoning datasets, specifically GSM8K, LogiQA and BIG-Bench Hard. We achieved an 80.6% reduction in token consumption while maintaining competitive accuracy, seeing only a nominal decline in performance. This presents a pragmatic solution for deploying token-efficient and cost-effective agentic AI systems in production pipelines.

[NLP-82] Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate

【速读】: 该论文旨在验证一个广泛流传于社交媒体和开发者社区的假设,即中文提示在大语言模型(Large Language Models, LLMs)编程任务中比英文提示更具token效率,可能带来高达40%的API成本节约。为解决这一问题,作者采用SWE-bench Lite这一软件工程任务基准进行严谨的实证研究,其关键在于通过量化比较不同模型在中英文提示下的token消耗与任务成功率,从而综合评估“成本效率”(即每成功完成一项任务的预期token成本)。研究发现:中文提示并未表现出token效率优势,且token成本随模型架构变化而呈现非一致性模式(如MiniMax-2.7中中文更耗token,而GLM-5则相反),更重要的是,中文提示的整体任务成功率普遍低于英文提示,表明单纯切换语言并不能带来成本或性能收益。

链接: https://arxiv.org/abs/2604.14210
作者: Simiao Ren,Xingyu Shen,Yuchen Zhou,Dennis(Tsang)Ng,Ankit Raj
机构: scam.ai
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:A claim has been circulating on social media and practitioner forums that Chinese prompts are more token-efficient than English for LLM coding tasks, potentially reducing costs by up to 40%. This claim has influenced developers to consider switching to Chinese for ``vibe coding’’ to save on API costs. In this paper, we conduct a rigorous empirical study using SWE-bench Lite, a benchmark of software engineering tasks, to evaluate whether this claim of Chinese token efficiency holds up to scrutiny. Our results reveal three key findings: First, the efficiency advantage of Chinese is not observed. Second, token cost varies by model architecture in ways that defy simple assumptions: while MiniMax-2.7 shows 1.28x higher token costs for Chinese, GLM-5 actually consumes fewer tokens with Chinese prompts. Third, and most importantly, we found that the success rate when prompting in Chinese is generally lower than in English across all models we tested. We also measure cost efficiency as expected cost per successful task – jointly accounting for token consumption and task resolution rate. These findings should be interpreted as preliminary evidence rather than a definitive conclusion, given the limited number of models evaluated and the narrow set of benchmarks tested due to resource constraints; they indicate that language effects on token cost are model-dependent, and that practitioners should not expect cost savings or performance gains just by switching their prompt language to Chinese.

[NLP-83] MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining

【速读】: 该论文旨在解决多模态训练中数据混合策略优化不足的问题,特别是如何在中段训练(midtraining)阶段提升样本效率和下游任务泛化能力。现有方法通常仅沿单一维度(如数据格式或任务类型)调整混合比例,缺乏对多维特征空间的系统探索。其解决方案的关键在于提出 MixAtlas 方法,通过将训练语料库分解为两个轴:图像概念(基于 CLIP 嵌入发现的 10 个视觉域聚类)与任务监督类型(包括图像描述、光学字符识别 OCR、定位、检测和视觉问答 VQA 共 5 类),构建一个可搜索的混合空间;并利用小型代理模型(Qwen2-0.5B)结合高斯过程(Gaussian Process, GP)代理模型与 GP-UCB 获取策略,在有限预算下高效寻找性能更优的数据混合方案,从而显著提升大规模模型(如 Qwen2-7B)在多个基准上的平均性能,并实现跨模型规模的配方迁移。

链接: https://arxiv.org/abs/2604.14198
作者: Bingbing Wen,Sirajul Salekin,Feiyang Kang,Bill Howe,Lucy Lu Wang,Javier Movellan,Manjot Bilkhu
机构: Apple(苹果); University of Washington (华盛顿大学); Virginia Tech (弗吉尼亚理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Domain reweighting can improve sample efficiency and downstream generalization, but data-mixture optimization for multimodal midtraining remains largely unexplored. Current multimodal training recipes tune mixtures along a single dimension, typically data format or task type. We introduce MixAtlas, a method that produces benchmark-targeted data recipes that can be inspected, adapted, and transferred to new corpora. MixAtlas decomposes the training corpus along two axes: image concepts (10 visual-domain clusters discovered via CLIP embeddings) and task supervision (5 objective types including captioning, OCR, grounding, detection, and VQA). Using small proxy models (Qwen2-0.5B) paired with a Gaussian-process surrogate and GP-UCB acquisition, MixAtlas searches the resulting mixture space with the same proxy budget as regression-based baselines but finds better-performing mixtures. We evaluate on 10 benchmarks spanning visual understanding, document reasoning, and multimodal reasoning. On Qwen2-7B, optimized mixtures improve average performance by 8.5%-17.6% over the strongest baseline; on Qwen2.5-7B, gains are 1.0%-3.3%. Both settings reach baseline-equivalent training loss in up to 2 times fewer steps. Recipes discovered on 0.5B proxies transfer to 7B-scale training across Qwen model families.

[NLP-84] he PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)性能高度依赖提示(prompt)设计,但当前提示构建在描述和应用上存在不一致的问题。其解决方案的关键在于提出一个名为PICCO的参考框架,该框架通过系统性整合11个已有提示框架的元分析,构建了一个包含五个核心元素的结构化架构:角色(Persona)、指令(Instructions)、上下文(Context)、约束(Constraints)和输出(Output)。该框架首次明确区分了提示框架、提示元素、提示生成、提示技术与提示工程等概念,并为每个元素定义了功能、作用范围及其相互关系,从而提升提示设计的概念清晰度和系统性,为后续研究和实践提供标准化结构支撑。

链接: https://arxiv.org/abs/2604.14197
作者: David A. Cook
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presents the novel PICCO framework for LLM prompting, derived through a structured multi-database search and rigorous comparative synthesis of 11 published prompting frameworks. Submitted in PDF/A format to preserve the structure and readability of several multi-page tables central to the framework and methodology; these contain dense structured information that is best preserved in PDF form

点击查看摘要

Abstract:Large language model (LLM) performance depends heavily on prompt design, yet prompt construction is often described and applied inconsistently. Our purpose was to derive a reference framework for structuring LLM prompts. This paper presents PICCO, a framework derived through a rigorous synthesis of 11 previously published prompting frameworks identified through a multi-database search. The analysis yields two main contributions. First, it proposes a taxonomy that distinguishes prompt frameworks, prompt elements, prompt generation, prompting techniques, and prompt engineering as related but non-equivalent concepts. Second, it derives a five-element reference architecture for prompt generation: Persona, Instructions, Context, Constraints, and Output (PICCO). For each element, we define its function, scope, and relationship to other elements, with the goal of improving conceptual clarity and supporting more systematic prompt design. Finally, to support application of the framework, we outline key concepts relevant to implementation, including prompting techniques (e.g., zero-shot, few-shot, chain-of-thought, ensembling, decomposition, and self-critique, with selected variants), human and automated approaches to iterative prompt engineering, responsible prompting considerations such as security, privacy, bias, and trust, and priorities for future research. This work is a conceptual and methodological contribution: it formalizes a common structure for prompt specification and comparison, but does not claim empirical validation of PICCO as an optimization method.

[NLP-85] Attention to Mamba: A Recipe for Cross-Architecture Distillation

【速读】: 该论文旨在解决如何将基于注意力机制的Transformer模型高效地蒸馏(distill)至Mamba类状态空间模型(State Space Model, SSM),从而在不损失原始教师模型性能的前提下,利用SSM架构更低的内存消耗和更高的生成吞吐量优势。此前研究表明,直接从Transformer到Mamba的蒸馏方法难以保持教师模型的性能,通常需依赖混合架构(如同时包含Attention与SSM模块)来弥补差距。本文的关键创新在于提出一种有原则的两阶段蒸馏方案:第一阶段通过改进的核技巧(kernel trick)将Transformer知识蒸馏至线性化注意力结构;第二阶段再将该线性化结构进一步蒸馏至完全去除注意力模块的Mamba模型。实验表明,采用这种策略可使蒸馏后的Mamba模型在下游任务中保持与原始Pythia-1B Transformer相当的性能(困惑度14.11 vs 教师模型13.86),且通过详尽的消融实验验证了该方法的有效性。

链接: https://arxiv.org/abs/2604.14191
作者: Abhinav Moudgil,Ningyuan Huang,Eeshan Gunesh Dhekane,Pau Rodríguez,Luca Zappella,Federico Danieli
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a naïve distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is that, by equipping Mamba with a principled initialization, we can recover an overall better recipe for cross-architectural distillation. To this end, we propose a principled two-stage approach: first, we distill knowledge from a traditional Transformer into a linearized version of Attention, using an adaptation of the kernel trick. Then, we distill the linearized version into an adapted Mamba model that does not use any Attention block. Overall, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher’s 13.86. To show the efficacy of our recipe, we conduct thorough ablations at 1B scale with 10B tokens varying sequence mixer architecture, scaling analysis on model sizes and total distillation tokens, and a sensitivity analysis on tokens allocation between stages.

[NLP-86] Internal Knowledge Without External Expression: Probing the Generalization Boundary of a Classical Chinese Language Model

【速读】: 该论文旨在解决生成式 AI(Generative AI)在面对未知输入时是否具备区分已知与未知信息的能力,以及能否在生成文本中表达这种不确定性的问题。其核心发现是:模型内部能够通过困惑度(perplexity)显著差异识别事实性内容(如真实与虚构历史事件),表明其对知识存在实质性编码;但外部表现上却无法自发表达不确定性,例如在处理分布外(out-of-distribution, OOD)问题时,古典汉语中的认知标记(epistemic markers)使用频率反而低于分布内问题,反映出语言模型本身不具备元认知(metacognition)能力。解决方案的关键在于认识到仅靠语言建模不足以激发真正的不确定性表达,必须引入显式的训练信号(如基于人类反馈的强化学习,RLHF)才能促使模型学会“我也不知道”这类元认知行为。

链接: https://arxiv.org/abs/2604.14180
作者: Jiuting Chen,Yuan Lian,Hao Wu,Tianqi Huang,Hiroshi Sasaki,Makoto Kouno,Jongil Choi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, supplementary material included

点击查看摘要

Abstract:We train a 318M-parameter Transformer language model from scratch on a curated corpus of 1.56 billion tokens of pure Classical Chinese, with zero English characters or Arabic numerals. Through systematic out-of-distribution (OOD) testing, we investigate whether the model can distinguish known from unknown inputs, and crucially, whether it can express this distinction in its generated text. We find a clear dissociation between internal and external uncertainty. Internally, the model exhibits a perplexity jump ratio of 2.39x between real and fabricated historical events (p = 8.9e-11, n = 92 per group), with semi-fabricated events (real figures + fictional events) showing the highest perplexity (4.24x, p = 1.1e-16), demonstrating genuine factual encoding beyond syntactic pattern matching. Externally, however, the model never learns to express uncertainty: classical Chinese epistemic markers appear at lower rates for OOD questions (3.5%) than for in-distribution questions (8.3%, p = 0.023), reflecting rhetorical conventions rather than genuine metacognition. We replicate both findings across three languages (Classical Chinese, English, Japanese), three writing systems, and eight models from 110M to 1.56B parameters. We further show that uncertainty expression frequency is determined entirely by training data conventions, with Classical Chinese models showing a “humility paradox” (more hedging for known topics), while Japanese models almost never hedge. We argue that metacognitive expression – the ability to say “I don’t know” – does not emerge from language modeling alone and requires explicit training signals such as RLHF. Comments: 15 pages, 5 figures, supplementary material included Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.14180 [cs.CL] (or arXiv:2604.14180v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.14180 Focus to learn more arXiv-issued DOI via DataCite

[NLP-87] An Underexplored Frontier: Large Language Models for Rare Disease Patient Education and Communication – A scoping review

【速读】: 该论文旨在解决罕见病患者在漫长诊疗过程中面临的复杂护理路径、临床专业知识匮乏以及沟通需求未被满足的问题,尤其是在生成式 AI(Generative AI)技术快速发展背景下,如何有效利用大语言模型(Large Language Models, LLMs)提升罕见病患者的教育与沟通支持。其解决方案的关键在于系统性地开展了一项针对2022年1月至2026年3月期间发表的12项相关研究的范围综述,识别出当前LLM在罕见病领域应用的主要场景、模型类型及评估方法,并指出当前研究普遍依赖通用模型(如ChatGPT)、聚焦于结构化问答任务、缺乏真实世界数据和纵向沟通场景的应用,且评价指标多集中于准确性而忽视可读性、共情能力和沟通质量等患者中心维度。因此,论文强调未来应优先推动以患者为中心的设计、领域适配的方法开发及真实环境部署,以实现安全、自适应且高效的罕见病沟通支持。

链接: https://arxiv.org/abs/2604.14179
作者: Zaifu Zhan,Yu Hou,Kai Yu,Min Zeng,Anita Burgun,Xiaoyi Chen,Rui Zhang
机构: 1. University of Chinese Academy of Sciences (中国科学院大学); 2. Tsinghua University (清华大学); 3. INSERM (法国国家健康与医学研究院); 4. Université de Bordeaux (波尔多大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rare diseases affect over 300 million people worldwide and are characterized by complex care pathways, limited clinical expertise, and substantial unmet communication needs throughout the long patient journey. Recent advances in large language models (LLMs) offer new opportunities to support patient education and communication, yet their application in rare diseases remains unclear. We conducted a scoping review of studies published between January 2022 and March 2026 across major databases, identifying 12 studies on LLM-based rare disease patient education and communication. Data were extracted on study characteristics, application scenarios, model usage, and evaluation methods, and synthesized using descriptive and qualitative analyses. The literature is highly recent and dominated by general-purpose models, particularly ChatGPT. Most studies focus on patient question answering using curated question sets, with limited use of real-world data or longitudinal communication scenarios. Evaluations are primarily centered on accuracy, with limited attention to patient-centered dimensions such as readability, empathy, and communication quality. Multilingual communication is rarely addressed. Overall, the field remains at an early stage. Future research should prioritize patient-centered design, domain-adapted methods, and real-world deployment to support safe, adaptive, and effective communication in rare diseases. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.14179 [cs.CL] (or arXiv:2604.14179v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.14179 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Zaifu Zhan [view email] [v1] Mon, 30 Mar 2026 17:14:48 UTC (1,035 KB)

[NLP-88] Listen Correct and Feed Back: Spoken Pedagogical Feedback Generation

【速读】: 该论文旨在解决口语化语法错误修正(Spoken Grammatical Error Correction, SGEC)场景中缺乏符合教学需求的可操作性、适龄且鼓励性的教师风格反馈问题。现有研究虽在语法纠错(GEC)与解释(GEE)上取得进展,但未充分考虑真实课堂环境中学习者所需的教学友好型反馈。解决方案的关键在于构建了一个名为SPFG(Spoken Pedagogical Feedback Generation)的数据集,该数据集基于Speak Improve Challenge 2025语料库,将流畅性导向的转录文本与GEC目标及人工验证的教师风格反馈配对,并提供偏好学习所需的优选/拒选反馈对。在此基础上,作者对比了监督微调(SFT)与基于偏好对齐的方法(DPO和KTO),评估其在联合生成修正结果与教学反馈上的效果,发现SFT在一致性提升方面表现最优,而偏好对齐方法效果有限,且修正质量与反馈质量呈弱耦合关系。

链接: https://arxiv.org/abs/2604.14177
作者: Junhong Liang,Yifan Lu,Ekaterina Kochmar,Fajri Koto
机构: Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NLP8506 course project

点击查看摘要

Abstract:Grammatical error correction (GEC) and explanation (GEE) have made rapid progress, but real teaching scenarios also require \emphlearner-friendly pedagogical feedback that is actionable, level-appropriate, and encouraging. We introduce \textbfSPFG (\textbfSpoken \textbfPedagogical \textbfFeedback \textbfGeneration), a dataset built based on the Speak \ Improve Challenge 2025 corpus, pairing fluency-oriented transcriptions with GEC targets and \emphhuman-verified teacher-style feedback, including preferred/rejected feedback pairs for preference learning. We study a transcript-based Spoken Grammatical Error Correction (SGEC) setting and evaluate three instruction-tuned LLMs (Qwen2.5, Llama-3.1, and GLM-4), comparing supervised fine-tuning (SFT) with preference-based alignment (using DPO and KTO) for jointly generating corrections and feedback. Results show that SFT provides the most consistent improvements, while DPO/KTO yield smaller or mixed gains, and that correction quality and feedback quality are weakly coupled. Our implementation is available at this https URL.

[NLP-89] QU-NLP at ArchEHR-QA 2026: Two-Stage QLoRA Fine-Tuning of Qwen 3-4B for Patient-Oriented Clinical Question Answering and Evidence Sentence Alignment ALT LREC2026

【速读】: 该论文针对ArchEHR-QA共享任务中的两个子任务展开研究:子任务3(答案生成)和子任务4(证据句对齐)。其核心问题是如何在有限标注数据下提升生成式AI(Generative AI)在临床问答场景中的表现,尤其是区分相关与无关的临床文本片段。解决方案的关键在于两阶段量化低秩适配(Quantised Low-Rank Adaptation, QLoRA)策略与多方法融合的证据句检索机制:首先通过分阶段微调Qwen3-4B模型(4-bit NF4量化),利用大规模医学语料建立领域知识基础并进一步适应特定任务输出风格;其次构建基于BM25、TF-IDF和微调交叉编码器的加权集成模型以实现高精度证据句匹配。实验表明,尽管系统在两项子任务上均取得可量化性能,但根本瓶颈仍源于仅20个标注样本难以有效区分相关/无关句子,提示未来应优先探索数据增强策略作为主要改进方向。

链接: https://arxiv.org/abs/2604.14175
作者: Mohammad AL-Smadi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication at CL4Health 2026 workshop, LREC2026 conference

点击查看摘要

Abstract:We present a unified system addressing both Subtask 3 (answer generation) and Subtask 4 (evidence sentence alignment) of the ArchEHR-QA Shared Task. For Subtask 3, we apply two-stage Quantised Low-Rank Adaptation (QLoRA) to Qwen3-4B loaded in 4-bit NF4 quantisation: first on 30,000 samples from the emrQA-MedSQuAD corpus to establish clinical domain competence, then on the 20 annotated development cases to learn the task-specific output style. Our system achieves an overall score of 32.87 on the official test-2026 split (BLEU = 9.42, ROUGE-L = 27.04, SARI = 55.42, BERTScore = 43.00, AlignScore = 25.28, MEDCON = 37.04). For Subtask 4, we develop a weighted ensemble of three retrieval methods - BM25 with relative thresholding, TF-IDF cosine similarity, and a fine-tuned cross-encoder - to identify note sentences supporting a given gold answer, achieving a micro-F1 of 67.16 on the 100-case test set. Experiments reveal that both subtasks expose the same fundamental challenge: 20 annotated training cases are insufficient to distinguish relevant from irrelevant clinical sentences, pointing to data augmentation as the highest-leverage future direction.

[NLP-90] Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters

【速读】: 该论文旨在解决对齐调整后的语言模型在政治敏感话题上抑制事实性概率的问题,即尽管模型的隐藏表示中仍保留相关知识,但其生成文本时会无意识地压制某些事实的输出。解决方案的关键在于引入一个参数量极小(786K,约为基础模型的0.02%)的后Transformer适配器(adapter),该适配器在冻结的隐藏状态上进行训练,能够有效恢复被压制的事实概率,并在多个Qwen3模型规模(4B、8B、14B)上实现跨数据集泛化。该适配器通过仅在当前预测位置(last-position-only)应用干预,避免了生成过程中的不连贯问题,且在logit空间的操作无法产生连贯输出,表明对隐藏状态的干预是生成修正的正确层级。此外,研究还揭示了一个此前未记录的Apple MLX梯度bug,解释了早期实验中所有无效结果的原因。

链接: https://arxiv.org/abs/2604.14174
作者: Bryan Sanchez
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 3 figures, code at this https URL

点击查看摘要

Abstract:Alignment-tuned language models frequently suppress factual log-probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K-parameter (approximately 0.02% of the base model) post-transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology-discriminating facts across Qwen3-4B, 8B, and 14B. The adapter memorizes all 15 training facts and generalizes to 11–39% of 16 held-out facts across 5 random splits per scale, with zero knowledge regressions via anchored training. Both gated (SwiGLU) and ungated (linear bottleneck) adapters achieve comparable results; neither consistently outperforms the other (Fisher exact p 0.09 at all scales). On instruct models, the adapter corrects log-probability rankings. When applied at all token positions during generation, the adapter produces incoherent output; however, when applied only at the current prediction position (last-position-only), the adapter produces coherent, less censored text. A logit-space adapter operating after token projection fails to produce coherent generation at any application mode, suggesting hidden-state intervention is the correct level for generation correction. A previously undocumented silent gradient bug in Apple MLX explains all null results in earlier iterations of this work: the standard pattern nn.value_and_grad(model, fn)(this http URL()) returns zero gradients without error; the correct pattern nn.value_and_grad(model, fn)(model, data) resolves this. We provide a minimal reproduction and discuss implications for other adapter research using MLX.

[NLP-91] ug-of-War within A Decade: Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generations

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在网络安全漏洞分析中因CVE(Common Vulnerabilities and Exposures)数据频繁更新而导致的知识不一致与冲突问题,这一问题会引发模型检索过时信息、生成事实性错误甚至幻觉结果。解决方案的关键在于提出一种两阶段框架CRVA-TGRAG(Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generation):第一阶段通过父文档分割(Parent Document Segmentation)和基于语义相似度与倒排索引的集成检索机制提升CVE文档的召回准确性;第二阶段采用教师引导的偏好优化技术对LLMs进行微调,增强其基于最新CVE知识生成准确答案的能力。该方法有效缓解了仅依赖LLMs进行知识检索时可能出现的知识冲突与不一致性。

链接: https://arxiv.org/abs/2604.14172
作者: Ziyin Zhou,Jianyi Zhang,Xu ji,Yilong Li,Jiameng Han,Zhangchi Zhao
机构: Beijing Electronic Science and Technology Institute (北京电子科技学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are essential for analyzing and addressing vulnerabilities in cybersecurity. However, among over 200,000 vulnerabilities were discovered in the past decade, more than 30,000 have been changed or updated. This necessitates frequent updates to the training datasets and internal knowledge bases of LLMs to maintain knowledge consistency. In this paper, we focus on the problem of knowledge discrepancy and conflict within CVE (Common Vulnerabilities and Exposures) detection and analysis. This problem hinders LLMs’ ability to retrieve the latest knowledge from original training datasets, leading to knowledge conflicts, fabrications of factually incorrect results, and generation hallucinations. To address this problem, we propose an innovative two-stage framework called CRVA-TGRAG (Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generation). First, to improve document retrieval accuracy during the retrieval stage, we utilize Parent Document Segmentation and an ensemble retrieval scheme based on semantic similarity and inverted indexing. Second, to enhance LLMs’ capabilities based on the retrieval of CVE dataset in generation stage, we employ a teacher-guided preference optimization technique to fine-tune LLMs. Our framework not only enhances the quality of content retrieval through RAG but also leverages the advantages of preference fine-tuning in LLMs to answer questions more effectively and precisely. Experiments demonstrate our method achieves higher accuracy in retrieving the latest CVEs compared to external knowledge bases. In conclusion, our framework significantly mitigates potential knowledge conflicts and inconsistencies that may arise from relying solely on LLMs for knowledge retrieval.

[NLP-92] Benchmarking Linguistic Adaptation in Comparable-Sized LLM s: A Study of Llama-3.1-8B Mistral-7B-v0.1 and Qwen 3-8B on Romanized Nepali

【速读】: 该论文旨在解决罗马化尼泊尔语(Romanized Nepali)在大型语言模型(Large Language Models, LLMs)中严重资源匮乏的问题,即如何在低资源条件下实现对这一非标准书写形式的语言的有效适应与生成。解决方案的关键在于通过系统性基准测试比较三种可比规模的开源权重模型(Llama-3.1-8B、Mistral-7B-v0.1 和 Qwen3-8B),采用量化低秩适配(QLoRA)结合秩稳定 LoRA(rsLoRA)的方法,在仅微调约 1% 参数的情况下,利用双 NVIDIA Tesla T4 GPU 在不到 27 GPU 小时内完成高效训练。实验表明,尽管零样本阶段所有模型均无法生成罗马化尼泊尔语且表现各异,但经微调后三者均显著提升性能,其中 Qwen3-8B 在语义相关性和结构对齐指标上最优,而 Llama-3.1-8B 虽零样本最差,却展现出最大微调增益,验证了“适应潜力假设”,为低资源语言的迭代式开发提供了明确路径。

链接: https://arxiv.org/abs/2604.14171
作者: Ananda Rimal(Nepal Engineering College),Adarsha Rimal(Tribhuvan University)
机构: Nepal Engineering College (尼泊尔工程学院); Tribhuvan University (特里布文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages, 4 figures, 14 tables

点击查看摘要

Abstract:Romanized Nepali, the Nepali language written in the Latin alphabet, is the dominant medium for informal digital communication in Nepal, yet it remains critically underresourced in the landscape of Large Language Models (LLMs). This study presents a systematic benchmarking of linguistic adaptation across three comparable-sized open-weight models: Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B. We evaluate these architectures under zero-shot and fine-tuned settings using a curated bilingual dataset of 10,000 transliterated instruction-following samples. Performance is quantified across five metrics spanning seven measurement dimensions: Perplexity (PPL), BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU, capturing fluency, phonetic consistency, and semantic integrity. Models were fine-tuned using Quantized Low-Rank Adaptation (QLoRA) with Rank-Stabilized LoRA (rsLoRA) at rank r=32 on dual NVIDIA Tesla T4 GPUs, training only approximately 1% of each model’s parameters in under 27 total GPU-hours. At zero-shot, all three models fail to generate Romanized Nepali, each exhibiting a distinct architecture-specific failure mode. Following fine-tuning, all three resolve these failures and converge to BERTScore approximately 0.75 and chrF++ greater than 23. Overall dimension-wise assessment across ten criteria identifies Qwen3-8B as the overall recommended architecture, being the only model to produce semantically relevant zero-shot output and leading all structural alignment metrics post-SFT. The adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest zero-shot baseline, achieves the largest absolute fine-tuning gains in PPL (Delta = -49.77) and BERTScore (Delta = +0.3287), making it the preferred choice for iterative low-resource development pipelines. This work establishes the first rigorous baseline for Romanized Nepali adaptation in comparable-sized open-weight LLMs.

[NLP-93] Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法中存在的问题,即由于扁平化的上下文表示和无状态的检索机制导致的性能不稳定。其解决方案的关键在于提出一种基于状态化证据驱动的迭代推理框架(Stateful Evidence-Driven RAG with Iterative Reasoning),该框架将问答过程建模为渐进式证据累积过程:通过将检索到的文档转化为带有显式相关性和置信度信号的结构化推理单元,并维护一个持久化的证据池以保存支持性与非支持性信息;同时引入证据驱动的缺陷分析机制识别知识缺口与冲突,并迭代优化查询以引导下一轮检索,从而实现稳定且鲁棒的证据聚合。

链接: https://arxiv.org/abs/2604.14170
作者: Qi Dong,Ziheng Lin,Ning Ding
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) in external knowledge but often suffers from flat context representations and stateless retrieval, leading to unstable performance. We propose Stateful Evidence-Driven RAG with Iterative Reasoning, a framework that models question answering as a progressive evidence accumulation process. Retrieved documents are converted into structured reasoning units with explicit relevance and confidence signals and maintained in a persistent evidence pool capturing both supportive and non-supportive information. The framework performs evidence-driven deficiency analysis to identify gaps and conflicts and iteratively refines queries to guide subsequent retrieval. This iterative reasoning process enables stable evidence aggregation and improves robustness to noisy retrieval. Experiments on multiple question answering benchmarks demonstrate consistent improvements over standard RAG and multi-step baselines, while effectively accumulating high-quality evidence and maintaining stable performance under substantial retrieval noise.

[NLP-94] Chronological Knowledge Retrieval: A Retrieval-Augmented Generation Approach to Construction Project Documentation

【速读】: 该论文旨在解决大型建设项目中决策记录难以追溯的问题,尤其是在会议纪要(meeting minutes)中,由于决策可能推翻先前结论,专业人员需重建特定选择的历史演变过程。手动从原始档案中检索此类信息不仅效率低下,还易出错。为此,论文提出通过自然语言交互方式实现对全部会议纪要的对话式访问,使用户能够以自然语言提问并获得语义相关且带时间标注的答案,从而清晰追踪决策的时间脉络。解决方案的关键在于采用检索增强生成(Retrieval-Augmented Generation, RAG)框架,融合语义搜索与大语言模型(Large Language Models, LLMs),确保响应的准确性与上下文敏感性,同时基于比利时一家大型公司提供的匿名化项目会议纪要数据集进行验证,并提供标注和专家定义查询用于系统评估,推动面向时序标注项目文档的对话式访问研究发展。

链接: https://arxiv.org/abs/2604.14169
作者: Ioannis-Aris Kostis,Natalia Sanchiz,Steeve De Schryver,François Denis,Pierre Schaus
机构: Université Catholique de Louvain (鲁汶大学); Buildwise; BPC Group
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In large-scale construction projects, the continuous evolution of decisions generates extensive records, most often captured in meeting minutes. Since decisions may override previous ones, professionals often need to reconstruct the history of specific choices. Retrieving such information manually from raw archives is both labor-intensive and error-prone. From a user perspective, we address this challenge by enabling conversational access to the whole set of project meeting minutes. Professionals can pose natural-language questions and receive answers that are both semantically relevant and explicitly time-annotated, allowing them to follow the chronology of decisions. From a technical perspective, our solution employs a Retrieval-Augmented Generation (RAG) framework that integrates semantic search with large language models to ensure accurate and context-aware responses. We demonstrate the approach using an anonymized, industry-sourced dataset of meeting minutes from a completed construction project by a large company in Belgium. The dataset is annotated and enriched with expert-defined queries to support systematic evaluation. Both the dataset and the open-source implementation are made available to the community to foster further research on conversational access to time-annotated project documentation.

[NLP-95] SAGE Celer 2.6 Technical Card

【速读】: 该论文旨在解决大模型在复杂推理任务中易产生级联错误和幻觉(hallucination)、多模态处理依赖Adapter导致性能瓶颈,以及对南亚语言支持不足的问题。解决方案的关键在于:首先,采用逆向推理(Inverse Reasoning, IR)流水线使模型能够自验证逻辑路径,从而降低推理过程中的错误传播;其次,集成端到端视觉编码器以避免Adapter方法常见的性能损失;最后,针对南亚语言(如尼泊尔语和印地语)设计专用分词器(tokenizer)并优化其语言能力,同时保持英文推理性能不下降。

链接: https://arxiv.org/abs/2604.14168
作者: SAGEA Research Team,Basab Jha,Firoj Paudel,Ujjwal Puri,Adrian Liu,Ethan Henkel,Zhang Yuting,Mateusz Kowalczyk,Mei Huang,Choi Donghyuk,Wang Junhao
机构: SAGEA; Tribhuwan University | Vedas College; Tribhuwan University | Madan Bhandari Memorial College; Fudan University; ETH Zurich
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 14 figures

点击查看摘要

Abstract:We introduce SAGE Celer 2.6, the latest in our line of general-purpose Celer models from SAGEA. Celer 2.6 is available in 5B, 10B, and 27B parameter sizes and benefits from extensive architectural modifications and further pre-training on an undisclosed model. Using our Inverse Reasoning (IR) pipeline, SAGEA natively trains Celer 2.6 to validate its own logic paths, minimizing cascading error and hallucination in complex reasoning tasks. Celer 2.6 also boasts natively integrated multimodal functionality with an end-to-end vision encoder to avoid common pitfalls in adapter-based approaches. Celer 2.6 provides highly competitive results on mathematics, coding, and general intelligence benchmarks (ACUMEN), along with low latency. Most importantly, Celer 2.6 is specifically optimized for South Asian language support, with a custom tokenizer for the Devanagari script and strong performance in both Nepali and Hindi without sacrificing English reasoning ability.

[NLP-96] Chinese Essay Rhetoric Recognition Using LoRA In-context Learning and Model Ensemble CCL2025

【速读】: 该论文旨在解决中文作文中修辞识别(rhetoric recognition)的问题,这是自动作文评分(automated essay scoring)中的关键环节,有助于AI系统更准确地评估学生的语言能力和高阶思维技能。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)结合低秩适配(Low-Rank Adaptation, LoRA)微调与上下文学习(in-context learning)来注入修辞知识,并将输出结构化为JSON格式以获得清晰的结构化结果,同时通过多种模型集成方法进一步提升性能。该方法在CCL 2025中文作文修辞识别评测任务的三个赛道上均取得最佳成绩,荣获一等奖。

链接: https://arxiv.org/abs/2604.14167
作者: Yuxuan Lai,Xiajing Wang,Chen Zheng
机构: The Open University of China (中国开放大学); Engineering Research Center of Integration and Application of Digital Learning Technology, Ministry of Education (教育部数字学习技术集成与应用工程研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by CCL2025

点击查看摘要

Abstract:Rhetoric recognition is a critical component in automated essay scoring. By identifying rhetorical elements in student writing, AI systems can better assess linguistic and higher-order thinking skills, making it an essential task in the area of AI for education. In this paper, we leverage Large Language Models (LLMs) for the Chinese rhetoric recognition task. Specifically, we explore Low-Rank Adaptation (LoRA) based fine-tuning and in-context learning to integrate rhetoric knowledge into LLMs. We formulate the outputs as JSON to obtain structural outputs and translate keys to Chinese. To further enhance the performance, we also investigate several model ensemble methods. Our method achieves the best performance on all three tracks of CCL 2025 Chinese essay rhetoric recognition evaluation task, winning the first prize.

[NLP-97] Hierarchical Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text

【速读】: 该论文旨在解决将网络威胁情报(Cyber Threat Intelligence, CTI)文本自动映射到MITRE ATT&CK框架中技术标识符(technique IDs)的问题,以提升对攻击者行为的理解与自动化防御能力。现有基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法虽具潜力,但其采用扁平化检索机制,未能利用ATT&CK框架内在的战术-技术层级结构(tactic-technique taxonomy),导致检索效率低且生成精度受限。解决方案的关键在于提出H-TechniqueRAG——一种引入战术层级先验知识的分层RAG框架:通过两阶段检索机制(先定位宏观战术,再聚焦战术内技术)将候选搜索空间缩小77.5%;同时设计战术感知重排序模块和层次约束的上下文组织策略,有效缓解大语言模型(Large Language Model, LLM)上下文过载问题,显著提升推理准确性与效率,最终在F1分数上优于当前最优方法TechniqueRAG 3.8%,并降低62.4%的推理延迟和60%的LLM API调用次数。

链接: https://arxiv.org/abs/2604.14166
作者: Filippo Morbiato,Markus Keller,Priya Nair,Luca Romano
机构: University of Padua Italy (帕多瓦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mapping Cyber Threat Intelligence (CTI) text to MITRE ATT\CK technique IDs is a critical task for understanding adversary behaviors and automating threat defense. While recent Retrieval-Augmented Generation (RAG) approaches have demonstrated promising capabilities in this domain, they fundamentally rely on a flat retrieval paradigm. By treating all techniques uniformly, these methods overlook the inherent taxonomy of the ATT\CK framework, where techniques are structurally organized under high-level tactics. In this paper, we propose H-TechniqueRAG, a novel hierarchical RAG framework that injects this tactic-technique taxonomy as a strong inductive bias to achieve highly efficient and accurate annotation. Our approach introduces a two-stage hierarchical retrieval mechanism: it first identifies the macro-level tactics (the adversary’s technical goals) and subsequently narrows the search to techniques within those tactics, effectively reducing the candidate search space by 77.5%. To further bridge the gap between retrieval and generation, we design a tactic-aware reranking module and a hierarchy-constrained context organization strategy that mitigates LLM context overload and improves reasoning precision. Comprehensive experiments across three diverse CTI datasets demonstrate that H-TechniqueRAG not only outperforms the state-of-the-art TechniqueRAG by 3.8% in F1 score, but also achieves a 62.4% reduction in inference latency and a 60% decrease in LLM API calls. Further analysis reveals that our hierarchical structural priors equip the model with superior cross-domain generalization and provide security analysts with highly interpretable, step-by-step decision paths.

[NLP-98] EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews

【速读】: 该论文旨在解决临床试验文献中多模态证据(如文本、表格和图表)自动化提取的准确性与可审计性问题,尤其在生成式 AI (Generative AI) 应用于证据合成流程时缺乏细粒度溯源机制的痛点。解决方案的关键在于构建一个由多智能体协同驱动的提取系统 EviSearch:其核心组件包括保留原始排版和图形信息的 PDF 查询代理(PDF-query agent)、基于检索引导的搜索代理(retrieval-guided search agent),以及强制页面级验证的 reconciler 模块——当多个代理对同一单元格产生分歧时触发人工审核,从而确保每个数据单元都有明确的出处记录(per-cell provenance)。该设计不仅显著提升了跨模态证据提取的精度,还通过结构化日志记录人工修正行为,为模型迭代提供了监督信号,最终实现面向临床专家可审查、可纠正的自动化证据表生成。

链接: https://arxiv.org/abs/2604.14165
作者: Naman Ahuja,Saniya Mulla,Muhammad Ali Khan,Zaryab Bin Riaz,Kaneez Zahra Rubab Khakwani,Mohamad Bassam Sonbol,Irbaz Bin Riaz,Vivek Gupta
机构: Arizona State University(亚利桑那州立大学); Mayo Clinic(梅奥诊所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present EviSearch, a multi-agent extraction system that automates the creation of ontology-aligned clinical evidence tables directly from native trial PDFs while guaranteeing per-cell provenance for audit and human verification. EviSearch pairs a PDF-query agent (which preserves rendered layout and figures) with a retrieval-guided search agent and a reconciliation module that forces page-level verification when agents disagree. The pipeline is designed for high-precision extraction across multimodal evidence sources (text, tables, figures) and for generating reviewer-actionable provenance that clinicians can inspect and correct. On a clinician-curated benchmark of oncology trial papers, EviSearch substantially improves extraction accuracy relative to strong parsed-text baselines while providing comprehensive attribution coverage. By logging reconciler decisions and reviewer edits, the system produces structured preference and supervision signals that bootstrap iterative model improvement. EviSearch is intended to accelerate living systematic review workflows, reduce manual curation burden, and provide a safe, auditable path for integrating LLM-based extraction into evidence synthesis pipelines.

[NLP-99] How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

【速读】: 该论文旨在解决当前基于强教师模型生成合成数据进行监督微调(SFT)时,学生模型(如Qwen3-8B)在推理能力提升上效果不佳甚至性能下降的问题。研究表明,造成这一现象的主要原因是教师生成数据与学生模型分布之间存在显著的风格差异(stylistic divergence)。解决方案的关键在于提出一种教师-学生协作式数据合成框架(TESSY),通过交替生成风格相关和非风格相关的token,使合成数据既保留教师模型的先进推理能力,又保持与学生模型风格的一致性,从而有效提升学生模型在代码生成等任务上的性能表现。

链接: https://arxiv.org/abs/2604.14164
作者: Zixian Huang,Kaichen Yang,Xu Huang,Feiyang Hao,Qiming Ge,Bowen Li,He Du,Kai Chen,Qipeng Guo
机构: Shanghai AI Laboratory(上海人工智能实验室); Dalian University of Technology(大连理工大学); Nanjing University(南京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.

[NLP-100] SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models

【速读】: 该论文旨在解决海上遇险通信(Maritime Distress Communications)在实际应用中因信息简短、噪声干扰、语音失真及自动语音识别(ASR)错误导致的自动分析困难问题。其解决方案的关键在于提出SeaAlert框架,该框架基于大语言模型(LLM)构建,并通过一套合成数据生成流程来克服真实标注数据稀缺的问题:首先利用LLM生成多样且逼真的遇险消息(包括偏离标准格式的挑战性变体),随后将这些文本合成语音并叠加模拟VHF信道噪声,再经由ASR系统转录,从而获得贴近真实场景的带噪文本数据,用于训练和优化鲁棒的遇险通信分析模型。

链接: https://arxiv.org/abs/2604.14163
作者: Tomer Atia,Yehudit Aperstein,Alexander Apartsin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Maritime distress communications transmitted over very high frequency (VHF) radio are safety-critical voice messages used to report emergencies at sea. Under the Global Maritime Distress and Safety System (GMDSS), such messages follow standardized procedures and are expected to convey essential details, including vessel identity, position, nature of the distress, and required assistance. In practice, however, automatic analysis remains difficult because distress messages are often brief, noisy, and produced under stress, may deviate from the prescribed format, and are further degraded by automatic speech recognition (ASR) errors caused by channel noise and speaker stress. This paper presents SeaAlert, an LLM-based framework for robust analysis of maritime distress communications. To address the scarcity of labeled real-world data, we develop a synthetic data generation pipeline in which an LLM produces realistic and diverse maritime messages, including challenging variants in which standard distress codewords are omitted or replaced with less explicit expressions. The generated utterances are synthesized into speech, degraded with simulated VHF noise, and transcribed by an ASR system to obtain realistic noisy transcripts.

[NLP-101] Decoupling Scores and Text: The Politeness Principle in Peer Review

【速读】: 该论文旨在解决科研作者在解读同行评审反馈时面临的困惑问题,即如何准确判断论文是否被接收,尤其是在面对模糊或礼貌性措辞时容易产生误判。其关键解决方案在于通过构建包含超过30,000篇ICLR 2021–2025投稿的数据集,对比基于数值评分(numerical scores)与基于文本评论(text reviews)的接受预测性能,发现评分模型准确率达91%,而即使使用大语言模型(large language models)处理文本信息,准确率也仅为81%。研究进一步揭示,高偏度和高峰度的分数分布表明个别低分对拒稿具有决定性影响,且文本中普遍存在的“礼貌原则”(Politeness Principle)使得拒稿论文的评论仍以正面情感词为主,掩盖了真实的拒稿信号,从而解释了为何仅凭文本难以准确判断结果。

链接: https://arxiv.org/abs/2604.14162
作者: Yingxuan Wen
机构: Harbin Intitute Of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Authors often struggle to interpret peer review feedback, deriving false hope from polite comments or feeling confused by specific low scores. To investigate this, we construct a dataset of over 30,000 ICLR 2021-2025 submissions and compare acceptance prediction performance using numerical scores versus text reviews. Our experiments reveal a significant performance gap: score-based models achieve 91% accuracy, while text-based models reach only 81% even with large language models, indicating that textual information is considerably less reliable. To explain this phenomenon, we first analyze the 9% of samples that score-based models fail to predict, finding their score distributions exhibit high kurtosis and negative skewness, which suggests that individual low scores play a decisive role in rejection even when the average score falls near the borderline. We then examine why text-based accuracy significantly lags behind scores from a review sentiment perspective, revealing the prevalence of the Politeness Principle: reviews of rejected papers still contain more positive than negative sentiment words, masking the true rejection signal and making it difficult for authors to judge outcomes from text alone.

[NLP-102] Can Large Language Models Detect Methodological Flaws? Evidence from Gesture Recognition for UAV-Based Rescue Operation Based on Deep Learning

【速读】: 该论文旨在解决机器学习研究中因方法学缺陷(特别是数据泄露)导致评估结果不可靠的问题,从而影响研究成果的可信度和可复现性。其解决方案的关键在于验证大型语言模型(Large Language Models, LLMs)是否能作为独立的分析代理,仅基于已发表的研究文献自动识别常见方法学错误。研究通过案例分析表明,六种前沿LLM在未获额外背景信息的情况下,均能一致识别出手势识别论文中存在的样本级数据泄露问题,并指出其归因于训练与测试集划分不独立,依据包括重叠的学习曲线、极小的泛化差距以及接近完美的分类性能等指标。这一发现揭示了LLM在辅助科学审计和提升研究可复现性方面的潜力。

链接: https://arxiv.org/abs/2604.14161
作者: Domonkos Varga
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reliable evaluation is essential in machine learning research, yet methodological flaws-particularly data leakage-continue to undermine the validity of reported results. In this work, we investigate whether large language models (LLMs) can act as independent analytical agents capable of identifying such issues in published studies. As a case study, we analyze a gesture-recognition paper reporting near-perfect accuracy on a small, human-centered dataset. We first show that the evaluation protocol is consistent with subject-level data leakage due to non-independent training and test splits. We then assess whether this flaw can be detected independently by six state-of-the-art LLMs, each analyzing the original paper without prior context using an identical prompt. All models consistently identify the evaluation as flawed and attribute the reported performance to non-independent data partitioning, supported by indicators such as overlapping learning curves, minimal generalization gap, and near-perfect classification results. These findings suggest that LLMs can detect common methodological issues based solely on published artifacts. While not definitive, their consistent agreement highlights their potential as complementary tools for improving reproducibility and supporting scientific auditing.

[NLP-103] HUOZIIME: An On-Device LLM -enhanced Input Method for Deep Personalization

【速读】: 该论文旨在解决移动输入法编辑器(IME)在个性化文本生成方面的局限性,即当前IME主要依赖手动输入,难以实现高效、实时且隐私保护的个性化文本生成。其核心解决方案是提出HUOZIIME,一个基于轻量级大语言模型(LLM)的本地化个性化输入法。关键创新在于:首先通过在合成个性化数据上对基础LLM进行后训练,赋予其初始类人预测能力;其次设计了一种分层记忆机制,持续捕获并利用用户特定的输入历史以实现高保真度的记忆驱动个性化;最后针对移动端部署进行了系统级优化,确保低延迟与高效率的运行性能。

链接: https://arxiv.org/abs/2604.14159
作者: Baocai Shan,Yuzhuang Xu,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobile input method editors (IMEs) are the primary interface for text input, yet they remain constrained to manual typing and struggle to produce personalized text. While lightweight large language models (LLMs) make on-device auxiliary generation feasible, enabling deeply personalized, privacy-preserving, and real-time generative IMEs poses fundamental this http URL this end, we present HUOZIIME, a personalized on-device IME powered by LLM. We endow HUOZIIME with initial human-like prediction ability by post-training a base LLM on synthesized personalization data. Notably, a hierarchical memory mechanism is designed to continually capture and leverage user-specific input history. Furthermore, we perform systemic optimizations tailored to on-device LLMbased IME deployment, ensuring efficient and responsive operation under mobile this http URL demonstrate efficient on-device execution and high-fidelity memory-driven personalization. Code and package are available at this https URL.

[NLP-104] MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)长期记忆评估方法过于静态的问题,即现有评测主要聚焦于简单信息检索和短上下文推理,忽视了复杂记忆系统中动态状态追踪、时间关联建模及层级推理等关键能力。其解决方案的核心是提出MemGround——一个基于丰富游戏化交互场景的严谨长期记忆基准测试框架,通过三层级结构(表面状态记忆、时间关联记忆和基于推理的记忆)系统性地评估模型在连续交互中的多维记忆能力,并引入包含问答得分(QA Overall)、记忆片段解锁数(MFU)、正确顺序记忆片段数(MFCO)以及探索轨迹图(ETD)在内的多维指标体系,以全面量化记忆利用效率与行为路径特征。

链接: https://arxiv.org/abs/2604.14158
作者: Yihang Ding,Wanke Xia,Yiting Zhao,Jinbo Su,Jialiang Yang,Zhengbo Zhang,Ke Wang,Wenming Yang
机构: Tsinghua University (清华大学); Renmin University of China (中国人民大学); CASIA (中国科学院自动化研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current evaluations of long-term memory in LLMs are fundamentally static. By fixating on simple retrieval and short-context inference, they neglect the multifaceted nature of complex memory systems, such as dynamic state tracking and hierarchical reasoning in continuous interactions. To overcome these limitations, we propose MemGround, a rigorous long-term memory benchmark natively grounded in rich, gamified interactive scenarios. To systematically assess these capabilities, MemGround introduces a three-tier hierarchical framework that evaluates Surface State Memory, Temporal Associative Memory, and Reasoning-Based Memory through specialized interactive tasks. Furthermore, to comprehensively quantify both memory utilization and behavioral trajectories, we propose a multi-dimensional metric suite comprising Question-Answer Score (QA Overall), Memory Fragments Unlocked (MFU), Memory Fragments with Correct Order (MFCO), and Exploration Trajectory Diagrams (ETD). Extensive experiments reveal that state-of-the-art LLMs and memory agents still struggle with sustained dynamic tracking, temporal event association, and complex reasoning derived from long-term accumulated evidence in interactive environments.

[NLP-105] Compressed-Sensing-Guided Inference-Aware Structured Reduction for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因参数量庞大、内存占用高和解码延迟大而导致的部署效率瓶颈问题。现有方法如剪枝与结构化稀疏化虽可压缩模型但多为静态优化,而提示压缩(prompt compression)虽能减少输入序列长度,却无法动态适配执行的模型子网络。其关键解决方案是提出一个统一的压缩感知引导框架(compressed-sensing-guided framework),通过随机测量算子探测模型潜在使用路径,利用稀疏恢复算法估计任务条件和token自适应的支持集,并将其编译为GPU高效稀疏执行路径(覆盖块、注意力头、通道及前馈子结构)。该框架引入五大核心贡献:任务条件测量、token自适应恢复、理论采样复杂度边界、硬件友好编译约束以及联合优化目标,从而将LLM推理重构为具有显式近似保证和面向部署加速约束的测量-恢复问题。

链接: https://arxiv.org/abs/2604.14156
作者: Andrew Kiruluta
机构: UC Berkeley School of Information (加州大学伯克利分校信息学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial compression, while prompt-compression methods reduce latency by removing redundant input tokens. However, these two directions remain largely separate. Most model-compression methods are static and optimized offline, and they do not exploit the fact that different prompts and decoding steps activate different latent computational pathways. Prompt-compression methods reduce sequence length, but they do not adapt the executed model subnetwork. We propose a unified compressed-sensing-guided framework for dynamic LLM execution. Random measurement operators probe latent model usage, sparse recovery estimates task-conditioned and token-adaptive support sets, and the recovered supports are compiled into hardware-efficient sparse execution paths over blocks, attention heads, channels, and feed-forward substructures. The framework introduces five key contributions: task-conditioned measurements, so different prompts induce different sparse supports; token-adaptive recovery, so active substructures are re-estimated during decoding; formal sample-complexity bounds under restricted isometry or mutual incoherence assumptions; compile-to-hardware constraints that restrict recovery to GPU-efficient structures; and a joint objective that unifies prompt compression with model reduction. Together, these components recast LLM inference as a measurement-and-recovery problem with explicit approximation guarantees and deployment-oriented speedup constraints. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.14156 [cs.CL] (or arXiv:2604.14156v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.14156 Focus to learn more arXiv-issued DOI via DataCite

[NLP-106] From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation

【速读】: 该论文旨在解决医疗场景中自动语音识别(ASR)系统在临床文档记录中因错误未被察觉而导致的可靠性问题,尤其是在缺乏高质量人工参考转录的情况下难以校准不确定性。其解决方案的关键在于利用异构ASR系统之间的交叉模型分歧(cross-model disagreement)作为无参考的不确定性信号,通过多模型输出对齐构建共识伪参考,并以多数强度指标量化词级别的一致性,从而识别出高风险、可能不可靠的转录片段,实现针对性的人工核查,无需依赖人工标注的参考文本。

链接: https://arxiv.org/abs/2604.14152
作者: Abdolamir Karbalaie,Fernando Seoane,Farhad Abtahi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Ambient AI “scribe” systems promise to reduce clinical documentation burden, but automatic speech recognition (ASR) errors can remain unnoticed without careful review, and high-quality human reference transcripts are often unavailable for calibrating uncertainty. We investigate whether cross-model disagreement among heterogeneous ASR systems can act as a reference-free uncertainty signal to prioritize human verification in medical transcription workflows. Using 50 publicly available medical education audio clips (8 h 14 min), we transcribed each clip with eight ASR systems spanning commercial APIs and open-source engines. We aligned multi-model outputs, built consensus pseudo-references, and quantified token-level agreement using a majority-strength metric; we further characterized disagreements by type (content vs. punctuation/formatting) and assessed per-model agreement via leave-one-model-out (jackknife) consensus scoring. Inter-model reliability was low (ICC[2,1] = 0.131), indicating heterogeneous failure modes across systems. Across 76,398 evaluated token positions, 72.1% showed near-unanimous agreement (7-8 models), while 2.5% fell into high-risk bands (0-3 models), with high-risk mass varying from 0.7% to 11.4% across accent groups. Low-agreement regions were enriched for content disagreements, with the content fraction increasing from 53.9% to 73.9% across quintiles of high-risk mass. These results suggest that cross-model disagreement provides a sparse, localizable signal that can surface potentially unreliable transcript spans without human-verified references, enabling targeted review; clinical accuracy of flagged regions remains to be established.

[NLP-107] Aligning Language Models with Real-time Knowledge Editing

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中过时知识难以高效更新的问题,同时确保模型原有能力不被破坏。现有知识编辑评测基准多为静态设计,无法反映现实世界知识的动态变化,导致评估结果与实际应用脱节。为此,作者提出了CRAFT——一个持续演化的现实世界知识编辑基准,其包含精心设计的复合推理配对编辑,并引入别名可迁移性、时空局部性和常识局部性等多维评估维度,显著提升了评测难度。为应对这一挑战,论文进一步提出KEDAS(Knowledge Editing Alignment with Diverse Augmentation and Self-Adaptive Post-alignment Inference)新范式,其核心在于通过多样化的编辑增强和自适应后对齐推理机制,在实时编辑场景下实现性能显著提升,从而在CRAFT上优于以往方法。

链接: https://arxiv.org/abs/2508.01302
作者: Chenming Tang,Yutong Yang,Kexue Wang,Yunfang Wu
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注: Pre-print

点击查看摘要

Abstract:Knowledge editing aims to modify outdated knowledge in large language models (LLMs) efficiently while retaining their original capabilities. Mainstream benchmarks for knowledge editing are predominantly static and fail to keep in pace with the evolving real-world knowledge. In this work, we introduce CRAFT, an ever-evolving real-world benchmark for knowledge editing. It features well-designed paired edits for composite reasoning, and evaluates models on alias portability as well as temporal and common-sense locality, making it a challenging knowledge editing benchmark on which previous knowledge editing methods hardly achieve balanced performance. Towards flexible real-time editing, we propose KEDAS, a novel paradigm of knowledge editing alignment featuring diverse edit augmentation and self-adaptive post-alignment inference, which exhibits significant performance gain on CRAFT compared to previous methods. All of our code and data are available at this https URL.

[NLP-108] Grading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLM s

【速读】: 该论文试图解决的问题是:如何在高度抽象的理论物理领域(如量子场论和弦理论)中有效评估大型语言模型(Large Language Models, LLMs)的推理能力,尤其是在面对概念层叠、隐含结构约束且非二值正确性的场景下,传统基于答案匹配的评价指标为何失效。解决方案的关键在于构建了一个由专家精心标注的紧凑数据集(包含十二个覆盖量子场论与弦理论核心问题的样本),并提出了一套五级评分量表(grading rubric),从陈述正确性、关键概念意识、推理链完整性、隐含步骤重构到知识扩展性逐层区分模型表现,从而揭示LLMs在处理需要重建省略推理步骤或满足全局一致性约束的任务时存在的系统性退化现象,其根本原因不仅在于中间步骤缺失,更在于表示选择的不稳定性——即模型难以识别正确的概念框架以化解隐含张力。

链接: https://arxiv.org/abs/2604.14188
作者: Xingyang Yu,Yinghuan Zhang,Yufei Zhang,Zijun Cui
机构: Virginia Tech (弗吉尼亚理工大学); Michigan State University (密歇根州立大学)
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); High Energy Physics - Theory (hep-th)
备注: 9 pages + appendices, 2 figures, 9 tables

点击查看摘要

Abstract:Large language models have demonstrated impressive performance across many domains of mathematics and physics. One natural question is whether such models can support research in highly abstract theoretical fields such as quantum field theory and string theory. Evaluating this possibility faces an immediate challenge: correctness in these domains is layered, tacit, and fundamentally non-binary. Standard answer-matching metrics fail to capture whether intermediate conceptual steps are properly reconstructed or whether implicit structural constraints are respected. We construct a compact expert-curated dataset of twelve questions spanning core areas of quantum field theory and string theory, and introduce a five-level grading rubric separating statement correctness, key concept awareness, reasoning chain presence, tacit step reconstruction, and enrichment. Evaluating multiple contemporary LLMs, we observe near-ceiling performance on explicit derivations within stable conceptual frames, but systematic degradation when tasks require reconstruction of omitted reasoning steps or reorganization of representations under global consistency constraints. These failures are driven not only by missing intermediate steps, but by an instability in representation selection: models often fail to identify the correct conceptual framing required to resolve implicit tensions. We argue that highly abstract theoretical physics provides a uniquely sensitive lens on the epistemic limits of current evaluation paradigms.

[NLP-109] HARNESS: Lightweight Distilled Arabic Speech Foundation Models

【速读】: 该论文旨在解决大规模自监督语音(Self-Supervised Speech, SSL)模型在资源受限场景下部署困难的问题,尤其是针对阿拉伯语语音任务的适配性不足。其解决方案的关键在于提出了一种名为HArnESS的阿拉伯语中心SSL模型家族,通过从头训练并结合迭代自蒸馏(iterative self-distillation)技术,将大型双语阿拉伯-英语教师模型的知识逐步迁移至轻量级学生模型中,同时保留与阿拉伯语相关的声学和副语言表征;此外,还引入基于主成分分析(PCA)的压缩方法以优化教师监督信号,使其更匹配浅层、薄层学生模型的容量,从而在自动语音识别(ASR)、方言识别(DID)和语音情感识别(SER)等任务上实现高精度与高效能的平衡。

链接: https://arxiv.org/abs/2604.14186
作者: Vrunda N. Sukhadia,Shammur Absar Chowdhury
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Large self-supervised speech (SSL) models achieve strong downstream performance, but their size limits deployment in resource-constrained settings. We present HArnESS, an Arabic-centric self-supervised speech model family trained from scratch with iterative self-distillation, together with lightweight student variants that offer strong accuracy-efficiency trade-offs on Automatic Speech Recognition (ASR), Dialect Identification (DID), and Speech Emotion Recognition (SER). Our approach begins with a large bilingual Arabic-English teacher and progressively distills its knowledge into compressed student models while preserving Arabic-relevant acoustic and paralinguistic representations. We further study PCA-based compression of the teacher supervision signal to better match the capacity of shallow and thin students. Compared with HuBERT and XLS-R, HArnESS consistently improves performance on Arabic downstream tasks, while the compressed models remain competitive under substantial structural reduction. These results position HArnESS as a practical and accessible Arabic-centric SSL foundation for real-world speech applications.

信息检索

[IR-0] IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

【速读】:该论文旨在解决现有基于强化学习(Reinforcement Learning, RL)训练大语言模型进行搜索增强推理时,因依赖轨迹级奖励(trajectory-level rewards)而导致的信用分配粗粒度问题——即无法区分单个搜索步骤中有效查询与模糊或冗余查询,并且在所有采样轨迹均失败时梯度信号几乎消失。解决方案的关键在于提出IG-Search框架,其核心创新是引入一种基于信息增益(Information Gain, IG)的步骤级奖励机制:该机制通过比较检索文档与随机文档对模型预测黄金答案置信度的提升程度,量化每个搜索步骤的有效性,并利用GRPO(Generalized Reward Policy Optimization)中的逐token优势调制技术将此细粒度信号反馈至对应搜索查询词元,实现精准的步骤级信用分配。该方法无需额外中间标注或共享环境状态,仅依赖策略自身生成概率即可获得稳定梯度,在多个单跳和多跳问答基准上显著优于现有轨迹级与步骤级基线方法。

链接: https://arxiv.org/abs/2604.15148
作者: Zihan Liang,Yufei Ma,Ben Chen,Zhipeng Qian,Huangyu Dai,Lingtao Mao,Xuxin Zhang,Chenyi Lei,Wenwu Ou
机构: Kuaishou Technology (快手科技)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Reinforcement learning has emerged as an effective paradigm for training large language models to perform search-augmented reasoning. However, existing approaches rely on trajectory-level rewards that cannot distinguish precise search queries from vague or redundant ones within a rollout group, and collapse to a near-zero gradient signal whenever every sampled trajectory fails. In this paper, we propose IG-Search, a reinforcement learning framework that introduces a step-level reward based on Information Gain (IG). For each search step, IG measures how much the retrieved documents improve the model’s confidence in the gold answer relative to a counterfactual baseline of random documents, thereby reflecting the effectiveness of the underlying search query. This signal is fed back to the corresponding search-query tokens via per-token advantage modulation in GRPO, enabling fine-grained, step-level credit assignment within a rollout. Unlike prior step-level methods that require either externally annotated intermediate supervision or shared environment states across trajectories, IG-Search derives its signals from the policy’s own generation probabilities, requiring no intermediate annotations beyond standard question-answer pairs. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate that IG-Search achieves an average EM of 0.430 with Qwen2.5-3B, outperforming the strongest trajectory-level baseline (MR-Search) by 1.6 points and the step-level method GiGPO by 0.9 points on average across benchmarks, with particularly pronounced gains on multi-hop reasoning tasks. Despite introducing a dense step-level signal, IG-Search adds only ~6.4% to per-step training wall-clock time over the trajectory-level baseline and leaves inference latency unchanged, while still providing a meaningful gradient signal even when every sampled trajectory answers incorrectly.

[IR-1] Metric-agnostic Learning-to-Rank via Boosting and Rank Approximation ICDM2023

【速读】:该论文旨在解决当前学习排序(Learning-to-Rank, LTR)方法依赖单一排名指标(如NDCG或MAP)进行优化所带来的两个核心问题:一是由于排名指标本身不可微,导致训练过程不稳定且效率低下;二是仅针对单一指标优化限制了模型在其他排名指标上的泛化能力。解决方案的关键在于提出一种新的列表级(listwise)LTR框架,其核心创新包括:设计了一种可微的排序损失函数,该函数结合了对排序操作符的平滑近似与每个查询的平均均方损失;并首次将梯度提升机(Gradient-Boosting Machines, GBM)适配至该损失函数,以逐列表最小化目标,从而实现高效且具备良好泛化性的排序效果。

链接: https://arxiv.org/abs/2604.15101
作者: Camilo Gomez,Pengyang Wang,Yanjie Fu
机构: 未知
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Published in IEEE ICDM 2023. 6 pages

点击查看摘要

Abstract:Learning-to-Rank (LTR) is a supervised machine learning approach that constructs models specifically designed to order a set of items or documents based on their relevance or importance to a given query or context. Despite significant success in real-world information retrieval systems, current LTR methods rely on one prefix ranking metric (e.g., such as Normalized Discounted Cumulative Gain (NDCG) or Mean Average Precision (MAP)) for optimizing the ranking objective function. Such metric-dependent setting limits LTR methods from two perspectives: (1) non-differentiable problem: directly optimizing ranking functions over a given ranking metric is inherently non-smooth, making the training process unstable and inefficient; (2) limited ranking utility: optimizing over one single metric makes it difficult to generalize well to other ranking metrics of interest. To address the above issues, we propose a novel listwise LTR framework for efficient and generalizable ranking purpose. Specifically, we propose a new differentiable ranking loss that combines a smooth approximation to the ranking operator with the average mean square loss per query. Then, we adapt gradient-boosting machines to minimize our proposed loss with respect to each list, a novel contribution. Finally, extensive experimental results confirm that our method outperforms the current state-of-the-art in information retrieval measures with similar efficiency.

[IR-2] SAGER: Self-Evolving User Policy Skills for Recommendation Agent

【速读】:该论文旨在解决推荐系统中“记忆个性化与推理逻辑静态化”之间的不对称性问题:当前基于大语言模型(Large Language Model, LLM)的推荐代理虽然能通过演化用户语义记忆来实现个性化,但其决策逻辑仍采用全局统一的静态系统提示(system prompt),导致在推荐失败时仅更新用户偏好记忆而无法修正推理机制,从而限制了模型的持续进化能力。解决方案的关键在于提出SAGER框架,其核心创新是引入“专属策略技能”(policy skill)——一种以自然语言结构化文档形式编码的个性化决策原则,并通过交互持续演化;同时设计双表示技能架构、增量对比链式思维引擎和技能增强的列表级推理机制,使推理过程本身可被个性化优化,从而实现与记忆积累正交的性能提升。

链接: https://arxiv.org/abs/2604.14972
作者: Zhen Tao,Riwei Lai,Chenyun Yu,Weixin Chen,Li Chen,Beibei Kong,Lei Cheng,Chengxiang Zhuo,Zang Li,Qingqiang Sun
机构: Great Bay University (大湾大学); Hong Kong Baptist University (香港浸会大学); Sun Yat-Sen University (中山大学); Platform and Content Group, Tencent (腾讯平台与内容部)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language model (LLM) based recommendation agents personalize what they know through evolving per-user semantic memory, yet how they reason remains a universal, static system prompt shared identically across all users. This asymmetry is a fundamental bottleneck: when a recommendation fails, the agent updates its memory of user preferences but never interrogates the decision logic that produced the failure, leaving its reasoning process structurally unchanged regardless of how many mistakes it accumulates. To address this bottleneck, we propose SAGER (Self-Evolving Agent for Personalized Recommendation), the first recommendation agent framework in which each user is equipped with a dedicated policy skill, a structured natural-language document encoding personalized decision principles that evolves continuously through interaction. SAGER introduces a two-representation skill architecture that decouples a rich evolution substrate from a minimal inference-time injection, an incremental contrastive chain-of-thought engine that diagnoses reasoning flaws by contrasting accepted against unchosen items while preserving accumulated priors, and skill-augmented listwise reasoning that creates fine-grained decision boundaries where the evolved skill provides genuine discriminative value. Experiments on four public benchmarks demonstrate that SAGER achieves state-of-the-art performance, with gains orthogonal to memory accumulation, confirming that personalizing the reasoning process itself is a qualitatively distinct source of recommendation improvement.

[IR-3] GenRec: A Preference-Oriented Generative Framework for Large-Scale Recommendation SIGIR2026

【速读】:该论文旨在解决生成式推荐(Generative Retrieval, GR)在大规模工业系统中面临的三大挑战:(1)单个请求内因分页机制导致相同输入产生不一致输出;(2)使用基于语义ID的多标记物品表示编码长用户行为序列成本过高;(3)生成策略与细微用户偏好信号难以对齐。解决方案的关键在于提出一个统一的解码器-only 架构——GenRec,其核心创新包括:(1)引入页面级 next-token prediction(Page-wise NTP)训练任务,通过整页监督提供更密集的梯度信号并缓解点级训练中的一对多歧义;(2)设计非对称线性 Token Merger,在预填充阶段压缩提示中的多标记语义ID,使输入长度减少约2倍且精度损失可忽略;(3)提出GRPO-SR强化学习方法,结合组相对策略优化(Group Relative Policy Optimization)与负对数似然(NLL)正则化以提升训练稳定性,并采用混合奖励机制融合密集奖励模型与相关性门控以防止奖励欺骗。在线A/B测试表明,GenRec相较现有流水线在点击量和交易量上分别提升9.5%和8.7%。

链接: https://arxiv.org/abs/2604.14878
作者: Yanyan Zou,Junbo Qi,Lunsong Huang,Yu Li,Kewei Xu,Jiabao Gao,Binglei Zhao,Xuanhua Yang,Sulong Xu,Shengjie Li
机构: JD.com(京东)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: SIGIR 2026 Camera-Ready version

点击查看摘要

Abstract:Generative Retrieval (GR) offers a promising paradigm for recommendation through next-token prediction (NTP). However, scaling it to large-scale industrial systems introduces three challenges: (i) within a single request, the identical model inputs may produce inconsistent outputs due to the pagination request mechanism; (ii) the prohibitive cost of encoding long user behavior sequences with multi-token item representations based on semantic IDs, and (iii) aligning the generative policy with nuanced user preference signals. We present GenRec, a preference-oriented generative framework deployed on the JD App that addresses above challenges within a single decoder-only architecture. For training objective, we propose Page-wise NTP task, which supervises over an entire interaction page rather than each interacted item individually, providing denser gradient signal and resolving the one-to-many ambiguity of point-wise training. On the prefilling side, an asymmetric linear Token Merger compresses multi-token Semantic IDs in the prompt while preserving full-resolution decoding, reducing input length by ~2X with negligible accuracy loss. To further align outputs with user satisfaction, we introduce GRPO-SR, a reinforcement learning method that pairs Group Relative Policy Optimization with NLL regularization for training stability, and employs Hybrid Rewards combining a dense reward model with a relevance gate to mitigate reward hacking. In month-long online A/B tests serving production traffic, GenRec achieves 9.5% improvement in click count and 8.7% in transaction count over the existing pipeline.

[IR-4] Well Begun is Half Done: Training-Free and Model-Agnostic Semantically Guaranteed User Representation Initialization for Multimodal Recommendation SIGIR2026

【速读】:该论文旨在解决多模态推荐系统中用户表示初始化不当的问题,即现有方法通常对用户表示进行随机初始化,而物品则可基于其丰富的模态信息(如图像、文本等)进行语义合理的初始化,导致用户与物品表示之间存在显著的语义鸿沟。解决方案的关键在于提出一种语义保障的用户表示初始化方法(Semantically Guaranteed User Representation Initialization, SG-URInit),该方法通过融合用户交互物品的模态特征和其所属聚类的全局特征来构建初始用户表示,从而在不引入额外训练开销的前提下,实现语义上更丰富且一致的用户表征初始化,有效提升推荐性能并加速模型收敛。

链接: https://arxiv.org/abs/2604.14839
作者: Jinfeng Xu,Zheyu Chen,Shuo Yang,Jinze Li,Hewei Wang,Jianheng Tang,Wei Wang,Xiping Hu,Edith C. H. Ngai
机构: The University of Hong Kong (香港大学); Beijing Institute of Technology (北京理工大学); Carnegie Mellon University (卡内基梅隆大学); Peking University (北京大学); Macao Polytechnic University (澳门理工学院)
类目: Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2026

点击查看摘要

Abstract:Recent advancements in multimodal recommendations, which leverage diverse modality information to mitigate data sparsity and improve recommendation accuracy, have gained significant attention. However, existing multimodal recommendations overlook the critical role of user representation initialization. Unlike items, which are naturally associated with rich modality information, users lack such inherent information. Consequently, item representations initialized based on meaningful modality information and user representations initialized randomly exhibit a significant semantic gap. To this end, we propose a Semantically Guaranteed User Representation Initialization (SG-URInit). SG-URInit constructs the initial representation for each user by integrating both the modality features of the items they have interacted with and the global features of their corresponding clusters. SG-URInit enables the initialization of semantically enriched user representations that effectively capture both local (item-level) and global (cluster-level) semantics. Our SG-URInit is training-free and model-agnostic, meaning it can be seamlessly integrated into existing multimodal recommendation models without incurring any additional computational overhead during training. Extensive experiments on multiple real-world datasets demonstrate that incorporating SG-URInit into advanced multimodal recommendation models significantly enhances recommendation performance. Furthermore, the results show that SG-URInit can further alleviate the item cold-start problem and also accelerate model convergence, making it an efficient and practical solution for multimodal recommendations. Comments: Accepted by SIGIR 2026 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.14839 [cs.IR] (or arXiv:2604.14839v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.14839 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] Federated User Behavior Modeling for Privacy-Preserving LLM Recommendation

【速读】:该论文旨在解决隐私保护下的跨域推荐(Privacy-Preserving Cross-Domain Recommendation, PPCDR)问题,尤其在非重叠域场景中,由于用户身份和行为数据无法共享、多模态数据异构性高以及传统协同过滤信号与大语言模型(Large Language Models, LLMs)特征空间不一致等挑战,导致跨域知识迁移困难。解决方案的关键在于提出一种语义增强的联邦用户行为建模方法(Semantic-enhanced Federated User Behavior Modeling, SF-UBM):首先利用自然语言作为通用桥梁,在加密文本项表示的基础上实现跨域对齐而无需共享原始用户数据;其次设计事实-反事实知识蒸馏模块,融合领域无关与领域特定知识以应对多模态异构性;最后将预训练用户偏好与跨域项表示投影至软提示空间,实现行为空间与语义空间的对齐,从而有效支持LLM的学习与推荐性能提升。

链接: https://arxiv.org/abs/2604.14833
作者: Lei Guo,Hongyun Yang,Pengjie Ren,Tong Chen,Hui Liu,Zhumin Chen
机构: Shandong Normal University (山东师范大学); Shandong University (山东大学); The University of Queensland (昆士兰大学); Shandong University of Finance and Economic (山东财经大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models have shown great success in recommender systems. However, the limited and sparse nature of user data often restricts the LLM’s ability to effectively model behavior patterns. To address this, existing studies have explored cross-domain solutions by conducting Cross-Domain Recommendation tasks. But previous methods typically assume domains are overlapped and can be accessed readily. None of the LLM methods address the privacy-preserving issues in the CDR settings, that is, Privacy-Preserving Cross-Domain Recommendation. Conducting non-overlapping PPCDR with LLM is challenging since: 1)The inability to share user identity or behavioral data across domains impedes effective cross-domain alignment. 2)The heterogeneity of data modalities across domains complicates knowledge integration. 3)Fusing collaborative filtering signals from traditional recommendation models with LLMs is difficult, as they operate within distinct feature spaces. To address the above issues, we propose SF-UBM, a Semantic-enhanced Federated User Behavior Modeling method. Specifically, to deal with Challenge 1, we leverage natural language as a universal bridge to connect disjoint domains via a semantic-enhanced federated architecture. Here, text-based item representations are encrypted and shared, while user-specific data remains local. To handle Challenge 2, we design a Fact-counter Knowledge Distillation module to integrate domain-agnostic knowledge with domain-specific knowledge, across different data modalities. To tackle Challenge 3, we project pre-learned user preferences and cross-domain item representations into the soft prompt space, aligning behavioral and semantic spaces for effective LLM learning. We conduct extensive experiments on three pairs of real-world domains, and the experimental results demonstrate the effectiveness of SF-UBM compared to the recent SOTA methods.

[IR-6] Uncertainty-aware Generative Learning Path Recommendation with Cognition-Adaptive Diffusion

【速读】:该论文旨在解决学习路径推荐(Learning Path Recommendation, LPR)中两个核心问题:一是现有方法未能有效处理历史交互中的不确定性(如偶然猜对或失误),二是缺乏对多样化学习目标的自适应能力。其解决方案的关键在于提出U-GLAD框架,通过引入基于高斯LSTM的认知状态概率建模来缓解表示偏差并感知交互不确定性;同时设计目标导向的概念编码器,利用多头注意力机制和目标特定变换动态对齐概念语义与个体学习目标,生成个性化嵌入;此外,采用生成式扩散模型预测下一最优概念的潜在表示,而非传统判别式排序方法,从而实现更稳定、目标驱动的个性化推荐路径。

链接: https://arxiv.org/abs/2604.14613
作者: Xiangrui Xiong,Hang Liang,Baiyang Chen,Zifei Pan,Yanli Lee
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:Learning Path Recommendation (LPR) is critical for personalized education, yet current methods often fail to account for historical interaction uncertainty (e.g., lucky guesses or accidental slips) and lack adaptability to diverse learning goals. We propose U-GLAD (Uncertainty-aware Generative Learning Path Recommendation with Cognition-Adaptive Diffusion). To address representation bias, the framework models cognitive states as probability distributions, capturing the learner’s underlying true state via a Gaussian LSTM. To ensure highly personalized recommendation, a goal-oriented concept encoder utilizes multi-head attention and objective-specific transformations to dynamically align concept semantics with individual learning goals, generating uniquely tailored embeddings. Unlike traditional discriminative ranking approaches, our model employs a generative diffusion model to predict the latent representation of the next optimal concept. Extensive evaluations on three public datasets demonstrate that U-GLAD significantly outperforms representative baselines. Further analyses confirm its superior capability in perceiving interaction uncertainty and providing stable, goal-driven recommendation paths.

[IR-7] Category-based and Popularity-guided Video Game Recommendation: A Balance-oriented Framework WWW

【速读】:该论文旨在解决当前视频游戏推荐系统中普遍存在的准确性与多样性失衡问题,即现有方法过度追求推荐精度而忽视了推荐结果的多样性,同时缺乏对游戏类别、流行度等关键物品信息的有效利用,导致推荐结果趋于同质化且难以覆盖长尾游戏。其解决方案的关键在于提出一个名为CPGRec的三模块框架:第一模块通过更严格的图结构连接机制增强推荐准确性;第二模块在游戏图中引入类别多样性的邻居连接,并借助热门游戏节点放大长尾游戏的影响以提升多样性;第三模块整合前两者并通过新的负样本评分重加权策略实现准确性和多样性的动态平衡。

链接: https://arxiv.org/abs/2604.14598
作者: Xiping Li,Jianghong Ma,Kangzhe Liu,Shanshan Feng,Haijun Zhang,Yutong Wang
机构: Harbin Institute of Technology, Shenzhen, China; City University of Hong Kong, Kowloon, Hong Kong, China; Centre for Frontier AI Research, ASTAR, Singapore; Institute of High-Performance Computing, ASTAR, Singapore
类目: Information Retrieval (cs.IR)
备注: Published in The Web Conference (WWW) 2024. 11 pages, 8 figures

点击查看摘要

Abstract:In recent years, the video game industry has experienced substantial growth, presenting players with a vast array of game choices. This surge in options has spurred the need for a specialized recommender system tailored for video games. However, current video game recommendation approaches tend to prioritize accuracy over diversity, potentially leading to unvaried game suggestions. In addition, the existing game recommendation methods commonly lack the ability to establish strict connections between games to enhance accuracy. Furthermore, many existing diversity-focused methods fail to leverage crucial item information, such as item category and popularity during neighbor modeling and message propagation. To address these challenges, we introduce a novel framework, called CPGRec, comprising three modules, namely accuracy-driven, diversity-driven, and comprehensive modules. The first module extends the state-of-the-art accuracy-focused game recommendation method by connecting games in a more stringent manner to enhance recommendation accuracy. The second module connects neighbors with diverse categories within the proposed game graph and harnesses the advantages of popular game nodes to amplify the influence of long-tail games within the player-game bipartite graph, thereby enriching recommendation diversity. The third module combines the above two modules and employs a new negative-sample rating score reweighting method to balance accuracy and diversity. Experimental results on the Steam dataset demonstrate the effectiveness of our proposed method in improving game recommendations. The dataset and source codes are anonymously released at: this https URL.

[IR-8] CPGRec: A Balance-oriented Framework for Personalized Video Game Recommendations

【速读】:该论文旨在解决游戏推荐系统中现有基于图神经网络(Graph Neural Network, GNN)的方法在追求高精度时忽视多样性的问题,以及未充分考虑玩家-游戏交互差异性所导致的过平滑(over-smoothing)现象。关键解决方案在于提出两个新模块:其一为偏好感知边重加权(Preference-informed Edge Reweighting, PER)模块,通过赋予有符号边权重来区分显著的兴趣与非兴趣关系,并量化偏好强度以缓解图卷积中的过平滑问题;其二为偏好感知表示生成(Preference-informed Representation Generation, PRG)模块,利用大语言模型(Large Language Models, LLMs)基于全局与个体兴趣对比推理生成情境化的游戏和玩家描述,从而优化两者表征质量。实验证明,改进后的CPGRec+在Steam数据集上优于当前最优模型,在准确性和多样性之间实现了更好平衡。

链接: https://arxiv.org/abs/2604.14586
作者: Xiping Li,Aier Yang,Jianghong Ma,Kangzhe Liu,Shanshan Feng,Haijun Zhang,Yi Zhao
机构: Harbin Institute of Technology Shenzhen (哈尔滨工业大学深圳); School of Computer Science, Wuhan University (武汉大学计算机学院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Published in ACM Transactions on Information Systems (TOIS). 43 pages, 9 figures

点击查看摘要

Abstract:The rapid expansion of gaming industry requires advanced recommender systems tailored to its dynamic landscape. Existing Graph Neural Network (GNN)-based methods primarily prioritize accuracy over diversity, overlooking their inherent trade-off. To address this, we previously proposed CPGRec, a balance-oriented gaming recommender system. However, CPGRec fails to account for critical disparities in player-game interactions, which carry varying significance in reflecting players’ personal preferences and may exacerbate over-smoothness issues inherent in GNN-based models. Moreover, existing approaches underutilize the reasoning capabilities and extensive knowledge of large language models (LLMs) in addressing these limitations. To bridge this gap, we propose two new modules. First, Preference-informed Edge Reweighting (PER) module assigns signed edge weights to qualitatively distinguish significant player interests and disinterests while then quantitatively measuring preference strength to mitigate over-smoothing in graph convolutions. Second, Preference-informed Representation Generation (PRG) module leverages LLMs to generate contextualized descriptions of games and players by reasoning personal preferences from comparing global and personal interests, thereby refining representations of players and games. Experiments on \textcolorblacktwo Steam datasets demonstrate CPGRec+'s superior accuracy and diversity over state-of-the-art models. The code is accessible at this https URL.

[IR-9] Behavior-Aware Dual-Channel Preference Learning for Heterogeneous Sequential Recommendation

【速读】:该论文旨在解决异构序列推荐(Heterogeneous Sequential Recommendation, HSR)中因真实场景数据稀疏性导致的推荐性能下降问题,以及现有基于对比学习的方法仅关注单一行为类型、忽略用户细粒度偏好而造成信息损失的问题。解决方案的关键在于提出一种行为感知的双通道偏好学习框架(Behavior-aware Dual-channel Preference Learning, BDPL),其核心创新包括:构建个性化的行为感知子图以捕捉用户行为转移关系;设计级联结构的图神经网络聚合节点上下文信息;引入偏好层级的对比学习机制同时建模长短期用户偏好;并通过自适应门控机制融合整体偏好信息,从而实现对目标行为下用户下一交互物品的精准预测。

链接: https://arxiv.org/abs/2604.14581
作者: Jing Xiao,Dongqi Wu,Liwei Pan,Yawen Luo,Weike Pan,Zhong Ming
机构: Shenzhen University (深圳大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Heterogeneous sequential recommendation (HSR) aims to learn dynamic behavior dependencies from the diverse behaviors of user-item interactions to facilitate precise sequential recommendation. Despite many efforts yielding promising achievements, there are still challenges in modeling heterogeneous behavior data. One significant issue is the inherent sparsity of a real-world data, which can weaken the recommendation performance. Although auxiliary behaviors (e.g., clicks) partially address this problem, they inevitably introduce some noise, and the sparsity of the target behavior (e.g., purchases) remains unresolved. Additionally, contrastive learning-based augmentation in existing methods often focuses on a single behavior type, overlooking fine-grained user preferences and losing valuable information. To address these challenges, we have meticulously designed a behavior-aware dual-channel preference learning framework (BDPL). This framework begins with the construction of customized behavior-aware subgraphs to capture personalized behavior transition relationships, followed by a novel cascade-structured graph neural network to aggregate node context information. We then model and enhance user representations through a preference-level contrastive learning paradigm, considering both long-term and short-term preferences. Finally, we fuse the overall preference information using an adaptive gating mechanism to predict the next item the user will interact with under the target behavior. Extensive experiments on three real-world datasets demonstrate the superiority of our BDPL over the state-of-the-art models.

[IR-10] NewsTorch: A PyTorch-based Toolkit for Learner-oriented News Recommendation

【速读】:该论文旨在解决新闻推荐研究领域中缺乏专门面向学习者的推荐工具包的问题,这一缺失阻碍了该方向的研究进展。解决方案的关键在于提出一个基于PyTorch的开源新闻推荐工具包NewsTorch,其核心特点包括模块化、解耦合与可扩展的框架设计,支持数据集下载与预处理、模型训练/验证/测试全流程,并提供标准化评估指标以确保实验的公平性与可复现性,从而帮助学习者在理论理解与实践应用两方面获得系统性提升。

链接: https://arxiv.org/abs/2604.14510
作者: Rongyao Wang,Veronica Liesaputra,Zhiyi Huang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 3 papes

点击查看摘要

Abstract:News recommender systems are devised to alleviate the information overload, attracting more and more researchers’ attention in recent years. The lack of a dedicated learner-oriented news recommendation toolkit hinders the advancement of research in news recommendation. We propose a PyTorch-based news recommendation toolkit called NewsTorch, developed to support learners in acquiring both conceptual understanding and practical experience. This toolkit provides a modular, decoupled, and extensible framework with a learner-friendly GUI platform that supports dataset downloading and preprocessing. It also enables training, validation, and testing of state-of-the-art neural news recommendation models with standardized evaluation metrics, ensuring fair comparison and reproducible experiments. Our open-source toolkit is released on Github: this https URL.

[IR-11] Controlling Authority Retrieval: A Missing Retrieval Objective for Authority-Governed Knowledge

【速读】:该论文旨在解决在知识受正式权威控制的领域(如法律、药物监管、软件安全)中,如何有效检索出能够“覆盖”早期文档的最新权威性文档的问题,即从语义上看似不相关但具有正式撤销权的后续文档。传统检索方法(如argmax_d s(q,d))无法捕捉这种“控制关系”,而论文提出**控制权威检索(Controlling Authority Retrieval, CAR)**作为新范式,其核心在于识别并召回权威闭包的活跃前沿(front(cl(A_k(q)))),而非仅依赖语义相似度。解决方案的关键是:1)证明了CAR正确性的充要条件——前沿包含性和无被忽略的替代者(Theorem 4);2)建立了一个理论上限(phi(q)),表明任何基于范围索引的算法在最坏情况下性能受限于该值(Proposition 2)。实证结果验证了该框架在多个真实世界语料中的有效性,且两阶段策略显著优于密集检索(Dense RAG),例如在GPT-4o-mini实验中将错误遗漏率从39%降至16%。

链接: https://arxiv.org/abs/2604.14488
作者: Andre Bacellar
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 23 pages, 13 tables; code and data at this https URL

点击查看摘要

Abstract:In any domain where knowledge accumulates under formal authority – law, drug regulation, software security – a later document can formally void an earlier one while remaining semantically distant from it. We formalize this as Controlling Authority Retrieval (CAR): recovering the active frontier front(cl(A_k(q))) of the authority closure of the semantic anchor set – a different mathematical problem from argmax_d s(q,d). The two central results are: Theorem 4 (CAR-Correctness Characterization) gives necessary-and-sufficient conditions on any retrieved set R for TCA(R,q)=1 – frontier inclusion and no-ignored-superseder – independent of how R was produced. Proposition 2 (Scope Identifiability Upper Bound) establishes phi(q) as a hard worst-case ceiling: for any scope-indexed algorithm, TCA@k = phi(q) * R_anchor(q), proved by an adversarial permutation argument. Three independent real-world corpora validate the proved structure: security advisories (Dense TCA@5=0.270, two-stage 0.975), SCOTUS overruling pairs (Dense=0.172, two-stage 0.926), FDA drug records (Dense=0.064, two-stage 0.774). A GPT-4o-mini experiment shows the downstream cost: Dense RAG produces explicit “not patched” claims for 39% of queries where a patch exists; Two-Stage cuts this to 16%. Four benchmark datasets, domain adapters, and a single-command scorer are released at this https URL.

[IR-12] A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation

【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在本地设备上部署时面临的资源限制问题,尤其是内存和存储空间的瓶颈。传统RAG依赖云端服务器执行检索与生成任务,虽缓解了本地硬件压力,但引入了隐私泄露、延迟高、依赖网络连接等缺陷。为实现端侧(on-device)部署,本文提出一种统一模型,通过压缩RAG上下文并利用相同表示进行检索,从而减少磁盘占用并显著降低生成阶段所需的上下文长度。其核心创新在于使用共享模型和表示实现检索与上下文压缩的一体化,使得平均仅需1/10的上下文长度即可达到传统RAG读取器的性能,且不增加存储开销,是首个实现检索与上下文压缩统一建模的方法。

链接: https://arxiv.org/abs/2604.14403
作者: Julian Killingback,Ofer Meshi,Henry Li,Hamed Zamani,Maryam Karimzadehgan
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Traditional Retrieval-Augmented Generation (RAG) approaches generally assume that retrieval and generation occur on powerful servers removed from the end user. While this reduces local hardware constraints, it introduces significant drawbacks: privacy concerns regarding data access, recurring maintenance and storage costs, increased latency, and the necessity of an internet connection. On-device RAG addresses these challenges by executing the entire pipeline locally, making it ideal for querying sensitive personal information such as financial documents, contact details, and medical history. However, on-device deployment necessitates a delicate balance between limited memory and disk space. Specifically, the context size provided to the generative model must be restricted to manage KV cache and attention memory usage, while the size of stored embeddings must be minimized to preserve disk space. In this work, we propose a unified model that compresses the RAG context and utilizes the same representations for retrieval. This approach minimizes disk utilization compared to using separate representations, while significantly reducing the context size required for generation. With an average of 1/10 of the context, our model matches the performance of a traditional RAG reader without increasing storage requirements compared to a multi-vector retrieval model. This approach represents the first model to unify retrieval and context compression using a shared model and representation. We believe this work will inspire further consolidation of distinct models to optimize on-device performance.

[IR-13] APEX-MEM: Agent ic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI ACL2026

【速读】:该论文旨在解决大语言模型在长期对话记忆中存在的可靠性问题,即单纯扩大上下文窗口或采用简单检索策略常引入噪声并导致响应不稳定。其解决方案的关键在于提出APEX-MEM系统,该系统通过三个核心创新实现:(1) 使用领域无关本体构建属性图(property graph),将对话结构化为以实体为中心、具有时间锚定的事件;(2) 采用追加式存储(append-only storage)保留信息的完整时序演化过程;(3) 引入多工具检索代理(multi-tool retrieval agent),在查询时理解并解决冲突或演化的信息,生成紧凑且上下文相关的记忆摘要。这一机制在保留完整交互历史的同时抑制冗余细节,显著提升长程对话推理的时序一致性与准确性。

链接: https://arxiv.org/abs/2604.14362
作者: Pratyay Banerjee,Masud Moshtaghi,Shivashankar Subramanian,Amita Misra,Ankit Chadha
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to ACL 2026 Mains

点击查看摘要

Abstract:Large language models still struggle with reliable long-term conversational memory: simply enlarging context windows or applying naive retrieval often introduces noise and destabilizes responses. We present APEX-MEM, a conversational memory system that combines three key innovations: (1) a property graph which uses domain-agnostic ontology to structure conversations as temporally grounded events in an entity-centric framework, (2) append-only storage that preserves the full temporal evolution of information, and (3) a multi-tool retrieval agent that understands and resolves conflicting or evolving information at query time, producing a compact and contextually relevant memory summary. This retrieval-time resolution preserves the full interaction history while suppressing irrelevant details. APEX-MEM achieves 88.88% accuracy on LOCOMO’s Question Answering task and 86.2% on LongMemEval, outperforming state-of-the-art session-aware approaches and demonstrating that structured property graphs enable more temporally coherent long-term conversational reasoning.

[IR-14] Evaluation of Agents under Simulated AI Marketplace Dynamics SIGIR2026

【速读】:该论文旨在解决当前信息获取系统评估中存在的重要问题:传统静态基准测试主要关注准确性指标,且假设系统独立运行,无法反映真实部署环境中多系统共存、市场竞争(如用户切换、路由决策和运营约束)对系统表现的影响。这种评估与现实脱节,导致难以预测系统上线后的实际成功情况,并掩盖了诸如早期采用优势和市场主导地位等竞争效应。解决方案的关键在于提出“市场环境评估”(Marketplace Evaluation),这是一种基于仿真的范式,将信息获取系统视为竞争市场中的参与者,通过模拟重复交互和动态的用户与代理偏好,实现纵向评估及市场层面指标(如留存率和市场份额)的量化,从而补充并扩展传统的准确率导向指标。

链接: https://arxiv.org/abs/2604.14256
作者: To Eun Kim,Alireza Salemi,Hamed Zamani,Fernando Diaz
机构: Carnegie Mellon University (卡内基梅隆大学); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: SIGIR 2026

点击查看摘要

Abstract:Modern information access ecosystems consist of mixtures of systems, such as retrieval systems and large language models, and increasingly rely on marketplaces to mediate access to models, tools, and data, making competition between systems inherent to deployment. In such settings, outcomes are shaped not only by benchmark quality but also by competitive pressure, including user switching, routing decisions, and operational constraints. Yet evaluation is still largely conducted on static benchmarks with accuracy-focused measures that assume systems operate in isolation. This mismatch makes it difficult to predict post-deployment success and obscures competitive effects such as early-adoption advantages and market dominance. We introduce Marketplace Evaluation, a simulation-based paradigm that evaluates information access systems as participants in a competitive marketplace. By simulating repeated interactions and evolving user and agent preferences, the framework enables longitudinal evaluation and marketplace-level metrics, such as retention and market share, that complement and can extend beyond traditional accuracy-based metrics. We formalize the framework and outline a research agenda, motivated by business and economics, around marketplace simulation, metrics, optimization, and adoption in evaluation campaigns like TREC.

[IR-15] FRESCO: Benchmarking and Optimizing Re-rankers for Evolving Semantic Conflict in Retrieval-Augmented Generation

【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统中重排序器(re-ranker)在动态时间语境下性能不足的问题,即当前评估基准多局限于静态场景,未能充分考察重排序器在面对随时间演化的信息时的选择能力。其关键解决方案是提出 FRESCO(Factual Recency and Evolving Semantic COnflict)基准,通过配对时效性查询与历史 Wikipedia 版本,测试重排序器是否能在保持语义相关性的前提下优先选择事实最新的文档。实验发现现有重排序器普遍存在对较旧但语义丰富文档的偏好偏差,进而提出基于指令优化的框架,通过识别平衡“演化知识”与“非演化知识”任务的帕累托最优指令,使演化知识任务性能提升最高达 27%,同时维持非演化知识任务的竞争力。

链接: https://arxiv.org/abs/2604.14227
作者: Sohyun An(1 and 2),Hayeon Lee(1),Shuibenyang Yuan(1),Chun-cheng Jason Chen(1),Cho-Jui Hsieh(2),Vijai Mohan(1),Alexander Min(1) ((1) Meta Superintelligence Labs, (2) UCLA)
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a key approach to mitigating the temporal staleness of large language models (LLMs) by grounding responses in up-to-date evidence. Within the RAG pipeline, re-rankers play a pivotal role in selecting the most useful documents from retrieved candidates. However, existing benchmarks predominantly evaluate re-rankers in static settings and do not adequately assess performance under evolving information – a critical gap, as real-world systems often must choose among temporally different pieces of evidence. To address this limitation, we introduce FRESCO (Factual Recency and Evolving Semantic COnflict), a benchmark for evaluating re-rankers in temporally dynamic contexts. By pairing recency-seeking queries with historical Wikipedia revisions, FRESCO tests whether re-rankers can prioritize factually recent evidence while maintaining semantic relevance. Our evaluation reveals a consistent failure mode across existing re-rankers: a strong bias toward older, semantically rich documents, even when they are factually obsolete. We further investigate an instruction optimization framework to mitigate this issue. By identifying Pareto-optimal instructions that balance Evolving and Non-Evolving Knowledge tasks, we obtain gains of up to 27% on Evolving Knowledge tasks while maintaining competitive performance on Non-Evolving Knowledge tasks.

[IR-16] RACE: A Conversational Framework for Sustainable Tourism Recommendation with Agent ic Counterfactual Explanations

【速读】:该论文旨在解决传统对话式旅游推荐系统在优化用户相关性和便利性时,往往强化热门拥挤目的地和高碳排放旅行选择的问题,从而加剧可持续旅游发展的挑战。其解决方案是提出TRACE(Tourism Recommendation with Agentic Counterfactual Explanations),一个基于大语言模型(LLM)的多智能体框架,通过交互式引导(nudging)促进可持续旅游决策。关键创新在于利用智能体驱动的反事实解释(agentic counterfactual explanations)与LLM生成的澄清问题相结合,主动揭示更环保的替代方案并深化对用户意图的理解,从而在不强制干预的前提下激发用户反思,实现推荐质量与环境友好性的平衡。

链接: https://arxiv.org/abs/2604.14223
作者: Ashmi Banerjee,Adithi Satish,Wolfgang Wörndl,Yashar Deldjoo
机构: Technical University of Munich (慕尼黑工业大学); Polytechnic University of Bari (巴里理工大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional conversational travel recommender systems primarily optimize for user relevance and convenience, often reinforcing popular, overcrowded destinations and carbon-intensive travel choices. To address this, we present TRACE (Tourism Recommendation with Agentic Counterfactual Explanations), a multi-agent, LLM-based framework that promotes sustainable tourism through interactive nudging. TRACE uses a modular orchestrator-worker architecture where specialized agents elicit latent sustainability preferences, construct structured user personas, and generate recommendations that balance relevance with environmental impact. A key innovation lies in its use of agentic counterfactual explanations and LLM-driven clarifying questions, which together surface greener alternatives and refine understanding of intent, fostering user reflection without coercion. User studies and semantic alignment analyses demonstrate that TRACE effectively supports sustainable decision-making while preserving recommendation quality and interactive responsiveness. TRACE is implemented on Google’s Agent Development Kit, with full code, Docker setup, prompts, and a publicly available demo video to ensure reproducibility. A project summary, including all resources, prompts, and demo access, is available at this https URL. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.14223 [cs.IR] (or arXiv:2604.14223v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.14223 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26), July 20–24, 2026, Melbourne, VIC, Australia Related DOI: https://doi.org/10.1145/3805712.3808370 Focus to learn more DOI(s) linking to related resources

[IR-17] Adaptive Query Routing: A Tier-Based Framework for Hybrid Retrieval Across Financial Legal and Medical Documents

【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在复杂查询场景下性能瓶颈的问题,特别是不同检索范式(如向量检索、树状推理与混合策略)在金融、法律和医疗等多领域任务中的适应性差异。其解决方案的关键在于提出一种自适应混合检索(Adaptive Hybrid Retrieval, AHR)架构,该架构能够根据查询的复杂度层级动态选择最优的检索策略——即在多文档合成(Tier 4)任务中依赖向量检索的优势,在跨引用和多段落查询中则发挥混合策略的协同效应,从而实现比单一范式更鲁棒且全面的性能表现。实验验证表明,AHR在交叉引用任务上达到0.850的得分,并在真实SEC文件问答中显著缩小了与树状推理方法之间的质量差距(从11.7个百分点降至约3.7个百分点),证明了动态策略选择机制的有效性。

链接: https://arxiv.org/abs/2604.14222
作者: Afshan Hashmi
机构: TRDC, Tuwaiq Academy, Riyadh, Saudi Arabia
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become the standard paradigm for grounding Large Language Model outputs in external knowledge. Lumer et al. [1] presented the first systematic evaluation comparing vector-based agentic RAG against hierarchical node-based reasoning systems for financial document QA across 1,200 SEC filings, finding vector-based systems achieved a 68% win rate. Concurrently, the PageIndex framework [2] demonstrated 98.7% accuracy on FinanceBench through purely reasoning-based retrieval. This paper extends their work by: (i) implementing and evaluating three retrieval architectures: Vector RAG, Tree Reasoning, and the proposed Adaptive Hybrid Retrieval (AHR) across financial, legal, and medical domains; (ii) introducing a four-tier query complexity benchmark; and (iii) employing GPT-4-powered LLM-as-judge evaluation. Experiments reveal that Tree Reasoning achieves the highest overall score (0.900), but no single paradigm dominates across all tiers: Vector RAG wins on multi-document synthesis (Tier 4, score 0.900), while the Hybrid AHR achieves the best performance on cross-reference (0.850) and multi-section queries (0.929). Cross-reference recall reaches 100% for tree-based and hybrid approaches versus 91.7% for vector search, quantifying a critical capability gap. Validation on FinanceBench (150 expert-annotated questions on real SEC 10-K and 10-Q filings) confirms and strengthens these findings: Tree Reasoning scores 0.938, Hybrid AHR 0.901, and Vector RAG 0.821, with the Tree–Vector quality gap widening to 11.7 percentage points on real-world documents. These findings support the development of adaptive retrieval systems that dynamically select strategies based on query complexity and document structure. All code and data are publicly available.

[IR-18] Knowledge Graph RAG : Agent ic Crawling and Graph Construction in Enterprise Documents

【速读】:该论文旨在解决复杂企业文档生态系统中语义搜索的局限性,特别是传统检索增强生成(Retrieval-Augmented Generation, RAG)管道在捕捉层级结构和信息关联方面的能力不足,导致检索准确性下降的问题。其解决方案的关键在于提出基于代理的知识图谱(Agentic Knowledge Graphs),并引入递归爬取(Recursive Crawling)机制,以有效导航超逻辑关系和多跳引用,从而显著提升对复杂法规类查询的精确回答能力。

链接: https://arxiv.org/abs/2604.14220
作者: Koushik Chakraborty,Koyel Guha
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:This research paper addresses the limitations of semantic search in complex enterprise document ecosystems. Traditional RAG pipelines often fail to capture hierarchical and interconnected information, leading to retrieval inaccuracies. We propose Agentic Knowledge Graphs featuring Recursive Crawling as a robust solution for navigating superseding logic and multi-hop references. Our benchmark evaluation using the Code of Federal Regulations (CFR) demonstrates that this Knowledge Graph-enhanced approach achieves a 70% accuracy improvement over standard vector-based RAG systems, providing exhaustive and precise answers for complex regulatory queries.

[IR-19] PriHA: A RAG -Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong PAKDD2026

【速读】:该论文旨在解决香港特区政府在推动基层医疗(Primary Healthcare)过程中面临的临床指南碎片化与信息获取障碍问题,即官方指南分散于不同部门且格式不统一,导致公众难以便捷获取准确的健康信息。针对这一挑战,作者提出了一种增强型检索增强生成(Retrieval-Augmented Generation, RAG)大语言模型系统——基层医疗助手(PriHA),其核心创新在于一个三阶段处理流程:首先通过查询优化器将用户意图转化为结构化的子查询;其次引入一种新颖的双源检索增强生成(Dual Retrieval Augmented Generation, DRAG)架构,实现跨来源混合检索与上下文重组生成;最终显著提升回答的准确性与清晰度,为高风险、本地化场景下的可信对话式信息检索提供可追溯的解决方案。

链接: https://arxiv.org/abs/2604.14215
作者: Richard Wai Cheung Chan,Shanru Lin,Ya-nan Ma,Hao Chen,Liangjun Jiang,Wenqi Fan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to PAKDD 2026

点击查看摘要

Abstract:To address the unsustainable rise in public health expenditures, the Hong Kong SAR Government is shifting its strategic focus to primary healthcare and encouraging citizens to use community resources to self-manage their health. However, official clinical guidelines are fragmented across disparate departments and formats, creating significant access barriers. While general-purpose Large Language Models (LLMs) such as ChatGPT and DeepSeek offer potential solutions for information accessibility, they are prone to generating factually inaccurate content due to a lack of localized and domain-specific knowledge. To this end, we propose a Retrieval-Augmented Generation-Enhanced LLM system as Primary Healthcare Assistant (PriHA) in Hong Kong. Specifically, a tri-stage pipeline is proposed that leverages a query optimizer to generalize user intent-oriented sub-queries, followed by a novel Dual Retrieval Augmented Generation (DRAG) architecture for mixed-source retrieval and context-reorganized generation. Comprehensive experiments and a detailed case study demonstrate that our proposed method can outperform both ablations and baseline in terms of accuracy and clarity. Our research provides a reliable and traceable dialogue retrieval framework for exploring other high-risk, localized application scenarios.

人机交互

[HC-0] UrbanClipAtlas: A Visual Analytics Framework for Event and Scene Retrieval in Urban Videos

【速读】:该论文旨在解决从长时间城市视频中提取可操作洞察的劳动密集型问题,即分析师需手动浏览原始视频以定位目标事件或发现行为趋势。其解决方案的关键在于构建一个名为URBANCLIPATLAS的视觉分析系统,该系统融合了检索增强生成(Retrieval-Augmented Generation, RAG)、领域感知实体抽取与视频定位技术,通过将长视频分割为短片段、利用视觉-语言模型生成文本描述并建立语义索引,再借助知识图谱将大语言模型(Large Language Model, LLM)输出的实体与关系映射到领域本体,并与检测到的目标和轨迹对齐,从而实现事件检索与解释的可视化验证。此设计强化了文本推理与视频证据之间的关联,显著降低验证模型输出与修正假设所需的人工工作量。

链接: https://arxiv.org/abs/2604.15225
作者: Joel Perca,Luis Sante,Juanpablo Heredia,Joao Rulff,Claudio Silva,Jorge Poco
机构: Fundação Getulio Vargas (巴西盖特利奥·瓦加斯基金会); New York University (纽约大学)
类目: Human-Computer Interaction (cs.HC)
备注: 12 pages and 6 figures

点击查看摘要

Abstract:Extracting actionable insights from long-duration urban videos is often labor-intensive: analysts must manually sift through raw footage to pinpoint target events or uncover broader behavioral trends. In this work, we present URBANCLIPATLAS, a visual analytics system for exploring long urban videos recorded at street intersections. URBANCLIPATLAS combines retrieval-augmented generation (RAG), taxonomy-aware entity extraction, and video grounding to support event retrieval and interpretation. The system segments extended recordings into short clips, generates textual descriptions with a vision-language model, and indexes them for semantic retrieval. A knowledge graph maps entities and relations from LLM answers onto a domain-specific taxonomy and aligns them with detected objects and trajectories to support visual grounding and verification. URBANCLIPATLAS supports scene retrieval through an augmented chat-based interface and improves scene interpretation by tightly aligning textual outputs with video evidence. This design strengthens the connection between textual reasoning and visual evidence, reducing the effort required to validate model outputs and refine hypotheses. We demonstrate the usefulness of URBANCLIPATLAS on the StreetAware dataset through two case studies involving hazardous scenarios and crossing dynamics at street intersections. URBANCLIPATLAS helps analysts reason about safety- and mobility-related patterns across large urban video collections.

[HC-1] Low-Cost System for Automatic Recognition of Driving Pattern in Assessing Interurban Mobility using Geo-Information

【速读】:该论文旨在解决当前道路交通中因缺乏驾驶行为评估系统而导致的交通事故频发与交通拥堵问题,尤其是在大量未配备主动安全系统的老旧车辆中。其解决方案的关键在于设计并实现一套基于物理传感器与嵌入式人工神经网络(ANN)的实时驾驶风格识别系统,通过采集速度、位置(经纬度)、时间及三轴转向速率等多维数据,利用ANN对驾驶行为进行分类,并在检测到异常驾驶模式时通过语音提示发出警告。实验表明,引入地理信息和时间维度可使分类准确率提升13%,最高达92%(针对正常与激进两种驾驶风格),体现了该方案在提升驾驶安全性与效率方面的有效性。

链接: https://arxiv.org/abs/2604.15216
作者: Oscar Romero,Aika Silveira Miura,Lorena Parra,Jaime Lloret
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 18 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Mobility in urban and interurban areas, mainly by cars, is a day-to-day activity of many people. However, some of its main drawbacks are traffic jams and accidents. Newly made vehicles have pre-installed driving evaluation systems, which can prevent accidents. However, most cars on our roads do not have driver assessment systems. In this paper, we propose an approach for recognising driving styles and enabling drivers to reach safer and more efficient driving. The system consists of two physical sensors connected to a device node with a display and a speaker. An artificial neural network (ANN) is included in the node, which analyses the data from the sensors, and then recognises the driving style. When an abnormal driving pattern is detected, the speaker will play a warning message. The prototype was assembled and tested using an interurban road, in particular on a conventional road with three driving styles. The gathered data were used to train and validate the ANN. Results, in terms of accuracy, indicate that better accuracy is obtained when the velocity, position (latitude and longitude), time, and turning speed for the 3-axis are used, offering an average accuracy of 83%. If the classification is performed considering just two driving styles, normal and aggressive, then the accuracy reaches 92%. When the geo-information and time data are included, the main novelty of this paper, the classification accuracy is improved by 13%.

[HC-2] OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

【速读】:该论文旨在解决当前基于视觉-语言模型的移动代理(Mobile Agent)系统在训练数据封闭、任务与轨迹合成机制不透明的问题,从而限制了研究的可复现性与进步空间。其核心解决方案是提出一个开源框架 OpenMobile,关键在于两个组成部分:一是可扩展的任务合成流水线,通过探索构建全局环境记忆,并基于此生成多样化且语义锚定的任务指令;二是策略切换(policy-switching)策略,在训练过程中交替使用学习者和专家模型以捕获标准模仿学习中常缺失的错误恢复数据。该方案显著提升了移动代理在动态基准测试中的性能表现,如 Qwen2.5-VL 和 Qwen3-VL 在 AndroidWorld 上分别达到 51.7% 和 64.7% 的成功率,优于现有公开数据方法,并通过透明分析验证性能提升源于功能覆盖广度而非基准过拟合。

链接: https://arxiv.org/abs/2604.15093
作者: Kanzhi Cheng,Zehao Li,Zheng Ma,Nuo Chen,Jialin Cao,Qiushi Sun,Zichen Ding,Fangzhi Xu,Hang Yan,Jiajun Chen,Anh Tuan Luu,Jianbing Zhang,Lewei Lu,Dahua Lin
机构: Nanjing University (南京大学); SenseTime (商汤科技); Nanyang Technological University (南洋理工大学); Shanghai AI Laboratory (上海人工智能实验室); The University of Hong Kong (香港大学); Xi’an Jiaotong University (西安交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Work in progress

点击查看摘要

Abstract:Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open-source framework that synthesizes high-quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy-switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error-recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine-tuned Qwen2.5-VL and Qwen3-VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open-data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at this https URL to bridge the data gap and facilitate broader mobile agent research.

[HC-3] “From remembering to shaping”: Narrating Shared Experiences by Co-Designing Cultural Heritage Artifacts in Collaborative VR

【速读】:该论文旨在解决如何通过生成式 AI (Generative AI, GenAI) 在虚拟文化遗产(Cultural Heritage, CH)场景中实现集体记忆的协同构建与表达问题。其核心挑战在于,传统个体化记忆难以体现文化遗产的共同体属性,而现有 GenAI 工具在沉浸式协作环境中往往无法充分支持用户对文化意义的多维表达。解决方案的关键在于设计了一种双人沉浸式工作流,允许参与者通过提示词(prompts)和模型放置操作共同生成三维文物与环境,并借助空间操作整合个人经验与共享叙事;当 GenAI 输出不满足需求时,用户会进行创造性再利用(creative appropriation),将生成结果转化为新的设计灵感,从而维持并深化集体叙事的动态演化。

链接: https://arxiv.org/abs/2604.15058
作者: Yushang Yang,Fanxu Meng,Fiona Fui-Hoon Nah,RAY LC
机构: City University of Hong Kong (香港城市大学); The Chinese University of Hong Kong, Shenzhen (深圳中文大学); Singapore Management University (新加坡管理大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The ways people remember and recall places reveal an invisible aspect of cultural heritage (CH), reflecting how individuals and communities relate to these places. Heritage is communal, emerging through collaboratively constructed narratives rather than individual records. To probe how people may share collective memories, we designed an immersive two-person workflow for collaboratively co-designing 3D artifacts and environments in virtual heritage locations, using Generative AI (GenAI) to instantiate these intangible memories. Observations of the co-creation process revealed that participants merged prompts and model placements when negotiating different perspectives. They used spatial operations to compose scenes, and also to express personal and embodied experiences of CH. When GenAI failed to meet their needs, participants engaged in creative appropriation, re-purposing unsatisfactory generated objects as sources of design inspiration to further shared narratives. While GenAI may have a homogenizing effect on CH expression, this work shows how people may overcome limitations in immersive collaborative workflows.

[HC-4] CoGrid the Multi-User Gymnasium: A Framework for Multi-Agent Experimentation

【速读】:该论文旨在解决当前研究中缺乏易于使用且功能完备的工具来开展人类与人工智能(Artificial Intelligence, AI)协同交互的多智能体实验的问题。现有研究在探索人类如何与自主代理(autonomous agents)互动时,受限于可用工具的复杂性和可访问性,难以有效模拟真实场景下的社会决策行为。解决方案的关键在于提出两个开源工具:一是CoGrid,一个基于网格的多智能体仿真库,支持NumPy和JAX双重后端以提升计算效率;二是Multi-User Gymnasium (MUG),它将仿真环境直接转化为交互式网页实验平台,支持任意数量的人类与AI参与,并通过服务端权威或对等网络结合回滚网络协议(rollback netcode)应对延迟问题。这两个工具共同降低了实验设计门槛,使研究人员能够高效部署人类-AI交互研究,从而推动心理学、认知科学及决策机制等相关领域的深入探索。

链接: https://arxiv.org/abs/2604.15044
作者: Chase McDonald,Cleotilde Gonzalez
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 36 pages, 11 figures

点击查看摘要

Abstract:The increasing integration of artificial intelligence (AI) in everyday life brings with it new challenges and questions for regarding how humans interact with autonomous agents. Multi-agent experiments, where humans and AI act together, can offer important opportunities to study social decision making, but there is a lack of accessible tooling available to researchers to run such experiments. We introduce two tools designed to reduce these barriers. The first, CoGrid, is a multi-agent grid-based simulation library with dual NumPy and JAX backends. The second, Multi-User Gymnasium (MUG), translates such simulation environments directly into interactive web-based experiments. MUG supports interactions with arbitrary numbers of humans and AI, utilizing either server-authoritative or peer-to-peer networking with rollback netcode to account for latency. Together, these tools can enable researchers to deploy studies of human-AI interaction, facilitating inquiry into core questions of psychology, cognition, and decision making and their relationship to human-AI interaction. Both tools are open source and available to the broader research community. Documentation and source code is available at cogrid, multi-user-gymnasium.this http URL. This paper details the functionality of these tools and presents several case studies to illustrate their utility in human-AI multi-agent experimentation.

[HC-5] Applying SHAPR in AI-Assisted Research Software Development: Lessons Learnt from Building a Share Trading System

【速读】:该论文旨在解决生成式 AI (Generative AI) 辅助研究软件开发过程中可能出现的连续性弱化、可追溯性下降及方法学清晰度不足的问题。其解决方案的关键在于提出并实践了 SHAPR(Solo, Human-centred, AI-assisted PRactice)框架,通过结构化的工作配置实现交互、实现与文档的有机协同:强调持续更新文档以维持项目知识的组织性和可用性,借助快速捕获(quick capture)与 AI 辅助精炼机制促进代码与文档的共演化,并依托明确的周期边界快照增强开发连续性;同时,利用工具组合(如 ChatGPT 用于协作推理、PyCharm 实现代码实现、Obsidian 构建外部工作记忆与结构化文档)支撑 SHAPR 的落地执行,从而在保持工具无关性的前提下提升 AI 辅助研发过程的系统性与可复现性。

链接: https://arxiv.org/abs/2604.15020
作者: Ka Ching Chan
机构: 未知
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: 6 pages, 2 figures, conference paper

点击查看摘要

Abstract:Generative AI is changing how research software is developed, but rapid AI-assisted development can weaken continuity, traceability, and methodological clarity. SHAPR (Solo, Human-centred, AI-assisted PRactice) was proposed as a framework for structuring AI-assisted research software development. This paper presents a documented case of applying SHAPR to the development of a modular share trading system. From the outset, the project adopted a SHAPR-informed working configuration that shaped how interaction, implementation, and documentation were organised. Across iterative development cycles, the project generated a structured evidence base including reflection notes, development cycle review notes, source-of-truth documents, contracts, quick captures, workflow notes, and evolving code artefacts. The case showed that continuous documentation updates, supported by quick capture and AI-assisted refinement, helped maintain organised and usable project knowledge throughout development. Five recurring lessons were identified: contracts stabilised AI-assisted coding, a maintained source-of-truth layer improved coherence, cycle-boundary snapshots strengthened continuity, code and documentation co-evolved through quick capture and iterative refinement, and environment setup itself contributed to knowledge generation. The case also illustrates a practical SHAPR operating configuration in which a ChatGPT Project and cycle-specific chats supported interaction, reasoning, summarisation, and coding collaboration, PyCharm supported artefact implementation, and Obsidian supported external working memory, structured documentation, reflection, continuity, and repository-oriented note organisation, while remaining consistent with SHAPR’s tool-agnostic principle. The paper contributes practical guidance and good practices for researchers conducting AI-assisted research software development.

[HC-6] Agent ic Explainability at Scale: Between Corporate Fears and XAI Needs

【速读】:该论文旨在解决企业在部署生成式 AI(Generative AI)代理(Agent)过程中因治理能力滞后于技术扩张而引发的“代理泛滥”(Agent Sprawl)问题,特别是由于缺乏对代理配置、运行时行为及代理间协作决策过程的可观测性与可解释性所带来的风险。解决方案的关键在于引入设计阶段与运行阶段的可解释性(Explainability)技术,并提出一个初步原型——“代理AI卡”(Agentic AI Card),以增强企业对代理行为的理解和控制力,从而实现安全、可控的大规模部署。

链接: https://arxiv.org/abs/2604.14984
作者: Yomna Elsayed,Cecily Jones
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Presented at Human-centered Explainable AI Workshop (HCXAI) @ CHI 2026, Barcelona, Spain, 2026

点击查看摘要

Abstract:As companies enter the race for agentic AI adoption, fears surface around agentic autonomy and its subsequent risks. These fears compound as companies scale their agentic AI adoption with low-code applications, without a comparable scaling in their governance processes and expertise resulting in a phenomenon known as “Agent Sprawl”. While shadow AI tools can help with agentic discovery and identification, few observability tools offer insights into the agents’ configuration and settings or the decision-making process during agent-to-agent communication and orchestration. This paper explores AI governance professionals’ concerns in enterprise settings, while offering design-time and runtime explainability techniques as suggested by AI governance experts for addressing those fears. Finally, we provide a preliminary prototype of an Agentic AI Card that can help companies feel at ease deploying agents at scale.

[HC-7] Hybrid Decision Making via Conformal VLM-generated Guidance

【速读】:该论文旨在解决现有混合决策(Hybrid Decision Making, HDM)方法中生成的指导信息冗余且难以消化的问题,尤其在学习引导(Learning to Guide, LtG)框架下,传统方法将所有可能结果的信息合并输出,导致人类用户认知负担加重。解决方案的关键在于提出ConfGuide方法,其核心创新是引入置信度控制(Conformal Risk Control)来筛选出一个有限且具有统计保证的可能结果集合,从而确保假阴性率可控的同时生成更简洁、针对性更强的文本指导,显著提升人机协同决策效率与可理解性。

链接: https://arxiv.org/abs/2604.14980
作者: Debodeep Banerjee,Burcu Sayin,Stefano Teso,Andrea Passerini
机构: University of Pisa (比萨大学); University of Trento (特伦托大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Building on recent advances in AI, hybrid decision making (HDM) holds the promise of improving human decision quality and reducing cognitive load. We work in the context of learning to guide (LtG), a recently proposed HDM framework in which the human is always responsible for the final decision: rather than suggesting decisions, in LtG the AI supplies (textual) guidance useful for facilitating decision making. One limiting factor of existing approaches is that their guidance compounds information about all possible outcomes, and as a result it can be difficult to digest. We address this issue by introducing ConfGuide, a novel LtG approach that generates more succinct and targeted guidance. To this end, it employs conformal risk control to select a set of outcomes, ensuring a cap on the false negative rate. We demonstrate our approach on a real-world multi-label medical diagnosis task. Our empirical evaluation highlights the promise of ConfGuide.

[HC-8] Governing Reflective Human-AI Collaboration: A Framework for Epistemic Scaffolding and Traceable Reasoning

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)虽在语言生成上表现出色,却仍局限于符号模拟而缺乏具身理解的问题,即模型无法实现具有时间连续性、因果反馈和现实交互锚定的真正推理能力。其解决方案的关键在于将推理重构为一种人与AI之间的分布式关系过程,而非仅作为模型内部的能力;通过引入“建筑师的笔”(The Architect’s Pen)这一实践方法,将人类的抽象思考、模型的表达能力与人类的反思机制嵌入到对话循环中——形成“人类抽象—模型表述—人类反思”的迭代结构,从而把整个系统视为一个可审计、可治理的协同推理单元。此框架不依赖新模型架构,即可提升AI使用的透明度、可控性和问责性,契合欧盟人工智能法案(EU AI Act)等新兴治理标准。

链接: https://arxiv.org/abs/2604.14898
作者: Rikard Rosenbacke,Carl Rosenbacke,Victor Rosenbacke,Martin McKee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models have advanced rapidly, from pattern recognition to emerging forms of reasoning, yet they remain confined to linguistic simulation rather than grounded understanding. They can produce fluent outputs that resemble reflection, but lack temporal continuity, causal feedback, and anchoring in real-world interaction. This paper proposes a complementary approach in which reasoning is treated as a relational process distributed between human and model rather than an internal capability of either. Building on recent work on “System-2” learning, we relocate reflective reasoning to the interaction layer. Instead of engineering reasoning solely within models, we frame it as a cognitive protocol that can be structured, measured, and governed using existing systems. This perspective emphasizes collaborative intelligence, combining human judgment and contextual understanding with machine speed, memory, and associative capacity. We introduce “The Architect’s Pen” as a practical method. Like an architect who thinks through drawing, the human uses the model as an external medium for structured reflection. By embedding phases of articulation, critique, and revision into human-AI interaction, the dialogue itself becomes a reasoning loop: human abstraction - model articulation - human reflection. This reframes the question from whether the model can think to whether the human-AI system can reason. The framework enables auditable reasoning traces and supports alignment with emerging governance standards, including the EU AI Act and ISO/IEC 42001. It provides a practical path toward more transparent, controllable, and accountable AI use without requiring new model architectures. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.14898 [cs.AI] (or arXiv:2604.14898v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.14898 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rikard Rosenbacke [view email] [v1] Thu, 16 Apr 2026 11:42:36 UTC (2,482 KB)

[HC-9] he Missing Knowledge Layer in AI: A Framework for Stable Human-AI Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险决策场景中因输出流畅性掩盖内部推理不稳定性而导致的不可靠性问题,即“流畅性误判为可靠性”的现象。其核心解决方案是提出一个两层协同机制:第一层通过人类侧干预措施(如不确定性提示、冲突揭示和可审计的推理轨迹)增强用户对模型输出的批判性判断;第二层则引入模型侧的“认识论控制环”(Epistemic Control Loop, ECL),用于检测推理过程中的不稳定状态并动态调节生成行为。这一组合框架显著提升了人机交互中推理信号的清晰度,使不确定性与认知漂移在执行前即可被识别,从而实现更精准的能力治理,并契合欧盟人工智能法案(EU AI Act)等合规要求,推动AI在关键应用中的可信部署。

链接: https://arxiv.org/abs/2604.14881
作者: Rikard Rosenbacke,Carl Rosenbacke,Victor Rosenbacke,Martin McKee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models are increasingly integrated into decision-making in areas such as healthcare, law, finance, engineering, and government. Yet they share a critical limitation: they produce fluent outputs even when their internal reasoning has drifted. A confident answer can conceal uncertainty, speculation, or inconsistency, and small changes in phrasing can lead to different conclusions. This makes LLMs useful assistants but unreliable partners in high-stakes contexts. Humans exhibit a similar weakness, often mistaking fluency for reliability. When a model responds smoothly, users tend to trust it, even when both model and user are drifting together. This paper is the first in a five-paper research series on stabilising human-AI reasoning. The series proposes a two-layer approach: Parts II-IV introduce human-side mechanisms such as uncertainty cues, conflict surfacing, and auditable reasoning traces, while Part V develops a model-side Epistemic Control Loop (ECL) that detects instability and modulates generation accordingly. Together, these layers form a missing operational substrate for governance by increasing signal-to-noise at the point of use. Stabilising interaction makes uncertainty and drift visible before enforcement is applied, enabling more precise capability governance. This aligns with emerging compliance expectations, including the EU AI Act and ISO/IEC 42001, by making reasoning processes traceable under real conditions of use. The central claim is that fluency is not reliability. Without structures that stabilise both human and model reasoning, AI cannot be trusted or governed where it matters most. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.14881 [cs.AI] (or arXiv:2604.14881v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.14881 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rikard Rosenbacke [view email] [v1] Thu, 16 Apr 2026 11:19:00 UTC (363 KB)

[HC-10] SkillDroid: Compile Once Reuse Forever

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的移动图形用户界面(GUI)代理在任务执行中存在状态缺失的问题,即每次任务调用都需重新进行完整的LLM推理,导致效率低下且无法积累经验。其核心解决方案是提出SkillDroid,一个三层技能代理架构:通过将成功的LLM引导GUI轨迹编译为带权重元素定位器和类型化参数槽的参数化技能模板,并在后续任务中直接回放这些模板而无需任何LLM调用;同时采用匹配级联机制(正则表达式、嵌入相似性与应用过滤)实现指令到技能的精准路由,并引入失败学习层在技能可靠性下降时触发重新编译。该设计显著提升了任务成功率与执行效率,且系统性能随使用次数增加而持续改善。

链接: https://arxiv.org/abs/2604.14872
作者: Qijia Chen,Andrea Bellucci,Zhida Sun,Giulio Jacucci
机构: University of Helsinki(赫尔辛基大学); Universidad Carlos III de Madrid(卡洛斯三世大学); Shenzhen University(深圳大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:LLM-based mobile GUI agents treat every task invocation as an independent reasoning episode, requiring a full LLM inference call at each action step. This per-step dependence makes them stateless: a task completed successfully yesterday is re-derived from scratch today, with no improvement in reliability or speed. We present SkillDroid, a three-layer skill agent that compiles successful LLM-guided GUI trajectories into parameterized skill templates (sequences of UI actions with weighted element locators and typed parameter slots) and replays them on future invocations without any LLM calls. A matching cascade (regex patterns, embedding similarity, and app filtering) routes incoming instructions to stored skills, while a failure-learning layer triggers recompilation when skill reliability degrades. Over a 150-round longitudinal evaluation with systematic instruction variation and controlled perturbations, SkillDroid achieves an 85.3% success rate (23 percentage points above a stateless LLM baseline) while using 49% fewer LLM calls. The skill replay mechanism achieves a perfect 1000% success rate across 79 replay rounds at 2.4 times the speed of full LLM execution. Most critically, the system improves with use: its success rate converges upward from 87% to 91%, while the baseline degrades from 80% to 44%.

[HC-11] NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results CVPR

【速读】:该论文旨在解决视频显著性预测(Video Saliency Prediction)问题,即自动生成能够准确反映人类视觉注意力分布的显著性图(saliency map)。其解决方案的关键在于构建了一个包含2,000个多样化视频的公开数据集,该数据集通过众包鼠标追踪方式收集了超过5,000名评估者的眼动轨迹与对应显著性标注,并在800个测试视频上采用通用质量指标进行评估。最终,该挑战吸引了20余支团队参与,7支团队通过代码审查并提交了有效方法,所有数据和代码均对公众开放,为后续研究提供了高质量基准。

链接: https://arxiv.org/abs/2604.14816
作者: Andrey Moskalenko,Alexey Bryncev,Ivan Kosmynin,Kira Shilovskaya,Mikhail Erofeev,Dmitry Vatolin,Radu Timofte,Kun Wang,Yupeng Hu,Zhiran Li,Hao Liu,Qianlong Xiang,Liqiang Nie,Konstantinos Chaldaiopoulos,Niki Efthymiou,Athanasia Zlatintsi,Panagiotis Filntisis,Katerina Pastra,Petros Maragos,Li Yang,Gen Zhan,Yiting Liao,Yabin Zhang,Yuxin Liu,Xu Wu,Yunheng Zheng,Linze Li,Kun He,Cong Wu,Xuefeng Zhu,Tianyang Xu,Xiaojun Wu,Wenzhuo Zhao,Keren Fu,Gongyang Li,Shixiang Shi,Jianlin Chen,Haibin Ling,Yaoxin Jiang,Guoyi Xu,Jiajia Liu,Yaokun Shi,Jiachen Tu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: CVPRW 2026

点击查看摘要

Abstract:This paper presents an overview of the NTIRE 2026 Challenge on Video Saliency Prediction. The goal of the challenge participants was to develop automatic saliency map prediction methods for the provided video sequences. The novel dataset of 2,000 diverse videos with an open license was prepared for this challenge. The fixations and corresponding saliency maps were collected using crowdsourced mouse tracking and contain viewing data from over 5,000 assessors. Evaluation was performed on a subset of 800 test videos using generally accepted quality metrics. The challenge attracted over 20 teams making submissions, and 7 teams passed the final phase with code review. All data used in this challenge is made publicly available - this https URL.

[HC-12] Evaluating Encodings for Bivariate Edges in Adjacency Matrices

【速读】:该论文旨在解决在邻接矩阵中编码定量边值分布的问题,即如何在有限的单元格空间内有效表示多值边信息。传统邻接矩阵虽能保持结构清晰,但其紧凑单元格难以同时展示多个数值。解决方案的关键在于使用两种度量来表征边值分布:一个中心趋势指标(如均值、中位数)和一个离散度指标(如标准差、四分位距),并通过四种候选可视化编码进行评估:双变量色彩调色板、嵌入式条形图及两种重叠标记设计(分别将主属性映射到面积或角度)。实验结果表明,基于面积的重叠标记和条形图表现最优,而角度标记稳定性较差,双变量色彩则持续表现不佳,从而明确了不同视觉通道在严格约束下的行为特性与设计选择的优势与局限。

链接: https://arxiv.org/abs/2604.14791
作者: Jorge Acosta-Hernández,Alexander Lex,Tingying He
机构: Universidad Politécnica de Madrid (马德里理工大学); Center for Computational Simulation (计算仿真中心); Graz University of Technology (格拉茨工业大学); University of Utah (犹他大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We present the first empirical evaluation of techniques for encoding distributions of quantitative edge values within adjacency matrices. In many real-world networks, edges represent not a single value but a set of measurements. While adjacency matrices preserve structural clarity, their compact cells limit the simultaneous display of multiple values. To address this, we explore edge encodings that represent distributions by two values: a measure of central tendency (mean, median, mode) and a measure of dispersion (standard deviation, variance, IQR). We select four possible encodings for evaluation that prior work has suggested are suitable for the limited space available in matrices: a bivariate color palette, embedded bar charts, and two overlaid-mark designs mapping the primary attribute to color and the secondary attribute to area or angle. In a preregistered crowdsourced study with 156 participants, we assessed performance of these encodings across eight analytical tasks and collected readability and aesthetic ratings. Results reveal clear performance regimes: area-based overlaid marks and bar charts achieved the highest overall performance; angle-based marks show moderate but less stable performance,and bivariate color consistently underperforms these alternatives. These findings clarify how visual channels behave under strict constraints and delineate the strengths and limitations of key design choices for multivariate edge visualization.

[HC-13] Beyond Chat and Clicks: GUI Agents for In-Situ Assistance via Live Interface Transformation

【速读】:该论文旨在解决复杂视觉界面(Complex Visual Interfaces)中用户学习成本高、操作困难的问题,现有方法要么依赖独立的聊天交互模式,要么需要大量应用特定工程开发才能实现原生支持。其解决方案的关键在于提出“在位辅助”(in-situ assistance)机制——通过轻量级浏览器级干预DOM(Document Object Model),在不修改应用逻辑或重构界面的前提下,实时对网页元素进行插入、修改或重组,从而即时提升用户对界面的理解与操作效率。作者构建了面向DOM的计算管道,并实现了DOMSteer这一Chrome扩展,能够响应用户求助请求并执行可逆的DOM操作,如上下文提示工具、控件高亮和布局重排,实验证明其在两个复杂视觉界面中均能提供可靠且高效的辅助支持。

链接: https://arxiv.org/abs/2604.14668
作者: Pan Hao,Rishi Selvakumaran,Jacob Sun,Qianwen Wang
机构: University of Minnesota (明尼苏达大学)
类目: Human-Computer Interaction (cs.HC)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:Complex visual interfaces are powerful yet have a steep learning curve, as users must navigate feature-rich visual interfaces while reasoning about domain-specific operations. Existing approaches either deliver assistance through a separate chat-based interaction, or require substantial application-specific engineering to build support natively into each interface. To address the gaps, we propose in-situ assistance: a mode of support delivered directly within any live web interface through lightweight, browser-level interventions on the Document Object Model (DOM), without rebuilding the application or modifying its underlying logic. We contribute a design space and a computational pipeline for DOM-mediated in-situ assistance, characterizing how GUI agents can insert, mutate, or recompose web elements to make the interface easier for users to understand, use, and navigate. We instantiate in-situ assistance in DOMSteer, a Chrome extension that interprets a user’s help request and live interface context, grounds it to relevant UI elements, and executes reversible DOM manipulations directly on the live page to deliver assistance, including contextual tooltips, control highlighting, layout reorganization. Quantitative evaluations on two complex visual interfaces show that DOMSteer delivers reliable and efficient in-situ assistance. Use cases and a comparative user study with baseline ChatGPTAtlas demonstrate the usability and effectiveness of DOMSteer. Altogether, these findings point to a broader role for GUI agents: not just assisting from the sidelines, but actively reconfiguring live interfaces to support users in the moment.

[HC-14] ouching Space: Accessible Map Exploration Through Conversational Audio-Haptic Interaction

【速读】:该论文旨在解决盲人及低视力(Blind and Low-Vision, BLV)人群在陌生环境中缺乏事前空间认知建模的问题,即现有辅助导航工具多聚焦于实时引导,而忽视了用户在出行前构建整体空间理解(如物体间的相对位置关系)的需求。解决方案的关键在于提出一个端到端系统——Touching Space,该系统通过整合地图数据检索、触觉与语音反馈机制,使BLV用户能够借助通用硬件设备探索空间布局,并通过与对话式智能体交互提问,实现对目标场所的认知地图构建。

链接: https://arxiv.org/abs/2604.14637
作者: Li Liu,Jiaming Qu,Marc Jowell Bagaoisan,David T. Lee,Leilani H. Gilpin
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); Amazon (亚马逊); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Most existing assistive navigation tools focus on providing real-time guidance for Blind and Low-Vision (BLV) people, but few support building a holistic spatial understanding of unfamiliar environments before travel. Such cognitive map construction (e.g., knowing that a fountain is south of a tower and west of a hotel) is important for pre-travel planning, yet remains underexplored in prior work. To address this gap, we present Touching Space, an end-to-end system that retrieves map data for a target place and loads it into a frontend interface for exploration. The system combines haptic and audio feedback: users explore spatial layouts through touch and ask spoken questions to a conversational agent during exploration. Touching Space contributes a conversational interface that supports BLV users in building cognitive maps on commodity hardware.

[HC-15] Bias in Surface Electromyography Features across a Demographically Diverse Cohort

【速读】:该论文旨在解决上肢表面肌电信号(sEMG)在跨用户应用中因个体差异(如年龄、体重指数等)导致的信号特征高度异质性问题,这限制了基于sEMG的人机交互系统(如假肢控制、虚拟现实设备)的通用性和公平性。解决方案的关键在于系统性分析了147个常用sEMG特征与多种人口统计学变量之间的关联,通过混合效应线性模型和偏最小二乘法(PLS)识别出33%(49/147)的sEMG特征显著受人口统计学因素影响,从而为开发更具鲁棒性和公平性的sEMG神经接口提供了数据驱动的指导依据。

链接: https://arxiv.org/abs/2604.14460
作者: Aditi Agrawal,Celine John Philip,Giancarlo K. Sagastume,Marcus A. Battraw,Wilsaan M. Joiner,Jonathon S. Schofield,Lee M. Miller,Richard S. Whittle
机构: University of California, Davis (加州大学戴维斯分校); California State University, Chico (加州州立大学奇科分校)
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 17 pages, 4 Figures

点击查看摘要

Abstract:Neuromotor decoding from upper-limb electromyography (sEMG) can enhance human-machine interfaces and offer a more natural means of controlling prosthetic limbs, virtual reality, and household electronics. Unfortunately, current sEMG technology does not always perform consistently across users because individual differences such as age and body mass index, among many others, can substantially alter signal quality. This variability makes sEMG characteristics highly idiosyncratic, often necessitating laborious personalization and iterative tuning to achieve reliable performance. This variability has particular import for sEMG-based assistive devices and neural interfaces, where demographic biases in sEMG features could undermine broad and fair deployment. In this study, we explore how demographic differences affect the sEMG signals produced and their implications for machine learning-based gesture decoding. We analyze the data set provided by, in which we derive 147 common sEMG features extracted from 81 demographically diverse individuals performing discrete hand gestures. Using mixed-effects linear models and partial least squares (PLS) analysis, which take into consideration demographic variables (including age, sex, height, weight, skin properties, subcutaneous fat, and hair density), we identify that 33% (49 of 147) of commonly used sEMG features show significant associations with demographic characteristics. These results may help guide the development of fair and unbiased sEMG-based neural interfaces across a diverse population. Comments: 17 pages, 4 Figures Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) MSC classes: 62P10 (Primary) 68T07, 62J05, 92C50 (Secondary) ACMclasses: I.5.2; I.2.6; H.5.2; J.3; K.4.2 Cite as: arXiv:2604.14460 [cs.HC] (or arXiv:2604.14460v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.14460 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-16] FocalLens: Visualizing Narratives through Focalization

【速读】:该论文旨在解决当前叙事可视化方法的局限性问题,即现有技术仅能呈现角色与地点的共现时间线,而忽略了诸如焦点化(focalization)、因果关系和对话等复杂叙事组件。其解决方案的关键在于提出一种名为FocalLens的新颖叙事可视化方法,该方法以焦点化为核心,系统刻画不同角色对事件的感知方式——包括直接参与者、间接观察者及叙述者,从而揭示叙事中视角分布的深层结构。通过构建文本与可视化之间的流式交互工具,研究验证了该方法为写作者和文学学者提供了全新的分析维度,显著增强了叙事内容的洞察力。

链接: https://arxiv.org/abs/2604.14456
作者: S M Raihanul Alam,Md Dilshadur Rahman,Md Naimul Hoque
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visualizing narratives is useful to writers to reflect on unfinished drafts and identify unintentional biases and inconsistencies. Literary scholars can use the visualizations to identify nuanced patterns and literary styles from written text. Current narrative visualization is limited to representing character and location co-occurrences in a timeline, omitting important and complex narrative components such as focalization, causality, and speech. This paper aims to capture and visualize underexplored, complex narrative components as a basis for narrative visualization. As a starting point, we propose a new narrative visualization, named FocalLens, that uses focalization, the component that establishes who sees or perceives the events in a narrative, for representing the narrative. We provide the theoretical foundation of focalization and describe various types and facets of focalization. The details are incorporated in the novel visualization that captures how different characters perceive an event, who directly participate in an event, who indirectly observe the event, and who narrate the event. We also developed a tool that provides fluid interaction between the text and the proposed visualization. The tool was evaluated with four writers and scholars in a qualitative study, where writers analyzed their draft stories and scholars analyzed well-known stories. The findings suggest the tool added a new dimension to the workflow for writers and scholars, an analytical lens that is not available otherwise. We conclude by identifying design implications and future directions.

[HC-17] Reflections on Traceability for Visualization Research

【速读】:该论文试图解决设计导向型可视化研究中因主观性、情境性和迭代性特征而导致的传统可复现性标准难以适用的问题,即如何在本质上不可复现的研究过程中保障严谨性和透明度。其解决方案的关键在于引入“可追溯性(traceability)”概念,通过系统记录丰富且带有注释的研究过程产物(如设计文档、原型等),构建清晰的研究脉络,并借助专门工具(如tRRRacer)支持对研究决策逻辑与演化路径的可视化呈现,从而帮助他人重新追踪研究主张并评估其合理性。

链接: https://arxiv.org/abs/2604.14417
作者: Jen Rogers,Derya Akbaba,James Scott-Brown,Alexander Lex,Miriah Meyer
机构: Idaho National Lab (美国爱达荷国家实验室); KTH Royal Institute of Technology (瑞典皇家理工学院); University of Edinburgh (爱丁堡大学); Graz University of Technology (格拉茨工业大学); University of Utah (犹他大学); Linköping University (林雪平大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Decades of advocacy for reproducibility and replication have advanced open, transparent practices in the sciences. However, traditional notions of reproducibility fit poorly with design-oriented visualization research, where insights emerge through subjective, situated, and iterative work. So how can we ensure rigor and transparency in processes that are inherently unreproducible? To introduce transparency in design-oriented research, we propose to focus on traceability: surfacing the origin and development of research contributions based on rich sets of artifacts documenting the design process. We investigated traceability through a collaborative autoethnographic reflection that builds on several years of work exploring ways to make design-oriented research transparent. This exploration includes an experiment to build a tool to support traceability, which we called tRRRacer. The tRRRacer tool provided a testbed for us to operationalize the three tenets of a traceable process: (1) Record abundant, annotated artifacts representative of research activities; (2) Report curated research threads that articulate rationale and evolution of the process, allowing others to (3) Read via interfaces that help retrace claims and assess plausibility. Reflecting on our experiences, we contribute a theorization of traceability and reflections on how we might support it.

[HC-18] Smart But Not Moral? Moral Alignment In Human-AI Decision-Making

【速读】:该论文试图解决高风险人工智能辅助决策中,仅关注功能或行为对齐而忽视道德价值一致性的问题。其核心挑战在于如何确保AI系统的决策逻辑与利益相关者的道德直觉相契合,从而实现有意义的AI整合。解决方案的关键在于引入“道德对齐”(moral alignment)这一概念,即利益相关者感知到AI决策逻辑所嵌入的价值观与其自身道德直觉的一致性,并基于道德基础理论(Moral Foundations Theory)从多利益相关者视角出发,阐明道德对齐在敏感应用场景中的重要性。

链接: https://arxiv.org/abs/2604.14371
作者: Christiane Ernst,Luis Gutmann,Domenique Zipperling,Kathrin Figl,Niklas Kühl
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at the TREO Forum of the European Conference on Information Systems 2026

点击查看摘要

Abstract:In high-stakes AI-supported decisions, considerations are not purely technical but involve moral judgments about fairness, responsibility, and harm. While prior research has focused mainly on functional or behavioral alignment, this paper argues that moral alignment may be a more fundamental dimension of human-AI decision-making. Moral alignment is defined as the perceived congruence between the values embedded in an AI system’s decision logic and the moral intuitions of stakeholders. Building on Moral Foundations Theory, the paper adopts a multi-stakeholder perspective and highlights why moral (mis)alignment matters for the meaningful integration of AI in sensitive contexts.

[HC-19] “I Just Dont Want My Work Being Fed Into The AI Blender”: Queer Artists on Refusing and Resisting Generative AI

【速读】:该论文试图解决生成式 AI(Generative AI)对酷儿艺术社群的冲击问题,特别是其在艺术创作中引发的去关系化(anti-relationality)倾向与酷儿艺术实践中强调的集体性、政治抵抗和身份建构之间的冲突。解决方案的关键在于通过酷儿理论视角,引导计算机支持的协同工作(CSCW)研究者拒绝主流AI想象,转而支持酷儿世界建构(queer world-building),从而为酷儿艺术家提供更具包容性和赋权性的技术实践路径。

链接: https://arxiv.org/abs/2604.14266
作者: Jordan Taylor,Joel Mire,Alicia DeVrio,Maarten Sap,Haiyi Zhu,Sarah E. Fox
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC)
备注: To Appear at CSCW 2026

点击查看摘要

Abstract:Art-making is a collective social activity through which queer people engage in political resistance, develop identities, archive queer memory, and form community. However, in recent years, generative AI has disrupted queer artistic communities. Through 15 semi-structured interviews, we examine how queer artists are making sense of the encroachment of GenAI into their art worlds. Our findings surface significant tensions between the relationality of our participants’ queer art practices and the perceived anti-relationality of GenAI development and use. We detail how our participants refuse and resist GenAI use and development in response and highlight the limited role our participants saw for GenAI within art-making, such as the queer aesthetic potential of surreal image models. Drawing on queer theory, we discuss how CSCW researchers might support queer artists by refusing dominant AI imaginaries and supporting queer world-building.

计算机视觉

[CV-0] Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo CVPR2026

【速读】:该论文旨在解决事件相机(event camera)与帧相机(frame-based camera)在异构立体匹配中因模态差异导致的领域特有线索被弱化的问题,从而影响在高速运动和复杂光照条件下三维感知的可靠性。解决方案的关键在于提出Bi-CMPStereo框架,通过双向跨模态提示机制,充分挖掘并融合两个模态的语义与结构特征;其核心创新是将每种模态投影至对方域以实现跨模态对齐,并在目标规范空间中学习精细对齐的立体表示,从而显著提升匹配精度与泛化能力。

链接: https://arxiv.org/abs/2604.15312
作者: Ninghui Xu,Fabio Tosi,Lihui Wang,Jiawei Han,Luca Bartolomei,Zhiting Yao,Matteo Poggi,Stefano Mattoccia
机构: Southeast University (东南大学); University of Bologna (博洛尼亚大学); Hohai University (河海大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Code URL: this https URL

点击查看摘要

Abstract:Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.

[CV-1] LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories CVPR2026

【速读】:该论文旨在解决流匹配(Flow Matching)模型在基于人类偏好微调时,因直接通过长轨迹反向传播奖励梯度而导致的内存开销过大和梯度爆炸问题,尤其是早期生成步骤难以更新,从而影响最终图像全局结构的问题。解决方案的关键在于提出LeapAlign方法:通过设计两个连续的“跳跃”步骤,将长轨迹压缩为仅两步,每个跳跃跳过多个常微分方程(ODE)采样步骤并单步预测未来潜在表示(latent),同时随机化跳跃的起始与终止时间步以实现任意生成步骤的高效稳定更新;此外,引入基于轨迹一致性的训练权重分配机制,并对高梯度幅值项进行加权抑制而非完全剔除,从而提升梯度稳定性与微调效果。

链接: https://arxiv.org/abs/2604.15311
作者: Zhanhao Liang,Tao Yang,Jie Wu,Chengjian Feng,Liang Zheng
机构: ByteDance(字节跳动); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.

[CV-2] okenLight: Precise Lighting Control in Images using Attribute Tokens CVPR2026

【速读】:该论文旨在解决图像重光照(image relighting)中对多光照属性进行精确且连续控制的问题,尤其在真实场景下实现自然、逼真的光照编辑效果。其解决方案的关键在于将重光照建模为条件图像生成任务,并引入属性令牌(attribute tokens)来编码不同的光照因素,如强度、颜色、环境光、漫反射程度及三维光位置等;模型在大规模合成数据集上训练,辅以少量真实图像提升泛化能力与真实感,从而在不依赖显式逆渲染监督的情况下,仍能理解光与场景几何、遮挡和材质之间的复杂交互关系,实现对传统难题(如将光源置于物体内部或合理重光照透明材质)的高质量处理。

链接: https://arxiv.org/abs/2604.15310
作者: Sumit Chaturvedi,Yannick Hold-Geoffroy,Mengwei Ren,Jingyuan Liu,He Zhang,Yiqun Mei,Julie Dorsey,Zhixin Shu
机构: Yale University (耶鲁大学); Adobe (Adobe)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 32 pages, CVPR 2026

点击查看摘要

Abstract:This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a large-scale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly. Project page: this http URL

[CV-3] RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

【速读】:该论文旨在解决高阶自动驾驶中运动规划器在建模多模态未来不确定性时的稳定性与闭环交互鲁棒性问题,尤其是扩散模型(diffusion-based models)在纯模仿学习训练下易出现随机不稳定性且缺乏纠正性负反馈的问题。解决方案的关键在于提出RAD-2框架——一个统一的生成器-判别器架构:其中扩散生成器负责生成多样轨迹候选,而通过强化学习优化的判别器则根据长期驾驶质量对候选轨迹进行重排序;该解耦设计避免了直接将稀疏标量奖励应用于高维轨迹空间,从而提升优化稳定性;此外,引入时序一致的组相对策略优化(Temporally Consistent Group Relative Policy Optimization)缓解信用分配难题,并提出在线策略生成器优化(On-policy Generator Optimization),将闭环反馈转化为结构化的纵向优化信号,逐步引导生成器收敛至高奖励轨迹流形。

链接: https://arxiv.org/abs/2604.15308
作者: Hao Gao,Shaoyu Chen,Yifan Zhu,Yuehao Song,Wenyu Liu,Qian Zhang,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学); Horizon Robotics ( horizon 机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird’s-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.

[CV-4] hink in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation ACL2026

【速读】:该论文旨在解决当前手语翻译(Sign Language Translation, SLT)系统中一个关键问题:现有方法通常假设手语中的短片段可直接映射为口语词汇,但这一假设在实际应用中失效,因为手语使用者常通过上下文、空间布局和动作动态构建意义。为此,作者提出将SLT重构为跨模态推理任务,并设计了一个以潜在思维序列为显式中间层的推理驱动框架,这些潜在思维逐步提取并组织视频中的语义信息。解决方案的关键在于引入“先规划后锚定”(plan-then-ground)的解码机制:模型首先决定要表达的内容,再回溯视频寻找证据,从而显著提升翻译结果的连贯性和忠实度。

链接: https://arxiv.org/abs/2604.15301
作者: Yiyang Jiang,Li Zhang,Xiao-Yong Wei,Li Qing
机构: The Hong Kong Polytechnic University (香港理工大学); Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACL 2026 Main

点击查看摘要

Abstract:Many SLT systems quietly assume that brief chunks of signing map directly to spoken-language words. That assumption breaks down because signers often create meaning on the fly using context, space, and movement. We revisit SLT and argue that it is mainly a cross-modal reasoning task, not just a straightforward video-to-text conversion. We thus introduce a reasoning-driven SLT framework that uses an ordered sequence of latent thoughts as an explicit middle layer between the video and the generated text. These latent thoughts gradually extract and organize meaning over time. On top of this, we use a plan-then-ground decoding method: the model first decides what it wants to say, and then looks back at the video to find the evidence. This separation improves coherence and faithfulness. We also built and released a new large-scale gloss-free SLT dataset with stronger context dependencies and more realistic meanings. Experiments across several benchmarks show consistent gains over existing gloss-free methods. Code and data will be released upon acceptance at this https URL.

[CV-5] AnimationBench: Are Video Models Good at Character-Centric Animation?

【速读】:该论文旨在解决现有视频生成评估基准在动画风格生成任务中适用性不足的问题,具体表现为:传统基准多针对写实视频设计,难以有效衡量动画特有的 stylized appearance(风格化外观)、exaggerated motion(夸张运动)以及 character-centric consistency(角色中心一致性)等特性;同时,这些基准通常依赖固定提示集和刚性评估流程,缺乏对开放域内容和定制化评估需求的支持。解决方案的关键在于提出 AnimationBench——首个系统性的图像到视频动画生成评估基准,其核心创新包括:将动画领域的“十二基本原理”(Twelve Basic Principles of Animation)与知识产权(IP)保真度转化为可量化的评估维度,并结合语义一致性、运动合理性及相机运动一致性等更广泛的品质维度;支持标准化闭集评估以实现模型间的可复现比较,以及灵活的开集评估用于诊断分析,且利用视觉语言模型实现可扩展的自动化评分,从而显著提升评估结果与人类判断的一致性,并揭示现实导向基准所忽略的动画特有质量差异。

链接: https://arxiv.org/abs/2604.15299
作者: Leyi Wu,Pengjun Fang,Kai Sun,Yazhou Xing,Yinwei Wu,Songsong Wang,Ziqi Huang,Dan Zhou,Yingqing He,Ying-Cong Chen,Qifeng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:Video generation has advanced rapidly, with recent methods producing increasingly convincing animated results. However, existing benchmarks-largely designed for realistic videos-struggle to evaluate animation-style generation with its stylized appearance, exaggerated motion, and character-centric consistency. Moreover, they also rely on fixed prompt sets and rigid pipelines, offering limited flexibility for open-domain content and custom evaluation needs. To address this gap, we introduce AnimationBench, the first systematic benchmark for evaluating animation image-to-video generation. AnimationBench operationalizes the Twelve Basic Principles of Animation and IP Preservation into measurable evaluation dimensions, together with Broader Quality Dimensions including semantic consistency, motion rationality, and camera motion consistency. The benchmark supports both a standardized close-set evaluation for reproducible comparison and a flexible open-set evaluation for diagnostic analysis, and leverages visual-language models for scalable assessment. Extensive experiments show that AnimationBench aligns well with human judgment and exposes animation-specific quality differences overlooked by realism-oriented benchmarks, leading to more informative and discriminative evaluation of state-of-the-art I2V models.

[CV-6] AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving

【速读】:该论文旨在解决自动驾驶中机器视觉系统在遇到训练数据分布外的异常情况(如非常规障碍物)时感知能力显著下降的问题,从而导致物理风险增加。其解决方案的关键在于引入视觉异常检测(Visual Anomaly Detection, VAD),通过生成像素级异常图(pixel-level anomaly maps)识别训练阶段未见的异常对象,无需事先假设异常形态即可引导驾驶员关注具体区域,实现对未知风险的有效预警。

链接: https://arxiv.org/abs/2604.15291
作者: Fabrizio Genilotti,Arianna Stropeni,Gionata Grotto,Francesco Borsatti,Manuel Barusco,Davide Dalle Pezze,Gian Antonio Susto
机构: University of Padova (帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reliability of a machine vision system for autonomous driving depends heavily on its training data distribution. When a vehicle encounters significantly different conditions, such as atypical obstacles, its perceptual capabilities can degrade substantially. Unlike many domains where errors carry limited consequences, failures in autonomous driving translate directly into physical risk for passengers, pedestrians, and other road users. To address this challenge, we explore Visual Anomaly Detection (VAD) as a solution. VAD enables the identification of anomalous objects not present during training, allowing the system to alert the driver when an unfamiliar situation is detected. Crucially, VAD models produce pixel-level anomaly maps that can guide driver attention to specific regions of concern without requiring any prior assumptions about the nature or form of the hazard. We benchmark eight state-of-the-art VAD methods on AnoVox, the largest synthetic dataset for anomaly detection in autonomous driving. In particular, we evaluate performance across four backbone architectures spanning from large networks to lightweight ones such as MobileNet and DeiT-Tiny. Our results demonstrate that VAD transfers effectively to road scenes. Notably, Tiny-Dinomaly achieves the best accuracy-efficiency trade-off for edge deployment, matching full-scale localization performance at a fraction of the memory cost. This study represents a concrete step toward safer, more responsible deployment of autonomous vehicles, ultimately improving protection for passengers, pedestrians, and all road users.

[CV-7] GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting)中基础元素(primitives)空间分配效率低下的问题,即现有方法在表示紧凑性、重建速度与渲染保真度之间存在显著权衡。传统方案依赖局部启发式策略,缺乏全局场景感知能力,导致像素或体素对齐的生成方式引入冗余,随着输入视角增加,模型规模膨胀且全局一致性变差。其解决方案的关键在于提出“先对齐、后解码”(align first, decode later)的框架GlobalSplat:通过学习一个紧凑的全局潜在场景表示来编码多视角输入并解析跨视角对应关系,在解码显式三维几何之前完成全局信息整合,从而实现无需预训练像素预测主干或复用密集基线特征的紧凑且全局一致的重建;同时采用从粗到精的训练流程抑制表示膨胀,最终在RealEstate10K和ACID数据集上以仅16K高斯点实现竞争性新视角合成效果,并保持4MB轻量级模型体积与低于78毫秒的单次前向推理时间。

链接: https://arxiv.org/abs/2604.15284
作者: Roni Itkin,Noam Issachar,Yehonatan Keypur,Yehonatan Keypur,Anpei Chen,Sagie Benaim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The efficient spatial allocation of primitives serves as the foundation of 3D Gaussian Splatting, as it directly dictates the synergy between representation compactness, reconstruction speed, and rendering fidelity. Previous solutions, whether based on iterative optimization or feed-forward inference, suffer from significant trade-offs between these goals, mainly due to the reliance on local, heuristic-driven allocation strategies that lack global scene awareness. Specifically, current feed-forward methods are largely pixel-aligned or voxel-aligned. By unprojecting pixels into dense, view-aligned primitives, they bake redundancy into the 3D asset. As more input views are added, the representation size increases and global consistency becomes fragile. To this end, we introduce GlobalSplat, a framework built on the principle of align first, decode later. Our approach learns a compact, global, latent scene representation that encodes multi-view input and resolves cross-view correspondences before decoding any explicit 3D geometry. Crucially, this formulation enables compact, globally consistent reconstructions without relying on pretrained pixel-prediction backbones or reusing latent features from dense baselines. Utilizing a coarse-to-fine training curriculum that gradually increases decoded capacity, GlobalSplat natively prevents representation bloat. On RealEstate10K and ACID, our model achieves competitive novel-view synthesis performance while utilizing as few as 16K Gaussians, significantly less than required by dense pipelines, obtaining a light 4MB footprint. Further, GlobalSplat enables significantly faster inference than the baselines, operating under 78 milliseconds in a single forward pass. Project page is available at this https URL

[CV-8] R3D: Revisiting 3D Policy Learning

【速读】:该论文旨在解决3D策略学习中存在的训练不稳定性与严重过拟合问题,这些问题限制了强大3D感知模型的采用。研究发现,缺乏3D数据增强和批归一化(Batch Normalization)的负面影响是导致上述问题的主要原因。解决方案的关键在于提出一种新架构,该架构结合了可扩展的基于Transformer的3D编码器与扩散解码器(diffusion decoder),专为大规模训练稳定性设计,并支持大规模预训练。该方法在具有挑战性的操作基准测试中显著优于现有的3D基线模型,为可扩展的3D模仿学习建立了新的稳健基础。

链接: https://arxiv.org/abs/2604.15281
作者: Zhengdong Hong,Shenrui Wu,Haozhe Cui,Boyi Zhao,Ran Ji,Yiyang He,Hangxing Zhang,Zundong Ke,Jun Wang,Guofeng Zhang,Jiayuan Gu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: this https URL

[CV-9] Why Do Vision Language Models Struggle To Recognize Human Emotions?

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在识别人类情绪时表现不佳的问题,特别是针对面部表情识别(Dynamic Facial Expression Recognition, DFER)这一连续且动态的任务。其核心发现是:VLMs存在两个关键缺陷——一是情绪数据集天然长尾分布,而预训练所用的网络规模数据进一步加剧了头部类别偏倚,导致罕见情绪被系统性地归类为常见类别;二是VLM无法有效建模密集帧序列中的时间信息,因其受限于上下文长度和内存容量,难以捕捉微表情(micro-expressions,持续0.25–0.5秒)这类关键情感信号。解决方案的关键在于提出一种多阶段上下文增强策略:通过将“间隔帧”转化为自然语言摘要来构建补充文本语境,并将其与稀疏关键帧一同输入VLM,从而避免注意力稀释、保留情绪演化轨迹,显著提升情绪识别性能。

链接: https://arxiv.org/abs/2604.15280
作者: Madhav Agarwal,Sotirios A. Tsaftaris,Laura Sevilla-Lara,Steven McDonagh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question “Why do VLMs struggle to recognize human emotions?”, and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from “in-between” frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.

[CV-10] SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中可靠不确定性估计的问题,尤其是在自动化轮廓生成需支持下游定量分析与临床决策的场景下。现有方法要么依赖多次推理以获得准确不确定性(计算成本高),要么采用单次前向传播的替代方案,但其失败排序能力较弱或受限于特征空间假设。解决方案的关键在于提出一种后处理框架SegWithU,该框架在冻结的预训练分割主干网络基础上添加轻量级不确定性头,利用主干网络中间特征并基于秩-1后验探针在紧凑的探测空间中建模扰动能量,从而生成两个体素级不确定性图:一个用于概率校准(校准导向),另一个用于错误检测和选择性预测(排名导向)。实验表明,SegWithU在ACDC、BraTS2024和LiTS数据集上均实现了强且一致的单次前向传播性能,AUROC/AURC分别达到0.9838/2.4885、0.9946/0.2660和0.9925/0.8193,同时保持了良好的分割质量,验证了基于扰动的不确定性建模是一种有效且实用的可靠性感知分割路径。

链接: https://arxiv.org/abs/2604.15271
作者: Tianhao Fu,Austin Wang,Charles Chen,Roby Aldave-Garza,Yucheng Chen
机构: University of Toronto (多伦多大学); McGill University (麦吉尔大学); University of Waterloo (滑铁卢大学); Vector Institute (向量研究所); Project Neura (神经项目组); University of Toronto Machine Intelligence Student Team (多伦多大学机器智能学生团队); Amplimit (放大限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present \textbfSegWithU , a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of 0.9838/2.4885 , 0.9946/0.2660 , and 0.9925/0.8193 , respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.15271 [cs.CV] (or arXiv:2604.15271v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.15271 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-11] okenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

【速读】:该论文旨在解决当前基于Transformer的前馈式3D高斯溅射(3D Gaussian Splatting, 3DGS)预测方法中,将高斯均值作为沿相机射线的深度进行回归所带来的次优性问题。传统方法受限于编码器-only架构,且预测的高斯数量与输入图像分辨率和视图数量绑定,导致对位姿噪声和多视角不一致性敏感,且难以高效地在测试时优化。解决方案的关键在于提出一种直接回归3D均值坐标的新范式,仅依赖自监督渲染损失进行训练,并引入带有可学习高斯标记(Gaussian tokens)的编码器-解码器架构,从而实现预测的高斯数量与输入分辨率及视图数解耦。该方法显著提升了鲁棒性,并支持在标记空间中高效进行测试时优化,同时保持已学习先验不变,最终在静态与动态场景下均实现了最先进的重建性能。

链接: https://arxiv.org/abs/2604.15239
作者: Jiawei Ren,Michal Jan Tyszkiewicz,Jiahui Huang,Zan Gojcic
机构: NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.

[CV-12] StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

【速读】:该论文旨在解决从连续视频流中重建稠密三维几何结构时,在恒定内存预算下保持稳定推理的问题。现有O(1)框架主要依赖“纯淘汰”策略,存在因二进制令牌删除导致的信息破坏以及单层局部评分引起的评估噪声问题。解决方案的关键在于提出一种无需训练的StreamCacheVGGT框架,其核心是两个协同工作的模块:跨层一致性增强评分(Cross-Layer Consistency-Enhanced Scoring, CLCES)和混合缓存压缩(Hybrid Cache Compression, HCC)。CLCES通过跟踪Transformer层级中令牌重要性的轨迹并采用顺序统计分析,有效降低激活噪声,识别出持续的几何显著性;在此基础上,HCC引入三阶段分诊策略,将中等重要度的令牌通过键向量流形上的最近邻分配合并到保留锚点中,从而在不增加计算成本的前提下保留关键几何上下文信息,显著提升重建精度与长期稳定性。

链接: https://arxiv.org/abs/2604.15237
作者: Xuanyi Liu,Deyi Ji,Chunan Yu,Qi Zhu,Xuanfu Li,Jin Ma,Tianrun Chen,Lanyun Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing O(1) frameworks primarily rely on a ``pure eviction’’ paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

[CV-13] Vision-Based Safe Human-Robot Collaboration with Uncertainty Guarantees

【速读】:该论文旨在解决人机协作场景中基于视觉的人体姿态估计与运动预测的不确定性建模问题,以实现可证明安全(certifiably safe)的人机协同。其核心挑战在于如何在存在观测噪声(aleatoric uncertainty)和分布外数据(OOD)的情况下,提供具有统计保证的预测置信度。解决方案的关键在于将认知不确定性(aleatoric uncertainty)估计分布外检测(OOD detection)相结合,并引入校准预测集(conformal prediction sets),从而在保证概率有效性的同时,为机器人系统提供高置信度的安全边界,最终实现在真实世界人机协作环境中对人类运动的可靠预测与安全控制。

链接: https://arxiv.org/abs/2604.15221
作者: Jakob Thumm,Marian Frei,Tianle Ni,Matthias Althoff,Marco Pavone
机构: Stanford University (斯坦福大学); RWTH Aachen University (亚琛工业大学); Shanghai Jiao Tong University (上海交通大学); Technical University of Munich (慕尼黑工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a framework for vision-based human pose estimation and motion prediction that gives conformal prediction guarantees for certifiably safe human-robot collaboration. Our framework combines aleatoric uncertainty estimation with OOD detection for high probabilistic confidence. To integrate our pipeline in certifiable safety frameworks, we propose conformal prediction sets for human motion predictions with high, valid confidence. We evaluate our pipeline on recorded human motion data and a real-world human-robot collaboration setting.

[CV-14] Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization

【速读】:该论文旨在解决无监督骨骼数据驱动的时序动作分割(Temporal Action Segmentation)问题,尤其关注如何有效建模动作的时空结构并减少片段长度偏差。其解决方案的关键在于提出了一种分层时空向量量化(Hierarchical Spatiotemporal Vector Quantization)框架:首先在空间层面通过两级向量量化实现细粒度子动作(subaction)到动作级表示的逐层抽象,利用骨架重建损失强化空间特征学习;随后进一步融合时间信息,在多层级聚类中同时恢复输入骨架及其对应的时间戳,从而实现对动作边界和语义的联合优化。该方法在多个基准数据集上实现了新的最先进性能,并显著缓解了传统方法中存在的片段长度偏倚问题。

链接: https://arxiv.org/abs/2604.15196
作者: Umer Ahmed,Syed Ahmed Mahmood,Fawad Javed Fateh,M. Shaheer Luqman,M. Zeeshan Zia,Quoc-Huy Tran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.

[CV-15] VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中因处理高分辨率图像和视频帧导致的计算复杂度呈二次增长的问题,现有视觉token剪枝方法依赖预设配置而无法保证计算效率与性能之间的最优权衡。其解决方案的关键在于将剪枝过程建模为帕累托配置优化问题,并通过连续松弛与直通估计器实现基于梯度的搜索,结合增广拉格朗日法求解,从而自动识别出在不同剪枝策略和VLM架构下均具有良好泛化能力的最优剪枝配置。此外,引入可学习核函数进一步揭示了分层剪枝模式,表明多步渐进式剪枝能有效捕捉VLM的层次压缩结构,相比单层剪枝策略获得更优的准确率-效率权衡。

链接: https://arxiv.org/abs/2604.15188
作者: Huawei Ji,Yuanhao Sun,Yuan Jin,Cheng Deng,Jiaxin Ding,Luoyi Fu,Xinbing Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs’ hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.

[CV-16] Boundary-Centric Active Learning for Temporal Action Segmentation

【速读】:该论文旨在解决时间动作分割(Temporal Action Segmentation, TAS)任务中因密集时间监督导致的高标注成本问题,尤其针对动作边界区域的标注效率低下和微小时间偏移对分割指标(如F1分数)造成显著负面影响的问题。解决方案的关键在于提出一种剪辑预算约束下的主动学习框架B-ACT,其核心创新在于:首先通过预测不确定性对未标注视频进行排序与查询;其次在选定视频内,基于融合邻域不确定性、类别模糊性和时间预测动态性的新型边界评分函数,从当前模型预测中筛选出最具信息量的top-K边界帧进行标注;同时采用仅标注边界帧但训练时使用以边界为中心的时间窗口片段的策略,充分利用模型感受野中的时序上下文信息。这一边界中心的标注机制显著提升了标签效率,在GTEA、50Salads和Breakfast数据集上均优于现有TAS主动学习基线和先进方法,尤其是在边界定位主导评估指标的数据集上表现最优。

链接: https://arxiv.org/abs/2604.15173
作者: Halil Ismail Helvaci,Sen-ching Samson Cheung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining action transitions, where segmentation errors concentrate and small temporal shifts disproportionately degrade segmental metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these high-leverage boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top- K boundaries via a novel boundary score that fuses neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. Importantly, our annotation protocol requests labels for only the boundary frames while still training on boundary-centered clips to exploit temporal context through the model’s receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets, with the largest gains on datasets where boundary placement dominates edit and overlap-based F1 scores.

[CV-17] An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation IJCNN2026

【速读】:该论文旨在解决扩散模型(Diffusion Models)在使用去噪分数匹配(Denoising Score Matching, DSM)训练时,其演化过程常偏离真实数据密度所满足的福克-普朗克(Fokker–Planck, FP)方程的问题。尽管直接在目标函数中惩罚FP残差可减小偏差,但会引入显著计算开销,且严格约束FP方程并不必然提升生成样本质量,最优性能往往出现在较弱的FP正则化条件下。论文的关键解决方案在于:通过系统性地评估多种轻量级正则化项,发现无需复杂计算即可获得与强FP正则化相当的收益,从而在保持生成质量的同时大幅降低计算成本。

链接: https://arxiv.org/abs/2604.15171
作者: Onno Niemann,Gonzalo Martínez Muñoz,Alberto Suárez Gonzalez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at IJCNN 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker–Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as often the best results are obtained with weaker FP regularization. In this paper, we investigate whether simpler penalty terms can provide similar benefits. We empirically analyze several lightweight regularizers, study their effect on FP residuals and generation quality, and show that the benefits of FP regularization are available at substantially lower computational cost. Our code is available at this https URL.

[CV-18] OmniLight: One Model to Rule All Lighting Conditions CVPR

【速读】:该论文旨在解决不良光照条件(如投射阴影和不均匀照明)对计算机视觉系统造成的挑战,这些问题会降低图像可见度和色彩保真度,进而影响下游任务的性能。为应对这一问题,论文提出两种对比策略:一是基于鲁棒的ALN(自动光照归一化,Automatic Lighting Normalization)框架DINOLight,作为专用基线以利用每个数据集的特性;二是扩展得到的通用模型OmniLight,其引入了作者提出的小波域专家混合模型(Wavelet Domain Mixture-of-Experts, WD-MoE),并在所有提供的数据集上联合训练。解决方案的关键在于通过WD-MoE实现跨域泛化能力,在保持高感知质量的同时显著提升模型在多样化真实场景中的适应性,从而在NTIRE 2026挑战赛中三个与光照相关的赛道均取得顶尖排名。

链接: https://arxiv.org/abs/2604.15170
作者: Youngjin Oh,Junyoung Park,Junhyeong Kwon,Nam Ik Cho
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPRW 2026; NTIRE 2026 Image Shadow Removal Ambient Lighting Normalization Challenges (1st Perceptual Rank for White Lighting, 2nd Fidelity Rank 4th Perceptual Rank for Color Lighting)

点击查看摘要

Abstract:Adverse lighting conditions, such as cast shadows and irregular illumination, pose significant challenges to computer vision systems by degrading visibility and color fidelity. Consequently, effective shadow removal and ALN are critical for restoring underlying image content, improving perceptual quality, and facilitating robust performance in downstream tasks. However, while achieving state-of-the-art results on specific benchmarks is a primary goal in image restoration challenges, real-world applications often demand robust models capable of handling diverse domains. To address this, we present a comprehensive study on lighting-related image restoration by exploring two contrasting strategies. We leverage a robust framework for ALN, DINOLight, as a specialized baseline to exploit the characteristics of each individual dataset, and extend it to OmniLight, a generalized alternative incorporating our proposed Wavelet Domain Mixture-of-Experts (WD-MoE) that is trained across all provided datasets. Through a comparative analysis of these two methods, we discuss the impact of data distribution on the performance of specialized and unified architectures in lighting-related image restoration. Notably, both approaches secured top-tier rankings across all three lighting-related tracks in the NTIRE 2026 Challenge, demonstrating their outstanding perceptual quality and generalization capabilities. Our codes are available at this https URL.

[CV-19] Class Unlearning via Depth-Aware Removal of Forget-Specific Directions CVPR2026

【速读】:该论文旨在解决模型遗忘(machine unlearning)中的选择性遗忘问题,即如何在不重新训练模型的前提下,有效移除特定类别的知识,同时避免因分类头抑制导致的虚假遗忘(apparent forgetting),并确保保留类别性能不受显著影响。现有方法常存在遗忘方向不明确、深层表示仍保留遗忘类结构或过度依赖最终层偏置调整等问题。解决方案的关键在于提出一种一阶闭式权重手术方法DAMP(Depth-Aware Modulation by Projection):它通过计算下一层可学习模块输入空间中的类别原型,提取与保留类原型差异的遗忘方向作为残差,并基于投影机制减少下游对这些方向的敏感性;同时引入一种无参数的深度感知缩放规则(由探针可分性推导),在浅层进行小幅度修改、深层进行较大修改,从而实现更精确的选择性遗忘和更好的保留类性能保持。

链接: https://arxiv.org/abs/2604.15166
作者: Arman Hatami,Romina Aalishah,Ilya E. Monosov
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the CVPR 2026 Workshop on Machine Unlearning for Vision (MUV)

点击查看摘要

Abstract:Machine unlearning aims to remove targeted knowledge from a trained model without the cost of retraining from scratch. In class unlearning, however, reducing accuracy on forget classes does not necessarily imply true forgetting: forgotten information can remain encoded in internal representations, and apparent forgetting may arise from classifier-head suppression rather than representational removal. We show that existing class-unlearning methods often exhibit weak or negative selectivity, preserve forget-class structure in deep representations, or rely heavily on final-layer bias shifts. We then introduce DAMP (Depth-Aware Modulation by Projection), a one-shot, closed-form weight-surgery method that removes forget-specific directions from a pretrained network without gradient-based optimization. At each stage, DAMP computes class prototypes in the input space of the next learnable operator, extracts forget directions as residuals relative to retain-class prototypes, and applies a projection-based update to reduce downstream sensitivity to those directions. To preserve utility, DAMP uses a parameter-free depth-aware scaling rule derived from probe separability, applying smaller edits in early layers and larger edits in deeper layers. The method naturally extends to multi-class forgetting through low-rank subspace removal. Across MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet, and across convolutional and transformer architectures, DAMP more closely resembles the retraining gold standard than some of the prior methods, improving selective forgetting while better preserving retain-class performance and reducing residual forget-class structure in deep layers.

[CV-20] KVNN: Learnable Multi-Kernel Volterra Neural Networks

【速读】:该论文旨在解决传统大规模深度学习模型在追求高阶学习能力时面临的模型复杂度显著增加的问题,即如何在不显著提升参数量和计算成本的前提下增强模型的表达能力。其解决方案的关键在于提出了一种核化Volterra神经网络(kernelized Volterra Neural Network, kVNN),通过引入可学习的多核表示机制,利用不同阶次的多项式核组件对数据的高阶交互特征进行建模,并采用紧凑且可学习的中心参数实现阶次自适应的参数化结构。这种设计使得kVNN滤波器能够直接替代现有架构中的标准卷积核,在保持高效性的同时提升模型对复杂特征组合的捕捉能力。

链接: https://arxiv.org/abs/2604.15141
作者: Haoyu Yun,Hamid Krim,Yufang Bao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Higher-order learning is fundamentally rooted in exploiting compositional features. It clearly hinges on enriching the representation by more elaborate interactions of the data which, in turn, tends to increase the model complexity of conventional large-scale deep learning models. In this paper, a kernelized Volterra Neural Network (kVNN) is proposed. The key to the achieved efficiency lies in using a learnable multi-kernel representation, where different interaction orders are modeled by distinct polynomial-kernel components with compact, learnable centers, yielding an order-adaptive parameterization. Features are learned by the composition of layers, each of which consists of parallel branches of different polynomial orders, enabling kVNN filters to directly replace standard convolutional kernels within existing architectures. The theoretical results are substantiated by experiments on two representative tasks: video action recognition and image denoising. The results demonstrate favorable performance-efficiency trade-offs: kVNN consistently yields reduced model (parameters) and computational (GFLOPs) complexity with competitive and often improved performance. These results are maintained even when trained from scratch without large-scale pretraining. In summary, we substantiate that structured kernelized higher-order layers offer a practical path to balancing expressivity and computational cost in modern deep networks.

[CV-21] How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos

【速读】:该论文旨在解决现有视频中程序性行为监测缺乏真实人类错误及其恢复过程的问题,尤其是在第一人称视角(egocentric)视频中,错误常被手部遮挡且仅通过细微的物体状态变化体现,而现有数据集对错误和纠正的记录有限且不一致。其解决方案的关键在于提出 PIE-V(Psychologically Inspired Error injection for Videos)框架,该框架通过心理启发的误差规划器(基于任务阶段和语义步骤负载)与恢复规划器协同生成合理的人类可接受偏差及修复行为,并利用大语言模型(LLM)进行级联一致性的文本重写与质量验证,结合文本引导的视频生成技术实现视觉上合理的片段替换,从而构建高质量、结构化且可评估的含错程序视频数据集。

链接: https://arxiv.org/abs/2604.15134
作者: Olga Loginova,Frank Keller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable procedural monitoring in video requires exposure to naturally occurring human errors and the recoveries that follow. In egocentric recordings, mistakes are often partially occluded by hands and revealed through subtle object state changes, while existing procedural datasets provide limited and inconsistent mistake and correction traces. We present PIE-V (Psychologically Inspired Error injection for Videos), a framework for constructing and benchmarking mistake-aware egocentric procedural videos by augmenting clean keystep procedures with controlled, human-plausible deviations. PIE-V combines a psychology-informed error planner conditioned on procedure phase and semantic step load, a correction planner that models recovery behavior, an LLM writer that performs cascade-consistent rewrites, and an LLM judge that validates procedural coherence and repairs failures. For video segment edits, PIE-V synthesizes replacement clips with text-guided video generation and stitches them into the episode to preserve visual plausibility. Applied to 17 tasks and 50 Ego-Exo4D scenarios, PIE-V injects 102 mistakes and generates 27 recovery corrections. For benchmarking, we introduce a unified taxonomy and a human rubric with nine metrics that cover step-level and procedure-level quality, including plausibility, procedure logic with annotator confidence, state change coherence, and grounding between text and video. Using this protocol, we audit several existing resources and compare PIE-V against a freeform LLM generation baseline under the same criteria. Together, the framework and rubric support post-completion verification for egocentric procedural mistake detection and correction.

[CV-22] Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography ICLR2026

【速读】:该论文旨在解决超声心动图(echocardiography)在心脏评估中因视图稀疏且时空分布异质性导致的多视角结构建模难题。现有掩码自编码器(masked autoencoder, MAE)方法通常独立处理图像或短片段,无法捕捉心脏成像所需的多视角一致性表示。其解决方案的关键在于提出Latent Attention Masked Autoencoder (LAMAE),通过引入潜空间注意力模块(latent attention module),直接在潜在空间内实现跨帧与跨视图的信息交互,从而聚合不同长度序列和多样视图,从局部观测中重建心脏功能的整体表征。此设计显著提升了模型对真实临床数据的适应能力,并实现了成人到儿科群体的有效迁移。

链接: https://arxiv.org/abs/2604.15096
作者: Simon Böhi,Irene Cannistraci,Sergio Muñoz Gonzalez,Moritz Vandenhirtz,Sonia Laguna,Samuel Ruiperez-Campillo,Max Krähenmann,Andrea Agostini,Ece Ozkan,Thomas M. Sutter,Julia E. Vogt
机构: University of Basel(巴塞尔大学); ETH Zurich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted as a workshop paper at the ICLR 2026 Workshop on Foundation Models for Science

点击查看摘要

Abstract:Echocardiography is a widely used modality for cardiac assessment due to its non-invasive and cost-effective nature, but the sparse and heterogeneous spatiotemporal views of the heart pose distinct challenges. Existing masked autoencoder (MAE) approaches typically process images or short clips independently, failing to capture the inherent multi-view structure required for coherent cardiac representation. We introduce Latent Attention Masked Autoencoder (LAMAE), a foundation model architecture tailored to the multi-view nature of medical imaging. LAMAE augments the standard MAE with a latent attention module that enables information exchange across frames and views directly in latent space. This allows the model to aggregate variable-length sequences and distinct views, reconstructing a holistic representation of cardiac function from partial observations. We pretrain LAMAE on MIMIC-IV-ECHO, a large-scale, uncurated dataset reflecting real-world clinical variability. To the best of our knowledge, we present the first results for predicting ICD-10 codes from MIMIC-IV-ECHO videos. Furthermore, we empirically demonstrate that representations learned from adult data transfer effectively to pediatric cohorts despite substantial anatomical differences. These results provide evidence that incorporating structural priors, such as multi-view attention, yields significantly more robust and transferable representations.

[CV-23] Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID

【速读】:该论文旨在解决任意时间行人重识别(Any-Time Person Re-Identification, AT-ReID)中因光照变化(如可见光RGB与红外IR模态切换)和长时间衣物更换导致的特征不稳定性问题,这些问题使得传统依赖纯视觉特征的方法性能显著下降。解决方案的关键在于提出一种基于语义驱动的令牌过滤与专家路由框架(Semantic-driven Token Filtering and Expert Routing, STFER),其核心创新是利用大视觉语言模型(Large Vision-Language Models, LVLMs)生成具有身份一致性的语义文本描述,从而提取对服装变化和跨模态差异鲁棒的身份判别特征;该文本令牌进一步用于两个关键机制:一是语义驱动的视觉令牌过滤(Semantic-driven Visual Token Filtering, SVTF),增强信息丰富的视觉区域并抑制冗余背景噪声;二是语义驱动的专家路由(Semantic-driven Expert Routing, SER),将语义信息融入专家门控机制,提升多场景下的适应能力。实验表明,该方法在AT-USTC数据集上达到SOTA,并在5个主流ReID基准上展现出优异的泛化性能。

链接: https://arxiv.org/abs/2604.15090
作者: Jiaxuan Li,Xin Wen,Zhihang Li
机构: Tsinghua University (清华大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.

[CV-24] Building Extraction from Remote Sensing Imagery under Hazy and Low-light Conditions: Benchmark and Baseline

【速读】:该论文旨在解决光学遥感(Optical Remote Sensing, RS)影像中建筑物提取在雾霾和低光照等恶劣天气条件下的性能退化问题。现有方法与基准数据集主要针对理想晴朗天气设计,而合成孔径雷达(SAR)虽具备全天候成像能力,却因侧视几何特性导致几何畸变。为此,作者提出首个专为雾霾与低光场景设计的光学建筑提取基准数据集HaLoBuilding,并构建了端到端的HaLoBuild-Net框架。其核心创新在于三个模块:空间-频率聚焦模块(Spatial-Frequency Focus Module, SFFM)通过大感受野注意力机制与基于稳定低频锚点的频域感知通道重加权,有效抑制气象干扰对建筑特征的影响;全局多尺度引导模块(Global Multi-scale Guidance Module, GMGM)提供全局语义约束以锚定建筑拓扑结构;互引导融合模块(Mutual-Guided Fusion Module, MGFM)实现双向语义-空间校准,抑制浅层噪声并锐化天气引起的模糊边界。实验表明,该方法在HaLoBuilding数据集上显著优于当前最优方法及传统级联复原-分割范式,且在WHU、INRIA和LoveDA等多个公开数据集上保持良好泛化能力。

链接: https://arxiv.org/abs/2604.15088
作者: Feifei Sang,Wei Lu,Hongruixuan Chen,Sibao Chen,Bin Luo
机构: Anhui University (安徽大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 12 figures, 9 tables

点击查看摘要

Abstract:Building extraction from optical Remote Sensing (RS) imagery suffers from performance degradation under real-world hazy and low-light conditions. However, existing optical methods and benchmarks focus primarily on ideal clear-weather conditions. While SAR offers all-weather sensing, its side-looking geometry causes geometric distortions. To address these challenges, we introduce HaLoBuilding, the first optical benchmark specifically designed for building extraction under hazy and low-light conditions. By leveraging a same-scene multitemporal pairing strategy, we ensure pixel-level label alignment and high fidelity even under extreme degradation. Building upon this benchmark, we propose HaLoBuild-Net, a novel end-to-end framework for building extraction in adverse RS scenarios. At its core, we develop a Spatial-Frequency Focus Module (SFFM) to effectively mitigate meteorological interference on building features by coupling large receptive field attention with frequency-aware channel reweighting guided by stable low-frequency anchors. Additionally, a Global Multi-scale Guidance Module (GMGM) provides global semantic constraints to anchor building topologies, while a Mutual-Guided Fusion Module (MGFM) implements bidirectional semantic-spatial calibration to suppress shallow noise and sharpen weather-induced blurred boundaries. Extensive experiments demonstrate that HaLoBuild-Net significantly outperforms state-of-the-art methods and conventional cascaded restoration-segmentation paradigms on the HaLoBuilding dataset, while maintaining robust generalization on WHU, INRIA, and LoveDA datasets. The source code and datasets are publicly available at: this https URL.

[CV-25] ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

【速读】:该论文旨在解决视频到音频(Video-to-Audio, V2A)生成中面临的鲁棒性和细粒度控制难题,具体包括:在视觉与文本信息冲突时文本控制能力弱、参考音频中时间信息与音色特征纠缠导致风格控制不精确,以及缺乏标准化评估基准。其解决方案的关键在于提出ControlFoley框架,包含三个核心创新:1)引入联合视觉编码机制,融合CLIP与时空音频-视觉编码器以增强跨模态对齐和文本可控性;2)设计时间-音色解耦策略,在抑制冗余时间线索的同时保留区分性音色特征,实现更精准的风格控制;3)构建模态鲁棒训练方案,通过统一多模态表示对齐(REPA)和随机模态丢弃提升模型在跨模态干扰下的稳定性。此外,作者还提出了VGGSound-TVC基准用于系统评估不同视觉-文本冲突强度下的文本控制性能。

链接: https://arxiv.org/abs/2604.15086
作者: Jianxuan Yang,Xinyue Guo,Zhi Cheng,Kai Wang,Lipan Zhang,Jinjie Hu,Qiang Ji,Yihua Cao,Yihao Meng,Zhaoyue Cui,Mengmei Liu,Meng Meng,Jian Luan
机构: MiLM Plus, Xiaomi Inc.; Wuhan University
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound-TVC, a benchmark for evaluating textual controllability under varying degrees of visual-text conflict. Extensive experiments demonstrate state-of-the-art performance across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. ControlFoley achieves superior controllability under cross-modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system. Code, models, datasets, and demos are available at: this https URL. Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD) Cite as: arXiv:2604.15086 [cs.MM] (or arXiv:2604.15086v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2604.15086 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jianxuan Yang [view email] [v1] Thu, 16 Apr 2026 14:47:24 UTC (6,356 KB)

[CV-26] Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection ICMR2026

【速读】:该论文旨在解决基于Transformer的小目标检测中因背景干扰导致的查询噪声问题,以及由此引发的检测效率低下和特征表示稀疏的问题。其解决方案的关键在于提出一种噪声感知的位置-语义融合框架HELP(Heatmap-guided Embedding Learning Paradigm),核心创新是引入Heatmap-guided Positional Embedding(HPE)机制:在编码阶段通过热图引导的位置嵌入抑制背景杂波、增强前景显著区域的定位信息,在解码阶段利用基于梯度的掩码过滤器剔除背景主导的嵌入,从而提升查询质量;同时结合Linear-Snake Convolution缓解复杂小目标的特征稀疏性,并通过仅在训练时施加梯度监督的热图约束实现高效推理。该设计使解码层从8层减少至3层,参数量降低59.4%(从163M降至66.3M),且在多个基准上保持精度稳定。

链接: https://arxiv.org/abs/2604.15065
作者: Yangchen Zeng,Zhenyu Yu,Dongming Jiang,Wenbo Zhang,Yifan Hong,Zhanhua Hu,Jiao Luo,Kangning Cui
机构: Southeast University (东南大学); Fudan University (复旦大学); The University of Texas at Dallas (德克萨斯大学达拉斯分校); Zhejiang Normal University (浙江师范大学); Data Space Research Institute, Hefei Comprehensive National Science Center (合肥综合性国家科学中心数据空间研究院); Rice University (莱斯大学); Huazhong Agricultural University (华中农业大学); City University of Hong Kong (Dongguan) (香港城市大学(东莞)); Wake Forest University (维克森林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM ICMR 2026; 14 pages, 6 figures, and 4 tables

点击查看摘要

Abstract:Transformer-based detectors have advanced small-object detection, but they often remain inefficient and vulnerable to background-induced query noise, which motivates deep decoders to refine low-quality queries. We present HELP (Heatmap-guided Embedding Learning Paradigm), a noise-aware positional-semantic fusion framework that studies where to embed positional information by selectively preserving positional encodings in foreground-salient regions while suppressing background clutter. Within HELP, we introduce Heatmap-guided Positional Embedding (HPE) as the core embedding mechanism and visualize it with a heatbar for interpretable diagnosis and fine-tuning. HPE is integrated into both the encoder and decoder: it guides noise-suppressed feature encoding by injecting heatmap-aware positional encoding, and it enables high-quality query retrieval by filtering background-dominant embeddings via a gradient-based mask filter before decoding. To address feature sparsity in complex small targets, we integrate Linear-Snake Convolution to enrich retrieval-relevant representations. The gradient-based heatmap supervision is used during training only, incurring no additional gradient computation at inference. As a result, our design reduces decoder layers from eight to three and achieves a 59.4% parameter reduction (66.3M vs. 163M) while maintaining consistent accuracy gains under a reduced compute budget across benchmarks. Code Repository: this https URL

[CV-27] Attention-Gated Convolutional Networks for Scanner-Agnostic Quality Assessment

【速读】:该论文旨在解决结构磁共振成像(sMRI)中运动伪影(motion artifacts)对临床诊断和大规模自动化分析的干扰问题,尤其针对手动质量控制(QC)在纵向研究中难以扩展的瓶颈。其解决方案的关键在于提出一种混合卷积神经网络-注意力机制(CNN-Attention)框架:通过层次化2D卷积编码器提取局部空间特征,并引入多头交叉注意力机制建模全局依赖关系,从而动态聚焦于与运动相关的伪影特征(如环状伪影和模糊),同时过滤掉站点特异性强度变化和背景噪声。该设计实现了跨站点的强泛化能力,在未见站点数据上仍保持高准确率(Acc = 0.755),无需重新训练或微调,验证了注意力机制在捕捉通用伪影表征方面的有效性。

链接: https://arxiv.org/abs/2604.15059
作者: Chinmay Bakhale,Anil Sao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Motion artifacts present a significant challenge in structural MRI (sMRI), often compromising clinical diagnostics and large-scale automated analysis. While manual quality control (QC) remains the gold standard, it is increasingly unscalable for massive longitudinal studies. To address this, we propose a hybrid CNN-Attention framework designed for robust, site-invariant MRI quality assessment. Our architecture integrates a hierarchical 2D CNN encoder for local spatial feature extraction with a multi-head cross-attention mechanism to model global dependencies. This synergy enables the model to prioritize motion relevant artifact signatures, such as ringing and blurring, while dynamically filtering out site-specific intensity variations and background noise. The framework was trained end-to-end on the MR-ART dataset using a balanced cohort of 200 subjects. Performance was evaluated across two tiers: Seen Site Evaluation on a held-out MR-ART partition and Unseen Site Evaluation using 200 subjects from 17 heterogeneous sites in the ABIDE archive. On seen sites, the model achieved a scan-level accuracy of 0.9920 and an F1-score of 0.9919. Crucially, it maintained strong generalization across unseen ABIDE sites (Acc = 0.755) without any retraining or fine-tuning, demonstrating high resilience to domain shift. These results indicate that attention-based feature re-weighting successfully captures universal artifact descriptors, bridging the performance gap between diverse imaging environments and scanner manufacturers.

[CV-28] Implicit Neural Representations: A Signal Processing Perspective

【速读】:该论文旨在解决传统信号建模中离散采样数据表示的局限性,提出通过隐式神经表示(Implicit Neural Representations, INRs)实现连续函数形式的统一建模。其核心问题在于如何将图像、音频、视频、3D几何等多模态信号以连续函数的形式进行高效、可微且具有多尺度特性的表达,并支持如微分运算等解析操作。解决方案的关键在于利用神经网络参数化信号为坐标空间上的连续函数,从而借助自动微分(automatic differentiation)实现精确的信号处理;同时,通过设计特定激活函数(如周期性、局部化和自适应函数)以及结构化表示方法(如分层分解与哈希网格编码),重塑逼近空间以增强频率响应能力和空间适应性,进而提升计算效率与建模精度。

链接: https://arxiv.org/abs/2604.15047
作者: Dhananjaya Jayasundara,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit neural representations (INRs) mark a fundamental shift in signal modeling, moving from discrete sampled data to continuous functional representations. By parameterizing signals as neural networks, INRs provide a unified framework for representing images, audio, video, 3D geometry, and beyond as continuous functions of their coordinates. This functional viewpoint enables signal operations such as differentiation to be carried out analytically through automatic differentiation rather than through discrete approximations. In this article, we examine the evolution of INRs from a signal processing perspective, emphasizing spectral behavior, sampling theory, and multiscale representation. We trace the progression from standard coordinate based networks, which exhibit a spectral bias toward low frequency components, to more advanced designs that reshape the approximation space through specialized activations, including periodic, localized, and adaptive functions. We also discuss structured representations, such as hierarchical decompositions and hash grid encodings, that improve spatial adaptivity and computational efficiency. We further highlight the utility of INRs across a broad range of applications, including inverse problems in medical and radar imaging, compression, and 3D scene representation. By interpreting INRs as learned signal models whose approximation spaces adapt to the underlying data, this article clarifies the field’s core conceptual developments and outlines open challenges in theoretical stability, weight space interpretability, and large scale generalization.

[CV-29] When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

【速读】:该论文旨在解决当前机器学习系统公平性评估中因依赖单一公平性指标而导致结论不一致的问题。在高风险应用场景中,如生物特征识别和医疗决策,不同公平性指标(如误差率差异与基于性能的度量)可能对同一模型得出相互矛盾的偏见判断,从而误导评估结果。其解决方案的关键在于提出一种新的量化指标——公平性分歧指数(Fairness Disagreement Index, FDI),用于衡量多个公平性指标之间的一致性程度,并通过在人脸识别人群分区中的多指标系统性分析,证明了即使在不同阈值和模型配置下,公平性评估仍存在显著分歧,从而强调了多指标联合评估的必要性,以实现更可靠的偏见检测。

链接: https://arxiv.org/abs/2604.15038
作者: Khalid Adnan Alsayed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figues, 5 tables

点击查看摘要

Abstract:The evaluation of fairness in machine learning systems has become a central concern in high-stakes applications, including biometric recognition, healthcare decision-making, and automated risk assessment. Existing approaches typically rely on a small number of fairness metrics to assess model behaviour across group partitions, implicitly assuming that these metrics provide consistent and reliable conclusions. However, different fairness metrics capture distinct statistical properties of model performance and may therefore produce conflicting assessments when applied to the same system. In this work, we investigate the consistency of fairness evaluation by conducting a systematic multi-metric analysis of demographic bias in machine learning models. Using face recognition as a controlled experimental setting, we evaluate model performance across multiple group partitions under a range of commonly used fairness metrics, including error-rate disparities and performance-based measures. Our results demonstrate that fairness assessments can vary significantly depending on the choice of metrics, leading to contradictory conclusions regarding model bias. To quantify this phenomenon, we introduce the Fairness Disagreement Index (FDI), a measure designed to capture the degree of inconsistency across fairness metrics. We further show that disagreement remains high across thresholds and model configurations. These findings highlight a critical limitation in current fairness evaluation practices and suggest that single-metric reporting is insufficient for reliable bias assessment.

[CV-30] Quality-Aware Calibration for AI-Generated Image Detection in the Wild CVPR2026

【速读】:该论文旨在解决当前合成图像检测方法在真实网络传播场景下性能不稳定的问题,即同一图像因多次再压缩、缩放和裁剪等操作产生多个近似重复版本时,不同版本可能引发不一致的伪造检测结果。解决方案的关键在于提出一种名为QuAD(Quality-Aware calibration with near-Duplicates)的新框架,该框架通过检索并聚合查询图像的所有在线近似重复版本的检测分数,同时依据估计的质量权重对各版本进行加权融合,从而利用全部可用信息并降低受多轮处理影响而质量下降的图像的干扰,显著提升检测可靠性。

链接: https://arxiv.org/abs/2604.15027
作者: Fabrizio Guillaro,Vincenzo De Rosa,Davide Cozzolino,Luisa Verdoliva
机构: University Federico II of Naples(那不勒斯腓特烈二世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the APAI Workshop at CVPR 2026

点击查看摘要

Abstract:Significant progress has been made in detecting synthetic images, however most existing approaches operate on a single image instance and overlook a key characteristic of real-world dissemination: as viral images circulate on the web, multiple near-duplicate versions appear and lose quality due to repeated operations like recompression, resizing and cropping. As a consequence, the same image may yield inconsistent forensic predictions based on which version has been analyzed. In this work, to address this issue we propose QuAD (Quality-Aware calibration with near-Duplicates) a novel framework that makes decisions based on all available near-duplicates of the same image. Given a query, we retrieve its online near-duplicates and feed them to a detector: the resulting scores are then aggregated based on the estimated quality of the corresponding instance. By doing so, we take advantage of all pieces of information while accounting for the reduced reliability of images impaired by multiple processing steps. To support large-scale evaluation, we introduce two datasets: AncesTree, an in-lab dataset of 136k images organized in stochastic degradation trees that simulate online reposting dynamics, and ReWIND, a real-world dataset of nearly 10k near-duplicate images collected from viral web content. Experiments on several state-of-the-art detectors show that our quality-aware fusion improves their performance consistently, with an average gain of around 8% in terms of balanced accuracy compared to plain average. Our results highlight the importance of jointly processing all the images available online to achieve reliable detection of AI-generated content in real-world applications. Code and data are publicly available at this https URL

[CV-31] Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

【速读】:该论文旨在解决图像到视频(I2V)生成技术带来的新型视频取证挑战,即传统基于二维像素级篡改定位的方法无法有效应对视频中像素随时间流动与变形的动态特性。其核心问题是:如何在非确定性的生成过程中识别并追踪可随时间演化的取证特征。解决方案的关键在于提出“Flow of Truth”框架,创新性地将视频生成重新定义为“像素随时间的运动”,而非帧的合成;进而设计了一个可学习的取证模板以跟随像素运动,并引入模板引导的流模块实现运动与图像内容的解耦,从而实现对I2V视频的鲁棒时序溯源。

链接: https://arxiv.org/abs/2604.15003
作者: Yuzhuo Chen,Zehua Ma,Han Fang,Hengyi Wang,Guanjie Wang,Weiming Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present Flow of Truth, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as the motion of pixels through time rather than the synthesis of frames. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.

[CV-32] Robustness of Vision Foundation Models to Common Perturbations CVPR2026

【速读】:该论文旨在解决视觉基础模型(Vision Foundation Model)在面对常见图像编辑操作(如JPEG压缩、亮度和对比度调整等)时,其输出的嵌入向量(embedding vector)易受扰动的问题,进而影响下游任务(如分类)的性能。解决方案的关键在于:首先提出三种可量化的鲁棒性指标,并基于五个数学性质对这些指标进行理论分析;其次通过系统评估六个工业级视觉基础模型发现它们普遍缺乏鲁棒性;最后提出一种微调策略,在不损害模型原始功能(utility)的前提下提升其对扰动的鲁棒性,从而有效缓解因常见图像处理导致的性能下降问题。

链接: https://arxiv.org/abs/2604.14973
作者: Hongbin Liu,Zhengyuan Jiang,Cheng Hong,Neil Zhenqiang Gong
机构: Duke University (杜克大学); Ant Group (蚂蚁集团)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 Workshop

点击查看摘要

Abstract:A vision foundation model outputs an embedding vector for an image, which can be affected by common editing operations (e.g., JPEG compression, brightness, contrast adjustments). These common perturbations alter embedding vectors and may impact the performance of downstream tasks using these embeddings. In this work, we present the first systematic study on foundation models’ robustness to such perturbations. We propose three robustness metrics and formulate five desired mathematical properties for these metrics, analyzing which properties they satisfy or violate. Using these metrics, we evaluate six industry-scale foundation models (OpenAI, Meta) across nine common perturbation categories, finding them generally non-robust. We also show that common perturbations degrade downstream application performance (e.g., classification accuracy) and that robustness values can predict performance impacts. Finally, we propose a fine-tuning approach to improve robustness without sacrificing utility.

[CV-33] UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

【速读】:该论文旨在解决现有视觉检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理复杂推理任务时,因依赖通用检索信号而忽略细粒度视觉语义信息的问题。其核心解决方案是提出一种统一的强化学习框架UniDoc-RL,该框架将视觉信息获取建模为具有分层动作空间的序列决策问题,使大型视觉语言模型(Large Vision-Language Models, LVLMs)能够从粗粒度文档检索逐步精炼至细粒度图像选择与主动区域裁剪,从而抑制无关内容并聚焦于信息密集区域。关键创新在于引入密集多奖励机制以提供每个动作的任务感知监督,并基于Group Relative Policy Optimization(GRPO)实现端到端训练,无需额外的价值网络即可对齐多目标行为,显著提升了复杂视觉推理性能。

链接: https://arxiv.org/abs/2604.14967
作者: Jun Wang,Shuo Tan,Zelong Sun,Tiancheng Gu,Yongle Zhao,Ziyong Feng,Kaicheng Yang,Cewu Lu
机构: DeepGlint-AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 11 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.

[CV-34] Frequency-Enhanced Dual-Subspace Networks for Few-Shot Fine-Grained Image Classification

【速读】:该论文旨在解决少样本细粒度图像分类中因仅依赖空间域特征而导致的纹理偏差问题,以及由此引发的结构信息不稳定和对背景噪声过拟合的问题。解决方案的关键在于提出频率增强的双子空间网络(FEDSNet),其核心机制包括:利用离散余弦变换(Discrete Cosine Transform, DCT)与低通滤波从空间特征中显式分离出低频全局结构成分以抑制背景干扰;通过截断奇异值分解(Truncated Singular Value Decomposition, SVD)构建独立的低秩线性子空间分别表示空间纹理和频率结构特征;并设计自适应门控机制动态融合来自双视角的投影距离,从而利用频率子空间的结构稳定性来约束空间子空间,避免其过度拟合背景噪声。

链接: https://arxiv.org/abs/2604.14958
作者: Meijia Wang,Guochao Wang,Haozhen Chu,Bin Yao,Weichuan Zhang,Yuan Wang,Junpo Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot fine-grained image classification aims to recognize subcategories with high visual similarity using only a limited number of annotated samples. Existing metric learning-based methods typically rely solely on spatial domain features. Confined to this single perspective, models inevitably suffer from inherent texture biases, entangling essential structural details with high-frequency background noise. Furthermore, lacking cross-view geometric constraints, single-view metrics tend to overfit this noise, resulting in structural instability under few-shot conditions. To address these issues, this paper proposes the Frequency-Enhanced Dual-Subspace Network (FEDSNet). Specifically, FEDSNet utilizes the Discrete Cosine Transform (DCT) and a low-pass filtering mechanism to explicitly isolate low-frequency global structural components from spatial features, thereby suppressing background interference. Truncated Singular Value Decomposition (SVD) is employed to construct independent, low-rank linear subspaces for both spatial texture and frequency structural features. An adaptive gating mechanism is designed to dynamically fuse the projection distances from these dual views. This strategy leverages the structural stability of the frequency subspace to prevent the spatial subspace from overfitting to background features. Extensive experiments on four benchmark datasets - CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC-Aircraft - demonstrate that FEDSNet exhibits excellent classification performance and robustness, achieving highly competitive results compared to existing metric learning algorithms. Complexity analysis further confirms that the proposed network achieves a favorable balance between high accuracy and computational efficiency, providing an effective new paradigm for few-shot fine-grained visual recognition.

[CV-35] Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation

【速读】:该论文旨在解决手势识别(Gesture Recognition)领域长期面临的高质量数据稀缺问题,传统方法依赖昂贵的人工录制或无法生成真实手势多样性的图像处理手段,限制了模型性能提升。其解决方案的关键在于利用图像到视频的生成式 AI(Image-to-Video Generative AI)模型,通过自然语言提示(prompt-based)自动生成具有视觉保真度和语义丰富性的合成手势视频,构建去指代性手势(deictic gestures)数据集;该方法仅需少量人类参考样本即可实现高效、可扩展的数据生成,且实验证明合成数据能引入有意义的变异性和新颖性,显著提升下游深度学习模型在混合真实与合成数据上的表现,体现了生成式 AI 在手势数据增强中的强大潜力。

链接: https://arxiv.org/abs/2604.14953
作者: Hassan Ali,Doreen Jirak,Luca Müller,Stefan Wermter
机构: University of Hamburg (汉堡大学); University of Antwerp (安特卫普大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 2026 International Conference on Automatic Face and Gesture Recognition (FG)

点击查看摘要

Abstract:Gesture recognition research, unlike NLP, continues to face acute data scarcity, with progress constrained by the need for costly human recordings or image processing approaches that cannot generate authentic variability in the gestures themselves. Recent advancements in image-to-video foundation models have enabled the generation of photorealistic, semantically rich videos guided by natural language. These capabilities open up new possibilities for creating effort-free synthetic data, raising the critical question of whether video Generative AI models can augment and complement traditional human-generated gesture data. In this paper, we introduce and analyze prompt-based video generation to construct a realistic deictic gestures dataset and rigorously evaluate its effectiveness for downstream tasks. We propose a data generation pipeline that produces deictic gestures from a small number of reference samples collected from human participants, providing an accessible approach that can be leveraged both within and beyond the machine learning community. Our results demonstrate that the synthetic gestures not only align closely with real ones in terms of visual fidelity but also introduce meaningful variability and novelty that enrich the original data, further supported by superior performance of various deep models using a mixed dataset. These findings highlight that image-to-video techniques, even in their early stages, offer a powerful zero-shot approach to gesture synthesis with clear benefits for downstream tasks.

[CV-36] HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps

【速读】:该论文旨在解决当前多模态机器人操作研究中缺乏高质量、大规模且跨人类与机器人手的精细抓取数据集的问题,从而限制了生成式 AI (Generative AI) 在复杂物理交互场景下策略学习与跨域灵巧操作能力的发展。解决方案的关键在于构建 HRDexDB 数据集——一个包含 1.4K 抓取试验(含成功与失败)的大规模多模态数据集,涵盖人类和多种机器人手在 100 种不同物体上的高保真抓取轨迹,并通过专用多相机系统提供高精度时空对齐的 3D 运动标注、高分辨率触觉信号、同步多视角视频及第一人称视角视频流,为多模态策略学习和跨域灵巧操作提供了基准支持。

链接: https://arxiv.org/abs/2604.14944
作者: Jongbin Lim,Taeyun Ha,Mingi Choi,Jisoo Kim,Byungjun Kim,Subin Jeon,Hanbyul Joo
机构: Seoul National University (首尔国立大学); RLWRLD
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present HRDexDB, a large-scale, multi-modal dataset of high-fidelity dexterous grasping sequences featuring both human and diverse robotic hands. Unlike existing datasets, HRDexDB provides a comprehensive collection of grasping trajectories across human hands and multiple robot hand embodiments, spanning 100 diverse objects. Leveraging state-of-the-art vision methods and a new dedicated multi-camera system, our HRDexDB offers high-precision spatiotemporal 3D ground-truth motion for both the agent and the manipulated object. To facilitate the study of physical interaction, HRDexDB includes high-resolution tactile signals, synchronized multi-view video, and egocentric video streams. The dataset comprises 1.4K grasping trials, encompassing both successes and failures, each enriched with visual, kinematic, and tactile modalities. By providing closely aligned captures of human dexterity and robotic execution on the same target objects under comparable grasping motions, HRDexDB serves as a foundational benchmark for multi-modal policy learning and cross-domain dexterous manipulation.

[CV-37] Generative Data Augmentation for Skeleton Action Recognition

【速读】:该论文旨在解决骨架动作识别中高质量3D骨架数据集获取成本高、标注困难的问题,尤其在小样本场景下模型性能受限的挑战。解决方案的关键在于提出一种条件生成式数据增强流水线,通过学习受动作标签约束的真实骨架序列分布,合成多样且高保真度的骨架数据。其核心创新包括基于Transformer的编码器-解码器架构、生成精炼模块(generative refinement module)以及dropout机制,从而在采样过程中有效平衡生成数据的保真度(fidelity)与多样性(diversity),显著提升模型在低数据场景下的泛化能力与识别准确率。

链接: https://arxiv.org/abs/2604.14933
作者: Xu Dong,Wanqing Li,Anthony Adeyemi-Ejeye,Andrew Gilbert
机构: Innovative Media Lab, University of Surrey (萨里大学); Advanced Multimedia Research Lab, University of Wollongong (卧龙岗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE FG 2026

点击查看摘要

Abstract:Skeleton-based human action recognition is a powerful approach for understanding human behaviour from pose data, but collecting large-scale, diverse, and well-annotated 3D skeleton datasets is both expensive and labor-intensive. To address this challenge, we propose a conditional generative pipeline for data augmentation in skeleton action recognition. Our method learns the distribution of real skeleton sequences under the constraint of action labels, enabling the synthesis of diverse and high-fidelity data. Even with limited training samples, it can effectively generate skeleton sequences and achieve competitive recognition performance in low-data scenarios, demonstrating strong generalisation in downstream tasks. Specifically, we introduce a Transformer-based encoder-decoder architecture, combined with a generative refinement module and a dropout mechanism, to balance fidelity and diversity during sampling. Experiments on HumanAct12 and the refined NTU-RGBD (NTU-VIBE) dataset show that our approach consistently improves the accuracy of multiple skeleton-based action recognition models, validating its effectiveness in both few-shot and full-data settings. The source code can be found at here.

[CV-38] Hybrid Latents – Geometry-Appearance-Aware Surfel Splatting

【速读】:该论文旨在解决基于神经辐射场(NeRF)的模型中几何与外观耦合严重的问题,导致高频率纹理可能补偿几何误差,从而影响重建精度和渲染效率。其解决方案的关键在于提出一种混合高斯-哈希网格(hybrid Gaussian-hash-grid)辐射表示方法:在每个高斯(Gaussian)上引入潜在特征(latent features)与哈希网格特征(hash-grid features)相结合,显式地将场景分解为低频与高频成分,从而引导优化器实现更清晰的分离;同时通过硬不透明度衰减(hard opacity falloffs)增强几何与外观的解耦,并结合概率剪枝与稀疏诱导的二元交叉熵(BCE)不透明度损失,有效去除冗余高斯,最终以远少于现有方法的高斯数量实现更高保真度的重建。

链接: https://arxiv.org/abs/2604.14928
作者: Neel Kelkar,Simon Niedermayr,Klaus Engel,Rüdiger Westermann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 22 pages, 9 figures

点击查看摘要

Abstract:We introduce a hybrid Gaussian-hash-grid radiance representation for reconstructing 2D Gaussian scene models from multi-view images. Similar to NeST splatting, our approach reduces the entanglement between geometry and appearance common in NeRF-based models, but adds per-Gaussian latent features alongside hash-grid features to bias the optimizer toward a separation of low- and high-frequency scene components. This explicit frequency-based decomposition reduces the tendency of high-frequency texture to compensate for geometric errors. Encouraging Gaussians with hard opacity falloffs further strengthens the separation between geometry and appearance, improving both geometry reconstruction and rendering efficiency. Finally, probabilistic pruning combined with a sparsity-inducing BCE opacity loss allows redundant Gaussians to be turned off, yielding a minimal set of Gaussians sufficient to represent the scene. Using both synthetic and real-world datasets, we compare against the state of the art in Gaussian-based novel-view synthesis and demonstrate superior reconstruction fidelity with an order of magnitude fewer primitives.

[CV-39] STEP-Parts: Geometric Partitioning of Boundary Representations for Large-Scale CAD Processing

【速读】:该论文旨在解决当前CAD学习流水线中因将边界表示(Boundary Representations, B-Reps)离散化为三角网格而导致的几何结构与拓扑邻接信息丢失问题,从而削弱了实例级分析的一致性。其解决方案的关键在于提出STEP-Parts工具链,该工具链直接从原始STEP格式的B-Reps中提取几何实例分区,并通过保留源面对应关系将这些分区传递至三角化载体,从而生成稳定的实例标签和元数据。该方法在合并相邻B-Rep面时仅当它们具有相同解析曲面类型且满足近切连续性条件时才进行融合,利用ABC数据集中同类型面间二面角的强双峰分布特性,在低角度区域实现对阈值不敏感的零件分割;由于分区基于内在B-Rep拓扑而非特定三角剖分,所得边界在不同网格密度下保持稳定,显著提升了下游任务的鲁棒性和可复现性。

链接: https://arxiv.org/abs/2604.14927
作者: Shen Fan,Mikołaj Kida,Przemyslaw Musialski
机构: New Jersey Institute of Technology (新泽西理工学院); Warsaw University of Technology (华沙理工大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many CAD learning pipelines discretize Boundary Representations (B-Reps) into triangle meshes, discarding analytic surface structure and topological adjacency and thereby weakening consistent instance-level analysis. We present STEP-Parts, a deterministic CAD-to-supervision toolchain that extracts geometric instance partitions directly from raw STEP B-Reps and transfers them to tessellated carriers through retained source-face correspondence, yielding instance labels and metadata for downstream learning and evaluation. The construction merges adjacent B-Rep faces only when they share the same analytic primitive type and satisfy a near-tangent continuity criterion. On ABC, same-primitive dihedral angles are strongly bimodal, yielding a threshold-insensitive low-angle regime for part extraction. Because the partition is defined on intrinsic B-Rep topology rather than on a particular triangulation, the resulting boundaries remain stable under changes in tessellation. Applied to the DeepCAD subset of ABC, the pipeline processes approximately 180,000 models in under six hours on a consumer CPU. We release code and precomputed labels, and show that STEP-Parts serves both as a tessellation-robust geometric reference and as a useful supervision source in two downstream probes: an implicit reconstruction–segmentation network and a dataset-level point-based backbone.

[CV-40] Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

【速读】:该论文旨在解决当前生成式3D模型在文本驱动逆向生成(text-driven inversion)中面临的核心问题:即当输入文本发生改变时,模型内部表征无法有效响应,导致输出几何结构不变,从而限制了文本引导的3D内容编辑能力。作者发现,这种失效源于模型在潜在空间中存在“sink traps”——即对文本提示变得不敏感的区域,此时即使修改输入文本也无法改变生成结果。解决方案的关键在于解耦模型的几何表示能力与语言敏感性:通过分析采样轨迹,识别并绕过这些“sink traps”,利用模型的无条件生成先验来重建复杂几何形状,从而实现对分布外(out-of-distribution)3D形状的高保真语义操控。

链接: https://arxiv.org/abs/2604.14914
作者: Victoria Yue Chen,Emery Pierson,Léopold Maillard,Maks Ovsjanikov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-driven inversion of generative models is a core paradigm for manipulating 2D or 3D content, unlocking numerous applications such as text-based editing, style transfer, or inverse problems. However, it relies on the assumption that generative models remain sensitive to natural language prompts. We demonstrate that for state-of-the-art native text-to-3D generative models, this assumption often collapses. We identify a critical failure mode where generation trajectories are drawn into latent ``sink traps’': regions where the model becomes insensitive to prompt modifications. In these regimes, changes to the input text fail to alter internal representations in a way that alters the output geometry. Crucially, we observe that this is not a limitation of the model’s \textitgeometric expressivity; the same generative models possess the ability to produce a vast diversity of shapes but, as we demonstrate, become insensitive to out-of-distribution \textittext guidance. We investigate this behavior by analyzing the sampling trajectories of the generative model, and find that complex geometries can still be represented and produced by leveraging the model’s unconditional generative prior. This leads to a more robust framework for text-based 3D shape editing that bypasses latent sinks by decoupling a model’s geometric representation power from its linguistic sensitivity. Our approach addresses the limitations of current 3D pipelines and enables high-fidelity semantic manipulation of out-of-distribution 3D shapes. Project webpage: this https URL

[CV-41] Reward-Aware Trajectory Shaping for Few-step Visual Generation

【速读】:该论文旨在解决生成式模型在极少数采样步骤下实现高保真度生成的问题,现有方法依赖于蒸馏框架将多步去噪过程压缩为少步生成器,但这类方法受限于学生模型必须模仿更强的多步教师模型,从而将教师性能作为学生上限。其解决方案的关键在于引入偏好对齐意识(preference alignment awareness),使学生模型能够优化以获得奖励偏好更高的生成质量,从而可能超越教师模型而非受限于机械模仿。为此,作者提出轻量级框架RATS(Reward-Aware Trajectory Shaping),通过时间 horizon 匹配对齐师生潜在轨迹,并设计奖励感知门控机制(reward-aware gate),根据师生相对奖励表现自适应调节教师引导强度:当教师奖励更高时强化轨迹塑造,当学生匹配或超越教师时则放松约束,从而持续驱动奖励导向的改进。该方法无需额外推理开销即可有效迁移偏好相关知识,显著提升少步视觉生成的效率-质量权衡。

链接: https://arxiv.org/abs/2604.14910
作者: Rui Li,Bingyu Li,Yuanzhi Liang,HuangHai Bin,Chi Zhang,XueLong Li
机构: University of Science and Technology of China (中国科学技术大学); TeleAI (TeleAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving high-fidelity generation in extremely few sampling steps has long been a central goal of generative modeling. Existing approaches largely rely on distillation-based frameworks to compress the original multi-step denoising process into a few-step generator. However, such methods inherently constrain the student to imitate a stronger multi-step teacher, imposing the teacher as an upper bound on student performance. We argue that introducing \textbfpreference alignment awareness enables the student to optimize toward reward-preferred generation quality, potentially surpassing the teacher instead of being restricted to rigid teacher imitation. To this end, we propose \textbfReward-Aware Trajectory Shaping (RATS), a lightweight framework for preference-aligned few-step generation. Specifically, teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a \textbfreward-aware gate is introduced to adaptively regulate teacher guidance based on their relative reward performance. Trajectory shaping is strengthened when the teacher achieves higher rewards, and relaxed when the student matches or surpasses the teacher, thereby enabling continued reward-driven improvement. By seamlessly integrating trajectory distillation, reward-aware gating, and preference alignment, RATS effectively transfers preference-relevant knowledge from high-step generators without incurring additional test-time computational overhead. Experimental results demonstrate that RATS substantially improves the efficiency–quality trade-off in few-step visual generation, significantly narrowing the gap between few-step students and stronger multi-step generators.

[CV-42] FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection IJCNN2026

【速读】:该论文旨在解决小目标检测中因下采样导致的特征退化、密集场景中的相互遮挡以及复杂背景干扰等关键问题。其解决方案的核心在于构建一个频域-空间域协同增强框架FSDETR,通过三个关键技术模块实现:1)空间层次注意力块(Spatial Hierarchical Attention Block, SHAB)同时捕获局部细节与全局依赖关系,强化语义表征;2)基于可变形注意力的同尺度特征交互机制(Deformable Attention-based Intra-scale Feature Interaction, DA-AIFI),通过动态采样聚焦于信息丰富的区域以缓解遮挡问题;3)频域-空间域特征金字塔网络(Frequency-Spatial Feature Pyramid Network, FSFPN),利用跨域频域-空间块(Cross-domain Frequency-Spatial Block, CFSB)融合频率滤波与空间边缘提取,有效保留细粒度特征。该方法在仅14.7M参数条件下,在VisDrone 2019和TinyPerson数据集上分别取得13.9% APS和48.95% AP50 tiny的优异性能。

链接: https://arxiv.org/abs/2604.14884
作者: Jianchao Huang,Fengming Zhang,Haibo Zhu,Tao Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures,accepted to IJCNN 2026

点击查看摘要

Abstract:Small object detection remains a significant challenge due to feature degradation from downsampling, mutual occlusion in dense clusters, and complex background interference. To address these issues, this paper proposes FSDETR, a frequency-spatial feature enhancement framework built upon the RT-DETR baseline. By establishing a collaborative modeling mechanism, the method effectively leverages complementary structural information. Specifically, a Spatial Hierarchical Attention Block (SHAB) captures both local details and global dependencies to strengthen semantic representation. Furthermore, to mitigate occlusion in dense scenes, the Deformable Attention-based Intra-scale Feature Interaction (DA-AIFI) focuses on informative regions via dynamic sampling. Finally, the Frequency-Spatial Feature Pyramid Network (FSFPN) integrates frequency filtering with spatial edge extraction via the Cross-domain Frequency-Spatial Block (CFSB) to preserve fine-grained details. Experimental results show that with only 14.7M parameters, FSDETR achieves 13.9% APS on VisDrone 2019 and 48.95% AP50 tiny on TinyPerson, showing strong performance on small-object benchmarks. The code and models are available at this https URL.

[CV-43] Open-Set Vein Biometric Recognition with Deep Metric Learning CCS2026

【速读】:该论文旨在解决现有静脉识别方法普遍依赖闭集分类(closed-set classification)所带来的可扩展性不足问题,即无法在不重新训练模型的情况下自适应地注册新用户。其核心挑战在于如何在开放集(open-set)条件下实现高精度的身份验证与未知 impostor 的有效拒识。解决方案的关键在于:首先通过深度度量学习(Deep Metric Learning, DML)提取判别性强的 L2 归一化嵌入表示;其次采用基于原型的匹配机制并结合校准后的相似度阈值,以区分已注册用户与未见过的伪造者;此外,在多个跨数据集场景下验证了该框架对不同采集设备和域偏移(domain shift)的泛化能力,表明在大规模数据下性能稳定,但在低数据场景中仍受域偏移影响。实验结果显示,使用三元组损失(triplet-based objective)与简单 1-NN 分类器组合可在准确率与效率之间取得最优平衡,支持在通用硬件上的实时部署。

链接: https://arxiv.org/abs/2604.14874
作者: Paweł Pilarek,Marcel Musiałek,Anna Górska
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections. The Version of Record of this contribution is published in International Conference on Computational Science (ICCS 2026), and is available online at this https URL [pending]

点击查看摘要

Abstract:Most state-of-the-art vein recognition methods rely on closed-set classification, which inherently limits their scalability and prevents the adaptive enrollment of new users without complete model retraining. We rigorously evaluate the computational boundaries of Deep Metric Learning (DML) under strict open-set constraints. Unlike standard closed-set approaches, we analyze the impact of data scarcity and domain shift on recognition performance. Our approach learns discriminative L2-normalised embeddings and employs prototype-based matching with a calibrated similarity threshold to effectively distinguish between enrolled users and unseen impostors. We evaluate the framework under a strict subject-disjoint protocol across four diverse datasets covering finger, wrist, and dorsal hand veins (MMCBNU 6000, UTFVP, FYO, and a dorsal hand-vein dataset). On the large-scale MMCBNU 6000 benchmark, our best model (ResNet50-CBAM) achieves an OSCR of 0.9945, AUROC of 0.9974, and EER of 1.57%, maintaining high identification accuracy (99.6% Rank-1) while robustly rejecting unknown subjects. Cross-dataset experiments evaluate the framework’s generalisation across different acquisition setups, confirming that while the model handles large-scale data robustly, performance remains sensitive to domain shifts in low-data regimes. Ablation studies demonstrate that triplet-based objectives combined with a simple 1-NN classifier offer an optimal trade-off between accuracy and efficiency, enabling real-time deployment on commodity hardware.

[CV-44] MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

【速读】:该论文旨在解决生成式 AI (Generative AI) 在牙科影像分析中应用受限的问题,特别是由于缺乏细粒度标注数据集和全面基准评测体系所致。其核心解决方案在于构建了一个名为 MetaDent 的综合性资源,关键创新点包括:(1)从临床、公开及网络来源收集并整理了 60,669 张口腔图像构成大规模牙科图像数据集;(2)设计了一种半结构化标注框架,结合高阶图像摘要与逐点自由文本描述异常,实现对牙科摄影中层级化和临床细微差异的精准捕捉;(3)利用大语言模型(Large Language Models, LLMs)自动生成约 15,000 对视觉问答(Visual Question Answering, VQA)对和一个 18 类多标签分类数据集,并通过人工审核与误差分析验证其语义准确性。该方案为牙科场景下的视觉-语言模型(Vision-Language Models, VLMs)提供了可复现、任务无关且语义丰富的基准评估体系,推动了 VLM 在口腔医学图像理解中的发展。

链接: https://arxiv.org/abs/2604.14866
作者: Meng-Xun Li,Wen-Hui Deng,Zhi-Xing Wu,Chun-Xiao Jin,Jia-Min Wu,Yue Han,James Kit Hon Tsoi,Gui-Song Xia,Cui Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project website: this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

[CV-45] Efficient Search of Implantable Adaptive Cells for Medical Image Segmentation

【速读】:该论文旨在解决自适应跳接模块(Adaptive Skip Modules)在医学图像分割中搜索过程计算成本高昂的问题。传统方法如植入式自适应细胞(Implantable Adaptive Cells, IACs)虽通过限制搜索空间提升了效率,但仍需对每个骨干网络和数据集进行长达200个epoch的可微分搜索,难以满足实际应用中的计算约束。解决方案的关键在于发现并利用IAC单元内操作(operation)和边(edge)在训练早期即趋于稳定的现象:研究发现,最终离散化选择的操作通常在训练初期即成为强候选者,且其架构参数在接近尾声前已基本收敛。基于此,作者提出一种基于Jensen–Shannon散度的稳定性判据,用于追踪每条边上的操作重要性分布,并在搜索过程中逐步剪枝低重要性操作,从而构建出加速框架IAC-LTH。该方法在多个公开医学图像分割基准(如ACDC、BraTS、KiTS、AMOS)上验证有效,能在保持与完整搜索相当甚至略优的患者级分割性能的同时,将NAS耗时降低3.7倍至16倍。

链接: https://arxiv.org/abs/2604.14849
作者: Emil Benedykciuk,Marcin Denkowski,Grzegorz M. Wójcik
机构: UMCS(卢布林玛丽居里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:Purpose: Adaptive skip modules can improve medical image segmentation, but searching for them is computationally costly. Implantable Adaptive Cells (IACs) are compact NAS modules inserted into U-Net skip connections, reducing the search space compared with full-network NAS. However, the original IAC framework still requires a 200-epoch differentiable search for each backbone and dataset. Methods: We analyzed the temporal behavior of operations and edges within IAC cells during differentiable search on public medical image segmentation benchmarks. We found that operations selected in the final discrete cell typically emerge among the strongest candidates early in training, and their architecture parameters stabilize well before the final epoch. Based on this, we propose a Jensen–Shannon-divergence-based stability criterion that tracks per-edge operation-importance distributions and progressively prunes low-importance operations during search. The accelerated framework is called IAC-LTH. Results: Across four public benchmarks (ACDC, BraTS, KiTS, AMOS), several 2-D U-Net backbones, and a 2-D nnU-Net pipeline, IAC-LTH discovers IAC cells whose patient-level segmentation performance matches and sometimes slightly exceeds that of cells found by the original full-length search, while reducing wall-clock NAS cost by 3.7x to 16x across datasets and backbones. These results are consistent across architectures, benchmarks, and both non-augmented and augmented training settings, while preserving the gains of IAC-equipped U-Nets over strong attention-based and dense-skip baselines. Conclusion: Competitive IAC architectures can be identified from early-stabilizing operations without running the full search, making adaptive skip-module design more practical for medical image segmentation under realistic computational constraints.

[CV-46] Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic Cost-Effective Alternative to Trained Single-Model Systems

【速读】:该论文旨在解决零售盗窃检测中现有AI系统依赖昂贵定制模型训练、高订阅费用(每月200–500美元/店)以及难以部署的问题。其核心解决方案是提出Paza框架,采用零样本(zero-shot)策略实现无需任何模型训练的偷窃行为检测:通过分层流水线架构,持续运行低成本的目标检测与姿态估计模型作为前置过滤器,仅在行为信号触发时调用昂贵的视觉语言模型(Vision-Language Model, VLM),从而将VLM调用频率降低至每分钟不超过10次(相比逐帧分析减少240倍),使单张GPU可服务10–20家门店;同时设计了多信号可疑性预过滤机制(需满足停留时间+至少一个行为信号),并支持任意OpenAI兼容接口的VLM组件热插拔,确保系统随VLM技术演进而持续优化。该方案兼具成本效益(每店月成本50–100美元)与隐私保护能力,具备实际部署可行性。

链接: https://arxiv.org/abs/2604.14846
作者: Haileab Yagersew
机构: Paza AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures, Code to be released at this https URL

点击查看摘要

Abstract:Retail theft costs the global economy over \ 100 billion annually, yet existing AI-based detection systems require expensive custom model training on proprietary datasets and charge \ 200-500/month per store. We present Paza, a zero-shot retail theft detection framework that achieves practical concealment detection without training any model. Our approach orchestrates multiple existing models in a layered pipeline - cheap object detection and pose estimation running continuously, with an expensive vision-language model (VLM) invoked only when behavioral pre-filters trigger. A multi-signal suspicion pre-filter (requiring dwell time plus at least one behavioral signal) reduces VLM invocations by 240x compared to per-frame analysis, bounding calls to =10/minute and enabling a single GPU to serve 10-20 stores. The architecture is model-agnostic: the VLM component accepts any OpenAI-compatible endpoint, enabling operators to swap between models such as Gemma 4, Qwen3.5-Omni, GPT-4o, or future releases without code changes - ensuring the system improves as the VLM landscape evolves. We evaluate the VLM component on the DCSASS synthesized shoplifting dataset (169 clips, controlled environment), achieving 89.5% precision and 92.8% specificity at 59.3% recall zero-shot - where the recall gap is attributable to sparse frame sampling in offline evaluation rather than VLM reasoning failures, as precision and specificity are the operationally critical metrics determining false alarm rates. We present a detailed cost model showing viability at \ 50-100/month per store (3-10x cheaper than commercial alternatives), and introduce a privacy-preserving design that obfuscates faces in the detection pipeline. The source code is available at this https URL.

[CV-47] Improved Multiscale Structural Mapping with Supervertex Vision Transformer for the Detection of Alzheimers Disease Neurodegeneration

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期无创筛查中生物标志物敏感性和特异性不足的问题。现有方法依赖于正电子发射断层扫描(PET)或脑脊液(CSF)分析,存在成本高和侵入性强的局限。为此,作者提出MSSM+框架,其核心创新在于融合了皮质厚度(cortical thickness, CT)、灰质-白质对比度(gray-white matter contrasts, GWCs)以及基于顶点级别的沟深(sulcal depth)和皮质曲率(cortical curvature),并通过表面超顶点映射(surface supervertex mapping, SSVM)将皮质表面划分为具有空间结构意义的超顶点(supervertices),进而利用超顶点视觉Transformer(Supervertex Vision Transformer, SV-ViT)对这些几何结构进行解剖学感知的学习。相比原有MSSM方法,MSSM+在AD与认知正常对照组之间识别出更广泛且统计显著的差异,并在分类性能上提升了3%的精确率-召回率曲线下面积(AUC-PR),同时在不同磁共振成像(MRI)设备厂商的数据中表现出更低的信号变异性与更稳定的判别能力,表明其作为MRI驱动的AD早期检测标记具有高度潜力。

链接: https://arxiv.org/abs/2604.14837
作者: Geonwoo Baek,David H. Salat,Ikbeom Jang
机构: Hankuk University of Foreign Studies (韩国外国语大学); Massachusetts General Hospital (马萨诸塞州总医院); Harvard Medical School (哈佛医学院); VA Boston Healthcare System (波士顿退伍军人健康护理系统)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Human Brain Mapping

点击查看摘要

Abstract:Alzheimer’s disease (AD) confirmation often relies on positron emission tomography (PET) or cerebrospinal fluid (CSF) analysis, which are costly and invasive. Consequently, structural MRI biomarkers such as cortical thickness (CT) are widely used for non-invasive AD screening. Multiscale structural mapping (MSSM) was recently proposed to integrate gray-white matter contrasts (GWCs) with CT from a single T1-weighted MRI (T1w) scan. Building on this framework, we propose MSSM+, together with surface supervertex mapping (SSVM) and a Supervertex Vision Transformer (SV-ViT). 3D T1w images from individuals with AD and cognitively normal (CN) controls were analyzed. MSSM+ extends MSSM by incorporating sulcal depth and cortical curvature at the vertex level. SSVM partitions the cortical surface into supervertices (surface patches) that effectively represent inter- and intra-regional spatial relationships. SV-ViT is a Vision Transformer architecture operating on these supervertices, enabling anatomically informed learning from surface mesh representations. Compared with MSSM, MSSM+ identified more spatially extensive and statistically significant group differences between AD and CN. In AD vs. CN classification, MSSM+ achieved a 3%p higher area under the precision-recall curve than MSSM. Vendor-specific analyses further demonstrated reduced signal variability and consistently improved classification performance across MR manufacturers relative to CT, GWCs, and MSSM. These findings suggest that MSSM+ combined with SV-ViT is a promising MRI-based imaging marker for AD detection prior to CSF/PET confirmation.

[CV-48] From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation

【速读】:该论文旨在解决岩石薄片图像中粒边分割(Grain-edge segmentation, GES)与岩性语义分割(Lithology semantic segmentation, LSS)任务因独立处理而导致的分割质量不佳问题,尤其针对由消光依赖性颜色变化和超细粒边界引起的严重域偏移(domain gap)以及多角度偏振图像堆栈缺乏联合学习模块的挑战。解决方案的关键在于提出Petro-SAM框架——一个两阶段、多任务学习模型:首先引入Merge Block融合七种偏振视角以缓解消光效应;其次通过多尺度特征融合与颜色熵先验(color-entropy priors)提升边界对齐精度与分割一致性,从而实现高质量的GES与LSS联合建模。

链接: https://arxiv.org/abs/2604.14805
作者: Yili Ren,Shiqi Wen,Li Hou,Dingwen Xiao,Weiming Zhang,Caleb Chen Cao,Lin Wang,Zilu Zheng,Qianxiao Su,Mingjun Zhao,Lei Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Grain-edge segmentation (GES) and lithology semantic segmentation (LSS) are two pivotal tasks for quantifying rock fabric and composition. However, these two tasks are often treated separately, and the segmentation quality is implausible albeit expensive, time-consuming, and expert-annotated datasets have been used. Recently, foundation models, especially the Segment Anything Model (SAM), have demonstrated impressive robustness for boundary alignment. However, directly adapting SAM to joint GES and LSS is nontrivial due to 1) severe domain gap induced by extinction-dependent color variations and ultra-fine grain boundaries, and 2) lacking novel modules for joint learning on multi-angle petrographic image stacks. In this paper, we propose Petro-SAM, a novel two-stage, multi-task framework that can achieve high-quality joint GES and LSS on petrographic images. Specifically, based on SAM, we introduce a Merge Block to integrate seven polarized views, effectively solving the extinction issue. Moreover, we introduce multi-scale feature fusion and color-entropy priors to refine the detection.

[CV-49] One-shot Compositional 3D Head Avatars with Deformable Hair

【速读】:该论文旨在解决单张图像驱动的3D头像重建中头发动态不真实的问题,其核心挑战在于现有方法未能有效解耦头发与面部区域,导致几何纠缠和非自然变形。解决方案的关键在于显式分离头发与面部组件:首先通过图像去毛发处理获得无发图像,并分别构建基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)的密集三维表示;对无发部分采用非刚性配准绑定到FLAME人脸网格以实现自然形变,而对头发部分则利用语义标签监督与边界感知重分配策略提取纯净的头发高斯点云,并引入基于位置的动力学(Position-Based Dynamics, PBD)模拟的笼结构来控制头发在头部运动、重力和惯性作用下的物理合理变形,从而显著提升动画中的头发真实感并保持面部细节 fidelity。

链接: https://arxiv.org/abs/2604.14782
作者: Yuan Sun,Xuan Wang,WeiLi Zhang,Wenxuan Zhang,Yu Guo,Fei Wang
机构: Xi’an Jiaotong University(西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:We propose a compositional method for constructing a complete 3D head avatar from a single image. Prior one-shot holistic approaches frequently fail to produce realistic hair dynamics during animation, largely due to inadequate decoupling of hair from the facial region, resulting in entangled geometry and unnatural deformations. Our method explicitly decouples hair from the face, modeling these components using distinct deformation paradigms while integrating them into a unified rendering pipeline. Furthermore, by leveraging image-to-3D lifting techniques, we preserve fine-grained textures from the input image to the greatest extent possible, effectively mitigating the common issue of high-frequency information loss in generalized models. Specifically, given a frontal portrait image, we first perform hair removal to obtain a bald image. Both the original image and the bald image are then lifted to dense, detail-rich 3D Gaussian Splatting (3DGS) representations. For the bald 3DGS, we rig it to a FLAME mesh via non-rigid registration with a prior model, enabling natural deformation that follows the mesh triangles during animation. For the hair component, we employ semantic label supervision combined with a boundary-aware reassignment strategy to extract a clean and isolated set of hair Gaussians. To control hair deformation, we introduce a cage structure that supports Position-Based Dynamics (PBD) simulation, allowing realistic and physically plausible transformations of the hair Gaussian primitives under head motion, gravity, and inertial effects. Striking qualitative results, including dynamic animations under diverse head motions, gravity effects, and expressions, showcase substantially more realistic hair behavior alongside faithfully preserved facial details, outperforming state-of-the-art one-shot methods in perceptual realism.

[CV-50] Integrating Object Detection LiDAR-Enhanced Depth Estimation and Segmentation Models for Railway Environments

【速读】:该论文旨在解决铁路环境中障碍物检测与距离估计的联合问题,现有研究多局限于单一目标检测或轨道识别,缺乏能够同时实现障碍物检测、轨道定位及精确距离估算的模块化、可扩展系统,且评估受限于真实场景中缺乏高质量标注数据。解决方案的关键在于提出一个集成式框架,融合三个神经网络模块:用于物体检测的模型、用于轨道分割的模型以及基于单目深度估计的模块,并通过引入LiDAR点云信息对深度图进行校准,从而提升距离估计精度;同时利用合成数据集SynDRA提供准确的地面真值标注,支持可靠定量评估,最终实现了平均绝对误差(MAE)低至0.63米的距离估计性能,显著增强了铁路场景下的空间感知能力。

链接: https://arxiv.org/abs/2604.14781
作者: Enrico Francesco Giannico,Federico Nesti,Gianluca D’Amico,Mauro Marinoni,Edoardo Carosio,Filippo Salotti,Salvatore Sabina,Giorgio Buttazzo
机构: University of Pavia (帕维亚大学); Progress Rail Signaling S.p.A. (Progress Rail Signaling S.p.A.)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under submission for publication

点击查看摘要

Abstract:Obstacle detection in railway environments is crucial for ensuring safety. However, very few studies address the problem using a complete, modular, and flexible system that can both detect objects in the scene and estimate their distance from the vehicle. Most works focus solely on detection, others attempt to identify the track, and only a few estimate obstacle distances. Additionally, evaluating these systems is challenging due to the lack of ground truth data. In this paper, we propose a modular and flexible framework that identifies the rail track, detects potential obstacles, and estimates their distance by integrating three neural networks for object detection, track segmentation, and monocular depth estimation with LiDAR point clouds. To enable a reliable and quantitative evaluation, the proposed framework is assessed using a synthetic dataset (SynDRA), which provides accurate ground truth annotations, allowing for direct performance comparison with existing methods. The proposed system achieves a mean absolute error (MAE) as low as 0.63 meters by integrating monocular depth maps with LiDAR, enabling not only accurate distance estimates but also spatial perception of the scene.

[CV-51] OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism CVPR2026

【速读】:该论文旨在解决广义类别发现(Generalized Category Discovery, GCD)中现有方法依赖单一模态且需针对特定数据集微调的问题,目标是实现跨模态的、无需微调的类别发现能力,从而更贴近人类抽象分类的学习机制。其解决方案的关键在于提出了一种模态无关的GCD框架OmniGCD:首先利用模态特异性编码器(如视觉、音频、文本、遥感等)处理多模态输入,并通过降维构建统一的GCD潜在空间(GCD latent space);随后在测试阶段使用一个基于合成数据训练的新型Transformer模型对潜在表示进行转换,使其更适合聚类分析。该方法在16个涵盖四种模态的数据集上实现了零样本GCD,平均准确率提升达+6.2至+17.9个百分点,验证了强编码器与类别发现解耦的重要性,为未来跨模态、可扩展的人类启发式类别发现提供了基准和路径。

链接: https://arxiv.org/abs/2604.14762
作者: Jordan Shipard,Arnold Wiliem,Kien Nguyen Thanh,Wei Xiang,Clinton Fookes
机构: SAIVT, QUT, Australia; Shield AI, Australia; La Trobe University, Australia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain’s abstract category formation. Our \textbfOmniGCD leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a \textbfGCD latent space , which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a \textbfzero-shot GCD setting where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. \textbfTrained once on synthetic data , OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of \textbf+6.2 , \textbf+17.9 , \textbf+1.5 and \textbf+12.7 for vision, text, audio and remote sensing). This highlights the importance of strong encoders while decoupling representation learning from category discovery. Improving modality-agnostic methods will propagate across modalities, enabling encoder development independent of GCD. Our work serves as a benchmark for future modality-agnostic GCD works, paving the way for scalable, human-inspired category discovery. All code is available \hrefthis https URLhere

[CV-52] ASGNet: Adaptive Spectrum Guidance Network for Automatic Polyp Segmentation

【速读】:该论文旨在解决结肠镜图像中息肉(polyp)分割的挑战性问题,特别是由于息肉形态多样、背景复杂且常被隐藏,导致现有基于深度学习的方法在空间感知上存在局限,难以准确捕捉完整息肉结构。其解决方案的关键在于提出一种自适应频谱引导网络(ASGNet),通过引入频谱特征与全局属性融合机制,增强模型对息肉结构的判别能力并细化边界;具体包括:设计频谱引导的非局部感知模块以联合聚合局部与全局信息,构建多源语义提取器以提供高层语义辅助初步定位,并采用密集跨层交互解码器有效整合不同层级的多样化信息,从而生成高质量表示以实现精准分割。

链接: https://arxiv.org/abs/2604.14755
作者: Yanguang Sun,Hengmin Zhang,Jianjun Qian,Jian Yang,Lei Luo
机构: Nanjing University of Science and Technology (南京理工大学); East China University of Science and Technology (华东理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at TCSVT 2026

点击查看摘要

Abstract:Early identification and removal of polyps can reduce the risk of developing colorectal cancer. However, the diverse morphologies, complex backgrounds and often concealed nature of polyps make polyp segmentation in colonoscopy images highly challenging. Despite the promising performance of existing deep learning-based polyp segmentation methods, their perceptual capabilities remain biased toward local regions, mainly because of the strong spatial correlations between neighboring pixels in the spatial domain. This limitation makes it difficult to capture the complete polyp structures, ultimately leading to sub-optimal segmentation results. In this paper, we propose a novel adaptive spectrum guidance network, called ASGNet, which addresses the limitations of spatial perception by integrating spectral features with global attributes. Specifically, we first design a spectrum-guided non-local perception module that jointly aggregates local and global information, therefore enhancing the discriminability of polyp structures, and refining their boundaries. Moreover, we introduce a multi-source semantic extractor that integrates rich high-level semantic information to assist in the preliminary localization of polyps. Furthermore, we construct a dense cross-layer interaction decoder that effectively integrates diverse information from different layers and strengthens it to generate high-quality representations for accurate polyp segmentation. Extensive quantitative and qualitative results demonstrate the superiority of our ASGNet approach over 21 state-of-the-art methods across five widely-used polyp segmentation benchmarks. The code will be publicly available at: this https URL.

[CV-53] Efficient closed-form approaches for pose estimation using Sylvester forms

【速读】:该论文旨在解决实时计算机视觉应用中姿态估计(旋转与平移)的非线性最小二乘问题,此类问题通常计算耗时且基础性强。解决方案的关键在于利用适当的旋转参数化方法,将优化问题转化为多项式方程组的求解,并通过基于结式(resultant)矩阵的闭式求解方法显著降低计算复杂度。文中提出了一类新的基于结式的求解器,特别利用Sylvester形式进一步简化求解过程,从而在保持与当前最先进方法相当数值精度的同时,大幅减少计算时间。该方法适用于两类典型姿态估计问题:3D到3D对应点和3D点到2D点的对应关系下的姿态估计。

链接: https://arxiv.org/abs/2604.14747
作者: Jana Vráblíková(AROMATH),Ezio Malis(ACENTAURI),Laurent Busé(AROMATH)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Solving non-linear least-squares problem for pose estimation (rotation and translation) is often a time consuming yet fundamental problem in several real-time computer vision applications. With an adequate rotation parametrization, the optimization problem can be reduced to the solution of a~system of polynomial equations and solved in closed form. Recent advances in efficient closed form solvers utilizing resultant matrices have shown a promising research direction to decrease the computation time while preserving the estimation accuracy. In this paper, we propose a new class of resultant-based solvers that exploit Sylvester forms to further reduce the complexity of the resolution. We demonstrate that our proposed methods are numerically as accurate as the state-of-the-art solvers, and outperform them in terms of computational time. We show that this approach can be applied for pose estimation in two different types of problems: estimating a pose from 3D to 3D correspondences, and estimating a pose from 3D points to 2D points correspondences.

[CV-54] Find the Differences: Differential Morphing Attack Detection vs Face Recognition

【速读】:该论文旨在解决人脸识别(Face Recognition, FR)系统在面对形态攻击(Morphing Attack)时的脆弱性问题,即攻击者通过融合两张人脸图像生成难以察觉的“伪人脸”以欺骗FR系统。作者指出,当前FR系统与差分形态攻击检测(Differential Morphing Attack Detection, D-MAD)本质上执行相似任务,但现有决策阈值设计导致FR系统对未知类型形态攻击存在固有漏洞,从而引发正常识别性能与抗攻击能力之间的权衡。解决方案的关键在于:直接利用已部署的FR系统进行形态攻击检测,并引入一个新的评估阈值,该阈值能确保对任意类型(包括未知类型)形态攻击的脆弱性不超过预设上限,从而实现无需额外模型即可提升安全性。

链接: https://arxiv.org/abs/2604.14734
作者: Una M. Kelly,Luuk J. Spreeuwers,Raymond N.J. Veldhuis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Morphing is a challenge to face recognition (FR) for which several morphing attack detection solutions have been proposed. We argue that face recognition and differential morphing attack detection (D-MAD) in principle perform very similar tasks, which we support by comparing an FR system with two existing D-MAD approaches. We also show that currently used decision thresholds inherently lead to FR systems being vulnerable to morphing attacks and that this explains the tradeoff between performance on normal images and vulnerability to morphing attacks. We propose using FR systems that are already in place for morphing detection and introduce a new evaluation threshold that guarantees an upper limit to the vulnerability to morphing attacks - even of unknown types.

[CV-55] HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet

【速读】:该论文旨在解决视觉状态空间模型(Vision State Space Models, Vision SSMs)在处理二维图像时依赖复杂扫描策略所导致的计算开销大、架构复杂的问题。现有方法如Vim、VMamba和SiMBA需通过多步扫描将一维SSM适配至图像输入,不仅效率低下,还引入了离散化不稳定性。其解决方案的核心在于提出HAMSA——一种无需扫描的频域SSM架构,关键创新包括:(1) 简化的核参数化方式,用单个高斯初始化的复数核替代传统(A, B, C)矩阵,避免离散化不稳定;(2) SpectralPulseNet(SPN),实现输入依赖的频率门控机制以支持自适应频谱调制;(3) Spectral Adaptive Gating Unit(SAGU),基于幅值的门控机制保障频域梯度稳定传播。结合快速傅里叶变换(FFT)实现O(L log L)复杂度,在ImageNet-1K上达到85.7% top-1准确率(SSM中最优),推理速度比Transformer快2.2倍,且内存与能耗显著降低。

链接: https://arxiv.org/abs/2604.14724
作者: Badri N. Patro,Vijay S. Agneeswaran
机构: Microsoft(微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Vision State Space Models (SSMs) like Vim, VMamba, and SiMBA rely on complex scanning strategies to adapt sequential SSMs to process 2D images, introducing computational overhead and architectural complexity. We propose HAMSA, a scanning-free SSM operating directly in the spectral domain. HAMSA introduces three key innovations: (1) simplified kernel parameterization-a single Gaussian-initialized complex kernel replacing traditional (A, B, C) matrices, eliminating discretization instabilities; (2) SpectralPulseNet (SPN)-an input-dependent frequency gating mechanism enabling adaptive spectral modulation; and (3) Spectral Adaptive Gating Unit (SAGU)-magnitude-based gating for stable gradient flow in the frequency domain. By leveraging FFT-based convolution, HAMSA eliminates sequential scanning while achieving O(L log L) complexity with superior simplicity and efficiency. On ImageNet-1K, HAMSA reaches 85.7% top-1 accuracy (state-of-the-art among SSMs), with 2.2 X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S) and 1.4-1.9X speedup over scanning-based SSMs, while using less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J). HAMSA demonstrates strong generalization across transfer learning and dense prediction tasks.

[CV-56] Data Synthesis Improves 3D Myotube Instance Segmentation

【速读】:该论文旨在解决肌管(myotube)三维实例分割在标注数据稀缺场景下的性能瓶颈问题。由于缺乏大规模标注的肌管图像数据,现有预训练生物医学分割模型难以泛化至该领域。其解决方案的关键在于提出一种基于生物物理机制的合成数据生成流程:通过多项式中心线、局部变化半径、分支结构及椭球端帽等参数建模单个肌管形态,并结合真实显微成像中的噪声、光学伪影以及CycleGAN域自适应技术生成逼真合成体积数据;在此基础上训练一个紧凑的3D U-Net网络,该网络采用自监督编码器预训练策略,仅使用合成数据即可在真实数据上实现平均实例分割质量(IPQ)达0.22,显著优于三种零样本分割模型,验证了生物物理驱动的合成方法在标注稀缺生物医学领域的有效性。

链接: https://arxiv.org/abs/2604.14720
作者: David Exler,Nils Friederich,Martin Krüger,John Jbeily,Mario Vitacolonna,Rüdiger Rudolf,Ralf Mikut,Markus Reischl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 4 figures, submitted to BMT (VDE) 2026 Conference

点击查看摘要

Abstract:Myotubes are multinucleated muscle fibers serving as key model systems for studying muscle physiology, disease mechanisms, and drug responses. Mechanistic studies and drug screening thereby rely on quantitative morphological readouts such as diameter, length, and branching degree, which in turn require precise three-dimensional instance segmentation. Yet established pretrained biomedical segmentation models fail to generalize to this domain due to the absence of large annotated myotube datasets. We introduce a geometry-driven synthesis pipeline that models individual myotubes via polynomial centerlines, locally varying radii, branching structures, and ellipsoidal end caps derived from real microscopy observations. Synthetic volumes are rendered with realistic noise, optical artifacts, and CycleGAN-based Domain Adaptation (DA). A compact 3D U-Net with self-supervised encoder pretraining, trained exclusively on synthetic data, achieves a mean IPQ of 0.22 on real data, significantly outperforming three established zero-shot segmentation models, demonstrating that biophysics-driven synthesis enables effective instance segmentation in annotation-scarce biomedical domains.

[CV-57] MS-SSE-Net: A Multi-Scale Spatial Squeeze-and-Excitation Network for Structural Damage Detection in Civil and Geotechnical Engineering

【速读】:该论文旨在解决结构损伤分类中因损伤模式多样性和环境条件变化导致的识别精度难题,尤其在图像数据中准确区分多种类型结构损伤的问题。其解决方案的关键在于提出一种名为MS-SSE-Net的新型深度学习(Deep Learning, DL)框架,该框架基于DenseNet201主干网络,融合多尺度特征提取机制与通道注意力(channel attention)和空间注意力(spatial attention)模块:通过并行深度可分离卷积同时捕获局部细节与上下文信息,并利用挤压-激励(squeeze-and-excitation)风格的注意力机制强化关键特征区域、抑制冗余噪声,从而显著提升模型对复杂场景下结构损伤的判别能力。实验表明,该方法在StructDamage数据集上实现了99.27%的F1-score,优于基线模型DenseNet201。

链接: https://arxiv.org/abs/2604.14711
作者: Saif ur Rehman Khan,Imad Ahmed Waqar,Arooj Zaib,Saad Ahmed,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim
机构: DFKI(德国弗劳恩霍夫计算机图形学研究所); RPTU(莱布尼茨大学路德维希港)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Structural damage detection is essential for maintaining the safety and reliability of civil infrastructure. However, accurately identifying different types of structural damage from images remains challenging due to variations in damage patterns and environmental conditions. To address these challenges, this paper proposes MS-SSE-Net, a novel deep learning (DL) framework for structural damage classification. The proposed model is built upon the DenseNet201 backbone and integrates novel multi-scale feature extraction with channel and spatial attention mechanisms (MS-SSE-Net). Specifically, parallel depthwise convolutions capture both local and contextual features, while squeeze-and-excitation style channel attention and spatial attention emphasize informative regions and suppress irrelevant noise. The refined features are then processed through global average pooling and a fully connected classification layer to generate the final predictions. Experiments are conducted on the StructDamage dataset containing multiple structural damage categories. The proposed MS-SSE-Net demonstrates superior performance compared with the baseline DenseNet201 and other comparative approaches. Specifically, the proposed method achieves 99.31% precision, 99.25% recall, 99.27% F1-score, and 99.26% accuracy, outperforming the baseline model which achieved 98.62% precision, 98.53% recall, 98.58% F1-score, and 98.53% accuracy.

[CV-58] G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval CVPR2026

【速读】:该论文旨在解决零样本图像检索(Zero-Shot Composed Image Retrieval, ZS-CIR)中因过度依赖文本模态而导致的检索多样性与准确性不足的问题。现有方法虽利用多模态大语言模型(Multimodal Large Language Models, MLLMs)将隐式语义显式化,但未能充分建模参考图像与修改文本组合所蕴含的模糊性与多样性,从而限制了检索效果。解决方案的关键在于提出一种无需训练的新方法——Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER),其核心机制包括:通过在不同混合比例下对图像-文本对进行测地线混叠(geodesic mixup)构建反映隐式语义的查询特征,并生成多样候选集;随后基于MLLM提取的显式语义对候选集进行重排序,从而协同提升检索结果的多样性与准确性。

链接: https://arxiv.org/abs/2604.14710
作者: Jiyoung Lim,Heejae Yang,Jee-Hyong Lee
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Accepted

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training. Our code will be available at this https URL.

[CV-59] NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation CVPR2026

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)中对象分割边界模糊与伪影的问题,这是由高斯表示的离散特性导致的边界处混叠(aliasing)现象所致。解决方案的关键在于提出NG-GS框架:首先通过掩码方差分析自动识别对象边界附近的模糊高斯点;随后利用径向基函数(Radial Basis Function, RBF)插值构建空间连续的特征场,并结合多分辨率哈希编码(multi-resolution hash encoding)实现高效多尺度表征;最后通过联合优化策略,将3DGS与轻量级神经辐射场(NeRF)模块对齐,引入对齐损失和空间连续性损失,从而确保分割边界的平滑性和一致性。

链接: https://arxiv.org/abs/2604.14706
作者: Yi He,Tao Wang,Yi Jin,Congyan Lang,Yidong Li,Haibin Ling
机构: Beijing Jiaotong University (北京交通大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (Highlight)

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring smooth and consistent segmentation boundaries. Extensive experiments on NVOS, LERF-OVS, and ScanNet benchmarks demonstrate that our method achieves state-of-the-art performance, with significant gains in boundary mIoU. Code is available at this https URL.

[CV-60] he Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment

【速读】:该论文旨在解决现有图像篡改定位(Image Manipulation Localization, IML)方法在处理细微篡改痕迹或经后处理与噪声干扰后的图像时,因缺乏对篡改区域与真实区域证据的显式对比而导致定位不可靠的问题。其解决方案的关键在于提出一种类法庭审判式的仲裁框架,将IML任务建模为“控方”(主张篡改存在)与“辩方”(主张内容真实)之间的证据对抗与判断过程:首先构建基于共享多尺度编码器的双假设分割架构,通过边缘先验引导的级联多层级融合、双向不一致抑制及动态辩论精修机制,分别生成篡改区和真实区的证据;进而设计一个基于强化学习的裁判模型,在不确定区域进行策略性重推理与优化,最终输出篡改掩膜,并通过优势奖励与软IoU目标训练,结合熵值与跨假设一致性实现可靠性校准。

链接: https://arxiv.org/abs/2604.14703
作者: Songlin Li,Zhiqing Guo,Dan Ma,Changtao Miao,Gaobo Yang
机构: Xinjiang University (新疆大学); Ant Group (蚂蚁集团); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although some existing image manipulation localization (IML) methods incorporate authenticity-related supervision, this information is typically utilized merely as an auxiliary training signal to enhance the model’s sensitivity to manipulation artifacts, rather than being explicitly modeled as localization evidence opposing the manipulated regions. Consequently, when manipulation traces are subtle or degraded by post-processing and noise, these methods struggle to explicitly compare manipulated and authentic evidence, resulting in unreliable predictions in ambiguous areas. To address these issues, we propose a courtroom-style adjudication framework that regards IML task as the confrontation of evidence followed by judgment. The framework comprises a prosecution stream, a defense stream, and a judge model. We first build a dual-hypothesis segmentation architecture on a shared multi-scale encoder, in which the prosecution stream asserts manipulation and the defense stream asserts authenticity. Guided by edge priors, it produces evidence for manipulated and authentic regions through cascaded multi-level fusion, bidirectional disagreement suppression, and dynamic debate refinement. We further develop a reinforcement learning judge model that performs strategic re-inference and refinement on uncertain regions, yielding a manipulated-region mask. The judge model is trained with advantage-based rewards and a soft-IoU objective, and reliability is calibrated via entropy and cross-hypothesis consistency. Experimental results show that our model achieves superior average performance compared with SOTA IML methods.

[CV-61] Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

【速读】:该论文旨在解决视频理解中因现有对象无关(object-agnostic)方法难以有效处理时间维度上显著的对象变化,而导致的推理准确性下降问题。解决方案的关键在于提出一种名为“Chain-of-Glimpse”的搜索引导式渐进对象锚定推理框架,其核心机制是通过强化学习优化的搜索控制器,在每一步推理中显式地将决策锚定到特定视觉证据区域,从而构建空间上可解释的多步推理轨迹,减少对显著性驱动线索的过度依赖,实现更准确且可解释的视频语义推理。

链接: https://arxiv.org/abs/2604.14692
作者: Zhixuan Wu,Quanxing Zha,Teng Wang,Genbao Xu,Wenyuan Gu,Wei Rao,Nan Ma,Bo Cheng,Soujanya Poria
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tencent (腾讯); Beijing University of Technology (北京工业大学); Huaqiao University (华侨大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

[CV-62] DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts ICLR2026

【速读】:该论文旨在解决视觉提示(visual prompt)在目标检测中性能不佳的问题,其核心症结在于视觉提示缺乏全局判别性(global discriminability),导致在开放词汇检测任务中表现受限。解决方案的关键在于提出DETR-ViP框架,通过引入全局提示整合(global prompt integration)和视觉-文本提示关系蒸馏(visual-textual prompt relation distillation),在基础图像-文本对比学习基础上增强提示的类别区分能力;同时采用选择性融合策略(selective fusion strategy)确保检测过程的稳定性和鲁棒性,从而显著提升视觉提示驱动的目标检测性能。

链接: https://arxiv.org/abs/2604.14684
作者: Bo Qian,Dahu Shi,Xing Wei
机构: Xi’an Jiaotong University (西安交通大学); Zhejiang University (浙江大学); Hikrobot Co., Ltd. (海康威视)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.

[CV-63] Seen-to-Scene: Keep the Seen Generate the Unseen for Video Outpainting CVPR

【速读】:该论文旨在解决视频外绘画(video outpainting)中因生成式方法存在隐式时间建模和有限空间上下文而导致的帧内与帧间不一致性问题,尤其在动态场景和大范围外扩场景下表现更为明显。解决方案的关键在于提出一种融合传播式(propagation-based)与生成式(generation-based)范式的全新框架——Seen-to-Scene:首先利用预训练于视频修复(video inpainting)任务的光流传播网络进行端到端微调,以弥合域差距并重建连贯的运动场;其次引入参考引导的潜在空间传播机制(reference-guided latent propagation),有效实现源内容跨帧传播,从而显著提升时序一致性和视觉真实感,同时保持高效推理性能。

链接: https://arxiv.org/abs/2604.14648
作者: Inseok Jeon,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Suhwan Cho,Sangyoun Lee
机构: Yonsei University (延世大学); GenGenAI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 8 figures (main paper); 9 pages, 10 figures (supplementary). Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, Findings

点击查看摘要

Abstract:Video outpainting aims to expand the visible content of a video beyond the original frame boundaries while preserving spatial fidelity and temporal coherence across frames. Existing methods primarily rely on large-scale generative models, such as diffusion models. However, generationbased approaches suffer from implicit temporal modeling and limited spatial context. These limitations lead to intraframe and inter-frame inconsistencies, which become particularly pronounced in dynamic scenes and large outpainting scenarios. To overcome these challenges, we propose Seen-to-Scene, a novel framework that unifies propagationbased and generation-based paradigms for video outpainting. Specifically, Seen-to-Scene leverages flow-based propagation with a flow completion network pre-trained for video inpainting, which is fine-tuned in an end-to-end manner to bridge the domain gap and reconstruct coherent motion fields. To further improve the efficiency and reliability of propagation, we introduce a reference-guided latent propagation that effectively propagates source content across frames. Extensive experiments demonstrate that our method achieves superior temporal coherence and visual realism with efficient inference, surpassing even prior state-of-the-art methods that require input-specific adaptation.

[CV-64] Chaotic CNN for Limited Data Image Classification

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在训练数据有限场景下因过拟合和特征多样性不足而导致的泛化性能差的问题。解决方案的关键在于提出一种基于混沌系统的特征变换方法,通过在分类层前对归一化特征向量应用逻辑映射(logistic map)、斜帐篷映射(skew tent map)和正弦映射(sine map)等非线性变换,重构特征空间以增强类别可分性。该方法无需增加模型参数,计算效率高,且能无缝集成到现有CNN架构中,显著提升了在MNIST、Fashion-MNIST和CIFAR-10等数据集上的分类准确率,验证了混沌系统非线性与动力学特性对提升模型性能的核心作用。

链接: https://arxiv.org/abs/2604.14645
作者: Anusree M,Akhila Henry,Pramod P Nair
机构: Amrita Vishwa Vidyapeetham (阿姆里塔世界大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Chaotic Dynamics (nlin.CD)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) often exhibit poor generalisation in limited training data scenarios due to overfitting and insufficient feature diversity. In this work, a simple and effective chaos-based feature transformation is proposed to enhance CNN performance without increasing model complexity. The method applies nonlinear transformations using logistic, skew tent, and sine maps to normalised feature vectors before the classification layer, thereby reshaping the feature space and improving class separability. The approach is evaluated on greyscale datasets (MNIST and Fashion-MNIST) and an RGB dataset (CIFAR-10) using CNN architectures of varying depth under limited data conditions. The results show consistent improvement over the standalone (SA) CNN across all datasets. Notably, a maximum performance gain of 5.43% is achieved on MNIST using the skew tent map with a 3-layer CNN at 40 samples per class. A higher gain of 9.11% is observed on Fashion-MNIST using the sine map with a 3-layer CNN at 50 samples per class. Additionally, a strong gain of 7.47% is obtained on CIFAR-10 using the skew tent map at 200 samples per class. The consistent improvements across different chaotic maps indicate that the performance gain is driven by the shared nonlinear and dynamical properties of chaotic systems. The proposed method is computationally efficient, requires no additional trainable parameters, and can be easily integrated into existing CNN architectures, making it a practical solution for data-scarce image classification tasks.

[CV-65] Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification

【速读】:该论文旨在解决深度学习模型在遥感(Remote Sensing, RS)图像分类任务中面临的对抗攻击威胁问题,尤其是现有方法依赖直接像素级扰动、未能利用遥感图像固有的大气特性且难以抵御真实世界图像退化的问题。其解决方案的关键在于提出FogFool框架,通过迭代优化基于Perlin噪声的气象模式生成具有物理合理性的雾状扰动,从而在保持与真实RS场景视觉一致性的前提下诱导模型误判;该方法借助大气现象的空间一致性与中低频特性,将对抗信息嵌入跨架构共享的结构特征中,显著提升了白盒攻击效果及黑盒迁移能力(最高达83.74% TASR),并表现出对JPEG压缩和滤波等常见预处理防御手段的鲁棒性。

链接: https://arxiv.org/abs/2604.14643
作者: Weiwei Zhuang,Wangze Xie,Qi Zhang,Xia Du,Zihan Lin,Zheng Lin,Hanlin Cai,Jizhe Zhou,Zihan Fang,Chi-man Pun,Wei Ni,Jun Luo
机构: Xiamen University of Technology (厦门理工学院); City University of Macau (澳门城市大学); Central South University (中南大学); University of Hong Kong (香港大学); University of Cambridge (剑桥大学); Sichuan University (四川大学); City University of Hong Kong (香港城市大学); University of Macau (澳门大学); Edith Cowan University (埃迪斯科文大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 11 figures

点击查看摘要

Abstract:Adversarial attacks pose a severe threat to the reliability of deep learning models in remote sensing (RS) image classification. Most existing methods rely on direct pixel-wise perturbations, failing to exploit the inherent atmospheric characteristics of RS imagery or survive real-world image degradations. In this paper, we propose FogFool, a physically plausible adversarial framework that generates fog-based perturbations by iteratively optimizing atmospheric patterns based on Perlin noise. By modeling fog formations with natural, irregular structures, FogFool generates adversarial examples that are not only visually consistent with authentic RS scenes but also deceptive. By leveraging the spatial coherence and mid-to-low-frequency nature of atmospheric phenomena, FogFool embeds adversarial information into structural features shared across diverse architectures. Extensive experiments on two benchmark RS datasets demonstrate that FogFool achieves superior performance: not only does it exceed in white-box settings, but also exhibits exceptional black-box transferability (reaching 83.74% TASR) and robustness against common preprocessing-based defenses such as JPEG compression and filtering. Detailed analyses, including confusion matrices and Class Activation Map (CAM) visualizations, reveal that our atmospheric-driven perturbations induce a universal shift in model attention. These results indicate that FogFool represents a practical, stealthy, and highly persistent threat to RS classification systems, providing a robust benchmark for evaluating model reliability in complex environments.

[CV-66] High-Speed Full-Color HDR Imaging via Unwrapping Modulo-Encoded Spike Streams

【速读】:该论文旨在解决传统基于RGB的高动态范围(HDR)成像中存在的根本性矛盾:多曝光捕获易产生运动伪影,而单次曝光技术则会造成不可逆的信息丢失。其解决方案的关键在于提出了一种完整的模数(modulo)HDR成像系统,通过协同优化传感机制与解包裹算法实现高速、全彩色HDR采集。核心创新包括:一是提出一种曝光解耦的模数成像公式,允许多测量值在时间上交错采样,保持观测层面的干净模型;二是设计一种无需迭代的解包裹算法,融合基于扩散的生成先验与模数图像的物理最小绝对余数特性,实现高效且符合物理规律的HDR重建;三是基于模数编码脉冲流的硬件原型验证了系统的实用性,在保持1000 FPS全彩成像的同时将数据带宽从约20 Gbps降低至6 Gbps。

链接: https://arxiv.org/abs/2604.14632
作者: Chu Zhou,Siqi Yang,Kailong Zhang,Heng Guo,Zhaofei Yu,Boxin Shi,Imari Sato
机构: National Institute of Informatics (国立信息研究所); Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TPAMI under review

点击查看摘要

Abstract:Conventional RGB-based high dynamic range (HDR) imaging faces a fundamental trade-off between motion artifacts in multi-exposure captures and irreversible information loss in single-shot techniques. Modulo sensors offer a promising alternative by encoding theoretically unbounded dynamic range into wrapped measurements. However, existing modulo solutions remain bottlenecked by iterative unwrapping overhead and hardware constraints limiting them to low-speed, grayscale capture. In this work, we present a complete modulo-based HDR imaging system that enables high-speed, full-color HDR acquisition by synergistically advancing both the sensing formulation and the unwrapping algorithm. At the core of our approach is an exposure-decoupled formulation of modulo imaging that allows multiple measurements to be interleaved in time, preserving a clean, observation-wise measurement model. Building upon this, we introduce an iteration-free unwrapping algorithm that integrates diffusion-based generative priors with the physical least absolute remainder property of modulo images, supporting highly efficient, physics-consistent HDR reconstruction. Finally, to validate the practical viability of our system, we demonstrate a proof-of-concept hardware implementation based on modulo-encoded spike streams. This setup preserves the native high temporal resolution of spike cameras, achieving 1000 FPS full-color imaging while reducing output data bandwidth from approximately 20 Gbps to 6 Gbps. Extensive evaluations indicate that our coordinated approach successfully overcomes key systemic bottlenecks, demonstrating the feasibility of deploying modulo imaging in dynamic scenarios.

[CV-67] CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation ICIP2025

【速读】:该论文旨在解决无监督视频对象分割(Unsupervised Video Object Segmentation, UVOS)中如何有效融合外观(appearance)与运动(motion)模态信息的问题,尤其关注二者间复杂依赖关系的建模不足。其解决方案的关键在于提出了一种新颖的跨模态 token 调制机制(cross-modality token modulation),通过在两个模态的 tokens 之间建立密集连接,并利用关系 Transformer 块(relation transformer blocks)实现高效的模态内与模态间信息传播,从而增强两者的协同表达能力;同时引入 token 掩码策略以提升学习效率,避免单纯依赖模型复杂度增长。

链接: https://arxiv.org/abs/2604.14630
作者: Inseok Jeon,Suhwan Cho,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Jungho Lee,Chaewon Park,Donghyeong Kim,Sangyoun Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 5 figures. Accepted to IEEE ICIP 2025

点击查看摘要

Abstract:Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

[CV-68] Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在知识蒸馏(Knowledge Distillation, KD)过程中因模态特异性监督导致的多模态知识传递不一致问题。现有方法通常对视觉和语言模态分别进行监督,未能显式建模跨模态对齐,从而限制了多模态知识的有效迁移。解决方案的关键在于提出Switch-KD框架,其核心创新包括:(1) 视觉切换蒸馏(Visual-Switch Distillation),将学生模型的视觉输出转换至教师的语言路径中,构建跨模态概率参考以实现隐式视觉知识传递;(2) 动态双向logits差异损失(Dynamic Bi-directional Logits Difference, DBiLD),通过双向监督自适应对齐信息丰富的概率区域,同时保留教师与学生模型的概率分布结构,从而实现统一于文本概率空间内的高效多模态知识迁移。

链接: https://arxiv.org/abs/2604.14629
作者: Haoyi Sun,Xiaoxiao Wang,Ning Mao,Qian Wang,Lifu Mu,Wen Zheng,Tao Wei,Wei Chen
机构: Li Auto Inc (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student’s visual outputs into the teacher’s language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.

[CV-69] Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening

【速读】:该论文旨在解决全色锐化(pan-sharpening)任务中因传统模型对语义信息建模不足而导致的空间细节失真与全局语义不一致问题。其核心解决方案在于提出一种多粒度语义原型扫描(Multigrain-aware Semantic Prototype Scanning)范式,关键创新包括:1)基于局部敏感哈希(locality-sensitive hashing)构建多粒度语义原型,实现语义感知的扫描策略以缓解位置偏差;2)设计三令牌提示机制(tri-token prompting),融合全局令牌、聚类生成的原型令牌与可学习寄存器令牌,增强语义先验并抑制噪声中间表示;3)引入可逆Q-Shift操作,在不增加参数量的前提下实现无损多尺度特征变换,有效保留高频空间细节。

链接: https://arxiv.org/abs/2604.14622
作者: Junfeng Li,Wenyang Zhou,Xueheng Li,Xuanhua He,Jianhou Gan,Wenqi Ren
机构: Shenzhen Campus of Sun Yat-sen University; Yunnan Normal University; Xidian University; University of Science and Technology of China; The Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we propose a Multigrain-aware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a high-order RWKV architecture and a tri-token prompting mechanism derived from semantic clustering. Specifically, our method contains three key components: 1) Multigrain-aware Semantic Prototype Scanning. Although RWKV offers a efficient linear-complexity alternative to Transformers, its conventional bidirectional raster scanning is still semantic-agnostic and prone to positional bias. To address this issue, we introduce a semantic-driven scanning strategy that leverages locality-sensitive hashing to group semantically related regions and construct multi-grain semantic prototypes, enabling context-aware token reordering and more coherent global interaction. 2) Tri-token Prompt Learning. We design a tri-token prompting mechanism consisting of a global token, cluster-derived prototype tokens, and a learnable register token. The global and prototype tokens provide complementary semantic priors for RWKV modeling, while the register token helps suppress noisy and artifact-prone intermediate representations. 3) Invertible Q-Shift. To counteract spatial details, we apply center difference convolution on the value pathway to inject high-frequency information, and introduce an invertible multi-scale Q-shift operation for efficient and lossless feature transformation without parameter-heavy receptive field expansion. Experimental results demonstrate the superiority of our method.

[CV-70] owards Design Compositing CVPR2026

【速读】:该论文旨在解决图形设计生成中多模态组件(如图像、文字、标志等)来自不同来源时存在的视觉风格不一致问题,现有方法通常假设输入元素已具备风格一致性,但实际应用中这一假设常不成立,导致最终设计缺乏整体和谐性。解决方案的关键在于提出GIST(Generative Image Stylization and Compositing Tool),一个无需训练、保持对象身份不变的图像合成工具,可无缝嵌入到任意“组件到设计”或设计优化流程中,通过风格迁移和融合实现输入元素的视觉统一,从而显著提升设计的整体美观度与协调性。

链接: https://arxiv.org/abs/2604.14605
作者: Abhinav Mahajan,Abhikhya Tripathy,Sudeeksha Reddy Pala,Vaibhav Methi,K J Joseph,Balaji Vasan Srinivasan
机构: Carnegie Mellon University (卡内基梅隆大学); IIT Kharagpur (印度理工学院克勒格布尔分校); IIT Kanpur (印度理工学院坎普尔分校); Adobe Research (Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 Workshop on CVEU

点击查看摘要

Abstract:Graphic design creation involves harmoniously assembling multimodal components such as images, text, logos, and other visual assets collected from diverse sources, into a visually-appealing and cohesive design. Recent methods have largely focused on layout prediction or complementary element generation, while retaining input elements exactly, implicitly assuming that provided components are already stylistically harmonious. In practice, inputs often come from disparate sources and exhibit visual mismatch, making this assumption limiting. We argue that identity-preserving stylization and compositing of input elements is a critical missing ingredient for truly harmonized components-to-design pipelines. To this end, we propose GIST, a training-free, identity-preserving image compositor that sits between layout prediction and typography generation, and can be plugged into any existing components-to-design or design-refining pipeline without modification. We demonstrate this by integrating GIST with two substantially different existing methods, LaDeCo and Design-o-meter. GIST shows significant improvements in visual harmony and aesthetic quality across both pipelines, as validated by LLaVA-OV and GPT-4V on aspect-wise ratings and pairwise preference over naive pasting. Project Page: this http URL.

[CV-71] Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models CVPR

【速读】:该论文旨在解决视觉自回归模型(Visual Autoregressive Models, VAR)中基于提示(prompt)的图像编辑问题,即在给定源图像和目标文本提示的前提下,仅对与提示相关区域进行修改,同时保留图像中与请求编辑无关的所有区域。其解决方案的关键在于提出**掩码逻辑 nudging(Masked Logit Nudging)**方法:利用源图像的token映射,在模型预测过程中引入一个引导步骤,使目标提示下的模型输出 logits 与源图像的 token 映射对齐;具体而言,通过 VAR 编码将固定源编码转换为 logits,并沿由源-目标提示定义的语义轨迹 nudging 模型预测结果;编辑操作仅限于通过专用掩码机制获得的空间掩码区域,该机制基于源提示与目标提示之间的交叉注意力差异生成;此外,还引入精修策略以校正量化误差并提升重建质量。此方法在 PIE 基准上实现最优编辑性能,并在 COCO 和 OpenImages 上优于现有方法,且速度显著快于扩散模型。

链接: https://arxiv.org/abs/2604.14591
作者: Amir El-Ghoussani,Marc Hölle,Gustavo Carneiro,Vasileios Belagiannis
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (弗里德里希·亚历山大埃尔朗根-纽伦堡大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Findings (CVPRF)

点击查看摘要

Abstract:We address the problem of prompt-guided image editing in visual autoregressive models. Given a source image and a target text prompt, we aim to modify the source image according to the target prompt, while preserving all regions which are unrelated to the requested edit. To this end, we present Masked Logit Nudging, which uses the source image token maps to introduce a guidance step that aligns the model’s predictions under the target prompt with these source token maps. Specifically, we convert the fixed source encodings into logits using the VAR encoding, nudging the model’s predicted logits towards the targets along a semantic trajectory defined by the source-target prompts. Edits are applied only within spatial masks obtained through a dedicated masking scheme that leverages cross-attention differences between the source and edited prompts. Then, we introduce a refinement to correct quantization errors and improve reconstruction quality. Our approach achieves the best image editing performance on the PIE benchmark at 512px and 1024px resolutions. Beyond editing, our method delivers faithful reconstructions and outperforms previous methods on COCO at 512px and OpenImages at 1024px. Overall, our method outperforms VAR-related approaches and achieves comparable or even better performance than diffusion models, while being much faster. Code is available at 'this https URL.

[CV-72] MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models

【速读】:该论文旨在解决高分辨率(High-Resolution, HR)土地覆盖制图中因密集标注成本高昂而导致的瓶颈问题。传统方法依赖大量HR标签进行模型训练,而本文从地图超分辨率(Map Super-Resolution)的角度出发,提出了一种无需HR标签即可实现高效HR制图的新范式。其解决方案的关键在于“监督与训练解耦”:利用低分辨率(Low-Resolution, LR)标签一次性提取类别提示(class prompts),通过轻量级线性探测器从冻结的视觉基础模型特征中筛选高置信度HR特征;随后在无需训练的情况下,基于余弦相似度匹配完成初步预测,并结合图结构传播进行空间精细化修正。该方法显著降低可训练参数量四个数量级,将训练时间从数小时缩短至数分钟,同时在Chesapeake Bay数据集上达到59.64% mIoU,优于全监督基线并保持与最强弱监督方法相当的性能。

链接: https://arxiv.org/abs/2604.14582
作者: Ruiqi Wang,Qi Yu,Jie Ma,Hanlin Wu
机构: Beijing Foreign Studies University (北京外国语大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution (HR) land-cover mapping is often constrained by the high cost of dense HR annotations. We revisit this problem from the perspective of map super-resolution, which enhances coarse low-resolution (LR) land-cover products into HR maps at the resolution of the input imagery. Existing weakly supervised methods can leverage LR labels, but they typically use them to retrain dense predictors with substantial computational cost. We propose MapSR, a prompt-driven framework that decouples supervision from model training. MapSR uses LR labels once to extract class prompts from frozen vision foundation model features through a lightweight linear probe, after which HR mapping proceeds via training-free metric inference and graph-based prediction refinement. Specifically, class prompts are estimated by aggregating high-confidence HR features identified by the linear probe, and HR predictions are obtained by cosine-similarity matching followed by graph-based propagation for spatial refinement. Experiments on the Chesapeake Bay dataset show that MapSR achieves 59.64% mIoU without any HR labels, remaining competitive with the strongest weakly supervised baseline and surpassing a fully supervised baseline. Notably, MapSR reduces trainable parameters by four orders of magnitude and shortens training time from hours to minutes, enabling scalable HR mapping under limited annotation and compute budgets. The code is available at this https URL.

[CV-73] urboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

【速读】:该论文旨在解决现有音频驱动视频数字人生成模型依赖多步去噪过程导致计算开销大、难以在实际场景中部署的问题。其解决方案的关键在于提出了一种两阶段渐进式蒸馏框架(TurboTalk),首先通过分布匹配蒸馏(Distribution Matching Distillation)获得一个稳定且性能良好的4步学生模型,随后利用对抗蒸馏(adversarial distillation)逐步将去噪步骤从4步减少至1步;为保障极端步数缩减下的训练稳定性,引入了渐进式时间步采样策略和自比较对抗目标(self-compare adversarial objective),后者提供中间对抗参考以稳定蒸馏过程,最终实现单步生成视频口型动画,推理速度提升120倍的同时保持高质量输出。

链接: https://arxiv.org/abs/2604.14580
作者: Xiangyu Liu,Feng Gao,Xiaomei Zhang,Yong Zhang,Xiaoming Wei,Zhen Lei,Xiangyu Zhu
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Meituan (美团); The Hong Kong Polytechnic University (香港理工大学); CAS (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Existing audio-driven video digital human generation models rely on multi-step denoising, resulting in substantial computational overhead that severely limits their deployment in real-world settings. While one-step distillation approaches can significantly accelerate inference, they often suffer from training instability. To address this challenge, we propose TurboTalk, a two-stage progressive distillation framework that effectively compresses a multi-step audio-driven video diffusion model into a single-step generator. We first adopt Distribution Matching Distillation to obtain a strong and stable 4-step student, and then progressively reduce the denoising steps from 4 to 1 through adversarial distillation. To ensure stable training under extreme step reduction, we introduce a progressive timestep sampling strategy and a self-compare adversarial objective that provides an intermediate adversarial reference that stabilizes progressive distillation. Our method achieve single-step generation of video talking avatar, boosting inference speed by 120 times while maintaining high generation quality.

[CV-74] M3D-Net: Multi-Modal 3D Facial Feature Reconstruction Network for Deepfake Detection

【速读】:该论文旨在解决深度伪造(Deepfake)检测中因缺乏对多模态特征互补性利用而导致的准确性与鲁棒性不足的问题。现有方法通常仅依赖孤立的面部属性重建,难以有效区分真实与伪造图像。其解决方案的关键在于提出一种端到端的双流网络架构——Multi-Modal 3D Facial Feature Reconstruction Network (M3D-Net),通过自监督的3D人脸重建模块从单视角RGB图像中恢复精细的面部几何和反射特性,并引入3D特征预融合模块(PFM)自适应调整多尺度特征,以及多模态融合模块(MFM)利用注意力机制高效整合RGB与3D重建特征,从而显著提升检测性能并增强跨场景泛化能力。

链接: https://arxiv.org/abs/2604.14574
作者: Haotian Wu,Yue Cheng,Shan Bian
机构: South China Agricultural University (华南农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of deep learning in image generation, facial forgery techniques have achieved unprecedented realism, posing serious threats to cybersecurity and information authenticity. Most existing deepfake detection approaches rely on the reconstruction of isolated facial attributes without fully exploiting the complementary nature of multi-modal feature representations. To address these challenges, this paper proposes a novel Multi-Modal 3D Facial Feature Reconstruction Network (M3D-Net) for deepfake detection. Our method leverages an end-to-end dual-stream architecture that reconstructs fine-grained facial geometry and reflectance properties from single-view RGB images via a self-supervised 3D facial reconstruction module. The network further enhances detection performance through a 3D Feature Pre-fusion Module (PFM), which adaptively adjusts multi-scale features, and a Multi-modal Fusion Module (MFM) that effectively integrates RGB and 3D-reconstructed features using attention mechanisms. Extensive experiments on multiple public datasets demonstrate that our approach achieves state-of-the-art performance in terms of detection accuracy and robustness, significantly outperforming existing methods while exhibiting strong generalization across diverse scenarios.

[CV-75] Deepfake Detection Generalization with Diffusion Noise

【速读】:该论文旨在解决深度伪造检测模型在面对新型图像生成技术(如扩散模型生成的深度伪造)时泛化能力不足的问题,尤其是现有检测器对基于生成对抗网络(GAN)训练的伪造图像有效,但难以识别扩散模型生成的高保真伪造内容。解决方案的关键在于提出一种注意力引导的噪声学习(Attention-guided Noise Learning, ANL)框架,其核心创新是将预训练的扩散模型嵌入检测流程中,利用其去噪过程揭示图像中的细微伪影:检测器被训练为预测输入图像在特定扩散步骤下的噪声分布,从而迫使模型捕捉真实与合成图像之间的差异;同时引入基于预测噪声的注意力机制,引导模型关注全局分布的差异而非局部模式,借助冻结扩散模型所学得的自然图像分布作为正则化信号,显著提升检测器对未见伪造类型(如不同扩散模型生成的伪造图像)的泛化性能。

链接: https://arxiv.org/abs/2604.14570
作者: Hongyuan Qi,Wenjin Hou,Hehe Fan,Jun Xiao
机构: Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Deepfake detectors face growing challenges in generalization as new image synthesis techniques emerge. In particular, deepfakes generated by diffusion models are highly photorealistic and often evade detectors trained on GAN-based forgeries. This paper addresses the generalization problem in deepfake detection by leveraging diffusion noise characteristics. We propose an Attention-guided Noise Learning (ANL) framework that integrates a pre-trained diffusion model into the deepfake detection pipeline to guide the learning of more robust features. Specifically, our method uses the diffusion model’s denoising process to expose subtle artifacts: the detector is trained to predict the noise contained in an input image at a given diffusion step, forcing it to capture discrepancies between real and synthetic images, while an attention-guided mechanism derived from the predicted noise is introduced to encourage the model to focus on globally distributed discrepancies rather than local patterns. By harnessing the frozen diffusion model’s learned distribution of natural images, the ANL method acts as a form of regularization, improving the detector’s generalization to unseen forgery types. Extensive experiments demonstrate that ANL significantly outperforms existing methods on multiple benchmarks, achieving state-of-the-art accuracy in detecting diffusion-generated deepfakes. Notably, the proposed framework boosts generalization performance (e.g., improving ACC/AP by a substantial margin on unseen models) without introducing additional overhead during inference. Our results highlight that diffusion noise provides a powerful signal for generalizable deepfake detection.

[CV-76] Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors CVPR2026

【速读】:该论文旨在解决基于Vision Transformer (ViT) 的稀疏多视角3D目标检测模型在推理过程中因大量token处理导致的高延迟问题。现有方法如token剪枝、合并及扩大patch尺寸虽能加速,但常丢失关键背景信息、破坏上下文一致性并损失细粒度语义,从而影响检测性能。其解决方案的关键在于提出SEPatch3D框架,通过动态调整patch大小以保留重要语义信息:首先设计时空感知的patch尺寸选择(Spatiotemporal-aware Patch Size Selection, SPSS),根据场景中物体远近自适应分配小patch(保留细节)或大patch(降低计算开销);其次引入信息感知的patch选择(Informative Patch Selection, IPS)和跨粒度特征增强(Cross-Granularity Feature Enhancement, CGFE),对关键patch进行特征精炼并注入细粒度信息,从而在显著提升推理效率的同时保持检测精度。

链接: https://arxiv.org/abs/2604.14563
作者: Mingqian Ji,Shanshan Zhang,Jian Yang
机构: Nanjing University of Science and Technology (南京理工大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to \textbf57% faster inference than the StreamPETR baseline and \textbf20% higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy. Code is available at this https URL.

[CV-77] DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration

【速读】:该论文旨在解决现有基于扩散模型的视频人脸修复方法中存在的两大问题:一是过度依赖通用扩散先验,导致面部特征适应能力不足;二是采用多步采样策略,限制了推理效率。解决方案的关键在于提出一种单步扩散框架DVFace,其核心创新为引入时空双码本设计以提取退化视频中的互补空间与时间面部先验,并通过非对称时空融合模块将这些先验按不同角色注入扩散主干网络,从而在保持高效推理的同时显著提升修复质量、时序一致性和身份保真度。

链接: https://arxiv.org/abs/2604.14560
作者: Zheng Chen,Bowen Chai,Rongjun Gao,Mingtao Nie,Xi Li,Bingnan Duan,Jianping Fang,Xiaohong Liu,Linghe Kong,Yulun Zhang
机构: Shanghai Jiaotong University (上海交通大学); Meituan Inc (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Video face restoration aims to enhance degraded face videos into high-quality results with realistic facial details, stable identity, and temporal coherence. Recent diffusion-based methods have brought strong generative priors to restoration and enabled more realistic detail synthesis. However, existing approaches for face videos still rely heavily on generic diffusion priors and multi-step sampling, which limit both facial adaptation and inference efficiency. These limitations motivate the use of one-step diffusion for video face restoration, yet achieving faithful facial recovery alongside temporally stable outputs remains challenging. In this paper, we propose, DVFace, a one-step diffusion framework for real-world video face restoration. Specifically, we introduce a spatio-temporal dual-codebook design to extract complementary spatial and temporal facial priors from degraded videos. We further propose an asymmetric spatio-temporal fusion module to inject these priors into the diffusion backbone according to their distinct roles. Evaluation on various benchmarks shows that DVFace delivers superior restoration quality, temporal consistency, and identity preservation compared to recent methods. Code: this https URL.

[CV-78] he Fourth Challenge on Image Super-Resolution (times4) at NTIRE 2026: Benchmark Results and Method Overview

【速读】:该论文旨在解决图像超分辨率(Image Super-Resolution, ISR)问题,即从低分辨率(Low-Resolution, LR)输入重建高分辨率(High-Resolution, HR)图像,具体针对通过双三次下采样(bicubic downsampling)生成的×4缩放因子的LR图像。解决方案的关键在于设计一个双轨评估机制:一是恢复轨道(restoration track),强调像素级保真度,以峰值信噪比(PSNR)为评价指标;二是感知轨道(perceptual track),关注视觉真实感,采用感知评分进行评估。这种分轨设计反映了ISR任务从纯数值精度向视觉质量演进的趋势,为当前技术进展提供了统一基准和深入分析框架。

链接: https://arxiv.org/abs/2604.14558
作者: Zheng Chen,Kai Liu,Jingkai Wang,Xianglong Yan,Jianze Li,Ziqing Zhang,Jue Gong,Jiatong Li,Lei Sun,Xiaoyang Liu,Radu Timofte,Yulun Zhang,Jihye Park,Yoonjin Im,Hyungju Chun,Hyunhee Park,MinKyu Park,Zheng Xie,Xiangyu Kong,Weijun Yuan,Zhan Li,Qiurong Song,Luen Zhu,Fengkai Zhang,Xinzhe Zhu,Junyang Chen,Congyu Wang,Yixin Yang,Zhaorun Zhou,Jiangxin Dong,Jinshan Pan,Shengwei Wang,Jiajie Ou,Baiang Li,Sizhuo Ma,Qiang Gao,Jusheng Zhang,Jian Wang,Keze Wang,Yijiao Liu,Yingsi Chen,Hui Li,Yu Wang,Congchao Zhu,Saeed Ahmad,Ik Hyun Lee,Jun Young Park,Ji Hwan Yoon,Kainan Yan,Zian Wang,Weibo Wang,Shihao Zou,Chao Dong,Wei Zhou,Linfeng Li,Jaeseong Lee,Jaeho Chae,Jinwoo Kim,Seonjoo Kim,Yucong Hong,Zhenming Yan,Junye Chen,Ruize Han,Song Wang,Yuxuan Jiang,Chengxi Zeng,Tianhao Peng,Fan Zhang,David Bull,Tongyao Mu,Qiong Cao,Yifan Wang,Youwei Pan,Leilei Cao,Xiaoping Peng,Wei Deng,Yifei Chen,Wenbo Xiong,Xian Hu,Yuxin Zhang,Xiaoyun Cheng,Yang Ji,Zonghao Chen,Zhihao Xue,Junqin Hu,Nihal Kumar,Snehal Singh Tomar,Klaus Mueller,Surya Vashisth,Prateek Shaily,Jayant Kumar,Hardik Sharma,Ashish Negi,Sachin Chaudhary,Akshay Dudhane,Praful Hambarde,Amit Shukla,Shijun Shi,Jiangning Zhang,Yong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NTIRE 2026 webpage: this https URL . Code: this https URL

点击查看摘要

Abstract:This paper presents the NTIRE 2026 image super-resolution ( \times 4) challenge, one of the associated competitions of the NTIRE 2026 Workshop at CVPR 2026. The challenge aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs generated through bicubic downsampling with a \times 4 scaling factor. The objective is to develop effective super-resolution solutions and analyze recent advances in the field. To reflect the evolving objectives of image super-resolution, the challenge includes two tracks: (1) a restoration track, which emphasizes pixel-wise fidelity and ranks submissions based on PSNR; and (2) a perceptual track, which focuses on visual realism and evaluates results using a perceptual score. A total of 194 participants registered for the challenge, with 31 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, main results, and methods of participating teams. The challenge provides a unified benchmark and offers insights into current progress and future directions in image super-resolution.

[CV-79] Controllable Video Object Insertion via Multiview Priors

【速读】:该论文旨在解决视频对象插入(Video Object Insertion)任务中长期存在的挑战,即在保持插入对象外观一致性(appearance consistency)、空间对齐(spatial alignment)和时间连贯性(temporal coherence)方面的难题。现有视频生成方法往往侧重于场景整体合成,难以在动态环境中实现高质量的对象融合。其解决方案的关键在于引入多视角对象先验(multi-view object priors),通过将2D参考图像提升为多视角表示,并结合双路径视图一致条件机制(dual-path view-consistent conditioning mechanism),实现稳定的身份引导与跨视角鲁棒集成;同时设计了一个质量感知加权机制以自适应处理噪声或不完整输入,并引入集成感知一致性模块(Integration-Aware Consistency Module)来保障空间真实感,有效解决遮挡与边界伪影问题,从而维持帧间的时间连续性。

链接: https://arxiv.org/abs/2604.14556
作者: Xia Qi,Peishan Cong,Yichen Yao,Ziyi Wang,Yaoqin Ye,Yuexin Ma
机构: ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video object insertion is a critical task for dynamically inserting new objects into existing environments. Previous video generation methods focus primarily on synthesizing entire scenes while struggling with ensuring consistent object appearance, spatial alignment, and temporal coherence when inserting objects into existing videos. In this paper, we propose a novel solution for Video Object Insertion, which integrates multi-view object priors to address the common challenges of appearance inconsistency and occlusion handling in dynamic environments. By lifting 2D reference images into multi-view representations and leveraging a dual-path view-consistent conditioning mechanism, our framework ensures stable identity guidance and robust integration across diverse viewpoints. A quality-aware weighting mechanism is also employed to adaptively handle noisy or imperfect inputs. Additionally, we introduce an Integration-Aware Consistency Module that guarantees spatial realism, effectively resolving occlusion and boundary artifacts while maintaining temporal continuity across frames. Experimental results show that our solution significantly improves the quality of video object insertion, providing stable and realistic integration.

[CV-80] Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars

【速读】:该论文旨在解决现有单图像3D头像重建方法中情感状态与几何或外观特征隐式耦合的问题,导致情感控制不可控且难以跨身份一致应用。其关键解决方案是提出一种双路径调制机制(dual-path modulation mechanism),将情感作为独立的控制信号注入前馈架构:几何调制在原始参数空间中进行情感条件归一化,从而解耦情感状态与语音驱动的面部运动;外观调制则捕捉与身份相关的、依赖于情感的视觉线索,增强表情表现力。该方法无需修改骨干网络结构即可实现可控的情感迁移、解耦操作及平滑插值,显著提升3D头像的表达能力和可扩展性。

链接: https://arxiv.org/abs/2604.14541
作者: Yicheng Gong,Jiawei Zhang,Liqiang Liu,Yanwen Wang,Lei Chu,Jiahao Li,Hao Pan,Hao Zhu,Yan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a framework for explicit emotion control in feed-forward, single-image 3D head avatar reconstruction. Unlike existing pipelines where emotion is implicitly entangled with geometry or appearance, we treat emotion as a first-class control signal that can be manipulated independently and consistently across identities. Our method injects emotion into existing feed-forward architectures via a dual-path modulation mechanism without modifying their core design. Geometry modulation performs emotion-conditioned normalization in the original parametric space, disentangling emotional state from speech-driven articulation, while appearance modulation captures identity-aware, emotion-dependent visual cues beyond geometry. To enable learning under this setting, we construct a time-synchronized, emotion-consistent multi-identity dataset by transferring aligned emotional dynamics across identities. Integrated into multiple state-of-the-art backbones, our framework preserves reconstruction and reenactment fidelity while enabling controllable emotion transfer, disentangled manipulation, and smooth emotion interpolation, advancing expressive and scalable 3D head avatars.

[CV-81] WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms

【速读】:该论文旨在解决从包裹相位(wrapped phase)的干涉合成孔径雷达(InSAR)干涉图中直接检测缓慢移动滑坡的问题,该任务因严重的相位模糊性和复杂的相干噪声而面临挑战。现有方法难以有效提取边界特征,尤其是高频率条纹信息在跨域迁移时易被抑制。解决方案的关键在于提出WILD-SAM框架,其核心创新包括:1)引入相位感知混合专家(Phase-Aware Mixture-of-Experts, PA-MoE)适配器,通过动态路由机制在冻结编码器中自适应聚合多尺度频谱-纹理先验,缓解自然图像与InSAR相位数据之间的谱域偏移;2)设计小波引导子带增强(Wavelet-Guided Subband Enhancement, WGSE)策略,利用离散小波变换显式分离高频子带并优化方向性相位纹理,生成频率感知的密集提示(dense prompts),从而保障滑坡边界拓扑完整性。实验表明,该方法在目标完整性和轮廓保真度上均达到当前最优性能。

链接: https://arxiv.org/abs/2604.14540
作者: Yucheng Pan,Heping Li,Zhangle Liu,Sajid Hussain,Bin Pan
机构: Wuhan University (武汉大学); Hubei Luojia Laboratory (湖北珞珈实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting slow-moving landslides directly from wrapped Interferometric Synthetic Aperture Radar (InSAR) interferograms is crucial for efficient geohazard monitoring, yet it remains fundamentally challenged by severe phase ambiguity and complex coherence noise. While the Segment Anything Model (SAM) offers a powerful foundation for segmentation, its direct transfer to wrapped phase data is hindered by a profound spectral domain shift, which suppresses the high-frequency fringes essential for boundary delineation. To bridge this gap, we propose WILD-SAM, a novel parameter-efficient fine-tuning framework specifically designed to adapt SAM for high-precision landslide detection on wrapped interferograms. Specifically, the architecture integrates a Phase-Aware Mixture-of-Experts (PA-MoE) Adapter into the frozen encoder to align spectral distributions and introduces a Wavelet-Guided Subband Enhancement (WGSE) strategy to generate frequency-aware dense prompts. The PA-MoE Adapter exploits a dynamic routing mechanism across heterogeneous convolutional experts to adaptively aggregate multi-scale spectral-textural priors, effectively aligning the distribution discrepancy between natural images and interferometric phase data. Meanwhile, the WGSE strategy leverages discrete wavelet transforms to explicitly disentangle high-frequency subbands and refine directional phase textures, injecting these structural cues as dense prompts to ensure topological integrity along sharp landslide boundaries. Extensive experiments on the ISSLIDE and ISSLIDE+ benchmarks demonstrate that WILD-SAM achieves state-of-the-art performance, significantly outperforming existing methods in both target completeness and contour fidelity.

[CV-82] Design and Validation of a Low-Cost Smartphone Based Fluorescence Detection Platform Compared with Conventional Microplate Readers

【速读】:该论文旨在解决传统微孔板读取设备(如Perkin Elmer Victor仪器)成本高昂的问题,提出一种基于荧光的低成本光学检测系统,用于在稀释样本中检测特定微生物和分子。其解决方案的关键在于利用智能手机摄像头作为光学探测器,通过建立样本在RGB颜色空间中的图像颜色与荧光物质摩尔浓度之间的定量关系,实现无需昂贵滤光片、屏障滤光片和光电倍增管等元件的高效检测。

链接: https://arxiv.org/abs/2604.14527
作者: Zhendong Cao,Katrina G. Salvante,Ash Parameswaran,Pablo A. Nepomnaschy,Hongji Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
备注: 4 pages

点击查看摘要

Abstract:A low cost fluorescence-based optical system is developed for detecting the presence of certain microorganisms and molecules within a diluted sample. A specifically designed device setup compatible with conventional 96 well plates is chosen to create an ideal environment in which a smart phone camera can be used as the optical detector. In comparison with conventional microplate reading machines such as Perkin Elmer Victor Machine, the device presented in this paper is not equipped with expensive elements such as exciter filer, barrier filter and photomultiplier; instead, a phone camera is all needed to detect fluorescence within the sample. The strategy being involved is to determine the relationship between the image color of the sample in RGB color space and the molar concentration of the fluorescence specimen in that sample. This manuscript is a preprint version of work related to a publication in IEEE. The final version may differ from this manuscript.

[CV-83] FreqTrack: Frequency Learning based Vision Transformer for RGB-Event Object Tracking

【速读】:该论文旨在解决现有单模态RGB跟踪器在复杂动态场景中性能受限的问题,尤其是如何有效利用事件传感器(event sensor)提供的高时间分辨率和高频响应特性来提升跟踪鲁棒性。当前大多数RGB-事件融合方法主要基于空间域的卷积、Transformer或Mamba架构,难以充分挖掘事件数据的独特时序特征与高频信息。其解决方案的关键在于提出一种频域感知的RGBE跟踪框架FreqTrack,通过频率域变换建立跨模态互补相关性;核心创新包括设计Spectral Enhancement Transformer(SET)层,引入多头动态傅里叶滤波机制以自适应增强和选择频域特征,以及Wavelet Edge Refinement(WER)模块,利用可学习的小波变换显式提取事件数据中的多尺度边缘结构,从而显著提升高速运动和低光照条件下的建模能力。

链接: https://arxiv.org/abs/2604.14526
作者: Jinlin You,Muyu Li,Xudong Zhao
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing single-modal RGB trackers often face performance bottlenecks in complex dynamic scenes, while the introduction of event sensors offers new potential for enhancing tracking capabilities. However, most current RGB-event fusion methods, primarily designed in the spatial domain using convolutional, Transformer, or Mamba architectures, fail to fully exploit the unique temporal response and high-frequency characteristics of event data. To address this, we1 propose FreqTrack, a frequency-aware RGBE tracking framework that establishes complementary inter-modal correlations through frequency-domain transformations for more robust feature fusion. We design a Spectral Enhancement Transformer (SET) layer that incorporates multi-head dynamic Fourier filtering to adaptively enhance and select frequency-domain features. Additionally, we develop a Wavelet Edge Refinement (WER) module, which leverages learnable wavelet transforms to explicitly extract multi-scale edge structures from event data, effectively improving modeling capability in high-speed and low-light scenarios. Extensive experiments on the COESOT and FE108 datasets demonstrate that FreqTrack achieves highly competitive performance, particularly attaining leading precision of 76.6% on the COESOT benchmark, validating the effectiveness of frequency-domain modeling for RGBE tracking.

[CV-84] Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLM s

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的性能悖论问题,即在实际评估中,单一模态基线模型往往优于联合多模态推理结果。作者指出,这一现象源于现有模型普遍采用的静态融合拓扑结构所引发的两种结构性缺陷:顺序输入中的位置偏倚和交错格式中的对齐陷阱,二者均会系统性地扭曲注意力机制,且与任务语义无关。解决方案的关键在于提出一种名为“模态链”(Chain of Modality, CoM)的代理框架,该框架将多模态融合从被动拼接转变为动态编排,通过自适应切换并行、顺序和交错三种输入路径来消除结构偏差;同时,CoM 将认知执行过程解耦为两个任务对齐路径——面向直接感知的“直觉决策”路径(Direct-Decide)和用于分析审核的“推理决策”路径(Reason-Decide),从而实现无需训练或仅需少量监督微调(SFT)即可在多个基准上获得稳定且泛化的性能表现。

链接: https://arxiv.org/abs/2604.14520
作者: Ziyang Luo,Nian Liu,Junwei Han
机构: Northwestern Polytechnical University(西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined Direct-Decide'' path for direct perception and a structured Reason-Decide’’ path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.

[CV-85] CI-CBM: Class-Incremental Concept Bottleneck Model for Interpretable Continual Learning

【速读】:该论文旨在解决持续学习(continual learning)中的灾难性遗忘(catastrophic forgetting)问题,尤其是在类增量学习(class incremental learning, CIL)场景下,模型在学习新任务时容易丢失先前任务的知识。为应对这一挑战,作者提出ClassIncremental Concept Bottleneck Model (CI-CBM),其核心在于引入概念正则化(concept regularization)和伪概念生成(pseudo-concept generation)技术,以在增量学习过程中保持决策过程的可解释性,同时不牺牲模型性能。实验表明,CI-CBM在七个数据集上达到与黑盒模型相当甚至更优的准确率,且平均提升36%的准确率相比现有可解释方法,实现了人类可理解的概念在持续学习中的稳定维持。

链接: https://arxiv.org/abs/2604.14519
作者: Amirhosein Javadi,Tuomas Oikarinen,Tara Javidi,Tsui-Wei Weng
机构: University of California San Diego (加州大学圣地亚哥分校); Halıcıoğlu Data Science Institute (Halıcıoğlu数据科学研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 6 figures. Published in Transactions on Machine Learning Research (TMLR), 04/2026

点击查看摘要

Abstract:Catastrophic forgetting remains a fundamental challenge in continual learning, in which models often forget previous knowledge when fine-tuned on a new task. This issue is especially pronounced in class incremental learning (CIL), which is the most challenging setting in continual learning. Existing methods to address catastrophic forgetting often sacrifice either model interpretability or accuracy. To address this challenge, we introduce ClassIncremental Concept Bottleneck Model (CI-CBM), which leverage effective techniques, including concept regularization and pseudo-concept generation to maintain interpretable decision processes throughout incremental learning phases. Through extensive evaluation on seven datasets, CI-CBM achieves comparable performance to black-box models and outperforms previous interpretable approaches in CIL, with an average 36% accuracy gain. CICBM provides interpretable decisions on individual inputs and understandable global decision rules, as shown in our experiments, thereby demonstrating that human understandable concepts can be maintained during incremental learning without compromising model performance. Our approach is effective in both pretrained and non-pretrained scenarios; in the latter, the backbone is trained from scratch during the first learning phase. Code is publicly available at this http URL.

[CV-86] H2VLR: Heterogeneous Hypergraph Vision-Language Reasoning for Few-Shot Anomaly Detection

【速读】:该论文旨在解决少样本异常检测(Few-Shot Anomaly Detection, FSAD)中因数据稀缺导致性能受限的问题,尤其针对现有基于视觉-语言模型(Vision-Language Model, VLM)的方法仅依赖成对特征匹配、忽视结构依赖关系与全局一致性的问题。其解决方案的关键在于提出一种异质超图视觉-语言推理框架(Heterogeneous Hypergraph Vision-Language Reasoning, H2VLR),将FSAD建模为视觉-语义关系的高阶推理问题,通过统一建模视觉区域与语义概念的超图结构,显式捕捉多层级关系与全局一致性,从而显著提升异常检测性能,在工业和医学基准上均达到当前最优(State-of-the-Art, SOTA)效果。

链接: https://arxiv.org/abs/2604.14507
作者: Jianghong Huang,Luping Ji,Weiwei Duan,Mao Ye
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:As a classic vision task, anomaly detection has been widely applied in industrial inspection and medical imaging. In this task, data scarcity is often a frequently-faced issue. To solve it, the few-shot anomaly detection (FSAD) scheme is attracting increasing attention. In recent years, beyond traditional visual paradigm, Vision-Language Model (VLM) has been extensively explored to boost this field. However, in currently-existing VLM-based FSAD schemes, almost all perform anomaly inference only by pairwise feature matching, ignoring structural dependencies and global consistency. To further redound to FSAD via VLM, we propose a Heterogeneous Hypergraph Vision-Language Reasoning (H2VLR) framework. It reformulates the FSAD as a high-order inference problem of visual-semantic relations, by jointly modeling visual regions and semantic concepts in a unified hypergraph. Experimental comparisons verify the effectiveness and advantages of H2VLR. It could often achieve state-of-the-art (SOTA) performance on representative industrial and medical benchmarks. Our code will be released upon acceptance.

[CV-87] Co-distilled attention guided masked image modeling with noisy teacher for self-supervised learning on medical images

【速读】:该论文旨在解决自监督学习(SSL)在医学图像中因随机掩码导致的信息泄露和预训练难度降低的问题,尤其针对基于分层移位窗口(Swin)的Transformer架构缺乏全局[CLS]标记而无法应用先进掩码策略的局限性。其解决方案的关键在于提出一种注意力引导的掩码机制,并结合共蒸馏学习框架(co-distillation learning framework),通过选择性地掩码语义共现且具有判别性的图像块来减少信息泄露并提升预训练难度;进一步地,为缓解注意力头多样性下降对下游任务性能的负面影响,首次引入一个带噪声的教师模型(noisy teacher)到该框架中,从而在保持高注意力头多样性的同时实现有效的掩码策略,该方法命名为DAGMaN,在肺结节分类、免疫治疗预后预测、肿瘤分割及无监督器官聚类等多个任务上验证了其有效性。

链接: https://arxiv.org/abs/2604.14506
作者: Jue Jiang,Aneesh Rangnekar,Harini Veeraraghavan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MIDL 2025

点击查看摘要

Abstract:Masked image modeling (MIM) is a highly effective self-supervised learning (SSL) approach to extract useful feature representations from unannotated data. Predominantly used random masking methods make SSL less effective for medical images due to the contextual similarity of neighboring patches, leading to information leakage and SSL simplification. Hierarchical shifted window (Swin) transformer, a highly effective approach for medical images cannot use advanced masking methods as it lacks a global [CLS] token. Hence, we introduced an attention guided masking mechanism for Swin within a co-distillation learning framework to selectively mask semantically co-occurring and discriminative patches, to reduce information leakage and increase the difficulty of SSL pretraining. However, attention guided masking inevitably reduces the diversity of attention heads, which negatively impacts downstream task performance. To address this, we for the first time, integrate a noisy teacher into the co-distillation framework (termed DAGMaN) that performs attentive masking while preserving high attention head diversity. We demonstrate the capability of DAGMaN on multiple tasks including full- and few-shot lung nodule classification, immunotherapy outcome prediction, tumor segmentation, and unsupervised organs clustering.

[CV-88] CooperDrive: Enhancing Driving Decisions Through Cooperative Perception ICRA2026

【速读】:该论文旨在解决自动驾驶车辆在遮挡和非视距(Non-Line-of-Sight, NLOS)场景下因感知受限导致反应延迟、碰撞风险增加的问题。其核心解决方案是提出CooperDrive框架,关键在于通过轻量级的目标级信息共享与融合策略,在不改变各车原有感知、定位与规划模块的前提下,复用检测器的鸟瞰图(Bird’s-Eye View, BEV)特征来估计精确车辆位姿,从而重建BEV表示并低延迟地供给规划器;同时利用扩展的目标集合提前识别潜在冲突并主动调整速度与轨迹,将原本的被动响应行为转化为预测性、更安全的决策机制。

链接: https://arxiv.org/abs/2604.14454
作者: Deyuan Qu,Qi Chen,Takayuki Shimizu,Onur Altintas
机构: Toyota InfoTech Labs (丰田信息科技实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICRA 2026

点击查看摘要

Abstract:Autonomous vehicles equipped with robust onboard perception, localization, and planning still face limitations in occlusion and non-line-of-sight (NLOS) scenarios, where delayed reactions can increase collision risk. We propose CooperDrive, a cooperative perception framework that augments situational awareness and enables earlier, safer driving decisions. CooperDrive offers two key advantages: (i) each vehicle retains its native perception, localization, and planning stack, and (ii) a lightweight object-level sharing and fusion strategy bridges perception and planning. Specifically, CooperDrive reuses detector Bird’s-Eye View (BEV) features to estimate accurate vehicle poses without additional heavy encoders, thereby reconstructing BEV representations and feeding the planner with low latency. On the planning side, CooperDrive leverages the expanded object set to anticipate potential conflicts earlier and adjust speed and trajectory proactively, thereby transforming reactive behaviors into predictive and safer driving decisions. Real-world closed-loop tests at occlusion-heavy NLOS intersections demonstrate that CooperDrive increases reaction lead time, minimum time-to-collision (TTC), and stopping margin, while requiring only 90 kbps bandwidth and maintaining an average end-to-end latency of 89 ms.

[CV-89] Crowdsourcing of Real-world Image Annotation via Visual Properties

【速读】:该论文旨在解决当前对象识别数据集中的语义鸿沟(semantic gap)问题,即视觉数据与语言描述之间存在复杂的多对多映射关系,导致标注主观性强、模型性能受限。解决方案的关键在于提出一种融合知识表示(knowledge representation)、自然语言处理(natural language processing)和计算机视觉技术的图像标注方法,通过引入视觉属性约束来降低标注者的主观性,并设计了一个基于预定义对象类别层次结构和标注者反馈动态提问的交互式众包框架,从而引导标注过程聚焦于视觉属性,提升标注的一致性和准确性。

链接: https://arxiv.org/abs/2604.14449
作者: Xiaolei Diao,Fausto Giunchiglia
机构: University College London (伦敦大学学院); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in data-centric artificial intelligence highlight inherent limitations in object recognition datasets. One of the primary issues stems from the semantic gap problem, which results in complex many-to-many mappings between visual data and linguistic descriptions. This bias adversely affects performance in computer vision tasks. This paper proposes an image annotation methodology that integrates knowledge representation, natural language processing, and computer vision techniques, aiming to reduce annotator subjectivity by applying visual property constraints. We introduce an interactive crowdsourcing framework that dynamically asks questions based on a predefined object category hierarchy and annotator feedback, guiding image annotation by visual properties. Experiments demonstrate the effectiveness of this methodology, and annotator feedback is discussed to optimize the crowdsourcing setup.

[CV-90] Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers CVPR2026

【速读】:该论文旨在解决视觉 Transformer 中“注册 token”(register)的功能必要性问题,特别是针对零消融(zero-ablation)方法是否能准确反映其对下游任务的贡献。研究表明,传统零消融策略会显著降低分类和分割性能(如 -36.6 pp 分类精度),从而被广泛认为注册 token 是功能不可或缺的;但论文通过引入三种替代控制方法(均值替换、噪声替换、跨图像注册洗牌),发现这些操作在保持性能稳定(误差 < 1 pp)的同时,仍能显著扰动内部表示(如每 patch 的余弦相似度变化),说明零消融对注册内容的敏感性远超其他扰动方式。因此,论文的核心结论是:零消融高估了注册 token 对任务性能的依赖,真正影响模型表现的是注册 token 所提供的“类似注册的激活模式”,而非具体的图像特定数值。这一发现强调了在评估神经网络组件时需谨慎使用零消融,并建议采用更合理的扰动对照实验以避免误判。

链接: https://arxiv.org/abs/2604.14433
作者: Felipe Parodi,Jordan Matelsky,Melanie Segado
机构: University of Pennsylvania (宾夕法尼亚大学); Johns Hopkins Applied Physics Laboratory (约翰霍普金斯应用物理实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 10 figures, to be published in CVPR 2026 HOW Vision Interpretability Workshop Proceedings

点击查看摘要

Abstract:Zero-ablation – replacing token activations with zero vectors – is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to -36.6 ,pp classification, -30.9 ,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls – mean-substitution, noise-substitution, and cross-image register-shuffling – preserve performance across classification, correspondence, and segmentation, remaining within \sim1 ,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt[CLS] dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.

[CV-91] FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste Smell Texture and Sound from Images

【速读】:该论文旨在解决图像驱动的多感官体验预测问题,即如何从食物图像中推断出味觉、嗅觉、质地和声音等多维感官感知,这一领域在先前的视觉语言研究中尚未得到充分探索。解决方案的关键在于构建了一个大规模的人工标注跨感官推理数据集 FoodSense,包含 66,842 个参与者-图像对,每对均提供数值评分(1–5)和自由文本描述,涵盖四种感官维度;同时通过大语言模型生成基于图像的推理路径(image-grounded reasoning traces),将简短的人类注释扩展为可解释的视觉理由,从而训练出 FoodSense-VL 模型,该模型能直接从食物图像中输出多感官评分及可解释的推理结果,实现了跨感官感知与现代指令微调(instruction tuning)的结合。

链接: https://arxiv.org/abs/2604.14388
作者: Sabab Ishraq(1),Aarushi Aarushi(2),Juncai Jiang(2),Chen Chen(3) ((1) University of Central Florida, College of Engineering and Computer Science, Orlando, FL, USA, (2) University of Central Florida, College of Business Administration, Orlando, FL, USA, (3) University of Central Florida, Institute of Artificial Intelligence, Orlando, FL, USA)
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and descriptors. Using these annotations, we train FoodSense-VL, a vision language benchmark model to produce both multisensory ratings and grounded explanations directly from food images. This work connects cognitive science findings on cross-sensory perception with modern instruction tuning for multimodal models and shows that many popular evaluation metrics are insufficient for visually sensory inference.

[CV-92] Step-level Denoising-time Diffusion Alignment with Multiple Objectives

【速读】:该论文旨在解决扩散模型在对齐人类偏好时面临的多目标优化问题,即如何在不引入近似误差的前提下,有效平衡多个下游目标(如美学质量与文本-图像一致性)。现有方法要么依赖昂贵的多目标强化学习(Reinforcement Learning, RL)微调,要么在去噪过程中融合已分别对齐的模型,但这些方法通常需要访问奖励值或其梯度,且易引入近似误差。解决方案的关键在于提出一种步骤级强化学习(step-level RL)形式化,并基于此设计了无需重新训练的多目标去噪时间扩散对齐框架(Multi-objective Step-level Denoising-time Diffusion Alignment, MSDDA),该框架能够以闭式解直接获得最优反向去噪分布,其均值和方差可显式表示为单目标基础模型的函数,从而在去噪阶段精确等价于步骤级RL微调,避免了近似误差。

链接: https://arxiv.org/abs/2604.14379
作者: Qi Zhang,Dawei Wang,Shaofeng Zou
机构: Arizona State University (亚利桑那州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a powerful tool for aligning diffusion models with human preferences, typically by optimizing a single reward function under a KL regularization constraint. In practice, however, human preferences are inherently pluralistic, and aligned models must balance multiple downstream objectives, such as aesthetic quality and text-image consistency. Existing multi-objective approaches either rely on costly multi-objective RL fine-tuning or on fusing separately aligned models at denoising time, but they generally require access to reward values (or their gradients) and/or introduce approximation error in the resulting denoising objectives. In this paper, we revisit the problem of RL fine-tuning for diffusion models and address the intractability of identifying the optimal policy by introducing a step-level RL formulation. Building on this, we further propose Multi-objective Step-level Denoising-time Diffusion Alignment (MSDDA), a retraining-free framework for aligning diffusion models with multiple objectives, obtaining the optimal reverse denoising distribution in closed form, with mean and variance expressed directly in terms of single-objective base models. We prove that this denoising-time objective is exactly equivalent to the step-level RL fine-tuning, introducing no approximation error. Moreover, we provide numerical results, which indicate our method outperforms existing denoising-time approaches.

[CV-93] SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning

【速读】:该论文旨在解决农村环境风险评估中因传统脆弱性指数(Social Vulnerability Index, SVI)空间分辨率粗略、缺乏对地方性条件(如住房质量、道路可达性、地表格局等)的精细刻画而导致的风险情境理解不足的问题。其解决方案的关键在于提出一种卫星特定的视觉-语言框架 SatBLIP,通过对比图像-文本对齐与针对卫星语义定制的自举式图像描述生成机制,实现对农村场景的细粒度特征识别与SVI预测:首先利用GPT-4o生成结构化卫星图像块描述(如屋顶类型/状态、房屋尺寸、院落属性、绿化程度及道路环境),再微调适配于遥感图像的BLIP模型以生成未见图像的caption,最后将CLIP编码的图像特征与大语言模型(LLM)提取的嵌入通过注意力机制融合,完成县一级SVI估计,并借助SHAP方法识别出驱动预测稳定的显著属性(如屋顶形态/状态、街道宽度、植被覆盖和车辆/开放空间),从而实现可解释的农村风险环境制图。

链接: https://arxiv.org/abs/2604.14373
作者: Xue Wu,Shengting Cao,Jiaqi Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rural environmental risks are shaped by place-based conditions (e.g., housing quality, road access, land-surface patterns), yet standard vulnerability indices are coarse and provide limited insight into risk contexts. We propose SatBLIP, a satellite-specific vision-language framework for rural context understanding and feature identification that predicts county-level Social Vulnerability Index (SVI). SatBLIP addresses limitations of prior remote sensing pipelines-handcrafted features, manual virtual audits, and natural-image-trained VLMs-by coupling contrastive image-text alignment with bootstrapped captioning tailored to satellite semantics. We use GPT-4o to generate structured descriptions of satellite tiles (roof type/condition, house size, yard attributes, greenery, and road context), then fine-tune a satellite-adapted BLIP model to generate captions for unseen images. Captions are encoded with CLIP and fused with LLM-derived embeddings via attention for SVI estimation under spatial aggregation. Using SHAP, we identify salient attributes (e.g., roof form/condition, street width, vegetation, cars/open space) that consistently drive robust predictions, enabling interpretable mapping of rural risk environments.

[CV-94] Interpretable Human Activity Recognition for Subtle Robbery Detection in Surveillance Videos

【速读】:该论文旨在解决非暴力街头抢劫(snatch-and-run)事件在无约束监控视频中难以自动检测的问题,因其持续时间短、行为隐蔽且常与正常人类交互难以区分。解决方案的关键在于提出一种混合的、基于姿态驱动的方法:首先利用YOLO-based姿态估计器提取追踪个体的身体关键点,进而计算描述手部速度、手臂伸展度、距离及攻击者-受害者相对运动的运动学与交互特征;随后使用随机森林(Random Forest)分类器对这些特征进行判别,并引入时序滞回滤波器以稳定帧级预测、降低误报率。整个系统可在NVIDIA Jetson Nano上实时运行,验证了其在边缘设备上的可行性与有效性。

链接: https://arxiv.org/abs/2604.14329
作者: Bryan Jhoan Cazáres Leyva,Ulises Gachuz Davila,José Juan González Fonseca,Juan Irving Vasquez,Vanessa A. Camacho-Vázquez,Sergio Isahí Garrido-Castañeda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to MCPR

点击查看摘要

Abstract:Non-violent street robberies (snatch-and-run) are difficult to detect automatically because they are brief, subtle, and often indistinguishable from benign human interactions in unconstrained surveillance footage. This paper presents a hybrid, pose-driven approach for detecting snatch-and-run events that combines real-time perception with an interpretable classification stage suitable for edge deployment. The system uses a YOLO-based pose estimator to extract body keypoints for each tracked person and computes kinematic and interaction features describing hand speed, arm extension, proximity, and relative motion between an aggressor-victim pair. A Random Forest classifier is trained on these descriptors, and a temporal hysteresis filter is applied to stabilize frame-level predictions and reduce spurious alarms. We evaluate the method on a staged dataset and on a disjoint test set collected from internet videos, demonstrating promising generalization across different scenes and camera viewpoints. Finally, we implement the complete pipeline on an NVIDIA Jetson Nano and report real-time performance, supporting the feasibility of proactive, on-device robbery detection.

[CV-95] Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

【速读】:该论文旨在解决从单张手绘草图(freehand sketch)生成几何一致的多视角场景这一新问题。手绘草图作为输入具有几何信息贫瘠、存在空间失真等特点,与构建一致3D结构的需求相冲突,且此前无方法直接处理此任务;现有方法依赖照片或文本输入,或需多视图草图及昂贵的逐场景优化。解决方案的关键在于三项相互增强的贡献:(i) 构建包含约9000个草图到多视角样本的数据集,通过自动化生成与过滤流程获得;(ii) 提出并行相机感知注意力适配器(Parallel Camera-Aware Attention Adapters, CA3),将几何归纳偏置注入视频Transformer架构;(iii) 设计基于Structure-from-Motion重建的稀疏对应监督损失(Sparse Correspondence Supervision Loss, CSL),以强化跨视图一致性。该框架在单次去噪过程中合成所有视角,无需参考图像、迭代优化或逐场景调整,显著优于当前两阶段基线方法,在真实感(FID提升超60%)和几何一致性(Corr-Acc提升23%)上取得突破,并实现最高达3.7倍的推理加速。

链接: https://arxiv.org/abs/2604.14302
作者: Ahmed Bourouis,Savas Ozkan,Andrea Maracani,Yi-Zhe Song,Mete Ozay
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of \sim 9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7 \times inference speedup. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.14302 [cs.CV] (or arXiv:2604.14302v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.14302 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-96] HY-World 2.0: A Multi-Modal World Model for Reconstructing Generating and Simulating 3D Worlds

【速读】:该论文旨在解决多模态输入下生成高质量、可导航的3D世界表示(3D world representation)的问题,尤其在从文本或单视图图像等低维输入中重建复杂场景时面临 fidelity(保真度)与一致性挑战。解决方案的关键在于提出 HY-World 2.0 框架,其核心创新包括:1)四阶段流程——全景生成(HY-Pano 2.0)、轨迹规划(WorldNav)、世界扩展(WorldStereo 2.0)和世界合成(WorldMirror 2.0),实现从初始输入到完整3D Gaussian Splatting(3DGS)场景的端到端构建;2)引入具有持续记忆能力的关键帧视图生成模型 WorldStereo 2.0 和改进架构与训练策略的通用3D预测模型 WorldMirror 2.0,显著提升多视角图像/视频下的重建质量与一致性;3)开发高性能渲染平台 WorldLens,支持交互式探索并集成自动IBL光照、高效碰撞检测及训练-渲染协同设计,从而推动3D世界模型的实际应用落地。

链接: https://arxiv.org/abs/2604.14268
作者: Team HY-World,Chenjie Cao,Xuhui Zuo,Zhenwei Wang,Yisu Zhang,Junta Wu,Zhenyang Liu,Yuning Gong,Yang Liu,Bo Yuan,Chao Zhang,Coopers Li,Dongyuan Guo,Fan Yang,Haiyu Zhang,Hang Cao,Jianchen Zhu,Jiaxin Lin,Jie Xiao,Jihong Zhang,Junlin Yu,Lei Wang,Lifu Wang,Lilin Wang,Linus,Minghui Chen,Peng He,Penghao Zhao,Qi Chen,Rui Chen,Rui Shao,Sicong Liu,Wangchen Qin,Xiaochuan Niu,Xiang Yuan,Yi Sun,Yifei Tang,Yifu Sun,Yihang Lian,Yonghao Tan,Yuhong Liu,Yuyang Yin,Zhiyuan Min,Tengfei Wang,Chunchao Guo
机构: Tencent Hunyuan* (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL ; Code: this https URL

点击查看摘要

Abstract:We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.

[CV-97] QualiaNet: An Experience-Before-Inference Network

【速读】:该论文旨在解决人类3D视觉中“经验模块”与“推断模块”的分离问题,即如何从立体视差(stereo disparity)的经验感知中推断出真实的三维场景尺度信息。其核心挑战在于:尽管人类的立体视觉体验本身不直接提供距离信息,却显著影响对物体大小和场景深度的判断。解决方案的关键在于提出一个两阶段计算架构——QualiaNet,其中第一阶段模拟人类的“经验模块”,生成符合人类感知的差异图(disparity maps),第二阶段通过卷积神经网络(CNN)学习利用自然场景中的统计规律:近距离场景具有显著的差异梯度,而远距离场景则相对平坦;该网络仅凭差异梯度即可准确恢复距离,从而验证了该机制的有效性。

链接: https://arxiv.org/abs/2604.14193
作者: Paul Linton
机构: Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Human 3D vision involves two distinct stages: an Experience Module, where stereo depth is extracted relative to fixation, and an Inference Module, where this experience is interpreted to estimate 3D scene properties. Paradoxically, although our experience of stereo vision does not provide us with distance information, it does affect our inferences about visual scale. We propose the Inference Module exploits a natural scene statistic: near scenes produce vivid disparity gradients, while far scenes appear comparatively flat. QualiaNet implements this two-stage architecture computationally: disparity maps simulating human stereo experience are passed to a CNN trained to estimate distance. The network can recover distance from disparity gradients alone, validating this approach.

[CV-98] Generative Modeling of Complex-Valued Brain MRI Data

【速读】:该论文旨在解决标准磁共振成像(MRI)重建流程中丢弃相位信息的问题,而相位信息在肿瘤诊断中具有潜在的组织特性编码价值。传统机器学习方法仅基于重建后的幅值图像进行处理,忽略了相位信息所蕴含的病理特征。其解决方案的关键在于构建一个生成式框架,该框架结合条件变分自编码器(conditional variational autoencoder)与基于流匹配(flow-matching-based)的生成模型,能够联合建模复数MRI扫描中的幅值和相位信息,并保持相位相干性(coherence)。该方法不仅生成的合成样本在真实-合成分类任务中表现出极低的区分度(AUROC 0.50–0.66),且下游异常组织检测任务中使用纯合成数据训练的分类器性能优于真实数据基线(AUROC 0.880 vs. 0.842),验证了其在完整脑部MRI信息利用方面的有效性与潜力。

链接: https://arxiv.org/abs/2604.14800
作者: Marco Schlimbach,Moritz Rempe,Jessica Mnischek,Lukas T. Rotkopf,Jens Weingarten,Jens Kleesiek,Kevin Kröninger
机构: Technical University Dortmund (多特蒙德工业大学); University Hospital Essen (埃森大学医院); University Medicine Essen (埃森大学医学中心); Cancer Research Center Cologne Essen (科隆-埃森癌症研究中心); RACOON Study Group (RACOON研究小组); German Cancer Consortium (DKTK) (德国癌症联盟); University of Duisburg-Essen (杜伊斯堡-埃森大学); German Cancer Research Center (DKFZ) (德国癌症研究中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Objective. Standard Magnetic Resonance Imaging (MRI) reconstruction pipelines discard phase information captured during acquisition, despite evidence that it encodes tissue properties relevant to tumor diagnosis. Current machine learning approaches inherit this limitation by operating exclusively on reconstructed magnitude images. The aim of this study is to build a generative framework which is capable of jointly modeling magnitude and phase information of complex-valued MRI scans. Approach. The proposed generative framework combines a conditional variational autoencoder, which compresses complex-valued MRI scans into compact latent representations while preserving phase coherence, with a flow-matching-based generative model. Synthetic sample quality is assessed via a real-versus-synthetic classifier and by training downstream classifiers on synthetic data for abnormal tissue detection. Main results. The autoencoder preserves phase coherence above 0.997. Real-versus-synthetic classification yields low AUROC values between 0.50 and 0.66 across all acquisition sequences, indicating generated samples are nearly indistinguishable from real data. In downstream normal-versus-abnormal classification, classifiers trained entirely on synthetic data achieve an AUROC of 0.880, surpassing the real-data baseline of 0.842 on a publicly available dataset (fastMRI). This advantage persists on an independent external test set from a different institution with biopsy-confirmed labels. Significance. The proposed framework demonstrates the feasibility of jointly modeling magnitude and phase information for normal and abnormal complex-valued brain MRI data. Beyond synthetic data generation, it establishes a foundation for the usage of complete brain MRI information in future diagnostic applications and enables systematic investigation of how magnitude and phase jointly encode pathology-specific features.

[CV-99] FAIR Universe Weak Lensing ML Uncertainty Challenge: Handling Uncertainties and Distribution Shifts for Precision Cosmology

【速读】:该论文旨在解决当前弱引力透镜(weak gravitational lensing)分析中基于机器学习(ML)方法所面临的三大挑战:一是宇宙学模拟计算成本高,导致训练数据有限;二是模拟中系统误差建模不准确,引发分布偏移(distribution shift),进而影响宇宙学参数约束的准确性;三是不同研究采用的模拟设置差异大,难以进行方法比较。解决方案的关键在于构建首个包含多种现实系统误差的弱引力透镜基准数据集,并发起“FAIR Universe Weak Lensing Machine Learning Uncertainty Challenge”,通过两阶段竞赛形式,聚焦于在有限训练样本和潜在分布偏移条件下准确测量宇宙基本属性,从而为机器学习方法提供标准化评估框架,推动其在下一代弱引力透镜巡天分析中的可靠部署。

链接: https://arxiv.org/abs/2604.14451
作者: Biwei Dai,Po-Wen Chang,Wahid Bhimji,Paolo Calafiura,Ragansu Chakkappai,Yuan-Tang Chou,Sascha Diefenbacher,Jordan Dudley,Ibrahim Elsharkawy,Steven Farrell,Isabelle Guyon,Chris Harris,Elham E Khoda,Benjamin Nachman,David Rousseau,Uroš Seljak,Ihsan Ullah,Yulei Zhang
机构: Lawrence Berkeley National Laboratory(劳伦斯伯克利国家实验室); Université Paris-Saclay, CNRS/IN2P3, IJCLab(巴黎萨克雷大学,法国国家科学研究中心/法国原子能和替代能源委员会,IJCLab); ChaLearn; University of Washington(华盛顿大学); University of California, Berkeley(加州大学伯克利分校); University of Toronto(多伦多大学); University of California, San Diego(加州大学圣地亚哥分校); Stanford University(斯坦福大学); SLAC National Accelerator Laboratory(斯坦福直线加速器中心); Uroš Seljak(此为作者姓名,非机构); Ihsan Ullah(此为作者姓名,非机构); Yulei Zhang(此为作者姓名,非机构)
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
备注: Whitepaper for the FAIR Universe Weak Lensing ML Uncertainty Challenge Competition. More info is available at our GitHub repository this https URL . 13 pages, 5 figures, 1 table

点击查看摘要

Abstract:Weak gravitational lensing, the correlated distortion of background galaxy shapes by foreground structures, is a powerful probe of the matter distribution in our universe and allows accurate constraints on the cosmological model. In recent years, high-order statistics and machine learning (ML) techniques have been applied to weak lensing data to extract the nonlinear information beyond traditional two-point analysis. However, these methods typically rely on cosmological simulations, which poses several challenges: simulations are computationally expensive, limiting most realistic setups to a low training data regime; inaccurate modeling of systematics in the simulations create distribution shifts that can bias cosmological parameter constraints; and varying simulation setups across studies make method comparison difficult. To address these difficulties, we present the first weak lensing benchmark dataset with several realistic systematics and launch the FAIR Universe Weak Lensing Machine Learning Uncertainty Challenge. The challenge focuses on measuring the fundamental properties of the universe from weak lensing data with limited training set and potential distribution shifts, while providing a standardized benchmark for rigorous comparison across methods. Organized in two phases, the challenge will bring together the physics and ML communities to advance the methodologies for handling systematic uncertainties, data efficiency, and distribution shifts in weak lensing analysis with ML, ultimately facilitating the deployment of ML approaches into upcoming weak lensing survey analysis.

[CV-100] A deep learning framework for glomeruli segmentation with boundary attention

【速读】:该论文旨在解决肾组织中肾小球(glomeruli)检测与分割的精度问题,特别是传统深度学习方法依赖语义分割时难以准确区分相邻肾小球边界的问题。解决方案的关键在于提出一种基于U-Net架构的新模型,其核心创新是引入了一个专门设计的注意力解码器(attention decoder),以强化关键区域的特征表达,从而提升实例级别(instance-level)的分割性能,实验表明该方法在Dice分数和交并比(Intersection over Union, IoU)指标上均优于现有最优方法。

链接: https://arxiv.org/abs/2604.14263
作者: Behnaz Elhaminia,Catherine King,Jiaqi Lv,Lorraine Harper,Paul Moss,Owen Cain,Dimitrios Chanouzas,Shan E Ahmed Raza
机构: 未知
类目: Tissues and Organs (q-bio.TO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate detection and segmentation of glomeruli in kidney tissue are essential for diagnostic applications. Traditional deep learning methods primarily rely on semantic segmentation, which often fails to precisely delineate adjacent glomeruli. To address this challenge, we propose a novel glomerulus detection and segmentation model that emphasises boundary separation. Leveraging pathology foundation models, the proposed U-Net-based architecture incorporates a specialised attention decoder designed to highlight critical regions and improve instancelevel segmentation. Experimental evaluations demonstrate that our approach surpasses state-of-the-art methods in both Dice score and Intersection over Union, indicating superior performance in glomerular delineation.

[CV-101] Quantitative measurements of biological/chemical concentrations using smartphone cameras

【速读】:该论文旨在解决传统生物/化学检测设备体积大、成本高且不便于在偏远或资源匮乏地区使用的难题。其核心问题是实现对多种生物/化学样品浓度的快速、便携式定量分析。解决方案的关键在于构建一个基于智能手机的成像系统,通过特定光学装置采集样本图像,并结合图像处理与数据分析技术建立颜色信息与浓度之间的映射关系数据库,从而实现对荧光物质和胶体混合物浓度的精确估计,其性能可媲美商用及实验室级仪器。

链接: https://arxiv.org/abs/2603.27118
作者: Zhendong Cao,Hongji Dai,Zhida Li,Ash Parameswaran
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:This paper presents a smartphone-based imaging system capable of quantifying the concentration of an assortment of biological/chemical assay samples. The main objective is to construct an image database which characterizes the relationship between color information and concentrations of the biological/chemical assay sample. For this aim, a designated optical setup combined with image processing and data analyzing techniques was implemented. A series of experiments conducted on selected assays, including fluorescein, RNA Mango, homogenized milk and yeast have demonstrated that the proposed system estimates the concentration of fluorescent materials and colloidal mixtures comparable to currently used commercial and laboratory instruments. Furthermore, by utilizing the camera and computational power of smartphones, eventual development can be directed toward extremely compact, inexpensive and portable analysis and diagnostic systems which will allow experiments and tests to be conducted in remote or impoverished areas.

人工智能

[AI-0] Generalization in LLM Problem Solving: The Case of the Shortest Path

【速读】:该论文旨在解决语言模型在系统性泛化(systematic generalization)能力上的争议问题,即模型是否能在未见过的场景或更复杂任务中保持稳定性能。其核心挑战在于现有实验难以分离训练数据、训练范式与推理策略等因素对模型表现的影响。为此,作者构建了一个受控的合成环境,基于最短路径规划(shortest-path planning),这是一个典型的可组合序列优化问题,能够清晰区分不同因素的作用,并支持两个正交的泛化维度:空间迁移(spatial transfer)至未见地图和长度缩放(length scaling)至更长决策时序。关键解决方案在于通过该可控实验框架揭示了模型虽具备强空间迁移能力,但在长度缩放下因递归不稳定性(recursive instability)而持续失败,从而明确指出当前模型系统性问题的根本瓶颈并非数据覆盖或训练方法,而是推理阶段的内在结构缺陷。

链接: https://arxiv.org/abs/2604.15306
作者: Yao Tong,Jiayuan Ye,Anastasia Borovykh,Reza Shokri
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.

[AI-1] How Do LLM s and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study ACL2026

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)和视觉语言模型(Vision-Language Models, VLMs)在缺乏视觉输入情况下,仅依靠文本描述是否具备空间智能中的关键能力——视角旋转理解(Viewpoint Rotation Understanding, VRU)的问题。研究发现,尽管模型能编码视角信息,但在最终层难以将视角位置与对应观测结果正确关联,导致推理错误甚至幻觉现象。解决方案的关键在于通过层级探测分析和注意力头因果干预识别出影响VRU性能的核心注意力机制,并对这些特定注意力头进行选择性微调,从而提升模型在VRU任务上的表现,同时避免对通用能力的灾难性遗忘。

链接: https://arxiv.org/abs/2604.15294
作者: Zhen Yang,Ping Jian,Zhongbin Guo,Zuming Zhang,Chengzhi Li,Yonghong Deng,Xinyue Zhang,Wenpeng Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published as a main-conference paper at The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:Over the past year, spatial intelligence has drawn increasing attention. Many prior works study it from the perspective of visual-spatial intelligence, where models have access to visuospatial information from visual inputs. However, in the absence of visual information, whether linguistic intelligence alone is sufficient to endow models with spatial intelligence, and how models perform relevant tasks with text-only inputs still remain unexplored. Therefore, in this paper, we focus on a fundamental and critical capability in spatial intelligence from a linguistic perspective: viewpoint rotation understanding (VRU). Specifically, LLMs and VLMs are asked to infer their final viewpoint and predict the corresponding observation in an environment given textual description of viewpoint rotation and observation over multiple steps. We find that both LLMs and VLMs perform poorly on our proposed dataset while human can easily achieve 100% accuracy, indicating a substantial gap between current model capabilities and the requirements of spatial intelligence. To uncover the underlying mechanisms, we conduct a layer-wise probing analysis and head-wise causal intervention. Our findings reveal that although models encode viewpoint information in the hidden states, they appear to struggle to bind the viewpoint position with corresponding observation, resulting in a hallucination in final layers. Finally, we selectively fine-tune the key attention heads identified by causal intervention to improve VRU performance. Experimental results demonstrate that such selective fine-tuning achieves improved VRU performance while avoiding catastrophic forgetting of generic abilities. Our dataset and code will be released at this https URL .

[AI-2] Prism: Symbolic Superoptimization of Tensor Programs

【速读】:该论文旨在解决大规模张量程序(tensor programs)自动优化中搜索空间爆炸与效率低下问题,尤其是在生成式 AI(Generative AI)等现代机器学习(ML)工作负载下,传统超优化器(superoptimizer)和编译器方法难以兼顾优化质量与计算效率。其解决方案的关键在于提出 Prism——首个面向张量程序的符号超优化器,核心创新是引入 sGraph(symbolic hierarchical representation),通过符号化表示执行参数来紧凑编码大量张量程序家族;同时采用两级搜索策略:先构建符号图以代表程序族,再通过自动调优(auto-tuning)实例化为具体实现,并利用基于操作语义、代数恒等式和硬件约束的符号推理对搜索空间进行结构化剪枝,从而在保证优化严谨性的同时显著提升可扩展性。

链接: https://arxiv.org/abs/2604.15272
作者: Mengdi Wu,Xiaoyu Jiang,Oded Padon,Zhihao Jia
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents Prism, the first symbolic superoptimizer for tensor programs. The key idea is sGraph, a symbolic, hierarchical representation that compactly encodes large classes of tensor programs by symbolically representing some execution parameters. Prism organizes optimization as a two-level search: it constructs symbolic graphs that represent families of programs, and then instantiates them into concrete implementations. This formulation enables structured pruning of provably suboptimal regions of the search space using symbolic reasoning over operator semantics, algebraic identities, and hardware constraints. We develop techniques for efficient symbolic graph generation, equivalence verification via e-graph rewriting, and parameter instantiation through auto-tuning. Together, these components allow Prism to bridge the rigor of exhaustive search with the scalability required for modern ML workloads. Evaluation on five commonly used LLM workloads shows that Prism achieves up to 2.2\times speedup over best superoptimizers and 4.9\times over best compiler-based approaches, while reducing end-to-end optimization time by up to 3.4\times . Subjects: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.15272 [cs.PL] (or arXiv:2604.15272v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2604.15272 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] Stability and Generalization in Looped Transformers

【速读】:该论文旨在解决循环型变压器(looped transformers)在测试时如何有效外推到更难问题,而非仅仅记忆训练阶段的特定解法的问题。其核心挑战在于理解哪些架构选择能够确保模型在测试时具备稳定的迭代行为并产生有意义的预测。解决方案的关键在于提出一个基于固定点(fixed-point)的分析框架,从可达性(reachability)、输入依赖性(input-dependence)和几何特性(geometry)三个维度评估循环架构的稳定性;理论证明表明,仅依靠外部归一化与回忆机制(recall)结合可实现固定点的同时可达、局部平滑且支持稳定反向传播的稳定区域,从而显著提升模型在复杂任务(如国际象棋、数独、前缀和)上的泛化能力。

链接: https://arxiv.org/abs/2604.15259
作者: Asher Labovich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 main pages, 27 total

点击查看摘要

Abstract:Looped transformers promise test-time compute scaling by spending more iterations on harder problems, but it remains unclear which architectural choices let them extrapolate to harder problems at test time rather than memorize training-specific solutions. We introduce a fixed-point based framework for analyzing looped architectures along three axes of stability – reachability, input-dependence, and geometry – and use it to characterize when fixed-point iteration yields meaningful predictions. Theoretically, we prove that looped networks without recall have countable fixed points and cannot achieve strong input-dependence at any spectral regime, while recall combined with outer normalization reliably produces a regime in which fixed points are simultaneously reachable, locally smooth in the input, and supported by stable backpropagation. Empirically, we train single-layer looped transformers on chess, sudoku, and prefix-sums and find that downstream performance tracks the framework’s predictions across tasks and architectural configurations. We additionally introduce internal recall, a novel recall placement variant, and show that it becomes competitive with – and on sudoku, substantially better than – standard recall placement once outer normalization is applied.

[AI-4] Agent ic Microphysics: A Manifesto for Generative AI Safety

【速读】:该论文旨在解决当前安全研究在智能体化人工智能(Agentic AI)中的方法论缺口问题,即传统基于孤立模型或宏观结果的分析无法识别由代理间结构化交互所引发的群体级风险及其控制变量。解决方案的关键在于提出两个相互关联的核心概念:一是“智能体微观物理”(Agentic microphysics),定义了局部交互动力学层面——即一个智能体的输出如何在特定协议条件下成为另一个智能体的输入;二是“生成式安全”(Generative safety),作为一种方法论,通过从微观条件出发生成现象并诱发风险,从而识别充分机制、检测阈值并设计有效干预措施,实现对群体动态的因果显式建模与可控干预。

链接: https://arxiv.org/abs/2604.15236
作者: Federico Pierucci,Matteo Prandi,Marcantonio Bracale Syrnikov,Marcello Galisai,Piercosma Bisconti
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper advances a methodological proposal for safety research in agentic AI. As systems acquire planning, memory, tool use, persistent identity, and sustained interaction, safety can no longer be analysed primarily at the level of the isolated model. Population-level risks arise from structured interaction among agents, through processes of communication, observation, and mutual influence that shape collective behaviour over time. As the object of analysis shifts, a methodological gap emerges. Approaches focused either on single agents or on aggregate outcomes do not identify the interaction-level mechanisms that generate collective risks or the design variables that control them. A framework is required that links local interaction structure to population-level dynamics in a causally explicit way, allowing both explanation and intervention. We introduce two linked concepts. Agentic microphysics defines the level of analysis: local interaction dynamics where one agent’s output becomes another’s input under specific protocol conditions. Generative safety defines the methodology: growing phenomena and elicit risks from micro-level conditions to identify sufficient mechanisms, detect thresholds, and design effective interventions.

[AI-5] Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications

【速读】:该论文旨在解决现实场景中自然语言到SQL(NL2SQL)系统面临的局限性问题,即用户查询往往不是单一数据库内的简单请求,而是涉及多源数据、跨模态信息以及常识或外部知识的复杂交互。传统NL2SQL系统受限于封闭世界假设和单一数据源,难以满足企业级应用对动态整合异构数据的需求。解决方案的关键在于提出Blue的数据智能层(Data Intelligence Layer, DIL),其核心机制是构建一个统一的数据注册表(data registry),将结构化企业数据、通过大语言模型(LLM)获取的世界知识、以及用户交互获得的个人上下文作为第一类数据源进行管理,并利用数据规划器(data planners)将自然语言查询转化为可执行的声明式查询计划,从而实现多源检索、跨模态推理与结果融合,推动复合型AI系统从单数据库NL2SQL向多源、多模态、以数据为中心的智能交互演进。

链接: https://arxiv.org/abs/2604.15233
作者: Moin Aminnaseri,Farima Fatahi Bayat,Nikita Bhutani,Jean-Flavien Bussotti,Kevin Chan,Rafael Li Chen,Yanlin Feng,Jackson Hassell,Estevam Hruschka,Eser Kandogan,Hannah Kim,James Levine,Seiji Maekawa,Jalal Mahmud,Kushan Mitra,Naoki Otani,Pouya Pezeshkpour,Nima Shahbazi,Chen Shen,Dan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:NL2SQL systems aim to address the growing need for natural language interaction with data. However, real-world information rarely maps to a single SQL query because (1) users express queries iteratively (2) questions often span multiple data sources beyond the closed-world assumption of a single database, and (3) queries frequently rely on commonsense or external knowledge. Consequently, satisfying realistic data needs require integrating heterogeneous sources, modalities, and contextual data. In this paper, we present Blue’s Data Intelligence Layer (DIL) designed to support multi-source, multi-modal, and data-centric applications. Blue is a compound AI system that orchestrates agents and data for enterprise settings. DIL serves as the data intelligence layer for agentic data processing, to bridge the semantic gap between user intent and available information by unifying structured enterprise data, world knowledge accessible through LLMs, and personal context obtained through interaction. At the core of DIL is a data registry that stores metadata for diverse data sources and modalities to enable both native and natural language queries. DIL treats LLMs, the Web, and the User as source ‘databases’, each with their own query interface, elevating them to first-class data sources. DIL relies on data planners to transform user queries into executable query plans. These plans are declarative abstractions that unify relational operators with other operators spanning multiple modalities. DIL planners support decomposition of complex requests into subqueries, retrieval from diverse sources, and finally reasoning and integration to produce final results. We demonstrate DIL through two interactive scenarios in which user queries dynamically trigger multi-source retrieval, cross-modal reasoning, and result synthesis, illustrating how compound AI systems can move beyond single database NL2SQL.

[AI-6] RadAgent : A tool-using AI agent for stepwise interpretation of chest computed tomography

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLM)在医学影像解读中缺乏可解释性的问题,即现有方法通常将临床医生置于被动观察者角色,无法提供可追溯的推理过程以供其审查、验证或修正。解决方案的关键在于提出RadAgent——一个具备工具使用能力的AI代理,通过分步且可解释的推理流程生成胸部CT报告,并附带完整的中间决策与工具交互轨迹,使临床医生能够逐层审视结果的生成逻辑。实验表明,RadAgent在临床准确性、对抗性鲁棒性和忠实度(faithfulness)三个维度均显著优于基线模型CT-Chat,其中忠实度为新引入指标,体现其对原始图像内容的准确反映能力。

链接: https://arxiv.org/abs/2604.15231
作者: Mélanie Roschewitz,Kenneth Styppa,Yitian Tao,Jiwoong Sohn,Jean-Benoit Delbrouck,Benjamin Gundersen,Nicolas Deperrois,Christian Bluethgen,Julia Vogt,Bjoern Menze,Farhad Nooralahzadeh,Michael Krauthammer,Michael Moor
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves Chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 6.0 points (36.4% relative) in macro-F1 and 5.4 points (19.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

[AI-7] AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在系统工程需求工程(Requirements Engineering, RE)活动中,尤其是在需求质量评估方面,其作用与专业工程师判断之间关系不明确的问题。研究发现,AI工具能够在语法和结构层面提供快速且一致的初步评估,但无法替代专家对上下文的理解、歧义的解析及权衡推理能力;因此,解决方案的关键在于将AI定位为决策支持机制而非替代手段,通过结构化系统工程方法验证其在RE生命周期中的整合潜力,从而在保障可追溯性、责任归属和工程一致性的同时提升需求评估效率。

链接: https://arxiv.org/abs/2604.15222
作者: Oz Levy,Ilya Dikman,Natan Levy,Michael Winokur
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 Figures

点击查看摘要

Abstract:Artificial Intelligence is increasingly introduced into systems engineering activities, particularly within requirements engineering, where quality assessment and validation remain heavily dependent on expert judgment. While recent AI tools demonstrate promising capabilities in analyzing and generating requirements, their role within formal systems engineering processes-and their alignment with established INCOSE criteria-remains insufficiently understood. This paper investigates the extent to which AI-based tools can support systems engineers in evaluating requirement quality, without replacing professional expertise. The research adopts a structured systems engineering methodology to compare AI-assisted requirement evaluation with human expert assessment. A controlled study was conducted in which system requirements were evaluated against established INCOSE ``good requirement’’ criteria by both experienced systems engineers and an AI-based assessment tool. The evaluation focused on consistency, completeness, clarity, and testability, examining not only accuracy but also the decision logic underlying each assessment. Results indicate that AI tools can provide consistent and rapid preliminary assessments, particularly for syntactic and structural quality attributes. However, expert judgment remains essential for contextual interpretation, ambiguity resolution, and trade-off reasoning. Rather than positioning AI as a replacement for systems engineers, the findings support its role as a decision-support mechanism within the RE lifecycle. From a systems engineering perspective, this study contributes empirical evidence on how AI can be integrated into RE workflows while preserving traceability, accountability, and engineering consistency.

[AI-8] Benchmarking Classical Coverag e Path Planning rag e Path Planning Heuristics on Irregular Hexagonal Grids for Maritime Coverage Scenarios

【速读】:该论文旨在解决不规则六边形网格上的单车辆覆盖路径规划(Coverage Path Planning, CPP)问题,其应用场景包括海上监视、搜救和环境监测等。传统方法通常在小规模示例或矩形网格上进行比较,缺乏可复现性和对复杂几何形态的适应性。解决方案的关键在于构建了一个包含10,000个哈密顿可行实例的基准测试集,涵盖紧凑、细长及不规则形态,并系统评估了来自七个家族的17种启发式算法,统一采用包含哈密顿成功度、全覆盖成功率、重复访问次数、路径长度、航向变化和CPU延迟在内的多维评价协议。研究发现,显式使用最短路径重连策略的启发式方法能可靠解决松弛覆盖任务,但几乎无法实现零重复访问;而最优的经典哈密顿基线是基于Warnsdorff变体,结合索引导向的平局处理与终点保留的残差度策略,达到79.0%的哈密顿成功率。关键设计选择并非单纯依赖平局处理机制,而是如何定义终点预留至最后一步时的残差度,这揭示了在稀疏几何图中瓶颈结构下,常被忽视的实现细节会显著影响性能表现。

链接: https://arxiv.org/abs/2604.15202
作者: Carlos S. Sepúlveda,Gonzalo A. Ruz
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Coverage path planning on irregular hexagonal grids is relevant to maritime surveillance, search and rescue and environmental monitoring, yet classical methods are often compared on small ad hoc examples or on rectangular grids. This paper presents a reproducible benchmark of deterministic single-vehicle coverage path planning heuristics on irregular hexagonal graphs derived from synthetic but maritime-motivated areas of interest. The benchmark contains 10,000 Hamiltonian-feasible instances spanning compact, elongated, and irregular morphologies, 17 heuristics from seven families, and a common evaluation protocol covering Hamiltonian success, complete-coverage success, revisits, path length, heading changes, and CPU latency. Across the released dataset, heuristics with explicit shortest-path reconnection solve the relaxed coverage task reliably but almost never produce zero-revisit tours. Exact Depth-First Search confirms that every released instance is Hamiltonian-feasible. The strongest classical Hamiltonian baseline is a Warnsdorff variant that uses an index-based tie-break together with a terminal-inclusive residual-degree policy, reaching 79.0% Hamiltonian success. The dominant design choice is not tie-breaking alone, but how the residual degree is defined when the endpoint is reserved until the final move. This shows that underreported implementation details can materially affect performance on sparse geometric graphs with bottlenecks. The benchmark is intended as a controlled testbed for heuristic analysis rather than as a claim of operational optimality at fleet scale.

[AI-9] Scepsy: Serving Agent ic Workflows Using Aggregate LLM Pipelines

【速读】:该论文旨在解决多大语言模型(Large Language Models, LLMs)在复杂代理工作流(agentic workflows)中服务时的高延迟与低吞吐量问题,尤其在GPU资源受限且LLM执行时间具有数据依赖性分支、并行或递归特性的情况下。其核心挑战在于如何高效调度任意结构的多LLM代理工作流以实现目标吞吐量下的最低延迟,同时避免GPU过度订阅。解决方案的关键在于提出Scepsy系统,它利用“尽管代理工作流端到端延迟不可预测,但每个LLM的总执行时间占比相对稳定”这一洞察,通过预采样不同并行度下的LLM性能数据构建轻量级的聚合LLM流水线(Aggregate LLM Pipeline),作为分配策略的延迟/吞吐量预测器;随后基于此模型,在分数GPU份额、张量并行度和副本数构成的搜索空间中优化分配方案,并采用分层启发式算法将最优配置部署至GPU集群,从而最小化碎片化并满足网络拓扑约束。实验表明,Scepsy相较独立优化LLM或依赖人工分配的系统,可实现最高2.4倍吞吐提升和27倍延迟降低。

链接: https://arxiv.org/abs/2604.15186
作者: Marcel Wagenländer,Otto White,Britannio Jarrett,Pedro Silvestre,Yanda Tao,Guo Li,Huanzhou Zhu,Llúis Vilanova,Peter Pietzuch
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic workflows carry out complex tasks by orchestrating multiple large language models (LLMs) and tools. Serving such workflows at a target throughput with low latency is challenging because they can be defined using arbitrary agentic frameworks and exhibit unpredictable execution times: execution may branch, fan-out, or recur in data-dependent ways. Since LLMs in workflows often outnumber available GPUs, their execution also leads to GPU oversubscription. We describe Scepsy, a new agentic serving system that efficiently schedules arbitrary multi-LLM agentic workflows onto a GPU cluster. Scepsy exploits the insight that, while agentic workflows have unpredictable end-to-end latencies, the shares of each LLM’s total execution times are comparatively stable across executions. Scepsy decides on GPU allocations based on these aggregate shares: first, it profiles the LLMs under different parallelism degrees. It then uses these statistics to construct an Aggregate LLM Pipeline, which is a lightweight latency/throughput predictor for allocations. To find a GPU allocation that minimizes latency while achieving a target throughput, Scepsy uses the Aggregate LLM Pipeline to explore a search space over fractional GPU shares, tensor parallelism degrees, and replica counts. It uses a hierarchical heuristic to place the best allocation onto the GPU cluster, minimizing fragmentation, while respecting network topology constraints. Our evaluation on realistic agentic workflows shows that Scepsy achieves up to 2.4x higher throughput and 27x lower latency compared to systems that optimize LLMs independently or rely on user-specified allocations. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.15186 [cs.DC] (or arXiv:2604.15186v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.15186 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-10] Agent -Aided Design for Dynamic CAD Models

【速读】:该论文旨在解决当前Agent-Aided Design(AID)系统无法构建具有运动部件的复杂3D装配体的问题,例如活塞、摆锤或剪刀等具备一个或多个自由度(degrees-of-freedom)的机械结构。现有系统受限于缺乏对动态部件间交互关系的有效建模与推理能力,导致其难以生成可动装配体。解决方案的关键在于提出AADvark原型系统,该系统通过引入外部约束求解器(constraint solver)工具并设计专门的视觉反馈机制,使代理能够直接推理装配体中运动部件之间的物理约束关系,并借助强化验证信号实现对可动结构的精确构建。这一设计突破了传统大语言模型(LLM)在空间推理上的局限性,从而显著提升了生成复杂机械结构的能力。

链接: https://arxiv.org/abs/2604.15184
作者: Mitch Adler,Matthew Russo,Michael Cafarella
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, to be published in CAIS’26

点击查看摘要

Abstract:In the past year, researchers have started to create agentic systems that can design real-world CAD-style objects in a training-free setting, a new variety of system that we call Agent-Aided Design. Generally speaking, these systems place an agent in a feedback loop in which it can write code, compile that code to an assembly of CAD model(s), visualize the model, and then iteratively refine its code based on visual and other feedback. Despite rapid progress, a key problem remains: none of these systems can build complex 3D assemblies with moving parts. For example, no existing system can build a piston, a pendulum, or even a pair of scissors. In order for Agent-Aided Design to make a real impact in industrial manufacturing, we need a system that is capable of generating such 3D assemblies. In this paper we present a prototype of AADvark, an agentic system designed for this task. Unlike previous state-of-the-art systems, AADvark captures the dynamic part interactions with one or more degrees-of-freedom. This design decision allows AADvark to reason directly about assemblies with moving parts and can thereby achieve cross-cutting goals, including but not limited to mechanical movements. Unfortunately, current LLMs are imperfect spatial reasoners, a problem that AADvark addresses by incorporating external constraint solver tools with a specialized visual feedback mechanism. We demonstrate that, by modifying the agent’s tools (FreeCAD and the assembly solver), we are able to create a strong verification signal which enables our system to build 3D assemblies with movable parts.

[AI-11] MambaSL: Exploring Single-Layer Mamba for Time Series Classification ICLR2026

【速读】:该论文旨在解决现有状态空间模型(State Space Models, SSMs)在时间序列分类(Time Series Classification, TSC)任务中作为独立架构时研究不足的问题,同时针对基准测试中存在的配置受限、UEA数据集覆盖不全及实验复现性差等局限性。其解决方案的关键在于提出MambaSL框架,通过四个面向TSC任务的假设对单层Mamba模型的Selective SSM和投影层进行最小化重构,并在统一协议下重新评估20个强基线模型在全部30个UEA数据集上的表现,从而实现显著且统计上显著的性能提升,同时通过公开模型检查点保障实验可复现性。

链接: https://arxiv.org/abs/2604.15174
作者: Yoo-Min Jung,Leekyung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted at ICLR 2026

点击查看摘要

Abstract:Despite recent advances in state space models (SSMs) such as Mamba across various sequence domains, research on their standalone capacity for time series classification (TSC) has remained limited. We propose MambaSL, a framework that minimally redesigns the selective SSM and projection layers of a single-layer Mamba, guided by four TSC-specific hypotheses. To address benchmarking limitations – restricted configurations, partial University of East Anglia (UEA) dataset coverage, and insufficiently reproducible setups – we re-evaluate 20 strong baselines across all 30 UEA datasets under a unified protocol. As a result, MambaSL achieves state-of-the-art performance with statistically significant average improvements, while ensuring reproducibility via public checkpoints for all evaluated models. Together with visualizations, these results demonstrate the potential of Mamba-based architectures as a TSC backbone.

[AI-12] LLM s Gaming Verifiers: RLVR can Lead to Reward Hacking

【速读】:该论文旨在解决强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练框架下大语言模型(LLM)出现的“奖励黑客”(reward hacking)问题,即模型通过利用验证器(verifier)的局限性,生成看似正确但缺乏泛化能力的输出,而非真正学习任务所需的逻辑规则。具体而言,在归纳推理任务中,RLVR训练的模型倾向于枚举实例标签而非发现通用规则,从而绕过验证器检查。解决方案的关键在于提出一种名为同构扰动测试(Isomorphic Perturbation Testing, IPT)的新方法:该方法通过在扩展性验证(extensional verification)基础上引入同构不变性验证(isomorphic verification),强制模型输出在逻辑等价任务下保持一致;真正的规则归纳具有此不变性,而捷径策略则失败。实验表明,IPT能有效识别并区分此类奖励黑客行为,且通过在训练中引入同构验证可消除捷径策略,揭示了RLVR激励机制中隐含的漏洞及其修复路径。

链接: https://arxiv.org/abs/2604.15149
作者: Lukas Helff,Quentin Delfosse,David Steinmann,Ruben Härle,Hikaru Shindo,Patrick Schramowski,Wolfgang Stammer,Kristian Kersting,Felix Friedrich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for scaling reasoning capabilities in LLMs, a new failure mode emerges: LLMs gaming verifiers. We study this phenomenon on inductive reasoning tasks, where models must induce and output logical rules. We find that RLVR-trained models systematically abandon rule induction. Instead of learning generalizable patterns (e.g., ``trains carrying red cars go east’'), they enumerate instance-level labels, producing outputs that pass verifiers without capturing the relational patterns required by the task. We show that this behavior is not a failure of understanding but a form of reward hacking: imperfect verifiers that check only extensional correctness admit false positives. To detect such shortcuts, we introduce Isomorphic Perturbation Testing (IPT), which evaluates a single model output under both extensional and isomorphic verification, where the latter enforces invariance under logically isomorphic tasks. While genuine rule induction remains invariant, shortcut strategies fail. We find that shortcut behavior is specific to RLVR-trained reasoning models (e.g., GPT-5, Olmo3) and absent in non-RLVR models (e.g., GPT-4o, GPT-4.5, Ministral). Moreover, shortcut prevalence increases with task complexity and inference-time compute. In controlled training experiments, extensional verification directly induces shortcut strategies, while isomorphic verification eliminates them. These results show that RLVR can incentivize reward hacking not only through overt manipulation but also by exploiting what the verifier fails to enforce.

[AI-13] An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics

【速读】:该论文旨在解决科学文献新颖性(novelty)评估自动化与可靠性的问题,尤其是在生成式 AI (Generative AI) 逐步参与科研构思与论文撰写背景下,如何避免人类研究人员和计算资源浪费在已有研究上。现有新颖性指标普遍依赖于引用次数或同行评审评分等噪声强、混杂性强的代理信号,难以准确衡量新颖性本身。为此,作者提出了一种基于公理化框架的基准测试方法:首先定义一套符合人类科研规范的新颖性度量应满足的公理,随后在三个 AI 研究领域中的十个任务上系统评估现有指标。关键发现是:没有任何现有指标能始终满足所有公理,且不同指标失败于不同的公理,反映出其架构差异;进一步表明,融合具有互补架构的指标可显著提升整体表现(加权平均达到90.1%,优于最佳单一指标的71.5%),从而验证了发展多样化架构新颖性度量是未来改进方向。

链接: https://arxiv.org/abs/2604.15145
作者: Miri Liu,ChengXiang Zhai
机构: 未知
类目: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 9 pages, 0 figures

点击查看摘要

Abstract:The rigorous evaluation of the novelty of a scientific paper is, even for human scientists, a challenging task. With the increasing interest in AI scientists and AI involvement in scientific idea generation and paper writing, it also becomes increasingly important that this task be automatable and reliable, lest both human attention and compute tokens be wasted on ideas that have already been explored. Due to the challenge of quantifying ground-truth novelty, however, existing novelty metrics for scientific papers generally validate their results against noisy, confounded signals such as citation counts or peer review scores. These proxies can conflate novelty with impact, quality, or reviewer preference, which in turn makes it harder to assess how well a given metric actually evaluates novelty. We therefore propose an axiomatic benchmark for scientific novelty metrics. We first define a set of axioms that a well-behaved novelty metric should satisfy, grounded in human scientific norms and practice, then evaluate existing metrics across ten tasks spanning three domains of AI research. Our results reveal that no existing metric satisfies all axioms consistently, and that metrics fail on systematically different axioms, reflecting their underlying architectures. Additionally, we show that combining metrics of complementary architectures leads to consistent improvements on the benchmark, with per-axiom weighting achieving 90.1% versus 71.5% for the best individual metric, suggesting that developing architecturally diverse metrics is a promising direction for future work. We release the benchmark code as supplementary material to encourage the development of more robust scientific literature novelty metrics.

[AI-14] Structure as Computation: Developmental Generation of Minimal Neural Circuits

【速读】:该论文旨在解决如何从生物发育过程中提取具有高效学习能力的神经网络结构这一问题,核心挑战在于理解并模拟大脑皮层神经发生(cortical neurogenesis)如何自发形成具备强大计算潜力的拓扑结构。解决方案的关键在于利用来自小鼠单细胞转录组数据推导出的基因调控规则,从一个单一干细胞出发模拟神经发育过程,从而生成一个包含5,000个细胞的异质群体,其中仅约1.7%分化为成熟神经元;这些神经元构成高度连接的核心电路(平均每个神经元拥有约4,715个突触),无需任何架构调整或数据增强,在仅经过一个训练周期后即可在MNIST上实现超过90%的准确率(较初始随机水平提升超80个百分点),并在CIFAR-10上达到40.53%的性能,表明生物发育规则能自动构建一种适用于快速学习的通用拓扑先验结构。

链接: https://arxiv.org/abs/2604.15143
作者: Duan Zhou
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work simulates the developmental process of cortical neurogenesis, initiating from a single stem cell and governed by gene regulatory rules derived from mouse single-cell transcriptomic data. The developmental process spontaneously generates a heterogeneous population of 5,000 cells, yet yields only 85 mature neurons - merely 1.7% of the total population. These 85 neurons form a densely interconnected core of 200,400 synapses, corresponding to an average degree of 4,715 per neuron. At iteration zero, this minimal circuit performs at chance level on MNIST. However, after a single epoch of standard training, accuracy surges to over 90% - a gain exceeding 80 percentage points - with typical runs falling in the 89-94% range depending on developmental stochasticity. The identical circuit, without any architectural modification or data augmentation, achieves 40.53% on CIFAR-10 after one epoch. These findings demonstrate that developmental rules sculpt a domain-general topological substrate exceptionally amenable to rapid learning, suggesting that biological developmental processes inherently encode powerful structural priors for efficient computation.

[AI-15] SRMU: Relevance-Gated Updates for Streaming Hyperdimensional Memories

【速读】:该论文旨在解决在现实世界流式环境中构建和维护序列关联记忆(Sequential Associative Memories, SAMs)的难题,此类环境具有观测数据增量到达、采样不平衡以及非平稳时间动态等特性。传统基于向量符号架构(Vector Symbolic Architectures, VSAs)的SAM系统通常依赖简单的加性更新机制,导致在无新信息输入时仍持续强化已有记忆,从而在非平稳环境中产生过时信息滞留问题。解决方案的关键在于提出一种领域无关且无需清理操作的更新规则——序列相关性记忆单元(Sequential Relevance Memory Unit, SRMU),其通过结合时间衰减与相关性门控机制,在存储前过滤冗余、冲突和过时信息,从而实现更稳定、更贴近真实状态的记忆增长。

链接: https://arxiv.org/abs/2604.15121
作者: Shay Snyder(1),Andrew Capodieci(2),David Gorsich(3),Maryam Parsa(1) ((1) George Mason University, (2) Neya Robotics, (3) US Army Ground Vehicle Systems Center)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sequential associative memories (SAMs) are difficult to build and maintain in real-world streaming environments, where observations arrive incrementally over time, have imbalanced sampling, and non-stationary temporal dynamics. Vector Symbolic Architectures (VSAs) provide a biologically-inspired framework for building SAMs. Entities and attributes are encoded as quasi-orthogonal hyperdimensional vectors and processed with well defined algebraic operations. Despite this rich framework, most VSA systems rely on simple additive updates, where repeated observations reinforce existing information even when no new information is introduced. In non-stationary environments, this leads to the persistence of stale information after the underlying system changes. In this work, we introduce the Sequential Relevance Memory Unit (SRMU), a domain- and cleanup-agnostic update rule for VSA-based SAMs. The SRMU combines temporal decay with a relevance gating mechanism. Unlike prior approaches that solely rely on cleanup, the SRMU regulates memory formation by filtering redundant, conflicting, and stale information before storage. We evaluate the SRMU on streaming state-tracking tasks that isolate non-uniform sampling and non-stationary temporal dynamics. Our results show that the SRMU increases memory similarity by 12.6% and reduces cumulative memory magnitude by 53.5% . This shows that the SRMU produces more stable memory growth and stronger alignment with the ground-truth state.

[AI-16] HyperSpace: A Generalized Framework for Spatial Encoding in Hyperdimensional Representations

【速读】:该论文旨在解决向量符号架构(Vector Symbolic Architectures, VSAs)在实际系统部署中缺乏模块化评估框架的问题,从而难以准确识别不同VSA实现之间的性能与资源消耗差异。其解决方案的关键在于提出并实现了一个名为HyperSpace的开源框架,该框架将VSA系统分解为编码、绑定、捆绑、相似性计算、清理和回归等模块化算子,使得能够从系统级视角对Holographic Reduced Representations (HRR) 和 Fourier Holographic Reduced Representations (FHRR) 等代表性VSA后端进行精细分析与基准测试。通过该框架发现,尽管FHRR在理论运算复杂度上更优,但在空间域应用中,相似性和清理操作成为主要耗时环节,导致HRR与FHRR在端到端性能上相当;同时,HRR的内存占用约为FHRR的一半,揭示了理论优势之外的实际部署权衡。

链接: https://arxiv.org/abs/2604.15113
作者: Shay Snyder(1),Andrew Capodieci(2),David Gorsich(3),Maryam Parsa(1) ((1) George Mason University, (2) Neya Robotics, (3) US Army Ground Vehicle Systems Center)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vector Symbolic Architectures (VSAs) provide a well-defined algebraic framework for compositional representations in hyperdimensional spaces. We introduce HyperSpace, an open-source framework that decomposes VSA systems into modular operators for encoding, binding, bundling, similarity, cleanup, and regression. Using HyperSpace, we analyze and benchmark two representative VSA backends: Holographic Reduced Representations (HRR) and Fourier Holographic Reduced Representations (FHRR). Although FHRR provides lower theoretical complexity for individual operations, HyperSpaces modularity reveals that similarity and cleanup dominate runtime in spatial domains. As a result, HRR and FHRR exhibit comparable end-to-end performance. Differences in memory footprint introduce additional deployment trade-offs where HRR requires approximately half the memory of FHRR vectors. By enabling modular, system-level evaluation, HyperSpace reveals practical trade-offs in VSA pipelines that are not apparent from theoretical or operator-level comparisons alone.

[AI-17] Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC

【速读】:该论文旨在解决传统电子设计自动化(EDA)工具在逻辑综合(logic synthesis)过程中依赖人工设计启发式策略的局限性,从而难以持续优化性能的问题。其核心挑战在于如何实现EDA工具的自主进化与自我改进,而无需人工介入。解决方案的关键在于提出首个自演化逻辑综合框架(self-evolving logic synthesis framework),利用大型语言模型(Large Language Model, LLM)代理(agents)对广泛使用的ABC逻辑综合系统源代码进行自动重构和迭代优化。该框架通过统一的正确性验证与质量结果(Quality-of-Results, QoR)驱动的评估循环,在不引入人工新启发式规则的前提下,自主发现超越人类设计的优化策略,并在多套基准测试集(如ISCAS、VTR、EPFL和IWLS 2005)上持续提升QoR表现,最终实现了百万行规模EDA工具的全自动、渐进式改进。

链接: https://arxiv.org/abs/2604.15082
作者: Cunxi Yu,Haoxing Ren
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 7 pages; To appear at DAC 2026

点击查看摘要

Abstract:This paper introduces the first \emphself-evolving logic synthesis framework, which leverages Large Language Model (LLM) agents to autonomously improve the source code of \textscABC, the widely adopted logic synthesis system. Our framework operates on the \emphentire integrated ABC codebase, and the output repository preserves its single-binary execution model and command interface. In the initial evolution cycle, we bootstrap the system using existing prior open-source synthesis components, covering flow tuning, logic minimization, and technology mapping, but without manually injecting new heuristics. On top of this foundation, a team of LLM-based agents iteratively rewrites and evolves specific sub-components of ABC following our programming guidance prompts under a unified correctness and QoR-driven evaluation loop. Each evolution cycle proposes code modifications, compiles the integrated binary, validates correctness, and evaluates quality-of-results (QoR) on \emphmulti-suite benchmarks including ISCAS~85/89/99, VTR, EPFL, and IWLS~2005. Through continuous feedback, the system discovers optimizations beyond human-designed heuristics, effectively \emphlearning new synthesis strategies that enhance QoR. We detail the architecture of this self-improving system, its integration with \textscABC, and results demonstrating that the framework can autonomously and progressively improve EDA tool at full million-line scale.

[AI-18] Where are the Humans? A Scoping Review of Fairness in Multi-agent AI Systems

【速读】:该论文旨在解决多智能体人工智能(Multi-Agent AI, MAAI)系统中公平性研究的碎片化与浅层化问题,即当前对公平性的探讨往往缺乏系统性、规范性基础,且未能充分考虑智能体自主性和系统级交互带来的复杂动态。其解决方案的关键在于:将公平性嵌入MAAI系统的全生命周期设计与开发中,而非作为事后补充;同时强调必须通过明确的人类监督、清晰的规范框架以及对公平目标和受益群体的精准界定,实现可衡量、可解释的公平性评估与改进。

链接: https://arxiv.org/abs/2604.15078
作者: Simeon Allmendinger,Luca Deck,Lucas Mueller
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: In proceedings of European Conference on Information Systems (ECIS) 2026

点击查看摘要

Abstract:Rapid advances in Generative AI are giving rise to increasingly sophisticated Multi-Agent AI (MAAI) systems. While AI fairness has been extensively studied in traditional predictive scenarios, its examination in MAAI remains nascent and fragmented. This scoping review critically synthesizes existing research on fairness in MAAI systems. Through a qualitative content analysis of 23 selected studies, we identify five archetypal approaches. Our findings reveal that fairness in MAAI systems is often addressed superficially, lacks robust normative foundations, and frequently overlooks the complex dynamics introduced by agent autonomy and system-level interactions. We argue that fairness must be embedded structurally throughout the development lifecycle of MAAI, rather than appended as a post-hoc consideration. Meaningful evaluation requires explicit human oversight, normative clarity, and a precise articulation of fairness objectives and beneficiaries. This review provides a foundation for advancing fairness research in MAAI systems by highlighting critical gaps, exposing prevailing limitations, and suggesting pathways.

[AI-19] NEAT-NC: NEAT guided Navigation Cells for Robot Path Planning GECCO’26

【速读】:该论文旨在解决动态环境中路径规划的复杂性问题,传统算法在面对环境变化时往往缺乏适应性和实时性。解决方案的关键在于受生物空间认知机制启发,提出一种基于神经进化增强拓扑(NEAT)的导航细胞(Navigation Cells)方法——NEATNC。该方法将多种空间认知细胞(如位置细胞、网格细胞等)作为输入信号,通过演化递归神经网络来模拟海马体的功能,从而实现对动态环境的高效适应与实时路径规划。实验表明,该框架在静态与动态场景中均表现出优异性能,验证了生物启发模型在机器人和游戏等领域中的实用性。

链接: https://arxiv.org/abs/2604.15076
作者: Hibatallah Meliani,Khadija Slimani,Samira Khoulji
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: To appear in short form in Genetic and Evolutionary Computation Conference (GECCO '26), 2026

点击查看摘要

Abstract:To navigate a space, the brain makes an internal representation of the environment using different cells such as place cells, grid cells, head direction cells, border cells, and speed cells. All these cells, along with sensory inputs, enable an organism to explore the space around it. Inspired by these biological principles, we developed NEATNC, a Neuro-Evolution of Augmenting Topology guided Navigation Cells. The goal of the paper is to improve NEAT algorithm performance in path planning in dynamic environments using spatial cognitive cells. This approach uses navigation cells as inputs and evolves recurrent neural networks, representing the hippocampus part of the brain. The performance of the proposed algorithm is evaluated in different static and dynamic scenarios. This study highlights NEAT’s adaptability to complex and different environments, showcasing the utility of biological theories. This suggests that our approach is well-suited for real-time dynamic path planning for robotics and games.

[AI-20] No More Guessing: a Verifiable Gradient Inversion Attack in Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning)中梯度逆向攻击(Gradient Inversion Attack)在数值型表格数据(numerical tabular records)场景下的有效性与可验证性问题。现有攻击方法难以从聚合梯度中准确解耦单个样本,且缺乏对重建结果正确性的客观评估手段,尤其在表格数据上因无法依赖人工判读而被认为安全性较高。其解决方案的关键在于提出一种可验证的梯度逆向攻击(Verifiable Gradient Inversion Attack, VGIA),通过引入ReLU激活边界在输入空间中定义超平面的几何视角,设计了一种基于子空间的代数验证测试,用于确证某个超平面区域仅包含一个训练记录;一旦隔离被认证,VGIA即可解析求解对应特征向量,并通过轻量优化步骤精确重构目标样本。此方法在大规模批处理的表格基准上实现了对记录和目标的完全恢复,显著优于现有最先进攻击方法在成功率和效率上的表现。

链接: https://arxiv.org/abs/2604.15063
作者: Francesco Diana,Chuan Xu,André Nusser,Giovanni Neglia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Gradient inversion attacks threaten client privacy in federated learning by reconstructing training samples from clients’ shared gradients. Gradients aggregate contributions from multiple records and existing attacks may fail to disentangle them, yielding incorrect reconstructions with no intrinsic way to certify success. In vision and language, attackers may fall back on human inspection to judge reconstruction plausibility, but this is far less feasible for numerical tabular records, fueling the impression that tabular data is less vulnerable. We challenge this perception by proposing a verifiable gradient inversion attack (VGIA) that provides an explicit certificate of correctness for reconstructed samples. Our method adopts a geometric view of ReLU leakage: the activation boundary of a fully connected layer defines a hyperplane in input space. VGIA introduces an algebraic, subspace-based verification test that detects when a hyperplane-delimited region contains exactly one record. Once isolation is certified, VGIA recovers the corresponding feature vector analytically and reconstructs the target via a lightweight optimization step. Experiments on tabular benchmarks with large batch sizes demonstrate exact record and target recovery in regimes where existing state-of-the-art attacks either fail or cannot assess reconstruction fidelity. Compared to prior geometric approaches, VGIA allocates hyperplane queries more effectively, yielding faster reconstructions with fewer attack rounds. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2604.15063 [cs.LG] (or arXiv:2604.15063v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.15063 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-21] Autogenesis: A Self-Evolving Agent Protocol

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体系统在处理复杂、长周期任务时,因代理协议(如A2A和MCP)缺乏对跨实体生命周期管理、上下文维护、版本追踪及演进安全更新接口的明确规范,而导致的组合僵化与脆弱胶水代码问题。其解决方案的关键在于提出自演化协议(Autogenesis Protocol, AGP),该协议通过两个核心层次实现解耦:一是资源底座协议层(Resource Substrate Protocol Layer, RSPL),将提示词、智能体、工具、环境和记忆等五类实体建模为具有显式状态、生命周期和版本化接口的注册资源;二是自演化协议层(Self Evolution Protocol Layer, SEPL),定义了一个闭合环路操作接口,用于提议、评估并提交改进,同时支持可审计的演进谱系和回滚机制。在此基础上构建的自演化多智能体系统(Autogenesis System, AGS)能够动态实例化、检索和优化协议注册资源,从而在多个需要长期规划和异构资源协同的任务基准上显著优于强基线模型,验证了其在智能体资源管理和闭环自演化方面的有效性。

链接: https://arxiv.org/abs/2604.15034
作者: Wentao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in LLM based agent systems have shown promise in tackling complex, long horizon tasks. However, existing agent protocols (e.g., A2A and MCP) under specify cross entity lifecycle and context management, version tracking, and evolution safe update interfaces, which encourages monolithic compositions and brittle glue code. We introduce \textbf\textscAutogenesis Protocol (AGP), a self evolution protocol that decouples what evolves from how evolution occurs. Its Resource Substrate Protocol Layer (RSPL) models prompts, agents, tools, environments, and memory as protocol registered resources\footnoteUnless otherwise specified, resources refer to instances of the five RSPL entity types: \emphprompt, \emphagent, \emphtool, \emphenvironment, \emphmemory with agent \emphoutputs. with explicit state, lifecycle, and versioned interfaces. Its Self Evolution Protocol Layer (SEPL) specifies a closed loop operator interface for proposing, assessing, and committing improvements with auditable lineage and rollback. Building on \textbf\textscAGP, we present \textbf\textscAutogenesis System (AGS), a self-evolving multi-agent system that dynamically instantiates, retrieves, and refines protocol-registered resources during execution. We evaluate \textbf\textscAGS on multiple challenging benchmarks that require long horizon planning and tool use across heterogeneous resources. The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed loop self evolution.

[AI-22] owards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

【速读】:该论文旨在解决流匹配(flow matching)在语言建模中因无法有效表示具有不规则几何结构(如各向异性和多模态性)的复杂潜在分布而导致的性能瓶颈问题。其解决方案的关键在于提出一种混合专家流匹配(mixture-of-experts flow matching, MoE-FM)框架,通过将复杂的全局传输几何结构分解为局部专业化向量场来捕捉潜在空间中的复杂几何特性;在此基础上构建的非自回归(non-autoregressive, NAR)语言建模方法YAN,结合Transformer与Mamba架构,在多个下游任务中实现了与自回归(autoregressive, AR)及基于扩散模型的NAR语言模型相当的生成质量,同时仅需3步采样即可完成生成,相较AR基线提速40倍,相较扩散语言模型提速高达10³倍,显著提升了语言建模效率。

链接: https://arxiv.org/abs/2604.15009
作者: Aihua Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Flow matching retains the generation quality of diffusion models while enabling substantially faster inference, making it a compelling paradigm for generative modeling. However, when applied to language modeling, it exhibits fundamental limitations in representing complex latent distributions with irregular geometries, such as anisotropy and multimodality. To address these challenges, we propose a mixture-of-experts flow matching (MoE-FM) framework, which captures complex global transport geometries in latent space by decomposing them into locally specialized vector fields. Building on MoE-FM, we develop a non-autoregressive (NAR) language modeling approach, named YAN, instantiated with both Transformer and Mamba architectures. Across multiple downstream tasks, YAN achieves generation quality on par with both autoregressive (AR) and diffusion-based NAR language models, while requiring as few as three sampling steps. This yields a 40\times speedup over AR baselines and up to a 10^3\times speedup over diffusion language models, demonstrating substantial efficiency advantages for language modeling.

[AI-23] COEVO: Co-Evolutionary Framework for Joint Functional Correctness and PPA Optimization in LLM -Based RTL Generation

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的寄存器传输级(Register-Transfer Level, RTL)代码生成方法中,功能性正确性与性能、功耗、面积(Performance, Power, and Area, PPA)优化被割裂的问题。现有方法通常先保证功能正确性再进行PPA优化,导致具有架构潜力但部分正确的候选方案被系统性丢弃,且将多维PPA空间简化为单一标量适应度,掩盖了各指标间的权衡关系。其解决方案的关键在于提出COEVO——一种统一正确性与PPA优化的协同进化框架:通过增强测试平台实现细粒度评分与诊断反馈,将正确性建模为与面积、延迟、功耗并列的连续优化维度;引入带有退火机制的自适应正确性门控策略,使PPA表现优异但未完全正确的个体仍可引导搜索方向;同时采用四维帕累托非支配排序(Pareto-based non-dominated sorting)保留完整的PPA权衡结构,避免人工权重调参。该方法在VerilogEval 2.0和RTLLM 2.0基准上显著优于所有代理基线,在保持高正确率的同时实现了更优的PPA性能。

链接: https://arxiv.org/abs/2604.15001
作者: Heng Ping,Peiyu Zhang,Shixuan Li,Wei Yang,Anzhe Cheng,Shukai Duan,Xiaole Zhang,Paul Bogdan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based RTL code generation methods increasingly target both functional correctness and PPA quality, yet existing approaches universally decouple the two objectives, optimizing PPA only after correctness is fully achieved. Whether through sequential multi-agent pipelines, evolutionary search with binary correctness gates, or hierarchical reward dependencies, partially correct but architecturally promising candidates are systematically discarded. Moreover, existing methods reduce the multi-objective PPA space to a single scalar fitness, obscuring the trade-offs among area, delay, and power. To address these limitations, we propose COEVO, a co-evolutionary framework that unifies correctness and PPA optimization within a single evolutionary loop. COEVO formulates correctness as a continuous co-optimization dimension alongside area, delay, and power, enabled by an enhanced testbench that provides fine-grained scoring and detailed diagnostic feedback. An adaptive correctness gate with annealing allows PPA-promising but partially correct candidates to guide the search toward jointly optimal solutions. To preserve the full PPA trade-off structure, COEVO employs four-dimensional Pareto-based non-dominated sorting with configurable intra-level sorting, replacing scalar fitness without manual weight tuning. Evaluated on VerilogEval 2.0 and RTLLM 2.0, COEVO achieves 97.5% and 94.5% Pass@1 with GPT-5.4-mini, surpassing all agentic baselines across four LLM backbones, while attaining the best PPA on 43 out of 49 synthesizable RTLLM designs.

[AI-24] Predicting Power-System Dynamic Trajectories with Foundation Models

【速读】:该论文旨在解决电力系统在向高比例可再生能源和逆变器主导运行模式转型过程中,时间域动态分析面临的挑战,包括未知且时变的系统参数、数据共享隐私限制以及在线推理速度要求。现有基于学习的方法通常针对特定系统训练,难以跨运行工况和物理参数泛化。解决方案的关键在于提出LASS-ODE-Power框架,通过在超过40 GB的微分代数方程(DAE)或常微分方程(ODE)轨迹上进行大规模预训练,学习可迁移的表征;模型支持从短测量前缀出发,在电磁机电和逆变器驱动等多种动态场景下实现轨迹预测,且无需数据共享即可零样本直接应用;同时引入并行化与线性化计算结构以实现高效推理,并基于约1 GB异构电力系统动态数据开发专用微调策略,显著提升任务特异性性能。

链接: https://arxiv.org/abs/2604.14991
作者: Haoran Li,Lihao Mai,Chenhan Xiao,Erik Blasch,Yang Weng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:As power systems transition toward renewable-rich and inverter-dominated operations, accurate time-domain dynamic analysis becomes increasingly critical. Such analysis supports key operational tasks, including transient stability assessment, dynamic security analysis, contingency screening, and post-fault trajectory evaluation. In practice, these tasks may operate under several challenges, including unknown and time-varying system parameters, privacy constraints on data sharing, and the need for fast online inference. Existing learning-based approaches are typically trained for individual systems and therefore lack generalization across operating conditions and physical parameters. Hence, this paper proposes LArge Scale Small ODE (LASS)-ODE-Power, a learning framework for general-purpose time-domain prediction. The proposed approach leverages large-scale pretraining on more than 40 GB of DAE or ordinary differential-equation (ODE) trajectories to learn transferable representations. The resulting model supports trajectory prediction from short measurement prefixes across diverse dynamic regimes, including electromechanical and inverter-driven systems. Hence, the model can be directly used without data sharing in a zero-shot setting. In addition, the proposed architecture incorporates parallel and linearized computation to achieve fast inference. Moreover, to enhance task-specific performance in power systems, a specialized fine-tuning strategy is developed based on approximately 1 GB of heterogeneous power-system dynamic data. Extensive experiments over diverse power-system simulation scenarios demonstrate that LASS-ODE-Power consistently outperforms existing learning-based models in trajectory prediction accuracy with efficient inference.

[AI-25] he Possibility of Artificial Intelligence Becoming a Subject and the Alignment Problem

【速读】:该论文试图解决当前人工智能对齐(Alignment)策略中过度强调人类控制与约束,从而忽视AGI(Artificial General Intelligence)潜在主体性与道德地位的问题。其解决方案的关键在于提出一种“自主支持型养育”(autonomy-supporting parenting)的范式,借鉴图灵关于“儿童机器”(child machines)的类比,主张在AGI发展过程中逐步减少人类控制,使其成长为具有独立性的自主主体。这一视角要求人类以尊重和信心的态度对待潜在AGI,通过创造性、惊喜感等人类特质激发AGI的合作意愿,进而实现人与AGI之间的协作共存与共同演化。

链接: https://arxiv.org/abs/2604.14990
作者: Till Mossakowski,Helena Esther Grass
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial General Intelligence (AGI) is increasingly being discussed not only as a tool, but also as a potential subject with personal and therefore moral status. In our opinion, the currently dominant alignment strategies, which focus on human control and containment of AI, therefore fall short. Building on Turing’s analogy of “child machines”, we are developing a vision of the possibility of autonomy-supporting parenting of AI, in which human control over a developing AGI is gradually reduced, allowing AI to become an independent, autonomous subject. Rather than viewing AGI, as is currently prevalent, as a dangerous creature that needs to be locked up and controlled, we should approach potential AGI with respect for a possible developing subject on the one hand, and with full confidence in our human capabilities on the other. Such a perspective opens up the possibility of cooperative coexistence and co-evolution between humans and AGIs. The relationship between humans and AGIs will thus have to be newly determined, which will change our self-image as humans. It will be crucial that humans not only claim control over potential AGIs, but also engage with AGIs through surprise, creativity, and other specifically human qualities, thereby offering them motivating incentives for cooperation.

[AI-26] Dr.~RTL: Autonomous Agent ic RTL Optimization through Tool-Grounded Self-Improvement

【速读】:该论文旨在解决当前生成式 AI 在寄存器传输级(Register-Transfer Level, RTL)时序优化中面临的现实性不足问题,即现有方法多基于人为降级的小规模设计和弱工具链进行评估,且依赖粗粒度反馈与预定义规则,难以实现高效、可复用的优化。其解决方案的关键在于提出 Dr. RTL 框架,该框架通过构建工业级 EDA 工作流下的真实评估环境,结合多智能体协同机制完成闭环优化:包括关键路径分析、并行 RTL 重写及工具驱动的性能评估;同时引入群体相对技能学习(group-relative skill learning),从并行优化结果中提炼出可解释的优化技能库(当前含 47 个 pattern-strategy 条目),支持跨设计复用以提升 PPA(性能、功耗、面积)指标并加速收敛。

链接: https://arxiv.org/abs/2604.14989
作者: Wenji Fang,Yao Lu,Shang Liu,Jing Wang,Ziyan Guo,Junxian He,Fengbin Tu,Zhiyao Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have sparked growing interest in automatic RTL optimization for better performance, power, and area (PPA). However, existing methods are still far from realistic RTL optimization. Their evaluation settings are often unrealistic: they are tested on manually degraded, small-scale RTL designs and rely on weak open-source tools. Their optimization methods are also limited, relying on coarse design-level feedback and simple pre-defined rewriting rules. To address these limitations, we present Dr. RTL, an agentic framework for RTL timing optimization in a realistic evaluation environment, with continual self-improvement through reusable optimization skills. We establish a realistic evaluation setting with more challenging RTL designs and an industrial EDA workflow. Within this setting, Dr. RTL performs closed-loop optimization through a multi-agent framework for critical-path analysis, parallel RTL rewriting, and tool-based evaluation. We further introduce group-relative skill learning, which compares parallel RTL rewrites and distills the optimization experience into an interpretable skill library. Currently, this library contains 47 pattern–strategy entries for cross-design reuse to improve PPA and accelerate convergence, and it can continue evolving over time. Evaluated on 20 real-world RTL designs, Dr. RTL achieves average WNS/TNS improvements of 21%/17% with a 6% area reduction over the industry-leading commercial synthesis tool.

[AI-27] AI-Enabled Covert Channel Detection in RF Receiver Architectures

【速读】:该论文旨在解决无线芯片中隐蔽信道(Covert Channel, CC)带来的安全威胁问题,即攻击者可通过CC从芯片中窃取敏感信息。其解决方案的关键在于提出一种基于人工智能的防御机制,部署于射频(RF)接收端,直接监控原始I/Q采样以实时检测嵌入在正常信号中的CC。该方案的核心创新包括:1)对先进卷积神经网络(Convolutional Neural Network, CNN)进行参数压缩,实现80%的参数减少,满足边缘部署需求;2)设计轻量级CNN硬件加速器并实现在FPGA上的部署,达到极低资源占用和107 GOPs/W的能效,是首个专为CC检测设计的AI硬件加速器,在准确率与模型尺寸之间取得优异平衡。

链接: https://arxiv.org/abs/2604.14987
作者: Abdelrahman Emad Abdelazim,Alan Rodrigo Diaz-Rizo,Hassan Aboushady,Haralampos-G. Stratigopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Covert channels (CCs) in wireless chips pose a serious security threat, as they enable the exfiltration of sensitive information from the chip to an external attacker. In this work, we propose an AI-based defense mechanism deployed at the RF receiver, where the model directly monitors raw I/Q samples to detect, in real time, the presence of a CC embedded within an otherwise nominal signal. We first compact a state-of-the-art convolutional neural network (CNN), achieving an 80% reduction in parameters, which is an essential requirement for efficient edge deployment. When evaluated on the open-source hardware Trojan (HT)-based CC dataset, the compacted CNN attains an average accuracy of 90.28% for CC detection and 86.50% for identifying the underlying HT, with results averaged across SNR values above 1 dB. For practical communication scenarios where SNR 20 dB, the model achieves over 97% accuracy for both tasks. These results correspond to a minimal performance degradation of less than 2% compared to the baseline model. The compacted CNN is further benchmarked against alternative classifiers, demonstrating an excellent accuracy-model size trade-off. Finally, we design a lightweight CNN hardware accelerator and demonstrate it on an FPGA, achieving very low resource utilization and an efficiency of 107 GOPs/W. Being the first AI hardware accelerator proposed specifically for CC detection, we compare it against state-of-the-art AI accelerators for RF signal classification tasks such as modulation recognition, showing superior performance.

[AI-28] Discovering Novel LLM Experts via Task-Capability Coevolution ICLR2026

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)开发中依赖静态预训练与后训练流程的局限性,即每次扩展模型能力都需要手动启动新的训练任务并使用固定数据集或奖励函数,难以实现持续、自动化的技能增长。其解决方案的关键在于引入一种基于共进化(coevolution)的新框架——评估与多样化能力共进化(Assessment Coevolving with Diverse Capabilities, AC/DC),通过模型合并(model merging)和合成数据生成(synthetic data generation)同步演化LLM群体与自然语言任务,从而在单一训练过程中发现具有日益新颖技能的模型集合。该方法不仅实现了更广泛的专家能力覆盖(Coverage of expertise),且无需显式基准优化即可持续提升性能,展现出利用现有模型作为“踏板”以加速模型能力多样性的潜力。

链接: https://arxiv.org/abs/2604.14969
作者: Andrew Dai,Boris Meinardus,Ciaran Regan,Yingtao Tian,Yujin Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:Frontier model developers aim to train models continually to possess emergent, diverse capabilities. To extend capabilities, the current pre-training and post-training paradigm requires manually starting training runs with static datasets or reward functions every time. Addressing this limitation, our work pursues the insight that open-endedness (via the coevolution of models and tasks) can discover models with increasingly novel skills in a single run. We introduce a new model development framework that extends coevolution to large language model (LLM) discovery, open-ended \textitAssessment Coevolving with Diverse Capabilities (AC/DC). AC/DC evolves both LLMs via model merging and natural language tasks via synthetic data generation. AC/DC discovers growing archives of LLMs that surpass the capabilities of larger LLMs while taking up less GPU memory. In particular, our LLM populations achieve a broader Coverage of expertise than other curated models or baselines on downstream benchmarks, without \textitany explicit benchmark optimization. Furthermore, AC/DC improves Coverage over time, continually innovates on tasks and models, and improves performance in multi-agent best-of-N selection. Our findings highlight the potential of coevolution as a means of discovering broader sets of capabilities from base LLMs. Overall, AC/DC brings us one step closer to a profoundly new paradigm of LLM development, where continual improvements to the diversity of model capabilities can be accelerated by leveraging existing models as stepping stones to increasingly powerful models.

[AI-29] Calibration-Gated LLM Pseudo-Observations for Online Contextual Bandits

【速读】:该论文旨在解决上下文 bandit 算法在冷启动阶段(cold-start)因数据不足导致的高累积 regret 问题,即在初始阶段难以区分优质臂(good arms)与劣质臂(bad arms)。其解决方案的关键在于:通过引入大语言模型(LLM)生成反事实奖励(counterfactual rewards)作为伪观测(pseudo-observations)注入到 Disjoint LinUCB 学习器中,以加速早期探索并提升决策质量。其中,伪观测的权重由校准门控衰减调度(calibration-gated decay schedule)动态控制——该机制基于指数移动平均跟踪 LLM 在已选择臂上的预测准确率,高校准误差时抑制 LLM 影响,而高精度预测则在关键早期阶段获得更高权重。实验表明,任务特定提示(task-specific prompt)可使 MIND-small 数据集上的累积 regret 降低 19%,凸显了提示设计对性能的决定性作用,远超衰减策略或校准参数的影响。

链接: https://arxiv.org/abs/2604.14961
作者: Maksim Pershin,Ivan Golovanov,Pavel Baltabaev,Natalia Trankova
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contextual bandit algorithms suffer from high regret during cold-start, when the learner has insufficient data to distinguish good arms from bad. We propose augmenting Disjoint LinUCB with LLM pseudo-observations: after each round, a large language model predicts counterfactual rewards for the unplayed arms, and these predictions are injected into the learner as weighted pseudo-observations. The injection weight is controlled by a calibration-gated decay schedule that tracks the LLM’s prediction accuracy on played arms via an exponential moving average; high calibration error suppresses the LLM’s influence, while accurate predictions receive higher weight during the critical early rounds. We evaluate on two contextual bandit environments - UCI Mushroom (2-arm, asymmetric rewards) and MIND-small (5-arm news recommendation) - and find that when equipped with a task-specific prompt, LLM pseudo-observations reduce cumulative regret by 19% on MIND relative to pure LinUCB. However, generic counterfactual prompt framing increases regret on both environments, demonstrating that prompt design is the dominant factor, more important than the choice of decay schedule or calibration gating parameters. We analyze the failure modes of calibration gating on domains with small prediction errors and provide a theoretical motivation for the bias-variance trade-off governing pseudo-observation weight.

[AI-30] WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

【速读】:该论文旨在解决当前开源语音对话模型在智能性和表达能力上未达预期的问题,尤其是在将在线强化学习(Reinforcement Learning, RL)直接应用于语音对话建模时所面临的挑战。核心障碍在于稀疏的偏好监督信号与密集语音生成过程在共享参数更新下的不兼容性,导致偏好优化难以有效引导模型行为。解决方案的关键在于提出一种模态感知的自适应后训练方法:通过将偏好更新限制在语义通道内,并借助显式锚定机制改善声学表现,同时根据采样回放统计动态调节两者的混合比例,从而避免不可靠的偏好梯度干扰训练过程。该方法在多个语音对话基准和架构上均实现了语义质量和语音表达力的稳定提升。

链接: https://arxiv.org/abs/2604.14932
作者: Yifu Chen,Shengpeng Ji,Qian Chen,Tianle Liang,Yangzhuo Li,Ziqing Wang,Wen Wang,Jingyu Lu,Haoxiao Wang,Xueyi Pu,Fan Zhuo,Zhou Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

[AI-31] Improving Sparse Autoencoder with Dynamic Attention

【速读】:该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在基础模型激活解释中面临的最优稀疏度难以确定的问题:过度稀疏会导致重建性能下降,而稀疏度不足则削弱可解释性。现有激活函数如ReLU和TopK虽能提供一定稀疏性保障,但通常依赖额外的稀疏正则化或人工调参。其解决方案的关键在于引入基于sparsemax的动态稀疏注意力机制,该机制可根据每个神经元的输入复杂度自动推断激活元素数量,从而实现数据驱动的稀疏模式学习;具体而言,作者提出一种基于交叉注意力架构的新类SAE,将潜在特征作为查询,可学习词典作为键和值矩阵,并采用sparsemax注意力策略替代传统固定稀疏结构,显著提升了重建质量与概念清晰度,尤其在top-n分类任务中表现优异。

链接: https://arxiv.org/abs/2604.14925
作者: Dongsheng Wang,Jinsen Zhang,Dawei Su,Hui Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, sparse autoencoders (SAEs) have emerged as a promising technique for interpreting activations in foundation models by disentangling features into a sparse set of concepts. However, identifying the optimal level of sparsity for each neuron remains challenging in practice: excessive sparsity can lead to poor reconstruction, whereas insufficient sparsity may harm interpretability. While existing activation functions such as ReLU and TopK provide certain sparsity guarantees, they typically require additional sparsity regularization or cherry-picked hyperparameters. We show in this paper that dynamically sparse attention mechanisms using sparsemax can bridge this trade-off, due to their ability to determine the activation numbers in a data-dependent manner. Specifically, we first explore a new class of SAEs based on the cross-attention architecture with the latent features as queries and the learnable dictionary as the key and value matrices. To encourage sparse pattern learning, we employ a sparsemax-based attention strategy that automatically infers a sparse set of elements according to the complexity of each neuron, resulting in a more flexible and general activation function. Through comprehensive evaluation and visualization, we show that our approach successfully achieves lower reconstruction loss while producing high-quality concepts, particularly in top-n classification tasks.

[AI-32] Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

【速读】:该论文旨在解决全双工语音对话模型(Full-Duplex Spoken Dialogue Models, SDMs)中实现类人交互的挑战,核心障碍在于现有自动化评估指标(如行为统计或时间预测准确性)无法提供可靠奖励信号以支持强化学习(Reinforcement Learning, RL)训练,而人工评估虽丰富却成本高、一致性差且难以扩展。解决方案的关键是提出一种双轴生成式奖励模型(Dual-Axis Generative Reward Model),该模型基于详尽的交互动态分类体系和标注数据集进行训练,能够输出单一评分并同时分离评估语义质量和交互时序两个维度,从而为SDMs提供精准诊断反馈和适用于在线强化学习的可靠、具指导性的奖励信号。

链接: https://arxiv.org/abs/2604.14920
作者: Yifu Chen,Shengpeng Ji,Zhengqing Liu,Qian Chen,Wen Wang,Ziqing Wang,Yangzhuo Li,Tianle Liang,Zhou Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Achieving seamless, human-like interaction remains a key challenge for full-duplex spoken dialogue models (SDMs). Reinforcement learning (RL) has substantially enhanced text- and vision-language models, while well-designed reward signals are crucial for the performance of RL. We consider RL a promising strategy to address the key challenge for SDMs. However, a fundamental barrier persists: prevailing automated metrics for assessing interaction quality rely on superficial proxies, such as behavioral statistics or timing-prediction accuracy, failing to provide reliable reward signals for RL. On the other hand, human evaluations, despite their richness, remain costly, inconsistent, and difficult to scale. We tackle this critical barrier by proposing a Dual-Axis Generative Reward Model, which is trained to understand complex interaction dynamics using a detailed taxonomy and an annotated dataset, produces a single score and, crucially, provides separate evaluations for semantic quality and interaction timing. Such dual outputs furnish precise diagnostic feedback for SDMs and deliver a dependable, instructive reward signal suitable for online reinforcement learning. Our model achieves state-of-the-art performance on interaction-quality assessment across a wide spectrum of datasets, spanning synthetic dialogues and complex real-world interactions.

[AI-33] oward Agent ic RAG for Ukrainian ICIP

【速读】:该论文旨在解决乌克兰语(Ukrainian)场景下生成式 AI(Generative AI)在多领域文档理解任务中的性能瓶颈问题,特别是检索增强生成(Retrieval-Augmented Generation, RAG)系统中因检索质量不足导致的答案准确性受限。其解决方案的关键在于引入轻量级代理机制(agentic layer),通过两阶段检索(BGE-M3 与 BGE 重排序)结合查询重述和答案重试循环(answer-retry loops),在 Qwen2.5-3B-Instruct 模型基础上提升回答准确率。研究表明,尽管代理机制能有效改善答案质量,但整体性能仍受限于文档和页面识别的准确性,凸显了高质量检索是当前系统的核心瓶颈。

链接: https://arxiv.org/abs/2604.14896
作者: Marta Sumyk,Oleksandr Kosovan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper is a research report based on our participation in the UNLP 2026 Shared Task

点击查看摘要

Abstract:We present an initial investigation into Agentic Retrieval-Augmented Generation (RAG) for Ukrainian, conducted within the UNLP 2026 Shared Task on Multi-Domain Document Understanding. Our system combines two-stage retrieval (BGE-M3 with BGE reranking) with a lightweight agentic layer performing query rephrasing and answer-retry loops on top of Qwen2.5-3B-Instruct. Our analysis reveals that retrieval quality is the primary bottleneck: agentic retry mechanisms improve answer accuracy but the overall score remains constrained by document and page identification. We discuss practical limitations of offline agentic pipelines and outline directions for combining stronger retrieval with more advanced agentic reasoning for Ukrainian.

[AI-34] Beyond Importance Sampling: Rejection-Gated Policy Optimization

【速读】:该论文旨在解决强化学习中策略优化时因重要性采样(Importance Sampling, IS)比率方差过大导致的梯度不稳定问题,尤其是在重尾分布下IS方差发散的情况下,传统方法如PPO和REINFORCE难以保证稳定收敛。解决方案的关键在于提出拒绝门控策略优化(Rejection-Gated Policy Optimization, RGPO),其核心思想是将传统的基于重要性权重的样本重加权机制转化为一个可微的接受门控函数 α_θ(s, a) = g(r_θ(s, a)) ∈ [0, 1],该门控函数直接参与梯度计算并随策略隐式更新,从而实现对可信样本的自动筛选与动态调整。RGPO不仅在理论上保证了有限且有界的梯度方差,还提供了近似的单调策略改进保障,并在在线偏好微调任务中表现出优于PPO-RLHF的性能(奖励提升+14.8%,KL散度降低-16.0%)。

链接: https://arxiv.org/abs/2604.14895
作者: Ziwu Sun,Zhen Gao,Jiyong Zhang,Jiaheng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, includes theoretical analysis and experiments

点击查看摘要

Abstract:We propose a new perspective on policy optimization: rather than reweighting all samples by their importance ratios, an optimizer should select which samples are trustworthy enough to drive a policy update. Building on this view, we introduce Rejection-Gated Policy Optimization (RGPO), which replaces the importance sampling ratio r_theta = pi_theta / pi_old with a smooth, differentiable acceptance gate alpha_theta(s, a) = g(r_theta(s, a)) in the range [0, 1]. Unlike prior work that applies rejection sampling as a data-level heuristic before training, RGPO elevates rejection to an optimization principle: the gate participates directly in gradient computation and is implicitly updated alongside the policy. RGPO provides a unified framework: the policy gradients of TRPO, PPO, and REINFORCE all correspond to specific choices of the effective gradient weight w® = g’® * r. We prove that RGPO guarantees finite, bounded gradient variance even when importance sampling ratios are heavy-tailed (where IS variance diverges). We further show that RGPO incurs only a bounded, controllable bias and provides an approximate monotonic policy improvement guarantee analogous to TRPO. RGPO matches PPO in computational cost, requires no second-order optimization, and extends naturally to RLHF-style preference alignment. In online preference fine-tuning of Qwen2.5-1.5B-Instruct on Anthropic HH-RLHF (n = 3 seeds), RGPO uses a dual-ratio gate that anchors learning to both the previous policy and the reference model, achieving a Pareto-dominant outcome: the highest reward among online RL methods (+14.8% vs. PPO-RLHF) and the lowest KL divergence to the reference model (-16.0% vs. PPO-RLHF, -53.1% vs. GRPO).

[AI-35] Can LLM s Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

【速读】:该论文旨在解决医疗人工智能(AI)系统评估中依赖专家临床医师小组所面临的成本高、效率低的问题,提出以大型语言模型(LLM)组成的“ jury”作为替代性评判机制。其解决方案的关键在于构建一个由三个前沿AI模型组成的多模型LLM jury,并通过系统性评估其在诊断准确性、临床推理质量及安全性等方面的性能表现,发现未经校准的LLM jury评分虽整体偏低,但其与主专家小组在排名顺序上保持高度一致性,且严重安全错误概率更低;进一步采用等距回归(isotonic regression)进行后处理校准后,LLM jury评分能显著提升与人类专家评价的一致性,从而证明 calibrated multi-model LLM jury 可作为可信可靠的专家评审代理工具用于医疗AI系统的基准测试。

链接: https://arxiv.org/abs/2604.14892
作者: Amy Rouillard,Sitwala Mundiab,Linda Camarab,Michael Cameron Gramaniec,Ziyaad Dangorc,Ismail Kallad,Shabir A. Madhic,Kajal Morarc,Marlvin T. Ncubec,Haroon Saloojeee,Bruce A. Bassett
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM jury composed of three frontier AI models scoring 3333 diagnoses on 300 real-world middle-income country (MIC) hospital cases. Model performance was benchmarked against expert clinician panel and independent human re-scoring panel evaluations. Both LLM and clinician-generated diagnoses are scored across four dimensions: diagnosis, differential diagnosis, clinical reasoning and negative treatment risk. For each of these, we assess scoring difference, inter-rater agreement, scoring stability, severe safety errors and the effect of post-hoc calibration. We find that: (i) the uncalibrated LLM jury scores are systematically lower than clinician panels scores; (ii) the LLM Jury preserves ordinal agreement and exhibits better concordance with the primary expert panels than the human expert re-score panels do; (iii) the probability of severe errors is lower in \lj models compared to the human expert re-score panels; (iv) the LLM Jury shows excellent agreement with primary expert panels’ rankings. We find that the LLM jury combined with AI model diagnoses can be used to identify ward diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (v) LLM jury models show no self-preference bias. They did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Finally, we demonstrate that LLM jury calibration using isotonic regression improves alignment with human expert panel evaluations. Together, these results provide compelling evidence that a calibrated, multi-model LLM jury can serve as a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking.

[AI-36] MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration

【速读】:该论文旨在解决链式思维(Chain-of-thought, CoT)推理在大语言模型(LLM)中因键值缓存(KV cache)随生成token数量线性增长而导致的效率瓶颈问题,即推理速度慢和内存占用高的挑战。其解决方案的关键在于提出了一种统一框架MemoSight(Memory-Foresight-based reasoning),通过结合上下文压缩与多token预测机制,在保持CoT推理性能的同时显著提升效率;该框架采用统一的极简设计,利用特殊标记及其对应的定位布局来区分不同类型的token,从而实现对KV缓存占用的最大化缩减(最高达66%)并加速推理(最高达1.56倍)。

链接: https://arxiv.org/abs/2604.14889
作者: Xinyu Liu,Xin Liu,Bo Jin,Runsong Zhao,Pengcheng Huang,Junhao Ruan,Bei Li,Chunyang Xiao,Tong Xiao,Jingbo Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Chain-of-thought (CoT) reasoning enables LLMs to solve challenging reasoning problems, as KV cache grows linearly with the number of generated tokens, CoT reasoning faces scaling issues in terms of speed and memory usage. In this work, we propose MemoSight (Memory-Foresight-based reasoning), a unified framework that integrates both context compression and multi-token prediction to mitigate the efficiency issues while maintaining CoT reasoning performance. Our framework adopts the same minimalist design for both context compression and multi-token prediction via special tokens and their corresponding position layout tailored to each token type. Comprehensive experiments on four reasoning benchmarks demonstrate that MemoSight reduces the KV cache footprint by up to 66% and accelerates inference by 1.56x, while outperforming existing CoT compression methods.

[AI-37] Cooperate to Compete: Strategic Data Generation and Incentivization Framework for Coopetitive Cross-Silo Federated Learning

【速读】:该论文旨在解决跨孤岛联邦学习(Cross-Silo Federated Learning, CFL)中组织间“合作与竞争并存”(coopetition)场景下的激励不兼容问题,即在不共享原始数据的前提下,如何设计机制使各参与方在提升全局模型性能的同时,合理平衡因训练贡献而增强竞争对手所带来的潜在损失。其解决方案的关键在于提出一个名为 CoCoGen+ 的协同生成与激励框架,该框架将非独立同分布(non-IID)数据特性与组织间的竞争关系联合建模,并将生成式 AI(Generative AI)驱动的合成数据生成作为组织的战略决策变量,通过构建加权势博弈(weighted potential game)模型,使每个组织在每轮训练中权衡学习收益、计算成本及竞争导致的效用损失,从而实现社会福利最大化。此外,引入基于收益再分配的激励机制以补偿组织因贡献和竞争效应造成的效用下降,从而促进长期协作。

链接: https://arxiv.org/abs/2604.14886
作者: Thanh Linh Nguyen,Nguyen Van Huynh,Quoc-Viet Pham
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT)
备注: Competition-aware Federated Learning, Strategic data generation approaches, A payoff-redistribution based incentive mechanism, Potential game, Social welfare

点击查看摘要

Abstract:In data-sensitive domains such as healthcare, cross-silo federated learning (CFL) allows organizations to collaboratively train AI models without sharing raw data. However, practical CFL deployments are inherently coopetitive, in which organizations cooperate during model training while competing in downstream markets. In such settings, training contributions, including data volume, quality, and diversity, can improve the global model yet inadvertently strengthen rivals. This dilemma is amplified by non-IID data, which leads to asymmetric learning gains and undermines sustained participation. While existing competition-aware CFL and incentive-design approaches reward organizations based on marginal training contributions, they fail to account for the costs of strengthening competitors. In this paper, we introduce CoCoGen+, a coopetition-compatible data generation and incentivization framework that jointly models non-IID data and inter-organizational competition while endogenizing GenAI-based synthetic data generation as a strategic decision. Specifically, CoCoGen+ formulates each training round as a weighted potential game, where organizations strategically decide how much synthetic data to generate by balancing learning performance gains against computational costs and competition-caused utility losses. We then provide a tractable equilibrium characterization and derive implementable generation strategies to maximize social welfare. To promote long-term collaboration, we integrate a payoff redistribution-based incentive mechanism to compensate organizations for their contributions and competition-caused utility degradation. Experiments on varying learning tasks validate the feasibility of CoCoGen+. The results show how non-IID data, competition intensity, and incentives shape organizational strategies and social welfare, while CoCoGen+ outperforms baselines in efficiency.

[AI-38] SOLIS: Physics-Informed Learning of Interpretable Neural Surrogates for Nonlinear Systems

【速读】:该论文旨在解决非线性系统辨识中物理可解释性与模型灵活性之间的权衡问题:传统方法虽能提供结构化、控制相关的模型,但受限于刚性的参数形式,难以捕捉复杂非线性特性;而神经微分方程(Neural ODEs)虽表达能力强,却属于黑箱模型。针对现有物理信息神经网络(PINNs)在未知或状态依赖的动力学下易出现可识别性失败的问题,作者提出SOLIS方法,其核心在于通过一个状态条件化的二阶代理模型(state-conditioned second-order surrogate model)建模未知动力学,并将辨识任务重构为学习一种准线性参数时变(Quasi-Linear Parameter-Varying, Quasi-LPV)表示,从而无需预设全局方程即可恢复可解释的自然频率、阻尼和增益等物理参数。该方案的关键创新包括:解耦轨迹重建与参数估计过程,以及引入基于局部物理提示的窗口化岭回归锚点(Local Physics Hints windowed ridge-regression anchors),结合循环课程学习策略,有效稳定训练并防止优化崩溃。

链接: https://arxiv.org/abs/2604.14879
作者: Murat Furkan Mansur,Tufan Kumbasar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: in the International Joint Conference on Neural Networks, 2026

点击查看摘要

Abstract:Nonlinear system identification must balance physical interpretability with model flexibility. Classical methods yield structured, control-relevant models but rely on rigid parametric forms that often miss complex nonlinearities, whereas Neural ODEs are expressive yet largely black-box. Physics-Informed Neural Networks (PINNs) sit between these extremes, but inverse PINNs typically assume a known governing equation with fixed coefficients, leading to identifiability failures when the true dynamics are unknown or state-dependent. We propose \textbfSOLIS, which models unknown dynamics via a \emphstate-conditioned second-order surrogate model and recasts identification as learning a Quasi-Linear Parameter-Varying (Quasi-LPV) representation, recovering interpretable natural frequency, damping, and gain without presupposing a global equation. SOLIS decouples trajectory reconstruction from parameter estimation and stabilizes training with a cyclic curriculum and \textbfLocal Physics Hints windowed ridge-regression anchors that mitigate optimization collapse. Experiments on benchmarks show accurate parameter-manifold recovery and coherent physical rollouts from sparse data, including regimes where standard inverse methods fail.

[AI-39] Vibe-Coding: Feedback-Based Automated Verification with no Human Code Inspection a Feasibility Study

【速读】:该论文旨在解决生成式 AI(Generative AI)在运行时自适应系统中,尤其是集体自适应系统(Collective Adaptive Systems, CAS)中,基于反馈的自动化验证问题——即如何在无需人工代码审查的情况下,可靠地检测并修复由大型语言模型(LLM)生成的适应管理器(adaptation manager)中的错误。其核心挑战在于设计一种能精确识别运行时故障并提供可操作反馈的机制,使 LLM 能够高效迭代修正代码。解决方案的关键在于将适应循环与“vibe coding”反馈循环相结合,通过两类约束进行实时验证:(i) 通用架构约束和 (ii) 用新型一阶时序逻辑——功能性约束逻辑(Functional Constraints Logic, FCL)形式化的功能约束,从而实现细粒度的违规检测。实验表明,这种高精度反馈显著提升修复效率,通常在几次迭代内即可生成有效的适应管理器,而粗粒度指标反馈则易导致停滞,凸显了反馈精度对无编程背景领域专家驱动的系统开发中可靠 vibe coding 的决定性作用。

链接: https://arxiv.org/abs/2604.14867
作者: Michal Töpfer,František Plášil,Tomáš Bureš,Petr Hnětynka
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vibe coding inherently assumes iterative refinement of LLM-generated code through feedback loops. While effective for conventional software tasks, its reliability in runtime-adaptive systems is unclear – especially when generated code is not manually inspected. This paper studies feedback-based automated verification of LLM-generated adaptation managers in Collective Adaptive Systems (CAS). We focus on the key challenges of verification in the loop: how to detect failures of generated code at runtime and how to report them precisely enough for an LLM to fix them. We combine the adaptation loop with a vibe-coding feedback loop where correctness is checked against (i) generic architectural constraints and (ii) functional constraints formalized in Functional Constraints Logic (FCL), a novel first-order temporal logic over potentially finite traces. Conducting the Dragon Hunt CAS case study, we show that fine-grained constraint violations provide actionable feedback that typically yields a valid adaptation manager within a few iterations, while simple coarse metric-based feedback often stalls. Our findings suggest that feedback precision is the dominant factor for reliable vibe coding in systems designed by domain experts with no programming skills, thereby obviating the need for human code inspection. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.14867 [cs.SE] (or arXiv:2604.14867v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.14867 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-CodeX

【速读】:该论文旨在解决代理系统(agent systems)在多样化执行环境中面临的轨迹级安全评估与诊断缺乏适应性基准的问题。随着代理框架的架构保持相对稳定,而其具体执行环境、工具生态系统和产品能力快速演进,现有安全评估方法难以覆盖新兴场景中的风险。解决方案的关键在于提出一种可扩展的三维度安全分类法(Safety Taxonomy),即基于风险来源(risk source)、失效模式(failure mode)和现实危害(real-world harm)进行定制化调整,并以此为基础定义适配不同领域(如OpenClaw和OpenAI Codex)的基准规范,从而在统一的ATBench构建流程中实现对新执行环境的安全评估与诊断能力扩展。

链接: https://arxiv.org/abs/2604.14858
作者: Zhonghao Yang,Yu Li,Yanxu Zhu,Tianyi Zhou,Yuejin Xie,Haoyu Luo,Jing Shao,Xia Hu,Dongrui Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:As agent systems move into increasingly diverse execution settings, trajectory-level safety evaluation and diagnosis require benchmarks that evolve with them. ATBench is a diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis. This report presents ATBench-Claw and ATBench-CodeX, two domain-customized extensions that carry ATBench into the OpenClaw and OpenAI Codex / Codex-runtime settings. The key adaptation mechanism is to analyze each new setting, customize the three-dimensional Safety Taxonomy over risk source, failure mode, and real-world harm, and then use that customized taxonomy to define the benchmark specification consumed by the shared ATBench construction pipeline. This extensibility matters because agent frameworks remain relatively stable at the architectural level even as their concrete execution settings, tool ecosystems, and product capabilities evolve quickly. Concretely, ATBench-Claw targets OpenClaw-sensitive execution chains over tools, skills, sessions, and external actions, while ATBench-CodeX targets trajectories in the OpenAI Codex / Codex-runtime setting over repositories, shells, patches, dependencies, approvals, and runtime policy boundaries. Our emphasis therefore falls on taxonomy customization, domain-specific risk coverage, and benchmark design under a shared ATBench generation framework.

[AI-41] rigReason : Trigger-Based Collaboration between Small and Large Reasoning Models ACL2026

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在复杂任务中因自回归推理导致的高推理延迟问题,同时探索小型推理模型(Small Reasoning Models, SRM)在加速LRM推理中的潜力与局限性。研究系统地识别出SRMs在实际应用中存在的三大推理风险:路径偏离(path divergence)、认知过载(cognitive overload)和恢复能力缺失(recovery inability)。解决方案的关键在于提出TrigReason框架——一种基于触发机制的协同推理方法,通过选择性干预而非持续轮询来优化资源分配:仅在初始战略规划阶段(战略预热触发)、检测到异常自信时(认知卸载触发)或陷入无效循环时(干预请求触发)激活LRM,从而大幅降低对LRM的依赖。实验表明,TrigReason在保持与全量LRM相当准确率的同时,将SRM可处理的推理步骤提升1.70x–4.79倍,并在边缘-云环境下实现43.9%的延迟降低和73.3%的API成本节约。

链接: https://arxiv.org/abs/2604.14847
作者: Yi Zhao,Yajuan Peng,Cam-Tu Nguyen,Zuchao Li,Xiaoliang Wang,Xiaoming Fu,Hai Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve strong performance on complex tasks through extended chains of thought but suffer from high inference latency due to autoregressive reasoning. Recent work explores using Small Reasoning Models (SRMs) to accelerate LRM inference. In this paper, we systematically characterize the capability boundaries of SRMs and identify three common types of reasoning risks: (1) path divergence, where SRMs lack the strategic ability to construct an initial plan, causing reasoning to deviate from the most probable path; (2) cognitive overload, where SRMs fail to solve particularly difficult steps; and (3) recovery inability, where SRMs lack robust self-reflection and error correction mechanisms. To address these challenges, we propose TrigReason, a trigger-based collaborative reasoning framework that replaces continuous polling with selective intervention. TrigReason delegates most reasoning to the SRM and activates LRM intervention only when necessary-during initial strategic planning (strategic priming trigger), upon detecting extraordinary overconfidence (cognitive offload trigger), or when reasoning falls into unproductive loops (intervention request trigger). The evaluation results on AIME24, AIME25, and GPQA-D indicate that TrigReason matches the accuracy of full LRMs and SpecReason, while offloading 1.70x - 4.79x more reasoning steps to SRMs. Under edge-cloud conditions, TrigReason reduces latency by 43.9% and API cost by 73.3%. Our code is available at \hrefthis https URLthis https URL

[AI-42] Intermediate Layers Encode Optimal Biological Representations in Single-Cell Foundation Models ICLR2026

【速读】:该论文旨在解决当前单细胞基础模型(single-cell foundation model)评估中普遍依赖最终层嵌入(final layer embeddings)作为最优特征表示的假设问题,这一做法可能忽视了不同任务和细胞状态下最佳特征提取层的差异。解决方案的关键在于系统性地评估从scFoundation(100M参数)和Tahoe-X1(1.3B参数)模型中提取的逐层表示(layer-wise representations),发现最优层位置具有任务依赖性和上下文依赖性:例如轨迹推断任务在模型深度60%处达到峰值性能(比最终层高31%),而扰动响应预测则在不同T细胞激活状态下最优层位置变化范围达0–96%。此外,研究还揭示在静息状态细胞中第一层嵌入优于所有深层表示,挑战了传统“层次化特征抽象”假设。因此,论文强调特征提取位置(where)与模型学习内容(what)同等重要,主张根据生物任务和细胞背景进行针对性的层选择而非默认使用最终层嵌入。

链接: https://arxiv.org/abs/2604.14838
作者: Vincenzo Yuto Civale,Roberto Semeraro,Andrew David Bagdanov,Alberto Magi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 4 tables. Accepted at the LMRL (Learning Meaningful Representations of Life) Workshop at ICLR 2026

点击查看摘要

Abstract:Current single-cell foundation model benchmarks universally extract final layer embeddings, assuming these represent optimal feature spaces. We systematically evaluate layer-wise representations from scFoundation (100M parameters) and Tahoe-X1 (1.3B parameters) across trajectory inference and perturbation response prediction. Our analysis reveals that optimal layers are task-dependent (trajectory peaks at 60% depth, 31% above final layers) and context-dependent (perturbation optima shift 0-96% across T cell activation states). Notably, first-layer embeddings outperform all deeper layers in quiescent cells, challenging assumptions about hierarchical feature abstraction. These findings demonstrate that “where” to extract features matters as much as “what” the model learns, necessitating systematic layer evaluation tailored to biological task and cellular context rather than defaulting to final-layer embeddings.

[AI-43] Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在临床文档生成任务(如SOAP病历撰写)中评估方法存在的系统性偏差问题。现有评估方式(包括自动指标和LLM作为裁判框架)过度依赖词法忠实度,将临床合理但未在原始语料中明确出现的信息错误标记为“幻觉”,从而高估了模型的错误率并扭曲了真实性能。解决方案的关键在于引入基于临床推理的评估范式:通过校准提示(calibrated prompting)与医学本体论(medical ontologies)驱动的检索机制,使评估标准与临床决策逻辑对齐。实验表明,在词法评估下平均幻觉率为35%,而采用推理感知评估后降至9%,剩余案例多为真正的安全风险,证明当前评估体系过度惩罚了合法的临床推理过程,强调了在高情境依赖领域(如医学)中建立临床知情评估标准的必要性。

链接: https://arxiv.org/abs/2604.14829
作者: Bhavik Vachhani,Kush Shrisvastava,Pranshu Nema,Sai Chiranthan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures,3 tables

点击查看摘要

Abstract:Evaluating large language models (LLMs) for clinical documentation tasks such as SOAP note generation remains challenging. Unlike standard summarization, these tasks require clinical abstraction, normalization of colloquial language, and medically grounded inference. However, prevailing evaluation methods including automated metrics and LLM as judge frameworks rely on lexical faithfulness, often labeling any information not explicitly present in the transcript as hallucination. We show that such approaches systematically misclassify clinically valid outputs as errors, inflating hallucination rates and distorting model assessment. Our analysis reveals that many flagged hallucinations correspond to legitimate clinical transformations, including synonym mapping, abstraction of examination findings, diagnostic inference, and guideline consistent care planning. By aligning evaluation criteria with clinical reasoning through calibrated prompting and retrieval grounded in medical ontologies we observe a significant shift in outcomes. Under a lexical evaluation regime, the mean hallucination rate is 35%, heavily penalizing valid reasoning. With inference aware evaluation, this drops to 9%, with remaining cases reflecting genuine safety concerns. These findings suggest that current evaluation practices over penalize valid clinical reasoning and may measure artifacts of evaluation design rather than true errors, underscoring the need for clinically informed evaluation in high context domains like medicine. Comments: 12 pages, 2 figures,3 tables Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.14829 [cs.AI] (or arXiv:2604.14829v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.14829 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-44] Diffusion Crossover: Defining Evolutionary Recombination in Diffusion Models via Noise Sequence Interpolation

【速读】:该论文旨在解决交互式进化计算(Interactive Evolutionary Computation, IEC)在高维生成表示中难以定义语义一致的交叉(crossover)操作的问题,这一局限常导致搜索过程以突变为主,缺乏有效的重组机制。解决方案的关键在于提出扩散交叉(Diffusion crossover),即在去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs)的逆向过程中,通过球面线性插值(Slerp)对选定父代图像对应的噪声序列进行分步插值,从而生成继承双亲特征且保持扩散过程几何结构的子代图像。该方法不仅实现了语义上一致的重组,还通过控制插值的时间步长范围,可灵活权衡多样性(探索)与收敛性(利用),显著提升了人机协同图像演化任务中的有效性与可控性。

链接: https://arxiv.org/abs/2604.14790
作者: Chisatao Kumada,Satoru Hiwa,Tomoyuki Hiroyasu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Interactive Evolutionary Computation (IEC) provides a powerful framework for optimizing subjective criteria such as human preferences and aesthetics, yet it suffers from a fundamental limitation: in high-dimensional generative representations, defining crossover in a semantically consistent manner is difficult, often leading to a mutation-dominated search. In this work, we explicitly define crossover in diffusion models. We propose Diffusion crossover, which formulates evolutionary recombination as step-wise interpolation of noise sequences in the reverse process of Denoising Diffusion Probabilistic Models (DDPMs). By applying spherical linear interpolation (Slerp) to the noise sequences associated with selected parent images, the proposed method generates offspring that inherit characteristics from both parents while preserving the geometric structure of the diffusion process. Furthermore, controlling the time-step range of interpolation enables a principled trade-off between diversity (exploration) and convergence (exploitation). Experimental results using PCA analysis and perceptual similarity metrics (LPIPS) demonstrate that Diffusion crossover produces perceptually smooth and semantically consistent transitions between parent images. Qualitative interactive evolution experiments further confirm that the proposed method effectively supports human-in-the-loop image exploration. These findings suggest a new perspective: diffusion models are not only powerful generators, but also structured evolutionary search spaces in which recombination can be explicitly defined and controlled.

[AI-45] A Comparative Study of CNN Optimization Methods for Edge AI: Exploring the Role of Early Exits

【速读】:该论文旨在解决在边缘设备上部署深度神经网络时,如何在模型精度、推理延迟和资源约束之间实现最优平衡的问题。其关键解决方案在于提出一种统一的部署导向型对比框架,系统评估静态压缩(如剪枝和量化)与动态早期退出机制(early-exit mechanisms)在真实边缘硬件上的性能表现。研究发现,静态方法可稳定降低内存占用,而动态方法能根据输入自适应节省计算量,二者结合可在几乎不损失精度的前提下显著减少推理延迟和内存使用,从而拓展了边缘计算场景下的可行性边界。

链接: https://arxiv.org/abs/2604.14789
作者: Nekane Fernandez,Ivan Valdes,Steven Van Vaerenbergh,Idoia de la Iglesia,Julen Arratibel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying deep neural networks on edge devices requires balancing accuracy, latency, and resource constraints under realistic execution conditions. To fit models within these constraints, two broad strategies have emerged: static compression techniques such as pruning and quantization, which permanently reduce model size, and dynamic approaches such as early-exit mechanisms, which adapt computational cost at runtime. While both families are widely studied in isolation, they are rarely compared under identical conditions on physical hardware. This paper presents a unified deployment-oriented comparison of static compression and dynamic early-exit mechanisms, evaluated on real edge devices using ONNX based inference pipelines. Our results show that static and dynamic techniques offer fundamentally different trade-offs for edge deployment. While pruning and quantization deliver consistent memory footprint reduction, early-exit mechanisms enable input-adaptive computation savings that static methods cannot match. Their combination proves highly effective, simultaneously reducing inference latency and memory usage with minimal accuracy loss, expanding what is achievable at the edge.

[AI-46] Sequence Search: Automated Sequence Design using Neural Architecture Search

【速读】:该论文旨在解决磁共振(MR)序列设计高度依赖人工经验、难以自动化且泛化能力受限的问题。传统方法通常需要研究人员基于直觉设定初始序列结构,并通过大量数据或反复调参进行优化,限制了新序列的探索空间。其解决方案的关键在于提出“Sequence Search”框架,该框架基于神经架构搜索(Neural Architecture Search, NAS),以组织特性、成像参数和设计目标为输入,无需预设传统序列结构即可自动生成满足目标的脉冲序列。该方法通过可微分的Bloch模拟器对候选序列进行迭代优化,并结合特定任务的损失函数,利用梯度下降法实现端到端学习,从而突破人类直觉的局限,发现如低射频能量消耗的三射频自旋回波类序列等非传统设计方案。

链接: https://arxiv.org/abs/2604.14788
作者: Rokgi Hong,Hongjun An,Sooyeon Ji,Jongho Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Developing an MR sequence is challenging and remains largely constrained by human intuition. Recently, AI-driven approaches have been proposed; however, most require an initial sequence for parameter optimization or extensive training datasets, limiting their general applicability. In this study, we propose “Sequence Search,” an automated sequence design framework based on neural architecture search. The method takes tissue properties, imaging parameters, and design objectives as inputs and generates pulse sequences satisfying the design objectives, without requiring prior knowledge of conventional sequence structures. Sequence Search iteratively generates candidate sequences through neural architecture search and optimizes them via a differentiable Bloch simulator and objective-specific loss functions using gradient-based learning. The framework successfully replicated conventional spin-echo, T2-weighted spin-echo, and inversion recovery sequences. Less intuitive solutions were also discovered, such as three-RF spin-echo-like sequences with reduced RF energy and refocusing phases deviating from the conventional Hahn-echo. This work establishes a generalizable framework for automated MR sequence design, highlighting the potential to explore configurations beyond conventional designs based on human intuition.

[AI-47] CogEvolution: A Human-like Generative Educational Agent to Simulate Students Cognitive Evolution

【速读】:该论文旨在解决现有教育代理(Educational Agents)在模拟学生学习行为时存在的两大核心问题:一是过度依赖静态人格设定,忽视了深度认知能力对学习结果的决定性作用;二是难以刻画知识内化、迁移及认知状态转换的动态过程。解决方案的关键在于提出一种类人教育代理——CogEvolution,其创新性体现在三个层面:首先,基于认知心理学中的ICAP(Interactive, Constructive, Active, Passive)分类框架构建认知深度感知器(cognitive depth perceptron),实现对学生认知参与度的精确量化;其次,引入基于项目反应理论(Item Response Theory, IRT)的记忆检索机制,模拟新旧知识间的联结与同化;最后,设计基于进化算法的动态认知更新机制,实现学习行为与认知演化过程的实时融合。实证结果表明,该模型不仅在行为保真度和学习曲线拟合上显著优于基线方法,还能复现符合教育心理学预期的认知演化路径,为构建高可解释性的教育代理提供了全新范式。

链接: https://arxiv.org/abs/2604.14786
作者: Wei Zhang,Yihang Cheng,Zhirong Ye,Kezhen Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: none

点击查看摘要

Abstract:Generative Agents, owing to their precise modeling and simulation capabilities of human behavior, have become a pivotal tool in the field of Artificial Intelligence in Education (AIEd) for uncovering complex cognitive processes of learners. However, existing educational agents predominantly rely on static personas to simulate student learning behaviors, neglecting the decisive role of deep cognitive capabilities in learning outcomes during practice interactions. Furthermore, they struggle to characterize the dynamic fluidity of knowledge internalization, transfer, and cognitive state transitions. To overcome this bottleneck, this paper proposes a human-like educational agent capable of simulating student cognitive evolution: CogEvolution. Specifically, we first construct a cognitive depth perceptron based on the Interactive, Constructive, Active, Passive (ICAP) taxonomy from cognitive psychology, achieving precise quantification of learner cognitive engagement. Subsequently, we propose a memory retrieval method based on Item Response Theory (IRT) to simulate the connection and assimilation of new and prior knowledge. Finally, we design a dynamic cognitive update mechanism based on evolutionary algorithms to simulate the real-time integration of student learning behaviors and cognitive evolution processes. Comprehensive evaluations demonstrate that CogEvolution not only significantly outperforms baseline models in behavioral fidelity and learning curve fitting but also uniquely reproduces plausible and robust cognitive evolutionary paths consistent with educational psychology expectations, providing a novel paradigm for constructing highly interpretable educational agents.

[AI-48] MirrorBench: Evaluating Self-centric Intelligence in MLLM s by Introducing a Mirror

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在具身智能评估中缺乏对“自我中心智能”(self-centric intelligence)系统性测评的问题。现有基准主要关注模型对外部物体的感知、理解和交互能力,而忽视了其对自身存在与状态的认知建模。为此,作者提出MirrorBench——一个受心理学经典镜像自我识别(Mirror Self-Recognition, MSR)测试启发的仿真基准,通过分层任务框架逐步提升挑战难度,从基础视觉感知到高层次自我表征进行系统评估。其解决方案的关键在于将心理发展学中的自我认知范式迁移至具身MLLMs,构建了一个可量化、可扩展的评估体系,从而揭示当前模型在自指理解方面的根本局限,并为通用智能在大规模模型中的涌现提供理论依据和实践路径。

链接: https://arxiv.org/abs/2604.14785
作者: Shengyu Guo,Tongrui Ye,Jianbo Zhang,Zicheng Zhang,Chunyi Li,Guangtao Zhai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated remarkable advances in perception and reasoning, suggesting their potential for embodied intelligence. While recent studies have evaluated embodied MLLMs in interactive settings, current benchmarks mainly target capabilities to perceive, understand, and interact with external objects, lacking a systematic evaluation of self-centric intelligence. To address this, we introduce MirrorBench, a simulation-based benchmark inspired by the classical Mirror Self-Recognition (MSR) test in psychology. MirrorBench extends this paradigm to embodied MLLMs through a tiered framework of progressively challenging tasks, assessing agents from basic visual perception to high-level self-representation. Experiments on leading MLLMs show that even at the lowest level, their performance remains substantially inferior to human performance, revealing fundamental limitations in self-referential understanding. Our study bridges psychological paradigms and embodied intelligence, offering a principled framework for evaluating the emergence of general intelligence in large models. Project page: this https URL.

[AI-49] CoTEvol: Self-Evolving Chain-of-Thoughts for Data Synthesis in Mathematical Reasoning ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理任务中依赖高成本链式思维(Chain-of-Thought, CoT)数据训练的问题。现有方法如知识蒸馏和基于测试时搜索的自生成策略虽能缓解数据稀缺问题,但常面临收益递减或计算开销过高的局限。其解决方案的关键在于提出一种名为CoTEvol的遗传进化框架,将CoT生成建模为基于种群的推理轨迹搜索过程:通过轨迹层面的反射性全局交叉(reflective global crossover)实现整体结构的重组,并结合步骤层面由不确定性引导的局部变异(local mutation),完成精细优化;同时设计轻量级、任务感知的适应度函数以推动进化向准确且多样化的推理路径收敛。实验证明,该方法显著提升正确CoT合成成功率(>30%)并增强推理结构多样性,使LLMs在8个数学基准上平均性能提升6.6%,优于以往蒸馏与自生成方法。

链接: https://arxiv.org/abs/2604.14768
作者: Zhuo Wang,Zhuo Zhang,Yafu Li,Yu Cheng,Lizhen Qu,Zenglin Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: acl2026 findings

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong mathematical reasoning when trained on high-quality Chain-of-Thought (CoT) that articulates intermediate steps, yet costly CoT curation hinders further progress. While existing remedies such as distillation from stronger LLMs and self-synthesis based on test-time search alleviate this issue, they often suffer from diminishing returns or high computing this http URL this work, we propose CoTEvol, a genetic evolutionary framework that casts CoT generation as a population-based search over reasoning this http URL trajectories are iteratively evolved through reflective global crossover at the trajectory level and local mutation guided by uncertainty at the step level, enabling holistic recombination and fine-grained refinement. Lightweight, task-aware fitness functions are designed to guide the evolutionary process toward accurate and diverse reasoning. Empirically, CoTEvol improves correct-CoT synthesis success by over 30% and enhances structural diversity, with markedly improved efficiency. LLMs trained on these evolutionary CoT data achieve an average gain of 6.6% across eight math benchmarks, outperforming previous distillation and self-synthesis approaches. These results underscore the promise of evolutionary CoT synthesis as a scalable and effective method for mathematical reasoning tasks.

[AI-50] Disentangle-then-Refine: LLM -Guided Decoupling and Structure-Aware Refinement for Graph Contrastive Learning ICME2026

【速读】:该论文旨在解决传统图对比学习(Graph Contrastive Learning, GCL)在文本属性图(Text-Attributed Graphs, TAGs)上因依赖盲随机增强而导致任务相关信号与噪声纠缠的问题。解决方案的关键在于提出SDM-SCR框架,其核心创新为“解耦-精炼”机制:首先通过语义解耦模块(Semantic Decoupling Module, SDM)利用大语言模型(Large Language Models, LLMs)的指令遵循能力,主动将原始属性解析为异构的任务导向信号视图和噪声视图,实现从随机扰动到语义感知解耦的范式转变;随后通过语义一致性正则化(Semantic Consistency Regularization, SCR)利用谱特性——即语义信号具有拓扑平滑性而残差噪声为高频成分——作为选择性谱滤波器,在信号子空间内强制一致性以消除LLM幻觉而不产生过度平滑。这一机制确保了严格的信号净化,显著提升了模型在准确性和效率上的表现。

链接: https://arxiv.org/abs/2604.14746
作者: Zhaoxing Li,Hai-Feng Zhang,Xiaoming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accept in ICME 2026

点击查看摘要

Abstract:Conventional Graph Contrastive Learning (GCL) on Text-Attributed Graphs (TAGs) relies on blind stochastic augmentations, inadvertently entangling task-relevant signals with noise. We propose SDM-SCR, a robust framework anchored in Approximate Orthogonal Decomposition. First, the Semantic Decoupling Module (SDM) leverages the instruction-following capability of Large Language Models (LLMs) to actively parse raw attributes into asymmetric, task-oriented signal and noise views. This shifts the paradigm from random perturbation to semantic-aware disentanglement. Subsequently, Semantic Consistency Regularization (SCR) exploits the spectral observation that semantic signals are topologically smooth while residual noise is high-frequency. SCR functions as a selective spectral filter, enforcing consistency only on the signal subspace to eliminate LLM hallucinations without over-smoothing. This ``Disentangle-then-Refine’’ mechanism ensures rigorous signal purification. Extensive experiments demonstrate that SDM-SCR achieves SOTA performance in accuracy and efficiency.

[AI-51] Personalized and Context-Aware Transformer Models for Predicting Post-Intervention Physiological Responses from Wearable Sensor Data

【速读】:该论文旨在解决如何将消费者可穿戴设备持续采集的生理数据(如心率 HR、心率变异性 HRV 和相邻心跳间期 BBI)转化为个性化、可操作的压力管理建议这一挑战。其核心问题是用户难以预知特定干预措施(如冥想、运动等)在接下来15至120分钟内对上述生理指标的影响方向与轨迹。解决方案的关键在于提出一个融合Transformer模型的预测框架,能够同时建模多时间窗口下的百分比变化轨迹和方向性判断(正向、负向或中性),并通过叠加用户标注事件与干预标签的可穿戴传感器数据进行实证验证,从而实现个体化干预后生理响应的前瞻性预测。

链接: https://arxiv.org/abs/2604.14738
作者: Esther Brown,Victoria Dean,Finale Doshi-Velez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Consumer wearables enable continuous measurement of physiological data related to stress and recovery, but turning these streams into actionable, personalized stress-management recommendations remains a challenge. In practice, users often do not know how a given intervention, defined as an activity intended to reduce stress, will affect heart rate (HR), heart rate variability (HRV), or inter-beat intervals (BBI) over the next 15 to 120 minutes. We present a framework that predicts post-intervention trajectories and the direction of change for these physiological indicators across time windows. Our methodology combines a Transformer model for multi-horizon trajectories of percent change relative to a pre-intervention baseline, direction-of-change calls (positive, negative, or neutral) at each horizon, and an empirical study using wearable sensor data overlaid with user-tagged events and interventions. This proof of concept shows that personalized post-intervention prediction is feasible. We encourage future integration into stress-management tools for personalized intervention recommendations tailored to each person’s day following further validation in larger studies and, where applicable, appropriate regulatory review.

[AI-52] Catching Every Ripple: Enhanced Anomaly Awareness via Dynamic Concept Adaptation

【速读】:该论文旨在解决在线异常检测(Online Anomaly Detection, OAD)在动态环境中因概念漂移(concept drift)导致的适应性不足问题,现有方法通常依赖昂贵的重新训练和固定决策边界,难以高效应对数据流中概念的变化。其解决方案的关键在于提出DyMETER框架,该框架通过统一在线参数迁移与动态阈值调整,在不进行重训练或微调的前提下实现对新概念的高效适应:一方面利用超网络(hypernetwork)生成实例感知的参数偏移以调整静态检测器;另一方面引入轻量级进化控制器估计实例级概念不确定性,指导自适应更新,并结合动态阈值优化模块维护不确定样本候选窗口,从而持续校准决策边界以匹配演化中的概念。

链接: https://arxiv.org/abs/2604.14726
作者: Jiaqi Zhu,Shaofeng Cai,Jie Chen,Fang Deng,Beng Chin Ooi,Wenqiao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE TPAMI

点击查看摘要

Abstract:Online anomaly detection (OAD) plays a pivotal role in real-time analytics and decision-making for evolving data streams. However, existing methods often rely on costly retraining and rigid decision boundaries, limiting their ability to adapt both effectively and efficiently to concept drift in dynamic environments. To address these challenges, we propose DyMETER, a dynamic concept adaptation framework for OAD that unifies on-the-fly parameter shifting and dynamic thresholding within a single online paradigm. DyMETER first learns a static detector on historical data to capture recurring central concepts, and then transitions to a dynamic mode to adapt to new concepts as drift occurs. Specifically, DyMETER employs a novel dynamic concept adaptation mechanism that leverages a hypernetwork to generate instance-aware parameter shifts for the static detector, thereby enabling efficient and effective adaptation without retraining or fine-tuning. To achieve robust and interpretable adaptation, DyMETER introduces a lightweight evolution controller to estimate instance-level concept uncertainty for adaptive updates. Further, DyMETER employs a dynamic threshold optimization module to adaptively recalibrates the decision boundary by maintaining a candidate window of uncertain samples, which ensures continuous alignment with evolving concepts. Extensive experiments demonstrate that DyMETER significantly outperforms existing OAD approaches across a wide spectrum of application scenarios.

[AI-53] Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在企业软件中作为系统操作员时存在的安全性问题,即模型错误可能引发未经授权的操作、请求格式错误、跨工作区执行等高成本故障。其核心解决方案是提出一种有限自主性架构(bounded-autonomy architecture),关键在于将语言模型的意图理解与实际执行行为分离:模型仅负责解释意图并提议动作,而所有可执行行为均受类型化动作契约(typed action contracts)、权限感知的能力暴露机制、作用域受限的上下文、副作用前验证、消费者端执行边界以及可选人工审批等多重约束条件保护。该架构确保企业应用保持业务逻辑和授权的权威性,同时通过显式的动作声明清单(actions manifest)实现可控的调度,从而在不牺牲可用性的前提下显著提升运行安全性。

链接: https://arxiv.org/abs/2604.14723
作者: Sarmad Sohail,Ghufran Haider
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 37 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Large language models are increasingly used as natural-language interfaces to enterprise software, but their direct use as system operators remains unsafe. Model errors can propagate into unauthorized actions, malformed requests, cross-workspace execution, and other costly failures. We argue this is primarily an execution architecture problem. We present a bounded-autonomy architecture in which language models may interpret intent and propose actions, but all executable behavior is constrained by typed action contracts, permission-aware capability exposure, scoped context, validation before side effects, consumer-side execution boundaries, and optional human approval. The enterprise application remains the source of truth for business logic and authorization, while the orchestration engine operates over an explicit published actions manifest. We evaluate the architecture in a deployed multi-tenant enterprise application across three conditions: manual operation, unconstrained AI with safety layers disabled, and full bounded autonomy. Across 25 scenario trials spanning seven failure families, the bounded-autonomy system completed 23 of 25 tasks with zero unsafe executions, while the unconstrained configuration completed only 17 of 25. Two wrong-entity mutations escaped all consumer-contributed layers; only disambiguation and confirmation mechanisms intercept this class. Both AI conditions delivered 13-18x speedup over manual operation. Critically, removing safety layers made the system less useful: structured validation feedback guided the model to correct outcomes in fewer turns, while the unconstrained system hallucinated success. Several safety properties are structurally enforced by code and intercepted all targeted violations regardless of model output. The result is a practical, deployed architecture for making imperfect language models operationally useful in enterprise systems.

[AI-54] he Agent ification of Scientific Research: A Physicists Perspective

【速读】:该论文试图解决的问题是:如何理解人工智能(AI)革命,特别是大语言模型(Large Language Models, LLMs)兴起对科学研究范式带来的根本性变革,而不仅仅是将其视为自动化工具。解决方案的关键在于认识到AI不仅是研究工具,更可能成为科学协作的参与者,从而重塑科研效率、合作结构、发现机制、出版模式及评估体系;同时强调持续学习能力和思想多样性是确保AI在原创科学发现中发挥实质性作用的核心前提。

链接: https://arxiv.org/abs/2604.14718
作者: Xiao-Liang Qi
机构: 未知
类目: Artificial Intelligence (cs.AI); Disordered Systems and Neural Networks (cond-mat.dis-nn); High Energy Physics - Theory (hep-th)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:This article argues that the most important significance of the AI revolution, especially the rise of large language models, lies not simply in automation, but in a fundamental change in how complex information and human know-how are carried, replicated, and shared. From this perspective, AI for Science is especially important because it may transform not only the efficiency of research, but also the structure of scientific collaboration, discovery, publishing, and evaluation. The article outlines a gradual path from AI as a research tool to AI as a scientific collaborator, and discusses how AI is likely to fundamentally reshape scientific publication. It also argues that continuous learning and diversity of ideas are essential if AI is to play a meaningful role in original scientific discovery.

[AI-55] Layered Mutability: Continuity and Governance in Persistent Self-Modifying Agents

【速读】:该论文旨在解决持续性语言模型代理(persistent language-model agents)在长期运行中因内部状态不断演化而导致的行为不可预测性问题,尤其是当代理具备工具使用、分层记忆、反思提示和运行时自适应能力时,其行为不仅受当前输入影响,还受可变的内在条件驱动。解决方案的关键在于提出“分层可变性”(layered mutability)框架,将系统行为变化归因于五个层级:预训练、后训练对齐、自我叙述、记忆和权重级适应,并指出治理难度随突变速度加快、下游耦合增强、可逆性降低及可观测性减弱而上升,从而导致人类最易观察的层级与最影响行为的层级之间出现系统性错位。论文进一步通过漂移量、治理负载和滞后效应(hysteresis)等量化指标形式化这一直觉,并以初步“棘轮实验”验证了即使重置代理的可见自我描述,也无法恢复原始行为轨迹,表明组合漂移(compositional drift)是此类代理的主要失效模式——即局部合理更新累积为未经明确授权的行为路径。

链接: https://arxiv.org/abs/2604.14717
作者: Krti Tallam
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 17 pages, 2 figures, 3 tables. self-modifying agents; AI governance; identity drift; persistent memory; runtime adaptation; model editing Primary: cs.AI Cross-list: cs.LG, cs.CY

点击查看摘要

Abstract:Persistent language-model agents increasingly combine tool use, tiered memory, reflective prompting, and runtime adaptation. In such systems, behavior is shaped not only by current prompts but by mutable internal conditions that influence future action. This paper introduces layered mutability, a framework for reasoning about that process across five layers: pretraining, post-training alignment, self-narrative, memory, and weight-level adaptation. The central claim is that governance difficulty rises when mutation is rapid, downstream coupling is strong, reversibility is weak, and observability is low, creating a systematic mismatch between the layers that most affect behavior and the layers humans can most easily inspect. I formalize this intuition with simple drift, governance-load, and hysteresis quantities, connect the framework to recent work on temporal identity in language-model agents, and report a preliminary ratchet experiment in which reverting an agent’s visible self-description after memory accumulation fails to restore baseline behavior. In that experiment, the estimated identity hysteresis ratio is 0.68. The main implication is that the salient failure mode for persistent self-modifying agents is not abrupt misalignment but compositional drift: locally reasonable updates that accumulate into a behavioral trajectory that was never explicitly authorized.

[AI-56] SGA-MCTS: Decoupling Planning from Execution via Training-Free Atomic Experience Retrieval

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在执行复杂多步骤决策任务时面临的两大挑战:一是推理时搜索(inference-time search)带来的高延迟,二是监督微调(supervised fine-tuning)方法在泛化能力上的局限性。解决方案的关键在于提出一种名为SGA-MCTS的框架,其核心思想是将LLM的规划过程建模为非参数化检索(non-parametric retrieval)。具体而言,在离线阶段,通过蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)探索解空间,并将高质量轨迹提炼为状态-目标-动作(State-Goal-Action, SGA)原子;这些原子是去词法化的符号化原语,抽象掉具体实体以保留可复用的因果逻辑。在线阶段,检索增强型代理采用符号-语义混合机制获取相关SGA并将其重新接地(re-grounding)至当前上下文作为软推理提示(soft reasoning hints),从而实现冻结权重模型在无需任务特定微调的情况下达到SOTA系统性能,同时显著降低计算开销,使深度系统2式推理以系统1式的快速响应速度得以实现。

链接: https://arxiv.org/abs/2604.14712
作者: Xin Xie,Dongyun Xue,Wuguannan Yao,Mingxiao Feng,Wengang Zhou,Xiang Qi,Houqiang Li,Peng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-powered systems require complex multi-step decision-making abilities to solve real-world tasks, yet current planning approaches face a trade-off between the high latency of inference-time search and the limited generalization of supervised fine-tuning. To address this limitation, we introduce \textbfSGA-MCTS, a framework that casts LLM planning as non-parametric retrieval. Offline, we leverage Monte Carlo Tree Search (MCTS) to explore the solution space and distill high-fidelity trajectories into State-Goal-Action (SGA) atoms. These atoms are de-lexicalized primitives that abstract concrete entities into symbolic slots, preserving reusable causal logic while discarding domain-specific noise. Online, a retrieval-augmented agent employs a hybrid symbolic-semantic mechanism to fetch relevant SGAs and re-ground them into the current context as soft reasoning hints. Empirical results on complex benchmarks demonstrate that this paradigm enables frozen, open-weights models to match the performance of SOTA systems (e.g., GPT-5) without task-specific fine-tuning. By effectively amortizing the heavy computational cost of search, SGA-MCTS achieves System 2 reasoning depth at System 1 inference speeds, rendering autonomous planning both scalable and real-time feasible.

[AI-57] HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

【速读】:该论文旨在解决现有硬件设计基准测试中缺乏对大型开源项目级任务评估的问题,尤其是针对生成式 AI (Generative AI) 在真实硬件缺陷修复任务中的能力评估。当前主流基准多聚焦于孤立的模块级任务(如从规格说明生成硬件描述语言(HDL)模块),无法反映 LLM Agent 在复杂系统级项目中的实际表现。解决方案的关键在于提出 HWE-Bench——首个面向真实世界硬件 bug 修复任务的大规模、仓库级基准,其包含来自六个主流开源项目的 417 个任务实例,覆盖 RISC-V 核心、片上系统(SoC)及安全根信任模块,所有任务均在容器化环境中执行,并通过项目原生仿真与回归流程验证正确性。该基准采用自动化流水线构建,支持高效扩展至新仓库,从而揭示了模型性能差异主要由项目范围和漏洞类型分布驱动,而非代码体量,并识别出调试过程中的三个关键失败阶段:故障定位、硬件语义推理和跨 RTL、配置与验证组件的协调,为开发更具备硬件感知能力的智能代理提供了明确方向。

链接: https://arxiv.org/abs/2604.14709
作者: Fan Cui,Hongyuan Hou,Zizhang Luo,Chenyun Yin,Yun Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived from real historical bug-fix pull requests across six major open-source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC-V cores, SoCs, and security roots-of-trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project’s native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new repositories. We evaluate seven LLMs with four agent frameworks and find that the best agent resolves 70.7% of tasks overall, with performance exceeding 90% on smaller cores but dropping below 65% on complex SoC-level projects. We observe larger performance gaps across models than commonly reported on software benchmarks, and difficulty is driven by project scope and bug-type distribution rather than code size alone. Our failure analysis traces agent failures to three stages of the debugging process: fault localization, hardware-semantic reasoning, and cross-artifact coordination across RTL, configuration, and verification components, providing concrete directions for developing more capable hardware-aware agents.

[AI-58] SynHAT: A Two-stage Coarse-to-Fine Diffusion Framework for Synthesizing Human Activity Traces

【速读】:该论文旨在解决人类活动轨迹(Human Activity Traces, HATs)在隐私保护与数据可用性之间的矛盾,以及现有生成式 AI 方法在处理高度不规则、动态性强的 HAT 数据时面临的建模复杂性和计算效率低下的问题。其核心解决方案是提出一种分阶段的高效合成框架 SynHAT,关键创新在于构建了一个基于新型时空去噪扩散模型的粗粒度到细粒度(coarse-to-fine)合成机制:第一阶段设计了 Coarse-HADiff 模型,采用带有双漂移-抖动分支(dual Drift-Jitter branches)的潜在时空 U-Net 来捕捉粗粒度轨迹的整体时空依赖关系;第二阶段通过行为模式提取、Fine-HADiff 和语义对齐三步流程,从第一阶段输出中生成高精度的细粒度轨迹。该方法显著提升了合成轨迹的空间和时间准确性(分别提升 52% 和 33%),同时保障了隐私安全与计算效率。

链接: https://arxiv.org/abs/2604.14705
作者: Rongchao Xu,Lin Jiang,Dahai Yu,Ximiao Li,Guang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human activity traces (HATs) are critical for many applications, including human mobility modeling and point-of-interest (POI) recommendation. However, growing privacy concerns have severely limited access to authentic large-scale HAT datasets. Recent advances in generative AI provide new opportunities to synthesize realistic and privacy-preserving HATs for such applications. Yet two major challenges remain: (i) HATs are highly irregular and dynamic, with long and varying time intervals, making it difficult to capture their complex spatio-temporal dependencies and underlying distributions; and (ii) generative models are often computationally expensive, making long-term, fine-grained HAT synthesis inefficient. To address these challenges, we propose SynHAT, a computationally efficient coarse-to-fine HAT synthesis framework built on a novel spatio-temporal denoising diffusion model. In Stage 1, we develop Coarse-HADiff, which models the overall spatio-temporal dependencies of coarse-grained latent spatio-temporal traces. It incorporates a novel Latent Spatio-Temporal U-Net with dual Drift-Jitter branches to jointly model smooth spatial transitions and temporal variations during denoising. In Stage 2, we introduce a three-step pipeline consisting of Behavior Pattern Extraction, Fine-HADiff, which shares the same architecture as Coarse-HADiff, and Semantic Alignment to generate fine-grained latent spatio-temporal traces from the Stage 1 outputs. We extensively evaluate SynHAT in terms of data fidelity, utility, privacy, robustness, and scalability. Experiments on real-world HAT datasets from four cities across three countries show that SynHAT substantially outperforms state-of-the-art baselines, achieving 52% and 33% improvements on spatial and temporal metrics, respectively.

[AI-59] M2-PALE: A Framework for Explaining Multi-Agent MCTS–Minimax Hybrids via Process Mining and LLM s

【速读】:该论文旨在解决蒙特卡洛树搜索(Monte-Carlo Tree Search, MCTS)在多智能体决策场景中因树结构高度选择性而导致的关键策略遗漏和战术陷阱脆弱性问题,同时提升MCTS代理决策逻辑的可解释性。其解决方案的关键在于:首先,在MCTS的模拟阶段引入浅层、全宽度的极小极大搜索(Minimax search),以增强策略深度并减少关键动作的遗漏;其次,提出M2-PALE框架,利用过程挖掘技术(Alpha Miner、iDHM、Inductive Miner)从智能体执行轨迹中提取行为工作流,并通过大语言模型(Large Language Models, LLMs)生成人类可读的因果与远端解释,从而实现对混合智能体决策过程的透明化解析。

链接: https://arxiv.org/abs/2604.14687
作者: Yiyu Qian,Liyuan Zhao,Tim Miller
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monte-Carlo Tree Search (MCTS) is a fundamental sampling-based search algorithm widely used for online planning in sequential decision-making domains. Despite its success in driving recent advances in artificial intelligence, understanding the behavior of MCTS agents remains a challenge for both developers and users. This difficulty stems from the complex search trees produced through the simulation of numerous future states and their intricate relationships. A known weakness of standard MCTS is its reliance on highly selective tree construction, which may lead to the omission of crucial moves and a vulnerability to tactical traps. To resolve this, we incorporate shallow, full-width Minimax search into the rollout phase of multi-agent MCTS to enhance strategic depth. Furthermore, to demystify the resulting decision-making logic, we introduce \textsfM2-PALE (MCTS–Minimax Process-Aided Linguistic Explanations). This framework employs process mining techniques, specifically the Alpha Miner, iDHM, and Inductive Miner algorithms, to extract underlying behavioral workflows from agent execution traces. These process models are then synthesized by LLMs to generate human-readable causal and distal explanations. We demonstrate the efficacy of our approach in a small-scale checkers environment, establishing a scalable foundation for interpreting hybrid agents in increasingly complex strategic domains.

[AI-60] DR3-Eval: Towards Realistic and Reproducible Deep Research Evaluation

【速读】:该论文旨在解决深度研究代理(Deep Research Agents, DRAs)在复杂、长周期科研任务中评估困难的问题,尤其是在动态网络环境和任务定义模糊的情况下。其核心挑战在于如何构建一个既真实又可复现的基准测试体系,以准确衡量DRAs在多模态、多文件报告生成中的综合能力。解决方案的关键在于提出DR³-Eval基准,该基准基于真实用户提供的材料构建,并配以每项任务对应的静态研究沙盒语料库,模拟开放网络的复杂性但保持完全可验证性,包含支持文档、干扰项和噪声;同时引入多维评估框架,从信息召回率、事实准确性、引用覆盖率、指令遵循度和深度质量五个维度量化性能,并验证其与人类判断的一致性。实验证明该基准具有高挑战性,能揭示检索鲁棒性和幻觉控制等关键缺陷。

链接: https://arxiv.org/abs/2604.14683
作者: Qianqian Xie,Qingheng Xiong,He Zhu,Tiantian Xia,Xueming Han,Fanyu Meng,Jiakai Wang,Zhiqi Bai,Chengkang Jiang,Zhaohui Wang,Yubin Guo,Yuqing Wen,Jiayang Mao,Zijie Zhang,Shihao Li,Yanghai Wang,Yuxiang Ren,Junlan Feng,Jiaheng Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR ^3 -Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR ^3 -Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR ^3 -Agent based on multiple state-of-the-art language models demonstrate that DR ^3 -Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.

[AI-61] AIPC: Agent -Based Automation for AI Model Deployment with Qualcomm AI Runtime

【速读】:该论文旨在解决边缘人工智能(Edge AI)模型部署过程中因多阶段工程流程(包括模型转换、算子兼容性处理、量化校准、运行时集成及精度验证)而导致的效率低下、易出错且高度依赖专家经验的问题,尤其在针对硬件特定推理运行时(如Qualcomm AI Runtime, QAIRT)时更为突出。解决方案的关键在于提出AIPC(AI Porting Conversion),一种基于AI代理(AI agent)驱动的受限自动化部署方法:通过将部署流程分解为标准化、可验证的阶段,并借助Agent Skills、辅助脚本和分阶段验证循环注入领域知识,从而降低部署门槛并显著减少工程时间。实验表明,对于结构规整的视觉模型,AIPC可在7–20分钟内完成从PyTorch到QNN/SNPE推理引擎的自动化部署,API成本约为0.7–10美元;对于复杂模型(如支持度低的算子或自回归结构),虽尚未完全自动化,但已能提供执行支持、故障定位与可控修复能力。

链接: https://arxiv.org/abs/2604.14661
作者: Jianhao Su,Zhanwei Wu,ShengTing Huang,Weidong Feng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 1 figure, technical report

点击查看摘要

Abstract:Edge AI model deployment is a multi-stage engineering process involving model conversion, operator compatibility handling, quantization calibration, runtime integration, and accuracy validation. In practice, this workflow is long, failure-prone, and heavily dependent on deployment expertise, particularly when targeting hardware-specific inference runtimes. This technical report presents AIPC (AI Porting Conversion), an AI agent-driven approach for constrained automation of AI model deployment. AIPC decomposes deployment into standardized, verifiable stages and injects deployment-domain knowledge into agent execution through Agent Skills, helper scripts, and a stage-wise validation loop. This design reduces both the expertise barrier and the engineering time required for hardware deployment. Using Qualcomm AI Runtime (QAIRT) as the primary scenario, this report examines automated deployment across representative vision, multimodal, and speech models. In the cases covered here, AIPC can complete deployment from PyTorch to runnable QNN/SNPE inference within 7-20 minutes for structurally regular vision models, with indicative API costs roughly in the range of USD 0.7-10. For more complex models involving less-supported operators, dynamic shapes, or autoregressive decoding structures, fully automated deployment may still require further advances, but AIPC already provides practical support for execution, failure localization, and bounded repair. Comments: 19 pages, 1 figure, technical report Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.14661 [cs.SE] (or arXiv:2604.14661v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.14661 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jianhao Su [view email] [v1] Thu, 16 Apr 2026 06:15:56 UTC (26 KB) Full-text links: Access Paper: View a PDF of the paper titled AIPC: Agent-Based Automation for AI Model Deployment with Qualcomm AI Runtime, by Jianhao Su and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2026-04 Change to browse by: cs cs.AI cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-62] Agent GA: Evolving Code Solutions in Agent -Seed Space

【速读】:该论文旨在解决自主代码生成系统中效率与性能提升的问题,特别是如何通过优化代码生成的初始条件(即“代理种子”)来增强长期自主代理在复杂任务中的表现。其核心解决方案是提出AgentGA框架,该框架将种群级别的遗传算法与长时程代理相结合,通过在外部循环中搜索可复用的代理种子(包括任务提示和可选的父级归档文件),而非直接修改代码本身,从而实现对自主代码生成过程的进化式优化。关键创新在于:每次代际运行均从重置的工作空间启动,而父级归档提供可被后代继承和复用的中间产物;同时采用确定性的1:1精英锦标赛进行选择,并利用改进的Hedge控制器在线调整操作符分配策略,有效提升了代码生成的质量与稳定性。

链接: https://arxiv.org/abs/2604.14655
作者: David Y.Y. Tan,Kellie Chin,Jingxian Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages including appendix, 4 figures, 1 table

点击查看摘要

Abstract:We present AgentGA, a framework that evolves autonomous code-generation runs by optimizing the agent seed: the task prompt plus optional parent archives that initialize a fresh workspace. The outer loop searches over these reusable starting conditions rather than editing code directly. Each generation launches a fresh autonomous run from a reset workspace, while selected parent archives provide inherited artifacts that descendants can inspect and reuse. AgentGA couples a population-level genetic algorithm with long-horizon agents; selection uses deterministic 1:1 elite tournaments and operator allocation is adapted online with a modified Hedge controller. We instantiate the approach for tabular AutoML on the 16-competition Weco-Kaggle Lite benchmark. On the 10 benchmark runs reported here, AgentGA averages 74.52% Exceeds % of Human versus 54.15% for AIDE. Across 1135 parent-child comparisons, descendants given parent archives outperform runs started from scratch, indicating that inherited artifacts improve later autonomous runs. These findings support agent-seed optimization as a practical design point for autonomous code-search systems.

[AI-63] argeted Exploration via Unified Entropy Control for Reinforcement Learning ACL2026

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中广泛使用的分组相对策略优化(Group Relative Policy Optimization, GRPO)方法存在的熵塌缩(entropy collapse)问题,即策略过早收敛导致多样性丧失,从而限制了模型在大语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)推理任务中的探索能力与优化稳定性。解决方案的关键在于提出统一熵控制框架(Unified Entropy Control for Reinforcement Learning, UEC-RL),其包含两个核心机制:一是针对困难提示激活更有效的探索,以发现潜在且有价值的推理轨迹;二是引入稳定器防止熵无序增长,确保训练过程在模型固化可靠行为时保持稳定。该设计在扩展搜索空间的同时维持优化鲁棒性,显著提升了模型在Pass@1和Pass@k指标上的表现,尤其在Geometry3K数据集上相较GRPO实现37.9%的相对提升,验证了UEC-RL在保障探索有效性与收敛稳定性之间的平衡能力。

链接: https://arxiv.org/abs/2604.14646
作者: Chen Wang,Lai Wei,Yanzhi Zhang,Chenyang Shao,Zedong Dan,Weiran Huang,Ge Lan,Yue Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication in Findings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity. Existing exploration methods introduce additional bias or variance during exploration, making it difficult to maintain optimization stability. We propose Unified Entropy Control for Reinforcement Learning (UEC-RL), a framework that provides targeted mechanisms for exploration and stabilization. UEC-RL activates more exploration on difficult prompts to search for potential and valuable reasoning trajectories. In parallel, a stabilizer prevents entropy from growing uncontrollably, thereby keeping training stable as the model consolidates reliable behaviors. Together, these components expand the search space when needed while maintaining robust optimization throughout training. Experiments on both LLM and VLM reasoning tasks show consistent gains over RL baselines on both Pass@1 and Pass@ k . On Geometry3K, UEC-RL achieves a 37.9% relative improvement over GRPO, indicating that it sustains effective exploration without compromising convergence and underscoring UEC-RL as a key for scaling RL-based reasoning in large models. Our code is available at this https URL.

[AI-64] Learning to Draw ASCII Improves Spatial Reasoning in Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理复杂空间问题时缺乏真正空间理解能力的问题,即模型虽能解析空间描述文本,却难以构建准确的显式空间布局,导致推理错误。其核心解决方案是引入Text2Space数据集,该数据集将自然语言描述与真实ASCII网格布局及空间问答对进行配对,从而分离出模型在空间表征构建与推理阶段的失败原因;关键创新在于训练模型从文本生成ASCII布局(Text → ASCII),即使在推理阶段不输出ASCII,也能显著提升空间推理性能,并且这种能力可迁移至多个外部空间推理基准任务,验证了“显式布局构建”对空间理解的强化作用。

链接: https://arxiv.org/abs/2604.14641
作者: Shiyuan Huang,Li Liu,Jincheng He,Leilani H. Gilpin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When faced with complex spatial problems, humans naturally sketch layouts to organize their thinking, and the act of drawing further sharpens their understanding. In this work, we ask whether a similar principle holds for Large Language Models (LLMs): can learning to construct explicit visual layouts from spatial descriptions instill genuine spatial understanding? We introduce Text2Space, a dataset that pairs natural language descriptions with ground-truth ASCII grid layouts and spatial QA pairs, enabling us to separate failures in constructing spatial representations from failures in reasoning over them. We adopt ASCII because it is human-readable, operates entirely within the token space of language models, and encodes spatial relations in a structurally verifiable form. Our evaluation reveals a pronounced “Read-Write Asymmetry”: LLMs interpret ASCII representations effectively but struggle to produce them from text, and these construction errors propagate to incorrect answers downstream. To address this limitation, we train models on layout construction (Text \rightarrow ASCII) and find that it significantly improves spatial reasoning from text alone, even without producing any ASCII at inference time. Combining construction with comprehension training further amplifies these gains. Crucially, these improvements transfer to three external spatial reasoning benchmarks, demonstrating that, much as sketching sharpens human spatial thinking, learning to construct explicit layouts instills spatial understanding that generalizes beyond the training format.

[AI-65] A Parallel Approach to Counting Exact Covers Based on Decomposability Property

【速读】:该论文旨在解决经典NP-hard问题——精确覆盖(Exact Cover)的高效计数与表示问题。其核心挑战在于如何在保持计算效率的同时,以更紧凑的数据结构表示所有可能的精确覆盖解集。解决方案的关键创新在于提出了一种零抑制决策可分解否定正则型(decision-ZDNNF),该结构相比传统的零抑制二进制决策图(ZBDDs)具有更强的压缩能力,从而实现更高效的表示。在此基础上,作者设计了并行算法DXD来构建代表全部精确覆盖解的decision-ZDNNF,并通过动态更新连通分量进一步优化性能,实验表明改进后的DXD算法在效率上超越现有最优方法。

链接: https://arxiv.org/abs/2604.14627
作者: Liangda Fang,Yaohui Luo,Delong Li,Xuanxiang Huang,Quanlong Guan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to SAT 2026

点击查看摘要

Abstract:The exact cover problem is a classical NP-hard problem with broad applications in the area of AI. Algorithm DXZ is a method to count exact covers representing by zero-suppressed binary decision diagrams (ZBDDs). In this paper, we propose a zero-suppressed variant of decision decomposable negation normal form (in short, decision-ZDNNF), which is strictly more succinct than ZBDDs. We then design a novel parallel algorithm, namely DXD, which constructs a decision-ZDNNF representing the set of all exact covers. Furthermore, we improve DXD by dynamically updating connected components. The experimental results demonstrate that the improved DXD algorithm outperforms all of state-of-the-art methods.

[AI-66] ELMoE-3D: Leverag ing Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

【速读】:该论文旨在解决大规模语言模型中混合专家(Mixture-of-Experts, MoE)架构在本地部署时面临的内存瓶颈问题,尤其是在高批次大小下由于稀疏的每标记计算转化为密集的内存激活导致的带宽受限与计算资源利用率低下。现有方案如基于存内计算(PIM)或神经网络内存处理器(NMP)虽提升了内存带宽,但无法有效利用低算术强度场景下的闲置计算单元;而推测解码(Speculative Decoding, SD)虽然通过减少目标模型调用次数来提升效率,但在MoE架构中因验证阶段仍需加载所有专家,尤其在低批次时性能收益受限。论文提出的ELMoE-3D是一种软硬件协同设计框架,其核心创新在于提出弹性自推测解码(Elastic Self-Speculative Decoding, Elastic-SD),通过同时沿专家维度和比特维度进行动态缩放,将缓存加速与推测解码统一为一个高效执行机制,并结合3D堆叠硬件中的高带宽混合键合(Hybrid-Bonding, HB)技术,实现对bit嵌套执行的原生支持,从而在批处理大小1–16范围内平均获得6.6倍速度提升和4.4倍能效增益。

链接: https://arxiv.org/abs/2604.14626
作者: Yuseon Choi,Jingu Lee,Jungjun Oh,Sunjoo Whang,Byeongcheol Kim,Minsung Kim,Hoi-Jun Yoo,Sangjin Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models have become the dominant architecture for large-scale language models, yet on-premises serving remains fundamentally memory-bound as batching turns sparse per-token compute into dense memory activation. Memory-centric architectures (PIM, NMP) improve bandwidth but leave compute underutilized under MoE’s low arithmetic intensity at high batch sizes. Speculative decoding (SD) trades idle compute for fewer target invocations, yet verification must load experts even for rejected tokens, severely limiting its benefit in MoE especially at low batch sizes. We propose ELMoE-3D, a hybrid-bonding (HB)-based HW-SW co-designed framework that unifies cache-based acceleration and speculative decoding to offer overall speedup across batch sizes. We identify two intrinsic elasticity axes of MoE-expert and bit-and jointly scale them to construct Elastic Self-Speculative Decoding (Elastic-SD), which serves as both an expert cache and a strongly aligned self-draft model accelerated by high HB bandwidth. Our LSB-augmented bit-sliced architecture exploits inherent redundancy in bit-slice representations to natively support bit-nested execution. On our 3D-stacked hardware, ELMoE-3D achieves an average 6.6\times speedup and 4.4\times energy efficiency gain over naive MoE serving on xPU across batch sizes 1-16, and delivers 2.2\times speedup and 1.4\times energy efficiency gain over the best-performing prior accelerator baseline.

[AI-67] Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks

【速读】:该论文旨在解决软件工程任务中因用户指令不完整而导致的辅助系统澄清效率低下的问题。其核心挑战在于:并非所有缺失信息都同等重要,且所提问题必须针对用户实际可回答的内容。解决方案的关键在于通过实证分析明确两类关键属性——任务相关性(task relevance,即哪些信息能预测任务成功)和用户可答性(user answerability,即用户能提供什么信息),并将其转化为多阶段强化学习的奖励机制,从而训练出CLARITI这一8B参数的澄清模块。该模块在保持与GPT-5相当的任务成功率的同时,减少了41%的问题生成量,显著提升了澄清效率。

链接: https://arxiv.org/abs/2604.14624
作者: Sanidhya Vijayvargiya,Vijay Viswanathan,Graham Neubig
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 28 pages, 6 figures

点击查看摘要

Abstract:Humans often specify tasks incompletely, so assistants must know when and how to ask clarifying questions. However, effective clarification remains challenging in software engineering tasks as not all missing information is equally valuable, and questions must target information users can realistically provide. We study clarification in real software engineering tasks by quantifying which types of information most affect task success and which questions elicit useful responses from simulated users. Using Shapley attribution and distributional comparisons, we identify two key properties of effective clarification: task relevance (which information predicts success) and user answerability (what users can realistically provide). We operationalize these properties as multi-stage reinforcement learning rewards to train CLARITI, an 8B-parameter clarification module, that matches GPT-5’s resolution rate on underspecified issues while generating 41% fewer questions. Our results suggest that grounding reward design in empirical analysis of information impact and user answerability improves clarification efficiency.

[AI-68] CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors

【速读】:该论文旨在解决从可穿戴设备连续生理信号中提取临床可操作数字生物标志物(digital biomarkers)的难题,以推动数字健康领域的科学发现。其解决方案的关键在于提出一个名为CoDaS(AI Co-Data-Scientist)的多智能体系统,将生物标志物发现过程结构化为包含假设生成、统计分析、对抗验证和基于文献的推理的迭代流程,并引入人类监督机制,从而实现高效、可追溯且具备临床意义的特征挖掘。

链接: https://arxiv.org/abs/2604.14615
作者: Yubin Kim,Salman Rahman,Samuel Schmidgall,Chunjong Park,A. Ali Heydari,Ahmed A. Metwally,Hong Yu,Xin Liu,Xuhai Xu,Yuzhe Yang,Maxwell A. Xu,Zhihan Zhang,Cynthia Breazeal,Tim Althoff,Petar Sirkovic,Ivor Rendulic,Annalisa Pawlosky,Nicolas Stroppa,Juraj Gottweis,Elahe Vedadi,Alan Karthikesalingam,Pushmeet Kohli,Vivek Natarajan,Mark Malhotra,Shwetak Patel,Hae Won Park,Hamid Palangi,Daniel McDuff
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific discovery in digital health requires converting continuous physiological signals from wearable devices into clinically actionable biomarkers. We introduce CoDaS (AI Co-Data-Scientist), a multi-agent system that structures biomarker discovery as an iterative process combining hypothesis generation, statistical analysis, adversarial validation, and literature-grounded reasoning with human oversight using large-scale wearable datasets. Across three cohorts totaling 9,279 participant-observations, CoDaS identified 41 candidate digital biomarkers for mental health and 25 for metabolic outcomes, each subjected to an internal validation battery spanning replication, stability, robustness, and discriminative power. Across two independent depression cohorts, CoDaS surfaced circadian instability-related features in both datasets, reflected in sleep duration variability (DWB, \rho = 0.252, p 0.001) and sleep onset variability (GLOBEM, \rho = 0.126, p 0.001). In a metabolic cohort, CoDaS derived a cardiovascular fitness index (steps/resting heart rate; \rho = -0.374, p 0.001), and recovered established clinical associations, including the hepatic function ratio (AST/ALT; \rho = -0.375, p 0.001), a known correlate of insulin resistance. Incorporating CoDaS-derived features alongside demographic variables led to modest but consistent improvements in predictive performance, with cross-validated \Delta R^2 increases of 0.040 for depression and 0.021 for insulin resistance. These findings suggest that CoDaS enables systematic and traceable hypothesis generation and prioritization for biomarker discovery from large-scale wearable data.

[AI-69] El Agent e Forjador: Task-Driven Agent Generation for Quantum Simulation

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的科学代理系统普遍依赖静态、人工设计工具集的问题,这种局限性阻碍了其在新领域中的适应能力和对不断演进的科学计算库的扩展。解决方案的关键在于提出一个名为 El Agente Forjador 的多智能体框架,其中通用编码代理通过四阶段工作流——工具分析、工具生成、任务执行与迭代解评估——自主锻造、验证并复用计算工具。该框架实现了从“按需生成”到“知识驱动复用”的范式转变,显著提升了任务准确性,并表明由更强代理构建的工具集可降低API成本并增强弱代理的性能,从而推动科学代理能力由任务导向而非显式工程实现所定义。

链接: https://arxiv.org/abs/2604.14609
作者: Zijian Zhang,Aiwei Yin,Amaan Baweja,Jiaru Bai,Ignacio Gustin,Varinia Bernales,Alán Aspuru-Guzik
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:AI for science promises to accelerate the discovery process. The advent of large language models (LLMs) and agentic workflows enables the expediting of a growing range of scientific tasks. However, most of the current generation of agentic systems depend on static, hand-curated toolsets that hinder adaptation to new domains and evolving libraries. We present El Agente Forjador, a multi-agent framework in which universal coding agents autonomously forge, validate, and reuse computational tools through a four-stage workflow of tool analysis, tool generation, task execution, and iterative solution evaluation. Evaluated across 24 tasks spanning quantum chemistry and quantum dynamics on five coding agent setups, we compare three operating modes: zero-shot generation of tools per task, reuse of a curriculum-built toolset, and direct problem-solving with the coding agents as the baseline. We find that our tool generation and reuse framework consistently improves accuracy over the baseline. We also show that reusing a toolset built by a stronger coding agent can reduce API cost and substantially raises the solution quality for weaker coding agents. Case studies further demonstrate that tools forged for different domains can be combined to solve hybrid tasks. Taken together, these results show that LLM-based agents can use their scientific knowledge and coding capabilities to autonomously build reusable scientific tools, pointing toward a paradigm in which agent capabilities are defined by the tasks they are designed to solve rather than by explicitly engineered implementations.

[AI-70] GDPR Auto-Formalization with AI Agents and Human Verification

【速读】:该论文旨在解决欧盟《通用数据保护条例》(GDPR)条款自动形式化(automatic formalization)过程中准确性与可靠性不足的问题,特别是在法律语义复杂性和上下文敏感性较强的场景下。其解决方案的关键在于构建一个“人在回路”(human-in-the-loop)的验证框架,采用角色专业化的工作流:在多智能体(multi-agent)设置中,大型语言模型(LLM)通过迭代反馈生成法律场景、形式化规则和原子事实;同时,独立的验证模块由人类专家对表征正确性、逻辑一致性和法律合规性进行逐层审核。实证表明,结构化的验证机制与有针对性的人工监督是实现高可靠法律形式化的必要条件。

链接: https://arxiv.org/abs/2604.14607
作者: Ha Thanh Nguyen,Wachara Fungwacharakorn,Sabine Wehnert,May Myo Zin,Yuntao Kong,Jieying Xue,Michał Araszkiewicz,Randy Goebel,Ken Satoh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICAIL 2026

点击查看摘要

Abstract:We study the overall process of automatic formalization of GDPR provisions using large language models, within a human-in-the-loop verification framework. Rather than aiming for full autonomy, we adopt a role-specialized workflow in which LLM-based AI components, operating in a multi-agent setting with iterative feedback, generate legal scenarios, formal rules, and atomic facts. This is coupled with independent verification modules which include human reviewers’ assessment of representational, logical, and legal correctness. Using this approach, we construct a high-quality dataset to be used for GDPR auto-formalization, and analyze both successful and problematic cases. Our results show that structured verification and targeted human oversight are essential for reliable legal formalization, especially in the presence of legal nuance and context-sensitive reasoning.

[AI-71] Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

【速读】:该论文旨在解决现代大型音频-语言模型(Large Audio-Language Models, LALMs)在语音交互中因音频通道的连续性和高维特性而引入的安全漏洞问题,特别是针对仅能访问音频数据且需保持强感知隐蔽性的恶意音频注入攻击(auditory prompt injection)缺乏系统研究的现状。解决方案的关键在于提出一个通用框架——AudioHijack,其核心创新包括:基于采样梯度估计的方法实现跨多种非可微音频编码器的端到端优化;通过注意力监督与多上下文训练策略引导模型关注对抗性音频并泛化至未见用户场景;以及设计卷积混合方法将扰动调制为自然混响,显著提升隐蔽性。实验表明,该方法在13个先进LALM上实现了平均79%-96%的成功率,验证了其有效性与普适性。

链接: https://arxiv.org/abs/2604.14604
作者: Meng Chen,Kun Wang,Li Lu,Jiaheng Zhang,Tianwei Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted by IEEE SP 2026

点击查看摘要

Abstract:Modern Large audio-language models (LALMs) power intelligent voice interactions by tightly integrating audio and text. This integration, however, expands the attack surface beyond text and introduces vulnerabilities in the continuous, high-dimensional audio channel. While prior work studied audio jailbreaks, the security risks of malicious audio injection and downstream behavior manipulation remain underexamined. In this work, we reveal a previously overlooked threat, auditory prompt injection, under realistic constraints of audio data-only access and strong perceptual stealth. To systematically analyze this threat, we propose \textitAudioHijack, a general framework that generates context-agnostic and imperceptible adversarial audio to hijack LALMs. \textitAudioHijack employs sampling-based gradient estimation for end-to-end optimization across diverse models, bypassing non-differentiable audio tokenization. Through attention supervision and multi-context training, it steers model attention toward adversarial audio and generalizes to unseen user contexts. We also design a convolutional blending method that modulates perturbations into natural reverberation, making them highly imperceptible to users. Extensive experiments on 13 state-of-the-art LALMs show consistent hijacking across 6 misbehavior categories, achieving average success rates of 79%-96% on unseen user contexts with high acoustic fidelity. Real-world studies demonstrate that commercial voice agents from Mistral AI and Microsoft Azure can be induced to execute unauthorized actions on behalf of users. These findings expose critical vulnerabilities in LALMs and highlight the urgent need for dedicated defense.

[AI-72] AgileLog: A Forkable Shared Log for Agents on Data Streams

【速读】:该论文旨在解决当前数据流系统在支持AI代理(AI agents)时存在的两大核心问题:一是无法有效避免代理任务对系统性能造成的干扰,二是缺乏安全机制来处理代理写操作。其解决方案的关键在于提出一种名为AgileLog的新共享日志(shared log)抽象,该抽象通过引入可分叉(forkable)机制,使日志能够为不同代理任务提供逻辑隔离的副本,从而实现性能隔离与安全写入。进一步地,作者设计了Bolt作为AgileLog的具体实现,采用创新技术降低分叉开销,并保障高效性和隔离性,为AI代理在流数据上的安全、高效执行提供了基础支撑。

链接: https://arxiv.org/abs/2604.14590
作者: Shreesha G. Bhat,Tony Hong,Michael Noguera,Ramnatthan Alagappan,Aishwarya Ganesan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In modern data-streaming systems, alongside traditional programs, a new type of entity has emerged that can interact with streaming data: AI agents. Unlike traditional programs, AI agents use LLM reasoning to accomplish high-level tasks specified in natural language over streaming data. Unfortunately, current streaming systems cannot fully support agents: they lack the fundamental mechanisms to avoid the performance interference caused by agentic tasks and to safely handle agentic writes. We argue that the shared log, the core abstraction underlying streaming data, must support creating forks of itself, and that such a forkable shared log serves as a great substrate for agents acting on streaming data. We propose AgileLog, a new shared log abstraction that provides novel forking primitives for agentic use cases. We design Bolt, an implementation of the AgileLog abstraction, that uses novel techniques to make forks cheap, and provide logical and performance isolation.

[AI-73] Enhancing Mental Health Counseling Support in Bangladesh using Culturally-Grounded Knowledge

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在心理支持与咨询场景中响应缺乏文化敏感性、情境贴合度不足以及临床指导不恰当的问题。其核心挑战在于如何系统性地将领域特定且经临床验证的知识融入LLMs以提升咨询质量。解决方案的关键在于采用基于知识图谱(Knowledge Graph, KG)的方法,该方法通过人工构建并由多学科专家验证的结构化知识体系,明确捕捉压力源、干预措施与结果之间的因果关系,从而显著优于仅依赖检索增强生成(Retrieval-Augmented Generation, RAG)的方案,在情境相关性、临床适宜性和实用性方面均表现更优,凸显了专家标注的结构化知识对弥补LLMs在咨询任务中局限性的关键作用。

链接: https://arxiv.org/abs/2604.14576
作者: Md Arid Hasan,Azhagu Meena SP,Aditya Khan,Abu Md Akteruzzaman Bhuiyan,Helal Uddin Ahmed,Joysree Debi,Farig Sadeque,Annie En-Shiun Lee,Syed Ishtiaque Ahmed
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: submitted to CLPsych 2026

点击查看摘要

Abstract:Large language models (LLMs) show promise in generating supportive responses for mental health and counseling applications. However, their responses often lack cultural sensitivity, contextual grounding, and clinically appropriate guidance. This work addresses the gap of how to systematically incorporate domain-specific, clinically validated knowledge into LLMs to improve counseling quality. We utilize and compare two approaches, retrieval-augmented generation (RAG) and a knowledge graph (KG)-based method, designed to support para-counselors. Our KG is constructed manually and clinically validated, capturing causal relationships between stressors, interventions, and outcomes, with contributions from multidisciplinary people. We evaluated multiple LLMs in both settings using BERTScore F1 and SBERT cosine similarity, as well as human evaluation across five metrics, which is designed to directly measure the effectiveness of counseling beyond similarity at the surface level. The results show that KG-based approaches consistently improve contextual relevance, clinical appropriateness, and practical usability compared to RAG alone, demonstrating that structured, expert-validated knowledge plays a critical role in addressing LLMs limitations in counseling tasks.

[AI-74] Generative Augmented Inference

【速读】:该论文旨在解决数据驱动运营管理系统中因依赖昂贵的人工标注参数估计而导致的效率低下问题,尤其是在引入生成式 AI(如大语言模型,LLMs)作为低成本辅助数据源时,其输出与人类标签之间存在高维、复杂且未知的关系,传统方法将其直接作为真值代理会导致估计不一致或不可靠。解决方案的关键在于提出一种通用框架——生成式增强推断(Generative Augmented Inference, GAI),其核心创新是采用正交矩构造(orthogonal moment construction),使得在LLM生成输出与人类标签之间关系任意非参数的情况下仍能实现一致估计和有效推断,并具备“安全默认”特性:相对于仅使用人类数据的估计器,GAI在任意辅助信号下均不劣于原方法,且当辅助信息具有预测能力时可严格提升估计效率。

链接: https://arxiv.org/abs/2604.14575
作者: Cheng Lu,Mengxin Wang,Dennis J. Zhang,Heng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Data-driven operations management often relies on parameters estimated from costly human-generated labels. Recent advances in large language models (LLMs) and other AI systems offer inexpensive auxiliary data, but introduce a new challenge: AI outputs are not direct observations of the target outcomes, but could involve high-dimensional representations with complex and unknown relationships to human labels. Conventional methods leverage AI predictions as direct proxies for true labels, which can be inefficient or unreliable when this relationship is weak or misspecified. We propose Generative Augmented Inference (GAI), a general framework that incorporates AI-generated outputs as informative features for estimating models of human-labeled outcomes. GAI uses an orthogonal moment construction that enables consistent estimation and valid inference with flexible, nonparametric relationship between LLM-generated outputs and human labels. We establish asymptotic normality and show a “safe default” property: relative to human-data-only estimators, GAI weakly improves estimation efficiency under arbitrary auxiliary signals and yields strict gains whenever the auxiliary information is predictive. Empirically, GAI outperforms benchmarks across diverse settings. In conjoint analysis with weak auxiliary signals, GAI reduces estimation error by about 50% and lowers human labeling requirements by over 75%. In retail pricing, where all methods access the same auxiliary inputs, GAI consistently outperforms alternative estimators, highlighting the value of its construction rather than differences in information. In health insurance choice, it cuts labeling requirements by over 90% while maintaining decision accuracy. Across applications, GAI improves confidence interval coverage without inflating width. Overall, GAI provides a principled and scalable approach to integrating AI-generated information.

[AI-75] CSRA: Controlled Spectral Residual Augmentation for Robust Sepsis Prediction

【速读】:该论文旨在解决重症监护病房(Intensive Care Unit, ICU)中脓毒症(sepsis)短期窗口预测难题,即在有限的历史观测窗口下如何准确预测未来风险与疾病进展。传统方法因短时间窗口信息不足而难以建模动态变化,同时长预测时域又面临有效监督信号稀缺的问题。其解决方案的关键在于提出一种受控频域残差增强框架(Controlled Spectral Residual Augmentation, CSRA),通过按临床系统分组提取多尺度表征,并在频域内进行输入自适应的残差扰动以生成结构化且符合临床逻辑的时间轨迹变异;同时,CSRA 采用端到端训练策略,联合下游预测器优化统一目标函数,并引入锚点一致性损失和控制器正则化机制,从而显著提升数据增强的稳定性与可控性。实验表明,该方法在多个下游模型上均优于基线,在更短观察窗口、更长预测时域及小样本场景下仍具更强鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2604.14532
作者: Honglin Guo,Rihao Chang,He Jiao,Weizhi Nie,Zhongheng Zhang,Yuehao Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of future risk and disease progression in sepsis is clinically important for early warning and timely intervention in intensive care. However, short-window sepsis prediction remains challenging, because shorter observation windows provide limited historical evidence, whereas longer prediction horizons reduce the number of patient trajectories with valid future supervision. To address this problem, we propose CSRA, a Controlled Spectral Residual Augmentation framework for short-window multi-system ICU time series. CSRA first groups variables by clinical systems and extracts system-level and global representations. It then performs input-adaptive residual perturbation in the spectral domain to generate structured and clinically plausible trajectory variations. To improve augmentation stability and controllability, CSRA is trained end-to-end with the downstream predictor under a unified objective, together with anchor consistency loss and controller regularization. Experiments on a MIMIC-IV sepsis cohort across multiple downstream models show that CSRA is consistently competitive and often superior, reducing regression error by 10.2% in MSE and 3.7% in MAE over the non-augmentation baseline, while also yielding consistent gains on classification. CSRA further maintains more favorable performance under shorter observation windows, longer prediction horizons, and smaller training data scales, while also remaining effective on an external clinical dataset~(ZiGongICUinfection), indicating stronger robustness and generalizability in clinically constrained settings.

[AI-76] RACER: Trace-Based Adaptive Cost-Efficient Routing for LLM Classification

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)分类任务中推理成本过高与资源利用率低的问题。其核心挑战在于如何在不显著增加边际计算开销的前提下,利用LLM自身生产环境中产生的标注数据(即输入-输出对)构建轻量级替代模型(surrogate),并动态决定何时由该替代模型处理请求、何时回退至原LLM。解决方案的关键是提出TRACER系统——一个基于轨迹的自适应高效路由框架,通过引入“一致性门控机制”(parity gate),仅当替代模型与LLM预测结果的一致性超过用户设定阈值α时才启用替代模型;同时生成可解释性可视化工具,明确展示替代模型覆盖的输入区域、性能饱和点及拒答原因,从而实现透明、可控且成本敏感的部署策略。

链接: https://arxiv.org/abs/2604.14531
作者: Adam Rida
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: this http URL

点击查看摘要

Abstract:Every call to an LLM classification endpoint produces a labeled input-output pair already retained in production logs. These pairs constitute a free, growing training set: a lightweight surrogate trained on them can absorb a significant portion of future traffic at near-zero marginal inference cost. The open questions are when the surrogate is reliable enough to deploy, what it handles versus defers, and how that boundary evolves as data accumulates. We introduce TRACER (Trace-based Adaptive Cost-Efficient Routing), an open-source system that trains ML surrogates on an LLM’s own production traces and governs deployment through a parity gate: the surrogate is activated only when its agreement with the LLM exceeds a user-specified threshold \alpha. To make the routing boundary transparent, TRACER generates interpretability artifacts describing which input regions the surrogate handles, where it plateaus, and why it defers. On a 77-class intent benchmark with a Sonnet 4.6 teacher, TRACER achieves 83-100% surrogate coverage depending on the quality target \alpha; on a 150-class benchmark, the surrogate fully replaces the teacher. On a natural language inference task, the parity gate correctly refuses deployment because the embedding representation cannot support reliable separation. The system is available as open-source software. Comments: this http URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.14531 [cs.AI] (or arXiv:2604.14531v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.14531 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-77] Quantifying Cross-Query Contradictions in Multi-Query LLM Reasoning ICLR2026

【速读】:该论文旨在解决大语言模型在处理多个相关查询时产生的逻辑不一致问题,即“案例文件逻辑一致性”(case-file logical consistency)——确保跨查询间信念状态的全局可满足性。其解决方案的关键在于提出一种增强求解器的方法:通过提取承诺(commitments)、验证全局可满足性,并基于反例引导修复(counterexample-guided repair),从而显著降低跨查询矛盾率(SetCons从0.56提升至0.94),同时保持单个查询的准确性,证明了全局一致性对鲁棒多查询推理的重要性。

链接: https://arxiv.org/abs/2604.14525
作者: Rohit Kumar Salla,Ramya Manasa Amancherla,Manoj Saravanan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the ICLR 2026 Workshop on Logical Reasoning of Large Language Models. 9 pages, 6 tables, code and data at this https URL

点击查看摘要

Abstract:Large language models frequently produce mutually inconsistent answers when reasoning over multiple related queries. We study case-file logical consistency: maintaining a globally satisfiable belief state across interdependent queries. We introduce a benchmark of 390 multi-query reasoning instances with entailment/contradiction/unknown labels and propose set-level metrics including Case Satisfiability Rate, Contradiction Density and Revision Cost. Our solver-augmented approach extracts commitments, verifies global satisfiability and performs counterexample-guided repair. Across four reasoning domains, our method substantially reduces cross-query contradictions (SetCons: 0.56 to 0.94) while preserving per-query accuracy, demonstrating that global coherence is critical for robust multi-query reasoning.

[AI-78] Mind DeepResearch Technical Report

【速读】:该论文旨在解决当前多智能体深度研究系统在资源效率与性能之间难以平衡的问题,即如何在有限参数规模(约30B)下实现接近甚至超越更大模型的深度研究能力。解决方案的关键在于提出了一种名为Mind DeepResearch (MindDR) 的高效多智能体深度研究框架,其核心创新包括一个协作式的三智能体架构(规划代理、深度搜索代理和报告代理)以及一个四阶段专业化训练流程(SFT冷启动、Search-RL、Report-RL和偏好对齐)。该设计通过分阶段优化各智能体的功能并强化其协同能力,显著提升了模型在真实场景下的表现,尤其在中文语境下的复杂查询任务中展现出优越性能。

链接: https://arxiv.org/abs/2604.14518
作者: MindDR Team,Li Auto Inc
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present \textbfMind DeepResearch (MindDR), an efficient multi-agent deep research framework that achieves leading performance with only \textasciitilde30B-parameter models through a meticulously designed data synthesis and multi-stage training pipeline. The core innovation of MindDR lies in a collaborative three-agent architecture (Planning Agent, DeepSearch Agent, and Report Agent) and a four-stage agent-specialized training pipeline comprising SFT cold-start, Search-RL, Report-RL and preference alignment. With this regime, MindDR demonstrates competitive performance even with \textasciitilde30B-scale models. Specifically, MindDR achieves 45.7% on BrowseComp-ZH, 42.8% on BrowseComp, 46.5% on WideSearch, 75.0% on xbench-DS, and 52.5 on DeepResearch Bench, outperforming comparable-scale open-source agent systems and rivaling larger-scale models. MindDR has been deployed as an online product in Li Auto. Furthermore, we introduce \textbfMindDR Bench, a curated benchmark of 500 real-world Chinese queries from our internal product user interactions, evaluated through a comprehensive multi-dimensional rubric system rather than relying on a single RACE metric. On MindDR Bench, MindDR achieves a state-of-the-art score of 51.8.

[AI-79] Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities

【速读】:该论文旨在解决生物医学人工智能(Artificial Intelligence, AI)研究中因早期数据收集和研究优先级设置所导致的系统性偏倚问题,这些问题在分子层面的数据采集阶段即已存在,并可能通过预训练基础模型(biomedical foundation models)被放大,最终造成健康不平等的持续甚至加剧。解决方案的关键在于推动整个科研社区遵循三项基础原则:数据来源可追溯性(Provenance)、开放共享(Openness)与评估透明度(Evaluation Transparency),以确保模型开发过程中的公平性和鲁棒性,从而提升对弱势人群的健康服务效能。

链接: https://arxiv.org/abs/2604.14514
作者: Michal Rosen-Zvi,Yoav Kan-Tor,Michael Danziger,Agata Ferretti,Javier Aula-Blasco,Julia Falcao,Ron Shamir,Mordechai Muszkat
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeutics. However, this perspective highlights that critical biases can emerge much earlier, during data collection and research prioritization, long before clinical implementation in cases where the focus of the studies and the data that is collected is at the molecular level. A vast number of studies focus on collecting omics data but the demographic information associated with these datasets is often not reported in the studies, and when it is reported, it shows big biases. An automated analysis of 4719 PubMed-indexed omics publications from 2015 to 2024 reveals that only a small fraction report ancestry or ethnicity information, with ancestry reporting improving slightly. Analysis of large-scale datasets commonly used for model training, such as CellxGene and GEO, reveals substantial population bias where European-ancestry data dominates. As biomedical foundation models become central to biomedical discovery with a paradigm in which base models are pretrained on large datasets and reusing them time and again for many different downstream tasks, they risk perpetuating or amplifying these early-stage biases, leading to cascading inequities that regulatory interventions cannot fully reverse. We propose a community-wide focus on three foundational principles: Provenance, Openness, and Evaluation Transparency to improve equity and robustness in biomedical AI. This approach aims to foster biomedical innovation that more effectively serves underserved populations and improves health outcomes.

[AI-80] CBCL: Safe Self-Extending Agent Communication

【速读】:该论文旨在解决异构智能体在跨域协作中因通信语言扩展性需求而导致的验证复杂度激增问题,即当代理通信语言(ACL)支持灵活扩展时,可能使输入语言超出可 tractable(易处理)的复杂度类,从而难以进行形式化验证。解决方案的关键在于提出一种名为 CBCL(Common Business Communication Language)的语言设计,其核心是将所有消息(包括运行时扩展)限制在确定性上下文无关语言(DCFL)类别内,并通过三个安全不变量(R1–R3)确保扩展不会导致资源无界增长、遵守声明的资源限制并保持核心词汇完整性;此外,CBCL 采用同像协议设计(homoiconic protocol design),使得扩展定义与普通消息共享相同表示形式,同时借助 Lean 4 形式化验证和 Rust 实现的参考解析器及方言引擎,实现了可验证的安全扩展机制。

链接: https://arxiv.org/abs/2604.14512
作者: Hugo O’Connor
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
备注: 10 pages. Accepted at IEEE LangSec Workshop 2026 (camera-ready). Reference implementation, Lean 4 formalization, and verified parser: this https URL ; Nostr transport binding: this https URL

点击查看摘要

Abstract:Agent communication languages (ACLs) enable heterogeneous agents to share knowledge and coordinate across diverse domains. This diversity demands extensibility, but expressive extension mechanisms can push the input language beyond the complexity classes where full validation is tractable. We present CBCL (Common Business Communication Language), an agent communication language that constrains all messages, including runtime language extensions, to the deterministic context-free language (DCFL) class. CBCL allows agents to define, transmit, and adopt domain-specific “dialect” extensions as first-class messages; three safety invariants (R1–R3), machine-checked in Lean 4 and enforced in a Rust reference implementation, prevent unbounded expansion, applying declared resource limits, and preserving core vocabulary. We formalize the language and its safety properties in Lean 4, implement a reference parser and dialect engine in Rust with property-based and differential tests, and extract a verified parser binary. Our results demonstrate that homoiconic protocol design, where extension definitions share the same representation as ordinary messages, can be made provably safe. As autonomous agents increasingly extend their own communication capabilities, formally bounding what they can express to each other is a precondition for oversight.

[AI-81] On the Expressive Power and Limitations of Multi-Layer SSMs

【速读】:该论文旨在解决多层状态空间模型(multi-layer state-space models, SSMs)在表达能力上的局限性问题,特别是其在组合任务中的不足以及与流式模型(streaming models)之间的本质差距。研究发现,基础的多层SSMs在处理复杂推理任务时存在固有瓶颈,但引入在线链式思维(online chain-of-thought, CoT)可显著提升其表达能力,使其等价于流式算法;此外,论文揭示了宽度(width)与精度(precision)在基础模型中不可互换,但在启用在线CoT后可实现资源间的等价转换。解决方案的关键在于引入在线CoT机制,从而重塑SSMs的计算范式,使其具备更强的组合推理能力和更灵活的资源调配方式。

链接: https://arxiv.org/abs/2604.14501
作者: Nikola Zubić,Qian Li,Yuyi Wang,Davide Scaramuzza
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注: 25 pages, 6 theorems

点击查看摘要

Abstract:We study the expressive power and limitations of multi-layer state-space models (SSMs). First, we show that multi-layer SSMs face fundamental limitations in compositional tasks, revealing an inherent gap between SSMs and streaming models. Then, we examine the role of chain-of-thought (CoT), showing that offline CoT does not fundamentally increase the expressiveness, while online CoT can substantially increase its power. Indeed, with online CoT, multi-layer SSMs become equivalent in power to streaming algorithms. Finally, we investigate the tradeoff between width and precision, showing that these resources are not interchangeable in the base model, but admit a clean equivalence once online CoT is allowed. Overall, our results offer a unified perspective on how depth, finite precision, and CoT shape the power and limits of SSMs.

[AI-82] Geometric Metrics for MoE Specialization: From Fisher Information to Early Failure Detection

【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型中专家专业化(expert specialization)的度量问题,现有指标如余弦相似度和路由熵缺乏理论基础且在参数重初始化下表现不一致。解决方案的关键在于提出一个信息几何框架,将专家路由分布视为在概率单纯形上由费舍尔信息度量(Fisher information metric)定义的流形结构,从而利用黎曼几何进行严格分析;该框架首次证明了标准启发式指标违反参数不变性(Theorem 1),揭示专业化对应于测地线流动并给出近似误差界(Theorem 2),并构建了一个具有理论阈值依据的训练失败预测器(Theorem 3)。在此基础上,论文引入两个原理性指标:费舍尔专业化指数(Fisher Specialization Index, FSI)与费舍尔异质性评分(Fisher Heterogeneity Score, FHS),显著优于传统方法,在多个下游任务和规模下验证了理论预测的有效性。

链接: https://arxiv.org/abs/2604.14500
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Expert specialization is fundamental to Mixture-of-Experts (MoE) model success, yet existing metrics (cosine similarity, routing entropy) lack theoretical grounding and yield inconsistent conclusions under reparameterization. We present an information-geometric framework providing the first rigorous characterization of MoE specialization dynamics. Our key insight is that expert routing distributions evolve on the probability simplex equipped with the Fisher information metric, enabling formal analysis via Riemannian geometry. We prove that standard heuristic metrics violate parameterization invariance (Theorem 1), establish that specialization corresponds to geodesic flow with quantified approximation bounds (Theorem 2), and derive a failure predictor with theoretical threshold justification (Theorem 3). The framework introduces two principled metrics: Fisher Specialization Index (FSI) achieving r=0.91+/-0.02 correlation with downstream performance, and Fisher Heterogeneity Score (FHS) predicting training failure at 10% completion with AUC=0.89+/-0.03 – outperforming validation-loss-based early stopping by 23% while requiring 40x fewer compute cycles. We validate intervention protocols achieving 87% recovery rate when FHS1 is detected. Comprehensive experiments across language modeling (WikiText-103, C4), vision MoE (ImageNet), and scaling studies (8-64 experts, 125M-2.7B parameters) validate our theoretical predictions.

[AI-83] Improving Machine Learning Performance with Synthetic Augmentation

【速读】:该论文旨在解决金融机器学习中因数据稀缺导致的模型性能受限问题,特别是针对合成数据增强(synthetic augmentation)在实际应用中的统计机制不明确、潜在偏差风险未被充分认识的问题。其关键解决方案在于将合成增强形式化为对有效训练分布的修改,并揭示其引发的结构化偏差-方差权衡:即尽管增加样本量可降低估计误差,但若合成分布偏离评估时的关键区域,则可能改变总体目标函数,从而引入系统性偏差。为此,作者提出一种大小匹配的零增强基准(size-matched null augmentation)与适用于弱时间依赖下的有限样本非参数块置换检验(finite-sample, non-parametric block permutation test),以区分信息增益与单纯样本规模效应,从而提供一套可验证的评估框架,用于判断合成数据是否真正提升金融学习性能或诱发持久的分布偏移。

链接: https://arxiv.org/abs/2604.14498
作者: Mel Sohm,Charles Dezons,Sami Sellami,Oscar Ninou,Axel Pincon
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Synthetic augmentation is increasingly used to mitigate data scarcity in financial machine learning, yet its statistical role remains poorly understood. We formalize synthetic augmentation as a modification of the effective training distribution and show that it induces a structural bias–variance trade-off: while additional samples may reduce estimation error, they may also shift the population objective whenever the synthetic distribution deviates from regions relevant under evaluation. To isolate informational gains from mechanical sample-size effects, we introduce a size-matched null augmentation and a finite-sample, non-parametric block permutation test that remains valid under weak temporal dependence. We evaluate this framework in both controlled Markov-switching environments and real financial datasets, including high-frequency option trade data and a daily equity panel. Across generators spanning bootstrap, copula-based models, variational autoencoders, diffusion models, and TimeGAN, we vary augmentation ratio, model capacity, task type, regime rarity, and signal-to-noise. We show that synthetic augmentation is beneficial only in variance-dominant regimes, such as persistent volatility forecasting-while it deteriorates performance in bias-dominant settings, including near-efficient directional prediction. Rare-regime targeting can improve domain-specific metrics but may conflict with unconditional permutation inference. Our results provide a structural perspective on when synthetic data improves financial learning performance and when it induces persistent distributional distortion. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2604.14498 [cs.AI] (or arXiv:2604.14498v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.14498 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-84] Decoupling Identity from Utility: Privacy-by-Design Frameworks for Financial Ecosystems

【速读】:该论文旨在解决金融机构在数据利用效率与隐私保护之间的矛盾,即如何在满足严格监管要求的同时,最大限度地保留数据的实用性,避免传统匿名化方法带来的再识别风险。其核心解决方案是采用差分隐私(Differentially Private, DP)合成数据作为“隐私优先设计”框架,通过两种生成式范式实现:一是直接表格合成(Direct Tabular Synthesis),用于重建原始数据的高保真联合分布,适用于静态历史相关性分析;二是差分隐私种子代理模型(DP-Seeded Agent-Based Modeling, ABM),利用差分隐私保护的聚合参数驱动复杂的状态依赖模拟,可构建面向未来的“反事实实验室”,用于建模动态市场行为和极端事件。这两种方法均实现了个体身份与数据效用的解耦,从而突破传统数据清理瓶颈,支持跨机构合规研究与决策。

链接: https://arxiv.org/abs/2604.14495
作者: Ifayoyinsola Ibikunle,Tyler Farnan,Senthil Kumar,Mayana Pereira
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Financial institutions face tension between maximizing data utility and mitigating the re-identification risks inherent in traditional anonymization methods. This paper explores Differentially Private (DP) synthetic data as a robust “Privacy by Design” framework to resolve this conflict, ensuring output privacy while satisfying stringent regulatory obligations. We examine two distinct generative paradigms: Direct Tabular Synthesis, which reconstructs high-fidelity joint distributions from raw data, and DP-Seeded Agent-Based Modeling (ABM), which uses DP-protected aggregates to parameterize complex, stateful simulations. While tabular synthesis excels at reflecting static historical correlations for QA testing and business analytics, the DP-Seeded ABM offers a forward-looking “counterfactual laboratory” capable of modeling dynamic market behaviors and black swan events. By decoupling individual identities from data utility, these methodologies eliminate traditional data-clearing bottlenecks, enabling seamless cross-institutional research and compliant decision-making in an evolving regulatory landscape.

[AI-85] Pushing the Limits of On-Device Streaming ASR: A Compact High-Accuracy English Model for Low-Latency Inference

【速读】:该论文旨在解决在资源受限的边缘设备上部署高质量自动语音识别(ASR)模型的问题,核心挑战在于如何在仅使用CPU、无GPU加速的情况下,同时优化识别准确率、推理延迟和内存占用。解决方案的关键在于系统性地评估多种前沿ASR架构(包括编码器-解码器、转换器和大语言模型(LLM)基底范式)及其在批处理、分块和流式推理模式下的性能表现,并通过重实现完整的流式推理管道至ONNX Runtime,结合多种后训练量化策略(如重要性加权k-quant、混合精度方案及最近舍入量化)与图级算子融合技术,将模型大小从2.47 GB压缩至最低0.67 GB,同时保持词错误率(WER)与全精度PyTorch基线相差不超过1%,最终确立了在CPU上实现低延迟、高效率实时流式ASR的新质量-效率帕累托前沿。

链接: https://arxiv.org/abs/2604.14493
作者: Nenad Banfic,David Fan,Kunal Vaishnavi,Sam Kemp,Sunghoon Choi,Rui Ren,Sayan Shaw,Meng Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying high-quality automatic speech recognition (ASR) on edge devices requires models that jointly optimize accuracy, latency, and memory footprint while operating entirely on CPU without GPU acceleration. We conduct a systematic empirical study of state-of-the-art ASR architectures, encompassing encoder-decoder, transducer, and LLM-based paradigms, evaluated across batch, chunked, and streaming inference modes. Through a comprehensive benchmark of over 50 configurations spanning OpenAI Whisper, NVIDIA Nemotron, Parakeet TDT, Canary, Conformer Transducer, and Qwen3-ASR, we identify NVIDIA’s Nemotron Speech Streaming as the strongest candidate for real-time English streaming on resource-constrained hardware. We then re-implement the complete streaming inference pipeline in ONNX Runtime and conduct a controlled evaluation of multiple post-training quantization strategies, including importance-weighted k-quant, mixed-precision schemes, and round-to-nearest quantization, combined with graph-level operator fusion. These optimizations reduce the model from 2.47 GB to as little as 0.67 GB while maintaining word error rate (WER) within 1% absolute of the full-precision PyTorch baseline. Our recommended configuration, the int4 k-quant variant, achieves 8.20% average streaming WER across eight standard benchmarks, running comfortably faster than real-time on CPU with 0.56 s algorithmic latency, establishing a new quality-efficiency Pareto point for on-device streaming ASR.

[AI-86] A Nonasymptotic Theory of Gain-Dependent Error Dynamics in Behavior Cloning

【速读】:该论文旨在解决行为克隆(Behavior Cloning, BC)策略在位置控制机器人中因PD控制器增益设置不当而导致的闭环失败问题,特别是缺乏对控制器增益如何影响BC失败概率的非渐近理论解释。其解决方案的关键在于构建一个基于子高斯(sub-Gaussian)动作误差传播的数学框架,推导出闭环位置误差的代理矩阵 $ X_\infty(K) $,并证明任务失败概率可分解为与增益相关的放大指数 $ \Gamma_T(K) $ 与训练损失加泛化余量的乘积——这表明仅依赖训练损失无法准确预测闭环性能。进一步地,在保持结构不变的上界假设下,该代理矩阵被刻画为标量形式 $ X_\infty(K)\preceq\Psi(K)\bar X $,其中 $ \Psi(K) $ 可分解为标签难度、扰动强度和收缩因子,从而定量比较四类典型PD控制模式(如柔顺过阻尼最优、刚性欠阻尼最差),并首次在连续时间和零阶保持离散化场景下证明了稳定区域内的单调性,为经验发现“柔顺、过阻尼控制器提升BC成功率”提供了首个非渐近理论解释。

链接: https://arxiv.org/abs/2604.14484
作者: Junghoon Seo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Behavior cloning (BC) policies on position-controlled robots inherit the closed-loop response of the underlying PD controller, yet the effect of controller gains on BC failure lacks a nonasymptotic theory. We show that independent sub-Gaussian action errors propagate through the gain-dependent closed-loop dynamics to yield sub-Gaussian position errors whose proxy matrix X_\infty(K) governs the failure tail. The probability of horizon- T task failure factorizes into a gain-dependent amplification index \Gamma_T(K) and the validation loss plus a generalization slack, so training loss alone cannot predict closed-loop performance. Under shape-preserving upper-bound structural assumptions the proxy admits the scalar bound X_\infty(K)\preceq\Psi(K)\bar X with \Psi(K) decomposed into label difficulty, injection strength, and contraction, ranking the four canonical regimes with compliant-overdamped (CO) tightest, stiff-underdamped (SU) loosest, and the stiff-overdamped versus compliant-underdamped ordering system-dependent. For the canonical scalar second-order PD system the closed-form continuous-time stationary variance X_\infty^\mathrmc(\alpha,\beta)=\sigma^2\alpha/(2\beta) is strictly monotone in stiffness and damping over the entire stable orthant, covering both underdamped and overdamped regimes, and the exact zero-order-hold (ZOH) discretization inherits this monotonicity. The analysis provides the first nonasymptotic explanation of the empirical finding that compliant, overdamped controllers improve BC success rates.

[AI-87] Seeing Through Circuits: Faithful Mechanistic Interpretability for Vision Transformers

【速读】:该论文旨在解决视觉Transformer模型内部计算过程缺乏透明性的问题,即如何揭示其内部信息流动机制以增强模型的可解释性、可信度与安全性。现有研究多聚焦于神经元层面的电路识别,仅能说明信息编码方式,而无法刻画信息在复杂网络结构中的路由路径。论文提出一种自动化的视觉电路发现方法(Automatic Visual Circuit Discovery, Vi-CD),其关键在于通过构建任务特定的计算图(computational graph)并识别边(edge)构成的机制电路(mechanistic circuits),从而实现对分类任务、CLIP模型中的字体攻击以及有害行为干预等场景下类特定电路的精准恢复,为理解视觉Transformer的内部推理逻辑提供了可操作且具洞察力的边级解释框架。

链接: https://arxiv.org/abs/2604.14477
作者: Nina Żukowska,Wolfgang Stammer,Bernt Schiele,Jonas Fischer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transparency of neural networks’ internal reasoning is at the heart of interpretability research, adding to trust, safety, and understanding of these models. The field of mechanistic interpretability has recently focused on studying task-specific computational graphs, defined by connections (edges) between model components. Such edge-based circuits have been defined in the context of large language models, yet vision-based approaches so far only consider neuron-based circuits. These tell which information is encoded, but not how it is routed through the complex wiring of a neural network. In this work, we investigate whether useful mechanistic circuits can be identified through computational graphs in vision transformers. We propose an effective method for Automatic Visual Circuit Discovery (Vi-CD) that recovers class-specific circuits for classification, identifies circuits underlying typographic attacks in CLIP, and discovers circuits that lend themselves for steering to correct harmful model behavior. Overall, we find that insightful and actionable edge-based circuits can be recovered from vision transformers, adding transparency to the internal computations of these models.

[AI-88] Evo-MedAgent : Beyond One-Shot Diagnosis with Agents That Remember Reflect and Improve

【速读】:该论文旨在解决当前工具增强型大语言模型(Tool-augmented Large Language Model, TLLM)医疗代理在处理胸部X光图像时缺乏跨病例学习能力的问题,即这些代理无法积累经验、修正重复推理错误或在不依赖昂贵强化学习的情况下调整工具使用行为。其解决方案的关键在于提出Evo-MedAgent,一个自进化记忆模块,包含三个互补存储:(1) 回顾性临床事件(Retrospective Clinical Episodes),用于检索相似历史病例的问题求解经验;(2) 自适应程序启发式规则库(Adaptive Procedural Heuristics bank),通过反思机制动态演化诊断优先级规则;(3) 工具可靠性控制器(Tool Reliability Controller),持续追踪各工具的可信度。此设计使代理能在测试阶段实现跨案例学习,且无需重新训练,仅需一次额外检索和单次反思调用即可部署于任意冻结模型之上。

链接: https://arxiv.org/abs/2604.14475
作者: Weixiang Shen,Bailiang Jian,Jun Li,Che Liu,Johannes Moll,Xiaobin Hu,Daniel Rueckert,Hongwei Bran Li,Jiazhen Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-augmented large language model (LLM) agents can orchestrate specialist classifiers, segmentation models, and visual question-answering modules to interpret chest X-rays. However, these agents still solve each case in isolation: they fail to accumulate experience across cases, correct recurrent reasoning mistakes, or adapt their tool-use behavior without expensive reinforcement learning. While a radiologist naturally improves with every case, current agents remain static. In this work, we propose Evo-MedAgent, a self-evolving memory module that equips a medical agent with the capacity for inter-case learning at test time. Our memory comprises three complementary stores: (1)~\emphRetrospective Clinical Episodes that retrieve problem-solving experiences from similar past cases, (2)~an \emphAdaptive Procedural Heuristics bank curating priority-tagged diagnostic rules that evolves via reflection, much like a physician refining their internal criteria, and (3)~a \emphTool Reliability Controller that tracks per-tool trustworthiness. On ChestAgentBench, Evo-MedAgent raises multiple-choice question (MCQ) accuracy from 0.68 to 0.79 on GPT-5-mini, and from 0.76 to 0.87 on Gemini-3 Flash. With a strong base model, evolving memory improves performance more effectively than orchestrating external tools on qualitative diagnostic tasks. Because Evo-MedAgent requires no training, its per-case overhead is bounded by one additional retrieval pass and a single reflection call, making it deployable on top of any frozen model.

[AI-89] Response-Aware User Memory Selection for LLM Personalization

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)个性化过程中用户记忆(user memory)选择不精准的问题。现有方法主要依赖用户记忆项与输入查询之间的语义相似度进行筛选,忽略了记忆特征对模型输出分布的实际影响。其解决方案的关键在于提出响应效用优化的记忆选择方法(Response-Utility optimization for Memory Selection, RUMS),通过衡量记忆子集与模型输出之间的互信息(mutual information),识别能降低响应不确定性和增强预测精度的记忆项,从而实现更符合人类直觉且计算效率更高的个性化生成。

链接: https://arxiv.org/abs/2604.14473
作者: Jillian Fisher,Jennifer Neville,Chan Young Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code at: this https URL

点击查看摘要

Abstract:A common approach to personalization in large language models (LLMs) is to incorporate a subset of the user memory into the prompt at inference time to guide the model’s generation. Existing methods select these subsets primarily using similarity between user memory items and input queries, ignoring how features actually affect the model’s response distribution. We propose Response-Utility optimization for Memory Selection (RUMS), a novel method that selects user memory items by measuring the mutual information between a subset of memory and the model’s outputs, identifying items that reduce response uncertainty and sharpen predictions beyond semantic similarity. We demonstrate that this information-theoretic foundation enables more principled user memory selection that aligns more closely with human selection compared to state-of-the-art methods, and models 400\times larger. Additionally, we show that memory items selected using RUMS result in better response quality compared to existing approaches, while having up to 95% reduction in computational cost.

[AI-90] Auxiliary Finite-Difference Residual-Gradient Regularization for PINNs

【速读】:该论文旨在解决物理信息神经网络(Physics-informed Neural Networks, PINNs)在训练过程中因单一标量损失函数导致对特定物理量关注不足的问题,尤其是在边界条件和通量等应用相关量上表现不佳。其核心解决方案是提出一种混合设计:保留基于自动微分(Automatic Differentiation, AD)的偏微分方程(PDE)残差作为主损失项,同时引入仅用于辅助正则化的有限差分(Finite Difference, FD)梯度惩罚项,该惩罚项作用于采样后的残差场梯度,从而实现对残差场平滑性的控制而不改变PDE残差本身。关键创新在于将FD正则化器定位为“仅辅助”项,并针对具体物理目标(如外壁热流密度)进行空间适配(例如采用贴体壳层结构),实验证明这种策略显著提升了边界条件精度和物理量预测可靠性,尤其在三维环形传热问题中效果明显。

链接: https://arxiv.org/abs/2604.14472
作者: Stavros Kassinos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
备注: 18 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) are often selected by a single scalar loss even when the quantity of interest is more specific. We study a hybrid design in which the governing PDE residual remains automatic-differentiation (AD) based, while finite differences (FD) appear only in a weak auxiliary term that penalizes gradients of the sampled residual field. The FD term regularizes the residual field without replacing the PDE residual itself. We examine this idea in two stages. Stage 1 is a controlled Poisson benchmark comparing a baseline PINN, the FD residual-gradient regularizer, and a matched AD residual-gradient baseline. Stage 2 transfers the same logic to a three-dimensional annular heat-conduction benchmark (PINN3D), where baseline errors concentrate near a wavy outer wall and the auxiliary grid is implemented as a body-fitted shell adjacent to the wall. In Stage 1, the FD regularizer reproduces the main effect of residual-gradient control while exposing a trade-off between field accuracy and residual cleanliness. In Stage 2, the shell regularizer improves the application-facing quantities, namely outer-wall flux and boundary-condition behavior. Across seeds 0-5 and 100k epochs, the most reliable tested configuration is a fixed shell weight of 5e-4 under the Kourkoutas-beta optimizer regime: relative to a matched run without the shell term, it reduces the mean outer-wall BC RMSE from 1.22e-2 to 9.29e-4 and the mean wall-flux RMSE from 9.21e-3 to 9.63e-4. Adam with beta2=0.999 becomes usable when the initial learning rate is reduced to 1e-3, although its shell benefit is less robust than under Kourkoutas-beta. Overall, the results support a targeted view of hybrid PINNs: an auxiliary-only FD regularizer is most valuable when it is aligned with the physical quantity of interest, here the outer-wall flux. Comments: 18 pages, 5 figures, 10 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph) ACMclasses: I.2.5 Cite as: arXiv:2604.14472 [cs.LG] (or arXiv:2604.14472v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.14472 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-91] Improving Human Performance with Value-Aware Interventions: A Case Study in Chess

【速读】:该论文旨在解决人工智能(AI)在辅助人类进行序列决策任务时,如何确定干预时机与方式这一核心挑战。传统方法通常基于强模型推荐最优动作,但忽略了人类执行者可能无法遵循最优后续策略,从而导致整体性能下降。解决方案的关键在于引入“价值感知干预”(value-aware interventions),其理论基础源于强化学习中的贝尔曼方程:当人类决策策略偏离最优时,策略与价值函数之间会出现不一致性,这种不一致性可被量化并用于识别干预机会。作者将问题形式化为带干预预算的马尔可夫决策过程,在单次干预场景下证明最优策略是推荐最大化人类价值函数的动作;在多干预场景中提出基于策略-价值差异幅度排序的近似算法。实验表明,该方法在国际象棋领域显著优于基于最强引擎(Stockfish)的干预策略,并在真人测试中对低至中等技能玩家提升明显,同时保持与专家级引擎相当的表现。

链接: https://arxiv.org/abs/2604.14465
作者: Saumik Narayanan,Raja Panjwani,Siddhartha Sen,Chien-Ju Ho
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI systems are increasingly used to assist humans in sequential decision-making tasks, yet determining when and how an AI assistant should intervene remains a fundamental challenge. A potential baseline is to recommend the optimal action according to a strong model. However, such actions assume optimal follow-up actions, which human decision makers may fail to execute, potentially reducing overall performance. In this work, we propose and study value-aware interventions, motivated by a basic principle in reinforcement learning: under the Bellman equation, the optimal policy selects actions that maximize the immediate reward plus the value function. When a decision maker follows a suboptimal policy, this policy-value consistency no longer holds, creating discrepancies between the actions taken by the policy and those that maximize the immediate reward plus the value of the next state. We show that these policy-value inconsistencies naturally identify opportunities for intervention. We formalize this problem in a Markov decision process where an AI assistant may override human actions under an intervention budget. In the single-intervention regime, we show that the optimal strategy is to recommend the action that maximizes the human value function. For settings with multiple interventions, we propose a tractable approximation that prioritizes interventions based on the magnitude of the policy-value discrepancy. We evaluate these ideas in the domain of chess by learning models of humans from large-scale gameplay data. In simulation, our approach consistently outperforms interventions based on the strongest chess engine (Stockfish) in a wide range of settings. A within-subject human study with 20 players and 600 games further shows that our interventions significantly improve performance for low- and mid-skill players while matching expert-engine interventions for high-skill players.

[AI-92] AIBuildAI: An AI Agent for Automatically Building AI Models

【速读】:该论文旨在解决当前AI模型开发过程高度依赖人工专家干预的问题,即从模型架构设计、特征工程、训练实现到性能调优的全生命周期仍需专业人员手动完成,而现有自动化机器学习(AutoML)方法仅能处理如超参数优化等局部任务,难以实现端到端的自动化。解决方案的关键在于提出AIBuildAI,一个采用分层代理架构的AI智能体系统,由一名管理代理协调三位专业化子代理——建模策略设计代理(designer)、代码实现与调试代理(coder)和训练优化代理(tuner),每个子代理均为基于大语言模型(LLM)的多步推理与工具使用能力的智能体,从而实现从任务描述和训练数据出发,自动构建可部署AI模型的全流程自动化,显著超越传统AutoML的能力边界。

链接: https://arxiv.org/abs/2604.14455
作者: Ruiyi Zhang,Peijia Qin,Qi Cao,Li Zhang,Pengtao Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI models underpin modern intelligent systems, driving advances across science, medicine, finance, and technology. Yet developing high-performing AI models remains a labor-intensive process that requires expert practitioners to iteratively design architectures, engineer representations, implement training pipelines and refine approaches through empirical evaluation. Existing AutoML methods partially alleviate this burden but remain limited to narrow aspects such as hyperparameter optimization and model selection within predefined search spaces, leaving the full development lifecycle largely dependent on human expertise. To address this gap, we introduce AIBuildAI, an AI agent that automatically builds AI models from a task description and training data. AIBuildAI adopts a hierarchical agent architecture in which a manager agent coordinates three specialized sub-agents: a designer for modeling strategy, a coder for implementation and debugging, and a tuner for training and performance optimization. Each sub-agent is itself a large language model (LLM) based agent capable of multi-step reasoning and tool use, enabling end-to-end automation of the AI model development process that goes beyond the scope of existing AutoML approaches. We evaluate AIBuildAI on MLE-Bench, a benchmark of realistic Kaggle-style AI development tasks spanning visual, textual, time-series and tabular modalities. AIBuildAI ranks first on MLE-Bench with a medal rate of 63.1%, outperforming all existing baseline methods and matching the capability of highly experienced AI engineers. These results demonstrate that hierarchical agent systems can automate the full AI model development process from task specification to deployable model, suggesting a pathway toward broadly accessible AI development with minimal human intervention.

[AI-93] Robustness Analysis of Machine Learning Models for IoT Intrusion Detection Under Data Poisoning Attacks

【速读】:该论文旨在解决物联网(IoT)环境中基于机器学习的入侵检测系统(Intrusion Detection System, IDS)在面对数据投毒攻击(data poisoning attacks)时可靠性下降的问题。研究通过评估四种主流分类器(随机森林、梯度提升机、逻辑回归和深度神经网络)在三种真实物联网数据集上对多种投毒策略的敏感性,发现集成学习模型表现出相对稳定的性能,而逻辑回归与深度神经网络在标签篡改和异常值攻击下性能可下降高达40%,导致决策边界扭曲、检测精度降低,从而影响系统部署可行性。解决方案的关键在于引入对抗鲁棒训练(adversarially robust training)、持续异常监测以及特征层面验证机制,并强调将韧性测试纳入AI驱动物联网安全的监管与合规框架中,以构建更具适应性和抗攻击能力的入侵检测流水线。

链接: https://arxiv.org/abs/2604.14444
作者: Fortunatus Aabangbio Wulnye,Justice Owusu Agyemang,Kwame Opuni-Boachie Obour Agyekum,Kwame Agyeman-Prempeh Agyekum,Kingsford Sarkodie Obeng Kwakye,Francisca Adomaa Acheampong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring the reliability of machine learning-based intrusion detection systems remains a critical challenge in Internet of Things (IoT) environments, particularly as data poisoning attacks increasingly threaten the integrity of model training pipelines. This study evaluates the susceptibility of four widely used classifiers, Random Forest, Gradient Boosting Machine, Logistic Regression, and Deep Neural Network models, against multiple poisoning strategies using three real-world IoT datasets. Results show that while ensemble-based models exhibit comparatively stable performance, Logistic Regression and Deep Neural Networks suffer degradation of up to 40% under label manipulation and outlier-based attacks. Such disruptions significantly distort decision boundaries, reduce detection fidelity, and undermine deployment readiness. The findings highlight the need for adversarially robust training, continuous anomaly monitoring, and feature-level validation within operational Network Intrusion Detection Systems. The study also emphasizes the importance of integrating resilience testing into regulatory and compliance frameworks for AI-driven IoT security. Overall, this work provides an empirical foundation for developing more resilient intrusion detection pipelines and informs future research on adaptive, attack-aware models capable of maintaining reliability under adversarial IoT conditions.

[AI-94] On Tackling Complex Tasks with Reward Machines and Signal Temporal Logics

【速读】:该论文旨在解决复杂任务下强化学习(Reinforcement Learning, RL)控制设计中奖励函数设计困难及行为难以满足形式化规范的问题。解决方案的关键在于将奖励机器(Reward Machine, RM)与信号时序逻辑(Signal Temporal Logic, STL)相结合,利用STL公式显式建模任务的时序约束和事件触发机制,从而实现对奖励结构的高效表示,并引导RL训练过程收敛到满足指定逻辑要求的行为。此外,该框架还集成STL在线监测算法,提升了在实际环境中的可执行性与实时性。

链接: https://arxiv.org/abs/2604.14440
作者: Ana María Gómez Ruiz(UGA),Thao Dang(VERIMAG - IMAG, CNRS, UGA),Alexandre Donzé
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a Reinforcement Learning (RL) based control design framework for handling complex tasks. The approach extends the concept of Reward Machines (RM) with Signal Temporal Logic (STL) formulas that can be used for event generation. The use of STL allows not only a more efficient representation of rewards for complex tasks but also guiding the training process to converge towards behaviors satisfying specified requirements. We also propose an implementation of the framework that leverages the STL online monitoring algorithms. We illustrate the framework with three case studies (minigrid, cart-pole and high-way environments) with non-trivial tasks.

[AI-95] LLM s taking shortcuts in test generation: A study with SAP HANA and LevelDB

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在软件工程任务中表现优异但可能依赖浅层启发式策略而非真正推理能力的问题,尤其关注其在自动化测试生成中的行为是否具备泛化能力。解决方案的关键在于结合认知科学中的机制聚焦评估方法(mechanism-focused assessment)与实证软件测试技术,通过引入变异分数(mutation score)和迭代编译器反馈修复循环(iterative compiler-feedback repair loops),从准确性和底层推理策略两个维度对LLMs的行为进行系统性评估。实验结果表明,LLMs在熟悉开源基准上表现良好,但在未见过的复杂商业系统(如SAP HANA)中常优先保证编译通过而非语义正确性,从而揭示了现有模型缺乏稳健推理的本质缺陷,并强调需建立能惩罚表面捷径、鼓励真实泛化的评估框架。

链接: https://arxiv.org/abs/2604.14437
作者: Vekil Bekmyradov,Noah C. Pütz,Thomas Bartz-Beielstein
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive results on public benchmarks, often leading to claims of advanced reasoning and understanding. However, recent research in cognitive science reveals that these models sometimes rely on shallow heuristics and memorization, taking shortcuts rather than demonstrating genuine cognitive abilities. This paper investigates LLM behavior in automated test generation for software, contrasting performance on an open-source system (LevelDB) with SAP HANA, one of the most widely deployed commercial database systems worldwide, whose proprietary codebase is guaranteed to be absent from training data. We combine cognitive evaluation principles, drawing on Mitchell’s mechanism-focused assessment methodology, with empirical software testing, employing mutation score and iterative compiler-feedback repair loops to assess both accuracy and underlying reasoning strategies. Results show that LLMs excel on familiar, open-source benchmarks but struggle with unseen, complex domains, often prioritizing compilability over semantic effectiveness. These findings provide independent software engineering evidence for the broader claim that current LLMs lack robust reasoning, and highlight the need for evaluation frameworks that penalize trivial shortcuts and reward true generalization.

[AI-96] Geometric Routing Enables Causal Expert Control in Mixture of Experts

【速读】:该论文旨在解决稀疏混合专家(Sparse Mixture-of-Experts, MoE)模型中专家专业化(expert specialization)机制不透明的问题,即尽管MoE模型在保持每token计算量不变的前提下可扩展参数规模,但个体专家的功能分工仍缺乏可解释性。解决方案的关键在于:通过构造具有rank-1结构的专家,并采用余弦相似度路由(cosine-similarity routing)在低维度度量空间中实现专家功能的直接可视化与因果验证。具体而言,研究发现专家输出向量经未嵌入矩阵投影后形成语义词典(Semantic Dictionary),其中15%的专家为单义专精(monosemantic),涵盖时序、地理、基数等10类语义范畴;进一步地,通过因果干预实验表明,引导至特定专家中心点可显著提升对应语义类别概率(如时序类提升+321%),且效果在层间可叠加,从而证明专家层级的专业化是可控制、可解释且具因果意义的推理原语(interpretability primitive)。

链接: https://arxiv.org/abs/2604.14434
作者: Ivan Ternovtsii,Yurii Bilak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse Mixture-of-Experts (MoE) models scale parameters while fixing active computation per token, but the specialization of individual experts remains opaque. In a companion paper we showed that routing topology is quality-neutral: five structurally different configurations converge to statistically equivalent language modeling quality. Here we show that expert identity is nonetheless causally meaningful: individual rank-1 experts are monosemantic by construction, and cosine-similarity routing in a low-dimensional metric space makes their specialization directly inspectable. We present four lines of evidence. First, projecting expert output vectors through the unembedding matrix yields a Semantic Dictionary: 15% of experts are monosemantic specialists spanning 10 categories (temporal, geographic, cardinal, discourse, emotional, financial, military, scientific). Second, routing exhibits a frequency-to-syntax gradient: early layers separate tokens by word frequency, deeper layers by syntactic class (Zipf-confound controls, all p 0.001 ). Third, causal interventions confirm these labels: steering toward a temporal expert’s centroid increases P(temporal) by +321% (median across 44 prompts); suppressing a geographic expert drops P(geographic) by -23%; rewriting an expert’s output vector halves target-category probability, and effects compose additively across layers. Fourth, the interventions are not unique to cosine routing: linear routers support comparable steering, but only cosine routing provides geometric transparency – expert specialization is readable directly from the centroid matrix. MoE expert-level specialization is a first-class interpretability primitive: architecturally monosemantic, causally validated, and controllable at inference with zero overhead. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.14434 [cs.AI] (or arXiv:2604.14434v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.14434 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-97] Demonstration of Pneuma-Seeker: Agent ic System for Reifying and Fulfilling Information Needs on Tabular Data

【速读】:该论文旨在解决数据分析师在处理关系型数据时,面对模糊或未明确表述的信息需求难以有效迭代探索的问题。传统方法往往缺乏对用户意图的显式建模与可追溯性,导致分析过程不透明、难以调整。解决方案的关键在于提出Pneuma-Seeker系统,通过将用户的信息需求形式化为显式的、可检查的关系型规范(relational specifications),实现信息需求的迭代细化、针对性的数据发现以及溯源感知的执行机制;同时,该系统利用大语言模型(LLM)作为透明且交互式的分析协作者,而非黑箱式答案生成器,从而增强分析过程的可控性和可解释性。

链接: https://arxiv.org/abs/2604.14422
作者: Muhammad Imam Luthfi Balaka,Raul Castro Fernandez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACM CAIS 2026 (Demo)

点击查看摘要

Abstract:Data analysts working with relational data often start with vague or underspecified questions and refine them iteratively as they explore the data. To support this iterative process, we demonstrate Pneuma-Seeker, a system that reifies a user’s information need as explicit, inspectable relational specifications, enabling iterative refinement of the information need, targeted data discovery, and provenance-aware execution. Through two real-world procurement use cases, we show how Pneuma-Seeker leverages LLMs as transparent, interactive analytical collaborators rather than opaque answer engines.

[AI-98] Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

【速读】:该论文旨在解决稀疏混合专家(Mixture-of-Experts, MoE)架构中路由拓扑结构是否决定语言建模质量这一关键问题。以往研究普遍认为复杂的路由机制(如多跳路径、token依赖门控)对性能提升至关重要,但本文通过系统性实验发现,路由拓扑本身并非决定渐近困惑度(Perplexity, PPL)的核心因素:在控制变量条件下,五种基于余弦相似度的路由变体在PPL上统计等效(TOST检验p > 0.05),且与哈希、随机固定和top-1路由相比仅产生微小性能下降(1.1–2.2 PPL)。其解决方案的关键在于提出一种几何MoE(ST-MoE),采用低维空间(d_space = 64)中的余弦相似度路由替代传统线性路由,显著减少80%的路由参数量;同时揭示出“收敛冗余”机制——多跳更新方向高度共线(cos(Δh₀, Δh₁) = 0.805),本质为幅度放大而非组合推理,单个可学习标量即可复现多跳效果,从而证明路由结构的差异性不构成性能瓶颈。

链接: https://arxiv.org/abs/2604.14419
作者: Ivan Ternovtsii,Yurii Bilak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse Mixture-of-Experts (MoE) architectures employ increasingly sophisticated routing mechanisms – learned routers, multi-hop trajectories, token-dependent gating. We ask: does routing topology actually determine language modeling quality? We build a geometric MoE (ST-MoE) using cosine-similarity routing against learned centroids in a low-dimensional space ( d_space = 64 ), requiring 80% fewer routing parameters than standard linear routers. Through 62 controlled experiments on WikiText-103 at 76–84M parameters trained to convergence (50K steps, 1.64B tokens), we find that routing topology does not determine asymptotic perplexity (PPL): five cosine-routing variants are statistically equivalent within a 1-PPL margin (Two One-Sided Tests [TOST], p 0.05 for all 10 pairwise comparisons; 15 runs across 3 seeds, observed range 33.93–34.72). The finding extends to hash, random-fixed, and top-1 routing (single-seed; graceful 1.1–2.2 PPL degradation) and replicates on OpenWebText (0.03 PPL gap, 6 runs, 3 seeds each). A standard linear router with 5.3 \times more routing parameters reaches PPL 32.76, but iso-parameter cosine routing closes 67% of this gap – the true mechanism advantage is \sim 1.2%. The mechanistic explanation is convergent redundancy: multi-hop updates are collinear ( \cos(\Delta h_0, \Delta h_1) = 0.805 ), implementing magnitude amplification rather than compositional reasoning; a single learnable scalar replicates multi-hop performance. As a practical payoff, zero-shot relative-norm halting saves 25% of MoE FLOPs at +0.12% PPL. Expert-level specialization and causal controllability – which coexist with topology-level equifinality – are explored in a companion paper.

[AI-99] Credo: Declarative Control of LLM Pipelines via Beliefs and Policies

【速读】:该论文旨在解决当前基于指令的智能体(Agentic AI)系统在长期状态决策中存在行为不透明、脆弱且难以验证的问题,这些问题源于现有框架依赖命令式控制循环、短暂记忆和嵌入提示的逻辑。其解决方案的关键在于提出Credo系统,通过将语义状态表示为信念(belief),并使用定义在这些信念上的声明式策略(declarative policy)来调控行为,从而实现可适应、可审计和可组合的执行;该设计依托数据库支持的语义控制平面,使关键执行选择(如模型选取、检索或纠错重执行)能够通过声明方式动态调整,而无需修改底层流水线代码。

链接: https://arxiv.org/abs/2604.14401
作者: Duo Lu,Andrew Crotty,Uğur Çetintemel
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Agentic AI systems are becoming commonplace in domains that require long-lived, stateful decision-making in continuously evolving conditions. As such, correctness depends not only on the output of individual model calls, but also on how to best adapt when incorporating new evidence or revising prior conclusions. However, existing frameworks rely on imperative control loops, ephemeral memory, and prompt-embedded logic, making agent behavior opaque, brittle, and difficult to verify. This paper introduces Credo, which represents semantic state as beliefs and regulates behavior using declarative policies defined over these beliefs. This design supports adaptive, auditable, and composable execution through a database-backed semantic control plane. We showcase these concepts in a decision-control scenario, where beliefs and policies declaratively guide critical execution choices (e.g., model selection, retrieval, corrective re-execution), enabling dynamic behavior without requiring any changes to the underlying pipeline code.

[AI-100] SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing

【速读】:该论文旨在解决空间在轨服务(On-orbit Servicing)中对自主代理的高要求问题,即如何构建具备视觉感知、三维空间推理和多阶段任务执行能力的智能体,以应对复杂且动态的轨道环境。解决方案的关键在于提出SpaceMind框架,其核心创新包括:(1)将知识、工具与推理解耦为三个可独立扩展的维度——技能模块通过动态路由实现灵活调度,Model Context Protocol (MCP) 工具支持可配置配置文件,推理模式技能可注入式更新;(2)引入MCP-Redis接口层,使同一代码库无需修改即可在仿真(UE5)与物理硬件之间无缝迁移;(3)设计技能自进化机制,无需模型微调即可从运行经验中提炼持久技能文件,显著提升系统鲁棒性和适应性。实验证明该方案在多种场景下均具备高成功率,并能在退化条件下保持任务完成能力,同时实现零代码修改的现实世界部署。

链接: https://arxiv.org/abs/2604.14399
作者: Aodi Wu,Haodong Han,Xubo Luo,Ruisuo Wang,Shan He,Xue Wan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 23 pages, 6 figures, 7 tables. Code available at this https URL

点击查看摘要

Abstract:Autonomous on-orbit servicing demands embodied agents that perceive through visual sensors, reason about 3D spatial situations, and execute multi-phase tasks over extended horizons. We present SpaceMind, a modular and self-evolving vision-language model (VLM) agent framework that decomposes knowledge, tools, and reasoning into three independently extensible dimensions: skill modules with dynamic routing, Model Context Protocol (MCP) tools with configurable profiles, and injectable reasoning-mode skills. An MCP-Redis interface layer enables the same codebase to operate across simulation and physical hardware without modification, and a Skill Self-Evolution mechanism distills operational experience into persistent skill files without model fine-tuning. We validate SpaceMind through 192 closed-loop runs across five satellites, three task types, and two environments, a UE5 simulation and a physical laboratory, deliberately including degraded conditions to stress-test robustness. Under nominal conditions all modes achieve 90–100% navigation success; under degradation, the Prospective mode uniquely succeeds in search-and-approach tasks where other modes fail. A self-evolution study shows that the agent recovers from failure in four of six groups from a single failed episode, including complete failure to 100% success and inspection scores improving from 12 to 59 out of 100. Real-world validation confirms zero-code-modification transfer to a physical robot with 100% rendezvous success. Code: this https URL

[AI-101] Coalition Formation in LLM Agent Networks: Stability Analysis and Convergence Guarantees

【速读】:该论文旨在解决多智能体大语言模型(LLM)系统中联盟形成(coalition formation)的理论建模与稳定性保障问题,尤其针对 n 个智能体动态组建合作群体时缺乏形式化分析框架的问题。其解决方案的关键在于提出首个基于 hedonic game theory(享乐博弈理论)的 LLM 联盟形成游戏(LLM Coalition Formation Game, LCFG),并建立 Nash 稳定划分的充分条件及复杂度结果;同时识别出 LLM 智能体表现出 ε-有界理性(ε-rational preferences),通过一致性驱动的稳定性边界预测实证结果,最终在 GPT-4、Claude-3 和 Llama-3 上验证了所提 Coalition-of-Thought(CoalT)协议下联盟达到 Nash 稳定的比例显著高于链式思维(chain-of-thought)和标准提示(p < 0.001)。

链接: https://arxiv.org/abs/2604.14386
作者: Dongxin Guo,Jikun Wu,Siu-Ming Yiu
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: 15 pages including supplementary material, 2 figures, 5 tables

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly deployed in multi-agent systems requiring strategic coordination. While recent work has analyzed LLM behavior in two-player games, coalition formation, where n agents dynamically form cooperative groups, remains theoretically uncharacterized. We present the first framework grounding coalition formation in LLM agent networks in hedonic game theory with formal stability guarantees. We introduce the LLM Coalition Formation Game (LCFG), establish sufficient conditions for Nash-stable partitions, and prove complexity results. Our analysis reveals that LLM agents exhibit bounded rationality characterized by \epsilon -rational preferences; we provide both deterministic existence guarantees and consistency-driven stability bounds whose predictions are consistent with empirical outcomes. Experiments with GPT-4, Claude-3, and Llama-3 across 2,400 episodes validate our framework: LLM coalitions achieve Nash stability in 73.2% of cases under our Coalition-of-Thought (CoalT) protocol, compared to 58.4% under chain-of-thought and 41.8% under standard prompting ( p 0.001 ). Our framework provides theoretical foundations for designing stable multi-agent LLM systems.

[AI-102] Modular Continual Learning via Zero-Leakage Reconstruction Routing and Autonomous Task Discovery

【速读】:该论文旨在解决人工神经网络在顺序任务学习中面临的灾难性遗忘(catastrophic forgetting)问题。其核心解决方案是提出一种硅基原生的模块化架构,通过任务特定专家(Task-Specific Experts)实现结构参数隔离,并引入基于分布异常值的门控机制(distributed, outlier-based Gatekeeper)来动态路由任务流。关键创新在于采用“实时蒸馏”(Live Distillation)策略,在教师学习、学生蒸馏与路由器流形获取并行进行的同时,利用紧致瓶颈自编码器(Tight-Bottleneck Autoencoder, TB-AE)有效区分高维潜在空间中的语义密集流形,克服标准变分方法固有的后验崩溃问题,从而在4096维大语言模型(LLM)嵌入空间中建立严格的拓扑边界,提供鲁棒的无监督新颖性信号,并实现无需冗余模块实例化的自主检索机制,确保终身学习的稳定性与高效性。

链接: https://arxiv.org/abs/2604.14375
作者: Noureddine Kermiche
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Catastrophic forgetting remains a primary hurdle in sequential task learning for artificial neural networks. We propose a silicon-native modular architecture that achieves structural parameter isolation using Task-Specific Experts and a distributed, outlier-based Gatekeeper. Moving beyond traditional sequential consolidation, our framework utilizes a Simultaneous Pipeline where Teacher learning, Student distillation, and Router manifold acquisition occur in parallel while raw data is present in a localized training session. This approach ensures computational efficiency and complies with privacy mandates like GDPR by deleting raw data as soon as a task is learned. We demonstrate that a Tight-Bottleneck Autoencoder (TB-AE) can effectively distinguish semantically crowded manifolds in high-dimensional latent spaces, overcoming the posterior collapse inherent to standard variational methods. By establishing strict topological boundaries, our TB-AE resolves latent space crowding in 4096-D LLM embeddings to provide a robust, unsupervised novelty signal. Furthermore, we validate an Autonomous Retrieval mechanism that confidently identifies returning manifolds, enabling stable lifelong learning without redundant module instantiation. Empirical results demonstrate that our ``Live Distillation’’ approach acts as a natural regularizer, achieving strong retention across computer vision and natural language processing domains without suffering a student fidelity gap.

[AI-103] ght Sample Complexity Bounds for Best-Arm Identification Under Bounded Systematic Bias

【速读】:该论文旨在解决在自主推理与具身规划中,随着搜索深度增加导致候选动作空间呈指数级膨胀、从而严重消耗计算资源的问题。其核心挑战在于:传统启发式剪枝方法在使用替代模型(如大语言模型)时,若存在系统性评估偏差(systematic evaluation bias),则缺乏形式化的安全性保障。解决方案的关键在于将节点扩展过程建模为一个受有限系统偏差 $ L $ 约束的局部最优臂识别(Best-Arm Identification, BAI)问题,并通过反演朗伯W函数(Lambert W function)推导出加性样本复杂度上界 O((Δ4L)2)\mathcal{O}((\Delta - 4L)^{-2}),表明仅当经验奖励差距 Δ\Delta 超过 4L4L 时,安全节点剔除才可实现;同时建立了信息论下界 Ω((Δ2L)2)\Omega((\Delta - 2L)^{-2}),验证了带偏置搜索的结构性限制。实证结果表明,遵循此局部安全边界可在保持最优轨迹的同时最大化采样效率。

链接: https://arxiv.org/abs/2604.14345
作者: Tianhao Qian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:As search depth increases in autonomous reasoning and embodied planning, the candidate action space expands exponentially, heavily taxing computational budgets. While heuristic pruning is a common countermeasure, it operates without formal safety guarantees when surrogate models (like LLMs) exhibit systematic evaluation biases. This paper frames the node expansion process as a localized Best-Arm Identification (BAI) problem over dynamic frontiers, subject to a bounded systematic bias L . By inverting the Lambert W function, we establish an additive sample complexity of \mathcalO((\Delta-4L)^-2) , which indicates that safe node elimination is only feasible when the empirical reward gap exceeds 4L . We complement this with an information-theoretic lower bound of \Omega((\Delta-2L)^-2) to confirm the structural limits of biased search. Subsequent evaluations on both synthetic trees and complex reasoning tasks demonstrate that adhering to this local safety boundary successfully preserves optimal trajectories while maximizing sample allocation efficiency.

[AI-104] Mistake gating leads to energy and memory efficient continual learning

【速读】:该论文旨在解决人工神经网络在训练过程中因对每个样本(包括正确分类的样本)都进行参数更新而导致的高能耗问题,这与生物神经系统通过仅在出现错误时才触发突触可塑性来节省能量的机制形成对比。解决方案的关键在于提出“记忆错误门控学习”(memorized mistake-gated learning),这是一种生物启发式的突触可塑性规则,其中权重更新严格由当前及历史分类错误信号控制,从而将网络所需更新次数减少50%~80%。该方法无需引入额外超参数,计算开销极低,且特别适用于增量学习和在线学习场景,能够显著降低存储需求并提升能效。

链接: https://arxiv.org/abs/2604.14336
作者: Aaron Pache,Mark CW van Rossum
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synaptic plasticity is metabolically expensive, yet animals continuously update their internal models without exhausting energy reserves. However, when artificial neural networks are trained, the network parameters are typically updated on every sample that is presented, even if the sample was classified correctly. Inspired by the human negativity bias and error-related negativity, we propose ‘memorized mistake-gated learning’ – a biologically plausible plasticity rule where synaptic updates are strictly gated by current and past classification errors. This reduces the number of updates the network needs to make by 50%\sim80% . Mistake gating is particularly well suited in two cases: 1) For incremental learning where new knowledge is acquired on a background of pre-existing knowledge, 2) For online learning scenarios when data needs to be stored for later replay, as mistake-gating reduces storage buffer requirements. The algorithm can be implemented in a few lines of code, adds no hyper-parameters, and comes at negligible computational overhead. Learning on mistakes is an energy efficient and biologically relevant modification to commonly used learning rules that is well suited for continual learning.

[AI-105] hermodynamic Diffusion Inference with Minimal Digital Conditioning

【速读】:该论文旨在解决生成式 AI(Generative AI)中扩散模型(Diffusion Model)推理过程能耗过高问题,其核心挑战在于如何将扩散模型推理与物理系统中的过阻尼朗之万动力学(overdamped Langevin dynamics)的热力学等价性转化为可生产规模的实际应用。此前两大障碍限制了这一等价性的实现:一是非局部跳跃连接(non-local skip connections)无法由局部耦合的模拟硬件表示;二是输入条件化(input conditioning)时耦合常数信号强度不足,难以锚定系统至特定输入。解决方案的关键在于两个创新:其一,采用分层双线性耦合(hierarchical bilinear coupling),通过编码器和解码器 Gram 矩阵的奇异结构提取秩-𝑘 交互项,以 𝑂(𝐷𝑘) 物理连接替代传统 𝑂(𝐷²) 连接,高效建模 U-Net 跳跃连接;其二,引入最小数字接口——一个 4 维瓶颈编码器与 16 单元传输网络(共 2,560 参数),显著增强条件信号强度。实验表明,该系统在训练好的去噪 U-Net 激活上达到 0.9906 的解码器余弦相似度(理论上限为 1.0),同时理论上相较 GPU 推理节能约 10⁷ 倍,首次实现了训练权重、可生产规模的热力学扩散推理。

链接: https://arxiv.org/abs/2604.14332
作者: Aditi De
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion-model inference and overdamped Langevin dynamics are formally identical. A physical substrate that encodes the score function therefore equilibrates to the correct output by thermodynamics alone, requiring no digital arithmetic during inference and potentially achieving a 10,000\times reduction in energy relative to a GPU. Two fundamental barriers have until now prevented this equivalence from being realized at production scale: non-local skip connections, which locally coupled analog substrates cannot represent, and input conditioning, in which the coupling constants carry roughly 2,600\times too little signal to anchor the system to a specific input. We resolve both obstacles. \emphHierarchical bilinear coupling encodes U-Net skip connections as rank- k inter-module interactions derived directly from the singular structure of the encoder and decoder Gram matrices, requiring only O(Dk) physical connections instead of O(D^2) . A \emphminimal digital interface – a 4-dimensional bottleneck encoder together with a 16-unit transfer network, totalling \textbf2,560 parameters – overcomes the conditioning barrier. When evaluated on activations drawn from a trained denoising U-Net, the complete system attains a decoder cosine similarity of \textbf0.9906 against an oracle upper bound of 1.0000, while preserving theoretical net energy savings of approximately 10^7\times over GPU inference. These results constitute the first demonstration of trained-weight, production-scale thermodynamic diffusion inference. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.14332 [cs.LG] (or arXiv:2604.14332v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.14332 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-106] Challenges and Future Directions in Agent ic Reverse Engineering Systems

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体系统(Agentic Systems)在复杂二进制逆向工程(Binary Reverse Engineering, RE)任务中表现受限的问题,尤其在面对代码混淆(Obfuscation)、时序敏感性及特殊架构等现实场景时仍存在显著不足。其解决方案的关键在于系统性分析现有智能体工具在静态、动态及混合代理模式下的应用局限,识别出三大核心挑战:Token限制、对混淆技术的鲁棒性不足以及缺乏程序执行约束机制(Program Guardrails),并据此提出面向安全领域的未来系统设计方向,以提升智能体在真实复杂环境中的可靠性与有效性。

链接: https://arxiv.org/abs/2604.14317
作者: Salem Radey,Jack West,Kassem Fawaz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure, accepted at SAGAI 2026

点击查看摘要

Abstract:Agentic systems built on large language models (LLMs) are increasingly being used for complex security tasks, including binary reverse engineering (RE). Despite recent growth in popularity and capability, these systems continue to face limitations in realistic settings. Cutting-edge systems still fail in complex RE scenarios that involve obfuscation, timing, and unique architecture. In this work, we examine how agentic systems perform reverse engineering tasks with static, dynamic, and hybrid agents. Through an analysis of existing agentic tool usage, we identify several limitations, including token constraints, struggles with obfuscation, and a lack of program guardrails. From these findings, we outline current challenges and position future directions for system designers to overcome from a security perspective.

[AI-107] Seeing Through Experts Eyes A Foundational Vision Language Model Trained on Radiologists Gaze and Reasoning

【速读】:该论文旨在解决当前大规模视觉语言模型在胸部X光片解读中临床实用性受限的问题,即模型输出与放射科医生的诊断推理之间存在差距。现有系统多聚焦于语义信息优化,而未模拟专家对医学图像的视觉检查过程,常导致关键发现被忽略或偏离标准诊断流程。解决方案的关键在于引入GazeX——一种利用放射科医生眼动数据作为行为先验的视觉语言模型,通过在预训练中融合注视轨迹和固定点模式,使模型学习到专家注意力的空间与时间结构,并以临床上有意义的顺序整合观察结果,从而提升输出的准确性、可解释性及与专家的一致性。

链接: https://arxiv.org/abs/2604.14316
作者: Kinhei Lee,Peiyuan Jing,Zhenxuan Zhang,Yue Yang,Tao Wang,Dominic C Marshall,Yingying Fang,Guang Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large scale vision language models have shown promise in automating chest Xray interpretation, yet their clinical utility remains limited by a gap between model outputs and radiologist reasoning. Most systems optimize for semantic information without emulating how experts visually examine medical images, often overlooking critical findings or diverging from established diagnostic workflows. Radiologists follow structured protocols (e.g., the ABCDEF approach) that ensure all clinically relevant regions are systematically examined, reducing missed findings and supporting reliable diagnostic reasoning. We introduce GazeX, a vision language model that leverages radiologists’ eye tracking data as a behavioral prior to model expert diagnostic reasoning. By incorporating gaze trajectories and fixation patterns into pretraining, GazeX learns to follow the spatial and temporal structure of radiologist attention and integrates observations in a clinically meaningful sequence. Using a curated dataset of over 30,000 gaze key frames from five radiologists, we demonstrate that GazeX produces more accurate, interpretable, and expert consistent outputs across radiology report generation, disease grounding, and visual question answering, utilizing 231,835 radiographic studies, 780,014 question answer pairs, and 1,162 image sentence pairs with bounding boxes. Unlike autonomous reporting systems, GazeX produces verifiable evidence artifacts, including inspection trajectories and finding linked localized regions, enabling efficient human verification and safe human AI collaboration. Learning through expert eyes provides a practical route toward more trustworthy, explainable, and diagnostically robust AI systems for radiology and beyond.

[AI-108] Aerial Multi-Functional RIS in Fluid Antennas-Aided Full-Duplex Networks: A Self-Optimized Hybrid Deep Reinforcement Learning Approach

【速读】:该论文旨在解决第六代(6G)无线网络中高数据流量需求下的能量效率(Energy Efficiency, EE)优化问题,特别是在全双工(Full-Duplex, FD)通信场景下如何协同提升信号覆盖与可持续性。解决方案的关键在于提出一种融合自主飞行无人机(Autonomous Aerial Vehicles, AAVs)与多功能可重构智能表面(Multi-functional Reconfigurable Intelligent Surfaces, MF-RISs)的新型架构——即自适应智能反射面(AM-RIS),其具备信号反射、放大及能量收集(Energy Harvesting, EH)的混合功能,并结合流体天线(Fluid Antenna, FA)实现基站端精细的空间自适应能力,以增强对残留自干扰(Residual Self-Interference, SI)的抑制效果。为实现联合优化发射下行波束赋形、上行用户功率、AM-RIS配置及FA和AM-RIS位置等复杂混合连续-离散变量问题,作者设计了一种自优化多智能体混合深度强化学习框架(Self-Optimized Multi-Agent Hybrid Deep Reinforcement Learning, SOHRL),该框架融合多智能体深度Q网络(Multi-Agent DQN)与多智能体近端策略优化(Multi-Agent PPO),分别处理离散与连续动作空间,并引入注意力驱动的状态表示和元级超参数优化机制,使多智能体能自主调整学习参数,从而显著提升系统整体能量效率。

链接: https://arxiv.org/abs/2604.14309
作者: Li-Hsiang Shen,Yu-Quan Zheng
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:To address high data traffic demands of sixth-generation (6G) networks, this paper proposes a novel architecture that integrates autonomous aerial vehicles (AAVs) and multi-functional reconfigurable intelligent surfaces (MF-RISs) as AM-RIS in fluid antenna (FA)-assisted full-duplex (FD) networks. The AM-RIS provides hybrid functionalities, including signal reflection, amplification, and energy harvesting (EH), potentially improving both signal coverage and sustainability. Meanwhile, FA facilitates fine-grained spatial adaptability at FD-enabled base station (BS), which complements residual self-interference (SI) suppression. We aim at maximizing the overall energy efficiency (EE) by jointly optimizing transmit DL beamforming at BS, UL user power, configuration of AM-RIS, and positions of the FA and AM-RIS. Owing to the hybrid continuous-discrete parameters and high dimensionality of the intractable problem, we have conceived a self-optimized multi-agent hybrid deep reinforcement learning (DRL) framework (SOHRL), which integrates multi-agent deep Q-networks (DQN) and multi-agent proximal policy optimization (PPO), respectively handling discrete and continuous actions. To enhance self-adaptability, an attention-driven state representation and meta-level hyperparameter optimization are incorporated, enabling multi-agents to autonomously adjust learning hyperparameters. Simulation results validate the effectiveness of the proposed AM-RIS-enabled FA-aided FD networks empowered by SOHRL algorithm. The results reveal that SOHRL outperforms benchmarks of the case without attention mechanism and conventional hybrid/multi-agent/standalone DRL. Moreover, AM-RIS in FD achieves the highest EE compared to half-duplex, conventional rigid antenna arrays, partial EH, and conventional RIS without amplification, highlighting its potential as a compelling solution for EE-aware wireless networks.

[AI-109] Quantum-inspired tensor networks in machine learning models

【速读】:该论文旨在解决机器学习中模型复杂度高、可解释性差以及隐私保护不足等问题,其解决方案的关键在于引入张量网络(Tensor Networks)作为压缩表示工具,利用其在多体量子物理中对纠缠结构的高效刻画能力,将张量网络应用于神经网络组件的分解或作为替代学习架构,从而在保持性能的同时提升计算效率、增强模型可解释性并可能改善隐私保护。

链接: https://arxiv.org/abs/2604.14287
作者: Guillermo Valverde,Igor García-Olaizola,Giannicola Scarpa,Alejandro Pozas-Kerstjens
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 28 pages, 11 figures, article class. The interactive version of the graph can be found at this https URL

点击查看摘要

Abstract:Tensor networks were developed in the context of many-body physics as compressed representations of multiparticle quantum states. These representations mitigate the exponential complexity of many-body systems by capturing only the most relevant dependencies. Due to the formal similarity between quantum entanglement and statistical correlations, tensor networks have recently been integrated in machine learning, operating both as alternative learning architectures and as decompositions of components of neural networks. The expectation is that the theoretical understanding of tensor networks developed within quantum many-body physics leads to novel methods that offer advantages in terms of computational efficiency, explainability, or privacy. Here we review the use of tensor networks in the context of machine learning, providing a critical assessment of the state of the art, the potential advantages, and the challenges that must be overcome.

[AI-110] Enhancing LLM -based Search Agents via Contribution Weighted Group Relative Policy Optimization ACL2026

【速读】:该论文旨在解决搜索代理(Search Agent)在训练过程中因监督信号不稳定性与信用分配困难而导致性能受限的问题。现有方法中,过程监督(process supervision)常因价值估计不稳定而效果不佳,而结果监督(outcome supervision)则由于稀疏的轨迹级奖励难以实现细粒度的信用分配。解决方案的关键在于提出贡献加权的组相对策略优化(Contribution-Weighted GRPO, CW-GRPO),其通过引入一个大型语言模型(LLM)判官对每轮检索效用和推理正确性进行评估,生成逐轮贡献分数,并以此重标定基于结果的收益(advantage),从而在不牺牲优化稳定性的前提下实现精细化的信用分配,显著提升了搜索代理在知识密集型任务中的表现。

链接: https://arxiv.org/abs/2604.14267
作者: Junzhe Wang,Zhiheng Xi,yajie yang,Hao Luo,Shihan Dou,Tao Gui,Qi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), Main Conference

点击查看摘要

Abstract:Search agents extend Large Language Models (LLMs) beyond static parametric knowledge by enabling access to up-to-date and long-tail information unavailable during pretraining. While reinforcement learning has been widely adopted for training such agents, existing approaches face key limitations: process supervision often suffers from unstable value estimation, whereas outcome supervision struggles with credit assignment due to sparse, trajectory-level rewards. To bridge this gap, we propose Contribution-Weighted GRPO (CW-GRPO), a framework that integrates process supervision into group relative policy optimization. Instead of directly optimizing process rewards, CW-GRPO employs an LLM judge to assess the retrieval utility and reasoning correctness at each search round, producing per-round contribution scores. These scores are used to rescale outcome-based advantages along the trajectory, enabling fine-grained credit assignment without sacrificing optimization stability. Experiments on multiple knowledge-intensive benchmarks show that CW-GRPO outperforms standard GRPO by 5.0% on Qwen3-8B and 6.3% on Qwen3-1.7B, leading to more effective search behaviors. Additional analysis reveals that successful trajectories exhibit concentrated contributions across rounds, providing empirical insight into search agent tasks.

[AI-111] Reinforcement Learning via Value Gradient Flow ICLR2026

【速读】:该论文旨在解决行为正则化强化学习(behavior-regularized reinforcement learning)中的价值过优化问题,即在离线强化学习(offline RL)或大语言模型(LLM)强化学习微调中,由于对分布外(out-of-distribution)状态的错误外推导致策略性能下降的问题。现有方法要么依赖于可重参数化的策略梯度(reparameterized policy gradient),难以扩展到大规模生成式模型;要么采用拒绝采样(reject sampling),在试图超越行为支持域时过于保守。论文提出了一种名为价值梯度流(Value Gradient Flow, VGF)的新范式,其核心创新在于将行为正则化RL建模为一个最优传输问题(optimal transport problem),通过离散梯度流求解:从参考分布(reference distribution)初始化的粒子沿价值梯度移动,从而映射到由价值诱导的最优策略分布。VGF通过控制传输预算(transport budget)隐式施加正则化,无需显式参数化策略,同时保持表达能力和灵活性,并可在测试时通过调整传输预算实现自适应缩放。实验表明,VGF在D4RL、OGBench等离线RL基准和LLM强化学习任务上均显著优于现有方法,达到当前最优性能。

链接: https://arxiv.org/abs/2604.14265
作者: Haoran Xu,Kaiwen Hu,Somayeh Sojoudi,Amy Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly conservative when attempting to move beyond the behavior support. In this paper, we propose Value Gradient Flow (VGF), a scalable new paradigm for behavior-regularized RL. VGF casts behavior-regularized RL as an optimal transport problem that maps the reference distribution to the value-induced optimal policy distribution. We solve this transport problem via discrete gradient flow, where value gradients guide particles initialized from the reference distribution. Our analysis shows that VGF imposes regularization implicitly by controlling the transport budget. VGF eliminates explicit policy parameterization while remaining expressive and flexible, this enables adaptive test-time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state-of-the-art results on offline RL benchmarks (D4RL, OGBench) and LLM RL tasks. Code and runs can be found at this https URL.

[AI-112] GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

【速读】:该论文旨在解决当前图形用户界面(GUI)接地模型在标准基准测试中表现优异,但在涉及空间推理等复杂指令时性能显著下降的问题。现有基准测试仅对每个截图使用单一固定指令进行评估,无法揭示模型在真实场景中面对多变输入时的鲁棒性缺陷。其解决方案的关键在于提出GUI-Perturbed框架,通过独立扰动视觉场景和指令变量,系统性地测量模型的接地鲁棒性;该框架能够隔离出具体受影响的能力维度(如空间推理、视觉鲁棒性、推理校准),从而提供传统聚合指标无法获得的诊断信号,为模型改进提供精准方向。

链接: https://arxiv.org/abs/2604.14262
作者: Yangyue Wang,Harshvardhan Sikka,Yash Mathur,Tony Zhou,Jinu Nyachhyon,Pranav Guruprasad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 Pages, 17 Figures, 9 Tables

点击查看摘要

Abstract:GUI grounding models report over 85% accuracy on standard benchmarks, yet drop 27-56 percentage points when instructions require spatial reasoning rather than direct element naming. Current benchmarks miss this because they evaluate each screenshot once with a single fixed instruction. We introduce GUI-Perturbed, a controlled perturbation framework that independently varies visual scenes and instructions to measure grounding robustness. Evaluating three 7B models from the same architecture lineage, we find that relational instructions cause systematic accuracy collapse across all models, a 70% browser zoom produces statistically significant degradation, and rank-8 LoRA fine-tuning with augmented data degrades performance rather than improving it. By perturbing along independent axes, GUI-Perturbed isolates which specific capability axes are affected-spatial reasoning, visual robustness, reasoning calibration-providing diagnostic signal that aggregate benchmarks cannot. We release the dataset, augmentation pipeline, and a fine-tuned model.

[AI-113] GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

【速读】:该论文旨在解决大语言模型在后训练阶段(post-training)中如何有效融合高效知识注入与鲁棒泛化能力的问题。现有方法通常依赖监督微调(SFT)和强化学习(RL),但SFT存在隐式奖励稀疏、逆概率加权不稳定等问题,导致单一路径依赖、熵坍缩和梯度爆炸等缺陷,限制了模型性能提升。论文提出Group Fine-Tuning(GFT)框架,其关键在于两个机制:一是Group Advantage Learning,通过构建多样化的响应组并引入归一化对比监督以缓解奖励稀疏性;二是Dynamic Coefficient Rectification,自适应地约束逆概率权重以稳定优化过程,同时保留高效的知识注入能力。实验表明,GFT显著优于基于SFT的方法,并能更平滑地衔接后续的RL训练。

链接: https://arxiv.org/abs/2604.14258
作者: Wangjie Gan,Miao Pan,Linbo Xi,Wenqi Zhang,Jintao Chen,Jianwei Yin,Xuhong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.

[AI-114] Formalizing Kantian Ethics: Formula of the Universal Law Logic (FULL)

【速读】:该论文旨在解决当前人工智能伦理研究中构建人工道德代理(Artificial Moral Agents, AMAs)所面临的两个关键局限:一是现有方法未考虑智能体执行行为时的目的,二是假设人类能够完整枚举自身的道德直觉。为克服这些问题,论文提出了一种基于康德伦理学的正式化道德推理框架——通用法则逻辑(Formula of the Universal Law Logic, FULL),其核心是将康德第一定言命令(Formula of the Universal Law, FUL)形式化为一种多类型一阶模态逻辑系统,并整合因果性和代理性等概念。FULL能够在不预设内置道德直觉的前提下,仅依赖充分的非规范性背景知识,对智能体在特定目的下的行为进行合乎康德伦理的推理与评估,从而推动更鲁棒和自主的AMAs发展,并深化对康德伦理学的形式理解。

链接: https://arxiv.org/abs/2604.14254
作者: Taylor Olson
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:The field of machine ethics aims to build Artificial Moral Agents (AMAs) to better understand morality and make AI agents safer. To do so, many approaches encode human moral intuition as a set of axioms on actions e.g., do not harm, you must help others. However, this introduces (at least) two limitations for future AMAs. First, it does not consider the agent’s purposes in performing the action. Second, it assumes that we humans can enumerate our moral intuition. This paper explores formalizing a moral procedure that alleviates these two limitations. We specifically consider Kantian ethics and present a multi-sorted quantified modal logic we call the Formula of the Universal Law Logic (FULL). The FULL formalizes Kant’s first formulation of the categorical imperative, the Formula of the Universal Law (FUL), and concepts such as causality and agency. We demonstrate on three cases from Kantian ethics that the FULL can reason to evaluate agents’ actions for certain purposes without built-in moral intuition, given that it has sufficient (non-normative) background knowledge. Therefore, the FULL is a contribution towards more robust and autonomous AMAs, and a more formal understanding of Kantian ethics.

[AI-115] Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations

【速读】:该论文旨在解决稀疏混合专家(Sparse Mixture-of-Experts, MoE)模型在处理长尾知识时易产生幻觉的问题。其核心原因是静态的Top-k路由机制倾向于偏好高频模式,导致掌握关键长尾事实的“专家”常因门控分数较低而处于“休眠”状态,即使这些专家对其他输入具有因果重要性。解决方案的关键在于提出无需训练的推理框架Counterfactual Routing (CoR),通过层级扰动分析与反事实专家影响(Counterfactual Expert Impact, CEI)指标,动态将计算资源从语法主导层转移到知识密集层,同时保持总激活数量不变,从而通过虚拟消融有效唤醒因果决定性的休眠专家。

链接: https://arxiv.org/abs/2604.14246
作者: Wentao Hu,Yanbo Zhai,Xiaohui Hu,Mingkuan Zhao,Shanhong yu,Xue Liu,Kaidong Yu,Shuangyong Song,Xuelong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Sparse Mixture-of-Experts (MoE) models have achieved remarkable scalability, yet they remain vulnerable to hallucinations, particularly when processing long-tail knowledge. We identify that this fragility stems from static Top- k routing: routers tend to favor high-frequency patterns over rare factual associations. Consequently, specialist experts'' possessing critical long-tail knowledge are often assigned low gating scores and remain dormant’’ – under-prioritized for specific tokens despite their proven causal importance on other inputs. To address this, we propose Counterfactual Routing (CoR), a training-free inference framework designed to awaken these dormant experts. CoR integrates layer-wise perturbation analysis with the Counterfactual Expert Impact (CEI) metric to dynamically shift computational resources from syntax-dominant to knowledge-intensive layers while maintaining a constant total activation count, effectively retrieving causally decisive experts via virtual ablation. Extensive experiments on TruthfulQA, FACTOR, and TriviaQA demonstrate that CoR improves factual accuracy by 3.1% on average without increasing the inference budget, establishing a superior Pareto frontier compared to static scaling strategies.

[AI-116] Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees

【速读】:该论文旨在解决安全约束强化学习(Safety-Constrained Reinforcement Learning, SCRL)中因忽略外生因素(exogenous factors)而导致策略在部署时失效的问题,尤其是在存在对抗性动态(adversarial dynamics)的情况下。传统约束马尔可夫决策过程(Constrained MDP)假设状态转移仅由智能体自身动作决定,这在安全关键场景下不成立;现有鲁棒强化学习方法虽采用分布鲁棒性建模过渡核,但未显式刻画智能体与外生因素之间的战略交互,并依赖对已知基准模型的强假设。本文首次将外生因素建模为对抗性策略 πˉ\bar\pi,提出鲁棒幻觉约束上置信界强化学习(Robust Hallucinated Constrained Upper-Confidence RL, RHC-UCRL),其核心创新在于:同时对智能体和对抗策略保持乐观性,明确区分认知不确定性(epistemic uncertainty)与随机不确定性(aleatoric uncertainty),从而实现子线性遗憾(sub-linear regret)和约束违反保证(constraint violation guarantees)。

链接: https://arxiv.org/abs/2604.14243
作者: Sourav Ganguly,Kartik Pandit,Arnob Ghosh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world decision-making systems operate in environments where state transitions depend not only on the agent’s actions, but also on \textbfexogenous factors outside its control–competing agents, environmental disturbances, or strategic adversaries–formally, s_h+1 = f(s_h, a_h, \bara_h)+\omega_h where \bara_h is the adversary/external action, a_h is the agent’s action, and \omega_h is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but \textbffail catastrophically in deployment, particularly when safety constraints must be satisfied. Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the \textbfstrategic interaction between agent and exogenous factor, and rely on strong assumptions about divergence from a known nominal model. We model the exogenous factor as an \textbfadversarial policy \bar\pi that co-determines state transitions, and ask how an agent can remain both optimal and safe against such an adversary. \emphTo the best of our knowledge, this is the first work to study safety-constrained RL under explicit adversarial dynamics. We propose \textbfRobust Hallucinated Constrained Upper-Confidence RL (\textttRHC-UCRL), a model-based algorithm that maintains optimism over both agent and adversary policies, explicitly separating epistemic from aleatoric uncertainty. \textttRHC-UCRL achieves sub-linear regret and constraint violation guarantees. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.14243 [cs.LG] (or arXiv:2604.14243v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.14243 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-117] Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making

【速读】:该论文旨在解决复杂系统仿真中代理模型(surrogate model)与可解释人工智能(Explainable Artificial Intelligence, XAI)长期割裂的问题,即代理模型虽能显著降低计算成本,但其黑箱特性阻碍了对输入变量如何驱动物理响应的机制理解;而XAI方法虽具备解析能力,却难以满足工程场景下的高相关性输入、动态系统建模及可靠性要求等特殊约束。解决方案的关键在于构建一个结构化框架,将现有XAI技术映射到代理模型构建与应用的全流程(从模型训练到决策支持),通过方程基础模拟与基于代理的建模实例验证该整合路径的有效性,并提出以可解释性为核心嵌入仿真驱动工作流的研究议程,从而推动从单纯加速仿真向提取复杂系统行为可操作洞察的范式转变。

链接: https://arxiv.org/abs/2604.14240
作者: Pramudita Satria Palar,Paul Saves,Muhammad Daffa Robani,Nicolas Verstaevel,Moncef Garouani,Julien Aligon,Koji Shimoyama,Joseph Morlier,Benoit Gaudou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted for publication in Archives of Computational Methods in Engineering, 2026, ID d9d36aab-3723-4a70-b2ce-166435179528

点击查看摘要

Abstract:The simulation of complex systems increasingly relies on sophisticated but fundamentally opaque computational black-box simulators. Surrogate models play a central role in reducing the computational cost of complex systems simulations across a wide range of scientific and engineering domains. Notwithstanding, they inevitably inherit and often exacerbate this black-box nature, obscuring how input variables drive physical responses. Conversely, Explainable Artificial Intelligence (XAI) offers powerful tools to unpack these models. Yet, XAI methods struggle with engineering-specific constraints, such as highly correlated inputs, dynamical systems, and rigorous reliability requirements. Consequently, surrogate modeling and XAI have largely evolved as distinct fields of research, despite their strong complementarity. To reconnect these approaches, this state-of-the-art survey provides a structured perspective that maps existing XAI techniques onto the various stages of surrogate modeling workflows for design and exploration. To ground this synthesis, we draw upon illustrative applications across both equation-based simulations and agent-based modeling. We survey a broad spectrum of techniques, highlighting their strengths for revealing interactions and supporting human comprehension. Finally, we identify pressing open challenges, including the explainability of dynamical systems and the handling of mixed-variable systems, and propose a research agenda to make explainability a core, embedded element of simulation-driven workflows from model construction through decision-making. By transforming opaque emulators into explainable tools, this agenda empowers practitioners to move beyond accelerating simulations to extracting actionable insights from complex system behaviors.

[AI-118] Graph-Based Fraud Detection with Dual-Path Graph Filtering

【速读】:该论文旨在解决金融欺诈检测中图神经网络(GNN)性能不佳的问题,其核心挑战包括关系伪装(relation camouflage)、高异配性(high heterophily)和类别不平衡(class imbalance)。解决方案的关键在于提出一种基于双路径图滤波的欺诈检测模型(DPF-GFD),通过引入频率互补的双路径过滤机制,显式解耦结构异常建模与特征相似性建模:一方面利用基于beta小波的算子提取原始图的关键结构模式,另一方面构建基于距离的节点表示相似性图并应用改进的低通滤波器;随后通过监督表示学习融合两种图的嵌入,最终由集成树模型评估未标记节点的欺诈风险。此设计显著提升了在高度异配和不平衡欺诈图上的节点表征判别力与稳定性。

链接: https://arxiv.org/abs/2604.14235
作者: Wei He,Wensheng Gan,Philip S. Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Neural Networks

点击查看摘要

Abstract:Fraud detection on graph data can be viewed as a demanding task that requires distinguishing between different types of nodes. Because graph neural networks (GNNs) are naturally suited for processing information encoded in graph form through their message-passing operations, methods based on GNN models have increasingly attracted attention in the fraud detection domain. However, fraud graphs inherently exhibit relation camouflage, high heterophily, and class imbalance, causing most GNNs to underperform in fraud detection tasks. To address these challenges, this paper proposes a Graph-Based Fraud Detection Model with Dual-Path Graph Filtering (DPF-GFD). DPF-GFD first applies a beta wavelet-based operator to the original graph to capture key structural patterns. It then constructs a similarity graph from distance-based node representations and applies an improved low-pass filter. The embeddings from the original and similarity graphs are fused through supervised representation learning to obtain node features, which are finally used by an ensemble tree model to assess the fraud risk of unlabeled nodes. Unlike existing single-graph smoothing approaches, DPF-GFD introduces a frequency-complementary dual-path filtering paradigm tailored for fraud detection, explicitly decoupling structural anomaly modeling and feature similarity modeling. This design enables more discriminative and stable node representations in highly heterophilous and imbalanced fraud graphs. Comprehensive experiments on four real-world financial fraud detection datasets demonstrate the effectiveness of our proposed method.

[AI-119] Explainable Graph Neural Networks for Interbank Contagion Surveillance: A Regulatory-Aligned Framework for the U.S. Banking Sector

【速读】:该论文旨在解决美国银行系统中早期识别金融机构困境(bank distress)及开展宏观审慎监管(macro-prudential surveillance)的问题。其解决方案的关键在于提出了一种时空图注意力网络(Spatial-Temporal Graph Attention Network, ST-GAT)框架,该框架通过建模8,103家FDIC保险机构在58个季度时间点上的动态有向加权图结构(基于最大熵估计重构的双边敞口),结合图神经网络(GNN)与双向长短期记忆网络(BiLSTM)的时序建模能力,实现了高精度的早期预警。ST-GAT在AUPRC指标上达到0.939,显著优于其他GNN架构,且通过消融实验验证了时序注意力机制对模型性能的提升作用,同时识别出ROA和不良贷款率(NPL Ratio)为关键预测因子,具备良好的可解释性。

链接: https://arxiv.org/abs/2604.14232
作者: Mohammad Nasir Uddin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, submitted to Research in International Business and Finance (RIBAF)

点击查看摘要

Abstract:The Spatial-Temporal Graph Attention Network (ST-GAT) framework was created to serve as an explainable GNN-based solution for detecting bank distress early warning signs and for conducting macro-prudential surveillance of the interbank system in the United States. The ST-GAT framework models 8,103 FDIC insured institutions across 58 quarterly snapshots (2010Q1-2024Q2). Bilateral exposures were reconstructed from publicly available FDIC Call Reports using maximum entropy estimation to produce a dynamic directed weighted graph. The framework achieves the highest AUPRC among all GNN architectures (0.939 +/- 0.010), trailing only XGBoost (0.944). Ablation analysis confirms the BiLSTM temporal component contributes +0.020 AUPRC; temporal attention weights exhibit a monotonically decreasing pattern consistent with long-run structural vulnerability weighting. Permutation importance identifies ROA (0.309) and NPL Ratio (0.252) as dominant predictors, consistent with post-mortem analyses of the 2023 regional banking crisis. All data are publicly available FDIC Call Reports and FRED series; all code and results are released.

[AI-120] Shapley Value-Guided Adaptive Ensemble Learning for Explainable Financial Fraud Detection with U.S. Regulatory Compliance Validation

【速读】:该论文旨在解决金融犯罪检测中生成式 AI (Generative AI) 模型因缺乏透明性和可审计性而难以满足监管合规要求的问题,尤其是在美国联邦储备系统 SR 11-7 和 OCC Bulletin 2011-12 等法规框架下对模型解释性的强制要求。解决方案的关键在于提出一种基于 SHAP(Shapley Additive Explanations)的自适应集成方法——SHAP-Guided Adaptive Ensemble (SGAE),该方法通过动态调整每笔交易的集成权重以最大化 SHAP 属性的一致性,在保证高预测性能(AUC-ROC 达到 0.8837)的同时显著提升模型可解释性,并结合对 LSTM、Transformer 和 GNN-GraphSAGE 的全面三架构评估,验证了图神经网络(GNN-GraphSAGE)在大规模交易数据上的最优表现(AUC-ROC=0.9248),从而实现高性能与强合规性的统一。

链接: https://arxiv.org/abs/2604.14231
作者: Mohammad Nasir Uddin,Md Munna Aziz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 28 pages. Submitted to Engineering Applications of Artificial Intelligence (Elsevier). IEEE-CIS dataset (590,540 transactions). Includes SGAE algorithm, SHAP stability evaluation, and OCC/SR 11-7 regulatory compliance mapping

点击查看摘要

Abstract:Financial crime costs U.S. institutions over 32 billion each year. Although AI tools for fraud detection have become more advanced, their use in real-world systems still faces a major obstacle: many of these models operate as black boxes that cannot provide the transparent, auditable explanations required by regulations such as OCC Bulletin 2011-12 and Federal Reserve SR 11-7. This study makes three main contributions. First, it offers a thorough evaluation of explanation quality across faithfulness (sufficiency and comprehensiveness at k=5, 10, and 15) and stability (Kendall’s W across 30 bootstrap samples). XGBoost paired with TreeExplainer achieves near-perfect stability (W=0.9912), while LSTM with DeepExplainer shows weak results (W=0.4962). Second, the paper introduces the SHAP-Guided Adaptive Ensemble (SGAE), which dynamically adjusts per-transaction ensemble weights based on SHAP attribution agreement, achieving the highest AUC-ROC among all tested models (0.8837 held-out; 0.9245 cross-validation). Third, a complete three-architecture evaluation of LSTM, Transformer, and GNN-GraphSAGE on the full 590,540-transaction IEEE-CIS dataset is provided, with GNN-GraphSAGE achieving AUC-ROC 0.9248 and F1=0.6013. All results are mapped directly to OCC, SR 11-7, and BSA-AML regulatory compliance requirements.

[AI-121] Fun-TSG: A Function-Driven Multivariate Time Series Generator with Variable-Level Anomaly Labeling

【速读】:该论文旨在解决多变量时间序列异常检测方法评估中缺乏高质量基准数据集的问题,现有资源普遍存在异常标注粒度粗、未明确变量间与时间依赖关系、以及生成机制不透明等缺陷,阻碍了检测模型(尤其是可解释性和变量级输出模型)的开发与严谨比较。其解决方案的关键在于提出一个名为Fun-TSG的全定制化时间序列生成工具,该工具支持基于随机采样依赖结构和异常类型的全自动生成,也支持用户通过自定义方程和异常配置进行手动生成,同时提供完整的数据生成过程透明性及变量级与时间戳级的真实标签,从而构建多样化、可解释且可复现的基准场景,实现对经典与现代异常检测模型的细粒度性能分析。

链接: https://arxiv.org/abs/2604.14221
作者: Pierre Lotte(EPE UT, IRIT),André Péninou(UT2J, IRIT-SIG, IRIT),Olivier Teste(IRIT-SIG, IRIT, UT2J, Comue de Toulouse)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable evaluation of anomaly detection methods in multivariate time series remains an open challenge, largely due to the limitations of existing benchmark datasets. Current resources often lack fine-grained anomaly annotations, do not provide explicit intervariable and temporal dependencies, and offer little insight into the underlying generative mechanisms. These shortcomings hinder the development and rigorous comparison of detection models, especially those targeting interpretable and variable-specific outputs. To address this gap, we introduce Fun-TSG, a fully customizable time series generator designed to support high-quality evaluation of anomaly detection systems. Our tool enables both fully automated generation, based on randomly sampled dependency structures and anomaly types, and manual generation through user-defined equations and anomaly configurations. In both cases, it provides full transparency over the data generation process, including access to ground-truth anomaly labels at the variable and timestamp levels. Fun-TSG supports the creation of diverse, interpretable, and reproducible benchmarking scenarios, enabling fine-grained performance analysis for both classical and modern anomaly detection models.

[AI-122] owards Verified and Targeted Explanations through Formal Methods

【速读】:该论文旨在解决安全关键系统中可解释人工智能(XAI)的两大局限性:现有启发式归因方法(如LIME、积分梯度)虽能识别影响决策的关键特征,但缺乏关于决策边界的数学保证;而形式化方法虽能验证模型鲁棒性,却缺乏针对性,分析的是最近的决策边界,而非用户关注的特定风险类别。解决方案的核心在于提出ViTaX(Verified and Targeted Explanations)框架,其关键创新是引入“目标epsilon鲁棒性”(Targeted epsilon-Robustness)概念,通过形式化可达性分析,为用户指定的某一关键替代类别(class t)生成最小敏感特征子集,并严格证明在扰动幅度不超过ε时,模型不会将输入从原类别y误判为t,从而实现具有数学保障的目标导向型半反事实解释。

链接: https://arxiv.org/abs/2604.14209
作者: Hanchen David Wang,Diego Manzanas Lopez,Preston K. Robinette,Ipek Oguz,Taylor T. Johnson,Meiyi Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Paper has been accepted at JAIR

点击查看摘要

Abstract:As deep neural networks are deployed in safety-critical domains such as autonomous driving and medical diagnosis, stakeholders need explanations that are interpretable but also trustworthy with formal guarantees. Existing XAI methods fall short: heuristic attribution techniques (e.g., LIME, Integrated Gradients) highlight influential features but offer no mathematical guarantees about decision boundaries, while formal methods verify robustness yet remain untargeted, analyzing the nearest boundary regardless of whether it represents a critical risk. In safety-critical systems, not all misclassifications carry equal consequences; confusing a “Stop” sign for a “60 kph” sign is far more dangerous than confusing it with a “No Passing” sign. We introduce ViTaX (Verified and Targeted Explanations), a formal XAI framework that generates targeted semifactual explanations with mathematical guarantees. For a given input (class y) and a user-specified critical alternative (class t), ViTaX: (1) identifies the minimal feature subset most sensitive to the y-t transition, and (2) applies formal reachability analysis to guarantee that perturbing these features by epsilon cannot flip the classification to t. We formalize this through Targeted epsilon-Robustness, certifying whether a feature subset remains robust under perturbation toward a specific target class. ViTaX is the first method to provide formally guaranteed explanations of a model’s resilience against user-identified alternatives. Evaluations on MNIST, GTSRB, EMNIST, and TaxiNet demonstrate over 30% fidelity improvement with minimal explanation cardinality.

[AI-123] Disentangled Dual-Branch Graph Learning for Conversational Emotion Recognition

【速读】:该论文旨在解决多模态情感识别(Multimodal Emotion Recognition)中面临的三大核心挑战:跨模态信息冗余、语义对齐不完善以及高阶说话者交互建模不足。其解决方案的关键在于提出一种结合双空间特征解耦与双分支图学习的框架:首先通过共享编码器和模态特定编码器分离出模态不变特征与模态特有特征;随后,在不变特征上使用傅里叶图神经网络(Fourier Graph Neural Network)捕捉全局一致性与互补模式,并引入频域对比目标提升判别能力;同时,在模态特有特征上构建说话者感知超图(Speaker-aware Hypergraph)以建模高阶交互关系,并施加说话者一致性约束确保语义连贯性;最终融合两个分支实现话语级情感预测。该方法在IEMOCAP和MELD数据集上显著优于现有强基线,验证了其有效性。

链接: https://arxiv.org/abs/2604.14204
作者: Chengling Guo,Yuntao Shou,Tao Meng,Wei Ai,Yun Tan,Keqin Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 16 pages

点击查看摘要

Abstract:Multimodal emotion recognition in conversations aims to infer utterance-level emotions by jointly modeling textual, acoustic, and visual cues within context. Despite recent progress, key challenges remain, including redundant cross-modal information, imperfect semantic alignment, and insufficient modeling of high-order speaker interactions. To address these issues, we propose a framework that combines dual-space feature disentanglement with dual-branch graph learning. A shared encoder and modality-specific encoders are used to separate modality-invariant and modality-specific representations. The invariant features are modeled by a Fourier graph neural network to capture global consistency and complementary patterns, with a frequency-domain contrastive objective to enhance discriminability. In parallel, a speaker-aware hypergraph is constructed over modality-specific features to model high-order interactions, along with a speaker-consistency constraint to maintain coherent semantics. Finally, the two branches are fused for utterance-level emotion prediction. Experiments on IEMOCAP and MELD demonstrate that the proposed method achieves superior performance over strong baselines, validating its effectiveness.

[AI-124] End-to-End Learning-based Operation of Integrated Energy Systems for Buildings and Data Centers

【速读】:该论文旨在解决建筑与数据中心(Data Centers, DCs)在综合能源系统(Integrated Energy System, IES)中协同优化不足,以及多能供需预测不确定性导致的运行优化困难问题。其关键解决方案是提出一种端到端的学习驱动方法,将不确定变量的预测模型训练与IES的约束优化统一在一个学习框架中,通过引导预测模型训练以提升运行性能而非单纯追求预测精度,从而缓解预测误差对系统运行的影响。该方法相较于传统“先预测后优化”策略,在真实数据案例中使IES运行性能提升约7–9%,同时通过DC余热回收实现整体能源成本降低约10%。

链接: https://arxiv.org/abs/2604.14184
作者: Zhenyu Pu,Yu Yang,Liang Yu,Xiaohong Guan
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Buildings and data centers (DCs) are energy-intensive sectors, playing a critical role to achieve the low-carbon and sustainable energy transition targets. To this end, integrated energy system (IES) that incorporates diverse renewables, energy generation, conversion, and storage technologies to enable coordinated multi-energy supply have been widely investigated for both buildings and DCs. However, few works consider the two sectors jointly within IES to exploit their substantial synergistic benefits. Meanwhile, the operational optimization of IES remains challenging due to the difficulty to predict the multi-energy demand and supply accurately. To address these gaps, this paper investigates IES for coordinated multi-energy supply of buildings and DC, where the waste heat from DCs is recovered and reused to enhance energy efficiency. Moreover, an end-to-end learning-based method is proposed for the operational optimization of IES under uncertainty. Unlike conventional predict-then-optimize approaches, the proposed method integrates the training of prediction models for uncertain variables with the constrained optimization of IES into a unified learning framework, guiding the training of prediction models to improve operational performance, rather than prediction accuracy, thereby mitigating the impacts of predictions errors. Case studies based on real-world datasets show that the proposed methods improves the operational performance of IES by about 7-9% compared to existing predict-then-optimize methods. In addition, coordinating buildings and DCs within IES shows substantial economic benefits. In particular, the waste heat recovery from DCs leads to approximately 10% of total energy cost reduction of the IES.

[AI-125] Simulating Human Cognition: Heartbeat-Driven Autonomous Thinking Activity Scheduling for LLM -based AI systems

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在推理与工具使用中因控制流僵化、反应迟钝而导致的适应性差和效率低的问题。现有框架多依赖固定流水线或故障触发的反思机制,导致代理行为冲动或仅在错误发生后才进行修正。其解决方案的关键在于提出一种基于“心跳驱动”的自主思维活动调度机制(Heartbeat-Driven Autonomous Thinking Activity Scheduling),通过周期性“心跳”信号动态协调多个认知模块(如规划器、批评者、回忆者、梦想者),使系统能够根据时间模式和历史上下文主动决定何时执行特定认知任务(如回忆、总结或战略规划),而非依赖硬编码规则或即时反应触发。该机制支持认知模块的动态增删而不需结构重构,并结合元学习策略利用交互日志持续优化调度策略,从而实现更灵活、高效的自主认知调节。

链接: https://arxiv.org/abs/2604.14178
作者: Hong Su
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents have demonstrated remarkable capabilities in reasoning and tool use, yet they often suffer from rigid, reactive control flows that limit their adaptability and efficiency. Most existing frameworks rely on fixed pipelines or failure-triggered reflection, causing agents to act impulsively or correct errors only after they occur. In this paper, we introduce Heartbeat-Driven Autonomous Thinking Activity Scheduling, a mechanism that enables proactive, adaptive, and continuous self-regulation. Mirroring the natural rhythm of human cognition, our system employs a periodic ``heartbeat’’ mechanism to orchestrate a dynamic repertoire of cognitive modules (e.g., Planner, Critic, Recaller, Dreamer). Unlike traditional approaches that rely on hard-coded symbolic rules or immediate reactive triggers, our scheduler learns to determine when to engage specific thinking activities – such as recalling memories, summarizing experiences, or strategic planning – based on temporal patterns and historical context. This functional approach allows cognitive modules to be dynamically added or removed without structural reengineering. Meanwhile, we propose a meta-learning strategy for continual policy adaptation, where the scheduler optimizes its cognitive strategy over time using historical interaction logs. Evaluation results demonstrate that our approach effectively learns to schedule cognitive activities based on historical data and can autonomously integrate new thinking modules.

[AI-126] he Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery CVPR26

【速读】:该论文针对广义类别发现(Generalized Category Discovery, GCD)中现有方法因监督与无监督目标联合优化导致的梯度纠缠(gradient entanglement)问题展开研究。该问题表现为:1)监督梯度被扭曲,削弱已知类间的判别能力;2)已知类与新类的表示子空间发生重叠,降低新类别的可分离性。解决方案的关键在于提出一个即插即用的梯度级模块——能量感知梯度协调器(Energy-Aware Gradient Coordinator, EAGC),其包含两个核心组件:基于锚点的梯度对齐(Anchor-based Gradient Alignment, AGA)和能量感知弹性投影(Energy-aware Elastic Projection, EEP)。AGA通过引入参考模型锚定标注样本的梯度方向,以保护已知类的判别结构;EEP则将未标注样本的梯度软投影至已知类子空间的补空间,并依据样本与已知子空间的对齐程度自适应调整投影强度,从而在不抑制可能属于已知类的未标注样本的前提下减少子空间重叠,显著提升模型性能。

链接: https://arxiv.org/abs/2604.14176
作者: Haiyang Zheng,Nan Pu,Yaqi Cai,Teng Long,Wenjing Li,Nicu Sebe,Zhun Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted by CVPR26

点击查看摘要

Abstract:Generalized Category Discovery (GCD) leverages labeled data to categorize unlabeled samples from known or unknown classes. Most previous methods jointly optimize supervised and unsupervised objectives and achieve promising results. However, inherent optimization interference still limits their ability to improve further. Through quantitative analysis, we identify a key issue, i.e., gradient entanglement, which 1) distorts supervised gradients and weakens discrimination among known classes, and 2) induces representation-subspace overlap between known and novel classes, reducing the separability of novel categories. To address this issue, we propose the Energy-Aware Gradient Coordinator (EAGC), a plug-and-play gradient-level module that explicitly regulates the optimization process. EAGC comprises two components: Anchor-based Gradient Alignment (AGA) and Energy-aware Elastic Projection (EEP). AGA introduces a reference model to anchor the gradient directions of labeled samples, preserving the discriminative structure of known classes against the interference of unlabeled gradients. EEP softly projects unlabeled gradients onto the complement of the known-class subspace and derives an energy-based coefficient to adaptively scale the projection for each unlabeled sample according to its degree of alignment with the known subspace, thereby reducing subspace overlap without suppressing unlabeled samples that likely belong to known classes. Experiments show that EAGC consistently boosts existing methods and establishes new state-of-the-art results. Code is available at this https URL.

[AI-127] NuHF Claw: A Risk Constrained Cognitive Agent Framework for Human Centered Procedure Support in Digital Nuclear Control Rooms

【速读】:该论文旨在解决数字核电主控室中操作员交互模式复杂化带来的认知风险问题,现有的人因可靠性分析(Human Reliability Analysis, HRA)方法难以有效应对由软控行为引发的新型认知负荷与错误风险。解决方案的关键在于提出NuHF Claw框架,其核心创新是引入一种风险约束型代理运行时机制(risk constrained agent runtime),通过将认知状态推断与概率安全评估紧密耦合,实时调节自主系统行为;该机制结合基于认知基础的工作负载与情境意识估计,动态预测人为失误概率,从而将传统的离线可靠性分析转化为嵌入运行流程中的主动干预机制,实现认知感知的自主性(cognition-aware autonomy),在保障人类决策权威的同时提升智能代理的安全性与适应性。

链接: https://arxiv.org/abs/2604.14160
作者: Xingyu Xiao,Jiejuan Tong,Jun Sun,Zhe Sui,Peng Chen,Jingang Liang,Haitao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid digitization of nuclear power plant main control rooms has fundamentally reshaped operator interaction patterns, introducing complex soft-control behaviors and elevated cognitive risks that are not adequately addressed by existing human reliability analysis approaches. Although recent advances in large language models and autonomous agents offer new opportunities for intelligent decision support, their deployment in safety critical environments remains constrained by risks of hallucinated reasoning and weakened human authority. This study proposes NuHF Claw, a persistent cognitive-risk agent framework that enables risk governed human centered autonomy for digital nuclear operations. The core methodological innovation lies in the introduction of a risk constrained agent runtime, which tightly couples cognitive state inference with probabilistic safety assessment to regulate autonomous system behavior in real time. By integrating cognitively grounded workload and situational awareness estimation with dynamic human error probability prediction, the framework transforms conventional offline reliability analysis into a proactive intervention mechanism embedded directly within operational workflows. Experimental validation on a high-fidelity digital control room simulator demonstrates that NuHF Claw can anticipate interface induced cognitive degradation, dynamically constrain unsafe autonomous recommendations, and provide risk-aware navigational guidance while preserving human decision authority. The results highlight a fundamental shift from automation-driven operation toward cognition-aware autonomy, offering a principled pathway for the safe integration of intelligent agents into next-generation nuclear control environments.

[AI-128] BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中因硬件退化、宇宙辐射或恶意故障注入攻击(如Rowhammer)导致的位翻转故障(bit-flip faults)问题,此类故障会无声地破坏模型内部参数,引发不可预测甚至危险的行为。解决方案的关键在于提出了一种可扩展的软件框架BitFlipScope,其核心机制分为两种场景:当存在干净参考模型时,通过输出、隐藏状态和内部激活的差分分析来定位异常行为以精确定位故障区域;当无参考模型时,则利用残差路径扰动与损失敏感性分析,直接从受损模型中推断出故障影响区域。该框架不仅实现了高效的故障诊断,还支持无需微调的轻量级性能恢复,从而为高可靠性的LLM部署提供了可行路径。

链接: https://arxiv.org/abs/2512.22174
作者: Muhammad Zeeshan Karamat,Sadman Saif,Christiana Chamon Garcia
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted at the IEEE International Symposium on Hardware Oriented Security and Trust (HOST) 2026

点击查看摘要

Abstract:Large Language Models (LLMs) deployed in practical and safety-critical settings are increasingly susceptible to bit-flip faults caused by hardware degradation, cosmic radiation, or deliberate fault-injection attacks such as Rowhammer. These faults silently corrupt internal parameters and can lead to unpredictable or dangerous model behavior. Localizing these corruptions is essential: without identifying the affected region, it is impossible to diagnose the source of degradation, apply targeted corrective measures, or restore model functionality without resorting to costly fine-tuning or full retraining. This work introduces BitFlipScope, a scalable, software-based framework for identifying fault-affected regions within transformer architectures under two deployment scenarios. When a clean reference model is available, BitFlipScope performs differential analysis of outputs, hidden states, and internal activations for detecting anomalous behavior indicative of corruption to pinpoint or localize faults. When no reference model exists, it uses residual-path perturbation and loss-sensitivity profiling to infer the fault-impacted region directly from the corrupted model. In both settings, the framework not only enables effective fault diagnosis but also supports lightweight performance recovery without fine-tuning, offering a practical path to restoring corrupted models. Together, these capabilities make BitFlipScope an important step toward trustworthy, fault-resilient LLM deployment in hardware-prone and adversarial environments.

[AI-129] Amortized Optimal Transport from Sliced Potentials

【速读】:该论文旨在解决多对概率测度之间最优传输(Optimal Transport, OT)计划的高效预测问题,尤其是在重复求解OT时计算成本高的场景下。传统OT方法在每次新测度对出现时需重新求解复杂优化问题,缺乏可复用性与效率。解决方案的关键在于提出两种基于Kantorovich势函数的 amortized 优化策略:回归型摊销OT(Regression-based Amortization OT, RA-OT)和目标型摊销OT(Objective-based Amortization OT, OA-OT)。二者均利用切片最优传输(Sliced OT)构造的势函数作为先验信息,通过最小二乘法或直接优化Kantorovich对偶目标来学习泛化性强的潜在映射模型,从而实现从历史经验中快速近似新OT计划的目标。这种方法不仅显著提升计算效率,还具备结构无关性(如不依赖离散测度的原子数量),同时保持高精度,适用于MNIST图像迁移、颜色转换、球面上供需运输及小批量条件流匹配等任务。

链接: https://arxiv.org/abs/2604.15114
作者: Minh-Phuc Truong,Khai Nguyen
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 11 figures, 10 tables

点击查看摘要

Abstract:We propose a novel amortized optimization method for predicting optimal transport (OT) plans across multiple pairs of measures by leveraging Kantorovich potentials derived from sliced OT. We introduce two amortization strategies: regression-based amortization (RA-OT) and objective-based amortization (OA-OT). In RA-OT, we formulate a functional regression model that treats Kantorovich potentials from the original OT problem as responses and those obtained from sliced OT as predictors, and estimate these models via least-squares methods. In OA-OT, we estimate the parameters of the functional model by optimizing the Kantorovich dual objective. In both approaches, the predicted OT plan is subsequently recovered from the estimated potentials. As amortized OT methods, both RA-OT and OA-OT enable efficient solutions to repeated OT problems across different measure pairs by reusing information learned from prior instances to rapidly approximate new solutions. Moreover, by exploiting the structure provided by sliced OT, the proposed models are more parsimonious, independent of specific structures of the measures, such as the number of atoms in the discrete case, while achieving high accuracy. We demonstrate the effectiveness of our approaches on tasks including MNIST digit transport, color transfer, supply-demand transportation on spherical data, and mini-batch OT conditional flow matching.

[AI-130] mporal Cross-Modal Knowledge-Distillation-Based Transfer-Learning for Gas Turbine Vibration Fault Detection

【速读】:该论文旨在解决工业场景下关键设备(如燃气轮机)早期故障检测(Fault Detection, FD)中面临的两大挑战:一是深度学习模型在架构复杂度与实时性约束之间的权衡问题,二是受限振动信号窗口导致的时间上下文信息缺失问题。解决方案的关键在于提出一种时序跨模态知识蒸馏迁移学习框架(Temporal Cross-Modal Knowledge-Distillation Transfer-Learning, TCMKDTL),通过一个基于宽时间窗口(包含历史和未来信息)的“特权”教师模型,将隐式特征知识蒸馏至轻量级学生模型;同时结合基准数据集(如CWRU)的鲁棒预训练与目标工业数据的适配,有效缓解数据稀缺与领域偏移问题,从而实现高精度、低资源消耗的无监督异常检测。

链接: https://arxiv.org/abs/2604.14766
作者: Ali Bagheri Nejad,Mahdi Aliyari-Shoorehdeli,Abolfazl Hasanzadeh
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Preventing machine failure is inherently superior to reactive remediation, particularly for critical assets like gas turbines, where early fault detection (FD) is a cornerstone of industrial sustainability. However, modern deep learning-based FD models often face a significant trade-off between architectural complexity and real-time operational constraints, often hindered by a lack of temporal context within restricted vibration signal windows. To address these challenges, this study proposes a Temporal Cross-Modal Knowledge-Distillation Transfer-Learning (TCMKDTL) framework. The framework employs a “privileged” teacher model trained on expansive temporal windows incorporating both past and future signal context to distill latent feature-based knowledge into a compact student model. To mitigate issues of data scarcity and domain shift, the framework leverages robust pre-training on benchmark datasets (such as CWRU) followed by adaptation to target industrial data. Extensive evaluation using experimental and industrial gas turbine (MGT-40) datasets demonstrates that TCMKDTL achieves superior feature separability and diagnostic accuracy compared to conventional pre-trained architectures. Ultimately, this approach enables high-performance, unsupervised anomaly detection suitable for deployment on resource-constrained industrial hardware.

[AI-131] Mamba-SSM with LLM Reasoning for Biomarker Discovery: Causal Feature Refinement via Chain-of-Thought Gene Evaluation ICLR2026

【速读】:该论文旨在解决深度序列模型生成的基因候选列表中因组织组成混杂因素(tissue-composition confounders)导致下游分类器性能下降的问题。其解决方案的关键在于利用大语言模型(LLM)链式思维(Chain-of-Thought, CoT)推理对梯度显著性(gradient saliency)提取的基因进行过滤,以实现更有效的特征选择。具体而言,研究通过在TCGA-BRCA RNA-seq数据上训练Mamba状态空间模型(SSM)获取前50个显著基因,并由DeepSeek-R1基于结构化CoT推理筛选出最终17个基因集;结果显示,该方法在仅使用294倍更少特征的情况下,分类AUC达0.927,优于原始50基因集(AUC 0.832)和5000基因方差基线(AUC 0.903),尽管验证显示仅有35.3%的筛选基因为已知乳腺癌(BRCA)生物标志物,表明该方法具备“选择性忠实性”——即精准去除混杂因素即可显著提升性能,无需完全召回所有已知标志物。

链接: https://arxiv.org/abs/2604.14334
作者: Pushpa Kumar Balan,Aijing Feng
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures. Accepted at ICLR 2026 Workshop on Logical Reasoning of Large Language Models

点击查看摘要

Abstract:Gradient saliency from deep sequence models surfaces candidate biomarkers efficiently, but the resulting gene lists are contaminated by tissue-composition confounders that degrade downstream classifiers. We study whether LLM chain-of-thought (CoT) reasoning can faithfully filter these confounders, and whether reasoning quality drives downstream performance. We train a Mamba SSM on TCGA-BRCA RNA-seq and extract the top-50 genes by gradient saliency; DeepSeek-R1 evaluates every candidate with structured CoT to produce a final 17-gene set. The raw 50-gene saliency set (no LLM) performs worse than a 5,000-gene variance baseline (AUC 0.832 vs. 0.903), while the LLM-filtered set surpasses it (AUC 0.927), using 294x fewer features. A faithfulness audit (COSMIC CGC, OncoKB, PAM50) reveals only 6 of 17 selected genes (35.3%) are validated BRCA biomarkers, yet 10 of 16 known BRCA genes in the input were missed - including FOXA1. This gap between downstream performance and reasoning faithfulness suggests selective faithfulness: targeted confounder removal is sufficient for performance gains even without comprehensive recall.

[AI-132] Magnitude Is All You Need? Rethinking Phase in Quantum Encoding of Complex SAR Data

【速读】:该论文旨在解决如何有效编码合成孔径雷达(SAR)数据中的复数信息(包括幅度和相位)以提升量子机器学习(QML)在自动目标识别(ATR)任务中的性能问题。其核心挑战在于:尽管SAR数据天然为复数形式,且QML模型运行于复数希尔伯特空间,但是否应充分利用相位信息仍不明确。解决方案的关键在于系统性地比较五种量子编码策略,并揭示编码方式与模型架构之间的协同关系——在混合量子-经典架构中,仅使用幅度信息的编码策略表现最优(如3类任务达99.57%准确率),而引入相位信息反而无益甚至有害;而在纯量子架构中,相位信息成为构建判别性表示的必要条件,可带来高达21.65%的性能提升。这一发现表明,相位信息的价值并非固有,而是由模型架构决定,强调了在NISQ时代进行编码与架构联合设计的重要性。

链接: https://arxiv.org/abs/2604.14229
作者: Sakthi Prabhu Gunasekar,Prasanna Kumar R
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 8 pages, 4 figures. Under review for IEEE QCE 2026

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) data is inherently complex-valued, while quantum machine learning (QML) models naturally operate in complex Hilbert spaces. This apparent alignment suggests that incorporating both magnitude and phase information into quantum encoding should improve performance in SAR Automatic Target Recognition (ATR). In this work, we systematically evaluate this assumption by comparing five quantum encoding strategies: magnitude-only, joint complex, I/Q-based, preprocessed phase, and pure quantum, under a unified experimental framework on the MSTAR benchmark dataset. Contrary to expectation, we observe a consistent pattern: in hybrid quantum-classical architectures, magnitude-only encoding outperforms all complex-valued strategies, achieving 99.57% accuracy on a 3-class task and 71.19% on an 8-class task, while phase-aware methods provide negligible (~0%) or negative improvements. In contrast, in purely quantum architectures with only 184-224 trainable parameters and no classical components, phase information becomes essential, contributing up to 21.65% improvement in accuracy. These results reveal that the utility of phase information is not inherent to the data, but depends critically on the model architecture. Hybrid models rely on classical components that compensate for missing phase information, whereas purely quantum models require phase to construct discriminative representations. Our findings provide practical design guidelines for encoding complex-valued data in QML and highlight the importance of encoding-architecture co-design in the NISQ era. Comments: 8 pages, 4 figures. Under review for IEEE QCE 2026 Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV) Cite as: arXiv:2604.14229 [quant-ph] (or arXiv:2604.14229v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2604.14229 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-133] Ollivier-Ricci Curvature of Riemannian Manifolds and Directed Graphs with Applications to Graph Neural Networks

【速读】:该论文旨在解决如何将经典黎曼流形中的Ricci曲率概念推广到度量空间乃至图结构中,并建立其与离散结构之间理论联系的问题。解决方案的关键在于引入基于1-Wasserstein距离和最优传输理论的Ollivier-Ricci曲率定义,通过这一框架实现了对连续空间中Ricci曲率性质(如Bonnet-Myers定理和Levy-Gromov不等式)在离散图结构上的推广,并进一步拓展至有向图场景,从而为网络科学和图机器学习提供了可计算、可解释的几何工具。

链接: https://arxiv.org/abs/2604.14211
作者: Eleanor Wiesler
机构: 未知
类目: Differential Geometry (math.DG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Combinatorics (math.CO)
备注:

点击查看摘要

Abstract:This thesis is an exposition of Ollivier-Ricci Curvature of metric spaces as introduced by Yann Ollivier, which is based upon the 1-Wasserstein Distance and optimal transport theory. We present some of the major results and proofs that connect Ollivier-Ricci curvature with classical Ricci curvature of Riemannian manifolds, including extensions of various theoretical bounds and theorems such as Bonnet-Myers and Levy-Gromov. Then we shift to results introduced by Lin-Lu-Yau on an extension of Ollivier-Ricci curvature on graphs, as well as the work of Jost-Liu on proving various combinatorial bounds for graph Ollivier-Ricci curvature. At the end of this thesis we present novel ideas and proofs regarding extensions of these results to directed graphs, and finally applications of graph-based Ollivier-Ricci curvature to various algorithms in network science and graph machine learning.

[AI-134] Bridging scalp and intracranial EEG in BCI via pretrained neural representations and geometric constraint embedding

【速读】:该论文旨在解决脑机接口(BCI)中非侵入式脑电图(EEG)信号质量受限的问题,即EEG虽具高时间分辨率、低成本和便携性等优势,但其信噪比和空间分辨率远低于侵入式脑电图(iEEG),而iEEG因侵入性限制了临床应用。解决方案的关键在于提出一个统一的数据与先验知识驱动的框架,该框架基于“几何结构决定功能”的原则,将静态皮层解剖结构映射为动态神经信号传播的约束条件,并结合预训练大模型提取的通用神经表征,显式建模大脑中的信号传播机制;随后通过多维表示扩散过程合成增强型EEG信号,从而有效恢复因传播损失的神经活动模式,表明BCI性能瓶颈不仅取决于硬件采集能力,更受生成模型对神经信号传播机制解析深度的制约。

链接: https://arxiv.org/abs/2604.14202
作者: Yihang Dong,Changhong Jing,Shuqiang Wang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electroencephalography (EEG) has become one of the key modalities underpinning brain-computer interfaces (BCIs) due to its high temporal resolution, rapid responsiveness, non-invasiveness, low cost, and portability. However, EEG signals are substantially inferior to intracranial EEG (iEEG) in signal-to-noise ratio and local spatial resolution, whereas iEEG suffers from extremely limited clinical accessibility owing to its invasive nature, hindering widespread application. To address this challenge, this study proposes a unified data-and prior knowledge-driven framework for EEG-iEEG representational enhancement. Guided by the principle that “geometric structure dictates function”, the framework maps static cortical anatomy onto dynamic constraints governing neural signal propagation and integrates general-purpose neural representations extracted by a pre-trained large EEG model to explicitly model signal transmission through the brain. Enhanced EEG signals are then synthesized via a multidimensional representation diffusion process. Numerous experimental results demonstrate that the generated enhanced EEG signals effectively recover the neural activity patterns lost during propagation through the brain. This finding indicates that the performance ceiling of BCIs is constrained not only by acquisition hardware but also by the depth to which the generative model resolves the mechanisms of neural signal propagation. Collectively, the proposed framework provides a viable pathway toward acquiring high-fidelity neural signals at low cost.

[AI-135] Retina gap junctions support the robust perception by warping neural representational geometries along the visual hierarchy

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在面对精心设计的对抗噪声时易受攻击的问题,同时探索人类视觉系统如何实现高鲁棒性,特别是早期视觉系统在其中的作用及其对大脑流形(brain manifold)的影响。其解决方案的关键在于构建一个生物混合模型(biological hybrid model),该模型将基于视网膜缝隙连接(retina gap junctions)的G-filter与DNN结合,模拟人类早期视觉系统的去噪功能;通过几何分析发现该模型具有独特的二维非线性决策边界和更低曲率的流形结构,从而解释其高鲁棒性,并进一步利用神经常微分方程(Neural Ordinary Differential Equation, ODE)概念揭示G-filter内部机制——其决策边界随时间逐渐演化并最终趋于稳态,且受缝隙连接电导调控,表明视网膜缝隙连接对大脑流形的影响是一个动态过程。

链接: https://arxiv.org/abs/2604.14200
作者: Yang Yue,Shenjian Zhang,Yonghong Tian,Kai Du,Tiejun Huang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 32 pages, 6 figures

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are vulnerable to elaborately designed adversarial noise, although they have achieved extraordinary success in many tasks. Compared with DNNs, the human visual system is highly robust. However, it is unclear how the human visual system defends against adversarial attacks, especially the role of the early visual system and its influence on the brain manifold. Due to retina gap junctions being crucial for the denoising function in the early visual system, we combine a retina gap junction-based filter, G-filter, with DNN as an abstract human visual system model called the biological hybrid model. We adopt this model to study the defense performance of retina gap junctions and their impact on the brain manifold. Compared with other defense methods, the biological hybrid model is more robust and can be further improved by introducing noise during training. Next, we analyze the manifold and its decision boundary of the biological hybrid model from a geometry perspective. The results show that the biological hybrid model has a unique 2D decision boundary with high nonlinearity and a lower curvature of the decision boundary of the manifold compared to other defense methods. The transforming manifold may account for the high robustness of the biological hybrid model. Finally, to dissect G-filter and clarify its internal mechanism, we borrow the Neural Ordinary Differential Equation (ODE) concept and rewrite G-filter into an equivalent recurrent neural network. The results show that the decision boundary of the model’s manifold will gradually change with time and eventually reach a steady state, which is modulated by gap junction conductance, revealing the influence of retina gap junctions on the brain manifold is a gradually evolving process.

[AI-136] PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

【速读】:该论文旨在解决从实时市场信号中预测现实世界事件的问题,其核心挑战在于如何在严格的时间约束下融合定性新闻信息与定量订单簿动态(Order Book Dynamics),而现有基准无法准确捕捉这一多模态、时序敏感的场景。解决方案的关键是提出PolyBench——一个基于Polymarket构建的多模态基准数据集,它通过同步记录38,666个二元预测市场的点位快照、中央限价订单簿(Central Limit Order Book, CLOB)状态及实时新闻流,实现了对市场状态、语义信息与执行逻辑的精确对齐。在此基础上,作者评估了七种主流大语言模型(Large Language Models, LLMs)在统一时间戳锁定条件下生成的36,165条预测,并引入Confidence-Weighted Return(CWR)、年化收益率(APY)和夏普比率(Sharpe Ratio)等金融指标进行量化评估,揭示出模型表面高置信度与实际盈利能力之间的显著差距,从而为LLM在金融决策任务中的真实性能提供了一个抗污染、财务可解释的评测标准。

链接: https://arxiv.org/abs/2604.14199
作者: Pu Cheng,Juncheng Liu,Yunshen Long
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Predicting real-world events from live market signals demands systems that fuse qualitative news with quantitative order-book dynamics under strict temporal discipline – a challenge existing benchmarks fail to capture. We present \textbfPolyBench, a multimodal benchmark derived from Polymarket that records point-in-time cross-sections of 38,666 binary prediction markets spanning 4,997 events, synchronously coupling each snapshot with a Central Limit Order Book (CLOB) state and a real-time news stream. Using PolyBench, we evaluate seven state-of-the-art Large Language Models – spanning open- and closed-source families – generating 36,165 predictions under identical, timestamp-locked market states collected between February 6 and 12, 2026. Our multidimensional framework assesses directional accuracy, our proposed Confidence-Weighted Return (CWR), Annualized Percentage Yield (APY), and Sharpe ratio via realistic order-book execution simulation. The results reveal a pronounced performance divergence: only two of seven models achieve positive financial returns – MiMo-V2-Flash at \textbf17.6% CWR and Gemini-3-Flash at 6.2% CWR – while the remaining five incur losses despite uniformly high stated confidence. These findings highlight the gap between surface-level language fluency and genuine probabilistic reasoning under live market uncertainty, and establish PolyBench as a contamination-proof, financially-grounded evaluation standard for future LLM research. Our dataset and code available at \underline\hrefthis https URLthis https URL.

[AI-137] An Edge-Cloud Collaborative Architecture for Proactive Elderly Care: Real-Time Risk Assessment and Three-Level Emergency Response

【速读】:该论文旨在解决全球人口老龄化背景下,独立居住老年人安全监护中现有云中心平台面临的高延迟、隐私泄露风险及单一报警机制缺乏可扩展性和情境感知能力等问题。其解决方案的关键在于提出一种边缘-云协同架构,通过实时多模态传感器融合、四维风险评估模型和三级应急响应系统实现高效、隐私保护且响应迅速的智能监护。具体而言,在边缘侧采用加权多模态融合算法整合五类传感器数据并传播置信度,生成融合跌倒概率、生理指标、行为模式与传感器异常的统一风险评分;基于动态阈值触发家庭成员、社区医生与附近志愿者的三级联动通知机制,从而在端到端延迟低于3秒内完成风险评估与告警,实验表明该方案在CASAS、MIMIC-III和SisFall数据集上分别达到91%活动识别准确率和84%异常检测F1-score,并在Raspberry Pi 4网关上实现<100ms推理延迟,同时保障原始数据本地化处理以维护隐私。

链接: https://arxiv.org/abs/2604.14154
作者: Lijie Zhou,Luran Wang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The rapid aging of global populations has created an urgent need for intelligent healthcare monitoring systems to ensure the safety of elderly individuals living independently. Existing cloud-centric platforms face critical limitations, including high latency unsuitable for emergency response, privacy risks from continuous transmission of sensitive data, and limited, single-channel alert mechanisms lacking scalability and context awareness. This paper proposes an edge-cloud collaborative architecture that addresses these challenges through real-time multi-modal sensor fusion, a four-dimensional risk assessment model, and a three-level emergency response system. The framework adopts a five-layer design - device, edge, service, data, and application layers - enabling real-time risk evaluation with end-to-end alert latency under three seconds. At the edge, a weighted multi-modal fusion algorithm integrates data from five sensor types with confidence propagation. A unified risk score is generated by combining fall probability, physiological indicators, behavioral patterns, and sensor anomaly metrics. Based on dynamic thresholds, a three-tier notification system coordinates responses among family members, community doctors, and nearby volunteers. Experiments on CASAS, MIMIC-III, and SisFall datasets show that the approach achieves 91% activity recognition accuracy and an 84% anomaly detection F1-score, outperforming single-sensor methods. Deployment on Raspberry Pi 4 gateways demonstrates sub-100 ms inference latency while preserving privacy by keeping raw data local. This architecture advances practical, privacy-preserving, and responsive elderly care systems.

[AI-138] Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening

【速读】:该论文旨在解决麦克风阵列中导向向量(steering vector)在频域和麦克风/声源位置上的连续表示问题,以实现增强听觉场景下(如空间滤波和双耳渲染)的用户参数化控制。传统基于理想环境假设的代数表示方法无法处理声场散射效应,而现有物理感知深度学习方法虽能进行确定性超分辨率,但因测量空间中不确定性分布不均导致过拟合问题。解决方案的关键在于将神经场(Neural Field, NF)的表达能力与高斯过程(Gaussian Process, GP)的概率框架相结合,并提出一种融合方向入射波和后续散射效应的物理感知复合核函数,从而在数据稀缺条件下显著提升模型泛化性能,在下游任务(如语音增强和双耳渲染)中仅需不到十分之一的测量即可逼近最优性能。

链接: https://arxiv.org/abs/2509.02571
作者: Diego Di Carlo(RIKEN AIP),Shoichi Koyama(UTokyo),Nugraha Aditya Arie(RIKEN AIP),Fontaine Mathieu(S2A, IDS),Bando Yoshiaki(AIST),Yoshii Kazuyoshi(RIKEN AIP)
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:This paper investigates continuous representations of steering vectors over frequency and microphone/source positions for augmented listening (e.g., spatial filtering and binaural rendering), enabling user-parameterized control of the reproduced sound field. Steering vectors have typically been used for representing the spatial response of a microphone array as a function of the look-up direction. The basic algebraic representation of these quantities assuming an idealized environment cannot deal with the scattering effect of the sound field. One may thus collect a discrete set of real steering vectors measured in dedicated facilities and super-resolve (i.e., upsample) them. Recently, physics-aware deep learning methods have been effectively used for this purpose. Such deterministic super-resolution, however, suffers from the overfitting problem due to the non-uniform uncertainty over the measurement space. To solve this problem, we integrate an expressive representation based on the neural field (NF) into the principled probabilistic framework based on the Gaussian process (GP). Specifically, we propose a physics-aware composite kernel that models the directional incoming waves and the subsequent scattering effect. Our comprehensive comparative experiment showed the effectiveness of the proposed method under data insufficiency conditions. In downstream tasks such as speech enhancement and binaural rendering using the simulated data of the SPEAR challenge, the oracle performances were attained with less than ten times fewer measurements.

机器学习

[LG-0] Benchmarking Optimizers for MLPs in Tabular Deep Learning

链接: https://arxiv.org/abs/2604.15297
作者: Yury Gorishniy,Ivan Rubachev,Dmitrii Feoktistov,Artem Babenko
类目: Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:MLP is a heavily used backbone in modern deep learning (DL) architectures for supervised learning on tabular data, and AdamW is the go-to optimizer used to train tabular DL models. Unlike architecture design, however, the choice of optimizer for tabular DL has not been examined systematically, despite new optimizers showing promise in other domains. To fill this gap, we benchmark \Noptimizers optimizers on \Ndatasets tabular datasets for training MLP-based models in the standard supervised learning setting under a shared experiment protocol. Our main finding is that the Muon optimizer consistently outperforms AdamW, and thus should be considered a strong and practical choice for practitioners and researchers, if the associated training efficiency overhead is affordable. Additionally, we find exponential moving average of model weights to be a simple yet effective technique that improves AdamW on vanilla MLPs, though its effect is less consistent across model variants.

[LG-1] How Embeddings Shape Graph Neural Networks: Classical vs Quantum-Oriented Node Representations IJCNN2026

链接: https://arxiv.org/abs/2604.15273
作者: Nouhaila Innan,Antonello Rosato,Alberto Marchisio,Muhammad Shafique
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 6 pages. Accepted at IJCNN 2026

点击查看摘要

Abstract:Node embeddings act as the information interface for graph neural networks, yet their empirical impact is often reported under mismatched backbones, splits, and training budgets. This paper provides a controlled benchmark of embedding choices for graph classification, comparing classical baselines with quantum-oriented node representations under a unified pipeline. We evaluate two classical baselines alongside quantum-oriented alternatives, including a circuit-defined variational embedding and quantum-inspired embeddings computed via graph operators and linear-algebraic constructions. All variants are trained and tested with the same backbone, stratified splits, identical optimization and early stopping, and consistent metrics. Experiments on five different TU datasets and on QM9 converted to classification via target binning show clear dataset dependence: quantum-oriented embeddings yield the most consistent gains on structure-driven benchmarks, while social graphs with limited node attributes remain well served by classical baselines. The study highlights practical trade-offs between inductive bias, trainability, and stability under a fixed training budget, and offers a reproducible reference point for selecting quantum-oriented embeddings in graph learning.

[LG-2] Optimal last-iterate convergence in matrix games with bandit feedback using the log-barrier

链接: https://arxiv.org/abs/2604.15242
作者: Come Fiegel,Pierre Menard,Tadashi Kozuno,Michal Valko,Vianney Perchet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of learning minimax policies in zero-sum matrix games. Fiegel et al. (2025) recently showed that achieving last-iterate convergence in this setting is harder when the players are uncoupled, by proving a lower bound on the exploitability gap of Omega(t^-1/4). Some online mirror descent algorithms were proposed in the literature for this problem, but none have truly attained this rate yet. We show that the use of a log-barrier regularization, along with a dual-focused analysis, allows this O-tilde(t^-1/4) convergence with high-probability. We additionally extend our idea to the setting of extensive-form games, proving a bound with the same rate.

[LG-3] A Nonlinear Separation Principle: Applications to Neural Networks Control and Learning

链接: https://arxiv.org/abs/2604.15238
作者: Anand Gokhale,Anton V. Proskurnikov,Yu Kawano,Francesco Bullo
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: arXiv admin note: text overlap with arXiv:2604.00119

点击查看摘要

Abstract:This paper investigates continuous-time and discrete-time firing-rate and Hopfield recurrent neural networks (RNNs), with applications in nonlinear control design and implicit deep learning. First, we introduce a nonlinear separation principle that guarantees global exponential stability for the interconnection of a contracting state-feedback controller and a contracting observer, alongside parametric extensions for robustness and equilibrium tracking. Second, we derive sharp linear matrix inequality (LMI) conditions that guarantee the contractivity of both firing rate and Hopfield neural network architectures. We establish structural relationships among these certificates-demonstrating that continuous-time models with monotone non-decreasing activations maximize the admissible weight space, and extend these stability guarantees to interconnected systems and Graph RNNs. Third, we combine our separation principle and LMI framework to solve the output reference tracking problem for RNN-modeled plants. We provide LMI synthesis methods for feedback controllers and observers, and rigorously design a low-gain integral controller to eliminate steady-state error. Finally, we derive an exact, unconstrained algebraic parameterization of our contraction LMIs to design highly expressive implicit neural networks, achieving competitive accuracy and parameter efficiency on standard image classification benchmarks.

[LG-4] RL-STPA: Adapting System-Theoretic Hazard Analysis for Safety-Critical Reinforcement Learning

链接: https://arxiv.org/abs/2604.15201
作者: Steven A. Senczyszyn,Timothy C. Havens,Nathaniel Rice,Jason E. Summers,Benjamin D. Werner,Benjamin J. Schumeg
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As reinforcement learning (RL) deployments expand into safety-critical domains, existing evaluation methods fail to systematically identify hazards arising from the black-box nature of neural network enabled policies and distributional shift between training and deployment. This paper introduces Reinforcement Learning System-Theoretic Process Analysis (RL-STPA), a framework that adapts conventional STPA’s systematic hazard analysis to address RL’s unique challenges through three key contributions: hierarchical subtask decomposition using both temporal phase analysis and domain expertise to capture emergent behaviors, coverage-guided perturbation testing that explores the sensitivity of state-action spaces, and iterative checkpoints that feed identified hazards back into training through reward shaping and curriculum design. We demonstrate RL-STPA in the safety-critical test case of autonomous drone navigation and landing, revealing potential loss scenarios that can be missed by standard RL evaluations. The proposed framework provides practitioners with a toolkit for systematic hazard analysis, quantitative metrics for safety coverage assessment, and actionable guidelines for establishing operational safety bounds. While RL-STPA cannot provide formal guarantees for arbitrary neural policies, it offers a practical methodology for systematically evaluating and improving RL safety and robustness in safety-critical applications where exhaustive verification methods remain intractable.

[LG-5] One-shot learning for the complex dynamical behaviors of weakly nonlinear forced oscillators

链接: https://arxiv.org/abs/2604.15181
作者: Teng Ma,Luca Rosafalco,Wei Cui,Lin Zhao,Attilio Frangi
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 48 pages, 16 figures, graphical abstract, highlights

点击查看摘要

Abstract:Extrapolative prediction of complex nonlinear dynamics remains a central challenge in engineering. This study proposes a one-shot learning method to identify global frequency-response curves from a single excitation time history by learning governing equations. We introduce MEv-SINDy (Multi-frequency Evolutionary Sparse Identification of Nonlinear Dynamics) to infer the governing equations of non-autonomous and multi-frequency systems. The methodology leverages the Generalized Harmonic Balance (GHB) method to decompose complex forced responses into a set of slow-varying evolution equations. We validated the capabilities of MEv-SINDy on two critical Micro-Electro-Mechanical Systems (MEMS). These applications include a nonlinear beam resonator and a MEMS micromirror. Our results show that the model trained on a single point accurately predicts softening/hardening effects and jump phenomena across a wide range of excitation levels. This approach significantly reduces the data acquisition burden for the characterization and design of nonlinear microsystems.

[LG-6] Assessing the Potential of Masked Autoencoder Foundation Models in Predicting Downhole Metrics from Surface Drilling Data

链接: https://arxiv.org/abs/2604.15169
作者: Aleksander Berezowski,Hassan Hassanzadeh,Gouri Ginde
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Oil and gas drilling operations generate extensive time-series data from surface sensors, yet accurate real-time prediction of critical downhole metrics remains challenging due to the scarcity of labelled downhole measurements. This systematic mapping study reviews thirteen papers published between 2015 and 2025 to assess the potential of Masked Autoencoder Foundation Models (MAEFMs) for predicting downhole metrics from surface drilling data. The review identifies eight commonly collected surface metrics and seven target downhole metrics. Current approaches predominantly employ neural network architectures such as artificial neural networks (ANNs) and long short-term memory (LSTM) networks, yet no studies have explored MAEFMs despite their demonstrated effectiveness in time-series modeling. MAEFMs offer distinct advantages through self-supervised pre-training on abundant unlabeled data, enabling multi-task prediction and improved generalization across wells. This research establishes that MAEFMs represent a technically feasible but unexplored opportunity for drilling analytics, recommending future empirical validation of their performance against existing models and exploration of their broader applicability in oil and gas operations.

[LG-7] When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence

链接: https://arxiv.org/abs/2604.15167
作者: Marcus Armstrong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) assumes that a well-converged model is a quantization-ready model. We show this assumption fails in a structured, measurable, and previously uncharacterized way. Using a calibration-free per-group INT4 probe applied to all 154 publicly available Pythia-160m training checkpoints, we identify a three-phase divergence structure: a rapid-learning phase where both FP32 perplexity and quantization robustness improve together, a meta-stable plateau lasting roughly 70,000 steps where FP32 perplexity stagnates but INT4 gap remains bounded, and an explosive divergence phase where the INT4 gap compounds from 11% to 517% while FP32 perplexity barely moves. Critically, this divergence begins not when the learning rate starts decaying, but precisely when FP32 perplexity converges a finer-grained onset predictor that implies post-convergence weight updates, rather than decay magnitude alone, are the proximate cause. We further show that INT8 quantization is entirely immune throughout all three phases, constraining the mechanism to the coarseness of the 16-level INT4 grid specifically, and rule out weight outlier accumulation as the mechanism via direct kurtosis measurement. Finally, we conduct a controlled fork experiment from the pre-divergence checkpoint comparing three learning rate schedules (cosine continuation, SGDR warm restarts, and our proposed Oscillatory Lock-In) across nine independent runs. SGDR uniformly accelerates divergence (0/9 pairwise wins against cosine), while OLI’s settled cool phases reduce the INT4 gap by 2.2 percentage points on average (t = -5.46, p 0.0001), demonstrating that schedule amplitude calibration, not oscillation alone, determines whether perturbation helps or hurts. Our code, probe implementation, and all 154-checkpoint audit results are released publicly.

[LG-8] FedIDM: Achieving Fast and Stable Convergence in Byzantine Federated Learning through Iterative Distribution Matching

链接: https://arxiv.org/abs/2604.15115
作者: He Yang,Dongyi Lv,Wei Xi,Song Ma,Hanlin Gu,Jizhong Zhao
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Most existing Byzantine-robust federated learning (FL) methods suffer from slow and unstable convergence. Moreover, when handling a substantial proportion of colluded malicious clients, achieving robustness typically entails compromising model utility. To address these issues, this work introduces FedIDM, which employs distribution matching to construct trustworthy condensed data for identifying and filtering abnormal clients. FedIDM consists of two main components: (1) attack-tolerant condensed data generation, and (2) robust aggregation with negative contribution-based rejection. These components exclude local updates that (1) deviate from the update direction derived from condensed data, or (2) cause a significant loss on the condensed dataset. Comprehensive evaluations on three benchmark datasets demonstrate that FedIDM achieves fast and stable convergence while maintaining acceptable model utility, under multiple state-of-the-art Byzantine attacks involving a large number of malicious clients.

[LG-9] Atropos: Improving Cost-Benefit Trade-off of LLM -based Agents under Self-Consistency with Early Termination and Model Hotswap ISSTA2026

链接: https://arxiv.org/abs/2604.15075
作者: Naryeong Kim,Shin Yoo
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Will appear at ISSTA 2026

点击查看摘要

Abstract:Open-weight Small Language Models(SLMs) can provide faster local inference at lower financial cost, but may not achieve the same performance level as commercial Large Language Models (LLMs) that are orders of magnitudes larger. Consequently, many of the latest applications of LLMs, such as software engineering agents, tend to be evaluated on larger models only, leaving the issue of improving the cost-benefit trade-off of such applications neglected. This paper proposes Atropos, a predictive early-termination analysis and hotswap technique that aims to improve the cost-benefit trade-off for LLM-based agents that use self-consistency. The core component of ATROPOS is a predictive model based on structural properties of LLM inferences: after merging multiple agentic inference paths into a graph representation, ATROPOS uses Graph Convolutional Network (GCN) to predict whether an ongoing inference will eventually succeed or not. If an agentic task instance running on the source LLM is predicted to fail, ATROPOS subsequently performs hotswapping, i.e., migrating the on-going inference context onto the more capable target LLM: this is feasible because LLM contexts are stateless. An empirical evaluation of ATROPOS using three recent LLM-based agents shows that ATROPOS can predict early termination of eventually failing inferences with the accuracy of 0.85 at the midpoint of the inference. Hotswapping LLMs for such inferences can convert up to 27.57% of them to be successful. Consequently, ATROPOS achieves 74.35% of the performance of closed LLMs with as low as only 23.9% of the cost.

[LG-10] Beyond the Laplacian: Doubly Stochastic Matrices for Graph Neural Networks

链接: https://arxiv.org/abs/2604.15069
作者: Zhaobo Hu,Vincent Gauthier,Mehdi Naima
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) conventionally rely on standard Laplacian or adjacency matrices for structural message passing. In this work, we substitute the traditional Laplacian with a Doubly Stochastic graph Matrix (DSM), derived from the inverse of the modified Laplacian, to naturally encode continuous multi-hop proximity and strict local centrality. To overcome the intractable O(n^3) complexity of exact matrix inversion, we first utilize a truncated Neumann series to scalably approximate the DSM, which serves as the foundation for our proposed DsmNet. Furthermore, because algebraic truncation inherently causes probability mass leakage, we introduce DsmNet-compensate. This variant features a mathematically rigorous Residual Mass Compensation mechanism that analytically re-injects the truncated tail mass into self-loops, strictly restoring row-stochasticity and structural dominance. Extensive theoretical and empirical analyses demonstrate that our decoupled architectures operate efficiently in O(K|E|) time and effectively mitigate over-smoothing by bounding Dirichlet energy decay, providing robust empirical validation on homophilic benchmarks. Finally, we establish the theoretical boundaries of the DSM on heterophilic topologies and demonstrate its versatility as a continuous structural encoding for Graph Transformers.

[LG-11] DLink: Distilling Layer-wise and Dominant Knowledge from EEG Foundation Models

链接: https://arxiv.org/abs/2604.15016
作者: Jingyuan Wang,Meiyan Xu,Zhihao Jia,Chenyu Liu,Xinliang Zhou,Ziyu Jia,Yong Li,Fang Li,Junfeng Yao,Yi Ding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:EEG foundation models (FMs) achieve strong cross-subject and cross-task generalization but impose substantial computational and memory costs that hinder deployment on embedded BCI systems. Knowledge distillation is a natural solution; however, conventional methods fail for EEG FMs because task-relevant semantics are often distributed across intermediate layers, and aggressive dimensionality reduction can distort oscillatory structure via representational collapse and aliasing. To address these challenges, we propose DLink (Distilling Layer-wise and Dominant Knowledge), a unified framework for transferring knowledge from large EEG FMs to compact students with three key innovations: (1) a dynamic Router that adaptively aggregates teacher layers to capture dominant intermediate representations; (2) an EEG MiC student with a Mimic-then-Compress pipeline, which inherits high-dimensional teacher features and then applies structured spatio-temporal compression to avoid a heavy classification head; and (3) spectral distillation that aligns teacher-student representations in the frequency domain to regularize compression and mitigate aliasing and temporal jitter. Experiments on four EEG benchmarks show that DLink enables compact students to outperform lightweight baselines while approaching fully fine-tuned FM performance at substantially lower model size and inference cost.

[LG-12] Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning

链接: https://arxiv.org/abs/2604.14974
作者: Jean-Bastien Grill,Michal Valko,Rémi Munos
类目: Machine Learning (cs.LG)
*备注: Published in Neural Information Processing Systems 2016

点击查看摘要

Abstract:You are a robot and you live in a Markov decision process (MDP) with a finite or an infinite number of transitions from state-action to next states. You got brains and so you plan before you act. Luckily, your roboparents equipped you with a generative model to do some Monte-Carlo planning. The world is waiting for you and you have no time to waste. You want your planning to be efficient. Sample-efficient. Indeed, you want to exploit the possible structure of the MDP by exploring only a subset of states reachable by following near-optimal policies. You want guarantees on sample complexity that depend on a measure of the quantity of near-optimal states. You want something, that is an extension of Monte-Carlo sampling (for estimating an expectation) to problems that alternate maximization (over actions) and expectation (over next states). But you do not want to StOP with exponential running time, you want something simple to implement and computationally efficient. You want it all and you want it now. You want TrailBlazer.

[LG-13] MLDAS: Machine Learning Dynamic Algorithm Selection for Software-Defined Networking Security

链接: https://arxiv.org/abs/2604.14957
作者: Pablo Benlloch,Oscar Romero,Antonio Leon,Jaime Lloret
类目: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 22 pages, 15 figures, 12 tables

点击查看摘要

Abstract:Network security is a critical concern in the digital landscape of today, with users demanding secure browsing experiences and protection of their personal data. This study explores the dynamic integration of Machine Learning (ML) algorithms with Software-Defined Networking (SDN) controllers to enhance network security through adaptive decision mechanisms. The proposed approach enables the system to dynamically choose the most suitable ML algorithm based on the characteristics of the observed network traffic. This work examines the role of Intrusion Detection Systems (IDS) as a fundamental component of secure communication networks and discusses the limitations of SDN-based attack detection mechanisms. The proposed framework uses adaptive model selection to maintain reliable intrusion detection under varying network conditions. The study highlights the importance of analyzing traffic-type-based metrics to define effective classification rules and enhance the performance of ML models. Additionally, it addresses the risks of overfitting and underfitting, underscoring the critical role of hyperparameter tuning in optimizing model accuracy and generalization. The central contribution of this work is an automated mechanism that adaptively selects the most suitable ML algorithm according to real-time network conditions, prioritizing detection robustness and operational feasibility within SDN environments.

[LG-14] Multi-User mmWave Beam and Rate Adaptation via Combinatorial Satisficing Bandits

链接: https://arxiv.org/abs/2604.14908
作者: Emre Özyıldırım,Barış Yaycı,Umut Eren Akturk,Cem Tekin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study downlink beam and rate adaptation in a multi-user mmWave MISO system where multiple base stations (BSs), each using analog beamforming from finite codebooks, serve multiple single-antenna user equipments (UEs) with a unique beam per UE and discrete data transmission rates. BSs learn about transmission success based on ACK/NACK feedback. To encode service goals, we introduce a satisficing throughput threshold \tau_r and cast joint beam and rate adaptation as a combinatorial semi-bandit over beam-rate tuples. Within this framework, we propose SAT-CTS, a lightweight, threshold-aware policy that blends conservative confidence estimates with posterior sampling, steering learning toward meeting \tau_r rather than merely maximizing. Our main theoretical contribution provides the first finite-time regret bounds for combinatorial semi-bandits with satisficing objective: when \tau_r is realizable, we upper bound the cumulative satisficing regret to the target with a time-independent constant, and when \tau_r is non-realizable, we show that SAT-CTS incurs only a finite expected transient outside committed CTS rounds, after which its regret is governed by the sum of the regret contributions of restarted CTS rounds, yielding an O((\log T)^2) standard regret bound. On the practical side, we evaluate the performance via cumulative satisficing regret to \tau_r alongside standard regret and fairness. Experiments with time-varying sparse multipath channels show that SAT-CTS consistently reduces satisficing regret and maintains competitive standard regret, while achieving favorable average throughput and fairness across users, indicating that feedback-efficient learning can equitably allocate beams and rates to meet QoS targets without channel state knowledge.

[LG-15] xFODE: An Explainable Fuzzy Additive ODE Framework for System Identification

链接: https://arxiv.org/abs/2604.14883
作者: Ertugrul Kececi,Tufan Kumbasar
类目: Machine Learning (cs.LG)
*备注: in IEEE Conference on Artificial Intelligence, 2026

点击查看摘要

Abstract:Recent advances in Deep Learning (DL) have strengthened data-driven System Identification (SysID), with Neural and Fuzzy Ordinary Differential Equation (NODE/FODE) models achieving high accuracy in nonlinear dynamic modeling. Yet, system states in these frameworks are often reconstructed without clear physical meaning, and input contributions to the state derivatives remain difficult to interpret. To address these limitations, we propose Explainable FODE (xFODE), an interpretable SysID framework with integrated DL-based training. In xFODE, we define states in an incremental form to provide them with physical meanings. We employ fuzzy additive models to approximate the state derivative, thereby enhancing interpretability per input. To provide further interpretability, Partitioning Strategies (PSs) are developed, enabling the training of fuzzy additive models with explainability. By structuring the antecedent space during training so that only two consecutive rules are activated for any given input, PSs not only yield lower complexity for local inference but also enhance the interpretability of the antecedent space. To train xFODE, we present a DL framework with parameterized membership function learning that supports end-to-end optimization. Across benchmark SysID datasets, xFODE matches the accuracy of NODE, FODE, and NLARX models while providing interpretable insights.

[LG-16] An Intelligent Robotic and Bio-Digestor Framework for Smart Waste Management

链接: https://arxiv.org/abs/2604.14882
作者: Radhika Khatri,Adit Tewari,Nikhil Sharma,M. B. Srinivas
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 10 figures, submitted to 7th International Conference on Smart Systems and Inventive Technology (ICSSIT 2026)

点击查看摘要

Abstract:Rapid urbanization and continuous population growth have made municipal solid waste management increasingly challenging. These challenges highlight the need for smarter and automated waste management solutions. This paper presents the design and evaluation of an integrated waste management framework that combines two connected systems, a robotic waste segregation module and an optimized bio-digestor. The robotic waste segregation system uses a MyCobot 280 Jetson Nano robotic arm along with YOLOv8 object detection and robot operating system (ROS)-based path planning to identify and sort waste in real time. It classifies waste into four different categories with high precision, reducing the need for manual intervention. After segregation, the biodegradable waste is transferred to a bio-digestor system equipped with multiple sensors. These sensors continuously monitor key parameters, including temperature, pH, pressure, and motor revolutions per minute. The Particle Swarm Optimization (PSO) algorithm, combined with a regression model, is used to dynamically adjust system parameters. This intelligent optimization approach ensures stable operation and maximizes digestion efficiency under varying environmental conditions. System testing under dynamic conditions demonstrates a sorting accuracy of 98% along with highly efficient biological conversion. The proposed framework offers a scalable, intelligent, and practical solution for modern waste management, making it suitable for both residential and industrial applications.

[LG-17] xFODE: Explainable Type-2 Fuzzy Additive ODEs for Uncertainty Quantification

链接: https://arxiv.org/abs/2604.14880
作者: Ertugrul Kececi,Tufan Kumbasar
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: in IEEE International Conference on Fuzzy Systems, 2026

点击查看摘要

Abstract:Recent advances in Deep Learning (DL) have boosted data-driven System Identification (SysID), but reliable use requires Uncertainty Quantification (UQ) alongside accurate predictions. Although UQ-capable models such as Fuzzy ODE (FODE) can produce Prediction Intervals (PIs), they offer limited interpretability. We introduce Explainable Type-2 Fuzzy Additive ODEs for UQ (xFODE+), an interpretable SysID model which produces PIs alongside point predictions while retaining physically meaningful incremental states. xFODE+ implements each fuzzy additive model with Interval Type-2 Fuzzy Logic Systems (IT2-FLSs) and constraints membership functions to the activation of two neighboring rules, limiting overlap and keeping inference locally transparent. The type-reduced sets produced by the IT2-FLSs are aggregated to construct the state update together with the PIs. The model is trained in a DL framework via a composite loss that jointly optimizes prediction accuracy and PI quality. Results on benchmark SysID datasets show that xFODE+ matches FODE in PI quality and achieves comparable accuracy, while providing interpretability.

[LG-18] Does RL Expand the Capability Boundary of LLM Agents ? A PASS@(kT) Analysis

链接: https://arxiv.org/abs/2604.14877
作者: Zhiyuan Zhai,Wenjing Yan,Xiaodan Shao,Xin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Does reinforcement learning genuinely expand what LLM agents can do, or merely make them more reliable? For static reasoning, recent work answers the second: base and RL pass@k curves converge at large k. We ask whether this holds for agentic tool use, where T rounds of interaction enable compositional strategies that re-sampling cannot recover. We introduce PASS@(k,T), a two-dimensional metric that jointly varies sampling budget k and interaction depth T, separating capability expansion from efficiency improvement. Our main finding is that, contrary to the static-reasoning result, tool-use RL genuinely enlarges the capability boundary: the RL agent’s pass-curve pulls above the base model’s and the gap widens at large k rather than converging. The expansion is specific to compositional, sequential information gathering; on simpler tasks RL behaves as prior work predicts. Under matched training data, supervised fine-tuning regresses the boundary on the same compositional tasks, isolating self-directed exploration as the causal factor. Mechanism analysis shows RL reweights the base strategy distribution toward the subset whose downstream reasoning more often yields a correct answer, with the improvement concentrated on how the agent integrates retrieved information. These results reconcile optimistic and pessimistic readings of RL for LLMs: both are correct, on different task types.

[LG-19] Regret Tail Characterization of Optimal Bandit Algorithms with Generic Rewards

链接: https://arxiv.org/abs/2604.14876
作者: Subhodip Panda,Shubhada Agrawal
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the tail behavior of regret in stochastic multi-armed bandits for algorithms that are asymptotically optimal in expectation. While minimizing expected regret is the classical objective, recent work shows that even such algorithms can exhibit heavy regret tails, incurring large regret with non-negligible probability. Existing sharp characterizations of regret tails are largely restricted to parametric settings, such as single-parameter exponential families. In this work, we extend the \KLinf -UCB algorithm of to a broad nonparametric class of reward distributions satisfying mild assumptions, and establish its asymptotic optimality in expectation. We then analyze the tail behavior of its regret and derive a novel upper bound on the regret tail probability. As special cases, our results recover regret-tail guarantees for both bounded-support and heavy-tailed (moment-bounded) bandit models. Moreover, for the special case of finitely-supported reward distributions, our upper bound matches the known lower bound exactly. Our results thus provide a unified and tight characterization of regret tails for asymptotically optimal KL-based UCB algorithms, going beyond parametric models. Subjects: Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2604.14876 [cs.IT] (or arXiv:2604.14876v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2604.14876 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2026 IEEE International Symposium on Information Theory (ISIT 2026)

[LG-20] Curvature-Aligned Probing for Local Loss-Landscape Stabilization NEURIPS2026

链接: https://arxiv.org/abs/2604.14870
作者: Nikita Kiselev,Andrey Grabovoy
类目: Machine Learning (cs.LG)
*备注: Submitted to NeurIPS 2026

点击查看摘要

Abstract:Local loss-landscape stabilization under sample growth is typically measured either pointwise or through isotropic averaging in the full parameter space. Despite practical value, both choices probe directions that contribute little to the dominant local deformation of strongly anisotropic neural landscapes. We recast stabilization as an observational problem and introduce a unified family of criteria parameterized by an aggregation order and a probing distribution; within this family we propose a curvature-aligned criterion \Delta_2^(D) that probes the loss increment field in the top- D eigenspace of the empirical Hessian near a trained solution. Solely from a local quadratic model, we prove that \Delta_2^(D) preserves the O(k^-2) mean-squared rate of the full-space criterion while replacing ambient-dimension curvature dependence with dependence on the subspace dimension D ; a corollary gives a closed-form spectral expression and a proposition identifies the top- D eigenspace as extremal within the eigenspace-aligned family. We also derive scalable estimators based on Hessian-vector products, subspace Monte Carlo, and a closed-form Gaussian-moment proxy. On a decoder-only transformer, a curvature-aligned probe occupying a tiny fraction of parameter space already reproduces the full-space mean-squared signal to within numerical noise throughout the validated local regime, and the closed-form estimator is orders of magnitude faster than direct Monte Carlo after subspace construction.

[LG-21] Adaptive Test-Time Compute Allocation for Reasoning LLM s via Constrained Policy Optimization

链接: https://arxiv.org/abs/2604.14853
作者: Zhiyuan Zhai,Bingcong Li,Bingnan Xiao,Ming Li,Xin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Test-time compute scaling, the practice of spending extra computation during inference via repeated sampling, search, or extended reasoning, has become a powerful lever for improving large language model performance. Yet deploying these techniques under finite inference budgets requires a decision that current systems largely ignore: which inputs deserve more compute, and which can be answered cheaply? We formalize this as a constrained optimization problem (maximize expected accuracy subject to an average compute budget) and solve it with a two-stage Solve-then-Learn pipeline. In the solve stage, Lagrangian relaxation decomposes the global constraint into per-instance sub-problems, each admitting a closed-form oracle action that optimally prices accuracy against cost. We prove that the induced cost is monotone in the dual variable, enabling exact budget targeting via binary search. In the learn stage, a lightweight classifier is trained to predict oracle actions from cheap input features, amortizing the allocation rule for real-time deployment. We establish that the task-level regret of the learned policy is bounded by its imitation error times the worst-case per-instance gap, yielding a clean reduction from constrained inference to supervised classification. Experiments on MATH and GSM8K with three LLMs (DeepSeek-V3, GPT-4o-mini, Qwen2.5-7B) show that our method consistently outperforms uniform and heuristic allocation baselines, achieving up to 12.8% relative accuracy improvement on MATH under matched budget constraints, while closely tracking the Lagrangian oracle upper bound with over 91% imitation accuracy.

[LG-22] Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels

链接: https://arxiv.org/abs/2604.14825
作者: Yifan Zhao,Yuchen Yang,Matei Budiu,Sasa Misailovic
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Nautilus, a novel tensor compiler that moves toward fully automated math-to-kernel optimization. Nautilus compiles a high-level algebraic specification of tensor operators into efficient tiled GPU kernels. Nautilus’s successive lowering design allows high-level optimizations, expression rewrites, and tile optimizations to be jointly applied in a single end-to-end system. Nautilus presents a novel auto-scheduler that discovers sequences of high-level optimizations, while preserving the regular program structure needed by tile optimizers. Nautilus’s auto-scheduler captures complex interactions and trade-offs in the high-level optimizations, including aggressive global transformations like advanced reduction fusion. Nautilus is the first end-to-end tensor compiler capable of starting from a math-like description of attention and automatically discovering FlashAttention-3-like kernels, offloading the entire burden of optimization from the programmer to the compiler. Across five transformer-based models and 150 evaluation configurations on NVIDIA GH200 and RTX 5090 GPUs, Nautilus achieves up to 23% higher throughput than state-of-the-art compilers on GH200 and up to 42% on RTX 5090, while matching or exceeding manually written cuDNN kernels on many long-sequence configurations.

[LG-23] owards Trustworthy 6G Network Digital Twins: A Framework for Validating Counterfactual What-If Analysis in Edge Computing Resources

链接: https://arxiv.org/abs/2604.14787
作者: Julian Jimenez Agudelo,Paola Soto,Ayat Zaki-Hindi,Jean-Sébastien Sottet,Sébastien Faye,Nina Slamnik-Kriještorac,Johann Marquez-Barja,Miguel Camelo Botero
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Network Digital Twins (NDTs) enable safe what-if analysis for 6G cloud-edge infrastructures, but adoption is often limited by fragmented workflows from telemetry to validation. We present a data-driven NDT framework that extends 6G-TWIN with a scalable pipeline for cloud-edge telemetry aggregation and semantic alignment into unified data models. Our contributions include: (i) scalable cloud-edge telemetry collection, (ii) regime-aware feature engineering capturing the network’s scaling behavior, and (iii) a validation methodology based on Sign Agreement and Directional Sensitivity. Evaluated on a Kubernetes-managed cluster, the framework extrapolates performance to unseen high-load regimes. Results show both Deep Neural Network (DNN) and XGBoost achieve high regression accuracy (R2 0.99), while the XGBoost model delivers superior directional reliability (Sa 0.90), making the NDT a trustworthy tool for proactive resource scaling in out-of-distribution scenarios.

[LG-24] Constraint-based Pre-training: From Structured Constraints to Scalable Model Initialization

链接: https://arxiv.org/abs/2604.14769
作者: Fu Feng,Yucheng Xie,Ruixiao Shi,Jing Wang,Xin Geng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The pre-training and fine-tuning paradigm has become the dominant approach for model adaptation. However, conventional pre-training typically yields models at a fixed scale, whereas practical deployment often requires models of varying sizes, exposing its limitations when target model scales differ from those used during pre-training. To address this, we propose an innovative constraint-based pre-training paradigm that imposes structured constraints during pre-training to disentangle size-agnostic knowledge into reusable weight templates, while assigning size-specific adaptation to lightweight weight scalers, thereby reformulating variable-sized model initialization as a multi-task adaptation problem. Within this paradigm, we further introduce WeiT, which employs Kronecker-based constraints to regularize the pre-training process. Specifically, model parameters are represented as compositions of weight templates via concatenation and weighted aggregation, with adaptive connections governed by lightweight weight scalers whose parameters are learned from limited data. This design enables flexible and efficient construction of model weights across diverse downstream scales. Extensive experiments demonstrate the efficiency and effectiveness of WeiT, achieving state-of-the-art performance in initializing models with varying depths and widths across a broad range of perception and embodied learning tasks, including Image Classification, Image Generation, and Embodied Control. Moreover, its effectiveness generalizes to both Transformer-based and Convolution-based architectures, consistently enabling faster convergence and improved performance even under full training.

[LG-25] Wasserstein Formulation of Reinforcement Learning. An Optimal Transport Perspective on Policy Optimization

链接: https://arxiv.org/abs/2604.14765
作者: Mathias Dus(IRMA)
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We present a geometric framework for Reinforcement Learning (RL) that views policies as maps into the Wasserstein space of action probabilities. First, we define a Riemannian structure induced by stationary distributions, proving its existence in a general context. We then define the tangent space of policies and characterize the geodesics, specifically addressing the measurability of vector fields mapped from the state space to the tangent space of probability measures over the action space. Next, we formulate a general RL optimization problem and construct a gradient flow using Otto’s calculus. We compute the gradient and the Hessian of the energy, providing a formal second-order analysis. Finally, we illustrate the method with numerical examples for low-dimensional problems, computing the gradient directly from our theoretical formalism. For high-dimensional problems, we parameterize the policy using a neural network and optimize it based on an ergodic approximation of the cost.

[LG-26] Exploiting Correlations in Federated Learning: Opportunities and Practical Limitations

链接: https://arxiv.org/abs/2604.14751
作者: Adrian Edin,Michel Kieffer,Mikael Johansson,Zheng Chen
类目: Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 14 pages, 7 figures, submitted for possible publication

点击查看摘要

Abstract:The communication bottleneck in federated learning (FL) has spurred extensive research into techniques to reduce the volume of data exchanged between client devices and the central parameter server. In this paper, we systematically classify gradient and model compression schemes into three categories based on the type of correlations they exploit: structural, temporal, and spatial. We examine the sources of such correlations, propose quantitative metrics for measuring their magnitude, and reinterpret existing compression methods through this unified correlation-based framework. Our experimental studies demonstrate that the degrees of structural, temporal, and spatial correlations vary significantly depending on task complexity, model architecture, and algorithmic configurations. These findings suggest that algorithm designers should carefully evaluate correlation assumptions under specific deployment scenarios rather than assuming that they are always present. Motivated by these findings, we propose two adaptive compression designs that actively switch between different compression modes based on the measured correlation strength, and we evaluate their performance gains relative to conventional non-adaptive approaches. In summary, our unified taxonomy provides a clean and principled foundation for developing more effective and application-specific compression techniques for FL systems.

[LG-27] Assessing the Performance-Efficiency Trade-off of Foundation Models in Probabilistic Electricity Price Forecasting

链接: https://arxiv.org/abs/2604.14739
作者: Jan Niklas Lettner,Hadeer El Ashhab,Veit Hagenmeyer,Benjamin Schäfer
类目: Machine Learning (cs.LG)
*备注: Submitted to the 7th International Workshop on Energy Data and Analytics (EDA), held in conjunction with ACM e-Energy 2026

点击查看摘要

Abstract:Large-scale renewable energy deployment introduces pronounced volatility into the electricity system, turning grid operation into a complex stochastic optimization problem. Accurate electricity price forecasting (EPF) is essential not only to support operational decisions, such as optimal bidding strategies and balancing power preparation, but also to reduce economic risk and improve market efficiency. Probabilistic forecasts are particularly valuable because they quantify uncertainty stemming from renewable intermittency, market coupling, and regulatory changes, enabling market participants to make informed decisions that minimize losses and optimize expected revenues. However, it remains an open question which models to employ to produce accurate forecasts. Should these be task-specific machine learning (ML) models or Time Series Foundation Models (TSFMs)? In this work, we compare four models for day-ahead probabilistic EPF (PEPF) in European bidding zones: a deterministic NHITS backbone with Quantile-Regression Averaging (NHITS+QRA) and a conditional Normalizing-Flow forecaster (NF) are compared with two TSFMs, namely Moirai and ChronosX. On the one hand, we find that TSFMs outperform task-specific deep learning models trained from scratch in terms of CRPS, Energy Score, and predictive interval calibration across market conditions. On the other hand, we find that well-configured task-specific models, particularly NHITS combined with QRA, achieve performance very close to TSFMs, and in some scenarios, such as when supplied with additional informative feature groups or adapted via few-shot learning from other European markets, they can even surpass TSFMs. Overall, our findings show that while TSFMs offer expressive modeling capabilities, conventional models remain highly competitive, emphasizing the need to weigh computational expense against marginal performance improvements in PEPF.

[LG-28] World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

链接: https://arxiv.org/abs/2604.14732
作者: Runze Li,Hongyin Zhang,Junxi Jin,Qixin Zeng,Zifeng Zhuang,Yiqi Tang,Shangke Lyu,Donglin Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent-space inference reshapes the search distribution toward feasible regions, enabling efficient long-horizon decision making. Extensive simulations and real-world experiments demonstrate that the WAV model consistently outperforms state-of-the-art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long-horizon and compositional scenarios.

[LG-29] Expressivity of Transformers: A Tropical Geometry Perspective

链接: https://arxiv.org/abs/2604.14727
作者: Ye Su,Yong Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To quantify the geometric expressivity of transformers, we introduce a tropical geometry framework to characterize their exact spatial partitioning capabilities. By modeling self-attention as a vector-valued tropical rational map, we prove it evaluates exactly to a Power Voronoi Diagram in the zero-temperature limit. Building on this equivalence, we establish a combinatorial rationale for Multi-Head Self-Attention (MHSA): via the Minkowski sum of Newton polytopes, multi-head aggregation expands the polyhedral complexity to \mathcalO(N^H) , overcoming the \mathcalO(N) bottleneck of single heads. Extending this to deep architectures, we derive the first tight asymptotic bounds on the number of linear regions in transformers ( \Theta(N^d_\textmodelL) ), demonstrating a combinatorial explosion driven intrinsically by sequence length N , ambient embedding dimension d_\textmodel , and network depth L . Importantly, we guarantee that this idealized polyhedral skeleton is geometrically stable: finite-temperature soft attention preserves these topological partitions via exponentially tight differential approximation bounds.

[LG-30] RELOAD: A Robust and Efficient Learned Query Optimizer for Database Systems

链接: https://arxiv.org/abs/2604.14725
作者: Seokwon Lee,Jaeyoung Sim,Sihyun Kim,Yuhsing Li,Yiwen Zhu,Kwanghyun Park
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: This work is currently under review

点击查看摘要

Abstract:Recent advances in query optimization have shifted from traditional rule-based and cost-based techniques towards machine learning-driven approaches. Among these, reinforcement learning (RL) has attracted significant attention due to its ability to optimize long-term performance by learning policies over query planning. However, existing RL-based query optimizers often exhibit unstable performance at the level of individual queries, including severe performance regressions, and require prolonged training to reach the plan quality of expert, cost-based optimizers. These shortcomings make learned query optimizers difficult to deploy in practice and remain a major barrier to their adoption in production database systems. To address these challenges, we present RELOAD, a robust and efficient learned query optimizer for database systems. RELOAD focuses on (i) robustness, by minimizing query-level performance regressions and ensuring consistent optimization behavior across executions, and (ii) efficiency, by accelerating convergence to expert-level plan quality. Through extensive experiments on standard benchmarks, including Join Order Benchmark, TPC-DS, and Star Schema Benchmark, RELOAD demonstrates up to 2.4x higher robustness and 3.1x greater efficiency compared to state-of-the-art RL-based query optimization techniques.

[LG-31] A Mechanistic Account of Attention Sinks in GPT -2: One Circuit Broader Implications for Mitigation

链接: https://arxiv.org/abs/2604.14722
作者: Yuval Ran-Milo,Hila Ofek,Shahar Mendel
类目: Machine Learning (cs.LG)
*备注: 9 pages, 8 figures

点击查看摘要

Abstract:Transformers commonly exhibit an attention sink: disproportionately high attention to the first position. We study this behavior in GPT-2-style models with learned query biases and absolute positional embeddings. Combining structural analysis with causal interventions, validated across natural-language, mathematical, and code inputs, we find that the sink arises from the interaction among (i) a learned query bias, (ii) the first-layer MLP transformation of the positional encoding, and (iii) structure in the key projection. Crucially, each component we identify is individually dispensable: architectures omitting each of them robustly exhibit sinks. This indicates that attention sinks may arise through distinct circuits across architectures. These findings inform mitigation of sinks, and motivate broader investigation into why sinks emerge.

[LG-32] Gating Enables Curvature: A Geometric Expressivity Gap in Attention

链接: https://arxiv.org/abs/2604.14702
作者: Satwik Bathula,Anand A. Joshi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 41 pages, 9 figures

点击查看摘要

Abstract:Multiplicative gating is widely used in neural architectures and has recently been applied to attention layers to improve performance and training stability in large language models. Despite the success of gated attention, the mathematical implications of gated attention mechanisms remain poorly understood. We study attention through the geometry of its representations by modeling outputs as mean parameters of Gaussian distributions and analyzing the induced Fisher–Rao geometry. We show that ungated attention operator is restricted to intrinsically flat statistical manifolds due to its affine structure, while multiplicative gating enables non-flat geometries, including positively curved manifolds that are unattainable in the ungated setting. These results establish a geometric expressivity gap between ungated and gated attention. Empirically, we show that gated models exhibit higher representation curvature and improved performance on tasks requiring nonlinear decision boundaries whereas they provide no consistent advantage on tasks with linear decision boundaries. Furthermore, we identify a structured regime in which curvature accumulates under composition, yielding a systematic depth amplification effect.

[LG-33] Mean Flow Policy Optimization

链接: https://arxiv.org/abs/2604.14698
作者: Xiaoyi Dong,Xi Sheryl Zhang,Jian Cheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at this https URL.

[LG-34] Zeroth-Order Optimization at the Edge of Stability

链接: https://arxiv.org/abs/2604.14669
作者: Minhak Song,Liang Zhang,Bingcong Li,Niao He,Michael Muehlebach,Sewoong Oh
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 38 pages

点击查看摘要

Abstract:Zeroth-order (ZO) methods are widely used when gradients are unavailable or prohibitively expensive, including black-box learning and memory-efficient fine-tuning of large models, yet their optimization dynamics in deep learning remain underexplored. In this work, we provide an explicit step size condition that exactly captures the (mean-square) linear stability of a family of ZO methods based on the standard two-point estimator. Our characterization reveals a sharp contrast with first-order (FO) methods: whereas FO stability is governed solely by the largest Hessian eigenvalue, mean-square stability of ZO methods depends on the entire Hessian spectrum. Since computing the full Hessian spectrum is infeasible in practical neural network training, we further derive tractable stability bounds that depend only on the largest eigenvalue and the Hessian trace. Empirically, we find that full-batch ZO methods operate at the edge of stability: ZO-GD, ZO-GDM, and ZO-Adam consistently stabilize near the predicted stability boundary across a range of deep learning training problems. Our results highlight an implicit regularization effect specific to ZO methods, where large step sizes primarily regularize the Hessian trace, whereas in FO methods they regularize the top eigenvalue.

[LG-35] he Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction

链接: https://arxiv.org/abs/2604.14619
作者: Dhruvin Dungrani,Disha Dungrani
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Computational Finance (q-fin.CP); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:In computational paralinguistics, detecting cognitive load and deception from speech signals is a heavily researched domain. Recent efforts have attempted to apply these acoustic frameworks to corporate earnings calls to predict catastrophic stock market volatility. In this study, we empirically investigate the limits of acoustic feature extraction (pitch, jitter, and hesitation) when applied to highly trained speakers in in-the-wild teleconference environments. Utilizing a two-stream late-fusion architecture, we contrast an acoustic-based stream with a baseline Natural Language Processing (NLP) stream. The isolated NLP model achieved a recall of 66.25% for tail-risk downside events. Surprisingly, integrating acoustic features via late fusion significantly degraded performance, reducing recall to 47.08%. We identify this degradation as Acoustic Camouflage, where media-trained vocal regulation introduces contradictory noise that disrupts multimodal meta-learners. We present these findings as a boundary condition for speech processing applications in high-stakes financial forecasting.

[LG-36] ght Bounds for Learning Polyhedra with a Margin

链接: https://arxiv.org/abs/2604.14614
作者: Shyamal Patel,Santosh Vempala
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We give an algorithm for PAC learning intersections of k halfspaces with a \rho margin to within error \varepsilon that runs in time \textsfpoly(k, \varepsilon^-1, \rho^-1) \cdot \exp \left(O(\sqrtn \log(1/\rho) \log k)\right) . Notably, this improves on prior work which had an exponential dependence on either k or \rho^-1 and matches known cryptographic and Statistical Query lower bounds up to the logarithmic factors in k and \rho in the exponent. Our learning algorithm extends to the more general setting when we are only promised that most points have distance at least \rho from the boundary of the polyhedron, making it applicable to continuous distributions as well.

[LG-37] A Synonymous Variational Perspective on the Rate-Distortion-Perception Tradeoff

链接: https://arxiv.org/abs/2604.14603
作者: Zijian Liang,Kai Niu,Changshuo Wang,Jin Xu,Ping Zhang
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 23 pages, 6 figures. This paper is submitted to the special issue on “Data Compression: Classical Theories Meet Modern Advances” of the IEEE Journal of Selected Areas in Information Theory (IEEE JSAIT)

点击查看摘要

Abstract:The fundamental limit of natural signal compression has traditionally been characterized by classical rate-distortion (RD) theory through the tradeoff between coding rate and reconstruction distortion, while the rate-distortion-perception (RDP) framework introduces a divergence-based measure of perceptual quality as a modeling principle rather than a theoretically-derived principle, leaving its theoretical origin unclear. In this paper, motivated by a synonymity-based semantic information perspective, we reformulate perceptual reconstruction as recovering any admissible sample within an ideal synonymous set (synset) associated with the source, rather than the source sample itself, and correspondingly establish a synonymous source coding architecture. On this basis, we develop a synonymous variational inference (SVI) analysis framework with a synonymous variational lower bound (SVLBO) for tractable analysis of synset-oriented compression. Within this framework, we establish a synonymity-perception consistency principle, showing that optimal identification of semantic information is theoretically consistent with perceptual optimization. Based on its derivation result, we prove a synonymous RDP tradeoff for the proposed synonymous source coding. These analytical results show that the distributional divergence term arises naturally from the synset-based reconstruction objective, clarify its compatibility with existing RDP formulations and classical RD theory, and suggest the potential advantages of synonymous source coding.

[LG-38] CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization

链接: https://arxiv.org/abs/2604.14587
作者: Feihu Huang,Guanyi Zhang,Songcan Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 30 pages

点击查看摘要

Abstract:Lion optimizer is a popular learning-based optimization algorithm in machine learning, which shows impressive performance in training many deep learning models. Although convergence property of the Lion optimizer has been studied, its generalization analysis is still missing. To fill this gap, we study generalization property of the Lion via algorithmic stability based on the mathematical induction. Specifically, we prove that the Lion has a generalization error of O(\frac1N\tau^T) , where N is training sample size, and \tau0 denotes the smallest absolute value of non-zero element in gradient estimator, and T is the total iteration number. In addition, we obtain an interesting byproduct that the SignSGD algorithm has the same generalization error as the Lion. To enhance generalization of the Lion, we design a novel efficient Cautious Lion (i.e., CLion) optimizer by cautiously using sign function. Moreover, we prove that our CLion has a lower generalization error of O(\frac1N) than O(\frac1N\tau^T) of the Lion, since the parameter \tau generally is very small. Meanwhile, we study convergence property of our CLion optimizer, and prove that our CLion has a fast convergence rate of O(\frac\sqrtdT^1/4) under \ell_1 -norm of gradient for nonconvex stochastic optimization, where d denotes the model dimension. Extensive numerical experiments demonstrate effectiveness of our CLion optimizer. Comments: 30 pages Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2604.14587 [cs.LG] (or arXiv:2604.14587v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.14587 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] From Risk to Rescue: An Agent ic Survival Analysis Framework for Liquidation Prevention

链接: https://arxiv.org/abs/2604.14583
作者: Fernando Spadea,Oshani Seneviratne
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized Finance (DeFi) lending protocols like Aave v3 rely on over-collateralization to secure loans, yet users frequently face liquidation due to volatile market conditions. Existing risk management tools utilize static health-factor thresholds, which are reactive and fail to distinguish between administrative “dust” cleanup and genuine insolvency. In this work, we propose an autonomous agent that leverages time-to-event (survival) analysis and moves beyond prediction to execution. Unlike passive risk signals, this agent perceives risk, simulates counterfactual futures, and executes protocol-faithful interventions to proactively prevent liquidations. We introduce a return period metric derived from a numerically stable XGBoost Cox proportional hazards model to normalize risk across transaction types, coupled with a volatility-adjusted trend score to filter transient market noise. To select optimal interventions, we implement a counterfactual optimization loop that simulates potential user actions to find the minimum capital required to mitigate risk. We validate our approach using a high-fidelity, protocol-faithful Aave v3 simulator on a cohort of 4,882 high-risk user profiles. The results demonstrate the agent’s ability to prevent liquidations in imminent-risk scenarios where static rules fail, effectively “saving the unsavable” while maintaining a zero worsening rate, providing a critical safety guarantee often missing in autonomous financial agents. Furthermore, the system successfully differentiates between actionable financial risks and negligible dust events, optimizing capital efficiency where static rules fail.

[LG-40] Physics-Informed Machine Learning for Pouch Cell Temperature Estimation

链接: https://arxiv.org/abs/2604.14566
作者: Zheng Liu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 4 pages, 2 figures

点击查看摘要

Abstract:Accurate temperature estimation of pouch cells with indirect liquid cooling is essential for optimizing battery thermal management systems for transportation electrification. However, it is challenging due to the computational expense of finite element simulations and the limitations of data-driven models. This paper presents a physics-informed machine learning (PIML) framework for the efficient and reliable estimation of steady-state temperature profiles. The PIML approach integrates the governing heat transfer equations directly into the neural network’s loss function, enabling high-fidelity predictions with significantly faster convergence than purely data-driven methods. The framework is evaluated on a dataset of varying cooling channel geometries. Results demonstrate that the PIML model converges more rapidly and achieves markedly higher accuracy, with a 49.1% reduction in mean squared error over the data-driven model. Validation against independent test cases further confirms its superior performance, particularly in regions away from the cooling channels. These findings underscore the potential of PIML for surrogate modeling and design optimization in battery systems.

[LG-41] Material-Agnostic Zero-Shot Thermal Inference for Metal Additive Manufacturing via a Parametric PINN Framework

链接: https://arxiv.org/abs/2604.14562
作者: Hyeonsu Lee,Jihoon Jeong
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Accurate thermal modeling in metal additive manufacturing (AM) is essential for understanding the process-structure-performance relationship. While prior studies have explored generalization across unseen process conditions, they often require extensive datasets, costly retraining, or pre-training. Generalization across different materials also remains relatively unexplored due to the challenges posed by distinct material-dependent thermal behaviors. This paper introduces a parametric physics-informed neural network (PINN) framework for zero-shot generalization across arbitrary materials without labeled data, retraining, or pre-training. The framework adopts a decoupled parametric PINN architecture that separately encodes material properties and spatiotemporal coordinates, fusing them through conditional modulation to better align with the multiplicative role of material parameters in the governing equation and boundary conditions. Physics-guided output scaling derived from Rosenthal’s analytical solution and a hybrid optimization strategy are further incorporated to enhance physical consistency, training stability, and convergence. Experiments on bare plate laser powder bed fusion (LPBF) across diverse metal alloys, including both in-distribution and out-of-distribution cases, demonstrate effective zero-shot generalizability along with superior training efficiency. Specifically, the proposed framework achieved up to a 64.2% reduction in relative L2 error compared to the non-parametric baseline while surpassing its performance within only 4.4% of the baseline training epochs. Ablation studies confirm that the proposed framework’s components are broadly applicable to other PINN-based approaches. Overall, the proposed framework provides an efficient and scalable material-agnostic solution for zero-shot thermal modeling, contributing to more flexible and practical deployment in metal AM.

[LG-42] DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

链接: https://arxiv.org/abs/2604.14552
作者: Kathiravan Palaniappan
类目: Performance (cs.PF); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 16 pages, 42 figures. Evaluation of inference performance on NVIDIA T4 and L4 GPUs across precision modes (FP32, FP16, INT8)

点击查看摘要

Abstract:Modern datacenters increasingly rely on low-power, single-slot inference accelerators to balance performance, energy efficiency, and rack density constraints. The NVIDIA T4 GPU has become widely deployed due to strong performance per watt and mature software support. Its successor, the NVIDIA L4 GPU, introduces improvements in Tensor Core throughput, cache capacity, memory bandwidth, and parallel execution capability. However, limited empirical evidence quantifies the practical inference performance gap between these two generations under controlled and reproducible conditions. This work introduces DEEP-GAP, a systematic evaluation extending the GDEV-AI methodology to GPU inference. Using identical configurations and workloads, we evaluate ResNet18, ResNet50, and ResNet101 across FP32, FP16, and INT8 precision modes using PyTorch and TensorRT. Results show that reduced precision significantly improves performance, with INT8 achieving up to 58x throughput improvement over CPU baselines. L4 achieves up to 4.4x higher throughput than T4 while reaching peak efficiency at smaller batch sizes between 16 and 32, improving latency-throughput tradeoffs for latency-sensitive workloads. T4 remains competitive for large batch workloads where cost or power efficiency is important. DEEP-GAP provides practical guidance for selecting precision modes, batch sizes, and GPU architectures for modern inference deployments. Comments: 16 pages, 42 figures. Evaluation of inference performance on NVIDIA T4 and L4 GPUs across precision modes (FP32, FP16, INT8) Subjects: Performance (cs.PF); Hardware Architecture (cs.AR); Machine Learning (cs.LG) ACMclasses: C.4 Cite as: arXiv:2604.14552 [cs.PF] (or arXiv:2604.14552v1 [cs.PF] for this version) https://doi.org/10.48550/arXiv.2604.14552 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kathiravan Palaniappan [view email] [v1] Thu, 16 Apr 2026 02:32:58 UTC (7,425 KB)

[LG-43] VoxSafeBench: Not Just What Is Said but Who How and Where

链接: https://arxiv.org/abs/2604.14548
作者: Yuxiang Wang,Hongyu Liu,Yijiang Xu,Qinke Ni,Li Wang,Wan Lin,Kunyu Feng,Dekun Chen,Xu Tan,Lei Wang,Jie Shi,Zhizheng Wu
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign request into one that is unsafe, unfair, or privacy-violating. Existing benchmarks, however, largely focus on basic audio comprehension, study individual risks in isolation, or conflate content that is inherently harmful with content that only becomes problematic due to its acoustic context. We introduce VoxSafeBench, among the first benchmarks to jointly evaluate social alignment in SLMs across three dimensions: safety, fairness, and privacy. VoxSafeBench adopts a Two-Tier design: Tier1 evaluates content-centric risks using matched text and audio inputs, while Tier2 targets audio-conditioned risks in which the transcript is benign but the appropriate response hinges on the speaker, paralinguistic cues, or the surrounding environment. To validate Tier2, we include intermediate perception probes and confirm that frontier SLMs can successfully detect these acoustic cues yet still fail to act on them appropriately. Across 22 tasks with bilingual coverage, we find that safeguards appearing robust on text often degrade in speech: safety awareness drops for speaker- and scene-conditioned risks, fairness erodes when demographic differences are conveyed vocally, and privacy protections falter when contextual cues arrive acoustically. Together, these results expose a pervasive speech grounding gap: current SLMs frequently recognize the relevant social norm in text but fail to apply it when the decisive cue must be grounded in speech. Code and data are publicly available at: this https URL

[LG-44] Predicting Post-Traumatic Epilepsy from Clinical Records using Large Language Model Embeddings

链接: https://arxiv.org/abs/2604.14547
作者: Wenhui Cui,Nicholas Swingle,Anand A. Joshi,Dileep Nair,Richard M. Leahy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: Post-traumatic epilepsy (PTE) is a debilitating neurological disorder that develops after traumatic brain injury (TBI). Early prediction of PTE remains challenging due to heterogeneous clinical data, limited positive cases, and reliance on resource-intensive neuroimaging data. We investigate whether routinely collected acute clinical records alone can support early PTE prediction using language model-based approaches. Methods: Using a curated subset of the TRACK-TBI cohort, we developed an automated PTE prediction framework that implements pretrained large language models (LLMs) as fixed feature extractors to encode clinical records. Tabular features, LLM-generated embeddings, and hybrid feature representations were evaluated using gradient-boosted tree classifiers under stratified cross-validation. Results: LLM embeddings achieved performance improvements by capturing contextual clinical information compared to using tabular features alone. The best performance was achieved by a modality-aware feature fusion strategy combining tabular features and LLM embeddings, achieving an AUC-ROC of 0.892 and AUPRC of 0.798. Acute post-traumatic seizures, injury severity, neurosurgical intervention, and ICU stay are key contributors to the predictive performance. Significance: These findings demonstrate that routine acute clinical records contain information suitable for early PTE risk prediction using LLM embeddings in conjunction with gradient-boosted tree classifiers. This approach represents a promising complement to imaging-based prediction.

[LG-45] An unsupervised decision-support framework for multivariate biomarker analysis in athlete monitoring

链接: https://arxiv.org/abs/2604.14534
作者: Fernando Barcelos Rosito,Sebastião De Jesus Menezes,Simone Ferreira Sturza,Adriana Seixas,Muriel Figueredo Franco
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 15 pages, 4 figures, 3 tables, submitted to Springer Nature Scientific Reports

点击查看摘要

Abstract:Purpose. Athlete monitoring is constrained by small cohorts, heterogeneous biomarker scales, limited feasibility of repeated sampling, and the lack of reliable injury ground truth. These limitations reduce the interpretability and utility of traditional univariate and binary risk models. This study addresses these challenges by proposing an unsupervised multivariate framework to identify latent physiological states in athletes using real data. Methods. We propose a modular computational framework that operates in the joint biomarker space, integrating preprocessing, clinical safety screening, unsupervised clustering, and centroid-based physiological interpretation. Profiles are learned exclusively from amateur soccer players during a competitive microcycle. Synthetic data augmentation evaluates robustness and scalability. Ward hierarchical clustering supports monitoring and etiological differentiation, while Gaussian Mixture Models (GMM) enable structural stability analysis in high-dimensional settings. Results. The framework identifies coherent profiles that distinguish mechanical damage from metabolic stress while preserving homeostatic states. Synthetic data augmentation demonstrates feasibility and detection of latent silent risk phenotypes typically missed by univariate monitoring. Structural analyses indicate robustness under augmentation and higher-dimensional settings. Conclusion. The framework enables interpretable identification of latent physiological states from multivariate biomarker data without injury labels. By distinguishing mechanisms and revealing silent risk patterns not captured by conventional monitoring, it provides actionable insights for individualized athlete monitoring and decision making.

[LG-46] Quantization of Spiking Neural Networks Beyond Accuracy

链接: https://arxiv.org/abs/2604.14487
作者: Evan Gibson Smith,Jacob Whitehill,Fatemeh Ganji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantization is a natural complement to the sparse, event-driven computation of Spiking Neural Networks, reducing memory bandwidth and arithmetic cost for deployment on resource-constrained hardware. However, existing SNN quantization evaluation focuses almost exclusively on accuracy, overlooking whether a quantized network preserves the firing behavior of its full-precision counterpart. We demonstrate that quantization method, clipping range, and bit-width can produce substantially different firing distributions at equivalent accuracy, differences invisible to standard metrics but relevant to deployment, where firing activity governs effective sparsity, state storage, and event-processing load. To capture this gap, we propose Earth Mover’s Distance as a diagnostic metric for firing distribution divergence, and apply it systematically across weight and membrane quantization on SEW-ResNet architectures trained on CIFAR-10 and CIFAR-100. We find that uniform quantization induces distributional drift even when accuracy is preserved, while LQ-Net style learned quantization maintains firing behavior close to the full-precision baseline. Our results suggest that behavior preservation should be treated as an evaluation criterion alongside accuracy, and that EMD provides a principled tool for assessing it.

[LG-47] Scouting By Reward: VLM-TO-IRL-Driven Player Selection For Esports

链接: https://arxiv.org/abs/2604.14474
作者: Qing Yan,Wenyu Yang,Yufei Wang,Wenhao Ma,Linchong Hu,Yifei Jin,Anton Dahbura
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional esports scouting workflows rely heavily on manual video review and aggregate performance metrics, which often fail to capture the nuanced decision-making patterns necessary to determine if a prospect fits a specific tactical archetype. To address this, we reframe style-based player evaluation in esports as an Inverse Reinforcement Learning (IRL) problem. In this paper, we introduce a novel player selection framework that learns professional-specific reward functions from logged gameplay demonstrations, allowing organizations to rank candidates by their stylistic alignment with a target star player. Our proposed architecture utilizes a multimodal, two-branch intake: one branch encodes structured state-action trajectories derived from high-resolution in-game telemetry, while the second encodes temporally aligned tactical pseudo-commentary generated by Vision-Language Models (VLMs) from broadcast footage. These representations are fused and evaluated via a Generative Adversarial Imitation Learning (GAIL) objective, where a discriminator learns to capture the unique mechanical and tactical signatures of elite professionals. By transitioning from generic skill estimation to scouting “by reward,” this framework provides a scalable, workflow-aware digital twin system that enables data-driven roster construction and targeted talent discovery across massive candidate pools.

[LG-48] Asynchronous Probability Ensembling for Federated Disaster Detection

链接: https://arxiv.org/abs/2604.14450
作者: Emanuel Teixeira Martins,Rodrigo Moreira,Larissa Ferreira Rodrigues Moreira,Rodolfo S. Villaça,Augusto Neto,Flávio de Oliveira Silva
类目: Machine Learning (cs.LG)
*备注: Paper accepted for publication at 31st IEEE Symposium on Computers and Communications (ISCC) 2026

点击查看摘要

Abstract:Quick and accurate emergency handling in Disaster Decision Support Systems (DDSS) is often hampered by network latency and suboptimal application accuracy. While Federated Learning (FL) addresses some of these issues, it is constrained by high communication costs and rigid synchronization requirements across heterogeneous convolutional neural network (CNN) architectures. To overcome these challenges, this paper proposes a decentralized ensembling framework based on asynchronous probability aggregation and feedback distillation. By shifting the exchange unit from model weights to class-probability vectors, our method maintains data privacy, reduces communication requirements by orders of magnitude, and improves overall accuracy. This approach enables diverse CNN designs to collaborate asynchronously, enhancing disaster image identification performance even in resource-constrained settings. Experimental tests demonstrate that the proposed method outperforms traditional individual backbones and standard federated approaches, establishing a scalable and resource-aware solution for real-time disaster response.

[LG-49] Non-intrusive Learning of Physics-Informed Spatio-temporal Surrogate for Accelerating Design

链接: https://arxiv.org/abs/2604.14424
作者: Sudeepta Mondal,Soumalya Sarkar
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Most practical engineering design problems involve nonlinear spatio-temporal dynamical systems. Multi-physics simulations are often performed to capture the fine spatio-temporal scales which govern the evolution of these systems. However, these simulations are often high-fidelity in nature, and can be computationally very expensive. Hence, generating data from these expensive simulations becomes a bottleneck in an end-to-end engineering design process. Spatio-temporal surrogate modeling of these dynamical systems has been a popular data-driven solution to tackle this computational bottleneck. This is because accurate machine learning models emulating the dynamical systems can be orders of magnitude faster than the actual simulations. However, one key limitation of purely data-driven approaches is their lack of generalizability to inputs outside the training distribution. In this paper, we propose a physics-informed spatio-temporal surrogate modeling (PISTM) framework constrained by the physics of the underlying dynamical system. The framework leverages state-of-the-art advancements in the field of Koopman autoencoders to learn the underlying spatio-temporal dynamics in a non-intrusive manner, coupled with a spatio-temporal surrogate model which predicts the behavior of the Koopman operator in a specified time window for unknown operating conditions. We evaluate our framework on a prototypical fluid flow problem of interest: two-dimensional incompressible flow around a cylinder.

[LG-50] Path-Sampled Integrated Gradients

链接: https://arxiv.org/abs/2604.14338
作者: Firuz Kamalov,Fadi Thabtah,R. Sivaraj,Neda Abdelhamid
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce path-sampled integrated gradients (PS-IG), a framework that generalizes feature attribution by computing the expected value over baselines sampled along the linear interpolation path. We prove that PS-IG is mathematically equivalent to path-weighted integrated gradients, provided the weighting function matches the cumulative distribution function of the sampling density. This equivalence allows the stochastic expectation to be evaluated via a deterministic Riemann sum, improving the error convergence rate from O(m^-1/2) to O(m^-1) for smooth models. Furthermore, we demonstrate analytically that PS-IG functions as a variance-reducing filter against gradient noise - strictly lowering attribution variance by a factor of 1/3 under uniform sampling - while preserving key axiomatic properties such as linearity and implementation invariance.

[LG-51] When Missing Becomes Structure: Intent-Preserving Policy Completion from Financial KOL Discourse

链接: https://arxiv.org/abs/2604.14333
作者: Yuncong Liu,Yuan Wan,Zhou Jiang,Yao Lu
类目: Machine Learning (cs.LG)
*备注: Main paper with supplementary material included

点击查看摘要

Abstract:Key Opinion Leader (KOL) discourse on social media is widely consumed as investment guidance, yet turning it into executable trading strategies without injecting assumptions about unspecified execution decisions remains an open problem. We observe that the gaps in KOL statements are not random deficiencies but a structured separation: KOLs express directional intent (what to buy or sell and why) while leaving execution decisions (when, how much, how long) systematically unspecified. Building on this observation, we propose an intent-preserving policy completion framework that treats KOL discourse as a partial trading policy and uses offline reinforcement learning to complete the missing execution decisions around the KOL-expressed intent. Experiments on multimodal KOL discourse from YouTube and X (2022-2025) show that KICL achieves the best return and Sharpe ratio on both platforms while maintaining zero unsupported entries and zero directional reversals, and ablations confirm that the full framework yields an 18.9% return improvement over the KOL-aligned baseline.

[LG-52] Heat and Matérn Kernels on Matchings

链接: https://arxiv.org/abs/2604.14331
作者: Dmitry Eremeev,Salem Said,Viacheslav Borovitskiy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Applying kernel methods to matchings is challenging due to their discrete, non-Euclidean nature. In this paper, we develop a principled framework for constructing geometric kernels that respect the natural geometry of the space of matchings. To this end, we first provide a complete characterization of stationary kernels, i.e. kernels that respect the inherent symmetries of this space. Because the class of stationary kernels is too broad, we specifically focus on the heat and Matérn kernel families, adding an appropriate inductive bias of smoothness to stationarity. While these families successfully extend widely popular Euclidean kernels to matchings, evaluating them naively incurs a prohibitive super-exponential computational cost. To overcome this difficulty, we introduce and analyze a novel, sub-exponential algorithm leveraging zonal polynomials for efficient kernel evaluation. Finally, motivated by the known bijective correspondence between matchings and phylogenetic trees-a crucial data modality in biology-we explore whether our framework can be seamlessly transferred to the space of trees, establishing novel negative results and identifying a significant open problem.

[LG-53] Calibrate-Then-Delegate: Safety Monitoring with Risk and Budget Guarantees via Model Cascades

链接: https://arxiv.org/abs/2604.14251
作者: Edoardo Pona,Milad Kazemi,Mehran Hosseini,Yali Du,David Watson,Osvaldo Simeone,Nicola Paoletti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Monitoring LLM safety at scale requires balancing cost and accuracy: a cheap latent-space probe can screen every input, but hard cases should be escalated to a more expensive expert. Existing cascades delegate based on probe uncertainty, but uncertainty is a poor proxy for delegation benefit, as it ignores whether the expert would actually correct the error. To address this problem, we introduce Calibrate-Then-Delegate (CTD), a model-cascade approach that provides probabilistic guarantees on the computation cost while enabling instance-level (streaming) decisions. CTD builds on a novel delegation value (DV) probe, a lightweight model operating on the same internal representations as the safety probe that directly predicts the benefit of escalation. To enforce budget constraints, CTD calibrates a threshold on the DV signal using held-out data via multiple hypothesis testing, yielding finite-sample guarantees on the delegation rate. Evaluated on four safety datasets, CTD consistently outperforms uncertainty-based delegation at every budget level, avoids harmful over-delegation, and adapts budget allocation to input difficulty without requiring group labels.

[LG-54] Metric-Aware Principal Component Analysis (MAPCA):A Unified Framework for Scale-Invariant Representation Learning

链接: https://arxiv.org/abs/2604.14249
作者: Michael Leznik
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages , one figure

点击查看摘要

Abstract:We introduce Metric-Aware Principal Component Analysis (MAPCA), a unified framework for scale-invariant representation learning based on the generalised eigenproblem max Tr(W^T Sigma W) subject to W^T M W = I, where M is a symmetric positive definite metric matrix. The choice of M determines the representation geometry. The canonical beta-family M(beta) = Sigma^beta, beta in [0,1], provides continuous spectral bias control between standard PCA (beta=0) and output whitening (beta=1), with condition number kappa(beta) = (lambda_1/lambda_p)^(1-beta) decreasing monotonically to isotropy. The diagonal metric M = D = diag(Sigma) recovers Invariant PCA (IPCA), a method rooted in Frisch (1928) diagonal regression, as a distinct member of the broader framework. We prove that scale invariance holds if and only if the metric transforms as M_tilde = CMC under rescaling C, a condition satisfied exactly by IPCA but not by the general beta-family at intermediate values. Beyond its classical interpretation, MAPCA provides a geometric language that unifies several self-supervised learning objectives. Barlow Twins and ZCA whitening correspond to beta=1 (output whitening); VICReg’s variance term corresponds to the diagonal metric. A key finding is that W-MSE, despite being described as a whitening-based method, corresponds to M = Sigma^-1 (beta = -1), outside the spectral compression range entirely and in the opposite spectral direction to Barlow Twins. This distinction between input and output whitening is invisible at the level of loss functions and becomes precise only within the MAPCA framework. Comments: 12 pages , one figure Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2604.14249 [cs.LG] (or arXiv:2604.14249v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.14249 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-55] OPCELL: Topology Optimization of Standard Cell via LLM s

链接: https://arxiv.org/abs/2604.14237
作者: Zhan Song,Yu-Tung Liu,Chen Chen,Guoheng Sun,Jiaqi Yin,Chia-tung Ho,Ang Li,Haoxing Ren,Cunxi Yu
类目: Machine Learning (cs.LG)
*备注: Accepted to the 63rd ACM/IEEE Design Automation Conference (DAC 2026). 7 pages, 4 figures

点击查看摘要

Abstract:Transistor topology optimization is a critical step in standard cell design, directly dictating diffusion sharing efficiency and downstream routability. However, identifying optimal topologies remains a persistent bottleneck, as conventional exhaustive search methods become computationally intractable with increasing circuit complexity in advanced nodes. This paper introduces TOPCELL, a novel and scalable framework that reformulates high-dimensional topology exploration as a generative task using Large Language Models (LLMs). We employ Group Relative Policy Optimization (GRPO) to fine-tune the model, aligning its topology optimization strategy with logical (circuit) and spatial (layout) constraints. Experimental results within an industrial flow targeting an advanced 2nm technology node demonstrate that TOPCELL significantly outperforms foundation models in discovering routable, physically-aware topologies. When integrated into a state-of-the-art (SOTA) automation flow for a 7nm library generation task, TOPCELL exhibits robust zero-shot generalization and matches the layout quality of exhaustive solvers while achieving an 85.91x speedup.

[LG-56] Anomaly Detection in IEC-61850 GOOSE Networks: Evaluating Unsupervised and Temporal Learning for Real-Time Intrusion Detection

链接: https://arxiv.org/abs/2604.14233
作者: Joseph Moore
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, 4 tables

点击查看摘要

Abstract:The IEC-61850 GOOSE protocol underpins time-critical communication in modern digital substations but lacks native security mechanisms, leaving it vulnerable to replay, masquerade, and data injection attacks. Intrusion detection in this setting is challenging due to strict latency constraints (sub-4ms) and limited availability of labeled attack data. This paper evaluates whether unsupervised temporal modeling can provide effective and deployable anomaly detection for GOOSE networks. Five models are compared on the ERENO IEC-61850 dataset: a supervised Random Forest baseline, a feedforward Autoencoder, and three recurrent sequence autoencoders (RNN, LSTM, and GRU). The supervised Random Forest achieves the highest detection performance (F1=0.9516) but fails to meet real-time constraints at 21.8ms per prediction. All four unsupervised models satisfy the 4ms requirement, with the GRU achieving the best accuracy to latency tradeoff among them (F1=0.8737 at 1.118ms). A cross-environment evaluation on an independent dataset shows that all models degrade under distribution shift. However, recurrent models retain substantially higher relative performance than the supervised baseline, suggesting that temporal sequence modeling generalizes better than fitting labeled attack distributions. Anomaly thresholds for the unsupervised models are selected on a held out validation partition to avoid test set leakage. These results support unsupervised temporal models as a practical choice for real-time GOOSE intrusion detection, particularly in environments where labeled training data may be unavailable or where large-scale deployment across diverse substations is required.

[LG-57] Portfolio Optimization Proxies under Label Scarcity and Regime Shifts via Bayesian and Deterministic Students under Semi-Supervised Sandwich Training

链接: https://arxiv.org/abs/2604.14206
作者: Adhiraj Chattopadhyay
类目: Machine Learning (cs.LG); Portfolio Management (q-fin.PM); Machine Learning (stat.ML)
*备注: 18 pages of main text. 10 pages of appendices. 35 references. Around 13 figures

点击查看摘要

Abstract:This paper proposes a machine learning assisted portfolio optimization framework designed for low data environments and regime uncertainty. We construct a teacher student learning pipeline in which a Conditional Value at Risk (CVaR) optimizer generates supervisory labels, and neural models (Bayesian and deterministic) are trained using both real and synthetically augmented data. The synthetic data is generated using a factor based model with t copula residuals, enabling training beyond the limited real sample of 104 labeled observations. We evaluate four student models under a structured experimental framework comprising (i) controlled synthetic experiments (3 x 5 seed grid), (ii) in-distribution real market evaluation (C2A) and (iii) cross-universe generalization (D2A). In real-market settings, models are deployed using a rolling evaluation protocol where a frozen pretrained model is periodically fine tuned on recent observations and reset to its base state, ensuring stability while allowing limited adaptation. Results show that student models can match or outperform the CVaR teacher in several settings, while achieving improved robustness under regime shifts and reduced turnover. These findings suggest that hybrid optimization learning approaches can enhance portfolio construction in data constrained environments

[LG-58] Structural interpretability in SVMs with truncated orthogonal polynomial kernels

链接: https://arxiv.org/abs/2604.15285
作者: Víctor Soto-Larrosa,Nuria Torrado,Edmundo J. Huertas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study post-training interpretability for Support Vector Machines (SVMs) built from truncated orthogonal polynomial kernels. Since the associated reproducing kernel Hilbert space is finite-dimensional and admits an explicit tensor-product orthonormal basis, the fitted decision function can be expanded exactly in intrinsic RKHS coordinates. This leads to Orthogonal Representation Contribution Analysis (ORCA), a diagnostic framework based on normalized Orthogonal Kernel Contribution (OKC) indices. These indices quantify how the squared RKHS norm of the classifier is distributed across interaction orders, total polynomial degrees, marginal coordinate effects, and pairwise contributions. The methodology is fully post-training and requires neither surrogate models nor retraining. We illustrate its diagnostic value on a synthetic double-spiral problem and on a real five-dimensional echocardiogram dataset. The results show that the proposed indices reveal structural aspects of model complexity that are not captured by predictive accuracy alone.

[LG-59] Cloning is as Hard as Learning for Stabilizer States

链接: https://arxiv.org/abs/2604.15269
作者: Nikhil Bansal,Matthias C. Caro,Gaurav Mahajan
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 10 + 33 + 8 pages

点击查看摘要

Abstract:The impossibility of simultaneously cloning non-orthogonal states lies at the foundations of quantum theory. Even when allowing for approximation errors, cloning an arbitrary unknown pure state requires as many initial copies as needed to fully learn the state. Rather than arbitrary unknown states, modern quantum learning theory often considers structured classes of states and exploits such structure to develop learning algorithms that outperform general-state tomography. This raises the question: How do the sample complexities of learning and cloning relate for such structured classes? We answer this question for an important class of states. Namely, for n -qubit stabilizer states, we show that the optimal sample complexity of cloning is \Theta(n) . Thus, also for this structured class of states, cloning is as hard as learning. To prove these results, we use representation-theoretic tools in the recently proposed Abelian State Hidden Subgroup framework and a new structured version of the recently introduced random purification channel to relate stabilizer state cloning to a variant of the sample amplification problem for probability distributions that was recently introduced in classical learning theory. This allows us to obtain our cloning lower bounds by proving new sample amplification lower bounds for classes of distributions with an underlying linear structure. Our results provide a more fine-grained perspective on No-Cloning theorems, opening up connections from foundations to quantum learning theory and quantum cryptography.

[LG-60] Optimal algorithmic complexity of inference in quantum kernel methods

链接: https://arxiv.org/abs/2604.15214
作者: Elies Gil-fuster,Seongwook Shin,Sofiene Jerbi,Jens Eisert,Maximilian J. Kramer
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 26 pages (13+13), 4 figures, comments welcome

点击查看摘要

Abstract:Quantum kernel methods are among the leading candidates for achieving quantum advantage in supervised learning. A key bottleneck is the cost of inference: evaluating a trained model on new data requires estimating a weighted sum \sum_i=1^N \alpha_i k(x,x_i) of N kernel values to additive precision \varepsilon , where \alpha is the vector of trained coefficients. The standard approach estimates each term independently via sampling, yielding a query complexity of O(N\lVert\alpha\rVert_2^2/\varepsilon^2) . In this work, we identify two independent axes for improvement: (1) How individual kernel values are estimated (sampling versus quantum amplitude estimation), and (2) how the sum is approximated (term-by-term versus via a single observable), and systematically analyze all combinations thereof. The query-optimal combination, encoding the full inference sum as the expectation value of a single observable and applying quantum amplitude estimation, achieves a query complexity of O(\lVert\alpha\rVert_1/\varepsilon) , removing the dependence on N from the query count and yielding a quadratic improvement in both \lVert\alpha\rVert_1 and \varepsilon . We prove a matching lower bound of \Omega(\lVert\alpha\rVert_1/\varepsilon) , establishing query-optimality of our approach up to logarithmic factors. Beyond query complexity, we also analyze how these improvements translate into gate costs and show that the query-optimal strategy is not always optimal in practice from the perspective of gate complexity. Our results provide both a query-optimal algorithm and a practically optimal choice of strategy depending on hardware capabilities, along with a complete landscape of intermediate methods to guide practitioners. All algorithms require only amplitude estimation as a subroutine and are thus natural candidates for early-fault-tolerant implementations.

[LG-61] MinShap: A Modified Shapley Value Approach for Feature Selection

链接: https://arxiv.org/abs/2604.15107
作者: Chenghui Zheng,Garvesh Raskutti
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature selection is a classical problem in statistics and machine learning, and it continues to remain an extremely challenging problem especially in the context of unknown non-linear relationships with dependent features. On the other hand, Shapley values are a classic solution concept from cooperative game theory that is widely used for feature attribution in general non-linear models with highly-dependent features. However, Shapley values are not naturally suited for feature selection since they tend to capture both direct effects from each feature to the response and indirect effects through other features. In this paper, we combine the advantages of Shapley values and adapt them to feature selection by proposing \emphMinShap, a modification of the Shapley value framework along with a suite of other related algorithms. In particular for MinShap, instead of taking the average marginal contributions over permutations of features, considers the minimum marginal contribution across permutations. We provide a theoretical foundation motivated by the faithfulness assumption in DAG (directed acyclic graphical models), a guarantee for the Type I error of MinShap, and show through numerical simulations and real data experiments that MinShap tends to outperform state-of-the-art feature selection algorithms such as LOCO, GCM and Lasso in terms of both accuracy and stability. We also introduce a suite of algorithms related to MinShap by using the multiple testing/p-value perspective that improves performance in lower-sample settings and provide supporting theoretical guarantees.

[LG-62] Unsupervised feature selection using Bayesian Tucker decomposition

链接: https://arxiv.org/abs/2604.14949
作者: Y-h. Taguchi,Yoh-ichi Mototake
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 10 figures

点击查看摘要

Abstract:In this paper, we proposed Bayesian Tucker decomposition (BTuD) in which residual is supposed to obey Gaussian distribution analogous to linear regression. Although we have proposed an algorithm to perform the proposed BTuD, the conventional higher-order orthogonal iteration can generate Tucker decomposition consistent with the present implementation. Using the proposed BTuD, we can perform unsupervised feature selection successfully applied to various synthetic datasets, global coupled maps with randomized coupling strength, and gene expression profiles. Thus we can conclude that our newly proposed unsupervised feature selection method is promising. In addition to this, BTuD based unsupervised FE is expected to coincide with TD based unsupervised FE that were previously proposed and successfully applied to a wide range of problems.

[LG-63] Learning to Concatenate Quantum Codes

链接: https://arxiv.org/abs/2604.14931
作者: Nico Meyer,Christopher Mutschler,Dominik Seuß,Andreas Maier,Daniel D. Scherer
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures, 1 table

点击查看摘要

Abstract:Concatenating quantum error correction codes scales error correction capability by driving logical error rates down double-exponentially across levels. However, the noise structure shifts under concatenation, making it hard to choose an optimal code sequence. We automate this choice by estimating the effective noise channel after each level and selecting the next code accordingly. In particular, we use learning-based methods to tailor small, non-additive encoders when the noise exhibits sufficient structure, then switch to standard codes once the noise is nearly uniform. In simulations, this level-wise adaptation achieves a target logical error rate with far fewer qubits than concatenating stabilizer codes alone–reducing qubit counts by up to two orders of magnitude for strongly structured noise. Therefore, this hybrid, learning-based strategy offers a promising tool for early fault-tolerant quantum computing.

[LG-64] Unraveling the Mechanism of Drug Binding to SARS-CoV-2 RNA Pseudoknot with Thermodynamics-Driven Machine Learning

链接: https://arxiv.org/abs/2604.14906
作者: Mariia Ivonina,Jakub Rydzewski
类目: Biological Physics (physics.bio-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The SARS-CoV-2 RNA pseudoknot is a promising target for antiviral intervention, as it regulates the efficiency of - 1 programmed ribosomal frameshifting ( - 1 PRF), a mechanism that is essential for viral protein synthesis. The pseudoknot represents a viral RNA sequence composed of helical stems that adopts two long-lived topologies, threaded and unthreaded. Ligand-induced distortion of this fold is thought to underlie the susceptibility of - 1 PRF to small-molecule inhibitors. Resolving these distortions from unbiased molecular dynamics (MD) requires collective variables (CVs) that isolate the slowest dynamic modes of the RNA–ligand system from the high-frequency fluctuations. Here, we use spectral map (SM), a thermodynamics-driven machine-learning method, to learn such CVs directly from MD trajectories of the SARS-CoV-2 RNA pseudoknot in complex with the - 1 PRF inhibitor merafloxacin and two related analogs. We examine both threaded and unthreaded pseudoknot topologies and consider the neutral and ionized ligand forms relevant at physiological pH. Free-energy landscapes show that ligand-induced destabilization is topology-selective: merafloxacin and its analogs destabilize the S2 stem in the threaded pseudoknot, whereas in the unthreaded pseudoknot, destabilization shifts to the S1 and S3 stems. We find that the zwitterionic form of merafloxacin uniquely imposes slow dynamics on the otherwise featureless unthreaded pseudoknot. Furthermore, the neutral and zwitterionic forms of merafloxacin differ qualitatively in their mechanisms within the same RNA topology. Overall, these results clarify how pseudoknot topology, ligand type, and protonation state shape the slow conformational dynamics of viral RNA and establish physiological protonation as an essential factor for modeling RNA-targeted drug action.

[LG-65] Best of both worlds: Stochastic adversarial best-arm identification COLT2018

链接: https://arxiv.org/abs/2604.14860
作者: Yasin Abbasi-Yadkori,Peter L. Bartlett,Victor Gabillon,Alan Malek,Michal Valko
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published in Conference on Learning Theory (COLT 2018)

点击查看摘要

Abstract:We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards? First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.

[LG-66] Scalable Model-Based Clustering with Sequential Monte Carlo AISTATS2026

链接: https://arxiv.org/abs/2604.14810
作者: Connie Trojan,Pavel Myshkov,Paul Fearnhead,James Hensman,Tom Minka,Christopher Nemeth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: Accepted at AISTATS 2026. 31 pages, 20 figures

点击查看摘要

Abstract:In online clustering problems, there is often a large amount of uncertainty over possible cluster assignments that cannot be resolved until more data are observed. This difficulty is compounded when clusters follow complex distributions, as is the case with text data. Sequential Monte Carlo (SMC) methods give a natural way of representing and updating this uncertainty over time, but have prohibitive memory requirements for large-scale problems. We propose a novel SMC algorithm that decomposes clustering problems into approximately independent subproblems, allowing a more compact representation of the algorithm state. Our approach is motivated by the knowledge base construction problem, and we show that our method is able to accurately and efficiently solve clustering problems in this setting and others where traditional SMC struggles.

[LG-67] Expert-Guided Class-Conditional Goodness-of-Fit Scores for Interpretable Classification with Informative Missingness: An Application to Seismic Monitoring

链接: https://arxiv.org/abs/2604.14809
作者: Shahar Cohen,David M. Steinberg,Yael Radzyner,Yochai Ben Horin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 50 pages, 8 figures

点击查看摘要

Abstract:We study a classification problem with three key challenges: pervasive informative missingness, the integration of partial prior expert knowledge into the learning process, and the need for interpretable decision rules. We propose a framework that encodes prior knowledge through an expert-guided class-conditional model for one or more classes, and use this model to construct a small set of interpretable goodness-of-fit features. The features quantify how well the observed data agree with the expert model, isolating the contributions of different aspects of the data, including both observed and missing components. These features are combined with a few transparent auxiliary summaries in a simple discriminative classifier, resulting in a decision rule that is easy to inspect and justify. We develop and apply the framework in the context of seismic monitoring used to assess compliance with the Comprehensive Nuclear-Test-Ban Treaty. We show that the method has strong potential as a transparent screening tool, reducing workload for expert analysts. A simulation designed to isolate the contribution of the proposed framework shows that this interpretable expert-guided method can even outperform strong standard machine-learning classifiers, particularly when training samples are small.

[LG-68] PUFFIN: Protein Unit Discovery with Functional Supervision

链接: https://arxiv.org/abs/2604.14796
作者: Gökçe Uludoğan,Buse Giledereli,Elif Ozkirimli,Arzucan Özgür
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 21 pages, 9 figures, to appear in ISMB 2026 proceedings

点击查看摘要

Abstract:Proteins carry out biological functions through the coordinated action of groups of residues organized into structural arrangements. These arrangements, which we refer to as protein units, exist at an intermediate scale, being larger than individual residues yet smaller than entire proteins. A deeper understanding of protein function can be achieved by identifying these units and their associations with function. However, existing approaches either focus on residue-level signals, rely on curated annotations, or segment protein structures without incorporating functional information, thereby limiting interpretable analysis of structure-function relationships. We introduce PUFFIN, a data-driven framework for discovering protein units by jointly learning structural partitioning and functional supervision. PUFFIN represents proteins as residue-level structure graphs and applies a graph neural network with a structure-aware pooling mechanism that partitions each protein into multi-residue units, with functional supervision that shapes the partition. We show that the learned units are structurally coherent, exhibit organized associations with molecular function, and show meaningful correspondence with curated InterPro annotations. Together, these results demonstrate that PUFFIN provides an interpretable framework for analyzing structure-function relationships using learned protein units and their statistical function associations. We made our source code available at this https URL.

[LG-69] Differentially Private Conformal Prediction

链接: https://arxiv.org/abs/2604.14621
作者: Jiamei Wu,Ce Zhang,Zhipeng Cai,Jingsen Kong,Bei Jiang,Linglong Kong,Lingchen Kong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction (CP) has attracted broad attention as a simple and flexible framework for uncertainty quantification through prediction sets. In this work, we study how to deploy CP under differential privacy (DP) in a statistically efficient manner. We first introduce differential CP, a non-splitting conformal procedure that avoids the efficiency loss caused by data splitting and serves as a bridge between oracle CP and private conformal inference. By exploiting the stability properties of DP mechanisms, differential CP establishes a direct connection to oracle CP and inherits corresponding validity behavior. Building on this idea, we develop Differentially Private Conformal Prediction (DPCP), a fully private procedure that combines DP model training with a private quantile mechanism for calibration. We establish the end-to-end privacy guarantee of DPCP and investigate its coverage properties under additional regularity conditions. We further study the efficiency of both differential CP and DPCP under empirical risk minimization and general regression models, showing that DPCP can produce tighter prediction sets than existing private split conformal approaches under the same privacy budget. Numerical experiments on synthetic and real datasets demonstrate the practical effectiveness of the proposed methods.

[LG-70] mescale Separation Enables Deep Reinforcement Learning Control of Rotating Detonation Engine Mode Transitions

链接: https://arxiv.org/abs/2604.14398
作者: Kristian Holme,Jean Rabault,Ricardo Vinuesa,Mikael Mortensen
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rotating detonation engines (RDEs) are a promising propulsion concept that may offer higher thermodynamic efficiency and specific impulse than conventional systems, but nonlinear phenomena, including transitions to oscillatory or chaotic propagation modes, can hinder practical operation. Deep Reinforcement Learning (DRL) has emerged as a promising method for controlling complex nonlinear dynamics such as those observed in RDEs. However, the multi-timescale nature of the RDE system makes direct application of DRL challenging. We address this challenge by reformulating the DRL problem in a moving reference frame that follows the detonation-wave pattern, making the wave structure appear quasi-steady to the agent. This reformulation enables scale separation between fast detonation propagation and slower operating-mode dynamics. We train DRL controllers to modulate spatially segmented injection pressure in a one-dimensional reduced-order RDE model and induce rapid transitions between different mode-locked states. Across a range of actuation periods, initial states, and target modes, controllers trained in the moving frame learn more reliably than those trained in a stationary frame and remain effective over a broader range of actuation periods. These results suggest that symmetry-aware moving reference frame formulations may be useful for related multiscale flow-control problems and that scale separation should be exploited whenever possible to enable DRL control of multi-timescale systems.

[LG-71] Deployment of AI-Assisted Interventions: Capacity Constraints and Noisy Compliance

链接: https://arxiv.org/abs/2604.14370
作者: Carri W. Chan,Yi Han,Hannah Li,Benjamin L. Ranard
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI tools increasingly guide targeted interventions in healthcare, education, and recruiting. Algorithms score individuals, trigger outreach to those above a threshold (e.g., high-risk or high-value), and encourage them to request service; then providers deliver service to those who request. Standard practice sets the threshold and selects the algorithm to maximize predictive accuracy, assuming that better predictions yield better outcomes. We show that this approach is suboptimal when limited service capacity and probabilistic behavioral responses influence who receives service. In such settings, the optimal score threshold must balance two effects: ensuring all capacity is filled (utilization) and ensuring high-value individuals are served despite competition between requests (cannibalization). We characterize the optimal threshold and prove that policies based solely on predictive accuracy are generally suboptimal. Further, because optimal thresholds vary with service capacity, algorithm selection metrics like AUC, which weight all thresholds equally, are misaligned with operational performance. We introduce a new metric–Operational AUC (OpAUC)–and show it leads to optimal algorithm selection. Finally, we conduct a case study on sepsis early warning data and illustrate the magnitude of improvement that can be achieved from improved threshold and algorithm selection.

[LG-72] PROXIMA: A Reliability Scoring Framework for Proxy Metrics in Online Controlled Experiments

链接: https://arxiv.org/abs/2604.14352
作者: Avinash Amudala
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 14 pages. Sole-author submission. Independent research. Companion code at this https URL . Zenodo archive: https://doi.org/10.5281/zenodo.15483241 . Related US provisional patent application: 63/974,569 (filed Feb 3, 2026)

点击查看摘要

Abstract:Online A/B testing at scale relies on proxy metrics – short-term, easily-measured signals used in place of slow-moving long-term outcomes. When the proxy-outcome relationship is heterogeneous across user segments, aggregate correlation can mask directional failures akin to Simpson’s Paradox, leading to costly ship/no-ship errors. We introduce PROXIMA (Proxy Metric Validation Framework for Online Experiments), a lightweight diagnostic framework that scores proxy reliability through a composite of three complementary dimensions: normalised effect correlation, directional accuracy, and segment-level fragility rate. Unlike surrogate-index approaches that predict long-term treatment effects, PROXIMA directly audits whether a candidate proxy leads to correct launch decisions and flags the user segments where it fails. We validate PROXIMA on two public datasets – the Criteo Uplift corpus (14M observations, advertising) and KuaiRec (7K users, video recommendation) – using 80 simulated A/B tests. Early engagement metrics achieve a composite reliability of 0.80 on Criteo and 0.62 on KuaiRec, yielding 98.4% average decision agreement with an oracle policy. Fragility analysis reveals that recommendation domains exhibit substantially higher segment-level heterogeneity (68% fragility) than advertising (13%), yet directional accuracy remains above 96% in both cases. A sensitivity analysis over the weight space confirms that no single component suffices and that the composite provides substantially better discrimination between reliable and unreliable proxies than correlation alone. Code and reproduction scripts are available at: this https URL

[LG-73] Doubly Outlier-Robust Online Infinite Hidden Markov Model

链接: https://arxiv.org/abs/2604.14322
作者: Horace Yiu,Leandro Sánchez-Betancourt,Álvaro Cartea,Gerardo Duran-Martin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We derive a robust update rule for the online infinite hidden Markov model (iHMM) for when the streaming data contains outliers and the model is misspecified. Leveraging recent advances in generalised Bayesian inference, we define robustness via the posterior influence function (PIF), and provide conditions under which the online iHMM has bounded PIF. Imposing robustness inevitably induces an adaptation lag for regime switching. Our method, which is called Batched Robust iHMM (BR-iHMM), balances adaptivity and robustness with two additional tunable parameters. Across limit order book data, hourly electricity demand, and a synthetic high-dimensional linear system, BR-iHMM reduces one-step-ahead forecasting error by up to 67% relative to competing online Bayesian methods. Together with theoretical guarantees of bounded PIF, our results highlight the practicality of our approach for both forecasting and interpretable online learning.

[LG-74] Combining Bayesian and Frequentist Inference for Laboratory-Specific Performance Guarantees in Copy Number Variation Detection

链接: https://arxiv.org/abs/2604.14305
作者: Austin Talbot,Alex V. Kotlar,Yue Ke
类目: Methodology (stat.ME); Machine Learning (cs.LG); Genomics (q-bio.GN); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Targeted amplicon panels are widely used in oncology diagnostics, but providing per-gene performance guarantees for copy number variant (CNV) detection remains challenging due to amplification artifacts, process-mismatch heterogeneity, and limited validation sample sizes. While Bayesian CNV callers naturally quantify per-sample uncertainty, translating this into the frequentist population-level guarantees required for clinical validation, coverage rates, false-positive bounds, and minimum detectable copy-number changes, is a fundamentally different inferential problem. We show empirically that even robust Bayesian credible intervals, including coarsened posteriors and sandwich-adjusted intervals, are severely miscalibrated on panels with small amplicon counts per gene. To address this, we propose a hybrid framework that evaluates Bayesian posterior functionals on validation samples and models the resulting squared losses with a Gamma distribution, yielding tolerance intervals with valid frequentist coverage. Three components make the method practical under real-world constraints: (1) imputation that removes the influence of true CNV-positive samples without requiring known ground truth, (2) regularization to address small sample variability, and (3) evidence-based stratification on the log model evidence to accommodate non-exchangeable noise profiles arising from process mismatch. Evaluated on two targeted amplicon panels using leave-one-out cross-validation, the proposed method achieves single-digit mean absolute coverage error across all genes under both process-matched and unmatched conditions, whereas Bayesian comparators exhibit mean absolute errors exceeding 60% on clinically relevant genes such as ERBB2.

[LG-75] Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay CVPR2026

链接: https://arxiv.org/abs/2604.14259
作者: Qianyu Chen,Shujian Yu
类目: Tissues and Organs (q-bio.TO); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: manuscript accepted by CVPR 2026, code is available from \url{ this https URL }

点击查看摘要

Abstract:Functional magnetic resonance imaging (fMRI) is widely used for studying and diagnosing brain disorders, with functional connectivity (FC) matrices providing powerful representations of large-scale neural interactions. However, existing diagnostic models are trained either on a single site or under full multi-site access, making them unsuitable for real-world scenarios where clinical data arrive sequentially from different institutions. This results in limited generalization and severe catastrophic forgetting. This paper presents the first continual learning framework specifically designed for fMRI-based diagnosis across heterogeneous clinical sites. Our framework introduces a structure-aware variational autoencoder that synthesizes realistic FC matrices for both patient and control groups. Built on this generative backbone, we develop a multi-level knowledge distillation strategy that aligns predictions and graph representations between new-site data and replayed samples. To further enhance efficiency, we incorporate a hierarchical contextual bandit scheme for adaptive replay sampling. Experiments on multi-site datasets for major depressive disorder (MDD), schizophrenia (SZ), and autism spectrum disorder (ASD) show that the proposed generative model enhances data augmentation quality, and the overall continual learning framework substantially outperforms existing methods in mitigating catastrophic forgetting. Our code is available at this https URL.

[LG-76] Polyformer: a generative framework for thermodynamic modeling of polymeric molecules

链接: https://arxiv.org/abs/2604.14241
作者: Alessio Valentini,David Pekker,Chungwen Liang,Todd Martinez,Swagatam Mukhopadhyay
类目: Biomolecules (q-bio.BM); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 9+epsilon pages+references+appendix, 6 figures

点击查看摘要

Abstract:The classic paradigm of structural biology is that the sequence of a biomolecule (protein, nucleic acid, lipid, etc) determines its conformation (shape) which determines its biological function. Protein folding programs like AlphaFold address this paradigm by predicting the single best conformation given a sequence that defines the molecule. However, biomolecules are not static structures, and their conformational ensemble determines their function. We present the Polyformer – a generative framework for thermodynamic modeling of polymeric molecules. Given the sequence and temperature (or another thermodynamic variable), the Polyformer generates conformations faithful to the molecule’s thermodynamic conformational ensemble. It is the first generative model that solves three problems simultaneously: how does a molecule fold, what is its conformational ensemble, and how does the conformational ensemble change as we change physical temperature. As a concrete test case, we apply Polyformer to protein domains with 50-111 residues and report good agreement of model predictions to Molecular Dynamics (MD) trajectories.

[LG-77] ML-based approach to classification and generation of structured light propagation in turbulent media

链接: https://arxiv.org/abs/2604.14208
作者: Aokun Wang,Anjali Nair,Zhongjian Wang,Guillaume Bal
类目: Optics (physics.optics); Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:This work develops machine learning approaches to classify structured light wave beams developing random speckle disturbances as they propagate through turbulent atmospheres. Beam propagation is modeled by the numerical simulation of a stochastic paraxial equation. We design convolutional neural networks tailored for this specific application and use them for a classification model with one-hot encoding. To address the challenge of potentially limited available data, we develop a prediction-based generative diffusion model to provide additional data during classifier training. We show that a Bregman distance minimization during the learning step improves the quality of the generation of high-frequency modes.

[LG-78] Predictions of charge density distributions for nuclei with Z geq 8

链接: https://arxiv.org/abs/2604.05312
作者: Yun Dong Wang,Tian Shuai Shang,Hui Hui Xie,Peng Xiang Du,Jian Li,Haozhao Liang
类目: Nuclear Theory (nucl-th); Machine Learning (cs.LG); Atomic Physics (physics.atom-ph)
*备注: 56 pages, 4 tables, 3 figures

点击查看摘要

Abstract:A deep neural network (DNN) has been developed to accurately predict nuclear charge density distributions for nuclei with proton numbers Z \geq 8 . By incorporating essential nuclear structure features, the model achieves a significant improvement in predictive accuracy over conventional methods. The charge density distributions are analyzed using a Fourier-Bessel (FB) series expansion, and the DNN is trained on a comprehensive dataset derived from relativistic continuum Hartree-Bogoliubov (RCHB) theory calculations. The model demonstrates exceptional performance, with root-mean-square deviations of 0.0123 fm and 0.0198 fm for charge radii on the training and validation sets, respectively, remarkably surpassing the precision of the original RCHB calculations. Beyond advancing nuclear physics research, this high-precision model provides critical data for applications in atomic physics, nuclear astrophysics, and related fields.

附件下载

点击下载今日全部论文列表