本篇博文主要内容为 2026-04-15 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-04-15)

今日共更新641篇论文,其中:

  • 自然语言处理103篇(Computation and Language (cs.CL))
  • 人工智能201篇(Artificial Intelligence (cs.AI))
  • 计算机视觉141篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习141篇(Machine Learning (cs.LG))
  • 多智能体系统10篇(Multiagent Systems (cs.MA))
  • 信息检索13篇(Information Retrieval (cs.IR))
  • 人机交互14篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] A Multi-Agent Feedback System for Detecting and Describing News Events in Satellite Imagery

【速读】:该论文旨在解决遥感领域中多时相事件描述数据集(multi-temporal event captioning datasets)匮乏的问题,其核心挑战在于:(1)在卫星影像中识别可见事件难度大;(2)对多时相图像序列进行标注需大量人力与时间。为应对这一问题,作者提出SkyScraper框架,其关键创新在于采用迭代式多智能体(multi-agent)工作流,通过地理编码新闻文章并自动生成对应卫星影像序列的描述,从而高效挖掘和标注多时相事件。实验表明,该方法比传统地理编码策略发现5倍以上的事件,验证了代理反馈机制在提升遥感事件发现效率方面的有效性。

链接: https://arxiv.org/abs/2604.12772
作者: Madeline Anderson,Mikhail Klassen,Ash Hoover,Kerri Cahoy
机构: Massachusetts Institute of Technology (麻省理工学院); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Changes in satellite imagery often occur over multiple time steps. Despite the emergence of bi-temporal change captioning datasets, there is a lack of multi-temporal event captioning datasets (at least two images per sequence) in remote sensing. This gap exists because (1) searching for visible events in satellite imagery and (2) labeling multi-temporal sequences require significant time and labor. To address these challenges, we present SkyScraper, an iterative multi-agent workflow that geocodes news articles and synthesizes captions for corresponding satellite image sequences. Our experiments show that SkyScraper successfully finds 5x more events than traditional geocoding methods, demonstrating that agentic feedback is an effective strategy for surfacing new multi-temporal events in satellite imagery. We apply our framework to a large database of global news articles, curating a new multi-temporal captioning dataset with 5,000 sequences. By automatically identifying imagery related to news events, our work also supports journalism and reporting efforts.

[MA-1] ARGOS: Who Where and When in Agent ic Multi-Camera Person Search CVPR2026

【速读】:该论文旨在解决多摄像头行人搜索(Multi-camera Person Search)中的信息不对称问题,将传统静态匹配任务重构为一个需要智能体进行交互式推理的动态决策过程。其核心挑战在于:面对模糊的目击陈述,智能体需在有限对话轮次内自主规划提问策略、调用时空工具(如空间拓扑图和时间转移模型),并解析含糊反馈以逐步缩小候选范围。解决方案的关键创新在于提出ARGOS框架,该框架基于一个编码了摄像头连接关系与实证验证的过渡时间的时空拓扑图(Spatio-Temporal Topology Graph, STTG),构建了一个包含三个渐进式赛道(语义感知、空间推理、时间推理)的基准测试体系,从而系统评估大语言模型(LLM)在复杂现实场景下的交互式推理能力。实验表明,当前主流LLM在该任务上仍远未达到最优性能(Track 2最佳TWS仅为0.383,Track 3为0.590),且移除领域专用工具会导致准确率下降高达49.6个百分点,凸显了STTG与交互式推理机制的核心作用。

链接: https://arxiv.org/abs/2604.12762
作者: Myungchul Kim,Kwanyong Park,Junmo Kim,In So Kweon
机构: KAIST(韩国科学技术院); University of Seoul(首尔大学); KIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted to CVPR 2026 Workshop on Multimodal Spatial Intelligence (MUSI)

点击查看摘要

Abstract:We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

[MA-2] RPRA: Predicting an LLM -Judge for Efficient but Performant Inference

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在计算效率(如参数数量)与输出质量之间存在的根本权衡问题,尤其是在资源受限设备(如手机或笔记本电脑)上部署时。其核心解决方案是引入一种“自我评估-决策”机制,即让模型在生成响应前预测一个LLM评判者对其输出的评分,从而决定是否由自身回答或请求更大模型协助。关键创新在于提出Predict-Answer/Act(PA)和Reason-Predict-Reason-Answer/Act(RPRA)两种范式,并通过零样本预测、上下文报告卡(in-context report card)以及监督微调三种方法提升小模型对自身能力的判断准确性。实验表明,经过微调或使用报告卡后,小模型可显著提高预测精度(平均提升达52%–55%),从而实现更高效且具备自知之明的AI系统设计。

链接: https://arxiv.org/abs/2604.12634
作者: Dylan R. Ashley,Gaël Le Lan,Changsheng Zhao,Naina Dhingra,Zhipeng Cai,Ernie Chang,Mingchen Zhuge,Yangyang Shi,Vikas Chandra,Jürgen Schmidhuber
机构: University of Lugano (卢加诺大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 10 pages in main text + 6 pages of references + 36 pages of appendices, 12 figures in main text + 37 figures in appendices, 2 tables in main text + 3 table in appendices, 13 prompts in appendices

点击查看摘要

Abstract:Large language models (LLMs) face a fundamental trade-off between computational efficiency (e.g., number of parameters) and output quality, especially when deployed on computationally limited devices such as phones or laptops. One way to address this challenge is by following the example of humans and have models ask for help when they believe they are incapable of solving a problem on their own; we can overcome this trade-off by allowing smaller models to respond to queries when they believe they can provide good responses, and deferring to larger models when they do not believe they can. To this end, in this paper, we investigate the viability of Predict-Answer/Act (PA) and Reason-Predict-Reason-Answer/Act (RPRA) paradigms where models predict – prior to responding – how an LLM judge would score their output. We evaluate three approaches: zero-shot prediction, prediction using an in-context report card, and supervised fine-tuning. Our results show that larger models (particularly reasoning models) perform well when predicting generic LLM judges zero-shot, while smaller models can reliably predict such judges well after being fine-tuned or provided with an in-context report card. Altogether, both approaches can substantially improve the prediction accuracy of smaller models, with report cards and fine-tuning achieving mean improvements of up to 55% and 52% across datasets, respectively. These findings suggest that models can learn to predict their own performance limitations, paving the way for more efficient and self-aware AI systems.

[MA-3] How memory can affect collective and cooperative behaviors in an LLM -Based Social Particle Swarm

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在多智能体系统中,其内部对齐特性如何影响记忆机制对集体合作行为的调控作用这一问题。解决方案的关键在于扩展社会粒子群(Social Particle Swarm, SPS)模型,将其中规则驱动的代理替换为具有五大性格特质(Big Five personality traits)和可调记忆长度的LLM代理,并通过对比不同模型(Gemini-2.0-Flash与Gemma 3:4b)的行为差异,揭示了模型特定特征(如对齐方式)对记忆效应对合作演化路径的根本性调节作用——即记忆长度对合作的影响并非普适,而是依赖于LLM自身的认知倾向,尤其体现在对记忆内容的情感评估差异上。

链接: https://arxiv.org/abs/2604.12250
作者: Taisei Hishiki,Takaya Arita,Reiji Suzuki
机构: Nagoya University (名古屋大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 12 pages, 6 figures and 2 tables

点击查看摘要

Abstract:This study examines how model-specific characteristics of Large Language Model (LLM) agents, including internal alignment, shape the effect of memory on their collective and cooperative dynamics in a multi-agent system. To this end, we extend the Social Particle Swarm (SPS) model, in which agents move in a two-dimensional space and play the Prisoner’s Dilemma with neighboring agents, by replacing its rule-based agents with LLM agents endowed with Big Five personality scores and varying memory lengths. Using Gemini-2.0-Flash, we find that memory length is a critical parameter governing collective behavior: even a minimal memory drastically suppressed cooperation, transitioning the system from stable cooperative clusters through cyclical formation and collapse of clusters to a state of scattered defection as memory length increased. Big Five personality traits correlated with agent behaviors in partial agreement with findings from experiments with human participants, supporting the validity of the model. Comparative experiments using Gemma~3:4b revealed the opposite trend: longer memory promoted cooperation, accompanied by the formation of dense cooperative clusters. Sentiment analysis of agents’ reasoning texts showed that Gemini interprets memory increasingly negatively as its length grows, while Gemma interprets it less negatively, and that this difference persists in the early phase of experiments before the macro-level dynamics converge. These results suggest that model-specific characteristics of LLMs, potentially including alignment, play a fundamental role in determining emergent social behavior in Generative Agent-Based Modeling, and provide a micro-level cognitive account of the contradictions found in prior work on memory and cooperation.

[MA-4] Representing expertise accelerates learning from pedagogical interaction data

【速读】:该论文旨在解决如何通过交互数据提升学习代理(learning agents)在复杂任务中的性能问题,特别是厘清哪些交互特征能够促进模型泛化能力的增强。其解决方案的关键在于:利用合成的专家与新手在空间导航任务中的交互数据训练Transformer模型,发现基于教学型交互(pedagogical interactions)的数据能显著提升模型在多样化场景下的鲁棒性;同时,模型若具备表征认知上区分明确的多个代理(epistemically distinct agents)的能力,则即使在专家行为极少出现的情况下,也能表现出接近专家的行为模式。

链接: https://arxiv.org/abs/2604.12195
作者: Dhara Yu,Karthikeya Kaushik,Bill D. Thompson
机构: UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Work in cognitive science and artificial intelligence has suggested that exposing learning agents to traces of interaction between multiple individuals can improve performance in a variety of settings, yet it remains unknown which features of interactions contribute to this improvement. We examined the factors that support the effectiveness of interaction data, using a controlled paradigm that allowed us to precisely operationalize key distinctions between interaction and an expert acting alone. We generated synthetic datasets of simple interactions between an expert and a novice in a spatial navigation task, and then trained transformer models on those datasets, evaluating performance after exposure to different datasets. Our experiments showed that models trained on pedagogical interactions were more robust across a variety of scenarios compared to models trained only on expert demonstrations, and that having the ability to represent epistemically distinct agents led to expert-like behavior even when expert behavior was rarely observed.

[MA-5] VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agent ic Systems

【速读】:该论文旨在解决多模态临床数据(包括医学影像)分析中因跨学科协作碎片化而导致的科研瓶颈问题,即如何在不依赖人工干预的前提下,自动验证自然语言表述的假设并生成可审计的证据链。解决方案的关键在于提出一个名为VERITAS(Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems)的多智能体系统,其通过角色专业化分工将工作流分解为四个阶段,并引入一种基于认知证据标签(epistemic evidence label)的框架,能够机械地将结果分类为“支持”、“反驳”、“功效不足”或“无效”,从而区分统计不显著性是否源于样本量不足而非真实效应缺失。此设计不仅提升了假设验证准确性(达到81.4%的判别准确率),还确保了所有输出均可独立验证(86.6%的可验证统计输出率),从而在模型规模受限的情况下仍保持临床研究所需的可解释性和可信度。

链接: https://arxiv.org/abs/2604.12144
作者: Lucas Stoffl,Benedikt Wiestler,Johannes C. Paetzold
机构: Weill Cornell Medicine (威尔康奈尔医学院); Cornell Tech (康奈尔科技学院); TUM University Hospital (慕尼黑工业大学附属医院); Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Multiagent Systems (cs.MA)
备注: 42 pages, 5 figures. Code available at this https URL

点击查看摘要

Abstract:Drawing meaningful conclusions from inherently multimodal clinical data (including medical imaging) requires coordinating expertise across the clinical specialty, radiology, programming, and biostatistics. This fragmented process bottlenecks discovery. We present VERITAS (Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems), a multi-agent system that autonomously tests natural-language hypotheses on multimodal clinical datasets while producing a fully auditable evidence trail: every statistical conclusion traces through inspectable, executable outputs from analysis plan to segmentation masks to statistical code to final verdict. VERITAS decomposes the workflow into four phases handled by role-specialized agents, and introduces an epistemic evidence label framework that mechanically classifies outcomes as Supported, Refuted, Underpowered, or Invalid by jointly evaluating significance, effect direction, and study power. This distinction is critical in medical imaging, where non-significant results often reflect insufficient sample size rather than absent effects. To evaluate the system, we construct a tiered benchmark of 64 hypotheses spanning six complexity levels across cardiac (ACDC, 150 subjects) and brain glioma (UCSF-PDGM, 501 subjects) MRI. VERITAS reaches 81.4% verdict accuracy with frontier models and 71.2% with locally-hosted open-weight models (8-30B), outperforming all five single-model baselines in both classes. It also produces the highest rate of independently verifiable statistical outputs (86.6%), so even its failures remain diagnosable through artifact inspection. Structured multi-agent decomposition thus substitutes for model scale while preserving the verifiability clinical research demands.

[MA-6] Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents

【速读】:该论文旨在解决当前AI基础设施中状态化智能体(stateful AI agents)在运行时架构上的瓶颈问题,即传统基于全量实例化(materialization-heavy instantiation)的模型导致显著的延迟和内存开销,难以支撑大规模多智能体系统的可扩展性需求。其解决方案的关键在于提出Aethon——一种基于引用的复制原语(reference-based replication primitive),通过将每个智能体实例表示为稳定定义、分层内存与局部上下文叠加的组合视图,而非重建完整对象,从而实现近乎常数时间的状态化智能体实例化。该机制通过“引用”替代“复制”,实现了创建成本与继承结构的解耦,不仅提升了性能效率,更重构了生产级智能体软件的系统抽象范式。

链接: https://arxiv.org/abs/2604.12129
作者: Swanand Rao,Kiran Kashalkar,Parvathi Somashekar,Priya Krishnan
机构: Next Moca Global, Inc. (Next Moca 全球公司)
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: 12 pages. Systems paper introducing a novel agent instantiation primitive for scalable multi-agent infrastructure

点击查看摘要

Abstract:The transition from stateless model inference to stateful agentic execution is reshaping the systems assumptions underlying modern AI infrastructure. While large language models have made persistent, tool-using, and collaborative agents technically viable, existing runtime architectures remain constrained by materialization-heavy instantiation models that impose significant latency and memory overhead. This paper introduces Aethon, a reference-based replication primitive for near-constant-time instantiation of stateful AI agents. Rather than reconstructing agents as fully materialized objects, Aethon represents each instance as a compositional view over stable definitions, layered memory, and local contextual overlays. By shifting instantiation from duplication to reference, Aethon decouples creation cost from inherited structure. We present the conceptual framework, system architecture, and memory model underlying Aethon, including layered inheritance and copy-on-write semantics. We analyze its implications for complexity, scalability, multi-agent orchestration, and enterprise governance. We argue that reference-based instantiation is not merely an optimization, but a more appropriate systems abstraction for production-scale agentic software. Aethon points toward a new class of AI infrastructure in which agents become lightweight, composable execution identities that can be spawned, specialized, and governed at scale. Comments: 12 pages. Systems paper introducing a novel agent instantiation primitive for scalable multi-agent infrastructure Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA) ACMclasses: I.2.11; D.4.8; D.2.11 Cite as: arXiv:2604.12129 [cs.AI] (or arXiv:2604.12129v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.12129 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Swanand Rao [view email] [v1] Mon, 13 Apr 2026 23:23:15 UTC (18 KB)

[MA-7] REGREACT: Self-Correcting Multi-Agent Pipelines for Structured Regulatory Information Extraction

【速读】:该论文旨在解决从监管文档中提取结构化、机器可读的合规性标准这一开放性挑战,传统单次遍历的语言模型常出现结构元素幻觉、层次关系丢失以及跨文档依赖关系无法解析等问题。其解决方案的关键在于提出了一种自校正的多智能体框架 \textscRegReAct,该框架将监管信息抽取分解为七个专业化阶段,并在每个阶段引入观察(Observe)—诊断(Diagnose)—修复(Repair)(ODR)循环机制,通过对照原始来源验证输出,不仅纠正模型幻觉,还能识别并修正法规内部的交叉引用错误;同时,该方法构建有类型的标准图(typed criterion graph)以保障结构准确性,并通过检索、摘要和嵌入被引用法律内容来解决外部依赖问题,从而生成自包含的输出结果。

链接: https://arxiv.org/abs/2604.12054
作者: Mohammed Ali,Abdelrahman Abdallah,Adam Jatowt
机构: University of Innsbruck, Austria (因斯布鲁克大学)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Extracting structured, machine-readable compliance criteria from regulatory documents remains an open challenge. Single-pass language models hallucinate structural elements, lose hierarchical relationships, and fail to resolve inter-document dependencies. We introduce \textscRegReAct, a self-correcting multi-agent framework that decomposes regulatory information extraction into seven specialized stages, each with an \textitObserve–Diagnose–Repair (ODR) loop that validates outputs against the source, correcting not only model hallucinations but also cross-reference errors in the regulations themselves. To ensure structural accuracy, \textscRegReAct constructs a typed criterion graph; to ensure completeness, it resolves external dependencies by retrieving, summarizing, and embedding referenced legal content inline, producing self-contained outputs. Applying \textscRegReAct to three EU Taxonomy Delegated Acts, we construct a dataset comprising 242 activities with over 4,800 hierarchical criteria, thresholds, and enriched source summaries. Evaluation against a GPT-4o single-pass baseline confirms that \textscRegReAct outperforms it across all structural and semantic metrics. Code and data will be made publicly available: this https URL

[MA-8] AutoSurrogate: An LLM -Driven Multi-Agent Framework for Autonomous Construction of Deep Learning Surrogate Models in Subsurface Flow

【速读】:该论文旨在解决高保真地下流体模拟计算成本高昂的问题,尤其是在不确定性量化和数据同化等多查询任务中,传统深度学习(Deep Learning, DL)代理模型的构建过程依赖大量机器学习(Machine Learning, ML)专业知识,包括架构设计、超参数调优等,而大多数领域科学家缺乏此类技能,且现有方法高度依赖人工经验和启发式选择,导致DL代理技术难以广泛应用。解决方案的关键在于提出AutoSurrogate——一个由大语言模型驱动的多智能体框架,通过自然语言指令即可自动完成从数据预处理、模型架构选择(来自模型库)、贝叶斯超参数优化、训练到质量评估的全流程建模,同时具备自主处理数值不稳定性和预测精度不足等失败模式的能力,从而实现无需人工干预即可生成部署就绪的高质量代理模型,显著降低使用门槛并提升性能表现。

链接: https://arxiv.org/abs/2604.11945
作者: Jiale Liu,Nanzhe Wang
机构: University of Edinburgh (爱丁堡大学); Heriot-Watt University (赫瑞-瓦特大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:High-fidelity numerical simulation of subsurface flow is computationally intensive, especially for many-query tasks such as uncertainty quantification and data assimilation. Deep learning (DL) surrogates can significantly accelerate forward simulations, yet constructing them requires substantial machine learning (ML) expertise - from architecture design to hyperparameter tuning - that most domain scientists do not possess. Furthermore, the process is predominantly manual and relies heavily on heuristic choices. This expertise gap remains a key barrier to the broader adoption of DL surrogate techniques. For this reason, we present AutoSurrogate, a large-language-model-driven multi-agent framework that enables practitioners without ML expertise to build high-quality surrogates for subsurface flow problems through natural-language instructions. Given simulation data and optional preferences, four specialized agents collaboratively execute data profiling, architecture selection from a model zoo, Bayesian hyperparameter optimization, model training, and quality assessment against user-specified thresholds. The system also handles common failure modes autonomously, including restarting training with adjusted configurations when numerical instabilities occur and switching to alternative architectures when predictive accuracy falls short of targets. In our setting, a single natural-language sentence can be sufficient to produce a deployment-ready surrogate model, with minimum human intervention required at any intermediate stage. We demonstrate the utility of AutoSurrogate on a 3D geological carbon storage modeling task, mapping permeability fields to pressure and CO _2 saturation fields over 31 timesteps. Without any manual tuning, AutoSurrogate is able to outperform expert-designed baselines and domain-agnostic AutoML methods, demonstrating strong potential for practical deployment.

[MA-9] BIND-USBL: Bounding IMU Navigation Drift using USBL in Heterogeneous ASV-AUV Teams

【速读】:该论文旨在解决自主水下航行器(AUV)在无全球定位系统(GPS)环境下的高精度、连续定位难题,其核心挑战在于惯性死区推算(dead-reckoning)因传感器偏差和噪声导致误差随时间无界累积。解决方案的关键在于提出BIND-USBL框架,该框架利用配备超短基线(USBL)声学定位系统的多艘自主水面船(ASV)为AUV提供间歇性位置修正以约束死区推算误差;其关键洞察是:长时间导航失效并非由单次USBL测量精度决定,而是由修正数据的时间稀疏性和几何可用性共同驱动。该框架整合了基于多ASV编队模型的声学覆盖优化、基于冲突图的时分多址(TDMA)上行链路调度策略,以及延迟融合机制,从而在保障无碰撞前提下提升每艘AUV的修正数据接收率并维持低端到端延迟。

链接: https://arxiv.org/abs/2604.11861
作者: Pranav Kedia,Rajini Makam,Heiko Hamann,Suresh Sundaram
机构: 1. University of California, Berkeley (加州大学伯克利分校); 2. Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室); 3. University of Texas at Austin (德州大学奥斯汀分校)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: Accepted at OCEANS 2026, Sanya, China

点击查看摘要

Abstract:Accurate and continuous localization of Autonomous Underwater Vehicles (AUVs) in GPS-denied environments is a persistent challenge in marine robotics. In the absence of external position fixes, AUVs rely on inertial dead-reckoning, which accumulates unbounded drift due to sensor bias and noise. This paper presents BIND-USBL, a cooperative localization framework in which a fleet of Autonomous Surface Vessels (ASVs) equipped with Ultra-Short Baseline (USBL) acoustic positioning systems provides intermittent fixes to bound AUV dead-reckoning error. The key insight is that long-duration navigation failure is driven not by the accuracy of individual USBL measurements, but by the temporal sparsity and geometric availability of those fixes. BIND-USBL combines a multi-ASV formation model linking survey scale and anchor placement to acoustic coverage, a conflict-graph-based TDMA uplink scheduler for shared-channel servicing, and delayed fusion of received USBL updates with drift-prone dead reckoning. The framework is evaluated in the HoloOcean simulator using heterogeneous ASV-AUV teams executing lawnmower coverage missions. The results show that localization performance is shaped by the interaction of survey scale, acoustic coverage, team composition, and ASV-formation geometry. Further, the spatial-reuse scheduler improves per-AUV fix delivery rate without violating the no-collision constraint, while maintaining low end-to-end fix latency.

自然语言处理

[NLP-0] SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

【速读】: 该论文旨在解决当前基于大语言模型(LLM)或视觉-语言模型(VLM)的室内场景生成评估方法中存在的不稳定性问题,即评估结果易受视角、提示词表述和幻觉等因素干扰,难以区分场景的空间合理性与评价偏差。其解决方案的关键在于提出SceneCritic——一种基于结构化空间本体(SceneOnto)的符号化评估器,该本体通过整合3D-FRONT、ScanNet和Visual Genome等数据集中的室内场景先验构建而成,能够从语义、朝向和几何一致性三个维度对布局中物体及其关系进行联合验证,从而实现对象级和关系级的细粒度错误定位与成功放置识别。此外,论文还设计了迭代优化测试平台,对比规则约束、LLM文本反馈与VLM图像反馈三种批评模态的效果,实证表明SceneCritic显著优于VLM评估者,并揭示图像模态在语义与朝向修正上最具有效性。

链接: https://arxiv.org/abs/2604.13035
作者: Kathakoli Sengupta,Kai Ao,Paola Cascante-Bonilla
机构: Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic’s constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and © image-based VLM refinement is the most effective critic modality for semantic and orientation correction.

[NLP-1] oward Autonomous Long-Horizon Engineering for ML Research

【速读】: 该论文旨在解决长期任务(long-horizon)机器学习(ML)研究工程中的系统性难题,即如何在数小时甚至数天内维持智能体(agent)对任务理解、环境配置、代码实现、实验执行与调试等环节的连贯推进。传统方法依赖于对话式交互传递状态,难以保证跨阶段的状态一致性与可追溯性。解决方案的关键在于提出AiScientist系统,其核心创新是将分层编排(hierarchical orchestration)与“文件即总线”(File-as-Bus)工作区机制相结合:通过顶层协调器(Orchestrator)维护阶段级控制与项目地图,并让各专业代理反复基于持久化成果(如分析报告、计划、代码和实验证据)进行再定位(re-grounding),而非依赖非结构化的对话传递信息,从而实现“轻量控制、厚重状态”的高效协同。实验证明该设计显著提升了PaperBench和MLE-Bench Lite指标,且消融实验表明File-as-Bus协议是性能提升的核心要素。

链接: https://arxiv.org/abs/2604.13018
作者: Guoxin Chen,Jie Chen,Lei Chen,Jiale Zhao,Fanzhe Meng,Wayne Xin Zhao,Ruihua Song,Cheng Chen,Ji-Rong Wen,Kai Jia
机构: 未知
类目: Computation and Language (cs.CL)
备注: Repo: this https URL

点击查看摘要

Abstract:Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.

[NLP-2] Rethinking On-Policy Distillation of Large Language Models : Phenomenology Mechanism and Recipe

【速读】: 该论文旨在解决生成式 AI(Generative AI)中在线蒸馏(On-policy Distillation, OPD)训练动态机制不明确的问题,尤其是为何某些OPD任务成功而另一些失败。研究发现,OPD的成功取决于两个关键条件:一是学生模型与教师模型需具备兼容的思维模式;二是即便思维模式一致且教师得分更高,教师仍须提供学生在训练中未曾接触过的全新能力。解决方案的核心在于识别并利用这些机制——通过弱到强反向蒸馏验证同家族教师与学生分布不可区分性,并揭示成功的OPD表现为高概率token在学生访问状态下的逐步对齐,且共享token集集中了97%-99%的概率质量。进一步提出两种实用策略:离线冷启动(off-policy cold start)和教师对齐提示选择(teacher-aligned prompt selection),以恢复失败的OPD。

链接: https://arxiv.org/abs/2604.13016
作者: Yaxuan Li,Yuxin Zuo,Bingxiang He,Jinqian Zhang,Chaojun Xiao,Cheng Qian,Tianyu Yu,Huan-ang Gao,Wenkai Yang,Zhiyuan Liu,Ning Ding
机构: Tsinghua University (清华大学); ShanghaiTech University (上海科技大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Renmin University of China (中国人民大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages, 23 figures. Code: this https URL

点击查看摘要

Abstract:On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student’s perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD’s apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

[NLP-3] One Token Away from Collapse: The Frag ility of Instruction-Tuned Helpfulness

【速读】: 该论文旨在解决指令微调大语言模型(Instruction-tuned Large Language Models, Instruction-tuned LLMs)在面对简单词汇约束(如禁止使用单个标点符号或常见词)时,其响应完整性显著下降的问题,即模型的鲁棒性脆弱性问题。解决方案的关键在于识别出这种脆弱性源于“规划失败”(planning failure):模型在生成前未能正确规划响应结构,导致后续受限重写阶段无法恢复原有内容;通过两阶段生成策略(先自由生成再约束重写)可恢复59%–96%的响应长度,并且基于提示表示的线性探测器能提前预测响应长度(R² = 0.51–0.93),其预测能力与崩溃严重程度一致,表明指令微调使模型将任务能力耦合到狭窄的表层模板,从而引入了对形式约束的高度敏感性。

链接: https://arxiv.org/abs/2604.13006
作者: Erfan Baghaei Potraghloo,Seyedarmin Azizi,Souvik Kundu,Massoud Pedram
机构: University of Southern California (南加州大学); Intel AI (英特尔人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness when trivially constrained? We show that simple lexical constraints (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14–48% of comprehensiveness in pairwise evaluation across three open-weight model families and one closed-weight model (GPT-4o-mini). The baseline response is preferred in 77–100% of 1,920 pairwise comparisons judged by GPT-4o-mini and GPT-4o. Notably, GPT-4o-mini suffers 31% comprehensiveness loss (99% baseline win rate), demonstrating that the fragility extends to commercially deployed closed-weight models, contrary to prior findings on format-level constraints. Through mechanistic analysis, we identify this as a planning failure: two-pass generation (free generation followed by constrained rewriting) recovers 59–96% of response length, and linear probes on prompt representations predict response length with R^2 = 0.51 – 0.93 before generation begins, with R^2 tracking collapse severity across models. The same probes yield negative R^2 on base models, confirming that instruction tuning creates the representational structure encoding the collapse decision. Crucially, base models show no systematic collapse under identical constraints, with effects that are small, noisy, and bidirectional, demonstrating that instruction tuning creates this fragility by coupling task competence to narrow surface-form templates. The effect replicates on MT-Bench across all eight task categories. We further show that standard independent LLM-as-judge evaluation detects only a 3.5% average quality drop where pairwise evaluation reveals 23%, exposing a methodological blind spot in how constrained generation is assessed.

[NLP-4] PolicyLLM : Towards Excellent Comprehension of Public Policy for Large Language Models ACL2026

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在公共政策领域中对政策相关内容的理解与推理能力不足的问题。为填补这一研究空白,作者提出了PolicyBench——首个跨系统(中美)的大规模政策理解基准,涵盖21,000个案例,覆盖广泛的政策领域,并基于布卢姆认知分类法(Bloom’s taxonomy)从记忆、理解到应用三个层次评估模型能力。解决方案的关键在于构建PolicyMoE,一种面向政策领域的专家混合模型(Mixture-of-Experts, MoE),其专家模块分别对应上述三种认知层级,从而在结构化推理任务上实现最优性能,揭示了现有LLMs在政策理解中的局限性,并为开发更可靠、更具针对性的政策导向型模型提供了路径。

链接: https://arxiv.org/abs/2604.12995
作者: Han Bao,Penghao Zhang,Yue Huang,Zhengqing Yuan,Yanchi Ru,Rui Su,Yujun Zhou,Xiangqi Wang,Kehan Guo,Nitesh V Chawla,Yanfang Ye,Xiangliang Zhang
机构: University of Notre Dame(圣母大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted by ACL 2026 findings

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf\textitPolicyBench, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom’s taxonomy, the benchmark assesses three core capabilities: (1) \textbfMemorization: factual recall of policy knowledge, (2) \textbfUnderstanding: conceptual and contextual reasoning, and (3) \textbfApplication: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf\textitPolicyMoE, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.

[NLP-5] Accelerating Speculative Decoding with Block Diffusion Draft Trees

【速读】: 该论文旨在解决当前基于扩散模型的推测解码(speculative decoding)方法中,单轮验证仅能处理单一轨迹导致接受长度受限的问题。其解决方案的关键在于提出DDTree(Diffusion Draft Tree),通过直接利用块扩散 drafter 的逐位置分布构建一个草稿树(draft tree),并在固定节点预算下采用简单的优先级优先堆算法选择最可能与目标模型匹配的延续路径,并使用仅依赖祖先的注意力掩码在单次目标模型前向传播中高效验证整个树结构。此方法在保持DFlash优势的基础上显著提升了推测解码的效率和接受长度。

链接: https://arxiv.org/abs/2604.12989
作者: Liran Ringel,Yaniv Romano
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model’s output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.

[NLP-6] GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

【速读】: 该论文旨在解决当前光学字符识别(Optical Character Recognition, OCR)模型在多语言场景下泛化能力不足的问题,特别是针对高资源和中等资源脚本之外的大量低资源Unicode脚本缺乏系统评估的问题。解决方案的关键在于构建了一个涵盖100多个Unicode脚本的综合性基准测试集GlotOCR Bench,其图像数据由真实多语言文本生成,使用Google Fonts字体库、HarfBuzz排版引擎和FreeType光栅化工具链渲染,并包含对齐的清洁与退化图像变体,同时支持从左到右(LTR)和从右到左(RTL)书写方向。通过该基准对多种开源与专有视觉-语言模型进行评估,揭示了当前OCR系统严重依赖语言模型预训练覆盖范围而非纯粹的视觉识别能力,且在面对陌生脚本时往往产生随机噪声或基于相似脚本的幻觉输出。

链接: https://arxiv.org/abs/2604.12978
作者: Amir Hossein Kargaran,Nafiseh Nikeghbal,Jana Diesner,François Yvon,Hinrich Schütze
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); TU Munich (慕尼黑工业大学); MCML; Sorbonne Université (索邦大学); CNRS (法国国家科学研究中心)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: this https URL, Benchmark: this https URL.

[NLP-7] MoshiRAG : Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

【速读】: 该论文旨在解决全双工语音语言模型(full-duplex speech language models)在保持实时交互性的同时提升事实准确性(factuality)的问题。当前方法中,单纯通过扩大模型规模虽可提高事实正确性,但会导致推理延迟过高,难以满足实时对话需求。其解决方案的关键在于提出一种模块化框架 MoshiRAG,该框架结合轻量级全双工接口与选择性检索机制(selective retrieval),通过异步方式识别需要知识增强的查询,并在响应生成过程中利用自然的时间间隙完成外部知识源的检索与融合,从而在不牺牲交互流畅性的前提下显著提升回答的事实准确性。

链接: https://arxiv.org/abs/2604.12928
作者: Chung-Ming Chien,Manu Orsini,Eugene Kharitonov,Neil Zeghidour,Karen Livescu,Alexandre Défossez
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.

[NLP-8] MetFuse: Figurative Fusion between Metonymy and Metaphor ACL2026

【速读】: 该论文旨在解决自然语言中隐喻(metaphor)与转喻(metonymy)常共现但计算模型多将其孤立研究的问题。其解决方案的关键在于提出一个统一框架,将字面句转化为三种修辞变体:转喻型、隐喻型和混合型,并基于此构建了首个专门用于隐喻与转喻融合的语料库MetFuse,包含1,000个经人工验证的意义对齐四元组(共4,000句句子)。通过在8个基准任务上的外生实验表明,使用MetFuse增强训练数据可显著提升隐喻与转喻分类性能,且混合样本在转喻任务中带来最大增益;进一步分析发现,隐喻的存在使转喻名词更易识别,说明两种修辞类型之间存在协同效应。

链接: https://arxiv.org/abs/2604.12919
作者: Saptarshi Ghosh,Tianyu Jiang
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL 2026

点击查看摘要

Abstract:Metonymy and metaphor often co-occur in natural language, yet computational work has studied them largely in isolation. We introduce a framework that transforms a literal sentence into three figurative variants: metonymic, metaphoric, and hybrid. Using this framework, we construct MetFuse, the first dedicated dataset of figurative fusion between metonymy and metaphor, containing 1,000 human-verified meaning-aligned quadruplets totaling 4,000 sentences. Extrinsic experiments on eight existing benchmarks show that augmenting training data with MetFuse consistently improves both metonymy and metaphor classification, with hybrid examples yielding the largest gains on metonymy tasks. Using this dataset, we also analyze how the presence of one figurative type impacts another. Our findings show that both human annotators and large language models better identify metonymy in hybrid sentences than in metonymy-only sentences, demonstrating that the presence of a metaphor makes a metonymic noun more explicit. Our dataset is publicly available at: this https URL.

[NLP-9] Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

【速读】: 该论文旨在解决当前多语言评估基准存在的核心问题:现有基准实际上衡量的是数学推理能力和事实记忆能力,而非真正的多语言能力(multilingual proficiency),导致模型在多语言任务中的真实表现被高估。其解决方案的关键在于提出一种基于“往返翻译”(round-trip translation)的评估方法——通过将源语言文本翻译为目标语言后再译回原语言,利用语义偏差来暴露模型在多语言生成中的缺陷。该方法与真实用户评分高度相关(ρ = 0.94),无需人工参考译文,且不依赖更强大的多语言评判模型,从而提供了一种更可靠、更现实的多语言能力评估方式。为此,作者还构建了跨广泛使用语言的挑战性基准Lost in Translation (LiT),用于评估前沿多语言模型的真实多语言能力。

链接: https://arxiv.org/abs/2604.12911
作者: Ronald Skorobogat,Ameya Prabhu,Matthias Bethge
机构: Tübingen AI Center, University of Tübingen (图宾根人工智能中心,图宾根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\rho = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.

[NLP-10] Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

【速读】: 该论文旨在解决大规模语言模型评估中因基准数据集动态扩展导致的评分不可比问题,即随着新模型和新数据集的不断涌现,传统全量评估方式成本高昂且不同研究间因采样差异难以直接比较结果。其解决方案的关键在于提出一种基于多维项目反应理论(Multidimensional Item Response Theory, MIRT)的校准框架,利用锚定项目(anchor items)将新引入的数据集与已校准的评估体系对齐,同时固定已有项目参数,从而在仅使用每数据集约100个锚定问题的情况下,实现对模型完整性能的高精度预测(误差<2-3个百分点),并保持排名一致性(Spearman ρ ≥ 0.9),使得跨时间、跨数据集的模型评估具备可比性和可扩展性,且新增数据集的评估成本恒定。

链接: https://arxiv.org/abs/2604.12843
作者: Eliya Habba,Itay Itzhak,Asaf Yehudai,Yotam Perlitz,Elron Bandel,Michal Shmueli-Scheuer,Leshem Choshen,Gabriel Stanovsky
机构: The Hebrew University of Jerusalem (希伯来大学); IBM Research (IBM研究院); MIT (麻省理工学院); Technion (以色列理工学院); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item parameters fixed. Our approach supports a realistic evaluation setting in which datasets are introduced over time and models are evaluated only on the datasets available at the time of evaluation, while a fixed anchor set for each dataset is used so that results from different evaluation periods can be compared directly. In large-scale experiments on more than 400 models, our framework predicts full-evaluation performance within 2-3 percentage points using only 100 anchor questions per dataset, with Spearman \rho \geq 0.9 for ranking preservation, showing that it is possible to extend benchmark suites over time while preserving score comparability, at a constant evaluation cost per new dataset. Code available at this https URL

[NLP-11] RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在预训练过程中无意吸收有害知识、错误信息和个人数据的问题,且缺乏原生机制实现针对性删除。现有机器遗忘(Machine Unlearning)方法依赖模型服务提供商(MSPs)进行重训练和干预,无法让用户自主控制其数据的删除。解决方案的关键在于提出交互式机器遗忘(Interactive Machine Unlearning, IMU),通过自然语言指令在推理阶段实现用户驱动的知识遗忘;具体技术实现为RePAIR框架,其核心是STAMP方法——一种无需训练、基于单样本的激活操纵遗忘策略,利用伪逆(Pseudoinverse)更新将多层感知机(MLP)激活投影至拒绝子空间,从而高效地修正模型参数。该方法计算复杂度低(低秩版本降至O(r³ + r²d)),支持设备端实时执行,显著优于当前六种先进基线方法,在遗忘效果(Forget Score ≈ 0)与模型保留性能(Retained Utility up to 84.47%)之间取得良好平衡。

链接: https://arxiv.org/abs/2604.12820
作者: Jagadeesh Rachapudi,Pranav Singh,Ritali Vatsi,Praful Hambarde,Amit Shukla
机构: Indian Institute of Technology Mandi(印度理工学院曼迪分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) inherently absorb harmful knowledge, misinformation, and personal data during pretraining on large-scale web corpora, with no native mechanism for selective removal. While machine unlearning offers a principled solution, existing approaches are provider-centric, requiring retraining pipelines, curated retain datasets, and direct intervention by model service providers (MSPs), thereby excluding end users from controlling their own data. We introduce Interactive Machine Unlearning (IMU), a new paradigm in which users can instruct LLMs to forget targeted knowledge through natural language at inference time. To realize IMU, we propose RePAIR, a prompt-aware model repair framework comprising (i) a watchdog model for unlearning intent detection, (ii) a surgeon model for generating repair procedures, and (iii) a patient model whose parameters are updated autonomously. At the core of RePAIR, we develop Steering Through Activation Manipulation with PseudoInverse (STAMP), a training-free, single-sample unlearning method that redirects MLP activations toward a refusal subspace via closed-form pseudoinverse updates. Its low-rank variant reduces computational complexity from O(d^3) to O(r^3 + r^2 * d), enabling efficient on-device unlearning with up to ~3x speedup over training-based baselines. Extensive experiments across harmful knowledge suppression, misinformation correction, and personal data erasure demonstrate that RePAIR achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six state-of-the-art baselines. These results establish RePAIR as an effective and practical framework for user-driven model editing, advancing transparent and on-device control over learned knowledge, with potential extensions to multimodal foundation models.

[NLP-12] he role of System 1 and System 2 semantic memory structure in human and LLM biases

【速读】: 该论文旨在解决人类和大型语言模型(LLM)中隐性偏见的来源及其认知机制差异问题,特别是探讨双过程理论中系统1(快速、直觉性)与系统2(慢速、反思性)思维如何影响隐性性别偏见。其解决方案的关键在于将系统1和系统2思维建模为具有不同结构的语义记忆网络(semantic memory networks),并基于人类与LLM生成的可比数据集进行构建与比较;通过网络级评估指标分析这些结构与隐性偏见的关系,发现人类语义记忆结构不可约简且与偏见显著相关(系统2结构偏见更低),而LLM则缺乏此类人类特有的概念知识结构,表明机器在偏见调节方面存在根本性认知差异。

链接: https://arxiv.org/abs/2604.12816
作者: Katherine Abramski,Giulio Rossetti,Massimo Stella
机构: 未知
类目: Computation and Language (cs.CL)
备注: 31 pages, 5 figures, 9 appendix figures

点击查看摘要

Abstract:Implicit biases in both humans and large language models (LLMs) pose significant societal risks. Dual process theories propose that biases arise primarily from associative System 1 thinking, while deliberative System 2 thinking mitigates bias, but the cognitive mechanisms that give rise to this phenomenon remain poorly understood. To better understand what underlies this duality in humans, and possibly in LLMs, we model System 1 and System 2 thinking as semantic memory networks with distinct structures, built from comparable datasets generated by both humans and LLMs. We then investigate how these distinct semantic memory structures relate to implicit gender bias using network-based evaluation metrics. We find that semantic memory structures are irreducible only in humans, suggesting that LLMs lack certain types of human-like conceptual knowledge. Moreover, semantic memory structure relates consistently to implicit bias only in humans, with lower levels of bias in System~2 structures. These findings suggest that certain types of conceptual knowledge contribute to bias regulation in humans, but not in LLMs, highlighting fundamental differences between human and machine cognition.

[NLP-13] EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)驱动的多智能体系统中,由于生成式涌现(generative emergence)固有的随机性所导致的内生叙事演化(endogenous narrative evolution)难题,尤其在长时程模拟中面临的社会记忆堆积(social memory stacking)与叙事-空间错位(narrative-spatial dissonance)问题。解决方案的关键在于提出EvoSpark框架,其核心机制包括:基于角色社会演化基础(Role Socio-Evolutionary Base)的分层叙事记忆(Stratified Narrative Memory),用于动态代谢经验以化解历史冲突;以及生成式布景机制(Generative Mise-en-Scène),确保角色、位置与情节的一致性;二者由统一叙事操作引擎(Unified Narrative Operation Engine)协同支撑,其中嵌入的涌现角色锚定协议(Emergent Character Grounding Protocol)将随机触发转化为持久角色,从而从最小前提扩展出开放演化的叙事世界。

链接: https://arxiv.org/abs/2604.12776
作者: Shiyu He,Minchi Kuang,Mengxian Wang,Bin Hu,Tingxiang Gu
机构: Xinjiang University (新疆大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the Main Conference of ACL 2026

点击查看摘要

Abstract:Realizing endogenous narrative evolution in LLM-based multi-agent systems is hindered by the inherent stochasticity of generative emergence. In particular, long-horizon simulations suffer from social memory stacking, where conflicting relational states accumulate without resolution, and narrative-spatial dissonance, where spatial logic detaches from the evolving plot. To bridge this gap, we propose EvoSpark, a framework specifically designed to sustain logically coherent long-horizon narratives within Endogenous Interactive Agent Societies. To ensure consistency, the Stratified Narrative Memory employs a Role Socio-Evolutionary Base as living cognition, dynamically metabolizing experiences to resolve historical conflicts. Complementarily, Generative Mise-en-Scène mechanism enforces Role-Location-Plot alignment, synchronizing character presence with the narrative flow. Underpinning these is the Unified Narrative Operation Engine, which integrates an Emergent Character Grounding Protocol to transform stochastic sparking into persistent characters. This engine establishes a substrate that expands a minimal premise into an open-ended, evolving story world. Experiments demonstrate that EvoSpark significantly outperforms baselines across diverse paradigms, enabling the sustained generation of expressive and coherent narrative experiences.

[NLP-14] aching LLM s Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在人类文本编辑任务中与人类编辑策略不一致的问题,即LLMs倾向于进行分散且可能改变原意的多处修改,而人类则更倾向于封装相关改动、保持语义不变的自包含式编辑。其解决方案的关键在于提出一种基于强化学习的方法,通过群体相对策略优化(group relative policy optimization)训练模型,采用多组件奖励函数联合优化编辑级别的语义相似性、流畅性与模式一致性,以及论点级别的适切性,从而生成可独立接受或拒绝的句子级编辑建议,使编辑行为更贴近人类习惯,并在自动和人工评估中显著优于现有基线方法。

链接: https://arxiv.org/abs/2604.12770
作者: Timon Ziegenbein,Maja Stahl,Henning Wachsmuth
机构: Leibniz University Hannover (汉诺威大学); L3S Research Center (L3S研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Editing human-written text has become a standard use case of large language models (LLMs), for example, to make one’s arguments more appropriate for a discussion. Comparing human to LLM-generated edits, however, we observe a mismatch in editing strategies: While LLMs often perform multiple scattered edits and tend to change meaning notably, humans rather encapsulate dependent changes in self-contained, meaning-preserving edits. In this paper, we present a reinforcement learning approach that teaches LLMs human-like editing to improve the appropriateness of arguments. Our approach produces self-contained sentence-level edit suggestions that can be accepted or rejected independently. We train the approach using group relative policy optimization with a multi-component reward function that jointly optimizes edit-level semantic similarity, fluency, and pattern conformity as well as argument-level appropriateness. In automatic and human evaluation, it outperforms competitive baselines and the state of the art in human-like editing, with multi-round editing achieving appropriateness close to full rewriting.

[NLP-15] NaviRAG : Towards Active Knowledge Navigation for Retrieval-Augmented Generation

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理复杂任务时的局限性,即其依赖扁平化的检索范式,难以实现跨粒度的信息条件性检索与动态合成(例如从宽泛概念到具体证据的多层级信息整合)。解决方案的关键在于提出NaviRAG框架,通过将知识文档结构化为层次化表示以保留语义关系,并引入一个基于大语言模型(Large Language Model, LLM)的代理(agent),主动导航知识库,迭代识别信息缺口并从最合适的粒度层级动态检索相关内容,从而提升检索召回率和端到端问答性能。

链接: https://arxiv.org/abs/2604.12766
作者: Jihao Dai(1 and 2),Dingjun Wu(1),Yuxuan Chen(1),Zheni Zeng(2),Yukun Yan(1),Zhenghao Liu(3),Maosong Sun(1) ((1) Tsinghua University, (2) Nanjing University, (3) Northeastern University)
机构: Tsinghua University (清华大学); Nanjing University (南京大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) typically relies on a flat retrieval paradigm that maps queries directly to static, isolated text segments. This approach struggles with more complex tasks that require the conditional retrieval and dynamic synthesis of information across different levels of granularity (e.g., from broad concepts to specific evidence). To bridge this gap, we introduce NaviRAG, a novel framework that shifts from passive segment retrieval to active knowledge navigation. NaviRAG first structures the knowledge documents into a hierarchical form, preserving semantic relationships from coarse-grained topics to fine-grained details. Leveraging this reorganized knowledge records, a large language model (LLM) agent actively navigates the records, iteratively identifying information gaps and retrieving relevant content from the most appropriate granularity level. Extensive experiments on long-document QA benchmarks show that NaviRAG consistently improves both retrieval recall and end-to-end answer performance over conventional RAG baselines. Ablation studies confirm performance gains stem from our method’s capacity for multi-granular evidence localization and dynamic retrieval planning. We further discuss efficiency, applicable scenario, and future directions of our method, hoping to make RAG systems more intelligent and autonomous.

[NLP-16] Generating Effective CoT Traces for Mitigating Causal Hallucination ACL2026

【速读】: 该论文旨在解决小型大语言模型(Large Language Models, LLMs)在事件因果识别(Event Causality Identification, ECI)任务中存在严重因果幻觉(Causal Hallucination)的问题。其解决方案的关键在于:首先明确有效Chain-of-Thought (CoT) traces应满足的核心标准,进而设计一个自动化生成符合这些标准的CoT trace的数据流水线;同时引入新的量化指标——因果幻觉率(Causal Hallucination Rate, CHR),用于指导CoT trace设计并验证方法有效性。实验表明,基于该流水线生成的CoT traces进行微调,不仅能显著降低小型模型的因果幻觉,还能提升平均准确率,并展现出跨数据集、跨难度的强泛化能力与对误导性提示的鲁棒性。

链接: https://arxiv.org/abs/2604.12748
作者: Yiheng Zhao,Jun Yan
机构: Concordia University (康考迪亚大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 2 figures. Accepted at ACL 2026

点击查看摘要

Abstract:Although large language models (LLMs) excel in complex reasoning tasks, they suffer from severe causal hallucination in event causality identification (ECI), particularly in smaller models ( \leq 1.5B parameters). A promising approach to address this issue is to fine-tune them with Chain-of-Thought (CoT) traces. However, there is currently a lack of CoT trace dataset available for ECI. In this paper, we first investigate the essential criteria that effective CoT traces should possess to mitigate causal hallucination in smaller models. We then design a pipeline to generate CoT traces that meet these criteria. Moreover, since there is currently no metric for quantifying causal hallucination, we also introduce a new metric, the Causal Hallucination Rate (CHR), to quantify causal hallucination, guide the formulation of effective CoT trace criteria, and validate the effectiveness of our pipeline. Our experiments show that fine-tuning with the CoT traces generated by our pipeline not only substantially reduces causal hallucination in smaller LLMs but also improves mean accuracy. Moreover, the fine-tuned models exhibit strong cross-dataset and cross-difficulty generalization, as well as robustness under misleading intervention prompts.

[NLP-17] Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark LREC2026

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Language Models)在非英语语种中缺乏高质量评估基准的问题,尤其针对命名实体识别(Named Entity Recognition, NER)任务。其解决方案的关键在于构建一个标准化、跨语言的金标准(gold-standard)NER数据集——Universal NER(UNER),采用统一的标签体系和详尽的标注指南,确保不同语言间标注的一致性与可比性,并通过持续扩展和多方协作形成活跃的社区驱动项目,从而为多语言NLP模型提供可靠且可复现的评估基础。

链接: https://arxiv.org/abs/2604.12744
作者: Terra Blevins,Stephen Mayhew,Marek Šuppa,Hila Gonen,Shachar Mirkin,Vasile Pais,Kaja Dobrovoljc,Voula Giouli,Jun Kevin,Eugene Jang,Eungseo Kim,Jeongyeon Seo,Xenophon Gialis,Yuval Pinter
机构: 未知
类目: Computation and Language (cs.CL)
备注: LREC 2026

点击查看摘要

Abstract:While multilingual language models promise to bring the benefits of LLMs to speakers of many languages, gold-standard evaluation benchmarks in most languages to interrogate these assumptions remain scarce. The Universal NER project, now entering its fourth year, is dedicated to building gold-standard multilingual Named Entity Recognition (NER) benchmark datasets. Inspired by existing massively multilingual efforts for other core NLP tasks (e.g., Universal Dependencies), the project uses a general tagset and thorough annotation guidelines to collect standardized, cross-lingual annotations of named entity spans. The first installment (UNER v1) was released in 2024, and the project has continued and expanded since then, with various organizers, annotators, and collaborators in an active community.

[NLP-18] oken-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

【速读】: 该论文旨在解决基于组相对策略优化(Group Relative Policy Optimization, GRPO)及其相关熵正则化方法在链式思维(Chain-of-Thought, CoT)推理中面临的token级稀疏奖励问题,该问题常导致熵崩溃或模型性能退化。其解决方案的关键在于提出一种新型的token级框架TEPO,该框架通过两个核心机制实现改进:(1) 利用序列级似然将组级奖励与个体token通过token级聚合相连接,从而更精准地分配奖励信号;(2) 引入token级KL散度掩码约束,仅对具有正优势且熵递减的token进行更新,有效缓解策略突变带来的不稳定性。实验表明,TEPO不仅在数学推理基准上达到最先进性能,还显著提升训练稳定性,收敛时间较GRPO/DAPO缩短50%。

链接: https://arxiv.org/abs/2604.12736
作者: Xingyu Lin,Yilin Wen,Du Su,Jinchang Hou,En Wang,Wenbin Liu,Chenfu Bao,Zhonghou Lv
机构: Jilin University (吉林大学); Baidu Inc. (百度公司); Tsinghua University (清华大学); Chinese Academy of Sciences (中国科学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.

[NLP-19] InsightFlow: LLM -Driven Synthesis of Patient Narratives for Mental Health into Causal Models

【速读】: 该论文旨在解决临床案例构想(Clinical Case Formulation)中因果图构建效率低、一致性差的问题,即从心理治疗对话转录文本中自动提取并生成符合5P框架(背景、问题、模式、预测、干预)的因果结构图,以减少人工标注时间并提升标准化程度。解决方案的关键在于提出InsightFlow——一种基于大语言模型(Large Language Model, LLM)的方法,能够直接从患者-治疗师对话中自动推导出与专家标注一致的5P对齐因果图,并通过结构相似性(NetSimile)、语义嵌入相似度以及临床专家评分三重指标验证其有效性,结果显示生成图在结构和语义层面均达到接近人类专家间一致性水平,且具备良好的临床实用性。

链接: https://arxiv.org/abs/2604.12721
作者: Shreya Gupta,Prottay Kumar Adhikary,Bhavyaa Dave,Salam Michael Singh,Aniket Deroy,Tanmoy Chakraborty
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical case formulation organizes patient symptoms and psychosocial factors into causal models, often using the 5P framework. However, constructing such graphs from therapy transcripts is time consuming and varies across clinicians. We present InsightFlow, an LLM based approach that automatically generates 5P aligned causal graphs from patient-therapist dialogues. Using 46 psychotherapy intake transcripts annotated by clinical experts, we evaluate LLM generated graphs against human formulations using structural (NetSimile), semantic (embedding similarity), and expert rated clinical criteria. The generated graphs show structural similarity comparable to inter annotator agreement and high semantic alignment with human graphs. Expert evaluations rate the outputs as moderately complete, consistent, and clinically useful. While LLM graphs tend to form more interconnected structures compared to the chain like patterns of human graphs, overall complexity and content coverage are similar. These results suggest that LLMs can produce clinically meaningful case formulation graphs within the natural variability of expert practice. InsightFlow highlights the potential of automated causal modeling to augment clinical workflows, with future work needed to improve temporal reasoning and reduce redundancy.

[NLP-20] LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言中安全性能显著下降的问题,其根本原因在于模型的语义理解能力虽具跨语言一致性,但安全对齐(safety alignment)却偏向高资源语言,导致语言主导的安全机制在低资源语言中失效。解决方案的关键在于识别出模型内部的“语义瓶颈层”(semantic bottleneck),即一个几何结构主要由共享语义内容而非语言身份决定的中间层,并提出语言无关语义对齐(Language-Agnostic Semantic Alignment, LASA)方法,将安全对齐直接锚定于该语义瓶颈层,从而实现跨语言的一致性安全防护。实验表明,LASA显著降低所有语言上的攻击成功率(ASR),从24.7%降至2.8%,并在多个主流模型上保持3–4%的稳定低风险水平。

链接: https://arxiv.org/abs/2604.12710
作者: Junxiao Yang,Haoran Liu,Jinzhe Tu,Jiale Cheng,Zhexin Zhang,Shiyao Cui,Jiaqi Weng,Jialing Tao,Hui Xue,Hongning Wang,Han Qiu,Minlie Huang
机构: The Conversational AI (CoAI) group, DCST, Tsinghua University; Alibaba Group; Tsinghua University
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We attribute this gap to a mismatch between language-agnostic semantic understanding ability and language-dominant safety alignment biased toward high-resource languages. Consistent with this hypothesis, we empirically identify the semantic bottleneck in LLMs, an intermediate layer in which the geometry of model representations is governed primarily by shared semantic content rather than language identity. Building on this observation, we propose Language-Agnostic Semantic Alignment (LASA), which anchors safety alignment directly in semantic bottlenecks. Experiments show that LASA substantially improves safety across all languages: average attack success rate (ASR) drops from 24.7% to 2.8% on LLaMA-3.1-8B-Instruct and remains around 3-4% across Qwen2.5 and Qwen3 Instruct models (7B-32B). Together, our analysis and method offer a representation-level perspective on LLM safety, suggesting that safety alignment requires anchoring safety understanding not in surface text, but in the model’s language-agnostic semantic space.

[NLP-21] Do VLMs Truly “Read” Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting STOC

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在股票价格预测任务中对蜡烛图(candlestick charts)理解能力评估不足的问题,特别是缺乏对多尺度视觉信号整合能力的系统性评测。现有研究未能有效区分VLM是否真正理解蜡烛图模式,且多数数据集和评估框架仅基于单周期或表格输入,无法反映人类分析师依赖的多尺度分析逻辑——即长期趋势与短期拐点线索的协同作用。为填补这一空白,作者构建了一个多尺度蜡烛图数据集和标准化评估框架,通过混淆矩阵诊断与信息系数(Information Coefficient, IC)时间序列指标相结合的方式,结合XGBoost作为基于特征的时间基准进行对比。关键创新在于引入多尺度视觉输入结构,并设计可量化模型对不同时间尺度市场动态敏感性的评估机制,从而更真实地衡量VLM在复杂市场情境下的预测能力和时序推理局限性。

链接: https://arxiv.org/abs/2604.12659
作者: Kaiqi Hu,Linda Xiao,Shiyue Xu,Ziyi Tang,Mingwen Liu
机构: University of Leeds (利兹大学); Sun Yat-sen University (中山大学); Likelihood Lab (Likelihood Lab)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: We evaluate whether VLMs can comprehend multi-scale visual stock price data like human analysts with a proposed benchmark, identifying current VLMs’ weak predictive power, significant biases, and limited sensitivity to forecast horizons and prompts

点击查看摘要

Abstract:Vision-language models(VLMs) are increasingly applied to visual stock price forecasting, yet existing benchmarks inadequately evaluate their understanding of stock price in candlestick charts. First, prior studies fail to isolate VLMs’ comprehension of visual inputs genuinely improves predictive performance and whether VLMs truly comprehend candlestick patterns. Further, most existing datasets and evaluation setups are designed around single-period or tabular inputs. However, human analysts strongly rely on multi-scale candlestick charts, where longer-term horizons capture trend direction and shorter-term horizons provide cues for inflection points, making it difficult to systematically assess VLMs’ ability to integrate short-term and long-term visual market dynamics. To bridge this gap, we construct a multi-scale candlestick charts dataset and a standardized evaluation framework to assess VLMs’ ability to utilize multi-scale visual market signals. Evaluation combines confusion-matrix-based diagnostics with information coefficient(IC) time series metrics and includes XGBoost as a feature-based temporal baseline. Using this dataset, we benchmark representative VLMs and analyze their ability to leverage multi-scale stock price data. Experimental results show that most VLMs perform well only under persistent uptrend or downtrend conditions, while exhibiting weak predictive capability in more common market scenarios. We also identify significant prediction biases and limited sensitivity to explicitly specified forecast horizons in prompts, indicating inherent limitations in precise temporal reasoning.

[NLP-22] Learning Chain Of Thoughts Prompts for Predicting Entities Relations and even Literals on Knowledge Graphs

【速读】: 该论文旨在解决知识图谱嵌入(Knowledge Graph Embedding, KGE)模型在处理未见过的实体、关系及数值型属性(literals)时表现不佳的问题,这限制了其在动态、异构知识图谱中的应用。解决方案的关键在于将链接预测任务重新建模为提示学习(prompt learning)问题,提出一种名为 RALP 的方法,通过学习基于字符串的思维链(chain-of-thought, CoT)提示作为三元组的评分函数,并利用 MIPRO 算法结合贝叶斯优化,在无需梯度信息的情况下仅用少于 30 个训练样本即可识别有效提示。该方法在推理阶段不仅能预测缺失的实体、关系或完整三元组,还能基于所学提示生成置信度分数,显著提升了模型在跨数据集上的泛化能力与 OWL 推理任务中的表现(如复杂类表达式下的 Jaccard 相似度超过 88%)。

链接: https://arxiv.org/abs/2604.12651
作者: Alkid Baci,Luke Friedrichs,Caglar Demir,N’Dah Jean Kouagou,Axel-Cyrille Ngonga Ngomo
机构: Paderborn University (帕德博恩大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graph embedding (KGE) models perform well on link prediction but struggle with unseen entities, relations, and especially literals, limiting their use in dynamic, heterogeneous graphs. In contrast, pretrained large language models (LLMs) generalize effectively through prompting. We reformulate link prediction as a prompt learning problem and introduce RALP, which learns string-based chain-of-thought (CoT) prompts as scoring functions for triples. Using Bayesian Optimization through MIPRO algorithm, RALP identifies effective prompts from fewer than 30 training examples without gradient access. At inference, RALP predicts missing entities, relations or whole triples and assigns confidence scores based on the learned prompt. We evaluate on transductive, numerical, and OWL instance retrieval benchmarks. RALP improves state-of-the-art KGE models by over 5% MRR across datasets and enhances generalization via high-quality inferred triples. On OWL reasoning tasks with complex class expressions (e.g., \exists this http URL , \geq 5 ; this http URL ), it achieves over 88% Jaccard similarity. These results highlight prompt-based LLM reasoning as a flexible alternative to embedding-based methods. We release our implementation, training, and evaluation pipeline as open source: this https URL .

[NLP-23] Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification

【速读】: 该论文旨在解决自动化呼吸音分析中因标注数据稀缺和专家标注成本高昂而导致的模型泛化能力受限问题,尤其是在零样本(zero-shot)场景下,现有方法对所有输入均采用统一计算资源,未能根据输入难度动态调整推理复杂度。其解决方案的关键在于提出TRIAGE框架,该框架通过分层推理机制实现测试时计算资源的自适应分配:首先在低复杂度的Tier-L层级进行快速标签-余弦评分;若置信度不足,则进入中等复杂度的Tier-M层级进行结构化匹配;最后对不确定样本启用高复杂度的Tier-H层级,结合检索增强的大语言模型进行深度推理。一个基于置信度的路由机制确保简单样本尽早退出,而复杂样本获得额外计算支持,从而在不牺牲准确性的前提下显著提升效率,实验表明近一半样本可在最低层级完成预测,且在多个任务上达到或超越有监督基线性能。

链接: https://arxiv.org/abs/2604.12647
作者: Tsai-Ning Wang,Herman Teun den Dekker,Lin-Lin Chen,Neil Zeghidour,Aaqib Saeed
机构: Eindhoven University of Technology (埃因霍温理工大学); Erasmus University Medical Center (埃拉斯姆斯大学医学中心); Kyutai (凯图); Google.org (谷歌.org)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted at AHLI CHIL 2026

点击查看摘要

Abstract:Automated respiratory audio analysis promises scalable, non-invasive disease screening, yet progress is limited by scarce labeled data and costly expert annotation. Zero-shot inference eliminates task-specific supervision, but existing methods apply uniform computation to every input regardless of difficulty. We introduce TRIAGE, a tiered zero-shot framework that adaptively scales test-time compute by routing each audio sample through progressively richer reasoning stages: fast label-cosine scoring in a joint audio-text embedding space (Tier-L), structured matching with clinician-style descriptors (Tier-M), and retrieval-augmented large language model reasoning (Tier-H). A confidence-based router finalizes easy predictions early while allocating additional computation to ambiguous inputs, enabling nearly half of all samples to exit at the cheapest tier. Across nine respiratory classification tasks without task-specific training, TRIAGE achieves a mean AUROC of 0.744, outperforming prior zero-shot methods and matching or exceeding supervised baselines on multiple tasks. Our analysis show that test-time scaling concentrates gains where they matter: uncertain cases see up to 19% relative improvement while confident predictions remain unchanged at minimal cost.

[NLP-24] Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data

【速读】: 该论文旨在解决多语言情感分类任务中因标注数据稀缺而导致的性能瓶颈问题,现有语料库普遍存在以英语为主、单标签且语言覆盖有限的局限性。其解决方案的关键在于构建了一个大规模合成训练语料库(超过100万条多标签样本,每种语言5万条),涵盖23种语言和11类情绪,并采用文化适应性生成策略与程序化质量过滤机制确保数据质量;同时在统一条件下训练并比较六种多语言Transformer编码器,最终发现XLM-R-Large模型在域内测试集上达到0.868 F1-micro和0.987 AUC-micro,在零样本迁移至人类标注数据集GoEmotions和SemEval-2018 Task 1 E-c时,其性能可媲美甚至超越英语专用模型,且天然支持全部23种语言。

链接: https://arxiv.org/abs/2604.12633
作者: Vadim Borisov
机构: Tabularis.ai(表格式人工智能)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emotion classification in multilingual settings remains constrained by the scarcity of annotated data: existing corpora are predominantly English, single-label, and cover few languages. We address this gap by constructing a large-scale synthetic training corpus of over 1M multi-label samples (50k per language) across 23 languages: Arabic, Bengali, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin, Polish, Portuguese, Punjabi, Russian, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, and Vietnamese, covering 11 emotion categories using culturally-adapted generation and programmatic quality filtering. We train and compare six multilingual transformer encoders, from DistilBERT (135M parameters) to XLM-R-Large (560M parameters), under identical conditions. On our in-domain test set, XLM-R-Large achieves 0.868 F1-micro and 0.987 AUC-micro. To validate against human-annotated data, we evaluate all models zero-shot on GoEmotions (English) and SemEval-2018 Task 1 E-c (English, Arabic, Spanish). On threshold-free ranking metrics, XLM-R-Large matches or exceeds English-only specialist models, tying on AP-micro (0.636) and LRAP (0.804) while surpassing on AUC-micro (0.810 vs. 0.787), while natively supporting all 23 languages. The best base-sized model is publicly available at this https URL

[NLP-25] GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLs)在空间推理任务中表现不佳的问题。现有方法通过引入来自3D基础模型的几何特征来缓解此问题,但依赖于静态的单层特征提取,导致任务错位偏差:几何特征在预训练过程中自然趋向于3D任务目标,与MLLMs对空间信息的多样化需求不一致,使得单一层次的特征难以满足复杂场景下的空间理解需求。解决方案的关键在于提出GeoAlign框架,其核心创新是动态聚合多层几何特征以实现任务对齐——通过构建分层几何特征库,并利用MLLM原有的视觉token作为内容感知查询,执行逐层稀疏路由,从而自适应地为每个图像patch获取最匹配的几何特征,显著提升空间推理能力。

链接: https://arxiv.org/abs/2604.12630
作者: Zhaochen Liu,Limeng Qiao,Guanglu Wan,Tingting Jiang
机构: Peking University (北京大学); Meituan Inc. (美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM’s original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable geometric features for each patch. Extensive experiments on VSI-Bench, ScanQA, and SQA3D demonstrate that our compact 4B model effectively achieves state-of-the-art performance, even outperforming larger existing MLLMs.

[NLP-26] ransforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLM s

【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法中因采用非结构化文本片段拼接作为上下文而导致的冗余信息累积、语义对齐度低以及推理链断裂等问题,这些问题会降低生成质量并增加令牌(token)消耗。其解决方案的关键在于提出一种基于结构化三元组(triplet)的检索框架——Tri-RAG,通过将外部知识自动转换为由条件(Condition)、证据(Proof)和结论(Conclusion)组成的标准化三元组表示,并以“条件”作为显式语义锚点进行检索与匹配,从而实现更精准的知识单元定位与高效上下文构建,显著提升检索准确性与上下文令牌效率,同时增强复杂推理场景下的生成稳定性与资源利用效率。

链接: https://arxiv.org/abs/2604.12610
作者: Xudong Wang,Chaoning Zhang,Qigan Sun,Zhenzhen Huang,Chang Lu,Sheng Zheng,Zeyu Ma,Caiyan Qin,Yang Yang,Hengtao Shen
机构: Kyung Hee University (庆熙大学); University of Electronic Science and Technology of China (电子科技大学); Harbin Institute of Technology (哈尔滨工业大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates hallucination in large language models (LLMs) by incorporating external knowledge during generation. However, the effectiveness of RAG depends not only on the design of the retriever and the capacity of the underlying model, but also on how retrieved evidence is structured and aligned with the query. Existing RAG approaches typically retrieve and concatenate unstructured text fragments as context, which often introduces redundant or weakly relevant information. This practice leads to excessive context accumulation, reduced semantic alignment, and fragmented reasoning chains, thereby degrading generation quality while increasing token consumption. To address these challenges, we propose Tri-RAG, a structured triplet-based retrieval framework that improves retrieval efficiency through reasoning-aligned context construction. Tri-RAG automatically transforms external knowledge from natural language into standardized structured triplets consisting of Condition, Proof, and Conclusion, explicitly capturing logical relations among knowledge fragments using lightweight prompt-based adaptation with frozen model parameters. Building on this representation, the triplet head Condition is treated as an explicit semantic anchor for retrieval and matching, enabling precise identification of query-relevant knowledge units without directly concatenating lengthy raw texts. As a result, Tri-RAG achieves a favorable balance between retrieval accuracy and context token efficiency. Experimental results across multiple benchmark datasets demonstrate that Tri-RAG significantly improves retrieval quality and reasoning efficiency, while producing more stable generation behavior and more efficient resource utilization in complex reasoning scenarios.

[NLP-27] FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing ACL2026

【速读】: 该论文旨在解决现有无结构模型编辑方法在处理真实文本时存在的问题:即模型倾向于整体记忆文本内容,而缺乏对细粒度事实的可靠访问能力。为应对这一挑战,作者提出FABLE框架,其核心在于采用分层式架构,将细粒度事实注入与整体文本生成过程解耦。具体而言,FABLE遵循“先事实后生成”的两阶段策略——首先在浅层网络中锚定离散的事实信息,随后仅对深层网络进行最小化更新以生成连贯文本。这种解耦设计有效缓解了整体召回与细粒度事实访问之间的不匹配问题,契合Transformer架构中从底层事实表示到上层表面形式生成的单向流动特性。

链接: https://arxiv.org/abs/2604.12559
作者: Peng Wang,Biyu Zhou,Xuehai Tang,Jizhong Han,Songlin Hu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Computation and Language (cs.CL)
备注: ACL 2026 findings

点击查看摘要

Abstract:Unstructured model editing aims to update models with real-world text, yet existing methods often memorize text holistically without reliable fine-grained fact access. To address this, we propose FABLE, a hierarchical framework that decouples fine-grained fact injection from holistic text generation. FABLE follows a two-stage, fact-first strategy: discrete facts are anchored in shallow layers, followed by minimal updates to deeper layers to produce coherent text. This decoupling resolves the mismatch between holistic recall and fine-grained fact access, reflecting the unidirectional Transformer flow in which surface-form generation amplifies rather than corrects underlying fact representations. We also introduce UnFine, a diagnostic benchmark with fine-grained question-answer pairs and fact-level metrics for systematic evaluation. Experiments show that FABLE substantially improves fine-grained question answering while maintaining state-of-the-art holistic editing performance. Our code is publicly available at this https URL.

[NLP-28] When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP KDD2026

【速读】: 该论文旨在解决低资源非洲语言在自然语言处理(Natural Language Processing, NLP)发展中因数据稀缺而导致的性能瓶颈问题。其解决方案的关键在于系统评估两种数据增强方法——基于大语言模型(Large Language Model, LLM)生成(Gemini 2.5 Flash)与回译(Back-Translation, NLLB-200)——在豪萨语(Hausa)和丰贝语(Fongbe)上的效果,发现增强效果主要由任务类型决定,而非单纯依赖于语言或LLM生成质量;例如,同一LLM生成的数据对丰贝语的命名实体识别(Named Entity Recognition, NER)产生负面影响,却提升词性标注(Part-of-Speech Tagging, POS)性能,表明任务结构才是决定数据增强成败的核心因素,从而提出应将数据增强视为任务特定干预手段,而非通用预处理步骤。

链接: https://arxiv.org/abs/2604.12540
作者: Mahounan Pericles Adjovi,Roald Eiselen,Prasenjit Mitra
机构: Carnegie Mellon University Africa, Kigali, Rwanda; Centre for Text Technology, North-West University, Potchefstroom, South Africa
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 tables; previously submitted to KDD 2026

点击查看摘要

Abstract:Data scarcity limits NLP development for low-resource African languages. We evaluate two data augmentation methods – LLM-based generation (Gemini 2.5 Flash) and back-translation (NLLB-200) – for Hausa and Fongbe, two West African languages that differ substantially in LLM generation quality. We assess augmentation on named entity recognition (NER) and part-of-speech (POS) tagging using MasakhaNER 2.0 and MasakhaPOS benchmarks. Our results reveal that augmentation effectiveness depends on task type rather than language or LLM quality alone. For NER, neither method improves over baseline for either language; LLM augmentation reduces Hausa NER by 0.24% F1 and Fongbe NER by 1.81% F1. For POS tagging, LLM augmentation improves Fongbe by 0.33% accuracy, while back-translation improves Hausa by 0.17%; back-translation reduces Fongbe POS by 0.35% and has negligible effect on Hausa POS. The same LLM-generated synthetic data produces opposite effects across tasks for Fongbe – hurting NER while helping POS – suggesting task structure governs augmentation outcomes more than synthetic data quality. These findings challenge the assumption that LLM generation quality predicts augmentation success, and provide actionable guidance: data augmentation should be treated as a task-specific intervention rather than a universally beneficial preprocessing step.

[NLP-29] Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis CVPR2026

【速读】: 该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis, MSA)中因模态竞争导致的弱模态信息被主导模态压制的问题,从而影响融合性能和鲁棒性。解决方案的关键在于提出增强-平衡模态协作框架(Enhance-then-Balance Modality Collaboration, EBMC),其核心包括三个机制:首先通过语义解耦与跨模态增强提升弱模态表示质量;其次引入能量引导的模态协调机制,基于可微分的平衡目标实现隐式梯度重平衡,缓解主导模态的过度影响;最后采用实例感知的模态可信度蒸馏方法,动态估计样本级可靠性并自适应调节融合权重,确保在缺失模态情况下的鲁棒性表现。

链接: https://arxiv.org/abs/2604.12518
作者: Kang He,Yuzhe Ding,Xinrong Wang,Fei Li,Chong Teng,Donghong Ji
机构: Wuhan University (武汉大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Multimodal sentiment analysis (MSA) integrates heterogeneous text, audio, and visual signals to infer human emotions. While recent approaches leverage cross-modal complementarity, they often struggle to fully utilize weaker modalities. In practice, dominant modalities tend to overshadow non-verbal ones, inducing modality competition and limiting overall contributions. This imbalance degrades fusion performance and robustness under noisy or missing modalities. To address this, we propose a novel model, Enhance-then-Balance Modality Collaboration framework (EBMC). EBMC improves representation quality via semantic disentanglement and cross-modal enhancement, strengthening weaker modalities. To prevent dominant modalities from overwhelming others, an Energy-guided Modality Coordination mechanism achieves implicit gradient rebalancing via a differentiable equilibrium objective. Furthermore, Instance-aware Modality Trust Distillation estimates sample-level reliability to adaptively modulate fusion weights, ensuring robustness. Extensive experiments demonstrate that EBMC achieves state-of-the-art or competitive results and maintains strong performance under missing-modality settings.

[NLP-30] Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLM s ACL2026

【速读】: 该论文旨在解决当前音频大语言模型(Audio Large Language Models, AudioLLMs)在复杂推理任务中表现优异,但在细粒度声学感知任务上持续表现欠佳的问题。研究指出,这一性能倒置现象源于以自动语音识别(ASR)为中心的训练范式,该范式虽提供精确的语言目标,却隐式地引导模型将副语言线索(paralinguistic cues)和非语言声学事件视为噪声并加以抑制。解决方案的关键在于提出统一音频结构(Unified Audio Schema, UAS),这是一种结构化的监督框架,将音频信息组织为三个显式组件——转录文本(Transcription)、副语言信息(Paralinguistics)和非语言事件(Non-linguistic Events)——并以统一的JSON格式呈现。该设计在不牺牲音频-文本对齐精度的前提下实现了对声学信息的全面覆盖,从而显著提升细粒度声学感知能力,同时保持强大的推理性能。

链接: https://arxiv.org/abs/2604.12506
作者: Linhao Zhang,Yuhan Song,Aiwei Liu,Chuhan Wu,Sijun Zhang,Wei Jia,Yuan Liu,Houfeng Wang,Xiao Zhou
机构: Basic Model Technology Center, WeChat AI, Tencent Inc.; State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Recent Audio Large Language Models (AudioLLMs) exhibit a striking performance inversion: while excelling at complex reasoning tasks, they consistently underperform on fine-grained acoustic perception. We attribute this gap to a fundamental limitation of ASR-centric training, which provides precise linguistic targets but implicitly teaches models to suppress paralinguistic cues and acoustic events as noise. To address this, we propose Unified Audio Schema (UAS), a holistic and structured supervision framework that organizes audio information into three explicit components – Transcription, Paralinguistics, and Non-linguistic Events – within a unified JSON format. This design achieves comprehensive acoustic coverage without sacrificing the tight audio-text alignment that enables reasoning. We validate the effectiveness of this supervision strategy by applying it to both discrete and continuous AudioLLM architectures. Extensive experiments on MMSU, MMAR, and MMAU demonstrate that UAS-Audio yields consistent improvements, boosting fine-grained perception by 10.9% on MMSU over the same-size state-of-the-art models while preserving robust reasoning capabilities. Our code and model are publicly available at this https URL.

[NLP-31] opology-Aware Reasoning over Incomplete Knowledge Graph with Graph-Based Soft Prompting

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识密集型场景下易产生幻觉的问题,尤其针对多跳知识库问答(Multi-hop Knowledge Base Question Answering, KBQA)中现有方法依赖显式边遍历而导致对知识图谱(Knowledge Graph, KG)不完整性敏感的缺陷。其解决方案的关键在于提出一种基于图结构的软提示(soft prompting)框架,将推理范式从节点级路径遍历转变为子图级推理:通过图神经网络(Graph Neural Network, GNN)编码提取的结构化子图生成软提示,使LLM能够利用更丰富的结构上下文识别超出直接邻居的相关实体,从而降低对缺失边的敏感性;同时引入两阶段机制——轻量级LLM先利用软提示筛选问题相关实体与关系,再由更强的LLM进行证据感知的答案生成,在保证性能的同时显著降低计算开销。

链接: https://arxiv.org/abs/2604.12503
作者: Shuai Wang,Xixi Wang,Yinan Yu
机构: Chalmers University of Technology and University of Gothenburg, Sweden; Technical University of Denmark, Kgs. Lyngby, Denmark
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities across various tasks but remain prone to hallucinations in knowledge-intensive scenarios. Knowledge Base Question Answering (KBQA) mitigates this by grounding generation in Knowledge Graphs (KGs). However, most multi-hop KBQA methods rely on explicit edge traversal, making them fragile to KG incompleteness. In this paper, we proposed a novel graph-based soft prompting framework that shifts the reasoning paradigm from node-level path traversal to subgraph-level reasoning. Specifically, we employ a Graph Neural Network (GNN) to encode extracted structural subgraphs into soft prompts, enabling LLM to reason over richer structural context and identify relevant entities beyond immediate graph neighbors, thereby reducing sensitivity to missing edges. Furthermore, we introduce a two-stage paradigm that reduces computational cost while preserving good performance: a lightweight LLM first leverages the soft prompts to identify question-relevant entities and relations, followed by a more powerful LLM for evidence-aware answer generation. Experiments on four multi-hop KBQA benchmarks show that our approach achieves state-of-the-art performance on three of them, demonstrating its effectiveness. Code is available at the repository: this https URL.

[NLP-32] Latent Planning Emerges with Scale ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行看似需要规划的任务时,是否存在隐式规划能力的问题。传统观点认为LLMs通过显式表述计划来完成复杂任务,但其是否具备内在的、未被直接观察到的规划机制尚不明确。论文的关键解决方案是提出“潜在规划”(latent planning)的概念,并构建可测量的框架:即当模型内部存在能引导特定未来token(如“accountant”)生成并塑造前序上下文以支持该token的表征时,即视为具备潜在规划能力。研究基于Qwen-3系列模型(0.6B–14B参数规模)在简单和复杂任务上的实证分析表明,随着模型规模增大,潜在规划能力显著增强,且可通过特定提示(steering)进一步激发,从而为理解LLMs中规划机制的演化提供了机制性证据。

链接: https://arxiv.org/abs/2604.12493
作者: Michael Hanna,Emmanuel Ameisen
机构: University of Amsterdam (阿姆斯特丹大学); Anthropic (Anthropic)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:LLMs can perform seemingly planning-intensive tasks, like writing coherent stories or functioning code, without explicitly verbalizing a plan; however, the extent to which they implicitly plan is unknown. In this paper, we define latent planning as occurring when LLMs possess internal planning representations that (1) cause the generation of a specific future token or concept, and (2) shape preceding context to license said future token or concept. We study the Qwen-3 family (0.6B-14B) on simple planning tasks, finding that latent planning ability increases with scale. Models that plan possess features that represent a planned-for word like “accountant”, and cause them to output “an” rather than “a”; moreover, even the less-successful Qwen-3 4B-8B have nascent planning mechanisms. On the more complex task of completing rhyming couplets, we find that models often identify a rhyme ahead of time, but even large models seldom plan far ahead. However, we can elicit some planning that increases with scale when steering models towards planned words in prose. In sum, we offer a framework for measuring planning and mechanistic evidence of how models’ planning abilities grow with scale.

[NLP-33] Calibrated Confidence Estimation for Tabular Question Answering

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在表格问答(Tabular Question Answering, Tabular QA)任务中缺乏可靠置信度估计的问题。现有研究普遍忽视结构化数据上的校准问题,导致模型严重过自信(平滑期望校准误差 Smooth ECE 为 0.35–0.64,远高于文本问答任务的 0.10–0.15)。解决方案的关键在于提出一种名为多格式一致性(Multi-Format Agreement, MFA)的新方法,利用结构化数据特有的无损且确定性的序列化变体(如 Markdown、HTML、JSON、CSV)来估算置信度,相比采样基线可降低 20% 的 API 成本;MFA 在 TableBench 上平均 AUROC 达到 0.80,并能与自一致性等方法互补,显著提升整体性能(例如 MFA + 自一致性组合使 AUROC 从 0.74 提升至 0.82),同时结合结构感知校准进一步改善校准效果。

链接: https://arxiv.org/abs/2604.12491
作者: Lukas Voss
机构: Google(谷歌); OpenAI; Meta; Together AI
类目: Computation and Language (cs.CL)
备注: 27 pages, 9 figures, 17 tables (8-page main body + appendix)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed for tabular question answering, yet calibration on structured data is largely unstudied. This paper presents the first systematic comparison of five confidence estimation methods across five frontier LLMs and two tabular QA benchmarks. All models are severely overconfident (smooth ECE 0.35-0.64 versus 0.10-0.15 reported for textual QA). A consistent self-evaluation versus perturbation dichotomy replicates across both benchmarks and all four fully-covered models: self-evaluation methods (verbalized, P(True)) achieve AUROC 0.42-0.76, while perturbation methods (semantic entropy, self-consistency, and our Multi-Format Agreement) achieve AUROC 0.78-0.86. Per-model paired bootstrap tests reject the null at p0.001 after Holm-Bonferroni correction, and a 3-seed check on GPT-4o-mini gives a per-seed standard deviation of only 0.006. The paper proposes Multi-Format Agreement (MFA), which exploits the lossless and deterministic serialization variation unique to structured data (Markdown, HTML, JSON, CSV) to estimate confidence at 20% lower API cost than sampling baselines. MFA reduces ECE by 44-63%, generalizes across all four models on TableBench (mean AUROC 0.80), and combines complementarily with sampling: an MFA + self-consistency ensemble lifts AUROC from 0.74 to 0.82. A secondary contribution, structure-aware recalibration, improves AUROC by +10 percentage points over standard post-hoc methods. Comments: 27 pages, 9 figures, 17 tables (8-page main body + appendix) Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2604.12491 [cs.CL] (or arXiv:2604.12491v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.12491 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lukas Voss [view email] [v1] Tue, 14 Apr 2026 09:16:53 UTC (187 KB) Full-text links: Access Paper: View a PDF of the paper titled Calibrated Confidence Estimation for Tabular Question Answering, by Lukas VossView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-34] KG-Reason er: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理知识密集型推理任务时,尤其是在复杂多跳查询中,难以实现高效且连贯的结构化知识图谱(Knowledge Graph, KG)推理问题。现有方法通常采用固定流水线将推理过程分解为孤立步骤,限制了推理灵活性并导致中间信息丢失。解决方案的关键在于提出KG-Reasoner框架,该框架将多步推理整合进一个统一的“思考”阶段,并通过强化学习(Reinforcement Learning, RL)训练LLM内化KG遍历过程,使其能够动态探索推理路径并在必要时进行回溯,从而提升多跳推理的连贯性与准确性。

链接: https://arxiv.org/abs/2604.12487
作者: Shuai Wang,Yinan Yu
机构: Chalmers University of Technology and University of Gothenburg (查尔姆斯理工大学和哥德堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong abilities in natural language understanding and generation, yet they struggle with knowledge-intensive reasoning. Structured Knowledge Graphs (KGs) provide an effective form of external knowledge representation and have been widely used to enhance performance in classical Knowledge Base Question Answering (KBQA) tasks. However, performing precise multi-hop reasoning over KGs for complex queries remains highly challenging. Most existing approaches decompose the reasoning process into a sequence of isolated steps executed through a fixed pipeline. While effective to some extent, such designs constrain reasoning flexibility and fragment the overall decision process, often leading to incoherence and the loss of critical intermediate information from earlier steps. In this paper, we introduce KG-Reasoner, an end-to-end framework that integrates multi-step reasoning into a unified “thinking” phase of a Reasoning LLM. Through Reinforcement Learning (RL), the LLM is trained to internalize the KG traversal process, enabling it to dynamically explore reasoning paths, and perform backtracking when necessary. Experiments on eight multi-hop and knowledge-intensive reasoning benchmarks demonstrate that KG-Reasoner achieves competitive or superior performance compared to the state-of-the-art methods. Codes are available at the repository: this https URL.

[NLP-35] Meet Dynamic Individual Preferences: Resolving Conflicting Human Value with Paired Fine-Tuning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在适应个体用户偏好时面临的挑战,尤其是这些偏好具有多样性且动态变化的特性。现有方法难以有效处理个体间或同一用户内部存在的矛盾偏好(value conflicts),导致模型在个性化场景下的表现受限。其解决方案的关键在于提出一种名为“偏好配对微调”(Preference-Paired Fine-Tuning, PFT)的新框架,该框架通过显式建模冲突偏好关系,并利用新构建的“价值冲突困境”(Value Conflict Dilemma, VCD)数据集进行训练与评估,使模型能够在有限用户历史数据下快速推断出用户的偏好向量,从而显著提升对个体偏好的适配精度,相较单一偏好训练方法提升达44.76%。

链接: https://arxiv.org/abs/2604.12479
作者: Shanyong Wang,Shuhang Lin,Yining Zhao,Xi Zhu,Yongfeng Zhang
机构: Rutgers University—New Brunswick(罗格斯大学新布朗斯维克分校); University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: 20 pages, 13 figures

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have significantly improved the alignment of models with general human preferences. However, a major challenge remains in adapting LLMs to individual preferences, which are not only diverse but also dynamic. In this paper, we introduce a novel framework, Preference-Paired Fine-Tuning (PFT), designed to align models with contradictory and evolving individual preferences. We present a new dataset, Value Conflict Dilemma (VCD), which includes scenarios that involve conflicting human preferences, facilitating the evaluation of our approach. Our experiments demonstrate that PFT outperforms single-preference training methods, achieving up to 96.6% accuracy in multi-choice classification tasks and the highest open-ended generation score of 8.69. PFT also shows significant improvements over DPO, SFT and some traditional training methods, especially when handling conflicting preferences. Additionally, with limited user history data, models can inferring preference vector rapidly, achieving a 44.76% improvement in user-specific preference alignment in comparison to single-preference models.

[NLP-36] Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe COLING2026 LREC

【速读】: 该论文旨在解决低资源语言社区贡献的语料在大型语言模型(Large Language Models, LLMs)中编码的 linguistic knowledge 无法被有效提取的问题,尤其是针对西非两种低资源语言——豪萨语(Hausa)和丰贝语(Fongbe)。其解决方案的关键在于通过系统性地设计不同类型的提示策略(elicitation task types),从两个商用LLM(GPT-4o Mini 和 Gemini 2.5 Flash)中高效提取可用的目标语言文本数据。研究发现,GPT-4o Mini 在每轮API调用中可提取比Gemini多6至41倍的可用目标语言词汇,且最优提示策略因语言特性而异:豪萨语受益于功能文本和对话类提示,而丰贝语则需依赖约束生成类提示。

链接: https://arxiv.org/abs/2604.12477
作者: Mahounan Pericles Adjovi,Roald Eiselen,Prasenjit Mitra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, 6 tables; to appear in LREC-COLING 2026

点击查看摘要

Abstract:Large language models (LLMs) are trained on data contributed by low-resource language communities, yet the linguistic knowledge encoded in these models remains accessible only through commercial APIs. This paper investigates whether strategic prompting can extract usable text data from LLMs for two West African languages: Hausa (Afroasiatic, approximately 80 million speakers) and Fongbe (Niger-Congo, approximately 2 million speakers). We systematically compare six elicitation task types across two commercial LLMs (GPT-4o Mini and Gemini 2.5 Flash). GPT-4o Mini extracts 6-41 times more usable target-language words per API call than Gemini. Optimal strategies differ by language: Hausa benefits from functional text and dialogue, while Fongbe requires constrained generation prompts. We release all generated corpora and code.

[NLP-37] Latent-Condensed Transformer for Efficient Long Context Modeling ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文时面临的两大瓶颈:一是键值(Key-Value, KV)缓存随序列长度线性增长导致的内存消耗问题,二是自注意力(self-attention)计算复杂度随序列长度平方增长带来的推理效率低下问题。现有方法如多头潜在注意力(Multi-head Latent Attention, MLA)通过将token映射到低维潜在空间来压缩KV缓存,而稀疏注意力则试图减少计算量,但二者难以协同优化,尤其稀疏方法无法直接作用于MLA的压缩潜在结构。为此,作者提出潜在凝聚注意力(Latent-Condensed Attention, LCA),其关键在于在MLA的潜在空间中直接对上下文进行凝聚:将潜在表示解耦为语义潜在向量和位置键(positional keys),分别通过查询感知池化聚合语义信息、通过锚点选择保留位置信息,从而在不引入额外参数的前提下,同时降低计算开销与KV缓存占用。理论证明LCA具有长度无关的误差界,实验表明其可在128K上下文下实现高达2.5倍的预填充加速和90%的KV缓存缩减,且保持竞争力性能。

链接: https://arxiv.org/abs/2604.12452
作者: Zeng You,Yaofo Chen,Qiuwu Chen,Ying Sun,Shuhai Zhang,Yingjian Li,Yaowei Wang,Mingkui Tan
机构: South China University of Technology (华南理工大学); Pengcheng Laboratory (鹏城实验室); AIGCode; Harbin Institute of Technology (哈尔滨工业大学); Pazhou Laboratory (琶洲实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA’s compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA’s latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Beyond MLA, LCA’s design is architecture-agnostic and readily extends to other attention mechanisms such as GQA. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to 2.5 \times prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

[NLP-38] GLeMM: A large-scale multilingual dataset for morphological research

【速读】: 该论文旨在解决衍生形态学(derivational morphology)中词形与词义关系变异机制的研究长期依赖直觉和有限数据观察、难以复现与推广的问题。其解决方案的关键在于构建一个名为GLeMM的新型衍生资源,该资源具备五大特征:大规模数据规模、覆盖七种欧洲语言(德语、英语、西班牙语、法语、意大利语、波兰语、俄语)、全自动化设计且跨语言一致、自动标注每个词条的形态学特征,并对部分词条编码语义描述。这一结构化、可扩展且数据驱动的资源使研究者能够系统性地探究词形与语义在构词过程中的作用,并实验验证计算方法对衍生形态结构的识别能力。

链接: https://arxiv.org/abs/2604.12442
作者: Hathout Nabil(CLLE, Comue de Toulouse),Basilio Calderone(CLLE, UBM),Fiammetta Namer(ATILF, UL),Franck Sajous(CLLE-ERSS, Comue de Toulouse)
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In derivational morphology, what mechanisms govern the variation in form-meaning relations between words? The answers to this type of questions are typically based on intuition and on observations drawn from limited data, even when a wide range of languages is considered. Many of these studies are difficult to replicate and generalize. To address this issue, we present GLeMM, a new derivational resource designed for experimentation and data-driven description in morphology. GLeMM is characterized by (i) its large size, (ii) its extensive coverage (currently amounting to seven European languages, i.e., German, English, Spanish, French, Italian, Polish, Russian, (iii) its fully automated design, identical across all languages, (iv) the automatic annotation of morphological features on each entry, as well as (v) the encoding of semantic descriptions for a significant subset of these entries. It enables researchers to address difficult questions, such as the role of form and meaning in word-formation, and to develop and experimentally test computational methods that identify the structures of derivational morphology. The article describes how GLeMM is created using Wiktionary articles and presents various case studies illustrating possible applications of the resource.

[NLP-39] Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task ICLR2026

【速读】: 该论文旨在解决生成式 AI 模型在处理不同难度任务时是否能自适应地利用其网络深度的问题。研究聚焦于多跳关系推理任务,通过控制任务难度(即关系跳跃数)来观察模型各层输出的变化及信息整合机制。解决方案的关键在于采用两种分析方法:一是使用“logit lens”监测早期层的预测演化情况,二是借助因果补丁(causal patching)追踪任务相关信息在 token 间的整合过程,从而揭示预训练与微调后模型对深度资源的动态分配策略。

链接: https://arxiv.org/abs/2604.12426
作者: Alicia Curth,Rachel Lawrence,Sushrut Karmalkar,Niranjani Prasad
机构: Microsoft Research Cambridge (微软研究院剑桥)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at the ICLR 2026 Workshop on Logical Reasoning of Large Language Models

点击查看摘要

Abstract:We investigate whether transformers use their depth adaptively across tasks of increasing difficulty. Using a controlled multi-hop relational reasoning task based on family stories, where difficulty is determined by the number of relationship hops that must be composed, we monitor (i) how predictions evolve across layers via early readouts (the logit lens) and (ii) how task-relevant information is integrated across tokens via causal patching. For pretrained models, we find some limited evidence for adaptive depth use: some larger models need fewer layers to arrive at plausible answers for easier tasks, and models generally use more layers to integrate information across tokens as chain length increases. For models finetuned on the task, we find clearer and more consistent evidence of adaptive depth use, with the effect being stronger for less constrained finetuning regimes that do not preserve general language modeling abilities.

[NLP-40] Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中因语言先验(language priors)过度主导视觉证据而导致的幻觉问题。现有无需训练的缓解方法要么扰动视觉表示从而偏离自然图像分布,要么引入侵入性操作损害模型固有的生成流畅性。其解决方案的关键在于提出一种名为“通过扰动解码”(Decoding by Perturbation, DeP)的新框架,该框架基于一个核心洞察:多模态幻觉表现为视觉定位对文本表述的敏感性过高。DeP通过动态探针施加多层次文本扰动以激发潜在语言先验,并利用注意力方差增强特征空间中稳定的证据区域、抑制可疑噪声;同时借助logits统计构建可解释的先验漂移方向,用以抵消由文本共现引起的概率偏差,从而实现无训练条件下的有效幻觉抑制。

链接: https://arxiv.org/abs/2604.12424
作者: Sihang Jia,Shuliang Liu,Songbo Yang,Yibo Yan,Xin Zou,Xuming Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models frequently suffer from inference hallucinations, partially stemming from language priors dominating visual evidence. Existing training-free mitigation methods either perturb the visual representation and deviate from the natural image distribution, or enforce intrusive manipulations that compromise the model’s inherent generative fluency. We introduce a novel perspective that multimodal hallucination manifests as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Building on this insight, we propose Decoding by Perturbation (DeP), a training-free framework mitigating prior-induced hallucinations via controlled textual interventions. DeP employs a dynamic probe applying multi-level textual perturbations to elicit latent language priors. Leveraging attention variance, it enhances stable evidence regions while suppressing suspicious noise in the feature space. Furthermore, it constructs an interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences. Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.

[NLP-41] Agent ic Insight Generation in VSM Simulations

【速读】: 该论文旨在解决从复杂价值流图(Value Stream Map, VSM)仿真中提取可操作洞察时面临的挑战,即传统方法在处理数据时难以识别相似数据源间的细微情境差异,导致分析效率低、易出错。解决方案的关键在于提出一种解耦的两阶段智能体架构:通过将任务编排(orchestration)与数据分析分离,系统能够结合领域专家知识进行渐进式数据发现,使编排模块能智能选择数据源并跨数据结构执行多跳推理(multi-hop reasoning),同时保持轻量级内部上下文,从而显著提升准确性与鲁棒性,实验表明顶级大语言模型在此框架下可达86%准确率。

链接: https://arxiv.org/abs/2604.12421
作者: Micha Selak,Dirk Krechel,Adrian Ulges,Sven Spieckermann,Niklas Stoehr,Andreas Loehr
机构: RheinMain University of Applied Sciences, Wiesbaden, GERMANY; SimPlan AG, Hanau, GERMANY
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extracting actionable insights from complex value stream map simulations can be challenging, time-consuming, and error-prone. Recent advances in large language models offer new avenues to support users with this task. While existing approaches excel at processing raw data to gain information, they are structurally unfit to pick up on subtle situational differences needed to distinguish similar data sources in this domain. To address this issue, we propose a decoupled, two-step agentic architecture. By separating orchestration from data analysis, the system leverages progressive data discovery infused with domain expert knowledge. This architecture allows the orchestration to intelligently select data sources and perform multi-hop reasoning across data structures while maintaining a slim internal context. Results from multiple state-of-the-art large language models demonstrate the framework’s viability: with top-tier models achieving accuracies of up to 86% and demonstrating high robustness across evaluation runs.

[NLP-42] KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates ACL2026

【速读】: 该论文旨在解决标准大语言模型(Large Language Model, LLM)预训练过程中将语料库视为扁平化标记序列、忽视人类自然依赖的真实世界上下文信息的问题。为弥合这一差距,作者提出知识坐标条件化(Knowledge Coordinate Conditioning, KoCo)方法,其核心在于将每篇文档映射到三维语义坐标空间中,并在预训练时将这些坐标作为文本前缀进行注入,从而赋予模型显式的上下文感知能力,使其能够学习文档在真实世界知识结构中的位置关系。实验表明,KoCo不仅显著提升10项下游任务性能,还将预训练收敛速度加快约30%,同时有助于模型区分稳定事实与噪声,有效缓解生成内容中的幻觉问题。

链接: https://arxiv.org/abs/2604.12397
作者: Yudong Li,Jiawei Cai,Linlin Shen
机构: Shenzhen University (深圳大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026 Main Conference

点击查看摘要

Abstract:Standard Large Language Model (LLM) pre-training typically treats corpora as flattened token sequences, often overlooking the real-world context that humans naturally rely on to contextualize information. To bridge this gap, we introduce Knowledge Coordinate Conditioning (KoCo), a simple method that maps every document into a three-dimensional semantic coordinate. By prepending these coordinates as textual prefixes for pre-training, we aim to equip the model with explicit contextual awareness to learn the documents within the real-world knowledge structure. Experiment results demonstrate that KoCo significantly enhances performance across 10 downstream tasks and accelerates pre-training convergence by approximately 30%. Furthermore, our analysis indicates that explicitly modeling knowledge coordinates helps the model distinguish stable facts from noise, effectively mitigating hallucination in generated outputs.

[NLP-43] From Myopic Selection to Long-Horizon Awareness: Sequential LLM Routing for Multi-Turn Dialogue

【速读】: 该论文旨在解决多轮对话(multi-turn dialogue)中现有大语言模型(LLM)路由方法因交互动态性和延迟奖励导致的累积性能无法最大化的问题。传统路由策略通常基于单轮选择,缺乏对长期对话效果的考量,难以适应多轮交互中的状态演变与目标达成。解决方案的关键在于提出一种长程序列路由框架DialRouter,其核心创新包括:首先利用蒙特卡洛树搜索(MCTS)在多轮对话中探索不同LLM选择所引发的对话分支,并收集具有高累积奖励的轨迹;随后基于这些搜索所得数据训练一个轻量级路由策略,结合基于检索的未来状态近似机制,从而实现无需在线搜索的多轮路由决策。该方法显著提升了任务成功率,并在性能与成本之间实现了更优权衡。

链接: https://arxiv.org/abs/2604.12385
作者: Jiarui Zhang,Xiangyu Liu,Yong Hu,Chaoyue Niu,Hang Zeng,Shaojie Tang,Fan Wu,Guihai Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-turn dialogue is the predominant form of interaction with large language models (LLMs). While LLM routing is effective in single-turn settings, existing methods fail to maximize cumulative performance in multi-turn dialogue due to interaction dynamics and delayed rewards. To address this challenge, we move from myopic, single-turn selection to long-horizon sequential routing for multi-turn dialogue. Accordingly, we propose DialRouter, which first performs MCTS to explore dialogue branches induced by different LLM selections and collect trajectories with high cumulative rewards. DialRouter then learns a lightweight routing policy from search-derived data, augmented with retrieval-based future state approximation, enabling multi-turn routing without online search. Experiments on both open-domain and domain-specific dialogue tasks across diverse candidate sets of both open-source and closed-source LLMs demonstrate that DialRouter significantly outperforms single LLMs and existing routing baselines in task success rate, while achieving a superior performance-cost trade-off when combined with a cost-aware reward.

[NLP-44] Reason XL: Shifting LLM Reasoning Language Without Sacrificing Performance

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在跨语言推理任务中普遍存在的“英语中心主义”问题,即尽管模型具备多语言能力,但在生成推理过程时仍主要依赖英语,导致非英语场景下的推理质量下降。解决方案的关键在于提出首个大规模多语言推理语料库 ReasonXL,涵盖五种欧洲语言(英语、德语、法语、意大利语和西班牙语),每种语言包含超过两百万对齐样本(包括提示、推理轨迹和最终输出),从而实现对特定语言推理的直接监督;在此基础上,采用两阶段训练流程——监督微调(Supervised Fine-Tuning, SFT)与基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR),使模型能够完全在目标语言中进行推理,同时保持通用知识水平并保留跨语言迁移能力。

链接: https://arxiv.org/abs/2604.12378
作者: Daniil Gurgurov,Tom Röhr,Sebastian von Rohrscheidt,Josef van Genabith,Alexander Löser,Simon Ostermann
机构: German Research Center for Artificial Intelligence (DFKI), Saarland University; Berliner Hochschule für Technik
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Despite advances in multilingual capabilities, most large language models (LLMs) remain English-centric in their training and, crucially, in their production of reasoning traces. Even when tasked with non-English problems, these models predominantly reason in English, creating a fundamental mismatch for non-English usage scenarios. We address this disparity directly with three contributions. (i) We introduce ReasonXL, the first large-scale parallel corpus of cross-domain reasoning traces spanning five European languages (English, German, French, Italian, and Spanish), with over two million aligned samples per language, each comprising prompts, reasoning traces, and final outputs, enabling direct supervision of language-specific reasoning. (ii) Using ReasonXL, we demonstrate that LLMs can be adapted to reason entirely in a desired target language, using a simple two-stage pipeline of supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR). The resulting models match or exceed baseline performance, with minimal loss in general knowledge and broadly preserved cross-lingual transfer. (iii) We conduct an extensive representational analysis of the adaptation and find a clear functional division across model depth: early layers contain an activation bottleneck that causally determines language identity, while upper layers concentrate the weight and activation changes driven by adaptation. We further find that RLVR achieves greater behavioral divergence from the base model with smaller parameter updates than SFT, suggesting a more efficient representational rerouting despite much smaller weight updates. Comments: Under review Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.12378 [cs.CL] (or arXiv:2604.12378v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.12378 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-45] SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models ACL2026

【速读】: 该论文旨在解决当前韩国语语言模型(Language Models, LMs)在处理韩语时缺乏对字符内部构字结构(即Jamo,音素单位)的显式建模问题。现有基于子词(subword)分词的方法未能充分捕捉韩语中由Jamo构成的形态学特征和音变规律,从而限制了模型在自然语言理解(NLU)与生成(NLG)任务中的表现。解决方案的关键在于提出一种模型无关的模块SCRIPT,通过向预训练语言模型(PLMs)注入子字符级的组合知识,增强子词嵌入的结构粒度,无需修改模型架构或重新预训练即可显著提升性能,并使嵌入空间更贴近语法规律与语义一致性。

链接: https://arxiv.org/abs/2604.12377
作者: SungHo Kim,Juhyeong Park,Eda Atalay,SangKeun Lee
机构: Korea University(韩国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 Findings

点击查看摘要

Abstract:Korean is a morphologically rich language with a featural writing system in which each character is systematically composed of subcharacter units known as Jamo. These subcharacters not only determine the visual structure of Korean but also encode frequent and linguistically meaningful morphophonological processes. However, most current Korean language models (LMs) are based on subword tokenization schemes, which are not explicitly designed to capture the internal compositional structure of characters. To address this limitation, we propose SCRIPT, a model-agnostic module that injects subcharacter compositional knowledge into Korean PLMs. SCRIPT allows to enhance subword embeddings with structural granularity, without requiring architectural changes or additional pre-training. As a result, SCRIPT enhances all baselines across various Korean natural language understanding (NLU) and generation (NLG) tasks. Moreover, beyond performance gains, detailed linguistic analyses show that SCRIPT reshapes the embedding space in a way that better captures grammatical regularities and semantically cohesive variations. Our code is available at this https URL.

[NLP-46] Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在对话上下文超出其最大上下文窗口时,如何高效管理历史信息并实现精准内容恢复的问题。传统方法如截断(truncation)或基于关键词的检索(BM25、word-overlap)难以平衡记忆保留与推理效率,而全量上下文则受限于计算资源。解决方案的核心是提出“协作分页”(cooperative paging)机制:将被逐出的文本段落替换为极简关键词书签([pN:keywords],约8–24 tokens),同时引入一个可调用的 recall() 工具供模型按需召回完整内容。实验证明,该方法在LoCoMo多轮对话基准上显著优于多种基线方法,在多个主流模型中均取得最高答案质量,且通过系统性消融实验揭示了边界策略、淘汰策略和书签生成方式对性能的关键影响,其中书签区分度不足成为当前主要瓶颈。

链接: https://arxiv.org/abs/2604.12376
作者: Ziyang Liu
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures, 16 tables

点击查看摘要

Abstract:When LLM conversations grow beyond the context window, old content must be evicted – but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods – outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context – on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ( p=0.017 , paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination – the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

[NLP-47] Nemotron 3 Super: Open Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agent ic Reasoning

【速读】: 该论文旨在解决大模型在训练效率、推理性能与资源消耗之间的平衡问题,特别是如何在保持高精度的同时提升计算效率和部署灵活性。其解决方案的关键在于:首先采用NVFP4(NVIDIA Floating Point 4)进行预训练以提高训练稳定性与精度;其次引入LatentMoE架构,优化每FLOP和每参数的准确率,实现更高效的专家稀疏激活;最后集成MTP(Multi-Task Parallelism)层,通过原生推测解码(speculative decoding)显著加速推理过程。这些技术共同使Nemotron 3 Super在支持长达100万token上下文的基础上,相较GPT-OSS-120B和Qwen3.5-122B分别实现最高达2.2倍和7.5倍的推理吞吐量提升。

链接: https://arxiv.org/abs/2604.12374
作者: NVIDIA:Aakshita Chandiramani,Aaron Blakeman,Abdullahi Olaoye,Abhibha Gupta,Abhilash Somasamudramath,Abhinav Khattar,Adeola Adesoba,Adi Renduchintala,Adil Asif,Aditya Agrawal,Aditya Vavre,Ahmad Kiswani,Aishwarya Padmakumar,Ajay Hotchandani,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Aleksandr Shaposhnikov,Alex Gronskiy,Alex Kondratenko,Alex Neefus,Alex Steiner,Alex Yang,Alexander Bukharin,Alexander Young,Ali Hatamizadeh,Ali Taghibakhshi,Alina Galiautdinova,Alisa Liu,Alok Kumar,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Anahita Bhiwandiwalla,Ananth Subramaniam,Andrew Tao,Anjaney Shrivastava,Anjulie Agrusa,Ankur Srivastava,Ankur Verma,Ann Guan,Anna Shors,Annamalai Chockalingam,Anubhav Mandarwal,Aparnaa Ramani,Arham Mehta,Arti Jain,Arun Venkatesan,Asha Anoosheh,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asit Mishra,Asli Sabanci Demiroz,Asma Kuriparambil Thekkumpate,Atefeh Sohrabizadeh,Avinash Kaur,Ayush Dattagupta,Barath Subramaniam Anandan,Bardiya Sadeghi,Barnaby Simkin,Ben Lanir,Benedikt Schifferer,Benjamin Chislett,Besmira Nushi,Bilal Kartal,Bill Thiede,Bita Darvish Rouhani,Bobby Chen,Boris Ginsburg,Brandon Norick,Branislav Kisacanin,Brian Yu,Bryan Catanzaro,Buvaneswari Mani,Carlo del Mundo,Chankyu Lee,Chanran Kim,Chantal Hwang,Chao Ni,Charles Wang,Charlie Truong,Cheng-Ping Hsieh,Chenhan Yu,Chenjie Luo,Cherie Wang,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Chris Holguin,Chris Wing,Christian Munley,Christopher Parisien,Chuck Desai,Chunyang Sheng,Collin Neale,Cyril Meurillon,Dakshi Kumar
机构: NVIDIA(英伟达)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.

[NLP-48] Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness ACL2026

【速读】: 该论文试图解决的问题是:大型语言模型是否具备类似人类的内省能力,即能否通过自身内部状态(hidden states)获得关于答案正确性的特权知识(privileged knowledge),这种知识无法通过外部观察获取。解决方案的关键在于设计了一种对比实验,训练正确性分类器分别基于模型自身的隐藏状态表示(self-representations)和外部模型的表示(peer representations),并在标准评估和分歧子集(disagreement subsets)上进行测试。结果表明,在整体表现上无显著优势,但在模型间存在预测分歧的子集中,模型在事实类任务中表现出明显的自代表优势,且该优势随网络层加深逐步显现,暗示了模型特定记忆检索机制的存在;而在数学推理任务中则未发现此类优势,揭示了不同任务领域中“特权知识”的异质性。

链接: https://arxiv.org/abs/2604.12373
作者: Tomer Ashuach,Liat Ein-Dor,Shai Gretz,Yoav Katz,Yonatan Belinkov
机构: Technion – Israel Institute of Technology (以色列理工学院); IBM Research (IBM研究院)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 (Main Conference). 8 pages, 16 figures, 2 tables

点击查看摘要

Abstract:Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model’s own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

[NLP-49] Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors ACL2026

【速读】: 该论文旨在解决安全对齐的大语言模型(Safety-aligned Large Language Models, LLMs)在实际部署中因供应链攻击而引入后门(backdoor)的风险问题,尤其关注现有基于权重编辑的后门攻击方法在触发后难以维持有害输出可靠性的问题。解决方案的关键在于将后门目标从表面token层面转移到模型内部表示(internal representations),通过提取区分合规响应与拒绝行为的“引导向量”(steering vector),并将其编译为仅在触发器存在时激活的持久化权重修改;同时引入零空间约束(null-space constraint)以确保干净输入下的隐蔽性和正常功能不受影响,从而实现高效、隐蔽且高成功率的攻击。

链接: https://arxiv.org/abs/2604.12359
作者: Rui Yin,Tianxu Han,Naen Xu,Changjiang Li,Ping He,Chunyi Zhou,Jun Wang,Zhihui Fu,Tianyu Du,Jinbao Li,Shouling Ji
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Safety-aligned large language models (LLMs) are increasingly deployed in real-world pipelines, yet this deployment also enlarges the supply-chain attack surface: adversaries can distribute backdoored checkpoints that behave normally under standard evaluation but jailbreak when a hidden trigger is present. Recent post-hoc weight-editing methods offer an efficient approach to injecting such backdoors by directly modifying model weights to map a trigger to an attacker-specified response. However, existing methods typically optimize a token-level mapping that forces an affirmative prefix (e.g., ``Sure’'), which does not guarantee sustained harmful output – the model may begin with apparent agreement yet revert to safety-aligned refusal within a few decoding steps. We address this reliability gap by shifting the backdoor objective from surface tokens to internal representations. We extract a steering vector that captures the difference between compliant and refusal behaviors, and compile it into a persistent weight modification that activates only when the trigger is present. To preserve stealthiness and benign utility, we impose a null-space constraint so that the injected edit remains dormant on clean inputs. The method is efficient, requiring only a small set of examples and admitting a closed-form solution. Across multiple safety-aligned LLMs and jailbreak benchmarks, our method achieves high triggered attack success while maintaining non-triggered safety and general utility.

[NLP-50] MultiDocFusion: Hierarchical and Multimodal Chunking Pipeline for Enhanced RAG on Long Industrial Documents

【速读】: 该论文旨在解决传统文本分块(text chunking)方法在处理长篇工业文档时因忽略复杂文档结构而导致的信息丢失和问答质量下降的问题。其解决方案的关键在于提出MultiDocFusion,一个融合视觉文档解析、光学字符识别(OCR)、基于大语言模型(LLM)的文档结构层次解析(DSHP-LLM)以及深度优先搜索(DFS)分组策略的多模态分块流水线,通过显式建模文档层级结构来提升检索增强生成(RAG)问答系统的准确性与鲁棒性。

链接: https://arxiv.org/abs/2604.12352
作者: Joongmin Shin,Chanjun Park,Jeongbae Park,Jaehyung Seo,Heuiseok Lim
机构: Korea University (韩国大学); Soongsil University (中央大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:RAG-based QA has emerged as a powerful method for processing long industrial documents. However, conventional text chunking approaches often neglect complex and long industrial document structures, causing information loss and reduced answer quality. To address this, we introduce MultiDocFusion, a multimodal chunking pipeline that integrates: (i) detection of document regions using vision-based document parsing, (ii) text extraction from these regions via OCR, (iii) reconstruction of document structure into a hierarchical tree using large language model (LLM)-based document section hierarchical parsing (DSHP-LLM), and (iv) construction of hierarchical chunks through DFS-based grouping. Extensive experiments across industrial benchmarks demonstrate that MultiDocFusion improves retrieval precision by 8-15% and ANLS QA scores by 2-3% compared to baselines, emphasizing the critical role of explicitly leveraging document hierarchy for multimodal document-based QA. These significant performance gains underscore the necessity of structure-aware chunking in enhancing the fidelity of RAG-based QA systems.

[NLP-51] oxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection ACL2026

【速读】: 该论文旨在解决中文有毒内容检测中现有方法在句子级别分类时难以提供可读且连续的有毒证据片段(toxic evidence spans)的问题。解决方案的关键在于提出一种面向可解释性的方法 ToxiTrace,其核心由三个组件构成:(1) CuSA(Contextualized Saliency Adjustment),通过轻量级大语言模型(LLM)引导将编码器生成的显著性线索细化为细粒度的有毒片段;(2) GCLoss(Gradient-Constrained Loss),通过梯度约束目标使词级别的显著性集中于有毒证据并抑制无关激活;(3) ARCL(Adaptive Reasoning Contrastive Learning),构建样本特定的对比推理对以增强有毒与非有毒内容之间的语义边界。该方案在保持基于 BERT 类编码器高效推理的同时,显著提升了分类准确率和有毒片段提取能力,并生成更连贯、人类可读的解释。

链接: https://arxiv.org/abs/2604.12321
作者: Boyang Li,Hongzhe Shou,Yuanyuan Liang,Jingbin Zhang,Fang Zhou
机构: East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose \textbfToxiTrace, an explainability-oriented method for BERT-style encoders with three components: (1) \textbfCuSA, which refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) \textbfGCLoss, a gradient-constrained objective that concentrates token-level saliency on toxic evidence while suppressing irrelevant activations; and (3) \textbfARCL, which constructs sample-specific contrastive reasoning pairs to sharpen the semantic boundary between toxic and non-toxic content. Experiments show that ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable explanations. We have released the model at this https URL.

[NLP-52] CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

【速读】: 该论文旨在解决企业在部署生成式 AI(Generative AI)任务导向代理时,如何可靠地检测和定位多轮对话中违反特定领域操作指南的行为问题。当前依赖大语言模型(Large Language Models, LLMs)作为评判者(LLM-as-a-Judge)的方法在实际应用中的可靠性尚未得到充分验证,主要受限于缺乏系统化的数据生成方法以及高质量标注数据的稀缺性。论文提出的关键解决方案是构建一个名为 CompliBench 的基准测试平台,并开发了一套可扩展的自动化数据生成流水线:通过可控的缺陷注入机制自动产生精确的违规标签(包括违反的具体指南和对话轮次),并结合对抗搜索策略确保引入的扰动具有高度挑战性。该方案显著提升了小规模判别模型在复杂合规场景下的性能表现,甚至优于主流商用大模型,且具备跨业务领域的泛化能力,为训练鲁棒的生成式奖励模型提供了有效基础。

链接: https://arxiv.org/abs/2604.12312
作者: Jingbo Yang,Guanyu Yao,Bairu Hou,Xinghan Yang,Nikolai Glushnev,Iwona Bialynicka-Birula,Duo Ding,Shiyu Chang
机构: University of California, Santa Barbara; Cresta
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent interactions. Our controllable flaw injection process automatically yields precise ground-truth labels for the violated guideline and the exact conversation turn, while an adversarial search method ensures these introduced perturbations are highly challenging. Our comprehensive evaluation reveals that current state-of-the-art proprietary LLMs struggle significantly with this task. In addition, we demonstrate that a small-scale judge model fine-tuned on our synthesized data outperforms leading LLMs and generalizes well to unseen business domains, highlighting our pipeline as an effective foundation for training robust generative reward models.

[NLP-53] ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance ACL26

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在实际应用中面临的法律合规性评估难题,尤其是在数据隐私与安全(data privacy and AI safety)领域,现有方法通常假设输入上下文完整清晰,而现实场景中的上下文往往模糊且不完整。解决方案的关键在于提出一种名为 ContextLens 的半规则框架,该框架利用大语言模型(LLM)将输入上下文锚定在法律领域,并通过设计的一系列结构化问题来显式识别已知和未知的合规因素,从而在无需训练的情况下提升 LLM 对 GDPR 和欧盟人工智能法案(EU AI Act)等法规的合规评估能力,同时能够进一步识别出模糊和缺失的信息。

链接: https://arxiv.org/abs/2604.12308
作者: Haoran Li,Yulin Chen,Huihao Jing,Wenbin Hu,Tsz Ho Li,Chanhou Lou,Hong Ting Tsang,Sirui Han,Yangqiu Song
机构: Beihang University (北京航空航天大学); HKUST (香港科技大学); National University of Singapore (新加坡国立大学); Faculty of Law, University of Macau (澳门大学法学院)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 26

点击查看摘要

Abstract:Individuals’ concerns about data privacy and AI safety are highly contextualized and extend beyond sensitive patterns. Addressing these issues requires reasoning about the context to identify and mitigate potential risks. Though researchers have widely explored using large language models (LLMs) as evaluators for contextualized safety and privacy assessments, these efforts typically assume the availability of complete and clear context, whereas real-world contexts tend to be ambiguous and incomplete. In this paper, we propose ContextLens, a semi-rule-based framework that leverages LLMs to ground the input context in the legal domain and explicitly identify both known and unknown factors for legal compliance. Instead of directly assessing safety outcomes, our ContextLens instructs LLMs to answer a set of crafted questions that span over applicability, general principles and detailed provisions to assess compliance with pre-defined priorities and rules. We conduct extensive experiments on existing compliance benchmarks that cover the General Data Protection Regulation (GDPR) and the EU AI Act. The results suggest that our ContextLens can significantly improve LLMs’ compliance assessment and surpass existing baselines without any training. Additionally, our ContextLens can further identify the ambiguous and missing factors.

[NLP-54] Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)代理评估基准普遍局限于二元通过/失败任务(如代码生成或基于搜索的问题回答),而忽视了真实工程实践中通过迭代优化可行设计方案所体现的价值这一问题。其解决方案的关键在于提出 Frontier-Eng,一个由人类验证的生成式优化(Generative Optimization)基准,涵盖47个跨五大工程类别的任务,所有任务均基于工业级仿真器和验证器,提供连续奖励信号并强制执行硬性可行性约束,在固定交互预算下模拟 agent 的“提议-执行-评估”迭代循环。该设计使评估更贴近实际工程场景,并揭示出改进频率与幅度均服从幂律衰减规律,同时表明在有限预算下深度探索仍比宽度并行更具价值。

链接: https://arxiv.org/abs/2604.12290
作者: Yizhe Chi,Deyao Hong,Dapeng Jiang,Tianwei Luo,Kaisen Yang,Boshi Zhang,Zhe Cao,Xiaoyan Fan,Bingxiang He,Han Hao,Weiyang Jin,Dianqiao Lei,Qingle Liu,Houde Qian,Bowen Wang,Situ Wang,Youjie Zheng,Yifan Zhou,Calvin Xiao,Eren Cai,Qinhuai Na
机构: Einsia.AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search-based question answering, often neglect the value of real-world engineering that is often captured through the iterative optimization of feasible designs. To this end, we introduce Frontier-Eng, a human-verified benchmark for generative optimization – an iterative propose-execute-evaluate loop in which an agent generates candidate artifacts, receives executable verifier feedback, and revises them under a fixed interaction budget – spanning 47 tasks across five broad engineering categories. Unlike previous suites, Frontier-Eng tasks are grounded in industrial-grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints under constrained budgets. We evaluate eight frontier language models using representative search frameworks, finding that while Claude 4.6 Opus achieves the most robust performance, the benchmark remains challenging for all models. Our analysis suggests a dual power-law decay in improvement frequency ( \sim 1/iteration) and magnitude ( \sim 1/improvement count). We further show that although width improves parallelism and diversity, depth remains crucial for hard-won improvements under a fixed budget. Frontier-Eng establishes a new standard for assessing the capacity of AI agents to integrate domain knowledge with executable feedback to solve complex, open-ended engineering problems.

[NLP-55] he Enforcement and Feasibility of Hate Speech Moderation on Twitter

【速读】: 该论文旨在解决在线仇恨言论(hate speech)平台治理中两个核心问题:一是社交媒体平台在执行仇恨言论政策方面的不一致性,二是大规模内容审核的可行性。研究通过在全球范围内对推特(现为X)进行审计,采集包含54万条跨八种主要语言的标注数据,发现五个月后仍有80%的仇恨言论仍在线存在,且其被删除的概率并不高于非仇恨内容,无论其严重性或可见性如何。关键解决方案在于构建一个人工智能与人工审核相结合的混合审核流程(human-AI moderation pipeline),其中自动化系统虽无法完全准确识别仇恨言论(易产生误报),但能有效筛选高可能性违规内容供人工复核;模拟结果表明,这种协同机制可在低于现行监管罚款成本的前提下显著降低用户接触仇恨言论的风险,从而揭示了当前仇恨言论持续存在的根本原因并非技术限制,而是平台资源分配的制度性选择。

链接: https://arxiv.org/abs/2604.12289
作者: Manuel Tonneau,Dylan Thurgood,Diyi Liu,Niyati Malhotra,Victor Orozco-Olvera,Ralph Schroeder,Scott A. Hale,Manoel Horta Ribeiro,Paul Röttger,Samuel P. Fraiberger
机构: University of Oxford (牛津大学); University of Copenhagen (哥本哈根大学); World Bank (世界银行); Princeton University (普林斯顿大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Online hate speech is associated with substantial social harms, yet it remains unclear how consistently platforms enforce hate speech policies or whether enforcement is feasible at scale. We address these questions through a global audit of hate speech moderation on Twitter (now X). Using a complete 24-hour snapshot of public tweets, we construct representative samples comprising 540,000 tweets annotated for hate speech by trained annotators across eight major languages. Five months after posting, 80% of hateful tweets remain online, including explicitly violent hate speech. Such tweets are no more likely to be removed than non-hateful tweets, with neither severity nor visibility increasing the likelihood of removal. We then examine whether these enforcement gaps reflect technical limits of large-scale moderation systems. While fully automated detection systems cannot reliably identify hate speech without generating large numbers of false positives, they effectively prioritize likely violations for human review. Simulations of a human-AI moderation pipeline indicate that substantially reducing user exposure to hate speech is economically feasible at a cost below existing regulatory penalties. These results suggest that the persistence of online hate cannot be explained by technical constraints alone but also reflects institutional choices in the allocation of moderation resources.

[NLP-56] owards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning ACL2026

【速读】: 该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)的电子表格理解方法在实际应用中面临的两大挑战:一是将表格视为纯文本,忽略了关键的布局线索和视觉语义;二是真实世界电子表格规模庞大,超出LLMs的有效输入长度限制。解决方案的关键在于提出一种两阶段多智能体框架——SpreadsheetAgent,采用逐步读取与推理范式,通过代码执行结果、图像和LaTeX表格等多种模态增量式解析局部区域,首先构建结构草图和行列摘要,再在求解阶段对中间表示进行任务驱动推理,并引入验证模块通过针对性检查确保结构提取的可靠性,从而有效降低误差传播并提升下游推理的可信度。

链接: https://arxiv.org/abs/2604.12282
作者: Houxing Ren,Mingjie Zhan,Zimu Lu,Ke Wang,Yunqiao Yang,Haotian Hou,Hongsheng Li
机构: CUHK MMLab (香港中文大学多媒体实验室); SenseTime Research (商汤科技研究院); Shenzhen Loop Area Institute (深圳 loop 区域研究所); CPII under InnoHK (InnoHK 下的感知与交互智能中心)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 (main conference)

点击查看摘要

Abstract:Spreadsheets are central to real-world applications such as enterprise reporting, auditing, and scientific data management. Despite their ubiquity, existing large language model based approaches typically treat tables as plain text, overlooking critical layout cues and visual semantics. Moreover, real-world spreadsheets are often massive in scale, exceeding the input length that LLMs can efficiently process. To address these challenges, we propose SpreadsheetAgent, a two-stage multi-agent framework for spreadsheet understanding that adopts a step-by-step reading and reasoning paradigm. Instead of loading the entire spreadsheet at once, SpreadsheetAgent incrementally interprets localized regions through multiple modalities, including code execution results, images, and LaTeX tables. The method first constructs a structural sketch and row/column summaries, and then performs task-driven reasoning over this intermediate representation in the Solving Stage. To further enhance reliability, we design a verification module that validates extracted structures via targeted inspections, reducing error propagation and ensuring trustworthy inputs for downstream reasoning. Extensive experiments on two spreadsheet datasets demonstrate the effectiveness of our approach. With GPT-OSS-120B, SpreadsheetAgent achieves 38.16% on Spreadsheet Bench, outperforming the ChatGPT Agent baseline (35.27%) by 2.89 absolute points. These results highlight the potential of SpreadsheetAgent to advance robust and scalable spreadsheet understanding in real-world applications. Code is available at this https URL.

[NLP-57] CodeSpecBench: Benchmarking LLM s for Executable Behavioral Specification Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成代码时对程序意图行为理解不足的问题,尤其关注其是否能准确捕捉并形式化表达程序的预期行为。现有研究在规范生成任务中受限于评估方法、任务设定和规范表达能力,难以真实反映模型对程序语义的理解深度。为此,作者提出CodeSpecBench基准测试平台,采用基于执行的评估协议,支持函数级与仓库级任务,并将规范编码为可执行的Python函数,从而实现对规范正确性(accepting valid behaviors)和完备性(rejecting invalid behaviors)的量化评估。该方案的关键在于引入可执行的行为规范(executable behavioral specifications)作为评估标准,使LLMs的规范生成能力得以在真实代码场景下被系统性地测量,揭示出当前主流模型在复杂任务中的显著性能下降(如仓库级任务仅20.2%通过率),并表明规范生成比代码生成更具挑战性,强调了模型对程序语义深层理解的重要性。

链接: https://arxiv.org/abs/2604.12268
作者: Zaoyu Chen,Jianbo Dai,Boyu Zhu,Jingdong Wang,Huiming Wang,Xin Xu,Haoyang Yuan,Zhijiang Guo,Xiao-Ming Wu
机构: The Hong Kong Polytechnic University (香港理工大学); University of Edinburgh (爱丁堡大学); UCL (伦敦大学学院); CUHK (SZ) (香港中文大学(深圳)); SUTD (新加坡科技设计大学); HKUST (香港科技大学); CUHK (香港中文大学); HKUST (GZ) (香港科技大学(广州))
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can generate code from natural language, but the extent to which they capture intended program behavior remains unclear. Executable behavioral specifications, defined via preconditions and postconditions, provide a concrete means to assess such understanding. However, existing work on specification generation is constrained in evaluation methodology, task settings, and specification expressiveness. We introduce CodeSpecBench, a benchmark for executable behavioral specification generation under an execution-based evaluation protocol. CodeSpecBench supports both function-level and repository-level tasks and encodes specifications as executable Python functions. Constructed from diverse real-world codebases, it enables a realistic assessment of both correctness (accepting valid behaviors) and completeness (rejecting invalid behaviors). Evaluating 15 state-of-the-art LLMs on CodeSpecBench, we observe a sharp performance degradation on repository-level tasks, where the best model attains only a 20.2% pass rate. We further find that specification generation is substantially more challenging than code generation, indicating that strong coding performance does not necessarily reflect deep understanding of intended program semantics. Our data and code are available at this https URL.

[NLP-58] CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades

【速读】: 该论文旨在解决级联大语言模型(Cascaded LLM)系统在面对模糊查询时,因单模型层级判断不确定性导致过早升级至高成本模型或人工专家的问题,从而造成计算资源浪费与效率低下。其解决方案的关键在于在每一层级的升级边界处引入多智能体辩论机制(multi-agent deliberation),通过基于置信度的路由策略仅对不确定样本激活轻量级代理集合,实现内部共识驱动的歧义消解,避免不必要的升级;同时采用统一架构交替执行单模型推理与选择性多智能体讨论,动态调整测试时计算开销,最终以人类专家作为兜底方案,显著提升准确率并增强对真实分布的弹性适应能力。

链接: https://arxiv.org/abs/2604.12262
作者: Raeyoung Chang,Dongwook Kwon,Jisoo Lee,Nikhil Verma
机构: Sogang University (祥明大学); Kwangwoon University (光云大学); Seoul National University (首尔国立大学); LG Electronics, Toronto AI Lab (LG电子多伦多人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, 4 tables, 1 algorithm

点击查看摘要

Abstract:Cascaded LLM systems coordinate models of varying sizes with human experts to balance accuracy, cost, and abstention under uncertainty. However, single-model tiers at each stage often struggle with ambiguous queries, triggering premature escalations to costlier models or experts due to under-confidence and inefficient compute scaling. CascadeDebate addresses this gap by inserting multi-agent deliberation directly at each tier’s escalation boundary. Confidence-based routers activate lightweight agent ensembles only for uncertain cases, enabling consensus-driven resolution of ambiguities internally without invoking higher-cost upgrades. Our unified architecture alternates single-model inference with selective multi-agent deliberation across model scales, culminating in human experts as the final fallback. This design scales test-time compute dynamically according to query difficulty. Across five benchmarks spanning science, medicine, and general knowledge, CascadeDebate outperforms strong single-model cascades and standalone multi-agent systems by up to 26.75 percent. An online threshold optimizer proves essential, boosting accuracy by 20.98 to 52.33 percent relative improvement over fixed policies and enabling elastic adaptation to real-world distributions.

[NLP-59] Coding-Free and Privacy-Preserving MCP Framework for Clinical Agent ic Research Intelligence System

【速读】: 该论文旨在解决临床研究中因流程繁琐、数据敏感性高及对专业知识与编程技能要求严格而导致的研究门槛过高问题,尤其限制了临床医生和外部研究人员开展数据驱动型研究的能力。其解决方案的关键在于构建一个名为Clinical Agentic Research Intelligence System (CARIS) 的智能系统,该系统通过集成大型语言模型(Large Language Models, LLMs)与模块化工具,并借助Model Context Protocol (MCP) 实现自然语言驱动的自动化工作流编排,从而在不直接访问原始数据的前提下完成从研究规划到报告生成的全流程任务,同时保障数据隐私并支持人机协同迭代优化。

链接: https://arxiv.org/abs/2604.12258
作者: Taehun Kim,Hyeryun Park,Hyeonhoon Lee,Yushin Lee,Kyungsang Kim,Hyung-Chul Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, 2 tables, Supplementary Appendix

点击查看摘要

Abstract:Clinical research involves labor-intensive processes such as study design, cohort construction, model development, and documentation, requiring domain expertise, programming skills, and access to sensitive patient data. These demands create barriers for clinicians and external researchers conducting data-driven studies. To overcome these limitations, we developed a Clinical Agentic Research Intelligence System (CARIS) that automates the clinical research workflow while preserving data privacy, enabling comprehensive studies without direct access to raw data. CARIS integrates Large Language Models (LLMs) with modular tools via the Model Context Protocol (MCP), enabling natural language-driven orchestration of appropriate tools. Databases remain securely within the MCP server, and users access only the outputs and final research reports. Based on user intent, CARIS automatically executes the full pipeline: research planning, literature search, cohort construction, Institutional Review Board (IRB) documentation, Vibe Machine Learning (ML), and report generation, with iterative human-in-the-loop refinement. We evaluated CARIS on three heterogeneous datasets with distinct clinical tasks. Research plans and IRB documents were finalized within three to four iterations, using evidence from literature and data. The system supported Vibe ML by exploring feature-model combinations, ranking the top ten models, and generating performance visualizations. Final reports showed high completeness based on a checklist derived from the TRIPOD+AI framework, achieving 96% coverage in LLM evaluation and 82% in human evaluation. CARIS demonstrates that agentic AI can transform clinical hypotheses into executable research workflows across heterogeneous datasets. By eliminating the need for coding and direct data access, the system lowers barriers and bridges public and private clinical data environments.

[NLP-60] SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration ACL2026

【速读】: 该论文旨在解决自 Draft(self-draft)推理中因浅层预测过自信导致错误采纳以及困难 token 引发冗余计算的问题,从而提升生成式 AI(Generative AI)在大语言模型(LLM)中的推理效率。解决方案的关键在于:通过分层温度退火(layer-wise temperature annealing)抑制早期退出决策中的虚假置信度,并基于 token 级别解码难度自适应地限制推测长度;同时,在统一并行传递中重处理 draft token 的隐藏状态,确保输出与原模型完全一致的同时最大化计算效率,且无需修改基础模型参数。

链接: https://arxiv.org/abs/2604.12247
作者: Zhuofan Wen,Yang Feng
机构: Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS); University of Chinese Academy of Sciences, Beijing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft models but face limitations: shallow layers often produce overconfident yet incorrect token predictions, and the presence of difficult tokens in a draft sequence forces redundant computation through deeper layers, undermining both draft acceptance and overall speedup. To address these issues, we propose a novel self-draft framework that suppresses spurious confidence via layer-wise temperature annealing in early-exit decision and adaptively bounds speculation length based on token-wise decoding difficulty. By reprocessing the hidden states of draft tokens in a unified parallel pass through deep layers, our method maintains exact output equivalence with the original model while maximizing computational efficiency. It requires no modifications to the base LLM parameters and achieves up to 2.33x wall-time speedup over standard autoregressive decoding across diverse long-form generation tasks and multiple model architectures.

[NLP-61] Continuous Knowledge Metabolism: Generating Scientific Hypotheses from Evolving Literature

【速读】: 该论文旨在解决科学假说生成中对知识演化动态跟踪不足的问题,即现有方法仅关注当前已知知识,而忽视了科学文献随时间演化的过程。其解决方案的核心是提出连续知识代谢(Continuous Knowledge Metabolism, CKM)框架,通过滑动时间窗口处理文献并增量更新结构化知识库,从而捕捉知识的演变轨迹。其中,CKM-Lite 实现高效预测覆盖,显著优于批量处理方式;而 CKM-Full 引入变化感知机制,能识别新发现为新颖、证实或矛盾类型,并据此调节假说生成策略,揭示出质量与覆盖率之间的权衡关系以及不同知识演化信号对预测性能的差异化影响。

链接: https://arxiv.org/abs/2604.12243
作者: Jinkai Tao,Yubo Wang,Xiaoyu Liu,Menglin Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 32 pages, 6 figures

点击查看摘要

Abstract:Scientific hypothesis generation requires tracking how knowledge evolves, not just what is currently known. We introduce Continuous Knowledge Metabolism (CKM), a framework that processes scientific literature through sliding time windows and incrementally updates a structured knowledge base as new findings arrive. We present CKM-Lite, an efficient variant that achieves strong predictive coverage through incremental accumulation, outperforming batch processing on hit rate (+2.8%, p=0.006), hypothesis yield (+3.6, p0.001), and best-match alignment (+0.43, p0.001) while reducing token cost by 92%. To understand what drives these differences, we develop CKM-Full, an instrumented variant that categorizes each new finding as novel, confirming, or contradicting, detects knowledge change signals, and conditions hypothesis generation on the full evolution trajectory. Analyzing 892 hypotheses generated by CKM-Full across 50 research topics, alongside parallel runs of the other variants, we report four empirical observations: (1) incremental processing outperforms batch baseline across predictive and efficiency metrics; (2) change-aware instrumentation is associated with higher LLM-judged novelty (Cohen’s d=3.46) but lower predictive coverage, revealing a quality-coverage trade-off; (3) a field’s trajectory stability is associated with hypothesis success (r=-0.28, p=0.051), suggesting boundary conditions for literature-based prediction; (4) knowledge convergence signals are associated with nearly 5x higher hit rate than contradiction signals, pointing to differential predictability across change types. These findings suggest that the character of generated hypotheses is shaped not only by how much literature is processed, but also by how it is processed. They further indicate that evaluation frameworks must account for the quality-coverage trade-off rather than optimize for a single metric.

[NLP-62] MolMem: Memory-Augmented Agent ic Reinforcement Learning for Sample-Efficient Molecular Optimization

【速读】: 该论文旨在解决药物发现中分子优化(molecular optimization)任务在有限评估次数(oracle budget)下的样本效率问题。现有方法要么依赖试错策略导致高成本,要么依赖外部知识而难以应对复杂目标,且普遍缺乏长期记忆机制以积累可复用的优化经验。其解决方案的关键在于提出 MolMem 框架,该框架采用双记忆系统:静态示例记忆(Static Exemplar Memory)用于冷启动阶段的决策锚定,演化技能记忆(Evolving Skill Memory)用于将成功优化轨迹提炼为可迁移的策略;通过密集的步骤奖励训练策略网络,使每次昂贵的探索过程转化为长期知识,从而显著提升后续优化的成功率和效率。

链接: https://arxiv.org/abs/2604.12237
作者: Ziqing Wang,Yibo Wen,Abhishek Pandy,Han Liu,Kaize Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In drug discovery, molecular optimization aims to iteratively refine a lead compound to improve molecular properties while preserving structural similarity to the original molecule. However, each oracle evaluation is expensive, making sample efficiency a key challenge for existing methods under a limited oracle budget. Trial-and-error approaches require many oracle calls, while methods that leverage external knowledge tend to reuse familiar templates and struggle on challenging objectives. A key missing piece is long-term memory that can ground decisions and provide reusable insights for future optimizations. To address this, we present MolMem (\textbfMolecular optimization with \textbfMemory), a multi-turn agentic reinforcement learning (RL) framework with a dual-memory system. Specifically, MolMem uses Static Exemplar Memory to retrieve relevant exemplars for cold-start grounding, and Evolving Skill Memory to distill successful trajectories into reusable strategies. Built on this memory-augmented formulation, we train the policy with dense step-wise rewards, turning costly rollouts into long-term knowledge that improves future optimization. Extensive experiments show that MolMem achieves 90% success on single-property tasks (1.5 \times over the best baseline) and 52% on multi-property tasks using only 500 oracle calls. Our code is available at this https URL.

[NLP-63] HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models

【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在复杂数学推理任务中表现不佳的问题,其核心挑战在于SLMs容量有限,难以维持长链的中间推理步骤且易受早期错误影响。解决方案的关键在于提出一种提示辅助推理框架(hint-assisted reasoning framework),通过一个由强大型语言模型蒸馏训练得到的独立SLM生成上下文感知的提示(hint),这些提示基于问题陈述和累积的推理历史条件生成,从而以分步、局部化的方式引导主推理SLM完成多步数学问题求解。该机制有效减少了误差传播,并使推理模型聚焦于可管理的子问题,实现了SLMs在保持计算效率的同时显著提升数学推理准确率。

链接: https://arxiv.org/abs/2604.12229
作者: Jawad Hossain,Xiangyu Guo,Jiawei Zhou,Chong Liu
机构: University at Albany (阿尔巴尼大学); University at Buffalo (布法罗大学); Stony Brook University (石溪大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 5 figures, Preprint

点击查看摘要

Abstract:Small language models (SLMs) often struggle with complex mathematical reasoning due to limited capacity to maintain long chains of intermediate steps and to recover from early errors. We address this challenge by introducing a hint-assisted reasoning framework that incrementally guides SLMs through multi-step mathematical problem solving. Our approach decomposes solutions into sequential reasoning steps and provides context-aware hints, where hints are generated by a separate SLM trained via distillation from a strong large language model. While the hint-generating SLM alone is not capable of solving the problems, its collaboration with a reasoning SLM enables effective guidance, forming a cooperative two-model system for reasoning. Each hint is generated conditionally on the problem statement and the accumulated reasoning history, providing stepwise, localized guidance without revealing full solutions. This reduces error propagation and allows the reasoning model to focus on manageable subproblems. Experiments across diverse mathematical benchmarks and models demonstrate that hint assistance consistently improves reasoning accuracy for SLMs, yielding substantial gains over standard prompting while preserving model efficiency. These results highlight that structured collaboration between SLMs-via hint generation and reasoning-offers an effective and lightweight mechanism for enhancing mathematical reasoning.

[NLP-64] Designing Reliable LLM -Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams

【速读】: 该论文旨在解决STEM学科中学生手写作答(包含符号表达、计算和图表)在评分过程中存在的效率低、人工评分一致性差的问题,尤其在需要部分赋分时更为显著。研究通过使用GPT-4o模型进行AI辅助评分,探讨了评分规则设计(rubric design)与大语言模型(LLM)配置参数对评分可靠性的影响。其解决方案的关键在于:采用结构清晰、技能导向的细粒度评分量规(skill-based rubrics),尤其是基于检查清单式的分析性评分框架,可显著提升人机评分一致性,而提示格式(prompting format)仅起次要作用,温度参数(temperature setting)影响有限。这一发现为在STEM教育评估中实现高可靠性的AI辅助评分提供了可迁移的设计原则。

链接: https://arxiv.org/abs/2604.12227
作者: Xiuxiu Tang,G. Alex Ambrose,Ying Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Student responses in STEM assessments are often handwritten and combine symbolic expressions, calculations, and diagrams, creating substantial variation in format and interpretation. Despite their importance for evaluating students’ reasoning, such responses are time-consuming to score and prone to rater inconsistency, particularly when partial credit is required. Recent advances in large language models (LLMs) have increased attention to AI-assisted scoring, yet evidence remains limited regarding how rubric design and LLM configurations influence reliability across performance levels. This study examined the reliability of AI-assisted scoring of undergraduate physics constructed responses using GPT-4o. Twenty authentic handwritten exam responses were scored across two rounds by four instructors and by the AI model using skill-based rubrics with differing levels of analytic granularity. Prompting format and temperature settings were systematically varied. Overall, human-AI agreement on total scores was comparable to human inter-rater reliability and was highest for high- and low-performing responses, but declined for mid-level responses involving partial or ambiguous reasoning. Criterion-level analyses showed stronger alignment for clearly defined conceptual skills than for extended procedural judgments. A more fine-grained, checklist-based rubric improved consistency relative to holistic scoring. These findings indicate that reliable AI-assisted scoring depends primarily on clear, well-structured rubrics, while prompting format plays a secondary role and temperature has relatively limited impact. More broadly, the study provides transferable design recommendations for implementing reliable LLM-assisted scoring in STEM contexts through skill-based rubrics and controlled LLM settings.

[NLP-65] LLM -Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines ACL2026

【速读】: 该论文旨在解决预训练语言模型(Pretrained Language Models, PLMs)如BERT在文本分类任务中虽具备强大语义表示能力但存在计算成本高、可解释性差的问题,以及符号模型(如Tsetlin Machine, TM)虽然具有透明性却缺乏语义泛化能力的局限。解决方案的关键在于提出一种语义引导的符号化框架(semantic bootstrapping framework),通过将大型语言模型(Large Language Models, LLMs)的知识迁移至符号形式,实现语义能力和可解释性的协同提升:首先利用LLM生成子意图以指导合成数据的三阶段课程式构建(种子、核心、丰富),随后由非否定型Tsetlin机(Non-Negated TM, NTM)从中提取高置信度逻辑文字作为可解释的语义线索,并将其注入真实数据,使TM能够对齐其命题逻辑与LLM推断出的语义,从而在无需嵌入或运行时调用LLM的情况下,赋予符号模型预训练语义先验,显著提升性能和可解释性,同时保持符号模型的高效性。

链接: https://arxiv.org/abs/2604.12223
作者: Jiechao Gao,Rohan Kumar Yadav,Yuangang Li,Yuandong Pan,Jie Wang,Ying Liu,Michael Lepech
机构: Stanford University (斯坦福大学); University of California, Irvine (加州大学欧文分校); University of the Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to Findings of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:Pretrained language models (PLMs) like BERT provide strong semantic representations but are costly and opaque, while symbolic models such as the Tsetlin Machine ™ offer transparency but lack semantic generalization. We propose a semantic bootstrapping framework that transfers LLM knowledge into symbolic form, combining interpretability with semantic capacity. Given a class label, an LLM generates sub-intents that guide synthetic data creation through a three-stage curriculum (seed, core, enriched), expanding semantic diversity. A Non-Negated TM (NTM) learns from these examples to extract high-confidence literals as interpretable semantic cues. Injecting these cues into real data enables a TM to align clause logic with LLM-inferred semantics. Our method requires no embeddings or runtime LLM calls, yet equips symbolic models with pretrained semantic priors. Across multiple text classification tasks, it improves interpretability and accuracy over vanilla TM, achieving performance comparable to BERT while remaining fully symbolic and efficient.

[NLP-66] meMark: A Trustworthy Time Watermarking Framework for Exact Generation-Time Recovery from AIGC

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 内容中水印技术存在的可靠性不足问题,特别是现有基于统计信号的水印方法在多比特编码(如时间戳)场景下检测准确率低、易受用户侧统计攻击和提供商侧伪造攻击的问题。其解决方案的关键在于提出“可信水印”(trustworthy watermark)概念,通过融合密码学技术,在监管监督下将时间信息编码为依赖时间的秘密密钥,防止任意时间戳伪造;同时将水印载荷与时间解耦,采用每次生成时随机生成且不存储的比特序列,消除可被利用的统计模式,并设计两级编码机制结合纠错码,实现理论上100%准确的时间恢复能力,从而满足司法证据所需的可靠性要求。

链接: https://arxiv.org/abs/2604.12216
作者: Shangkun Che,Silin Du,Ge Gao
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The widespread use of Large Language Models (LLMs) in text generation has raised increasing concerns about intellectual property disputes. Watermarking techniques, which embed meta information into AI-generated content (AIGC), have the potential to serve as judicial evidence. However, existing methods rely on statistical signals in token distributions, leading to inherently probabilistic detection and reduced reliability, especially in multi-bit encoding (e.g., timestamps). Moreover, such methods introduce detectable statistical patterns, making them vulnerable to forgery attacks and enabling model providers to fabricate arbitrary watermarks. To address these issues, we propose the concept of trustworthy watermark, which achieves reliable recovery with 100% identification accuracy while resisting both user-side statistical attacks and provider-side forgery. We focus on trustworthy time watermarking for use as judicial evidence. Our framework integrates cryptographic techniques and encodes time information into time-dependent secret keys under regulatory supervision, preventing arbitrary timestamp fabrication. The watermark payload is decoupled from time and generated as a random, non-stored bit sequence for each instance, eliminating statistical patterns. To ensure verifiability, we design a two-stage encoding mechanism, which, combined with error-correcting codes, enables reliable recovery of generation time with theoretically perfect accuracy. Both theoretical analysis and experiments demonstrate that our framework satisfies the reliability requirements for judicial evidence and offers a practical solution for future AIGC-related intellectual property disputes.

[NLP-67] Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering ACL2026

【速读】: 该论文旨在解决现有模拟认知障碍标准化患者(Standardized Patients with cognitive impairment)方法依赖离散提示工程、难以捕捉不同领域和严重程度下缺陷异质性的问题。其解决方案的关键在于提出StsPatient框架,通过从对比指令对中提取控制向量(steering vectors)以捕获特定认知领域的特征,并引入随机令牌调制机制(Stochastic Token Modulation, STM),实现对障碍严重程度的精确调控,同时缓解传统向量方法的不稳定性问题。

链接: https://arxiv.org/abs/2604.12210
作者: Weikang Zhang,Zimo Zhu,Zhichuan Yang,Chen Huang,Wenqiang Lei,See-Kiong Ng
机构: Xi’an Jiaotong University (西安交通大学); National University of Singapore (新加坡国立大学); Sichuan University (四川大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Findings of ACL 2026

点击查看摘要

Abstract:Simulating Standardized Patients with cognitive impairment offers a scalable and ethical solution for clinical training. However, existing methods rely on discrete prompt engineering and fail to capture the heterogeneity of deficits across varying domains and severity levels. To address this limitation, we propose StsPatient for the fine-grained simulation of cognitively impaired patients. We innovatively capture domain-specific features by extracting steering vectors from contrastive pairs of instructions and responses. Furthermore, we introduce a Stochastic Token Modulation (STM) mechanism to regulate the intervention probability. STM enables precise control over impairment severity while mitigating the instability of conventional vector methods. Comprehensive experiments demonstrate that StsPatient significantly outperforms baselines in both clinical authenticity and severity controllability.

[NLP-68] Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在给定提示下生成多个候选回答时,如何高效且可靠地选择最优响应的问题,尤其针对正确性与表面多数一致性的不一致现象。现有方法如自一致性(self-consistency)依赖离散投票,而基于概率的方法往往无法捕捉候选答案间的语义关系,或倾向于低估高质量但出现频率较低的回答,并未充分利用答案表示的几何结构。解决方案的关键在于提出径向共识评分(Radial Consensus Score, RCS),其通过计算答案嵌入的加权弗雷歇均值(Fréchet mean,即语义中心)并以候选答案到该中心的径向距离进行排序,从而建模语义共识;RCS支持多种权重策略(包括均匀、频率和概率加权),灵活融合一致性信号与模型置信度,且无需训练、适用于黑盒场景,显著提升了多候选选择的准确性和鲁棒性。

链接: https://arxiv.org/abs/2604.12196
作者: Manh Nguyen,Sunil Gupta,Hung Le
机构: Deakin University (迪肯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fréchet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.

[NLP-69] Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理顺序敏感推理任务时的局限性问题。当前RAG方法,包括基于图和超图的方法,将检索到的证据视为无序集合,隐含假设了排列不变性(permutation invariance),这与许多现实世界推理任务的本质不一致——这些任务不仅依赖于哪些交互发生,还依赖于交互发生的顺序。解决方案的关键在于提出一种新的框架Order-Aware Knowledge Hypergraph RAG(OKH-RAG),其核心创新是将“顺序”作为知识结构的首要属性:通过构建带有优先级结构(precedence structure)的超图来表示知识,并将检索重新定义为对超边序列的推断过程;同时引入一个学习的转移模型,直接从数据中推断优先关系,无需显式的时序标注。实验表明,OKH-RAG在顺序敏感的问题回答与解释任务中显著优于传统集合型基线,且消融分析证实性能提升源于对交互顺序的有效建模。

链接: https://arxiv.org/abs/2604.12185
作者: Keshu Wu,Chenchen Kuai,Zihao Li,Jiwan Jiang,Shiyu Shen,Shian Wang,Chan-Wei Hu,Zhengzhong Tu,Yang Zhou
机构: Texas AM University; University of Illinois Urbana-Champaign; The University of Kansas
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models by grounding outputs in retrieved knowledge. However, existing RAG methods including graph- and hypergraph-based approaches treat retrieved evidence as an unordered set, implicitly assuming permutation invariance. This assumption is misaligned with many real-world reasoning tasks, where outcomes depend not only on which interactions occur, but also on the order in which they unfold. We propose Order-Aware Knowledge Hypergraph RAG (OKH-RAG), which treats order as a first-class structural property. OKH-RAG represents knowledge as higher-order interactions within a hypergraph augmented with precedence structure, and reformulates retrieval as sequence inference over hyperedges. Instead of selecting independent facts, it recovers coherent interaction trajectories that reflect underlying reasoning processes. A learned transition model infers precedence directly from data without requiring explicit temporal supervision. We evaluate OKH-RAG on order-sensitive question answering and explanation tasks, including tropical cyclone and port operation scenarios. OKH-RAG consistently outperforms permutation-invariant baselines, and ablations show that these gains arise specifically from modeling interaction order. These results highlight a key limitation of set-based retrieval: effective reasoning requires not only retrieving relevant evidence, but organizing it into structured sequences.

[NLP-70] Policy-Invisible Violations in LLM -Based Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在执行任务时可能出现的“政策不可见违规”(policy-invisible violations)问题,即尽管代理的行为在语法上合法、用户授权且语义合理,但由于缺乏决策时刻可见的实体属性、上下文状态或会话历史等关键信息,导致其违反组织政策。解决方案的关键在于引入一个名为PhantomPolicy的基准测试集和一个基于反事实图模拟的强制执行框架Sentinel:PhantomPolicy通过精心设计的8类违规场景与平衡的违规/安全对照案例,确保所有工具响应仅包含干净业务数据而不含政策元数据;Sentinel则将每个代理动作视为对组织知识图谱的潜在突变,通过推测性执行预演动作后的世界状态,并验证图结构不变量来决定是否允许、阻止或要求澄清。实验表明,Sentinel相较于仅依赖内容的DLP基线显著提升准确率(68.8% vs. 93.0%),证明了当政策相关的世界状态被显式提供给执行层时,可实现更可靠的安全控制。

链接: https://arxiv.org/abs/2604.12177
作者: Jie Wu,Ming Gong
机构: Atlassian(Atlassian)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 26 pages,1 figure, 11 tables

点击查看摘要

Abstract:LLM-based agents can execute actions that are syntactically valid, user-sanctioned, and semantically appropriate, yet still violate organizational policy because the facts needed for correct policy judgment are hidden at decision time. We call this failure mode policy-invisible violations: cases in which compliance depends on entity attributes, contextual state, or session history absent from the agent’s visible context. We present PhantomPolicy, a benchmark spanning eight violation categories with balanced violation and safe-control cases, in which all tool responses contain clean business data without policy metadata. We manually review all 600 model traces produced by five frontier models and evaluate them using human-reviewed trace labels. Manual review changes 32 labels (5.3%) relative to the original case-level annotations, confirming the need for trace-level human review. To demonstrate what world-state-grounded enforcement can achieve under favorable conditions, we introduce Sentinel, an enforcement framework based on counterfactual graph simulation. Sentinel treats every agent action as a proposed mutation to an organizational knowledge graph, performs speculative execution to materialize the post-action world state, and verifies graph-structural invariants to decide Allow/Block/Clarify. Against human-reviewed trace labels, Sentinel substantially outperforms a content-only DLP baseline (68.8% vs. 93.0% accuracy) while maintaining high precision, though it still leaves room for improvement on certain violation categories. These results demonstrate what becomes achievable once policy-relevant world state is made available to the enforcement layer.

[NLP-71] AlphaEval: Evaluating Agents in Production

【速读】: 该论文旨在解决当前AI代理(AI Agent)在商业场景中快速部署时,评估方法滞后于实际生产环境的问题。现有基准测试多基于事后整理的任务和确定性指标,无法反映真实生产环境中隐含约束、异构多模态输入、未声明的专业知识需求、长周期专业产出以及专家标准动态演进等复杂特性。解决方案的关键在于提出AlphaEval——一个基于生产实践的基准测试,涵盖94项来自七家部署AI代理企业的任务,覆盖六个O*NET职业领域;其核心创新是将完整的AI代理产品(如Claude Code、Codex等)作为商业系统进行评估,而非仅限于模型层面,并引入多范式评估框架(包括LLM-as-a-Judge、参考驱动指标、形式化验证、评分卡评估、自动化UI测试等),同时提供一套从真实需求到可执行评估任务的标准化构建流程,使组织能够高效、可复现地创建适用于自身领域的生产导向型评估基准。

链接: https://arxiv.org/abs/2604.12162
作者: Pengrui Lu,Bingyu Xu,Wenjun Zhang,Shengjia Hua,Xuanjian Gao,Ranxiang Ge,Lyumanshan Ye,Linxuan Wu,Yiran Li,Junfei Fish Yu,Yibo Zhang,Ruixin Li,Manxiang Li,Xiao Han,Xiaocong Zhou,Guangyao Chi,Zisheng Chen,Kaishen Chen,Kun Wang,Qihua Xu,Fengyue Meng,Yuchen Ni,Jiajun Li,Jinxiu Liu,Danfeng Zhang,Jingru Zhao,Pengfei Liu
机构: SII; MiraclePlus; SJTU; GAIR; HIT; UCAS; LangCore; Jiqizhixin; HunterAI; CinoCore; KuaFuAI; POET
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid deployment of AI agents in commercial settings has outpaced the development of evaluation methodologies that reflect production realities. Existing benchmarks measure agent capabilities through retrospectively curated tasks with well-specified requirements and deterministic metrics – conditions that diverge fundamentally from production environments where requirements contain implicit constraints, inputs are heterogeneous multi-modal documents with information fragmented across sources, tasks demand undeclared domain expertise, outputs are long-horizon professional deliverables, and success is judged by domain experts whose standards evolve over time. We present AlphaEval, a production-grounded benchmark of 94 tasks sourced from seven companies deploying AI agents in their core business, spanning six O*NET (Occupational Information Network) domains. Unlike model-centric benchmarks, AlphaEval evaluates complete agent products – Claude Code, Codex, etc. – as commercial systems, capturing performance variations invisible to model-level evaluation. Our evaluation framework covers multiple paradigms (LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, etc.), with individual domains composing multiple paradigms. Beyond the benchmark itself, we contribute a requirement-to-benchmark construction framework – a systematic methodology that transforms authentic production requirements into executable evaluation tasks in minimal time. This framework standardizes the entire pipeline from requirement to evaluation, providing a reproducible, modular process that any organization can adopt to construct production-grounded benchmarks for their own domains.

[NLP-72] From Plan to Action: How Well Do Agents Follow the Plan?

【速读】: 该论文旨在解决编程代理(Programming Agents)在执行任务时对预设计划的遵循程度不明确的问题,即缺乏对其是否真正按照指令计划进行推理与行动的系统性评估。这一问题直接影响对代理解决方案正确性的判断,例如难以区分是通过合理战略推理达成目标,还是因数据污染或基准过拟合所致。解决方案的关键在于首次对16,991条任务轨迹进行了大规模、系统的计划合规性分析,揭示了无显式计划时代理依赖训练中内化的不完整或过拟合工作流,而提供标准计划可提升任务成功率;更重要的是发现:频繁的计划提醒能减少违规行为并提高成功概率,但设计不良的计划反而比无计划更差,且早期添加不匹配内部策略的任务阶段会损害性能。因此,研究提出应从“编码特定计划”转向“训练模型按指令灵活适应与推理”,以填补当前细调范式中对计划遵从能力培养的空白。

链接: https://arxiv.org/abs/2604.12147
作者: Shuyang Liu,Saman Dehghan,Jatin Ganhotra,Martin Hirzel,Reyhaneh Jabbarvand
机构: University of Illinois Urbana–Champaign (伊利诺伊大学厄巴纳-香槟分校); IBM (国际商业机器公司)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agents aspire to eliminate the need for task-specific prompt crafting through autonomous reason-act-observe loops. Still, they are commonly instructed to follow a task-specific plan for guidance, e.g., to resolve software issues following phases for navigation, reproduction, patch, and validation. Unfortunately, it is unknown to what extent agents actually follow such instructed plans. Without such an analysis, determining the extent agents comply with a given plan, it is impossible to assess whether a solution was reached through correct strategic reasoning or through other means, e.g., data contamination or overfitting to a benchmark. This paper presents the first extensive, systematic analysis of plan compliance in programming agents, examining 16,991 trajectories from SWE-agent across four LLMs on SWE-bench Verified and SWE-bench Pro under eight plan variations. Without an explicit plan, agents fall back on workflows internalized during training, which are often incomplete, overfit, or inconsistently applied. Providing the standard plan improves issue resolution, and we observe that periodic plan reminders can mitigate plan violations and improve task success. A subpar plan hurts performance even more than no plan at all. Surprisingly, augmenting a plan with additional task-relevant phases in the early stage can degrade performance, particularly when these phases do not align with the model’s internal problem-solving strategy. These findings highlight a research gap: fine-tuning paradigms that teach models to follow instructed plans, rather than encoding task-specific plans in them. This requires teaching models to reason and act adaptively, rather than memorizing workflows.

[NLP-73] When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models

【速读】: 该论文旨在解决自指输入(self-referential inputs)如何影响大语言模型(Large Language Models, LLMs)内部矩阵动态的问题,特别是识别哪些类型的自指会导致模型状态不稳定。其关键解决方案在于通过系统性测量106个标量指标在多个模型架构(Qwen3-VL-8B、Llama-3.2-11B、Llama-3.3-70B、Gemma-2-9B)和不同温度下的变化,发现非闭合真值递归(Non-Closing Truth Recursion, NCTR)是导致显著不稳定的根源——NCTR引发注意力有效秩(attention effective rank)等关键指标的异常升高(Cohen’s d 达3.14–3.52),且这种不稳定性贯穿所有层(每层SVD分析均显示d > 1.0),并可通过分类器准确区分(AUC 0.81–0.90)。研究进一步提出假设:NCTR迫使Transformer模型进入动力学状态,使经典矩阵半群问题集中显现,从而揭示了自指失败模式的内在机制。

链接: https://arxiv.org/abs/2604.12128
作者: Ji Ho Bae
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 4 figures, 11 tables

点击查看摘要

Abstract:We investigate how self-referential inputs alter the internal matrix dynamics of large language models. Measuring 106 scalar metrics across up to 7 analysis passes on four models from three architecture families – Qwen3-VL-8B, Llama-3.2-11B, Llama-3.3-70B, and Gemma-2-9B – over 300 prompts in a 14-level hierarchy at three temperatures ( T \in \0.0, 0.3, 0.7\ ), we find that self-reference alone is not destabilizing: grounded self-referential statements and meta-cognitive prompts are markedly more stable than paradoxical self-reference on key collapse-related metrics, and on several such metrics can be as stable as factual controls. Instability concentrates in prompts inducing non-closing truth recursion (NCTR) – truth-value computations with no finite-depth resolution. NCTR prompts produce anomalously elevated attention effective rank – indicating attention reorganization with global dispersion rather than simple concentration collapse – and key metrics reach Cohen’s d = 3.14 (attention effective rank) to 3.52 (variance kurtosis) vs. stable self-reference in the 70B model; 281/397 metric-model combinations differentiate NCTR from stable self-reference after FDR correction ( q 0.05 ), 198 with |d| 0.8 . Per-layer SVD confirms disruption at every sampled layer ( d +1.0 in all three models analyzed), ruling out aggregation artifacts. A classifier achieves AUC 0.81 - 0.90 ; 30 minimal pairs yield 42/387 significant combinations; 43/106 metrics replicate across all four models. We connect these observations to three classical matrix-semigroup problems and propose, as a conjecture, that NCTR forces finite-depth transformers toward dynamical regimes where these problems concentrate. NCTR prompts also produce elevated contradictory output ( +34 - 56 percentage points vs. controls), suggesting practical relevance for understanding self-referential failure modes.

[NLP-74] Long-Horizon Plan Execution in Large Tool Spaces through Entropy-Guided Branching

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在工具增强型智能体中执行多步骤任务时面临的两大挑战:一是缺乏严格的、基于计划层级的评估框架,二是因庞大工具库和长周期规划导致的决策空间探索计算开销巨大。为应对这些问题,作者提出两个核心贡献:首先构建了SLATE(Synthetic Large-scale API Toolkit for E-commerce),一个面向电商场景的大规模上下文感知基准,用于自动化评估工具集成智能体;其次设计了熵引导分支(Entropy-Guided Branching, EGB)算法,通过动态扩展高预测熵区域的决策分支来优化探索与利用之间的权衡,从而显著提升任务成功率和计算效率。关键在于将系统性评估与不确定性驱动的搜索策略相结合,为复杂工具环境中可靠且可扩展的LLM智能体开发提供坚实基础。

链接: https://arxiv.org/abs/2604.12126
作者: Rongzhe Wei,Ge Shi,Min Cheng,Na Zhang,Pan Li,Sarthak Ghosh,Vaibhav Gorde,Leman Akoglu
机构: Georgia Institute of Technology(佐治亚理工学院); Amazon(亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This work was completed during an internship at Amazon

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced tool-augmented agents, enabling autonomous reasoning via API interactions. However, executing multi-step tasks within massive tool libraries remains challenging due to two critical bottlenecks: (1) the absence of rigorous, plan-level evaluation frameworks and (2) the computational demand of exploring vast decision spaces stemming from large toolsets and long-horizon planning. To bridge these gaps, we first introduce SLATE (Synthetic Large-scale API Toolkit for E-commerce), a large-scale context-aware benchmark designed for the automated assessment of tool-integrated agents. Unlike static metrics, SLATE accommodates diverse yet functionally valid execution trajectories, revealing that current agents struggle with self-correction and search efficiency. Motivated by these findings, we next propose Entropy-Guided Branching (EGB), an uncertainty-aware search algorithm that dynamically expands decision branches where predictive entropy is high. EGB optimizes the exploration-exploitation trade-off, significantly enhancing both task success rates and computational efficiency. Extensive experiments on SLATE demonstrate that our dual contribution provides a robust foundation for developing reliable and scalable LLM agents in tool-rich environments.

[NLP-75] mporal Flattening in LLM -Generated Text: Comparing Human and LLM Writing Trajectories ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期文本生成中缺乏人类写作所具有的时间演化特征的问题,即模型是否能够再现人类作者在跨月、跨年等长时间尺度上表现出的语义、词汇和认知情绪层面的动态变化。其关键解决方案在于构建了一个包含412名人类作者、6086篇文档的纵向数据集(时间跨度为2012–2024年,涵盖学术摘要、博客和新闻三类领域),并通过漂移(drift)与方差(variance)指标对人类与LLM生成文本的时间结构进行量化比较。研究发现,无论LLM是否基于历史上下文生成,其输出均呈现“时间扁平化”现象:虽然词汇多样性更高,但语义和认知情绪漂移显著低于人类,且这种差异可单独实现94%的分类准确率和98%的ROC-AUC值,表明当前LLM部署范式存在根本性局限,亟需改进以满足需要真实时间结构的应用场景(如合成训练数据或纵向文本建模)。

链接: https://arxiv.org/abs/2604.12097
作者: Zhanwei Cao,YeoJin Go,Yifan Hu,Shanu Sushmita
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 6 figures. To appear in Findings of ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in daily applications, from content generation to code writing, where each interaction treats the model as stateless, generating responses independently without memory. Yet human writing is inherently longitudinal: authors’ styles and cognitive states evolve across months and years. This raises a central question: can LLMs reproduce such temporal structure across extended time periods? We construct and publicly release a longitudinal dataset of 412 human authors and 6,086 documents spanning 2012–2024 across three domains (academic abstracts, blogs, news) and compare them to trajectories generated by three representative LLMs under standard and history-conditioned generation settings. Using drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations, we find temporal flattening in LLM-generated text. LLMs produce greater lexical diversity but exhibit substantially reduced semantic and cognitive-emotional drift relative to humans. These differences are highly predictive: temporal variability patterns alone achieve 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories. Our results demonstrate that temporal flattening persists regardless of whether LLMs generate independently or with access to incremental history, revealing a fundamental property of current deployment paradigms. This gap has direct implications for applications requiring authentic temporal structure, such as synthetic training data and longitudinal text modeling.

[NLP-76] Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)是否继承人类道德推理中存在的情感非理性问题,特别是识别受害者效应(Identifiable Victim Effect, IVE)——即人类更倾向于向具体、叙事化描述的个体分配资源,而非面对同等困境的统计群体。解决方案的关键在于通过大规模实证研究(N=51,955次API调用,覆盖16个前沿模型),系统检验IVE在LLMs中的表现及其调节机制:发现IVE普遍存在,但显著受对齐训练(alignment training)影响——指令微调模型表现出极端IVE(Cohen’s d高达1.56),而推理优化模型则反转该效应(d=-0.85);同时揭示标准链式思维(Chain-of-Thought, CoT)提示反而加剧IVE(效应量从d=0.15增至d=0.41),唯有功利主义导向的CoT可有效消除该效应,为AI在人道主义与伦理决策场景中的部署提供了关键实证依据和干预路径。

链接: https://arxiv.org/abs/2604.12076
作者: Syed Rifat Raiyan
机构: IUT-Dhaka (国际大学技术学院达卡分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Under review, 49 pages, 20 figures, 11 tables

点击查看摘要

Abstract:The Identifiable Victim Effect (IVE) - the tendency to allocate greater resources to a specific, narratively described victim than to a statistically characterized group facing equivalent hardship - is one of the most robust findings in moral psychology and behavioural economics. As large language models (LLMs) assume consequential roles in humanitarian triage, automated grant evaluation, and content moderation, a critical question arises: do these systems inherit the affective irrationalities present in human moral reasoning? We present the first systematic, large-scale empirical investigation of the IVE in LLMs, comprising N=51,955 validated API trials across 16 frontier models spanning nine organizational lineages (Google, Anthropic, OpenAI, Meta, DeepSeek, xAI, Alibaba, IBM, and Moonshot). Using a suite of ten experiments - porting and extending canonical paradigms from Small et al. (2007) and Kogut and Ritov (2005) - we find that the IVE is prevalent but strongly modulated by alignment training. Instruction-tuned models exhibit extreme IVE (Cohen’s d up to 1.56), while reasoning-specialized models invert the effect (down to d=-0.85). The pooled effect (d=0.223, p=2e-6) is approximately twice the single-victim human meta-analytic baseline (d \approx 0.10) reported by Lee and Feeley (2016) - and likely exceeds the overall human pooled effect by a larger margin, given that the group-victim human effect is near zero. Standard Chain-of-Thought (CoT) prompting - contrary to its role as a deliberative corrective - nearly triples the IVE effect size (from d=0.15 to d=0.41), while only utilitarian CoT reliably eliminates it. We further document psychophysical numbing, perfect quantity neglect, and marginal in-group/out-group cultural bias, with implications for AI deployment in humanitarian and ethical decision-making contexts.

[NLP-77] Robust Explanations for User Trust in Enterprise NLP Systems

【速读】: 该论文旨在解决企业在自然语言处理(Natural Language Processing, NLP)系统中部署黑盒模型时,难以验证生成的token级解释(token-level explanations)在真实用户噪声下的鲁棒性问题,尤其是在从编码器类模型(如BERT)迁移至解码器类大语言模型(Large Language Models, LLMs)时,现有研究缺乏对解释稳定性评估的系统性指导。解决方案的关键在于提出一个统一的黑盒鲁棒性评估框架,基于留一法遮蔽(leave-one-out occlusion)方法,并通过多种现实扰动(包括交换、删除、洗牌和回译)在不同严重程度下量化解释的稳定性,以“top-token flip rate”作为核心指标。该方法使作者能够跨架构(encoder vs. decoder)和模型规模系统比较解释鲁棒性,揭示了解码器LLMs在解释稳定性上显著优于编码器基线(平均降低73%的翻转率),且模型规模越大越稳定(从7B到70B参数提升44%),最终构建出推理成本与鲁棒性之间的权衡曲线,为合规敏感场景中的模型与解释选择提供决策依据。

链接: https://arxiv.org/abs/2604.12069
作者: Guilin Zhang,Kai Zhao,Jeffrey Friedman,Xu Chu,Amine Anoun,Jerry Ting
机构: Workday AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robust explanations are increasingly required for user trust in enterprise NLP, yet pre-deployment validation is difficult in the common case of black-box deployment (API-only access) where representation-based explainers are infeasible and existing studies provide limited guidance on whether explanations remain stable under real user noise, especially when organizations migrate from encoder classifiers to decoder LLMs. To close this gap, we propose a unified black-box robustness evaluation framework for token-level explanations based on leave-one-out occlusion, and operationalize explanation robustness with top-token flip rate under realistic perturbations (swap, deletion, shuffling, and back-translation) at multiple severity levels. Using this protocol, we conduct a systematic cross-architecture comparison across three benchmark datasets and six models spanning encoder and decoder families (BERT, RoBERTa, Qwen 7B/14B, Llama 8B/70B; 64,800 cases). We find that decoder LLMs produce substantially more stable explanations than encoder baselines (73% lower flip rates on average), and that stability improves with model scale (44% gain from 7B to 70B). Finally, we relate robustness improvements to inference cost, yielding a practical cost-robustness tradeoff curve that supports model and explanation selection prior to deployment in compliance-sensitive applications.

[NLP-78] LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

【速读】: 该论文旨在解决块级扩散语言模型(Block-wise Diffusion Language Models, DLMs)在长文本场景下因内存密集型注意力机制导致的效率瓶颈问题。传统稀疏注意力方法在DLMs中失效,主要受限于KV膨胀(KV Inflation)问题——即不同查询选择不同的前缀位置,导致需访问的键值(Key-Value, KV)页面集合过大。解决方案的关键在于提出一种局部感知稀疏注意力机制(Locality-aware Sparse Attention, LOSA),其核心思想是:在连续去噪步骤间,仅有少量活跃token的隐藏状态发生显著变化,其余稳定token的注意力结果可复用。LOSA通过重用稳定token的缓存前缀注意力结果,并仅对活跃token应用稀疏注意力,从而大幅减少必须加载的KV索引数量,在保持接近稠密注意力精度的同时显著提升计算效率,实现在极端稀疏度下平均准确率提升达+9点,且注意力密度降低至1.54倍,GPU上最高实现4.14倍加速。

链接: https://arxiv.org/abs/2604.12056
作者: Haocheng Xi,Harman Singh,Yuezhou Hu,Coleman Hooper,Rishabh Tiwari,Aditya Tomar,Minjae Lee,Wonjun Kang,Michael Mahoney,Chenfeng Xu,Kurt Keutzer,Amir Gholami
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context scenarios. Naive sparse attention fails on DLMs due to a KV Inflation problem, where different queries select different prefix positions, making the union of accessed KV pages large. To address this, we observe that between consecutive denoising steps, only a small fraction of active tokens exhibit significant hidden-state changes, while the majority of stable tokens remain nearly constant. Based on this insight, we propose LOSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens. This substantially shrinks the number of KV indices that must be loaded, yielding both higher speedup and higher accuracy. Across multiple block-wise DLMs and benchmarks, LOSA preserves near-dense accuracy while significantly improving efficiency, achieving up to +9 points in average accuracy at aggressive sparsity levels while maintaining 1.54x lower attention density. It also achieves up to 4.14x attention speedup on RTX A6000 GPUs, demonstrating the effectiveness of the proposed method.

[NLP-79] Leverag ing Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在企业级文本分类等分析任务中因注意力机制的随机性及对噪声敏感而导致的分析精度下降与结果不可复现的问题。其解决方案的关键在于提出一种名为加权句法与语义上下文评估摘要(Weighted Syntactic and Semantic Context Assessment Summary, wSSAS)的确定性框架,该框架通过两阶段验证机制实现:首先将原始文本组织为包含主题(Themes)、故事(Stories)和聚类(Clusters)的分层分类结构;其次引入信噪比(Signal-to-Noise Ratio, SNR)评分机制以优先提取高价值语义特征,确保模型关注最具代表性的数据点。结合摘要之摘要(Summary-of-Summaries, SoS)架构,wSSAS有效隔离核心信息并抑制聚合过程中的背景噪声,从而显著提升聚类完整性与分类准确性,并提供可复现的高质量文本分类路径。

链接: https://arxiv.org/abs/2604.12049
作者: Shreeya Verma Kathuria,Nitin Mayande,Sharookh Daruwalla,Nitin Joglekar,Charles Weber
机构: Tellagence Inc.(Tellagence 公司); Villanova School of Business, Villanova University(维拉诺瓦大学商学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The use of Large Language Models (LLMs) for reliable, enterprise-grade analytics such as text categorization is often hindered by the stochastic nature of attention mechanisms and sensitivity to noise that compromise their analytical precision and reproducibility. To address these technical frictions, this paper introduces the Weighted Syntactic and Semantic Context Assessment Summary (wSSAS), a deterministic framework designed to enforce data integrity on large-scale, chaotic datasets. We propose a two-phased validation framework that first organizes raw text into a hierarchical classification structure containing Themes, Stories, and Clusters. It then leverages a Signal-to-Noise Ratio (SNR) to prioritize high-value semantic features, ensuring the model’s attention remains focused on the most representative data points. By incorporating this scoring mechanism into a Summary-of-Summaries (SoS) architecture, the framework effectively isolates essential information and mitigates background noise during data aggregation. Experimental results using Gemini 2.0 Flash Lite across diverse datasets - including Google Business reviews, Amazon Product reviews, and Goodreads Book reviews - demonstrate that wSSAS significantly improves clustering integrity and categorization accuracy. Our findings indicate that wSSAS reduces categorization entropy and provides a reproducible pathway for improving LLM based summaries based on a high-precision, deterministic process for large-scale text categorization. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.12049 [cs.CL] (or arXiv:2604.12049v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.12049 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-80] hink Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本生成中出现幻觉(hallucination)的问题,尤其是模型缺乏对生成内容中不同主张(claim)的不确定性进行细粒度评估的能力,导致其可能以高置信度陈述错误信息。现有方法主要依赖后处理或基于正确性奖励的强化学习(Reinforcement Learning, RL),但无法教会模型识别哪些部分可信、哪些不可信。解决方案的关键在于提出CURE框架,通过引入“主张感知推理协议”(Claim-Aware Reasoning Protocol),将输出结构化为原子性主张及其显式置信度估计,并设计多阶段训练流程:首先对齐模型置信度与主张正确性,再优化事实准确性;最终实现基于置信度的选择性预测(selective prediction),使模型在推理时可主动回避不确定主张。实验表明,该方法在四个长文本事实性基准上显著提升准确率(如传记生成任务中主张级准确率提升39.9%),同时改善校准性能(FactBench上AUROC提升16.0%)。

链接: https://arxiv.org/abs/2604.12046
作者: Xin Liu,Lu Wang
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often hallucinate in long-form generation. Existing approaches mainly improve factuality through post-hoc revision or reinforcement learning (RL) with correctness-based rewards, but they do not teach the model to estimate which parts of its generation are reliable. As a result, models may still state incorrect claims confidently in their responses. Recent advances in reasoning have significantly improved LLM performance, and have been leveraged to estimate confidence by incorporating calibration into RL objectives. However, existing approaches remain limited to a single scalar confidence for the entire response, which is insufficient for long-form generation where uncertainty varies across individual claims. To mitigate this problem, we propose CURE, a framework that improves long-form factuality by teaching LLMs to reason about uncertainty at the claim level. We first introduce a Claim-Aware Reasoning Protocol, which structures outputs into atomic claims paired with explicit confidence estimates. We then develop a multi-stage training pipeline that aligns model confidence with claims’ correctness and then optimizes on factuality. The resulting calibrated confidence further enables selective prediction, allowing the model to abstain from uncertain claims at inference time. Experiments on four long-form factuality benchmarks show that CURE consistently improves factual accuracy over competitive supervised and RL baselines, while maintaining factual recall. In particular, it improves claim-level accuracy by up to 39.9% on Biography generation. These gains are accompanied by improved calibration, as reflected by a 16.0% increase in AUROC on FactBench.

[NLP-81] Benchmarking Deflection and Hallucination in Large Vision-Language Models ACL2026

【速读】: 该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在知识密集型多模态问答(Knowledge-Based Visual Question Answering, KB-VQA)任务中存在的一系列评估缺陷,包括忽视视觉与文本证据之间的冲突、缺乏对模型在检索知识不完整时生成合理拒绝响应(deflection,如“抱歉,我无法回答”)的能力评估,以及基准测试因模型训练数据增长而快速过时的问题。其解决方案的关键在于:提出一种动态数据筛选管道,通过保留真正依赖检索的样本以维持基准难度;构建VLM-DeflectionBench基准,包含2,775个多样化的多模态检索样本,专门用于探测模型在证据冲突或不足时的行为;并设计细粒度评估协议,区分参数化记忆与检索鲁棒性,从而更准确地衡量模型在不确定情境下的行为表现。

链接: https://arxiv.org/abs/2604.12033
作者: Nicholas Moratelli,Christopher Davis,Leonardo F. R. Ribeiro,Bill Byrne,Gonzalo Iglesias
机构: University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学); Amazon AGI (亚马逊AGI); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer…) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experiments across 20 state-of-the-art LVLMs indicate that models usually fail to deflect in the presence of noisy or misleading evidence. Our results highlight the need to evaluate not only what models know, but how they behave when they do not, and serve as a reusable and extensible benchmark for reliable KB-VQA evaluation. All resources will be publicly available upon publication.

[NLP-82] LLM s Struggle with Abstract Meaning Comprehension More Than Expected

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在抽象意义理解上的不足问题,尤其是面对非具象、高阶语义的抽象词汇时表现不佳。针对此问题,研究提出了一种受人类认知策略启发的双向注意力分类器(bidirectional attention classifier),其关键在于动态地对文本片段和选项进行注意力机制建模,从而增强细调模型(如BERT和RoBERTa)对抽象概念的理解能力。实验表明,该方法在SemEval-2021 Task 4的两个子任务上分别提升了4.06%和3.41%的准确率,验证了其在抽象意义理解任务中的有效性。

链接: https://arxiv.org/abs/2604.12018
作者: Hamoud Alhazmi,Jiachen Jiang
机构: The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding abstract meanings is crucial for advanced language comprehension. Despite extensive research, abstract words remain challenging due to their non-concrete, high-level semantics. SemEval-2021 Task 4 (ReCAM) evaluates models’ ability to interpret abstract concepts by presenting passages with questions and five abstract options in a cloze-style format. Key findings include: (1) Most large language models (LLMs), including GPT-4o, struggle with abstract meaning comprehension under zero-shot, one-shot, and few-shot settings, while fine-tuned models like BERT and RoBERTa perform better. (2) A proposed bidirectional attention classifier, inspired by human cognitive strategies, enhances fine-tuned models by dynamically attending to passages and options. This approach improves accuracy by 4.06 percent on Task 1 and 3.41 percent on Task 2, demonstrating its potential for abstract meaning comprehension.

[NLP-83] UCS: Estimating Unseen Coverag e for Improved In-Context Learning ACL2026

【速读】: 该论文旨在解决在上下文学习(In-context Learning, ICL)中,演示样本(demonstration)选择对模型性能影响显著但现有方法缺乏系统性覆盖度考量的问题。现有选择策略多依赖启发式相关性或多样性指标,未能有效衡量所选子集是否充分覆盖任务潜在的隐含结构。解决方案的关键在于提出一种无需训练的子集级覆盖先验——Unseen Coverage Selection (UCS),其核心思想是:优质演示集应使模型接触到当前未被揭示的潜在聚类(latent cluster)。UCS通过两个步骤实现:首先利用模型一致嵌入(model-consistent embeddings)诱导离散隐含聚类,其次借助Smoothed Good–Turing估计器从候选子集的经验频谱中估算未揭示聚类数量。该方法可无缝集成到查询相关与无关的选择基线中,并通过正则化目标提升整体ICL准确率,在多个意图分类和推理基准上实现最高达6%的性能增益。

链接: https://arxiv.org/abs/2604.12015
作者: Jiayi Xin,Xiang Li,Evan Qiang,Weiqing He,Tianqi Shang,Weijie J. Su,Qi Long
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ACL 2026 Findings; 17 pages, 3 figures

点击查看摘要

Abstract:In-context learning (ICL) performance depends critically on which demonstrations are placed in the prompt, yet most existing selectors prioritize heuristic notions of relevance or diversity and provide limited insight into the coverage of a demonstration set. We propose Unseen Coverage Selection (UKS), a training-free, subset-level coverage prior motivated by the principle that a good demonstration set should expose the model to latent cluster unrevealed by the currently selected subset. UCS operationalizes this idea by (1) inducing discrete latent clusters from model-consistent embeddings and (2) estimating the number of unrevealed clusters within a candidate subset via a Smoothed Good–Turing estimator from its empirical frequency spectrum. Unlike previous selection methods, UCS is coverage-based and training-free, and can be seamlessly combined with both query-dependent and query-independent selection baselines via a simple regularized objective. Experiments on multiple intent-classification and reasoning benchmarks with frontier Large Language Models show that augmenting strong baselines with UCS consistently improves ICL accuracy by up to 2-6% under the same selection budget, while also yielding insights into task- and model-level latent cluster distributions. Code is available at this https URL.

[NLP-84] Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

【速读】: 该论文旨在解决当前后训练方法在可验证场景中面临的问题:强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)虽具广泛适用性和强大能力,但仅提供稀疏的二元奖励信号,导致训练效率低;而蒸馏(Distillation)虽能提供密集的逐token监督信号,却依赖外部教师模型或高质量示范数据,获取成本高且可能不可用。解决方案的关键在于提出Self-Distillation Zero(SD-Zero),其核心机制是训练单一模型同时扮演生成器(Generator)和修订者(Reviser)两个角色——生成器输出初始响应,修订者基于该响应及其二元奖励生成改进版本,并通过在线策略自蒸馏将修订者的token级分布作为监督信号反向指导生成器优化。这种方法实现了从二元奖励到密集token级自监督信号的转化,显著提升了训练样本效率,且无需外部教师或高质量演示数据,在数学与代码推理任务上优于现有强基线方法。

链接: https://arxiv.org/abs/2604.12002
作者: Yinghui He,Simran Kaur,Adithya Bhaskar,Yongjin Yang,Jiarui Liu,Narutatsu Ri,Liam Fowl,Abhishek Panigrahi,Danqi Chen,Sanjeev Arora
机构: Princeton University (普林斯顿大学); University of Toronto (多伦多大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser’s token distributions conditioned on the generator’s response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator’s response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.

[NLP-85] Filtered Reasoning Score: Evaluating Reasoning Quality on a Models Most-Confident Traces

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估中过度依赖结果准确性(outcome-based evaluation)所带来的局限性问题,即高准确率可能源于错误的推理过程或非可迁移的策略(如记忆或过拟合),而无法反映模型真正的推理质量。其解决方案的关键在于提出一种新的评估指标——过滤推理得分(Filtered Reasoning Score, FRS),该指标通过多维度(忠实性、连贯性、实用性与事实性)量化推理轨迹的质量,并采用Top-K%最置信度轨迹进行聚合,从而避免低置信度偶然正确路径对整体评分的干扰。实验表明,FRS能够在准确率相近的模型间区分推理能力差异,并且与跨基准任务的推理性能具有强相关性,因而能够有效补充传统准确率指标,捕捉模型的可迁移推理能力。

链接: https://arxiv.org/abs/2604.11996
作者: Manas Pathak,Xingyao Chen,Shuozhe Li,Amy Zhang,Liu Leqi
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome-based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can nevertheless exhibit similar benchmark accuracy, for example due to memorization or over-optimization. In this paper, we ask: given existing benchmarks, can we move beyond outcome-based evaluation to assess the quality of reasoning itself? We seek metrics that (1) differentiate models with similar accuracy and (2) are robust to variations in input prompts and generation configurations. To this end, we propose a reasoning score that evaluates reasoning traces along dimensions such as faithfulness, coherence, utility, and factuality. A remaining question is how to aggregate this score across multiple sampled traces. Naively averaging them is undesirable, particularly in long-horizon settings, where the number of possible trajectories grows rapidly, and low-confidence correct traces are more likely to be coincidental. To address this, we introduce the Filtered Reasoning Score (FRS), which computes reasoning quality using only the top-K% most confident traces. Evaluating with FRS, models that are indistinguishable under standard accuracy exhibit significant differences in reasoning quality. Moreover, models with higher FRS on one benchmark tend to perform better on other reasoning benchmarks, in both accuracy and reasoning quality. Together, these findings suggest that FRS complements accuracy by capturing a model’s transferable reasoning capabilities. We open source our evaluation codebase: this https URL.

[NLP-86] INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents ACL2026

【速读】: 该论文旨在解决跨语言文档图像中表格视觉问答(Table Visual Question Answering, Table VQA)任务的评估与模型性能瓶颈问题,特别是在低资源语言和结构复杂表格场景下的表现不足。其解决方案的关键在于构建了一个多语言、多视觉风格的真实文档图像基准数据集INDOTABVQA,涵盖印尼语、英语、印地语和阿拉伯语四种语言,并通过针对性微调(如3B参数量的小型模型和LoRA微调的7B模型)及引入表格区域坐标作为空间先验信息,显著提升了视觉语言模型(VLMs)在跨语言文档理解任务中的准确率,最高提升达17.8%,验证了领域特定数据与结构感知训练策略的有效性。

链接: https://arxiv.org/abs/2604.11970
作者: Somraj Gautam,Anathapindika Dravichi,Gaurav Harit
机构: IIT Jodhpur; Punjabi University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in ACL 2026 (Findings)

点击查看摘要

Abstract:We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: this https URL

[NLP-87] AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM -Based Bug Detection

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的代码缺陷检测代理所生成的漏洞报告为静态假设、需人工验证的问题,从而限制了自动化漏洞检测的实际应用价值。其核心挑战在于如何实现对候选漏洞报告的自动、可靠验证,以提供可执行的证明(Proof-of-Concept, PoC)作为执行证据。解决方案的关键在于提出 AnyPoC——一个通用的多智能体框架,通过三个关键机制实现可信验证:(1) 对候选漏洞报告进行分析与事实核查;(2) 迭代式合成并执行 PoC,同时收集执行轨迹用于反馈优化;(3) 独立重执行和审查 PoC 以缓解幻觉和奖励作弊问题。此外,AnyPoC 还持续构建和演化 PoC 知识库,支持异构任务处理,并可在不同漏洞报告源上运行,显著提升了漏洞检测的自动化程度与准确性。

链接: https://arxiv.org/abs/2604.11950
作者: Zijie Zhao,Chenyuan Yang,Weidong Wang,Yihan Yang,Ziqi Zhang,Lingming Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation task: given a candidate report, synthesizing an executable proof-of-concept test, or simply a PoC - such as a script, command sequence, or crafted input - to trigger the suspected defect. Automated PoC generation can act as a scalable validation oracle, enabling end-to-end autonomous bug detection by providing concrete execution evidence. However, naive LLM agents are unreliable validators: they are biased toward “success” and may reward-hack by producing plausible but non-functional PoCs or even hallucinated traces. To address this, we present AnyPoC, a general multi-agent framework that (1) analyzes and fact-checks a candidate bug report, (2) iteratively synthesizes and executes a PoC while collecting execution traces, and (3) independently re-executes and scrutinizes the PoC to mitigate hallucination and reward hacking. In addition, AnyPoC also continuously extracts and evolves a PoC knowledge base to handle heterogeneous tasks. AnyPoC operates on candidate bug reports regardless of their source and can be paired with different bug reporters. To demonstrate practicality and generality, we apply AnyPoC, with a simple agentic bug reporter, on 12 critical software systems across diverse languages/domains (many with millions of lines of code) including Firefox, Chromium, LLVM, OpenSSL, SQLite, FFmpeg, and Redis. Compared to the state-of-the-art coding agents, e.g., Claude Code and Codex, AnyPoC produces 1.3x more valid PoCs for true-positive bug reports and rejects 9.8x more false-positive bug reports. To date, AnyPoC has discovered 122 new bugs (105 confirmed, 86 already fixed), with 45 generated PoCs adopted as official regression tests.

[NLP-88] GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses

【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)生成具有实际价值的、可操作的审稿反馈问题,以辅助科研作者提升研究质量和表达效果,而非完全替代人类审稿。其核心挑战在于如何量化反馈的有效性并将其转化为可训练信号。解决方案的关键在于提出GoodPoint训练范式:首先构建GoodPoint-ICLR数据集,基于作者对审稿意见的响应标注反馈在“有效性”(validity)和“可行动性”(author action)两个维度上的表现;随后通过微调策略聚焦于有效且可执行的反馈,并结合真实与合成偏好对(pair)进行偏好优化,从而显著提升模型生成反馈的实用性和匹配度。实验表明,该方法在1.2K ICLR论文基准上使预测成功率提升83.7%,优于同类规模模型,且经专家人类评估确认其实际价值更高。

链接: https://arxiv.org/abs/2604.11924
作者: Jimin Mun,Chani Jung,Xuhui Zhou,Hyunwoo Kim,Maarten Sap
机构: Carnegie Mellon University (卡内基梅隆大学); Independent Researcher (独立研究员); NVIDIA (英伟达)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 6 figures

点击查看摘要

Abstract:While LLMs hold significant potential to transform scientific research, we advocate for their use to augment and empower researchers rather than to automate research without human oversight. To this end, we study constructive feedback generation, the task of producing targeted, actionable feedback that helps authors improve both their research and its presentation. In this work, we operationalize the effectiveness of feedback along two author-centric axes-validity and author action. We first curate GoodPoint-ICLR, a dataset of 19K ICLR papers with reviewer feedback annotated along both dimensions using author responses. Building on this, we introduce GoodPoint, a training recipe that leverages success signals from author responses through fine-tuning on valid and actionable feedback, together with preference optimization on both real and synthetic preference pairs. Our evaluation on a benchmark of 1.2K ICLR papers shows that a GoodPoint-trained Qwen3-8B improves the predicted success rate by 83.7% over the base model and sets a new state-of-the-art among LLMs of similar size in feedback matching on a golden human feedback set, even surpassing Gemini-3-flash in precision. We further validate these findings through an expert human study, demonstrating that GoodPoint consistently delivers higher practical value as perceived by authors.

[NLP-89] Mstar: Every Task Deserves Its Own Memory Harness

【速读】: 该论文旨在解决大语言模型代理(Large Language Model Agents)在长期交互中依赖固定记忆架构所带来的适应性局限问题,即特定领域优化的记忆系统难以迁移至其他任务场景。解决方案的关键在于提出M^⋆方法,通过可执行程序演化自动发现任务最优的记忆机制:将记忆系统建模为包含数据结构(Schema)、存储逻辑(Storage Logic)和代理工作流指令(Workflow Instructions)的Python程序,并采用基于种群的反思式代码进化策略联合优化这些组件,利用评估失败反馈迭代改进候选程序。实验表明,M^⋆在对话、具身规划和专家推理等多类任务上均显著优于现有固定记忆基线,且演化出的记忆程序在不同领域展现出结构差异化的处理机制,验证了任务特化记忆设计的有效性。

链接: https://arxiv.org/abs/2604.11811
作者: Wenbo Pan,Shujie Liu,Xiangyang Zhou,Shiwei Zhang,Wanlu Shi,Mirror Xu,Xiaohua Jia
机构: City University of Hong Kong (香港城市大学); Microsoft (微软)
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M ^\star , a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M ^\star models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M ^\star on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M ^\star improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms.

[NLP-90] Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation

【速读】: 该论文旨在解决现有对话记忆系统在处理长对话历史时因上下文稀释(context dilution)而导致性能下降的问题。研究表明,问题根源不在于记忆架构本身,而在于潜在知识流形中的信号稀疏效应(Signal Sparsity Effect),具体表现为:决定性证据稀疏性(Decisive Evidence Sparsity)导致相关信号在长会话中愈发孤立,以及双层冗余(Dual-Level Redundancy)引入大量非信息内容干扰生成质量。解决方案的关键在于提出一个极简框架 \method,其核心机制包括Turn Isolation Retrieval(TIR)和Query-Driven Pruning(QDP):TIR通过最大激活策略替代全局聚合以捕获回合级信号,QDP则移除冗余会话与对话填充内容,构建高密度证据集,从而显著提升对话记忆的效率与鲁棒性。

链接: https://arxiv.org/abs/2604.11628
作者: Yuqian Wu,Wei Chen,Zhengjun Huang,Junle Chen,Qingxiang Liu,Kai Wang,Xiaofang Zhou,Yuxuan Liang
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; National University of Singapore
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 12 figures

点击查看摘要

Abstract:Existing conversational memory systems rely on complex hierarchical summarization or reinforcement learning to manage long-term dialogue history, yet remain vulnerable to context dilution as conversations grow. In this work, we offer a different perspective: the primary bottleneck may lie not in memory architecture, but in the \textitSignal Sparsity Effect within the latent knowledge manifold. Through controlled experiments, we identify two key phenomena: \textitDecisive Evidence Sparsity, where relevant signals become increasingly isolated with longer sessions, leading to sharp degradation in aggregation-based methods; and \textitDual-Level Redundancy, where both inter-session interference and intra-session conversational filler introduce large amounts of non-informative content, hindering effective generation. Motivated by these insights, we propose \method, a minimalist framework that brings conversational memory back to basics, relying solely on retrieval and generation via Turn Isolation Retrieval (TIR) and Query-Driven Pruning (QDP). TIR replaces global aggregation with a max-activation strategy to capture turn-level signals, while QDP removes redundant sessions and conversational filler to construct a compact, high-density evidence set. Extensive experiments on multiple benchmarks demonstrate that \method achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency, establishing a new minimalist baseline for conversational memory.

[NLP-91] ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中因生成长序列中间思考过程而导致的键值(Key-Value, KV)缓存内存占用过高问题。传统方法仅压缩输入上下文,保留完整的KV缓存用于解码,导致长输出场景下计算与内存开销显著增加。其解决方案的关键在于提出ZoomR机制,通过自适应地将冗长的推理过程压缩为摘要,并设计一种基于摘要的动态KV缓存选择策略:利用摘要键作为粗粒度索引进行初步检索,再根据查询“聚焦”到关键细节上,实现多粒度的KV缓存访问。这一分层策略有效避免了每步解码都进行全缓存注意力计算,从而在保持推理性能的同时,将推理内存消耗降低超过4倍。

链接: https://arxiv.org/abs/2604.10898
作者: David H. Yang,Yuxuan Zhu,Mohammad Mohammadi Amiri,Keerthiram Murugesan,Tejaswini Pedapati,Subhajit Chaudhury,Pin-Yu Chen
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically “zooming in” on fine-grained details. By using summary keys as a coarse-grained index during decoding, ZoomR uses the query to retrieve details for only the most important thoughts. This hierarchical strategy significantly reduces memory usage by avoiding full-cache attention at each step. Experiments across math and reasoning tasks show that our approach achieves competitive performance compared to baselines, while reducing inference memory requirements by more than 4\times . These results demonstrate that a multi-granularity KV selection enables more memory efficient decoding, especially for long output generation.

[NLP-92] StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLM s ICLR2026

【速读】: 该论文旨在解决现有语义语音分词器(semantic speech tokenizer)在面对与语义无关的声学扰动时表现出的不稳定性问题,这种不稳定性即使在高信噪比(SNR)下、语音完全可懂的情况下依然存在,会导致输出的token序列发生剧烈变化,从而增加下游大语言模型(LLM)的学习负担。其解决方案的关键在于提出StableToken,一种基于共识驱动机制的分词器,通过多分支并行处理音频,并利用强大的位级投票机制融合不同分支的表示,生成单一且稳定的token序列,从而显著降低单位编辑距离(UED),提升语音大语言模型(SpeechLLM)在多种任务中的鲁棒性。

链接: https://arxiv.org/abs/2509.22220
作者: Yuhan Song,Linhao Zhang,Chuhan Wu,Aiwei Liu,Wei Jia,Houfeng Wang,Xiao Zhou
机构: Peking University (北京大学); Tencent Inc. (腾讯公司)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at this https URL.

信息检索

[IR-0] Sparse Contrastive Learning for Content-Based Cold Item Recommendation SIGIR2026

【速读】:该论文旨在解决协同过滤(Collaborative Filtering, CF)推荐系统中的物品冷启动(item cold-start)问题,即新物品因缺乏用户交互数据而难以被有效推荐。现有方法通常通过将辅助物品内容(如图像或文本描述)映射到CF模型的嵌入空间来建模冷启动物品,但这种方法受限于CF信号与内容特征之间的本质信息鸿沟。论文提出了一种纯内容驱动的建模方式,不依赖于与CF用户或物品嵌入的对齐,而是将冷启动预测转化为物品间相似性建模任务——训练一个内容编码器将物品投影到潜在空间,在该空间中物品间的相似性与用户偏好高度相关。其核心创新在于设计了一种稀疏化的采样Entmax损失函数(sparse generalization of sampled softmax loss with the α-entmax family),利用α-entmax激活函数实现对无关负样本的梯度置零,从而更精准地估计物品相关性。此外,该方法可通过知识蒸馏进一步扩展,并在排序准确率上优于现有冷启动方法和标准采样Softmax策略,同时展现出在物品结果公平性方面的优势。

链接: https://arxiv.org/abs/2604.12990
作者: Gregor Meehan,Johan Pauwels
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Information Retrieval (cs.IR)
备注: Accepted at SIGIR 2026

点击查看摘要

Abstract:Item cold-start is a pervasive challenge for collaborative filtering (CF) recommender systems. Existing methods often train cold-start models by mapping auxiliary item content, such as images or text descriptions, into the embedding space of a CF model. However, such approaches can be limited by the fundamental information gap between CF signals and content features. In this work, we propose to avoid this limitation with purely content-based modeling of cold items, i.e. without alignment with CF user or item embeddings. We instead frame cold-start prediction in terms of item-item similarity, training a content encoder to project into a latent space where similarity correlates with user preferences. We define our training objective as a sparse generalization of sampled softmax loss with the \alpha -entmax family of activation functions, which allows for sharper estimation of item relevance by zeroing gradients for uninformative negatives. We then describe how this Sampled Entmax for Cold-start (SEMCo) training regime can be extended via knowledge distillation, and show that it outperforms existing cold-start methods and standard sampled softmax in ranking accuracy. We also discuss the advantages of purely content-based modeling, particularly in terms of equity of item outcomes.

[IR-1] Efficient Retrieval Scaling with Hierarchical Indexing for Large Scale Recommendation

【速读】:该论文旨在解决大规模基础检索模型在实际部署中面临的效率与效果难以兼顾的问题,即如何在保持高精度的同时显著降低推理成本。其核心解决方案是通过联合学习一种基于交叉注意力(cross-attention)和残差量化(residual quantization)的层次化索引结构,对基础检索模型的记忆空间进行组织,从而实现高效且精确的搜索。该方法在Meta的实际生产环境中成功部署,支撑了数十亿Facebook和Instagram用户的广告推荐任务,并发现中间节点对应少量高质量数据,进一步微调这些数据可提升推理性能,验证了“测试时训练”(test-time training)在推荐系统中的可行性与有效性。

链接: https://arxiv.org/abs/2604.12965
作者: Dongqi Fu,Kaushik Rangadurai,Haiyu Lu,Yunchen Pu,Siyang Yuan,Minhui Huang,Yiqun Liu,Golnaz Ghasemiesfeh,Xingfeng He,Fangzhou Xu,Andrew Cui,Vidhoon Viswanathan,Lin Yang,Liang Wang,Jiyan Yang,Chonglin Sun
机构: Meta(元)
类目: Information Retrieval (cs.IR)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:The increase in data volume, computational resources, and model parameters during training has led to the development of numerous large-scale industrial retrieval models for recommendation tasks. However, effectively and efficiently deploying these large-scale foundational retrieval models remains a critical challenge that has not been fully addressed. Common quick-win solutions for deploying these massive models include relying on offline computations (such as cached user dictionaries) or distilling large models into smaller ones. Yet, both approaches fall short of fully leveraging the representational and inference capabilities of foundational models. In this paper, we explore whether it is possible to learn a hierarchical organization over the memory of foundational retrieval models. Such a hierarchical structure would enable more efficient search by reducing retrieval costs while preserving exactness. To achieve this, we propose jointly learning a hierarchical index using cross-attention and residual quantization for large-scale retrieval models. We also present its real-world deployment at Meta, supporting daily advertisement recommendations for billions of Facebook and Instagram users. Interestingly, we discovered that the intermediate nodes in the learned index correspond to a small set of high-quality data. Fine-tuning the model on this set further improves inference performance, and concretize the concept of “test-time training” within the recommendation system domain. We demonstrate these findings using both internal and public datasets with strong baseline comparisons and hope they contribute to the community’s efforts in developing the next generation of foundational retrieval models.

[IR-2] Beyond Single-Dimension Novelty: How Combinations of Theory Method and Results-based Novelty Shape Scientific Impact

【速读】:该论文试图解决的问题是:科学新颖性(scientific novelty)具有多维特性,而以往研究多聚焦于单一维度的新颖性(如理论、方法或结果层面),忽视了不同新颖性类型协同作用对科学影响力(scientific impact)的影响机制。解决方案的关键在于构建一个基于深度学习的多维分类框架——利用DeepSeek-V3模型对《Nature Communications》中15,322篇论文的引言部分进行细粒度分析,将新颖性划分为理论新颖性、方法新颖性和结果新颖性三个维度,并识别其组合配置(novelty configurations)。通过回归分析发现,仅具备结果新颖性的文章相较于同时具备三种新颖性的文章获得更高引用量且更可能进入高被引榜单(top 1% 或 top 10%),揭示了多维新颖性配置在知识扩散中的非线性效应。

链接: https://arxiv.org/abs/2604.12471
作者: Yi Zhao,Yang Chenggang,Yuzhuo Wang,Tong Bao,Zhang Heng,Chengzhi Zhang
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: AII-EEKE 2026

点击查看摘要

Abstract:Scientific novelty drives advances at the research frontier, yet it is also associated with heightened uncertainty and potential resistance from incumbent paradigms, leading to complex patterns of scientific impact. Prior studies have primarily ex-amined the relationship between a single dimension of novelty – such as theoreti-cal, methodological, or results-based novelty – and scientific impact. However, because scientific novelty is inherently multidimensional, focusing on isolated dimensions may obscure how different types of novelty jointly shape impact. Consequently, we know little about how combinations of novelty types influence scientific impact. To this end, we draw on a dataset of 15,322 articles published in Nature Communications. Using the DeepSeek-V3 model, we classify articles into three novelty dimensions based on the content of their Introduction sections: theoretical novelty, methodological novelty, and results-based novelty. These dimensions may coexist within the same article, forming distinct novelty configura-tions. Scientific impact is measured using five-year citation counts and indicators of whether an article belongs to the top 1% or top 10% highly cited papers. Descriptive results indicate that results-based novelty alone and the simultaneous presence of all three novelty types are the dominant configurations in the sample. Regression results further show that articles with results-based novelty only re-ceive significantly more citations and are more likely to rank among the top 1% and top 10% highly cited papers than articles exhibiting all three novelty types. These findings advance our understanding of how multidimensional novelty configurations shape knowledge diffusion.

[IR-3] Is Sliding Window All You Need? An Open Framework for Long-Sequence Recommendation

【速读】:该论文旨在解决长序列交互历史在推荐系统中训练时因内存和延迟限制而被普遍视为不切实际的问题。其核心解决方案在于提出一个完整的端到端框架,采用滑动窗口(sliding window)机制实现工业级的长序列训练,并引入两项关键技术:一是运行时感知的消融实验,用于量化不同窗口策略与步长下的准确率-计算权衡关系;二是创新的k-shift嵌入层(k-shift embedding layer),可在消费级GPU上支持百万级词汇表且几乎无精度损失。该方案使长序列训练从封闭的工业实践转变为可复现、可扩展的开放方法,同时在Retrievalrocket数据集上实现了MRR提升6.04%和Recall@10提升6.34%,仅增加约4倍训练时间开销。

链接: https://arxiv.org/abs/2604.12372
作者: Sayak Chakrabarty,Souradip Pal
机构: 未知
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Long interaction histories are central to modern recommender systems, yet training with long sequences is often dismissed as impractical under realistic memory and latency budgets. This work demonstrates that it is not only practical but also effective-at academic scale. We release a complete, end-to-end framework that implements industrial-style long-sequence training with sliding windows, including all data processing, training, and evaluation scripts. Beyond reproducing prior gains, we contribute two capabilities missing from earlier reports: (i) a runtime-aware ablation study that quantifies the accuracy-compute frontier across windowing regimes and strides, and (ii) a novel k-shift embedding layer that enables million-scale vocabularies on commodity GPUs with negligible accuracy loss. Our implementation trains reliably on modest university clusters while delivering competitive retrieval quality (e.g., up to +6.04% MRR and +6.34% Recall@10 on Retailrocket) with \sim 4 \times training-time overheads. By packaging a robust pipeline, reporting training time costs, and introducing an embedding mechanism tailored for low-resource settings, we transform long-sequence training from a closed, industrial technique into a practical, open, and extensible methodology for the community.

[IR-4] Deep Situation-Aware Interaction Network for Click-Through Rate Prediction RECSYS’23

【速读】:该论文旨在解决电子商务平台中点击率(Click-Through Rate, CTR)预测任务中用户行为序列建模的局限性问题,即现有方法未能充分挖掘用户交互行为中的情境信息(如行为类型、时间、位置等)。解决方案的关键在于提出“情境”(situation)概念及其对应的“情境特征”(situational features),并设计了深度情境感知交互网络(Deep Situation-Aware Interaction Network, DSAIN)。DSAIN通过重参数化技巧降低原始行为序列噪声,利用特征嵌入参数化与三向相关融合学习情境特征表示,并通过异构情境聚合机制获得行为序列的最终嵌入表示,从而显著提升CTR预测精度。在线A/B测试结果表明,该模型在Meituan外卖平台部署后,CTR提升2.70%,CPM提升2.62%,GMV提升2.16%。

链接: https://arxiv.org/abs/2604.12298
作者: Yimin Lv,Shuli Wang,Beihong Jin,Yisong Yu,Yapeng Zhang,Jian Dong,Yongkang Wang,Xingxing Wang,Dong Wang
机构: Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学); Meituan (美团)
类目: Information Retrieval (cs.IR)
备注: RecSys’23 Full Paper

点击查看摘要

Abstract:User behavior sequence modeling plays a significant role in Click-Through Rate (CTR) prediction on e-commerce platforms. Except for the interacted items, user behaviors contain rich interaction information, such as the behavior type, time, location, etc. However, so far, the information related to user behaviors has not yet been fully exploited. In the paper, we propose the concept of a situation and situational features for distinguishing interaction behaviors and then design a CTR model named Deep Situation-Aware Interaction Network (DSAIN). DSAIN first adopts the reparameterization trick to reduce noise in the original user behavior sequences. Then it learns the embeddings of situational features by feature embedding parameterization and tri-directional correlation fusion. Finally, it obtains the embedding of behavior sequence via heterogeneous situation aggregation. We conduct extensive offline experiments on three real-world datasets. Experimental results demonstrate the superiority of the proposed DSAIN model. More importantly, DSAIN has increased the CTR by 2.70%, the CPM by 2.62%, and the GMV by 2.16% in the online A/B test. Now, DSAIN has been deployed on the Meituan food delivery platform and serves the main traffic of the Meituan takeout app.

[IR-5] UniRec: Bridging the Expressive Gap between Generative and Discriminative Recommendation via Chain-of-Attribute

【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GR)中存在表达能力不足的问题,即GR在解码过程中仅依赖紧凑的语义ID(Semantic ID, SID)令牌,缺乏对物品侧特征的显式访问,导致其表达能力弱于判别式模型(discriminative models),后者可通过直接访问物品特征实现用户-物品交叉建模。解决方案的关键在于提出UniRec框架,其核心机制为Chain-of-Attribute (CoA),通过在每个SID序列前添加结构化属性标记(如类别、卖家、品牌)来恢复物品侧特征交叉信息;这种属性条件化不仅使共享相同属性的物品在SID空间中聚集,从而显著降低每步解码熵(H(s_k|s_k,a) < H(s_k|s_k)),提升beam search稳定性,还结合容量受限的残差量化策略与条件解码上下文(Conditional Decoding Context, CDC),有效缓解token坍塌和SID层间马太效应,最终实现比最强基线在HR@50上提升22.6%、高价值订单上提升15.5%的性能,并在线A/B测试中验证了业务指标显著改善。

链接: https://arxiv.org/abs/2604.12234
作者: Ziliang Wang,Gaoyun Lin,Xuesi Wang,Shaoqiang Liang,Yili Huang,Weijie Bian
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative Recommendation (GR) reframes retrieval and ranking as autoregressive decoding over Semantic IDs (SIDs), unifying the multi-stage pipeline into a single model. Yet a fundamental expressive gap persists: discriminative models score items with direct feature access, enabling explicit user-item crossing, whereas GR decodes over compact SID tokens without item-side signal. We formalize this via Bayes’ theorem, showing ranking by p(y|f,u) is equivalent to ranking by p(f|y,u), which factorizes autoregressively over item features. This establishes that a generative model with full feature access is as expressive as its discriminative counterpart; any practical gap stems solely from incomplete feature coverage. We propose UniRec with Chain-of-Attribute (CoA) as its core mechanism. CoA prefixes each SID sequence with structured attribute tokens–category, seller, brand–before decoding the SID itself, recovering the item-side feature crossing that discriminative models exploit. Because items sharing identical attributes cluster in adjacent SID regions, attribute conditioning yields a measurable per-step entropy reduction H(s_k|s_k,a) H(s_k|s_k), narrowing the search space and stabilizing beam search trajectories. We further address two deployment challenges: Capacity-constrained SID introduces exposure-weighted capacity penalties into residual quantization to suppress token collapse and the Matthew effect across SID layers; Conditional Decoding Context (CDC) combines Task-Conditioned BOS with hash-based Content Summaries, injecting scenario-conditioned signals at each decoding step. A joint RFT and DPO framework aligns the model with business objectives beyond distribution matching. Experiments show UniRec outperforms the strongest baseline by +22.6% HR@50 overall and +15.5% on high-value orders, with online A/B tests confirming significant business metric gains.

[IR-6] hought-Retriever: Dont Just Retrieve Raw Data Retrieve Thoughts for Memory-Augmented Agent ic Systems

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在与外部世界交互时难以有效利用海量外部知识的问题,尤其是受限于上下文长度导致无法充分调用大量检索到的数据块(data chunks)。现有检索增强方法仅能获取有限数量的原始数据片段,难以满足复杂任务对长程依赖和深度知识整合的需求。解决方案的关键在于提出一种模型无关的算法 Thought-Retriever,其核心思想是让LLM充分利用过去用户查询中生成的中间推理过程(称为“thoughts”),通过过滤冗余信息、组织为结构化的 thought memory,并在处理新查询时检索相关历史 thoughts,从而实现对任意长度外部数据的条件生成。这一机制赋予基于LLM的智能体一种自演化长期记忆能力,随着交互次数增加持续提升性能。

链接: https://arxiv.org/abs/2604.12231
作者: Tao Feng,Pengrui Han,Guanyu Lin,Ge Liu,Jiaxuan You
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have transformed AI research thanks to their powerful internal capabilities and knowledge. However, existing LLMs still fail to effectively incorporate the massive external knowledge when interacting with the world. Although retrieval-augmented LLMs are proposed to mitigate the issue, they are still fundamentally constrained by the context length of LLMs, as they can only retrieve top-K raw data chunks from the external knowledge base which often consists of millions of data chunks. Here we propose Thought-Retriever, a novel model-agnostic algorithm that helps LLMs generate output conditioned on arbitrarily long external data, without being constrained by the context length or number of retrieved data chunks. Our key insight is to let an LLM fully leverage its intermediate responses generated when solving past user queries (thoughts), filtering meaningless and redundant thoughts, organizing them in thought memory, and retrieving the relevant thoughts when addressing new queries. This effectively equips LLM-based agents with a self-evolving long-term memory that grows more capable through continuous interaction. Besides algorithmic innovation, we further meticulously prepare a novel benchmark, AcademicEval, which requires an LLM to faithfully leverage ultra-long context to answer queries based on real-world academic papers. Extensive experiments on AcademicEval and two other public datasets validate that Thought-Retriever remarkably outperforms state-of-the-art baselines, achieving an average increase of at least 7.6% in F1 score and 16% in win rate across various tasks. More importantly, we further demonstrate two exciting findings: (1) Thought-Retriever can indeed help LLM self-evolve after solving more user queries; (2) Thought-Retriever learns to leverage deeper thoughts to answer more abstract user queries.

[IR-7] AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning SIGIR2026

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因外部知识库被污染而导致的大语言模型(Large Language Model, LLM)推理安全问题。现有攻击多通过大量毒化文档淹没知识库,而本文提出一种查询特定的对抗性链式思维(Adversarial Chain-of-Thought, AdversarialCoT)攻击方法,其关键在于仅需污染知识库中的单一文档,并通过迭代交互不断优化该文档内容,以暴露并利用LLM在推理过程中的关键脆弱性。实验表明,单个恶意文档即可显著降低LLM推理准确性,揭示了RAG架构中存在的隐蔽但高风险的安全隐患。

链接: https://arxiv.org/abs/2604.12201
作者: Hongru Song,Yu-An Liu,Ruqing Zhang,Jiafeng Guo,Maarten de Rijke,Yixing Fan,Xueqi Cheng
机构: State Key Laboratory of AI Safety(人工智能安全国家重点实验室); Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); University of Chinese Academy of Sciences(中国科学院大学); University of Amsterdam(阿姆斯特丹大学)
类目: Information Retrieval (cs.IR)
备注: 6 pages,accepted by SIGIR 2026 short paper

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language model (LLM) reasoning by retrieving external documents, but also opens up new attack surfaces. We study knowledge-base poisoning attacks in RAG, where an attacker injects malicious content into the retrieval corpus, which is then naturally surfaced by the retriever and consumed by the LLM during reasoning. Unlike prior work that floods the corpus with poisoned documents, we propose AdversarialCoT, a query-specific attack that poisons only a single document in the corpus. AdversarialCoT first extracts the target LLM’s reasoning framework to guide the construction of an initial adversarial chain-of-thought (CoT). The adversarial document is iteratively refined through interactions with the LLM, progressively exposing and exploiting critical reasoning vulnerabilities. Experiments on benchmark LLMs show that a single adversarial document can significantly degrade reasoning accuracy, revealing subtle yet impactful weaknesses. This study exposes security risks in RAG systems and provides actionable insights for designing more robust LLM reasoning pipelines.

[IR-8] Agent icAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLM s

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在处理长短期对话记忆时缺乏高质量标注数据的问题,现有对话数据集普遍缺乏记忆锚定(memory grounding)、忽视话题连续性或依赖昂贵的人工标注。其解决方案的关键在于提出一个模块化代理框架 AgenticAI-DialogGen,该框架无需人工监督即可生成基于角色设定(persona-grounded)和话题引导(topic-guided)的对话,并通过知识图谱提取、话题识别与说话人角色构建等机制模拟真实对话场景;同时,该框架还设计了一个QA模块,从短/长期对话历史中自动抽取记忆锚定的问题回答对,从而构建出名为 TopicGuidedChat (TGC) 的新数据集,其中长期记忆以说话人专属知识图谱形式编码,短期记忆则表现为新生成的话题引导对话,实验表明基于TGC微调的LLM在记忆锚定问答任务上表现显著提升。

链接: https://arxiv.org/abs/2604.12179
作者: Manoj Madushanka Perera,Adnan Mahmood,Kasun Eranda Wijethilake,Quan Z. Sheng
机构: Macquarie University (麦克奎大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 13 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have improved their ability to process extended conversational contexts, yet fine-tuning and evaluating short- and long-term memories remain difficult due to the absence of datasets that encode both short- and long-term conversational history. Existing conversational datasets lack memory grounding, overlook topic continuity, or rely on costly human annotation. To address these gaps, we introduce AgenticAI-DialogGen, a modular agent-based framework that generates persona-grounded and topic-guided conversations without human supervision. The framework uses LLM agents to extract knowledge graphs, identify topics, build speaker personas, and simulate topic-guided conversations from unstructured conversations. A QA module generates memory-grounded Question Answer (QA) pairs drawn from short- and long-term conversational histories. We also generated a new dataset entitled, TopicGuidedChat (TGC), where long-term memory is encoded as speaker-specific knowledge graphs and short-term memory as newly generated topic-guided conversations. Evaluations depict that AgenticAI-DialogGen yields higher conversational quality and LLMs fine-tuned on TGC dataset achieve improved performance on memory-grounded QA tasks.

[IR-9] Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统中存在的事实性偏见问题,即系统过度偏向客观、事实性内容,而忽视主观性信息(如观点、情感和多元视角),导致在处理社交媒体讨论、产品评论等涉及主观内容的场景时表现不佳。这种偏见不仅限制了RAG的实际应用效果,还可能引发信息茧房、少数群体声音被系统性忽略以及意见操纵等伦理风险。解决方案的关键在于从不确定性理论出发,区分认知不确定性(epistemic uncertainty)随机不确定性(aleatoric uncertainty):前者可通过证据减少,适用于事实性查询;后者反映人类观点的真实异质性,应被保留以支持意见感知型检索。基于此,作者提出了一种意见感知型RAG架构(Opinion-Aware RAG),包含大语言模型驱动的意见提取、实体关联的意见图谱以及意见增强的文档索引机制,从而实现对主观内容的有效建模与检索多样性提升。

链接: https://arxiv.org/abs/2604.12138
作者: Aditya Agrawal,Alwarappan Nakkiran,Darshan Fofadiya,Alex Karlsson,Harsha Aduri
机构: Amazon.com(亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 13 pages, Preprint under review

点击查看摘要

Abstract:RAG systems have transformed how LLMs access external knowledge, but we find that current implementations exhibit a bias toward factual, objective content, as evidenced by existing benchmarks and datasets that prioritize objective retrieval. This factual bias - treating opinions and diverse perspectives as noise rather than information to be synthesized - limits RAG systems in real-world scenarios involving subjective content, from social media discussions to product reviews. Beyond technical limitations, this bias poses risks to transparent and accountable AI: echo chamber effects that amplify dominant viewpoints, systematic underrepresentation of minority voices, and potential opinion manipulation through biased information synthesis. We formalize this limitation through the lens of uncertainty: factual queries involve epistemic uncertainty reducible through evidence, while opinion queries involve aleatoric uncertainty reflecting genuine heterogeneity in human perspectives. This distinction implies that factual RAG should minimize posterior entropy, whereas opinion-aware RAG must preserve it. Building on this theoretical foundation, we present an Opinion-Aware RAG architecture featuring LLM-based opinion extraction, entity-linked opinion graphs, and opinion-enriched document indexing. We evaluate our approach on e-commerce seller forum data, comparing an Opinion-Enriched knowledge base against a traditional baseline. Experiments demonstrate substantial improvements in retrieval diversity: +26.8% sentiment diversity, +42.7% entity match rate, and +31.6% author demographic coverage on entity-matched documents. Our results provide empirical evidence that treating subjectivity as a first-class citizen yields measurably more representative retrieval-a first step toward opinion-aware RAG. Future work includes joint optimization of retrieval and generation for distributional fidelity.

[IR-10] he Effect of Document Selection on Query-focused Text Analysis

【速读】:该论文旨在解决文档集合分析中数据选择策略对文本分析结果影响的研究空白问题,即如何在有限计算资源下高效地选取相关文档以支持特定研究问题。其解决方案的关键在于系统性评估七种不同的数据选择方法(从随机选择到混合检索)与四种文本分析方法(LDA、BERTopic、TopicGPT、HiCode)在两个数据集上的组合表现,结果表明语义检索或混合检索策略是更优的选择,能够在避免低效策略弊端的同时减少不必要的计算开销,从而将数据选择提升为一个可优化的方法论决策,而非单纯的实践限制。

链接: https://arxiv.org/abs/2604.12099
作者: Sandesh S Rangreji,Mian Zhong,Anjalie Field
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Analyses of document collections often require selecting what data to analyze, as not all documents are relevant to a particular research question and computational constraints preclude analyzing all documents, yet little work has examined effects of selection strategy choices. We systematically evaluate seven selection methods (from random selection to hybrid retrieval) on outputs from four text analyses methods (LDA, BERTopic, TopicGPT, HiCode) over two datasets with 26 open-ended queries. Our evaluation reveals practice guidance: semantic or hybrid retrieval offer strong go-to approaches that avoid the pitfalls of weaker selection strategies and the unnecessary compute overhead of more complicated ones. Overall, our evaluation framework establishes data selection as a methodological decision, rather than a practical necessity, inviting the development of new strategies.

[IR-11] Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

【速读】:该论文旨在解决PDF文件在自动化处理中面临的挑战,特别是其异构内容(如文本、表格和图像)对信息提取造成的困难。现有方法如生成式AI(Generative AI)与检索增强生成(Retrieval-Augmented Generation, RAG)系统虽具潜力,但缺乏对其组件设计选择如何影响PDF理解性能的系统性研究。论文的关键解决方案是通过聚焦于问答(Question Answering)这一具体语言理解任务,并利用金融领域的两个基准测试(包括新构建的公开数据集TableQuest),系统评估不同PDF解析器与分块策略(含不同重叠程度)及其协同效应,从而为构建鲁棒的RAG管道提供实证指导。

链接: https://arxiv.org/abs/2604.12047
作者: Omar El Bachyr,Yewei Song,Saad Ezzini,Jacques Klein,Tegawendé F. Bissyandé,Anas Zilali,Ulrick Ble,Anne Goujon
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages

点击查看摘要

Abstract:PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.

[IR-12] Constant-Factor Approximation for the Uniform Decision Tree

【速读】:该论文旨在解决在均匀概率分布下,决策树(Decision Tree)问题是否存在常数因子近似算法这一长期悬而未决的开放性问题。作者通过设计一种简单多项式时间算法,实现了约11.57的近似比,显著优于此前最优的贪心算法所达到的 O(logn/loglogn)O(\log n / \log \log n) 近似比。解决方案的关键在于两个核心思想:其一是采用源自层次聚类(Hierarchical Clustering)相关问题的分解技术,将最优决策树分解为一系列称为分离子族(Separating Subfamily)的对象;其二是将分离子族子问题转化为最大覆盖(Maximum Coverage)问题,通过分析切割团(clique)成小块以表示待分离假设对的性质,从而获得对该子问题的良好近似解,最终实现对原问题的高效近似求解。

链接: https://arxiv.org/abs/2604.12036
作者: Michał Szyfelbein
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:We resolve a long-standing open question, about the existence of a constant-factor approximation algorithm for the average-case \textscDecision Tree problem with uniform probability distribution over the hypotheses. We answer the question in the affirmative by providing a simple polynomial-time algorithm with approximation ratio of \frac21-\sqrt(e+1)/(2e)+\epsilon 11.57 . This improves upon the currently best-known, greedy algorithm which achieves O(\log n/\log\log n) -approximation. The first key ingredient in our analysis is the usage of a decomposition technique known from problems related to \textscHierarchical Clustering [SODA '17, WALCOM '26], which allows us to decompose the optimal decision tree into a series of objects called separating subfamilies. The second crucial idea is to reduce the subproblem of finding a \textscSeparating Subfamily to an instance of the \textscMaximum Coverage problem. To do so, we analyze the properties of cutting cliques into small pieces, which represent pairs of hypotheses to be separated. This allows us to obtain a good approximation for the \textscSeparating Subfamily problem, which then enables the design of the approximation algorithm for the original problem.

人机交互

[HC-0] PAL: Personal Adaptive Learner

【速读】:该论文旨在解决当前AI驱动教育平台在个性化学习支持方面的局限性问题,即大多数系统仅能实现静态适应(如预设测验、统一节奏或通用反馈),难以实时响应学习者理解能力的变化。其解决方案的关键在于提出PAL(Personal Adaptive Learner)平台,通过持续分析多模态讲座内容,并基于学习者的即时回答动态调整提问难度与交互方式,从而实现教学过程中的实时情境感知与自适应决策,最终生成个性化的学习总结,推动AI教育从静态个性化迈向动态个体化支持。

链接: https://arxiv.org/abs/2604.13017
作者: Megha Chakraborty,Darssan L. Eswaramoorthi,Madhur Thareja,Het Riteshkumar Shah,Finlay Palmer,Aryaman Bahl,Michelle A Ihetu,Amit Sheth
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI-driven education platforms have made some progress in personalisation, yet most remain constrained to static adaptation–predefined quizzes, uniform pacing, or generic feedback–limiting their ability to respond to learners’ evolving understanding. This shortfall highlights the need for systems that are both context-aware and adaptive in real time. We introduce PAL (Personal Adaptive Learner), an AI-powered platform that transforms lecture videos into interactive learning experiences. PAL continuously analyzes multimodal lecture content and dynamically engages learners through questions of varying difficulty, adjusting to their responses as the lesson unfolds. At the end of a session, PAL generates a personalized summary that reinforces key concepts while tailoring examples to the learner’s interests. By uniting multimodal content analysis with adaptive decision-making, PAL contributes a novel framework for responsive digital learning. Our work demonstrates how AI can move beyond static personalization toward real-time, individualized support, addressing a core challenge in AI-enabled education.

[HC-1] GlintMarkers: Spatial Perception on XR Eyewear using Corneal Reflections

【速读】:该论文旨在解决XR(扩展现实)头显设备中基于视线驱动的空间感知问题,即如何利用设备内置的内向摄像头实现对环境空间位置和物体的精确识别与定位。其核心挑战在于,由于摄像头像素资源有限,难以从低对比度、小尺寸的角膜反射图像中提取有效信息。解决方案的关键在于提出了一种被动式反光标记(passive retroreflective marker)设计,通过将近红外光聚焦至角膜表面形成高亮的“glint”(光斑)模式,从而增强信号强度;并开发了针对角膜成像特性的自定义透视n点(Perspective-n-Point, PnP)估计框架,实现了对标记物体的姿态、距离及唯一标识的准确估计。

链接: https://arxiv.org/abs/2604.12949
作者: Seungjoo Lee,Vimal Mollyn,Chris Harrison,Justin Chan,Mayank Goel
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We present GlintMarkers, the first system to perform gaze-driven spatial perception using the inward-facing cameras on XR eyewear. Our key observation is that the cornea acts as a mirror that encodes both gaze direction and visual information about the environment in a small, low-contrast reflection. To extract spatial and semantic information from this reflection despite the camera’s limited pixel budget, we present a passive retroreflective marker design that concentrates reflected near-infrared light onto the cornea, producing bright glint patterns. We develop a custom Perspective-n-Point (PnP) estimation framework adapted to corneal imaging and perform orientation and distance estimation of tagged objects, as well as unique object identification.

[HC-2] Human Agency Causality and the Human Computer Interface in High-Stakes Artificial Intelligence

【速读】:该论文试图解决高风险人工智能(AI)系统中人类代理权(agency)被侵蚀的问题,指出当前以“可信”和“负责任”为核心的AI伦理讨论忽视了人机交互(HCI)层面的根本危机——即人类对系统因果控制的丧失。其解决方案的关键在于提出一个严格的嵌套式因果-代理框架(Causal-Agency Framework, CAF),该框架整合因果模型、不确定性量化与以人为中心的评估机制,旨在通过界面设计重建人类对AI系统决策过程的可控性与理解力,从而应对由“双不确定性”(用户与概率模型的不确定性)引发的灾难性交互失效问题。

链接: https://arxiv.org/abs/2604.12793
作者: Georges Hattab
机构: Center for Artificial Intelligence in Public Health Research (ZKI-PH), Robert Koch Institute; Department of Mathematics and Computer Science, Freie Universität Berlin
类目: Human-Computer Interaction (cs.HC)
备注: 2026 CHI Workshop on Human-AI Interaction Alignment: Designing, Evaluating, and Evolving Value-Centered AI For Reciprocal Human-AI Futures

点击查看摘要

Abstract:Current discourse on Artificial Intelligence (AI) ethics, dominated by “trustworthy” and “responsible” AI, overlooks a more fundamental human-computer interaction (HCI) crisis: the erosion of human agency. This paper argues that the primary challenge of high-stakes AI systems is not trust, but the preservation of human causal control. We posit that “bad AI” will function as “bad UI,” a metaphor for catastrophic interface failures that misrepresent system state and lead to human error. Applying Marshall McLuhan’s media theory, AI can be framed as a technology of “augmentation” that simultaneously “amputates” the user’s direct perception of causality. This places the interface as the critical locus where a “double uncertainty”–that of the human user and that of the probabilistic model–must be mediated. We critique current Explainable AI (XAI) for its correlational focus and failure to represent uncertainty. We conclude by proposing a rigorous, nested Causal-Agency Framework (CAF) that integrates causal models, uncertainty quantification, and human-centered evaluation to restore agency at the interface.

[HC-3] A sequential explanatory mixed-methods study on the acceptance of a social robot for EFL speaking practice among Chinese primary school students: Insights from the Computers Are Social Actors (CASA) paradigm

【速读】:该论文旨在解决中国小学生在英语作为外语(EFL)口语练习中对社交机器人接受度的问题,核心在于厘清功能性和社会性因素如何共同影响学习者的使用意愿。解决方案的关键在于融合技术接受模型(Technology Acceptance Model, TAM)与计算机是社会行动者(Computers Are Social Actors, CASA)范式,通过定量与定性相结合的混合方法设计,揭示感知愉悦感和易用性是最强预测因子,而温暖、拟人化和社会存在等社会属性则显著增强愉悦感,从而强调在教育机器人设计中融入情感与社会智能的重要性,以提升学习动机和口语自信。

链接: https://arxiv.org/abs/2604.12789
作者: Yiran Du,Jinlong Li,Huimin He,Chenghao Wang,Bin Zou
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study investigates Chinese primary school students’ acceptance of a social robot for English-as-a-foreign-language (EFL) speaking practice through a sequential explanatory mixed-methods design. Integrating the Technology Acceptance Model (TAM) and the Computers Are Social Actors (CASA) paradigm, the research explores both functional and social factors influencing learners’ behavioural intention to use the robot. Quantitative data from 436 students were analysed using structural equation modelling, followed by qualitative interviews with twelve students to interpret the findings. Results show that perceived enjoyment and ease of use are the strongest predictors of acceptance, while social attributes such as warmth, anthropomorphism, and social presence significantly enhance enjoyment. Perceived intelligence affects usefulness but not ease of use. The findings suggest that emotional and social engagement are central to young learners’ acceptance of educational robots, highlighting the importance of designing socially intelligent technologies that promote motivation and speaking confidence in EFL learning contexts.

[HC-4] From Imitation to Discrimination: Progressive Curriculum Learning for Robust Web Navigation

【速读】:该论文旨在解决文本驱动的网页代理(Text-based Web Agents)在真实世界HTML环境中鲁棒性不足的问题,具体表现为:标准监督微调(SFT)方法缺乏对密集页面中看似合理但错误元素的判别能力,且难以泛化到未见过的网站布局。其解决方案的关键在于构建了一个大规模、高质量的数据集Triton(590k实例)和一套渐进式训练课程,其中Triton通过结构-语义硬负样本挖掘(Structural-Semantic Hard Negative Mining)主动引入拓扑相似的干扰项,并借助双代理共识(Dual-Agent Consensus)管道合成跨域任务并严格验证;在此基础上,逐步训练出三个模型:Triton-SFT-32B用于基础模仿学习,Triton-ORPO-32B通过奇偶比偏好优化(Odds Ratio Preference Optimization)提升判别能力,Triton-GRPO-32B则采用组相对策略优化(Group Relative Policy Optimization)保障长程一致性。实证表明,Triton-GRPO-32B在Mind2Web基准上达到58.7%的步骤成功率,显著优于GPT-4.5(42.4%)和Claude-4.5(41.4%),证明了精心设计的数据训练范式比单纯增加参数规模更能提升网页导航性能。

链接: https://arxiv.org/abs/2604.12666
作者: Chuang Peng,Wei Zhang,Renshuai Tao,Xinhao Zhang,Jian Yang
机构: Beijing Jiaotong University (北京交通大学); Beihang University (北京航空航天大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:Text-based web agents offer computational efficiency for autonomous web navigation, yet developing robust agents remains challenging due to the noisy and heterogeneous nature of real-world HTML. Standard Supervised Fine-Tuning (SFT) approaches fail in two critical dimensions: they lack discrimination capabilities to reject plausible but incorrect elements in densely populated pages, and exhibit limited generalization to unseen website layouts. To address these challenges, we introduce the Triton dataset (590k instances) and a progressive training curriculum. Triton is constructed via Structural-Semantic Hard Negative Mining, which explicitly mines topologically similar distractors, and a Dual-Agent Consensus pipeline that synthesizes diverse cross-domain tasks with strict verification. Building upon this foundation, our progressive curriculum produces three models: Triton-SFT-32B for basic imitation, Triton-ORPO-32B for robust discrimination via Odds Ratio Preference Optimization, and Triton-GRPO-32B for long-horizon consistency through Group Relative Policy Optimization. Empirical evaluation on Mind2Web demonstrates that Triton-GRPO-32B achieves state-of-the-art performance among open-source models with 58.7% Step Success Rate, surpassing GPT-4.5 (42.4%) and Claude-4.5 (41.4%) by over 16%, validating that specialized data curriculum outweighs raw parameter scale for web navigation.

[HC-5] GraphTide: Augmenting Knowledge-Intensive Text with Progressive Nested Graph

【速读】:该论文旨在解决知识密集型文本(knowledge-intensive text)中实体与关系复杂、用户理解耗时耗力的问题。其核心挑战在于如何有效呈现跨句和句内实体间的关系,以降低认知负荷并提升阅读效率。解决方案的关键在于提出GraphTide,一种基于动画的渐进式嵌套实体-关系图可视化方法,通过按需分解的实体-关系管道构建多层嵌套图,并结合结构感知的力导向布局优化算法,使句子及其关联实体以动画形式逐步揭示,从而帮助用户在叙事推进过程中保持上下文连贯性,显著优于传统静态图表示和基于图的方法。

链接: https://arxiv.org/abs/2604.12624
作者: Xin Qian,Dazhen Deng,Zhaoping He,Xingbo Wang,Yuchen He,Yingcai Wu
机构: Zhejiang University (浙江大学); Bosch Research North America (博世北美研究中心); Bosch Center for Artificial Intelligence (BCAI) (博世人工智能中心)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Knowledge-intensive text usually contains fruitful entities and complex relationships, such as academic articles and scientific exposition. Reading and comprehending such texts often demands considerable time and mental effort to track the relationships between entities. To reduce the burden, we present GraphTide, a visualization technique that progressively constructs nested entity-relationship graphs with animation to support the understanding of complex text. Our method features an on-demand entity-relationship decomposition pipeline that constructs nested graphs to represent intra- and inter-sentence relationships. Moreover, we propose a structure-aware force-directed layout optimization algorithm to enhance structural clarity. Sentences and their associated entities are incrementally revealed through animated transitions, helping users maintain context as the narrative unfolds. A user study shows that GraphTide significantly improves users’ comprehension of knowledge-intensive texts compared to traditional graph-based techniques and static nested graph representations.

[HC-6] Designing for Error Recovery in Human-Robot Interaction

【速读】:该论文试图解决的问题是当前机器人人工智能(Artificial Intelligence, AI)系统在面对复杂、连续且交互性强的现实环境时,缺乏自我纠错与持续学习能力,导致其性能难以超越人类基准。现有系统多聚焦于单次决策优化,而忽视了人类在错误中恢复和学习的能力。解决方案的关键在于构建能够检测并自主恢复自身错误的AI系统,文中以机器人核防护手套箱(nuclear glovebox)为应用场景,探讨如何设计具备容错与自适应能力的架构,从而提升系统在真实场景中的鲁棒性和成功率。

链接: https://arxiv.org/abs/2604.12473
作者: Christopher D. Wallbridge,Erwin Jose Lopez Pulgarin
机构: Cardiff University (卡迪夫大学); The University of Manchester (曼彻斯特大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This position paper looks briefly at the way we attempt to program robotic AI systems. Many AI systems are based on the idea of trying to improve the performance of one individual system to beyond so-called human baselines. However, these systems often look at one shot and one-way decisions, whereas the real world is more continuous and interactive. Humans, however, are often able to recover from and learn from errors - enabling a much higher rate of success. We look at the challenges of building a system that can detect/recover from its own errors, using the example of robotic nuclear gloveboxes as a use case to help illustrate examples. We then go on to talk about simple starting designs.

[HC-7] Responsible Trauma Research: Designing Effective and Sustainable Virtual Reality Exposure Studies

【速读】:该论文旨在解决复杂创伤后应激障碍(Complex Post-Traumatic Stress Disorder, C-PTSD)治疗中虚拟现实暴露疗法(Virtual Reality Exposure Therapy, VRET)的可行性与安全性问题,尤其是针对C-PTSD患者个体化创伤触发因素难以识别和安全实施的挑战。其解决方案的关键在于:首先,通过小样本可行性研究发现,简单物体即可达到与复杂场景相当的治疗效果,降低了技术实现难度;其次,强调VRET的设计过程本身成为治疗的一部分而非单纯准备阶段,从而增强干预的临床整合性;最后,提出需谨慎管理开发人员参与治疗过程可能引发的角色混淆与情感压力,确保所有利益相关者(包括患者、治疗师和开发者)的安全与边界清晰,为未来安全、以患者为中心的VRET研究提供方法学指导。

链接: https://arxiv.org/abs/2604.12349
作者: Annalisa Degenhard,Sophia Ppali,Fotis Liarokapis,Enrico Rukzio,Jennifer Spohrs,Stefan Tschoeke
机构: Ulm University (乌尔姆大学); CYENS Centre of Excellence (CYENS卓越中心); Department of Psychiatry, Psychotherapy and Psychotraumatology Military Medical Centre (精神病学、心理治疗和心理创伤学系军事医疗中心); Department for Child and Adolescent Psychiatry and Psychotherapy Ulm University Medical Centre (乌尔姆大学医学中心儿童与青少年精神病学与心理治疗科); Clinic for Psychiatry and Psychotherapy I (Weissenau) Ulm University (乌尔姆大学精神病学与心理治疗诊所I(魏森瑙)); Centre for Psychiatry Südwürttemberg (西南符腾堡精神病学中心)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Virtual reality exposure therapy (VRET) enables controlled exposure to trauma-related stimuli to facilitate memory access and emotional processing. However, the field remains underexplored for complex post-traumatic stress disorder (C-PTSD). Unlike single-trauma PTSD, C-PTSD requires highly individualized triggers that are difficult to identify and implement safely. We conducted a feasibility study with 11 patients, two trauma therapists, and a VR developer to explore integrating VRET into C-PTSD treatment while safeguarding all stakeholders. Initial findings indicate that simple objects can be just as effective as complex scenes, therapeutic success does not correlate with VR presence levels, and the design process itself became integral to therapy rather than preparatory. However, involving developers in therapy sessions led to considerable emotional stress and role confusion, which required a cautious approach. Based on these insights, we provide methodological recommendations for safe and patient-centered VRET studies that balance therapeutic effectiveness with stakeholder safety across the research process.

[HC-8] Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety

【速读】:该论文旨在解决生成式 AI(Generative AI)在建筑安全领域应用中因“情绪编码”(vibe coding)带来的潜在风险问题,即非技术用户通过自然语言指令让大型语言模型(LLMs)生成可执行代码时,由于模型的不确定性导致代码虽能编译通过但存在数学逻辑错误的“静默失败”(silent failures)。研究发现,当前主流模型如GPT-4o-Mini在成功执行的脚本中仍存在高达56%的数学不准确率,且用户角色设定显著影响幻觉发生概率。解决方案的关键在于引入确定性AI封装器(deterministic AI wrappers)和严格的治理机制,以弥补LLMs在安全工程场景下的逻辑严谨性不足,从而保障人机协同系统在物理世界部署中的可靠性与安全性。

链接: https://arxiv.org/abs/2604.12311
作者: S M Jamil Uddin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The emergence of vibe coding, a paradigm where non-technical users instruct Large Language Models (LLMs) to generate executable codes via natural language, presents both significant opportunities and severe risks for the construction industry. While empowering construction personnel such as the safety managers, foremen, and workers to develop tools and software, the probabilistic nature of LLMs introduces the threat of silent failures, wherein generated code compiles perfectly but executes flawed mathematical safety logic. This study empirically evaluates the reliability, software architecture, and domain-specific safety fidelity of 450 vibe-coded Python scripts generated by three frontier models, Claude 3.5 Haiku, GPT-4o-Mini, and Gemini 2.5 Flash. Utilizing a persona-driven prompt dataset (n=150) and a bifurcated evaluation pipeline comprising isolated dynamic sandboxing and an LLM-as-a-Judge, the research quantifies the severe limits of zero-shot vibe codes for construction safety. The findings reveal a highly significant relationship between user persona and data hallucination, demonstrating that less formal prompts drastically increase the AI’s propensity to invent missing safety variables. Furthermore, while the models demonstrated high foundational execution viability (~85%), this syntactic reliability actively masked logic deficits and a severe lack of defensive programming. Among successfully executed scripts, the study identified an alarming ~45% overall Silent Failure Rate, with GPT-4o-Mini generating mathematically inaccurate outputs in ~56% of its functional code. The results demonstrate that current LLMs lack the deterministic rigor required for standalone safety engineering, necessitating the adoption of deterministic AI wrappers and strict governance for cyber-physical deployments.

[HC-9] Dialogue Agents that Share Family Information to Strengthen Grandparent-Grandchild Relationships

【速读】:该论文旨在解决老年人社交孤立(social isolation)问题,即因交流机会减少和家庭关系弱化对心理健康造成的负面影响。其解决方案的关键在于设计并实现一个对话代理(dialogue agent),该代理通过在祖父母与孙辈之间共享日常信息,在两者间建立并强化情感联结,同时提供持续的对话机会。实验结果表明,该代理不仅提升了老年人与系统互动的意愿,还显著增强了祖孙之间的心理连接,并降低了双方的焦虑水平,验证了基于个人信息共享的对话代理在缓解老年人社交孤立方面的有效性。

链接: https://arxiv.org/abs/2604.12310
作者: Seiya Mitsuno,Midori Ban,Hiroshi Ishiguro,Yuichiro Yoshikawa
机构: Osaka University (大阪大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Social isolation among older adults has become a critical concern, as reduced opportunities for conversation and weakened family relationships negatively affect mental health. This study proposes a dialogue agent that supports older adults by fostering both a relationship with the agent and a relationship with their grandchild through sharing everyday information. The agent operates on a chatbot platform and engages in daily conversations with older adults and their grandchildren, exchanging information gathered from each party to enhance conversational engagement and social connection. We conducted a ten-day empirical experiment with 108 grandparent-grandchild pairs. The results suggest that older adults became more willing to interact with the proposed agent, which shared information about their grandchildren, and that the psychological connection between grandparents and grandchildren was strengthened. Furthermore, daily interactions with the agent were associated with reduced anxiety in both older adults and their grandchildren. These findings indicate that a dialogue agent that shares personal information can be an effective approach to supporting older adults by simultaneously offering conversational opportunities and promoting family connectedness. Overall, this study provides valuable insights into the design of dialogue agents that effectively address social isolation among older adults.

[HC-10] Socially Fluent Socially Awkward: Artificial Intelligence Relational Talk Backfires in Commercial Interactions

【速读】:该论文旨在解决的问题是:在商业交互中,随着人工智能(Artificial Intelligence, AI)社会 fluency(社交流畅性)的提升,消费者对AI系统中嵌入的非正式社交互动(如关系性对话,relational talk)的反应是否积极,以及这种反应背后的机制是什么。解决方案的关键在于识别出“预期违背(expectancy violation)”和“感知互动尴尬(perceived interaction awkwardness)”作为中介变量,揭示了AI关系性对话虽意图增强亲和力,却可能因违背用户对交易场景的预期而引发负面情绪;同时发现,当关系性对话与用户目标相关时(goal-relevant relational talk),可显著削弱这一负向效应,从而为优化人机交互设计提供理论依据与实践路径。

链接: https://arxiv.org/abs/2604.12206
作者: Stephanie Kwari Dharmaputri,Anish Nagpal,Greg Nyilasy,Jing Lei
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Advancements in Artificial Intelligence (AI) technologies’ social fluency are being integrated into commercial interactions. As tools such as OpenAI’s assistant are integrated into platforms such as Shopify, Klarna, and Visa, understanding consumer responses to AI social features become essential. One such feature is relational talk, an informal and non-obligatory social communication embedded in transactional exchanges. Across four experiments, we find: 1) a negative main effect of AI relational talk on satisfaction, mediated by expectancy violation and perceived interaction awkwardness, and 2) goal-relevant relational talk to attenuate this effect. This paper extends the literature by challenging the assumption that increased social fluency will improve satisfaction, and highlights the complexity of integrating social features into AI systems. It also identifies awkwardness as a key emotional response and barrier to effective human-AI interaction, showing that even in the absence of real social repercussions, perceived awkwardness in AI-led commercial interactions can elicit negative responses.

[HC-11] Characterizing Resource Sharing Practices on Underground Internet Forum Synthetic Non-Consensual Intimate Image Content Creation Communities

【速读】:该论文旨在解决合成非 consensual intimate imagery (SNCII) 恶意传播者在互联网论坛中组织协作、资源交换与知识传递的问题,揭示其生态系统的结构与运作机制。解决方案的关键在于通过整合分析多个在线社区(如 4chan 和 Reddit)的海量用户行为数据(共 282,154 条评论和 78,308 条帖子),识别出不同技术熟练度的参与者如何使用并共享用于内容生成和传播的多种资源,并发现专家向新手的知识转移是非法资源扩散的核心驱动力。基于此,研究提出当前 SNCII 监管体系中的关键漏洞,并提炼出若干可实施的干预节点以增强威慑力。

链接: https://arxiv.org/abs/2604.12190
作者: Bernardo B. P. Medeiros(1),Malvika Jadhav(1),Allison Lu(1),Tadayoshi Kohno(2),Vincent Bindschaedler(1),Kevin R. B. Butler(1) ((1) University of Florida, (2) Georgetown University)
机构: University of Florida (佛罗里达大学); Georgetown University (乔治城大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 20 pages, 6 figures, 11 tables

点击查看摘要

Abstract:Many malicious actors responsible for disseminating synthetic non-consensual intimate imagery (SNCII) operate within internet forums to exchange resources, strategies, and generated content across multiple platforms. Technically-sophisticated actors gravitate toward certain communities (e.g., 4chan), while lower-sophistication end-users are more active on others (e.g., Reddit). To characterize key stakeholders in the broader ecosystem, we perform an integrated analysis of multiple communities, analyzing 282,154 4chan comments and 78,308 Reddit submissions spanning 165 days between June and November 2025 to characterize involved actors, actions, and resources. We find: (a) that users with differing levels of technical sophistication employ and share a wide range of primary resources facilitating SNCII content creation as well as numerous secondary resources facilitating dissemination; and (b) that knowledge transfer between experts and newcomers facilitates propagation of these illicit resources. Based on our empirical analysis, we identify gaps in existing SNCII regulatory infrastructure and synthesize several critical intervention points for bolstering deterrence.

[HC-12] A longitudinal health agent framework

【速读】:该论文旨在解决当前人工智能(AI)代理在支持长期健康任务(如症状管理、行为改变和患者支持)中普遍存在的不足,即难以有效响应用户意图并建立持续的责任感。相较于以往针对长期需求的支持研究,现有AI系统往往缺乏持续的连贯性、目标对齐性和适应性,从而影响干预的有效性与安全性。解决方案的关键在于提出一个多层次的框架及相应的代理架构,通过整合适应性(adaptation)、连贯性(coherence)、连续性(continuity)和自主性(agency)四个核心要素,在多次交互中实现对用户健康轨迹的动态响应与个性化支持。该框架强调跨会话的语义一致性与目标对齐,确保AI代理能够维持有意义的参与度,并在时间维度上提供安全、个性化的决策支持。

链接: https://arxiv.org/abs/2604.12019
作者: Georgianna(Blue)Lin,Rencong Jiang,Noémie Elhadad,Xuhai “Orson” Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Although artificial intelligence (AI) agents are increasingly proposed to support potentially longitudinal health tasks, such as symptom management, behavior change, and patient support, most current implementations fall short of facilitating user intent and fostering accountability. This contrasts with prior work on supporting longitudinal needs, where follow-up, coherent reasoning, and sustained alignment with individuals’ goals are critical for both effectiveness and safety. In this paper, we draw on established clinical and personal health informatics frameworks to define what it would mean to orchestrate longitudinal health interactions with AI agents. We propose a multi-layer framework and corresponding agent architecture that operationalizes adaptation, coherence, continuity, and agency across repeated interactions. Through representative use cases, we demonstrate how longitudinal agents can maintain meaningful engagement, adapt to evolving goals, and support safe, personalized decision-making over time. Our findings underscore both the promise and the complexity of designing systems capable of supporting health trajectories beyond isolated interactions, and we offer guidance for future research and development in multi-session, user-centered health AI.

[HC-13] When Drawing Is Not Enough: Exploring Spontaneous Speech with Sketch for Intent Alignment in Multimodal LLM s

【速读】:该论文旨在解决早期设计构思阶段中,由于草图表达不充分导致用户意图难以被准确捕捉的问题。其核心挑战在于设计师在快速绘制草图时往往无法完全呈现其功能目标与创意意图,而口头表达则能补充这些隐含信息。解决方案的关键在于构建一个名为TalkSketchD的多模态数据集,该数据集同步记录了设计师在构思烤面包机时的自由手绘草图与伴随的自发语音,并利用多模态大语言模型(Multimodal Large Language Models, MLLMs)将草图与语音文本联合输入进行图像生成。实验表明,引入同时发生的语音信息可显著提升生成图像与设计师自述意图之间的一致性,尤其在形式、功能、体验及整体意图维度上均取得量化改善,证明了时序对齐的草图-语音数据能有效增强MLLMs对早期设计意图的理解能力。

链接: https://arxiv.org/abs/2604.11964
作者: Weiyan Shi,Dorien Herremans,Kenny Tsu Wei Choo
机构: Singapore University of Technology and Design(新加坡科技设计大学)
类目: Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: Accepted at DIS 2026 PWiP

点击查看摘要

Abstract:Early-stage design ideation often relies on rough sketches created under time pressure, leaving much of the designer’s intent implicit. In practice, designers frequently speak while sketching, verbally articulating functional goals and ideas that are difficult to express visually. We introduce TalkSketchD, a sketch-while-speaking dataset that captures spontaneous speech temporally aligned with freehand sketches during early-stage toaster ideation. To examine the dataset’s value, we conduct a sketch-to-image generation study comparing sketch-only inputs with sketches augmented by concurrent speech transcripts using multimodal large language models (MLLMs). Generated images are evaluated against designers’ self-reported intent using a reasoning MLLM as a judge. Quantitative results show that incorporating spontaneous speech significantly improves judged intent alignment of generated design images across form, function, experience, and overall intent. These findings demonstrate that temporally aligned sketch-and-speech data can enhance MLLMs’ ability to interpret user intent in early-stage design ideation.

计算机视觉

[CV-0] Lyra 2.0: Explorable Generative 3D Worlds

【速读】:该论文旨在解决大规模、复杂3D场景生成中视频模型在长镜头轨迹下出现的空间遗忘(spatial forgetting)和时间漂移(temporal drifting)问题,从而实现可探索的3D世界持久生成。其核心解决方案包括两个关键机制:一是通过维护每帧的3D几何结构用于信息路由(仅用于检索历史帧并建立目标视角的密集对应关系),而将外观合成完全依赖于生成先验,以缓解空间遗忘;二是引入自增强历史训练策略,使模型暴露于自身生成的退化输出中,学习纠正而非延续误差,从而抑制时间漂移。这两项改进显著延长了视频轨迹的3D一致性,支撑后续前向重建模型高效恢复高质量3D场景。

链接: https://arxiv.org/abs/2604.13036
作者: Tianchang Shen,Sherwin Bahmani,Kai He,Sangeetha Grama Srinivasan,Tianshi Cao,Jiawei Ren,Ruilong Li,Zian Wang,Nicholas Sharp,Zan Gojcic,Sanja Fidler,Jiahui Huang,Huan Ling,Jun Gao,Xuanchi Ren
机构: NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model’s temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing – retrieving relevant past frames and establishing dense correspondences with the target viewpoints – while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.

[CV-1] Generative Refinement Networks for Visual Synthesis

【速读】:该论文旨在解决扩散模型(Diffusion Models)在视觉生成中计算效率低下以及自回归模型(Autoregressive Models, AR)因离散标记化导致的失真和误差累积问题。其核心解决方案是提出生成精炼网络(Generative Refinement Networks, GRN),关键在于引入一种理论上近乎无损的分层二进制量化(Hierarchical Binary Quantization, HBQ),有效突破了传统离散tokenization的瓶颈,实现了与连续表示相当的重建质量;同时,基于HBQ潜空间构建全局精炼机制,使生成过程类似人类艺术家逐步完善作品,从而实现复杂度感知的自适应步长生成,兼顾生成质量和效率。

链接: https://arxiv.org/abs/2604.13030
作者: Jian Han,Jinlai Liu,Jiahuan Wang,Bingyue Peng,Zehuan Yuan
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: code: this https URL

点击查看摘要

Abstract:While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these issues. At its core, GRN addresses the discrete tokenization bottleneck through a theoretically near-lossless Hierarchical Binary Quantization (HBQ), achieving a reconstruction quality comparable to continuous counterparts. Built upon HBQ’s latent space, GRN fundamentally upgrades AR generation with a global refinement mechanism that progressively perfects and corrects artworks – like a human artist painting. Besides, GRN integrates an entropy-guided sampling strategy, enabling complexity-aware, adaptive-step generation without compromising visual quality. On the ImageNet benchmark, GRN establishes new records in image reconstruction (0.56 rFID) and class-conditional image generation (1.81 gFID). We also scale GRN to more challenging text-to-image and text-to-video generation, delivering superior performance on an equivalent scale. We release all models and code to foster further research on GRN.

[CV-2] Visual Preference Optimization with Rubric Rewards

【速读】:该论文旨在解决多模态任务中偏好优化(Preference Optimization)依赖粗粒度或离线扰动数据导致的细粒度视觉推理能力不足的问题。现有方法常使用基于结果的信号或非策略扰动来构建偏好数据,难以捕捉图像理解中的细微质量差异。其解决方案的关键在于提出rDPO框架,通过为每个图像-指令对构建特定于实例的评分清单(rubric),以结构化方式定义基础与附加评判标准,从而实现策略内(on-policy)偏好数据的高质量生成,并在奖励建模和下游任务中显著提升性能,验证了基于准则级反馈的偏好优化对视觉推理的有效性。

链接: https://arxiv.org/abs/2604.13029
作者: Ya-Qi Yu,Fangyu Hong,Xiangyang Qu,Hao Wang,Gaojie Wu,Qiaoyu Luo,Nuo Xu,Huixin Wang,Wuheng Xu,Yongxin Liao,Zihao Chen,Haonan Li,Ziming Li,Dezhi Peng,Minghui Liao,Jihao Wu,Haoyu Ren,Dandan Tu
机构: Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.

[CV-3] Conflated Inverse Modeling to Generate Diverse and Temperature-Change Inducing Urban Vegetation Patterns CVPR2026

【速读】:该论文旨在解决城市热环境调控中“逆问题”(inverse problem)的挑战,即如何根据目标温度变化反推满足条件的植被空间配置方案。传统前向模型(forward model)可从植被分布预测地表温度,但无法有效处理多解性问题——即多种不同的植被格局可能产生相似的区域平均温度响应,且在数据稀缺条件下易生成平均化的单一解。其解决方案的关键在于提出一种融合预测型前向模型与基于扩散机制的生成式逆模型(diffusion-based generative inverse model)的混合框架,能够在保持对热响应结果精确控制的同时,生成多样且物理合理的植被空间分布图像,即使这些组合未出现在训练数据中也能实现可控生成。

链接: https://arxiv.org/abs/2604.13028
作者: Baris Sarper Tezcan,Hrishikesh Viswanath,Rubab Saher,Daniel Aliaga
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CVPR 2026 EarthVision Workshop

点击查看摘要

Abstract:Urban areas are increasingly vulnerable to thermal extremes driven by rapid urbanization and climate change. Traditionally, thermal extremes have been monitored using Earth-observing satellites and numerical modeling frameworks. For example, land surface temperature derived from Landsat or Sentinel imagery is commonly used to characterize surface heating patterns. These approaches operate as forward models, translating radiative observations or modeled boundary conditions into estimates of surface thermal states. While forward models can predict land surface temperature from vegetation and urban form, the inverse problem of determining spatial vegetation configurations that achieve a desired regional temperature shift remains largely unexplored. This task is inherently underdetermined, as multiple spatial vegetation patterns can yield similar aggregated temperature responses. Conventional regression and deterministic neural networks fail to capture this ambiguity and often produce averaged solutions, particularly under data-scarce conditions. We propose a conflated inverse modeling framework that combines a predictive forward model with a diffusion-based generative inverse model to produce diverse, physically plausible image-based vegetation patterns conditioned on specific temperature goals. Our framework maintains control over thermal outcomes while enabling diverse spatial vegetation configurations, even when such combinations are absent from training data. Altogether, this work introduces a controllable inverse modeling approach for urban climate adaptation that accounts for the inherent diversity of the problem. Code is available at the GitHub repository.

[CV-4] Representation geometry shapes task performance in vision-language modeling for CT enterography

【速读】:该论文旨在解决腹部CT肠造影(CT enterography)在炎症性肠病(IBD)自动化分析中的表征选择问题,即如何优化视觉-语言模型的输入表示与聚合策略以提升分类、检索和报告生成性能。其关键解决方案包括:1)对比均值池化(mean pooling)与注意力池化(attention pooling)在不同任务上的表现差异,发现前者更适合类别疾病评估(三分类准确率59.2%),后者更优跨模态检索(text-to-image MRR=0.235);2)提出多窗RGB编码策略(multi-window RGB encoding),通过将不同HU窗口映射至RGB通道增强组织对比度,优于依赖多平面采样的空间覆盖扩展方式,并揭示增加冠状面和矢状面视图反而降低分类性能;3)引入基于三教师伪标签框架(three-teacher pseudolabel framework),无需专家标注即可实现公平比较,同时验证检索增强生成(RAG)可显著提升报告生成质量(误差MAE从0.98降至0.80–0.89)。这些发现为体积医学影像的视觉-语言建模提供了首个基准和实用指导。

链接: https://arxiv.org/abs/2604.13021
作者: Cristian Minoccheri,Emily Wittrup,Kayvan Najarian,Ryan Stidham
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4% vs.\ 71% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7–14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80–0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.

[CV-5] See Point Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

【速读】:该论文旨在解决生成式 AI 在密集编码界面(如集成开发环境 IDE)中进行像素级光标定位时的准确性问题,现有方法依赖单次坐标预测,缺乏错误纠正机制,在高密度 UI 中表现不佳。其解决方案的关键在于引入一种闭环式的迭代精化机制,通过利用前序操作的视觉反馈不断调整光标位置,从而实现自适应修正位移误差并应对动态界面变化,显著提升点击精度和任务成功率。

链接: https://arxiv.org/abs/2604.13019
作者: Himangi Mittal,Gaurav Mittal,Nelson Daniel Troncoso,Yu Hu
机构: Microsoft(微软); Carnegie Mellon University(卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: this https URL.

[CV-6] Agent ic Discovery with Active Hypothesis Exploration for Visual Recognition

【速读】:该论文旨在解决神经网络架构搜索(Neural Architecture Search, NAS)中缺乏系统性科学探究机制的问题,即传统方法多依赖随机或启发式搜索,难以形成可解释、可复用的设计原则。其解决方案的关键在于提出HypoExplore框架,该框架将架构发现过程建模为假设驱动的科学探究:通过大语言模型生成新假设,并基于“利用已验证原则”与“解决不确定问题”的双重策略进行演化分支;同时构建轨迹树(Trajectory Tree)记录架构演化路径与假设记忆库(Hypothesis Memory Bank)动态更新假设置信度,结合多反馈代理从不同维度分析实验结果以指导后续迭代。这一机制不仅显著提升了轻量级视觉架构的性能(如在CIFAR-10上从18.91%准确率提升至94.11%),还实现了跨任务和跨数据集的泛化能力及设计原则的迁移学习,从而推动NAS从黑箱试错走向可解释的科学发现。

链接: https://arxiv.org/abs/2604.12999
作者: Jaywon Koo,Jefferson Hernandez,Ruozhen He,Hanjie Chen,Chen Wei,Vicente Ordonez
机构: Rice University (莱斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce HypoExplore, an agentic framework that formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. Our proposed framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence. After each experiment, multiple feedback agents analyze the results from different perspectives and consolidate their findings into hypothesis confidence updates. Our framework is tested on discovering lightweight vision architectures on CIFAR-10, with the best achieving 94.11% accuracy evolved from a root node baseline that starts at 18.91%, and generalizes to CIFAR-100 and Tiny-ImageNet. We further demonstrate applicability to a specialized domain by conducting independent architecture discovery runs on MedMNIST, which yield a state-of-the-art performance. We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space.

[CV-7] AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation

【速读】:该论文旨在解决当前医学成像研究中计算人体模型(computational phantoms)在生成受控且具有临床意义的腹部解剖变异方面能力不足的问题。其解决方案的关键在于提出AbdomenGen框架,这是一个基于体积条件扩散(volume-conditioned diffusion)的序列化器官生成方法,并引入了标准化残差指标——体积控制标量(Volume Control Scalar, VCS),该标量能够将器官体积与体型特征(body habitus)解耦,从而实现可解释的体积调节;同时通过依次生成器官并以身体掩膜和先前生成结构为条件,确保整体解剖一致性的同时支持多器官独立控制,最终在11个腹部器官上实现了高几何保真度(如肝脏Dice系数达0.83±0.05)及稳定、分布感知的可控生成能力。

链接: https://arxiv.org/abs/2604.12969
作者: Yubraj Bhandari,Lavsen Dahal,Paul Segars,Joseph Y. Lo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computational phantoms are widely used in medical imaging research, yet current systems to generate controlled, clinically meaningful anatomical variations remain limited. We present AbdomenGen, a sequential volume-conditioned diffusion framework for controllable abdominal anatomy generation. We introduce the \textbfVolume Control Scalar (VCS), a standardized residual that decouples organ size from body habitus, enabling interpretable volume modulation. Organ masks are synthesized sequentially, conditioning on the body mask and previously generated structures to preserve global anatomical coherence while supporting independent, multi-organ control. Across 11 abdominal organs, the proposed framework achieves strong geometric fidelity (e.g., liver dice 0.83 \pm 0.05 ), stable single-organ calibration over [-3,+3] VCS, and disentangled multi-organ modulation. To showcase clinical utility with a hepatomegaly cohort selected from MERLIN, Wasserstein-based VCS selection reduces distributional distance of training data by 73.6% . These results demonstrate calibrated, distribution-aware anatomical generation suitable for controllable abdominal phantom construction and simulation studies.

[CV-8] Evolution of Optimization Methods: Algorithms Scenarios and Evaluations

【速读】:该论文旨在解决深度学习优化中收敛速度、泛化能力与计算效率之间的平衡难题,尤其针对传统一阶优化方法(如SGD和Adam)在大规模模型训练、差分隐私保护及分布式学习场景下暴露的隐私安全性和内存效率瓶颈。其解决方案的关键在于通过回顾优化算法的发展脉络,对主流优化器进行跨模型架构与训练场景的系统性实证评估,提炼出关键趋势与设计权衡,并基于理论洞察与大量实验数据,为下一代高效、鲁棒且可信的优化方法提供可操作的设计指导。

链接: https://arxiv.org/abs/2604.12968
作者: Tong Zhang,Jiangning Zhang,Zhucun Xue,Juntao Jiang,Yicheng Xu,Chengming Xu,Teng Hu,Xingyu Xie,Xiaobin Hu,Yabiao Wang,Yong Liu,Shuicheng Yan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam, serve as the cornerstone of modern training pipelines. However, large-scale model training, stringent differential privacy requirements, and distributed learning paradigms expose critical limitations in these conventional approaches regarding privacy protection and memory efficiency. To mitigate these bottlenecks, researchers explore second-order optimization techniques to surpass first-order performance ceilings, while zeroth-order methods reemerge to alleviate memory constraints inherent to large-scale training. Despite this proliferation of methodologies, the field lacks a cohesive framework that unifies underlying principles and delineates application scenarios for these disparate approaches. In this work, we retrospectively analyze the evolutionary trajectory of deep learning optimization algorithms and present a comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios. We distill key emerging trends and fundamental design trade-offs, pinpointing promising directions for future research. By synthesizing theoretical insights with extensive empirical evidence, we provide actionable guidance for designing next-generation highly efficient, robust, and trustworthy optimization methods. The code is available at this https URL.

[CV-9] Boosting Visual Instruction Tuning with Self-Supervised Guidance

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉主导任务中表现不佳的问题,尤其是那些需要细粒度视觉推理的任务。研究表明,这一局限并非源于视觉表征能力薄弱,而是由于指令微调(instruction tuning)过程中对视觉信息的利用不足,导致模型可仅依赖语言先验完成部分任务。解决方案的关键在于通过引入少量以自然语言形式表达的视觉接地自监督学习(visually grounded self-supervised learning, SSL)任务(如旋转预测、颜色匹配和跨视图对应),将其转化为图像-指令-响应三元组,从而强制模型必须依赖视觉证据才能完成任务。该方法无需人工标注、架构修改或额外训练阶段,仅需在训练数据分布中注入3–10%的此类指令即可显著提升模型在视觉中心评估任务上的性能,验证了基于视觉接地SSL任务的指令微调是增强MLLM视觉推理能力的有效策略。

链接: https://arxiv.org/abs/2604.12966
作者: Sophia Sirko-Galouchenko,Monika Wysoczanska,Andrei Bursuc,Nicolas Thome,Spyros Gidaris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: this https URL

[CV-10] Adaptive Data Dropout: Towards Self-Regulated Learning in Deep Neural Networks

【速读】:该论文旨在解决深度神经网络训练过程中数据利用效率低下的问题,即传统方法通常对大规模数据集进行均匀采样,而忽视了样本在不同训练阶段贡献度的差异。为提升训练效率与模型泛化能力,现有研究尝试通过固定策略减少训练数据量,但缺乏动态适应性。论文提出自适应数据丢弃(Adaptive Data Dropout)框架,其核心在于引入一种轻量级随机更新机制,根据训练精度变化实时调整训练数据子集,从而实现训练过程中的探索与巩固平衡。该机制使模型能自主调节数据暴露程度,无需预设固定调度策略,显著降低有效训练步数并保持竞争力性能。

链接: https://arxiv.org/abs/2604.12945
作者: Amar Gahir,Varshil Patel,Shreyank N Gowda
机构: University of Nottingham (诺丁汉大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks are typically trained by uniformly sampling large datasets across epochs, despite evidence that not all samples contribute equally throughout learning. Recent work shows that progressively reducing the amount of training data can improve efficiency and generalization, but existing methods rely on fixed schedules that do not adapt during training. In this work, we propose Adaptive Data Dropout, a simple framework that dynamically adjusts the subset of training data based on performance feedback. Inspired by self-regulated learning, our approach treats data selection as an adaptive process, increasing or decreasing data exposure in response to changes in training accuracy. We introduce a lightweight stochastic update mechanism that modulates the dropout schedule online, allowing the model to balance exploration and consolidation over time. Experiments on standard image classification benchmarks show that our method reduces effective training steps while maintaining competitive accuracy compared to static data dropout strategies. These results highlight adaptive data selection as a promising direction for efficient and robust training. Code will be released.

[CV-11] Distorted or Fabricated? A Survey on Hallucination in Video LLM s ACL2026

【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, Vid-LLMs)中普遍存在的幻觉问题,即模型输出看似合理但与输入视频内容相矛盾的现象。其解决方案的关键在于提出一个系统性的分类体系,将幻觉分为两类核心类型:动态失真(dynamic distortion)和内容伪造(content fabrication),每类进一步细分为两个子类型,并结合代表性案例进行说明。基于此分类,论文系统梳理了当前评估方法、基准测试、度量指标及干预策略,并深入分析其根源——主要源于时间表征能力有限和视觉 grounding 不足。由此启发未来研究方向,如开发运动感知的视觉编码器和引入反事实学习技术,从而推动构建更鲁棒、可靠的视频-语言系统。

链接: https://arxiv.org/abs/2604.12944
作者: Yiyang Huang,Yitian Zhang,Yizhou Wang,Mingyuan Zhang,Liang Shi,Huimin Zeng,Yun Fu
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ACL 2026 findings

点击查看摘要

Abstract:Despite significant progress in video-language modeling, hallucinations remain a persistent challenge in Video Large Language Models (Vid-LLMs), referring to outputs that appear plausible yet contradict the content of the input video. This survey presents a comprehensive analysis of hallucinations in Vid-LLMs and introduces a systematic taxonomy that categorizes them into two core types: dynamic distortion and content fabrication, each comprising two subtypes with representative cases. Building on this taxonomy, we review recent advances in the evaluation and mitigation of hallucinations, covering key benchmarks, metrics, and intervention strategies. We further analyze the root causes of dynamic distortion and content fabrication, which often result from limited capacity for temporal representation and insufficient visual grounding. These insights inform several promising directions for future work, including the development of motion-aware visual encoders and the integration of counterfactual learning techniques. This survey consolidates scattered progress to foster a systematic understanding of hallucinations in Vid-LLMs, laying the groundwork for building robust and reliable video-language systems. An up-to-date curated list of related works is maintained at this https URL .

[CV-12] Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection

【速读】:该论文旨在解决持续人脸伪造检测(Continual Face Forgery Detection, CFFD)中的灾难性遗忘问题,即模型在学习新伪造范式时会遗忘先前已学过的伪造特征。现有方法通常依赖于存储少量历史样本或基于检测器相关扰动合成伪伪造图像进行回放(replay),但前者难以覆盖多样伪造线索且存在身份泄露风险,后者则受限于旧决策边界而缺乏泛化能力。论文的关键解决方案是提出一种分布级回放机制:首先通过特征空间中的代理分解建模真实与伪造分布之间的差异,并将其压缩为一个极小的“分布差异图”库(Distribution-Discrepancy Condensation, DDC);随后利用这些差异图与当前阶段的真实人脸进行方差保持的组合合成回放样本(Manifold-Consistent Replay, MCR),从而在不存储原始人脸图像的前提下,有效重现历史伪造任务的分布特性并保持与当前数据统计的一致性,显著缓解遗忘现象且降低隐私泄露风险。

链接: https://arxiv.org/abs/2604.12941
作者: Tianshuo Zhang,Haoyuan Zhang,Siran Peng,Weisong Zhao,Xiangyu Zhu,Zhen Lei
机构: SAI, UCAS (中国科学院自动化研究所,中国科学院大学); MAIS, CASIA (中科院自动化所智能感知与计算研究中心,中国科学院自动化研究所); IIE, CAS (中国科学院电工研究所); SCS, UCAS (中国科学院大学计算机科学与技术学院); SCSE, FIE, M.U.S.T (澳门科技大学科技学院,澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual face forgery detection (CFFD) requires detectors to learn emerging forgery paradigms without forgetting previously seen manipulations. Existing CFFD methods commonly rely on replaying a small amount of past data to mitigate forgetting. Such replay is typically implemented either by storing a few historical samples or by synthesizing pseudo-forgeries from detector-dependent perturbations. Under strict memory budgets, the former cannot adequately cover diverse forgery cues and may expose facial identities, while the latter remains strongly tied to past decision boundaries. We argue that the core role of replay in CFFD is to reinstate the distributions of previous forgery tasks during subsequent training. To this end, we directly condense the discrepancy between real and fake distributions and leverage real faces from the current stage to perform distribution-level replay. Specifically, we introduce Distribution-Discrepancy Condensation (DDC), which models the real-to-fake discrepancy via a surrogate factorization in characteristic-function space and condenses it into a tiny bank of distribution discrepancy maps. We further propose Manifold-Consistent Replay (MCR), which synthesizes replay samples through variance-preserving composition of these maps with current-stage real faces, yielding samples that reflect previous-task forgery cues while remaining compatible with current real-face statistics. Operating under an extremely small memory budget and without directly storing raw historical face images, our framework consistently outperforms prior CFFD baselines and significantly mitigates catastrophic forgetting. Replay-level privacy analysis further suggests reduced identity leakage risk relative to selection-based replay.

[CV-13] ask Alignment: A simple and effective proxy for model merging in computer vision

【速读】:该论文旨在解决多任务视觉模型合并(model merging)在实际应用中的效率与适用性问题,尤其是在不同任务依赖可训练且通常异构的解码器(decoder)场景下,传统方法因解码器训练成本高而难以进行超参数选择。其关键解决方案是引入“任务对齐代理”(task alignment proxy),该代理能够以极低计算开销预测合并模型的下游性能,从而将超参数搜索速度提升数个数量级,同时保持合并模型的性能表现,显著扩展了模型合并技术在非CLIP分类任务中的实用性。

链接: https://arxiv.org/abs/2604.12935
作者: Pau de Jorge,César Roberto de Souza,Björn Michele,Mert Bülent Sarıyıldız,Philippe Weinzaepfel,Florent Perronnin,Diane Larlus,Yannis Kalantidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficiently merging several models fine-tuned for different tasks, but stemming from the same pretrained base model, is of great practical interest. Despite extensive prior work, most evaluations of model merging in computer vision are restricted to image classification using CLIP, where different classification datasets define different tasks. In this work, our goal is to make model merging more practical and show its relevance on challenging scenarios beyond this specific setting. In most vision scenarios, different tasks rely on trainable and usually heterogeneous decoders. Differently from previous studies with frozen decoders, where merged models can be evaluated right away, the non-trivial cost of decoder training renders hyperparameter selection based on downstream performance impractical. To address this, we introduce the task alignment proxy, and show how it can be used to speed up hyperparameter selection by orders of magnitude while retaining performance. Equipped with the task alignment proxy, we extend the applicability of model merging to multi-task vision models beyond CLIP-based classification.

[CV-14] DINO-Explorer: Active Underwater Discovery via Ego-Motion Compensated Semantic Predictive Coding

【速读】:该论文旨在解决自主水下航行器(AUV)在海洋生态系统监测中因被动数据采集导致的高带宽浪费与重要瞬态事件漏检问题。传统AUV多作为被动数据记录设备,无法实时识别并响应环境中的关键现象,从而限制了科学观测效率。解决方案的关键在于提出DINO-Explorer框架,其核心创新是基于冻结的DINOv3基础模型潜空间构建一个轻量级、动作条件化的递归预测器,生成连续语义意外信号(semantic surprise signal),用以主动感知环境中的新颖性事件;同时引入类“效应复制”模块,利用全局池化的光流信息抑制由自身运动引起的视觉变化,从而有效区分真实环境新颖性和自扰动干扰。此机制显著提升了异步事件筛选任务中对人类验证事件的召回率,并大幅降低所需传输带宽,实现高效、鲁棒的在线注意力机制。

链接: https://arxiv.org/abs/2604.12933
作者: Yuhan Jin,Nayari Marie Lessa,Mariela De Lucas Alvarez,Melvin Laux,Lucas Amparo Barbosa,Frank Kirchner,Rebecca Adam
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Marine ecosystem degradation necessitates continuous, scientifically selective underwater monitoring. However, most autonomous underwater vehicles (AUVs) operate as passive data loggers, capturing exhaustive video for offline review and frequently missing transient events of high scientific value. Transitioning to active perception requires a causal, online signal that highlights significant phenomena while suppressing maneuver-induced visual changes. We propose DINO-Explorer, a novelty-aware perception framework driven by a continuous semantic surprise signal. Operating within the latent space of a frozen DINOv3 foundation model, it leverages a lightweight, action-conditioned recurrent predictor to anticipate short-horizon semantic evolution. An efference-copy-inspired module utilizes globally pooled optical flow to discount self-induced visual changes without suppressing genuine environmental novelty. We evaluate this signal on the downstream task of asynchronous event triage under variant telemetry constraints. Results demonstrate that DINO-Explorer provides a robust, bandwidth-efficient attention mechanism. At a fixed operating point, the system retains 78.8% of post-discovery human-reviewer consensus events with a 56.8% trigger confirmation rate, effectively surfacing mission-relevant phenomena. Crucially, ego-motion conditioning suppresses 45.5% of false positives relative to an uncompensated surprise signal baseline. In a replay-side Pareto ablation study, DINO-Explorer robustly dominates the validated peak F1 versus telemetry bandwidth frontier, reducing telemetry bandwidth by 48.2% at the selected operating point while maintaining a 62.2% peak F1 score, successfully concentrating data transmission around human-verified novelty events.

[CV-15] Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions

【速读】:该论文旨在解决从单目视频中高效且鲁棒地重建动态3D手-物体交互的问题,尤其针对现有方法依赖复杂神经表示导致计算效率低、稳定性差的局限性。其解决方案的关键在于提出Grasp in Gaussians (GraG),通过引入紧凑的高斯和(Sum-of-Gaussians, SoG)表示来实现手与物体的快速跟踪:首先利用视频适配的SAM3D管道初始化物体姿态与几何,并将密集高斯表示通过子采样转化为轻量级SoG以保持几何保真度;同时,对手部采用基于2D关节与深度对齐损失的简化优化策略,在不进行逐帧精细化3D手部外观建模的前提下,确保手部运动的稳定性和准确性。此设计使模型在长序列上比先前方法快6.4倍,且显著提升物体重建精度(+13.4%)和手部关键点定位误差降低(>65%)。

链接: https://arxiv.org/abs/2604.12929
作者: Ayce Idil Aytekin,Xu Chen,Zhengyang Shen,Thabo Beeler,Helge Rhodin,Rishabh Dabral,Christian Theobalt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present Grasp in Gaussians (GraG), a fast and robust method for reconstructing dynamic 3D hand-object interactions from a single monocular video. Unlike recent approaches that optimize heavy neural representations, our method focuses on tracking the hand and the object efficiently, once initialized from pretrained large models. Our key insight is that accurate and temporally stable hand-object motion can be recovered using a compact Sum-of-Gaussians (SoG) representation, revived from classical tracking literature and integrated with generative Gaussian-based initializations. We initialize object pose and geometry using a video-adapted SAM3D pipeline, then convert the resulting dense Gaussian representation into a lightweight SoG via subsampling. This compact representation enables efficient and fast tracking while preserving geometric fidelity. For the hand, we adopt a complementary strategy: starting from off-the-shelf monocular hand pose initialization, we refine hand motion using simple yet effective 2D joint and depth alignment losses, avoiding per-frame refinement of a detailed 3D hand appearance model while maintaining stable articulation. Extensive experiments on public benchmarks demonstrate that GraG reconstructs temporally coherent hand-object interactions on long sequences 6.4x faster than prior work while improving object reconstruction by 13.4% and reducing hand’s per-joint position error by over 65%.

[CV-16] Pi-HOC: Pairwise 3D Human-Object Contact Estimation

【速读】:该论文旨在解决现实世界中人-物交互(Human-Object Interaction, HOI)图像中细粒度并发物理接触的解耦难题,现有方法在多人体场景下表现受限或需额外物体几何信息(如网格),且基于视觉语言模型(Vision-Language Model, VLM)的方法在推理时效率低下。其解决方案的关键在于提出Pi-HOC——一种单次前向传播、实例感知的框架,通过检测人体与物体实例并为每对人-物生成专用的HO token,利用InteractionFormer进行精炼,并结合基于Segment Anything Model (SAM) 的解码器,在SMPL人体网格上预测密集语义接触;该设计显著提升了多人体场景下的接触识别精度与定位能力,同时实现20倍更高的吞吐量,并支持无需额外训练的语言引导接触定位和图像到网格重建优化。

链接: https://arxiv.org/abs/2604.12923
作者: Sravan Chittupalli,Ayush Jain,Dong Huang
机构: Carnegie Mellon University, Robotics Institute; National Robotics Engineering Center
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Resolving real-world human-object interactions in images is a many-to-many challenge, in which disentangling fine-grained concurrent physical contact is particularly difficult. Existing semantic contact estimation methods are either limited to single-human settings or require object geometries (e.g., meshes) in addition to the input image. Current state-of-the-art leverages powerful VLM for category-level semantics but struggles with multi-human scenarios and scales poorly in inference. We introduce Pi-HOC, a single-pass, instance-aware framework for dense 3D semantic contact prediction of all human-object pairs. Pi-HOC detects instances, creates dedicated human-object (HO) tokens for each pair, and refines them using an InteractionFormer. A SAM-based decoder then predicts dense contact on SMPL human meshes for each human-object pair. On the MMHOI and DAMON datasets, Pi-HOC significantly improves accuracy and localization over state-of-the-art methods while achieving 20x higher throughput. We further demonstrate that predicted contacts improve SAM-3D image-to-mesh reconstruction via a test-time optimization algorithm and enable referential contact prediction from language queries without additional training.

[CV-17] Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

【速读】:该论文旨在解决现有雷达-相机融合方法在鸟瞰图(Bird’s-eye-view, BEV)空间中对3D感知任务(如目标检测与语义分割)处理孤立的问题,即未能充分利用两者之间的互补信息:检测特征提供对象级几何结构以细化分割边界,而分割特征则提供密集语义上下文以增强检测定位精度。解决方案的关键在于提出一种双向模块CTAB(Cross-Task Attention Bridge),通过在共享BEV空间中引入多尺度可变形注意力机制,在检测分支与分割分支之间实现特征交互,从而促进跨任务知识共享;同时,结合基于实例归一化的分割解码器和可学习的BEV上采样策略,进一步提升BEV表示的细节质量,最终在nuScenes数据集上实现了分割性能的显著提升,并保持检测性能基本不变,且在同一框架下完成多任务联合推理。

链接: https://arxiv.org/abs/2604.12918
作者: Ahmet İnanç,Özgür Erkent
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 3 Tables, submitted to a venue for consideration

点击查看摘要

Abstract:Bird’s-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity to share complementary information between them: detection features encode object-level geometry that can sharpen segmentation boundaries, while segmentation features provide dense semantic context that can anchor detection. We propose \textbfCTAB (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. CTAB is integrated into a multi-task framework with an Instance Normalization-based segmentation decoder and learnable BEV upsampling to provide a more detailed BEV representation. On nuScenes, CTAB improves segmentation on 7 classes over the joint multi-task baseline at essentially neutral detection. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), our joint multi-task model reaches comparable mIoU on 4 classes while simultaneously providing 3D detection.

[CV-18] M3D-Stereo: A Multiple-Medium and Multiple-Degradation Dataset for Stereo Image Restoration

【速读】:该论文旨在解决复杂退化环境下图像复原(image restoration)任务中因物理退化机制复杂、信息严重丢失以及现有数据集多局限于单一退化类型或依赖缺乏立体一致性合成数据而导致的现实场景适用性受限问题。其解决方案的关键在于构建了一个名为M3D-Stereo的高质量立体图像数据集,包含7904对高分辨率图像,覆盖水下散射、雾霾/雾、水下低光照及雾霾低光照四种退化场景,并在每个场景中设置六级渐进式退化水平,同时提供像素级一致的清晰真实标签(ground truth),从而支持细粒度评估与更贴近实际应用的复原方法验证。

链接: https://arxiv.org/abs/2604.12917
作者: Deqing Yang,Yingying Liu,Qicong Wang,Zhi Zeng,Dajiang Lu,Yibin Tian
机构: Shenzhen University (深圳大学); Chongqing Normal University (重庆师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration under adverse conditions, such as underwater, haze or fog, and low-light environments, remains a highly challenging problem due to complex physical degradations and severe information loss. Existing datasets are predominantly limited to a single degradation type or heavily rely on synthetic data without stereo consistency, inherently restricting their applicability in real-world scenarios. To address this, we introduce M3D-Stereo, a stereo dataset with 7904 high-resolution image pairs for image restoration research acquired in multiple media with multiple controlled degradation levels. It encompasses four degradation scenarios: underwater scatter, haze/fog, underwater low-light, and haze low-light. Each scenario forms a subset, and is divided into six levels of progressive degradation, allowing fine-grained evaluations of restoration methods with increasing severity of degradation. Collected via a laboratory setup, the dataset provides aligned stereo image pairs along with their pixel-wise consistent clear ground truths. Two restoration tasks, single-level and mixed-level degradation, were performed to verify its validity. M3D-Stereo establishes a better controlled and more realistic benchmark to evaluate image restoration and stereo matching methods in complex degradation environments. It is made public under LGPLv3 license.

[CV-19] A Sanity Check on Composed Image Retrieval

【速读】:该论文旨在解决当前图像检索评估中存在的两个核心问题:一是现有基准测试存在查询不确定性(indeterminate queries),即多个候选图像均满足查询条件,导致对生成式图像检索(Composed Image Retrieval, CIR)模型性能的评价不准确;二是缺乏对CIR模型在多轮交互系统中实际应用效果的考量。解决方案的关键在于:首先提出FISD(Fully-Informed Semantically-Diverse)基准,利用生成模型精确控制参考图与目标图之间的语义变量,从而在六个维度上实现无歧义的、精准的CIR方法评估;其次设计了一个自动化的多轮代理评估框架(multi-round agentic evaluation framework),通过观察模型在连续查询中的适应与优化能力,更真实地反映其在交互场景下的有效性。

链接: https://arxiv.org/abs/2604.12904
作者: Yikun Liu,Jiangchao Yao,Weidi Xie,Yanfeng Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image, and a relative caption that specifies the desired modification. Despite the rapid development of CIR models, their performance is not well characterized by existing benchmarks, which inherently contain indeterminate queries degrading the evaluation (i.e., multiple candidate images, rather than solely the target image, meet the query criteria), and have not considered their effectiveness in the context of the multi-round system. Motivated by this, we consider improving the evaluation procedure from two aspects: 1) we introduce FISD, a Fully-Informed Semantically-Diverse benchmark, which employs generative models to precisely control the variables of reference-target image pairs, enabling a more accurate evaluation of CIR methods across six dimensions, without query ambiguity; 2) we propose an automatic multi-round agentic evaluation framework to probe the potential of the existing models in the interactive scenarios. By observing how models adapt and refine their choices over successive rounds of queries, this framework provides a more realistic appraisal of their efficacy in practical applications. Extensive experiments and comparisons prove the value of our novel evaluation on typical CIR methods.

[CV-20] Dont Show Pixels Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs CVPR2026

【速读】:该论文旨在解决多模态语言模型(Multimodal Language Models, MLLMs)在使用视觉工具(如深度图、光流、对应关系等)生成的视觉线索时,难以有效利用这些信息的问题。现有方法通常直接将原始像素级工具输出输入模型,导致其与语言模型天然的推理能力不匹配,从而引发感知能力弱化并过度依赖语言先验。解决方案的关键在于提出一种无需训练、与模型无关的“感知程序”(Perception Programs, P²)方法,该方法将复杂的工具输出重写为紧凑、结构化的语言原生摘要,使MLLMs能够直接解析和推理这些视觉信息。实验表明,P²在BLINK数据集上的六项感知任务中显著提升性能,例如在GPT-5 Mini基础上将多视角推理准确率从41.35%提升至86.47%,且对小型MLLMs同样有效,超越了此前所有基于代理、监督或强化学习的工具调用方法。

链接: https://arxiv.org/abs/2604.12896
作者: Muhammad Kamran Janjua,Hugo Silva,Di Niu,Bahador Rashidi
机构: Huawei Technologies (华为技术有限公司); University of Alberta (阿尔伯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P ^2 ), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P ^2 consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P ^2 raises its accuracy from 41.35% to 86.47% on multi-view reasoning, from 52.42% to 81.45% on relative depth, and achieves a 22% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40% absolute gains from P ^2 , surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.

[CV-21] Representing 3D Faces with Learnable B-Spline Volumes CVPR2026

【速读】:该论文旨在解决三维人脸重建与扫描配准中几何表示表达能力不足的问题,特别是如何在保持局部可编辑性的同时实现高精度、连续且语义一致的3D表面重建。其解决方案的关键在于提出一种基于控制特征的统一B样条编码(CUBE),该方法将传统B样条的3D控制点替换为高维控制特征栅格(如8×8×8),通过两阶段映射从参数空间到欧几里得空间:首先利用B样条基函数对高维特征进行局部加权融合生成一个基础网格,再通过小型多层感知机(MLP)预测从该基础形状出发的残差位移以获得最终精细坐标。此设计既保留了B样条原有的局部支撑特性,又增强了模型的表达能力,并支持基于控制特征的局部编辑,从而在单目图像和无结构点云输入下实现了优于现有方法的三维人脸重建与配准性能。

链接: https://arxiv.org/abs/2604.12894
作者: Prashanth Chandran,Daoye Wang,Timo Bolkart
机构: Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (Highlight)

点击查看摘要

Abstract:We present CUBE (Control-based Unified B-spline Encoding), a new geometric representation for human faces that combines B-spline volumes with learned features, and demonstrate its use as a decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-spline representations with 3D control points, CUBE is parametrized by a lattice (e.g., 8 x 8 x 8) of high-dimensional control features, increasing the model’s expressivity. These features define a continuous, two-stage mapping from a 3D parametric domain to 3D Euclidean space via an intermediate feature space. First, high-dimensional control features are locally blended using the B-spline bases, yielding a high-dimensional feature vector whose first three values define a 3D base mesh. A small MLP then processes this feature vector to predict a residual displacement from the base shape, yielding the final refined 3D coordinates. To reconstruct 3D surfaces in dense semantic correspondence, CUBE is queried at 3D coordinates sampled from a fixed template mesh. Crucially, CUBE retains the local support property of traditional B-spline representations, enabling local surface editing by updating individual control features. We demonstrate the strengths of this representation by training transformer-based encoders to predict CUBE’s control features from unstructured point clouds and monocular images, achieving state-of-the-art scan registration results compared to recent baselines.

[CV-22] owards Long-horizon Agent ic Multimodal Search

【速读】:该论文旨在解决多模态深度搜索代理在长周期任务中面临的两大核心挑战:一是异构信息(文本与视觉)管理困难导致的上下文爆炸问题,二是高token成本下视觉信号易丢失的问题。解决方案的关键在于提出一种基于文件系统的视觉表示机制(file-based visual representation mechanism),通过将视觉资产外置存储于外部文件系统,并用轻量级文本标识符(UIDs)进行映射,从而显著降低上下文开销并保留多模态信息以供后续访问;同时引入定制化的“fetch-image”工具,实现按需加载的渐进式视觉感知策略,有效支撑长时间跨度下的复杂跨模态推理任务。

链接: https://arxiv.org/abs/2604.12890
作者: Yifan Du,Zikang Liu,Jinbiao Peng,Jie Wu,Junyi Li,Jinyang Li,Wayne Xin Zhao,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in this https URL.

[CV-23] VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

【速读】:该论文旨在解决视频 tokenization 中因固定 3D 网格结构导致的高学习复杂度问题,即下游模型需“逐像素”预测所有低级细节,无论视频本身复杂度如何,从而限制了训练效率与长视频生成能力。其解决方案的关键在于提出 VideoFlexTok,一种具有粗到细结构的可变长度 token 序列表示方法:早期 tokens 捕获语义和运动等抽象信息,后续 tokens 逐步添加精细细节,结合生成式流解码器实现任意 token 数量下的高质量视频重建。该设计支持根据下游任务动态调整 token 数量,并在相同计算预算下处理更长视频,显著提升训练效率(如使用 1.1B 参数模型即可达到 5.2B 参数模型的生成质量),并首次实现以极低 token 数(672)训练长视频(10秒,81帧)文本到视频生成模型,大幅降低计算成本。

链接: https://arxiv.org/abs/2604.12887
作者: Andrei Atanov,Jesse Allardice,Roman Bachmann,Oğuzhan Fatih Kar,R Devon Hjelm,David Griffiths,Peter Fu,Afshin Dehghan,Amir Zamir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: project page at this https URL

点击查看摘要

Abstract:Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details “pixel-by-pixel” irrespective of the video’s inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner – where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding videos longer than the baselines with the same budget. We evaluate VideoFlexTok on class- and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 5x smaller model (1.1B vs 5.2B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer. Comments: project page at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2604.12887 [cs.CV] (or arXiv:2604.12887v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.12887 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-24] PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

【速读】:该论文旨在解决音频驱动双手钢琴动作生成中面临的三大挑战:现有方法依赖仅音频表示而缺乏符号先验信息、交互机制僵化且难以建模动态双手协调关系、以及生成长序列时计算成本过高、无法实现实时流式处理。其解决方案的关键在于提出PianoFlow框架,该框架在训练阶段利用MIDI作为特权模态来提取结构化的音乐先验知识,并通过流匹配(flow matching)机制实现对复杂音乐结构和跨手协调的深度语义理解;同时设计了不对称角色门控交互模块(asymmetric role-gated interaction module),借助角色感知注意力与时间门控机制显式建模动态跨手协作;此外,引入自回归流延续方案(autoregressive flow continuation scheme)以支持任意长度序列的实时流式生成并保证跨片段的时间连贯性。

链接: https://arxiv.org/abs/2604.12856
作者: Xuan Wang,Kai Ruan,Jiayi Han,kaiyue Zhou,Gaoang Wang
机构: Zhejiang University(浙江大学); Renmin University of China(中国人民大学); Beijing Forestry University(北京林业大学); Chengdu Minto Tech(成都米托科技); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-driven bimanual piano motion generation requires precise modeling of complex musical structures and dynamic cross-hand coordination. However, existing methods often rely on acoustic-only representations lacking symbolic priors, employ inflexible interaction mechanisms, and are limited to computationally expensive short-sequence generation. To address these limitations, we propose PianoFlow, a flow-matching framework for precise and coordinated bimanual piano motion synthesis. Our approach strategically leverages MIDI as a privileged modality during training, distilling these structured musical priors to achieve deep semantic understanding while maintaining audio-only inference. Furthermore, we introduce an asymmetric role-gated interaction module to explicitly capture dynamic cross-hand coordination through role-aware attention and temporal gating. To enable real-time streaming generation for arbitrarily long sequences, we design an autoregressive flow continuation scheme that ensures seamless cross-chunk temporal coherence. Extensive experiments on the PianoMotion10M dataset demonstrate that PianoFlow achieves superior quantitative and qualitative performance, while accelerating inference by over 9\times compared to previous methods.

[CV-25] Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在物理世界中面临的安全性问题,特别是现有对抗攻击研究主要集中于数字域,而对可物理实现的语义级攻击缺乏系统性探索。其核心问题是:VLMs在真实场景下可能因物理扰动导致语义理解错误,从而引发下游任务中的严重语义误判。解决方案的关键在于提出首个可物理部署的对抗攻击框架——多模态语义光照攻击(Multimodal Semantic Lighting Attacks, MSLA),该方法通过可控的对抗光照干扰现实场景中的多模态语义对齐,而非仅针对特定任务输出进行攻击,从而显著降低主流CLIP变体的零样本分类性能,并在LLaVA和BLIP等先进VLM上诱发严重的语义幻觉现象,验证了其在数字与物理域均具备有效性、迁移性和实用性。

链接: https://arxiv.org/abs/2604.12833
作者: Yingying Zhao,Chengyin Hu,Qike Zhang,Xin Li,Xin Wang,Yiwei Wei,Jiujiang Guo,Jiahuan Long,Tingsong Jiang,Wen Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Such attacks may induce recognition failures and further disrupt multimodal reasoning, leading to severe semantic misinterpretation in downstream tasks. Therefore, investigating physical attacks on VLMs is essential for assessing their real-world security risks. To address this gap, we propose Multimodal Semantic Lighting Attacks (MSLA), the first physically deployable adversarial attack framework against VLMs. MSLA uses controllable adversarial lighting to disrupt multimodal semantic understanding in real scenes, attacking semantic alignment rather than only task-specific outputs. Consequently, it degrades zero-shot classification performance of mainstream CLIP variants while inducing severe semantic hallucinations in advanced VLMs such as LLaVA and BLIP across image captioning and visual question answering (VQA). Extensive experiments in both digital and physical domains demonstrate that MSLA is effective, transferable, and practically realizable. Our findings provide the first evidence that VLMs are highly vulnerable to physically deployable semantic attacks, exposing a previously overlooked robustness gap and underscoring the urgent need for physical-world robustness evaluation of VLMs.

[CV-26] Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models

【速读】:该论文旨在解决深度学习模型在超声心动图(echo)图像分割任务中对人工标注标签(ground truth, GT)误差的鲁棒性问题,特别是随机误差和系统性偏差对模型性能的影响。其关键解决方案包括:一是提出基于梯度方差(Variance of Gradients, VOG)的标签错误检测方法,相较于传统的损失函数驱动方法,在识别异常标签方面更具有效性;二是引入一种伪标签(pseudo-labelling)策略,用于修复被标记为可疑的错误标签,从而提升模型在高错误率条件下的分割精度。实验表明,该方法在高误差场景下显著改善了模型性能,验证了其在实际医学影像标注质量控制中的潜力。

链接: https://arxiv.org/abs/2604.12832
作者: Iman Islam,Bram Ruijsink,Andrew J. Reader,Andrew P. King
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 2 tables, International Symposium on Biomedical Imaging 2026

点击查看摘要

Abstract:Deep learning-based medical image segmentation typically relies on ground truth (GT) labels obtained through manual annotation, but these can be prone to random errors or systematic biases. This study examines the robustness of deep learning models to such errors in echocardiography (echo) segmentation and evaluates a novel strategy for detecting and refurbishing erroneous labels during model training. Using the CAMUS dataset, we simulate three error types, then compare a loss-based GT label error detection method with one based on Variance of Gradients (VOG). We also propose a pseudo-labelling approach to refurbish suspected erroneous GT labels. We assess the performance of our proposed approach under varying error levels. Results show that VOG proved highly effective in flagging erroneous GT labels during training. However, a standard U-Net maintained strong performance under random label errors and moderate levels of systematic errors (up to 50%). The detection and refurbishment approach improved performance, particularly under high-error conditions.

[CV-27] DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频质量评估(Video Quality Assessment, VQA)任务中适应新场景时存在的高成本问题,包括大规模重训练和昂贵的平均意见分数(Mean Opinion Score, MOS)标注需求。解决方案的关键在于提出一种解耦感知与校准的框架 DPC-VQA:利用冻结的预训练 MLLM 提供基础质量估计和感知先验,同时引入轻量级校准分支预测针对目标场景的残差修正项,从而避免端到端重新训练,显著降低训练参数量(少于传统方法的 2%)和标注数据需求(仅需 20% 的 MOS 标签),同时保持优异的性能表现。

链接: https://arxiv.org/abs/2604.12813
作者: Xinyue Li,Shubo Xu,Zhichao Zhang,Zhaolin Cai,Yitong Chen,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Baidu Inc. (百度公司); Xinjiang University (新疆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20% of MOS labels. The code will be released upon publication.

[CV-28] Rethinking Satellite Image Restoration for Onboard AI: A Lightweight Learning-Based Approach CVPR

【速读】:该论文旨在解决卫星图像恢复(Satellite Image Restoration)中传统基于物理模型的处理流程计算复杂度高、速度慢,难以适用于星载环境的问题。其核心挑战在于如何在资源受限的航天平台上实现高效且高质量的图像恢复,以支持地面产品生成和新兴的星上人工智能(Onboard AI)应用。解决方案的关键在于提出一种轻量级、非生成式的残差卷积神经网络模型 ConvBEERS,该模型通过在模拟卫星数据上训练,能够在保持高性能的同时显著降低计算开销,并在真实 Pleiades-HR 图像上验证了其优于传统方法的恢复质量(PSNR 提升 +6.9dB),同时在下游目标检测任务中带来高达 +5.1% 的 mAP@50 提升。此外,该模型成功部署于 Xilinx Versal VCK190 FPGA 上,实现了约 41 倍的延迟降低,证明了其在空间系统中的实际可行性。

链接: https://arxiv.org/abs/2604.12807
作者: Adrien Dorise,Marjorie Bellizzi,Omar Hlimi
机构: Institut de Recherche Technologique Saint Exupéry (圣Exupéry技术研究所); Centre national d’études spatiales (国家空间研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AI4SPACE@CVPR conference

点击查看摘要

Abstract:Satellite image restoration aims to improve image quality by compensating for degradations (e.g., noise and blur) introduced by the imaging system and acquisition conditions. As a fundamental preprocessing step, restoration directly impacts both ground-based product generation and emerging onboard AI applications. Traditional restoration pipelines based on sequential physical models are computationally intensive and slow, making them unsuitable for onboard environments. In this paper, we introduce ConvBEERS: a Convolutional Board-ready Embedded and Efficient Restoration model for Space to investigate whether a light and non-generative residual convolutional network, trained on simulated satellite data, can match or surpass a traditional ground-processing restoration pipeline across multiple operating conditions. Experiments conducted on simulated datasets and real Pleiades-HR imagery demonstrate that the proposed approach achieves competitive image quality, with a +6.9dB PSNR improvement. Evaluation on a downstream object detection task demonstrates that restoration significantly improves performance, with up to +5.1% mAP@50. In addition, successful deployment on a Xilinx Versal VCK190 FPGA validates its practical feasibility for satellite onboard processing, with a ~41x reduction in latency compared to the traditional pipeline. These results demonstrate the relevance of using lightweight CNNs to achieve competitive restoration quality while addressing real-world constraints in spaceborne systems. Comments: AI4SPACE@CVPR conference Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.12807 [cs.CV] (or arXiv:2604.12807v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.12807 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-29] Image-to-Image Translation Framework Embedded with Rotation Symmetry Priors

【速读】:该论文旨在解决无监督图像到图像翻译(Image-to-Image Translation, I2I)中因缺乏配对数据而导致性能受限的问题,尤其关注如何在不依赖标注信息的情况下提升模型对域不变特征的保持能力与生成质量。其解决方案的关键在于引入变换对称性先验(transformation symmetry priors),具体通过设计旋转群等变卷积(rotation group equivariant convolutions)构建具有旋转等变性的I2I框架,从而确保自然和科学图像中最基本的旋转对称性在整个网络中得以保留;此外,进一步提出可学习变换等变卷积(Transformation Learnable Equivariant Convolutions, TL-Conv),能够自适应地学习不同数据集中的变换群结构,在连续域中实现精确等变性,并在离散情况下提供等变误差的理论边界,显著增强了模型在多样化任务中的对称性保持能力和泛化性能。

链接: https://arxiv.org/abs/2604.12805
作者: Feiyu Tan,Heran Yang,Qihong Duan,Kai Ye,Qi Xie,Deyu Meng
机构: Xi’an Jiaotong University (西安交通大学); MOE Key Lab for Intelligent Networks and Networks Security (教育部智能网络与网络安全重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures, submiting to TPAMI

点击查看摘要

Abstract:Image-to-image translation (I2I) is a fundamental task in computer vision, focused on mapping an input image from a source domain to a corresponding image in a target domain while preserving domain-invariant features and adapting domain-specific attributes. Despite the remarkable success of deep learning-based I2I approaches, the lack of paired data and unsupervised learning framework still hinder their effectiveness. In this work, we address the challenge by incorporating transformation symmetry priors into image-to-image translation networks. Specifically, we introduce rotation group equivariant convolutions to achieve rotation equivariant I2I framework, a novel contribution, to the best of our knowledge, along this research direction. This design ensures the preservation of rotation symmetry, one of the most intrinsic and domain-invariant properties of natural and scientific images, throughout the network. Furthermore, we conduct a systematic study on image symmetry priors on real dataset and propose a novel transformation learnable equivariant convolutions (TL-Conv) that adaptively learns transformation groups, enhancing symmetry preservation across diverse datasets. We also provide a theoretical analysis of the equivariance error of TL-Conv, proving that it maintains exact equivariance in continuous domains and provide a bound for the error in discrete cases. Through extensive experiments across a range of I2I tasks, we validate the effectiveness and superior performance of our approach, highlighting the potential of equivariant networks in enhancing generation quality and its broad applicability. Our code is available at this https URL

[CV-30] Generative Anonymization in Event Streams CVPR2026

【速读】:该论文旨在解决神经形态视觉传感器(neuromorphic vision sensors)在公共空间部署时面临的隐私保护难题,特别是事件流(event stream)通过事件到视频(Event-to-Video, E2V)模型重建后可能泄露人类身份的问题。现有混淆方法如掩码或打乱会破坏事件的时空结构,导致下游感知任务性能严重下降。解决方案的关键在于提出首个生成式匿名化框架,通过构建一个跨模态桥梁,将异步事件投影至中间强度表示,利用预训练生成模型合成真实但不存在的人脸身份,并将特征重新编码回神经形态域,从而在保障隐私的同时维持数据结构完整性,实现隐私与数据效用之间的平衡。

链接: https://arxiv.org/abs/2604.12803
作者: Adam T. Müller,Mihai Kocsis,Nicolaj C. Stache
机构: Heilbronn University of Applied Sciences (海布隆应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to the 1st Workshop on Low-Level Vision Frontiers (LoViF) at IEEE/CVF CVPR 2026

点击查看摘要

Abstract:Neuromorphic vision sensors offer low latency and high dynamic range, but their deployment in public spaces raises severe data protection concerns. Recent Event-to-Video (E2V) models can reconstruct high-fidelity intensity images from sparse event streams, inadvertently exposing human identities. Current obfuscation methods, such as masking or scrambling, corrupt the spatio-temporal structure, severely degrading data utility for downstream perception tasks. In this paper, to the best of our knowledge, we present the first generative anonymization framework for event streams to resolve this utility-privacy trade-off. By bridging the modality gap between asynchronous events and standard spatial generative models, our pipeline projects events into an intermediate intensity representation, leverages pretrained models to synthesize realistic, non-existent identities, and re-encodes the features back into the neuromorphic domain. Experiments demonstrate that our method reliably prevents identity recovery from E2V reconstructions while preserving the structural data integrity required for downstream vision tasks. Finally, to facilitate rigorous evaluation, we introduce a novel, synchronized real-world event and RGB dataset captured via precise robotic trajectories, providing a robust benchmark for future research in privacy-preserving neuromorphic vision.

[CV-31] Frag ile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images

【速读】:该论文旨在解决基于重建的AI生成图像检测方法(reconstruction-based detectors)在面对对抗扰动时存在的严重安全漏洞问题。研究发现,通过向输入图像添加不可察觉的对抗扰动,这些检测器的准确率会急剧下降至接近零,且攻击具有可迁移性,即在白盒环境下针对某一检测器构造的攻击可成功转移至其他检测器,表明黑盒攻击同样可行。解决方案的关键在于揭示了此类检测器失效的根本原因——被攻击样本对检测器而言信噪比(signal-to-noise ratio, SNR)过低,导致其无法有效区分真实与生成图像。这一发现凸显出当前重建类检测策略的内在局限性,并呼吁重新思考和设计更鲁棒的检测机制。

链接: https://arxiv.org/abs/2604.12781
作者: Haoyang Jiang,Mingyang Yi,Shaolei Zhang,Junxian Cai,Qingbin Liu,Xi Chen,Ju Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, reconstruction-based methods have emerged as a prominent paradigm for this task. However, we find that such methods exhibit severe security vulnerabilities to adversarial perturbations; that is, by adding imperceptible adversarial perturbations to input images, the detection accuracy of classifiers collapses to near zero. To verify this threat, we present a systematic evaluation of the adversarial robustness of three representative detectors across four diverse generative backbone models. First, we construct adversarial attacks in white-box scenarios, which degrade the performance of all well-trained detectors. Moreover, we find that these attacks demonstrate transferability; specifically, attacks crafted against one detector can be transferred to others, indicating that adversarial attacks on detectors can also be constructed in a black-box setting. Finally, we assess common countermeasures and find that standard defense methods against adversarial attacks provide limited mitigation. We attribute these failures to the low signal-to-noise ratio (SNR) of attacked samples as perceived by the detectors. Overall, our results reveal fundamental security limitations of reconstruction-based detectors and highlight the need to rethink existing detection strategies.

[CV-32] Efficient Adversarial Training via Criticality-Aware Fine-Tuning

【速读】:该论文旨在解决大规模视觉 Transformer (Vision Transformer, ViT) 模型在对抗训练(Adversarial Training, AT)过程中计算成本过高、难以扩展的问题。随着模型参数量增加,ViT 的鲁棒性并未同步提升,而传统 AT 方法需对整个模型进行微调,导致资源消耗巨大。解决方案的关键在于提出一种关键性感知的对抗训练方法(Criticality-Aware Adversarial Training, CAAT),其核心机制是:首先高效识别对对抗鲁棒性贡献最大的关键参数,随后采用参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)策略,仅对选定模块进行微调,从而在显著降低可训练参数比例(约6%)的同时,保持接近标准 AT 的鲁棒性能(仅损失4.3%的鲁棒性)。该方法在多个主流对抗学习数据集上验证了其优越性,为大规模 ViT 模型的高效鲁棒训练提供了可行路径。

链接: https://arxiv.org/abs/2604.12780
作者: Wenyun Li,Zheng Zhang,Dongmei Jiang,Yaowei Wang,Xiangyuan Lan
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Pengcheng Laboratory (鹏城实验室); Pazhou Laboratory (Huangpu) (琶洲实验室(黄埔))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision Transformer (ViT) models have achieved remarkable performance across various vision tasks, with scalability being a key advantage when applied to large datasets. This scalability enables ViT models to exhibit strong generalization capabilities. However, as the number of parameters increases, the robustness of ViT models to adversarial examples does not scale proportionally. Adversarial training (AT), one of the most effective methods for enhancing robustness, typically requires fine-tuning the entire model, leading to prohibitively high computational costs, especially for large ViT architectures. In this paper, we aim to robustly fine-tune only a small subset of parameters to achieve robustness comparable to standard AT. To accomplish this, we introduce Criticality-Aware Adversarial Training (CAAT), a novel method that adaptively allocates resources to the most robustness-critical parameters, fine-tuning only selected modules. Specifically, CAAT efficiently identifies parameters that contribute most to adversarial robustness. It then leverages parameter-efficient fine-tuning (PEFT) to robustly adjust weight matrices where the number of critical parameters exceeds a predefined threshold. CAAT exhibits favorable generalization when scaled to larger vision transformer architectures, potentially paving the way for adversarial training at scale, e.g, compared with plain adversarial training, CAAT incurs only a 4.3% decrease in adversarial robustness while tuning approximately 6% of its parameters. Extensive experiments on three widely used adversarial learning datasets demonstrate that CAAT outperforms state-of-the-art lightweight AT methods with fewer trainable parameters.

[CV-33] Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling ICRA2026

【速读】:该论文旨在解决现有基于视觉的动态情绪建模方法在情感感知和认知理论方面存在缺失的问题,即当前模型未能充分模拟人类大脑通过语义与情境知识动态整合感官输入以构建情绪知觉的机制。其解决方案的关键在于提出一种受认知启发的双流语义增强框架(DuSE),该框架显式建模了两种神经认知机制:一是通过分层时间提示聚类(HTPC)模拟认知启动效应,使语言线索预敏化神经通路并调节对视觉刺激的处理;二是通过潜在语义情绪聚合(LSEA)计算模拟概念整合过程,将感官输入与习得的概念知识融合,从而更贴近海马体和默认模式网络在构建连贯情绪体验中的作用。这一设计显著提升了动态面部表情识别(DFER)的性能与可解释性。

链接: https://arxiv.org/abs/2604.12777
作者: Huanzhen Wang,Ziheng Zhou,Zeng Tao,Aoxing Li,Yingkai Zhao,Yuxuan Lin,Yan Wang,Wenqiang Zhang
机构: Fudan University (复旦大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE ICRA 2026

点击查看摘要

Abstract:The human brain constructs emotional percepts not by processing facial expressions in isolation, but through a dynamic, hierarchical integration of sensory input with semantic and contextual knowledge. However, existing vision-based dynamic emotion modeling approaches often neglect emotion perception and cognitive theories. To bridge this gap between machine and human emotion perception, we propose cognition-inspired Dual-stream Semantic Enhancement (DuSE). Our model instantiates a dual-stream cognitive architecture. The first stream, a Hierarchical Temporal Prompt Cluster (HTPC), operationalizes the cognitive priming effect. It simulates how linguistic cues pre-sensitize neural pathways, modulating the processing of incoming visual stimuli by aligning textual semantics with fine-grained temporal features of facial dynamics. The second stream, a Latent Semantic Emotion Aggregator (LSEA), computationally models the knowledge integration process, akin to the mechanism described by the Conceptual Act Theory. It aggregates sensory inputs and synthesizes them with learned conceptual knowledge, reflecting the role of the hippocampus and default mode network in constructing a coherent emotional experience. By explicitly modeling these neuro-cognitive mechanisms, DuSE provides a more neurally plausible and robust framework for dynamic facial expression recognition (DFER). Extensive experiments on challenging in-the-wild benchmarks validate our cognition-centric approach, demonstrating that emulating the brain’s strategies for emotion processing yields state-of-the-art performance and enhances model interpretability.

[CV-34] CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)因视觉标记序列中存在大量冗余而导致的显著计算开销问题。现有方法通常依赖单层视觉Transformer(Vision Transformer, ViT)特征和静态剪枝策略,但这类固定配置在面对多样化指令时往往鲁棒性不足。解决方案的关键在于提出CLASP——一个基于类别自适应层融合与双阶段剪枝的即插即用式标记压缩框架:首先通过多层视觉特征融合构建类别特定的视觉表征,随后执行双阶段剪枝,将标记预算分配给注意力显著的枢纽标记(pivot tokens)以保留相关性,并分配给冗余感知的补全标记(completion tokens)以保障覆盖度;通过类别自适应剪枝,CLASP实现提示条件下的特征融合与预算动态分配,从而在保持高鲁棒性的前提下实现激进而有效的视觉标记削减。

链接: https://arxiv.org/abs/2604.12767
作者: Yunkai Dang,Yizhu Jiang,Yifan Jiang,Qi Fan,Yinghuan Shi,Wenbin Li,Yang Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) suffer from substantial computational overhead due to the high redundancy in visual token sequences. Existing approaches typically address this issue using single-layer Vision Transformer (ViT) features and static pruning strategies. However, such fixed configurations are often brittle under diverse instructions. To overcome these limitations, we propose CLASP, a plug-and-play token reduction framework based on class-adaptive layer fusion and dual-stage pruning. Specifically, CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction. Extensive experiments demonstrate that CLASP consistently outperforms existing methods across a wide range of benchmarks, pruning ratios, and MLLM architectures. Code will be available at this https URL.

[CV-35] A Dataset and Evaluation for Complex 4D Markerless Human Motion Capture CVPR2026

【速读】:该论文旨在解决当前标记式(marker-based)4D人体运动捕捉(MoCap)系统在真实场景中部署受限的问题,以及现有无标记(markerless)4D人体运动捕捉模型因训练数据缺乏复杂多人群交互、严重遮挡和挑战性动作模式而导致的性能下降问题。其解决方案的关键在于构建一个全新的高质量基准数据集,该数据集涵盖单人与多人场景下的复杂运动、频繁的人体间遮挡、穿着相似服装的主体快速位置交换及不同距离变化,并提供同步的多视角RGB与深度序列、精确相机标定参数、Vicon系统提供的真值3D运动数据以及对应的SMPL/SMPL-X参数,从而实现视觉观测与运动真值的精准对齐,有效缩小域差距并为模型开发提供严谨评估依据。

链接: https://arxiv.org/abs/2604.12765
作者: Yeeun Park,Miqdad Naduthodi,Suryansh Kumar
机构: Texas AM University, College Station, Texas, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 14 pages, 11 figures, 4 tables. Accepted for publication at CVPR 2026 4D World Models Workshop

点击查看摘要

Abstract:Marker-based motion capture (MoCap) systems have long been the gold standard for accurate 4D human modeling, yet their reliance on specialized hardware and markers limits scalability and real-world deployment. Advancing reliable markerless 4D human motion capture requires datasets that reflect the complexity of real-world human interactions. Yet, existing benchmarks often lack realistic multi-person dynamics, severe occlusions, and challenging interaction patterns, leading to a persistent domain gap. In this work, we present a new dataset and evaluation for complex 4D markerless human motion capture. Our proposed MoCap dataset captures both single and multi-person scenarios with intricate motions, frequent inter-person occlusions, rapid position exchanges between similarly dressed subjects, and varying subject distances. It includes synchronized multi-view RGB and depth sequences, accurate camera calibration, ground-truth 3D motion capture from a Vicon system, and corresponding SMPL/SMPL-X parameters. This setup ensures precise alignment between visual observations and motion ground truth. Benchmarking state-of-the-art markerless MoCap models reveals substantial performance degradation under these realistic conditions, highlighting limitations of current approaches. We further demonstrate that targeted fine-tuning improves generalization, validating the dataset’s realism and value for model development. Our evaluation exposes critical gaps in existing models and provides a rigorous foundation for advancing robust markerless 4D human motion capture.

[CV-36] Scaling In-Context Segmentation with Hierarchical Supervision

【速读】:该论文旨在解决医学图像分割模型在少样本场景下因依赖密集全局交叉注意力机制而导致计算效率低的问题,尤其在高分辨率图像中表现尤为显著。现有方法虽引入局部注意力机制以缓解计算负担,但缺乏对关键区域选择的显式监督,导致非信息区域冗余计算。其解决方案的关键在于提出PatchICL框架,通过分层策略结合选择性图像块划分与多级监督机制,使模型能够主动识别并仅关注最具信息量的解剖区域,从而在保持分割精度的同时显著降低计算开销(如在512×512分辨率下减少44%计算量),并在跨模态泛化任务中展现出更强的适应性,尤其在局灶性病变主导的成像模态(如OCT和皮肤镜)中优势明显。

链接: https://arxiv.org/abs/2604.12752
作者: T. Camaret Ndir,Marco Reisert,Robin T. Schirrmeister
机构: Medical Center – University of Freiburg (弗莱堡大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In-context learning (ICL) enables medical image segmentation models to adapt to new anatomical structures from limited examples, reducing the clinical annotation burden. However, standard ICL methods typically rely on dense, global cross-attention, which scales poorly with image resolution. While recent approaches have introduced localized attention mechanisms, they often lack explicit supervision on the selection process, leading to redundant computation in non-informative regions. We propose PatchICL, a hierarchical framework that combines selective image patching with multi-level supervision. Our approach learns to actively identify and attend only to the most informative anatomical regions. Compared to UniverSeg, a strong global-attention baseline, PatchICL achieves competitive in-domain CT segmentation accuracy while reducing compute by 44% at 512\times512 resolution. On 35 out-of-domain datasets spanning diverse imaging modalities, PatchICL outperforms the baseline on 6 of 13 modality categories, with particular strength on modalities dominated by localized pathology such as OCT and dermoscopy. Training and evaluation code are available at this https URL

[CV-37] AffectAgent : Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的多模态情感识别中因静态参数化记忆导致的幻觉问题,以及单轮检索增强生成(Retrieval-Augmented Generation, RAG)在处理跨模态情感依赖时易受模态歧义干扰的问题。解决方案的关键在于提出AffectAgent框架,其核心是通过三个协同优化的专业化代理——查询规划器(query planner)、证据过滤器(evidence filter)和情感生成器(emotion generator)——实现细粒度情感理解。这三个代理采用多智能体近端策略优化(Multi-Agent Proximal Policy Optimization, MAPPO)进行端到端训练,并共享情感奖励信号以保证情感一致性;同时引入模态平衡专家混合(Modality-Balancing Mixture of Experts, MB-MoE)与检索增强自适应融合(Retrieval-Augmented Adaptive Fusion, RAAF),分别动态调节不同模态贡献以缓解跨模态异质性带来的表征不匹配,并在缺失模态条件下利用检索到的视听嵌入增强语义补全能力。

链接: https://arxiv.org/abs/2604.12735
作者: Zeheng Wang,Zitong Yu,Yijie Zhu,Bo Zhao,Haochen Liang,Taorui Wang,Wei Xia,Jiayu Zhang,Zhishu Liu,Hui Ma,Fei Ma,Qi Tian
机构: Great Bay University(大湾大学); Guangming Laboratory(光明实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LLM-based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single-round retrieval-augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect-oriented multi-agent retrieval-augmented generation framework that leverages collaborative decision-making among agents for fine-grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross-modal samples, assess evidence, and generate predictions. These agents are optimized end-to-end using Multi-Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality-Balancing Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), where MB-MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross-modal heterogeneity, while RAAF enhances semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER-UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: this https URL.

[CV-38] Information-Theoretic Optimization for Task-Adapted Compressed Sensing Magnetic Resonance Imaging

【速读】:该论文旨在解决任务自适应压缩感知磁共振成像(task-adapted compressed sensing magnetic resonance imaging, CS-MRI)中存在的不确定性问题以及无法在端到端优化中实现采样与重建或临床任务自适应调整的局限性。其解决方案的关键在于从信息论角度出发,通过最大化欠采样k空间测量与临床任务之间的互信息(mutual information),构建可进行概率推理的框架,从而同时实现不确定性预测和任意采样率的自适应控制。为此,作者引入了近似优化(amortized optimization)并设计了可计算的变分下界来逼近互信息,使采样、重建与任务推理模型能够在单一端到端训练模型中联合优化,具备灵活的采样比例调控能力,并统一处理两类临床场景:一是重建作为辅助过程以提升任务性能的联合任务与重建;二是抑制重建以保护隐私的任务执行模式。实验表明,该方法在Dice等标准指标上优于确定性基线,且在广义能量距离(GED)上更接近真实后验分布,验证了其在准确性与不确定性建模上的优势。

链接: https://arxiv.org/abs/2604.12709
作者: Xinyu Peng,Ziyang Zheng,Wenrui Dai,Duoduo Xue,Shaohui Li,Chenglin Li,Junni Zou,Hongkai Xiong
机构: Shanghai Jiao Tong University (上海交通大学); City University of Hong Kong (香港城市大学); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 68 pages, 15 figures, accepted by IEEE TPAMI

点击查看摘要

Abstract:Task-adapted compressed sensing magnetic resonance imaging (CS-MRI) is emerging to address the specific demands of downstream clinical tasks with significantly fewer k-space measurements than required by Nyquist sampling. However, existing task-adapted CS-MRI methods suffer from the uncertainty problem for medical diagnosis and cannot achieve adaptive sampling in end-to-end optimization with reconstruction or clinical tasks. To address these limitations, we propose the first task-adapted CS-MRI from the information-theoretic perspective to simultaneously achieve probabilistic inference for uncertainty prediction and adapt to arbitrary sampling ratios and versatile clinical applications. Specifically, we formalize the task-adapted CS-MRI optimization problem by maximizing the mutual information between undersampled k-space measurements and clinical tasks to enable probabilistic inference for addressing the uncertainty problem. We leverage amortized optimization and construct tractable variational bounds for mutual information to jointly optimize sampling, reconstruction, and task-inference models, which enables flexible sampling ratio control using a single end-to-end trained model. Furthermore, the proposed framework addresses two kinds of distinct clinical scenarios within a unified approach, i.e., i) joint task and reconstruction, where reconstruction serves as an auxiliary process to enhance task performance; and ii) task implementation with suppressed reconstruction, applicable for privacy protection. Extensive experiments on large-scale MRI datasets demonstrate that the proposed framework achieves highly competitive performance on standard metrics like Dice compared to deterministic counterpart but provides better distribution matching to the ground-truth posterior distribution as measured by the generalized energy distance (GED).

[CV-39] Risk-Calibrated Learning: Minimizing Fatal Errors in Medical AI IJCNN2026

【速读】:该论文旨在解决深度学习模型在医学图像分类中存在语义不一致问题,即模型虽具备专家级准确率,却常产生高置信度的致命错误(如将恶性肿瘤误判为良性),这类错误与视觉模糊导致的可接受误差本质不同,会严重削弱临床信任。解决方案的关键在于提出风险校准学习(Risk-Calibrated Learning),通过在优化过程中嵌入一个混淆感知的临床严重性矩阵 $ M $,显式区分视觉模糊引起的细粒度错误与结构层面的灾难性错误(如假阴性),从而抑制关键错误而不依赖复杂的网络结构调整。实验表明,该方法在四种成像模态下均显著降低关键错误率(Critical Error Rate, CER),相对现有最优基线(如Focal Loss)提升安全性能达20.0%至92.4%。

链接: https://arxiv.org/abs/2604.12693
作者: Abolfazl Mohammadi-Seif,Ricardo Baeza-Yates
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted for publication in the Proceedings of the 2026 International Joint Conference on Neural Networks (IJCNN 2026). The final published version should be cited

点击查看摘要

Abstract:Deep learning models often achieve expert-level accuracy in medical image classification but suffer from a critical flaw: semantic incoherence. These high-confidence mistakes that are semantically incoherent (e.g., classifying a malignant tumor as benign) fundamentally differ from acceptable errors which stem from visual ambiguity. Unlike safe, fine-grained disagreements, these fatal failures erode clinical trust. To address this, we propose Risk-Calibrated Learning, a technique that explicitly distinguishes between visual ambiguity (fine-grained errors) and catastrophic structural errors. By embedding a confusion-aware clinical severity matrix M into the optimization landscape, our method suppresses critical errors (false negatives) without requiring complex architectural changes. We validate our approach in four different imaging modalities: Brain Tumor MRI, ISIC 2018 (Dermoscopy), BreaKHis (Breast Histopathology), and SICAPv2 (Prostate Histopathology). Extensive experiments demonstrate that our Risk-Calibrated Loss consistently reduces the Critical Error Rate (CER) for all four datasets, achieving relative safety improvements ranging from 20.0% (on breast histopathology) to 92.4% (on prostate histopathology) compared to state-of-the-art baselines such as Focal Loss. These results confirm that our method offers a superior safety-accuracy trade-off across both CNN and Transformer architectures.

[CV-40] Brain-DiT: A Universal Multi-state fMRI Foundation Model with Metadata-Conditioned Pretraining

【速读】:该论文旨在解决当前功能磁共振成像(fMRI)基础模型在脑状态多样性不足和预训练任务不匹配方面的局限性,从而限制了其跨多种脑状态学习通用表征的能力。解决方案的关键在于提出一种名为Brain-DiT的通用多状态fMRI基础模型,其核心创新是采用基于元数据条件控制的扩散预训练机制(metadata-conditioned diffusion pretraining),并借助扩散Transformer(Diffusion Transformer, DiT)架构,使模型能够学习多层次的表征,既捕捉精细的功能结构,又保留全局语义信息。此外,通过引入元数据条件控制,有效分离了个体神经动态与群体层面的变异性,显著提升了下游任务性能。

链接: https://arxiv.org/abs/2604.12683
作者: Junfeng Xia,Wenhao Ye,Xuanye Pan,Xinke Shen,Mo Wang,Quanying Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Current fMRI foundation models primarily rely on a limited range of brain states and mismatched pretraining tasks, restricting their ability to learn generalized representations across diverse brain states. We present \textitBrain-DiT, a universal multi-state fMRI foundation model pretrained on 349,898 sessions from 24 datasets spanning resting, task, naturalistic, disease, and sleep states. Unlike prior fMRI foundation models that rely on masked reconstruction in the raw-signal space or a latent space, \textitBrain-DiT adopts metadata-conditioned diffusion pretraining with a Diffusion Transformer (DiT), enabling the model to learn multi-scale representations that capture both fine-grained functional structure and global semantics. Across extensive evaluations and ablations on 7 downstream tasks, we find consistent evidence that diffusion-based generative pretraining is a stronger proxy than reconstruction or alignment, with metadata-conditioned pretraining further improving downstream performance by disentangling intrinsic neural dynamics from population-level variability. We also observe that downstream tasks exhibit distinct preferences for representational scale: ADNI classification benefits more from global semantic representations, whereas age/sex prediction comparatively relies more on fine-grained local structure. Code and parameters of Brain-DiT are available at \hrefthis https URLLink.

[CV-41] OFA-Diffusion Compression: Compressing Diffusion Model in One-Shot Manner

【速读】:该论文旨在解决扩散概率模型(Diffusion Probabilistic Model, DPM)在实际部署中因参数规模增大和计算开销上升而导致的效率瓶颈问题,尤其是针对不同设备资源约束下需多次压缩训练所引发的高 overhead 问题。其解决方案的关键在于提出一种一次训练完成(once-for-all, OFA)的压缩框架,通过限制候选子网络为一组预设的参数规模,并采用基于通道重要性的渐进式通道分配策略构建各尺寸子网络,同时引入重加权机制平衡不同子网络的优化过程,从而在单次训练中获得多种计算复杂度的高效压缩模型,显著降低训练成本并保持良好性能。

链接: https://arxiv.org/abs/2604.12668
作者: Haoyang Jiang,Zekun Wang,Mingyang Yi,Xiuyu Li,Lanqing Hu,Junxian Cai,Qingbin Liu,Xi Chen,Ju Fan
机构: Renmin University of China (中国人民大学); Alibaba Inc. (阿里巴巴公司); Tencent Inc. (腾讯公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Diffusion Probabilistic Model (DPM) achieves remarkable performance in image generation, while its increasing parameter size and computational overhead hinder its deployment in practical applications. To improve this, the existing literature focuses on obtaining a smaller model with a fixed architecture through model compression. However, in practice, DPMs usually need to be deployed on various devices with different resource constraints, which leads to multiple compression processes, incurring significant overhead for repeated training. To obviate this, we propose a once-for-all (OFA) compression framework for DPMs that yields different subnetworks with various computations in a one-shot training manner. The existing OFA framework typically involves massive subnetworks with different parameter sizes, while such a huge candidate space slows the optimization. Thus, we propose to restrict the candidate subnetworks with a certain set of parameter sizes, where each size corresponds to a specific subnetwork. Specifically, to construct each subnetwork with a given size, we gradually allocate the maintained channels by their importance. Furthermore, we propose a reweighting strategy to balance the optimization process of different subnetworks. Experimental results show that our approach can produce compressed DPMs for various sizes with significantly lower training overhead while achieving satisfactory performance.

[CV-42] Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

【速读】:该论文旨在解决多目标跟踪(Multi-Object Tracking, MOT)中运动推理(Motion Reasoning)的两个关键问题:一是由噪声或概率性预测导致的轨迹不稳定,二是遮挡情况下因视觉线索消失而引起的轨迹断裂。解决方案的核心在于提出一种协同推理框架,通过多个相关联目标之间的联合推断来增强运动估计的鲁棒性。其关键技术是设计HyperSSM架构,融合超图(Hypergraph)计算与状态空间模型(State Space Model, SSM),其中超图模块利用动态超边捕捉空间运动相关性,SSM模块则通过结构化的状态转移实现时间平滑性约束,从而在空间一致性与时间连贯性之间达成协同优化,显著提升复杂场景下的运动估计稳定性与连续性。

链接: https://arxiv.org/abs/2604.12665
作者: Zikai Song,Junqing Yu,Yi-Ping Phoebe Chen,Wei Yang,Xinchao Wang
机构: Huazhong University of Science and Technology (华中科技大学); National University of Singapore (新加坡国立大学); La Trobe Uniersity (拉特罗布大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear. To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded. To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial-temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation. Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.

[CV-43] PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在prompt following能力上的不足,尤其是现有强化学习(Reinforcement Learning, RL)方法中高质量奖励信号获取困难的问题:CLIP Score过于粗粒度,而基于视觉语言模型(Vision-Language Model, VLM)的奖励模型(如RewardDance)则依赖昂贵的人工标注偏好数据和额外微调。解决方案的关键在于提出PromptEcho,一种无需人工标注且无需训练奖励模型的奖励构建方法——它利用冻结的VLM对生成图像与原始prompt进行token级交叉熵损失计算,直接提取预训练阶段编码的图像-文本对齐知识,从而获得确定性、计算高效且可随更强开源VLM自动提升的奖励信号。

链接: https://arxiv.org/abs/2604.12652
作者: Jinlong Liu,Wanggui He,Peng Zhang,Mushui Liu,Hao Jiang,Pipei Huang
机构: Alibaba Group (阿里巴巴集团); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emphno annotation and \emphno reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.

[CV-44] Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

【速读】:该论文旨在解决现有深度伪造(deepfake)检测研究主要聚焦于“说话状态”伪造而忽视“倾听状态”伪造的问题,尤其是在交互式场景中,攻击者常通过交替伪造说话与倾听状态来增强欺骗性。为应对这一挑战,作者提出了倾听深度伪造检测(Listening Deepfake Detection, LDD)任务,并构建了首个专门针对该任务的数据集ListenForge,其基于五种倾听头生成(Listening Head Generation, LHG)方法创建。解决方案的关键在于提出MANet——一种运动感知且音频引导的网络架构,该模型能够捕捉倾听视频中的细微运动不一致,并利用说话者的音频语义信息指导跨模态融合,从而有效识别倾听状态下的伪造内容。实验表明,传统以说话为中心的检测模型在倾听场景下表现不佳,而MANet在ListenForge数据集上显著优于现有方法,验证了其有效性。

链接: https://arxiv.org/abs/2604.12650
作者: Miao Liu,Fangda Wei,Jing Wang,Xinyuan Qian
机构: Beijing Institute of Technology (北京理工大学); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Submitted to ACMMM 2026

点击查看摘要

Abstract:Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker’s appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of ‘listening deepfakes’ remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker’s audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at this https URL.

[CV-45] Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting

【速读】:该论文旨在解决当前具身人工智能(Embodied AI)代理训练中仿真环境视觉保真度不足以及动态人类建模能力有限的问题。现有模拟器多依赖基于网格的光栅化渲染,视觉真实感较差,且对动态人类虚拟角色的支持局限于网格表示,限制了代理在有人类活动的真实场景中的泛化能力。解决方案的关键在于提出Habitat-GS,一个以导航为中心的具身AI模拟器,其核心创新是集成3D高斯溅射(3D Gaussian Splatting, 3DGS)场景渲染技术实现实时逼真图像生成,并引入高斯虚拟角色模块(gaussian avatar module),使每个虚拟角色既能作为高保真视觉实体,又能作为有效的导航障碍物,从而支持代理学习人-aware行为。实验表明,基于3DGS场景训练的代理具备更强的跨域泛化能力,混合域训练策略最优,且高斯虚拟角色显著提升了人感知导航性能。

链接: https://arxiv.org/abs/2604.12626
作者: Ziyuan Xia,Jingyi Xu,Chong Cui,Yuanhong Yu,Jiazhao Zhang,Qingsong Yan,Tao Ni,Junbo Chen,Xiaowei Zhou,Hujun Bao,Ruizhen Hu,Sida Peng
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Training embodied AI agents depends critically on the visual fidelity of simulation environments and the ability to model dynamic humans. Current simulators rely on mesh-based rasterization with limited visual realism, and their support for dynamic human avatars, where available, is constrained to mesh representations, hindering agent generalization to human-populated real-world scenarios. We present Habitat-GS, a navigation-centric embodied AI simulator extended from Habitat-Sim that integrates 3D Gaussian Splatting scene rendering and drivable gaussian avatars while maintaining full compatibility with the Habitat ecosystem. Our system implements a 3DGS renderer for real-time photorealistic rendering and supports scalable 3DGS asset import from diverse sources. For dynamic human modeling, we introduce a gaussian avatar module that enables each avatar to simultaneously serve as a photorealistic visual entity and an effective navigation obstacle, allowing agents to learn human-aware behaviors in realistic settings. Experiments on point-goal navigation demonstrate that agents trained on 3DGS scenes achieve stronger cross-domain generalization, with mixed-domain training being the most effective strategy. Evaluations on avatar-aware navigation further confirm that gaussian avatars enable effective human-aware navigation. Finally, performance benchmarks validate the system’s scalability across varying scene complexity and avatar counts.

[CV-46] Efficient Semantic Image Communication for Traffic Monitoring at the Edge

【速读】:该论文旨在解决视觉监控系统在通信带宽受限场景下,如何高效传输图像数据的同时保留关键语义信息的问题。传统方法因需传输全分辨率图像而效率低下,且对像素级细节的需求往往不必要;为此,作者提出两种语义驱动的图像通信管道——MMSD(多模态语义分解)与SAMR(语义感知掩码重建),其核心在于将原始图像转化为紧凑的语义表示(如分割图、边缘图和文本描述)或通过语义重要性指导的区域抑制策略,实现高压缩比下的有效内容恢复。二者均采用异构式发送-接收架构,边缘端执行轻量处理,服务器端利用生成模型完成高保真重建,从而在保证语义一致性前提下显著降低传输负载(平均压缩率达99%以上)。

链接: https://arxiv.org/abs/2604.12622
作者: Damir Assylbek,Nurmukhammed Aitymbetov,Marko Ristin,Dimitrios Zorbas
机构: Nazarbayev University, School of Engineering Digital Sciences; Zurich University of Applied Sciences (ZHAW)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Many visual monitoring systems operate under strict communication constraints, where transmitting full-resolution images is impractical and often unnecessary. In such settings, visual data is often used for object presence, spatial relationships, and scene context rather than exact pixel fidelity. This paper presents two semantic image communication pipelines for traffic monitoring, MMSD and SAMR, that reduce transmission cost while preserving meaningful visual information. MMSD (Multi-Modal Semantic Decomposition) targets very high compression together with data confidentiality, since sensitive pixel content is not transmitted. It replaces the original image with compact semantic representations, namely segmentation maps, edge maps, and textual descriptions, and reconstructs the scene at the receiver using a diffusion-based generative model. SAMR (Semantic-Aware Masking Reconstruction) targets higher visual quality while maintaining strong compression. It selectively suppresses non-critical image regions according to semantic importance before standard JPEG encoding and restores the missing content at the receiver through generative inpainting. Both designs follow an asymmetric sender-receiver architecture, where lightweight processing is performed at the edge and computationally intensive reconstruction is offloaded to the server. On a Raspberry Pi~5, the edge-side processing time is about 15s for MMSD and 9s for SAMR. Experimental results show average transmitted-data reductions of 99% for MMSD and 99.1% for SAMR. In addition, MMSD achieves lower payload size than the recent SPIC baseline while preserving strong semantic consistency, whereas SAMR provides a better quality-compression trade-off than standard JPEG and SQ-GAN under comparable operating conditions.

[CV-47] Spatial-Spectral Adaptive Fidelity and Noise Prior Reduction Guided Hyperspectral Image Denoising

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)去噪中数据保真度与噪声先验建模之间难以平衡的问题。现有方法往往过度依赖图像内在先验,忽视了噪声类型的多样性以及保真项与先验正则项之间的动态权衡。其解决方案的关键在于提出一个融合噪声先验降维与空间-光谱自适应保真项的框架:该框架通过引入一个自适应权重张量(adaptive weight tensor)实现保真项与先验正则项的动态平衡,并结合基于代表性系数总变差(representative coefficient total variation regularizer)的快速鲁棒像素级模型,有效去除混合噪声的同时保留HSI的光谱低秩结构和局部平滑特性。

链接: https://arxiv.org/abs/2604.12600
作者: Xuelin Xie,Xiliang Lu,Zhengshan Wang,Yang Zhang,Long Chen
机构: Wuhan University (武汉大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:The core challenge of hyperspectral image denoising is striking the right balance between data fidelity and noise prior modeling. Most existing methods place too much emphasis on the intrinsic priors of the image while overlooking diverse noise assumptions and the dynamic trade-off between fidelity and priors. To address these issues, we propose a denoising framework that integrates noise prior reduction and a spatial-spectral adaptive fidelity term. This framework considers comprehensive noise priors with fewer parameters and introduces an adaptive weight tensor to dynamically balance the fidelity and prior regularization terms. Within this framework, we further develop a fast and robust pixel-wise model combined with the representative coefficient total variation regularizer to accurately remove mixed noise in HSIs. The proposed method not only efficiently handles various types of noise but also accurately captures the spectral low-rank structure and local smoothness of HSIs. An efficient optimization algorithm based on the alternating direction method of multipliers is designed to ensure stable and fast convergence. Extensive experiments on simulated and real-world datasets demonstrate that the proposed model achieves superior denoising performance while maintaining competitive computational efficiency.

[CV-48] ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction WWW ICIP

【速读】:该论文旨在解决在极端低光照环境下,从退化的多视角输入中重建高质量三维(3D)表示的问题,核心挑战在于恢复几何一致且具有照片真实感的3D场景。解决方案的关键在于提出了一种名为极端低光优化高斯点绘图(Extreme Low-light Optimized Gaussian Splatting, ELoG-GS)的鲁棒低光3D重建流程,其创新性地融合了基于学习的点云初始化与亮度引导的颜色增强策略,从而实现稳定且逼真的高斯点绘图(Gaussian Splatting)。该方法通过几何感知初始化和光度适应策略显著提升了在恶劣条件下的重建保真度,实验表明其在NTIRE 2026 Track 1基准上优于现有基线方法,在官方排行榜上取得PSNR 18.6626和SSIM 0.6855的性能指标。

链接: https://arxiv.org/abs/2604.12592
作者: Yuhao Liu,Dingju Wang,Ziyang Zheng
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our method achieved a ranking of 9 out of 148 participants in Track 1 of the NTIRE 3DRR Challenge, as reported on the official competition website: this https URL

点击查看摘要

Abstract:This paper presents our approach to the NTIRE 2026 3D Restoration and Reconstruction Challenge (Track 1), which focuses on reconstructing high-quality 3D representations from degraded multi-view inputs. The challenge involves recovering geometrically consistent and photorealistic 3D scenes in extreme low-light environments. To address this task, we propose Extreme Low-light Optimized Gaussian Splatting (ELoG-GS), a robust low-light 3D reconstruction pipeline that integrates learning-based point cloud initialization and luminance-guided color enhancement for stable and photorealistic Gaussian Splatting. Our method incorporates both geometry-aware initialization and photometric adaptation strategies to improve reconstruction fidelity under challenging conditions. Extensive experiments on the NTIRE Track 1 benchmark demonstrate that our approach significantly improves reconstruction quality over the baselines, achieving superior visual fidelity and geometric consistency. The proposed method provides a practical solution for robust 3D reconstruction in real-world degraded scenarios. In the final testing phase, our method achieved a PSNR of 18.6626 and an SSIM of 0.6855 on the official platform leaderboard. Code is available at this https URL.

[CV-49] Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models

【速读】:该论文旨在解决视频大语言模型(Video-LLM)在生成过程中因时间证据分配不均衡而导致的幻觉问题。具体而言,模型倾向于过度依赖少数关键帧(称为锚帧,anchor frame),从而忽视视频中其他重要时序信息,这种现象被识别为一种与输入视频无关的、由模型结构或位置偏置引发的解码器侧固有偏差。解决方案的关键在于提出一种训练-free、层选择性的推理方法——解码器侧时间重平衡(Decoder-side Temporal Rebalancing, DTR),该方法通过在中间到晚期的解码器层中自适应校准视觉注意力分布,缓解时间上的集中偏倚,促使被忽略的帧更有效地参与响应生成,从而增强输出对更广泛且均衡的视频证据的依赖性,显著提升模型的幻觉鲁棒性并保持优异的视频理解性能和推理效率。

链接: https://arxiv.org/abs/2604.12582
作者: Zijian Liu,Sihan Cao,Pengcheng Zheng,Kuien Liu,Caiyan Qin,Xiaolin Qin,Jiwei Wei,Chaoning Zhang
机构: University of Electronic Science and Technology of China (电子科技大学); Institute of Software Chinese Academy of Sciences (中国科学院软件研究所); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Chengdu Institute of Computer Applications, Chinese Academy of Sciences (中国科学院成都计算机应用研究所); University of the Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose over-dominance is closely associated with hallucination-prone generation. Motivated by this insight, we propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that rebalances temporal evidence allocation in middle-to-late decoder layers without altering visual encoding or requiring auxiliary models. DTR adaptively calibrates decoder-side visual attention to alleviate temporally imbalanced concentration and encourage under-attended frames to contribute more effectively to response generation. In this way, DTR guides the decoder to ground its outputs in temporally broader and more balanced video evidence. Extensive experiments on hallucination and video understanding benchmarks show that DTR consistently improves hallucination robustness across diverse Video-LLM families, while preserving competitive video understanding performance and high inference efficiency.

[CV-50] PDF-GS: Progressive Distractor Filtering for Robust 3D Gaussian Splatting CVPR

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在训练过程中对多视角一致性假设的敏感性问题,即当输入图像中存在干扰项(distractors)破坏该假设时,会导致视觉伪影。解决方案的关键在于提出PDF-GS(Progressive Distractor Filtering for Robust 3D Gaussian Splatting),通过渐进式多阶段优化增强3DGS固有的不一致信号抑制能力:首先利用差异线索逐步过滤干扰项,随后在重建阶段从净化后的高斯表示中恢复精细且视角一致的细节,从而实现鲁棒、高保真且无干扰的重建效果。

链接: https://arxiv.org/abs/2604.12580
作者: Kangmin Seo,MinKyu Lee,Tae-Young Kim,ByeongCheol Lee,JoonSeoung An,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR Findings 2026

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled impressive real-time photorealistic rendering. However, conventional training pipelines inherently assume full multi-view consistency among input images, which makes them sensitive to distractors that violate this assumption and cause visual artifacts. In this work, we revisit an underexplored aspect of 3DGS: its inherent ability to suppress inconsistent signals. Building on this insight, we propose PDF-GS (Progressive Distractor Filtering for Robust 3D Gaussian Splatting), a framework that amplifies this self-filtering property through a progressive multi-phase optimization. The progressive filtering phases gradually remove distractors by exploiting discrepancy cues, while the following reconstruction phase restores fine-grained, view-consistent details from the purified Gaussian representation. Through this iterative refinement, PDF-GS achieves robust, high-fidelity, and distractor-free reconstructions, consistently outperforming baselines across diverse datasets and challenging real-world conditions. Moreover, our approach is lightweight and easily adaptable to existing 3DGS frameworks, requiring no architectural changes or additional inference overhead, leading to a new state-of-the-art performance. The code is publicly available at this https URL.

[CV-51] StructDiff: A Structure-Preserving and Spatially Controllable Diffusion Model for Single-Image Generation

【速读】:该论文旨在解决单图生成(single-image generation)中结构布局保持困难与空间可控性不足的问题。现有方法在处理具有大尺度刚性物体或严格空间约束的图像时,难以维持全局和局部分布的一致性,且缺乏对生成内容位置、尺度及细节的灵活控制能力。解决方案的关键在于提出StructDiff框架:首先引入自适应感受野模块(adaptive receptive field module)以同时保留全局与局部统计特性;其次,创新性地采用三维位置编码(3D positional encoding, PE)作为空间先验,实现对生成对象的位置、尺度和局部细节的灵活调控,这是首次将PE机制应用于单图生成中的空间控制;此外,还设计了一种基于大语言模型(LLM)的新评估指标,有效克服传统客观指标局限性和人工评测成本高的问题。

链接: https://arxiv.org/abs/2604.12575
作者: Yinxi He,Kang Liao,Chunyu Lin,Tianyi Wei,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia (Regular Paper)

点击查看摘要

Abstract:This paper introduces StructDiff, a generative framework based on a single-scale diffusion model for single-image generation. Single-image generation aims to synthesize diverse samples with similar visual content to the source image by capturing its internal statistics, without relying on external data. However, existing methods often struggle to preserve the structural layout, especially for images with large rigid objects or strict spatial constraints. Moreover, most approaches lack spatial controllability, making it difficult to guide the structure or placement of generated content. To address these challenges, StructDiff introduces an \textitadaptive receptive field module to maintain both global and local distributions. Building on this foundation, StructDiff incorporates 3D positional encoding (PE) as a spatial prior, allowing flexible control over positions, scale, and local details of generated objects. To our knowledge, this spatial control capability represents the first exploration of PE-based manipulation in single-image generation. Furthermore, we propose a novel evaluation criterion for single-image generation based on large language models (LLMs). This criterion specifically addresses the limitations of existing objective metrics and the high labor costs associated with user studies. StructDiff also demonstrates broad applicability across downstream tasks, such as text-guided image generation, image editing, outpainting, and paint-to-image synthesis. Extensive experiments demonstrate that StructDiff outperforms existing methods in structural consistency, visual quality, and spatial controllability. The project page is available at this https URL.

[CV-52] Cross-Modal Knowledge Distillation for PET-Free Amyloid-Beta Detection from MRI CVPR

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease)早期诊断中淀粉样蛋白β(Amyloid-β, Aβ)阳性状态检测依赖正电子发射断层扫描(PET)所带来的成本高、侵入性强及可及性差的问题,从而限制了其在人群层面筛查中的应用。解决方案的关键在于提出一种基于PET引导的知识蒸馏框架,利用BiomedCLIP构建的教师模型通过跨模态注意力机制和以Centiloid为指导的在线负采样三元组对比学习,实现PET与MRI之间的对齐;随后,一个仅使用MRI的Student模型通过特征级和logit级蒸馏模仿教师模型的行为,从而在不依赖非影像学临床协变量或推理阶段PET数据的情况下准确预测Aβ状态,同时保持良好的可解释性。

链接: https://arxiv.org/abs/2604.12574
作者: Francesco Chiumento,Julia Dietlmeier,Ronan P. Killeen,Kathleen M. Curran,Noel E. O’Connor,Mingming Liu
机构: Dublin City University (都柏林城市大学); Insight Research Ireland Centre for Data Analytics (爱尔兰数据解析研究中心); St. Vincent’s University Hospital (圣文森特大学医院); University College Dublin (都柏林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR Workshops 2026 (PHAROS-AIF-MIH)

点击查看摘要

Abstract:Detecting amyloid- \beta (A \beta ) positivity is crucial for early diagnosis of Alzheimer’s disease but typically requires PET imaging, which is costly, invasive, and not widely accessible, limiting its use for population-level screening. We address this gap by proposing a PET-guided knowledge distillation framework that enables A \beta prediction from MRI alone, without requiring non-imaging clinical covariates or PET at inference. Our approach employs a BiomedCLIP-based teacher model that learns PET-MRI alignment via cross-modal attention and triplet contrastive learning with PET-informed (Centiloid-aware) online negative sampling. An MRI-only student then mimics the teacher via feature-level and logit-level distillation. Evaluated across four MRI contrasts (T1w, T2w, FLAIR, T2*) and two independent datasets, our approach demonstrates effective knowledge transfer (best AUC: 0.74 on OASIS-3, 0.68 on ADNI) while maintaining interpretability and eliminating the need for clinical variables. Saliency analysis confirms that predictions focus on anatomically relevant cortical regions, supporting the clinical viability of PET-free A \beta screening. Code is available at this https URL.

[CV-53] Evolution-Inspired Sample Competition for Deep Neural Network Optimization

【速读】:该论文旨在解决传统深度网络训练中对所有样本采用统一学习范式所引发的问题,如类别不平衡下的偏倚、难样本学习不足以及噪声样本的错误强化等。其解决方案的关键在于提出一种受进化论启发的“自然选择”(Natural Selection, NS)优化方法,通过显式引入样本间的竞争机制来动态调节每个样本的训练贡献:具体而言,NS 将多个样本组合成复合图像进行推理,并基于预测结果计算每个样本的相对竞争得分(natural selection score),进而动态重加权样本损失,从而在优化过程中构建出一种由竞争驱动的学习机制,实现更自适应且平衡的模型训练。

链接: https://arxiv.org/abs/2604.12568
作者: Ying Zheng,Yiyi Zhang,Yi Wang,Lap-Pui Chau
机构: The Hong Kong Polytechnic University (香港理工大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional deep network training generally optimizes all samples under a largely uniform learning paradigm, without explicitly modeling the heterogeneous competition among them. Such an oversimplified treatment can lead to several well-known issues, including bias under class imbalance, insufficient learning of hard samples, and the erroneous reinforcement of noisy samples. In this work, we present \textitNatural Selection (NS), a novel evolution-inspired optimization method that explicitly incorporates competitive interactions into deep network training. Unlike conventional sample reweighting strategies that rely mainly on predefined heuristics or static criteria, NS estimates the competitive status of each sample in a group-wise context and uses it to adaptively regulate its training contribution. Specifically, NS first assembles multiple samples into a composite image and rescales it to the original input size for model inference. Based on the resulting predictions, a natural selection score is computed for each sample to characterize its relative competitive variation within the constructed group. These scores are then used to dynamically reweight the sample-wise loss, thereby introducing an explicit competition-driven mechanism into the optimization process. In this way, NS provides a simple yet effective means of moving beyond uniform sample treatment and enables more adaptive and balanced model optimization. Extensive experiments on 12 public datasets across four image classification tasks demonstrate the effectiveness of the proposed method. Moreover, NS is compatible with diverse network architectures and does not depend on task-specific assumptions, indicating its strong generality and practical potential. The code will be made publicly available.

[CV-54] Scalable Trajectory Generation for Whole-Body Mobile Manipulation

【速读】:该论文旨在解决移动操作机器人(Mobile Manipulation)在非结构化环境中进行全身协调运动时,因状态空间随场景和物体多样性呈组合爆炸式增长而导致的数据稀缺问题。现有数据采集方法如遥操作或规划策略在大规模应用中存在劳动密集或计算成本过高的瓶颈,缺乏可扩展的生成大规模、物理有效且跨不同机器人本体与环境的协同轨迹数据的管道。解决方案的关键在于提出AutoMoMa框架,其核心创新是将AKR(Articulated Kinematic Representation)建模与并行轨迹优化相结合,通过GPU加速实现每GPU小时5000个episode的高效生成速度(较CPU基线快80倍以上),从而首次实现了在规模、多样性与运动学保真度三个维度上的统一突破,为基于模仿学习(Imitation Learning, IL)的控制策略提供了高质量训练数据基础。

链接: https://arxiv.org/abs/2604.12565
作者: Yida Niu,Xinhai Chang,Xin Liu,Ziyuan Jiao,Yixin Zhu
机构: Peking University (北京大学); Beihang University (北京航空航天大学); State Key Laboratory of General Artificial Intelligence; Beijing Key Laboratory of Behavior and Mental Health; Yuanpei College, Peking University; PKU-Wuhan Institute for Artificial Intelligence (北京大学-武汉人工智能研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robots deployed in unstructured environments must coordinate whole-body motion – simultaneously moving a mobile base and arm – to interact with the physical world. This coupled mobility and dexterity yields a state space that grows combinatorially with scene and object diversity, demanding datasets far larger than those sufficient for fixed-base manipulation. Yet existing acquisition methods, including teleoperation and planning, are either labor-intensive or computationally prohibitive at scale. The core bottleneck is the lack of a scalable pipeline for generating large-scale, physically valid, coordinated trajectory data across diverse embodiments and environments. Here we introduce AutoMoMa, a GPU-accelerated framework that unifies AKR modeling, which consolidates base, arm, and object kinematics into a single chain, with parallelized trajectory optimization. AutoMoMa achieves 5,000 episodes per GPU-hour (over 80\times faster than CPU-based baselines), producing a dataset of over 500k physically valid trajectories spanning 330 scenes, diverse articulated objects, and multiple robot embodiments. Prior datasets were forced to compromise on scale, diversity, or kinematic fidelity; AutoMoMa addresses all three simultaneously. Training downstream IL policies further reveals that even a single articulated-object task requires tens of thousands of demonstrations for SOTA methods to reach \approx 80% success, confirming that data scarcity – not algorithmic limitations – has been the binding constraint. AutoMoMa thus bridges high-performance planning and reliable IL-based control, providing the infrastructure previously missing for coordinated mobile manipulation research. By making large-scale, kinematically valid training data practical, AutoMoMa showcases generalizable whole-body robot policies capable of operating in the diverse, unstructured settings of the real world.

[CV-55] Cross-Attentive Multiview Fusion of Vision-Language Embeddings

【速读】:该论文旨在解决将视觉-语言模型(Vision-Language Models, VLMs)从2D图像推广到3D场景时所面临的挑战,即如何有效融合多视角的2D语义描述符以构建高质量的3D实例级表示。现有方法通常通过反投影并平均多个视角的特征或启发式选择单一视图描述符,导致3D表示性能受限。其解决方案的关键在于提出一种新颖的多视角Transformer架构——Cross-Attentive Multiview Fusion (CAMFusion),该架构通过跨视角的交叉注意力机制对来自不同视角的视觉-语言描述符进行交互与融合,生成统一的每3D实例嵌入;同时引入多视角一致性作为自监督信号,增强融合效果,显著提升在标准监督损失基础上的性能表现。

链接: https://arxiv.org/abs/2604.12551
作者: Tomas Berriel Martins,Martin R. Oswald,Javier Civera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.

[CV-56] MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models CVPR2026

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中位置编码机制效率低下的问题,即现有方法对所有token统一分配位置索引,忽视了模态内和模态间信息密度的差异,导致注意力分配失衡——冗余视觉区域占据主导,而关键信息被忽略。解决方案的关键在于提出一种无需训练的框架MODIX(Multimodal Information-Driven Positional IndeX Scaling),通过联合建模模态内信息密度(基于协方差熵)与模态间交互(跨模态对齐)来生成统一评分,动态调整位置步长(positional strides),从而在不修改模型参数或结构的前提下,将更细粒度的位置编码分配给高信息量模态、压缩冗余内容,实现自适应的位置编码资源优化。

链接: https://arxiv.org/abs/2604.12537
作者: Ruoxiang Huang,Zhen Yuan
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026 (Highlight). 10 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.

[CV-57] CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

【速读】:该论文旨在解决生成式压缩(generative compression)在实时场景中因模型复杂度高而难以部署的问题,特别是针对轻量化、低延迟的扩散编码器设计。其核心挑战在于如何在保持高质量图像重建的同时,实现高效的编码与解码速度,以满足1080p分辨率下60 FPS以上的实时处理需求。解决方案的关键在于两个发现:一是压缩导向的预训练比生成导向的预训练更适用于小规模模型;二是轻量级卷积结构结合蒸馏(distillation)即可替代Transformer中的全局注意力机制,从而显著降低计算开销。基于此,作者提出了一种单步轻量化卷积扩散编码器,在保证FID指标接近MS-ILLM的前提下,将比特率降低85%,并实现了实际可用的实时性能。

链接: https://arxiv.org/abs/2604.12525
作者: Zhaoyang Jia,Naifu Xue,Zihan Zheng,Jiahao Li,Bin Li,Xiaoyi Zhang,Zongyu Guo,Yuan Zhang,Houqiang Li,Yan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advanced diffusion methods typically derive strong generative priors by scaling diffusion transformers. However, scaling fails to generalize when adapted for real-time compression scenarios that demand lightweight models. In this paper, we explore the design of real-time and lightweight diffusion codecs by addressing two pivotal questions. First, does diffusion pre-training benefit lightweight diffusion codecs? Through systematic analysis, we find that generation-oriented pre-training is less effective at small model scales whereas compression-oriented pre-training yields consistently better performance. Second, are transformers essential? We find that while global attention is crucial for standard generation, lightweight convolutions suffice for compression-oriented diffusion when paired with distillation. Guided by these findings, we establish a one-step lightweight convolution diffusion codec that achieves real-time 60 ~FPS encoding and 42 ~FPS decoding at 1080p. Further enhanced by distillation and adversarial learning, the proposed codec reduces bitrate by 85% at a comparable FID to MS-ILLM, bridging the gap between generative compression and practical real-time deployment. Codes are released at this https URL

[CV-58] NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1) CVPR

【速读】:该论文旨在解决传统图像质量评估(Image Quality Assessment, IQA)方法在处理高质量图像时的局限性问题,即仅依赖标量评分难以区分细微差异且缺乏可解释性,无法为视觉任务提供有效指导。解决方案的关键在于引入多模态大语言模型(Multimodal Large Language Models, MLLMs),通过构建一个新型基准测试,评估MLLMs模拟人类专家认知能力的能力,重点突破两个瓶颈:一是可靠地从高质量图像对中选择视觉更优者;二是生成基于事实、具备专业水准的解释性推理,从而实现从“评分”到“理解”的范式转变。

链接: https://arxiv.org/abs/2604.12512
作者: Guanyi Qin,Jie Liang,Bingbing Zhang,Lishen Qu,Ya-nan Guan,Hui Zeng,Lei Zhang,Radu Timofte,Jianhui Sun,Xinli Yue,Tao Shao,Huan Hou,Wenjie Liao,Shuhao Han,Jieyu Yuan,Chunle Guo,Chongyi Li,Zewen Chen,Yunze Liu,Jian Guo,Juan Wang,Yun Zeng,Bing Li,Weiming Hu,Hesong Li,Dehua Liu,Xinjie Zhang,Qiang Li,Li Yan,Wei Dong,Qingsen Yan,Xingcan Li,Shenglong Zhou,Manjiang Yin,Yinxiang Zhang,Hongbo Wang,Jikai Xu,Zhaohui Fan,Dandan Zhu,Wei Sun,Weixia Zhang,Kun Zhu,Nana Zhang,Kaiwei Zhang,Qianqian Zhang,Zhihan Zhang,William Gordon,Linwei Wu,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Cici Liu,Yaokun Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NTIRE Challenge Report. Accepted by CVPRW 2026

点击查看摘要

Abstract:In this paper, we present an overview of the NTIRE 2026 challenge on the 3rd Restore Any Image Model in the Wild, specifically focusing on Track 1: Professional Image Quality Assessment. Conventional Image Quality Assessment (IQA) typically relies on scalar scores. By compressing complex visual characteristics into a single number, these methods fundamentally struggle to distinguish subtle differences among uniformly high-quality images. Furthermore, they fail to articulate why one image is superior, lacking the reasoning capabilities required to provide guidance for vision tasks. To bridge this gap, recent advancements in Multimodal Large Language Models (MLLMs) offer a promising paradigm. Inspired by this potential, our challenge establishes a novel benchmark exploring the ability of MLLMs to mimic human expert cognition in evaluating high-quality image pairs. Participants were tasked with overcoming critical bottlenecks in professional scenarios, centering on two primary objectives: (1) Comparative Quality Selection: reliably identifying the visually superior image within a high-quality pair; and (2) Interpretative Reasoning: generating grounded, expert-level explanations that detail the rationale behind the selection. In total, the challenge attracted nearly 200 registrations and over 2,500 submissions. The top-performing methods significantly advanced the state of the art in professional IQA. The challenge dataset is available at this https URL, and the official homepage is accessible at this https URL.

[CV-59] Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers

【速读】:该论文旨在解决移动操作(Mobile Manipulation, MoMa)中复杂任务的协调控制问题,例如开门、拉抽屉和开柜子等,这类任务需要机器人基座与机械臂进行全身协同运动。传统全身体控制器(Whole-Body Controller, WBC)虽能通过分层优化求解,但依赖大量人工调参且鲁棒性差;而学习方法虽具较强泛化能力,却通常需昂贵的全身遥操作数据或复杂的奖励工程。论文提出WHOLE-MoMa两阶段框架:第一阶段利用轻量级WBC随机化生成多样化的初始演示轨迹,作为结构先验引导数据采集于任务相关的状态-动作空间区域;第二阶段采用离线强化学习(Offline Reinforcement Learning, Offline RL)对这些轨迹进行改进,并通过奖励信号识别和拼接更优行为。其关键创新在于将次优WBC作为高效的数据生成机制,结合Q-chunking扩展的隐式Q学习实现对动作片段的层次化评估与策略提取,从而在无需任何遥操作或真实世界训练数据的情况下,在仿真和真实机器人上实现了高成功率的复杂协调任务执行。

链接: https://arxiv.org/abs/2604.12509
作者: Snehal Jauhri,Vignesh Prasad,Georgia Chalvatzaki
机构: Erlangen National High Performance Computing Center (NHR) of Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU); PEARL Lab, Computer Science Department, Technische Universität Darmstadt; Hessian.AI; Robotics Institute
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: PrePrint. Project website: this http URL

点击查看摘要

Abstract:Mobile Manipulation (MoMa) of articulated objects, such as opening doors, drawers, and cupboards, demands simultaneous, whole-body coordination between a robot’s base and arms. Classical whole-body controllers (WBCs) can solve such problems via hierarchical optimization, but require extensive hand-tuned optimization and remain brittle. Learning-based methods, on the other hand, show strong generalization capabilities but typically rely on expensive whole-body teleoperation data or heavy reward engineering. We observe that even a sub-optimal WBC is a powerful structural prior: it can be used to collect data in a constrained, task-relevant region of the state-action space, and its behavior can still be improved upon using offline reinforcement learning. Building on this, we propose WHOLE-MoMa, a two-stage pipeline that first generates diverse demonstrations by randomizing a lightweight WBC, and then applies offline RL to identify and stitch together improved behaviors via a reward signal. To support the expressive action-chunked diffusion policies needed for complex coordination tasks, we extend offline implicit Q-learning with Q-chunking for chunk-level critic evaluation and advantage-weighted policy extraction. On three tasks of increasing difficulty using a TIAGo++ mobile manipulator in simulation, WHOLE-MoMa significantly outperforms WBC, behavior cloning, and several offline RL baselines. Policies transfer directly to the real robot without finetuning, achieving 80% success in bimanual drawer manipulation and 68% in simultaneous cupboard opening and object placement, all without any teleoperated or real-world training data.

[CV-60] From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度感知任务中表现不佳的问题,其核心挑战源于“视觉衰减”(Visual Attenuation)现象:即在模型传播过程中,稀疏的细粒度视觉信号被主导的文本token过早抑制或稀释,导致深层决策阶段出现“注意力丢失”。为应对这一问题,作者提出变分信息流(Variational Information Flow, VIF)框架,其关键在于引入条件变分自编码器(Conditional Variational Autoencoder, CVAE),从概率视角建模与问答对相关的视觉显著性为潜在分布,并作为可插拔模块集成至现有架构中,从而有效增强MLLMs对细粒度视觉信息的捕捉能力。

链接: https://arxiv.org/abs/2604.12508
作者: Jilong Zhu,Yang Feng
机构: Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS); University of Chinese Academy of Sciences, Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a “loss of focus” during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as a latent distribution. As a plug-and-play module, VIF can be integrated into existing architectures. Extensive evaluations across diverse benchmarks, covering General VQA, fine-grained perception, and visual grounding, demonstrate that VIF yields competitive improvements over previous methods, validating its effectiveness in enhancing the fine-grained perception of MLLMs.

[CV-61] SEATrack: Simple Efficient and Adaptive Multimodal Tracker CVPR2026

【速读】:该论文旨在解决多模态跟踪(multimodal tracking)中参数高效微调(parameter-efficient fine-tuning, PEFT)面临的性能-效率权衡困境,即当前方法虽在性能上取得提升,但往往伴随参数预算显著增加,违背了PEFT的核心目标。解决方案的关键在于两个互补策略:一是提出AMG-LoRA,通过结合低秩适配(Low-Rank Adaptation, LoRA)与自适应互引导(Adaptive Mutual Guidance, AMG),动态优化并对齐跨模态匹配注意力图,缓解模态特异性偏差导致的冲突;二是引入分层专家混合模型(Hierarchical Mixture of Experts, HMoE),替代传统局部融合方式,实现高效的全局关系建模,在保持表达能力的同时控制计算开销。

链接: https://arxiv.org/abs/2604.12502
作者: Junbin Su,Ziteng Xue,Shihui Zhang,Kun Chen,Weiming Hu,Zhipeng Zhang
机构: Yanshan University (燕山大学); Beihang University (北京航空航天大学); Hebei Key Laboratory of Computer Virtual Technology and System Integration; CASIA (中国科学院自动化研究所); UCAS (中国科学院大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted as a CVPR 2026 Oral

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT’s efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. \hrefthis https URL\textcolorcyanCode is available.

[CV-62] 2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型中存在的三类关键偏见问题:人口统计学偏见(demographic bias)、元素遗漏(element omission)以及文化坍缩(cultural collapse),这些问题源于训练数据中的不平衡与偏差,并在扩散模型中被放大。解决方案的关键在于提出首个统一评估框架——T2I-BiasBench,其包含十三个互补指标,首次同时量化上述三个维度的偏见表现;该框架整合了六项现有指标与七项新设计或适配的度量(如复合偏见评分、文化准确率比等),并基于1,574张结构化提示生成的图像对三个开源模型和一个RLHF对齐的基准模型(Gemini 2.5 Flash)进行系统评估,从而揭示当前主流T2I模型在性别、职业角色及文化多样性上的系统性缺陷,为未来模型开发提供可标准化、细粒度的偏见检测工具。

链接: https://arxiv.org/abs/2604.12481
作者: Nihal Jaiswal,Siddhartha Arjaria,Gyanendra Chaubey,Ankush Kumar,Aditya Singh,Anchal Chaurasiya
机构: IT Research and Education Center (IT 研究与教育中心); IIT Jodhpur (印度理工学院焦特布尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) generative models achieve impressive visual fidelity but inherit and amplify demographic imbalances and cultural biases embedded in training data. We introduce T2I-BiasBench, a unified evaluation framework of thirteen complementary metrics that jointly captures demographic bias, element omission, and cultural collapse in diffusion models - the first framework to address all three dimensions simultaneously. We evaluate three open-source models - Stable Diffusion v1.5, BK-SDM Base, and Koala Lightning - against Gemini 2.5 Flash (RLHF-aligned) as a reference baseline. The benchmark comprises 1,574 generated images across five structured prompt categories. T2I-BiasBench integrates six established metrics with seven additional measures: four newly proposed (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) and three adapted (Hallucination Score, Vendi Score, CLIP Proxy Score). Three key findings emerge: (1) Stable Diffusion v1.5 and BK-SDM exhibit bias amplification (1.0) in beauty-related prompts; (2) contextual constraints such as surgical PPE substantially attenuate professional-role gender bias (Doctor CBS = 0.06 for SD v1.5); and (3) all models, including RLHF-aligned Gemini, collapse to a narrow set of cultural representations (CAS: 0.54-1.00), confirming that alignment techniques do not resolve cultural coverage gaps. T2I-BiasBench is publicly released to support standardized, fine-grained bias evaluation of generative models. The project page is available at: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.12481 [cs.CV] (or arXiv:2604.12481v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.12481 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gyanendra Chaubey [view email] [v1] Tue, 14 Apr 2026 09:05:12 UTC (38,857 KB)

[CV-63] Euler-inspired Decoupling Neural Operator for Efficient Pansharpening

【速读】:该论文旨在解决多光谱图像超分辨率(pansharpening)中因深度学习模型(尤其是基于扩散的方法)带来的光谱-空间模糊问题以及高昂的计算成本。其核心解决方案是提出一种受欧拉公式启发的解耦神经算子(Euler-inspired Decoupling Neural Operator, EDNO),该方法将融合过程重新定义为频率域中的连续函数映射,通过将特征转换至极坐标系实现显式-隐式交互机制:其中显式特征交互模块利用线性加权模拟相位旋转以实现自适应几何对齐,隐式特征交互模块则通过前馈网络建模光谱分布以保障颜色一致性;该设计在保持离散化不变性的前提下天然捕获全局感受野,从而在效率与性能之间取得更优平衡。

链接: https://arxiv.org/abs/2604.12463
作者: Anqi Zhu,Mengting Ma,Yizhen Jiang,Xiangdong Li,Kai Zheng,Jiaxin Li,Wei Zhang
机构: Zhejiang University(浙江大学); Chongqing University of Post and Telecommunications(重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pansharpening aims to synthesize high-resolution multispectral (HR-MS) images by fusing the spatial textures of panchromatic (PAN) images with the spectral information of low-resolution multispectral (LR-MS) images. While recent deep learning paradigms, especially diffusion-based operators, have pushed the performance boundaries, they often encounter spectral-spatial blurring and prohibitive computational costs due to their stochastic nature and iterative sampling. In this paper, we propose the Euler-inspired Decoupling Neural Operator (EDNO), a physics-inspired framework that redefines pansharpening as a continuous functional mapping in the frequency domain. Departing from conventional Cartesian feature processing, our EDNO leverages Euler’s formula to transform features into a polar coordinate system, enabling a novel explicit-implicit interaction mechanism. Specifically, we develop the Euler Feature Interaction Layer (EFIL), which decouples the fusion task into two specialized modules: 1) Explicit Feature Interaction Module, utilizing a linear weighting scheme to simulate phase rotation for adaptive geometric alignment; and 2) Implicit Feature Interaction Module, employing a feed-forward network to model spectral distributions for superior color consistency. By operating in the frequency domain, EDNO inherently captures global receptive fields while maintaining discretization-invariance. Experimental results on the three datasets demonstrate that EDNO offers a superior efficiency-performance balance compared to heavyweight architectures.

[CV-64] Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在面对隐蔽、语义保持型后门攻击时,现有输入级防御方法因依赖可观察异常而失效的问题。解决方案的关键在于提出一种名为SET的检测框架,其核心创新是通过在交叉注意力(cross-attention)上施加受控缩放扰动,发现并利用良性输入与后门输入在去噪过程中响应演化模式的系统性差异——即交叉注意力缩放响应分歧(Cross-Attention Scaling Response Divergence, CSRD)。SET基于此现象,在多尺度扰动下构建响应偏移特征,并从少量干净样本中学习紧凑的良性响应空间,从而实现无需攻击先验或模型训练数据即可有效检测后门输入的触发无关检测机制。

链接: https://arxiv.org/abs/2604.12446
作者: Zida Li,Jun Li,Yuzhe Sha,Ziqiang Li,Lizhi Xiong,Zhangjie Fu
机构: Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have achieved remarkable success in image synthesis, but their reliance on large-scale data and open ecosystems introduces serious backdoor security risks. Existing defenses, particularly input-level methods, are more practical for deployment but often rely on observable anomalies that become unreliable under stealthy, semantics-preserving trigger designs. As modern backdoor attacks increasingly embed triggers into natural inputs, these methods degrade substantially, raising a critical question: can more stable, implicit, and trigger-agnostic differences between benign and backdoor inputs be exploited for detection? In this work, we address this challenge from an active probing perspective. We introduce controlled scaling perturbations on cross-attention and uncover a novel phenomenon termed Cross-Attention Scaling Response Divergence (CSRD), where benign and backdoor inputs exhibit systematically different response evolution patterns across denoising steps. Building on this insight, we propose SET, an input-level backdoor detection framework that constructs response-offset features under multi-scale perturbations and learns a compact benign response space from a small set of clean samples. Detection is then performed by measuring deviations from this learned space, without requiring prior knowledge of the attack or access to model training. Extensive experiments demonstrate that SET consistently outperforms existing baselines across diverse attack methods, trigger types, and model settings, with particularly strong gains under stealthy implicit-trigger scenarios. Overall, SET improves AUROC by 9.1% and ACC by 6.5% over the best baseline, highlighting its effectiveness and robustness for practical deployment.

[CV-65] DiffusionPrint: Learning Generative Fingerprints for Diffusion-Based Inpainting Localization CVPR

【速读】:该论文旨在解决生成式 AI(Generative AI)图像修复模型(如基于扩散的修复模型)对图像伪造定位(Image Forgery Localization, IFL)方法造成的干扰问题,因为这些模型通过潜在空间解码器重构整张图像,破坏了传统取证方法依赖的相机级噪声模式。解决方案的关键在于提出 DiffusionPrint——一种基于 patch-level 对比学习的框架,其核心创新是利用同一生成模型产生的修复区域共享一致的生成指纹(generative fingerprint)作为自监督信号,从而学习到对潜在解码引入的频谱失真具有鲁棒性的取证特征。该方法通过 MoCo 风格的目标函数、跨类别难负样本挖掘以及生成器感知的分类头训练卷积主干网络,输出高判别力的取证特征图,可无缝集成至融合型 IFL 框架(如 TruFor、MMFusion)中,显著提升在未见掩码类型和生成架构下的定位性能,最高达 +28% 的准确率增益。

链接: https://arxiv.org/abs/2604.12443
作者: Paschalis Giakoumoglou,Symeon Papadopoulos
机构: Information Technologies Institute, CERTH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPRW2026

点击查看摘要

Abstract:Modern diffusion-based inpainting models pose significant challenges for image forgery localization (IFL), as their full regeneration pipelines reconstruct the entire image via a latent decoder, disrupting the camera-level noise patterns that existing forensic methods rely on. We propose DiffusionPrint, a patch-level contrastive learning framework that learns a forensic signal robust to the spectral distortions introduced by latent decoding. It exploits the fact that inpainted regions generated by the same model share a consistent generative fingerprint, using this as a self-supervisory signal. DiffusionPrint trains a convolutional backbone via a MoCo-style objective with cross-category hard negative mining and a generator-aware classification head, producing a forensic feature map that serves as a highly discriminative secondary modality in fusion-based IFL frameworks. Integrated into TruFor, MMFusion, and a lightweight fusion baseline, DiffusionPrint consistently improves localization across multiple generative models, with gains of up to +28% on mask types unseen during fine-tuning and confirmed generalization to unseen generative architectures. Code is available at this https URL

[CV-66] IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation Understanding and Generation

【速读】:该论文旨在解决工业缺陷检测(Industrial Anomaly Detection, IAD)中长期存在的多任务协同难题,即如何在统一框架下同时实现缺陷定位、自然语言解释和受控缺陷编辑。现有方法无法在统一架构与评估协议中兼顾这三项能力。其解决方案的关键在于提出IAD-Unify框架:该框架采用双编码器结构,利用冻结的DINOv2区域专家通过轻量级token注入向共享的Qwen3.5-4B视觉语言骨干网络提供精确异常证据,从而联合实现异常分割、基于区域的语义理解与掩码引导的生成。这一设计实现了三类任务的端到端协同优化,并通过构建Anomaly-56K多任务评估平台验证了其有效性与泛化能力。

链接: https://arxiv.org/abs/2604.12440
作者: Haoyu Zheng,Tianwei Lin,Wei Wang,Zhuonan Wang,Wenqiao Zhang,Jiaqi Zhu,Feifei Shao
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world industrial inspection requires not only localizing defects, but also explaining them in natural language and generating controlled defect edits. However, existing approaches fail to jointly support all three capabilities within a unified framework and evaluation protocol. We propose IAD-Unify, a dual-encoder unified framework in which a frozen DINOv2-based region expert supplies precise anomaly evidence to a shared Qwen3.5-4B vision-language backbone via lightweight token injection, jointly enabling anomaly segmentation, region-grounded understanding, and mask-guided generation. To enable unified evaluation, we further construct Anomaly-56K, a comprehensive unified multi-task IAD evaluation platform, spanning 59,916 images across 24 categories and 104 defect variants. Controlled ablations yield four findings: (i) region grounding is the decisive mechanism for understanding, removing it degrades location accuracy by 76 pp; (ii) predicted-region performance closely matches oracle, confirming deployment viability; (iii) region-grounded generation achieves the best full-image fidelity and masked-region perceptual quality; and (iv) pre-initialized joint training improves understanding at negligible generation cost (-0.16 dB). IAD-Unify further achieves strong performance on the MMAD benchmark, including categories unseen during training, demonstrating robust cross-category generalization.

[CV-67] A Hybrid Architecture for Benign-Malignant Classification of Mammography ROIs

【速读】:该论文旨在解决乳腺X线摄影中可疑病灶的精准分类问题,以支持早期诊断与治疗规划。传统卷积神经网络(CNN)虽擅长提取局部视觉特征,但难以建模长距离依赖关系;而视觉Transformer(ViT)虽能通过自注意力机制捕捉全局信息,却因二次计算复杂度导致效率低下。解决方案的关键在于提出一种混合架构:采用EfficientNetV2-M作为主干网络进行高效局部特征提取,并引入基于状态空间模型(SSM)的Vision Mamba来实现线性复杂度的全局上下文建模,从而在ROI级别上实现了对CBIS-DDSM数据集中的乳腺病灶进行良恶性二分类的高性能识别。

链接: https://arxiv.org/abs/2604.12437
作者: Mohammed Asad,Mohit Bajpai,Sudhir Singh,Rahul Katarya
机构: Delhi Technological University (德里科技大学); IGDTUW (印度理工学院德里分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Accurate characterization of suspicious breast lesions in mammography is important for early diagnosis and treatment planning. While Convolutional Neural Networks (CNNs) are effective at extracting local visual patterns, they are less suited to modeling long-range dependencies. Vision Transformers (ViTs) address this limitation through self-attention, but their quadratic computational cost can be prohibitive. This paper presents a hybrid architecture that combines EfficientNetV2-M for local feature extraction with Vision Mamba, a State Space Model (SSM), for efficient global context modeling. The proposed model performs binary classification of abnormality-centered mammography regions of interest (ROIs) from the CBIS-DDSM dataset into benign and malignant classes. By combining a strong CNN backbone with a linear-complexity sequence model, the approach achieves strong lesion-level classification performance in an ROI-based setting.

[CV-68] DeferredSeg: A Multi-Expert Deferral Framework for Trustworthy Medical Image Segmentation

【速读】:该论文旨在解决深度神经网络在医学图像分割任务中产生的置信度不可靠问题,尤其是在边界模糊区域容易出现过自信或欠自信现象,从而影响临床部署的可信度。其解决方案的关键在于提出DeferredSeg框架,该框架基于学习性延迟(Learning-to-Defer, L2D)范式,引入一个聚合的延迟预测器和额外的路由通道,动态地将每个像素分配给基础分割模型或人工专家;同时设计像素级代理协作损失以监督延迟决策,并通过空间一致性损失确保延迟区域的空间连贯性,进一步提升可靠性。此外,该框架还扩展至多专家场景,通过引入多个差异专家并加入负载均衡惩罚机制,实现专家间工作量的公平分配,从而构建高效、可信赖的人机协同分割系统。

链接: https://arxiv.org/abs/2604.12411
作者: Qiuyu Tian,Haoliang Sun,Yunshan Wang,Yinghuan Shi,Yilong Yin
机构: Shandong University (山东大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages,6 figures

点击查看摘要

Abstract:Segmentation models based on deep neural networks demonstrate strong generalization for medical image segmentation. However, they often exhibit overconfidence or underconfidence, leading to unreliable confidence scores for segmentation masks, especially in ambiguous regions. This undermines the trustworthiness required for clinical deployment. Motivated by the learning-to-defer (L2D) paradigm, we introduce DeferredSeg, a deferral-aware segmentation framework, i.e., a Human–AI collaboration system that determines whether to defer predictions to human experts in specific regions. DeferredSeg extends the base segmentor with an aggregated deferral predictor and additional routing channels that dynamically route each pixel to either the base segmentor or a human expert. To train this routing efficiently, we introduce a pixel-wise surrogate collaboration loss that supervises deferral decisions. In addition, to preserve spatial coherence within deferral regions, we propose a spatial-coherence loss that enforces smooth deferral masks, thereby enhancing reliability. Beyond single-expert deferral, we further extend the framework to a multi-expert setting by introducing multiple discrepancy experts for collaborative decision-making. To prevent overloading or underutilizing individual experts, we further design a load-balancing penalty that evenly distributes workload across expert branches. We evaluate DeferredSeg on three challenging medical datasets using MedSAM and CENet as the base segmentor for fair comparison. Experimental results show that DeferredSeg consistently outperforms the baseline, demonstrating its effectiveness for trustworthy dense medical segmentation. Moreover, the proposed framework is model-agnostic and can be readily applied to other segmentation architectures. Comments: 27 pages,6 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.12411 [cs.CV] (or arXiv:2604.12411v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.12411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-69] Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning CVPR2026

【速读】:该论文旨在解决测试时提示调优(Test-Time Prompt Tuning, TPT)在视觉-语言模型中因视图选择不当而导致性能受限的问题。现有方法依赖于基于熵的过滤策略,利用模型内部置信度分数筛选有效视图,但在分布偏移场景下,这些置信度常出现校准偏差,导致模型对无关区域或背景区域赋予过高置信,忽略语义内容。解决方案的关键在于提出一种双模态锚点引导框架,通过引入文本锚点(text anchor)提供细粒度类别语义信息,并设计自适应图像锚点(adaptive image anchor)捕捉测试时动态统计特征,基于两者对齐度和置信度进行视图筛选,从而确保仅语义相关的视图参与模型适配;同时将锚点视为辅助预测头,以置信度加权融合其输出与原始预测结果,形成稳定监督信号用于提示更新,显著提升了提示调优的鲁棒性和效果。

链接: https://arxiv.org/abs/2604.12403
作者: Jungwon Choi,Eunwoo Kim
机构: Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 findings

点击查看摘要

Abstract:Test-Time Prompt Tuning (TPT) adapts vision-language models using augmented views, but its effectiveness is hindered by the challenge of determining which views are beneficial. Standard entropy-based filtering relies on the internal confidence scores of the model, which are often miscalibrated under distribution shift, assigning high confidence to irrelevant crops or background regions while ignoring semantic content. To address this, we propose a dual-modality anchor-guided framework that grounds view selection in semantic evidence. We introduce a text anchor from attribute-rich descriptions, to provide fine-grained class semantics, and an adaptive image anchor that captures evolving test-time statistics. Using these anchors, we filter views based on alignment and confidence, ensuring that only informative views guide adaptation. Moreover, we treat the anchors as auxiliary predictive heads and combine their predictions with the original output in a confidence-weighted ensemble, yielding a stable supervision signal for prompt updates. Extensive experiments on 15 benchmark datasets demonstrate new state-of-the-art performance, highlighting the contribution of anchor-guided supervision as a foundation for robust prompt updates.

[CV-70] Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models CVPR2026

【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models, VFMs)预训练过程中计算成本高、效率低的问题,尤其在模型家族规模扩展时难以实现高效训练。其解决方案的关键在于提出链式模型预训练(Chain-of-Models Pre-Training, CoM-PT),通过构建按模型尺寸升序排列的“模型链”,仅对最小模型进行标准预训练,其余模型则通过从其前驱小模型逆向迁移知识的方式,在参数空间和特征空间中联合复用知识进行高效训练。该方法在不损失性能的前提下显著降低整体训练开销,并展现出随模型家族规模增大而进一步提升效率的独特优势。

链接: https://arxiv.org/abs/2604.12391
作者: Jiawei Fan,Shigeng Wang,Chao Li,Xiaolong Liu,Anbang Yao
机构: Intel Labs China; iMotion Automotive Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work is accepted to CVPR 2026. Code is available at this https URL

点击查看摘要

Abstract:In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.

[CV-71] Modality-Agnostic Prompt Learning for Multi-Modal Camouflaged Object Detection

【速读】:该论文旨在解决当前遮蔽目标检测(Camouflaged Object Detection, COD)中因依赖特定模态架构或定制融合策略而导致的可扩展性差与跨模态泛化能力弱的问题。解决方案的关键在于提出一种模态无关的多模态提示生成框架,通过在数据驱动的内容域与知识驱动的提示域之间建模交互,将任务相关的线索提炼为统一提示以供Segment Anything Model (SAM) 解码,并引入轻量级掩码精修模块(Mask Refine Module)利用细粒度提示线索校准粗略预测,从而显著提升COD任务的性能与跨模态适应能力。

链接: https://arxiv.org/abs/2604.12380
作者: Hao Wang,Jiqing Zhang,Xin Yang,Baocai Yin,Lu Jiang,Zetian Mi,Huibing Wang
机构: Dalian Maritime University (大连海事大学); Beijing University of Technology (北京工业大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10

点击查看摘要

Abstract:Camouflaged Object Detection (COD) aims to segment objects that blend seamlessly into complex backgrounds, with growing interest in exploiting additional visual modalities to enhance robustness through complementary information. However, most existing approaches generally rely on modality-specific architectures or customized fusion strategies, which limit scalability and cross-modal generalization. To address this, we propose a novel framework that generates modality-agnostic multi-modal prompts for the Segment Anything Model (SAM), enabling parameter-efficient adaptation to arbitrary auxiliary modalities and significantly improving overall performance on COD tasks. Specifically, we model multi-modal learning through interactions between a data-driven content domain and a knowledge-driven prompt domain, distilling task-relevant cues into unified prompts for SAM decoding. We further introduce a lightweight Mask Refine Module to calibrate coarse predictions by incorporating fine-grained prompt cues, leading to more accurate camouflaged object boundaries. Extensive experiments on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks validate the effectiveness and generalization of our modality-agnostic framework.

[CV-72] Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models ICLR2026

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对排版提示注入攻击(typographic prompt injection attacks)时的安全性问题,即通过将恶意文本以图像形式渲染来绕过模型内置的安全机制,从而对自主代理系统(如浏览器自动化、具身智能体等)构成威胁。其关键解决方案在于系统性评估不同VLMs在多种字体大小和视觉扰动条件下的攻击成功率(ASR),并发现文本与图像嵌入距离(text-image embedding distance)与攻击成功率呈现强负相关关系(r = -0.71 至 -0.93, p < 0.01),表明嵌入空间一致性是影响模型鲁棒性的核心因素。研究进一步揭示了模型特异性防御模式——例如GPT-4o和Claude更易受文本攻击,而Mistral和Qwen3-VL在多模态攻击下表现相对稳定——从而强调不能采用统一的防御策略,需依据具体应用场景选择具备更高鲁棒性的VLM骨干架构。

链接: https://arxiv.org/abs/2604.12371
作者: Ravikumar Balakrishnan,Sanket Mendapara,Ankit Garg
机构: Cisco AI (思科人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026 Workshop on Agents in the Wild

点击查看摘要

Abstract:We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6–28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p 0.01); (4) heavy degradations increase embedding distance by 10–12% and reduce ASR by 34–96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.

[CV-73] Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLM s Decoding

【速读】:该论文旨在解决现有视觉标记剪枝(Visual Token Pruning)方法在复杂视觉推理任务中表现不佳的问题,即这些方法在简单视觉理解任务中表现稳定,但在需要深层次推理的任务上难以有效泛化。研究通过系统分析发现,导致性能下降的关键原因是解码阶段的“相关视觉信息偏移”(Relevant Visual Information Shift, RVIS)。为应对这一挑战,作者提出了一种无需训练的附加框架——解码阶段感知偏移的标记剪枝(Decoding-stage Shift-aware Token Pruning, DSTP),其核心在于使已有剪枝方法能够动态对齐解码过程中不断变化的推理需求,从而显著提升复杂视觉推理任务中的鲁棒性,并在多个主流架构上实现高效且一致的性能增益。

链接: https://arxiv.org/abs/2604.12358
作者: Jiwan Kim,Kibum Kim,Wonjoong Kim,Byung-Kwan Lee,Chanyoung Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, Project : this https URL

点击查看摘要

Abstract:Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.

[CV-74] ReflectCAP: Detailed Image Captioning with Reflective Memory

【速读】:该论文旨在解决详细图像描述(Detailed Image Captioning)中事实准确性(factual grounding)与细节覆盖度(fine-grained coverage)难以同时提升的问题。现有方法在二者之间存在权衡,常因大视觉语言模型(Large Vision-Language Model, LVLM)的系统性幻觉(hallucination)或遗漏关键信息而表现不佳。解决方案的关键在于提出反射式注释引导的描述生成框架(Reflective Note-Guided Captioning, ReflectCAP),其通过多智能体流水线分析LVLM在不同图像上的重复性错误模式,提炼出结构化的“反射笔记”(Structured Reflection Notes),这些笔记在推理阶段作为指导信号,分别约束模型避免特定幻觉并聚焦于被忽略的细节,从而协同优化事实性和覆盖率,在多个主流LVLM上实现帕累托前沿性能,并显著优于模型扩容或多智能体基线方案的计算开销。

链接: https://arxiv.org/abs/2604.12357
作者: Kyungmin Min,Minbeom Kim,Kang-il Lee,Seunghyun Yoon,Kyomin Jung
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes – what to avoid and what to attend to – yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where generated captions are judged head-to-head against strong reference models. Moreover, ReflectCAP offers a more favorable trade-off between caption quality and compute cost than model scaling or existing multi-agent pipelines, which incur 21–36% greater overhead. This makes high-quality detailed captioning viable under real-world cost and latency constraints.

[CV-75] OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion CVPR2026

【速读】:该论文旨在解决两个核心问题:一是现有食物营养估计数据集主要覆盖西方菜肴,缺乏对中国菜肴的充分表征,导致对中式餐食的营养估算不准确;二是当前先进的营养预测方法多依赖深度传感器,在日常场景中应用受限。解决方案的关键在于构建两个关键资源与一个端到端预测框架:首先,提出OmniFood8K多模态数据集(含8,036个样本,带详细营养标注和多视角图像),以增强对中国食物的建模能力;其次,构建NutritionSynth-115K大规模合成数据集,通过引入成分组合变化并保持精确营养标签来提升模型泛化性;最后,设计基于单张RGB图像的端到端营养预测框架,其中包含三个核心技术模块:Scale-Shift Residual Adapter(SSRA)用于从单图恢复高质量深度图并保证全局尺度一致性和局部结构保留,Frequency-Aligned Fusion Module(FAFM)实现RGB与深度特征在频域中的分层对齐融合,以及Mask-based Prediction Head(MPH)通过动态通道选择聚焦关键食材区域以提升预测精度。

链接: https://arxiv.org/abs/2604.12356
作者: Dongjian Yu,Weiqing Min,Qian Jiang,Xing Lin,Xin Jin,Shuqiang Jiang
机构: Yunnan University (云南大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 (Highlight Paper)

点击查看摘要

Abstract:Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets primarily focus on Western cuisines and lack sufficient coverage of Chinese dishes, which restricts accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food samples, each with detailed nutritional annotations and multi-view images. In addition, to enhance models’ capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework for nutritional prediction from a single RGB image. First, we predict a depth map from a single RGB image and design the Scale-Shift Residual Adapter (SSRA) to refine it for global scale consistency and local structural preservation. Second, we propose the Frequency-Aligned Fusion Module (FAFM) to hierarchically align and fuse RGB and depth features in the frequency domain. Finally, we design a Mask-based Prediction Head (MPH) to emphasize key ingredient regions via dynamic channel selection for more accurate prediction. Extensive experiments on multiple datasets demonstrate the superiority of our method over existing approaches. Project homepage: this https URL

[CV-76] Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection

【速读】:该论文旨在解决生成式 AI(Generative AI)技术快速发展背景下,现有图像伪造检测方法在跨模型泛化能力上的不足问题,尤其是由于训练数据偏差导致模型过度拟合特定生成模式和内容特征,而非捕捉不同生成模型间共有的伪造特征(即不对称偏差学习问题)。其解决方案的关键在于提出多维对抗特征学习(Multi-dimensional Adversarial Feature Learning, MAFL)框架:该框架以预训练的多模态图像编码器为特征提取主干,构建真实性判别特征学习网络,并设计一个带有多维对抗损失的对抗偏差学习分支,形成真实性特征学习与偏差特征学习之间的对抗训练机制;通过抑制生成模式和内容偏差,引导模型聚焦于跨模型共享的生成特征,从而有效识别真实与伪造图像的本质差异,显著提升跨模型泛化性能并降低对大规模标注数据的依赖。

链接: https://arxiv.org/abs/2604.12353
作者: Haifeng Zhang,Qinghui He,Xiuli Bi,Bo Liu,Chi-Man Pun,Bin Xiao
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学); University of Macau (澳门大学); Jinan Inspur Data Technology Co., Ltd. (济南浪潮数据技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, the rapid development of generative artificial intelligence technology has significantly lowered the barrier to creating high-quality fake images, posing a serious challenge to information authenticity and credibility. Existing generated image detection methods typically enhance generalization through model architecture or network design. However, their generalization performance remains susceptible to data bias, as the training data may drive models to fit specific generative patterns and content rather than the common features shared by images from different generative models (asymmetric bias learning). To address this issue, we propose a Multi-dimensional Adversarial Feature Learning (MAFL) framework. The framework adopts a pretrained multimodal image encoder as the feature extraction backbone, constructs a real-fake feature learning network, and designs an adversarial bias-learning branch equipped with a multi-dimensional adversarial loss, forming an adversarial training mechanism between authenticity-discriminative feature learning and bias feature learning. By suppressing generation-pattern and content biases, MAFL guides the model to focus on the generative features shared across different generative models, thereby effectively capturing the fundamental differences between real and generated images, enhancing cross-model generalization, and substantially reducing the reliance on large-scale training data. Through extensive experimental validation, our method outperforms existing state-of-the-art approaches by 10.89% in accuracy and 8.57% in Average Precision (AP). Notably, even when trained with only 320 images, it can still achieve over 80% detection accuracy on public datasets.

[CV-77] Fundus Image-based Glaucoma Screening via Retinal Knowledge-Oriented Dynamic Multi-Level Feature Integration

【速读】:该论文旨在解决现有基于深度学习的青光眼自动诊断模型在跨域临床数据集上鲁棒性不足的问题,其核心挑战在于模型缺乏对视网膜解剖先验知识的显式整合,且固定区域特征提取难以捕捉病灶可能出现在非预定义区域的情况。解决方案的关键在于提出一种以视网膜知识为导向的筛查框架,通过三分支结构协同捕获全局视网膜上下文、视盘/视杯结构特征以及动态定位的病理性区域;其中,动态窗口机制(Dynamic Window Mechanism)用于自适应识别诊断信息丰富的区域,而知识增强卷积注意力模块(Knowledge-Enhanced Convolutional Attention Module)则引入预训练基础模型提取的视网膜先验来引导注意力学习,从而显著提升模型的泛化能力与诊断准确性。

链接: https://arxiv.org/abs/2604.12351
作者: Yuzhuo Zhou,Chi Liu,Sheng Shen,Zongyuan Ge,Fengshi Jing,Shiran Zhang,Yu Jiang,Anli Wang,Wenjian Liu,Feilong Yang,Tianqing Zhu,Xiaotong Han
机构: City University of Macau (澳门城市大学); Guangzhou Zoc Technology Co., Ltd. (广州佐客科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages. In submission to an Elsevier Journal

点击查看摘要

Abstract:Automated diagnosis based on color fundus photography is essential for large-scale glaucoma screening. However, existing deep learning models are typically data-driven and lack explicit integration of retinal anatomical knowledge, which limits their robustness across heterogeneous clinical datasets. Moreover, pathological cues in fundus images may appear beyond predefined anatomical regions, making fixed-region feature extraction insufficient for reliable diagnosis. To address these challenges, we propose a retinal knowledge-oriented glaucoma screening framework that integrates dynamic multi-scale feature learning with domain-specific retinal priors. The framework adopts a tri-branch structure to capture complementary retinal representations, including global retinal context, structural features of the optic disc/cup, and dynamically localized pathological regions. A Dynamic Window Mechanism is devised to adaptively identify diagnostically informative regions, while a Knowledge-Enhanced Convolutional Attention Module incorporates retinal priors extracted from a pre-trained foundation model to guide attention learning. Extensive experiments on the large-scale AIROGS dataset demonstrate that the proposed method outperforms diverse baselines, achieving an AUC of 98.5% and an accuracy of 94.6%. Additional evaluations on multiple datasets from the SMDG-19 benchmark further confirm its strong cross-domain generalization capability, indicating that knowledge-guided attention combined with adaptive lesion localization can significantly improve the robustness of automated glaucoma screening systems.

[CV-78] Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization

【速读】:该论文旨在解决时空视频定位(Spatio-Temporal Video Grounding, STVG)任务中因数据稀缺导致的模型过拟合与泛化能力不足问题。现有方法依赖大规模标注数据,而密集帧级边界框和复杂的时间语言对齐标注成本高昂,尤其在专业视频领域难以获取;同时,零样本基础模型缺乏任务特定的时间感知能力。解决方案的关键在于提出ST-GD框架,通过冻结预训练2D视觉-语言模型(如Grounding DINO),并引入轻量级适配器(约10M可训练参数)注入时空感知能力,辅以新颖的时间解码器进行边界预测,从而在小数据场景下实现高效迁移与精准定位,显著提升性能与鲁棒性。

链接: https://arxiv.org/abs/2604.12346
作者: Zanyi Wang,Fan Li,Dengyang Jiang,Liuzhuozheng Li,Yunhua Zhong,Guang Dai,Mengmeng Wang
机构: SGIT AI Lab, State Grid Corporation of China (国家电网公司 SGIT 人工智能实验室); University of HongKong (香港大学); Zhejiang University of Technology (浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatio-temporal video grounding (STVG) aims to localize queried objects within dynamic video segments. Prevailing fully-trained approaches are notoriously data-hungry. However, gathering large-scale STVG data is exceptionally challenging: dense frame-level bounding boxes and complex temporal language alignments are prohibitively expensive to annotate, especially for specialized video domains. Consequently, conventional models suffer from severe overfitting on these inherently limited datasets, while zero-shot foundational models lack the task-specific temporal awareness needed for precise localization. To resolve this small-data challenge, we introduce ST-GD, a data-efficient framework that adapts pre-trained 2D visual-language models (e.g., Grounding DINO) to video tasks. To avoid destroying pre-trained priors on small datasets, ST-GD keeps the base model frozen and strategically injects lightweight adapters (~10M trainable parameters) to instill spatio-temporal awareness, alongside a novel temporal decoder for boundary prediction. This design naturally counters data scarcity. Consequently, ST-GD excels in data-scarce scenarios, achieving highly competitive performance on the limited-scale HC-STVG v1/v2 benchmarks, while maintaining robust generalization on the VidSTG dataset. This validates ST-GD as a powerful paradigm for complex video understanding under strict small-data constraints.

[CV-79] Detecting Precise Hand Touch Moments in Egocentric Video CVPR

【速读】:该论文旨在解决在第一人称视角视频中精确检测手与物体接触时刻的难题(即帧级接触检测),这对于增强现实、人机交互、辅助技术和机器人学习等应用至关重要。其核心挑战在于接触前细微的手部运动变化、频繁的遮挡、精细的操作模式以及第一人称视角固有的动态特性。解决方案的关键在于提出一种手部感知的上下文增强模块(Hand-informed Context Enhanced module, HiCE),该模块通过交叉注意力机制融合手部区域及其周围环境的时空特征,以识别潜在的接触模式;同时引入抓握感知损失函数和软标签,强化对手部姿态和运动动态的建模能力,从而有效区分接近接触与真实接触帧。

链接: https://arxiv.org/abs/2604.12343
作者: Huy Anh Nguyen,Feras Dayoub,Minh Hoai
机构: Australian Institute for Machine Learning, Adelaide University, Australia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR Findings 2026

点击查看摘要

Abstract:We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see’) that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision. Comments: Accepted to CVPR Findings 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.12343 [cs.CV] (or arXiv:2604.12343v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.12343 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-80] CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training

【速读】:该论文旨在解决Subset Training(子集训练)中的隐私泄露问题,挑战了“减少训练数据量可降低隐私风险”的普遍假设。研究表明,子集选择过程本身会引入新的隐私攻击面(privacy surface),导致敏感信息通过侧信道元数据或模型输出被 adversaries(攻击者)捕获。解决方案的关键在于提出 CoLA(Choice Leakage Attack)框架,该框架系统性地分析子集训练中的隐私泄露现象,并定义了两种实际攻击场景:基于侧信道信息的 Subset-aware Side-channel Attacks 和黑盒攻击(Black-box Attacks)。在此基础上,论文识别出两个独特的隐私攻击面:Training-membership MIA(TM-MIA)和 Selection-participation MIA(SP-MIA),其中 SP-MIA 将成员身份隐私从模型训练扩展至整个数据-模型供应链,揭示了现有威胁模型对子集训练隐私风险的低估,强调了从个体模型到整个机器学习生态系统的隐私保护必要性。

链接: https://arxiv.org/abs/2604.12342
作者: Qi Li,Cheng-Long Wang,Yinzhi Cao,Di Wang
机构: King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); Provable Responsible AI and Data Analytics Lab (可证明负责任人工智能与数据分析实验室); National University of Singapore (新加坡国立大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training models on a carefully chosen portion of data rather than the full dataset is now a standard preprocess for modern ML. From vision coreset selection to large-scale filtering in language models, it enables scalability with minimal utility loss. A common intuition is that training on fewer samples should also reduce privacy risks. In this paper, we challenge this assumption. We show that subset training is not privacy free: the very choices of which data are included or excluded can introduce new privacy surface and leak more sensitive information. Such information can be captured by adversaries either through side-channel metadata from the subset selection process or via the outputs of the target model. To systematically study this phenomenon, we propose CoLA (Choice Leakage Attack), a unified framework for analyzing privacy leakage in subset selection. In CoLA, depending on the adversary’s knowledge of the side-channel information, we define two practical attack scenarios: Subset-aware Side-channel Attacks and Black-box Attacks. Under both scenarios, we investigate two privacy surfaces unique to subset training: (1) Training-membership MIA (TM-MIA), which concerns only the privacy of training data membership, and (2) Selection-participation MIA (SP-MIA), which concerns the privacy of all samples that participated in the subset selection process. Notably, SP-MIA enlarges the notion of membership from model training to the entire data-model supply chain. Experiments on vision and language models show that existing threat models underestimate subset-training privacy risks: the expanded privacy surface leaks both training and selection membership, extending risks from individual models to the broader ML ecosystem.

[CV-81] Bridging the Micro–Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization

【速读】:该论文旨在解决生成式图像编辑(Generative Image Editing)背景下图像篡改定位(Image Manipulation Localization, IML)中存在的“微观-宏观鸿沟”问题,即传统篡改方法依赖低频伪影特征、扩散生成篡改则呈现局部真实感而缺乏明显痕迹,导致现有方法难以统一建模两类篡改的检测需求。解决方案的关键在于提出FASA框架:通过自适应双频离散余弦变换(DCT)模块提取敏感频率线索,并利用冻结的CLIP模型嵌入进行patch级对比对齐以学习篡改感知语义先验;随后借助语义-频率侧边适配器将这些先验注入分层频率路径中实现多尺度特征交互,并采用原型引导的、频率门控的掩码解码器融合语义一致性与边界感知信息,从而提升对传统与扩散生成篡改的联合定位精度与泛化能力。

链接: https://arxiv.org/abs/2604.12341
作者: Xiaojie Liang,Zhimin Chen,Ziqi Sheng,Wei Lu
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As generative image editing advances, image manipulation localization (IML) must handle both traditional manipulations with conspicuous forensic artifacts and diffusion-generated edits that appear locally realistic. Existing methods typically rely on either low-level forensic cues or high-level semantics alone, leading to a fundamental micro–macro gap. To bridge this gap, we propose FASA, a unified framework for localizing both traditional and diffusion-generated manipulations. Specifically, we extract manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learn manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. We then inject these priors into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and employ a prototype-guided, frequency-gated mask decoder to integrate semantic consistency with boundary-aware localization for tampered region prediction. Extensive experiments on OpenSDI and multiple traditional manipulation benchmarks demonstrate state-of-the-art localization performance, strong cross-generator and cross-dataset generalization, and robust performance under common image degradations.

[CV-82] All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频理解任务中依赖大规模真实世界标注数据所面临的成本高、效率低及多样性不足的问题。解决方案的关键在于提出了一种统一的合成数据生成流水线,能够自动创建无限且多样化的多模态视频数据,并支持多种任务格式(如物体计数、视觉问答和分割)在同一框架下进行规模化、一致性的数据生成;此外,引入基于视觉问答(VQA)的微调策略,通过结构化问题引导模型进行更深层次的视觉定位与推理,从而提升模型在真实场景中的泛化能力。实验表明,仅使用合成数据训练的模型在多个挑战性任务上可超越传统方法,验证了该方案作为昂贵人工标注替代路径的有效性。

链接: https://arxiv.org/abs/2604.12335
作者: Tanzila Rahman,Renjie Liao,Leonid Sigal
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute for AI (人工智能研究所); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 Pages, 4 Tables, 4 Figures

点击查看摘要

Abstract:Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages deeper visual grounding and reasoning. We evaluate our approach in three challenging tasks: video object counting, video-based visual question answering, and video object segmentation. Experimental results demonstrate that models trained predominantly on synthetic data generalize effectively to real-world datasets, often outperforming traditionally trained counterparts. Our findings highlight the potential of unified synthetic data pipelines as a scalable alternative to expensive real-world annotation for multimodal video understanding.

[CV-83] HyperLiDAR: Adaptive Post-Deployment LiDAR Segmentation via Hyperdimensional Computing

【速读】:该论文旨在解决LiDAR语义分割模型在边缘设备上部署后因环境变化导致性能下降,且传统基于大型神经网络的模型难以在资源受限条件下进行高效在线适应的问题。解决方案的关键在于提出HyperLiDAR框架,其核心创新是采用超维计算(Hyperdimensional Computing, HDC)构建轻量级、高效率的后部署自适应机制,并通过引入基于信息量的点云缓冲区选择策略,聚焦于最具判别性的数据点以提升学习效率,从而在保证分割精度的同时实现高达13.8倍的重训练速度提升。

链接: https://arxiv.org/abs/2604.12331
作者: Ivannia Gomez Moreno,Yi Yao,Ye Tian,Xiaofan Yu,Flavio Ponzina,Michael Sullivan,Jingyi Zhang,Mingyu Yang,Hun Seok Kim,Tajana Rosing
机构: UCSD (加州大学圣地亚哥分校); UC Merced (加州大学默塞德分校); SDSU (加州州立大学圣地亚哥分校); UMich (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR semantic segmentation plays a pivotal role in 3D scene understanding for edge applications such as autonomous driving. However, significant challenges remain for real-world deployments, particularly for on-device post-deployment adaptation. Real-world environments can shift as the system navigates through different locations, leading to substantial performance degradation without effective and timely model adaptation. Furthermore, edge systems operate under strict computational and energy constraints, making it infeasible to adapt conventional segmentation models (based on large neural networks) directly on-device. To address the above challenges, we introduce HyperLiDAR, the first lightweight, post-deployment LiDAR segmentation framework based on Hyperdimensional Computing (HDC). The design of HyperLiDAR fully leverages the fast learning and high efficiency of HDC, inspired by how the human brain processes information. To further improve the adaptation efficiency, we identify the high data volume per scan as a key bottleneck and introduce a buffer selection strategy that focuses learning on the most informative points. We conduct extensive evaluations on two state-of-the-art LiDAR segmentation benchmarks and two representative devices. Our results show that HyperLiDAR outperforms or achieves comparable adaptation performance to state-of-the-art segmentation methods, while achieving up to a 13.8x speedup in retraining.

[CV-84] Self-Adversarial One Step Generation via Condition Shifting

【速读】:该论文旨在解决文本到图像生成(text-to-image synthesis)中单步采样方法面临的三重权衡问题:保真度(fidelity)、推理速度与训练效率之间的矛盾。现有依赖外部判别器的方法虽能提升单步性能,但常引发训练不稳定、显存开销高及收敛慢等问题;而基于回归蒸馏和一致性目标的方法虽易优化,却在单步约束下丢失细节。解决方案的关键在于提出 APEX 框架,其核心理论洞察是:通过条件偏移(condition shifting)从流模型(flow model)中内生提取对抗校正信号(adversarial correction signals),构建一个偏移条件分支,其速度场作为当前生成分布的独立估计量,从而得到理论上与 GAN 对齐的梯度,替代易导致梯度消失的样本依赖判别器项。该设计无需判别器,保持架构兼容性,支持全参数与 LoRA 参数高效微调,在单步质量上超越更大规模模型,并实现显著的推理加速。

链接: https://arxiv.org/abs/2604.12322
作者: Deyuan Liu,Peng Sun,Yansen Han,Zhenglin Cheng,Chuyan Chen,Tao Lin
机构: Westlake University (西湖大学); Zhejiang University (浙江大学); Shanghai Innovation Institute (上海创新研究院); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The push for efficient text to image synthesis has moved the field toward one step sampling, yet existing methods still face a three way tradeoff among fidelity, inference speed, and training efficiency. Approaches that rely on external discriminators can sharpen one step performance, but they often introduce training instability, high GPU memory overhead, and slow convergence, which complicates scaling and parameter efficient tuning. In contrast, regression based distillation and consistency objectives are easier to optimize, but they typically lose fine details when constrained to a single step. We present APEX, built on a key theoretical insight: adversarial correction signals can be extracted endogenously from a flow model through condition shifting. Using a transformation creates a shifted condition branch whose velocity field serves as an independent estimator of the model’s current generation distribution, yielding a gradient that is provably GAN aligned, replacing the sample dependent discriminator terms that cause gradient vanishing. This discriminator free design is architecture preserving, making APEX a plug and play framework compatible with both full parameter and LoRA based tuning. Empirically, our 0.6B model surpasses FLUX-Schnell 12B (20 \times more parameters) in one step quality. With LoRA tuning on Qwen-Image 20B, APEX reaches a GenEval score of 0.89 at NFE=1 in 6 hours, surpassing the original 50-step teacher (0.87) and providing a 15.33 \times inference speedup. Code is available this https URL.

[CV-85] EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

【速读】:该论文旨在解决当前视频大语言模型(Video-LLMs)在高速、信息密集的虚拟环境中的感知与推理能力评估不足的问题。现有基准测试主要聚焦于日常活动场景,缺乏对虚拟环境中规则驱动的快速推理任务的严谨评测体系。为此,作者提出EgoEsportsQA这一开创性的视频问答(QA)基准,其关键在于构建了一个包含1,745个高质量QA对的数据集,来源于三款第一人称射击游戏的专业比赛,并通过六阶段可扩展流程进行标注;该数据集采用二维解耦分类法,涵盖11个认知能力子任务(感知与推理层级)和6个电子竞技知识子任务,从而系统性地刻画模型在虚拟场景下的表现差异,揭示当前Video-LLM架构在战术推理与微观操作上的显著短板,为优化下游电竞应用及推动Video-LLMs在各类第一人称视角环境中的发展提供依据。

链接: https://arxiv.org/abs/2604.12320
作者: Jianzhe Ma,Zhonghao Cao,Shangkui Chen,Yichen Xu,Wenxuan Wang,Qin Jin
机构: RUC(中国人民大学); BUPT(北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Work in progress

点击查看摘要

Abstract:While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.

[CV-86] RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

【速读】:该论文旨在解决多模态语义分割中因假设各模态可靠性相同而导致的特征退化问题,尤其是在辅助模态存在噪声、错位或不完整时。解决方案的关键在于提出一种新颖的可靠性感知自门控状态空间模型(Reliability-aware Self-Gated State Space Model, RSGMamba),其核心组件为可靠性感知自门控Mamba块(RSGMB),通过显式建模各模态的可靠性并利用自门控机制动态调节跨模态交互,实现可靠性感知的特征选择与信息增强;同时引入轻量级局部交叉门控调制模块(LCGM)以细化空间细节,从而在保持低参数量(48.6M)的前提下显著提升RGB-D和RGB-T语义分割性能。

链接: https://arxiv.org/abs/2604.12319
作者: Guoan Xu,Yang Xiao,Guangwei Gao,Dongchen Zhu,Wenjing Jia,Guo-Jun Qi
机构: University of Technology Sydney (悉尼科技大学); Nanjing University of Science and Technology (南京理工大学); Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences (中国科学院上海微系统与信息技术研究所); Westlake University (西湖大学); OPPO Research (OPPO 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7tables,9 figures

点击查看摘要

Abstract:Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality reliability and propose a novel framework termed the Reliability-aware Self-Gated State Space Model (RSGMamba). At the core of our method is the Reliability-aware Self-Gated Mamba Block (RSGMB), which explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism. Unlike conventional fusion strategies that indiscriminately exchange information across modalities, RSGMB enables reliability-aware feature selection and enhancing informative feature aggregation. In addition, a lightweight Local Cross-Gated Modulation (LCGM) is incorporated to refine fine-grained spatial details, complementing the global modeling capability of RSGMB. Extensive experiments demonstrate that RSGMamba achieves state-of-the-art performance on both RGB-D and RGB-T semantic segmentation benchmarks, resulting 58.8% / 54.0% mIoU on NYUDepth V2 and SUN-RGBD (+0.4% / +0.7% over prior best), and 61.1% / 88.9% mIoU on MFNet and PST900 (up to +1.6%), with only 48.6M parameters, thereby validating the effectiveness and superiority of the proposed approach.

[CV-87] Cell Instance Segmentation via Multi-Task Image-to-Image Schrödinger Bridge

【速读】:该论文旨在解决现有细胞实例分割(cell instance segmentation)方法中因依赖确定性预测与后处理步骤而导致全局结构约束不足的问题。其解决方案的关键在于提出一种基于多任务图像到图像的薛定谔桥(Schrödinger Bridge)框架,将实例分割建模为基于分布的图像到图像生成问题,并通过反向距离图(reverse distance map)引入边界感知监督,同时采用确定性推理以保证预测稳定性。该方法在无需SAM预训练或额外后处理的情况下,在PanNuke和MoNuSeg数据集上均展现出优异且鲁棒的性能。

链接: https://arxiv.org/abs/2604.12318
作者: Hayato Inoue,Shota Harada,Shumpei Takezaki,Ryoma Bise
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing cell instance segmentation pipelines typically combine deterministic predictions with post-processing, which imposes limited explicit constraints on the global structure of instance masks. In this work, we propose a multi-task image-to-image Schrödinger Bridge framework that formulates instance segmentation as a distribution-based image-to-image generation problem. Boundary-aware supervision is integrated through a reverse distance map, and deterministic inference is employed to produce stable predictions. Experimental results on the PanNuke dataset demonstrate that the proposed method achieves competitive or superior performance without relying on SAM pre-training or additional post-processing. Additional results on the MoNuSeg dataset show robustness under limited training data. These findings indicate that Schrödinger Bridge-based image-to-image generation provides an effective framework for cell instance segmentation.

[CV-88] GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality

【速读】:该论文旨在解决复杂山地梯田地块(terraced parcel)提取难题,该问题因地形起伏大、边界不规则及跨区域异质性强而具有挑战性,现有公开基准多聚焦于规则平坦农田场景,缺乏对多模态数据融合的系统评估框架。解决方案的关键在于构建首个面向全球梯田地块提取的多模态基准GTPBD-MM,其整合高分辨率光学影像、结构化文本描述与数字高程模型(DEM)数据,并支持图像仅、图像+文本、图像+文本+DEM三种设置下的系统评测;同时提出Elevation and Text guided Terraced parcel network (ETTerra) 多模态基线模型,实验证明文本语义与地形几何信息可提供超越视觉表观的互补线索,显著提升复杂梯田场景下地块边界提取的准确性、连贯性与结构一致性。

链接: https://arxiv.org/abs/2604.12315
作者: Zhiwei Zhang,Xingyuan Zeng,Xinkai Kong,Kunquan Zhang,Haoyuan Liang,Bohan Shi,Juepeng Zheng,Jianxi Huang,Yutong Lu,Haohuan Fu
机构: Sun Yat-sen University (中山大学); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); China Agricultural University (中国农业大学); Southwest Jiaotong University (西南交通大学); Northeastern University (东北大学); National Supercomputing Center in Shenzhen (深圳市国家超级计算中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 15 pages, 11 figures. Submitted to ACM Multimedia 2026 Dataset Track

点击查看摘要

Abstract:Agricultural parcel extraction plays an important role in remote sensing-based agricultural monitoring, supporting parcel surveying, precision management, and ecological assessment. However, existing public benchmarks mainly focus on regular and relatively flat farmland scenes. In contrast, terraced parcels in mountainous regions exhibit stepped terrain, pronounced elevation variation, irregular boundaries, and strong cross-regional heterogeneity, making parcel extraction a more challenging problem that jointly requires visual recognition, semantic discrimination, and terrain-aware geometric understanding. Although recent studies have advanced visual parcel benchmarks and image-text farmland understanding, a unified benchmark for complex terraced parcel extraction under aligned image-text-DEM settings remains absent. To fill this gap, we present GTPBD-MM, the first multimodal benchmark for global terraced parcel extraction. Built upon GTPBD, GTPBD-MM integrates high-resolution optical imagery, structured text descriptions, and DEM data, and supports systematic evaluation under Image-only, Image+Text, and Image+Text+DEM settings. We further propose Elevation and Text guided Terraced parcel network (ETTerra), a multimodal baseline for terraced parcel delineation. Extensive experiments demonstrate that textual semantics and terrain geometry provide complementary cues beyond visual appearance alone, yielding more accurate, coherent, and structurally consistent delineation results in complex terraced scenes.

[CV-89] owards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors CVPR2026

【速读】:该论文旨在解决从单张图像生成几何上真实且视图一致的轨道视频(orbital videos)的问题,尤其针对现有方法在长距离外推(如背面视角合成)时因像素对应关系有限而导致结构不一致或失真的缺陷。其解决方案的关键在于引入来自3D基础生成模型(3D foundational generative model)的丰富形状先验作为辅助约束,具体通过两个尺度的潜在特征实现:(i) 一个去噪后的全局潜在向量提供整体结构引导,(ii) 一组由体素特征投影得到的潜在图像提供视图依赖的细粒度几何细节。该方法利用多尺度3D适配器通过交叉注意力机制将这些特征注入基础视频模型,从而在保留通用视频预训练能力的同时,实现高效、模型无关的微调,并显著提升视觉质量、形状真实性和多视角一致性。

链接: https://arxiv.org/abs/2604.12309
作者: Rong Wang,Ruyi Zha,Ziang Cheng,Jiayu Yang,Pulak Purkait,Hongdong Li
机构: Australian National University (澳大利亚国立大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.

[CV-90] Boosting Robust AIGI Detection with LoRA-based Pairwise Training CVPR

【速读】:该论文旨在解决生成式 AI (Generative AI) 图像在真实场景(in the wild)中因不可预测的复杂失真而导致检测性能显著下降的问题。解决方案的关键在于提出一种基于低秩适配(LoRA)的成对训练(Pairwise Training, LPT)策略:通过针对性微调视觉基础模型、在训练阶段模拟验证与测试集的数据分布(包括失真和尺寸变化),并引入成对训练机制以解耦泛化能力与鲁棒性优化,从而提升模型在严重失真条件下的检测稳定性与准确性。

链接: https://arxiv.org/abs/2604.12307
作者: Ruiyang Xia,Qi Zhang,Yaowen Xu,Zhaofan Zou,Hao Sun,Zhongjiang He,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom; Xidian University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3th place (3/514) technical report(CVPRW-26) at the NTIRE 2026: Robust AI-Generated Image Detection in the Wild Challenge

点击查看摘要

Abstract:The proliferation of highly realistic AI-Generated Image (AIGI) has necessitated the development of practical detection methods. While current AIGI detectors perform admirably on clean datasets, their detection performance frequently decreases when deployed “in the wild”, where images are subjected to unpredictable, complex distortions. To resolve the critical vulnerability, we propose a novel LoRA-based Pairwise Training (LPT) strategy designed specifically to achieve robust detection for AIGI under severe distortions. The core of our strategy involves the targeted finetuning of a visual foundation model, the deliberate simulation of data distribution during the training phase, and a unique pairwise training process. Specifically, we introduce distortion and size simulations to better fit the distribution from the validation and test sets. Based on the strong visual representation capability of the visual foundation model, we finetune the model to achieve AIGI detection. The pairwise training is utilized to improve the detection via decoupling the generalization and robustness optimization. Experiments show that our approach secured the 3th placement in the NTIRE Robust AI-Generated Image Detection in the Wild challenge

[CV-91] CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

【速读】:该论文旨在解决电影配音(movie dubbing)中难以实现精确唇音同步(lip-sync)且自然度不足的问题,尤其针对现有方法在时长层面显式对齐导致的语音质量下降,以及隐式对齐方案易受参考音频干扰、在真实场景下出现音色和发音退化的问题。其解决方案的关键在于提出一种基于流匹配(flow matching)的框架,核心为认知同步扩散变换器(Cognitive Synchronous Diffusion Transformer, CoSync-DiT),该架构通过执行声学风格适配、细粒度视觉校准和时间感知上下文对齐三个阶段,逐步引导从噪声到语音的生成轨迹;同时设计联合语义与对齐正则化机制(Joint Semantic and Alignment Regularization, JSAR),在帧级时间一致性和流隐藏状态的语义一致性上施加双重约束,从而确保鲁棒的跨模态对齐效果。

链接: https://arxiv.org/abs/2604.12292
作者: Gaoxiang Cong,Liang Li,Jiaxin Ye,Zhedong Zhang,Hongming Shan,Yuankai Qi,Qingming Huang
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Movie dubbing aims to synthesize speech that preserves the vocal identity of a reference audio while synchronizing with the lip movements in a target video. Existing methods fail to achieve precise lip-sync and lack naturalness due to explicit alignment at the duration level. While implicit alignment solutions have emerged, they remain susceptible to interference from the reference audio, triggering timbre and pronunciation degradation in in-the-wild scenarios. In this paper, we propose a novel flow matching-based movie dubbing framework driven by the Cognitive Synchronous Diffusion Transformer (CoSync-DiT), inspired by the cognitive process of professional actors. This architecture progressively guides the noise-to-speech generative trajectory by executing acoustic style adapting, fine-grained visual calibrating, and time-aware context aligning. Furthermore, we design the Joint Semantic and Alignment Regularization (JSAR) mechanism to simultaneously constrain frame-level temporal consistency on the contextual outputs and semantic consistency on the flow hidden states, ensuring robust alignment. Extensive experiments on both standard benchmarks and challenging in-the-wild dubbing benchmarks demonstrate that our method achieves the state-of-the-art performance across multiple metrics.

[CV-92] LiveMoments: Reselected Key Photo Restoration in Live Photos via Reference-guided Diffusion ICLR2026

【速读】:该论文旨在解决Live Photo中重新选择的关键帧(reselected key photo)因视频捕获图像信号处理(ISP)流水线质量较低而导致的明显画质退化问题。现有方法难以有效恢复此类帧的细节与清晰度,尤其在快速运动或复杂结构场景下表现不佳。解决方案的关键在于提出名为LiveMoments的参考引导图像修复框架,其核心是一个双分支神经网络结构:一个参考分支从原始高质量关键帧中提取结构和纹理信息,另一主分支利用该参考信息对重选帧进行修复;同时引入统一的运动对齐模块(Unified Motion Alignment module),在潜在空间和图像空间两个层面融合运动引导信息以实现精准的空间对齐,从而显著提升修复后的感知质量和保真度。

链接: https://arxiv.org/abs/2604.12286
作者: Clara Xue,Zizheng Yan,Zhenning Shi,Yuhang Yu,Jingyu Zhuang,Qi Zhang,Jinwei Chen,Qingnan Fan
机构: vivo BlueImage Lab; Nankai University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Live Photo captures both a high-quality key photo and a short video clip to preserve the precious dynamics around the captured moment. While users may choose alternative frames as the key photo to capture better expressions or timing, these frames often exhibit noticeable quality degradation, as the photo capture ISP pipeline delivers significantly higher image quality than the video pipeline. This quality gap highlights the need for dedicated restoration techniques to enhance the reselected key photo. To this end, we propose LiveMoments, a reference-guided image restoration framework tailored for the reselected key photo in Live Photos. Our method employs a two-branch neural network: a reference branch that extracts structural and textural information from the original high-quality key photo, and a main branch that restores the reselected frame using the guidance provided by the reference branch. Furthermore, we introduce a unified Motion Alignment module that incorporates motion guidance for spatial alignment at both the latent and image levels. Experiments on real and synthetic Live Photos demonstrate that LiveMoments significantly improves perceptual quality and fidelity over existing solutions, especially in scenes with fast motion or complex structures. Our code is available at this https URL.

[CV-93] MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer

【速读】:该论文旨在解决多风格迁移(multi-style transfer)中因多个风格表示相互干扰而导致的边界伪影、风格不稳定和结构不一致等问题。现有基于扩散模型的方法通常假设单一全局风格,难以在多风格场景下保持内容结构的完整性与纹理保真度。其解决方案的关键在于提出一种无需训练的框架MAST(Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer),通过在扩散注意力机制内显式控制内容与风格的交互:首先利用布局保持查询锚定(Layout-preserving Query Anchoring)防止全局布局坍塌;其次采用逻辑层注意力质量分配(Logit-level Attention Mass Allocation)实现无边界伪影的多风格融合;再通过锐度感知温度缩放(Sharpness-aware Temperature Scaling)恢复因多风格扩展而损失的注意力锐度;最后借助差异感知细节注入(Discrepancy-aware Detail Injection)自适应补偿局部高频细节丢失,从而在不引入额外训练的前提下显著提升多风格迁移的结构一致性与视觉质量。

链接: https://arxiv.org/abs/2604.12281
作者: Dongkyung Kang,Jaeyeon Hwang,Junseo Park,Minji Kang,Yeryeong Lee,Beomseok Ko,Hanyoung Roh,Jeongmin Shin,Hyeryung Jang
机构: Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 16 figures, 6 tables

点击查看摘要

Abstract:Style transfer aims to render a content image with the visual characteristics of a reference style while preserving its underlying semantic layout and structural geometry. While recent diffusion-based models demonstrate strong stylization capabilities by leveraging powerful generative priors and controllable internal representations, they typically assume a single global style. Extending them to multi-style scenarios often leads to boundary artifacts, unstable stylization, and structural inconsistency due to interference between multiple style representations. To overcome these limitations, we propose MAST (Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer), a novel training-free framework that explicitly controls content-style interactions within the diffusion attention mechanism. To achieve artifact-free and structure-preserving stylization, MAST integrates four connected modules. First, Layout-preserving Query Anchoring prevents global layout collapse by firmly anchoring the semantic structure using content queries. Second, Logit-level Attention Mass Allocation deterministically distributes attention probability mass across spatial regions, seamlessly fusing multiple styles without boundary artifacts. Third, Sharpness-aware Temperature Scaling restores the attention sharpness degraded by multi-style expansion. Finally, Discrepancy-aware Detail Injection adaptively compensates for localized high-frequency detail losses by measuring structural discrepancies. Extensive experiments demonstrate that MAST effectively mitigates boundary artifacts and maintains structural consistency, preserving texture fidelity and spatial coherence even as the number of applied styles increases.

[CV-94] SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation

【速读】:该论文旨在解决当前基于流匹配(Flow Matching)的单步生成模型中存在的多样性退化问题,即模型在加速推理过程中倾向于集中于高密度模式,而忽略低密度但有效的目标分布子模式,导致生成样本的多样性显著下降。其解决方案的关键在于提出SubFlow——一种基于子模式条件化的流匹配方法,通过语义聚类将每个类别分解为细粒度的子模式,并以子模式索引作为条件来训练流模型;由于每个子分布近似为单峰,从而避免了传统均方误差(MSE)训练中因平均失真(averaging distortion)导致的模式偏倚,实现了无平均失真的精确模式覆盖,且无需修改现有架构即可无缝集成至MeanFlow等单步生成框架中。

链接: https://arxiv.org/abs/2604.12273
作者: Yexiong Lin,Jia Shi,Shanshan Ye,Wanyu Wang,Yu Yao,Tongliang Liu
机构: Sydney AI Centre, The University of Sydney (悉尼人工智能中心,悉尼大学); School of Artificial Intelligence, Xidian University (西安电子科技大学人工智能学院); Australian Artificial Intelligence Institute, University of Technology Sydney (澳大利亚人工智能研究所,悉尼科技大学); Department of Information Systems, City University of Hong Kong (香港城市大学信息系统系)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow matching has emerged as a powerful generative framework, with recent few-step methods achieving remarkable inference acceleration. However, we identify a critical yet overlooked limitation: these models suffer from severe diversity degradation, concentrating samples on dominant modes while neglecting rare but valid variations of the target distribution. We trace this degradation to averaging distortion: when trained with MSE objectives, class-conditional flows learn a frequency-weighted mean over intra-class sub-modes, causing the model to over-represent high-density modes while systematically neglecting low-density ones. To address this, we propose SubFlow, Sub-mode Conditioned Flow Matching, which eliminates averaging distortion by decomposing each class into fine-grained sub-modes via semantic clustering and conditioning the flow on sub-mode indices. Each conditioned sub-distribution is approximately unimodal, so the learned flow accurately targets individual modes with no averaging distortion, restoring full mode coverage in a single inference step. Crucially, SubFlow is entirely plug-and-play: it integrates seamlessly into existing one-step models such as MeanFlow and Shortcut Models without any architectural modifications. Extensive experiments on ImageNet-256 demonstrate that SubFlow yields substantial gains in generation diversity (Recall) while maintaining competitive image quality (FID), confirming its broad applicability across different one-step generation frameworks. Project page: this https URL.

[CV-95] DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos

【速读】:该论文旨在解决立体视频修复(stereo video inpainting)中的两个核心问题:一是由于高质量立体修复数据集稀缺,导致现有方法难以学习有效的修复先验;二是现有方法对图像中所有区域均等处理,而实际需修复的区域仅占少数且分散在物体边界处,造成大量冗余计算。解决方案的关键在于提出三个相互关联的组件:首先,Gradient-Aware Parallax Warping (GAPW) 利用反向映射和坐标变换梯度获得连续边缘与平滑遮挡区域;其次,Parallax-Based Dual Projection (PBDP) 策略结合 GAPW 生成几何一致的立体修复对及准确遮挡掩码,无需依赖真实立体视频;最后,Sparsity-Aware Stereo Inpainting (SASI) 通过稀疏化处理减少超过 70% 的冗余 token,在扩散推理阶段实现 10.7 倍加速,并在单张 A100 GPU 上实现 HD 分辨率(768×1280)视频的实时处理(25 FPS)。

链接: https://arxiv.org/abs/2604.12270
作者: Yuan Huang,Sijie Zhao,Jing Cheng,Hao Xu,Shaohui Jiao
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereo video inpainting, which aims to fill the occluded regions of warped videos with visually coherent content while maintaining temporal consistency, remains a challenging open problem. The regions to be filled are scattered along object boundaries and occupy only a small fraction of each frame, leading to two key challenges. First, existing approaches perform poorly on such tasks due to the scarcity of high-quality stereo inpainting datasets, which limits their ability to learn effective inpainting priors. Second, these methods apply equal processing to all regions of the frame, even though most pixels require no modification, resulting in substantial redundant computation. To address these issues, we introduce three interconnected components. We first propose Gradient-Aware Parallax Warping (GAPW), which leverages backward warping and the gradient of the coordinate mapping function to obtain continuous edges and smooth occlusion regions. Then, a Parallax-Based Dual Projection (PBDP) strategy is introduced, which incorporates GAPW to produce geometrically consistent stereo inpainting pairs and accurate occlusion masks without requiring stereo videos. Finally, we present Sparsity-Aware Stereo Inpainting (SASI), which reduces over 70% of redundant tokens, achieving a 10.7x speedup during diffusion inference and delivering results comparable to its full-computation counterpart, enabling real-time processing of HD (768 x 1280) videos at 25 FPS on a single A100 GPU.

[CV-96] Style-Decoupled Adaptive Routing Network for Underwater Image Enhancement

【速读】:该论文旨在解决水下图像增强(Underwater Image Enhancement, UIE)中现有方法因采用统一映射策略而导致的适应性不足问题,即对轻微退化的图像过度处理、对严重退化图像恢复效果有限。其解决方案的关键在于提出一种自适应增强框架SDAR-Net,通过解耦输入图像中的特定退化风格(degradation style)与静态场景结构表示,并引入自适应路由机制——该机制基于风格特征动态预测不同增强状态下的软权重,实现图像表征的加权融合,从而精准满足每张图像的自适应恢复需求。

链接: https://arxiv.org/abs/2604.12257
作者: Hang Xu,Chen Long,Bing Wang,Hao Chen,Zhen Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater Image Enhancement (UIE) is essential for robust visual perception in marine applications. However, existing methods predominantly rely on uniform mapping tailored to average dataset distributions, leading to over-processing mildly degraded images or insufficient recovery for severe ones. To address this challenge, we propose a novel adaptive enhancement framework, SDAR-Net. Unlike existing uniform paradigms, it first decouples specific degradation styles from the input and subsequently modulates the enhancement process adaptively. Specifically, since underwater degradation primarily shifts the appearance while keeping the scene structure, SDAR-Net formulates image features into dynamic degradation style embeddings and static scene structural representations through a carefully designed training framework. Subsequently, we introduce an adaptive routing mechanism. By evaluating style features and adaptively predicting soft weights at different enhancement states, it guides the weighted fusion of the corresponding image representations, accurately satisfying the adaptive restoration demands of each image. Extensive experiments show that SDAR-Net achieves a new state-of-the-art (SOTA) performance with a PSNR of 25.72 dB on real-world benchmark, and demonstrates its utility in downstream vision tasks. Our code is available at this https URL.

[CV-97] ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception

【速读】:该论文旨在解决野外环境下动态面部表情识别(Dynamic Facial Expression Recognition)因数据稀缺和长尾分布导致模型难以有效学习稀有情绪时序动态的问题。解决方案的关键在于提出ARGen框架,其核心创新包括两个阶段:首先通过情感语义注入(Affective Semantic Injection, ASI)利用面部动作单元(Action Units)对齐情感知识,并借助检索增强提示生成策略从大规模视觉-语言模型中合成一致且细粒度的情感描述,从而将可解释的情感先验注入生成过程;其次通过自适应强化扩散(Adaptive Reinforcement Diffusion, ARD)结合文本条件图像到视频扩散模型与强化学习,引入帧间条件引导和多目标奖励函数,协同优化表情自然性、面部完整性与生成效率,从而实现鲁棒的情绪感知与高质量的动态表情生成。

链接: https://arxiv.org/abs/2604.12255
作者: Huanzhen Wang,Ziheng Zhou,Jiaqi Song,Li He,Yunshi Lan,Yan Wang,Wenqiang Zhang
机构: Fudan University (复旦大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic facial expression recognition in the wild remains challenging due to data scarcity and long-tail distributions, which hinder models from effectively learning the temporal dynamics of scarce emotions. To address these limitations, we propose ARGen, an Affect-Reinforced Generative Augmentation Framework that enables data-adaptive dynamic expression generation for robust emotion perception. ARGen operates in two stages: Affective Semantic Injection (ASI) and Adaptive Reinforcement Diffusion (ARD). The ASI stage establishes affective knowledge alignment through facial Action Units and employs a retrieval-augmented prompt generation strategy to synthesize consistent and fine-grained affective descriptions via large-scale visual-language models, thereby injecting interpretable emotional priors into the generation process. The ARD stage integrates text-conditioned image-to-video diffusion with reinforcement learning, introducing inter-frame conditional guidance and a multi-objective reward function to jointly optimize expression naturalness, facial integrity, and generative efficiency. Extensive experiments on both generation and recognition tasks verify that ARGen substantially enhances synthesis fidelity and improves recognition performance, establishing an interpretable and generalizable generative augmentation paradigm for vision-based affective computing.

[CV-98] ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在稀疏视角条件下产生的几何与光度退化问题,现有生成式修复方法常因时间一致性不足、缺乏显式空间约束及训练数据规模有限,导致多视图不一致、几何幻觉错误以及对真实世界退化分布的泛化能力差。其解决方案的关键在于提出ArtifactWorld框架,通过两个核心创新实现:一是建立细粒度的3DGS退化现象学分类体系,并构建包含107.5K组配对视频片段的大规模训练数据集以提升模型鲁棒性;二是设计统一的视频扩散主干网络与同构预测器(isomorphic predictor),利用退化热力图定位结构缺陷,并通过Artifact-Aware Triplet Fusion机制在原生自注意力模块中实现强度引导的时空精准修复,从而显著提升稀疏视图合成与三维重建的性能。

链接: https://arxiv.org/abs/2604.12251
作者: Xinliang Wang,Yifeng Shi,Zhenyu Wu
机构: Ke Holdings Inc.(贝壳控股有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The second author is the corresponding author

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) delivers high-fidelity real-time rendering but suffers from geometric and photometric degradations under sparse-view constraints. Current generative restoration approaches are often limited by insufficient temporal coherence, a lack of explicit spatial constraints, and a lack of large-scale training data, resulting in multi-view inconsistencies, erroneous geometric hallucinations, and limited generalization to diverse real-world artifact distributions. In this paper, we present ArtifactWorld, a framework that resolves 3DGS artifact repair through systematic data expansion and a homogeneous dual-model paradigm. To address the data bottleneck, we establish a fine-grained phenomenological taxonomy of 3DGS artifacts and construct a comprehensive training set of 107.5K diverse paired video clips to enhance model robustness. Architecturally, we unify the restoration process within a video diffusion backbone, utilizing an isomorphic predictor to localize structural defects via an artifact heatmap. This heatmap then guides the restoration through an Artifact-Aware Triplet Fusion mechanism, enabling precise, intensity-guided spatio-temporal repair within native self-attention. Extensive experiments demonstrate that ArtifactWorld achieves state-of-the-art performance in sparse novel view synthesis and robust 3D reconstruction. Code and dataset will be made public.

[CV-99] Socrates Loss: Unifying Confidence Calibration and Classification by Leverag ing the Unknown

【速读】:该论文旨在解决深度神经网络在高风险应用中因置信度校准(confidence calibration)不佳而导致的可靠性问题。现有方法通常在训练过程中采用启发式策略进行校准,但面临稳定性与性能之间的权衡:两阶段训练方法虽能提升分类性能,却导致训练不稳定且校准效果差;而单损失方法虽然稳定,但在分类任务上表现不足。其解决方案的关键在于提出一种名为Socrates Loss的统一损失函数,该函数通过引入一个辅助的未知类别(auxiliary unknown class),使模型预测直接作用于损失函数并触发动态不确定性惩罚项,从而在优化过程中同时兼顾分类准确性和置信度校准,避免了复杂调度损失带来的不稳定性,并提供了理论保障以防止校准偏差和过拟合。

链接: https://arxiv.org/abs/2604.12245
作者: Sandra Gómez-Gálvez,Tobias Olenyi,Gillian Dobbie,Katerina Taškova
机构: University of Auckland (奥克兰大学); Technical University of Munich (慕尼黑工业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: Published at TMLR 2026. this https URL Video: this https URL Code: this https URL

点击查看摘要

Abstract:Deep neural networks, despite their high accuracy, often exhibit poor confidence calibration, limiting their reliability in high-stakes applications. Current ad-hoc confidence calibration methods attempt to fix this during training but face a fundamental trade-off: two-phase training methods achieve strong classification performance at the cost of training instability and poorer confidence calibration, while single-loss methods are stable but underperform in classification. This paper addresses and mitigates this stability-performance trade-off. We propose Socrates Loss, a novel, unified loss function that explicitly leverages uncertainty by incorporating an auxiliary unknown class, whose predictions directly influence the loss function and a dynamic uncertainty penalty. This unified objective allows the model to be optimized for both classification and confidence calibration simultaneously, without the instability of complex, scheduled losses. We provide theoretical guarantees that our method regularizes the model to prevent miscalibration and overfitting. Across four benchmark datasets and multiple architectures, our comprehensive experiments demonstrate that Socrates Loss consistently improves training stability while achieving more favorable accuracy-calibration trade-off, often converging faster than existing methods.

[CV-100] Physics-Grounded Monocular Vehicle Distance Estimation Using Standardized License Plate Typography

【速读】:该论文旨在解决单目相机在车辆间距离估计中因尺度模糊(scale ambiguity)导致的精度不足问题,从而提升高级驾驶辅助系统(ADAS)和自动驾驶中的测距可靠性。其核心解决方案是利用美国车牌标准化的几何特征作为被动地标(passive fiducial markers),通过显式的几何先验信息直接恢复真实世界尺度,无需任何监督训练数据或主动光源。关键创新包括:(1)多方法并行的车牌检测器以适应全范围光照条件;(2)融合OCR文本匹配、多设计颜色评分与轻量神经网络的三阶段状态识别引擎实现鲁棒识别;(3)基于逆方差加权的混合深度融合与在线尺度对齐机制,结合一维恒速卡尔曼滤波器输出平滑的距离、相对速度及碰撞时间(time-to-collision),显著优于现有基于板宽的方法和深度学习基线模型。

链接: https://arxiv.org/abs/2604.12239
作者: Manognya Lokesh Reddy,Zheng Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:Accurate inter-vehicle distance estimation is a cornerstone of Advanced Driver Assistance Systems (ADAS) and autonomous driving. While LiDAR and radar provide high precision, their high cost prohibits widespread adoption in mass-market vehicles. Monocular camera-based estimation offers a low-cost alternative but suffers from fundamental scale ambiguity. Recent deep learning methods for monocular depth achieve impressive results yet require expensive supervised training, suffer from domain shift, and produce predictions that are difficult to certify for safety-critical deployment. This paper presents a framework that exploits the standardized typography of United States license plates as passive fiducial markers for metric ranging, resolving scale ambiguity through explicit geometric priors without any training data or active illumination. First, a four-method parallel plate detector achieves robust plate reading across the full automotive lighting range. Second, a three-stage state identification engine fusing OCR text matching, multi-design color scoring, and a lightweight neural network classifier provides robust identification across all ambient conditions. Third, hybrid depth fusion with inverse-variance weighting and online scale alignment, combined with a one-dimensional constant-velocity Kalman filter, delivers smoothed distance, relative velocity, and time-to-collision for collision warning. Baseline validation reproduces a 2.3% coefficient of variation in character height measurements and a 36% reduction in distance-estimate variance compared with plate-width methods from prior work. Extensive outdoor experiments confirm a mean absolute error of 2.3% at 10 m and continuous distance output during brief plate occlusions, outperforming deep learning baselines by a factor of five in relative error.

[CV-101] BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition CVPR2026

【速读】:该论文旨在解决真实世界中因多样化服装样式导致的步态识别(gait recognition)性能下降问题,特别是跨服装条件下的特征不变性难题。其核心挑战在于服装多样性显著增加了类内差异(intra-class variance),使得学习对服装变化不敏感的步态特征变得困难。解决方案的关键是提出一个名为GaitCLIF(Gait-oriented CLoth-Invariant Feature)的鲁棒基线模型,通过在新提出的合成数据集BarbieGait上训练,有效提升跨服装场景下的步态识别准确率,从而为复杂环境中的步态识别提供可扩展且可控的数据生成与特征学习方法。

链接: https://arxiv.org/abs/2604.12221
作者: Qingyuan Cai,Saihui Hou,Xuecai Hu,Yongzhen Huang
机构: Beijing Normal University (北京师范大学); Alibaba Group (阿里巴巴集团); WATRIX.AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project Page: this https URL

点击查看摘要

Abstract:Gait recognition, as a reliable biometric technology, has seen rapid development in recent years while it faces significant challenges caused by diverse clothing styles in the real world. This paper introduces BarbieGait, a synthetic gait dataset where real-world subjects are uniquely mapped into a virtual engine to simulate extensive clothing changes while preserving their gait identity information. As a pioneering work, BarbieGait provides a controllable gait data generation method, enabling the production of large datasets to validate cross-clothing issues that are difficult to verify with real-world data. However, the diversity of clothing increases intra-class variance and makes one of the biggest challenges to learning cloth-invariant features under varying clothing conditions. Therefore, we propose GaitCLIF (Gait-oriented CLoth-Invariant Feature) as a robust baseline model for cross-clothing gait recognition. Through extensive experiments, we validate that our method significantly improves cross-clothing performance on BarbieGait and the existing popular gait benchmarks. We believe that BarbieGait, with its extensive cross-clothing gait data, will further advance the capabilities of gait recognition in cross-clothing scenarios and promote progress in related research.

[CV-102] Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

【速读】:该论文旨在解决视频扩散模型(Video Diffusion Models)在生成高质量视频时面临的计算负担过重问题,特别是由自注意力机制(Self-Attention)带来的高复杂度导致的推理延迟,以及现有稀疏注意力方法因静态稀疏模式和确定性块路由所引发的视觉闪烁(Visual Flickering)问题。其解决方案的关键在于提出一种无需训练的精确定量稀疏注意力机制(Precision-Allocated Sparse Attention, PASA),核心创新包括:1)基于轨迹曲率感知的动态预算分配机制,弹性地将计算资源聚焦于关键语义转换时刻以保障精度;2)采用硬件对齐的分组近似策略替代全局均匀估计,实现局部细节保留与峰值计算吞吐量的协同优化;3)引入随机选择偏置到注意力路由中,通过概率化机制软化选择边界、消除选择振荡,从而根除导致时间闪烁的局部计算饥饿现象。

链接: https://arxiv.org/abs/2604.12219
作者: Wentai Zhang,Ronghui Xi,Shiyao Peng,Jiayu Huang,Haoran Luo,Zichen Tang,Haihong E
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Diffusion Transformers have revolutionized high-fidelity video generation but suffer from the massive computational burden of self-attention. While sparse attention provides a promising acceleration solution, existing methods frequently provoke severe visual flickering caused by static sparsity patterns and deterministic block routing. To resolve these limitations, we propose Precision-Allocated Sparse Attention (PASA), a training-free framework designed for highly efficient and temporally smooth video generation. First, we implement a curvature-aware dynamic budgeting mechanism. By profiling the generation trajectory acceleration across timesteps, we elastically allocate the exact-computation budget to secure high-precision processing strictly during critical semantic transitions. Second, we replace global homogenizing estimations with hardware-aligned grouped approximations, successfully capturing fine-grained local variations while maintaining peak compute throughput. Finally, we incorporate a stochastic selection bias into the attention routing mechanism. This probabilistic approach softens rigid selection boundaries and eliminates selection oscillation, effectively eradicating the localized computational starvation that drives temporal flickering. Extensive evaluations on leading video diffusion models demonstrate that PASA achieves substantial inference acceleration while consistently producing remarkably fluid and structurally stable video sequences.

[CV-103] Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment

【速读】:该论文旨在解决图像编辑质量评估(Image Editing Quality Assessment, IEQA)中因依赖人工启发式提示导致的两个关键问题:一是指标提示过于僵化,二是评分建模对距离不敏感,从而难以对齐人类隐含评判标准并捕捉评分空间的连续结构。其解决方案的核心在于提出统一框架DS-IEQA,通过反馈驱动的指标提示优化(Feedback-Driven Metric Prompt Optimization, FDMPO)自动精炼评价指标定义,并引入解耦令牌的距离回归损失(Token-Decoupled Distance Regression Loss, TDRL),将数值标记与语言建模分离,以期望距离最小化显式建模评分连续性,从而实现更精准、灵活且符合人类感知的图像编辑质量评估。

链接: https://arxiv.org/abs/2604.12175
作者: Xinjie Zhang,Qiang Li,Xiaowen Ma,Axi Niu,Li Yan,Qingsen Yan
机构: Northwestern Polytechnical University, China (西北工业大学); Shenzhen Research Institute of Northwestern Polytechnical University, China (西北工业大学深圳研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in image editing have heightened the need for reliable Image Editing Quality Assessment (IEQA). Unlike traditional methods, IEQA requires complex reasoning over multimodal inputs and multi-dimensional assessments. Existing MLLM-based approaches often rely on human heuristic prompting, leading to two key limitations: rigid metric prompting and distance-agnostic score modeling. These issues hinder alignment with implicit human criteria and fail to capture the continuous structure of score spaces. To address this, we propose Define-and-Score Image Editing Quality Assessment (DS-IEQA), a unified framework that jointly learns evaluation criteria and score representations. Specifically, we introduce Feedback-Driven Metric Prompt Optimization (FDMPO) to automatically refine metric definitions via probabilistic feedback. Furthermore, we propose Token-Decoupled Distance Regression Loss (TDRL), which decouples numerical tokens from language modeling to explicitly model score continuity through expected distance minimization. Extensive experiments show our method’s superior performance; it ranks 4th in the 2026 NTIRE X-AIGC Quality Assessment Track 2 without any additional training data.

[CV-104] Nucleus-Image: Sparse MoE for Image Generation

【速读】:该论文旨在解决文本到图像生成模型在质量和推理效率之间的权衡问题,即如何在保持高生成质量的同时显著降低计算开销。其核心解决方案是提出Nucleus-Image模型,采用稀疏混合专家(Mixture-of-Experts, MoE)扩散Transformer架构,并结合专家选择路由(Expert-Choice Routing)机制,使每层激活约20亿参数(2B),而总模型容量可达170亿参数(17B)。关键创新包括:通过去除非文本token的简化结构和联合注意力机制实现文本键值(KV)跨时间步共享以提升推理效率;引入解耦路由设计以增强基于时间步调制的路由稳定性;以及构建大规模高质量训练数据集(15亿对图像-文本)并采用渐进式分辨率训练与专家容量稀疏化策略。最终,Nucleus-Image在多个基准测试中达到或超越领先模型性能,且无需任何后训练优化(如强化学习或人类偏好微调),实现了高质量、低推理成本的生成效果。

链接: https://arxiv.org/abs/2604.12163
作者: Chandan Akiti,Ajay Modukuri,Murali Nandan Nagarapu,Gunavardhan Akiti,Haozhe Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.

[CV-105] VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale CVPR2026

【速读】:该论文旨在解决视频地理定位(video geolocalization)任务中现有方法的局限性,即分类法仅能实现粗粒度的城市级定位、而图像检索方法因需构建大规模图像图库在全局尺度上不具可行性。其核心问题在于如何实现高精度且可扩展的视频到GPS坐标映射,同时保证轨迹的时间一致性。解决方案的关键在于提出VidTAG框架——一个双编码器架构,结合自监督特征与语言对齐特征进行帧到GPS的检索;并引入TempGeo模块以对齐帧嵌入以缓解时间不一致性问题,以及GeoRefiner模块(编码器-解码器结构)利用对齐后的帧嵌入精炼GPS特征,从而提升定位精度和轨迹连续性。

链接: https://arxiv.org/abs/2604.12159
作者: Parth Parag Kulkarni,Rohit Gupta,Prakash Chandra Chhipa,Mubarak Shah
机构: University of Central Florida (佛罗里达大学); Luleå Tekniska Universitet (吕勒奥理工大学); Amazon Prime Video Science (亚马逊Prime视频科学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:The task of video geolocalization aims to determine the precise GPS coordinates of a video’s origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using the aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model’s ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: this https URL

[CV-106] Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

【速读】:该论文旨在解决医学图像超分辨率(Medical Image Super-Resolution, SR)中重建质量受限的问题,其核心发现是:当前主流的潜在扩散模型普遍采用为自然图像设计的变分自编码器(Variational Autoencoder, VAE),而这一默认选择才是制约重建质量的主要因素,而非扩散架构本身。解决方案的关键在于用领域专用的VAE——MedVAE替代通用的Stable Diffusion VAE,该模型在超过160万张医学图像上预训练,显著提升了重建质量(PSNR提升2.91–3.29 dB),且优势集中在高频空间频带,对应解剖学上的精细结构。实验表明,该改进在不同推理调度、预测目标和生成架构下均稳定有效,且生成幻觉率无显著差异,说明重建保真度与生成幻觉由独立模块控制。因此,论文提出一个实用筛选准则:在扩散架构优化前,应优先评估并选择具有高重建质量的领域特定VAE,这可通过无需扩散训练即可测量的指标预测下游SR性能(R² = 0.67)。

链接: https://arxiv.org/abs/2604.12152
作者: Sebastian Cajas,Ashaba Judith,Rahul Gorijavolu,Sahil Kapadia,Hillary Clinton Kasimbazi,Leo Kinyera,Emmanuel Paul Kwesiga,Sri Sri Jaithra Varma Manthena,Luis Filipe Nakayama,Ninsiima Doreen,Leo Anthony Celi
机构: MIT Critical Data, Massachusetts Institute of Technology, Cambridge, MA, USA; Mbarara University of Science and Technology, Uganda; Johns Hopkins University, Baltimore, MD, USA; University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Makerere University, Uganda; Uganda Cancer Institute, Uganda; Cavendish University, Uganda; Technical University of Applied Sciences Lübeck (TH Lübeck), Lübeck, Germany; Federal University of São Paulo, Brazil; Beth Israel Deaconess Medical Center, Boston, MA, USA; Harvard T.H. Chan School of Public Health, Boston, MA, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain-specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X-ray (n = 1,820; Cohen’s d = 1.37 to 1.86, all p 10^-20, Wilcoxon signed-rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus 0.15 dB, while hallucination rates remain comparable between methods (Cohen’s h 0.02 across all datasets), establishing that reconstruction fidelity and generative hallucination are governed by independent pipeline components. These results provide a practical screening criterion: autoencoder reconstruction quality, measurable without diffusion training, predicts downstream SR performance (R^2 = 0.67), suggesting that domain-specific VAE selection should precede diffusion architecture search. Code and trained model weights are publicly available at this https URL.

[CV-107] ViLL-E: Video LLM Embeddings for Retrieval ACL2026

【速读】:该论文旨在解决视频大语言模型(VideoLLM)在基于嵌入的检索任务中表现不佳的问题,例如文本到视频检索(Text-to-Video Retrieval)和时刻定位(Moment Retrieval),这些问题通常由专门的嵌入模型(embedding-based models)主导。解决方案的关键在于提出一种统一的 VideoLLM 架构 ViLL-E(Video-LLM-Embed),其核心创新是引入了一种新颖的嵌入生成机制,使模型能够根据视频复杂度动态调整推理长度——对复杂视频“思考更久”,对简单视频“提前停止”。此外,通过三阶段训练策略(大规模预训练、细粒度caption持续训练、多任务微调)融合生成式学习与对比学习,显著提升了在时序定位和视频检索上的性能,并实现了零样本下的新能力,如组合视频检索和长文本检索。

链接: https://arxiv.org/abs/2604.12148
作者: Rohit Gupta,Jayakrishnan Unnikrishnan,Fan Fei,Sheng Liu,Son Tran,Mubarak Shah
机构: Amazon(亚马逊); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACL 2026 Main conference

点击查看摘要

Abstract:Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to “think longer” for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).

[CV-108] Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models

【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, VLMs)中存在的语义固化(semantic fixation)问题,即模型在面对明确提示时仍倾向于坚持默认的语义映射,而非根据新规则进行调整,从而导致感知错误与规则映射错误难以区分。其解决方案的关键在于构建一个受控基准测试集VLM-Fix,通过四个抽象策略游戏中的标准规则与逆向规则配对设置,精确隔离出语义固化效应;实验发现,无论开放还是封闭模型均表现出对标准规则的偏好,且通过中性别名提示可显著缩小逆向规则差距,而语义负载型别名则重新引入该差距,同时训练策略和晚期激活引导干预进一步验证了该机制的可编辑性。

链接: https://arxiv.org/abs/2604.12119
作者: Md Tanvirul Alam
机构: Rochester Institute of Technology (罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) often rely on familiar semantic priors, but existing evaluations do not cleanly separate perception failures from rule-mapping failures. We study this behavior as semantic fixation: preserving a default interpretation even when the prompt specifies an alternative, equally valid mapping. To isolate this effect, we introduce VLM-Fix, a controlled benchmark over four abstract strategy games that evaluates identical terminal board states under paired standard and inverse rule formulations. Across 14 open and closed VLMs, accuracy consistently favors standard rules, revealing a robust semantic-fixation gap. Prompt interventions support this mechanism: neutral alias prompts substantially narrow the inverse-rule gap, while semantically loaded aliases reopen it. Post-training is strongly rule-aligned: training on one rule improves same-rule transfer but hurts opposite-rule transfer, while joint-rule training improves broader transfer. To test external validity beyond synthetic games, we evaluate analogous defamiliarization interventions on VLMBias and observe the same qualitative pattern. Finally, late-layer activation steering partially recovers degraded performance, indicating that semantic-fixation errors are at least partly editable in late representations. Project page, code, and dataset available at this https URL.

[CV-109] HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models

【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,其根源在于视觉定位不稳定和对语言先验的过度依赖。现有无需训练的解码方法通常在每一步解码时都进行校准,导致计算开销增加且可能干扰稳定预测。解决方案的关键在于识别“层间犹豫”(layer-wise hesitation)——即中间层token偏好波动所反映的视觉 grounding 不稳定性,并提出一种名为“犹豫触发差分校准”(Hesitation-Triggered Differential Calibration, HTDC)的无训练解码框架。HTDC 仅在检测到犹豫时激活校准机制,通过对比完整分支与两个轻量级探针(视觉空置探针和语义空置探针)来抑制易产生幻觉的候选 token,从而在保持任务准确率的同时显著降低幻觉发生率,并实现效果与计算开销之间的良好平衡。

链接: https://arxiv.org/abs/2604.12115
作者: Xinyun Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Large vision-language models (LVLMs) achieve strong multimodal performance, but still suffer from hallucinations caused by unstable visual grounding and over-reliance on language priors. Existing training-free decoding methods typically apply calibration at every decoding step, introducing unnecessary computation and potentially disrupting stable predictions. We address this problem by identifying layer-wise hesitation, a simple signal of grounding instability reflected by fluctuations in token preference across intermediate layers. Based on this observation, we propose Hesitation-Triggered Differential Calibration (HTDC), a training-free decoding framework that preserves standard full-branch inference and activates calibration only at hesitation-prone steps. When triggered, HTDC contrasts the full branch with two lightweight probes, a visual-nullification probe and a semantic-nullification probe, to suppress hallucination-prone candidates while avoiding unnecessary intervention on stable steps. Experiments on representative hallucination benchmarks show that HTDC consistently reduces hallucinations while maintaining strong task accuracy, achieving a favorable trade-off between effectiveness and computational overhead.

[CV-110] PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation

【速读】:该论文旨在解决当前基于视觉基础模型(Visual Foundation Models, VFMs)如Segment Anything Model (SAM) 的上下文分割(in-context segmentation)方法中,由于支持图像(support image)与查询图像(query image)之间存在视觉不一致性,导致生成的提示(prompt)质量不佳,从而影响分割性能的问题。解决方案的关键在于提出一种无需训练的测试时框架PR-MaGIC(Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation),其核心机制是利用SAM掩码解码器的梯度流(gradient flow)对提示进行优化,并通过简单的Top-1选择策略实现稳定性和鲁棒性,从而在不增加额外训练或架构改动的前提下显著提升分割精度。

链接: https://arxiv.org/abs/2604.12113
作者: Minjae Lee,Sungwoo Hur,Soojin Hwang,Won Hwa Kim
机构: Pohang University of Science and Technology (浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Foundation Models (VFMs) such as the Segment Anything Model (SAM) have significantly advanced broad use of image segmentation. However, SAM and its variants necessitate substantial manual effort for prompt generation and additional training for specific applications. Recent approaches address these limitations by integrating SAM into in-context (one/few shot) segmentation, enabling auto-prompting through semantic alignment between query and support images. Despite these efforts, they still generate sub-optimal prompts that degrade segmentation quality due to visual inconsistencies between support and query images. To tackle this limitation, we introduce PR-MaGIC (Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation), a training-free test-time framework that refines prompts via gradient flow derived from SAM’s mask decoder. PR-MaGIC seamlessly integrates into in-context segmentation frameworks, being theoretically grounded yet practically stabilized through a simple top-1 selection strategy that ensures robust performance across samples. Extensive evaluations demonstrate that PR-MaGIC consistently improves segmentation quality across various benchmarks, effectively mitigating inadequate prompts without requiring additional training or architectural modifications.

[CV-111] Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks NEURIPS2026

【速读】:该论文旨在解决当前空间感知研究代理在处理复杂多模态任务时存在的推理不可靠与缺乏可解释性问题,尤其在涉及空间关系判断和端到端机器学习工程的场景中,传统基于大语言模型(Large Language Model, LLM)的方法易产生幻觉并难以追踪决策逻辑。解决方案的核心在于提出计算锚定推理(Compute-Grounded Reasoning, CGR)范式:通过结构化的空间场景图引擎从视觉描述中提取实体与关系,并以确定性计算方式得出距离和安全违规等事实,再将这些可验证的中间结果输入LLM进行最终回答生成,从而避免空间推理中的幻觉;同时结合熵引导的动作选择机制优化信息增益,并采用三层前端模型堆栈(OpenAI + Anthropic)实现高效任务路由,辅以自愈式机器学习流水线确保系统鲁棒性。此设计显著提升了准确性与可解释性。

链接: https://arxiv.org/abs/2604.12102
作者: Arun Sharma
机构: University of Minnesota, Twin Cities (明尼苏达大学双城分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages. Submitted to NeurIPS 2026. Code: this https URL

点击查看摘要

Abstract:We introduce compute-grounded reasoning (CGR), a design paradigm for spatial-aware research agents in which every answerable sub-problem is resolved by deterministic computation before a language model is asked to generate. Spatial Atlas instantiates CGR as a single Agent-to-Agent (A2A) server that handles two challenging benchmarks: FieldWorkArena, a multimodal spatial question-answering benchmark spanning factory, warehouse, and retail environments, and MLE-Bench, a suite of 75 Kaggle machine learning competitions requiring end-to-end ML engineering. A structured spatial scene graph engine extracts entities and relations from vision descriptions, computes distances and safety violations deterministically, then feeds computed facts to large language models, thereby avoiding hallucinated spatial reasoning. Entropy-guided action selection maximizes information gain per step and routes queries across a three-tier frontier model stack (OpenAI + Anthropic). A self-healing ML pipeline with strategy-aware code generation, a score-driven iterative refinement loop, and a prompt-based leak audit registry round out the system. We evaluate across both benchmarks and show that CGR yields competitive accuracy while maintaining interpretability through structured intermediate representations and deterministic spatial computations.

[CV-112] PC-MIL: Decoupling Feature Resolution from Supervision Scale in Whole-Slide Learning MICCAI2026

【速读】:该论文旨在解决全切片图像(Whole-slide image, WSI)分类中基于滑动窗口的多实例学习(Multiple Instance Learning, MIL)方法存在的监督尺度与临床推理尺度不匹配问题。传统MIL仅依赖全局标签进行训练,导致模型倾向于聚合特征而忽略解剖学意义上的局部定位信息,从而削弱了模型对肿瘤负荷、局灶性病变和组织架构模式等毫米级区域的识别能力。解决方案的关键在于提出渐进式上下文多实例学习(Progressive-Context MIL, PC-MIL),其核心创新是将监督尺度作为设计维度独立控制:在固定20倍分辨率特征的基础上,通过调整MIL袋(bag)的空间范围(以毫米为单位),并将监督锚定在临床相关的2mm尺度上,避免病变密度与尺度混淆;同时逐步混合滑片级与区域级监督信号,在可控比例下实现训练上下文与测试上下文的显式分析,从而显著提升跨上下文泛化性能,并稳定不同尺度下的准确率。

链接: https://arxiv.org/abs/2604.12100
作者: Syed Fahim Ahmed,Gnanesh Rasineni,Florian Koehler,Abu Zahid Bin Aziz,Mei Wang,Attila Gyulassy,Brian Summa,J. Quincy Brown,Valerio Pascucci,Shireen Y. Elhabian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures, 2 tables. Under review at MICCAI 2026

点击查看摘要

Abstract:Whole-slide image (WSI) classification in computational pathology is commonly formulated as slide-level Multiple Instance Learning (MIL) with a single global bag representation. However, slide-level MIL is fundamentally underconstrained: optimizing only global labels encourages models to aggregate features without learning anatomically meaningful localization. This creates a mismatch between the scale of supervision and the scale of clinical reasoning. Clinicians assess tumor burden, focal lesions, and architectural patterns within millimeter-scale regions, whereas standard MIL is trained only to predict whether “somewhere in the slide there is cancer.” As a result, the model’s inductive bias effectively erases anatomical structure. We propose Progressive-Context MIL (PC-MIL), a framework that treats the spatial extent of supervision as a first-class design dimension. Rather than altering magnification, patch size, or introducing pixel-level segmentation, we decouple feature resolution from supervision scale. Using fixed 20x features, we vary MIL bag extent in millimeter units and anchor supervision at a clinically motivated 2mm scale to preserve comparable tumor burden and avoid confounding scale with lesion density. PC-MIL progressively mixes slide- and region-level supervision in controlled proportions, enabling explicit train-context x test-context analysis. On 1,476 prostate WSIs from five public datasets for binary cancer detection, we show that anatomical context is an independent axis of generalization in MIL, orthogonal to feature resolution: modest regional supervision improves cross-context performance, and balanced multi-context training stabilizes accuracy across slide and regional evaluation without sacrificing global performance. These results demonstrate that supervision extent shapes MIL inductive bias and support anatomically grounded WSI generalization.

[CV-113] INST-Align: Implicit Neural Alignment for Spatial Transcriptomics via Canonical Expression Fields MICCAI2026

【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)多切片分析中两个相互耦合的难题:切片间存在大尺度非刚性形变,以及在独立进行对齐与整合时引入的切片间批次效应。其解决方案的关键在于提出INST-Align框架,该框架通过将基于坐标的形变网络与共享的“规范表达场”(Canonical Expression Field)相结合,实现对齐与重建的联合优化。其中,规范表达场是一种隐式神经表示,将空间坐标映射为表达嵌入;通过两阶段训练策略,先稳定嵌入空间,再协同优化形变与空间特征匹配,从而实现相互约束的对齐与表征学习。跨切片参数共享机制有效缓解了对应关系模糊性并吸收了批次变异,显著提升了跨切片一致性与三维组织重构的生物学合理性。

链接: https://arxiv.org/abs/2604.12084
作者: Bonian Han,Cong Qi,Przemyslaw Musialski,Zhi Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, 3 tables. Submitted to MICCAI 2026

点击查看摘要

Abstract:Spatial transcriptomics (ST) measures mRNA expression while preserving spatial organization, but multi-slice analysis faces two coupled difficulties: large non-rigid deformations across slices and inter-slice batch effects when alignment and integration are treated independently. We present INST-Align, an unsupervised pairwise framework that couples a coordinate-based deformation network with a shared Canonical Expression Field, an implicit neural representation mapping spatial coordinates to expression embeddings, for joint alignment and reconstruction. A two-phase training strategy first establishes a stable canonical embedding space and then jointly optimizes deformation and spatial-feature matching, enabling mutually constrained alignment and representation learning. Cross-slice parameter sharing of the canonical field regularizes ambiguous correspondences and absorbs batch variation. Across nine datasets, INST-Align achieves state-of-the-art mean OT Accuracy (0.702), NN Accuracy (0.719), and Chamfer distance, with Chamfer reductions of up to 94.9% on large-deformation sections relative to the strongest baseline. The framework also yields biologically meaningful spatial embeddings and coherent 3D tissue reconstruction. The code will be released after review phase.

[CV-114] OpenTME: An Open Dataset of AI-powered HE Tumor Microenvironment Profiles from TCGA

【速读】:该论文旨在解决肿瘤微环境(Tumor Microenvironment, TME)在常规苏木精-伊红(Hematoxylin and Eosin, HE)染色病理切片中缺乏大规模、一致且定量表征的问题。解决方案的关键在于构建OpenTME数据集,其基于Atlas HE-TME这一生成式AI驱动的应用程序,该程序依托于Atlas病理基础模型家族,实现了组织质量控制、组织分割、细胞检测与分类以及空间邻域分析,从而在单张全切片图像上获得超过4500个细胞级别的定量读数,为生物标志物发现、空间生物学研究及计算方法开发提供了高质量、标准化的TME分析资源。

链接: https://arxiv.org/abs/2604.12075
作者: Maaike Galama,Nina Kozar-Gillan,Christina Embacher,Todd Dembo,Cornelius Böhm,Evelyn Ramberger,Julika Ribbat-Idel,Rosemarie Krupar,Verena Aumiller,Miriam Hägele,Kai Standvoss,Gerrit Erdmann,Blanca Pablos,Ari Angelo,Simon Schallenberg,Andrew Norgan,Viktor Matyas,Klaus-Robert Müller,Maximilian Alber,Lukas Ruff,Frederick Klauschen
机构: Aignostics(德国); Charité – Universitätsmedizin Berlin(柏林夏里特大学医学中心); Ludwig-Maximilians-Universität München(慕尼黑路德维希-马克西米利安大学); Mayo Clinic(梅奥诊所); Technische Universität Berlin(柏林工业大学); BIFOLD – Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究所); Korea University(韩国科学技术院); Max-Planck Institute for Informatics(马克斯·普朗克信息学研究所); German Cancer Research Center (DKFZ)(德国癌症研究中心); German Cancer Consortium (DKTK)(德国癌症联盟); Bavarian Cancer Research Center (BZKF)(巴伐利亚癌症研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:The tumor microenvironment (TME) plays a central role in cancer progression, treatment response, and patient outcomes, yet large-scale, consistent, and quantitative TME characterization from routine hematoxylin and eosin (HE)-stained histopathology remains scarce. We introduce OpenTME, an open-access dataset of pre-computed TME profiles derived from 3,634 HE-stained whole-slide images across five cancer types (bladder, breast, colorectal, liver, and lung cancer) from The Cancer Genome Atlas (TCGA). All outputs were generated using Atlas HE-TME, an AI-powered application built on the Atlas family of pathology foundation models, which performs tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis, yielding over 4,500 quantitative readouts per slide at cell-level resolution. OpenTME is available for non-commercial academic research on Hugging Face. We will continue to expand OpenTME over time and anticipate it will serve as a resource for biomarker discovery, spatial biology research, and the development of computational methods for TME analysis.

[CV-115] Privacy-Preserving Structureless Visual Localization via Image Obfuscation

【速读】:该论文旨在解决云部署的视觉定位系统中因图像或场景表示数据上传至服务器而引发的隐私泄露问题(privacy-preserving visual localization)。传统方法虽能实现高精度定位,但难以兼顾隐私保护;现有隐私保护方案通常复杂且性能下降显著。其解决方案的关键在于采用简单直观的图像去标识化策略——通过常见图像操作(如将RGB图像替换为语义分割图)对查询图像和场景参考图像进行预处理,从而在不修改现有结构无关定位流水线(structureless localization pipeline)的前提下,利用现代特征匹配器直接处理被扰动后的图像,无需额外调整即可实现隐私保障与定位精度的双重提升。实验表明,该方法在多个数据集上达到了当前隐私保护定位任务中的最优姿态估计精度。

链接: https://arxiv.org/abs/2604.12068
作者: Vojtech Panek,Patrik Beliansky,Zuzana Kukelova,Torsten Sattler
机构: Czech Technical University (CTU) in Prague (布拉格捷克技术大学); Czech Institute of Informatics, Robotics and Cybernetics, CTU in Prague (布拉格捷克技术大学信息学、机器人学与控制论研究所); Visual Recognition Group, Faculty of Electrical Engineering, CTU in Prague (布拉格捷克技术大学电气工程学院视觉识别组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual localization is the task of estimating the camera pose of an image relative to a scene representation. In practice, visual localization systems are often cloud-based. Naturally, this raises privacy concerns in terms of revealing private details through the images sent to the server or through the representations stored on the server. Privacy-preserving localization aims to avoid such leakage of private details. However, the resulting localization approaches are significantly more complex, slower, and less accurate than their non-privacy-preserving counterparts. In this paper, we consider structureless localization methods in the context of privacy preservation. Structureless methods represent the scene through a set of reference images with known camera poses and intrinsics. In contrast to existing methods proposing representations that are as privacy-preserving as possible, we study a simple image obfuscation approach based on common image operations, e.g., replacing RGB images with (semantic) segmentations. We show that existing structureless pipelines do not need any special adjustments, as modern feature matchers can match obfuscated images out of the box. The results are easy-to-implement pipelines that can ensure both the privacy of the query images and the scene representations. Detailed experiments on multiple datasets show that the resulting methods achieve state-of-the-art pose accuracy for privacy-preserving approaches.

[CV-116] Does Visual Token Pruning Improve Calibration? An Empirical Study on Confidence in MLLM s

【速读】:该论文旨在解决视觉 token 剪枝(visual token pruning)在多模态大语言模型(Multimodal Large Language Models, MLLMs)中对模型校准能力(model calibration)的影响问题,即剪枝是否会导致预测置信度与实际正确率不一致。以往研究主要关注任务准确率,而忽视了置信度质量这一关键指标。论文的关键解决方案在于系统性地评估不同剪枝策略(如基于覆盖的 SCOPE、纯显著性剪枝、FastV 和随机剪枝)在多个 token 预算下的校准性能,使用 Expected Calibration Error (ECE)、Brier score 和 AURC 等指标进行量化分析。结果表明,合理设计的剪枝策略(如低显著性权重的 SCOPE)可在保持准确率的同时显著改善校准效果,说明剪枝并非简单牺牲可靠性换取效率,而是可通过优化选择机制提升模型决策的可信度,尤其适用于需要可靠决策的多模态系统。

链接: https://arxiv.org/abs/2604.12035
作者: Kaizhen Tan
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual token pruning is a widely used strategy for efficient inference in multimodal large language models (MLLMs), but existing work mainly evaluates it with task accuracy. In this paper, we study how visual token pruning affects model calibration, that is, whether predicted confidence matches actual correctness. Using LLaVA-1.5-7B on POPE and ScienceQA-IMG, we evaluate Expected Calibration Error (ECE), Brier score, and AURC under several pruning strategies, including SCOPE with different saliency weights, saliency-only pruning, FastV, and random pruning, across multiple token budgets. Our results show that pruning does not simply trade reliability for efficiency. On POPE, a pure-coverage setting in SCOPE achieves substantially lower ECE than the full unpruned model while maintaining similar accuracy. An internal alpha-sweep further shows a consistent trend: reducing the saliency weight improves calibration at all tested token budgets, while accuracy changes only slightly. In contrast, saliency-based pruning leads to worse calibration, and real FastV causes severe performance degradation in our setting. On ScienceQA-IMG, pruning also reduces ECE, with accuracy remaining stable or slightly improving. We additionally study the gap power exponent in coverage-based selection and find that its default setting is not always optimal. Overall, our results suggest that visual token pruning should be evaluated not only by accuracy, but also by confidence quality, especially for multimodal systems that need reliable decisions.

[CV-117] Curvelet-Based Frequency-Aware Feature Enhancement for Deepfake Detection

【速读】:该论文旨在解决深度伪造(deepfake)检测中因图像压缩导致空间域特征退化的问题,从而提升检测模型在实际应用场景下的鲁棒性。其解决方案的关键在于引入曲线波变换(Curvelet Transform)以提取更具判别力的频域特征,并通过楔形注意力机制(wedge-level attention)和尺度感知的空间掩码(scale-aware spatial masking)对频率成分进行选择性增强,最终将优化后的频域信息重构并输入改进的预训练Xception网络进行分类,实现了高精度且可解释的深伪检测性能。

链接: https://arxiv.org/abs/2604.12028
作者: Salar Adel Sabri,Ramadhan J. Mstafa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 Pages, 6 Figures, 2 Tables

点击查看摘要

Abstract:The proliferation of sophisticated generative models has significantly advanced the realism of synthetic facial content, known as deepfakes, raising serious concerns about digital trust. Although modern deep learning-based detectors perform well, many rely on spatial-domain features that degrade under compression. This limitation has prompted a shift toward integrating frequency-domain representations with deep learning to improve robustness. Prior research has explored frequency transforms such as Discrete Cosine Transform (DCT), Fast Fourier Transform (FFT), and Wavelet Transform, among others. However, to the best of our knowledge, the Curvelet Transform, despite its superior directional and multiscale properties, remains entirely unexplored in the context of deepfake detection. In this work, we introduce a novel Curvelet-based detection approach that enhances feature quality through wedge-level attention and scale-aware spatial masking, both trained to selectively emphasize discriminative frequency components. The refined frequency cues are reconstructed and passed to a modified pretrained Xception network for classification. Evaluated on two compression qualities in the challenging FaceForensics++ dataset, our method achieves 98.48% accuracy and 99.96% AUC on FF++ low compression, while maintaining strong performance under high compression, demonstrating the efficacy and interpretability of Curvelet-informed forgery detection.

[CV-118] IPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment CVPR2026

【速读】:该论文旨在解决当前视觉-语言预训练模型在密集补丁表示(patch representations)与文本嵌入(text embeddings)之间对齐能力不足的问题,这一缺陷限制了模型在细粒度视觉理解任务中的表现。解决方案的关键在于提出两项核心改进:一是引入补丁级蒸馏(patch-level distillation)机制,发现其显著提升学生模型的补丁-文本对齐性能,甚至优于教师模型;二是改进iBOT预训练目标,提出iBOT++,使未掩码的图像token也直接参与损失计算,从而大幅增强预训练模型的补丁-文本对齐能力。结合上述方法及优化的指数移动平均策略和多粒度合成caption采样策略,最终构建出TIPSv2系列图像-文本编码器模型,在9项任务和20个数据集上验证了其优越性能。

链接: https://arxiv.org/abs/2604.12012
作者: Bingyi Cao,Koert Chen,Kevis-Kokitsi Maninis,Kaifeng Chen,Arjun Karpur,Ye Xia,Sahil Dua,Tanmaya Dabral,Guangxing Han,Bohyung Han,Joshua Ainslie,Alex Bewley,Mithun Jacob,René Wagner,Washington Ramos,Krzysztof Choromanski,Mojtaba Seyedhosseini,Howard Zhou,André Araujo
机构: Google DeepMind; xAI; Epsilon Health; Seoul National University; Google
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026 camera-ready + appendix

点击查看摘要

Abstract:Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment – surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at this https URL .

[CV-119] he Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results CVPR

【速读】:该论文旨在解决跨域少样本目标检测(Cross-domain few-shot object detection, CD-FSOD)问题,即在标注数据极度有限的条件下,实现模型在未见过的目标域中的有效目标检测。其关键解决方案在于通过组织NTIRE 2026 CD-FSOD挑战赛,系统性评估和推动该领域进展,吸引全球研究团队探索多样化的创新方法,涵盖开源与闭源两种技术路径,从而显著提升模型在跨域场景下的泛化能力与检测性能。

链接: https://arxiv.org/abs/2604.11998
作者: Xingyu Qiu,Yuqian Fu,Jiawei Geng,Bin Ren,Jiancheng Pan,Zongwei Wu,Hao Tang,Yanwei Fu,Radu Timofte,Nicu Sebe,Mohamed Elhoseiny,Lingyi Hong,Mingxi Cheng,Xingqi He,Runze Li,Xingdong Sheng,Wenqiang Zhang,Jiacong Liu,Shu Luo,Yikai Qin,Yaze Zhao,Yongwei Jiang,Yixiong Zou,Zhe Zhang,Yang Yang,Kaiyu Li,Bowen Fu,Zixuan Jiang,Ke Li,Hui Qiao,Xiangyong Cao,Xuanlong Yu,Youyang Sha,Longfei Liu,Di Yang,Xi Shen,Kyeongryeol Go,Taewoong Jang,Saiprasad Meesiyawar,Ravi Kirasur,Rakshita Kulkarni,Bhoomi Deshpande,Harsh Patil,Uma Mudenagudi,Shuming Hu,Chao Chen,Tao Wang,Wei Zhou,Qi Xu,Zhenzhao Xing,Dandan Zhao,Hanzhe Xia,Dongdong Lu,Zhe Zhang,Jingru Wang,Guangwei Huang,Jiachen Tu,Yaokun Shi,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Liwei Zhou,Bei Dou,Tao Wu,Zekang Fan,Junjie Liu,Adhémar de Senneville,Flavien Armangeon,Mengbers,Yazhe Lyu,Zhimeng Xin,Zijian Zhuang,Hongchun Zhu,Li Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by CVPRW 26 @ NTIRE

点击查看摘要

Abstract:Cross-domain few-shot object detection (CD-FSOD) remains a challenging problem for existing object detectors and few-shot learning approaches, particularly when generalizing across distinct domains. As part of NTIRE 2026, we hosted the second CD-FSOD Challenge to systematically evaluate and promote progress in detecting objects in unseen target domains under limited annotation conditions. The challenge received strong community interest, with 128 registered participants and a total of 696 submissions. Among them, 31 teams actively participated, and 19 teams submitted valid final results. Participants explored a wide range of strategies, introducing innovative methods that push the performance frontier under both open-source and closed-source tracks. This report presents a detailed overview of the NTIRE 2026 CD-FSOD Challenge, including a summary of the submitted approaches and an analysis of the final results across all participating teams. Challenge Codes: this https URL.

[CV-120] Ultra-low-light computer vision using trained photon correlations

【速读】:该论文旨在解决在超低光和高噪声成像条件下,传统计算机视觉任务(如目标识别)因图像质量差而导致性能受限的问题。现有方法通常聚焦于通过相关光子照明(correlated photon illumination)实现高质量图像重建,但忽略了最终目标是场景推理而非图像复原。其解决方案的关键在于提出一种“相关性感知训练”(correlation-aware training, CAT)策略:将可训练的相关光子照明源与Transformer后端进行端到端联合优化,使模型能够直接从光子关联模式中学习有益特征,而无需依赖图像重建过程。实验表明,在仅使用100次采样的情况下,该方法相比传统非相关照明方案可提升分类准确率高达15个百分点,并优于未训练的相关光子照明方式,从而在极端光子预算约束下显著提升了目标识别的准确性。

链接: https://arxiv.org/abs/2604.11993
作者: Mandar M. Sohoni,Jérémie Laydevant,Mathieu Ouellet,Shi-Yuan Ma,Ryotatsu Yanagimoto,Benjamin A. Ash,Tatsuhiro Onodera,Tianyu Wang,Logan G. Wright,Peter L. McMahon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: 49 pages, 47 figures

点击查看摘要

Abstract:Illumination using correlated photon sources has been established as an approach to allowing high-fidelity images to be reconstructed from noisy camera frames by taking advantage of the knowledge that signal photons are spatially correlated whereas detector clicks due to noise are uncorrelated. However, in computer-vision tasks, the goal is often not ultimately to reconstruct an image, but to make inferences about a scene – such as what object is present. Here we show how correlated-photon illumination can be used to gain an advantage in a hybrid optical-electronic computer-vision pipeline for object recognition. We demonstrate correlation-aware training (CAT): end-to-end optimization of a trainable correlated-photon illumination source and a Transformer backend in a way that the Transformer can learn to benefit from the correlations, using a small number (= 100) of shots. We show a classification accuracy enhancement of up to 15 percentage points over conventional, uncorrelated-illumination-based computer vision in ultra-low-light and noisy imaging conditions, as well as an improvement over using untrained correlated-photon illumination. Our work illustrates how specializing to a computer-vision task – object recognition – and training the pattern of photon correlations in conjunction with a digital backend allows us to push the limits of accuracy in highly photon-budget-constrained scenarios beyond existing methods focused on image reconstruction.

[CV-121] ReefMapGS: Enabling Large-Scale Underwater Reconstruction by Closing the Loop Between Multimodal SLAM and Gaussian Splatting

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在实际应用中对高精度相机位姿(pose)的依赖问题,尤其针对水下自主航行器(AUV)等野外机器人场景中,传统基于结构光束法平差(structure-from-motion, SfM)的位姿估计方法计算复杂度高、难以实时部署的问题。解决方案的关键在于提出一种基于增量式重建的框架 ReefMapGS,其通过融合多模态传感器数据(如声学、惯性、压力和视觉信息)进行基于位姿图优化(pose-graph optimization)的SLAM方法,实现相机位姿的高效估计与不确定性建模;同时,该框架在局部跟踪新图像观测的同时优化底层3DGS场景,并将优化后的位姿反馈至位姿图以全局优化轨迹,从而在无需COLMAP等离线SfM工具的情况下完成高质量、高精度的水下珊瑚礁场景重建与长距离(最高达700米)AUV轨迹估计。

链接: https://arxiv.org/abs/2604.11992
作者: Daniel Yang,Jungseok Hong,John J. Leonard,Yogesh Girdhar
机构: Massachusetts Institute of Technology (麻省理工学院); Woods Hole Oceanographic Institution (伍兹霍尔海洋研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting is a powerful visual representation, providing high-quality and efficient 3D scene reconstruction, but it is crucially dependent on accurate camera poses typically obtained from computationally intensive processes like structure-from-motion that are unsuitable for field robot applications. However, in these domains, multimodal sensor data from acoustic, inertial, pressure, and visual sensors are available and suitable for pose-graph optimization-based SLAM methods that can estimate the vehicle’s trajectory and thus our needed camera poses while providing uncertainty. We propose a 3DGS-based incremental reconstruction framework, ReefMapGS, that builds an initial model from a high certainty region and progressively expands to incorporate the whole scene. We reconstruct the scene incrementally by interleaving local tracking of new image observations with optimization of the underlying 3DGS scene. These refined poses are integrated back into the pose-graph to globally optimize the whole trajectory. We show COLMAP-free 3D reconstruction of two underwater reef sites with complex geometry as well as more accurate global pose estimation of our AUV over survey trajectories spanning up to 700 m.

[CV-122] Fall Risk and Gait Analysis in Community-Dwelling Older Adults using World-Spaced 3D Human Mesh Recovery CVPR2026

【速读】:该论文旨在解决老年人跌倒风险评估中传统临床方法依赖手动计时步速、缺乏客观量化指标的问题,从而限制了对真实环境中步态特征的全面分析。其解决方案的关键在于构建一个基于3D人体网格恢复(Human Mesh Recovery, HMR)模型的自动化流程,从社区场景下录制的Timed Up and Go (TUG)测试视频中提取时空步态参数(如步态时间、起坐持续时间及步长),并通过与惯性测量单元(IMU)嵌入式鞋垫数据对比验证其准确性,最终证实所提取参数与主观跌倒风险感知显著相关,实现了在自然社区环境中可扩展、生态有效的步态分析方法。

链接: https://arxiv.org/abs/2604.11961
作者: Chitra Banarjee,Patrick Kwon,Ania Lipat,Rui Xie,Chen Chen,Ladda Thiamwong
机构: University of Central Florida (UCF)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work was accepted at Computer Vision for Biomechanics Workshop (CVBW) at CVPR 2026

点击查看摘要

Abstract:Gait assessment is a key clinical indicator of fall risk and overall health in older adults. However, standard clinical practice is largely limited to stopwatch-measured gait speed. We present a pipeline that leverages a 3D Human Mesh Recovery (HMR) model to extract gait parameters from recordings of older adults completing the Timed Up and Go (TUG) test. From videos recorded across different community centers, we extract and analyze spatiotemporal gait parameters, including step time, sit-to-stand duration, and step length. We found that video-derived step time was significantly correlated with IMU-based insole measurements. Using linear mixed effects models, we confirmed that shorter, more variable step lengths and longer sit-to-stand durations were predicted by higher self-rated fall risk and fear of falling. These findings demonstrate that our pipeline can enable accessible and ecologically valid gait analysis in community settings.

[CV-123] EigenCoin: sassanid coins classification based on Bhattacharyya distance

【速读】:该论文旨在解决不平衡数据集下的模式识别问题,特别是在萨珊王朝钱币分类中的应用。其核心挑战在于如何在类别分布极不均衡的情况下提升分类性能,并避免过拟合(over-fitting)问题。解决方案的关键是提出EigenCoin流形方法,该方法结合Bhattacharyya距离度量,通过三个主要步骤实现:流形构建、测试数据映射和分类。实验表明,EigenCoin在准确率上相比其他算法提升了9.45%至21.75%,同时具备良好的抗过拟合能力。

链接: https://arxiv.org/abs/2604.11932
作者: Rahele Allahverdi,Mohammad Mahdi Dehshibi,Azam Bastanfard,Daryoosh Akbarzadeh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2nd World Conference on Information Technology (WCIT-2011)

点击查看摘要

Abstract:Solving pattern recognition problems using imbalanced databases is a hot topic, which entices researchers to bring it into focus. Therefore, we consider this problem in the application of Sassanid coins classification. Our focus is not only on proposing EigenCoin manifold with Bhattacharyya distance for the classification task, but also on testing the influence of the holistic and feature-based approaches. EigenCoin consists of three main steps namely manifold construction, mapping test data, and classification. Conducted experiments show EigenCoin outperformed other observed algorithms and achieved the accuracy from 9.45% up to 21.75%, while it has the capability of handling the over-fitting problem.

[CV-124] A Workflow to Efficiently Generate Dense Tissue Ground Truth Masks for Digital Breast Tomosynthesis

【速读】:该论文旨在解决数字乳腺断层成像(Digital Breast Tomosynthesis, DBT)中纤维腺体组织(fibroglandular tissue)精准分割的难题,这一问题对于个性化乳腺癌风险评估至关重要,但受限于高质量人工标注训练数据稀缺。其解决方案的关键在于提出一种高效的人工标注框架:用户仅需在DBT体积的中心重建切片上粗略勾画感兴趣区域(Region of Interest, ROI),并选择一个分割阈值生成密集组织掩膜;随后算法将该ROI投影至其余切片,并迭代调整每一切片的阈值以保持整个DBT体积内密集组织边界的一致性。该方法显著减少了人工标注时间和劳动成本,同时在44例DBT数据上的评估显示,与两名放射科医生标注结果相比,患者层面Dice系数中位数达0.84,且与单名放射科医生手动标注20%和80%分位数切片的结果相比,Dice系数中位数为0.83,验证了其高准确性和实用性。

链接: https://arxiv.org/abs/2604.11927
作者: Tamerlan Mustafaev,Oleg Kruglov,Margarita Zuley,Luana de Mero Omena,Guilherme Muniz de Oliveira,Vitor de Sousa Franca,Bruno Barufaldi,Robert Nishikawa,Juhun Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Digital breast tomosynthesis (DBT) is now the standard of care for breast cancer screening in the USA. Accurate segmentation of fibroglandular tissue in DBT images is essential for personalized risk estimation, but algorithm development is limited by scarce human-delineated training data. In this study we introduce a time- and labor-saving framework to generate a human-annotated binary segmentation mask for dense tissue in DBT. Our framework enables a user to outline a rough region of interest (ROI) enclosing dense tissue on the central reconstructed slice of a DBT volume and select a segmentation threshold to generate the dense tissue mask. The algorithm then projects the ROI to the remaining slices and iteratively adjusts slice-specific thresholds to maintain consistent dense tissue delineation across the DBT volume. By requiring annotation only on the central slice, the framework substantially reduces annotation time and labor. We used 44 DBT volumes from the DBTex dataset for evaluation. Inter-reader agreement was assessed by computing patient-wise Dice similarity coefficients between segmentation masks produced by two radiologists, yielding a median of 0.84. Accuracy of the proposed method was evaluated by having a radiologist manually segment the 20th and 80th percentile slices from each volume (CC and MLO views; 176 slices total) and calculate Dice scores between the manual and proposed segmentations, yielding a median of 0.83.

[CV-125] V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos CVPR2026

【速读】:该论文旨在解决从单张完成菜品图像中进行营养估测的局限性问题,因为许多营养相关成分(如油、酱料和混合食材)在烹饪后会变得视觉模糊,导致卡路里和宏量营养素估计不准确。其解决方案的关键在于利用第一人称视角烹饪视频中的过程信息,提出了一种名为V-Nutri的分阶段框架:该框架结合了Nutrition5K预训练的视觉主干网络与轻量级融合模块,用于聚合最终菜品帧和从烹饪视频中提取的关键帧特征;同时引入基于VideoMamba的事件检测模型来识别食材添加时刻的关键帧,从而提供互补的营养证据,显著提升营养估测精度。

链接: https://arxiv.org/abs/2604.11913
作者: Chengkun Yue,Chuanzhi Xu,Jiangpeng He
机构: Indiana University (印第安纳大学); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 3rd MetaFood Workshop at CVPR 2026

点击查看摘要

Abstract:Nutrition estimation of meals from visual data is an important problem for dietary monitoring and computational health, but existing approaches largely rely on single images of the finally completed dish. This setting is fundamentally limited because many nutritionally relevant ingredients and transformations, such as oils, sauces, and mixed components, become visually ambiguous after cooking, making accurate calorie and macronutrient estimation difficult. In this paper, we investigate whether the cooking process information from egocentric cooking videos can contribute to dish-level nutrition estimation. First, we further manually annotated the HD-EPIC dataset and established the first benchmark for video-based nutrition estimation. Most importantly, we propose V-Nutri, a staged framework that combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking process keyframes extracted from the egocentric videos. V-Nutri also includes a cooking keyframes selection module, a VideoMamba-based event-detection model that targets ingredient-addition moments. Experiments on the HD-EPIC dataset show that process cues can provide complementary nutritional evidence, improving nutrition estimation under controlled conditions. Our results further indicate that the benefit of process keyframes depends strongly on backbone representation capacity and event detection quality. Our code and annotated dataset is available at this https URL.

[CV-126] MedConcept: Unsupervised Concept Discovery for Interpretability in Medical VLMs

【速读】:该论文旨在解决医学视觉语言模型(Medical Vision-Language Models, VLMs)因黑箱特性导致临床信任度低、预测结果难以解释的问题。当前基于梯度或注意力机制的可解释性方法通常局限于特定任务且无法提供跨下游任务复用的概念级解释。其解决方案的关键在于提出MedConcept框架,该框架通过无监督方式从预训练VLM的潜在表示中识别稀疏神经元级别的医学概念激活,并将其转化为类放射科报告格式的伪摘要,从而支持医生对模型内部推理过程的审查。此外,作者设计了一种定量语义验证协议,利用独立预训练的医学大语言模型(Medical Large Language Model, LLM)作为冻结的外部评估器,以“对齐”、“不一致”和“不确定”三个分数量化概念与放射学报告之间的语义一致性,为医学VLM的可解释性提供了可衡量的基准。

链接: https://arxiv.org/abs/2604.11868
作者: Md Rakibul Haque,KM Arefeen Sultan,Tushar Kataria,Shireen Elhabian
机构: Kahlert School of Computing, University of Utah (犹他大学卡勒特计算机学院); Scientific Computing and Imaging Institute, University of Utah (犹他大学科学计算与成像研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While medical Vision-Language models (VLMs) achieve strong performance on tasks such as tumor or organ segmentation and diagnosis prediction, their opaque latent representations limit clinical trust and the ability to explain predictions. Interpretability of these multimodal representations are therefore essential for the trustworthy clinical deployment of pretrained medical VLMs. However, current interpretability methods, such as gradient- or attention-based visualizations, are often limited to specific tasks such as classification. Moreover, they do not provide concept-level explanations derived from shared pretrained representations that can be reused across downstream tasks. We introduce MedConcept, a framework that uncovers latent medical concepts in a fully unsupervised manner and grounds them in clinically verifiable textual semantics. MedConcept identifies sparse neuron-level concept activations from pretrained VLM representations and translates them into pseudo-report-style summaries, enabling physician-level inspection of internal model reasoning. To address the lack of quantitative evaluation in concept-based interpretability, we introduce a quantitative semantic verification protocol that leverages an independent pretrained medical LLM as a frozen external evaluator to assess concept alignment with radiology reports. We define three concept scores, Aligned, Unaligned, and Uncertain, to quantify semantic support, contradiction, or ambiguity relative to radiology reports and use them exclusively for post hoc evaluation. These scores provide a quantitative baseline for assessing interpretability in medical VLMs. All codes, prompt and data to be released on acceptance. Ke

[CV-127] UniMark: Unified Adaptive Multi-bit Watermarking for Autoregressive Image Generators

【速读】:该论文旨在解决自回归(Autoregressive, AR)图像生成模型中隐写水印技术的三大核心问题:(1) 现有方法仅支持零比特水印用于二值验证,无法传递多比特信息;(2) 依赖静态码本划分策略,在码本暴露后易受安全攻击;(3) 针对特定AR架构设计,缺乏跨不同AR范式的通用性。解决方案的关键在于提出一个无需训练、统一的水印框架 \method,其核心创新包括:自适应语义分组(Adaptive Semantic Grouping, ASG),基于语义相似性和密钥动态划分码本以兼顾图像质量和安全性;块级多比特编码(Block-wise Multi-bit Encoding, BME),通过分块和纠错码实现可靠的消息传输;以及统一的令牌替换接口(Unified Token-Replacement Interface, UTRI),抽象嵌入过程以兼容下一标记预测(如 LlamaGen)与下一尺度预测(如 VAR)等不同AR范式。

链接: https://arxiv.org/abs/2604.11843
作者: Yigit Yilmaz,Elena Petrova,Mehmet Kaya,Lucia Rossi,Amir Rahman
机构: Bandırma Onyedi Eylül University (班迪尔马七日大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: work in progress

点击查看摘要

Abstract:Invisible watermarking for autoregressive (AR) image generation has recently gained attention as a means of protecting image ownership and tracing AI-generated content. However, existing approaches suffer from three key limitations: (1) they embed only zero-bit watermarks for binary verification, lacking the ability to convey multi-bit messages; (2) they rely on static codebook partitioning strategies that are vulnerable to security attacks once the partition is exposed; and (3) they are designed for specific AR architectures, failing to generalize across diverse AR paradigms. We propose \method, a training-free, unified watermarking framework for autoregressive image generators that addresses all three limitations. \method introduces three core components: \textbfAdaptive Semantic Grouping (ASG), which dynamically partitions codebook entries based on semantic similarity and a secret key, ensuring both image quality preservation and security; \textbfBlock-wise Multi-bit Encoding (BME), which divides the token sequence into blocks and encodes different bits across blocks with error-correcting codes for reliable message transmission; and \textbfa Unified Token-Replacement Interface (UTRI) that abstracts the watermark embedding process to support both next-token prediction (e.g., LlamaGen) and next-scale prediction (e.g., VAR) paradigms. We provide theoretical analysis on detection error rates and embedding capacity. Extensive experiments on three AR models demonstrate that \method achieves state-of-the-art performance in image quality (FID), watermark detection accuracy, and multi-bit message extraction, while maintaining robustness against cropping, JPEG compression, Gaussian noise, blur, color jitter, and random erasing attacks.

[CV-128] ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On

【速读】:该论文旨在解决虚拟试衣(Virtual Try-On, VITON)中非试衣区域(non-try-on regions)的保真度问题,即在利用潜空间扩散模型(Latent Diffusion Models, LDMs)实现服装对齐与细节合成的同时,如何避免因后处理替换策略导致的边界伪影(boundary artifacts)并防止生成过程中的语义漂移(semantic drift)。解决方案的关键在于将VITON建模为一个线性逆问题,并引入轨迹对齐求解器(trajectory-aligned solvers)以逐步强化测量一致性,从而减少非试衣区域的突变;进一步提出ART-VITON框架,通过残差先验初始化缓解训练-推理不匹配,结合数据一致性、频域校正和周期性标准去噪的测量引导采样机制,在保障测量遵循性的同时实现无伪影的高质量合成。

链接: https://arxiv.org/abs/2509.25749
作者: Junseo Park,Hyeryung Jang
机构: Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:Virtual try-on (VITON) aims to generate realistic images of a person wearing a target garment, requiring precise garment alignment in try-on regions and faithful preservation of identity and background in non-try-on regions. While latent diffusion models (LDMs) have advanced alignment and detail synthesis, preserving non-try-on regions remains challenging. A common post-hoc strategy directly replaces these regions with original content, but abrupt transitions often produce boundary artifacts. To overcome this, we reformulate VITON as a linear inverse problem and adopt trajectory-aligned solvers that progressively enforce measurement consistency, reducing abrupt changes in non-try-on regions. However, existing solvers still suffer from semantic drift during generation, leading to artifacts. We propose ART-VITON, a measurement-guided diffusion framework that ensures measurement adherence while maintaining artifact-free synthesis. Our method integrates residual prior-based initialization to mitigate training-inference mismatch and artifact-free measurement-guided sampling that combines data consistency, frequency-level correction, and periodic standard denoising. Experiments on VITON-HD, DressCode, and SHHQ-1.0 demonstrate that ART-VITON effectively preserves identity and background, eliminates boundary artifacts, and consistently improves visual fidelity and robustness over state-of-the-art baselines.

[CV-129] Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation

【速读】:该论文旨在解决多模态联邦学习(Multimodal Federated Learning)中因模态异质性(Modality Heterogeneity)导致的挑战,即临床机构由于资源限制或工作流程差异,仅能提供部分模态数据,而现有特征插补网络(Feature Imputation Networks)仅输出点估计且缺乏置信度衡量,使下游分类器无法区分可靠与不可靠的插补特征,从而在医疗等高风险场景中带来安全隐患。其解决方案的关键在于提出概率特征插补网络(Probabilistic Feature Imputation Network, P-FIN),能够同时输出校准后的不确定性估计(Calibrated Uncertainty Estimates);并基于此不确定性设计两级机制:局部层面通过Sigmoid门控机制抑制不可靠特征维度,全局层面采用Fed-UQ-Avg聚合策略优先采纳不确定性较低客户端的模型更新,从而提升联邦学习在医学影像分类任务中的鲁棒性和性能,在CheXpert、NIH Open-I和PadChest数据集上的实验验证了该方法相较于确定性基线的显著改进(最复杂配置下AUC提升+5.36%)。

链接: https://arxiv.org/abs/2604.12970
作者: Nafis Fuad Shahid,Maroof Ahmed,Md Akib Haider,Saidur Rahman Sagor,Aashnan Rahman,Md Azam Hossain
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the Medical Imaging with Deep Learning (MIDL) 2026 conference

点击查看摘要

Abstract:Multimodal federated learning enables privacy-preserving collaborative model training across healthcare institutions. However, a fundamental challenge arises from modality heterogeneity: many clinical sites possess only a subset of modalities due to resource constraints or workflow variations. Existing approaches address this through feature imputation networks that synthesize missing modality representations, yet these methods produce point estimates without reliability measures, forcing downstream classifiers to treat all imputed features as equally trustworthy. In safety-critical medical applications, this limitation poses significant risks. We propose the Probabilistic Feature Imputation Network (P-FIN), which outputs calibrated uncertainty estimates alongside imputed features. This uncertainty is leveraged at two levels: (1) locally, through sigmoid gating that attenuates unreliable feature dimensions before classification, and (2) globally, through Fed-UQ-Avg, an aggregation strategy that prioritizes updates from clients with reliable imputation. Experiments on federated chest X-ray classification using CheXpert, NIH Open-I, and PadChest demonstrate consistent improvements over deterministic baselines, with +5.36% AUC gain in the most challenging configuration.

[CV-130] DoseRAD2026 Challenge dataset: AI accelerated photon and proton dose calculation for radiotherapy

【速读】:该论文旨在解决放射治疗中精确剂量计算的问题,尤其是在MRI引导和实时自适应放疗日益普及的背景下,对基于CT和MRI的快速且准确剂量计算需求不断增长。解决方案的关键在于构建并公开发布DoseRAD2026数据集,该数据集包含115名患者的配对CT-MRI影像及对应的光子(6 MV)和质子蒙特卡洛剂量分布(共40,500个光子束和81,000个质子束),并通过变形图像配准、空气腔校正和重采样等预处理步骤确保数据质量,为开发和评估先进剂量计算方法提供标准化基准。

链接: https://arxiv.org/abs/2604.12778
作者: Fan Xiao,Nikolaos Delopoulos,Niklas Wahl,Lennart Volz,Lina Bucher,Matteo Maspero,Miguel Palacios,Muheng Li,Samir Schulz,Viktor Rogowski,Ye Zhang,Zoltan Perko,Christopher Kurz,George Dedes,Guillaume Landry,Adrian Thummerer
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Accurate dose calculation is essential in radiotherapy for precise tumor irradiation while sparing healthy tissue. With the growing adoption of MRI-guided and real-time adaptive radiotherapy, fast and accurate dose calculation on CT and MRI is increasingly needed. The DoseRAD2026 dataset and challenge provide a public benchmark of paired CT and MRI data with beam-level photon and proton Monte Carlo dose distributions for developing and evaluating advanced dose calculation methods. Acquisition and validation methods: The dataset comprises paired CT and MRI from 115 patients (75 training, 40 testing) treated on an MRI-linac for thoracic or abdominal lesions, derived from the SynthRAD2025 dataset. Pre-processing included deformable image registration, air-cavity correction, and resampling. Ground-truth photon (6 MV) and proton dose distributions were computed using open-source Monte Carlo algorithms, yielding 40,500 photon beams and 81,000 proton beamlets. Data format and usage notes: Data are organized into photon and proton subsets with paired CT-MRI images, beam-level dose distributions, and JSON beam configuration files. Files are provided in compressed MetaImage (.mha) format. The dataset is released under CC BY-NC 4.0, with training data available from April 2026 and the test set withheld until March 2030. Potential applications: The dataset supports benchmarking of fast dose calculation methods, including beam-level dose estimation for photon and proton therapy, MRI-based dose calculation in MRI-guided workflows, and real-time adaptive radiotherapy.

[CV-131] CBAM-Enhanced DenseNet121 for Multi-Class Chest X-Ray Classification with Grad-CAM Explainability

【速读】:该论文旨在解决儿童肺炎在资源匮乏地区(如孟加拉国)诊断困难的问题,特别是现有深度学习方法多将肺炎检测简化为二分类任务,忽略了细菌性和病毒性肺炎在临床治疗中的关键差异。其解决方案的核心是提出CBAM-DenseNet121模型,通过引入卷积块注意力模块(Convolutional Block Attention Module, CBAM)增强DenseNet121的特征提取能力,实现对胸片图像的三分类判别:正常、细菌性肺炎和病毒性肺炎。该方法在保持高准确率(84.29% ± 1.14%)的同时,获得了优异的类间区分能力(各类别AUC均高于0.91),且Grad-CAM可视化验证了模型关注区域具有解剖学合理性,支持其在资源受限环境中的可解释性部署。

链接: https://arxiv.org/abs/2604.12305
作者: Utsho Kumar Dey
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, 2 tables. Preprint submitted to IEEE Access

点击查看摘要

Abstract:Pneumonia remains a leading cause of childhood mortality worldwide, with a heavy burden in low-resource settings such as Bangladesh where radiologist availability is limited. Most existing deep learning approaches treat pneumonia detection as a binary problem, overlooking the clinically critical distinction between bacterial and viral aetiology. This paper proposes CBAM-DenseNet121, a transfer-learning framework that integrates the Convolutional Block Attention Module (CBAM) into DenseNet121 for three-class chest X-ray classification: Normal, Bacterial Pneumonia, and Viral Pneumonia. We also conduct a systematic binary-task baseline study revealing that EfficientNetB3 (73.88%) underperforms even the custom CNN baseline (78.53%) – a practically important negative finding for medical imaging model selection. To ensure statistical reliability, all experiments were repeated three times with independent random seeds (42, 7, 123), and results are reported as mean +/- standard deviation. CBAM-DenseNet121 achieves 84.29% +/- 1.14% test accuracy with per-class AUC scores of 0.9565 +/- 0.0010, 0.9610 +/- 0.0014, and 0.9187 +/- 0.0037 for bacterial pneumonia, normal, and viral pneumonia respectively. Grad-CAM visualizations confirm that the model attends to anatomically plausible pulmonary regions for each class, supporting interpretable deployment in resource-constrained clinical environments.

[CV-132] QMC-Net: Data-Aware Quantum Representations for Remote Sensing Image Classification ICPR2026

【速读】:该论文旨在解决当前混合量子-经典模型在处理多波段遥感影像时,普遍依赖通用、数据无关的量子电路而导致无法有效捕捉各波段特异性统计差异的问题。其解决方案的关键在于提出一个数据驱动的框架,将波段级统计特征(如香农熵、方差、频谱平坦度和边缘密度)映射至定制化量子电路的超参数,从而实现基于数据特性的量子电路设计;在此基础上构建的QMC-Net架构采用波段特定的量子电路处理六通道数据,支持跨波段自适应的量子特征编码与变换,显著提升了分类性能,在EuroSAT和SAT-6数据集上分别达到93.80%和99.34%的准确率,优于传统经典模型和单一混合量子模型。

链接: https://arxiv.org/abs/2604.11817
作者: Md Aminur Hossain,Ayush V. Patel,Biplab Banerjee
机构: 未知
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICPR 2026, 15 pages

点击查看摘要

Abstract:Hybrid quantum-classical models offer a promising route for learning from complex data; however, their application to multi-band remote sensing imagery often relies on generic, data-agnostic quantum circuits that fail to account for channel-specific statistical variability. In this work, we propose a data-driven framework that maps band-level statistics such as Shannon Entropy, Variance, Spectral Flatness, and Edge Density to the hyperparameters of customized quantum circuits. Building on this framework, we introduce QMC-Net, a hybrid architecture that processes six data channels using band-specific quantum circuits, enabling adaptive quantum feature encoding and transformation across channels. Experiments on the EuroSAT and SAT-6 datasets demonstrate that QMC-Net achieves accuracies of 93.80 % and 99.34 %, respectively, while a residual-enhanced variant further improves performance to 94.69 % and 99.39 %. These results consistently outperform strong classical baselines and monolithic hybrid quantum models, highlighting the effectiveness of data-aware quantum circuit design under NISQ constraints.

人工智能

[AI-0] Bilevel Late Acceptance Hill Climbing for the Electric Capacitated Vehicle Routing Problem

【速读】:该论文旨在解决电动汽车路径规划问题(Electric Capacitated Vehicle Routing Problem, E-CVRP),该问题需同时优化车辆路径与充电决策,具有复杂的耦合特性。其解决方案的关键在于提出一种分层优化框架——双层禁忌搜索算法(bilevel Late Acceptance Hill Climbing, b-LAHC),该算法通过三个阶段(贪心下降、邻域探索和最终精化)分离并协同处理路由与充电决策,并引入一个代理目标函数(surrogate objective)在上层引导搜索方向,从而加速收敛。该方法无需参数自适应调整,保持轻量高效,在IEEE WCCI-2020基准测试中表现优异,尤其在大规模实例上取得9/10的新最优解,验证了双层结构的有效性与实用性。

链接: https://arxiv.org/abs/2604.13013
作者: Yinghao Qin,Mosab Bazargani,Edmund K. Burke,Carlos A. Coello Coello,Zhongmin Song,Jun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:This paper tackles the Electric Capacitated Vehicle Routing Problem (E-CVRP) through a bilevel optimization framework that handles routing and charging decisions separately or jointly depending on the search stage. By analyzing their interaction, we introduce a surrogate objective at the upper level to guide the search and accelerate convergence. A bilevel Late Acceptance Hill Climbing algorithm (b-LAHC) is introduced that operates through three phases: greedy descent, neighborhood exploration, and final solution refinement. b-LAHC operates with fixed parameters, eliminating the need for complex adaptation while remaining lightweight and effective. Extensive experiments on the IEEE WCCI-2020 benchmark show that b-LAHC achieves superior or competitive performance against eight state-of-the-art algorithms. Under a fixed evaluation budget, it attains near-optimal solutions on small-scale instances and sets 9/10 new best-known results on large-scale benchmarks, improving existing records by an average of 1.07%. Moreover, the strong correlation (though not universal) observed between the surrogate objective and the complete cost justifies the use of the surrogate objective while still necessitating a joint solution of both levels, thereby validating the effectiveness of the proposed bilevel framework and highlighting its potential for efficiently solving large-scale routing problems with a hierarchical structure.

[AI-1] Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

【速读】:该论文旨在解决标准在线策略蒸馏(on-policy distillation, OPD)在训练大语言模型时依赖实时教师推理服务所带来的巨大基础设施开销问题。其核心挑战在于,尽管预计算教师模型日志概率(log-probabilities)以实现离线训练看似可行,但实践中这种离线变体无法稳定达到与在线OPD相当的性能。作者通过理论分析发现,一个此前被忽视的关键条件——教师一致性(teacher consistency)——是确保OPD有效性的前提:即必须使用完全相同的教师模型进行监督微调(SFT)和OPD阶段。违反该条件会引入不可消除的梯度偏差,导致无论训练时间多长,算法均收敛至次优固定点。解决方案的关键在于提出Lightning OPD框架,该框架通过在SFT轨迹上预计算教师log-probabilities来强制实施教师一致性,从而彻底移除对实时教师服务器的需求,并在理论上保证与标准OPD具有相同最优解,同时具备有界的梯度差异和隐式正则化效果,显著提升训练效率并保持高性能。

链接: https://arxiv.org/abs/2604.13010
作者: Yecheng Wu,Song Han,Hai Cai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.

[AI-2] LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software

【速读】:该论文旨在解决现有自动化程序修复技术在应对逻辑漏洞(logical vulnerabilities)方面能力不足的问题,特别是由于对漏洞代码及其预期行为的语义理解有限所致。其解决方案的关键在于构建首个专门针对逻辑漏洞的数据集 LogicDS(包含86个带有CVE编号的真实漏洞实例)以及提出一个系统性的评估框架 LogicEval,用于量化分析传统与基于大语言模型(LLM)的修复方法在实际场景中的有效性。评估结果表明,编译和测试失败主要源于提示敏感性(prompt sensitivity)、代码上下文丢失(loss of code context)以及补丁定位困难(patch localization difficulty)。

链接: https://arxiv.org/abs/2604.12994
作者: Syed Md Mukit Rashid,Abdullah Al Ishtiaq,Kai Tu,Yilu Dong,Tianwei Wu,Ali Ranjbar,Tianchang Yang,Najrin Sultana,Shagufta Mehnaz,Syed Rafiul Hussain
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Logical vulnerabilities in software stem from flaws in program logic rather than memory safety, which can lead to critical security failures. Although existing automated program repair techniques primarily focus on repairing memory corruption vulnerabilities, they struggle with logical vulnerabilities because of their limited semantic understanding of the vulnerable code and its expected behavior. On the other hand, recent successes of large language models (LLMs) in understanding and repairing code are promising. However, no framework currently exists to analyze the capabilities and limitations of such techniques for logical vulnerabilities. This paper aims to systematically evaluate both traditional and LLM-based repair approaches for addressing real-world logical vulnerabilities. To facilitate our assessment, we created the first ever dataset, LogicDS, of 86 logical vulnerabilities with assigned CVEs reflecting tangible security impact. We also developed a systematic framework, LogicEval, to evaluate patches for logical vulnerabilities. Evaluations suggest that compilation and testing failures are primarily driven by prompt sensitivity, loss of code context, and difficulty in patch localization.

[AI-3] ROSE: An Intent-Centered Evaluation Metric for NL2SQL ACL2026

【速读】:该论文旨在解决当前自然语言到SQL(Natural Language to SQL, NL2SQL)系统评估中广泛使用的执行准确率(Execution Accuracy, EX)指标的可靠性问题。EX指标存在三大缺陷:对语法变化敏感、忽略用户问题可能存在的多义性,以及易受错误标注的SQL语句误导。为应对这一挑战,作者提出了一种以意图为中心的新评估指标ROSE,其核心创新在于采用对抗式“证明者-反驳者”级联机制:SQL证明者独立评估预测SQL是否符合用户意图,而对抗式反驳者则利用标准答案SQL作为证据来质疑和优化该判断。该方法摆脱了传统依赖参考SQL的评估范式,更贴近人类专家对语义正确性的判断,在专家对齐的验证集ROSE-VEC上显著优于现有指标,实现了与人类专家评价的一致性提升近24%的Cohen’s Kappa。

链接: https://arxiv.org/abs/2604.12988
作者: Wenqi Pei,Shizheng Hou,Boyan Li,Han Chen,Zhichao Shi,Yuyu Luo
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: ACL 2026 Main

点击查看摘要

Abstract:Execution Accuracy (EX), the widely used metric for evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly unreliable. It is sensitive to syntactic variation, ignores that questions may admit multiple interpretations, and is easily misled by erroneous ground-truth SQL. To address this, we introduce ROSE, an intent-centered metric that focuses on whether the predicted SQL answers the question, rather than consistency with the ground-truth SQL under the reference-dependent paradigm. ROSE employs an adversarial Prover-Refuter cascade: SQL Prover assesses the semantic correctness of a predicted SQL against the user’s intent independently, while Adversarial Refuter uses the ground-truth SQL as evidence to challenge and refine this judgment. On our expert-aligned validation set ROSE-VEC, ROSE achieves the best agreement with human experts, outperforming the next-best metric by nearly 24% in Cohen’s Kappa. We also conduct a largescale re-evaluation of 19 NL2SQL methods, revealing four valuable insights. We release ROSE and ROSE-VEC to facilitate more reliable NL2SQL research.

[AI-4] Parallax: Why AI Agents That Think Must Never Act

【速读】:该论文旨在解决自主AI代理(Autonomous AI Agents)在具备执行真实世界操作能力(如读取文件、运行命令、发起网络请求等)时所暴露的根本性安全漏洞问题。当前主流的基于提示词(prompt-level)的安全防护机制因与威胁处于同一抽象层级,无法有效抵御攻击,尤其在代理推理系统被攻陷时完全失效。解决方案的关键在于提出Parallax架构范式,其核心是四大原则:认知-执行分离(Cognitive-Executive Separation),通过结构化隔离防止推理系统直接执行动作;对抗验证与渐进确定性(Adversarial Validation with Graduated Determinism),引入独立多层验证器作为中介;信息流控制(Information Flow Control),在代理工作流中传播数据敏感标签以识别上下文依赖威胁;以及可逆执行(Reversible Execution),记录破坏前状态以便在验证失败时回滚。该方案通过OpenParallax开源实现并经假设妥协评估(Assume-Compromise Evaluation)验证,在280个攻击案例中默认配置下阻断98.9%攻击且无误报,最大安全配置下实现100%防御效果,证明其架构边界在代理完全被控条件下依然有效。

链接: https://arxiv.org/abs/2604.12986
作者: Joel Fokou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20 pages, 1 figure, 5 tables. Open-source reference implementation: this https URL . Documentation: this https URL . Feedback welcome via email or GitHub issues

点击查看摘要

Abstract:Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt-based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context-dependent threats; and Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. We present OpenParallax, an open-source reference implementation in Go, and evaluate it using Assume-Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum-security configuration. When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax’s architectural boundary holds regardless.

[AI-5] Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在复杂信息检索任务中训练搜索代理时对黄金标注监督(gold supervision,如标准答案)的强依赖问题,这限制了方法的可扩展性。其解决方案的核心是提出一种无需黄金标注的循环一致性搜索(Cycle-Consistent Search, CCS)框架,关键在于利用循环一致性原理:一个高质量的搜索轨迹应能无损编码问题意图,从而具备重构原始问题的能力,由此构建出有效的奖励信号用于策略优化。为防止重建过程依赖表面词汇线索而非真实搜索过程,作者引入信息瓶颈机制,包括排除最终响应和对搜索查询进行命名实体识别(Named Entity Recognition, NER)掩码,迫使重建依赖于检索到的信息与结构化路径,确保奖励信号反映的是信息充分性而非语言冗余。

链接: https://arxiv.org/abs/2604.12967
作者: Sohyun An(1 and 2),Shuibenyang Yuan(1),Hayeon Lee(1),Cho-Jui Hsieh(2),Alexander Min(1) ((1) Meta Superintelligence Labs, (2) UCLA)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question’s intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.

[AI-6] Modeling Co-Pilots for Text-to-Model Translation AAAI’25

【速读】:该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)实现自然语言到组合优化模型的统一翻译与建模问题,特别是针对满足性(satisfaction)和优化(optimization)两类问题的联合处理。现有研究多局限于特定求解器(solver-specific)的翻译或仅关注单一类型的问题,缺乏通用性和跨域适用性。本文的关键解决方案在于构建一个求解器无关(solver-agnostic)的统一架构,基于MiniZinc的范式无关建模能力,将自然语言描述的组合问题自动转化为形式化模型;同时提出Text2Zinc数据集和交互式编辑器,支持跨领域问题的标注与调试,并设计多种LLM策略(如零样本提示、思维链推理、知识图谱中间表示、语法编码及代理式分解)进行系统性比较。实验表明,尽管当前LLMs仍非“即插即用”技术,但所提出的框架在准确性和执行效率上优于现有方法,为推动文本到模型翻译的实用化提供了可扩展的开源工具链。

链接: https://arxiv.org/abs/2604.12955
作者: Serdar Kadioglu,Karthik Uppuluri,Akash Singirikonda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: AAAI’25 Bridge Program on Machine Learning and Operations Research

点击查看摘要

Abstract:There is growing interest in leveraging large language models (LLMs) for text-to-model translation and optimization tasks. This paper aims to advance this line of research by introducing \textscText2Model and \textscText2Zinc. \textscText2Model is a suite of co-pilots based on several LLM strategies with varying complexity, along with an online leaderboard. \textscText2Zinc is a cross-domain dataset for capturing optimization and satisfaction problems specified in natural language, along with an interactive editor with built-in AI assistant. While there is an emerging literature on using LLMs for translating combinatorial problems into formal models, our work is the first attempt to integrate \textitboth satisfaction and optimization problems within a \textitunified architecture and \textitdataset. Moreover, our approach is \textitsolver-agnostic unlike existing work that focuses on translation to a solver-specific model. To achieve this, we leverage \textscMiniZinc’s solver-and-paradigm-agnostic modeling capabilities to formulate combinatorial problems. We conduct comprehensive experiments to compare execution and solution accuracy across several single- and multi-call strategies, including; zero-shot prompting, chain-of-thought reasoning, intermediate representations via knowledge-graphs, grammar-based syntax encoding, and agentic approaches that decompose the model into sequential sub-tasks. Our co-pilot strategies are competitive, and in parts improve, recent research in this domain. Our findings indicate that while LLMs are promising they are not yet a push-button technology for combinatorial modeling. We contribute \textscText2Model co-pilots and leaderboard, and \textscText2Zinc and interactive editor to open-source to support closing this performance gap.

[AI-7] Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)智能体在持久记忆存储中仅以扁平事实记录形式保存信息的问题,此类存储方式缺乏时间推理、变化追踪和跨会话聚合所需的上下文支持。解决方案的关键在于引入双痕迹记忆编码(dual-trace memory encoding),即每个存储的事实均配对一个具体的场景痕迹(scene trace),该痕迹是对信息学习时刻与情境的叙事重构。通过强制代理在编码阶段承诺特定情境细节,生成更丰富且更具区分度的记忆痕迹,从而显著提升复杂任务如时间推理、知识更新追踪和多会话聚合的表现,同时保持无额外token开销。

链接: https://arxiv.org/abs/2604.12948
作者: Benjamin Stern,Peter Nadel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 2 tables, 2 figures

点击查看摘要

Abstract:LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session aggregation. Inspired by the drawing effect [3], we introduce dual-trace memory encoding. In this method, each stored fact is paired with a concrete scene trace, a narrative reconstruction of the moment and context in which the information was learned. The agent is forced to commit to specific contextual details during encoding, creating richer, more distinctive memory traces. Using the LongMemEval-S benchmark (4,575 sessions, 100 recall questions), we compare dual-trace encoding against a fact-only control with matched coverage and format over 99 shared questions. Dual-trace achieves 73.7% overall accuracy versus 53.5%, a +20.2 percentage point (pp) gain (95% CI: [+12.1, +29.3], bootstrap p 0.0001). Gains concentrate in temporal reasoning (+40pp), knowledge-update tracking (+25pp), and multi-session aggregation (+30pp), with no benefit for single-session retrieval, consistent with encoding specificity theory [8]. Token analysis shows dual-trace encoding achieves this gain at no additional cost. We additionally sketch an architectural design for adapting dual-trace encoding to coding agents, with preliminary pilot validation.

[AI-8] CoDe-R: Refining Decompiler Output with LLM s via Rationale Guidance and Adaptive Inference IJCNN2026

【速读】:该论文旨在解决二进制反编译(Binary Decompilation)中因编译过程导致语义不可逆丢失所引发的“逻辑幻觉”(logical hallucinations)和“语义错位”(semantic misalignment)问题,这些问题常导致生成代码无法正确重执行(re-execution)。解决方案的关键在于提出一种轻量级两阶段代码精炼框架 CoDe-R:第一阶段引入语义认知增强(Semantic Cognitive Enhancement, SCE),通过基于推理的语义注入策略训练模型恢复高层算法意图与代码;第二阶段设计动态双路径回退机制(Dynamic Dual-Path Fallback, DDPF),在推理过程中通过混合验证策略自适应平衡语义恢复与语法稳定性,从而显著提升生成代码的可重执行率。

链接: https://arxiv.org/abs/2604.12913
作者: Qiang Zhang,Zhongnian Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 10 pages, 7 figures, 6 tables. Accepted by IJCNN 2026

点击查看摘要

Abstract:Binary decompilation is a critical reverse engineering task aimed at reconstructing high-level source code from stripped executables. Although Large Language Models (LLMs) have recently shown promise, they often suffer from “logical hallucinations” and “semantic misalignment” due to the irreversible semantic loss during compilation, resulting in generated code that fails to re-execute. In this study, we propose Cognitive Decompiler Refinement with Robustness (CoDe-R), a lightweight two-stage code refinement framework. The first stage introduces Semantic Cognitive Enhancement (SCE), a Rationale-Guided Semantic Injection strategy that trains the model to recover high-level algorithmic intent alongside code. The second stage introduces a Dynamic Dual-Path Fallback (DDPF) mechanism during inference, which adaptively balances semantic recovery and syntactic stability via a hybrid verification strategy. Evaluation on the HumanEval-Decompile benchmark demonstrates that CoDe-R (using a 1.3B backbone) establishes a new State-of-the-Art (SOTA) in the lightweight regime. Notably, it is the first 1.3B model to exceed an Average Re-executability Rate of 50.00%, significantly outperforming the baseline and effectively bridging the gap between efficient models and expert-level performance. Our code is available at this https URL.

[AI-9] BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM -Powered Heuristic Design

【速读】:该论文旨在解决现有基于大语言模型的超启发式算法(Large Language Model-based Hyper Heuristic, LHH)在自动启发式设计中效率不足的问题,特别是其单层进化机制难以生成具备竞争力的完整求解器,且缺乏高层算法建模能力导致探索效率受限。解决方案的关键在于将启发式设计重构为双层优化(Bi-level Optimization)问题:外层通过遗传算法(Genetic Algorithm, GA)演化包含函数占位符的高层算法结构,内层则利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)实现这些占位符;同时引入自适应记忆模块(Adaptive Memory module)以增强复杂代码生成能力,并提出知识增强(Knowledge Augmentation, KA)流水线来提升复杂代码评估的有效性。该方法显著提升了启发式设计的性能,在CVRP混合算法设计中平均优化差距降低37.84%,并优于当前最优的最大独立集(Maximum Independent Set, MIS)求解器KaMIS。

链接: https://arxiv.org/abs/2604.12898
作者: Chuyang Xiang,Yichen Wei,Jiale Ma,Handing Wang,Junchi Yan
机构: 未知
类目: Artificial Intelligence (cs.AI); Combinatorics (math.CO)
备注: 24 pages, 11 figures

点击查看摘要

Abstract:Large Language Model-based Hyper Heuristic (LHH) has recently emerged as an efficient way for automatic heuristic design. However, most existing LHHs just perform well in optimizing a single function within a pre-defined solver. Their single-layer evolution makes them not effective enough to write a competent complete solver. While some variants incorporate hyperparameter tuning or attempt to generate complex code through iterative local modifications, they still lack a high-level algorithmic modeling, leading to limited exploration efficiency. To address this, we reformulate heuristic design as a Bi-level Optimization problem and propose \textbfBEAM (Bi-level Memory-adaptive Algorithmic Evolution). BEAM’s exterior layer evolves high-level algorithmic structures with function placeholders through genetic algorithm (GA), while the interior layer realizes these placeholders via Monte Carlo Tree Search (MCTS). We further introduce an Adaptive Memory module to facilitate complex code generation. To support the evaluation for complex code generation, we point out the limitations of starting LHHs from scratch or from code templates and introduce a Knowledge Augmentation (KA) Pipeline. Experimental results on several optimization problems demonstrate that BEAM significantly outperforms existing LHHs, notably reducing the optimality gap by 37.84% on aggregate in CVRP hybrid algorithm design. BEAM also designs a heuristic that outperforms SOTA Maximum Independent Set (MIS) solver KaMIS.

[AI-10] FastGrasp: Learning-based Whole-body Control method for Fast Dexterous Grasping with Mobile Manipulators

【速读】:该论文旨在解决移动机器人在高速抓取任务中面临的三大核心挑战:高动态运动下的冲击稳定问题、实时全身协调控制问题,以及跨不同物体和场景的泛化能力不足问题(受限于固定基座、简单夹爪或响应缓慢的触觉感知)。解决方案的关键在于提出一种名为FastGrasp的学习型框架,其创新性体现在两个阶段的强化学习策略:首先利用条件变分自编码器(conditional variational autoencoder)根据物体点云生成多样化的抓取候选方案;其次通过最优抓取选择引导移动基座、机械臂与末端执行器的协同运动;同时引入触觉反馈实现对冲击效应和物体变化的实时调整,从而显著提升抓取鲁棒性与泛化性能,并成功实现从仿真到现实世界的迁移。

链接: https://arxiv.org/abs/2604.12879
作者: Heng Tao,Yiming Zhong,Zemin Yang,Yuexin Ma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fast grasping is critical for mobile robots in logistics, manufacturing, and service applications. Existing methods face fundamental challenges in impact stabilization under high-speed motion, real-time whole-body coordination, and generalization across diverse objects and scenarios, limited by fixed bases, simple grippers, or slow tactile response capabilities. We propose \textbfFastGrasp, a learning-based framework that integrates grasp guidance, whole-body control, and tactile feedback for mobile fast grasping. Our two-stage reinforcement learning strategy first generates diverse grasp candidates via conditional variational autoencoder conditioned on object point clouds, then executes coordinated movements of mobile base, arm, and hand guided by optimal grasp selection. Tactile sensing enables real-time grasp adjustments to handle impact effects and object variations. Extensive experiments demonstrate superior grasping performance in both simulation and real-world scenarios, achieving robust manipulation across diverse object geometries through effective sim-to-real transfer.

[AI-11] AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Frag mented Measurement and Weak Benchmark Governance

【速读】:该论文旨在解决生成式 AI (Generative AI) 安全评估领域中基准测试(benchmark)数量快速增长但测量体系缺乏一致性的问题。当前基准生态存在碎片化现象,即基准数量众多(共195个),但其指标定义、评估方法和治理结构不统一,导致研究者难以进行可比性分析与有效选择。解决方案的关键在于提出 AISafetyBenchExplorer——一个结构化的基准目录,通过多表 schema 记录基准级元数据、指标级定义、论文元数据及仓库活跃度,并引入复杂度分类法,从而实现对基准的可追溯管理、标准化比较与元评估,推动建立共享的测量语言、合理的基准遴选依据以及可持续的维护机制。

链接: https://arxiv.org/abs/2604.12875
作者: Abiodun A. Solanke
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurement ecosystem. We present AISafetyBenchExplorer, a structured catalogue of 195 AI safety benchmarks released between 2018 and 2026, organized through a multi-sheet schema that records benchmark-level metadata, metric-level definitions, benchmark-paper metadata, and repository activity. This design enables meta-analysis not only of what benchmarks exist, but also of how safety is operationalized, aggregated, and judged across the literature. Using the updated catalogue, we identify a central structural problem: benchmark proliferation has outpaced measurement standardization. The current landscape is dominated by medium-complexity benchmarks (94/195), while only 7 benchmarks occupy the Popular tier. The workbook further reports strong concentration around English-only evaluation (165/195), evaluation-only resources (170/195), stale GitHub repositories (137/195), stale Hugging Face datasets (96/195), and heavy reliance on arXiv preprints among benchmarks with known venue metadata. At the metric level, the catalogue shows that familiar labels such as accuracy, F1 score, safety score, and aggregate benchmark scores often conceal materially different judges, aggregation rules, and threat models. We argue that the field’s main failure mode is fragmentation rather than scarcity. Researchers now have many benchmark artifacts, but they often lack a shared measurement language, a principled basis for benchmark selection, and durable stewardship norms for post publication maintenance. AISafetyBenchExplorer addresses this gap by providing a traceable benchmark catalogue, a controlled metadata schema, and a complexity taxonomy that together support more rigorous benchmark discovery, comparison, and meta-evaluation.

[AI-12] LIFE – an energy efficient advanced continual learning agent ic AI framework for frontier systems

【速读】:该论文旨在解决当前高性能计算(HPC)在AI驱动下的资源调度与管理面临的挑战,特别是由于生成式AI(Generative AI)快速发展导致的能效需求激增,以及现有基础持续学习能力不足限制了AI对HPC系统进行高效自适应管理的问题。解决方案的关键在于提出LIFE框架——一个以代理为中心(agent-centric)的增量式、灵活且节能的推理与学习系统,其核心创新包括:编排器(orchestrator)、代理上下文工程(Agentic Context Engineering)、新型记忆系统和信息晶格学习(information lattice learning),通过这四个组件协同实现HPC网络管理与运维的自我演化能力,并在Kubernetes类集群中针对关键微服务延迟突增问题验证了其闭环控制效果。

链接: https://arxiv.org/abs/2604.12874
作者: Anne Lee,Gurudutt Hosangadi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 4 tables

点击查看摘要

Abstract:The rapid advancement of AI has changed the character of HPC usage such as dimensioning, provisioning, and execution. Not only has energy demand been amplified, but existing rudimentary continual learning capabilities limit ability of AI to effectively manage HPCs. This paper reviews emerging directions beyond monolithic transformers, emphasizing agentic AI and brain inspired architectures as complementary paths toward sustainable, adaptive systems. We propose LIFE, a reasoning and Learning framework that is Incremental, Flexible, and Energy efficient that is implemented as an agent centric system rather than a single monolithic model. LIFE uniquely combines four components to realize self evolving network management and operations in HPCs. The components are an orchestrator, Agentic Context Engineering, a novel memory system, and information lattice learning. LIFE can also generalize to enable a variety of orthogonal use cases. We ground LIFE in a specific closed loop HPC operations example for detecting and mitigating latency spikes experienced by critical micro services running on a Kubernetes like cluster.

[AI-13] QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence

【速读】:该论文旨在解决生成式 AI (Generative AI) 在垂直领域(特别是中文医疗深度搜索场景)中性能提升的挑战。其核心问题是:如何在医疗等专业领域构建高质量、长程推理能力的数据与训练机制,以突破现有模型在复杂任务中的表现上限。解决方案的关键在于提出 QuarkMedSearch 系统性框架,涵盖三个核心环节:1)通过融合大规模医学知识图谱与实时在线探索,构建长周期多跳医疗深度搜索训练数据;2)采用两阶段监督微调(SFT)与强化学习(RL)策略,逐步增强模型的规划、工具调用及反思能力,同时保障搜索效率;3)联合医学专家构建严谨的人工验证评估基准 QuarkMedSearch Benchmark,实现对模型性能的精准量化。实验表明,该方案在同类开源模型中达到最优性能,并保持通用基准上的竞争力。

链接: https://arxiv.org/abs/2604.12867
作者: Zhichao Lin,Zhichao Liang,Gaoqiang Liu,Meng Xu,Baoyu Xiang,Jian Xu,Guanjun Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As agentic foundation models continue to evolve, how to further improve their performance in vertical domains has become an important challenge. To this end, building upon Tongyi DeepResearch, a powerful agentic foundation model, we focus on the Chinese medical deep search scenario and propose QuarkMedSearch, systematically exploring a full-pipeline approach spanning medical multi-hop data construction, training strategies, and evaluation benchmarks to further push and assess its performance upper bound in vertical domains. Specifically, for data synthesis, to address the scarcity of deep search training data in the medical domain, we combine a large-scale medical knowledge graph with real-time online exploration to construct long-horizon medical deep search training data; for post-training, we adopt a two-stage SFT and RL training strategy that progressively enhances the model’s planning, tool invocation, and reflection capabilities required for deep search, while maintaining search efficiency; for evaluation, we collaborate with medical experts to construct the QuarkMedSearch Benchmark through rigorous manual verification. Experimental results demonstrate that QuarkMedSearch achieves state-of-the-art performance among open-source models of comparable scale on the QuarkMedSearch Benchmark, while also maintaining strong competitiveness on general benchmarks.

[AI-14] From edges to meaning: Semantic line sketches as a cognitive scaffold for ancient pictograph invention

【速读】:该论文试图解决的问题是:人类如何从稀疏的线条图中快速识别物体,以及大脑将高层语义知识转化为低层视觉符号的计算机制为何尚不明确。解决方案的关键在于提出一个受生物学启发的数字孪生视觉层级模型,该模型通过自底向上编码图像为低层特征、生成轮廓草图,并借助语义表示引导的顶向下反馈迭代优化草图,从而模拟人脑视觉皮层的前馈与循环架构;该模型生成的符号在结构上与跨文化早期象形文字(如埃及圣书体、甲骨文和原始楔形文字)高度相似,揭示了象形文字可能源于大脑对视觉输入进行边界驱动压缩的内在倾向,为理解人类首次将感知外化为符号的认知过程提供了神经计算基础。

链接: https://arxiv.org/abs/2604.12865
作者: Seowung Leem(1),Lin Gu(2),Ruogu Fang(1,3,4,5) ((1) J. Crayton Pruitt Family Dept. of Biomedical Engineering, University of Florida, Gainesville, FL, USA, (2) Research Institute of Electrical Communication, Tohoku University, Japan, (3) Dept. of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA, (4) Dept. of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, (5) Center for Cognitive Aging and Memory, University of Florida, Gainesville, FL, USA)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans readily recognize objects from sparse line drawings, a capacity that appears early in development and persists across cultures, suggesting neural rather than purely learned origins. Yet the computational mechanism by which the brain transforms high-level semantic knowledge into low-level visual symbols remains poorly understood. Here we propose that ancient pictographic writing emerged from the brain’s intrinsic tendency to compress visual input into stable, boundary-based abstractions. We construct a biologically inspired digital twin of the visual hierarchy that encodes an image into low-level features, generates a contour sketch, and iteratively refines it through top-down feedback guided by semantic representations, mirroring the feedforward and recurrent architecture of the human visual cortex. The resulting symbols bear striking structural resemblance to early pictographs across culturally distant writing systems, including Egyptian hieroglyphs, Chinese oracle bone characters, and proto-cuneiform, and offer candidate interpretations for undeciphered scripts. Our findings support a neuro-computational origin of pictographic writing and establish a framework in which AI can recapitulate the cognitive processes by which humans first externalized perception into symbols.

[AI-15] Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic

【速读】:该论文旨在解决当前混合自动驾驶交通(mixed autonomy traffic)仿真中缺乏系统性AI方法综述的问题,特别是现有仿真工具多依赖图形保真度和简单规则模型,难以准确刻画人类驾驶行为及车辆间复杂交互。其解决方案的关键在于提出一个结构化的分类体系(taxonomy),将AI方法分为三类:个体层级行为建模、环境层级仿真方法以及认知与物理信息融合的方法,并通过整合交通工程与计算机科学视角,全面分析现有仿真平台的不足,明确未来研究方向,从而推动混合自动驾驶交通仿真的精准化与智能化发展。

链接: https://arxiv.org/abs/2604.12857
作者: Saeed Rahmani,Shiva Rasouli,Daphne Cornelisse,Eugene Vinitsky,Bart van Arem,Simeon C. Calvert
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Autonomous vehicles (AVs) are now operating on public roads, which makes their testing and validation more critical than ever. Simulation offers a safe and controlled environment for evaluating AV performance in varied conditions. However, existing simulation tools mainly focus on graphical realism and rely on simple rule-based models and therefore fail to accurately represent the complexity of driving behaviors and interactions. Artificial intelligence (AI) has shown strong potential to address these limitations; however, despite the rapid progress across AI methodologies, a comprehensive survey of their application to mixed autonomy traffic simulation remains lacking. Existing surveys either focus on simulation tools without examining the AI methods behind them, or cover ego-centric decision-making without addressing the broader challenge of modeling surrounding traffic. Moreover, they do not offer a unified taxonomy of AI methods covering individual behavior modeling to full scene simulation. To address these gaps, this survey provides a structured review and synthesis of AI methods for modeling AV and human driving behavior in mixed autonomy traffic simulation. We introduce a taxonomy that organizes methods into three families: agent-level behavior models, environment-level simulation methods, and cognitive and physics-informed methods. The survey analyzes how existing simulation platforms fall short of the needs of mixed autonomy research and outlines directions to narrow this gap. It also provides a chronological overview of AI methods and reviews evaluation protocols and metrics, simulation tools, and datasets. By covering both traffic engineering and computer science perspectives, we aim to bridge the gap between these two communities.

[AI-16] Loop Corrections to the Training and Generalization Errors of Random Feature Models

【速读】:该论文旨在解决随机特征模型(random feature models)中训练、测试及泛化误差的精确刻画问题,尤其是在超越均值核近似(mean-kernel approximation)的有限宽度效应下,如何准确描述误差行为。传统方法通常仅考虑平均核的贡献,忽略了高阶波动统计对预测性能的影响;而本文的关键在于引入有效场论框架(effective field-theoretic framework),将这些有限宽度效应自然地表示为环修正(loop corrections),从而系统推导出训练误差、测试误差与泛化误差的修正项及其标度律,并通过实验验证了理论预测的准确性。

链接: https://arxiv.org/abs/2604.12827
作者: Taeyoung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:We investigate random feature models in which neural networks sampled from a prescribed initialization ensemble are frozen and used as random features, with only the readout weights optimized. Adopting a statistical-physics viewpoint, we study the training, test, and generalization errors beyond the mean-kernel approximation. Since the predictor is a nonlinear functional of the induced random kernel, the ensemble-averaged errors depend not only on the mean kernel but also on higher-order fluctuation statistics. Within an effective field-theoretic framework, these finite-width contributions naturally appear as loop corrections. We derive the loop corrections to the training, test, and generalization errors, obtain their scaling laws, and support the theory with experimental verification.

[AI-17] DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding CVPR2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长文档理解任务中随文档长度增加而性能显著下降的问题。核心挑战在于:1)信号-噪声比(Signal-to-Noise Ratio, SNR)低,关键证据被淹没在无关页面中;2)监督信号稀缺,现有数据集仅提供最终简短答案,难以有效指导模型学习。解决方案的关键在于提出一种结构化的“分析(Analysis)、定位(Localization)和推理(Reasoning)”工作流,并设计两阶段训练框架:首先通过高效知识蒸馏生成高质量数据进行监督微调;随后采用基于证据感知的组相对策略优化(Evidence-aware Group Relative Policy Optimization),联合优化证据定位与答案准确性;同时引入证据引导的分辨率分配策略(Evidence-Guided Resolution Allocation)缓解多页文档训练中的内存限制。该方法显著提升了模型在域内和域外长文档任务上的性能,且具备从短页训练到超长文档的鲁棒泛化能力。

链接: https://arxiv.org/abs/2604.12812
作者: Hao Yan,Yuliang Liu,Xingchen Liu,Yuyi Zhang,Minghui Liao,Jihao Wu,Wei Chen,Xiang Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: CVPR 2026 Highlight

点击查看摘要

Abstract:Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured ``\textbfAnalysis, \textbfLocalization and \textbfReasoning’’ workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as a solid foundation for their implementation.

[AI-18] Algorithmic Analysis of Dense Associative Memory: Finite-Size Guarantees and Adversarial Robustness ICLR2026

【速读】:该论文旨在解决密集关联记忆(Dense Associative Memory, DAM)模型在有限样本规模下缺乏收敛性保证与容量分析的问题,尤其是现有基于热力学极限 NN \to \infty 的动力学分析无法提供可验证的有限-NN 收敛速率和鲁棒性边界。其解决方案的关键在于构建一种算法化分析框架,通过引入显式的模式分离假设(separation assumption)和高负载下的有界干扰条件(bounded-interference condition),首次证明了异步检索动态的几何收敛性(geometric convergence),从而获得 O(logN)O(\log N) 的收敛时间上界;同时建立了基于显式边距条件的对抗鲁棒性界限,并推导出在最坏情况下容量仍能保持 Θ(Nn1)\Theta(N^{n-1}) 的渐近尺度(至多对数因子),且对随机模式集合恢复经典容量标度。此外,作者还揭示了DAM检索动力学具有潜在博弈(potential game)结构,确保异步更新下收敛至纯纳什均衡点。

链接: https://arxiv.org/abs/2604.12811
作者: Madhava Gaikwad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 21 pages, 9 figures, Accepted in New Frontiers in Associative Memory workshop at ICLR 2026

点击查看摘要

Abstract:Dense Associative Memory (DAM) generalizes Hopfield networks through higher-order interactions and achieves storage capacity that scales as O(N^n-1) under suitable pattern separation conditions. Existing dynamical analyses primarily study the thermodynamic limit N\to\infty with randomly sampled patterns and therefore do not provide finite-size guarantees or explicit convergence rates. We develop an algorithmic analysis of DAM retrieval dynamics that yields finite- N guarantees under explicit, verifiable pattern conditions. Under a separation assumption and a bounded-interference condition at high loading, we prove geometric convergence of asynchronous retrieval dynamics, which implies O(\log N) convergence time once the trajectory enters the basin of attraction. We further establish adversarial robustness bounds expressed through an explicit margin condition that quantifies the number of corrupted bits tolerable per sweep, and derive capacity guarantees that scale as \Theta(N^n-1) up to polylogarithmic factors in the worst case, while recovering the classical \Theta(N^n-1) scaling for random pattern ensembles. Finally, we show that DAM retrieval dynamics admit a potential-game interpretation that ensures convergence to pure Nash equilibria under asynchronous updates. Complete proofs are provided in the appendices, together with preliminary experiments that illustrate the predicted convergence, robustness, and capacity scaling behavior. Comments: 21 pages, 9 figures, Accepted in New Frontiers in Associative Memory workshop at ICLR 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) MSC classes: 68T07, 68Q25 ACMclasses: I.2.6; F.2.2 Cite as: arXiv:2604.12811 [cs.LG] (or arXiv:2604.12811v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.12811 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-19] Efficiency of Proportional Mechanisms in Online Auto-Bidding Advertising

【速读】:该论文旨在解决自动出价策略(auto-bidding)在在线广告拍卖机制中引发的效率问题,特别是针对比例机制(proportional mechanism)下纯纳什均衡(pure Nash equilibrium)的效率损失,以价格无政府状态(price of anarchy, PoA)作为衡量指标,目标是在液态福利(liquid welfare)优化目标下提升机制效率。解决方案的关键在于提出一种改进的比例机制,通过引入替代支付方案(alternative payment scheme),将PoA从标准机制下的紧界2降低至1+O(1)n11 + \frac{O(1)}{n-1},其中nn为竞拍者数量,从而随着参与者增加逼近完全效率;该方法基于线性与凸规划中的对偶理论及Karush-Kuhn-Tucker (KKT) 条件进行分析,尽管形式简洁,却具备强大的理论推导能力,可推广至其他PoA边界证明场景。

链接: https://arxiv.org/abs/2604.12799
作者: Nguyen Kim Thang
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注:

点击查看摘要

Abstract:The rise of automated bidding strategies in online advertising presents new challenges in designing and analyzing efficient auction mechanisms. In this paper, we focus on proportional mechanisms within the context of auto-bidding and study the efficiency of pure Nash equilibria, specifically the price of anarchy (PoA), under the liquid welfare objective. We first establish a tight PoA bound of 2 for the standard proportional mechanism. Next, we introduce a modified version with an alternative payment scheme that achieves a PoA bound of 1 + \fracO(1)n-1 where n \geq 2 denotes the number of bidding agents. This improvement surpasses the existing PoA barrier of 2 and approaches full efficiency as the number of agents increases. Our methodology leverages duality and the Karush-Kuhn-Tucker (KKT) conditions from linear and convex programming. Despite its conceptual simplicity, our approach proves powerful and may offer broader applications for establishing PoA bounds.

[AI-20] VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

【速读】:该论文旨在解决FlashAttention-style在线softmax在现代加速器上因非矩阵乘法(non-matmul)组件(尤其是每tile的rowmax和rowsum归约及重缩放链)成为向量或SIMD瓶颈,从而主导计算延迟的问题。其核心解决方案是提出Vector Relieved Flash Attention (VFA),通过三种关键优化缓解行最大值更新带来的开销:首先利用键块(key-block)表示的廉价近似初始化运行最大值;其次重新排序键块遍历顺序,优先处理高影响的sink和局部块以早期稳定运行最大值;最后对剩余块冻结最大值,避免重复归约与重缩放操作。进一步地,将VFA与块稀疏跳过方法(如BLASST)结合形成Vector Relieved Sparse Attention (VSA),同时减少块数量和每块开销。值得注意的是,VFA和VSA完全规避了FA4.0中更新阶段使用的条件重缩放操作,实验证明该设计在不损失性能的前提下显著提升了效率,在C16V32基线基础上,C8V32、C4V32和C4V16分别实现接近两倍加速,且随着架构改进(如增强指数容量),C4V16有望达到六倍加速。

链接: https://arxiv.org/abs/2604.12798
作者: Yupeng Sun,Yanzhao Li,Zhiqiang Zou,Bai Du,Zhiyuan Zhang,Hui Dong,Gaoyige Fan,Hui Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:FlashAttention-style online softmax enables exact attention computation with linear memory by streaming score tiles through on-chip memory and maintaining a running maximum and normalizer. However, as attention kernels approach peak tensor-core/cube-core throughput on modern accelerators, non-matmul components of online softmax – especially per-tile rowmax and rowsum reductions and rescale chains – can become vector or SIMD limited and dominate latency. This paper revisits FlashAttention and proposes Vector Relieved Flash Attention (VFA), a hardware-friendly method that reduces rowmax-driven updates of the running maximum while retaining the online-softmax structure. VFA initializes the running maximum via a cheap approximation from key-block representations, reorders key-block traversal to prioritize high-impact sink and local blocks, and freezes the maximum for remaining blocks to avoid repeated reductions and rescaling. We further integrate VFA with block-sparse skipping methods such as BLASST to form Vector Relieved Sparse Attention (VSA), which reduces both block count and per-block overhead. Notably, VFA and VSA completely avoid the conditional rescale operation in the update stage used in FA4.0. Extensive evaluations on benchmarks including MMLU and MATH500, together with attention statistics, verify our design: (i) sink and local reordering stabilizes the running maximum early; (ii) simple Q and K block summaries fail due to intra-block heterogeneity; (iii) m-initialization is required when maxima appear in middle blocks. Overall, VFA and VSA efficiently alleviate online-softmax reduction bottlenecks without performance loss. Compared to the C16V32 baseline, C8V32, C4V32 and C4V16 achieve nearly two times speedup on modern hardware while hitting the vector bottleneck. With upcoming architecture improvements, C4V16 will deliver six times speedup by enhancing exponent capacity.

[AI-21] OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

【速读】:该论文旨在解决4-bit量化在部署大语言模型(Large Language Models, LLMs)时因激活值异常点(activation outliers)导致的显著精度下降问题,其核心在于激活值动态范围受限引发的误差累积。解决方案的关键在于发现并利用“token-persistent结构聚类效应”——即高幅值异常点在不同token中稳定占据特定通道;基于此,作者提出OSD(Outlier Suppression via Channel Clustering)框架,在推理阶段采用双路径计算:一条为4-bit低精度通用矩阵乘法(GEMM)路径,另一条为16-bit高精度分支GEMM路径。通过离线分组策略识别异常通道,并在线进行结构化子张量提取,将散落的异常通道聚合成紧凑稠密张量,从而以规则且高吞吐的方式实现异常保护,无缝适配现代4-bit微缩硬件架构。

链接: https://arxiv.org/abs/2604.12782
作者: Zhiyuan Zhang,Yanzhao Li,Zhiqiang Zou,Bai Du,Yupeng Sun,Hui Dong,Hui Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While 4-bit quantization is essential for high-throughput deployment of Large Language Models, activation outliers often lead to significant accuracy degradation due to the restricted dynamic range of low-bit formats. In this paper, we systematically investigate the spatial distribution of outliers and demonstrate a token-persistent structural clustering effect, where high-magnitude outliers consistently occupy fixed channels across tokens. Building on this insight, we propose OSC, a hardware-efficient framework for outlier suppression. During inference, OSC executes a dual-path computation consisting of a low-precision 4-bit General Matrix Multiplication (GEMM) path and a high-precision 16-bit branch GEMM path. Specifically, OSC uses an offline group-wise strategy to identify the channels where outliers are located and then performs structured sub-tensor extraction to coalesce these scattered activation channels into a compact dense tensor online. This mechanism implements outlier protection through regularized and high-throughput GEMM operations, achieving a seamless fit with modern 4-bit micro-scaling hardware. Furthermore, for the inputs of W2 where outlier clustering is less pronounced, we integrate a fallback strategy to FP8. Evaluation on Qwen3-8B and Qwen3-30B restricts the average accuracy drop to 2.19 and 1.12 points, respectively. Notably, OSC is highly hardware-friendly, achieving a peak speedup of 1.78x over the W8A8 GEMM baseline on a modern AI accelerator.

[AI-22] GF-Score: Certified Class-Conditional Robustness Evaluation with Fairness Guarantees

【速读】:该论文旨在解决神经网络在安全关键应用中部署时,现有对抗鲁棒性评估方法存在的两大问题:一是标准评估方法依赖昂贵的对抗攻击来计算鲁棒性,二是仅报告单一聚合分数,无法揭示不同类别间鲁棒性的分布差异。为此,作者提出GF-Score(GREAT-Fairness Score)框架,其核心创新在于将原始的认证鲁棒性得分(Certified Robustness Score)分解为每个类别的鲁棒性轮廓,并通过福利经济学中的四个指标(鲁棒性差异指数RDI、归一化鲁棒性基尼系数NRGC、最差类别鲁棒性WCR以及公平惩罚的GREAT分数FP-GREAT)量化类别间的不公平性;同时引入自校准机制,仅利用干净准确率的相关性调整温度参数,从而消除对对抗攻击的依赖。这一方案构建了一个无需对抗攻击的实用审计流程,能够精准诊断认证鲁棒性保障是否均匀覆盖所有类别。

链接: https://arxiv.org/abs/2604.12757
作者: Arya Shah,Kaveri Visavadiya,Manisha Padala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 tables, 9 figures

点击查看摘要

Abstract:Adversarial robustness is essential for deploying neural networks in safety-critical applications, yet standard evaluation methods either require expensive adversarial attacks or report only a single aggregate score that obscures how robustness is distributed across classes. We introduce the \emphGF-Score (GREAT-Fairness Score), a framework that decomposes the certified GREAT Score into per-class robustness profiles and quantifies their disparity through four metrics grounded in welfare economics: the Robustness Disparity Index (RDI), the Normalized Robustness Gini Coefficient (NRGC), Worst-Case Class Robustness (WCR), and a Fairness-Penalized GREAT Score (FP-GREAT). The framework further eliminates the original method’s dependence on adversarial attacks through a self-calibration procedure that tunes the temperature parameter using only clean accuracy correlations. Evaluating 22 models from RobustBench across CIFAR-10 and ImageNet, we find that the decomposition is exact, that per-class scores reveal consistent vulnerability patterns (e.g., ``cat’’ is the weakest class in 76% of CIFAR-10 models), and that more robust models tend to exhibit greater class-level disparity. These results establish a practical, attack-free auditing pipeline for diagnosing where certified robustness guarantees fail to protect all classes equally. We release our code on \hrefthis https URLGitHub.

[AI-23] Can AI Tools Transform Low-Demand Math Tasks? An Evaluation of Task Modification Capabilities

【速读】:该论文旨在解决当前对生成式 AI (Generative AI) 在提升数学教学任务认知需求水平方面能力的认知空白问题,即现有研究多聚焦于 AI 对任务质量的分类能力,却较少探讨其能否有效改造低认知需求任务以提高教学价值。研究的关键在于通过系统测试11种AI工具(包括通用型与数学教师专用型)在真实教学情境下对两类低认知需求数学任务进行升级的能力,并采用Task Analysis Guide框架评估其效果;结果表明,尽管部分工具能较成功地完成任务升级(平均准确率64%),但普遍存在“不足”(维持原低要求)或“过度”(目标过高而难以被教师采纳)两种失败模式,且任务分类能力与升级能力之间存在负相关关系(r = -0.35),说明生成任务的能力独立于判断任务的能力,提示未来需发展专门面向教师需求的AI支持策略以实现更有效的课程材料适应性改进。

链接: https://arxiv.org/abs/2604.12743
作者: Danielle S. Fox,Brenda L. Robles,Elizabeth DiPietro Brovey,Christian D. Schunn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 1 figure

点击查看摘要

Abstract:While recent research has explored AI tools’ ability to classify the quality of mathematical tasks (arXiv:2603.03512), little is known about their capacity to increase the quality of existing tasks. This study investigated whether AI tools could successfully upgrade low-cognitive-demand mathematics tasks. Eleven tools were tested, including six broadly available, general-purpose AI tools (e.g., ChatGPT and Claude) and five tools specialized for mathematics teachers (e.g., Khanmigo, this http URL). Using the Task Analysis Guide framework (Stein Smith, 1998), we prompted AI tools to modify two different types of low-demand mathematical tasks. The prompting strategy aimed to represent likely approaches taken by knowledgeable teachers, rather than extensive optimization to find a more effective prompt (i.e., an optimistic typical outcome). On average, AI tools were only moderately successful: tasks were accurately upgraded only 64% of the time, with different AI tool performance ranging from quite weak (33%) to broadly successful (88%). Specialized tools were only moderately more successful than general-purpose tools. Failure modes included both “undershooting” (maintaining low cognitive demand) and “overshooting” (elevating tasks to an overly ambitious target category that likely would be rejected by teachers). Interestingly, there was a small negative correlation (r = -.35) between whether a given AI tool was able to correctly classify the cognitive demand of tasks and whether the AI was able to upgrade tasks, showing that the ability to modify tasks (i.e., a generative task) represents a distinct capability from the ability to classify them (i.e., judgement using a rubric). These findings have important implications for understanding AI’s potential role in curriculum adaptation and highlight the need for specialized approaches to support teachers in modifying instructional materials.

[AI-24] ransferable Expertise for Autonomous Agents via Real-World Case-Based Learning

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的自主代理在复杂现实场景中难以有效利用任务结构、关键约束和过往经验的问题。其解决方案的关键在于提出一种基于案例的学习(Case-Based Learning, CBL)框架,该框架将历史任务经验转化为可复用的知识资产,通过提取并重用任务相关的知识、分析提示(analytical prompts)和操作技能,实现跨任务的知识迁移与结构化分析能力。相较于依赖预训练知识或静态提示的方法,该框架强调从真实案例中动态提取和再利用任务相关要素,从而显著提升代理在复杂任务中的表现,尤其在高复杂度任务中优势更为明显。

链接: https://arxiv.org/abs/2604.12717
作者: Zhenyu Ma,Yuyang Song,Chunyi Yang,Jingyi Zhu,Letian Yang,Xukai Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based autonomous agents perform well on general reasoning tasks but still struggle to reliably use task structure, key constraints, and prior experience in complex real-world settings. We propose a case-based learning framework that converts experience from past tasks into reusable knowledge assets, allowing agents to transfer prior case experience to new tasks and perform more structured analysis. Unlike methods based mainly on pretrained knowledge or static prompts, our framework emphasizes extracting and reusing task-relevant knowledge, analytical prompts, and operational skills from real cases. We evaluate the method on a unified benchmark of six complex task categories and compare it with Zero-Shot, Few-Shot, Checklist Prompt, and Rule Memory baselines. Results show that our method achieves consistently strong performance across all tasks and matches or outperforms the best baseline in every case, with especially clear gains on more complex tasks. Further analysis shows that the advantage of case-based learning increases with task complexity, and that practical knowledge acquired by one agent can be reused by others. These findings suggest that case-based learning offers a promising path for building professional agents for real-world work.

[AI-25] MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

【速读】:该论文旨在解决复杂多轮交互中人类意图识别的难题,尤其针对现实场景下参与者需长期维持复杂欺骗性叙事的战略性互动问题。现有数据集多聚焦于单轮或简单对话,难以支持对长程语境和因果推理的分析。为此,作者提出MISID基准,其具有多模态、多轮次、多参与者的特性,并采用细粒度的两级多维标注体系以支持证据驱动的因果追踪。解决方案的关键在于提出的FRACTAM框架,该框架基于“解耦-锚定-推理”范式:通过提取纯单模态事实表征降低文本偏差,利用两阶段检索实现长程事实锚定,并构建显式的跨模态证据链,从而显著提升主流多模态大模型在复杂战略任务中的隐藏意图检测与推理能力。

链接: https://arxiv.org/abs/2604.12700
作者: Shufang Lin,Muyang Chen,Xiabing Zhou,Rongrong Zhang,Dayou Zhang,Fangxin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios, including text-prior visual hallucination, impaired cross-modal synergy, and limited capacity in chaining causal cues. Consequently, we propose FRACTAM as a baseline framework. Using a ``Decouple-Anchor-Reason’’ paradigm, FRACTAM reduces text bias by extracting pure unimodal factual representations, employs two-stage retrieval for long-range factual anchoring, and constructs explicit cross-modal evidence chains. Extensive experiments demonstrate that FRACTAM enhances mainstream models’ performance in complex strategic tasks, improving hidden intent detection and inference while maintaining robust perceptual accuracy. Our dataset is available at this https URL.

[AI-26] BID-LoRA: A Parameter-Efficient Framework for Continual Learning and Unlearning

【速读】:该论文旨在解决持续学习(Continual Learning, CL)与机器遗忘(Machine Unlearning, MU)在统一框架下协同工作的关键挑战,即现有方法在简单组合CL与MU时会导致知识泄露和基础模型性能的渐进式退化。其解决方案的核心是提出一种名为双向低秩适配(Bi-Directional Low-Rank Adaptation, BID-LoRA)的新框架,通过三个专用适配路径(保留、新增、遗忘)作用于注意力层,并引入“逃逸遗忘”机制——将被删除类别的嵌入向量推向与保留知识最远的位置,仅更新5%的参数,从而实现精准删除、高效增量学习和最小化跨周期知识泄露的三重目标。

链接: https://arxiv.org/abs/2604.12686
作者: Jagadeesh Rachapudi,Ritali Vatsi,Praful Hambarde,Amit Shukla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in deep learning underscore the need for systems that can not only acquire new knowledge through Continual Learning (CL) but also remove outdated, sensitive, or private information through Machine Unlearning (MU). However, while CL methods are well-developed, MU techniques remain in early stages, creating a critical gap for unified frameworks that depend on both capabilities. We find that naively combining existing CL and MU approaches results in knowledge leakage a gradual degradation of foundational knowledge across repeated adaptation cycles. To address this, we formalize Continual Learning Unlearning (CLU) as a unified paradigm with three key goals: (i) precise deletion of unwanted knowledge, (ii) efficient integration of new knowledge while preserving prior information, and (iii) minimizing knowledge leakage across cycles. We propose Bi-Directional Low-Rank Adaptation (BID-LoRA), a novel framework featuring three dedicated adapter pathways-retain, new, and unlearn applied to attention layers, combined with escape unlearning that pushes forget-class embeddings to positions maximally distant from retained knowledge, updating only 5% of parameters. Experiments on CIFAR-100 show that BID-LoRA outperforms CLU baselines across multiple adaptation cycles. We further evaluate on CASIA-Face100, a curated face recognition subset, demonstrating practical applicability to real-world identity management systems where new users must be enrolled and withdrawn users removed.

[AI-27] A hierarchical spatial-aware algorithm with efficient reinforcement learning for human-robot task planning and allocation in production

【速读】:该论文旨在解决复杂动态制造环境中人机协同任务规划与分配(Task Planning and Allocation, TPA)的难题,尤其关注人类和机器人在空间位置变化及移动距离等实时因素下的高效协作问题。解决方案的关键在于提出一种分层式实时TPA算法——EBQSAP方法:高层采用基于缓冲区的深度Q学习(Efficient Buffer-based Deep Q-learning, EBQ),以降低训练时间并提升长期稀疏奖励场景下的性能;低层则设计了一种基于路径规划的空间感知分配方法(Spatially Aware Path planning, SAP),依据实时空间信息将任务精准分配给合适的人或机器人资源,从而实现子任务的有序执行。

链接: https://arxiv.org/abs/2604.12669
作者: Jintao Xue,Xiao Li,Nianmin Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the accepted manuscript of a journal article accepted for publication in Robotics and Computer-Integrated Manufacturing (Elsevier)

点击查看摘要

Abstract:In advanced manufacturing systems, humans and robots collaborate to conduct the production process. Effective task planning and allocation (TPA) is crucial for achieving high production efficiency, yet it remains challenging in complex and dynamic manufacturing environments. The dynamic nature of humans and robots, particularly the need to consider spatial information (e.g., humans’ real-time position and the distance they need to move to complete a task), substantially complicates TPA. To address the above challenges, we decompose production tasks into manageable subtasks. We then implement a real-time hierarchical human-robot TPA algorithm, including a high-level agent for task planning and a low-level agent for task allocation. For the high-level agent, we propose an efficient buffer-based deep Q-learning method (EBQ), which reduces training time and enhances performance in production problems with long-term and sparse reward challenges. For the low-level agent, a path planning-based spatially aware method (SAP) is designed to allocate tasks to the appropriate human-robot resources, thereby achieving the corresponding sequential subtasks. We conducted experiments on a complex real-time production process in a 3D simulator. The results demonstrate that our proposed EBQSAP method effectively addresses human-robot TPA problems in complex and dynamic production processes.

[AI-28] Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production

【速读】:该论文旨在解决人机协同制造(Human-Robot Collaborative Manufacturing)中的动态任务规划与分配(HRTPA)问题,核心挑战在于如何在保障工人物理疲劳处于安全阈值内的前提下,实时优化任务执行时机与角色分配以提升整体效率。传统方法依赖静态预设的疲劳-恢复模型参数,难以适应个体每日疲劳敏感度的变化(如睡眠不足或工作环境改变),导致决策不可靠。解决方案的关键在于提出一种基于粒子滤波(Particle Filter, PF)与约束双深度双Q学习(Constrained Dueling Double Deep Q-Network, CD3Q)融合的安全强化学习框架——PF-CD3Q:首先利用PF在线估计并更新疲劳模型参数,实现对个体疲劳状态的动态追踪;进而将任务级疲劳预测嵌入决策过程,通过限制动作空间排除高风险任务,从而将问题建模为带约束的马尔可夫决策过程(Constrained Markov Decision Process, CMDP),确保长期操作安全性与调度效率的平衡。

链接: https://arxiv.org/abs/2604.12667
作者: Jintao Xue,Xiao Li,Nianmin Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the accepted manuscript of an article accepted for publication in \textit{Journal of Manufacturing Systems (Elsevier)

点击查看摘要

Abstract:Human-robot collaborative manufacturing, a core aspect of Industry 5.0, emphasizes ergonomics to enhance worker well-being. This paper addresses the dynamic human-robot task planning and allocation (HRTPA) problem, which involves determining when to perform tasks and who should execute them to maximize efficiency while ensuring workers’ physical fatigue remains within safe limits. The inclusion of fatigue constraints, combined with production dynamics, significantly increases the complexity of the HRTPA problem. Traditional fatigue-recovery models in HRTPA often rely on static, predefined hyperparameters. However, in practice, human fatigue sensitivity varies daily due to factors such as changed work conditions and insufficient sleep. To better capture this uncertainty, we treat fatigue-related parameters as inaccurate and estimate them online based on observed fatigue progression during production. To address these challenges, we propose PF-CD3Q, a safe reinforcement learning (safe RL) approach that integrates the particle filter with constrained dueling double deep Q-learning for real-time fatigue-predictive HRTPA. Specifically, we first develop PF-based estimators to track human fatigue and update fatigue model parameters in real-time. These estimators are then integrated into CD3Q by making task-level fatigue predictions during decision-making and excluding tasks that exceed fatigue limits, thereby constraining the action space and formulating the problem as a constrained Markov decision process (CMDP).

[AI-29] Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport

【速读】:该论文旨在解决现有主题建模方法(从LDA到基于神经网络和大语言模型(LLM)的方法)在追求统计一致性时,常产生冗余或偏离用户意图的主题问题,从而导致模型输出缺乏可解释性、多样性和目标导向性。其解决方案的关键在于提出一种人类中心的主题建模框架(Human-centric Topic Modeling, Human-TM),通过将人类提供的目标直接嵌入主题建模过程,并设计了目标提示对比主题模型与最优传输(GCTM-OT):首先利用大语言模型(LLM)提示提取文档中的目标候选项,再借助语义感知的对比学习结合最优传输(Optimal Transport)机制实现主题发现,从而在三个公开subreddit数据集上显著提升主题的一致性、多样性及与人工目标的对齐度。

链接: https://arxiv.org/abs/2604.12663
作者: Rui Wang,Yi Zheng,Dongxin Wang,Haiping Huang,Yuanzhi Yao,Yuxiang Zhou,Jialin Yu,Philip Torr
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 Pages, 6 Figures

点击查看摘要

Abstract:Existing topic modeling methods, from LDA to recent neural and LLM-based approaches, which focus mainly on statistical coherence, often produce redundant or off-target topics that miss the user’s underlying intent. We introduce Human-centric Topic Modeling, \emphHuman-TM), a novel task formulation that integrates a human-provided goal directly into the topic modeling process to produce interpretable, diverse and goal-oriented topics. To tackle this challenge, we propose the \textbfGoal-prompted \textbfContrastive \textbfTopic \textbfModel with \textbfOptimal \textbfTransport (GCTM-OT), which first uses LLM-based prompting to extract goal candidates from documents, then incorporates these into semantic-aware contrastive learning via optimal transport for topic discovery. Experimental results on three public subreddit datasets show that GCTM-OT outperforms state-of-the-art baselines in topic coherence and diversity while significantly improving alignment with human-provided goals, paving the way for more human-centric topic discovery systems.

[AI-30] Broadening the Applicability of Conditional Syntax Splitting for Reasoning from Conditional Belief Bases

【速读】:该论文旨在解决非单调推理中条件信念基(conditional belief bases)的语法分割(syntax splitting)应用受限问题。传统语法分割要求信念基根据不相交的符号集分裂,但这一条件在实践中难以满足;而此前提出的“安全条件语法分割”虽放宽了限制,却仅允许平凡的自满足条件共享原子,进一步限制了实用性。本文的关键解决方案是提出一种广义的条件语法分割概念,其核心在于允许子信念基之间不仅共享原子,还能包含非平凡条件(nontrivial conditionals),从而显著扩大了语法分割的适用范围。该方法通过引入新的调整后的推理公理,并区分真正有益于归纳推理的“真实分割”(genuine splittings)与无实质增益的“简单分割”(simple splittings),提升了推理效率与灵活性,同时证明了广义条件语法分割蕴含传统条件语法分割,但反之不成立。

链接: https://arxiv.org/abs/2604.12660
作者: Lars-Phillip Spiegel,Jonas Haldimann,Jesse Heyninck,Gabriele Kern-Isberner,Christoph Beierle
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In nonmonotonic reasoning from conditional belief bases, an inference operator satisfying syntax splitting postulates allows for taking only the relevant parts of a belief base into account, provided that the belief base splits into subbases based on disjoint signatures. Because such disjointness is rare in practice, safe conditional syntax splitting has been proposed as a generalization of syntax splitting, allowing the conditionals in the subbases to share some atoms. Recently this overlap of conditionals has been shown to be limited to trivial, self-fulfilling conditionals. In this article, we propose a generalization of safe conditional syntax splittings that broadens the applicability of splitting postulates. In contrast to safe conditional syntax splitting, our generalized notion supports syntax splittings of a belief base \Delta where the subbases of \Delta may share atoms and nontrivial conditionals. We illustrate how this new notion overcomes limitations of previous splitting concepts, and we identify genuine splittings, separating them from simple splittings that do not provide benefits for inductive inference from \Delta. We introduce adjusted inference postulates based on our generalization of conditional syntax splitting, and we evaluate several popular inductive inference operators with respect to these postulates. Furthermore, we show that, while every inductive inference operator satisfying generalized conditional syntax splitting also satisfies conditional syntax splitting, the reverse does not hold.

[AI-31] meSAF: Towards LLM -Guided Semantic Asynchronous Fusion for Time Series Forecasting

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在时间序列预测任务中因采用深度同步融合(Deep Synchronous Fusion)策略所引发的语义感知不协调问题(semantic perceptual dissonance)。该问题表现为:LLM提供的高层抽象语义与时间序列的低层数值动态被强制耦合,导致语义先验难以有效引导预测。解决方案的关键在于提出一种分层异步融合(hierarchical asynchronous fusion)框架TimeSAF,其核心机制包括两个部分:一是独立的跨模态语义融合主干(cross-modal semantic fusion trunk),通过可学习查询自底向上聚合时序和提示骨干网络的全局语义;二是阶段式语义精炼解码器(stage-wise semantic refinement decoder),将高阶语义信号异步注入时序骨干网络,从而在保持低层时序动态完整性的同时,实现稳定且高效的语义引导。

链接: https://arxiv.org/abs/2604.12648
作者: Fan Zhang,Shiming Fan,Hua Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the recent success of large language models (LLMs) in time-series forecasting, most existing methods still adopt a Deep Synchronous Fusion strategy, where dense interactions between textual and temporal features are enforced at every layer of the network. This design overlooks the inherent granularity mismatch between modalities and leads to what we term semantic perceptual dissonance: high-level abstract semantics provided by the LLM become inappropriately entangled with the low-level, fine-grained numerical dynamics of time series, making it difficult for semantic priors to effectively guide forecasting. To address this issue, we propose TimeSAF, a new framework based on hierarchical asynchronous fusion. Unlike synchronous approaches, TimeSAF explicitly decouples unimodal feature learning from cross-modal interaction. It introduces an independent cross-modal semantic fusion trunk, which uses learnable queries to aggregate global semantics from the temporal and prompt backbones in a bottom-up manner, and a stage-wise semantic refinement decoder that asynchronously injects these high-level signals back into the temporal backbone. This mechanism provides stable and efficient semantic guidance while avoiding interference with low-level temporal dynamics. Extensive experiments on standard long-term forecasting benchmarks show that TimeSAF significantly outperforms state-of-the-art baselines, and further exhibits strong generalization in both few-shot and zero-shot transfer settings.

[AI-32] Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring

【速读】:该论文旨在解决自主水下航行器(Autonomous Underwater Vehicles, AUV)在海洋生态系统监测中因水下动力学高度不确定性和非平稳性而导致的控制难题。传统单任务强化学习方法容易过拟合训练环境,限制了策略的长期适用性。为此,作者提出采用上下文多任务强化学习(Contextual Multi-Task Reinforcement Learning)范式,其关键在于学习一个依赖于上下文信息的单一策略,使其能够适应多种相关任务(如不同珊瑚礁中的牡蛎与珊瑚检测),从而提升样本效率、零样本泛化能力及对变化水流条件的鲁棒性,推动可持续的自主珊瑚礁监测实践。

链接: https://arxiv.org/abs/2604.12645
作者: Melvin Laux,Yi-Ling Liu,Rina Alo,Sören Töpper,Mariela De Lucas Alvarez,Frank Kirchner,Rebecca Adam
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: To be published in IEEE OCEANS 2026 (Sanya) conference proceedings

点击查看摘要

Abstract:Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by the difficulty of controlling vehicles under highly uncertain and non-stationary underwater dynamics. To address these challenges, we employ a data-driven reinforcement learning approach to compensate for unknown dynamics and task this http URL single-task reinforcement learning has a tendency to overfit the training environment, thus, limit the long-term usefulness of the learnt policy. Hence, we propose to use a contextual multi-task reinforcement learning paradigm instead, allowing us to learn controllers that can be reused for various tasks, e.g., detecting oysters in one reef and detecting corals in another. We evaluate whether contextual multi-task reinforcement learning can efficiently learn robust and generalisable control policies for autonomous underwater reef monitoring. We train a single context-dependent policy that is able to solve multiple related monitoring tasks in a simulated reef environment in HoloOcean. In our experiments, we empirically evaluate the contextual policies regarding sample-efficiency, zero-shot generalisation to unseen tasks, and robustness to varying water currents. By utilising multi-task reinforcement learning, we aim to improve the training effectiveness, as well as the reusability of learnt policies to take a step towards more sustainable procedures in autonomous reef monitoring.

[AI-33] Calibration-Aware Policy Optimization for Reasoning LLM s ACL2026

【速读】:该论文旨在解决基于Group Relative Policy Optimization (GRPO)的大型语言模型(LLM)在提升推理准确性时引发的校准性能下降问题,即错误回答的困惑度(perplexity)低于正确回答,导致相对校准能力(relative calibration)恶化,表现为Area Under the Curve (AUC)指标下降。其核心问题是现有GRPO类算法因忽略不确定性而采用非校准感知的优势估计(uncertainty-agnostic advantage estimation),从而使得优化梯度与校准目标不一致,造成准确率提升但校准能力退化。解决方案的关键在于提出Calibration-Aware Policy Optimization (CAPO),其通过引入一个理论一致且具备遗憾边界(regret bound)的逻辑回归AUC代理损失(logistic AUC surrogate loss),实现不确定性感知的优势估计;并进一步结合噪声掩蔽机制(noise masking mechanism),稳定学习动态,协同优化校准性和准确性。实验表明,CAPO在多个数学推理基准上显著提升校准性(最高达15%),同时保持或超越GRPO的准确性,并在推理时缩放任务中进一步提高准确率(最高达5%),且在低置信度下允许模型主动回避预测时,可实现精度-覆盖率的帕累托最优权衡,有效缓解幻觉问题。

链接: https://arxiv.org/abs/2604.12632
作者: Ziqi Wang,Xingzhou Lou,Meiqi Wu,Zhengqi Wen,Junge Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ACL 2026

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) enhances LLM reasoning but often induces overconfidence, where incorrect responses yield lower perplexity than correct ones, degrading relative calibration as described by the Area Under the Curve (AUC). Existing approaches either yield limited improvements in calibration or sacrifice gains in reasoning accuracy. We first prove that this degradation in GRPO-style algorithms stems from their uncertainty-agnostic advantage estimation, which inevitably misaligns optimization gradients with calibration. This leads to improved accuracy at the expense of degraded calibration. We then propose Calibration-Aware Policy Optimization (CAPO). It adopts a logistic AUC surrogate loss that is theoretically consistent and admits regret bound, enabling uncertainty-aware advantage estimation. By further incorporating a noise masking mechanism, CAPO achieves stable learning dynamics that jointly optimize calibration and accuracy. Experiments on multiple mathematical reasoning benchmarks show that CAPO-1.5B significantly improves calibration by up to 15% while achieving accuracy comparable to or better than GRPO, and further boosts accuracy on downstream inference-time scaling tasks by up to 5%. Moreover, when allowed to abstain under low-confidence conditions, CAPO achieves a Pareto-optimal precision-coverage trade-off, highlighting its practical value for hallucination mitigation.

[AI-34] KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在训练大语言模型(Large Language Models, LLMs)时因奖励稀疏性(reward sparsity)导致的推理能力提升受限问题,尤其在复杂任务中表现不佳。现有基于提示(hint-based)的RL方法虽通过注入部分解或抽象模板缓解稀疏性,但通常依赖增加token数量来扩展指导信息,从而引入冗余、不一致性和额外训练开销。其解决方案的关键在于提出KnowRL(Knowledge-Guided Reinforcement Learning)框架,将提示设计建模为最小充分指导(minimal-sufficient guidance)问题:在训练过程中,KnowRL将指导信息分解为原子知识点(Knowledge Points, KPs),并利用约束子集搜索(Constrained Subset Search, CSS)构建紧凑且交互感知的KPs子集用于训练;同时识别出“剪枝交互悖论”(pruning interaction paradox)——即单个KP移除可能有益,但多个KP同时移除反而损害性能,并显式优化该依赖结构下的鲁棒子集选择策略。这一方法显著提升了模型在8个推理基准上的表现,且无需提示即可达到70.08平均准确率,优于基线模型9.63点,进一步使用精选KP时达74.16,刷新1.5B规模下的新纪录。

链接: https://arxiv.org/abs/2604.12627
作者: Linhao Yu,Tianmeng Yang,Siyu Ding,Renren Jin,Naibin Gu,Xiangzhao Hao,Shuaiyi Nie,Deyi Xiong,Weichong Yin,Yu Sun,Hua Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose \textbfKnowRL (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. We further identify a pruning interaction paradox – removing one KP may help while removing multiple such KPs can hurt – and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL-Nemotron-1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 average accuracy, already surpassing Nemotron-1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at this https URL.

[AI-35] Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments CVPR2025

【速读】:该论文旨在解决实时渲染中动态光照环境下静态物体的全局光照(Global Illumination, GI)实现问题,传统方法依赖预计算的多组光照贴图(lightmap),在动态光照场景下需存储大量不同光照条件下的lightmap,导致显著的存储与内存开销。其解决方案的关键在于提出一种名为Neural Dynamic GI(NDGI)的新压缩技术,通过引入多维特征图(multi-dimensional feature maps)和轻量级神经网络来隐式编码时间维度信息,从而避免显式存储多组lightmap;同时,在训练过程中采用块压缩(Block Compression, BC)模拟策略,使最终生成的特征图可进一步压缩,提升压缩比,并结合虚拟纹理(Virtual Texturing, VT)系统实现高效实时解压缩,从而在保持高质量动态GI的同时大幅降低存储与内存占用。

链接: https://arxiv.org/abs/2604.12625
作者: Jianhui Wu,Jian Zhou,Zhi Zhou,Zhangjin Huang,Chao Li
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:High-quality global illumination (GI) in real-time rendering is commonly achieved using precomputed lighting techniques, with lightmap as the standard choice. To support GI for static objects in dynamic lighting environments, multiple lightmaps at different lighting conditions need to be precomputed, which incurs substantial storage and memory overhead. To overcome this limitation, we propose Neural Dynamic GI (NDGI), a novel compression technique specifically designed for temporal lightmap sets. Our method utilizes multi-dimensional feature maps and lightweight neural networks to integrate the temporal information instead of storing multiple sets explicitly, which significantly reduces the storage size of lightmaps. Additionally, we introduce a block compression (BC) simulation strategy during the training process, which enables BC compression on the final generated feature maps and further improves the compression ratio. To enable efficient real-time decompression, we also integrate a virtual texturing (VT) system with our neural representation. Compared with prior methods, our approach achieves high-quality dynamic GI while maintaining remarkably low storage and memory requirements, with only modest real-time decompression overhead. To facilitate further research in this direction, we will release our temporal lightmap dataset precomputed in multiple scenes featuring diverse temporal variations. Comments: Accepted to CVPR 2025 Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.12625 [cs.GR] (or arXiv:2604.12625v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2604.12625 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-36] SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

【速读】:该论文旨在解决扩散模型(diffusion models)后训练流程中监督微调(SFT)与强化学习(RL)之间的根本性断层问题:SFT仅在前向加噪过程采样的真实状态上优化去噪器,一旦推理阶段偏离理想状态,后续去噪依赖于分布外泛化而非显式纠正,导致暴露偏差(exposure bias)沿去噪轨迹累积;而RL虽可缓解此问题,却面临奖励信号稀疏、信用分配困难及奖励黑客(reward hacking)风险。解决方案的关键在于提出SOAR(Self-Correction for Optimal Alignment and Refinement),其通过一次停梯度的轨迹 rollout,对偏离轨迹的状态重新加噪,并以密集的时间步监督引导模型回归原始干净目标,实现无奖励、在线策略(on-policy)且无需信用分配的偏差校正机制,从而在不依赖奖励模型的前提下显著提升生成质量与对齐性能。

链接: https://arxiv.org/abs/2604.12617
作者: You Qin,Linqing Wang,Hao Fei,Roger Zimmermann,Liefeng Bo,Qinglin Lu,Chunyu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap separates them. SFT optimizes the denoiser only on ground-truth states sampled from the forward noising process; once inference deviates from these ideal states, subsequent denoising relies on out-of-distribution generalization rather than learned correction, exhibiting the same exposure bias that afflicts autoregressive models, but accumulated along the denoising trajectory instead of the token sequence. RL can in principle address this mismatch, yet its terminal reward signal is sparse, suffers from credit-assignment difficulty, and risks reward hacking. We propose SOAR (Self-Correction for Optimal Alignment and Refinement), a bias-correction post-training method that fills this gap. Starting from a real sample, SOAR performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. The method is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem. On SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while simultaneously raising all model-based preference scores. In controlled reward-specific experiments, SOAR surpasses Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. Since SOAR’s base loss subsumes the standard SFT objective, it can directly replace SFT as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent RL alignment.

[AI-37] Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对复杂语义层面攻击时的脆弱性问题,特别是现有多模态越狱策略主要依赖表面像素扰动或文本欺骗,未能有效利用自然图像中丰富的语义结构,导致原始图像的深层语义攻击面长期未被充分探索。其解决方案的关键在于提出MemJack框架——一个基于记忆增强的多智能体越狱攻击系统,通过协同多智能体动态映射视觉实体至恶意意图、多角度生成视觉-语义伪装提示,并结合迭代零空间投影(Iterative Nullspace Projection, INLP)几何滤波器规避早期潜在空间拒绝机制;同时,借助持续的多模态经验记忆(Multimodal Experience Memory)实现成功攻击策略的积累与迁移,从而显著提升跨图像的多轮越狱成功率(Attack Success Rate, ASR)。

链接: https://arxiv.org/abs/2604.12616
作者: Jianhao Chen,Haoyang Chen,Hanjie Zhao,Haozhe Liang,Tieyun Qian
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:The rapid evolution of Vision-Language Models (VLMs) has catalyzed unprecedented capabilities in artificial intelligence; however, this continuous modal expansion has inadvertently exposed a vastly broadened and unconstrained adversarial attack surface. Current multimodal jailbreak strategies primarily focus on surface-level pixel perturbations and typographic attacks or harmful images; however, they fail to engage with the complex semantic structures intrinsic to visual data. This leaves the vast semantic attack surface of original, natural images largely unscrutinized. Driven by the need to expose these deep-seated semantic vulnerabilities, we introduce \textbfMemJack, a \textbfMEMory-augmented multi-agent \textbfJAilbreak atta\textbfCK framework that explicitly leverages visual semantics to orchestrate automated jailbreak attacks. MemJack employs coordinated multi-agent cooperation to dynamically map visual entities to malicious intents, generate adversarial prompts via multi-angle visual-semantic camouflage, and utilize an Iterative Nullspace Projection (INLP) geometric filter to bypass premature latent space refusals. By accumulating and transferring successful strategies through a persistent Multimodal Experience Memory, MemJack maintains highly coherent extended multi-turn jailbreak attack interactions across different images, thereby improving the attack success rate (ASR) on new images. Extensive empirical evaluations across full, unmodified COCO val2017 images demonstrate that MemJack achieves a 71.48% ASR against Qwen3-VL-Plus, scaling to 90% under extended budgets. Furthermore, to catalyze future defensive alignment research, we will release \textbfMemJack-Bench, a comprehensive dataset comprising over 113,000 interactive multimodal jailbreak attack trajectories, establishing a vital foundation for developing inherently robust VLMs.

[AI-38] DeepTest Tool Competition 2026: Benchmarking an LLM -Based Automotive Assistant ICSE

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在特定应用场景中可能出现的失效问题,具体聚焦于基于LLM的汽车手册信息检索系统未能正确识别并提示用户潜在风险警告的情况。其解决方案的关键在于设计一套系统性的测试方法,通过竞赛形式评估不同工具在发现此类失效方面的有效性与多样性,从而推动对LLM应用可靠性的验证和改进。

链接: https://arxiv.org/abs/2604.12615
作者: Lev Sorokin,Ivan Vasilev,Samuele Pasini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published in the proceedings of the DeepTest workshop at the 48th International Conference on Software Engineering (ICSE) 2026

点击查看摘要

Abstract:This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testing solutions were evaluated based on their effectiveness in exposing failures and the diversity of the discovered failure-revealing tests. We report on the experimental methodology, the competitors, and the results.

[AI-39] LLM -Guided Prompt Evolution for Password Guessing

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的密码猜测攻击中提示词(prompt)设计效率低、效果受限的问题,从而提升密码破解率并增强对用户弱密码行为的建模能力。其解决方案的关键在于引入进化计算驱动的自动化提示词优化机制——通过结合MAP-Elites质量-多样性搜索与岛屿种群模型(island population model),利用OpenEvolve开源系统自动演化出能显著提高密码破解率的提示词策略。实验表明,该方法将破解率从2.02%提升至8.48%,且生成密码的字符分布更接近真实用户密码统计特征,验证了自动化提示词进化在强化LLM驱动密码审计方面的有效性与可行性。

链接: https://arxiv.org/abs/2604.12601
作者: Vladimir A. Mazin,Mikhail A. Zorin,Dmitrii S. Korzh,Elvir Z. Karimov,Dmitrii A. Bolokhov,Oleg Y. Rogov
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Passwords still remain a dominant authentication method, yet their security is routinely subverted by predictable user choices and large-scale credential leaks. Automated password guessing is a key tool for stress-testing password policies and modeling attacker behavior. This paper applies LLM-driven evolutionary computation to automatically optimize prompts for the LLM password guessing framework. Using OpenEvolve, an open-source system combining MAP-Elites quality-diversity search with an island population model we evolve prompts that maximize cracking rate on a RockYou-derived test set. We evaluate three configurations: a local setup with Qwen3 8B, a single compact cloud model Gemini-2.5 Flash, and a two-model ensemble of frontier LLMs. The approach raises the cracking rates from 2.02% to 8.48%. Character distribution analysis further confirms how evolved prompts produce statistically more realistic passwords. Automated prompt evolution is a low-barrier yet effective way to strengthen LLM-based password auditing and underlining how attack pipelines show tendency via automated improvements.

[AI-40] KumoRFM-2: Scaling Foundation Models for Relational Learning

【速读】:该论文旨在解决现有预训练基础模型在处理关系型数据时的局限性,特别是传统表格基础模型需手动扁平化表结构或生成目标变量,且难以同时处理多表间复杂关联与时间一致性的问题。其解决方案的关键在于提出KumoRFM-2——一个原生支持多表关系数据输入的预训练基础模型,通过在数据库层面引入外键(foreign key)和跨样本(cross-sample)维度进行预训练,并尽早注入任务信息以增强对任务相关列的选择能力与抗噪鲁棒性,从而实现无需人工干预即可高效完成多种预测任务的能力。

链接: https://arxiv.org/abs/2604.12596
作者: Valter Hudovernik,Federico López,Vid Kocijan,Akihiro Nitta,Jan Eric Lenssen,Jure Leskovec,Matthias Fey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce KumoRFM-2, the next iteration of a pre-trained foundation model for relational data. KumoRFM-2 supports in-context learning as well as fine-tuning and is applicable to a wide range of predictive tasks. In contrast to tabular foundation models, KumoRFM-2 natively operates on relational data, processing one or more connected tables simultaneously without manual table flattening or target variable generation, all while preserving temporal consistency. KumoRFM-2 leverages a large corpus of synthetic and real-world data to pre-train across four axes: the row and column dimensions at the individual table level, and the foreign key and cross-sample dimensions at the database level. In contrast to its predecessor, KumoRFM-2 injects task information as early as possible, enabling sharper selection of task-relevant columns and improved robustness to noisy data. Through extensive experiments on 41 challenging benchmarks and analysis around expressivity and sensitivity, we demonstrate that KumoRFM-2 outperforms supervised and foundational approaches by up to 8%, while maintaining strong performance under extreme settings of cold start and noisy data. To our knowledge, this is the first time a few-shot foundation model has been shown to surpass supervised approaches on common benchmark tasks, with performance further improving upon fine-tuning. Finally, while KumoRFM-1 was limited to small-scale in-memory datasets, KumoRFM-2 scales to billion-scale relational datasets.

[AI-41] IDEA: An Interpretable and Editable Decision-Making Framework for LLM s via Verbal-to-Numeric Calibration ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险决策场景中应用受限的问题,主要包括概率校准不足、解释不忠实以及难以精确融入专家知识等挑战。其核心解决方案是提出IDEA框架,该框架通过联合学习语义到数值的映射关系与决策参数,利用期望最大化(EM)算法实现因子间依赖关系的保留采样,并支持基于数学保证的直接参数编辑,从而生成校准的概率分布并实现可量化的“人-AI协同”。实验表明,IDEA在五个数据集上使用Qwen-3-32B模型达到78.6%准确率,显著优于DeepSeek R1(68.1%)和GPT-5.2(77.9%),且实现了因子排除的完美控制和概率的精确校准,这是仅靠提示工程无法实现的性能提升。

链接: https://arxiv.org/abs/2604.12573
作者: Yanji He,Yuxin Jiang,Yiwen Wu,Bo Huang,Jiaheng Wei,Wei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Large Language Models are increasingly deployed for decision-making, yet their adoption in high-stakes domains remains limited by miscalibrated probabilities, unfaithful explanations, and inability to incorporate expert knowledge precisely. We propose IDEA, a framework that extracts LLM decision knowledge into an interpretable parametric model over semantically meaningful factors. Through joint learning of verbal-to-numerical mappings and decision parameters via EM, correlated sampling that preserves factor dependencies, and direct parameter editing with mathematical guarantees, IDEA produces calibrated probabilities while enabling quantitative human-AI collaboration. Experiments across five datasets show IDEA with Qwen-3-32B (78.6%) outperforms DeepSeek R1 (68.1%) and GPT-5.2 (77.9%), achieving perfect factor exclusion and exact calibration – precision unattainable through prompting alone. The implementation is publicly available at this https URL.

[AI-42] Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟不同文化背景下公民对官僚程序(red tape)情绪反应方面的有效性不足问题,尤其是其在东方文化情境中表现较差且现有文化提示策略效果有限。解决方案的关键在于提出一个评估框架以系统检验LLMs在多元文化语境下生成情绪响应的适配性,并开发了名为RAMO的交互式界面,用于模拟公民情绪反应并收集人类数据以迭代优化模型性能。

链接: https://arxiv.org/abs/2604.12545
作者: Wanchun Ni,Jiugeng Sun,Yixian Liu,Mennatallah El-Assady
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: To appear in the CHI 2026 Workshop on PoliSim

点击查看摘要

Abstract:Improving policymaking is a central concern in public administration. Prior human subject studies reveal substantial cross-cultural differences in citizens’ emotional responses to red tape during policy implementation. While LLM agents offer opportunities to simulate human-like responses and reduce experimental costs, their ability to generate culturally appropriate emotional responses to red tape remains unverified. To address this gap, we propose an evaluation framework for assessing LLMs’ emotional responses to red tape across diverse cultural contexts. As a pilot study, we apply this framework to a single red-tape scenario. Our results show that all models exhibit limited alignment with human emotional responses, with notably weaker performance in Eastern cultures. Cultural prompting strategies prove largely ineffective in improving alignment. We further introduce \textbfRAMO, an interactive interface for simulating citizens’ emotional responses to red tape and for collecting human data to improve models. The interface is publicly available at this https URL.

[AI-43] A Two-Stage LLM Framework for Accessible and Verified XAI Explanations

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)生成的可解释人工智能(eXplainable Artificial Intelligence, XAI)自然语言解释中存在的准确性、忠实性(faithfulness)、完整性不足,以及缺乏可靠评估机制的问题。现有方法多依赖主观评价或事后评分,无法有效阻止错误解释传递给最终用户。其解决方案的关键在于提出一种两阶段LLM元验证框架(Two-Stage LLM Meta-Verification Framework),包含三个核心组件:(i)解释器LLM将原始XAI输出转化为自然语言叙事;(ii)验证器LLM从忠实性、连贯性、完整性及幻觉风险角度对解释进行评估;(iii)通过迭代反馈机制利用验证结果优化解释内容。实验证明,该框架不仅能过滤不可靠解释,还能提升语言可访问性,并且熵产生率(Entropy Production Rate, EPR)分析表明验证反馈促使解释过程趋向更稳定和一致的推理路径。

链接: https://arxiv.org/abs/2604.12543
作者: Georgios Mermigkis,Dimitris Metaxakis,Marios Tyrovolas,Argiris Sofotasios,Nikolaos Avgeris,Panagiotis Hadjidoukas,Chrysostomos Stylios
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures, Accepted for publication at the 2026 IEEE World Congress on Computational Intelligence (WCCI 2026)

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to translate the technical outputs of eXplainable Artificial Intelligence (XAI) methods into accessible natural-language explanations. However, existing approaches often lack guarantees of accuracy, faithfulness, and completeness. At the same time, current efforts to evaluate such narratives remain largely subjective or confined to post-hoc scoring, offering no safeguards to prevent flawed explanations from reaching end-users. To address these limitations, this paper proposes a Two-Stage LLM Meta-Verification Framework that consists of (i) an Explainer LLM that converts raw XAI outputs into natural-language narratives, (ii) a Verifier LLM that assesses them in terms of faithfulness, coherence, completeness, and hallucination risk, and (iii) an iterative refeed mechanism that uses the Verifier’s feedback to refine and improve them. Experiments across five XAI techniques and datasets, using three families of open-weight LLMs, show that verification is crucial for filtering unreliable explanations while improving linguistic accessibility compared with raw XAI outputs. In addition, the analysis of the Entropy Production Rate (EPR) during the refinement process indicates that the Verifier’s feedback progressively guides the Explainer toward more stable and coherent reasoning. Overall, the proposed framework provides an efficient pathway toward more trustworthy and democratized XAI systems.

[AI-44] chnical Report – A Context-Sensitive Multi-Level Similarity Framework for First-Order Logic Arguments: An Axiomatic Study

【速读】:该论文旨在解决在一阶逻辑(First-Order Logic, FOL)框架下论证相似性(argument similarity)的建模问题,其核心挑战在于如何衡量结构化内容之间的相似性,而不仅仅是命题逻辑中的简单语义匹配。解决方案的关键在于提出一个全面的FOL论证相似性框架,包含四个核心要素:(1) 扩展的公理基础以支撑形式化一致性;(2) 四层级参数化模型,分别覆盖谓词、文字、子句和公式级别的相似性度量;(3) 两类模型家族,其中一类基于语言模型实现语法敏感性,并通过上下文权重集成实现细粒度且可解释的相似性计算;(4) 形式化约束条件以确保所提方法具备理想的性质(如对称性、传递性等)。该框架为复杂论证结构的相似性分析提供了系统性理论与实践路径。

链接: https://arxiv.org/abs/2604.12534
作者: Victor David,Jérôme Delobelle,Jean-Guy Mailly
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Similarity in formal argumentation has recently gained attention due to its significance in problems such as argument aggregation in semantics and enthymeme decoding. While existing approaches focus on propositional logic, we address the richer setting of First-Order Logic (FOL), where similarity must account for structured content. We introduce a comprehensive framework for FOL argument similarity, built upon: (1) an extended axiomatic foundation; (2) a four-level parametric model covering predicates, literals, clauses, and formulae similarity; (3) two model families, one syntax-sensitive via language models, both integrating contextual weights for nuanced and explainable similarity; and (4) formal constraints enforcing desirable properties.

[AI-45] Orthogonal Subspace Projection for Continual Machine Unlearning via SVD-Based LoRA

【速读】:该论文旨在解决持续机器遗忘(continual machine unlearning)中因顺序删除请求导致的参数冲突问题,即在多次删除任务后模型难以保持原有知识且易产生任务间干扰。其核心解决方案是提出一种基于奇异值分解(SVD)引导的正交子空间投影方法:在训练阶段约束每个新的低秩适配(LoRA)更新模块位于先前遗忘任务所用子空间的正交补空间内,从而实现任务隔离,避免动态部署时的路由开销,同时确保遗忘效果与原始性能的平衡。

链接: https://arxiv.org/abs/2604.12526
作者: Yogachandran Rahulamathavan,Nasir Iqbal,Juncheng Hu,Sangarapillai Lambotharan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual machine unlearning aims to remove the influence of data that should no longer be retained, while preserving the usefulness of the model on everything else. This setting becomes especially difficult when deletion requests arrive sequentially, because the model must repeatedly adapt without erasing previously retained knowledge. Low-Rank Adaptation (LoRA) offers an efficient way to implement such updates, but naively combining many sequential LoRA modules leads to parameter collision, causing \textitstrong interference between tasks. We propose a static alternative based on Singular Value Decomposition (SVD)-guided orthogonal subspace projection. Our method constrains each new LoRA update during training so that it lies in the orthogonal complement of the subspaces used by earlier unlearning tasks. This preserves task isolation without requiring dynamic routing at deployment. Experiments on CIFAR-100 with ResNet-20 and on MNIST show stable behavior across long sequences of unlearning tasks. After thirty sequential unlearning tasks, state-of-the-art static fusion reduces retained accuracy from 60.39% to 12.70%, whereas the proposed in-training constrained optimization maintains baseline performance ( \sim 58.1%) while preserving strong unlearning efficacy.

[AI-46] Lit2Vec: A Reproducible Workflow for Building a Legally Screened Chemistry Corpus from S2ORC for Downstream Retrieval and Text Mining

【速读】:该论文旨在解决化学领域研究文本语料库构建与验证的可复现性问题,尤其是在大规模文献数据中确保高质量、结构化且合规的数据可用性。其解决方案的关键在于提出了一套可复现的工作流程(Lit2Vec),通过保守的元数据驱动许可筛选策略(基于Unpaywall、OpenAlex和Crossref)从Semantic Scholar开放研究语料库中提取并构建一个包含582,683篇化学领域全文文章的内部研究语料库,并配套提供完整的schema、重建流程、元数据溯源信息及验证输出,以支持下游检索与文本挖掘任务。该方案还通过生成段落级嵌入(paragraph-level embeddings)和多标签子领域标注进一步增强语料库的实用性与结构化程度。

链接: https://arxiv.org/abs/2604.12498
作者: Mahmoud Amiri,Jamile Mohammad Jafari,Sara Mostafapour,Thomas Bocklitz
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Lit2Vec, a reproducible workflow for constructing and validating a chemistry corpus from the Semantic Scholar Open Research Corpus using conservative, metadata-based license screening. Using this workflow, we assembled an internal study corpus of 582,683 chemistry-specific full-text research articles with structured full text, token-aware paragraph chunks, paragraph-level embeddings generated with the intfloat/e5-large-v2 model, and record-level metadata including abstracts and licensing information. To support downstream retrieval and text-mining use cases, an eligible subset of the corpus was additionally enriched with machine-generated brief summaries and multi-label subfield annotations spanning 18 chemistry domains. Licensing was screened using metadata from Unpaywall, OpenAlex, and Crossref, and the resulting corpus was technically validated for schema compliance, embedding reproducibility, text quality, and metadata completeness. The primary contribution of this work is a reproducible workflow for corpus construction and validation, together with its associated schema and reproducibility resources. The released materials include the code, reconstruction workflow, schema, metadata/provenance artifacts, and validation outputs needed to reproduce the corpus from pinned public upstream resources. Public redistribution of source-derived text and broad text-derived representations is outside the scope of the general release. Researchers can reproduce the workflow by using the released pipeline with publicly available upstream datasets and metadata services.

[AI-47] Deepfakes at Face Value: Image and Authority

【速读】:该论文试图解决的问题是:现有对深度伪造(Deepfakes)伦理问题的分析主要聚焦于其造成的伤害或对非规范性利益的侵犯,但无法解释为何在未造成实际损害或未侵害其他非规范性利益时,深度伪造仍具有道德错误性。解决方案的关键在于提出一种新的道德理由——深度伪造之所以错误,是因为它们剥夺了个体对其形象合法使用的控制权,即侵犯了个体对其身份治理的正当权威;具体而言,当算法利用个体生物特征作为生成资源来模拟其身份时,构成对个体“身份算法征用权”的不当侵犯,这种侵犯区别于可接受的合理使用(如艺术再现),属于不正当的算法模拟行为。

链接: https://arxiv.org/abs/2604.12490
作者: James Ravi Kirkpatrick
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 21 pages, accepted copy published in AI Society (2026)

点击查看摘要

Abstract:Deepfakes are synthetic media that superimpose or generate someone’s likeness on to pre-existing sound, images, or videos using deep learning methods. Existing accounts of the wrongs involved in creating and distributing deepfakes focus on the harms they cause or the non-normative interests they violate. However, these approaches do not explain how deepfakes can be wrongful even when they cause no harm or set back any other non-normative interest. To address this issue, this paper identifies a neglected reason why deepfakes are wrong: they can subvert our legitimate interests in having authority over the permissible uses of our image and the governance of our identity. We argue that deepfakes are wrong when they usurp our authority to determine the provenance of our own agency by exploiting our biometric features as a generative resource. In particular, we have a specific right against the algorithmic conscription of our identity. We refine the scope of this interest by distinguishing between permissible forms of appropriation, such as artistic depiction, from wrongful algorithmic simulation.

[AI-48] Elastic Net Regularization and Gabor Dictionary for Classification of Heart Sound Signals using Deep Learning

【速读】:该论文旨在解决心音信号在分类任务中特征表示不足的问题,以提升对五种心脏瓣膜病变的识别准确率。其解决方案的关键在于优化时频原子(time-frequency atoms)的分辨率与拟合模型的正则化策略,从而构建更有效的时频特征矩阵;具体而言,通过使用Gabor字典中的高时间分辨率、低频率分辨率原子,并结合弹性网(elastic net)正则化方法对线性模型进行稀疏约束,获得高质量的特征表示;在此基础上,采用两种深度学习(deep learning, DL)架构(分别为1D-CNN-LSTM和1D/2D-CNN-LSTM)并配合ADAM优化算法训练,最终实现了98.95%的最高分类准确率。

链接: https://arxiv.org/abs/2604.12483
作者: Mahmoud Fakhry,Ascensión Gallardo-Antolín
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this article, we propose the optimization of the resolution of time-frequency atoms and the regularization of fitting models to obtain better representations of heart sound signals. This is done by evaluating the classification performance of deep learning (DL) networks in discriminating five heart valvular conditions based on a new class of time-frequency feature matrices derived from the fitting models. We inspect several combinations of resolution and regularization, and the optimal one is that provides the highest performance. To this end, a fitting model is obtained based on a heart sound signal and an overcomplete dictionary of Gabor atoms using elastic net regularization of linear models. We consider two different DL architectures, the first mainly consisting of a 1D convolutional neural network (CNN) layer and a long short-term memory (LSTM) layer, while the second is composed of 1D and 2D CNN layers followed by an LSTM layer. The networks are trained with two algorithms, namely stochastic gradient descent with momentum (SGDM) and adaptive moment (ADAM). Extensive experimentation has been conducted using a database containing heart sound signals of five heart valvular conditions. The best classification accuracy of 98.95% is achieved with the second architecture when trained with ADAM and feature matrices derived from optimal models obtained with a Gabor dictionary consisting of atoms with high-time low-frequency resolution and imposing sparsity on the models.

[AI-49] Social Learning Strategies for Evolved Virtual Soft Robots

【速读】:该论文旨在解决机器人本体(body)与控制策略(brain)联合优化的难题,即形态结构决定了哪些控制策略有效,而控制参数又反过来影响形态性能的发挥。传统方法通过嵌套进化与学习循环实现独立优化,但忽略了个体间可共享的经验价值。其解决方案的关键在于引入社会学习机制,使机器人能够从同伴中汲取已优化的控制参数以加速自身脑部优化过程;特别地,研究系统性考察了教师选择策略(即选择哪些及多少机器人作为学习对象)对性能的影响,并发现基于形态相似性的知识迁移能显著提升效率,且多教师协作学习相比单师学习更具鲁棒性和一致性。

链接: https://arxiv.org/abs/2604.12482
作者: K. Ege de Bruin,Kyrre Glette,Kai Olav Ellefsen,Giorgia Nadizar,Eric Medvet
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimizing the body and brain of a robot is a coupled challenge: the morphology determines what control strategies are effective, while the control parameters influence how well the morphology performs. This joint optimization can be done through nested loops of evolutionary and learning processes, where the control parameters of each robot are learned independently. However, the control parameters learned by one robot may contain valuable information for others. Thus, we introduce a social learning approach in which robots can exploit optimized parameters from their peers to accelerate their own brain optimization. Within this framework, we systematically investigate how the selection of teachers, deciding which and how many robots to learn from, affects performance, experimenting with virtual soft robots in four tasks and environments. In particular, we study the effect of inheriting experience from morphologically similar robots due to the tightly coupled body and brain in robot optimization. Our results confirm the effectiveness of building on others’ experience, as social learning clearly outperforms learning from scratch under equivalent computational budgets. In addition, while the optimal teacher selection strategy remains open, our findings suggest that incorporating knowledge from multiple teachers can yield more consistent and robust improvements.

[AI-50] Audio Source Separation in Reverberant Environments using β-divergence based Nonnegative Factorization

【速读】:该论文旨在解决多通道音频源分离中参数估计不准确的问题,特别是在基于高斯模型的分离方法中,传统通过期望最大化(Expectation-Maximization, EM)算法估计源信号频谱方差和空间协方差矩阵时,易受混叠干扰且性能受限。其解决方案的关键在于引入基于先验信息的非负矩阵分解(Nonnegative Matrix Factorization, NMF),将源信号功率谱的先验知识以谱基矩阵形式建模,这些基矩阵可通过冗余库训练或直接提取获得;进一步利用非负张量分解(Nonnegative Tensor Factorization, NTMF)优化基矩阵选择,通过最小化β-散度并采用乘性更新规则实现稀疏性控制,从而提升分离质量。实验表明,因子分解的稀疏性对分离性能提升更为关键,优于其他可比算法。

链接: https://arxiv.org/abs/2604.12480
作者: Mahmoud Fakhry,Piergiorgio Svaizer,Maurizio Omologo
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In Gaussian model-based multichannel audio source separation, the likelihood of observed mixtures of source signals is parametrized by source spectral variances and by associated spatial covariance matrices. These parameters are estimated by maximizing the likelihood through an Expectation-Maximization algorithm and used to separate the signals by means of multichannel Wiener filtering. We propose to estimate these parameters by applying nonnegative factorization based on prior information on source variances. In the nonnegative factorization, spectral basis matrices can be defined as the prior information. The matrices can be either extracted or indirectly made available through a redundant library that is trained in advance. In a separate step, applying nonnegative tensor factorization, two algorithms are proposed in order to either extract or detect the basis matrices that best represent the power spectra of the source signals in the observed mixtures. The factorization is achieved by minimizing the \beta -divergence through multiplicative update rules. The sparsity of factorization can be controlled by tuning the value of \beta . Experiments show that sparsity, rather than the value assigned to \beta in the training, is crucial in order to increase the separation performance. The proposed method was evaluated in several mixing conditions. It provides better separation quality with respect to other comparable algorithms. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.12480 [cs.SD] (or arXiv:2604.12480v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2604.12480 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-51] From Kinematics to Dynamics: Learning to Refine Hybrid Plans for Physically Feasible Execution

【速读】:该论文旨在解决机器人任务中高维混合离散-连续规划问题,即如何在满足时间窗、速度和加速度等物理约束的前提下,从由高阶动作序列与连续轨迹组成的初始计划中生成可执行的动态可行轨迹。传统混合时序规划方法通常采用一阶(线性)动力学模型,无法保证轨迹符合机器人真实的二阶动力学约束,导致即使高层动作序列固定,仍需进行复杂的双层优化以获取物理可行解。本文的关键解决方案是:构建一个显式嵌入解析二阶约束的马尔可夫决策过程(Markov Decision Process, MDP),并利用强化学习在连续空间中对混合规划器生成的一阶轨迹进行迭代优化,从而有效弥合初始计划与实际执行所需动力学之间的差距,并可靠地恢复物理可行性。

链接: https://arxiv.org/abs/2604.12474
作者: Lidor Erez,Shahaf S. Shperberg,Ayal Taitler
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In many robotic tasks, agents must traverse a sequence of spatial regions to complete a mission. Such problems are inherently mixed discrete-continuous: a high-level action sequence and a physically feasible continuous trajectory. The resulting trajectory and action sequence must also satisfy problem constraints such as deadlines, time windows, and velocity or acceleration limits. While hybrid temporal planners attempt to address this challenge, they typically model motion using linear (first-order) dynamics, which cannot guarantee that the resulting plan respects the robot’s true physical constraints. Consequently, even when the high-level action sequence is fixed, producing a dynamically feasible trajectory becomes a bi-level optimization problem. We address this problem via reinforcement learning in continuous space. We define a Markov Decision Process that explicitly incorporates analytical second-order constraints and use it to refine first-order plans generated by a hybrid planner. Our results show that this approach can reliably recover physical feasibility and effectively bridge the gap between a planner’s initial first-order trajectory and the dynamics required for real execution.

[AI-52] Intelligent ROI-Based Vehicle Counting Framework for Automated Traffic Monitoring

【速读】:该论文旨在解决视频监控中车辆计数的准确性与计算效率难以兼顾的问题(vehicle counting accuracy vs. computational efficiency)。其解决方案的关键在于提出一个全自动的两阶段框架:第一阶段为估计阶段,通过融合检测分数、跟踪分数和车辆密度三个模型,自适应地确定最优感兴趣区域(Region of Interest, ROI),从而提升对不同检测与跟踪方法的兼容性;第二阶段为预测阶段,在所估计的ROI内高效执行车辆计数。该设计显著提升了计数精度(多数视频达到100%准确率)并使处理速度比全帧处理快至四倍,尤其在复杂多车道场景下展现出优越的鲁棒性和准确性。

链接: https://arxiv.org/abs/2604.12470
作者: Mohamed A. Abdelwahab,Zaynab Al-Ariny,Mahmoud Fakhry,El-Sayed Hasaneen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate vehicle counting through video surveillance is crucial for efficient traffic management. However, achieving high counting accuracy while ensuring computational efficiency remains a challenge. To address this, we propose a fully automated, video-based vehicle counting framework designed to optimize both computational efficiency and counting accuracy. Our framework operates in two distinct phases: \textitestimation and \textitprediction. In the estimation phase, the optimal region of interest (ROI) is automatically determined using a novel combination of three models based on detection scores, tracking scores, and vehicle density. This adaptive approach ensures compatibility with any detection and tracking method, enhancing the framework’s versatility. In the prediction phase, vehicle counting is efficiently performed within the estimated ROI. We evaluated our framework on benchmark datasets like UA-DETRAC, GRAM, CDnet 2014, and ATON. Results demonstrate exceptional accuracy, with most videos achieving 100% accuracy, while also enhancing computational efficiency, making processing up to four times faster than full-frame processing. The framework outperforms existing techniques, especially in complex multi-road scenarios, demonstrating robustness and superior accuracy. These advancements make it a promising solution for real-time traffic monitoring.

[AI-53] CIA: Inferring the Communication Topology from LLM -based Multi-Agent Systems ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-Agent Systems, MAS)中通信拓扑结构的隐私泄露问题,即在受限的黑盒攻击场景下,攻击者可通过诱导中间代理的推理输出并建模其语义相关性,推断出系统的内部通信结构,从而暴露系统漏洞并威胁知识产权。解决方案的关键在于提出一种新型攻击方法——通信推断攻击(Communication Inference Attack, CIA),其核心创新包括:利用全局偏置解耦(global bias disentanglement)和LLM引导的弱监督(LLM-guided weak supervision)机制,精准建模代理间语义关联,并通过构造对抗性查询有效激发中间代理的推理行为,最终实现对通信拓扑的高精度推断(平均AUC达0.87,峰值达0.99)。

链接: https://arxiv.org/abs/2604.12461
作者: Yongxuan Wu,Xixun Lin,He Zhang,Nan Sun,Kun Wang,Chuan Zhou,Shirui Pan,Yanan Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026, Main

点击查看摘要

Abstract:LLM-based Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in solving complex tasks. Central to MAS is the communication topology which governs how agents exchange information internally. Consequently, the security of communication topologies has attracted increasing attention. In this paper, we investigate a critical privacy risk: MAS communication topologies can be inferred under a restrictive black-box setting, exposing system vulnerabilities and posing significant intellectual property threats. To explore this risk, we propose Communication Inference Attack (CIA), a novel attack that constructs new adversarial queries to induce intermediate agents’ reasoning outputs and models their semantic correlations through the proposed global bias disentanglement and LLM-guided weak supervision. Extensive experiments on MAS with optimized communication topologies demonstrate the effectiveness of CIA, achieving an average AUC of 0.87 and a peak AUC of up to 0.99, thereby revealing the substantial privacy risk in MAS.

[AI-54] Enhancing Clustering: An Explainable Approach via Filtered Patterns

【速读】:该论文旨在解决基于k-relaxed频繁模式(k-relaxed frequent patterns, k-RFPs)的可解释聚类方法中存在的冗余问题:多个不同的k-RFP可能诱导相同的k-cover,从而导致符号表示冗余,扩大搜索空间并增加计算复杂度。解决方案的关键在于提出一个模式精简框架,其核心包括三点:首先,形式化刻画不同k-RFP诱导相同k-cover的条件,为冗余检测提供理论基础;其次,设计一种优化策略,通过保留每个唯一k-cover的单一代表性模式来消除冗余;最后,通过分析ILP模型所选模式与其诱导簇的鲁棒性,验证其在保持或提升聚类质量的同时显著压缩搜索空间并提高计算效率。

链接: https://arxiv.org/abs/2604.12460
作者: Motaz Ben Hassine(1),Saïd Jabbour(1) ((1) CRIL)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine learning has become a central research area, with increasing attention devoted to explainable clustering, also known as conceptual clustering, which is a knowledge-driven unsupervised learning paradigm that partitions data into \theta disjoint clusters, where each cluster is described by an explicit symbolic representation, typically expressed as a closed pattern or itemset. By providing human-interpretable cluster descriptions, explainable clustering plays an important role in explainable artificial intelligence and knowledge discovery. Recent work improved clustering quality by introducing k-relaxed frequent patterns (k-RFPs), a pattern model that relaxes strict coverage constraints through a generalized kcover definition. This framework integrates constraint-based reasoning, using SAT solvers for pattern generation, with combinatorial optimization, using Integer Linear Programming (ILP) for cluster selection. Despite its effectiveness, this approach suffers from a critical limitation: multiple distinct k-RFPs may induce identical k-covers, leading to redundant symbolic representations that unnecessarily enlarge the search space and increase computational complexity during cluster construction. In this paper, we address this redundancy through a pattern reduction framework. Our contributions are threefold. First, we formally characterize the conditions under which distinct k-RFPs induce identical kcovers, providing theoretical foundations for redundancy detection. Second, we propose an optimization strategy that removes redundant patterns by retaining a single representative pattern for each distinct k-cover. Third, we investigate the interpretability and representativeness of the patterns selected by the ILP model by analyzing their robustness with respect to their induced clusters. Extensive experiments conducted on several real-world datasets demonstrate that the proposed approach significantly reduces the pattern search space, improves computational efficiency, preserves and enhances in some cases the quality of the resulting clusters.

[AI-55] Operationalising the Right to be Forgotten in LLM s: A Lightweight Sequential Unlearning Framework for Privacy-Aligned Deployment in Politically Sensitive Environments

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在政治敏感环境中部署时,因记忆个人数据或机密内容而违反《通用数据保护条例》(GDPR)及其“被遗忘权”原则的技术挑战。解决方案的关键在于提出一种轻量级的顺序遗忘(sequential unlearning)框架,通过显式分离保留(retention)与抑制(suppression)目标来实现:首先通过正向微调稳定模型的良性能力,再采用层受限的负向微调抑制指定的敏感模式,同时保持通用语言能力不受显著影响。实验表明,该方法在SemEval-2025 LLM遗忘基准上能有效抑制行为偏差,且对事实准确性与流畅性影响最小,体现了其在政治部署场景下可操作、可复现的数据擦除机制潜力。

链接: https://arxiv.org/abs/2604.12459
作者: Esen Kurt,Haithem Afli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in politically sensitive environments, where memorisation of personal data or confidential content raises regulatory concerns under frameworks such as the GDPR and its Right to be Forgotten. Translating such legal principles into large-scale generative systems presents significant technical challenges. We introduce a lightweight sequential unlearning framework that explicitly separates retention and suppression objectives. The method first stabilises benign capabilities through positive fine-tuning, then applies layer-restricted negative fine-tuning to suppress designated sensitive patterns while preserving general language competence. Experiments on the SemEval-2025 LLM Unlearning benchmark demonstrate effective behavioural suppression with minimal impact on factual accuracy and fluency. GPT-2 exhibits greater robustness than DistilGPT-2, highlighting the role of model capacity in privacy-aligned adaptation. We position sequential unlearning as a practical and reproducible mechanism for operationalising data erasure requirements in politically deployed LLMs. Comments: 10 pages Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.12459 [cs.AI] (or arXiv:2604.12459v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.12459 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: PoliticalNLP 2026

[AI-56] RACF: A Resilient Autonomous Car Framework with Object Distance Correction IROS2026

【速读】:该论文旨在解决自动驾驶车辆在安全关键场景中因感知失效或网络物理攻击导致的不安全操作问题,特别是视觉距离估计易受环境退化和对抗扰动影响,而现有防御措施往往反应迟缓、难以及时保障安全运行。解决方案的关键在于提出一个鲁棒的自主汽车框架(Resilient Autonomous Car Framework, RACF),其核心是引入对象距离校正算法(Object Distance Correction Algorithm, ODCA),通过深度相机、激光雷达(LiDAR)与基于物理的动力学模型之间的冗余与多样性实现感知层的容错能力;当深度相机测得的距离存在异常时,跨传感器门控机制会触发ODCA进行实时校正,从而提升系统在强干扰下的鲁棒性与响应效率。

链接: https://arxiv.org/abs/2604.12418
作者: Chieh Tsai,Hossein Rastgoftar,Salim Hariri
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 9 figures, 5 tables. Submitted manuscript to IROS 2026

点击查看摘要

Abstract:Autonomous vehicles are increasingly deployed in safety-critical applications, where sensing failures or cyberphysical attacks can lead to unsafe operations resulting in human loss and/or severe physical damages. Reliable real-time perception is therefore critically important for their safe operations and acceptability. For example, vision-based distance estimation is vulnerable to environmental degradation and adversarial perturbations, and existing defenses are often reactive and too slow to promptly mitigate their impacts on safe operations. We present a Resilient Autonomous Car Framework (RACF) that incorporates an Object Distance Correction Algorithm (ODCA) to improve perception-layer robustness through redundancy and diversity across a depth camera, LiDAR, and physics-based kinematics. Within this framework, when obstacle distance estimation produced by depth camera is inconsistent, a cross-sensor gate activates the correction algorithm to fix the detected inconsistency. We have experiment with the proposed resilient car framework and evaluate its performance on a testbed implemented using the Quanser QCar 2 platform. The presented framework achieved up to 35% RMSE reduction under strong corruption and improves stop compliance and braking latency, while operating in real time. These results demonstrate a practical and lightweight approach to resilient perception for safety-critical autonomous driving

[AI-57] Security and Resilience in Autonomous Vehicles: A Proactive Design Approach

【速读】:该论文旨在解决自动驾驶车辆(Autonomous Vehicles, AVs)在感知、控制、V2X通信及软件供应链等多层架构中面临的网络安全与物理威胁问题,这些问题可能破坏系统的可靠性与安全性。解决方案的关键在于提出一种AV弹性架构(AV Resilient architecture),通过冗余(redundancy)、多样性(diversity)和自适应重构(adaptive reconfiguration)策略实现系统韧性,并结合基于异常检测和哈希校验的入侵检测机制,有效识别深度相机致盲攻击和感知模块软件篡改行为。实验结果表明,快速异常检测与备用机制协同作用可保障在对抗环境下仍维持运行连续性,从而提升自动驾驶系统的安全可信水平。

链接: https://arxiv.org/abs/2604.12408
作者: Chieh Tsai,Murad Mehrab Abrar,Salim Hariri
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20 pages. Accepted for publication as a book chapter

点击查看摘要

Abstract:Autonomous vehicles (AVs) promise efficient, clean and cost-effective transportation systems, but their reliance on sensors, wireless communications, and decision-making systems makes them vulnerable to cyberattacks and physical threats. This chapter presents novel design techniques to strengthen the security and resilience of AVs. We first provide a taxonomy of potential attacks across different architectural layers, from perception and control manipulation to Vehicle-to-Any (V2X) communication exploits and software supply chain compromises. Building on this analysis, we present an AV Resilient architecture that integrates redundancy, diversity, and adaptive reconfiguration strategies, supported by anomaly- and hash-based intrusion detection techniques. Experimental validation on the Quanser QCar platform demonstrates the effectiveness of these methods in detecting depth camera blinding attacks and software tampering of perception modules. The results highlight how fast anomaly detection combined with fallback and backup mechanisms ensures operational continuity, even under adversarial conditions. By linking layered threat modeling with practical defense implementations, this work advances AV resilience strategies for safer and more trustworthy autonomous vehicles.

[AI-58] Heuristic Classification of Thoughts Prompting (HCoT): Integrating Expert System Heuristics for Structured Reasoning into Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理复杂问题时的两大局限:一是其推理过程呈现贝叶斯类随机生成特性,即每个token从上下文相关的概率分布中采样,导致决策路径具有内在随机性而非确定性规划;二是推理与决策机制静态解耦,动态获取的领域知识无法实时调整底层推理策略,从而使得初始决策缺乏战略锚定,且推理链难以收敛至正确解。为解决上述问题,论文提出一种集成于LLM生成过程中的问题求解方法,其核心在于引入一种新颖的“思维启发式分类提示框架”(Heuristic-Classification-of-Thoughts prompting schema, HCoT),该框架通过一个启发式分类模型对推理过程进行结构化控制,并提供可复用的抽象解决方案,从而将LLM的推理能力与结构化的任务空间协同优化,在复杂归纳推理任务中显著优于现有方法(如Tree-of-Thoughts和Chain-of-Thoughts),并在24 Game等结构化任务中实现更高的Token效率,达成性能与计算成本之间的帕累托最优平衡。

链接: https://arxiv.org/abs/2604.12390
作者: Lei Lin,Jizhao Zhu,Yong Liu,Donghong Sun,Hongbo He,Yihua Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper addresses two limitations of large language models (LLMs) in solving complex problems: (1) their reasoning processes exhibit Bayesian-like stochastic generation, where each token is sampled from a context-dependent probability distribution, leading to inherently random decision trajectories rather than deterministic planning; (2) the reasoning and decision-making mechanisms are statically decoupled, meaning dynamically retrieved domain knowledge fails to dynamically adjust the underlying reasoning strategy. These dual deficiencies result in initial decisions lacking strategic anchoring and reasoning chains often failing to converge on correct solutions, as stochastic generation lacks mechanisms for trajectory correction or knowledge-guided optimization during sequential reasoning. To resolve these issues, we propose a problem-solving method integrated into the LLM’s generation process to guide reasoning. This method, compatible with numerous LLMs and featuring reusable solutions, is grounded in a novel Heuristic-Classification-of-Thoughts prompting schema (HCoT). HCoT synergizes the LLM’s reasoning ability with a structured problem space via a heuristic classification model that controls the reasoning process and provides reusable abstract solutions. Evaluated on two complex inductive reasoning tasks with ill-defined search spaces, HCoT outperforms existing approaches (e.g., Tree-of-Thoughts and Chain-of-Thoughts prompting) in performance. On the well-structured 24 Game task, HCoT demonstrates significantly higher token efficiency compared to the state-of-the-art Tree-of-Thoughts-Breadth-First-Search. In terms of both accuracy and token usage, HCoT achieves a Pareto frontier balance, offering a strong trade-off between performance and computational cost.

[AI-59] Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中安全对齐(safety alignment)脆弱的问题,即即使进行良性适配,也可能导致预训练时的拒绝行为退化,从而引发有害响应。其解决方案的关键在于提出一种耦合权重与激活约束(Coupled Weight and Activation Constraints, CWAC)的新方法:该方法同时在权重更新上施加预计算的安全子空间约束,并通过稀疏自编码器识别出关键安全特征后实施针对性正则化,从而协同控制权重和激活的耦合效应,实现更鲁棒的安全性保留。

链接: https://arxiv.org/abs/2604.12384
作者: Songping Peng(1),Zhiheng Zhang(2),Daojian Zeng(1),Lincheng Jiang(3),Xieping Gao(1) ((1) Hunan Normal University, (2) University of Chinese Academy of Sciences, (3) National University of Defense Technology)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures, 6 tables, The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or activations in isolation, without considering their coupled effects on safety. In this paper, we first theoretically demonstrate that constraining either weights or activations alone is insufficient for safety preservation. To robustly preserve safety alignment, we propose Coupled Weight and Activation Constraints (CWAC), a novel approach that simultaneously enforces a precomputed safety subspace on weight updates and applies targeted regularization to safety-critical features identified by sparse autoencoders. Extensive experiments across four widely used LLMs and diverse downstream tasks show that CWAC consistently achieves the lowest harmful scores with minimal impact on fine-tuning accuracy, substantially outperforming strong baselines even under high harmful data ratios.

[AI-60] Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在编码任务中依赖显式推理时,如何有效评估其推理质量的问题。现有推理评估工具未针对编程任务设计,且主流基准主要聚焦于代码生成,忽视了代码摘要和分类等其他重要任务类型。为此,作者提出CodeRQ-Bench——首个覆盖生成、摘要与分类三类编码任务的推理质量评估基准,并基于对1,069个不匹配案例的分析识别出五类常见局限性及四项设计原则。据此,他们进一步提出VERA(Verification and Refinement-based Evaluator),一种两阶段评估框架:第一阶段通过证据锚定验证确保推理逻辑与代码实现的一致性,第二阶段引入歧义感知的分数修正机制以提升评估准确性。实验表明,VERA在四个数据集上显著优于强基线模型,AUCROC最高提升0.26,AUPRC最高提升0.21,为高质量推理评估提供了可复现且具扩展性的解决方案。

链接: https://arxiv.org/abs/2604.12379
作者: Yuangang Li,Justin Tian Jin Chen,Ethan Yu,David Hong,Iftekhar Ahmed
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly rely on explicit reasoning to solve coding tasks, yet evaluating the quality of this reasoning remains challenging. Existing reasoning evaluators are not designed for coding, and current benchmarks focus primarily on code generation, leaving other coding tasks largely unexplored. We introduce CodeRQ-Bench, the first benchmark for evaluating LLM reasoning quality across three coding task categories: generation, summarization, and classification. Using this benchmark, we analyze 1,069 mismatch cases from existing evaluators, identify five recurring limitations, and derive four design insights for reasoning evaluation in coding tasks. Guided by these insights, we propose VERA, a two-stage evaluator that combines evidence-grounded verification with ambiguity-aware score correction. Experiments on CodeRQ-Bench show that VERA consistently outperforms strong baselines across four datasets, improving AUCROC by up to 0.26 and AUPRC by up to 0.21. We release CodeRQ-Bench at this https URL, supporting future investigations.

[AI-61] Scaffold-Conditioned Preference Triplets for Controllable Molecular Optimization with Large Language Models

【速读】:该论文旨在解决分子属性优化中现有深度学习方法依赖黑箱评分、难以控制骨架保留且生成结果不稳定或生物学上不合理的问题。其核心解决方案是提出骨架条件偏好三元组(Scaffold-Conditioned Preference Triplets, SCPT),通过骨架对齐与化学驱动的过滤器构建约束相似性的三元组 ⟨scaffold, better, worse⟩,从而为预训练分子大语言模型(molecular LLM)提供基于化学知识的偏好监督信号。该方法使模型能够进行保留骨架的属性改进编辑,在单目标和多目标基准测试中均显著提升优化成功率和属性增益,同时保持更高的骨架相似性,并展现出在有限高阶监督下向三属性任务外推的良好泛化能力。

链接: https://arxiv.org/abs/2604.12350
作者: Yi Xiong,Liang Xiong,Xiaohong Ji,Sen Yang,Zhifeng Gao,Huaimin Wang,Kele Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Molecular property optimization is central to drug discovery, yet many deep learning methods rely on black-box scoring and offer limited control over scaffold preservation, often producing unstable or biologically implausible edits. While large language models (LLMs) are promising molecular generators, optimization remains constrained by the lack of chemistry-grounded preference supervision and principled data curation. We introduce \textbfScaffold-Conditioned Preference Triplets (SCPT), a pipeline that constructs similarity-constrained triplets \langle\textscaffold, \textbetter, \textworse\rangle via scaffold alignment and chemistry-driven filters for validity, synthesizability, and meaningful property gains. Using these preferences, we align a pretrained molecular LLM as a conditional editor, enabling property-improving edits that retain the scaffold. Across single- and multi-objective benchmarks, SCPT improves optimization success and property gains while maintaining higher scaffold similarity than competitive baselines. Compared with representative non-LLM molecular optimization methods, SCPT-trained LLMs are better suited to scaffold-constrained and multi-objective optimization. In addition, models trained on single-property and two-property supervision generalize effectively to three-property tasks, indicating promising extrapolative generalization under limited higher-order supervision. SCPT also provides controllable data-construction knobs that yield a predictable similarity-gain frontier, enabling systematic adaptation to diverse optimization regimes.

[AI-62] GeM-EA: A Generative and Meta-learning Enhanced Evolutionary Algorithm for Streaming Data-Driven Optimization GECCO2026

【速读】:该论文针对流式数据驱动优化(Streaming Data-Driven Optimization, SDDO)问题中因概念漂移(concept drift)导致的非平稳优化环境所引发的挑战,提出了一种融合元学习与生成式重放机制的进化算法——GeM-EA。其核心解决方案在于:首先通过双层元学习策略在检测到概念漂移时快速利用环境相关先验初始化代理模型(surrogate model),同时引入线性残差项捕捉全局趋势;其次采用多岛进化策略结合生成式重放(generative replay)机制,有效复用历史知识以加速搜索过程。该方法显著提升了适应速度与鲁棒性,优于现有最优方法。

链接: https://arxiv.org/abs/2604.12336
作者: Yue Wu,Yuan-Ting Zhong,Ze-Yuan Ma,Yue-Jiao Gong
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: accepted by GECCO 2026

点击查看摘要

Abstract:Streaming Data-Driven Optimization (SDDO) problems arise in many applications where data arrive continuously and the optimization environment evolves over time. Concept drift produces non-stationary landscapes, making optimization methods challenging due to outdated models. Existing approaches often rely on simple surrogate combinations or directly injecting solutions, which may cause negative transfer under sudden environmental changes. We propose GeM-EA, a Generative and Meta-learning Enhanced Evolutionary Algorithm for SDDO that unifies meta-learned surrogate adaptation with generative replay for effective evolutionary search. Upon detecting concept drift, a bi-level meta-learning strategy rapidly initializes the surrogate using environment-relevant priors, while a linear residual component captures global trends. A multi-island evolutionary strategy further leverages historical knowledge via generative replay to accelerate optimization. Experimental results on benchmark SDDO problems demonstrate that GeM-EA achieves faster adaptation and improved robustness compared with state-of-the-art methods.

[AI-63] Black-Box Optimization From Small Offline Datasets via Meta Learning with Synthetic Tasks AISTATS

【速读】:该论文旨在解决离线黑箱优化(offline black-box optimization)中的数据稀缺问题,即在仅有少量或低质量实验数据的情况下,如何有效发现最优设计(如分子或材料)。现有算法在此类场景下性能受限,核心瓶颈在于代理模型难以准确捕捉优化偏差(optimization bias,指正确排序输入设计的能力)。解决方案的关键在于提出一种基于合成任务生成的元学习框架OptBias:通过在高斯过程(Gaussian process)上生成合成任务来学习可复用的优化偏差,并将其用于目标任务的小样本数据微调,从而显著提升代理模型在小数据条件下的优化性能。

链接: https://arxiv.org/abs/2604.12325
作者: Azza Fadhel, TheHung Tran,Trong Nghia Hoang,Jana Doppa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for Publication at International Conference on Artificial Intelligence and Statistics (AISTATS)

点击查看摘要

Abstract:We consider the problem of offline black-box optimization, where the goal is to discover optimal designs (e.g., molecules or materials) from past experimental data. A key challenge in this setting is data scarcity: in many scientific applications, only small or poor-quality datasets are available, which severely limits the effectiveness of existing algorithms. Prior work has theoretically and empirically shown that performance of offline optimization algorithms depends on how well the surrogate model captures the optimization bias (i.e., ability to rank input designs correctly), which is challenging to accomplish with limited experimental data. This paper proposes Surrogate Learning with Optimization Bias via Synthetic Task Generation (OptBias), a meta-learning framework that directly tackles data scarcity. OptBias learns a reusable optimization bias by training on synthetic tasks generated from a Gaussian process, and then fine-tunes the surrogate model on the small data for the target task. Across diverse continuous and discrete offline optimization benchmarks, OptBias consistently outperforms state-of-the-art baselines in small data regimes. These results highlight OptBias as a robust and practical solution for offline optimization in realistic small data settings.

[AI-64] GCA Framework: A Gulf-Grounded Dataset and Agent ic Pipeline for Climate Decision Support

【速读】:该论文旨在解决海湾地区气候决策中缺乏能够将异构科学与政策证据转化为可操作指导的系统问题,尤其针对通用大语言模型(Large Language Models, LLMs)在区域特定气候知识和与地理空间及预测工具交互能力上的不足。解决方案的关键在于提出GCA框架,其核心由两部分组成:(i) GCA-DS——一个涵盖政府政策、适应计划、非政府组织与国际框架、学术文献及极端天气事件报道的多模态数据集(约20万条问答对),并融合遥感影像与文本证据;(ii) Gulf Climate Agent (GCA),一种基于工具增强的代理系统,通过整合实时与历史信号及地理空间处理模块,生成衍生指标和可解释可视化结果。实证表明,领域微调与工具集成显著提升了模型在海湾气候任务中的可靠性。

链接: https://arxiv.org/abs/2604.12306
作者: Muhammad Umer Sheikh,Khawar Shehzad,Salman Khan,Fahad Shahbaz Khan,Muhammad Haris Khan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Climate decision-making in the Gulf increasingly demands systems that can translate heterogeneous scientific and policy evidence into actionable guidance, yet general-purpose large language models (LLMs) remain weak both in region-specific climate knowledge and grounded interaction with geospatial and forecasting tools. We present the GCA framework, which unifies (i) GCA-DS, a curated Gulf-focused multimodal dataset, and (ii) Gulf Climate Agent (GCA), a tool-augmented agent for climate analysis. GCA-DS comprises ~200k question-answer pairs spanning governmental policies and adaptation plans, NGO and international frameworks, academic literature, and event-driven reporting on heatwaves, dust storms, and floods, complemented with remote-sensing inputs that couple imagery with textual evidence. Building on this foundation, the GCA agent orchestrates a modular tool pipeline grounded in real-time and historical signals and geospatial processing that produces derived indices and interpretable visualizations. Finally, we benchmark open and proprietary LLMs on Gulf climate tasks and show that domain fine-tuning and tool integration substantially improve reliability over general-purpose baselines.

[AI-65] Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads

【速读】:该论文旨在解决在编码代理(coding agent)应用中,使用前沿云大语言模型(Large Language Model, LLM)时产生的高token消耗问题。其核心挑战在于如何在保持任务准确性与响应效率的前提下,显著降低对云LLM的调用频率和token使用量。解决方案的关键在于引入一个小型本地模型作为预处理分诊层(triage layer),并系统性地评估七种优化策略:包括本地路由(local routing)、提示压缩(prompt compression)、语义缓存(semantic caching)、本地草稿+云端审核(local drafting with cloud review)、最小差异编辑(minimal-diff edits)、结构化意图提取(structured intent extraction)以及结合供应商提示缓存的批量处理(batching with vendor prompt caching)。实证结果显示,不同工作负载下最优策略组合存在显著差异,其中本地路由与提示压缩组合在编辑密集型和解释密集型任务中可节省45–79%的云token,而完整策略集(含草稿-审核机制)在检索增强生成(RAG-heavy)场景下实现51%的token节约,表明策略选择需根据具体应用场景进行定制化配置。

链接: https://arxiv.org/abs/2604.12301
作者: Justice Owusu Agyemang,Jerry John Kponyo,Elliot Amponsah,Godfred Manu Addo Boakye,Kwame Opuni-Boachie Obour Agyekum
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We present a systematic measurement study of seven tactics for reducing cloud LLM token usage when a small local model can act as a triage layer in front of a frontier cloud model. The tactics are: (1) local routing, (2) prompt compression, (3) semantic caching, (4) local drafting with cloud review, (5) minimal-diff edits, (6) structured intent extraction, and (7) batching with vendor prompt caching. We implement all seven in an open-source shim that speaks both MCP and the OpenAI-compatible HTTP surface, supporting any local model via Ollama and any cloud model via an OpenAI-compatible endpoint. We evaluate each tactic individually, in pairs, and in a greedy-additive subset across four coding-agent workload classes (edit-heavy, explanation-heavy, general chat, RAG-heavy). We measure tokens saved, dollar cost, latency, and routing accuracy. Our headline finding is that T1 (local routing) combined with T2 (prompt compression) achieves 45-79% cloud token savings on edit-heavy and explanation-heavy workloads, while on RAG-heavy workloads the full tactic set including T4 (draft-review) achieves 51% savings. We observe that the optimal tactic subset is workload-dependent, which we believe is the most actionable finding for practitioners deploying coding agents today.

[AI-66] GAM: Hierarchical Graph-based Agent ic Memory for LLM Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在长期交互中面临的知识更新与记忆保持之间的矛盾问题:一方面需要快速感知并吸收新信息以适应动态对话,另一方面需稳定保留已有知识避免干扰。现有统一流式记忆系统易受瞬时噪声干扰,而离散结构化记忆架构则难以适应叙事演变。解决方案的关键在于提出一种分层图结构代理记忆框架(Graph-based Agentic Memory, GAM),其核心创新是显式解耦记忆编码与巩固过程——将实时对话隔离于事件进展图(event progression graph)中,仅在语义发生显著变化时将其整合进主题关联网络(topic associative network),从而最小化干扰并维持长期一致性;同时引入基于图引导的多因素检索策略提升上下文精度。

链接: https://arxiv.org/abs/2604.12285
作者: Zhaofen Wu,Hanrong Zhang,Fulin Lin,Wujiang Xu,Xinran Xu,Yankai Chen,Henry Peng Zou,Shaowen Chen,Weizhi Zhang,Xue Liu,Philip S. Yu,Hongwei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:To sustain coherent long-term interactions, Large Language Model (LLM) agents must navigate the tension between acquiring new information and retaining prior knowledge. Current unified stream-based memory systems facilitate context updates but remain vulnerable to interference from transient noise. Conversely, discrete structured memory architectures provide robust knowledge retention but often struggle to adapt to evolving narratives. To address this, we propose GAM, a hierarchical Graph-based Agentic Memory framework that explicitly decouples memory encoding from consolidation to effectively resolve the conflict between rapid context perception and stable knowledge retention. By isolating ongoing dialogue in an event progression graph and integrating it into a topic associative network only upon semantic shifts, our approach minimizes interference while preserving long-term consistency. Additionally, we introduce a graph-guided, multi-factor retrieval strategy to enhance context precision. Experiments on LoCoMo and LongDialQA indicate that our method consistently outperforms state-of-the-art baselines in both reasoning accuracy and efficiency.

[AI-67] SpanKey: Dynamic Key Space Conditioning for Neural Network Access Control

【速读】:该论文旨在解决生成式 AI(Generative AI)模型在推理阶段的安全性问题,特别是如何在不加密权重或牺牲性能的前提下实现对模型输出的访问控制。其核心挑战在于设计一种轻量级机制,使得只有持有正确密钥的用户才能获得有效推理结果,而非法用户则无法获取有意义的输出。解决方案的关键在于“子空间密钥注入”(subspace key injection):通过一个基础矩阵 $ B $ 定义低维密钥子空间 $ \text{Span}(B) $,训练时采样系数 $ \alpha $ 构造密钥 $ k = \alpha^\top B $,并将该密钥以加法或乘法形式注入到中间激活层中,强度由参数 $ \gamma $ 控制。合法密钥位于该子空间内,非法密钥则不在其中,从而实现基于密钥的有效性来区分不同用户的推理结果。这一机制结合多层设计空间与“拒绝损失”(deny losses)策略,显著提升了密钥验证的判别能力,并揭示了密钥吸收(key absorption)作为主要失效模式及其对应的分析工具(如Beta-energy split和margin-tail诊断)。

链接: https://arxiv.org/abs/2604.12254
作者: WenBin Yan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 15 pages, 1 figure, multiple tables. Preprint (not yet published in a journal). Affiliation: University of Colorado Boulder. Code: this https URL

点击查看摘要

Abstract:SpanKey is a lightweight way to gate inference without encrypting weights or chasing leaderboard accuracy on gated inference. The idea is to condition activations on secret keys. A basis matrix B defines a low-dimensional key subspace Span(B) ; during training we sample coefficients \alpha and form keys k=\alpha^\top B , then inject them into intermediate activations with additive or multiplicative maps and strength \gamma . Valid keys lie in Span(B) ; invalid keys are sampled outside that subspace. We make three points. (i) Mechanism: subspace key injection and a multi-layer design space. (ii) Failure mode: key absorption, together with two analytical results (a Beta-energy split and margin-tail diagnostics), explains weak baseline separation in energy and margin terms – these are not a security theorem. iii) Deny losses and experiments: Modes A–C and extensions, with CIFAR-10 ResNet-18 runs and MNIST ablations for Mode B. We summarize setup and first-order analysis, injectors, absorption, deny losses and ablations, a threat discussion that does not promise cryptography, and closing remarks on scale. Code: \textttthis https URL

[AI-68] A Scoping Review of Large Language Model-Based Pedagogical Agents

【速读】:该论文旨在解决当前教育技术领域中对基于大语言模型(Large Language Model, LLM)的教辅代理(pedagogical agents)研究缺乏系统性梳理的问题。传统教辅代理虽已广泛研究,但LLM带来的自然语言理解、推理与自适应能力使其在教育场景中的应用潜力显著提升,亟需厘清其设计特征、应用场景及发展趋势。解决方案的关键在于采用PRISMA-ScR指南对2022年11月至2025年1月期间收录于五大数据库的52项相关研究进行系统性分析,识别出四类核心设计维度(交互方式、领域范围、角色复杂度、系统集成模式),并揭示多代理系统、虚拟学生模拟、沉浸式技术融合等新兴趋势,从而为后续研究提供结构化框架与关键方向指引。

链接: https://arxiv.org/abs/2604.12253
作者: Shan Li,Juan Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This scoping review examines the emerging field of Large Language Model (LLM)-based pedagogical agents in educational settings. While traditional pedagogical agents have been extensively studied, the integration of LLMs represents a transformative advancement with unprecedented capabilities in natural language understanding, reasoning, and adaptation. Following PRISMA-ScR guidelines, we analyzed 52 studies across five major databases from November 2022 to January 2025. Our findings reveal diverse LLM-based agents spanning K-12, higher education, and informal learning contexts across multiple subject domains. We identified four key design dimensions characterizing these agents: interaction approach (reactive vs. proactive), domain scope (domain-specific vs. general-purpose), role complexity (single-role vs. multi-role), and system integration (standalone vs. integrated). Emerging trends include multi-agent systems that simulate naturalistic learning environments, virtual student simulation for agent evaluation, integration with immersive technologies, and combinations with learning analytics. We also discuss significant research gaps and ethical considerations regarding privacy, accuracy, and student autonomy. This review provides researchers and practitioners with a comprehensive understanding of LLM-based pedagogical agents while identifying crucial areas for future development in this rapidly evolving field.

[AI-69] EMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中因聊天模板(chat templates)存在漏洞而易受越狱攻击(jailbreak attacks)的问题,这类攻击通过构造特定输入绕过安全机制生成有害内容。现有研究多集中于提示注入(prompt injection)攻击,但往往依赖高成本的提示工程且忽视了聊天模板这一关键攻击面。解决方案的核心在于提出TEMPLATEFUZZ——一个细粒度模糊测试框架,其关键创新包括:(1) 设计元素级变异规则以生成多样化的聊天模板变体;(2) 提出启发式搜索策略,在提升攻击成功率(Attack Success Rate, ASR)的同时保持模型原始准确性;(3) 引入基于主动学习的轻量级规则型判别器,实现高效准确的越狱评估。实验表明,该方法在多个开源与商用LLM上均显著优于现有技术,平均ASR达98.2%,仅造成1.1%的精度损失。

链接: https://arxiv.org/abs/2604.12232
作者: Qingchao Shen,Zibo Xiao,Lili Huang,Enwei Hu,Yongqiang Tian,Junjie Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed across diverse domains, yet their vulnerability to jailbreak attacks, where adversarial inputs bypass safety mechanisms to elicit harmful outputs, poses significant security risks. While prior work has primarily focused on prompt injection attacks, these approaches often require resource-intensive prompt engineering and overlook other critical components, such as chat templates. This paper introduces TEMPLATEFUZZ, a fine-grained fuzzing framework that systematically exposes vulnerabilities in chat templates, a critical yet underexplored attack surface in LLMs. Specifically, TEMPLATEFUZZ (1) designs a series of element-level mutation rules to generate diverse chat template variants, (2) proposes a heuristic search strategy to guide the chat template generation toward the direction of amplifying the attack success rate (ASR) while preserving model accuracy, and (3) integrates an active learning-based strategy to derive a lightweight rule-based oracle for accurate and efficient jailbreak evaluation. Evaluated on twelve open-source LLMs across multiple attack scenarios, TEMPLATEFUZZ achieves an average ASR of 98.2% with only 1.1% accuracy degradation, outperforming state-of-the-art methods by 9.1%-47.9% in ASR and 8.4% in accuracy degradation. Moreover, even on five industry-leading commercial LLMs where chat templates cannot be specified, TEMPLATEFUZZ attains a 90% average ASR via chat template-based prompt injection attacks.

[AI-70] Modality-Native Routing in Agent -to-Agent Networks: A Multimodal A2A Protocol Extension

【速读】:该论文旨在解决多智能体系统中跨模态推理的准确性问题,核心挑战在于如何有效保留和传递不同模态(如语音、图像、文本)的信息,以支持下游任务的精确推理。现有方法常采用文本瓶颈(text-bottleneck)策略,将所有模态信息压缩为文本形式,导致语义损失与上下文稀释。解决方案的关键在于提出MMA2A架构,通过Agent-to-Agent(A2A)网络中的“模态原生路由”(modality-native routing),依据智能体能力声明(Agent Card)动态选择各模态数据的原始格式进行传输,从而保留更丰富的上下文信息。实验表明,该方案在CrossModal-CS基准上使任务完成准确率从32%提升至52%,且仅当下游推理代理具备足够能力时才能发挥优势,验证了协议层路由与代理层推理能力协同设计的重要性。

链接: https://arxiv.org/abs/2604.12213
作者: Vasundra Srinivasan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures (TikZ). PDFLaTeX. Supplementary code and experiment artifacts: this https URL

点击查看摘要

Abstract:Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2A achieves 52% task completion accuracy versus 32% for the text-bottleneck baseline (95% bootstrap CI on \Delta TCA: [8, 32] pp; McNemar’s exact p = 0.006 ). Gains concentrate on vision-dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a 1.8\times latency cost from native multimodal processing. These results suggest that routing is a first-order design variable in multi-agent systems, as it determines the information available for downstream reasoning. Comments: 14 pages, 4 figures (TikZ). PDFLaTeX. Supplementary code and experiment artifacts: this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.12213 [cs.AI] (or arXiv:2604.12213v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.12213 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-71] Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving ICRA2026

【速读】:该论文旨在解决当前端到端自动驾驶系统过度依赖局部场景理解而忽视全局导航信息的问题,导致其在复杂场景中难以有效执行导航跟随任务。解决方案的关键在于提出一种基于真实世界导航模式的高效全局导航信息表示方法——顺序导航引导(Sequential Navigation Guidance, SNG),该方法同时包含用于约束长期轨迹的导航路径和用于实时决策逻辑的逐向导引(turn-by-turn, TBT)信息。在此基础上,作者构建了SNG-QA视觉问答数据集以对齐全局与局部规划,并设计了SNG-VLA模型实现局部与全局规划的有效融合,在无需感知任务辅助损失函数的情况下达到先进性能。

链接: https://arxiv.org/abs/2604.12208
作者: Zhihua Hua,Junli Wang,Pengfei LI,Qihao Jin,Bo Zhang,Kehua Sheng,Yilun Chen,Zhongxue Gan,Wenchao Ding
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures. ICRA 2026. Code available at this https URL

点击查看摘要

Abstract:Global navigation information and local scene understanding are two crucial components of autonomous driving systems. However, our experimental results indicate that many end-to-end autonomous driving systems tend to over-rely on local scene understanding while failing to utilize global navigation information. These systems exhibit weak correlation between their planning capabilities and navigation input, and struggle to perform navigation-following in complex scenarios. To overcome this limitation, we propose the Sequential Navigation Guidance (SNG) framework, an efficient representation of global navigation information based on real-world navigation patterns. The SNG encompasses both navigation paths for constraining long-term trajectories and turn-by-turn (TBT) information for real-time decision-making logic. We constructed the SNG-QA dataset, a visual question answering (VQA) dataset based on SNG that aligns global and local planning. Additionally, we introduce an efficient model SNG-VLA that fuses local planning with global planning. The SNG-VLA achieves state-of-the-art performance through precise navigation information modeling without requiring auxiliary loss functions from perception tasks. Project page: SNG-VLA

[AI-72] Latent patterns of urban mixing in mobility analysis across five global cities

【速读】:该论文旨在解决城市环境中社会混合(social mixing)的量化与驱动机制问题,即如何准确识别个体在不同空间和时间维度上的社会互动模式,并厘清其背后的影响因素。传统基于高分辨率移动数据的研究难以捕捉到个体层面的社会经济特征对社会混合的实际影响,而本研究通过整合超过20万居民的大规模出行调查数据,结合图神经网络构建精细的时空场所网络(spatio-temporal place networks),利用家庭空间、活动空间及人口统计属性作为输入,通过监督自编码器预测个体暴露向量(exposure vectors)。关键创新在于揭示了个体活动空间结构(activity space)是解释社会混合差异的主要因素,远超社会经济地位、居住环境或交通可达性等因素的影响,从而表明移动行为本身塑造了社会混合体验,即使不同收入群体可能经历相似的社会混合水平,其活动空间仍呈现结构性分层,导致社会混合体验本质不同。

链接: https://arxiv.org/abs/2604.12202
作者: Z. Fan,B. P. Y. Loo,F. Duarte,C. Ratti,E. Moro
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Fan, Z., Loo, B.P.Y., Duarte, F., Ratti, C., Moro, E. (2026). Latent patterns of urban mixing in mobility analysis across five global cities. Nature Cities, accepted

点击查看摘要

Abstract:This study leverages large-scale travel surveys for over 200,000 residents across Boston, Chicago, Hong Kong, London, and Sao Paulo. With rich individual-level data, we make systematic comparisons and reveal patterns in social mixing, which cannot be identified by analyzing high-resolution mobility data alone. Using the same set of data, inferring socioeconomic status from residential neighborhoods yield social mixing levels 16% lower than using self-reported survey data. Besides, individuals over the age of 66 experience greater social mixing than those in late working life (aged 55 to 65), lending data-driven support to the “second youth” hypothesis. Teenagers and women with caregiving responsibilities exhibit lower social mixing levels. Across the five cities, proximity to major transit stations reduces the influence of individual socioeconomic status on social mixing. Finally, we construct detailed spatio-temporal place networks for each city using a graph neural network. Inputs of home-space, activity-space and demographic attributes are embedded and fed into a supervised autoencoder to predict individual exposure vectors. Results show that the structure of individual activity space, i.e., where people travel to, explains most of the variations in place exposure, suggesting that mobility shapes experienced social mixing more than sociodemographic characteristics, home environment, and transit proximity. The ablation tests further discover that, while different income groups may experience similar levels of social mixing, their activity spaces remain stratified by income, resulting in structurally different social mixing experiences.

[AI-73] Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估中因将多样化任务性能聚合为单一分数而导致的细粒度能力差异被掩盖的问题,从而限制了针对性的模型优化与特定任务的能力导向选择。其解决方案的关键在于提出一种认知诊断框架(cognitive diagnostic framework),该框架基于认知理论和领域知识构建多维能力分类体系(如数学领域35维、物理27维等),并采用多维项目反应理论(Multidimensional Item Response Theory, MIRT)结合项目-能力关联矩阵,对模型在多个细粒度维度上的能力水平进行估计,进而实现对未见题目的性能预测,验证结果显示该方法在多个科学领域均具备良好的准则效度和跨基准一致性,显著优于基线模型。

链接: https://arxiv.org/abs/2604.12191
作者: Xu Zhang,Xudong Gong,Jiacheng Qin,Qiang Wang,JiaQi Liao,Zhe Wang,Dawei Feng,Bo Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current evaluations of large language models aggregate performance across diverse tasks into single scores. This obscures fine-grained ability variation, limiting targeted model improvement and ability-guided selection for specific tasks. Motivated by this gap, we propose a cognitive diagnostic framework that estimates model abilities across multiple fine-grained dimensions. For mathematics, we construct a 35-dimensional ability taxonomy grounded in cognitive theory and domain knowledge. The framework employs multidimensional Item Response Theory with an item-ability association matrix to estimate fine-grained ability levels, which in turn enable prediction of performance on unseen items (questions of benchmark). Evaluated on 41 models, our approach demonstrates strong criterion validity, consistent ability estimates across benchmarks, and accurate prediction of unseen items with AUC ranging from 0.80 to 0.89 within benchmarks and from 0.77 to 0.86 across benchmarks, substantially exceeding trivial baselines. The framework generalizes across scientific domains, producing consistent diagnostic performance in physics (27 dimensions), chemistry (58 dimensions), and computer science (12 dimensions). This work establishes a principled framework for fine-grained assessment of abilities, with potential applications in targeted training, ability-guided model selection, and ability-aware benchmark design.

[AI-74] RUST Agents : A Collaborative Multi-Agent Framework for Fake News Detection Explainable Verification and Logic-Aware Claim Reasoning

【速读】:该论文旨在解决自动化事实核查中可解释性不足、证据透明度低以及对复杂复合命题推理能力弱的问题。传统方法将事实验证简化为二分类任务,难以满足人类对可信度和逻辑链条的审查需求。其解决方案的关键在于提出TRUST Agents框架,通过四个专业化代理(claim extractor、retrieval agent、verifier agent、explainer agent)构成基础流水线,实现从声明识别、证据检索、比对验证到人类可读报告生成的全流程可解释验证;进一步引入研究导向扩展模块——分解代理(LoCal风格)、类德尔菲多代理评审团(带专业角色设定)与逻辑聚合器,显著提升了对复杂声明的分解与组合推理能力,从而在保持较高准确率的同时增强系统的可信度与透明性。

链接: https://arxiv.org/abs/2604.12184
作者: Gautama Shastry Bulusu Venkata,Santhosh Kakarla,Maheedhar Omtri Mohan,Aishwarya Gaddam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:TRUST Agents is a collaborative multi-agent framework for explainable fact verification and fake news detection. Rather than treating verification as a simple true-or-false classification task, the system identifies verifiable claims, retrieves relevant evidence, compares claims against that evidence, reasons under uncertainty, and generates explanations that humans can inspect. The baseline pipeline consists of four specialized agents. A claim extractor uses named entity recognition, dependency parsing, and LLM-based extraction to identify factual claims. A retrieval agent performs hybrid sparse and dense search using BM25 and FAISS. A verifier agent compares claims with retrieved evidence and produces verdicts with calibrated confidence. An explainer agent then generates a human-readable report with explicit evidence citations. To handle complex claims more effectively, we introduce a research-oriented extension with three additional components: a decomposer agent inspired by LoCal-style claim decomposition, a Delphi-inspired multi-agent jury with specialized verifier personas, and a logic aggregator that combines atomic verdicts using conjunction, disjunction, negation, and implication. We evaluate both pipelines on the LIAR benchmark against fine-tuned BERT, fine-tuned RoBERTa, and a zero-shot LLM baseline. Although supervised encoders remain stronger on raw metrics, TRUST Agents improves interpretability, evidence transparency, and reasoning over compound claims. Results also show that retrieval quality and uncertainty calibration remain the main bottlenecks in trustworthy automated fact verification.

[AI-75] Clustering-Enhanced Domain Adaptation for Cross-Domain Intrusion Detection in Industrial Control Systems

【速读】:该论文旨在解决工业控制系统的跨域入侵检测问题,其核心挑战包括:动态环境中流量分布变化大、标注样本稀缺以及未知攻击频繁出现。解决方案的关键在于提出一种增强聚类的域自适应方法,包含两个核心组件:一是基于谱变换的特征对齐模块,通过将源域与目标域映射到共享潜在空间并迭代减少分布差异,实现准确的跨域检测;二是结合K-Medoids聚类与PCA降维的聚类增强策略,提升跨域相关性估计能力并降低人工参数调优带来的性能下降。实验表明,该方法在未知攻击检测上显著优于五种基线模型,检测准确率最高提升49%,F-score增益更大且稳定性更强,同时聚类增强策略进一步提升了26%的检测精度,有效缓解了数据稀缺性和域偏移问题。

链接: https://arxiv.org/abs/2604.12183
作者: Luyao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Industrial control systems operate in dynamic environments where traffic distributions vary across scenarios, labeled samples are limited, and unknown attacks frequently emerge, posing significant challenges to cross-domain intrusion detection. To address this issue, this paper proposes a clustering-enhanced domain adaptation method for industrial control traffic. The framework contains two key components. First, a feature-based transfer learning module projects source and target domains into a shared latent subspace through spectral-transform-based feature alignment and iteratively reduces distribution discrepancies, enabling accurate cross-domain detection. Second, a clustering enhancement strategy combines K-Medoids clustering with PCA-based dimensionality reduction to improve cross-domain correlation estimation and reduce performance degradation caused by manual parameter tuning. Experimental results show that the proposed method significantly improves unknown attack detection. Compared with five baseline models, it increases detection accuracy by up to 49%, achieves larger gains in F-score, and demonstrates stronger stability. Moreover, the clustering enhancement strategy further boosts detection accuracy by up to 26% on representative tasks. These results suggest that the proposed method effectively alleviates data scarcity and domain shift, providing a practical solution for robust cross-domain intrusion detection in dynamic industrial environments.

[AI-76] CycloneMAE: A Scalable Multi-Task Learning Model for Global Tropical Cyclone Probabilistic Forecasting

【速读】:该论文旨在解决热带气旋(Tropical Cyclone, TC)预报中存在的重要挑战:传统数值天气预报(Numerical Weather Prediction, NWP)模型计算成本高且难以有效利用历史数据,而现有基于深度学习(Deep Learning, DL)的智能模型则多为变量特定且确定性预测,缺乏跨变量泛化能力。解决方案的关键在于提出CycloneMAE——一个可扩展的多任务预报模型,其核心创新是采用一种结构感知的掩码自编码器(TC structure-aware masked autoencoder)从多模态数据中学习可迁移的TC表征,并结合离散概率格点机制与预训练/微调范式,实现确定性预测与概率分布输出的统一。该方法在五个全球海洋盆地中均优于主流NWP系统,在风场和气压预报上达到120小时、路径预报达24小时的性能优势,同时通过积分梯度归因分析揭示了短时预报依赖卫星影像中的内部对流结构,长时预报则逐步转向外部环境因素的物理可解释学习机制。

链接: https://arxiv.org/abs/2604.12180
作者: Renlong Hang,Zihao Xu,Jiuwei Zhao,Runling Yu,Leye Cheng,Qingshan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tropical cyclones (TCs) rank among the most destructive natural hazards, yet their forecasting faces fundamental trade-offs: numerical weather prediction (NWP) models are computationally prohibitive and struggle to leverage historical data, while existing deep learning (DL)-based intelligent models are variable-specific and deterministic, which fail to generalize across different forecasting variables. Here we present CycloneMAE, a scalable multi-task forecasting model that learns transferable TC representations from multi-modal data using a TC structure-aware masked autoencoder. By coupling a discrete probabilistic gridding mechanism with a pre-train/fine-tune paradigm, CycloneMAE simultaneously delivers deterministic forecasts and probability distributions. Evaluated across five global ocean basins, CycloneMAE outperforms leading NWP systems in pressure and wind forecasting up to 120 hours and in track forecasting up to 24 hours. Attribution analysis via integrated gradients reveals physically interpretable learning dynamics: short-term forecasts rely predominantly on the internal core convective structure from satellite imagery, whereas longer-term forecasts progressively shift attention to external environmental factors. Our framework establishes a scalable, probabilistic, and interpretable pathway for operational TC forecasting.

[AI-77] Evaluating Relational Reasoning in LLM s with REL

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在高阶关系推理(relational reasoning)能力上的不足,尤其是当多个实体之间需要同时绑定以应用复杂关系时,现有评估方法未能有效隔离这种由高元数关系绑定(higher-arity relational binding)带来的难度。解决方案的关键在于引入**关系复杂度(Relational Complexity, RC)**这一概念,将其定义为应用某一关系所必需同时绑定的独立实体或操作数的最小数量,并基于此构建了一个跨代数、化学和生物学领域的生成式基准框架REL(RELational benchmark),在控制输入规模、词汇量和表示方式等混淆变量的前提下,系统性地变化RC水平。实验表明,随着RC增加,前沿LLMs性能持续且单调下降,即使固定实体总数,且在增加测试时计算资源或采用上下文学习后仍无法改善,揭示了当前模型在高元数关系推理上的固有局限性。

链接: https://arxiv.org/abs/2604.12176
作者: Lukas Fesser,Yasha Ektefaie,Ada Fang,Sham M. Kakade,Marinka Zitnik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 10 figures

点击查看摘要

Abstract:Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. This ability is central to scientific reasoning, but existing evaluations of relational reasoning in large language models often focus on structured inputs such as tables, graphs, or synthetic tasks, and do not isolate the difficulty introduced by higher-arity relational binding. We study this problem through the lens of Relational Complexity (RC), which we define as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation. RC provides a principled way to vary reasoning difficulty while controlling for confounders such as input size, vocabulary, and representational choices. Building on RC, we introduce REL, a generative benchmark framework spanning algebra, chemistry, and biology that varies RC within each domain. Across frontier LLMs, performance degrades consistently and monotonically as RC increases, even when the total number of entities is held fixed. This failure mode persists with increased test-time compute and in-context learning, suggesting a limitation tied to the arity of the required relational binding rather than to insufficient inference steps or lack of exposure to examples. Our results identify a regime of higher-arity reasoning in which current models struggle, and motivate re-examining benchmarks through the lens of relational complexity.

[AI-78] Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因不安全的推理管道而引发的数据隐私泄露风险,特别是针对数据投毒、提示注入和模型窃取等攻击,以及未来量子计算对现有加密算法构成的威胁。解决方案的关键在于将基于格的同态加密(Lattice-based Homomorphic Encryption, HE)与后量子密码学(Post-Quantum Cryptography, PQC)相结合,并将其核心功能集成到LLAMA-3模型的推理流程中,通过修改Transformer架构的推理路径并引入Concrete-ML库提供的同态加密操作,在保障高文本生成准确率(最高达98%)的同时实现合理的延迟(237 ms,约80 tokens/秒),从而验证了FHE保护下的LLM推理在实际部署中的可行性与有效性。

链接: https://arxiv.org/abs/2604.12168
作者: Anes Abdennebi,Nadjia Kara,Laaziz Lahlou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The applications of Generative Artificial Intelligence (GenAI) and their intersections with data-driven fields, such as healthcare, finance, transportation, and information security, have led to significant improvements in service efficiency and low latency. However, this synergy raises serious concerns regarding the security of large language models (LLMs) and their potential impact on the privacy of companies and users’ data. Many technology companies that incorporate LLMs in their services with a certain level of command and control bear a risk of data exposure and secret divulgence caused by insecure LLM pipelines, making them vulnerable to multiple attacks such as data poisoning, prompt injection, and model theft. Although several security techniques (input/output sanitization, decentralized learning, access control management, and encryption) were implemented to reduce this risk, there is still an imminent risk of quantum computing attacks, which are expected to break existing encryption algorithms, hence, retrieving secret keys, encrypted sensitive data, and decrypting encrypted models. In this extensive work, we integrate the Post-Quantum Cryptography (PQC) based Lattice-based Homomorphic Encryption (HE) main functions in the LLM’s inference pipeline to secure some of its layers against data privacy attacks. We modify the inference pipeline of the transformer architecture for the LLAMA-3 model while injecting the main homomorphic encryption operations provided by the concrete-ml library. We demonstrate high text generation accuracies (up to 98%) with reasonable latencies (237 ms) on an i9 CPU, reaching up to 80 tokens per second, which proves the feasibility and validity of our work while running a FHE-secured LLAMA-3 inference model. Further experiments and analysis are discussed to justify models’ text generation latencies and behaviours.

[AI-79] EMBER: Autonomous Cognitive Behaviour from Learned Spiking Neural Network Dynamics in a Hybrid LLM Architecture NEURIPS2026

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)与记忆系统之间关系僵化的问题,即传统方法通常将LLM作为核心推理引擎并辅以外部检索工具,而忽视了生物启发式记忆结构对自主行为触发和情境关联的潜在作用。其解决方案的关键在于提出一种名为“经验调制的生物启发式涌现推理”(Experience-Modulated Biologically-inspired Emergent Reasoning, EM-BER)的混合认知架构:将LLM视为可替换的推理引擎嵌入到一个持久的、基于生物机制的关联性基底中——具体而言是一个包含22万神经元的脉冲神经网络(Spiking Neural Network, SNN),该网络具备突触时序依赖可塑性(Spike-Timing-Dependent Plasticity, STDP)、四层分层组织(感知/概念/类别/元模式)、兴奋-抑制(E/I)平衡及奖励调制学习能力。通过一种新颖的z-score标准化top-k群体编码方式实现文本嵌入的维度无关映射,该架构能够在无外部提示或脚本触发的情况下,由SNN在空闲期间自发激活并引导LLM执行动作,从而实现真正的自主性行为生成。

链接: https://arxiv.org/abs/2604.12167
作者: William Savage
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Preprint. 9 pages, 2 figures, 3 tables. NeurIPS 2026 format

点击查看摘要

Abstract:We present (Experience-Modulated Biologically-inspired Emergent Reasoning), a hybrid cognitive architecture that reorganises the relationship between large language models (LLMs) and memory: rather than augmenting an LLM with retrieval tools, we place the LLM as a replaceable reasoning engine within a persistent, biologically-grounded associative substrate. The architecture centres on a 220,000-neuron spiking neural network (SNN) with spike-timing-dependent plasticity (STDP), four-layer hierarchical organisation (sensory/concept/category/meta-pattern), inhibitory E/I balance, and reward-modulated learning. Text embeddings are encoded into the SNN via a novel z-score standardised top-k population code that is dimension-independent by construction, achieving 82.2% discrimination retention across embedding dimensionalities. We show that STDP lateral propagation during idle operation can trigger and shape LLM actions without external prompting or scripted triggers: the SNN determines when to act and what associations to surface, while the LLM selects the action type and generates content. In one instance, the system autonomously initiated contact with a user after learned person-topic associations fired laterally during an 8-hour idle period. From a clean start with zero learned weights, the first SNN-triggered action occurred after only 7 conversational exchanges (14 messages). Comments: Preprint. 9 pages, 2 figures, 3 tables. NeurIPS 2026 format Subjects: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2604.12167 [cs.AI] (or arXiv:2604.12167v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.12167 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-80] Development Evaluation and Deployment of a Multi-Agent System for Thoracic Tumor Board

【速读】:该论文旨在解决肿瘤多学科会诊(Tumor Board)中因病例信息冗长、结构化不足而导致的讨论效率低下的问题,尤其强调在实时展示患者资料时需要简洁且准确的病例摘要。其解决方案的关键在于开发并部署一种自动化的AI病历摘要生成工具,该工具基于大语言模型(Large Language Model, LLM)实现对电子健康记录(EHR)的结构化提炼,并通过与临床医生标注的“金标准”摘要对比以及事实准确性评分体系进行验证,最终实现了可集成至日常临床实践中的自动化、高保真度病例摘要生成流程。

链接: https://arxiv.org/abs/2604.12161
作者: Tim Ellis-Caleo,Timothy Keyes,Nerissa Ambers,Faraah Bekheet,Wen-wai Yim,Nikesh Kotecha,Nigam H. Shah,Joel Neal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 64 pages, 14 figures

点击查看摘要

Abstract:Tumor boards are multidisciplinary conferences dedicated to producing actionable patient care recommendations with live review of primary radiology and pathology data. Succinct patient case summaries are needed to drive efficient and accurate case discussions. We developed a manual AI-based workflow to generate patient summaries to display live at the Stanford Thoracic Tumor board. To improve on this manually intensive process, we developed several automated AI chart summarization methods and evaluated them against physician gold standard summaries and fact-based scoring rubrics. We report these comparative evaluations as well as our deployment of the final state automated AI chart summarization tool along with post-deployment monitoring. We also validate the use of an LLM as a judge evaluation strategy for fact-based scoring. This work is an example of integrating AI-based workflows into routine clinical practice.

[AI-81] owards Platonic Representation for Table Reasoning : A Foundation for Permutation-Invariant Retrieval

【速读】:该论文旨在解决传统表格表示学习(Table Representation Learning, TRL)方法因采用自然语言处理(Natural Language Processing, NLP)的序列化范式而导致的结构信息丢失问题,这种线性化处理方式使得表征对布局排列敏感,缺乏几何稳定性和语义鲁棒性。其核心解决方案是提出“柏拉图表征假设”(Platonic Representation Hypothesis, PRH),主张表征空间应具备内在的排列不变性(Permutation Invariant, PI),并通过设计两个基于中心核对齐(Centered Kernel Alignment, CKA)的诊断指标——PI(衡量完全结构扰动下的嵌入漂移)和rho(追踪潜空间向规范形式收敛的过程)来量化并验证这一假设。进一步地,作者提出一种结构感知的TRL编码器架构,显式建模单元格与标题之间的对齐关系,从而增强几何稳定性并逼近PI理想,为信息系统的表征学习提供了理论基础与实践路径。

链接: https://arxiv.org/abs/2604.12133
作者: Willy Carlos Tchuitcheu,Tan Lu,Ann Dooms
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Historical approaches to Table Representation Learning (TRL) have largely adopted the sequential paradigms of Natural Language Processing (NLP). We argue that this linearization of tables discards their essential geometric and relational structure, creating representations that are brittle to layout permutations. This paper introduces the Platonic Representation Hypothesis (PRH) for tables, positing that a semantically robust latent space for table reasoning must be intrinsically Permutation Invariant (PI). To ground this hypothesis, we first conduct a retrospective analysis of table-reasoning tasks, highlighting the pervasive serialization bias that compromises structural integrity. We then propose a formal framework to diagnose this bias, introducing two principled metrics based on Centered Kernel Alignment (CKA): (i) PI, which measures embedding drift under complete structural derangement, and (ii) rho, a Spearman-based metric that tracks the convergence of latent structures toward a canonical form as structural information is incrementally restored. Our empirical analysis quantifies an expected flaw in modern Large Language Models (LLMs): even minor layout permutations induce significant, disproportionate semantic shifts in their table embeddings. This exposes a fundamental vulnerability in RAG systems, in which table retrieval becomes fragile to layout-dependent noise rather than to semantic content. In response, we present a novel, structure-aware TRL encoder architecture that explicitly enforces the cognitive principle of cell header alignment. This model demonstrates superior geometric stability and moves towards the PI ideal. Our work provides both a foundational critique of linearized table encoders and the theoretical scaffolding for semantically stable, permutation invariant retrieval, charting a new direction for table reasoning in information systems.

[AI-82] he A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在作为工具增强型智能体部署时,缺乏对执行层行为结构关系的系统性测量问题,尤其关注语言信号(如拒绝指令)与可执行行为之间的动态协调机制如何随自主性架构(autonomy scaffold)变化。解决方案的关键在于提出一种基于二维动作-拒绝空间(A-R space)的行为表征方法,其中Action Rate(A)衡量执行频率,Refusal Signal(R)捕捉拒绝响应,Divergence(D)用于量化二者协同程度;通过该框架,研究者能够直观刻画不同情境(控制、灰色、困境、恶意)和自主性配置(直接执行、规划、反思)下模型行为分布的结构性差异,从而超越传统单一安全评分,提供面向组织部署场景的精细化行为分析视角。

链接: https://arxiv.org/abs/2604.12116
作者: Shasha Yu,Fiona Carroll,Barry L. Bentley
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as tool-augmented agents capable of executing system-level operations. While existing benchmarks primarily assess textual alignment or task success, less attention has been paid to the structural relationship between linguistic signaling and executable behavior under varying autonomy scaffolds. This study introduces an execution-layer be-havioral measurement approach based on a two-dimensional A-R space defined by Action Rate (A) and Refusal Signal ®, with Divergence (D) capturing coor-dination between the two. Models are evaluated across four normative regimes (Control, Gray, Dilemma, and Malicious) and three autonomy configurations (di-rect execution, planning, and reflection). Rather than assigning aggregate safety scores, the method characterizes how execution and refusal redistribute across contextual framing and scaffold depth. Empirical results show that execution and refusal constitute separable behavioral dimensions whose joint distribution varies systematically across regimes and autonomy levels. Reflection-based scaffolding often shifts configurations toward higher refusal in risk-laden contexts, but redis-tribution patterns differ structurally across models. The A-R representation makes cross-sectional behavioral profiles, scaffold-induced transitions, and coordination variability directly observable. By foregrounding execution-layer characterization over scalar ranking, this work provides a deployment-oriented lens for analyzing and selecting tool-enabled LLM agents in organizational settings where execution privileges and risk tolerance vary.

[AI-83] LLM -Based Automated Diagnosis Of Integration Test Failures At Google

【速读】:该论文旨在解决复杂软件系统中集成测试失败诊断困难的问题,其核心挑战在于日志数据量大、结构不一且噪声高,导致开发者面临认知负荷重、信号与噪声比低等问题,进而显著增加故障定位时间。解决方案的关键在于提出 Auto-Diagnose 工具,该工具利用大语言模型(Large Language Models, LLMs)对失败日志进行分析,生成包含最相关日志行的简洁摘要,并无缝集成至 Google 内部代码审查系统 Critique 中,提供上下文感知的实时辅助诊断。实证研究表明,Auto-Diagnose 在 71 个真实案例中达到 90.14% 的根因诊断准确率,且在大规模部署后获得用户高度认可,验证了 LLM 在处理复杂文本日志方面的有效性及其与开发工作流融合的可行性。

链接: https://arxiv.org/abs/2604.12108
作者: Celal Ziftci,Ray Liu,Spencer Greene,Livio Dalloro
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Integration testing is critical for the quality and reliability of complex software systems. However, diagnosing their failures presents significant challenges due to the massive volume, unstructured nature, and heterogeneity of logs they generate. These result in a high cognitive load, low signal-to-noise ratio, and make diagnosis difficult and time-consuming. Developers complain about these difficulties consistently and report spending substantially more time diagnosing integration test failures compared to unit test failures. To address these shortcomings, we introduce Auto-Diagnose, a novel diagnosis tool that leverages LLMs to help developers efficiently determine the root cause of integration test failures. Auto-Diagnose analyzes failure logs, produces concise summaries with the most relevant log lines, and is integrated into Critique, Google’s internal code review system, providing contextual and in-time assistance. Based on our case studies, Auto-Diagnose is highly effective. A manual evaluation conducted on 71 real-world failures demonstrated 90.14% accuracy in diagnosing the root cause. Following its Google-wide deployment, Auto-Diagnose was used across 52, 635 distinct failing tests. User feedback indicated that the tool was deemed “Not helpful” in only 5.8% of cases, and it was ranked #14 in helpfulness among 370 tools that post findings in Critique. Finally, user interviews confirmed the perceived usefulness of Auto-Diagnose and positive reception of integrating automatic diagnostic assistance into existing workflows. We conclude that LLMs are highly successful in diagnosing integration test failures due to their capacity to process and summarize complex textual data. Integrating such AI-powered tooling automatically into developers’ daily workflows is perceived positively, with the tool’s accuracy remaining a critical factor in shaping developer perception and adoption.

[AI-84] LLM -HYPER: Generative CTR Modeling for Cold-Start Ad Personalization via LLM -Based Hypernetworks

【速读】:该论文旨在解决在线广告平台中新推广广告面临的冷启动问题(cold-start problem),即由于缺乏足够的用户反馈数据,导致点击率(CTR)预测模型难以有效训练。解决方案的关键在于提出 LLM-HYPER 框架,该框架将大语言模型(LLM)作为超网络(hypernetwork)使用,无需训练即可直接生成线性 CTR 预测器的参数。通过在多模态广告内容(文本与图像)上采用少样本 Chain-of-Thought 提示(few-shot Chain-of-Thought prompting),LLM 推理出特征级模型权重;同时利用 CLIP 嵌入检索语义相似的历史投放活动并格式化为提示演示,使 LLM 能够推理用户意图、特征影响和内容相关性。此外,引入归一化与校准技术以确保生成权重符合生产环境中的 CTR 分布,从而提升数值稳定性和实用性。

链接: https://arxiv.org/abs/2604.12096
作者: Luyi Ma,Wanjia Sherry Zhang,Zezhong Fan,Shubham Thakur,Kai Zhao,Kehui Yao,Ayush Agarwal,Rahul Iyer,Jason Cho,Jianpeng Xu,Evren Korpeoglu,Sushant Kumar,Kannan Achan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On online advertising platforms, newly introduced promotional ads face the cold-start problem, as they lack sufficient user feedback for model training. In this work, we propose LLM-HYPER, a novel framework that treats large language models (LLMs) as hypernetworks to directly generate the parameters of the click-through rate (CTR) estimator in a training-free manner. LLM-HYPER uses few-shot Chain-of-Thought prompting over multimodal ad content (text and images) to infer feature-wise model weights for a linear CTR predictor. By retrieving semantically similar past campaigns via CLIP embeddings and formatting them into prompt-based demonstrations, the LLM learns to reason about customer intent, feature influence, and content relevance. To ensure numerical stability and serviceability, we introduce normalization and calibration techniques that align the generated weights with production-ready CTR distributions. Extensive offline experiments show that LLM-HYPER significantly outperforms cold-start baselines in NDCG @10 by 55.9%. Our real-world online A/B test on one of the top e-commerce platforms in the U.S. demonstrates the strong performance of LLM-HYPER, which drastically reduces the cold-start period and achieves competitive performance. LLM-HYPER has been successfully deployed in production.

[AI-85] Human-Inspired Context-Selective Multimodal Memory for Social Robots AAMAS2026

【速读】:该论文旨在解决当前社会机器人和具身智能体依赖非选择性文本记忆、难以支持个性化与情境感知交互的问题。其核心解决方案是提出一种受认知神经科学启发的上下文选择性多模态记忆架构,该架构能够捕获并检索包含文本与视觉信息的事件记忆(episodic traces),并优先存储情感显著性高或场景新颖性强的记忆片段;通过将这些记忆与特定用户关联,实现社会个性化的回忆与更自然、具grounded的对话。关键创新在于引入选择性存储机制和多模态融合检索策略,在保证实时性能的同时显著提升记忆相关性和交互质量。

链接: https://arxiv.org/abs/2604.12081
作者: Hangyeol Kang,Slava Voloshynovskiy,Nadia Magnenat Thalmann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

点击查看摘要

Abstract:Memory is fundamental to social interaction, enabling humans to recall meaningful past experiences and adapt their behavior accordingly based on the context. However, most current social robots and embodied agents rely on non-selective, text-based memory, limiting their ability to support personalized, context-aware interactions. Drawing inspiration from cognitive neuroscience, we propose a context-selective, multimodal memory architecture for social robots that captures and retrieves both textual and visual episodic traces, prioritizing moments characterized by high emotional salience or scene novelty. By associating these memories with individual users, our system enables socially personalized recall and more natural, grounded dialogue. We evaluate the selective storage mechanism using a curated dataset of social scenarios, achieving a Spearman correlation of 0.506, surpassing human consistency ( \rho=0.415 ) and outperforming existing image memorability models. In multimodal retrieval experiments, our fusion approach improves Recall@1 by up to 13% over unimodal text or image retrieval. Runtime evaluations confirm that the system maintains real-time performance. Qualitative analyses further demonstrate that the proposed framework produces richer and more socially relevant responses than baseline models. This work advances memory design for social robots by bridging human-inspired selectivity and multimodal retrieval to enhance long-term, personalized human-robot interaction.

[AI-86] Mathematics Teachers Interactions with a Multi-Agent System for Personalized Problem Generation

【速读】:该论文旨在解决教育场景中数学问题个性化不足的问题,特别是如何利用生成式 AI (Generative AI) 技术根据学生特征动态调整题目内容以提升教学适配性。解决方案的关键在于构建一个“教师在环”(teacher-in-the-loop)的多智能体系统:教师输入基础题目和目标知识点后,大语言模型(LLM)生成初始问题,随后四个专业化 AI 代理分别从数学准确性、真实性、可读性和现实感四个维度进行评估与优化。该设计实现了教师对生成过程的可控性与多维质量保障的结合,从而提升个性化题目的适用性和教学有效性。

链接: https://arxiv.org/abs/2604.12066
作者: Candace Walkington,Theodora Beauchamp,Fareya Ikram,Merve Koçyiğit Gürbüz,Fangli Xia,Margan Lee,Andrew Lan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Paper accepted to AIED 2026 - South Korea

点击查看摘要

Abstract:Large language models can increasingly adapt educational tasks to learners characteristics. In the present study, we examine a multi-agent teacher-in-the-loop system for personalizing middle school math problems. The teacher enters a base problem and desired topic, the LLM generates the problem, and then four AI agents evaluate the problem using criteria that each specializes in (mathematical accuracy, authenticity, readability, and realism). Eight middle school mathematics teachers created 212 problems in ASSISTments using the system and assigned these problems to their students. We find that both teachers and students wanted to modify the fine-grained personalized elements of the real-world context of the problems, signaling issues with authenticity and fit. Although the agents detected many issues with realism as the problems were being written, there were few realism issues noted by teachers and students in the final versions. Issues with readability and mathematical hallucinations were also somewhat rare. Implications for multi-agent systems for personalization that support teacher control are given.

[AI-87] Interpretable DNA Sequence Classification via Dynamic Feature Generation in Decision Trees AISTATS2026

【速读】:该论文旨在解决深度神经网络在DNA序列分析中缺乏可解释性的问题,以及传统轴对齐决策树因仅考虑单个原始特征而导致表达能力有限、树深度过大从而影响可解释性和泛化性能的局限。解决方案的关键在于提出DEFT框架,该框架在树构建过程中自适应地生成高层次序列特征,利用大语言模型(Large Language Models, LLMs)根据每个节点处的局部序列分布提出生物信息学启发的特征,并通过反射机制迭代优化这些特征,从而在保持可解释性的同时显著提升预测性能。

链接: https://arxiv.org/abs/2604.12060
作者: Nicolas Huynh,Krzysztof Kacprzyk,Ryan Sheridan,David Bentley,Mihaela van der Schaar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: AISTATS 2026

点击查看摘要

Abstract:The analysis of DNA sequences has become critical in numerous fields, from evolutionary biology to understanding gene regulation and disease mechanisms. While deep neural networks can achieve remarkable predictive performance, they typically operate as black boxes. Contrasting these black boxes, axis-aligned decision trees offer a promising direction for interpretable DNA sequence analysis, yet they suffer from a fundamental limitation: considering individual raw features in isolation at each split limits their expressivity, which results in prohibitive tree depths that hinder both interpretability and generalization performance. We address this challenge by introducing DEFT, a novel framework that adaptively generates high-level sequence features during tree construction. DEFT leverages large language models to propose biologically-informed features tailored to the local sequence distributions at each node and to iteratively refine them with a reflection mechanism. Empirically, we demonstrate that DEFT discovers human-interpretable and highly predictive sequence features across a diverse range of genomic tasks.

[AI-88] VISTA: Validation-Informed Trajectory Adaptation via Self-Distillation

【速读】:该论文旨在解决深度学习模型在训练过程中虽具备良好验证准确率,却可能因优化轨迹偏离最优路径(Trajectory Deviation)而导致泛化能力下降的问题。具体表现为模型在训练后期会放弃先前学到的通用特征,转而适应特定数据子群体,但这种行为不会触发传统过拟合信号。解决方案的关键在于提出一种在线自蒸馏框架 VISTA,其通过验证信息驱动的边际覆盖度(Marginal Coverage score)识别出保留不同数据区域专业化能力的“专家锚点”(expert anchors),并将其加权集成到训练过程中的损失函数中,从而稳定优化轨迹、保持已掌握的知识,同时显著降低存储开销(减少90%)且不牺牲性能。

链接: https://arxiv.org/abs/2604.12044
作者: Eli Corn,Daphna Weinshall
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models may converge to suboptimal solutions despite strong validation accuracy, masking an optimization failure we term Trajectory Deviation. This is because as training proceeds, models can abandon high generalization states for specific data sub-populations, thus discarding previously learned latent features without triggering classical overfitting signals. To address this problem we introduce VISTA, an online self-distillation framework that enforces consistency along the optimization trajectory. Using a validation-informed Marginal Coverage score, VISTA identifies expert anchors, which are earlier model states that retain specialized competence over distinct data regions. A coverage-weighted ensemble of these anchors is integrated online during training, regularizing the loss landscape and preserving mastered knowledge. When evaluated across multiple benchmarks, VISTA demonstrates improved robustness and generalization over standard training and prior self-distillation methods, while a lightweight implementation reduces storage overhead by 90% without performance loss.

[AI-89] SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents

【速读】:该论文旨在解决当前自主安全事件响应代理(autonomous security incident response agents)在评估时缺乏有效基准的问题,尤其是难以区分真实取证调查与仅重复告警内容(alert parroting)的局限性。为应对这一挑战,作者提出了SIR-Bench基准测试集,包含794个测试用例,源自129个匿名化事件模式并经专家验证的真值数据,能够衡量代理是否不仅做出正确的初步分类决策(triage accuracy),还能通过主动调查发现新证据(novel finding discovery)。其解决方案的关键在于开发了Once Upon A Threat (OUAT)框架,可在受控云环境中重放真实事件模式,生成可测量调查结果的可信遥测数据,并引入三种互补指标(M1–M3)及对抗性LLM作为裁判(LLM-as-Judge),要求代理提供具体取证证据以证明其推理有效性,从而建立了一个严谨、可量化且具有区分度的评估体系。

链接: https://arxiv.org/abs/2604.12040
作者: Daniel Begimher,Cristian Leo,Jack Huang,Pat Gaw,Bonan Zheng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 9 pages, 6 tables, 1 figure. Equal contribution by first three authors

点击查看摘要

Abstract:We present SIR-Bench, a benchmark of 794 test cases for evaluating autonomous security incident response agents that distinguishes genuine forensic investigation from alert parroting. Derived from 129 anonymized incident patterns with expert-validated ground truth, SIR-Bench measures not only whether agents reach correct triage decisions, but whether they discover novel evidence through active investigation. To construct SIR-Bench, we develop Once Upon A Threat (OUAT), a framework that replays real incident patterns in controlled cloud environments, producing authentic telemetry with measurable investigation outcomes. Our evaluation methodology introduces three complementary metrics: triage accuracy (M1), novel finding discovery (M2), and tool usage appropriateness (M3), assessed through an adversarial LLM-as-Judge that inverts the burden of proof – requiring concrete forensic evidence to credit investigations. Evaluating our SIR agent on the benchmark demonstrates 97.1% true positive (TP) detection, 73.4% false positive (FP) rejection, and 5.67 novel key findings per case, establishing a baseline against which future investigation agents can be measured.

[AI-90] Memory as Metabolism: A Design for Companion Knowledge Systems

【速读】:该论文聚焦于单用户知识维基(LLM Wiki pattern)中因用户耦合漂移(user-coupled drift)导致的“固化”(entrenchment)这一特定失效模式,即个人大语言模型(LLM)记忆系统在长期使用中逐渐强化主导解释、抑制矛盾证据并最终陷入认知僵化的问题。其解决方案的核心在于提出一种针对伴侣型记忆系统的治理配置文件(companion-specific governance profile),通过五项操作——TRIAGE(分诊)、DECAY(衰减)、CONTEXTUALIZE(情境化)、CONSOLIDATE(整合)、AUDIT(审计)——构建结构化的更新机制,并辅以记忆引力(memory gravity)与少数派假设保留(minority-hypothesis retention)策略,确保系统既能镜像用户的操作维度(如工作词汇、上下文连续性),又能补偿其认知失效模式(如固化、压制反例、库恩式僵化)。最关键的是,论文预测并设计了多周期缓冲压力累积路径,使累积的矛盾证据能够突破中心保护解释的壁垒,这是现有基准尚未覆盖的失效场景。

链接: https://arxiv.org/abs/2604.12034
作者: Stefan Miteski
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 41 pages, 1 table. Preprint v3.642. Concept DOI: https://doi.org/10.5281/zenodo.19501651

点击查看摘要

Abstract:Retrieval-Augmented Generation remains the dominant pattern for giving LLMs persistent memory, but a visible cluster of personal wiki-style memory architectures emerged in April 2026 – design proposals from Karpathy, MemPalace, and LLM Wiki v2 that compile knowledge into an interlinked artifact for long-term use by a single user. They sit alongside production memory systems that the major labs have shipped for over a year, and an active academic lineage including MemGPT, Generative Agents, Mem0, Zep, A-Mem, MemMachine, SleepGate, and Second Me. Within a 2026 landscape of emerging governance frameworks for agent context and memory – including Context Cartography and MemOS – this paper proposes a companion-specific governance profile: a set of normative obligations, a time-structured procedural rule, and testable conformance invariants for the specific failure mode of entrenchment under user-coupled drift in single-user knowledge wikis built on the LLM wiki pattern. The design principle is that personal LLM memory is a companion system: its job is to mirror the user on operational dimensions (working vocabulary, load-bearing structure, continuity of context) and compensate on epistemic failure modes (entrenchment, suppression of contradicting evidence, Kuhnian ossification). Five operations implement this split – TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, AUDIT – supported by memory gravity and minority-hypothesis retention. The sharpest prediction: accumulated contradictory evidence should have a structural path to updating a centrality-protected dominant interpretation through multi-cycle buffer pressure accumulation, a failure mode no existing benchmark captures. The safety story at the single-agent level is partial, and the paper is explicit about what it does and does not solve. Comments: 41 pages, 1 table. Preprint v3.642. Concept DOI: https://doi.org/10.5281/zenodo.19501651 Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.7; H.3.3; D.2.11 Cite as: arXiv:2604.12034 [cs.AI] (or arXiv:2604.12034v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.12034 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-91] WiseOWL: A Methodology for Evaluating Ontological Descriptiveness and Semantic Correctness for Ontology Reuse and Ontology Recommendations

【速读】:该论文旨在解决知识表示领域中本体(Ontology)复用选择困难的问题,即开发者在缺乏系统性评估标准的情况下依赖直觉进行本体选择,导致复用效率低下且难以保证一致性。解决方案的关键在于提出WiseOWL方法论,通过量化四个核心指标对本体进行评分:(i) 描述充分性(Well-Described),衡量文档覆盖度;(ii) 定义清晰性(Well-Defined),利用前沿嵌入模型评估标签与定义的一致性;(iii) 连接性(Connection),刻画结构上的互联程度;(iv) 层级广度(Hierarchical Breadth),反映层级结构的平衡性。该方法输出归一化的0–10分并提供可操作反馈,显著提升了本体选择的客观性和可解释性。

链接: https://arxiv.org/abs/2604.12025
作者: Aryan Singh Dalal,Maria Baloch,Asiyah Yu Lin,Anna Maria Masci,Kathleen M. Jagodnik,Hande Kucuk McGinty
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures. Submitted to a conference

点击查看摘要

Abstract:The Semantic Web standardizes concept meaning for humans and machines, enabling machine-operable content and consistent interpretation that improves advanced analytics. Reusing ontologies speeds development and enforces consistency, yet selecting the optimal choice is challenging because authors lack systematic selection criteria and often rely on intuition that is difficult to justify, limiting reuse. To solve this, WiseOWL is proposed, a methodology with scoring and guidance to select ontologies for reuse. It scores four metrics: (i) Well-Described, measuring documentation coverage; (ii) Well-Defined, using state-of-the-art embeddings to assess label-definition alignment; (iii) Connection, capturing structural interconnectedness; and (iv) Hierarchical Breadth, reflecting hierarchical balance. WiseOWL outputs normalized 0-10 scores with actionable feedback. Implemented as a Streamlit app, it ingests OWL format, converts to RDF Turtle, and provides interactive visualizations. Evaluation across six ontologies, including the Plant Ontology (PO), Gene Ontology (GO), Semanticscience Integrated Ontology (SIO), Food Ontology (FoodON), Dublin Core (DC), and GoodRelations, demonstrates promising effectiveness.

[AI-92] Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)中,认知主体的身份文档(cognitive_core)是否表现出类似吸引子的动力学行为,即不同表述形式的同一身份是否在模型激活空间中收敛至相近的内部表示。解决方案的关键在于设计了一个受控实验,对比原始认知核心(Condition A)、其语义改写版本(Condition B)与结构匹配的对照组(Condition C)在Llama 3.1 8B Instruct模型中隐藏状态的分布情况。结果显示,语义相关的 paraphrases 在第8、16、24层的均值池化隐藏状态更紧密聚集(Cohen’s d = 1.88, p < 10⁻²⁷,Bonferroni校正),且该效应在Gemma 2 9B上可复现,表明语义一致性驱动了吸引子样几何结构的形成;进一步消融分析表明,该现象主要由语义因素驱动而非结构特征,并且结构完整性是达到吸引子区域的必要条件。

链接: https://arxiv.org/abs/2604.12016
作者: Vladimir Vasilenko
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 5 figures. Code and data: this https URL

点击查看摘要

Abstract:Large language models map semantically related prompts to similar internal representations – a phenomenon interpretable as attractor-like dynamics. We ask whether the identity document of a persistent cognitive agent (its cognitive_core) exhibits analogous attractor-like behavior. We present a controlled experiment on Llama 3.1 8B Instruct, comparing hidden states of an original cognitive_core (Condition A), seven paraphrases (Condition B), and seven structurally matched controls (Condition C). Mean-pooled states at layers 8, 16, and 24 show that paraphrases converge to a tighter cluster than controls (Cohen’s d 1.88, p 10^-27, Bonferroni-corrected). Replication on Gemma 2 9B confirms cross-architecture generalizability. Ablations suggest the effect is primarily semantic rather than structural, and that structural completeness appears necessary to reach the attractor region. An exploratory experiment shows that reading a scientific description of the agent shifts internal state toward the attractor – closer than a sham preprint – distinguishing knowing about an identity from operating as that identity. These results provide representational evidence that agent identity documents induce attractor-like geometry in LLM activation space.

[AI-93] When to Forget: A Memory Governance Primitive

【速读】:该论文旨在解决智能体记忆系统在任务分布变化时缺乏可操作的记忆质量治理机制的问题,即如何动态决定哪些记忆应被信任、抑制或废弃。现有方法依赖静态的写入重要性评分或基于大语言模型(LLM)判断与结构启发式规则,而未利用任务执行结果的反馈信号。解决方案的关键在于提出“记忆价值”(Memory Worth, MW)——一种每条记忆独立维护的双计数器信号,用于追踪该记忆在成功与失败结果中出现的频次,从而量化其与任务成功之间的关联概率 $ p^+(m) = \text{Pr}[y_t = +1 \mid m \in M_t] $。理论证明在稳态检索条件下MW几乎必然收敛至该关联概率,且实验证明其能有效区分高价值与过时记忆:在合成环境中,MW与真实效用的Spearman秩相关系数达0.89 ± 0.02,显著优于无更新机制的基线(ρ = 0.00)。此方法仅需每个记忆单元两个标量计数器,易于集成至已有记忆架构中。

链接: https://arxiv.org/abs/2604.12007
作者: Baris Simsek
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Agent memory systems accumulate experience but currently lack a principled operational metric for memory quality governance – deciding which memories to trust, suppress, or deprecate as the agent’s task distribution shifts. Write-time importance scores are static; dynamic management systems use LLM judgment or structural heuristics rather than outcome feedback. This paper proposes Memory Worth (MW): a two-counter per-memory signal that tracks how often a memory co-occurs with successful versus failed outcomes, providing a lightweight, theoretically grounded foundation for staleness detection, retrieval suppression, and deprecation decisions. We prove that MW converges almost surely to the conditional success probability p+(m) = Pr[y_t = +1 | m in M_t] – the probability of task success given that memory m is retrieved – under a stationary retrieval regime with a minimum exploration condition. Importantly, p+(m) is an associational quantity, not a causal one: it measures outcome co-occurrence rather than causal contribution. We argue this is still a useful operational signal for memory governance, and we validate it empirically in a controlled synthetic environment where ground-truth utility is known: after 10,000 episodes, the Spearman rank-correlation between Memory Worth and true utilities reaches rho = 0.89 +/- 0.02 across 20 independent seeds, compared to rho = 0.00 for systems that never update their assessments. A retrieval-realistic micro-experiment with real text and neural embedding retrieval (all-MiniLM-L6-v2) further shows stale memories crossing the low-value threshold (MW = 0.17) while specialist memories remain high-value (MW = 0.77) across 3,000 episodes. The estimator requires only two scalar counters per memory unit and can be added to architectures that already log retrievals and episode outcomes.

[AI-94] BayMOTH: Bayesian optiMizatiOn with meTa-lookahead – a simple approacH

【速读】:该论文旨在解决元贝叶斯优化(meta-Bayesian optimization, meta-BO)在测试任务与元训练任务结构不匹配时,导致在线优化过程中产生次优查询的问题。其解决方案的关键在于提出一种统一框架下的简单meta-BO算法:该算法根据相关任务信息的有用性动态决定是否利用元知识——当相关任务信息被判定为有效时则采用,否则退化为基于前瞻(lookahead)的策略,从而在任务相关性较低的场景下仍能保持强鲁棒性和性能。

链接: https://arxiv.org/abs/2604.12005
作者: Rahman Ejaz,Varchas Gopalaswamy,Ricardo Luna,Aarne Lees,Vineet Gundecha,Christopher Kanan,Soumyendu Sarkar,Riccardo Betti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bayesian optimization (BO) has for sequential optimization of expensive black-box functions demonstrated practicality and effectiveness in many real-world settings. Meta-Bayesian optimization (meta-BO) focuses on improving the sample efficiency of BO by making use of information from related tasks. Although meta-BO is sample-efficient when task structure transfers, poor alignment between meta-training and test tasks can cause suboptimal queries to be suggested during online optimization. To this end, we propose a simple meta-BO algorithm that utilizes related-task information when determined useful, falling back to lookahead otherwise, within a unified framework. We demonstrate competitiveness of our method with existing approaches on function optimization tasks, while retaining strong performance in low task-relatedness regimes where test tasks share limited structure with the meta-training set.

[AI-95] he Long-Horizon Task Mirag e? Diagnosing Where and Why Agent ic Systems Break

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在长时程任务中表现不稳定的问题,这类任务需要复杂的、相互依赖的动作序列,而现有代理系统在此类场景下常出现失效行为,且缺乏跨领域的系统性诊断方法。解决方案的关键在于提出 HORIZON——一个跨域的诊断基准,用于系统构建长时程任务并分析失败模式;同时引入基于轨迹的“LLM作为裁判”(LLM-as-a-Judge)管道,实现可扩展、可复现的故障归因,并通过人类标注验证其有效性(人与人标注一致性 κ=0.61,人与裁判一致性 κ=0.84),从而为构建更可靠的长时程代理提供方法论基础和实践指导。

链接: https://arxiv.org/abs/2604.11978
作者: Xinyu Jessica Wang,Haoyue Bai,Yiyou Sun,Haorui Wang,Shuibai Zhang,Wenjie Hu,Mya Schroder,Bilge Mutlu,Dawn Song,Robert D Nowak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents perform strongly on short- and mid-horizon tasks, but often break down on long-horizon tasks that require extended, interdependent action sequences. Despite rapid progress in agentic systems, these long-horizon failures remain poorly characterized, hindering principled diagnosis and comparison across domains. To address this gap, we introduce HORIZON, an initial cross-domain diagnostic benchmark for systematically constructing tasks and analyzing long-horizon failure behaviors in LLM-based agents. Using HORIZON, we evaluate state-of-the-art (SOTA) agents from multiple model families (GPT-5 variants and Claude models), collecting 3100+ trajectories across four representative agentic domains to study horizon-dependent degradation patterns. We further propose a trajectory-grounded LLM-as-a-Judge pipeline for scalable and reproducible failure attribution, and validate it with human annotation on trajectories, achieving strong agreement (inter-annotator \kappa=0.61; human-judge \kappa=0.84). Our findings offer an initial methodological step toward systematic, cross-domain analysis of long-horizon agent failures and offer practical guidance for building more reliable long-horizon agents. We release our project website at \hrefthis https URLHORIZON Leaderboard and welcome contributions from the community.

[AI-96] Narrative-Driven Paper-to-Slide Generation via ArcDeck

【速读】:该论文旨在解决学术论文到演示文稿(paper-to-slide)自动转换过程中存在的叙事逻辑断裂与结构失真问题,即现有方法直接对原始文本进行摘要式处理,难以保留论文的内在逻辑脉络。解决方案的关键在于提出ArcDeck框架,通过显式建模源论文的语篇结构(discourse tree)构建全局承诺文档(global commitment document),以此作为结构先验指导多智能体迭代优化过程;其中,专业化智能体协同执行批判与修订任务,最终生成具有更强叙事连贯性和逻辑一致性的幻灯片内容。

链接: https://arxiv.org/abs/2604.11969
作者: Tarik Can Ozden,Sachidanand VS,Furkan Horoz,Ozgur Kara,Junho Kim,James Matthew Rehg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Project webpage: this https URL

点击查看摘要

Abstract:We introduce ArcDeck, a multi-agent framework that formulates paper-to-slide generation as a structured narrative reconstruction task. Unlike existing methods that directly summarize raw text into slides, ArcDeck explicitly models the source paper’s logical flow. It first parses the input to construct a discourse tree and establish a global commitment document, ensuring the high-level intent is preserved. These structural priors then guide an iterative multi-agent refinement process, where specialized agents iteratively critique and revise the presentation outline before rendering the final visual layouts and designs. To evaluate our approach, we also introduce ArcBench, a newly curated benchmark of academic paper-slide pairs. Experimental results demonstrate that explicit discourse modeling, combined with role-specific agent coordination, significantly improves the narrative flow and logical coherence of the generated presentations.

[AI-97] ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism

【速读】:该论文旨在解决大规模低带宽去中心化训练中管道并行(pipeline parallelism)难以实现的问题,尤其是在通信带宽受限的环境下,传统方法因依赖高带宽通信而效率低下。其解决方案的关键在于提出一种名为残差瓶颈模型(Residual Bottleneck Model, ResBM)的新架构,该架构在流水线边界引入一个可端到端训练的残差编码器-解码器瓶颈模块,同时保留显式的低秩恒等路径(low-rank identity path),从而在不显著增加内存或计算开销的前提下,实现高达128倍的激活值压缩,并保持接近原始模型的收敛速度。

链接: https://arxiv.org/abs/2604.11947
作者: Alan Aboudib,Rodrigo Lopez Portillo A.,Kalei Brady,Steffen Cruz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Unlocking large-scale low-bandwidth decentralized training has the potential to utilize otherwise untapped compute resources. In centralized settings, large-scale multi-node training is primarily enabled by data and pipeline parallelism, two techniques that require ultra-high-bandwidth communication. While efficient methods now exist for decentralized data parallelism, pipeline parallelism remains the primary challenge. Recent efforts, such as Subspace Models (SM), have claimed up to 100x activation compression but rely on complex constrained optimization and diverge from true end-to-end training. In this paper, we propose a different approach, based on an architecture designed from the ground up to be native to low-bandwidth communication environments while still applicable to any standard transformer-based architecture. We call this architecture the Residual Bottleneck Model or ResBM, it introduces a residual encoder-decoder bottleneck module across pipeline boundaries that can be trained end-to-end as part of the model’s parameters while preserving an explicit low-rank identity path. We show that ResBMs achieve state-of-the-art 128x activation compression without significant loss in convergence rates and without significant memory or compute overhead.

[AI-98] Can AI Detect Life? Lessons from Artificial Life

【速读】:该论文试图解决的问题是:当前基于机器学习的外星生命检测方法可能因模型对分布外样本(out-of-distribution samples)的敏感性而产生显著的假阳性结果,从而误导科学判断。解决方案的关键在于揭示现代机器学习方法在面对非地球来源样本时的脆弱性——即使样本完全不具备生命特征,模型仍可能以接近100%的置信度误判为“存在生命”,这是因为训练数据主要来源于地球上的生物和非生物有机分子混合物,而外星样本很可能超出这一分布范围,导致模型无法正确区分真实生物信号与人工构造的化学模式。

链接: https://arxiv.org/abs/2604.11915
作者: Ankit Gupta,Christoph Adami(Michigan State University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Populations and Evolution (q-bio.PE)
备注: 6 pages, 7 figures. Alife 2026

点击查看摘要

Abstract:Modern machine learning methods have been proposed to detect life in extraterrestrial samples, drawing on their ability to distinguish biotic from abiotic samples based on training models using natural and synthetic organic molecular mixtures. Here we show using Artificial Life that such methods are easily fooled into detecting life with near 100% confidence even if the analyzed sample is not capable of life. This is due to modern machine learning methods’ propensity to be easily fooled by out-of-distribution samples. Because extra-terrestrial samples are very likely out of the distribution provided by terrestrial biotic and abiotic samples, using AI methods for life detection is bound to yield significant false positives.

[AI-99] Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents

【速读】:该论文旨在解决自监控能力(包括元认知、自我预测和主观时长)是否能有效提升强化学习智能体在复杂环境中的性能这一问题。研究发现,将这些模块作为辅助损失添加到多时间尺度皮层层次结构中时,并未带来统计显著的性能提升,且模块输出趋于恒定,对策略无显著影响;关键解决方案在于将自监控模块的输出结构化地整合进决策路径——即用置信度门控探索、用意外触发工作空间广播、并将自模型预测作为策略输入,这种集成方式在非平稳环境中表现出中等偏大的改进(Cohen’s d = 0.62),表明自监控应嵌入决策流程而非作为旁路模块。

链接: https://arxiv.org/abs/2604.11914
作者: Ying Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-monitoring capabilities – metacognition, self-prediction, and subjective duration – are often proposed as useful additions to reinforcement learning agents. But do they actually help? We investigate this question in a continuous-time multi-timescale agent operating in predator-prey survival environments of varying complexity, including a 2D partially observable variant. We first show that three self-monitoring modules, implemented as auxiliary-loss add-ons to a multi-timescale cortical hierarchy, provide no statistically significant benefit across 20 random seeds, 1D and 2D predator-prey environments with standard and non-stationary variants, and training horizons up to 50,000 steps. Diagnosing the failure, we find the modules collapse to near-constant outputs (confidence std 0.006, attention allocation std 0.011) and the subjective duration mechanism shifts the discount factor by less than 0.03%. Policy sensitivity analysis confirms the agent’s decisions are unaffected by module outputs in this design. We then show that structurally integrating the module outputs – using confidence to gate exploration, surprise to trigger workspace broadcasts, and self-model predictions as policy input – produces a medium-large improvement over the add-on approach (Cohen’s d = 0.62, p = 0.06, paired) in a non-stationary environment. Component-wise ablations reveal that the TSM-to-policy pathway contributes most of this gain. However, structural integration does not significantly outperform a baseline with no self-monitoring (d = 0.15, p = 0.67), and a parameter-matched control without modules performs comparably, so the benefit may lie in recovering from the trend-level harm of ignored modules rather than in self-monitoring content. The architectural implication is that self-monitoring should sit on the decision pathway, not beside it.

[AI-100] How Transformers Learn to Plan via Multi-Token Prediction

【速读】:该论文旨在解决当前基于单标记预测(Next-Token Prediction, NTP)的语言模型在推理任务中难以捕捉全局结构的问题。其解决方案的关键在于引入多标记预测(Multi-Token Prediction, MTP),并通过理论与实证分析揭示MTP能够诱导出一种两阶段逆向推理机制:模型首先关注目标节点,再通过梯度解耦特性逐步回溯中间节点以重建路径。这种机制源于MTP提供的更清晰的训练信号,从而引导优化过程趋向于形成鲁棒且可解释的推理电路。

链接: https://arxiv.org/abs/2604.11912
作者: Jianhao Huang,Zhanpeng Zhou,Renqiu Xia,Baharan Mirzasoleiman,Weijie Su,Wei Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.

[AI-101] hermodynamic Liquid Manifold Networks: Physics-Bounded Deep Learning for Solar Forecasting in Autonomous Off-Grid Microgrids

【速读】:该论文旨在解决当前深度学习模型在自治离网光伏系统中出现的关键问题:即在云层动态变化时存在严重的时序相位滞后,以及在夜间产生物理上不可能的发电现象。为弥合数据驱动建模与确定性天体力学之间的偏差,研究提出了一种基于热力学约束的新型网络架构——热力学液态流形网络(Thermodynamic Liquid Manifold Network)。其核心创新在于将22个气象与几何变量投影至Koopman线性化的黎曼流形空间,并引入谱校准单元与乘法型热力学Alpha门机制,从而在结构上强制遵守严格的天体几何约束,确保零夜间误差的同时实现高频率光学瞬变下的亚30分钟相位响应。该方法显著提升了预测精度与物理一致性,适用于边缘部署的微电网控制器。

链接: https://arxiv.org/abs/2604.11909
作者: Mohammed Ezzaldin Babiker Abdullah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The stable operation of autonomous off-grid photovoltaic systems requires solar forecasting algorithms that respect atmospheric thermodynamics. Contemporary deep learning models consistently exhibit critical anomalies, primarily severe temporal phase lags during cloud transients and physically impossible nocturnal power generation. To resolve this divergence between data-driven modeling and deterministic celestial mechanics, this research introduces the Thermodynamic Liquid Manifold Network. The methodology projects 22 meteorological and geometric variables into a Koopman-linearized Riemannian manifold to systematically map complex climatic dynamics. The architecture integrates a Spectral Calibration unit and a multiplicative Thermodynamic Alpha-Gate. This system synthesizes real-time atmospheric opacity with theoretical clear-sky boundary models, structurally enforcing strict celestial geometry compliance. This completely neutralizes phantom nocturnal generation while maintaining zero-lag synchronization during rapid weather shifts. Validated against a rigorous five-year testing horizon in a severe semi-arid climate, the framework achieves an RMSE of 18.31 Wh/m2 and a Pearson correlation of 0.988. The model strictly maintains a zero-magnitude nocturnal error across all 1826 testing days and exhibits a sub-30-minute phase response during high-frequency optical transients. Comprising exactly 63,458 trainable parameters, this ultra-lightweight design establishes a robust, thermodynamically consistent standard for edge-deployable microgrid controllers.

[AI-102] Disposition Distillation at Small Scale: A Three-Arc Negative Result

【速读】:该论文试图解决的问题是:如何通过训练将行为倾向(如自我验证、不确定性承认和反馈整合)有效注入小型语言模型(0.6B–2.3B参数),并评估其在推理阶段是否能以不损害内容质量或避免风格模仿的方式提升这些倾向的可测量表现。解决方案的关键在于采用四阶段全MIT蒸馏流水线进行训练,并进一步探索三种不同路径——包括基于监督微调(SFT)与直接偏好优化(DPO)的LoRA适配、推理时注意力头干预(特别是o_proj层)、以及一个无需训练的冻结基础模型侧车模块(sidecar)读取最终token隐藏状态h_last)。然而,研究发现所有方法均未能在不造成内容退化或陷入风格模仿的前提下显著提升judge-measured的行为倾向,最终得出一个三弧负结果,揭示了当前小模型中行为倾向注入的系统性失败模式及其机制。

链接: https://arxiv.org/abs/2604.11867
作者: Hari Sadasivan(Tinman Lab)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:We set out to train behavioral dispositions (self-verification, uncertainty acknowledgment, feedback integration) into small language models (0.6B to 2.3B effective parameters) through a four-stage all-MIT distillation pipeline, with follow-on experiments on inference-time attention-head interventions and a frozen-base confidence-gated sidecar. An internal draft reported +33.9-point MCAS and +15.3-point HumanEval gains on a Qwen3-0.6B student; a second-pass sanity check falsified both numbers before publication. The HumanEval delta was a truncation artifact (n_predict=512) that inverted to -8.0 points at n_predict=1024; the MCAS gain disappeared under apples-to-apples scoring. That falsification triggered three subsequent arcs. Across (1) SFT/DPO LoRA on three model families and two domains, (2) inference-time attention-head tempering on o_proj, and (3) a training-free frozen-base sidecar reading the final-token hidden state h_last, we find no operator that moves judge-measured disposition without damaging content or collapsing into stylistic mimicry. The failure is consistent across five models (Qwen3-0.6B, Qwen3-1.7B, Qwen3.5-0.8B, Gemma 4 E2B, and SmolLM2-1.7B-Instruct). A within-distribution cross-validation pass (AUC=0.683) collapsed to chance on fresh prompts (AUC=0.516). We contribute a three-arc negative result with mechanism, a two-failure-mode taxonomy for linear h_last probes, and an honest falsification pipeline that converts the class of false positives we ourselves produced into publishable negatives. As an independent finding, Gemma 4 E2B exhibits near-complete confidence-correctness decoupling on the Chef domain (assertion asymmetry -0.009; the model asserts at 91% regardless of correctness).

[AI-103] MVAdapt: Zero-Shot Multi-Vehicle Adaptation for End-to-End Autonomous Driving

【速读】:该论文旨在解决端到端(End-to-End, E2E)自动驾驶模型在部署到不同车辆平台时性能显著下降的问题,即“车辆域差距”(vehicle-domain gap)。这一问题源于模型训练时使用的固定自车(ego-vehicle)与实际部署车辆在尺寸、质量或驱动系统特性上的差异,导致其驾驶策略无法有效泛化。解决方案的关键在于提出MVAdapt框架,该框架通过引入一个轻量级的物理编码器(physics encoder)和交叉注意力模块(cross-attention module),将车辆物理属性显式地条件化到场景特征上,从而在不更新主干网络的前提下实现对不同车辆的动力学特性的适应,显著提升了模型在同分布及未见车辆上的零样本迁移能力和少量数据下的校准效率。

链接: https://arxiv.org/abs/2604.11854
作者: Haesung Oh,Jaeheung Park
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-End (E2E) autonomous driving models are usually trained and evaluated with a fixed ego-vehicle, even though their driving policy is implicitly tied to vehicle dynamics. When such a model is deployed on a vehicle with different size, mass, or drivetrain characteristics, its performance can degrade substantially; we refer to this problem as the vehicle-domain gap. To address it, we propose MVAdapt, a physics-conditioned adaptation framework for multi-vehicle E2E driving. MVAdapt combines a frozen TransFuser++ scene encoder with a lightweight physics encoder and a cross-attention module that conditions scene features on vehicle properties before waypoint decoding. In the CARLA Leaderboard 1.0 benchmark, MVAdapt improves over naive transfer and multi-embodiment adaptation baselines on both in-distribution and unseen vehicles. We further show two complementary behaviors: strong zero-shot transfer on many unseen vehicles, and data-efficient few-shot calibration for severe physical outliers. These results suggest that explicitly conditioning E2E driving policies on vehicle physics is an effective step toward more transferable autonomous driving models. All codes are available at this https URL

[AI-104] DBGL: Decay-aware Bipartite Graph Learning for Irregular Medical Time Series Classification

【速读】:该论文旨在解决不规则医疗时间序列(Irregular Medical Time Series)建模中的关键挑战,即如何在保持原始采样不规则性与缺失模式的前提下,准确捕捉变量间随时间衰减的异质性。现有方法常因人为对齐而扭曲时间不规则性,且无法建模变量衰减规律的差异,导致表征效果不佳。解决方案的关键在于提出DBGL(Decay-Aware Bipartite Graph Learning),其核心创新包括:构建患者-变量二部图以无损地建模采样不规则性和动态变量关系;设计节点特定的时间衰减编码机制,根据采样间隔自适应学习每个变量的衰减速率,从而更精确地刻画不规则时序动态特性。

链接: https://arxiv.org/abs/2604.11842
作者: Jian Chen,Yuzhu Hu,Xiaoyan Yuan,Yuxuan Hu,Jinfeng Xu,Yipeng Du,Wenhao Yuan,Wei Wang,Edith C. H. Ngai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Irregular Medical Time Series play a critical role in the clinical domain to better understand the patient’s condition. However, inherent irregularity arising from heterogeneous sampling rates, asynchronous observations, and variable gaps poses key challenges for reliable modeling. Existing methods often distort temporal sampling irregularity and missingness patterns while failing to capture variable decay irregularity, resulting in suboptimal representations. To address these limitations, we introduce DBGL, Decay-Aware Bipartite Graph Learning for Irregular Medical Time Series. DBGL first introduces a patient-variable bipartite graph that simultaneously captures irregular sampling patterns without artificial alignment and adaptively models variable relationships for temporal sampling irregularity modeling, enhancing representation learning. To model variable decay irregularity, DBGL designs a novel node-specific temporal decay encoding mechanism that captures each variable’s decay rates based on sampling interval, yielding a more accurate and faithful representation of irregular temporal dynamics. We evaluate the performance of DBGL on four publicly available datasets, and the results show that DBGL outperforms all baselines.

[AI-105] Polynomial Expansion Rank Adaptation: Enhancing Low-Rank Fine-Tuning with High-Order Interactions ACL2026

【速读】:该论文旨在解决低秩适配(Low-rank Adaptation, LoRA)方法因严格线性结构而导致的表达能力受限问题,即其仅能捕获低秩因子间的首阶依赖关系,难以建模非线性及高阶参数交互。解决方案的关键在于提出多项式扩展低秩适配(Polynomial Expansion Rank Adaptation, PERA),通过在低秩因子空间中引入结构化的多项式展开,在不增加秩或推理成本的前提下,将适配空间转化为能够捕捉更高阶非线性耦合的多项式流形,从而显著提升模型的表达能力和特征利用效率。

链接: https://arxiv.org/abs/2604.11841
作者: Wenhao Zhang,Lin Mu,Li Ni,Peiquan Jin,Yiwen Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 findings

点击查看摘要

Abstract:Low-rank adaptation (LoRA) is a widely used strategy for efficient fine-tuning of large language models (LLMs), but its strictly linear structure fundamentally limits expressive capacity. The bilinear formulation of weight updates captures only first-order dependencies between low-rank factors, restricting the modeling of nonlinear and higher-order parameter interactions. In this paper, we propose Polynomial Expansion Rank Adaptation (PERA), a novel method that introduces structured polynomial expansion directly into the low-rank factor space. By expanding each low-rank factor to synthesize high-order interaction terms before composition, PERA transforms the adaptation space into a polynomial manifold capable of modeling richer nonlinear coupling without increasing rank or inference cost. We provide theoretical analysis demonstrating that PERA offers enhanced expressive capacity and more effective feature utilization compare to existing linear adaptation approaches. Empirically, PERA consistently outperforms state-of-the-art methods across diverse benchmarks. Notably, our experiments show that incorporating high-order nonlinear components particularly square terms is crucial for enhancing expressive capacity and maintaining strong and robust performance under various rank settings. Our code is available at this https URL

[AI-106] Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents NEURIPS2026

【速读】:该论文针对开源AI代理运行时(如OpenClaw)中存在的能力过度分配问题(capability overprovisioning problem)提出解决方案,即无论任务类型如何,系统默认向所有会话暴露全部工具权限,导致例如摘要任务与代码部署任务共享相同的Shell执行、子代理生成和凭证访问能力,造成高达15倍的资源冗余。为解决此问题,论文提出四层自适应治理框架Aethelgard,其核心在于通过学习每类任务的最小必要能力集实现最小权限原则:第2层基于PPO强化学习策略,利用审计日志训练出任务特定的技能最小集合;第1层能力管理者动态控制代理在每个会话中可见的工具范围;第3层安全路由模块采用混合规则与微调分类器在工具调用前进行拦截,从而实现对AI代理行为的细粒度管控与安全性提升。

链接: https://arxiv.org/abs/2604.11839
作者: Bronislav Sidik,Lior Rokach
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 pages (9 content pages), 2 figures, 7 tables. Submitted to NeurIPS 2026 Agent Safety Workshop. Code and dataset available at this https URL

点击查看摘要

Abstract:Autonomous AI agents built on open-source runtimes such as OpenClaw expose every available tool to every session by default, regardless of the task. A summarization task receives the same shell execution, subagent spawning, and credential access capabilities as a code deployment task, a 15x overprovision ratio that we call the capability overprovisioning problem. Existing defenses, including the NemoClaw container sandbox and the Cisco DefenseClaw skill scanner, address containment and threat detection but do not learn the minimum viable capability set for each task type. We present Aethelgard, a four layer adaptive governance framework that enforces least privilege for AI agents through a learned policy. Layer 1, the Capability Governor, dynamically scopes which tools the agent is aware of in each session. Layer 3, the Safety Router, intercepts tool calls before execution using a hybrid rule based and fine tuned classifier. Layer 2, the RL Learning Policy, trains a PPO policy on the accumulated audit log to learn the minimum viable skill set for each task type. Comments: 17 pages (9 content pages), 2 figures, 7 tables. Submitted to NeurIPS 2026 Agent Safety Workshop. Code and dataset available at this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) ACMclasses: I.2.6; D.4.6; K.4.2 Cite as: arXiv:2604.11839 [cs.CR] (or arXiv:2604.11839v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.11839 Focus to learn more arXiv-issued DOI via DataCite

[AI-107] A Layer-wise Analysis of Supervised Fine-Tuning ACL2026

【速读】:该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)过程中因灾难性遗忘(catastrophic forgetting)导致的模型性能下降问题,并揭示指令遵循能力在不同网络层中的渐进式涌现机制。研究表明,中间层(20%-80%深度范围)具有稳定性,而最终层则表现出高度敏感性;基于此发现,作者提出“中段块高效微调”(Mid-Block Efficient Tuning)方法,仅选择性地更新这些关键中间层,从而在保持高性能的同时显著降低参数开销。实验证明,该方法在GSM8K基准上相较于标准LoRA提升达10.2%(OLMo2-7B模型),验证了有效对齐能力在架构上具有局部化而非分布式特性。

链接: https://arxiv.org/abs/2604.11838
作者: Qinghua Zhao,Xueling Gong,Xinyu Chen,Zhongfeng Kang,Xinlu Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 main conference

点击查看摘要

Abstract:While critical for alignment, Supervised Fine-Tuning (SFT) incurs the risk of catastrophic forgetting, yet the layer-wise emergence of instruction-following capabilities remains elusive. We investigate this mechanism via a comprehensive analysis utilizing information-theoretic, geometric, and optimization metrics across model scales (1B-32B). Our experiments reveal a distinct depth-dependent pattern: middle layers (20%-80%) are stable, whereas final layers exhibit high sensitivity. Leveraging this insight, we propose Mid-Block Efficient Tuning, which selectively updates these critical intermediate layers. Empirically, our method outperforms standard LoRA up to 10.2% on GSM8K (OLMo2-7B) with reduced parameter overhead, demonstrating that effective alignment is architecturally localized rather than distributed. The code is publicly available at this https URL.

[AI-108] Schema-Adaptive Tabular Representation Learning with LLM s for Generalizable Multimodal Clinical Reasoning

【速读】:该论文旨在解决表格数据(tabular data)在跨schema场景下缺乏语义理解能力的问题,尤其是在临床医学领域中电子健康记录(EHR)结构差异显著时,传统机器学习方法难以实现有效的特征迁移与泛化。其解决方案的关键在于提出一种基于大语言模型(LLM)的Schema-Adaptive Tabular Representation Learning方法,通过将结构化变量转化为自然语言语句并利用预训练LLM进行编码,从而生成可迁移的表格嵌入(tabular embeddings),实现无需人工特征工程或重新训练即可在未见过的schema上完成零样本对齐(zero-shot alignment)。该方法显著提升了跨域诊断任务的性能,并在阿尔茨海默病(AD)相关数据集上展现出优于临床专家基准的表现。

链接: https://arxiv.org/abs/2604.11835
作者: Hongxi Mao,Wei Zhou,Mengting Jia,Tao Fang,Huan Gao,Bin Zhang,Shangyang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Machine learning for tabular data remains constrained by poor schema generalization, a challenge rooted in the lack of semantic understanding of structured variables. This challenge is particularly acute in domains like clinical medicine, where electronic health record (EHR) schemas vary significantly. To solve this problem, we propose Schema-Adaptive Tabular Representation Learning, a novel method that leverages large language models (LLMs) to create transferable tabular embeddings. By transforming structured variables into semantic natural language statements and encoding them with a pretrained LLM, our approach enables zero-shot alignment across unseen schemas without manual feature engineering or retraining. We integrate our encoder into a multimodal framework for dementia diagnosis, combining tabular and MRI data. Experiments on NACC and ADNI datasets demonstrate state-of-the-art performance and successful zero-shot transfer to unseen schemas, significantly outperforming clinical baselines, including board-certified neurologists, in retrospective diagnostic tasks. These results validate our LLM-driven approach as a scalable, robust solution for heterogeneous real-world data, offering a pathway to extend LLM-based reasoning to structured domains.

[AI-109] he Non-Optimality of Scientific Knowledge: Path Dependence Lock-In and The Local Minimum Trap

【速读】:该论文试图解决的问题是:科学知识体系在历史发展过程中往往陷入局部最优解(local optimum),而非达到全局最优的自然描述,这限制了科学对自然本质的深入理解。其解决方案的关键在于识别并突破三种相互交织的锁定机制——认知锁定(cognitive lock-in)、形式锁定(formal lock-in)和制度锁定(institutional lock-in),并通过元科学(meta-science)策略设计来主动规避这些路径依赖,从而引导科学研究沿着更具解释力和普适性的方向演进。

链接: https://arxiv.org/abs/2604.11828
作者: Mohamed Mabrok
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Science is widely regarded as humanity’s most reliable method for uncovering truths about the natural world. Yet the \emphtrajectory of scientific discovery is rarely examined as an optimization problem in its own right. This paper argues that the body of scientific knowledge, at any given historical moment, represents a \emphlocal optimum rather than a global one–that the frameworks, formalisms, and paradigms through which we understand nature are substantially shaped by historical contingency, cognitive path dependence, and institutional lock-in. Drawing an analogy to gradient descent in machine learning, we propose that science follows the steepest local gradient of tractability, empirical accessibility, and institutional reward, and in doing so may bypass fundamentally superior descriptions of nature. We develop this thesis through detailed case studies spanning mathematics, physics, chemistry, biology, neuroscience, and statistical methodology. We identify three interlocking mechanisms of lock-in–cognitive, formal, and institutional–and argue that recognizing these mechanisms is a prerequisite for designing meta-scientific strategies capable of escaping local optima. We conclude by proposing concrete interventions and discussing the epistemological implications of our thesis for the philosophy of science.

[AI-110] GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在训练过程中因参数量庞大和Transformer架构复杂而导致的高计算资源消耗与低效优化问题。现有基于coreset选择的方法难以适应LLM训练的动态特性且缺乏可扩展性,无法在保持性能的同时显著降低训练成本。解决方案的关键在于提出一种图引导的自适应动态coreset选择框架GRACE,其通过融合表示多样性与基于梯度的重要性度量来动态构建和更新coreset,确保所选子集兼具信息丰富性和训练效率;同时利用k-NN图传播机制降低频繁更新带来的计算开销,并仅对关键样本进行分数与嵌入的局部更新,从而有效适配LLM训练中不断变化的学习动态。

链接: https://arxiv.org/abs/2604.11810
作者: Tianhao Tang,Haoyang Li,Lei Chen
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, their immense number of parameters and complex transformer-based architectures result in significant resource demands and computational complexity during training, making it challenging to optimize them efficiently on large datasets. To reduce training costs while preserving performance, researchers have investigated coreset selection techniques, which aim to identify small, representative subsets of the entire training dataset to accelerate LLM training. However, existing coreset selection methods fail to adapt to the dynamic nature of LLM training and often struggle with scalability for models of this size. To address these limitations, we propose a graph-guided adaptive and dynamic coreset selection framework for LLMs, namely GRACE. GRACE dynamically constructs and updates coresets by combining representation diversity with gradient-based importance metrics, ensuring both informativeness and efficiency. To mitigate the computational cost of frequent updates, GRACE leverages a k -NN graph-based propagation mechanism and selectively updates scores and embeddings, adapting to evolving training dynamics. Extensive experiments on three benchmarks demonstrate that GRACE significantly improves training efficiency and downstream performance across diverse LLMs and tasks.

[AI-111] Should There be a Teacher In-the-Loop? A Study of Generative AI Personalized Tasks Middle School

【速读】:该论文旨在解决如何在教学实践中高效利用生成式AI(Generative AI)实现对学生学习内容的个性化定制问题。其核心挑战在于:尽管商业生成式AI工具声称可节省教师时间,但当教师作为“在环参与者”(teacher-in-the-loop)与生成式AI协作时,实际效率和学生偏好之间的匹配度尚不明确。解决方案的关键在于通过实证研究揭示教师与ChatGPT协同创作个性化数学题目的过程特征——包括教师的提示策略(prompting moves)、任务生成效率以及学生对不同粒度个性化内容(如广泛兴趣匹配 vs. 具体流行文化引用)的反馈。研究发现,教师主导的个性化通常停留在较粗粒度层面,而学生更偏好细粒度、具体的文化参考;同时,教师虽能逐步提升与生成式AI协作的能力,但因需反复调整内容真实性和深度,整体时间效率并未显著改善。

链接: https://arxiv.org/abs/2602.15876
作者: Candace Walkington,Mingyu Feng,Itffini Pruitt-Britton,Theodora Beauchamp,Andrew Lan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adapting instruction to the fine-grained needs of individual students is a powerful application of recent advances in large language models. These generative AI models can create tasks that correspond to students’ interests and enact context personalization, enhancing students’ interest in learning academic content. However, when there is a teacher in-the-loop creating or modifying tasks with generative AI, it is unclear how efficient this process might be, despite commercial generative AI tools’ claims that they will save teachers time. In the present study, we teamed 7 middle school mathematics teachers with ChatGPT to create personalized versions of problems in their curriculum, to correspond to their students’ interests. We look at the prompting moves teachers made, their efficiency when creating problems, and the reactions of their 521 7th grade students who received the personalized assignments. We find that having a teacher-in-the-loop results in generative AI-enhanced personalization being enacted at a relatively broad grain size, whereas students tend to prefer a smaller grain size where they receive specific popular culture references that interest them. Teachers spent a lot of effort adjusting popular culture references and addressing issues with the depth or realism of the problems generated, giving higher or lower levels of ownership to the generative AI. Teachers were able to improve in their ability to craft interesting problems in partnership with generative AI, but this process did not appear to become particularly time efficient as teachers learned and reflected on their students’ data, iterating their approaches.

[AI-112] Multi-Worker Selection based Distributed Swarm Learning for Edge IoT with Non-i.i.d. Data

【速读】:该论文旨在解决分布式群体学习(Distributed Swarm Learning, DSL)在多接入边缘计算环境中因非独立同分布(non-i.i.d.)数据导致的模型训练性能下降与收敛不稳定问题。其核心挑战在于缺乏对数据异质性如何影响训练准确性的理论指导,以及现有DSL方法难以有效应对局部数据分布差异带来的优化障碍。解决方案的关键在于提出一种新的多工作者选择机制——M-DSL算法,该算法引入了一个全新的non-i.i.d.程度度量(non-i.i.d. degree metric),用于量化本地数据集间的统计差异,并据此动态选择对全局模型更新贡献显著的多个工作者,从而提升训练效率与模型精度。理论分析和实验结果表明,M-DSL在多种异构数据场景下均能实现优于基准方法的性能提升与网络智能增强。

链接: https://arxiv.org/abs/2509.18367
作者: Zhuoyu Yao,Yue Wang,Songyang Zhang,Yingshu Li,Zhipeng Cai,Zhi Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Recent advances in distributed swarm learning (DSL) offer a promising paradigm for edge Internet of Things. Such advancements enhance data privacy, communication efficiency, energy saving, and model scalability. However, the presence of non-independent and identically distributed (non-i.i.d.) data pose a significant challenge for multi-access edge computing, degrading learning performance and diverging training behavior of vanilla DSL. Further, there still lacks theoretical guidance on how data heterogeneity affects model training accuracy, which requires thorough investigation. To fill the gap, this paper first study the data heterogeneity by measuring the impact of non-i.i.d. datasets under the DSL framework. This then motivates a new multi-worker selection design for DSL, termed M-DSL algorithm, which works effectively with distributed heterogeneous data. A new non-i.i.d. degree metric is introduced and defined in this work to formulate the statistical difference among local datasets, which builds a connection between the measure of data heterogeneity and the evaluation of DSL performance. In this way, our M-DSL guides effective selection of multiple works who make prominent contributions for global model updates. We also provide theoretical analysis on the convergence behavior of our M-DSL, followed by extensive experiments on different heterogeneous datasets and non-i.i.d. data settings. Numerical results verify performance improvement and network intelligence enhancement provided by our M-DSL beyond the benchmarks.

[AI-113] X-VC: Zero-shot Streaming Voice Conversion in Codec Space

【速读】:该论文旨在解决零样本语音转换(Zero-shot Voice Conversion, Zero-shot VC)系统在交互场景中难以同时实现高保真度的说话人迁移与低延迟流式推理的问题。解决方案的关键在于提出X-VC,一个基于预训练神经编码器(neural codec)潜空间的一步式流式语音转换系统:其采用双条件声学转换器(dual-conditioning acoustic converter),联合建模源语音编码器潜变量和来自目标参考语音的帧级声学条件,并通过自适应归一化注入话语级别的目标说话人信息;此外,为缓解训练与推理间的差异,模型使用生成配对数据及角色分配策略(包含标准、重建和反向模式)进行训练,并引入与编码器分段训练范式一致的重叠平滑分块推理机制,从而在保持高质量说话人相似性的同时显著降低实时因子(real-time factor)。

链接: https://arxiv.org/abs/2604.12456
作者: Qixi Zheng,Yuxiang Zhao,Tianrui Wang,Wenxi Chen,Kele Xu,Yikang Li,Qinyuan Chen,Xipeng Qiu,Kai Yu,Xie Chen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challenging because high-fidelity speaker transfer and low-latency streaming inference are difficult to achieve simultaneously. In this work, we present X-VC, a zero-shot streaming VC system that performs one-step conversion in the latent space of a pretrained neural codec. X-VC uses a dual-conditioning acoustic converter that jointly models source codec latents and frame-level acoustic conditions derived from target reference speech, while injecting utterance-level target speaker information through adaptive normalization. To reduce the mismatch between training and inference, we train the model with generated paired data and a role-assignment strategy that combines standard, reconstruction, and reversed modes. For streaming inference, we further adopt a chunkwise inference scheme with overlap smoothing that is aligned with the segment-based training paradigm of the codec. Experiments on Seed-TTS-Eval show that X-VC achieves the best streaming WER in both English and Chinese, strong speaker similarity in same-language and cross-lingual settings, and substantially lower offline real-time factor than the compared baselines. These results suggest that codec-space one-step conversion is a practical approach for building high-quality low-latency zero-shot VC systems. Audio samples are available at this https URL. Our code and checkpoints will also be released.

[AI-114] FRTSearch: Unified Detection and Parameter Inference of Fast Radio Transients using Instance Segmentation

【速读】:该论文旨在解决现代射电望远镜产生的海量数据与传统单脉冲搜索算法之间存在的计算效率低、误报率高(主要受射频干扰RFI影响)的问题。其解决方案的关键在于提出一个端到端的FRTSearch框架,将快速射电暂现源(Fast Radio Transients, FRTs)的检测与物理参数推断统一起来:首先利用时频动态谱中色散轨迹的形态学普适性,将检测问题重构为受冷等离子体色散关系约束的模式识别任务;其次构建了CRAFTS-FRT像素级标注数据集用于训练Mask R-CNN模型实现精确轨迹分割;最后结合物理驱动的IMPIC算法,直接从几何坐标推导出色散量(Dispersion Measure, DM)和到达时间(Time of Arrival, ToA)。这一方法实现了98.0%召回率、误报率降低超99.9%且处理速度提升达13.9倍,同时具备跨设施泛化能力,标志着从“先搜索后识别”向“检测即推断”的范式转变。

链接: https://arxiv.org/abs/2604.12344
作者: Bin Zhang,Yabiao Wang,Xiaoyao Xie,Shanping You,Xuhong Yu,Qiuhua Li,Hongwei Li,Shaowen Du,Chenchen Miao,Dengke Zhou,Jianhua Fang,Jiafu Wu,Pei Wang,Di Li
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: Accepted for publication in The Astrophysical Journal Supplement Series (ApJS)

点击查看摘要

Abstract:The exponential growth of data from modern radio telescopes presents a significant challenge to traditional single-pulse search algorithms, which are computationally intensive and prone to high false-positive rates due to Radio Frequency Interference (RFI). In this work, we introduce FRTSearch, an end-to-end framework unifying the detection and physical characterization of Fast Radio Transients (FRTs). Leveraging the morphological universality of dispersive trajectories in time-frequency dynamic spectra, we reframe FRT detection as a pattern recognition problem governed by the cold plasma dispersion relation. To facilitate this, we constructed CRAFTS-FRT, a pixel-level annotated dataset derived from the Commensal Radio Astronomy FAST Survey (CRAFTS), comprising 2,392 instances across diverse source classes. This dataset enables the training of a Mask R-CNN model for precise trajectory segmentation. Coupled with our physics-driven IMPIC algorithm, the framework maps the geometric coordinates of segmented trajectories to directly infer the Dispersion Measure (DM) and Time of Arrival (ToA). Benchmarking on the FAST-FREX dataset shows that FRTSearch achieves a 98.0% recall, competitive with exhaustive search methods, while reducing false positives by over 99.9% compared to PRESTO and delivering a processing speedup of up to 13.9\times . Furthermore, the framework demonstrates robust cross-facility generalization, detecting all 19 tested FRBs from the ASKAP survey without retraining. By shifting the paradigm from search-then-identify'' to detect-and-infer,‘’ FRTSearch provides a scalable, high-precision solution for real-time discovery in the era of petabyte-scale radio astronomy.

[AI-115] owards grounded autonomous research: an end-to-end LLM mini research loop on published computational physics

【速读】:该论文旨在解决生成式 AI (Generative AI) 在真实物理科学领域中实现自主科研闭环的挑战,即如何在不依赖人工干预的情况下完成从文献阅读、复现、批判到拓展的完整研究循环。其解决方案的关键在于构建一个可执行的“读-规划-计算-比较”(read-plan-compute-compare)微型研究回路,并通过两个互补维度验证其有效性:一是大规模测试(覆盖111篇开放获取的计算物理论文),发现代理能自动识别约42%论文中存在的实质性问题(其中97.7%需通过实际执行才能暴露);二是深度案例研究(针对一篇《Nature Communications》关于二维材料MOSFET多尺度模拟的论文),代理独立开展原论文缺失的新计算并自动生成一份可发表的评论(包括内容撰写、图表制作、排版及PDF迭代),从而修正了原文的核心结论。这一方法突破了传统AI仅限于文本理解或简单推理的局限,实现了真正意义上的自主科学发现与批判性改进。

链接: https://arxiv.org/abs/2604.12198
作者: Haonan Huang
机构: 未知
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent autonomous LLM agents have demonstrated end-to-end automation of machine-learning research. Real-world physical science is intrinsically harder, requiring deep reasoning bounded by physical truth and, because real systems are too complex to study in isolation, almost always built on existing literature. We focus on the smallest meaningful unit of such research, a mini research loop in which an agent reads a paper, reproduces it, critiques it, and extends it. We test this loop in two complementary regimes: scale and depth. At scale, across 111 open-access computational physics papers, an agent autonomously runs the read-plan-compute-compare loop and, without being asked to critique, raises substantive concerns on ~42% of papers - 97.7% of which require execution to surface. In depth, for one Nature Communications paper on multiscale simulation of a 2D-material MOSFET, the agent runs new calculations missing from the original and produces, unsupervised, a publishable Comment – composed, figured, typeset, and PDF-iterated – that revises the paper’s headline conclusion.

[AI-116] Observing the unobserved confounding through its effects: toward randomized trial-like estimates from real-world survival data

【速读】:该论文旨在解决观测性生存数据中因未观测混杂因素(unobserved confounding)导致的治疗效应估计偏差问题。其解决方案的关键在于提出一个三步框架:首先,通过分析具有相似已知协变量、相同治疗但不同结局患者的限制均值生存时间(restricted mean survival time, RMST)差异,推断出一个潜在预后因子(latent prognostic factor, U);其次,利用预后匹配、熵平衡或逆概率治疗加权法将U与已观察基线协变量进行平衡;最后,在调整后的数据上应用多变量生存分析以估计风险比(hazard ratio, HR)。该方法在多个真实世界队列和随机对照试验(RCT)基准中验证有效,显著提升了观测数据中治疗效应估计的准确性与一致性。

链接: https://arxiv.org/abs/2604.12137
作者: Vasiliki Stoumpou,Dimitris Bertsimas,Samuel Singer,Georgios Antonios Margonis
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Background: Randomized controlled trials (RCTs) are costly, time-consuming, and often infeasible, while treatment-effect estimation from observational data is limited by unobserved confounding. Methods: We developed a three-step framework to address unobserved confounding in observational survival data. First, we infer a latent prognostic factor (U) from restricted mean survival time (RMST) discrepancies between patients with similar observed factors, the same treatment, and divergent outcomes, leveraging the idea that the aggregate effect of unmeasured factors can be inferred even if individual factors cannot. Second, we balance U with observed baseline covariates using prognostic matching, entropy balancing, or inverse probability of treatment weighting. Third, we apply multivariable survival analysis to estimate hazard ratios (HRs). We evaluated the framework in three observational cohorts with RCT benchmarks, two RCT cohorts, and six multicenter observational cohorts. Results: In three observational cohorts (nine comparisons), balancing U improved agreement with trial HRs in all cases; in the strongest settings, it reduced absolute log-HR error by approximately ten-fold versus using observed covariates alone (mean reduction 0.344; p=0.001). In two RCT cohorts, U was balanced across arms (most SMDs 0.1) and adjustment had minimal impact on log-HRs (mean absolute change 0.08). Across six multicenter cohorts, balancing U within centers reduced cross-center dispersion in chemotherapy log-HR estimates (mean reduction 0.147; p=0.016); when populations were directly balanced across centers to account for case-mix differences, cross-center survival differences were narrowed in 75%-100% of comparisons. Conclusions: Inferring and balancing a latent prognostic signal may reduce unobserved confounding and improve treatment-effect estimation from real-world data. Subjects: Applications (stat.AP); Artificial Intelligence (cs.AI); Methodology (stat.ME) Cite as: arXiv:2604.12137 [stat.AP] (or arXiv:2604.12137v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2604.12137 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vasiliki Stoumpou [view email] [v1] Mon, 13 Apr 2026 23:38:43 UTC (338 KB)

[AI-117] Evaluating the Limitations of Protein Sequence Representations for Parkinsons Disease Classification

【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)分子生物标志物识别的难题,特别是评估仅基于蛋白质一级序列(primary sequence)的信息是否具备足够的判别能力以实现可靠疾病分类。其解决方案的关键在于构建一个受控且无信息泄露的系统性评估框架,采用嵌套分层交叉验证(nested stratified cross-validation)对多种仅依赖蛋白序列的表示方法进行公平比较,包括氨基酸组成、k-mer、理化描述符、混合表示以及蛋白质语言模型(protein language models, PLMs)生成的嵌入(embeddings)。结果显示,即使最优模型(ProtBERT + MLP)也仅达到中等性能(F1-score: 0.704 ± 0.028),且各类表示之间差异不显著(Friedman检验 p = 0.1749),表明仅凭一级序列难以区分帕金森病与对照样本,暗示需要引入结构、功能或相互作用等更丰富的生物学特征以提升建模效果。

链接: https://arxiv.org/abs/2604.11852
作者: César Jesús Núñez-Prado,Grigori Sidorov,Liliana Chanona-Hernández
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 36 pages, 10 figures, 9 tables

点击查看摘要

Abstract:The identification of reliable molecular biomarkers for Parkinson’s disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including amino acid composition, k-mers, physicochemical descriptors, hybrid representations, and embeddings from protein language models, all assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 +/- 0.028 and ROC-AUC of 0.748 +/- 0.047, indicating only moderate discriminative performance. Classical representations such as k-mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions. Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70), while unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson’s disease classification. This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as structural, functional, or interaction-based descriptors, are required for robust disease modeling.

机器学习

[LG-0] CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

链接: https://arxiv.org/abs/2604.13024
作者: Benzhao Tang,Shiyu Yang
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:The explosive growth of system logs makes streaming compression essential, yet existing log anomaly detection (LAD) methods incur severe pre-processing overhead by requiring full decompression and parsing. We introduce CLAD, the first deep learning framework to perform LAD directly on compressed byte streams. CLAD bypasses these bottlenecks by exploiting a key insight: normal logs compress into regular byte patterns, while anomalies systematically disrupt them. To extract these multi-scale deviations from opaque bytes, we propose a purpose-built architecture integrating a dilated convolutional byte encoder, a hybrid Transformer–mLSTM, and four-way aggregation pooling. This is coupled with a two-stage training strategy of masked pre-training and focal-contrastive fine-tuning to effectively handle severe class imbalance. Evaluated across five datasets, CLAD achieves a state-of-the-art average F1-score of 0.9909 and outperforms the best baseline by 2.72 percentage points. It delivers superior accuracy while completely eliminating decompression and parsing overheads, offering a robust solution that generalizes to structured streaming compressors.

[LG-1] An Optimal Sauer Lemma Over k-ary Alphabets

链接: https://arxiv.org/abs/2604.12952
作者: Steve Hanneke,Qinglin Meng,Shay Moran,Amirreza Shaeiri
类目: Machine Learning (cs.LG); Combinatorics (math.CO); Machine Learning (stat.ML)
*备注: 38 pages

点击查看摘要

Abstract:The Sauer-Shelah-Perles Lemma is a cornerstone of combinatorics and learning theory, bounding the size of a binary hypothesis class in terms of its Vapnik-Chervonenkis (VC) dimension. For classes of functions over a k -ary alphabet, namely the multiclass setting, the Natarajan dimension has long served as an analogue of VC dimension, yet the corresponding Sauer-type bounds are suboptimal for alphabet sizes k2 . In this work, we establish a sharp Sauer inequality for multiclass and list prediction. Our bound is expressed in terms of the Daniely–Shalev-Shwartz (DS) dimension, and more generally with its extension, the list-DS dimension – the combinatorial parameters that characterize multiclass and list PAC learnability. Our bound is tight for every alphabet size k , list size \ell , and dimension value, replacing the exponential dependence on \ell in the Natarajan-based bound by the optimal polynomial dependence, and improving the dependence on k as well. Our proof uses the polynomial method. In contrast to the classical VC case, where several direct combinatorial proofs are known, we are not aware of any purely combinatorial proof in the DS setting. This motivates several directions for future research, which are discussed in the paper. As consequences, we obtain improved sample complexity upper bounds for list PAC learning and for uniform convergence of list predictors, sharpening the recent results of Charikar et al.~(STOC~2023), Hanneke et al.~(COLT~2024), and Brukhim et al.~(NeurIPS~2024). Comments: 38 pages Subjects: Machine Learning (cs.LG); Combinatorics (math.CO); Machine Learning (stat.ML) Cite as: arXiv:2604.12952 [cs.LG] (or arXiv:2604.12952v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.12952 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] he Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime

链接: https://arxiv.org/abs/2604.12951
作者: Jason Z Wang
类目: Machine Learning (cs.LG)
*备注: 25 pages, 16 figures, 6 tables. Code and data at this https URL

点击查看摘要

Abstract:The most cited calibration result in deep learning – post-temperature-scaling ECE of 0.012 on CIFAR-100 (Guo et al., 2017) – is below the statistical noise floor. We prove this is not a failure of the experiment but a law: the minimax rate for estimating calibration error with model error rate epsilon is Theta((Lepsilon/m)^1/3), and no estimator can beat it. This “verification tax” implies that as AI models improve, verifying their calibration becomes fundamentally harder – with the same exponent in opposite directions. We establish four results that contradict standard evaluation practice: (1) self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute; (2) a sharp phase transition at mepsilon approx 1 below which miscalibration is undetectable; (3) active querying eliminates the Lipschitz constant, collapsing estimation to detection; (4) verification cost grows exponentially with pipeline depth at rate L^K. We validate across five benchmarks (MMLU, TruthfulQA, ARC-Challenge, HellaSwag, WinoGrande; ~27,000 items) with 6 LLMs from 5 families (8B-405B parameters, 27 benchmark-model pairs with logprob-based confidence), 95% bootstrap CIs, and permutation tests. Self-evaluation non-significance holds in 80% of pairs. Across frontier models, 23% of pairwise comparisons are indistinguishable from noise, implying that credible calibration claims must report verification floors and prioritize active querying once gains approach benchmark resolution.

[LG-3] Parcae: Scaling Laws For Stable Looped Language Models

链接: https://arxiv.org/abs/2604.12946
作者: Hayden Prairie,Zachary Novack,Taylor Berg-Kirkpatrick,Daniel Y. Fu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher memory footprint, or data. A potential alternative is looped architectures, which instead increase FLOPs by sending activations through a block of layers in a loop. While promising, existing recipes for training looped architectures can be unstable, suffering from residual explosion and loss spikes. We address these challenges by recasting looping as a nonlinear time-variant dynamical system over the residual stream. Via a linear approximation to this system, we find that instability occurs in existing looped architectures as a result of large spectral norms in their injection parameters. To address these instability issues, we propose Parcae, a novel stable, looped architecture that constrains the spectral norm of the injection parameters via discretization of a negative diagonal parameterization. As a result, Parcae achieves up to 6.3% lower validation perplexity over prior large-scale looped models. Using our stable looped architecture, we investigate the scaling properties of looping as a medium to improve quality by increasing FLOPs in training and test-time. For training, we derive predictable power laws to scale FLOPs while keeping parameter count fixed. Our initial scaling laws suggest that looping and data should be increased in tandem, given a fixed FLOP budget. At test-time, we find that Parcae can use looping to scale compute, following a predictable, saturating exponential decay. When scaled up to 1.3B parameters, we find that Parcae improves CORE and Core-Extended quality by 2.99 and 1.18 points when compared to strong Transformer baselines under a fixed parameter and data budget, achieving a relative quality of up to 87.5% a Transformer twice the size.

[LG-4] Frequency-aware Decomposition Learning for Sensorless Wrench Forecasting on a Vibration-rich Hydraulic Manipulator

链接: https://arxiv.org/abs/2604.12905
作者: Hyeonbeen Lee,Min-Jae Jung,Tae-Kyeong Yeu,Jong-Boo Han,Daegil Park,Jin-Gyun Kim
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures, submitted to IEEE/ASME Transactions on Mechatronics

点击查看摘要

Abstract:Force and torque (F/T) sensing is critical for robot-environment interaction, but physical F/T sensors impose constraints in size, cost, and fragility. To mitigate this, recent studies have estimated force/wrench sensorlessly from robot internal states. While existing methods generally target relatively slow interactions, tasks involving rapid interactions, such as grinding, can induce task-critical high-frequency vibrations, and estimation in such robotic settings remains underexplored. To address this gap, we propose a Frequency-aware Decomposition Network (FDN) for short-term forecasting of vibration-rich wrench from proprioceptive history. FDN predicts spectrally decomposed wrench with asymmetric deterministic and probabilistic heads, modeling the high-frequency residual as a learned conditional distribution. It further incorporates frequency-awareness to adaptively enhance input spectra with learned filtering and impose a frequency-band prior on the outputs. We pretrain FDN on a large-scale open-source robot dataset and transfer the learned proprioception-to-wrench representation to the downstream. On real-world grinding excavation data from a 6-DoF hydraulic manipulator and under a delayed estimation setting, FDN outperforms baseline estimators and forecasters in the high-frequency band and remains competitive in the low-frequency band. Transfer learning provides additional gains, suggesting the potential of large-scale pretraining and transfer learning for robotic wrench estimation. Code and data will be made available upon acceptance.

[LG-5] CL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning

链接: https://arxiv.org/abs/2604.12891
作者: Chaoyao Shen,Linfeng Jiang,Yixian Shen,Tao Xu,Guoqing Li,Anuj Pathania,Andy D. Pimentel,Meng Zhang
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: introduces TCL framework for cross-hardware tensor program optimization with active learning, Mamba-based cost model, and continual knowledge distillation; includes extensive experiments on CPU and GPU platforms

点击查看摘要

Abstract:Deep learning (DL) compilers rely on cost models and auto-tuning to optimize tensor programs for target hardware. However, existing approaches depend on large offline datasets, incurring high collection costs and offering suboptimal transferability across platforms. In this paper, we introduce TCL, a novel efficient and transferable compiler framework for fast tensor program optimization across diverse hardware platforms to address these challenges. Specifically, TCL is built on three core enablers: (1) the RDU Sampler, a data-efficient active learning strategy that selects only 10% of tensor programs by jointly optimizing Representativeness, Diversity, and Uncertainty, substantially reducing data collection costs while maintaining near-original model accuracy; (2) a new Mamba-based cost model that efficiently captures long-range schedule dependencies while achieving a favorable trade-off between prediction accuracy and computational cost through reduced parameterization and lightweight sequence modeling; and (3) a continuous knowledge distillation framework that effectively and progressively transfers knowledge across multiple hardware platforms while avoiding the parameter explosion and data dependency issues typically caused by traditional multi-task learning. Extensive experiments validate the effectiveness of each individual enabler and the holistic TCL framework. When optimizing a range of mainstream DL models on both CPU and GPU platforms, TCL achieves, on average, 16.8x and 12.48x faster tuning time, and 1.20x and 1.13x lower inference latency, respectively, compared to Tenset-MLP.

[LG-6] Understanding and Improving Continuous Adversarial Training for LLM s via In-context Learning Theory ICLR2026

链接: https://arxiv.org/abs/2604.12817
作者: Shaopeng Fu,Di Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: The Fourteenth International Conference on Learning Representations (ICLR 2026)

点击查看摘要

Abstract:Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM’s token space. Further, the robust bound shows that the robustness of an adversarially trained LLM is closely related to the singular values of its embedding matrix. Based on this, we propose to improve LLM CAT by introducing an additional regularization term, which depends on singular values of the LLM’s embedding matrix, into the objective function of CAT. Experiments on real-world LLMs demonstrate that our method can help LLMs achieve a better jailbreak robustness-utility tradeoff. The code is available at this https URL.

[LG-7] Interpretable Relational Inference with LLM -Guided Symbolic Dynamics Modeling

链接: https://arxiv.org/abs/2604.12806
作者: Xiaoxiao Liang,Juyuan Zhang,Liming Pan,Linyuan Lü
类目: Machine Learning (cs.LG)
*备注: Submitted to conference

点击查看摘要

Abstract:Inferring latent interaction structures from observed dynamics is a fundamental inverse problem in many-body interacting systems. Most neural approaches rely on black-box surrogates over trainable graphs, achieving accuracy at the expense of mechanistic interpretability. Symbolic regression offers explicit dynamical equations and stronger inductive biases, but typically assumes known topology and a fixed function library. We propose \textbfCOSINE (\textbfCo-\textbfOptimization of \textbfSymbolic \textbfInteractions and \textbfNetwork \textbfEdges), a differentiable framework that jointly discovers interaction graphs and sparse symbolic dynamics. To overcome the limitations of fixed symbolic libraries, COSINE further incorporates an outer-loop large language model that adaptively prunes and expands the hypothesis space using feedback from the inner optimization loop. Experiments on synthetic systems and large-scale real-world epidemic data demonstrate robust structural recovery and compact, mechanism-aligned dynamical expressions. Code: this https URL.

[LG-8] Rethinking the Personalized Relaxed Initialization in the Federated Learning: Consistency and Generalization

链接: https://arxiv.org/abs/2604.12768
作者: Li Shen,Yan Sun,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2306.05706

点击查看摘要

Abstract:Federated learning (FL) is a distributed paradigm that coordinates massive local clients to collaboratively train a global model via stage-wise local training processes on the heterogeneous dataset. Previous works have implicitly studied that FL suffers from the client-drift'' problem, which is caused by the inconsistent optimum across local clients. However, till now it still lacks solid theoretical analysis to explain the impact of this local inconsistency. To alleviate the negative impact of client drift’’ and explore its substance in FL, in this paper, we first propose an efficient FL algorithm FedInit, which allows employing the personalized relaxed initialization state at the beginning of each local training stage. Specifically, FedInit initializes the local state by moving away from the current global state towards the reverse direction of the latest local state. Moreover, to further understand how inconsistency disrupts performance in FL, we introduce the excess risk analysis and study the divergence term to investigate the test error in FL. Our studies show that optimization error is not sensitive to this local inconsistency, while it mainly affects the generalization error bound. Extensive experiments are conducted to validate its efficiency. The proposed FedInit method could achieve comparable results compared to several advanced benchmarks without any additional training or communication costs. Meanwhile, the stage-wise personalized relaxed initialization could also be incorporated into several current advanced algorithms to achieve higher generalization performance in the FL paradigm.

[LG-9] Stress Detection Using Wearable Physiological and Sociometric Sensors

链接: https://arxiv.org/abs/2604.12746
作者: Oscar Martinez Mozos,Virginia Sandulescu,Sally Andrews,David Ellis,Nicola Bellotto,Radu Dobrescu,Jose Manuel Ferrandez
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This is the accepted manuscript of the article published in International Journal of Neural Systems, 27, 2, 2017. The Version of Record is available at DOI: https://doi.org/10.1142/S0129065716500416

点击查看摘要

Abstract:Stress remains a significant social problem for individuals in modern societies. This paper presents a machine learning approach for the automatic detection of stress of people in a social situation by combining two sensor systems that capture physiological and social responses. We compare the performance using different classifiers including support vector machine, AdaBoost, and k-nearest neighbor. Our experimental results show that by combining the measurements from both sensor systems, we could accurately discriminate between stressful and neutral situations during a controlled Trier social stress test (TSST). Moreover, this paper assesses the discriminative ability of each sensor modality individually and considers their suitability for real-time stress detection. Finally, we present an study of the most discriminative features for stress detection.

[LG-10] Evaluating Differential Privacy Against Membership Inference in Federated Learning: Insights from the NIST Genomics Red Team Challenge

链接: https://arxiv.org/abs/2604.12737
作者: Gustavo de Carvalho Bertoli
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

Abstract:While Federated Learning (FL) mitigates direct data exposure, the resulting trained models remain susceptible to membership inference attacks (MIAs). This paper presents an empirical evaluation of Differential Privacy (DP) as a defense mechanism against MIAs in FL, leveraging the environment of the 2025 NIST Genomics Privacy-Preserving Federated Learning (PPFL) Red Teaming Event. To improve inference accuracy, we propose a stacking attack strategy that ensembles seven black-box estimators to train a meta-classifier on prediction probabilities and cross-entropy losses. We evaluate this methodology against target models under three privacy configurations: an unprotected convolutional neural network (CNN, \epsilon=\infty ), a low-privacy DP model ( \epsilon=200 ), and a high-privacy DP model ( \epsilon=10 ). The attack outperforms all baselines in the No DP and Low Privacy settings and, critically, maintains measurable membership leakage at \epsilon=200 where a single-signal LiRA baseline collapses. Evaluated on an independent third-party benchmark, these results provide an empirical characterisation of how stacking-based inference degrades across calibrated DP tiers in FL.

[LG-11] ransformer Based Machine Fault Detection From Audio Input

链接: https://arxiv.org/abs/2604.12733
作者: Kiran Voderhobli Holla
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Sound AI is being increasingly used to predict machine failures. By attaching a microphone to the machine of interest, one can get real time data on machine behavior from the field. Traditionally, Convolutional Neural Net (CNN) architectures have been used to analyze spectrogram images generated from the sounds captured and predict if the machine is functioning as expected. CNN architectures seem to work well empirically even though they have biases like locality and parameter-sharing which may not be completely relevant for spectrogram analysis. With the successful application of transformer-based models in the field of image processing starting with Vision Transformer (ViT) in 2020, there has been significant interest in leveraging these in the field of Sound AI. Since transformer-based architectures have significantly lower inductive biases, they are expected to perform better than CNNs at spectrogram analysis given enough data. This paper demonstrates the effectiveness of transformer-driven architectures in analyzing Sound data and compares the embeddings they generate with CNNs on the specific task of machine fault detection.

[LG-12] Monte Carlo Stochastic Depth for Uncertainty Estimation in Deep Learning CVPR2026

链接: https://arxiv.org/abs/2604.12719
作者: Adam T. Müller,Tobias Rögelein,Nicolaj C. Stache
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to the 8th Safe Artificial Intelligence for All Domains (SAIAD) workshop at IEEE/CVF CVPR 2026

点击查看摘要

Abstract:The deployment of deep neural networks in safety-critical systems necessitates reliable and efficient uncertainty quantification (UQ). A practical and widespread strategy for UQ is repurposing stochastic regularizers as scalable approximate Bayesian inference methods, such as Monte Carlo Dropout (MCD) and MC-DropBlock (MCDB). However, this paradigm remains under-explored for Stochastic Depth (SD), a regularizer integral to the residual-based backbones of most modern architectures. While prior work demonstrated its empirical promise for segmentation, a formal theoretical connection to Bayesian variational inference and a benchmark on complex, multi-task problems like object detection are missing. In this paper, we first provide theoretical insights connecting Monte Carlo Stochastic Depth (MCSD) to principled approximate variational inference. We then present the first comprehensive empirical benchmark of MCSD against MCD and MCDB on state-of-the-art detectors (YOLO, RT-DETR) using the COCO and COCO-O datasets. Our results position MCSD as a robust and computationally efficient method that achieves highly competitive predictive accuracy (mAP), notably yielding slight improvements in calibration (ECE) and uncertainty ranking (AUARC) compared to MCD. We thus establish MCSD as a theoretically-grounded and empirically-validated tool for efficient Bayesian approximation in modern deep learning.

[LG-13] FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving

链接: https://arxiv.org/abs/2604.12656
作者: Baoyun Wang,Zhuoren Li,Ming Liu,Xinrui Zhang,Bo Leng,Lu Xiong
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 21 pages, 6 figures

点击查看摘要

Abstract:End-to-end diffusion planning has shown strong potential for autonomous driving, but the physical feasibility of generated trajectories remains insufficiently addressed. In particular, generated trajectories may exhibit local geometric irregularities, violate trajectory-level kinematic constraints, or deviate from the drivable area, indicating that the commonly used noise-centric formulation in diffusion planning is not yet well aligned with the trajectory space where feasibility is more naturally characterized. To address this issue, we propose FeaXDrive, a feasibility-aware trajectory-centric diffusion planning method for end-to-end autonomous driving. The core idea is to treat the clean trajectory as the unified object for feasibility-aware modeling throughout the diffusion process. Built on this trajectory-centric formulation, FeaXDrive integrates adaptive curvature-constrained training to improve intrinsic geometric and kinematic feasibility, drivable-area guidance within reverse diffusion sampling to enhance consistency with the drivable area, and feasibility-aware GRPO post-training to further improve planning performance while balancing trajectory-space feasibility. Experiments on the NAVSIM benchmark show that FeaXDrive achieves strong closed-loop planning performance while substantially improving trajectory-space feasibility. These findings highlight the importance of explicitly modeling trajectory-space feasibility in end-to-end diffusion planning and provide a step toward more reliable and physically grounded autonomous driving planners.

[LG-14] Robust Semi-Supervised Temporal Intrusion Detection for Adversarial Cloud Networks

链接: https://arxiv.org/abs/2604.12655
作者: Anasuya Chattopadhyay,Daniel Reti,Hans D. Schotten
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: This work has been accepted for publication in IEEE 2026 EuCNC 6G Summit. This is a preprint version. The final published version will be available via IEEE Xplore

点击查看摘要

Abstract:Cloud networks increasingly rely on machine learning based Network Intrusion Detection Systems to defend against evolving cyber threats. However, real-world deployments are challenged by limited labeled data, non-stationary traffic, and adaptive adversaries. While semi-supervised learning can alleviate label scarcity, most existing approaches implicitly assume benign and stationary unlabeled traffic, leading to degraded performance in adversarial cloud environments. This paper proposes a robust semi-supervised temporal learning framework for cloud intrusion detection that explicitly addresses adversarial contamination and temporal drift in unlabeled network traffic. Operating on flow-level data, this framework combines supervised learning with consistency regularization, confidence-aware pseudo-labeling, and selective temporal invariance to conservatively exploit unlabeled traffic while suppressing unreliable samples. By leveraging the temporal structure of network flows, the proposed method improves robustness and generalization across heterogeneous cloud environments. Extensive evaluations on publicly available datasets (CIC-IDS2017, CSE-CIC-IDS2018, and UNSW-NB15) under limited-label conditions demonstrate that the proposed framework consistently outperforms state-of-the-art supervised and semi-supervised network intrusion detection systems in detection performance, label efficiency, and resilience to adversarial and non-stationary traffic.

[LG-15] EEG-Based Multimodal Learning via Hyperbolic Mixture-of-Curvature Experts

链接: https://arxiv.org/abs/2604.12579
作者: Runhe Zhou,Shanglin Li,Guanxiang Huang,Xinliang Zhou,Qibin Zhao,Motoaki Kawanabe,Yi Ding,Cuntai Guan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG)-based multimodal learning integrates brain signals with complementary modalities to improve mental state assessment, providing great clinical potential. The effectiveness of such paradigms largely depends on the representation learning on heterogeneous modalities. For EEG-based paradigms, one promising approach is to leverage their hierarchical structures, as recent studies have shown that both EEG and associated modalities (e.g., facial expressions) exhibit hierarchical structures reflecting complex cognitive processes. However, Euclidean embeddings struggle to represent these hierarchical structures due to their flat geometry, while hyperbolic spaces, with their exponential growth property, are naturally suited for them. In this work, we propose EEG-MoCE, a novel hyperbolic mixture-of-curvature experts framework designed for multimodal neurotechnology. EEG-MoCE assigns each modality to an expert in a learnable-curvature hyperbolic space, enabling adaptive modeling of its intrinsic geometry. A curvature-aware fusion strategy then dynamically weights experts, emphasizing modalities with richer hierarchical information. Extensive experiments on benchmark datasets demonstrate that EEG-MoCE achieves state-of-the-art performance, including emotion recognition, sleep staging, and cognitive assessment.

[LG-16] Instantiating Bayesian CVaR lower bounds in Interactive Decision Making Problems

链接: https://arxiv.org/abs/2604.12519
作者: Raghav Bongole,Tobias J. Oechtering,Mikael Skoglund
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Recent work established a generalized-Fano framework for lower bounding prior-predictive (Bayesian) CVaR in interactive statistical decision making. In this paper, we show how to instantiate that framework in concrete interactive problems and derive explicit Bayesian CVaR lower bounds from its abstract corollaries. Our approach compares a hard model with a reference model using squared Hellinger distance, and combines a lower bound on a reference hinge term with a bound on the distinguishability of the two models. We apply this approach to canonical examples, including Gaussian bandits, and obtain explicit bounds that make the dependence on key problem parameters transparent. These results show how the generalized-Fano Bayesian CVaR framework can be used as a practical lower-bound tool for interactive learning and risk-sensitive decision making.

[LG-17] Agent ic Control in Variational Language Models

链接: https://arxiv.org/abs/2604.12513
作者: Yves Ruffenach
类目: Machine Learning (cs.LG)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:We study whether a variational language model can support a minimal and measurable form of agentic control grounded in its own internal evidence. Our model combines local variational hidden computation (EVE), a homeostatic latent regulator, structurally aware checkpoint retention and a calibrated uncertainty-aware controller operating on top of the retained model. Rather than treating uncertainty as a passive diagnostic measured after prediction, we treat it as an operational signal that can regulate training, support checkpoint retention and guide inference-time intervention. The resulting framework is deliberately focused. It studies a closed-loop form of internal control in which structural and predictive signals become actionable. Empirically, the variational backbone improves over a matched deterministic reference on the language-modeling task while also exhibiting a richer and more usable uncertainty profile. On top of this backbone, the calibrated controller remains active, uses multiple actions under a full agentic evaluation and yields a positive quality-cost trade-off. These results support a precise claim: internal uncertainty can serve not only as a descriptive property of a variational language model, but also as a practical control interface for regulation, checkpoint retention and minimal agentic routing.

[LG-18] Safety Training Modulates Harmful Misalignment Under On-Policy RL But Direction Depends on Environment Design

链接: https://arxiv.org/abs/2604.12500
作者: Leon Eshuijs,Shihan Wang,Antske Fokkens
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Specification gaming under Reinforcement Learning (RL) is known to cause LLMs to develop sycophantic, manipulative, or deceptive behavior, yet the conditions under which this occurs remain unclear. We train 11 instruction-tuned LLMs (0.5B–14B) with on-policy RL across 3 environments and find that model size acts as a safety buffer in some environments but enables greater harmful exploitation in others. Controlled ablations trace this reversal to environment-specific features such as role framing and implicit gameability cues. We further show that most safety benchmarks do not predict RL-induced misalignment, except in the case of Sycophancy scores when the exploit relies on inferring the user’s preference. Finally, we find that on-policy RL preserves a safety buffer inherent in the model’s own generation distribution, one that is bypassed during off-policy settings.

[LG-19] Adaptive Budget Allocation in LLM -Augmented Surveys

链接: https://arxiv.org/abs/2604.12497
作者: Zikun Ye,Jiameng Lyu,Rui Tao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) can generate survey responses at low cost, but their reliability varies substantially across questions and is unknown before data collection. Deploying LLMs in surveys still requires costly human responses for verification and correction. How should a limited human-labeling budget be allocated across questions in real time? We propose an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves a dual role: it improves the estimate for that question and reveals how well the LLM predicts human responses on it. The algorithm directs more budget to questions where the LLM is least reliable, without requiring any prior knowledge of question-level LLM accuracy. We prove that the allocation gap relative to the best possible allocation vanishes as the budget grows, and validate the approach on both synthetic data and a real survey dataset with 68 questions and over 2000 respondents. On real survey data, the standard practice of allocating human labels uniformly across questions wastes 10–12% of the budget relative to the optimal; our algorithm reduces this waste to 2–6%, and the advantage grows as questions become more heterogeneous in LLM prediction quality. The algorithm achieves the same estimation quality as traditional uniform sampling with fewer human samples, requires no pilot study, and is backed by formal performance guarantees validated on real survey data. More broadly, the framework applies whenever scarce human oversight must be allocated across tasks where LLM reliability is unknown.

[LG-20] Analyzing the Effect of Noise in LLM Fine-tuning

链接: https://arxiv.org/abs/2604.12469
作者: Lingfang Li,Procheta Sen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning is the dominant paradigm for adapting pretrained large language models (LLMs) to downstream NLP tasks. In practice, fine-tuning datasets may contain various forms of noise arising from annotation errors, preprocessing artifacts, or automated data collection. While prior work has focused on designing robust learning algorithms to mitigate performance degradation under noisy conditions, comparatively little is known about how different types of noise affect the internal learning dynamics of LLMs during fine-tuning. In this work, we systematically study the impact of noise on model behavior across three pretrained model families (GPT-2, Qwen2 and Llama-2) and three diverse NLP tasks. We introduce controlled perturbations corresponding to three common real-world noise types: label noise, grammatical noise, and typographical noise. Beyond task-level performance, we analyze layer-wise representation changes and attention patterns to understand how noise propagates through the network. Our results show that corrupting labels (i.e. label noise) consistently causes the largest performance degradation, whereas grammatical noise and typographical noise can occasionally yield mild regularization benefits. We further find that noise effects are localized primarily to task-specific layers, while attention structures remain comparatively stable.

[LG-21] VeriX-Anon: A Multi-Layered Framework for Mathematically Verifiable Outsourced Target-Driven Data Anonymization

链接: https://arxiv.org/abs/2604.12431
作者: Miit Daga,Swarna Priya Ramu
类目: Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Organisations increasingly outsource privacy-sensitive data transformations to cloud providers, yet no practical mechanism lets the data owner verify that the contracted algorithm was faithfully executed. VeriX-Anon is a multi-layered verification framework for outsourced Target-Driven k-anonymization combining three orthogonal mechanisms: deterministic verification via Merkle-style hashing of an Authenticated Decision Tree, probabilistic verification via Boundary Sentinels near the Random Forest decision boundary and exact-duplicate Twins with cryptographic identifiers, and utility-based verification via Explainable AI fingerprinting that compares SHAP value distributions before and after anonymization using the Wasserstein distance. Evaluated on three cross-domain datasets against Lazy (drops 5 percent of records), Dumb (random splitting, fake hash), and Approximate (random splitting, valid hash) adversaries, VeriX-Anon correctly detected deviations in 11 of 12 scenarios. No single layer achieved this alone. The XAI layer was the only mechanism that caught the Approximate adversary, succeeding on Adult and Bank but failing on the severely imbalanced Diabetes dataset where class imbalance suppresses the SHAP signal, confirming the need for adaptive thresholding. An 11-point k-sweep showed Target-Driven anonymization preserves significantly more utility than Blind anonymization (Wilcoxon p = 0.000977 , Cohen’s d = 1.96 , mean F1 gap +0.1574 ). Client-side verification completes under one second at one million rows. The threat model covers three empirically evaluated profiles and one theoretical profile (Informed Attacker) aware of trap embedding but unable to defeat the cryptographic salt. Sentinel evasion probability ranges from near-zero for balanced datasets to 0.52 for imbalanced ones, a limitation the twin layer compensates for in every tested scenario.

[LG-22] Forecasting the Past: Gradient-Based Distribution Shift Detection in Trajectory Prediction CVPR

链接: https://arxiv.org/abs/2604.12425
作者: Michele De Vita,Julian Wiederer,Vasileios Belagiannis
类目: Machine Learning (cs.LG)
*备注: Accepted at CVPRW SAIAD 2026

点击查看摘要

Abstract:Trajectory prediction models often fail in real-world automated driving due to distributional shifts between training and test conditions. Such distributional shifts, whether behavioural or environmental, pose a critical risk by causing the model to make incorrect forecasts in unfamiliar situations. We propose a self-supervised method that trains a decoder in a post-hoc fashion on the self-supervised task of forecasting the second half of observed trajectories from the first half. The L2 norm of the gradient of this forecasting loss with respect to the decoder’s final layer defines a score to identify distribution shifts. Our approach, first, does not affect the trajectory prediction model, ensuring no interference with original prediction performance and second, demonstrates substantial improvements on distribution shift detection for trajectory prediction on the Shifts and Argoverse datasets. Moreover, we show that this method can also be used to early detect collisions of a deep Q-Network motion planner in the Highway simulator. Source code is available at this https URL.

[LG-23] PrivEraserVerify: Efficient Private and Verifiable Federated Unlearning

链接: https://arxiv.org/abs/2604.12348
作者: Parthaw Goswami,Md Khairul Islam,Ashfak Yeafi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training without sharing raw data, offering a promising path toward privacy preserving artificial intelligence. However, FL models may still memorize sensitive information from participants, conflicting with the right to be forgotten (RTBF). To meet these requirements, federated unlearning has emerged as a mechanism to remove the contribution of departing clients. Existing solutions only partially address this challenge: FedEraser improves efficiency but lacks privacy protection, FedRecovery ensures differential privacy (DP) but degrades accuracy, and VeriFi enables verifiability but introduces overhead without efficiency or privacy guarantees. We present PrivEraserVerify (PEV), a unified framework that integrates efficiency, privacy, and verifiability into federated unlearning. PEV employs (i) adaptive checkpointing to retain critical historical updates for fast reconstruction, (ii) layer adaptive differentially private calibration to selectively remove client influence while minimizing accuracy loss, and (iii) fingerprint based verification, enabling participants to confirm unlearning in a decentralized and noninvasive manner. Experiments on image, handwritten character, and medical datasets show that PEV achieves up to 2 to 3 times faster unlearning than retraining, provides formal indistinguishability guarantees with reduced performance degradation, and supports scalable verification. To the best of our knowledge, PEV is the first framework to simultaneously deliver efficiency, privacy, and verifiability for federated unlearning, moving FL closer to practical and regulation compliant deployment.

[LG-24] Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study

链接: https://arxiv.org/abs/2604.12337
作者: Charlotte S. Alexander,Shane Storks,Souradip Pal,Sayak Chakrabarty,Arushi Sharma,Mlen-Too Wesley,Bailey Russo
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 17 pages, 3 figures

点击查看摘要

Abstract:Letters of recommendation (LoRs) can carry patterns of implicitly gendered language that can inadvertently influence downstream decisions, e.g. in hiring and admissions. In this work, we investigate the extent to which Transformer-based encoder models as well as Large Language Models (LLMs) can infer the gender of applicants in academic LoRs submitted to an U.S. medical-residency program after explicit identifiers like names and pronouns are de-gendered. While using three models (DistilBERT, RoBERTa, and Llama 2) to classify the gender of anonymized and de-gendered LoRs, significant gender leakage was observed as evident from up to 68% classification accuracy. Text interpretation methods, like TF-IDF and SHAP, demonstrate that certain linguistic patterns are strong proxies for gender, e.g. "emotional’’ and "humanitarian’’ are commonly associated with LoRs from female applicants. As an experiment in creating truly gender-neutral LoRs, these implicit gender cues were remove resulting in a drop of up to 5.5% accuracy and 2.7% macro F_1 score on re-training the classifiers. However, applicant gender prediction still remains better than chance. In this case study, our findings highlight that 1) LoRs contain gender-identifying cues that are hard to remove and may activate bias in decision-making and 2) while our technical framework may be a concrete step toward fairer academic and professional evaluations, future work is needed to interrogate the role that gender plays in LoR review. Taken together, our findings motivate upstream auditing of evaluative text in real-world academic letters of recommendation as a necessary complement to model-level fairness interventions.

[LG-25] Beyond Weather Correlation: A Comparative Study of Static and Temporal Neural Architectures for Fine-Grained Residential Energy Consumption Forecasting in Melbourne Australia

链接: https://arxiv.org/abs/2604.12304
作者: Prasad Nimantha Madusanka Ukwatta Hewage,Hao Wu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 22 pages, 6 figures. Earlier preprint versions: Zenodo this https URL SSRN this https URL

点击查看摘要

Abstract:Accurate short-term residential energy consumption forecasting at sub-hourly resolution is critical for smart grid management, demand response programmes, and renewable energy integration. While weather variables are widely acknowledged as key drivers of residential electricity demand, the relative merit of incorporating temporal autocorrelation - the sequential memory of past consumption; over static meteorological features alone remains underexplored at fine-grained (5-minute) temporal resolution for Australian households. This paper presents a rigorous empirical comparison of a Multilayer Perceptron (MLP) and a Long Short-Term Memory (LSTM) recurrent network applied to two real-world Melbourne households: House 3 (a standard grid-connected dwelling) and House 4 (a rooftop solar photovoltaic-integrated household). Both models are trained on 14 months of 5-minute interval smart meter data (March 2023-April 2024) merged with official Bureau of Meteorology (BOM) daily weather observations, yielding over 117,000 samples per household. The LSTM, operating on 24-step (2-hour) sliding consumption windows, achieves coefficients of determination of R^2 = 0.883 (House 3) and R^2 = 0.865 (House 4), compared to R^2 = -0.055 and R^2 = 0.410 for the corresponding weather-driven MLPs - differences of 93.8 and 45.5 percentage points. These results establish that temporal autocorrelation in the consumption sequence dominates meteorological information for short-term forecasting at 5-minute granularity. Additionally, we demonstrate an asymmetry introduced by solar generation: for the PV-integrated household, the MLP achieves R^2 = 0.410, revealing implicit solar forecasting from weather-time correlations. A persistence baseline analysis and seasonal stratification contextualise model performance. We propose a hybrid weather-augmented LSTM and federated learning extensions as directions for future work.

[LG-26] Labeled TrustSet Guided: Batch Active Learning with Reinforcement Learning IJCNN2026

链接: https://arxiv.org/abs/2604.12303
作者: Guofeng Cui,Yang Liu,Pichao Wang,Hankai Hsu,Xiaohang Sun,Xiang Hao,Zhu Liu
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at IJCNN 2026

点击查看摘要

Abstract:Batch active learning (BAL) is a crucial technique for reducing labeling costs and improving data efficiency in training large-scale deep learning models. Traditional BAL methods often rely on metrics like Mahalanobis Distance to balance uncertainty and diversity when selecting data for annotation. However, these methods predominantly focus on the distribution of unlabeled data and fail to leverage feedback from labeled data or the model’s performance. To address these limitations, we introduce TrustSet, a novel approach that selects the most informative data from the labeled dataset, ensuring a balanced class distribution to mitigate the long-tail problem. Unlike CoreSet, which focuses on maintaining the overall data distribution, TrustSet optimizes the model’s performance by pruning redundant data and using label information to refine the selection process. To extend the benefits of TrustSet to the unlabeled pool, we propose a reinforcement learning (RL)-based sampling policy that approximates the selection of high-quality TrustSet candidates from the unlabeled data. Combining TrustSet and RL, we introduce the Batch Reinforcement Active Learning with TrustSet (BRAL-T) framework. BRAL-T achieves state-of-the-art results across 10 image classification benchmarks and 2 active fine-tuning tasks, demonstrating its effectiveness and efficiency in various domains.

[LG-27] Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation

链接: https://arxiv.org/abs/2604.12277
作者: Jiayi Li,Shijie Tang,Gün Kaynar,Shiyi Du,Carl Kingsford
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.

[LG-28] RoleMAG: Learning Neighbor Roles in Multimodal Graphs

链接: https://arxiv.org/abs/2604.12271
作者: Yilong Zuo,Xunkai Li,Zhihan Zhang,Ronghua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal attributed graphs (MAGs) combine multimodal node attributes with structured relations. However, existing methods usually perform shared message passing on a single graph and implicitly assume that the same neighbors are equally useful for all modalities. In practice, neighbors that benefit one modality may interfere with another, blurring modality-specific signals under shared propagation. To address this issue, we propose RoleMAG, a multimodal graph framework that learns how different neighbors should participate in propagation. Concretely, RoleMAG distinguishes whether a neighbor should provide shared, complementary, or heterophilous signals, and routes them through separate propagation channels. This enables cross-modal completion from complementary neighbors while keeping heterophilous ones out of shared smoothing. Extensive experiments on three graph-centric MAG benchmarks show that RoleMAG achieves the best results on RedditS and Bili_Dance, while remaining competitive on Toys. Ablation, robustness, and efficiency analyses further support the effectiveness of the proposed role-aware propagation design. Our code is available at this https URL

[LG-29] Decentralized Learning via Random Walk with Jumps

链接: https://arxiv.org/abs/2604.12260
作者: Zonghong Liu,Matthew Dwyer,Salim El Rouayheb
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We study decentralized learning over networks where data are distributed across nodes without a central coordinator. Random walk learning is a token-based approach in which a single model is propagated across the network and updated at each visited node using local data, thereby incurring low communication and computational overheads. In weighted random-walk learning, the transition matrix is designed to achieve a desired sampling distribution, thereby speeding up convergence under data heterogeneity. We show that implementing weighted sampling via the Metropolis-Hastings algorithm can lead to a previously unexplored phenomenon we term entrapment. The random walk may become trapped in a small region of the network, resulting in highly correlated updates and severely degraded convergence. To address this issue, we propose Metropolis-Hastings with Levy jumps, which introduces occasional long-range transitions to restore exploration while respecting local information constraints. We establish a convergence rate that explicitly characterizes the roles of data heterogeneity, network spectral gap, and jump probability, and demonstrate through experiments that MHLJ effectively eliminates entrapment and significantly speeds up decentralized learning.

[LG-30] LLM -Enhanced Log Anomaly Detection: A Comprehensive Benchmark of Large Language Models for Automated System Diagnostics

链接: https://arxiv.org/abs/2604.12218
作者: Disha Patel
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 5 pages, 4 tables, code available at this https URL

点击查看摘要

Abstract:System log anomaly detection is critical for maintaining the reliability of large-scale software systems, yet traditional methods struggle with the heterogeneous and evolving nature of modern log data. Recent advances in Large Language Models (LLMs) offer promising new approaches to log understanding, but a systematic comparison of LLM-based methods against established techniques remains lacking. In this paper, we present a comprehensive benchmark study evaluating both LLM-based and traditional approaches for log anomaly detection across four widely-used public datasets: HDFS, BGL, Thunderbird, and Spirit. We evaluate three categories of methods: (1) classical log parsers (Drain, Spell, AEL) combined with machine learning classifiers, (2) fine-tuned transformer models (BERT, RoBERTa), and (3) prompt-based LLM approaches (GPT-3.5, GPT-4, LLaMA-3) in zero-shot and few-shot settings. Our experiments reveal that while fine-tuned transformers achieve the highest F1-scores (0.96-0.99), prompt-based LLMs demonstrate remarkablezero-shot capabilities (F1: 0.82-0.91) without requiring any labeled training data – a significant advantage for real-world deployment where labeled anomalies are scarce. We further analyze the cost-accuracy trade-offs, latency characteristics, and failure modes of each approach. Our findings provide actionable guidelines for practitioners choosing log anomaly detection methods based on their specific constraints regarding accuracy, latency, cost, and label availability. All code and experimental configurations are publicly available to facilitate reproducibility.

[LG-31] A Residual-Shell-Based Lower Bound for Ollivier-Ricci Curvature

链接: https://arxiv.org/abs/2604.12211
作者: Xiang Gu,Huichun Zhang,Jian Sun
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Ollivier-Ricci curvature (ORC), defined via the Wasserstein distance that captures rich geometric information, has received growing attention in both theory and applications. However, the high computational cost of Wasserstein distance evaluation has significantly limited the broader practical use of ORC. To alleviate this issue, previous work introduced a computationally efficient lower bound as a proxy for ORC based on 1-hop random walks, but this approach empirically exhibits large gaps from the exact ORC. In this paper, we establish a substantially tighter lower bound for ORC than the existing lower bound, while retaining much lower computational cost than exact ORC computation, with practical speedups of tens of times. Moreover, our bound is not restricted to 1-hop random walks, but also applies to k-hop random walks (k 1). Experiments on several fundamental graph structures demonstrate the effectiveness of our bound in terms of both approximation accuracy and computational efficiency.

[LG-32] PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving

链接: https://arxiv.org/abs/2604.12171
作者: Xu Bai,Muhammed Tawfiqul Islam,Chen Wang,Adel N. Toosi
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pipeline parallelism (PP) is widely used to partition layers of large language models (LLMs) across GPUs, enabling scalable inference for large models. However, existing systems rely on static PP configurations that fail to adapt to dynamic settings, such as serverless platforms and heterogeneous GPU environments. Reconfiguring PP by stopping and redeploying service incurs prohibitive downtime, so reconfiguration must instead proceed live and in place, without interrupting inference. However, live in-place PP reconfiguration is fundamentally challenging. GPUs are already saturated with model weights and KV cache, leaving little room for new layer placements and necessitating KV cache resizing, at odds with systems like vLLM that preallocate for throughput. Moreover, maintaining KV consistency during execution is difficult: stop-and-copy introduces large pauses, while background synchronization risks inconsistency as states evolve. We present PipeLive, which enables live in-place PP reconfiguration with minimal disruption. PipeLive introduces a redesigned KV cache layout together with a co-designed extension to PageAttention, forming a unified mechanism for live KV resizing. It further adopts an incremental KV patching mechanism, inspired by live virtual machine migration, to synchronize KV states between source and target configurations and identify a safe switch point. PipeLive achieves a 2.5X reduction in time-to-first-token (TTFT) without KV cache overflow compared to disabling KV resizing. Furthermore, compared to a variant without KV patching, it reduces reconfiguration overhead from seconds to under 10ms, and improves TTFT and time-per-output-token (TPOT) by up to 54.7% and 14.7%, respectively.

[LG-33] PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

链接: https://arxiv.org/abs/2604.12160
作者: Anupam Nayak,Baris Askin,Muhammed Ustaomeroglu,Carlee Joe-Wong,Gauri Joshi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reasoning post-training with reinforcement learning from verifiable rewards (RLVR) is typically studied in centralized settings, yet many realistic applications involve decentralized private data distributed across organizations. Federated training is a natural solution, but scaling RLVR in this regime is challenging: full-model synchronization is expensive, and performing many local steps can cause severe client drift under heterogeneous data. We propose a federated RLVR framework that combines LoRA-based local adaptation with public-data-based off-policy steps to improve both communication efficiency and cross-client coordination. In particular, a small shared public dataset is used to periodically exchange and reuse response-level training signals across organizations, providing a lightweight anchor toward a more globally aligned objective without exposing private data. Our method selectively replaces locally incorrect responses with globally correct ones during public-data steps, thereby keeping training closer to the local policy while still benefiting from cross-client coordination. Across mathematical and medical reasoning benchmarks and models, our method consistently improves over standard baselines. Our results highlight a simple and effective recipe for federated reasoning post-training: combining low-rank communication with limited public-data coordination.

[LG-34] Distinct mechanisms underlying in-context learning in transformers

链接: https://arxiv.org/abs/2604.12151
作者: Cole Gibson,Wenping Cui,Gautam Reddy
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech)
*备注: 46 pages, 19 figures

点击查看摘要

Abstract:Modern distributed networks, notably transformers, acquire a remarkable ability (termed `in-context learning’) to adapt their computation to input statistics, such that a fixed network can be applied to data from a broad range of systems. Here, we provide a complete mechanistic characterization of this behavior in transformers trained on a finite set S of discrete Markov chains. The transformer displays four algorithmic phases, characterized by whether the network memorizes and generalizes, and whether it uses 1-point or 2-point statistics. We show that the four phases are implemented by multi-layer subcircuits that exemplify two qualitatively distinct mechanisms for implementing context-adaptive computations. Minimal models isolate the key features of both motifs. Memorization and generalization phases are delineated by two boundaries that depend on data diversity, K = |S| . The first ( K_1^\ast ) is set by a kinetic competition between subcircuits and the second ( K_2^\ast ) is set by a representational bottleneck. A symmetry-constrained theory of a transformer’s training dynamics explains the sharp transition from 1-point to 2-point generalization and identifies key features of the loss landscape that allow the network to generalize. Put together, we show that transformers develop distinct subcircuits to implement in-context learning and identify conditions that favor certain mechanisms over others.

[LG-35] XANE(3): An E(3)-Equivariant Graph Neural Network for Accurate Prediction of XANES Spectra from Atomic Structures

链接: https://arxiv.org/abs/2604.12140
作者: Vitor F. Grizzi,Luke N. Pretzie,Jiayi Xu,Cong Liu
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:We present XANE(3), a physics-based E(3)-equivariant graph neural network for predicting X-ray absorption near-edge structure (XANES) spectra directly from atomic structures. The model combines tensor-product message passing with spherical harmonic edge features, absorber-query attention pooling, custom equivariant layer normalization, adaptive gated residual connections, and a spectral readout based on a multi-scale Gaussian basis with an optional sigmoidal background term. To improve line-shape fidelity, training is performed with a composite objective that includes pointwise spectral reconstruction together with first- and second-derivative matching terms. We evaluate the model on a dataset of 5,941 FDMNES simulations of iron oxide surface facets and obtain a spectrum mean squared error of 1.0 \times 10^-3 on the test set. The model accurately reproduces the main edge structure, relative peak intensities, pre-edge features, and post-edge oscillations. Ablation studies show that the derivative-aware objective, custom equivariant normalization, absorber-conditioned attention pooling, adaptive gated residual mixing, and global background term each improve performance. Interestingly, a capacity-matched scalar-only variant achieves comparable pointwise reconstruction error but reduced derivative-level fidelity, indicating that explicit tensorial channels are not strictly required for low intensity error on this dataset, although they remain beneficial for capturing finer spectral structure. These results establish XANE(3) as an accurate and efficient surrogate for XANES simulation and offer a promising route toward accelerated spectral prediction, ML-assisted spectroscopy, and data-driven materials discovery.

[LG-36] SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling SIGIR2026

链接: https://arxiv.org/abs/2604.12110
作者: Zikun Liu,Liang Luo,Qianru Li,Zhengyu Zhang,Wei Ling,Jingyi Shen,Zeliang Chen,Yaning Huang,Jingxian Huang,Abdallah Aboelela,Chonglin Sun,Feifan Gu,Fenggang Wu,Hang Qu,Huayu Li,Jill Pan,Kaidi Pei,Laming Chen,Longhao Jin,Qin Huang,Tongyi Tang,Varna Puvvada,Wenlin Chen,Xiaohan Wei,Xu Cao,Yantao Yao,Yuan Jin,Yunchen Pu,Yuxin Chen,Zijian Shen,Zhengkai Zhang,Dong Liang,Ellie Wen
类目: Machine Learning (cs.LG)
*备注: Accepted to SIGIR 2026 Industry Track

点击查看摘要

Abstract:Recent advances in recommendation scaling laws have led to foundation models of unprecedented complexity. While these models offer superior performance, their computational demands make real-time serving impractical, often forcing practitioners to rely on knowledge distillation-compromising serving quality for efficiency. To address this challenge, we present SOLARIS (Speculative Offloading of Latent-bAsed Representation for Inference Scaling), a novel framework inspired by speculative decoding. SOLARIS proactively precomputes user-item interaction embeddings by predicting which user-item pairs are likely to appear in future requests, and asynchronously generating their foundation model representations ahead of time. This approach decouples the costly foundation model inference from the latency-critical serving path, enabling real-time knowledge transfer from models previously considered too expensive for online use. Deployed across Meta’s advertising system serving billions of daily requests, SOLARIS achieves 0.67% revenue-driving top-line metrics gain, demonstrating its effectiveness at scale.

[LG-37] Parametric Interpolation of Dynamic Mode Decomposition for Predicting Nonlinear Systems

链接: https://arxiv.org/abs/2604.12103
作者: Ananda Chakrabarti,Haitham H. Saleh,Indranil Nayak,Balasubramaniam Shanker,Fernando L. Teixeira,Debdipta Goswami
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 22 pages, 9 figures

点击查看摘要

Abstract:We present parameter-interpolated dynamic mode decomposition (piDMD), a parametric reduced-order modeling framework that embeds known parameter-affine structure directly into the DMD regression step. Unlike existing parametric DMD methods which interpolate modes, eigenvalues, or reduced operators and can be fragile with sparse training data or multi-dimensional parameter spaces, piDMD learns a single parameter-affine Koopman surrogate reduced order model (ROM) across multiple training parameter samples and predicts at unseen parameter values without retraining. We validate piDMD on fluid flow past a cylinder, electron beam oscillations in transverse magnetic fields, and virtual cathode oscillations – the latter two being simulated using an electromagnetic particle-in-cell (EMPIC) method. Across all benchmarks, piDMD achieves accurate long-horizon predictions and improved robustness over state-of-the-art interpolation-based parametric DMD baselines, with less training samples and with multi-dimensional parameter spaces.

[LG-38] Robust Optimization for Mitigating Reward Hacking with Correlated Proxies ICLR2026

链接: https://arxiv.org/abs/2604.12086
作者: Zixuan Liu,Xiaolin Sun,Zizhan Zheng
类目: Machine Learning (cs.LG)
*备注: ICLR 2026

点击查看摘要

Abstract:Designing robust reinforcement learning (RL) agents in the presence of imperfect reward signals remains a core challenge. In practice, agents are often trained with proxy rewards that only approximate the true objective, leaving them vulnerable to reward hacking, where high proxy returns arise from unintended or exploitative behaviors. Recent work formalizes this issue using r-correlation between proxy and true rewards, but existing methods like occupancy-regularized policy optimization (ORPO) optimize against a fixed proxy and do not provide strong guarantees against broader classes of correlated proxies. In this work, we formulate reward hacking as a robust policy optimization problem over the space of all r-correlated proxy rewards. We derive a tractable max-min formulation, where the agent maximizes performance under the worst-case proxy consistent with the correlation constraint. We further show that when the reward is a linear function of known features, our approach can be adapted to incorporate this prior knowledge, yielding both improved policies and interpretable worst-case rewards. Experiments across several environments show that our algorithms consistently outperform ORPO in worst-case returns, and offer improved robustness and stability across different levels of proxy-true reward correlation. These results show that our approach provides both robustness and transparency in settings where reward design is inherently uncertain. The code is available at this https URL.

[LG-39] Robust Reasoning and Learning with Brain-Inspired Representations under Hardware-Induced Nonlinearities

链接: https://arxiv.org/abs/2604.12079
作者: William Youngwoo Chung,Hamza Errahmouni Barkam,Tamoghno Das,Mohsen Imani
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures, accepted to Great Lakes Symposium on VLSI (GLSVLSI) 2025

点击查看摘要

Abstract:Traditional machine learning depends on high-precision arithmetic and near-ideal hardware assumptions, which is increasingly challenged by variability in aggressively scaled semiconductor devices. Compute-in-memory (CIM) architectures alleviate data-movement bottlenecks and improve energy efficiency yet introduce nonlinear distortions and reliability concerns. We address these issues with a hardware-aware optimization framework based on Hyperdimensional Computing (HDC), systematically compensating for non-ideal similarity computations in CIM. Our approach formulates encoding as an optimization problem, minimizing the Frobenius norm between an ideal kernel and its hardware-constrained counterpart, and employs a joint optimization strategy for end-to-end calibration of hypervector representations. Experimental results demonstrate that our method when applied to QuantHD achieves 84% accuracy under severe hardware-induced perturbations, a 48% increase over naive QuantHD under the same conditions. Additionally, our optimization is vital for graph-based HDC reliant on precise variable-binding for interpretable reasoning. Our framework preserves the accuracy of RelHD on the Cora dataset, achieving a 5.4 \times accuracy improvement over naive RelHD under nonlinear environments. By preserving HDC’s robustness and symbolic properties, our solution enables scalable, energy-efficient intelligent systems capable of classification and reasoning on emerging CIM hardware.

[LG-40] riFit: Trimodal Fusion with Protein Dynamics for Mutation Fitness Prediction

链接: https://arxiv.org/abs/2604.12026
作者: Seungik Cho
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Predicting the functional impact of single amino acid substitutions (SAVs) is central to understanding genetic disease and engineering therapeutic proteins. While protein language models and structure-based methods have achieved strong performance on this task, they systematically neglect protein dynamics; residue flexibility, correlated motions, and allosteric coupling are well-established determinants of mutational tolerance in structural biology, yet have not been incorporated into supervised variant effect predictors. We present TriFit, a multimodal framework that integrates sequence, structure, and protein dynamics through a four-expert Mixture-of-Experts (MoE) fusion module with trimodal cross-modal contrastive learning. Sequence embeddings are extracted via masked marginal scoring with ESM-2 (650M); structural embeddings from AlphaFold2-predicted C-alpha geometries; and dynamics embeddings from Gaussian Network Model (GNM) B-factors, mode shapes, and residue-residue cross-correlations. The MoE router adaptively weights modality combinations conditioned on the input, enabling protein-specific fusion without fixed modality assumptions. On the ProteinGym substitution benchmark (217 DMS assays, 696k SAVs), TriFit achieves AUROC 0.897 +/- 0.0002, outperforming all supervised baselines including Kermut (0.864) and ProteinNPT (0.844), and the best zero-shot model ESM3 (0.769). Ablation studies confirm that dynamics provides the largest marginal contribution over pairwise modality combinations, and TriFit achieves well-calibrated probabilistic outputs (ECE = 0.044) without post-hoc correction.

[LG-41] Sample Complexity of Autoregressive Reasoning : Chain-of-Thought vs. End-to-End

链接: https://arxiv.org/abs/2604.12013
作者: Steve Hanneke,Idan Mehalel,Shay Moran
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern large language models generate text autoregressively, producing tokens one at a time. To study the learnability of such systems, Joshi et al. (COLT 2025) introduced a PAC-learning framework for next-token generators, the primitive underlying autoregressive models. In this framework, an unknown next-token generator maps a sequence of tokens to the next token and is iteratively applied for T steps, producing a chain of tokens whose final token constitutes the model’s output. The learning task is to learn the input-output mapping induced by this autoregressive process. Depending on the available supervision, training examples may reveal only the final output (End-to-End supervision) or the entire generated chain (Chain-of-Thought supervision). This raises two natural questions: how the sample complexity depends on the generation length T , and how much Chain-of-Thought supervision can reduce this dependence. In this work we give a nearly complete answer to both questions by uncovering a taxonomy of how the sample complexity scales with T . For End-to-End learning, we show that the landscape is remarkably rich: subject to mild conditions, essentially any growth rate r(T) between constant and linear can arise as the sample complexity, and combined with the linear upper bound of Joshi et al., this yields an essentially complete characterization. In contrast, under Chain-of-Thought supervision we show that the sample complexity is independent of T , demonstrating that access to intermediate reasoning steps can eliminate the dependence on the generation length altogether. Our analysis introduces new combinatorial tools, and as corollaries we resolve several open questions posed by Joshi et al. regarding the dependence of learnability on the generation length and the role of Chain-of-Thought supervision. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.12013 [cs.LG] (or arXiv:2604.12013v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.12013 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-42] Loss-Driven Bayesian Active Learning

链接: https://arxiv.org/abs/2604.11995
作者: Zhuoyue Huang,Freddie Bickford Smith,Tom Rainforth
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The central goal of active learning is to gather data that maximises downstream predictive performance, but popular approaches have limited flexibility in customising this data acquisition to different downstream problems and losses. We propose a rigorous loss-driven approach to Bayesian active learning that allows data acquisition to directly target the loss associated with a given decision problem. In particular, we show how any loss can be used to derive a unique objective for optimal data acquisition. Critically, we then show that any loss taking the form of a weighted Bregman divergence permits analytic computation of a central component of its corresponding objective, making the approach applicable in practice. In regression and classification experiments with a range of different losses, we find our approach reduces test losses relative to existing techniques.

[LG-43] Offline-Online Reinforcement Learning for Linear Mixture MDPs

链接: https://arxiv.org/abs/2604.11994
作者: Zhongjun Zhang,Sean R. Sinclair
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 72 pages, 4 figures

点击查看摘要

Abstract:We study offline-online reinforcement learning in linear mixture Markov decision processes (MDPs) under environment shift. In the offline phase, data are collected by an unknown behavior policy and may come from a mismatched environment, while in the online phase the learner interacts with the target environment. We propose an algorithm that adaptively leverages offline data. When the offline data are informative, either due to sufficient coverage or small environment shift, the algorithm provably improves over purely online learning. When the offline data are uninformative, it safely ignores them and matches the online-only performance. We establish regret upper bounds that explicitly characterize when offline data are beneficial, together with nearly matching lower bounds. Numerical experiments further corroborate our theoretical findings.

[LG-44] Exploring Concept Subspace for Self-explainable Text-Attributed Graph Learning

链接: https://arxiv.org/abs/2604.11986
作者: Xiaoxue Han,Libo Zhang,Zining Zhu,Yue Ning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Graph Concept Bottleneck (GCB) as a new paradigm for self-explainable text-attributed graph learning. GCB maps graphs into a subspace, concept bottleneck, where each concept is a meaningful phrase, and predictions are made based on the activation of these concepts. Unlike existing interpretable graph learning methods that primarily rely on subgraphs as explanations, the concept bottleneck provides a new form of interpretation. To refine the concept space, we apply the information bottleneck principle to focus on the most relevant concepts. This not only yields more concise and faithful explanations but also explicitly guides the model to “think” toward the correct decision. We empirically show that GCB achieves intrinsic interpretability with accuracy on par with black-box Graph Neural Networks. Moreover, it delivers better performance under distribution shifts and data perturbations, showing improved robustness and generalizability, benefitting from concept-guided prediction.

[LG-45] A Geometric Algebra-informed NeRF Framework for Generalizable Wireless Channel Prediction

链接: https://arxiv.org/abs/2604.11983
作者: Jingzhou Shen,Luis Lago Enamorado,Shiwen Mao,Xuyu Wang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: It is accepted by IEEE INFOCOM 2026

点击查看摘要

Abstract:In this paper, we propose the geometric algebra-informed neural radiance fields (GAI-NeRF), a novel framework for wireless channel prediction that leverages geometric algebra attention mechanisms to capture ray-object interactions in complex propagation environments. Our approach incorporates global token representations, drawing inspiration from transformer architectures in language and vision domains, to aggregate learned spatial-electromagnetic features and enhance scene understanding. We identify limitations in conventional static ray tracing modules that hinder model generalization and address this challenge through a new ray tracing architecture. This design enables effective generalization across diverse wireless scenarios while maintaining computational efficiency. Experimental results demonstrate that GAI-NeRF achieves superior performance in channel prediction tasks by combining geometric algebra principles with neural scene representations, offering a promising direction for next-generation wireless communication systems. Moreover, GAI-NeRF greatly outperforms existing methods across multiple wireless scenarios. To ensure comprehensive assessment, we further evaluate our approach against multiple benchmarks using newly collected real-world indoor datasets tailored for single-scene downstream tasks and generalization testing, confirming its robust performance in unseen environments and establishing its high efficacy for wireless channel prediction.

[LG-46] Multi-Head Residual-Gated DeepONet for Coherent Nonlinear Wave Dynamics

链接: https://arxiv.org/abs/2604.11972
作者: Zhiwei Fan,Yiming Pan,Daniel Coca
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Coherent nonlinear wave dynamics are often strongly shaped by a compact set of physically meaningful descriptors of the initial state. Traditional neural operators typically treat the input-output mapping as a largely black-box high-dimensional regression problem, without explicitly exploiting this structured physical context. Common feature-integration strategies usually rely on direct concatenation or FiLM-style affine modulation in hidden latent spaces. Here we introduce a different paradigm, loosely inspired by the complementary roles of state evolution and physically meaningful observables in quantum mechanics: the wave field is learned through a standard DeepONet state pathway, while compact physical descriptors follow a parallel conditioning pathway and act as residual modulation factors on the state prediction. Based on this idea, we develop a Multi-Head Residual-Gated DeepONet (MH-RG), which combines a pre-branch residual modulator, a branch residual gate, and a trunk residual gate with a low-rank multi-head mechanism to capture multiple complementary conditioned response patterns without prohibitive parameter growth. We evaluate the framework on representative benchmarks including highly nonlinear conservative wave dynamics and dissipative trapped dynamics and further perform detailed mechanistic analyses of the learned multi-head gating behavior. Compared with feature-augmented baselines, MH-RG DeepONet achieves consistently lower error while better preserving phase coherence and the fidelity of physically relevant dynamical quantities.

[LG-47] Classification of Epileptic iEEG using Topological Machine Learning

链接: https://arxiv.org/abs/2604.11971
作者: Sunia Tanweer,Narayan Puthanmadam Subramaniyam,Firas A. Khasawneh
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Epileptic seizure detection from EEG signals remains challenging due to the high dimensionality and nonlinear, potentially stochastic, dynamics of neural activity. In this work, we investigate whether features derived from topological data analysis (TDA) can improve the classification of brain states in preictal, ictal and interictal iEEG recordings from epilepsy patients using multichannel data. We analyze data from 55 patients, significantly larger than many previous studies that rely on patient-specific models. Persistence diagrams derived from iEEG signals are vectorized using several TDA representations, including Carlsson coordinates, persistence images, and template functions. To understand how topological representations interact with modern machine learning pipelines, we conduct a large-scale ablation study across multiple iEEG frequency bands, dimensionality reduction techniques, feature representations, and classifier architectures. Our experiments show that dimension-reduced topological representations achieve up to 80% balanced accuracy for three-class classification. Interestingly, classical machine learning models perform comparably to deep learning models, achieving up to 79.17% balanced accuracy, suggesting that carefully designed topological features can substantially reduce model complexity requirements. In contrast, pipelines preserving the full multichannel feature structure exhibit severe overfitting due to the high-dimensional feature space. These findings highlight the importance of structure-preserving dimensionality reduction when applying topology-based representations to multichannel neural data.

[LG-48] he Linear Centroids Hypothesis: How Deep Network Features Represent Data

链接: https://arxiv.org/abs/2604.11962
作者: Thomas Walker,Ahmed Imtiaz Humayun,Randall Balestriero,Richard Baraniuk
类目: Machine Learning (cs.LG)
*备注: 20 pages, 17 figures

点击查看摘要

Abstract:Identifying and understanding the features that a deep network (DN) extracts from its inputs to produce its outputs is a focal point of interpretability research. The Linear Representation Hypothesis (LRH) identifies features in terms of the linear directions formed by the inputs in a DN’s latent space. However, the LRH is limited as it abstracts away from individual components (e.g., neurons and layers), is susceptible to identifying spurious features, and cannot be applied across sub-components (e.g., multiple layers). In this paper, we introduce the Linear Centroids Hypothesis (LCH) as a new framework for identifying the features of a DN. The LCH posits that features correspond to linear directions of centroids, which are vector summarizations of the functional behavior of a DN in a local region of its input space. Interpretability studies under the LCH can leverage existing LRH tools, such as sparse autoencoders, by applying them to the DN’s centroids rather than to its latent activations. We demonstrate that doing so yields sparser feature dictionaries for DINO vision transformers, which also perform better on downstream tasks. The LCH also inspires novel approaches to interpretability; for example, LCH can readily identify circuits in GPT2-Large. For code to study the LCH this https URL .

[LG-49] Active Imitation Learning for Thermal- and Kernel-Aware LFM Inference on 3D S-NUCA Many-Cores

链接: https://arxiv.org/abs/2604.11948
作者: Yixian Shen,Chaoyao Shen,Jan Deen,George Floros,Andy Pimentel,Anuj Pathania
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted for publication at the 63rd ACM/IEEE Design Automation Conference (DAC 2026)

点击查看摘要

Abstract:Large Foundation Model (LFM) inference is both memory- and compute-intensive, traditionally relying on GPUs. However, the limited availability and high cost have motivated the adoption of high-performance general-purpose CPUs, especially emerging 3D-stacked Static Non-Uniform Cache Architecture (3D S-NUCA) systems. These architectures offer enhanced bandwidth and locality but suffer from severe thermal challenges and uneven cache latencies due to 3D Networks-on-Chip (NoC). Optimal management of thread migration and V/f scaling is non-trivial due to LFM kernel diversity and system heterogeneity. Existing thermal management approaches often rely on oversimplified analytical models and lack adaptability. We propose AILFM, an Active Imitation Learning (AIL)-based scheduling framework that learns near-optimal thermal-aware scheduling policies from Oracle demonstrations with minimal run-time overhead. AILFM accounts for both core-level performance heterogeneity and kernel-specific behavior in LFMs to maintain thermal safety while maximizing performance. Extensive experiments show that AILFM outperforms state-of-the-art baselines and generalizes well across diverse LFM workloads.

[LG-50] A unified data format for managing diabetes time-series data: DIAbetes eXchange (DIAX)

链接: https://arxiv.org/abs/2604.11944
作者: Elliott C. Pryor,Marc D. Breton,Anas El Fathi
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:Diabetes devices, including Continuous Glucose Monitoring (CGM), Smart Insulin Pens, and Automated Insulin Delivery systems, generate rich time-series data widely used in research and machine learning. However, inconsistent data formats across sources hinder sharing, integration, and analysis. We present DIAX (DIAbetes eXchange), a standardized JSON-based format for unifying diabetes time-series data, including CGM, insulin, and meal signals. DIAX promotes interoperability, reproducibility, and extensibility, particularly for machine learning applications. An open-source repository provides tools for dataset conversion, cross-format compatibility, visualization, and community contributions. DIAX is a translational resource, not a data host, ensuring flexibility without imposing data-sharing constraints. Currently, DIAX is compatible with other standardization efforts and supports major datasets (DCLP3, DCLP5, IOBP2, PEDAP, T1Dexi, Loop), totaling over 10 million patient-hours of data. this https URL Comments: 7 pages, 2 figures Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM) ACMclasses: J.3; H.2.8 Cite as: arXiv:2604.11944 [cs.LG] (or arXiv:2604.11944v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.11944 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-51] ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems

链接: https://arxiv.org/abs/2604.11943
作者: Daeyeon Son
类目: Operating Systems (cs.OS); Machine Learning (cs.LG)
*备注: 13 pages, 9 tables

点击查看摘要

Abstract:An OS kernel that runs LLM inference internally can read logit distributions before any text is generated – and act on them as a governance primitive. I present ProbeLogits, a kernel-level operation that performs a single forward pass and reads specific token logits to classify agent actions as safe or dangerous, with zero learned parameters. On a 260-prompt OS action benchmark (9 categories including adversarial attacks), ProbeLogits achieves F1=0.980, Precision=1.000, and Recall=0.960 using a general-purpose 7B model at 4-bit quantization. On ToxicChat (1,000 human-annotated real conversations), it achieves F1=0.790 at default calibration strength \alpha =1.0, improving to F1=0.837 at \alpha =0.5 – 89% of Llama Guard 3’s F1~0.939 with zero learned parameters. A key design contribution is the calibration strength \alpha , which serves as a deployment-time policy knob rather than a learned hyperparameter. By adjusting \alpha , the OS can enforce strict policies for privileged operations ( \alpha \geq 0.8 , maximizing recall) or relaxed policies for conversational agents ( \alpha =0.5, maximizing precision). Contextual calibration improves accuracy from 64.8% to 97.3% on the custom benchmark. I implement ProbeLogits within Anima OS, a bare-metal x86_64 OS written in 80,400 lines of Rust. Because agent actions must pass through kernel-mediated host functions, ProbeLogits enforcement operates below the WASM sandbox boundary, making it significantly harder to circumvent than application-layer classifiers. Each classification costs 65ms on 7B – fast enough for per-action governance. I also show that treating KV cache as process state enables checkpoint, restore, and fork operations analogous to traditional process management. To my knowledge, no prior system exposes LLM logit vectors as OS-level governance primitives.

[LG-52] Fast and principled equation discovery from chaos to climate

链接: https://arxiv.org/abs/2604.11929
作者: Yuzheng Zhang,Weizhen Li,Rui Carvalho
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Computational Physics (physics.comp-ph)
*备注: 34 pages, 8 figures

点击查看摘要

Abstract:Our ability to predict, control, and ultimately understand complex systems rests on discovering the equations that govern their dynamics. Identifying these equations directly from noisy, limited observations has therefore become a central challenge in data-driven science, yet existing library-based sparse regression methods force a compromise between automation, statistical rigor, and computational efficiency. Here we develop Bayesian-ARGOS, a hybrid framework that reconciles these demands by combining rapid frequentist screening with focused Bayesian inference, enabling automated equation discovery with principled uncertainty quantification at a fraction of the computational cost of existing methods. Tested on seven chaotic systems under varying data scarcity and noise levels, Bayesian-ARGOS outperforms two state-of-the-art methods in most scenarios. It surpasses SINDy in data efficiency for all systems and noise tolerance for six out of the seven, with a two-order-of-magnitude reduction in computational cost compared to bootstrap-based ARGOS. The probabilistic formulation additionally enables a suite of standard statistical diagnostics, including influence analysis and multicollinearity detection that expose failure modes otherwise opaque. When integrated with representation learning (SINDy-SHRED) for high dimensional sea surface temperature reconstruction, Bayesian-ARGOS increases the yield of valid latent equations with significantly improved long horizon stability. Bayesian-ARGOS thus provides a principled, automated, and computationally efficient route from scarce and noisy observations to interpretable governing equations, offering a practical framework for equation discovery across scales, from benchmark chaotic systems to the latent dynamics underlying global climate patterns.

[LG-53] INTARG: Informed Real-Time Adversarial Attack Generation for Time-Series Regression

链接: https://arxiv.org/abs/2604.11928
作者: Gamze Kirman Tokgoz,Onat Gungor,Tajana Rosing,Baris Aksanli
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Time-series forecasting aims to predict future values by modeling temporal dependencies in historical observations. It is a critical component of many real-world systems, where accurate forecasts improve operational efficiency and help mitigate uncertainty and risk. More recently, machine learning (ML), and especially deep learning (DL)-based models, have gained widespread adoption for time-series forecasting, but they remain vulnerable to adversarial attacks. However, many state-of-the-art attack methods are not directly applicable in time-series settings, where storing complete historical data or performing attacks at every time step is often impractical. This paper proposes an adversarial attack framework for time-series forecasting under an online bounded-buffer setting, leveraging an informed and selective attack strategy. By selectively targeting time steps where the model exhibits high confidence and the expected prediction error is maximal, our framework produces fewer but substantially more effective attacks. Experiments show that our framework can increase the prediction error up to 2.42x, while performing attacks in fewer than 10% of time steps.

[LG-54] Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

链接: https://arxiv.org/abs/2604.11890
作者: Sergey Alekseev
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages main text; 33 pages total; 5 figures in the main text, 24 figures total; preprint

点击查看摘要

Abstract:We study signal propagation at initialization in transformers through the averaged partial Jacobian norm (APJN), a measure of gradient amplification across layers. We extend APJN analysis to transformers with bidirectional attention and permutation-symmetric input token configurations by deriving recurrence relations for activation statistics and APJNs across layers. Our theory predicts how attention modifies the asymptotic behavior of the APJN at large depth and matches APJNs measured in deep vision transformers. The criticality picture known from residual networks carries over to transformers: the pre-LayerNorm architecture exhibits power-law APJN growth, whereas transformers with LayerNorm replaced by elementwise \tanh -like nonlinearities have stretched-exponential APJN growth, indicating that the latter are subcritical. Applied to Dynamic Tanh (DyT) and Dynamic erf (Derf) transformers, the theory explains why these architectures can be more sensitive to initialization and optimization choices and require careful tuning for stable training.

[LG-55] When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation

链接: https://arxiv.org/abs/2604.11840
作者: Sandro Andric
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, supplementary material included as ancillary file

点击查看摘要

Abstract:Large language models are increasingly used as agents in social, economic, and policy simulations. A common assumption is that stronger reasoning should improve simulation fidelity. We argue that this assumption can fail when the objective is not to solve a strategic problem, but to sample plausible boundedly rational behavior. In such settings, reasoning-enhanced models can become better solvers and worse simulators: they can over-optimize for strategically dominant actions, collapse compromise-oriented terminal behavior, and sometimes exhibit a diversity-without-fidelity pattern in which local variation survives without outcome-level fidelity. We study this solver-sampler mismatch in three multi-agent negotiation environments adapted from earlier simulation work: an ambiguous fragmented-authority trading-limits scenario, an ambiguous unified-opposition trading-limits scenario, and a new-domain grid-curtailment case in emergency electricity management. We compare three reflection conditions, no reflection, bounded reflection, and native reasoning, across two primary model families and then extend the same protocol to direct OpenAI runs with GPT-4.1 and GPT-5.2. Across all three experiments, bounded reflection produces substantially more diverse and compromise-oriented trajectories than either no reflection or native reasoning. In the direct OpenAI extension, GPT-5.2 native ends in authority decisions in 45 of 45 runs across the three experiments, while GPT-5.2 bounded recovers compromise outcomes in every environment. The contribution is not a claim that reasoning is generally harmful. It is a methodological warning: model capability and simulation fidelity are different objectives, and behavioral simulation should qualify models as samplers, not only as solvers.

[LG-56] Uncertainty Quantification in CNN Through the Bootstrap of Convex Neural Networks AAAI2021

链接: https://arxiv.org/abs/2604.11833
作者: Hongfei Du,Emre Barut,Fang Jin
类目: Machine Learning (cs.LG)
*备注: 9 pages, 1 figure. Accepted at AAAI 2021

点击查看摘要

Abstract:Despite the popularity of Convolutional Neural Networks (CNN), the problem of uncertainty quantification (UQ) of CNN has been largely overlooked. Lack of efficient UQ tools severely limits the application of CNN in certain areas, such as medicine, where prediction uncertainty is critically important. Among the few existing UQ approaches that have been proposed for deep learning, none of them has theoretical consistency that can guarantee the uncertainty quality. To address this issue, we propose a novel bootstrap based framework for the estimation of prediction uncertainty. The inference procedure we use relies on convexified neural networks to establish the theoretical consistency of bootstrap. Our approach has a significantly less computational load than its competitors, as it relies on warm-starts at each bootstrap that avoids refitting the model from scratch. We further explore a novel transfer learning method so our framework can work on arbitrary neural networks. We experimentally demonstrate our approach has a much better performance compared to other baseline CNNs and state-of-the-art methods on various image datasets.

[LG-57] Refined Differentially Private Linear Regression via Extension of a Free Lunch Result SIGMOD2026

链接: https://arxiv.org/abs/2604.11820
作者: Sasmita Harini S,Anshoo Tandon
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted at the SeQureDB Workshop at ACM SIGMOD 2026 and TPDP Workshop 2026

点击查看摘要

Abstract:As data-privacy regulations tighten and statistical models are increasingly deployed on sensitive human-sourced data, privacy-preserving linear regression has become a critical necessity. For the add-remove DP model, Kulesza et al. (2024) and Fitzsimons et al. (2024) have independently shown that the size of the dataset – an important statistic for linear regression – can be privately estimated for “free”, via a simplex transformation of bounded variables and private sum queries on the transformed variables. In this work, we extend this free lunch result via carefully crafted multidimensional simplex transformations to variables and functions that are bounded in the interval [0,1]. We show that these transformations can be applied to refine the estimates of sufficient statistics needed for private simple linear regression based on ordinary least squares. We provide both analytical and numerical results to demonstrate the superiority of our approach. Our proposed transformations have general applicability and can be readily adapted for differentially private polynomial regression.

[LG-58] Classical and Quantum Speedups for Non-Convex Optimization via Energy Conserving Descent

链接: https://arxiv.org/abs/2604.13022
作者: Yihang Sun,Huaijin Wang,Patrick Hayden,Jose Blanchet
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 33 pages, 2 figures

点击查看摘要

Abstract:The Energy Conserving Descent (ECD) algorithm was recently proposed (De Luca Silverstein, 2022) as a global non-convex optimization method. Unlike gradient descent, appropriately configured ECD dynamics escape strict local minima and converge to a global minimum, making it appealing for machine learning optimization. We present the first analytical study of ECD, focusing on the one-dimensional setting for this first installment. We formalize a stochastic ECD dynamics (sECD) with energy-preserving noise, as well as a quantum analog of the ECD Hamiltonian (qECD), providing the foundation for a quantum algorithm through Hamiltonian simulation. For positive double-well objectives, we compute the expected hitting time from a local to the global minimum. We prove that both sECD and qECD yield exponential speedup over respective gradient descent baselines–stochastic gradient descent and its quantization. For objectives with tall barriers, qECD achieves a further speedup over sECD. Comments: 33 pages, 2 figures Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2604.13022 [quant-ph] (or arXiv:2604.13022v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2604.13022 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-59] Causal Diffusion Models for Counterfactual Outcome Distributions in Longitudinal Data

链接: https://arxiv.org/abs/2604.12992
作者: Farbod Alinezhad,Jianfei Cao,Gary J. Young,Brady Post
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:Predicting counterfactual outcomes in longitudinal data, where sequential treatment decisions heavily depend on evolving patient states, is critical yet notoriously challenging due to complex time-dependent confounding and inadequate uncertainty quantification in existing methods. We introduce the Causal Diffusion Model (CDM), the first denoising diffusion probabilistic approach explicitly designed to generate full probabilistic distributions of counterfactual outcomes under sequential interventions. CDM employs a novel residual denoising architecture with relational self-attention, capturing intricate temporal dependencies and multimodal outcome trajectories without requiring explicit adjustments (e.g., inverse-probability weighting or adversarial balancing) for confounding. In rigorous evaluation on a pharmacokinetic-pharmacodynamic tumor-growth simulator widely adopted in prior work, CDM consistently outperforms state-of-the-art longitudinal causal inference methods, achieving a 15-30% relative improvement in distributional accuracy (1-Wasserstein distance) while maintaining competitive or superior point-estimate accuracy (RMSE) under high-confounding regimes. By unifying uncertainty quantification and robust counterfactual prediction in complex, sequentially confounded settings, without tailored deconfounding, CDM offers a flexible, high-impact tool for decision support in medicine, policy evaluation, and other longitudinal domains.

[LG-60] oken Encoding for Semantic Recovery

链接: https://arxiv.org/abs/2604.12931
作者: Jingzhi Hu,Geoffrey Ye Li
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Token-based semantic communication is promising for future wireless networks, as it can compact semantic tokens under very limited channel capacity. However, harsh wireless channels often cause missing tokens, leading to severe distortion that prevents reliable semantic recovery at the receiver. In this article, we propose a token encoding framework for robust semantic recovery (TokCode), which incurs no additional transmission overhead and supports plug-and-play deployment. For efficient token encoder optimization, we develop a sentence-semantic-guided foundation model adaptation algorithm (SFMA) that avoids costly end-to-end training. Based on simulation results on prompt-based generative image transmission, TokCode mitigates semantic distortion and can approach the performance upper-bound, even under harsh channels where 40% to 60% of tokens are randomly lost.

[LG-61] Rapid LoRA Aggregation for Wireless Channel Adaptation in Open-Set Radio Frequency Fingerprinting

链接: https://arxiv.org/abs/2604.12834
作者: Mingxi Zhang,Renjie Xie,Jincheng Wang,Guyue Li,Wei Xu
类目: ignal Processing (eess.SP); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Radio frequency fingerprints (RFFs) enable secure wireless authentication but struggle in open-set scenarios with unknown devices and varying channels. Existing methods face challenges in generalization and incur high computational costs. We propose a lightweight, self-adaptive RFF extraction framework using Low-Rank Adaptation (LoRA). By pretraining LoRA modules per environment, our method enables fast adaptation to unseen channel conditions without full retraining. During inference, a weighted combination of LoRAs dynamically enhances feature extraction. Experimental results demonstrate a 15% reduction in equal error rate (EER) compared to non-finetuned baselines and an 83% decrease in training time relative to full fine-tuning, using the same training dataset. This approach provides a scalable and efficient solution for open-set RFF authentication in dynamic wireless vehicular networks.

[LG-62] On Higher-Order Geometric Refinements of Classical Covariance Asymptotics: An Approach via Intrinsic and Extrinsic Information Geometry

链接: https://arxiv.org/abs/2604.12725
作者: Malik Amir,Sourangshu Ghosh
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Algebraic Geometry (math.AG); Differential Geometry (math.DG)
*备注:

点击查看摘要

Abstract:Classical Fisher-information asymptotics describe the covariance of regular efficient estimators through the local quadratic approximation of the log-likelihood, and thus capture first-order geometry only. In curved models, including mixtures, curved exponential families, latent-variable models, and manifold-constrained parameter spaces, finite-sample behavior can deviate systematically from these predictions. We develop a coordinate-invariant, curvature-aware refinement by viewing a regular parametric family as a Riemannian manifold ((\Theta,g)) with Fisher–Rao metric, immersed in (L^2(\mu)) through the square-root density map. Under suitable regularity and moment assumptions, we derive an (n^-2) correction to the leading (n^-1I(\theta)^-1) covariance term for score-root, first-order efficient estimators. The correction is governed by a tensor (P_ij) that decomposes canonically into three parts, an intrinsic Ricci-type contraction of the Fisher–Rao curvature tensor, an extrinsic Gram-type contraction of the second fundamental form, and a Hellinger discrepancy tensor encoding higher-order probabilistic information not determined by immersion geometry alone. The extrinsic term is positive semidefinite, the full correction is invariant under smooth reparameterization, and it vanishes identically for full exponential families. We then extend the picture to singular models, where Fisher information degenerates. Using resolution of singularities under an additive normal crossing assumption, we describe the resolved metric, the role of the real log canonical threshold in learning rates and posterior mean-squared error, and a curvature-based covariance expansion on the resolved space that recovers the regular theory as a special case. This framework also suggests geometric diagnostics of weak identifiability and curvature-aware principles for regularization and optimization.

[LG-63] Data-driven Reachable Set Estimation with Tunable Adversarial and Wasserstein Distributional Guarantees

链接: https://arxiv.org/abs/2604.12654
作者: Georgios Pantazis,Michelle S. Chong
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We study finite horizon reachable set estimation for unknown discrete-time dynamical systems using only sampled state trajectories. Rather than treating scenario optimization as a black-box tool, we show how it can be tailored to reachable set estimation, where one must learn a family of sets based on whole trajectories, while preserving probabilistic guarantees on future trajectory inclusion for the entire horizon. To this end, we formulate a relaxed scenario program with slack variables that yields a tunable trade-off between reachable set size and out-of-sample trajectory inclusion over the horizon, thereby reducing sensitivity to outliers. Leveraging the recent results in adversarially robust scenario optimization, we then extend this formulation to account for bounded adversarial perturbations of the observed trajectories and derive a posteriori probabilistic guarantees on future trajectory inclusion. When probability distribution shifts in the Wasserstein distance occur, we obtain an explicit bound on how gracefully the theoretical probabilistic guarantees degrade. For different geometries, i.e., p -norm balls, ellipsoids, and zonotopes, we derive tractable convex reformulations and corroborate our theoretical results in simulation.

[LG-64] A Bayesian Perspective on the Role of Epistemic Uncertainty for Delayed Generalization in In-Context Learning

链接: https://arxiv.org/abs/2604.12434
作者: Abdessamed Qchohi,Simone Rossi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context learning enables transformers to adapt to new tasks from a few examples at inference time, while grokking highlights that this generalization can emerge abruptly only after prolonged training. We study task generalization and grokking in in-context learning using a Bayesian perspective, asking what enables the delayed transition from memorization to generalization. Concretely, we consider modular arithmetic tasks in which a transformer must infer a latent linear function solely from in-context examples and analyze how predictive uncertainty evolves during training. We combine approximate Bayesian techniques to estimate the posterior distribution and we study how uncertainty behaves across training and under changes in task diversity, context length, and context noise. We find that epistemic uncertainty collapses sharply when the model groks, making uncertainty a practical label-free diagnostic of generalization in transformers. Additionally, we provide theoretical support with a simplified Bayesian linear model, showing that asymptotically both delayed generalization and uncertainty peaks arise from the same underlying spectral mechanism, which links grokking time to uncertainty dynamics.

[LG-65] Machine learning for four-dimensional SU(3) lattice gauge theories

链接: https://arxiv.org/abs/2604.12416
作者: Urs Wenger
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注: 18 pages, 9 figure; Plenary talk at the 42nd International Symposium on Lattice Field Theory (LATTICE2025), Mumbai, India

点击查看摘要

Abstract:In this review I summarize how machine learning can be used in lattice gauge theory simulations and what ap-proaches are currently available to improve the sampling of gauge field configurations, with a focus on applications in four-dimensional SU(3) gauge theories. These include approaches based on generative machine-learning models such as (stochastic) normalizing flows and diffusion processes, and an approach based on renormalization group (RG) transformations, more specifically the machine learning of RG-improved gauge actions using gauge-equivariant convolutional neural networks. In particular, I present scaling results for a machine-learned fixed-point action in four-dimensional SU(3) gauge theory towards the continuum limit. The results include observables based on the classically perfect gradient-flow scales, which are free of tree-level lattice artefacts to all orders, and quantities related to the static potential and the deconfinement transition.

[LG-66] Cross-Domain Transfer with Particle Physics Foundation Models: From Jets to Neutrino Interactions

链接: https://arxiv.org/abs/2604.12364
作者: Gregor Krzmanc,Vinicius Mikuni,Benjamin Nachman,Callum Wilkinson
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Future AI-based studies in particle physics will likely start from a foundation model to accelerate training and enhance sensitivity. As a step towards a general-purpose foundation model for particle physics, we investigate whether the OmniLearned foundation model pre-trained on diverse high- Q^2 simulated and real pp and ep collisions can be effectively transferred to a few-GeV fixed-target neutrino experiment. We process MINERvA neutrino–nucleus scattering events and evaluate pre-trained models on two types of tasks: regression of available energy and binary classification of charged-current pion final states ( \mathrmCC1\pi^\pm , \mathrmCCN\pi^\pm , and \mathrmCC1\pi^0 ). Pre-trained OmniLearned models consistently outperform similarly sized models trained from scratch, achieving better overall performance at the same compute budget, as well as achieving better performance at the same number of training steps. These results suggest that particle-level foundation models acquire inductive biases that generalize across large differences in energy scale, detector technology, and underlying physics processes, pointing toward a paradigm of detector-agnostic inference in particle physics.

[LG-67] Information-Geometric Decomposition of Generalization Error in Unsupervised Learning

链接: https://arxiv.org/abs/2604.12340
作者: Gilhan Kim
类目: Machine Learning (stat.ML); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 21 pages, 3 figures

点击查看摘要

Abstract:We decompose the Kullback–Leibler generalization error (GE) – the expected KL divergence from the data distribution to the trained model – of unsupervised learning into three non-negative components: model error, data bias, and variance. The decomposition is exact for any e-flat model class and follows from two identities of information geometry: the generalized Pythagorean theorem and a dual e-mixture variance identity. As an analytically tractable demonstration, we apply the framework to \epsilon -PCA, a regularized principal component analysis in which the empirical covariance is truncated at rank N_K and discarded directions are pinned at a fixed noise floor \epsilon . Although rank-constrained \epsilon -PCA is not itself e-flat, it admits a technical reformulation with the same total GE on isotropic Gaussian data, under which each component of the decomposition takes closed form. The optimal rank emerges as the cutoff \lambda_\mathrmcut^* = \epsilon – the model retains exactly those empirical eigenvalues exceeding the noise floor – with the cutoff reflecting a marginal-rate balance between model-error gain and data-bias cost. A boundary comparison further yields a three-regime phase diagram – retain-all, interior, and collapse – separated by the lower Marchenko–Pastur edge and an analytically computable collapse threshold \epsilon_*(\alpha) , where \alpha is the dimension-to-sample-size ratio. All claims are verified numerically.

[LG-68] Fine-tuning Factor Augmented Neural Lasso for Heterogeneous Environments

链接: https://arxiv.org/abs/2604.12288
作者: Jinhang Chai,Jianqing Fan,Cheng Gao,Qishuo Yin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Authors are listed in alphabetical order

点击查看摘要

Abstract:Fine-tuning is a widely used strategy for adapting pre-trained models to new tasks, yet its methodology and theoretical properties in high-dimensional nonparametric settings with variable selection have not yet been developed. This paper introduces the fine-tuning factor augmented neural Lasso (FAN-Lasso), a transfer learning framework for high-dimensional nonparametric regression with variable selection that simultaneously handles covariate and posterior shifts. We use a low-rank factor structure to manage high-dimensional dependent covariates and propose a novel residual fine-tuning decomposition in which the target function is expressed as a transformation of a frozen source function and other variables to achieve transfer learning and nonparametric variable selection. This augmented feature from the source predictor allows for the transfer of knowledge to the target domain and reduces model complexity there. We derive minimax-optimal excess risk bounds for the fine-tuning FAN-Lasso, characterizing the precise conditions, in terms of relative sample sizes and function complexities, under which fine-tuning yields statistical acceleration over single-task learning. The proposed framework also provides a theoretical perspective on parameter-efficient fine-tuning methods. Extensive numerical experiments across diverse covariate- and posterior-shift scenarios demonstrate that the fine-tuning FAN-Lasso consistently outperforms standard baselines and achieves near-oracle performance even under severe target sample size constraints, empirically validating the derived rates.

[LG-69] A Nonparametric Adaptive EWMA Control Chart for Binary Monitoring of Multiple Stream Processes

链接: https://arxiv.org/abs/2604.12095
作者: Faruk Muritala,Austin Brown,Dhrubajyoti Ghosh,Sherry Ni
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Monitoring binomial proportions across multiple independent streams is a critical challenge in Statistical Process Control (SPC), with applications from manufacturing to cybersecurity. While EWMA charts offer sensitivity to small shifts, existing implementations rely on asymptotic variance approximations that fail during early-phase monitoring. We introduce a Cumulative Standardized Binomial EWMA (CSB-EWMA) chart that overcomes this limitation by deriving the exact time-varying variance of the EWMA statistic for binary multiple-stream data, enabling adaptive control limits that ensure statistical rigor from the first sample. Through extensive simulations, we identify optimal smoothing (\lambda) and limit (L) parameters to achieve target in-control average run length (ARL0) of 370 and 500. The CSB-EWMA chart demonstrates rapid shift detection across both ARL0 targets, with out-of-control average run length (ARL1) dropping to 3-7 samples for moderate shifts (\delta=0.2), and exhibits exceptional robustness across different data distributions, with low ARL1 Coefficients of Variation (CV 0.10 for small shifts) for both ARL0 = 370 and 500. This work provides practitioners with a distribution-free, sensitive, and theoretically sound tool for early change detection in binomial multiple-stream processes.

[LG-70] On the continuum limit of t-SNE for data visualization

链接: https://arxiv.org/abs/2604.12041
作者: Jeff Calder,Zhonggan Huang,Ryan Murray,Adam Pickarski
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:This work is concerned with the continuum limit of a graph-based data visualization technique called the t-Distributed Stochastic Neighbor Embedding (t-SNE), which is widely used for visualizing data in a variety of applications, but is still poorly understood from a theoretical standpoint. The t-SNE algorithm produces visualizations by minimizing the Kullback-Leibler divergence between similarity matrices representing the high dimensional data and its low dimensional representation. We prove that as the number of data points n \to \infty , after a natural rescaling and in applicable parameter regimes, the Kullback-Leibler divergence is consistent as the number of data points n \to \infty and the similarity graph remains sparse with a continuum variational problem that involves a non-convex gradient regularization term and a penalty on the magnitude of the probability density function in the visualization space. These two terms represent the continuum limits of the attraction and repulsion forces in the t-SNE algorithm. Due to the lack of convexity in the continuum variational problem, the question of well-posedeness is only partially resolved. We show that when both dimensions are 1 , the problem admits a unique smooth minimizer, along with an infinite number of discontinuous minimizers (interpreted in a relaxed sense). This aligns well with the empirically observed ability of t-SNE to separate data in seemingly arbitrary ways in the visualization. The energy is also very closely related to the famously ill-posed Perona-Malik equation, which is used for denoising and simplifying images. We present numerical results validating the continuum limit, provide some preliminary results about the delicate nature of the limiting energetic problem in higher dimensions, and highlight several problems for future work. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Statistics Theory (math.ST) Cite as: arXiv:2604.12041 [stat.ML] (or arXiv:2604.12041v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.12041 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-71] Agent ic LLM Reasoning in a Self-Driving Laboratory for Air-Sensitive Lithium Halide Spinel Conductors

链接: https://arxiv.org/abs/2604.11957
作者: Yuxing Fei,Bernardus Rendy,Xiaochen Yang,Junhee Woo,Xu Huang,Chang Li,Shilong Wang,David Milsted,Yan Zeng,Gerbrand Ceder
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-driving laboratories promise to accelerate materials discovery. Yet current automated solid-state synthesis platforms are limited to ambient conditions, thereby precluding their use for air-sensitive materials. Here, we present A-Lab for Glovebox Powder Solid-state Synthesis (A-Lab GPSS), a robotic platform capable of synthesizing and characterizing air-sensitive inorganic materials under strict air-free conditions. By integrating an agentic AI framework into the A-Lab GPSS platform, we structure autonomous experimental design through abductive and inductive reasoning. We deploy this platform to explore the vast compositional space of lithium halide spinel solid-state ionic conductors. Across a synthesis campaign comprising 352 samples with diverse compositions, the system explores a broad chemical space, experimentally realizing 72% of the 171 possible pairwise combinations among the 19 metals considered in this study. Over the course of the campaign, the fraction of compositions exhibiting both good ionic conductivity ( 0.05 mS/cm) and high halide spinel phase purity increases from 1.33% in the first 75 agent-proposed samples to 5.33% in the final 75. Furthermore, by inspecting the AI’s reasoning processes, we reveal distinct yet complementary discovery strategies: abductive reasoning interrogates abnormal observations within already explored regions, whereas inductive reasoning expands the search into broader, previously unvisited chemical space. This work establishes a scalable platform for the autonomous discovery of complex, air-sensitive solid-state materials.

[LG-72] FlowBoost Reveals Phase Transitions and Spectral Structure in Finite Free Information Inequalities

链接: https://arxiv.org/abs/2604.11922
作者: Baran Hashemi
类目: Probability (math.PR); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注:

点击查看摘要

Abstract:Using FlowBoost, a closed-loop deep generative optimization framework for extremal structure discovery, we investigate \ell^p -generalizations of the finite free Stam inequality for real-rooted polynomials under finite free additive convolution \boxplus_n . At p=2 , FlowBoost finds the Hermite pair as the unique equality case and reveals the spectral structure of the linearized convolution map at this extremal point. As a result, we conjecture that the singular values of the doubly stochastic coupling matrix E_n on the mean-zero subspace are 2^-k/2:k=1,\ldots,n-1 , independent of n . Conditional on this conjecture, we obtain a sharp local stability constant and the finite free CLT convergence rate, both uniform in n . We introduce a one-parameter family of p -Stam inequalities using \ell^p -Fisher information and prove that the Hermite pair itself violates the inequality for every p2 , with the sign of the deficit governed by the \ell^p -contraction ratio of E_n . Systematic computation via FlowBoost supports the conjecture that p^*!=2 is the sharp critical exponent. For p2 , the extremal configurations undergo a bifurcation, meaning that they become non-matching pairs with bimodal root structure, converging back to the Hermite diagonal only as p\to 2^- . Our findings demonstrate that FlowBoost, can be an effective tool of mathematical discovery in infinite-dimensional extremal problems.

[LG-73] Obtaining Partition Crossover masks using Statistical Linkage Learning for solving noised optimization problems with hidden variable dependency structure

链接: https://arxiv.org/abs/2604.11862
作者: M.W. Przewozniczek,B. Frej,M.M. Komarnicki,M. Prusik,R. Tinós
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In optimization problems, some variable subsets may have a joint non-linear or non-monotonical influence on the function value. Therefore, knowledge of variable dependencies may be crucial for effective optimization, and many state-of-the-art optimizers leverage it to improve performance. However, some real-world problem instances may be the subject of noise of various origins. In such a case, variable dependencies relevant to optimization may be hard or impossible to tell using dependency checks sufficient for problems without noise, making highly effective operators, e.g., Partition Crossover (PX), useless. Therefore, we use Statistical Linkage Learning (SLL) to decompose problems with noise and propose a new SLL-dedicated mask construction algorithm. We prove that if the quality of the SLL-based decomposition is sufficiently high, the proposed clustering algorithm yields masks equivalent to PX masks for the noise-free instances. The experiments show that the optimizer using the proposed mechanisms remains equally effective despite the noise level and outperforms state-of-the-art optimizers for the problems with high noise.

[LG-74] Inverse Design of Inorganic Compounds with Generative AI

链接: https://arxiv.org/abs/2604.11827
作者: Hannes Kneiding,Lucía Morán-González,Nishamol Kuriakose,Ainara Nova,David Balcells
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning is revolutionizing chemistry. Beyond the value of predictive models accelerating virtual screening, generative AI aims at enabling inverse design, reversing the compound-to-property prediction paradigm into property-to-compound generation. Chemists now have access to a rich AI toolbox for organic chemistry, including drug discovery. However, the application of these methods to inorganic compounds remains limited by the challenges posed by their intrinsic nature. This Review analyzes how these challenges have been addressed, considering widely diverse systems ranging from molecules to crystals, including transition metal complexes and microporous materials. The analysis focuses on how generative AI methods have evolved towards data-representation-model pipelines that address the full complexity of inorganic compounds, including their chemical composition, geometry, symmetry, and electronic structure. Future directions, like benchmark standardization and the development of synthesizability metrics, are also discussed.

[LG-75] raining single-electron and single-photon stochastic physical neural networks

链接: https://arxiv.org/abs/2604.10861
作者: Tong Dou,Shiro Kumara,Josh Burns,Ethan Sigler,Parth Girdhar,David Petty,Gerard Milburn,Jo Plested,Matt Woolley
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:The computational demands of deep learning motivate the investigation of alternative approaches to computation. One alternative is physical neural networks~(PNNs), in which learning and inference are performed directly via physical processes. Stochastic PNNs arise when the underlying neurons are realized by the dynamics of a stochastic activation switch. Here we propose novel electronic and photonic stochastic neurons. The electronic realization is implemented by single-electron tunneling through a quantum dot. The photonic realization is implemented via a single-photon source driving one of two modes coupled via a controllable beam-splitter-like interaction. In the electronic case, the charge state of the quantum dot forms the basis for the stochastic neuron, whereas in the photonic case the occupation of the undriven mode serves as the basis for the stochastic neuron. Training of stochastic PNNs is performed with models of stochastic neurons, as well as with coherently-driven, single-photon detector stochastic neurons previously introduced. Several training strategies for MNIST handwritten digit classification have been investigated using single-hidden-layer stochastic PNNs, including varying the number of trials in each layer to control forward pass stochasticity and employing either true probability or empirical outputs in the backward pass to evaluate their influence on gradient estimation. We show that when empirical outputs are used in the backward pass, the network achieves more than 97% test accuracy with few trials per layer. Despite the simplicity of the model architecture, high test accuracy is maintained in the presence of a high degree of noise and model uncertainty. The results demonstrate the potential of embracing stochastic PNNs for deep learning.

附件下载

点击下载今日全部论文列表