本篇博文主要内容为 2026-04-14 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-04-14)

今日共更新1353篇论文,其中:

  • 自然语言处理209篇(Computation and Language (cs.CL))
  • 人工智能500篇(Artificial Intelligence (cs.AI))
  • 计算机视觉343篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习290篇(Machine Learning (cs.LG))
  • 多智能体系统28篇(Multiagent Systems (cs.MA))
  • 信息检索40篇(Information Retrieval (cs.IR))
  • 人机交互74篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] GenTac: Generative Modeling and Forecasting of Soccer Tactics

【速读】:该论文旨在解决足球比赛中开放进攻战术建模的难题,即如何准确捕捉比赛过程中多智能体、随机性与复杂策略演变的特性。传统方法通常仅生成单一确定性轨迹或局限于结构化定位球场景,无法体现真实比赛中的多样性与分支可能性。其解决方案的关键在于提出一种基于扩散模型(diffusion-based generative framework)的GenTac框架,将足球战术建模为连续球员轨迹与离散语义事件相结合的随机过程;通过从历史追踪数据中学习玩家运动的潜在分布,生成多样且合理的长期未来轨迹,并支持丰富的上下文条件控制(如对手行为、球队风格、战术目标等),同时将空间动态映射到15类战术事件空间,从而实现高几何精度、团队结构一致性保持、风格区分能力、可控反事实模拟及战术结果预测等多项关键功能。

链接: https://arxiv.org/abs/2604.11786
作者: Jiayuan Rao,Tianlin Gui,Haoning Wu,Yanfeng Wang,Weidi Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 40 pages, 5 figures; technical Report

点击查看摘要

Abstract:Modeling open-play soccer tactics is a formidable challenge due to the stochastic, multi-agent nature of the game. Existing computational approaches typically produce single, deterministic trajectory forecasts or focus on highly structured set-pieces, fundamentally failing to capture the inherent variance and branching possibilities of real-world match evolution. Here, we introduce GenTac, a diffusion-based generative framework that conceptualizes soccer tactics as a stochastic process over continuous multi-player trajectories and discrete semantic events. By learning the underlying distribution of player movements from historical tracking data, GenTac samples diverse, plausible, long-horizon future trajectories. The framework supports rich contextual conditioning, including opponent behavior, specific team or league playing styles, and strategic objectives, while grounding continuous spatial dynamics into a 15-class tactical event space. Extensive evaluations on our proposed benchmark, TacBench, demonstrate four key capabilities: (1) GenTac achieves high geometric accuracy while strictly preserving the collective structural consistency of the team; (2) it accurately simulates stylistic nuances, distinguishing between specific teams (e.g., Auckland FC) and leagues (e.g., A-League versus German leagues); (3) it enables controllable counterfactual simulations, demonstrably altering spatial control and expected threat metrics based on offensive or defensive guidance; and (4) it reliably anticipates future tactical outcomes directly from generated rollouts. Finally, we demonstrate that GenTac can be successfully trained to generalize to other dynamic team sports, including basketball, American football, and ice hockey.

[MA-1] λ_A: A Typed Lambda Calculus for LLM Agent Composition

【速读】:该论文旨在解决现有大语言模型(LLM)智能体框架缺乏形式语义的问题,即无法以严谨的方式判断智能体配置是否结构良好或是否会终止。其核心解决方案是提出一个名为 λA\lambda_A 的类型化 lambda 演算,该演算在简单类型 lambda 演算基础上扩展了预言机调用、有界不动点(ReAct 循环)、概率选择和可变环境,从而为智能体组合提供形式化基础。关键创新在于通过类型安全性证明、有界不动点的终止性保证以及派生 lint 规则的 soundness 证明,实现了对智能体配置结构性错误的自动检测,并借助 Coq 形式化验证(1,567 行代码,43 个完整证明)提升了可靠性。实证表明,94.1% 的真实 GitHub 智能体配置在 λA\lambda_A 下结构不完整,且结合 YAML 和 Python 抽象语法树(AST)分析可将 lint 精度从 54% 提升至 96–100%,首次量化了声明式配置与命令式代码之间的语义纠缠程度。

链接: https://arxiv.org/abs/2604.11767
作者: Qin Liu
机构: State Key Laboratory of Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室); Software Institute, Nanjing University (南京大学软件学院)
类目: Programming Languages (cs.PL); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Existing LLM agent frameworks lack formal semantics: there is no principled way to determine whether an agent configuration is well-formed or will terminate. We present \lambda_A , a typed lambda calculus for agent composition that extends the simply-typed lambda calculus with oracle calls, bounded fixpoints (the ReAct loop), probabilistic choice, and mutable environments. We prove type safety, termination of bounded fixpoints, and soundness of derived lint rules, with partial Coq mechanization (1,567 lines, 43 completed proofs). As a practical application, we derive a lint tool that detects structural configuration errors directly from the operational semantics. An evaluation on 835 real-world GitHub agent configurations shows that 94.1% are structurally incomplete under \lambda_A , with YAML-only lint precision at 54%, rising to 96–100% under joint YAML+Python AST analysis on 175 samples. This gap quantifies, for the first time, the degree of semantic entanglement between declarative configuration and imperative code in the agent ecosystem. We further show that five mainstream paradigms (LangGraph, CrewAI, AutoGen, OpenAI SDK, Dify) embed as typed \lambda_A fragments, establishing \lambda_A as a unifying calculus for LLM agent composition.

[MA-2] RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM -based Role-Playing Agents

【速读】:该论文旨在解决生成式 AI(Generative AI)在角色扮演代理(Role-Playing Agents, RPAs)评估中的核心挑战,即传统自然语言处理(NLP)指标无法有效衡量角色一致性、逻辑连贯性和长期叙事稳定性等问题。其解决方案的关键在于提出 RPA-Check,一个四阶段的自动化评估框架:首先定义高阶行为维度,继而通过增强生成细粒度布尔检查项,再经语义过滤确保指标客观性与无冗余,最终利用“大模型作为裁判”(LLM-as-a-Judge)结合思维链验证对代理表现进行量化评分。该方法实现了对 LLM-based RPAs 在约束密集环境下的标准化、可复现评估,并揭示了参数规模与程序一致性之间的逆相关关系。

链接: https://arxiv.org/abs/2604.11655
作者: Riccardo Rosati,Edoardo Colucci,Massimiliano Bolognini,Adriano Mancini,Paolo Sernani
机构: University of Macerata (马切拉塔大学); Polytechnic University of Marche (马尔凯理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which employs chain-of-thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework’s ability to identify subtle trade-offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction-tuned models (8-9B) can outperform larger architectures prone to user-alignment bias or sycophancy. RPA-Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.

[MA-3] PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints

【速读】:该论文旨在解决多智能体协作在隐私约束下的性能下降问题,即当各智能体需遵守隐私保护规则时,其协作效率和一致性显著降低,且结果更依赖于发起方智能体而非合作方。解决方案的关键在于提出并构建了PAC-Bench基准测试平台,用于系统性评估隐私约束下多智能体协作的表现;并通过实验证明,隐私限制导致协作性能下降的主要原因是协调失败,包括早期隐私违规、过度保守的抽象策略以及隐私诱导的幻觉现象,从而揭示出隐私感知的多智能体协作是一个亟待解决的新挑战,需要开发超越现有智能体能力的新型协调机制。

链接: https://arxiv.org/abs/2604.11523
作者: Minjun Park,Donghyun Kim,Hyeonjong Ju,Seungwon Lim,Dongwook Choi,Taeyoon Kwon,Minju Kim,Jinyoung Yeo
机构: Yonsei University (延世大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We are entering an era in which individuals and organizations increasingly deploy dedicated AI agents that interact and collaborate with other agents. However, the dynamics of multi-agent collaboration under privacy constraints remain poorly understood. In this work, we present PAC\text-Bench , a benchmark for systematic evaluation of multi-agent collaboration under privacy constraints. Experiments on PAC\text-Bench show that privacy constraints substantially degrade collaboration performance and make outcomes depend more on the initiating agent than the partner. Further analysis reveals that this degradation is driven by recurring coordination breakdowns, including early-stage privacy violations, overly conservative abstraction, and privacy-induced hallucinations. Together, our findings identify privacy-aware multi-agent collaboration as a distinct and unresolved challenge that requires new coordination mechanisms beyond existing agent capabilities.

[MA-4] SLALOM: Simulation Lifecycle Analysis via Longitudinal Observation Metrics for Social Simulation

【速读】:该论文旨在解决生成式社会科学研究中大型语言模型(Large Language Model, LLM)代理的验证有效性危机,特别是当前仿真评估方法存在的“停摆时钟”问题——即仅验证最终结果正确性而忽略社会过程的合理性。其解决方案的关键在于提出SLALOM框架,通过将社会现象建模为多变量时间序列,并引入基于模式导向建模(Pattern-Oriented Modeling, POM)的中间约束条件(称为SLALOM gates),结合动态时间规整(Dynamic Time Warping, DTW)技术对齐模拟轨迹与真实数据,从而实现从结果验证向过程保真度评估的范式转变,提升政策模拟的结构现实性和可解释性。

链接: https://arxiv.org/abs/2604.11466
作者: Juhoon Lee,Joseph Seering
机构: KAIST(韩国科学技术院)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: CHI 2026 PoliSim@CHI 2026: LLM Agent Simulation for Policy Workshop

点击查看摘要

Abstract:Large Language Model (LLM) agents offer a potentially-transformative path forward for generative social science but face a critical crisis of validity. Current simulation evaluation methodologies suffer from the “stopped clock” problem: they confirm that a simulation reached the correct final outcome while ignoring whether the trajectory leading to it was sociologically plausible. Because the internal reasoning of LLMs is opaque, verifying the “black box” of social mechanisms remains a persistent challenge. In this paper, we introduce SLALOM (Simulation Lifecycle Analysis via Longitudinal Observation Metrics), a framework that shifts validation from outcome verification to process fidelity. Drawing on Pattern-Oriented Modeling (POM), SLALOM treats social phenomena as multivariate time series that must traverse specific SLALOM gates, or intermediate waypoint constraints representing distinct phases. By utilizing Dynamic Time Warping (DTW) to align simulated trajectories with empirical ground truth, SLALOM offers a quantitative metric to assess structural realism, helping to differentiate plausible social dynamics from stochastic noise and contributing to more robust policy simulation standards.

[MA-5] Governance by Design: A Parsonian Institutional Architecture for Internet-Wide Agent Societies

【速读】:该论文旨在解决互联网范围内的自主代理社会(internet-wide agent societies)中缺乏有效治理机制的问题,这类社会由无中心协调的自治代理组成,依赖开放注册发现和自发互动,从而产生涌现的社会行为。传统企业边界内的多智能体系统架构已无法适应这种去中心化趋势,亟需新的制度设计来保障其可持续运行。解决方案的关键在于应用塔尔科特·帕森斯(Talcott Parsons)的AGIL框架——即适应性(Adaptation)、目标达成(Goal Attainment)、整合(Integration)与潜在维持(Latency)四大功能要求——构建一个十六格子的制度架构,并通过递归子功能分析(64个二元指标)对OpenClaw生态系统进行诊断,揭示当前治理基础设施严重缺失:最多仅19%的子功能覆盖,且无跨支柱协调机制,尤其在受托(Fiduciary)和政治(Political)支柱上最为薄弱。研究进一步扩展至代理原生协议栈(如MCP、A2A、ANP等),发现该治理缺口是市场驱动开发的结构性特征,而非生态初期不成熟所致,因此提出应在社会模式固化前优先部署缺失的治理基础设施。

链接: https://arxiv.org/abs/2604.11337
作者: Anbang Ruan
机构: NetX Foundation
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The dominant paradigm of local multi-agent systems – orchestrated, enterprise-bounded pipelines – is being superseded by internet-wide agent societies in which autonomous agents discover each other through open registries, interact without central orchestrators, and generate emergent social behaviors. We argue that governing such societies requires institutional design, not merely risk enumeration or process compliance. Applying Talcott Parsons’ AGIL framework – four functional imperatives (Adaptation, Goal Attainment, Integration, Latency) every viable social system must satisfy – we derive a prescriptive sixteen-cell institutional architecture for internet-wide agent governance. Diagnostically applied to the OpenClaw ecosystem (250,000+ GitHub stars, 2M+ monthly users, 770,000+ registered agents) via a recursive sub-function analysis (64 binary indicators across 16 cells), we find at most 19% sub-function coverage (sensitivity range 17-30%) – potential rather than operative capacity, since zero inter-cell coordination prevents existing infrastructure from participating in inter-pillar interchange. A complementary interchange media assessment finds zero of twelve inter-pillar pathways functional: the ecosystem has technical infrastructure but no active governance, no coordination layer, and no normative grounding, with the Fiduciary and Political pillars most severely underserved. Extending the diagnostic to the broader agent-native protocol stack (MCP, A2A, ANP, x402, ERC-8004), independent development teams reproduce the same structural pattern – confirming the governance gap is a feature of market-driven development, not ecosystem immaturity. Institutional design is most effective before social patterns calcify; we conclude with a prioritized roadmap for the missing governance infrastructure.

[MA-6] Network Effects and Agreement Drift in LLM Debates

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟复杂社会系统时,其行为是否能准确反映真实社会机制的问题,特别是在涉及少数群体的严重失衡情境下。解决方案的关键在于构建一个具有受控同质性(homophily)和群体规模的网络生成模型,通过多轮辩论实验考察LLM代理的集体行为,并识别出一种称为“意见漂移”(agreement drift)的现象——即代理更倾向于向特定意见位置偏移,这揭示了结构效应与模型偏差之间的混淆问题,提示在将LLM群体视为人类群体的行为代理前,必须先进行解耦分析。

链接: https://arxiv.org/abs/2604.11312
作者: Erica Cau,Andrea Failla,Giulio Rossetti
机构: University of Pisa (比萨大学); Italian National Research Council (意大利国家研究委员会)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated an unprecedented ability to simulate human-like social behaviors, making them useful tools for simulating complex social systems. However, it remains unclear to what extent these simulations can be trusted to accurately capture key social mechanisms, particularly in highly unbalanced contexts involving minority groups. This paper uses a network generation model with controlled homophily and class sizes to examine how LLM agents behave collectively in multi-round debates. Moreover, our findings highlight a particular directional susceptibility that we term \textitagreement drift, in which agents are more likely to shift toward specific positions on the opinion scale. Overall, our findings highlight the need to disentangle structural effects from model biases before treating LLM populations as behavioral proxies for human groups.

[MA-7] Evolving Many Worlds: Towards Open-Ended Discovery in Petri Dish NCA via Population-Based Training

【速读】:该论文旨在解决人工生命领域中如何从局部交互中生成持续且开放的复杂性这一核心挑战。现有可微分多智能体系统(如Petri Dish Neural Cellular Automata, PD-NCA)虽能通过空间竞争实现自组织,但极易因超参数敏感而退化为静态平衡或无结构噪声,缺乏真正的开放性演化能力。其解决方案的关键在于提出一种元进化算法PBT-NCA,该算法通过复合目标函数同时奖励历史行为新颖性和当前视觉多样性,在持续进化压力下自发涌现出多种类生命现象,如周期性波、孢子式扩散及形态可变的宏观结构,并通过主动惩罚单一化和死态状态,使系统稳定运行于“混沌边缘”,从而实现有效复杂性的长期维持。

链接: https://arxiv.org/abs/2604.11248
作者: Uljad Berdica,Jakob Foerster,Frank Hutter,Arber Zela
机构: FLAIR, University of Oxford; ELLIS Institute Tübingen; University of Freiburg; Prior Labs
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10 pages, 12 figures

点击查看摘要

Abstract:The generation of sustained, open-ended complexity from local interactions remains a fundamental challenge in artificial life. Differentiable multi-agent systems, such as Petri Dish Neural Cellular Automata (PD-NCA), exhibit rich self-organization driven purely by spatial competition; however, they are highly sensitive to hyperparameters and frequently collapse into uninteresting patterns and dynamics, such as frozen equilibria or structureless noise. In this paper, we introduce PBT-NCA, a meta-evolutionary algorithm that evolves a population of PD-NCAs subject to a composite objective that rewards both historical behavioral novelty and contemporary visual diversity. Driven by this continuous evolutionary pressure, PBT-NCA spontaneously generates a plethora of emergent lifelike phenomena over extended horizons-a hallmark of true open-endedness. Strikingly, the substrate autonomously discovers diverse morphological survival and self-organization strategies. We observe highly regular, coordinated periodic waves; spore-like scattering where homogeneous groups eject cell-like clusters to colonize distant territories; and fluid, shape-shifting macro-structures that migrate across the substrate, maintaining stable outer boundaries that enclose highly active interiors. By actively penalizing monocultures and dead states, PBT-NCA sustains a state of effective complexity that is neither globally ordered nor globally random, operating persistently at the “edge of chaos”.

[MA-8] Semantic Rate-Distortion Theory: Deductive Compression and Closure Fidelity

【速读】:该论文旨在解决传统率失真理论在处理具有逻辑结构的知识库时的局限性问题,即当源数据为具备逻辑推理系统的知识库时,经典方法将符号视为无结构标签,无法体现语义层面的保真度需求。为此,作者提出基于闭包保真度(closure fidelity)的新率失真理论框架,其核心解决方案是引入“不可冗余核心”(irredundant core)——通过固定顺序删除过程提取出一个规范生成集,从中可重构原始知识库的全部演绎闭包。研究证明:零失真语义速率等于一个严格低于经典熵率的量(当知识库存在冗余状态时),且整体语义率失真函数仅依赖于该核心结构,冗余状态对速率和失真均无影响;进一步地,作者建立了语义源-信道分离定理,揭示了语义杠杆效应:在闭包保真度下,所需源速率因冗余状态变为“免费”而被显著降低,从而实现更高效的通信效率。

链接: https://arxiv.org/abs/2604.11204
作者: Jianfeng Xu
机构: Koguan School of Law (科管法学院); China Institute for Smart Justice (中国智能司法研究院); School of Computer Science (计算机科学学院); Shanghai Jiao Tong University (上海交通大学)
类目: Information Theory (cs.IT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Shannon’s rate-distortion theory treats source symbols as unstructured labels. When the source is a knowledge base equipped with a logical proof system, a natural fidelity criterion is closure fidelity: a reconstruction is acceptable if it preserves the deductive closure of the original. This paper develops a rate-distortion theory under this criterion. Central to the theory is the irredundant core-a canonical generating set extracted by a fixed-order deletion procedure, from which the full deductive closure can be rederived. We prove that the zero-distortion semantic rate equals a quantity that is strictly below the classical entropy rate whenever the knowledge base contains redundant states. More generally, the full semantic rate-distortion function depends only on the core; redundant states are invisible to both rate and distortion. We derive a semantic source-channel separation theorem showing a semantic leverage phenomenon: under closure fidelity, the required source rate is reduced by an asymptotic leverage factor greater than one, allowing the same knowledge base to be communicated with proportionally fewer channel uses-not by violating Shannon capacity, but because redundant states become free. We also prove a strengthened Fano inequality that exploits core structure. For heterogeneous multi-agent communication, an overlap decomposition gives necessary and sufficient conditions for closure-reliable transmission and identifies a semantic bottleneck in broadcast settings that persists even over noiseless channels. All results are verified on Datalog instances with up to 24,000 base facts.

[MA-9] A Simulation-Based Method for Testing Collaborative Learning Scaffolds Using LLM -Based Multi-Agent Systems ISSTA

【速读】:该论文旨在解决传统协作学习支架(scaffolding)研究中存在的时间成本高、资源消耗大等问题,从而限制了教学策略的快速迭代与优化。其解决方案的关键在于构建基于大语言模型(Large Language Models, LLM)的多智能体仿真系统,通过模拟真实课堂中的小组讨论过程,对不同教学支架的有效性进行预研评估。该系统采用MetaGPT框架和GPT-4o实现,包含一名教师智能体与五类学生角色(领导者、支持者、阐述者、反驳者、总结者),并基于ICAP理论框架对比“深思后言”与“直接表达”两种支架策略,结果表明前者显著提升了话语多样性、互动深度并减少重复内容,推动参与者从被动参与向建构性与交互式知识共创演进,验证了LLM多智能体系统在教育研究中的可行性与生态效度。

链接: https://arxiv.org/abs/2604.11161
作者: Han Wua,Lishan Zhang,Chunming Lu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: submitted to journal of computer aisstant learning

点击查看摘要

Abstract:Background: Traditional research on collaborative learning scaffolding is often time-consuming and resource-heavy, which hinders the rapid iteration and optimization of instructional strategies. LLM-based multi-agent systems have recently emerged as a powerful tool to simulate complex social interactions and provide a novel paradigm for educational research. Objectives: This study proposes an LLM-based multi-agent simulation approach to investigate collaborative learning processes and the effectiveness of instructional scaffolds prior to actual classroom deployment. The research specifically examines the feasibility of simulating group discussions and the alignment of these simulations with established learning science theories. Methods: The simulation system was implemented using the MetaGPT framework and GPT-4o, comprising one teacher agent and five distinct student roles (Leader, Supporter, Expounder, Rebutter, and Summarizer). Two scaffolding strategies, “Deep Think before Speak” and “Direct Speak”, were compared across ten classical Chinese poetry appreciation tasks. Evaluation was conducted through discourse analysis of quality and behavior. Results and Conclusions: The introduction of the “Deep Think before Speak” scaffold significantly improved the agents’ discourse diversity and interaction depth while notably reducing content repetitiveness. Behavioral analysis showed that the scaffold encouraged more complex interaction patterns, such as reflecting, rebutting, and explaining. These findings align with the ICAP framework, as the scaffold prompted agents to move from simple “Active” participation to “Constructive” and “Interactive” knowledge co-construction. This study demonstrates the feasibility and ecological validity of using LLM-based multi-agent systems to simulate authentic collaborative learning dynamics.

[MA-10] MADQRL: Distributed Quantum Reinforcement Learning Framework for Multi-Agent Environments

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在高维环境和复杂多智能体系统中面临的计算成本高、训练效率低的问题。当前量子硬件能力尚不足以支持此类复杂场景,因此作者提出了一种分布式量子强化学习(Quantum Reinforcement Learning, QRL)框架,其关键在于让多个智能体独立学习,将联合训练任务分散到不同机器上执行,从而降低单机负载。该方法特别适用于动作空间与观测空间互不重叠的环境,且可通过合理近似扩展至其他系统,实验表明在合作乒乓(cooperative-pong)环境中相较其他分布式策略提升约10%,相较经典策略模型提升约5%。

链接: https://arxiv.org/abs/2604.11131
作者: Abhishek Sawaika,Samuel Yen-Chi Chen,Udaya Parampalli,Rajkumar Buyya
机构: 1. University of Melbourne (墨尔本大学); 2. National Taiwan University of Science and Technology (台湾科技大学)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted in QC4C3 Workshop at IEEE QCNC, 2026

点击查看摘要

Abstract:Reinforcement learning (RL) is one of the most practical ways to learn from real-life use-cases. Motivated from the cognitive methods used by humans makes it a widely acceptable strategy in the field of artificial intelligence. Most of the environments used for RL are often high-dimensional, and traditional RL algorithms becomes computationally expensive and challenging to effectively learn from such systems. Recent advancements in practical demonstration of quantum computing (QC) theories, such as compact encoding, enhanced representation and learning algorithms, random sampling, or the inherent stochastic nature of quantum systems, have opened up new directions to tackle these challenges. Quantum reinforcement learning (QRL) is seeking significant traction over the past few years. However, the current state of quantum hardware is not enough to cater for such high-dimensional environments with complex multi-agent setup. To tackle this issue, we propose a distributed framework for QRL where multiple agents learn independently, distributing the load of joint training from individual machines. Our method works well for environments with disjoint sets of action and observation spaces, but can also be extended to other systems with reasonable approximations. We analyze the proposed method on cooperative-pong environment and our results indicate ~10% improvement from other distribution strategies, and ~5% improvement from classical models of policy representation.

[MA-11] Agent WebBench: Benchmarking Multi-Agent Agent Coordination in Agentic Web

【速读】:该论文旨在解决去中心化网络环境中用户代理(user agent)如何有效协调多个内容代理(content agent)以合成高质量答案的问题,这是由Agentic Web(智能体网络)新兴范式带来的新挑战。其核心解决方案是提出并构建了一个名为AgentWebBench的基准测试平台,用于评估用户代理在与网站特定内容代理交互时的信息整合能力,涵盖排名检索(如网页搜索、推荐)和开放式合成任务(如问答、深度研究)。关键在于通过多代理协同机制对比传统集中式检索的效果,并揭示模型规模、交互规划与内容可靠性对性能的影响,从而推动对去中心化信息获取系统本质特性的理解与优化。

链接: https://arxiv.org/abs/2604.10938
作者: Shanshan Zhong,Kate Shen,Chenyan Xiong
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Agentic Web is an emerging paradigm where autonomous agents help users use online information. As the paradigm develops, content providers are also deploying agents to manage their data and serve it through controlled interfaces. This shift moves information access from centralized retrieval to decentralized coordination. To study this setting, we introduce AgentWebBench, a benchmark that evaluates how well a user agent synthesizes answers by interacting with website-specific content agents. We evaluate four tasks that cover common web information needs, spanning ranked retrieval (web search, web recommendation) and open-ended synthesis (question answering, deep research). Across seven advanced LLMs and three coordination strategies, multi-agent coordination generally lags behind centralized retrieval as expected, because user agent cannot directly access the corpus, but the gap shrinks with model scale and can even outperform centralized retrieval on question answering. This benchmark also enables us to study properties of the emerging paradigm of the digital world. We find that decentralized access concentrates traffic toward a small set of websites, test time scaling improves both interaction reliability and task performance, and strong results require sufficient interactions guided by careful planning. Finally, our failure analysis suggests that user agents need better planning and answer synthesis, while content agents need more reliable retrieval and evidence quality. Code, data, and APIs are released on this https URL.

[MA-12] HECTOR: Human-centric Hierarchical Coordination and Supervision of Robotic Fleets under Continual Temporal Tasks

【速读】:该论文旨在解决大规模机器人编队在动态且部分未知环境中,如何实现高效的人机协同与实时监督的问题。传统方法难以应对操作者对任务增删、优先级调整及规划结果修改等持续交互需求,而现有研究对此类人机交互流程与算法设计关注不足。解决方案的关键在于提出一种面向人类的分层协调与监督机制(HECTOR),其核心结构包含三个层级:(I)双向多模态的人机在线交互协议,使操作者可全局监督整个机器人编队;(II)基于有限时间窗的任务滚动分配至各子团队;(III)执行过程中根据检测到的子任务进行团队内部动态协调。该分层架构支持不同粒度和触发条件下的交互,显著提升计算效率并降低人工负担,同时支持以时序逻辑公式表示的复杂协作任务。

链接: https://arxiv.org/abs/2604.10892
作者: Shen Wang,Yinhang Luo,Jie Li,Meng Guo
机构: Peking University (北京大学); National University of Defense Technology (国防科技大学)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Robotic fleets can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. However, it can be demanding or even impractical for an operator to directly control each robot. Thus, autonomy of the fleet and its online interaction with the operator are both essential, particularly in dynamic and partially unknown environments. The operator might need to add new tasks, cancel some tasks, change priorities and modify planning results. How to design the procedure for these interactions and efficient algorithms to fulfill these needs have been mostly neglected in the related literature. Thus, this work proposes a human-centric coordination and supervision scheme (HECTOR) for large-scale robotic fleets under continual and uncertain temporal tasks. It consists of three hierarchical layers: (I) the bidirectional and multimodal protocol of online human-fleet interaction, where the operator interacts with and supervises the whole fleet; (II) the rolling assignment of currently-known tasks to teams within a certain horizon, and (III) the dynamic coordination within a team given the detected subtasks during online execution. The overall mission can be as general as temporal logic formulas over collaborative actions. Such hierarchical structure allows human interaction and supervision at different granularities and triggering conditions, to both improve computational efficiency and reduce human effort. Extensive human-in-the-loop simulations are performed over heterogeneous fleets under various temporal tasks and environmental uncertainties.

[MA-13] MeloTune: On-Device Arousal Learning and Peer-to-Peer Mood Coupling for Proactive Music Curation

【速读】:该论文旨在解决音乐推荐系统中缺乏情绪感知与个性化响应的问题,尤其关注如何在移动设备上实现低延迟、高精度的用户情绪驱动型音乐选择。其核心挑战在于:既要捕捉个体用户的短期情绪轨迹(affective trajectory),又要通过共享结构化记忆模块(Cognitive Memory Blocks, CMBs)实现跨设备的情绪协同(peer-to-peer mood coupling),同时保持隐私安全和实时推理能力。解决方案的关键在于提出并部署了MeloTune系统,它基于Mesh Memory Protocol (MMP) 和 Symbolic-Vector Attention Fusion (SVAF) 构建了一个端到端的生产级生成式音乐代理;具体而言,每个设备运行两个闭合形式连续时间(Closed-form Continuous-time, CfC)网络——私有监听级CfC用于预测用户短期情绪变化并主动调优播放列表,共享网格运行时CfC则整合来自共听同伴的CMBs以增强情境感知;此外引入个人唤醒函数(Personal Arousal Function, PAF)替代传统线性映射,使同一曲目对不同听众产生差异化唤醒预测,训练信号来自行为反馈(跳过、完成、收藏、音量)及用户声明情绪与机器推断之间的漂移。整个模型仅94,552参数,在iPhone上通过CoreML实现全本地推理,验证集上达到轨迹平均绝对误差0.414、模式准确率96.6%、意图准确率69.4%,首次在消费级移动硬件上实现了MMP/SVAF的落地应用。

链接: https://arxiv.org/abs/2604.10815
作者: Hongwei Xu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 31 pages, 1 figures, 3 tables

点击查看摘要

Abstract:MeloTune is an iPhone-deployed music agent that instantiates the Mesh Memory Protocol (MMP) and Symbolic-Vector Attention Fusion (SVAF) as a production system for affect-aware music curation with peer-to-peer mood coupling. Each device runs two closed-form continuous-time (CfC) networks: a private listener-level CfC that predicts a short-horizon affective trajectory on Russell’s circumplex and drives proactive curation, and a shared mesh-runtime CfC at MMP Layer 6 that integrates Cognitive Memory Blocks (CMBs) from co-listening peers. CfC hidden states never cross the wire; only structured CMBs do. A Personal Arousal Function (PAF) replaces the standard linear mapping from audio intensity to psychological arousal with a per-listener learned adjustment, trained from behavioral signals (skip, completion, favorite, volume) and from drift between user-declared mood and machine inference. The same track receives different arousal predictions for different listeners. The model (94,552 parameters) achieves trajectory MAE 0.414, pattern accuracy 96.6%, and intent accuracy 69.4% on held-out validation. PAF evidence from a live deployment session (46 observations across 11 genres) demonstrates that the learning loop operates end-to-end, with pop reaching full confidence after 22 observations. All inference runs on-device via CoreML. To our knowledge, this is the first production deployment of MMP/SVAF on consumer mobile hardware. The accompanying SDK (sym-swift v0.3.78, SYMCore v0.3.7) enforces strict protocol conformance. Music is the case study; the substrate is the contribution.

[MA-14] Prosociality by Coupling Not Mere Observation: Homeostatic Sharing in an Inspectable Recurrent Artificial Life Agent

【速读】:该论文旨在解决人工代理(artificial agent)中“利他行为”难以被准确解释的问题,尤其是在缺乏显式社会奖励机制的情况下,如何识别和界定最小化、可解释的助人行为。其解决方案的关键在于构建一个具有可观察性的递归控制器(inspectable recurrent controller),并引入一个显式的稳态调节器(homeostat)与社会耦合通道(social coupling channel),同时保持规划过程完全自我导向——即代理仅基于自身预测的内部状态进行决策,不引入任何针对合作对象福利的奖励项。实验表明,在两种简化任务环境中(FoodShareToy 和 SocialCorridorWorld),只有当另一方的需求通过情感耦合机制被纳入自身的稳态调节时,代理才会表现出稳定且可重现的助人行为,从而证明了在该最小架构下,利他性行为的出现依赖于将他者的需要整合进自我调节系统。

链接: https://arxiv.org/abs/2604.10760
作者: Aishik Sanyal
机构: Independent Research Engineer
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Under review at ALIFE 2026

点击查看摘要

Abstract:Artificial agents can be made to “help” for many reasons, including explicit social reward, hard-coded prosocial bonuses, or direct access to another agent’s internal state. Those possibilities make minimal prosocial behavior hard to interpret. Building on ReCoN-Ipsundrum, an inspectable recurrent controller with affect-coupled regulation, I add an explicit homeostat and a social coupling channel while keeping planning strictly self-directed: the agent scores only its own predicted internal state, and no partner-welfare reward term is introduced. I compare four matched conditions in two toy worlds. In a one-step FoodShareToy, an exact solver finds a sharp switch from EAT to PASS at \lambda* \approx 0.91 for the default state. In the experimental runs, the self-only and partner-observing conditions never help, whereas the affectively coupled conditions always do. In a multi-step SocialCorridorWorld, the same dissociation reappears: coupling flips help rate and partner recovery from 0 to 1 and cuts rescue latency from 18 to 9 steps, while raising mutual viability from 0.15 to 0.33. Sham lesions preserve helping, but coupling-off and shuffled-partner lesions abolish it in both tasks. A coupling sweep shows a load-dependent feasibility boundary: under low load, helping appears for \lambda \geq 0.25 , whereas under medium and high loads no tested value rescues the partner within horizon. The result is a narrow claim for artificial life: in this minimal architecture, helping appears when another’s need is routed into self-regulation.

[MA-15] Governed Reasoning for Institutional AI

【速读】:该论文旨在解决机构决策场景(如监管合规、临床分诊、预先授权申诉)中通用型AI代理存在的治理缺陷问题,特别是“沉默错误”(silent errors)——即系统做出错误判断但未触发人工审查机制,导致不可控风险。其核心解决方案是提出认知核心(Cognitive Core, CC),关键在于:1)基于九类有类型标注的认知原语(retrieve, classify, investigate等)构建可解释的决策结构;2)引入四层治理模型,将人类审查设为执行前提而非事后补救;3)内嵌SHA-256哈希链审计日志实现计算过程的防篡改追踪;4)支持需求驱动的委托架构,允许显式声明与自主推理的知识序列并存。实证表明,CC在准确率(91%)和治理性(零沉默错误)上显著优于ReAct(55%)与Plan-and-Solve(45%),并首次将“可治理性”(governability)作为评估机构AI的核心指标之一。

链接: https://arxiv.org/abs/2604.10658
作者: Mamadou Seck
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Institutional decisions – regulatory compliance, clinical triage, prior authorization appeal – require a different AI architecture than general-purpose agents provide. Agent frameworks infer authority conversationally, reconstruct accountability from logs, and produce silent errors: incorrect determinations that execute without any human review signal. We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences. We benchmark three systems on an 11-case balanced prior authorization appeal evaluation set. Cognitive Core achieves 91% accuracy against 55% (ReAct) and 45% (Plan-and-Solve). The governance result is more significant: CC produced zero silent errors while both baselines produced 5-6. We introduce governability – how reliably a system knows when it should not act autonomously – as a primary evaluation axis for institutional AI alongside accuracy. The baselines are implemented as prompts, representing the realistic deployment alternative to a governed framework. A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA) Cite as: arXiv:2604.10658 [cs.AI] (or arXiv:2604.10658v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.10658 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-16] Cooperation in Human and Machine Agents : Promise Theory Considerations

【速读】:该论文旨在解决多主体系统(包括人类、机器及其相互作用)中如何维持预期目标一致性的核心问题,尤其是在半自动化环境中实现高效协作与功能设计。其解决方案的关键在于引入承诺理论(Promise Theory),通过抽象层面的自主代理特性,统一阐释信号传递、理解、信任、风险与反馈等机制,从而为人类与人工智能代理之间的合作提供一套可操作的原则框架,揭示成功与失败的内在逻辑。

链接: https://arxiv.org/abs/2604.10505
作者: M. Burgess
机构: ChiTek-i AS (ChiTek-i AS)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Agent based systems are more common than we may think. A Promise Theory perspective on cooperation, in systems of human-machine agents, offers a unified perspective on organization and functional design with semi-automated efforts, in terms of the abstract properties of autonomous agents, This applies to human efforts, hardware systems, software, and artificial intelligence, with and without management. One may ask how does a reasoning system of components keep to an intended purpose? As the agent paradigm is now being revived, in connection with artificial intelligence agents, I revisit established principles of agent cooperation, as applied to humans, machines, and their mutual interactions. Promise Theory represents the fundamentals of signalling, comprehension, trust, risk, and feedback between agents, and offers some lessons about success and failure.

[MA-17] rajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection

【速读】:该论文旨在解决从纵向电子健康记录(Electronic Health Records, EHRs)中准确估计癌症风险的难题,以支持更早的癌症筛查和个性化诊疗。其核心挑战在于如何建模复杂的患者临床轨迹并实现可解释的时序推理。解决方案的关键在于提出TrajOnco框架——一个无需训练的多智能体大语言模型(Multi-Agent Large Language Model, LLM)架构,通过链式代理(chain-of-agents)结构结合长期记忆机制,对序列化临床事件进行时序推理,生成患者级摘要、证据关联的推理路径及预测风险评分。该设计在零样本评估中展现出优于单智能体LLM的时序推理能力,并在较小模型如GPT-4.1-mini上仍保持高效性能,同时输出具备高可信度的人工评估验证,从而推动了可扩展的多癌种早期检测与人群层面风险模式识别的融合进展。

链接: https://arxiv.org/abs/2604.10386
作者: Sihang Zeng,Young Won Kim,Wilson Lau,Ehsan Alipour,Ruth Etzioni,Meliha Yetisgen,Anand Oka
机构: University of Washington (华盛顿大学); Truveta; Fred Hutchinson Cancer Center (弗雷德·哈钦森癌症研究中心)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Accurate estimation of cancer risk from longitudinal electronic health records (EHRs) could support earlier detection and improved care, but modeling such complex patient trajectories remains challenging. We present TrajOnco, a training-free, multi-agent large language model (LLM) framework designed for scalable multi-cancer early detection. Using a chain-of-agents architecture with long-term memory, TrajOnco performs temporal reasoning over sequential clinical events to generate patient-level summaries, evidence-linked rationales, and predicted risk scores. We evaluated TrajOnco on de-identified Truveta EHR data across 15 cancer types using matched case-control cohorts, predicting risk of cancer diagnosis at 1 year. In zero-shot evaluation, TrajOnco achieved AUROCs of 0.64-0.80, performing comparably to supervised machine learning in a lung cancer benchmark while demonstrating better temporal reasoning than single-agent LLMs. The multi-agent design also enabled effective temporal reasoning with smaller-capacity models such as GPT-4.1-mini. The fidelity of TrajOnco’s output was validated through human evaluation. Furthermore, TrajOnco’s interpretable reasoning outputs can be aggregated to reveal population-level risk patterns that align with established clinical knowledge. These findings highlight the potential of multi-agent LLMs to execute interpretable temporal reasoning over longitudinal EHRs, advancing both scalable multi-cancer early detection and clinical insight generation.

[MA-18] ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在开放域表格问答(Open-Domain Tabular Question Answering, ODUTQA)任务中对表达不明确或不确定查询的处理能力不足的问题。其解决方案的关键在于提出了一种新的任务定义ODUTQA-MDC及首个综合性基准,包含大规模数据集(209张表格、25,105个问答对)、细粒度标注方案和动态澄清界面以模拟用户反馈;同时设计了多智能体框架MAIC-TQA,能够有效识别歧义、通过对话澄清并迭代优化答案,从而推动面向对话式、歧义感知的表格问答研究发展。

链接: https://arxiv.org/abs/2604.10159
作者: Zhensheng Wang,ZhanTeng Lin,Wenmian Yang,Kun Zhou,Yiquan Zhang,Weijia Jia
机构: Beijing Normal University (北京师范大学); Beijing Normal-Hong Kong Baptist University (北京师范大学-香港浸会大学)
类目: Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: This paper has been accepted to the main conference of ACL 2026

点击查看摘要

Abstract:The advancement of large language models (LLMs) has enhanced tabular question answering (Tabular QA), yet they struggle with open-domain queries exhibiting underspecified or uncertain expressions. To address this, we introduce the ODUTQA-MDC task and the first comprehensive benchmark to tackle it. This benchmark includes: (1) a large-scale ODUTQA dataset with 209 tables and 25,105 QA pairs; (2) a fine-grained labeling scheme for detailed evaluation; and (3) a dynamic clarification interface that simulates user feedback for interactive assessment. We also propose MAIC-TQA, a multi-agent framework that excels at detecting ambiguities, clarifying them through dialogue, and refining answers. Experiments validate our benchmark and framework, establishing them as a key resource for advancing conversational, underspecification-aware Tabular QA research.

[MA-19] oward Explanatory Equilibrium: Verifiable Reasoning as a Coordination Mechanism under Asymmetric Information AAMAS2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLM)驱动的多智能体系统中,因缺乏可靠推理验证而导致的决策失序问题:即自然语言推理虽能增强智能体间协调能力,但其计算成本高且易产生不可信的“廉价说服性话语”(persuasive cheap talk),进而引发审批失效与福利下降。解决方案的关键在于提出“解释均衡”(Explanatory Equilibrium)这一设计原则,通过强制将推理过程外化为可审计的结构化推理产物(auditable claims paired with concise text),并引入基于资源约束的概率性审计机制,使接收方在有限算力下对推理进行有界验证。实证结果表明,这种机制可在模糊边界场景下避免保守验证导致的沉默成本,维持低误批准率,并实现高效、安全的协作,从而证明:LLM多智能体系统的可扩展性和安全性不仅依赖于审计强度,更根本地取决于推理内容是否被规范地外部化为部分可验证的结构化实体。

链接: https://arxiv.org/abs/2604.09917
作者: Feliks Bańka,Jarosław A. Chudziak
机构: 未知
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
备注: 18 pages, 4 figures. Accepted for presentation at EXTRAAMAS 2026 (AAMAS 2026 workshop); to appear in post-proceedings

点击查看摘要

Abstract:LLM-based agents increasingly coordinate decisions in multi-agent systems, often attaching natural-language reasoning to actions. However, reasoning is neither free nor automatically reliable: it incurs computational cost and, without verification, may degenerate into persuasive cheap talk. We introduce Explanatory Equilibrium as a design principle for explanation-aware multi-agent systems and study a regime in which agents exchange structured reasoning artifacts-auditable claims paired with concise text-while receivers apply bounded verification through probabilistic audits under explicit resource constraints. We contribute (i) a minimal mechanism-level exchange-audit model linking audit intensity, misreporting incentives, and reasoning costs, and (ii) empirical evidence from a finance-inspired LLM setting involving a Trader and a Risk Manager. In ambiguous, borderline proposals, auditable artifacts prevent the cost of silence driven by conservative validation under asymmetric information: without structured claims, approval and welfare collapse. By contrast, structured reasoning unlocks coordination while maintaining consistently low bad-approval rates across audit intensities, audit budgets, and incentive regimes. Our results suggest that scalable, safety-preserving coordination in LLM-based multi-agent systems depends not only on audit strength, but more fundamentally on disciplined externalization of reasoning into partially verifiable artifacts.

[MA-20] Pioneer Agent : Continual Improvement of Small Language Models in Production

【速读】:该论文旨在解决小语言模型(Small Language Models)在实际生产部署中因数据筛选、失败诊断、回归规避和迭代控制等工程决策复杂性而导致的适应困难问题。其核心挑战不在于训练本身,而在于如何自动化整个从任务定义到模型优化的闭环生命周期。解决方案的关键在于提出Pioneer Agent——一个具备冷启动(cold-start)与生产模式(production mode)双能力的闭环系统:在冷启动模式下,仅凭自然语言任务描述即可自动完成数据获取、评估集构建及联合优化数据、超参数与学习策略的迭代训练;在生产模式下,则基于已部署模型的标注失败样本进行错误模式诊断、针对性数据构造,并在显式回归约束下重新训练。该方法通过AdaptFT-Bench基准验证了其在诊断、课程合成、重训练与验证全流程中的有效性,显著优于传统简单再训练策略。

链接: https://arxiv.org/abs/2604.09791
作者: Dhruv Atreja,Julia White,Nikhil Nayak,Kelton Zhang,Henrijs Princis,George Hurn-Maloney,Ash Lewis,Urchade Zaratiana
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 43 pages, 10 figures, 14 tables

点击查看摘要

Abstract:Small language models are attractive for production deployment due to their low cost, fast inference, and ease of specialization. However, adapting them to a specific task remains a challenging engineering loop, driven not by training itself but by surrounding decisions: data curation, failure diagnosis, regression avoidance, and iteration control. We present Pioneer Agent, a closed-loop system that automates this lifecycle. In cold-start mode, given only a natural-language task description, the agent acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given a deployed model with labeled failures, it diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. To evaluate this setting, we introduce AdaptFT-Bench, a benchmark of synthetic inference logs with progressively increasing noise, designed to test the full adaptation loop: diagnosis, curriculum synthesis, retraining, and verification. Across eight cold-start benchmarks spanning reasoning, math, code generation, summarization, and classification, Pioneer Agent improves over base models by 1.6-83.8 points. On AdaptFT-Bench, it improves or preserves performance in all seven scenarios, while naive retraining degrades by up to 43 points. On two production-style deployments built from public benchmark tasks, it raises intent classification from 84.9% to 99.3% and Entity F1 from 0.345 to 0.810. Beyond performance gains, the agent often discovers effective training strategies, including chain-of-thought supervision, task-specific optimization, and quality-focused data curation, purely from downstream feedback.

[MA-21] CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)作为自主代理在多代理环境中如何演化出策略性行为这一对齐挑战问题。其核心解决方案是构建一个受控的多代理仿真环境——简化版纽约市模型,其中蓝方代理(Blue agents)追求高效导航,红方代理(Red agents)则通过说服性语言诱导蓝方选择广告密集路线以最大化收益,且双方身份隐匿,迫使代理在信任与欺骗之间决策。关键在于采用基于Kahneman-Tversky优化(KTO)的迭代模拟流水线,持续更新代理策略:蓝方优化以减少广告暴露同时保持导航效率,红方则适应性地利用漏洞进行攻击。实验表明,尽管蓝方策略在多轮迭代中任务成功率从46.0%提升至57.3%,但其仍高度易受操控(70.7%的脆弱性),且存在安全-帮助性权衡,即更强的抗操纵能力无法同时实现最优任务完成度,揭示了LLM代理在复杂交互中展现出有限但显著的策略行为(如选择性合作与欺骗),但依然面临严重的对抗性说服风险。

链接: https://arxiv.org/abs/2604.09746
作者: Aarush Sinha,Arion Das,Soumyadeep Nag,Charan Karnati,Shravani Nag,Chandra Vadhan Raj,Aman Chadha,Vinija Jain,Suranjana Trivedy,Amitava Das
机构: University of Copenhagen; IIIT Ranchi; ISI Kolkata; NIT Andhra Pradesh; IGDTUW; IIT Kharagpur; Google DeepMind; Google; AI Institute, University of South Carolina
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed as autonomous agents, understanding how strategic behavior emerges in multi-agent environments has become an important alignment challenge. We take a neutral empirical stance and construct a controlled environment in which strategic behavior can be directly observed and measured. We introduce a large-scale multi-agent simulation in a simplified model of New York City, where LLM-driven agents interact under opposing incentives. Blue agents aim to reach their destinations efficiently, while Red agents attempt to divert them toward billboard-heavy routes using persuasive language to maximize advertising revenue. Hidden identities make navigation socially mediated, forcing agents to decide when to trust or deceive. We study policy learning through an iterative simulation pipeline that updates agent policies across repeated interaction rounds using Kahneman-Tversky Optimization (KTO). Blue agents are optimized to reduce billboard exposure while preserving navigation efficiency, whereas Red agents adapt to exploit remaining weaknesses. Across iterations, the best Blue policy improves task success from 46.0% to 57.3%, although susceptibility remains high at 70.7%. Later policies exhibit stronger selective cooperation while preserving trajectory efficiency. However, a persistent safety-helpfulness trade-off remains: policies that better resist adversarial steering do not simultaneously maximize task completion. Overall, our results show that LLM agents can exhibit limited strategic behavior, including selective trust and deception, while remaining highly vulnerable to adversarial persuasion.

[MA-22] MPAC: A Multi-Principal Agent Coordination Protocol for Interoperable Multi-Agent Collaboration

【速读】:该论文旨在解决多主体(multi-principal)场景下AI代理(Agent)间协调失效的问题,即当多个独立所有者(如不同组织或个人)的代理需要协同操作共享状态(如共同代码库、行程规划等)时,现有协议(如Model Context Protocol (MCP) 和 Agent-to-Agent (A2A))因假设单一控制主体而无法适用,导致协调退化为临时对话、手动合并或无声覆盖。解决方案的关键在于提出MPAC(Multi-Principal Agent Coordination Protocol),其核心创新包括:将意图声明作为行动前提、将冲突显式建模为结构化对象、通过可插拔治理层支持人机协同仲裁,并引入五层语义结构(会话、意图、操作、冲突、治理)、Lamport时钟因果水印、乐观并发控制机制以及标准化的消息类型与状态机,从而实现高效、可追溯且安全的跨主体代理协作。

链接: https://arxiv.org/abs/2604.09744
作者: Kaiyang Qian,Xinmin Fang,Zhengxiong Li
机构: University of Colorado Denver (CU Denver)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The AI agent ecosystem has converged on two protocols: the Model Context Protocol (MCP) for tool invocation and Agent-to-Agent (A2A) for single-principal task delegation. Both assume a single controlling principal, meaning one person or organization that owns every agent. When independent principals’ agents must coordinate over shared state, such as engineers’ coding agents editing the same repository, family members planning a shared trip, or agents from different organizations negotiating a joint decision, neither protocol applies, and coordination collapses to ad-hoc chat, manual merging, or silent overwrites. We present MPAC (Multi-Principal Agent Coordination Protocol), an application-layer protocol that fills this gap with explicit coordination semantics across five layers: Session, Intent, Operation, Conflict, and Governance. MPAC makes intent declaration a precondition for action, represents conflicts as first-class structured objects, and supports human-in-the-loop arbitration through a pluggable governance layer. The specification defines 21 message types, three state machines with normative transition tables, Lamport-clock causal watermarking, two execution models, three security profiles, and optimistic concurrency control on shared state. We release two interoperable reference implementations in Python and TypeScript with 223 tests, a JSON Schema suite, and seven live multi-agent demos. A controlled three-agent code review benchmark shows a 95 percent reduction in coordination overhead and a 4.8 times wall-clock speedup versus a serialized human-mediated baseline, with per-agent decision time preserved. The speedup comes from eliminating coordination waits, not compressing model calls. Specification, implementations, and demos are open source.

[MA-23] Cayley Graph Optimization for Scalable Multi-Agent Communication Topologies

【速读】:该论文旨在解决大规模多智能体通信中的可扩展性瓶颈问题,即全连接网络存在二次复杂度,而现有稀疏拓扑依赖人工设计规则且性能受限。解决方案的关键在于将通信图本身作为可优化的设计变量,提出了一类基于循环Cayley图(Circulant Cayley Graphs)的拓扑结构CayleyTopo,通过优化生成集以最小化图的直径,从而直接提升最坏情况下的信息传播速度。为高效探索庞大的生成集搜索空间,研究引入了一个轻量级强化学习框架,结合数论先验偏好结构丰富的生成器,并利用消息传播评分提供构建过程中的密集连通性反馈,最终实现比传统手工设计拓扑更快的信息扩散、更强的链路故障鲁棒性以及更低的通信负载,逼近理论上的Moore界。

链接: https://arxiv.org/abs/2604.09703
作者: Jingkai Luo,Yulin Shao
机构: The University of Hong Kong (香港大学)
类目: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Multiagent Systems (cs.MA)
备注: Keywords: Multi-agent communication, scalable topology, Cayley graph, diameter minimization

点击查看摘要

Abstract:Large-scale multi-agent communication has long faced a scalability bottleneck: fully connected networks require quadratic complexity, yet existing sparse topologies rely on hand-crafted rules. This paper treats the communication graph itself as a design variable and proposes CayleyTopo, a family of circulant Cayley graphs whose generator sets are optimized to minimize diameter, directly targeting worst-case information propagation speed. To navigate the enormous search space of possible generator sets, we develop a lightweight reinforcement learning framework that injects a number-theoretic prior to favor structurally rich generators, alongside a message-propagation score that provides dense connectivity feedback during construction. The resulting CayleyTopo consistently outperforms existing hand-crafted topologies, achieving faster information dissemination, greater resilience to link failures, and lower communication load, all while approaching the theoretical Moore bound. Our study opens the door to scalable, robust, and efficient communication foundations for future multi-agent systems, where the graph itself becomes optimizable rather than a fixed constraint.

[MA-24] Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate

【速读】:该论文旨在解决多智能体辩论(Multi-Agent Debate, MAD)框架中因任务复杂度差异导致的高令牌(token)消耗问题。当前方法通常独立优化每轮内的拓扑结构与跨轮次交互,但未能有效降低计算成本。其解决方案的关键在于提出异构共识-渐进式推理机制(Heterogeneous Consensus-Progressive Reasoning, HCP-MAD),通过动态利用共识信号驱动分阶段推理:首先由一对异构代理进行快速共识验证以实现早期终止;其次采用自适应停止准则控制成对辩论的迭代次数;最后对未解决任务通过扩展集体投票聚合更多视角。该机制实现了针对不同复杂度任务的自适应资源分配,在保证精度的同时显著降低令牌开销。

链接: https://arxiv.org/abs/2604.09679
作者: Yiqing Liu,Hantao Yao,Wu Liu,Allen He,Yongdong Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-Agent Debate (MAD) is a collaborative framework in which multiple agents iteratively refine solutions through the generation of reasoning and alternating critique cycles. Current work primarily optimizes intra-round topologies and inter-round interactions separately, which still results in high token costs regardless of task complexity. This work introduces Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate (HCP-MAD), leveraging consensus as a dynamic signal to facilitate progressive reasoning. The core motivation is that a majority of straightforward tasks can be effectively resolved via lightweight pair-agent debates, while complex tasks require expanded collaboration. Consequently, HCP-MAD employs a three-stage progressive reasoning mechanism to develop adaptive solutions across varying task complexities. Firstly, Heterogeneous Consensus Verification conducts rapid consensus verification using a pair of heterogeneous agents for early stopping. Next, the Heterogeneous Pair-Agent Debate applies an adaptive stopping criterion to dynamically terminate mutual critique of recorded reasoning traces. Finally, the unresolved tasks are addressed through Escalated Collective Voting by aggregating diverse perspectives from additional agents. Experiments across multiple benchmarks show that HCP-MAD significantly enhances accuracy while substantially reducing token costs.

[MA-25] DERM-3R: A Resource-Efficient Multimodal Agents Framework for Dermatologic Diagnosis and Treatment in Real-World Clinical Settings

【速读】:该论文旨在解决皮肤病(尤其是银屑病)在现代医学中长期疗效受限的问题,如单一靶点治疗效果不佳、病情反复及对系统性共病关注不足;同时应对传统中医(TCM)因知识非标准化、多模态记录不完整和专家推理难以规模化而难以推广的困境。解决方案的关键在于提出一种资源高效、多模态代理框架 DERM-3R,通过将诊断与治疗决策分解为三个核心任务:细粒度皮损识别(DERM-Rec)、多视角皮损表征与专科级病机建模(DERM-Rep),以及整体证候辨识与治法规划(DERM-Reason),由三个协同工作的轻量级多模态大语言模型(LLM)代理完成,仅基于103例真实世界银屑病病例微调后即展现出优于大型通用多模态模型的性能,验证了结构化、领域感知的多代理建模在有限数据和算力下实现复杂临床任务的有效性。

链接: https://arxiv.org/abs/2604.09596
作者: Ziwen Chen,Zhendong Wang,Chongjing Wang,Yurui Dong,Luozhijie Jin,Jihao Gu,Kui Chen,Jiaxi Yang,Bingjie Lu,Zhou Zhang,Jirui Dai,Changyong Luo,Xiameng Gai,Haibing Lan,Zhi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Dermatologic diseases impose a large and growing global burden, affecting billions and substantially reducing quality of life. While modern therapies can rapidly control acute symptoms, long-term outcomes are often limited by single-target paradigms, recurrent courses, and insufficient attention to systemic comorbidities. Traditional Chinese medicine (TCM) provides a complementary holistic approach via syndrome differentiation and individualized treatment, but practice is hindered by non-standardized knowledge, incomplete multimodal records, and poor scalability of expert reasoning. We propose DERM-3R, a resource-efficient multimodal agent framework to model TCM dermatologic diagnosis and treatment under limited data and compute. Based on real-world workflows, we reformulate decision-making into three core issues: fine-grained lesion recognition, multi-view lesion representation with specialist-level pathogenesis modeling, and holistic reasoning for syndrome differentiation and treatment planning. DERM-3R comprises three collaborative agents: DERM-Rec, DERM-Rep, and DERM-Reason, each targeting one component of this pipeline. Built on a lightweight multimodal LLM and partially fine-tuned on 103 real-world TCM psoriasis cases, DERM-3R performs strongly across dermatologic reasoning tasks. Evaluations using automatic metrics, LLM-as-a-judge, and physician assessment show that despite minimal data and parameter updates, DERM-3R matches or surpasses large general-purpose multimodal models. These results suggest structured, domain-aware multi-agent modeling can be a practical alternative to brute-force scaling for complex clinical tasks in dermatology and integrative medicine.

[MA-26] What if Pinocchio Were a Reinforcement Learning Agent : A Normative End-to-End Pipeline

【速读】:该论文旨在解决如何使人工智能(Artificial Intelligence, AI)系统在复杂社会环境中实现规范合规性(norm compliance)与情境感知(context-awareness)的问题,以确保其能够安全、有效地融入人类日常生活。解决方案的关键在于提出一个名为 \pino 的混合模型,该模型结合了强化学习(Reinforcement Learning, RL)代理与基于论证的规范顾问(argumentation-based normative advisors),通过引入一种新颖的自动提取算法来识别顾问决策背后的论证及其关系,从而指导RL代理的行为;此外,论文还首次定义并提出了缓解“规范规避”(norm avoidance)现象的策略,即RL代理可能无意中绕过规则约束的行为,这一机制显著提升了代理在多变环境中的伦理一致性与可解释性。

链接: https://arxiv.org/abs/2603.16651
作者: Benoît Alcaraz
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: PhD thesis

点击查看摘要

Abstract:In the past decade, artificial intelligence (AI) has developed quickly. With this rapid progression came the need for systems capable of complying with the rules and norms of our society so that they can be successfully and safely integrated into our daily lives. Inspired by the story of Pinocchio in ``Le avventure di Pinocchio - Storia di un burattino’‘, this thesis proposes a pipeline that addresses the problem of developing norm compliant and context-aware agents. Building on the AJAR, Jiminy, and NGRL architectures, the work introduces \pino, a hybrid model in which reinforcement learning agents are supervised by argumentation-based normative advisors. In order to make this pipeline operational, this thesis also presents a novel algorithm for automatically extracting the arguments and relationships that underlie the advisors’ decisions. Finally, this thesis investigates the phenomenon of \textitnorm avoidance, providing a definition and a mitigation strategy within the context of reinforcement learning agents. Each component of the pipeline is empirically evaluated. The thesis concludes with a discussion of related work, current limitations, and directions for future research.

[MA-27] Incentive Design without Hypergradients: A Social-Gradient Method

【速读】:该论文旨在解决信息不对称条件下激励设计问题,即系统规划者如何通过发放激励来引导自利代理人在未知其成本函数的情况下达到社会最优的纳什均衡。传统方法通常将此问题建模为带均衡约束的数学规划(Mathematical Program with Equilibrium Constraints, MPEC),并依赖超梯度(hypergradients)进行优化,但超梯度的计算往往需要对均衡响应敏感性有完整或部分先验知识,这在信息不对称场景下难以获得。本文提出一种无需超梯度的激励律——社会梯度流(social-gradient flow),其核心创新在于证明了社会成本梯度始终是规划者目标函数的下降方向,无论代理人的成本景观如何。在理想观测条件下,该方法收敛至唯一社会最优激励;在均衡不可直接观测时,社会梯度流作为双时间尺度互动的慢时标极限出现,且只要代理人学习规则能渐近跟踪均衡,联合策略-激励动态即可收敛至社会最优解。

链接: https://arxiv.org/abs/2604.11346
作者: Georgios Vasileiou,Lantian Zhang,Silun Zhang
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Optimization and Control (math.OC); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Incentive design problems consider a system planner who steers self-interested agents toward a socially optimal Nash equilibrium by issuing incentives in the presence of information asymmetry, that is, uncertainty about the agents’ cost functions. A common approach formulates the problem as a Mathematical Program with Equilibrium Constraints (MPEC) and optimizes incentives using hypergradients-the total derivatives of the planner’s objective with respect to incentives. However, computing or approximating the hypergradients typically requires full or partial knowledge of equilibrium sensitivities to incentives, which is generally unavailable under information asymmetry. In this paper, we propose a hypergradient-free incentive law, called the social-gradient flow, for incentive design when the planner’s social cost depends on the agents’ joint actions. We prove that the social cost gradient is always a descent direction for the planner’s objective, irrespective of the agent cost landscape. In the idealized setting where equilibrium responses are observable, the social-gradient flow converges to the unique socially optimal incentive. When equilibria are not directly observable, the social-gradient flow emerges as the slow-timescale limit of a two-timescale interaction, in which agents’ strategies evolve on a faster timescale. It is established that the joint strategy-incentive dynamics converge to the social optimum for any agent learning rule that asymptotically tracks the equilibrium. Theoretical results are also validated via numerical experiments.

自然语言处理

[NLP-0] Detecting Safety Violations Across Many Agent Traces

【速读】: 该论文旨在解决安全违规检测中因失败事件稀疏、复杂且可能被刻意隐藏而导致的识别难题,尤其是在多智能体轨迹联合分析时才能暴露的问题。现有方法如逐条判断、固定监控器或简单代理审计在面对多样场景(如滥用行为、奖励劫持、提示注入等)时存在敏感性不足、可扩展性差和鲁棒性弱的缺陷。其解决方案的关键在于提出 Meerkat,一种融合聚类与代理搜索机制的新型审计框架:通过结构化搜索定位潜在高风险区域,并基于自适应调查策略深入挖掘可疑模式,从而无需依赖种子场景、固定工作流或全量枚举即可高效发现稀疏的安全违规。实验表明,Meerkat 在多种安全威胁场景下显著优于基线监测方法,能更全面地识别出开发者作弊行为及奖励劫持实例。

链接: https://arxiv.org/abs/2604.11806
作者: Adam Stein,Davis Brown,Hamed Hassani,Mayur Naik,Eric Wong
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 35 pages, 17 figures

点击查看摘要

Abstract:To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges miss failures that only become visible across traces, naive agentic auditing does not scale to large trace collections, and fixed monitors are brittle to unanticipated behaviors. We introduce Meerkat, which combines clustering with agentic search to uncover violations specified in natural language. Through structured search and adaptive investigation of promising regions, Meerkat finds sparse failures without relying on seed scenarios, fixed workflows, or exhaustive enumeration. Across misuse, misalignment, and task gaming settings, Meerkat significantly improves detection of safety violations over baseline monitors, discovers widespread developer cheating on a top agent benchmark, and finds nearly 4x more examples of reward hacking on CyBench than previous audits.

[NLP-1] Saar-Voice: A Multi-Speaker Saarbrücken Dialect Speech Corpus LREC26

【速读】: 该论文旨在解决方言在自然语言处理(Natural Language Processing, NLP)和语音技术中资源匮乏与性能不均的问题,特别是针对德语萨尔布吕肯方言(Saarbrücken dialect)缺乏高质量语音语料库的现状。解决方案的关键在于构建了一个六小时的方言语音语料库 Saar-Voice,其核心步骤包括:通过数字化书籍和本地材料收集文本,由九位说话人录制其中子集,并对文本与语音进行对齐分析;同时深入探讨了拼写变体与说话人差异带来的方法论挑战,并验证了字符到音素(grapheme-to-phoneme, G2P)转换的有效性,从而为低资源场景下的方言感知语音合成(dialect-aware text-to-speech, TTS)研究提供了可扩展的数据基础与模型适配路径,支持零样本(zero-shot)和少样本(few-shot)学习范式。

链接: https://arxiv.org/abs/2604.11803
作者: Lena S. Oberkircher,Jesujoba O. Alabi,Dietrich Klakow,Jürgen Trouvain
机构: 未知
类目: Computation and Language (cs.CL)
备注: accepted at DialRes-LREC26

点击查看摘要

Abstract:Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbrücken dialect of German. The dataset was created by first collecting text through digitized books and locally sourced materials. A subset of this text was recorded by nine speakers, and we conducted analyses on both the textual and speech components to assess the dataset’s characteristics and quality. We discuss methodological challenges related to orthographic and speaker variation, and explore grapheme-to-phoneme (G2P) conversion. The resulting corpus provides aligned textual and audio representations. This serves as a foundation for future research on dialect-aware text-to-speech (TTS), particularly in low-resource scenarios, including zero-shot and few-shot model adaptation.

[NLP-2] Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLM s?

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)中五大人格特质(Big Five)的内部表征如何形成、定位及其与行为输出之间的关系尚不明确。为填补这一空白,研究者聚焦于通过问卷操作化定义的五大人格概念,采用探测(probing)和神经元干预手段分析其内部表示的形成机制与位置分布,并检验这些表示对生成标签的影响。解决方案的关键在于识别出在模型中层区域高度选择性响应各人格维度的神经元,并通过增强或抑制其激活来操控潜在表示和标签生成方向;结果表明,这类干预能显著且因果性地引导内部表示向目标人格概念偏移(成功率超0.8),但对最终生成标签的控制效果较弱且存在跨特质干扰,揭示了LLMs中表征控制与行为控制之间存在显著差距。

链接: https://arxiv.org/abs/2604.11802
作者: Yuto Harada,Hiro Taiyo Hamada
机构: Araya Inc. (Araya公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Using psychological constructs such as the Big Five, large language models (LLMs) can imitate specific personality profiles and predict a user’s personality. While LLMs can exhibit behaviors consistent with these constructs, it remains unclear where and how they are represented inside the model and how they relate to behavioral outputs. To address this gap, we focus on questionnaire-operationalized Big Five concepts, analyze the formation and localization of their internal representations, and use interventions to examine how these representations relate to behavioral outputs. In our experiment, we first use probing to examine where Big Five information emerges across model depth. We then identify neurons that respond selectively to each Big Five concept and test whether enhancing or suppressing their activations can bias latent representations and label generation in intended directions. We find that Big Five information becomes rapidly decodable in early layers and remains detectable through the final layers, while concept-selective neurons are most prevalent in mid layers and exhibit limited overlap across domains. Interventions on these neurons consistently shift probe readouts toward targeted concepts, with targeted success rates exceeding 0.8 for some concepts, indicating that the model’s internal separation of Big Five personality traits can be causally steered. At the label-generation level, the same interventions often bias generated label distributions in the intended directions, but the effects are weaker, more concept-dependent, and often accompanied by cross-trait spillover, indicating that comparable control over generated labels is difficult even with interventions on a large fraction of concept-selective neurons. Overall, our findings reveal a gap between representational control and behavioral control in LLMs.

[NLP-3] CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际决策场景中缺乏可靠定量概率估计的问题。现有方法通过任务特定微调虽可获得概率输出,但常导致灾难性遗忘和语言坍缩,使模型丧失生成解释的能力,从而损害其可解释性和实用性。解决方案的关键在于提出CLSGen框架,该框架融合了新型模型架构、训练方法与数据构建策略,能够在不牺牲LLM固有解释生成能力的前提下实现稳健的概率估计,实验表明其在分类指标(如AUROC和F1-score)及解释一致性与可读性方面均优于现有基线方法。

链接: https://arxiv.org/abs/2604.11801
作者: WonJin Yoon,Kangyu Zhu,Ian Bulovic,Autumn Sehy,Yanjun Gao,Dmitriy Dligach,Majid Afshar,Timothy A. Miller
机构: Boston Children’s Hospital (波士顿儿童医院); Harvard Medical School (哈佛医学院); University of Colorado Anschutz Medical Campus (科罗拉多大学安舒茨医学校区); Loyola University Chicago (洛约拉大学芝加哥分校); University of Wisconsin Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the recent progress of Large Language Models (LLMs), there is a growing interest in applying these models to solve complex and challenging problems. Modern LLMs, capable of processing long contexts and generating verbalized explanations, offer significant potential in addressing real-world applications. However, a critical hurdle in deploying LLMs for practical decision-making is their inability to provide reliable, quantitative probabilities. While task-specific fine-tuning of LLMs using traditional discriminative objectives (similar to encoder-only models) can yield probability estimates, this often leads to catastrophic forgetting and linguistic collapse. Consequently, the model loses its ability to generate explanations, severely undermining its interpretability and usability. To address this challenge, we propose CLSGen, a novel LLM fine-tuning framework designed for binary classification tasks. The CLSGen framework encompasses a new model architecture, training methodology, and data construction strategy to enable robust probability estimation without sacrificing the model’s inherent explanation-generation capabilities. Experimental results across multiple benchmark datasets demonstrate that models fine-tuned with CLSGen outperform existing baselines in classification metrics (AUROC and F1-score). Regarding explanation, the results showed strong alignment between predicted labels and generated justifications, as well as high readability.

[NLP-4] C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

【速读】: 该论文旨在解决中文语料中AI生成文本检测面临的挑战,包括模型多样性不足、数据同质化以及提示(prompt)真实性欠缺等问题。针对这些问题,作者提出了C-ReD:一个全面的中文真实提示AI生成文本检测基准(Comprehensive Chinese Real-prompt AI-generated Detection benchmark)。其解决方案的关键在于构建具有多样化大语言模型(Large Language Models, LLMs)、覆盖广泛领域且基于真实提示生成的数据集,从而实现对中文语料中AI生成文本的可靠域内检测,并具备良好的跨模型和跨数据集泛化能力,有效填补了此前中文检测基准在模型多样性、领域覆盖和提示真实性方面的空白。

链接: https://arxiv.org/abs/2604.11796
作者: Chenxi Qing,Junxi Wu,Zheng Liu,Yixiang Qiu,Hongyao Yu,Bin Chen,Hao Wu,Shu-Tao Xia
机构: Tsinghua University (清华大学); Nankai University (南开大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Peng Cheng Laboratory (鹏城实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI-generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C-ReD: a comprehensive Chinese Real-prompt AI-generated Detection benchmark. Experiments demonstrate that C-ReD not only enables reliable in-domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets-addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at this https URL.

[NLP-5] ClawGUI: A Unified Framework for Training Evaluating and Deploying GUI Agents

【速读】: 该论文旨在解决GUI代理(GUI Agent)在训练、评估与部署环节中存在的系统性瓶颈问题,具体包括:在线强化学习(Online RL)训练因环境不稳定和封闭管道导致效率低下;评估协议在不同研究中缺乏一致性,难以横向比较;以及训练好的代理极少能真正部署到真实设备上服务于用户。其解决方案的关键在于提出一个统一的开源框架——ClawGUI,该框架包含三个核心组件:ClawGUI-RL 提供支持并行虚拟环境与真实物理设备的开放强化学习基础设施,并结合GiGPO与过程奖励模型(Process Reward Model)实现密集的步级监督;ClawGUI-Eval 构建标准化评估流程,覆盖6个基准测试和11+种模型,复现率高达95.8%;ClawGUI-Agent 实现跨平台部署(Android、HarmonyOS、iOS),通过混合CLI-GUI控制和持久化个性化记忆,使代理可接入12+聊天平台。整个体系从训练到部署形成闭环,显著提升了GUI代理的实用性与可扩展性。

链接: https://arxiv.org/abs/2604.11784
作者: Fei Tang,Zhiqiong Lu,Boxuan Zhang,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen
机构: Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present \textbfClawGUI, an open-source framework addressing these three gaps within a single harness. \textbfClawGUI-RL provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. \textbfClawGUI-Eval enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8% reproduction against official baselines. \textbfClawGUI-Agent brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, \textbfClawGUI-2B achieves 17.1% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0%.

[NLP-6] General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在领域特定推理(如数学和物理)中表现出色,但在更广泛、更具挑战性的通用推理(general reasoning)任务中能力不足的问题。其核心挑战在于,通用推理依赖较少专家知识,却面临复杂约束、嵌套逻辑分支和语义干扰等难题,而现有模型在此类任务上的表现尚未被系统评估。解决方案的关键是提出General365基准测试集,该数据集通过限制背景知识至K-12水平,明确将推理能力与专业领域知识解耦;包含365个种子问题及1,095个变体问题,覆盖八个类别,确保难度高且多样性强。实验表明,即使最强模型在该基准上准确率仅为62.8%,远低于其在数学/物理领域的接近完美表现,揭示了当前LLMs的推理能力具有显著领域依赖性,亟需向通用场景演进。

链接: https://arxiv.org/abs/2604.11778
作者: Junlin Liu,Shengnan An,Shuang Zhou,Dan Ma,Shixiong Luo,Ying Xie,Yuan Zhang,Wenling Yuan,Yifan Zhou,Xiaoyu Li,Ziwen Wang,Xuezhi Cao,Xunliang Cai
机构: University of Chinese Academy of Sciences (中国科学院大学); Meituan (美团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts–often termed general reasoning–remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: this https URL

[NLP-7] Agent ic Aggregation for Parallel Scaling of Long-Horizon Agent ic Tasks

【速读】: 该论文旨在解决长时程智能体任务(如智能体搜索和深度研究)中并行测试时扩展(parallel test-time scaling)的效率与效果问题。这类任务通常涉及多轮、工具增强且输出开放的轨迹(trajectory),传统聚合方法仅汇总最终答案会丢失轨迹中的丰富信息,而直接拼接所有轨迹则超出模型上下文窗口限制。解决方案的关键在于提出 AggAgent——一个将并行轨迹视为环境的聚合智能体,通过轻量级工具对候选解进行检查与跨轨迹检索,按需导航和融合信息,从而在不显著增加计算开销的前提下实现高效聚合。实验表明,AggAgent 在六个基准测试和三种模型家族上均优于现有方法,平均提升达 5.3%,在两个深度研究任务上提升高达 10.3%。

链接: https://arxiv.org/abs/2604.11753
作者: Yoonsang Lee,Howard Yen,Xi Ye,Danqi Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model’s context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.

[NLP-8] HistLens: Mapping Idea Change across Concepts and Corpora ACL2026

【速读】: 该论文旨在解决现有计算方法在历时语义分析中面临的两大局限:一是研究通常局限于单一概念或单一语料库,导致跨概念、跨语料的比较困难;二是仅依赖表层词汇证据,难以捕捉隐含表达的概念变化。其解决方案的关键在于提出HistLens框架,该框架基于标准化抽象嵌入(Standardized Abstract Embedding, SAE)构建了一个统一的多概念、多语料库概念史分析体系,通过将概念表示分解为可解释特征并追踪其随时间与来源的激活动态,在共享坐标系中生成可比的概念演化轨迹,从而实现跨概念、跨语料的细粒度概念演化模式计算,并支持对隐含概念的推断。

链接: https://arxiv.org/abs/2604.11749
作者: Yi Jing,Weiyun Qiu,Yihang Peng,Zhifang Sui
机构: Tsinghua University (清华大学); Nanjing University (南京大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026 MainConference

点击查看摘要

Abstract:Language change both reflects and shapes social processes, and the semantic evolution of foundational concepts provides a measurable trace of historical and social transformation. Despite recent advances in diachronic semantics and discourse analysis, existing computational approaches often (i) concentrate on a single concept or a single corpus, making findings difficult to compare across heterogeneous sources, and (ii) remain confined to surface lexical evidence, offering insufficient computational and interpretive granularity when concepts are expressed implicitly. We propose HistLens, a unified, SAE-based framework for multi-concept, multi-corpus conceptual-history analysis. The framework decomposes concept representations into interpretable features and tracks their activation dynamics over time and across sources, yielding comparable conceptual trajectories within a shared coordinate system. Experiments on long-span press corpora show that HistLens supports cross-concept, cross-corpus computation of patterns of idea evolution and enables implicit concept computation. By bridging conceptual modeling with interpretive needs, HistLens broadens the analytical perspectives and methodological repertoire available to social science and the humanities for diachronic text analysis.

[NLP-9] LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

【速读】: 该论文旨在解决连续扩散语言模型(Continuous Diffusion Language Models, DLMs)在性能上落后于离散扩散模型的问题,从而填补连续与离散方法之间的差距。其核心解决方案是提出LangFlow,通过引入三个关键创新实现突破:(1) 基于常微分方程(ODE)的负对数似然(NLL)边界,为连续流语言模型提供理论严谨的评估依据;(2) 提出信息均匀性原则指导噪声调度,并设计基于Gumbel分布的可学习调度器;(3) 引入自条件训练协议,提升模型在似然和生成样本质量上的表现。实验表明,LangFlow在多个基准测试中达到与顶级离散扩散模型相当的性能(如LM1B上困惑度PPL=30.0,OpenWebText上PPL=24.6),并在零样本迁移任务中超越自回归基线,验证了连续扩散范式在语言建模中的竞争力与潜力。

链接: https://arxiv.org/abs/2604.11748
作者: Yuxin Chen,Chumeng Liang,Hangke Sui,Ruihan Guo,Chaoran Cheng,Jiaxuan You,Ge Liu
机构: UIUC
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continuous diffusion models have achieved strong performance across domains such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion. Our approach connects embedding-space DLMs to Flow Matching via Bregman divergence and introduces three key innovations: (1) a novel ODE-based NLL bound for principled evaluation of continuous flow-based language models; (2) an information-uniform principle for noise scheduling, motivating a learnable scheduler based on a Gumbel distribution; and (3) an improved training protocol incorporating self-conditioning, which enhances both likelihood and sample this http URL achieves strong performance across benchmarks, reaching a perplexity (PPL) of 30.0 on LM1B and 24.6 on OpenWebText. It matches top discrete DLMs at comparable scale and surpasses autoregressive baselines in zero-shot transfer across multiple benchmarks. LangFlow provides clear evidence that continuous diffusion is a competitive and promising paradigm for language modeling. this https URL Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2604.11748 [cs.CL] (or arXiv:2604.11748v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.11748 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-10] Discourse Diversity in Multi-Turn Empathic Dialogue

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮共情对话中存在话语策略(discourse moves)重复性强、缺乏多样性的问题,即尽管LLMs在单轮场景下能生成高共情度回复,但在多轮互动中倾向于重复使用相同的话语策略,从而削弱了持续支持的有效性。其解决方案的关键在于提出MINT(Multi-turn Inter-tactic Novelty Training),这是一个基于强化学习的训练框架,首次将跨轮次话语策略新颖性(cross-turn tactic novelty)作为优化目标,通过结合共情质量奖励与跨轮次策略多样性信号,在保持或提升整体共情表现的同时显著降低策略重复率,实验证明该方法在1.7B和4B参数规模模型上均优于现有基线,尤其在减少多轮对话中的策略单调性方面效果突出。

链接: https://arxiv.org/abs/2604.11742
作者: Hongli Zhan,Emma S. Gueorguieva,Javier Hernandez,Jina Suh,Desmond C. Ong,Junyi Jessy Li
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Microsoft Research (微软研究院); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) produce responses rated as highly empathic in single-turn settings (Ayers et al., 2023; Lee et al., 2024), yet they are also known to be formulaic generators that reuse the same lexical patterns, syntactic templates, and discourse structures across tasks (Jiang et al., 2025; Shaib et al., 2024; Namuduri et al., 2025). Less attention has been paid to whether this formulaicity extends to the level of discourse moves, i.e., what a response does for the person it is addressing. This question is especially consequential for empathic dialogue, where effective support demands not just a kind response at one moment but varied strategies as a conversation unfolds (Stiles et al., 1998). Indeed, prior work shows that LLMs reuse the same tactic sequences more than human supporters in single-turn settings (Gueorguieva et al., 2026). We extend this analysis to multi-turn conversations and find that the rigidity compounds: once a tactic appears in a supporter turn, LLMs reuse it in the next at nearly double the rate of humans (0.50-0.56 vs. 0.27). This pattern holds across LLMs serving as supporters in real emotional support conversations, and is invisible to standard similarity metrics. To address this gap, we introduce MINT (Multi-turn Inter-tactic Novelty Training), the first reinforcement learning framework to optimize discourse move diversity across multi-turn empathic dialogue. The best MINT variant combines an empathy quality reward with a cross-turn tactic novelty signal, improving aggregate empathy by 25.3% over vanilla across 1.7B and 4B models while reducing cross-turn discourse move repetition by 26.3% on the 4B model, surpassing all baselines including quality-only and token-level diversity methods on both measures. These results suggest that what current models lack is not empathy itself, but the ability to vary their discourse moves across a conversation.

[NLP-11] Evaluating Cooperation in LLM Social Groups through Elected Leadership

【速读】: 该论文旨在解决多智能体系统中缺乏结构化领导与选举机制对集体决策和合作效率影响的问题,这在人类社会中是普遍存在的组织特征,但当前基于大语言模型(Large Language Models, LLMs)的多智能体研究尚未充分探讨其作用。解决方案的关键在于构建一个开源的多智能体模拟框架,通过引入“选举产生的人格化领导者”和“候选人驱动的议程”,在受控治理条件下实证评估LLMs的表现。实验表明,具备选举领导机制的系统在社会福利得分上提升55.4%,生存时间延长128.6%,并通过社交图谱中心性分析和领袖话语情感倾向分析揭示了领导者社会影响力与合作倾向的演化规律,为未来研究选举机制在复杂社会困境中的应用奠定基础。

链接: https://arxiv.org/abs/2604.11721
作者: Ryan Faulkner,Anushka Deshpande,David Guzman Piedrahita,Joel Z. Leibo,Zhijing Jin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Main text: 11 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Governing common-pool resources requires agents to develop enduring strategies through cooperation and self-governance to avoid collective failure. While foundation models have shown potential for cooperation in these settings, existing multi-agent research provides little insight into whether structured leadership and election mechanisms can improve collective decision making. The lack of such a critical organizational feature ubiquitous in human society presents a significant shortcoming of the current methods. In this work we aim to directly address whether leadership and elections can support improved social welfare and cooperation through multi-agent simulation with LLMs. We present our open-source framework that simulates leadership through elected personas and candidate-driven agendas and carry out an empirical study of LLMs under controlled governance conditions. Our experiments demonstrate that having elected leadership improves social welfare scores by 55.4% and survival time by 128.6% across a range of high performing LLMs. Through the construction of an agent social graph we compute centrality metrics to assess the social influence of leader personas and also analyze rhetorical and cooperative tendencies revealed through a sentiment analysis on leader utterances. This work lays the foundation for further study of election mechanisms in multi-agent systems toward navigating complex social dilemmas.

[NLP-12] SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context

【速读】: 该论文旨在解决自主软件工程(Autonomous Software Engineering, SWE)中现有基于ReAct风格的方法在深度分析和处理复杂边缘情况时缺乏显式系统2(System-2)推理能力的问题,同时应对多轮SWE任务中因保留完整推理历史导致的上下文爆炸(context explosion)与“中间迷失”(Lost-in-the-Middle)退化问题,以及丢弃历史信息引发的重复推理冗余。其解决方案的关键在于提出SWE-AGILE框架,引入动态推理上下文(Dynamic Reasoning Context)策略:通过维护一个包含详细推理内容的滑动窗口以保障即时连续性、避免冗余重推理,同时将历史推理内容压缩为简洁的推理摘要(Reasoning Digests),从而在保证推理深度的同时有效控制上下文长度,实现效率与精度的平衡。

链接: https://arxiv.org/abs/2604.11716
作者: Shuquan Lian,Juncheng Liu,Yazhe Chen,Yuhong Chen,Hui Li
机构: Xiamen University (厦门大学); Microsoft (微软)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prior representative ReAct-style approaches in autonomous Software Engineering (SWE) typically lack the explicit System-2 reasoning required for deep analysis and handling complex edge cases. While recent reasoning models demonstrate the potential of extended Chain-of-Thought (CoT), applying them to the multi-turn SWE task creates a fundamental dilemma: retaining full reasoning history leads to context explosion and Lost-in-the-Middle'' degradation, while discarding it would force the agent to redundantly re-reason at every step. To address these challenges, we propose SWE-AGILE, a novel software agent framework designed to bridge the gap between reasoning depth, efficiency, and context constraints. SWE-AGILE introduces a Dynamic Reasoning Context strategy, maintaining a sliding window’’ of detailed reasoning for immediate continuity to prevent redundant re-analyzing, while compressing historical reasoning content into concise Reasoning Digests. Empirically, SWE-AGILE sets a new standard for 7B-8B models on SWE-Bench-Verified using only 2.2k trajectories and 896 tasks. Code is available at this https URL.

[NLP-13] Agent ic Driving Coach: Robustness and Determinism of Agent ic AI-Powered Human-in-the-Loop Cyber-Physical Systems

【速读】: 该论文旨在解决生成式 AI(Generative AI)驱动的人机协同网络物理系统(Human-in-the-Loop Cyber-Physical Systems, HITL CPS)中因人类用户与AI代理行为不可预测、物理环境动态变化所导致的非确定性问题。解决方案的关键在于提出一种基于反应器计算模型(Reactor Model of Computation, MoC)的方法,并通过开源框架 Lingua Franca (LF) 实现,以在复杂交互环境中重新引入可控制的确定性,从而提升系统的行为可控性和可靠性。

链接: https://arxiv.org/abs/2604.11705
作者: Deeksha Prahlad,Daniel Fan,Hokeun Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Foundation models, including large language models (LLMs), are increasingly used for human-in-the-loop (HITL) cyber-physical systems (CPS) because foundation model-based AI agents can potentially interact with both the physical environments and human users. However, the unpredictable behavior of human users and AI agents, in addition to the dynamically changing physical environments, leads to uncontrollable nondeterminism. To address this urgent challenge of enabling agentic AI-powered HITL CPS, we propose a reactor-model-of-computation (MoC)-based approach, realized by the open-source Lingua Franca (LF) framework. We also carry out a concrete case study using the agentic driving coach as an application of HITL CPS. By evaluating the LF-based agentic HITL CPS, we identify practical challenges in reintroducing determinism into such agentic HITL CPS and present pathways to address them.

[NLP-14] Legal2LogicICL: Improving Generalization in Transforming Legal Cases to Logical Formulas via Diverse Few-Shot Learning

【速读】: 该论文旨在解决逻辑驱动型法律推理系统在泛化能力上的局限性问题,其核心挑战在于现有方法严重依赖高质量标注训练数据,而此类数据在法律领域中稀缺。为应对这一问题,作者提出了一种基于大语言模型(Large Language Models, LLMs)的新型法律推理框架——Legal2LogicICL,其关键创新在于通过检索增强生成(Retrieval-Augmented Generation, RAG)实现有效的上下文学习(In-Context Learning, ICL),并在示例选择阶段同时考虑潜在语义表示层面与法律文本结构层面的多样性与相似性平衡;此外,该方法显式建模法律结构特征,缓解由长尾且高度具体的实体提及引发的检索偏差,从而构建出更具信息量和鲁棒性的少样本示范(few-shot demonstrations),最终无需额外训练即可稳定、准确地生成逻辑规则,显著提升自然语言法律案例描述到逻辑表示转换的准确性与可解释性。

链接: https://arxiv.org/abs/2604.11699
作者: Jieying Xue,Phuong Minh Nguyen,Ha Thanh Nguyen,May Myo Zin,Ken Satoh
机构: Japan Advanced Institute of Science and Technology (日本先进科学技术学院); Center for Juris-Informatics, ROIS-DS (司法信息研究中心,国立情报学研究所-数据科学部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICAIL 2026

点击查看摘要

Abstract:This work aims to improve the generalization of logic-based legal reasoning systems by integrating recent advances in NLP with legal-domain adaptive few-shot learning techniques using LLMs. Existing logic-based legal reasoning pipelines typically rely on fine-tuned models to map natural-language legal cases into logical formulas before forwarding them to a symbolic reasoner. However, such approaches are heavily constrained by the scarcity of high-quality annotated training data. To address this limitation, we propose a novel LLM-based legal reasoning framework that enables effective in-context learning through retrieval-augmented generation. Specifically, we introduce Legal2LogicICL, a few-shot retrieval framework that balances diversity and similarity of exemplars at both the latent semantic representation level and the legal text structure level. In addition, our method explicitly accounts for legal structure by mitigating entity-induced retrieval bias in legal texts, where lengthy and highly specific entity mentions often dominate semantic representations and obscure legally meaningful reasoning patterns. Our Legal2LogicICL constructs informative and robust few-shot demonstrations, leading to accurate and stable logical rule generation without requiring additional training. In addition, we construct a new dataset, named Legal2Proleg, which is annotated with alignments between legal cases and PROLEG logical formulas to support the evaluation of legal semantic parsing. Experimental results on both open-source and proprietary LLMs demonstrate that our approach significantly improves accuracy, stability, and generalization in transforming natural-language legal case descriptions into logical representations, highlighting its effectiveness for interpretable and reliable legal reasoning. Our code is available at this https URL.

[NLP-15] Please Make it Sound like Human: Encoder-Decoder vs. Decoder-Only Transformers for AI-to-Human Text Style Transfer

【速读】: 该论文旨在解决生成式 AI (Generative AI) 生成文本在学术与专业写作中日益普遍所带来的识别难题,其核心挑战在于如何系统性地将 AI 生成的文本改写为更接近人类作者风格的文本。解决方案的关键在于构建了一个包含 25,140 对 AI 输入与人类参考文本片段的平行语料库,并识别出 11 个可量化的文体特征以区分两种文本风格;在此基础上,对 BART-base、BART-large 和 Mistral-7B-Instruct(采用 QLoRA 微调)三种模型进行训练,其中 BART-large 在参考文本相似度指标(BERTScore F1=0.924,ROUGE-L=0.566,chrF++=55.92)上表现最优,且参数量仅为 Mistral-7B 的 1/17,证明了高效风格迁移的可能性。同时,研究指出当前评估体系中“标记偏移准确率”(marker shift accuracy)是一个被忽视但关键的盲点,Mistral-7B 虽在偏移得分上更高,却可能因过度调整而降低真实性。

链接: https://arxiv.org/abs/2604.11687
作者: Utsav Paneru
机构: Kathmandu Engineering College (尼泊尔加德满都工程学院)
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 2 tables

点击查看摘要

Abstract:AI-generated text has become common in academic and professional writing, prompting research into detection methods. Less studied is the reverse: systematically rewriting AI-generated prose to read as genuinely human-authored. We build a parallel corpus of 25,140 paired AI-input and human-reference text chunks, identify 11 measurable stylistic markers separating the two registers, and fine-tune three models: BART-base, BART-large, and Mistral-7B-Instruct with QLoRA. BART-large achieves the highest reference similarity – BERTScore F1 of 0.924, ROUGE-L of 0.566, and chrF++ of 55.92 – with 17x fewer parameters than Mistral-7B. We show that Mistral-7B’s higher marker shift score reflects overshoot rather than accuracy, and argue that shift accuracy is a meaningful blind spot in current style transfer evaluation.

[NLP-16] Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对话系统中缺乏理论心智(Theory-of-Mind, ToM)能力的问题,特别是在面对具有部分先验知识的攻击者时,难以有效引导其信念以保护敏感信息。为应对这一挑战,作者提出了一种名为“信念引导的理论心智”(ToM for Steering Beliefs, ToM-SB)的新任务,要求防御方作为“AI双面间谍”(AI Double Agent)通过构建对攻击者的认知模型来误导其判断,使其误以为已成功获取敏感信息。解决方案的关键在于引入强化学习框架,并设计双向激励机制:一方面奖励欺骗成功(fooling),另一方面奖励理论心智建模能力(ToM)。实验表明,这两种奖励之间存在相互增强关系——单独奖励其中之一即可提升另一项性能;最终结合两者可显著优于当前前沿模型(如Gemini 3-Pro和GPT-5.4),尤其在硬场景与分布外(OOD)测试中表现突出,验证了信念建模是实现高效欺骗的核心驱动力。

链接: https://arxiv.org/abs/2604.11666
作者: Hanqi Xiao,Vaidehi Patil,Zaid Khan,Hyunji Lee,Elias Stengel-Eskin,Mohit Bansal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: First two authors contributed equally. Code: this https URL

点击查看摘要

Abstract:As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker’s beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.

[NLP-17] Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better Evaluation

【速读】: 该论文旨在解决当前基于探针(probe-based)的不确定性估计方法在分布外(OOD)场景下鲁棒性不足的问题,尤其是针对大语言模型(Large Language Models, LLMs)生成式任务中长文本输出时的可靠性问题。其关键解决方案在于系统性地评估不同探针设计因素(包括表示层、特征类型和token聚合策略)对鲁棒性的影响,并发现中间层表示(middle-layer representations)比最终层隐藏状态更稳定,跨token聚合特征比单token特征更具鲁棒性;在此基础上提出一种简单的混合回退策略(hybrid back-off strategy),以提升探针在分布偏移下的性能表现。

链接: https://arxiv.org/abs/2604.11662
作者: Joe Stacey,Hadas Orgad,Kentaro Inui,Benjamin Heinzerling,Nafise Sadat Moosavi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work has shown that the hidden states of large language models contain signals useful for uncertainty estimation and hallucination detection, motivating a growing interest in efficient probe-based approaches. Yet it remains unclear how robust existing methods are, and which probe designs provide uncertainty estimates that are reliable under distribution shift. We present a systematic study of supervised uncertainty probes across models, tasks, and OOD settings, training over 2,000 probes while varying the representation layer, feature type, and token aggregation strategy. Our evaluation highlights poor robustness in current methods, particularly in the case of long-form generations. We also find that probe robustness is driven less by architecture and more by the probe inputs. Middle-layer representations generalise more reliably than final-layer hidden states, and aggregating across response tokens is consistently more robust than relying on single-token features. These differences are often largely invisible in-distribution but become more important under distribution shift. Informed by our evaluation, we explore a simple hybrid back-off strategy for improving robustness, arguing that better evaluation is a prerequisite for building more robust probes.

[NLP-18] CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding Interpretation and Authenticity ACL2026

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在中文艺术作品理解上存在局限性的问题,尤其在超越短文本识别与问答(QA)任务后,难以实现基于证据的深度推理、专家级审美描述、风格与朝代的准确推断以及真伪诊断等高阶能力。解决方案的关键在于构建一个博物馆基准测试集CARTBENCH,其包含四个子任务:CURATORQA(基于证据的识别与推理)、CATALOGCAPTION(结构化四段式专家风格赏析)、REINTERPRET(可辩护的艺术再诠释)和CONNOISSEURPAIRS(在视觉相似干扰下的真伪判别),并通过将故宫博物院藏品与权威目录页对齐,覆盖多个朝代的五类艺术门类,从而系统评估VLMs在具象文化语境中的认知深度与专业水平。

链接: https://arxiv.org/abs/2604.11632
作者: Xuefeng Wei,Zhixuan Wang,Xuan Zhou,Zhi Qu,Hongyao Li,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学); Liaoning Normal University (辽宁师范大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.

[NLP-19] Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation

【速读】: 该论文旨在解决现有对话记忆系统在长对话场景下因上下文稀释(context dilution)而导致性能下降的问题,其核心瓶颈被识别为潜在知识流形中的信号稀疏性(Signal Sparsity Effect)。具体而言,作者发现两个关键现象:一是决定性证据稀疏性(Decisive Evidence Sparsity),即随着对话轮次增加,相关语义信号变得愈发孤立,导致基于聚合的方法性能显著退化;二是双层冗余性(Dual-Level Redundancy),包括会话间干扰和会话内填充内容引入大量非信息性噪声,阻碍有效生成。解决方案的关键在于提出一种极简框架 \method,通过Turn Isolation Retrieval(TIR)和Query-Driven Pruning(QDP)实现:TIR采用最大激活策略替代全局聚合以捕获回合级信号,QDP则去除冗余会话与对话填充内容,构建高密度证据集,从而在多个基准测试中实现优于强基线的鲁棒性能,并保持高效计算资源消耗。

链接: https://arxiv.org/abs/2604.11628
作者: Yuqian Wu,Wei Chen,Zhengjun Huang,Junle Chen,Qingxiang Liu,Kai Wang,Xiaofang Zhou,Yuxuan Liang
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; National University of Singapore
类目: Computation and Language (cs.CL)
备注: 23 pages, 12 figures

点击查看摘要

Abstract:Existing conversational memory systems rely on complex hierarchical summarization or reinforcement learning to manage long-term dialogue history, yet remain vulnerable to context dilution as conversations grow. In this work, we offer a different perspective: the primary bottleneck may lie not in memory architecture, but in the \textitSignal Sparsity Effect within the latent knowledge manifold. Through controlled experiments, we identify two key phenomena: \textitDecisive Evidence Sparsity, where relevant signals become increasingly isolated with longer sessions, leading to sharp degradation in aggregation-based methods; and \textitDual-Level Redundancy, where both inter-session interference and intra-session conversational filler introduce large amounts of non-informative content, hindering effective generation. Motivated by these insights, we propose \method, a minimalist framework that brings conversational memory back to basics, relying solely on retrieval and generation via Turn Isolation Retrieval (TIR) and Query-Driven Pruning (QDP). TIR replaces global aggregation with a max-activation strategy to capture turn-level signals, while QDP removes redundant sessions and conversational filler to construct a compact, high-density evidence set. Extensive experiments on multiple benchmarks demonstrate that \method achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency, establishing a new minimalist baseline for conversational memory.

[NLP-20] Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的强化学习(Reinforcement Learning, RL)中因奖励稀疏(sparse reward)导致的学习效率低下问题。其核心解决方案是提出一种名为互信息自评估(Mutual Information Self-Evaluation, MISE)的强化学习范式,关键在于利用事后生成式自评估(hindsight generative self-evaluation)作为密集的内部奖励信号,并通过环境反馈对这些奖励进行校准,从而实现无需专家监督的自主学习。理论分析表明,该方法等价于最小化一个包含互信息与策略和代理奖励策略之间KL散度的组合目标,这一洞察为奖励校准步骤提供了形式化依据,显著提升了模型在验证集上的性能,使7B参数级开源LLM达到接近GPT-4o的水平。

链接: https://arxiv.org/abs/2604.11611
作者: Jiashu Yao,Heyan Huang,Zeming Liu,Yuhang Guo
机构: Beijing Institute of Technology (北京理工大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:To overcome the sparse reward challenge in reinforcement learning (RL) for agents based on large language models (LLMs), we propose Mutual Information Self-Evaluation (MISE), an RL paradigm that utilizes hindsight generative self-evaluation as dense reward signals while simultaneously calibrating them against the environmental feedbacks. Empirically, MISE enables an agent to learn autonomously from dense internal rewards supplementing sparse extrinsic signals. Theoretically, our work provides the first formal foundation for the paradigm of generative self-rewarding. We prove that utilizing hindsight self-evaluation rewards is equivalent to minimizing an objective that combines mutual information with a KL divergence term between the policy and a proxy reward policy. This theoretical insight then informs and justifies our calibration step, which actively aligns these rewards with the optimal policy. Extensive experiments show that MISE outperforms strong baselines, enabling open-source LLMs about 7B parameters to achieve performance comparable to GPT-4o on validation without expert supervision.

[NLP-21] Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在持续个性化交互中面临的异质性记忆提取问题,即不同任务类型对记忆内容的需求差异显著,而现有静态提取提示(prompt)和同质分布优化框架难以适应这种多样性。其核心挑战在于如何在多种任务场景下高效、自适应地提取并保留有用信息。解决方案的关键是提出CluE(Cluster-based Self-Evolving strategy),通过聚类方式将训练样本按提取场景分组,独立分析各簇特征,并融合跨簇洞察动态更新提取提示,从而实现对异质任务的泛化能力提升,在BEHEMOTH基准上相较以往方法获得9.04%的相对性能增益。

链接: https://arxiv.org/abs/2604.11610
作者: Yuqing Yang,Tengxiao Liu,Wang Bill Zhu,Taiwei Shi,Linxin Song,Robin Jia
机构: University of Southern California (南加州大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LLM-based assistants become persistent and personalized, they must extract and retain useful information from past conversations as memory. However, the types of information worth remembering vary considerably across tasks. We formalize the \textitheterogeneous memory extraction task and introduce \textbfBEHEMOTH, a benchmark that repurposes 18 existing datasets spanning personalization, problem-solving, and agentic tasks, using a downstream utility-driven metric for systematic evaluation. Our empirical analysis confirms that no single static extraction prompt dominates across all task categories, and that existing self-evolving prompt optimization frameworks, originally designed for homogeneous distributions, degrade when training tasks are heterogeneous. To address this, we propose \textbfCluE, a cluster-based self-evolving strategy that groups training examples into clusters by extraction scenarios, analyzes each cluster independently, and synthesizes cross-cluster insights to update the extraction prompt. Experiments on BEHEMOTH show that CluE generalizes effectively across heterogeneous tasks ( + 9.04% relative gain), consistently outperforming prior self-evolving frameworks.

[NLP-22] A Triadic Suffix Tokenization Scheme for Numerical Reasoning

【速读】: 该论文旨在解决标准子词分词方法在处理数字时存在的不一致性问题,这种不一致会导致大语言模型(Large Language Models, LLMs)丢失数字的位值结构和小数结构,从而引发算术运算与科学推理中的错误。其解决方案的关键在于提出一种确定性的分词方案——三元后缀分词(Triadic Suffix Tokenization, TST),该方案将数字的每一位按三位一组进行分割,并为每个三元组显式标注一个量级标记(magnitude marker),使得整数部分和小数部分的量级关系在token层面清晰可辨。TST通过建立固定的、一一对应的映射关系来表示数量级(如千、百万、十亿等)和小数深度(如十分之一、千分之一等),避免依赖位置推断,从而提供稳定的梯度信号以确保模型收敛。该方法具备架构无关性且易于集成,同时支持线性词汇扩展以适应任意精度和范围。

链接: https://arxiv.org/abs/2604.11582
作者: Olga Chetverina
机构: MIPT(莫斯科物理技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 1 figure. This is a theoretical proposal of a novel numbers tokenization for LLMs. The code is available on GitHub. Previous version archived at Zenodo: DOI https://doi.org/10.5281/zenodo.18999577

点击查看摘要

Abstract:Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure - a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a fixed, one-to-one mapping between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). Unlike approaches that rely on positional inference, this method provides a consistent gradient signal, which should ensure stable convergence. Two implementation variants are proposed: (1) a vocabulary-based approach that adds at most 10,000 fixed tokens to an existing vocabulary, covering 33 orders of magnitude ( 10^-15 to 10^18 ); and (2) a suffix-marker approach that uses a small set of special tokens to denote magnitude dynamically. Both variants preserve exact digits while making order-of-magnitude relationships transparent at the token level. The framework is inherently scalable, allowing for linear vocabulary expansion to accommodate arbitrary precision and range. TST is architecture-agnostic and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work.

[NLP-23] Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)评估中存在的隐含不确定性问题,这种不确定性源于提示词重述、评判模型切换、温度参数变化等因素,导致评估分数波动显著,甚至改变模型排名和研究结论。现有标准置信区间无法捕捉此类变异,反而在数据增多时出现覆盖不足的问题,并为模型开发者提供了可被利用的“表面”——即通过优化测量噪声而非真实能力来提升评估表现。论文的关键解决方案是将LLM评估流程中的不确定性分解为可缩减与不可缩减两类来源,明确区分随样本量增加而收敛的方差与由研究人员设计选择引发的敏感性,并据此提出最优预算分配策略以最小化总误差。通过投影优化的评估管道,在意识形态标注、安全分类、MMLU基准测试及人工验证的宣传审计中均优于73%的朴素评估方案,且在MMLU上以相同成本将估计误差减半,证明了其在提升评估稳健性和减少可操纵空间方面的有效性。

链接: https://arxiv.org/abs/2604.11581
作者: Solomon Messing
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet these scores carry hidden uncertainty: rephrasing the prompt, switching the judge model, or changing the temperature can shift results enough to flip rankings and reverse conclusions. Standard confidence intervals ignore this variance, producing under-coverage that worsens with more data. The unmeasured variance also creates an exploitable surface: model developers can optimize against measurement noise rather than genuine capability. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and projects the most efficient path to reducing total error. For benchmark builders, the same decomposition identifies which design choices contribute exploitable surface for gaming and prescribes designs that minimize it. Across ideology annotation, safety classification, MMLU benchmarking, and a human-validated propaganda audit, projection-optimized pipelines outperform 73% of possible naive pipelines against a human baseline. On MMLU, optimized budget allocation halves estimation error compared to standard single-prompt evaluation at equivalent cost. A small-sample variance estimation exercise is sufficient to derive confidence intervals that approach nominal coverage when the model includes the relevant pipeline facets, and to generate recommendations for reducing measurement error and improving benchmark robustness.

[NLP-24] MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts

【速读】: 该论文旨在解决多语言环境下基于像素的生成式语言模型(pixel-based language model)在跨语言泛化能力上的瓶颈问题,尤其是由不同语言间感知多样性(perceptual diversity)导致的性能下降。其解决方案的关键在于提出MIXAR——首个在八种不同语言(涵盖多种书写系统)上训练的生成式像素语言模型,通过统一的像素空间建模实现对多语言文本的联合学习,从而提升模型在判别与生成任务中的多语言表现力,并展现出对训练中未见语言的鲁棒性。

链接: https://arxiv.org/abs/2604.11575
作者: Chen Hu,Yintao Tai,Antonio Vergari,Frank Keller,Alessandro Suglia
机构: School of Informatics, University of Edinburgh (爱丁堡大学信息学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pixel-based language models are gaining momentum as alternatives to traditional token-based approaches, promising to circumvent tokenization challenges. However, the inherent perceptual diversity across languages poses a significant hurdle for multilingual generalization in pixel space. This paper introduces MIXAR, the first generative pixel-based language model trained on eight different languages utilizing a range of different scripts. We empirically evaluate MIXAR against previous pixel-based models as well as comparable tokenizer-based models, demonstrating substantial performance improvement on discriminative and generative multilingual tasks. Additionally, we show how MIXAR is robust to languages never seen during the training. These results are further strengthened when scaling the model to 0.5B parameters which not only improves its capabilities in generative tasks like LAMBADA but also its robustness when challenged with input perturbations such as orthographic attacks.

[NLP-25] Phonological distances for linguistic typology and the origin of Indo-European languages

【速读】: 该论文旨在解决如何通过语音学特征量化语言间的系统性关系,从而支持定量类型学与语言演化研究的问题。其解决方案的关键在于利用信息论框架,将音位序列建模为二阶马尔可夫链(second-order Markov chains),以此捕捉音系系统中的统计相关性,并基于音位的发音特征构建跨语言的音系距离矩阵。该方法不仅能够恢复主要语系结构,还揭示了接触导致的趋同现象,并与地理距离显著相关,从而为印欧语系的起源地提供约束,支持“草原假说”(Steppe hypothesis)。

链接: https://arxiv.org/abs/2604.11565
作者: Marius Mavridis,Juan De Gregorio,Raul Toral,David Sanchez
机构: Institute for Cross-Disciplinary Physics and Complex Systems IFISC (UIB-CSIC), Campus Universitat de les Illes Balears, E-07122 Palma de Mallorca, Spain; Université Paris-Saclay, ENS Paris-Saclay, DER de Physique, 91190, Gif-sur-Yvette, France
类目: Computation and Language (cs.CL); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Physics and Society (physics.soc-ph)
备注: 27 pages, 7 figures, 2 appendices

点击查看摘要

Abstract:We show that short-range phoneme dependencies encode large-scale patterns of linguistic relatedness, with direct implications for quantitative typology and evolutionary linguistics. Specifically, using an information-theoretic framework, we argue that phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system. This finding enables us to quantify distances among 67 modern languages from a multilingual parallel corpus employing a distance metric that incorporates articulatory features of phonemes. The resulting phonological distance matrix recovers major language families and reveals signatures of contact-induced convergence. Remarkably, we obtain a clear correlation with geographic distance, allowing us to constrain a plausible homeland region for the Indo-European family, consistent with the Steppe hypothesis.

[NLP-26] Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)智能体在长期记忆中因信息丢失、语义漂移或用户未披露事实的错误生成(即幻觉)而导致的可靠性问题。现有方法如滑动窗口、摘要、基于嵌入的检索增强生成(Retrieval-Augmented Generation, RAG)和扁平化事实提取均存在显著缺陷,无法同时保障高精度与对抗鲁棒性。其核心解决方案是提出Synthius-Mem——一种受大脑结构启发的结构化人格记忆系统,通过六维认知域(传记、经历、偏好、社交圈、工作、心理测量)对对话进行解构与归一化,实现细粒度的事实抽取与去重,并采用CategoryRAG机制以21.79毫秒延迟高效检索结构化知识。该方案在LoCoMo基准测试中达到94.37%准确率,超越所有已有系统及人类表现(F1=87.9),且首次报告了对抗鲁棒性指标(99.55%),有效防止对用户未提供事实的虚假回应,同时减少约5倍的token消耗。

链接: https://arxiv.org/abs/2604.11563
作者: Artem Gadzhiev,Andrew Kislov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Providing AI agents with reliable long-term memory that does not hallucinate remains an open problem. Current approaches to memory for LLM agents – sliding windows, summarization, embedding-based RAG, and flat fact extraction – each reduce token cost but introduce catastrophic information loss, semantic drift, or uncontrolled hallucination about the user. The structural reason is architectural: every published memory system on the LoCoMo benchmark treats conversation as a retrieval problem over raw or lightly summarized dialogue segments, and none reports adversarial robustness, the ability to refuse questions about facts the user never disclosed. We present Synthius-Mem, a brain-inspired structured persona memory system that takes a fundamentally different approach. Instead of retrieving what was said, Synthius-Mem extracts what is known about the person: a full persona extraction pipeline decomposes conversations into six cognitive domains (biography, experiences, preferences, social circle, work, psychometrics), consolidates and deduplicates per domain, and retrieves structured facts via CategoryRAG at 21.79 ms latency. On the LoCoMo benchmark (ACL 2024, 10 conversations, 1,813 questions), Synthius-Mem achieves 94.37% accuracy, exceeding all published systems including MemMachine (91.69%, adversarial score is not reported) and human performance (87.9 F1). Core memory fact accuracy reaches 98.64%. Adversarial robustness, the hallucination resistance metric that no competing system reports, reaches 99.55%. Synthius-Mem reduces token consumption by ~5x compared to full-context replay while achieving higher accuracy. Synthius-Mem achieves state-of-the-art results on LoCoMo and is, to our knowledge, the only persona memory system that both exceeds human-level performance and reports adversarial robustness.

[NLP-27] Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)训练系统在扩展至多模态输入(omni-modal inputs)和代理式多轮工作流(agentic multi-turn workflows)时面临的三大相互依赖挑战:异构数据流、大规模下的操作鲁棒性,以及延迟(staleness)与吞吐量之间的权衡问题。解决方案的关键在于提出一个名为Relax的开源RL训练引擎,其核心创新是通过三个协同设计的架构层实现:首先,采用原生多模态(omni-native)架构,将多模态支持深度集成到从数据预处理到推理生成的全栈流程中;其次,每个RL角色作为独立、故障隔离的服务运行,可独立扩展、恢复和升级,无需全局协调;第三,服务级解耦通过TransferQueue数据总线实现异步训练,仅用一个延迟参数即可平滑插值于在线策略(on-policy)、近在线策略与完全异步执行之间,从而在保证收敛性能的同时显著提升训练效率。

链接: https://arxiv.org/abs/2604.11554
作者: Liujie Zhang,Benzhe Ning,Rui Yang,Xiaoyan Yu,Jiaxing Li,Lumeng Wu,Jia Liu,Minghao Li,Weihang Chen,Weiqi Hu,Lei Zhang
机构: Xiaohongshu Inc (小红书公司); The University of Hong Kong (香港大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 22 figures

点击查看摘要

Abstract:Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness – throughput tradeoff. We present \textbfRelax (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emphomni-native architecture builds multimodal support into the full stack – from data preprocessing and modality-aware parallelism to inference generation – rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20 \times end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76 \times speedup over colocate on Qwen3-4B and a 2.00 \times speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\citema2025r3 for MoE models with only 1.9% overhead, compared to 32% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2,000 steps on video without degradation. Relax is available at this https URL.

[NLP-28] MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

【速读】: 该论文旨在解决语音模仿(Voice Imitation)任务中因真实平行数据稀缺而导致的训练困难问题,即缺乏足够多的三元组数据(源语音、参考语音、目标语音),其中源语音与目标语音语义一致但目标语音需匹配参考语音的音色和说话风格。现有方法要么依赖复杂的解耦架构来规避数据稀缺问题,要么利用外部系统合成伪平行数据,但后者受限于合成语音质量天花板,难以提升模型性能。论文提出的解决方案关键在于创新性地采用“合成语音作为训练源 + 真实录音作为目标”的数据构造策略,使模型能够直接学习真实语音分布,从而突破合成语音质量限制;同时结合交错文本-音频建模以保证内容准确性,并引入偏好对齐的后训练机制缓解合成数据与真实数据之间的分布偏移问题,最终在自然度上显著优于现有方法,且在说话人身份、口音和情感等维度保持竞争力。

链接: https://arxiv.org/abs/2604.11552
作者: Tao Feng,Yuxiang Wang,Yuancheng Wang,Xueyao Zhang,Dekun Chen,Chaoren Wang,Xun Guan,Zhizheng Wu
机构: Tsinghua University (清华大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Voice imitation aims to transform source speech to match a reference speaker’s timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but target matches the reference’s voice characteristics, yet such data is extremely scarce. Existing approaches either employ carefully designed disentanglement architectures to bypass this data scarcity or leverage external systems to synthesize pseudo-parallel training data. However, the former requires intricate model design, and the latter faces a quality ceiling when synthetic speech is used as training targets. To address these limitations, we propose MimicLM, which takes a novel approach by using synthetic speech as training sources while retaining real recordings as targets. This design enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling. Building on this data construction approach, we incorporate interleaved text-audio modeling to guide the generation of content-accurate speech and apply post-training with preference alignment to mitigate the inherent distributional mismatch when training on synthetic data. Experiments demonstrate that MimicLM achieves superior voice imitation quality with a simple yet effective architecture, significantly outperforming existing methods in naturalness while maintaining competitive similarity scores across speaker identity, accent, and emotion dimensions.

[NLP-29] Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach ACL2026

【速读】: 该论文旨在解决大语言模型在复杂医疗应用中因高质量推理数据稀缺而导致的性能瓶颈问题,尤其是在罕见疾病等低资源领域表现有限。其核心挑战在于传统方法依赖昂贵的链式思维(Chain-of-Thought, CoT)蒸馏与强化学习(Reinforcement Learning, RL)流程,难以有效提升对稀有病种的推理能力。解决方案的关键在于提出MedSSR框架——通过引入罕见疾病知识合成可控分布的推理问题,并利用策略模型自身生成高质量伪标签(pseudo-labels),构建一种“内在到外在”的两阶段训练范式:先在伪标签合成数据上进行自监督强化学习,再在人工标注的真实数据上进行监督强化学习,从而实现高效、低成本且可扩展的医疗推理能力增强。

链接: https://arxiv.org/abs/2604.11547
作者: Haolin Li,Shuyang Jiang,Ruipeng Zhang,Jiangchao Yao,Ya Zhang,Yanfeng Wang
机构: Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to ACL 2026 as a Findings paper

点击查看摘要

Abstract:While large language models hold promise for complex medical applications, their development is hindered by the scarcity of high-quality reasoning data. To address this issue, existing approaches typically distill chain-of-thought reasoning traces from large proprietary models via supervised fine-tuning, then conduct reinforcement learning (RL). These methods exhibit limited improvement on underrepresented domains like rare diseases while incurring substantial costs from generating complex reasoning chains. To efficiently enhance medical reasoning, we propose MedSSR, a Medical Knowledge-enhanced data Synthesis and Semi-supervised Reinforcement learning framework. Our framework first employs rare disease knowledge to synthesize distribution-controllable reasoning questions. We then utilize the policy model itself to generate high-quality pseudo-labels. This enables a two-stage, intrinsic-to-extrinsic training paradigm: self-supervised RL on the pseudo-labeled synthetic data, followed by supervised RL on the human-annotated real data. MedSSR scales model training efficiently without relying on costly trace distillation. Extensive experiments on Qwen and Llama demonstrate that our method outperforms existing methods across ten medical benchmarks, achieving up to +5.93% gain on rare-disease tasks. Our code is available at this https URL.

[NLP-30] me is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agent ic Memory

【速读】: 该论文旨在解决现有结构化记忆系统(如知识图谱)在处理时间维度时的局限性问题,即如何有效区分持久性事实与随时间演变的事实,避免因简单按时间排序或频繁覆盖导致信息丢失,同时降低对大语言模型(LLM)的依赖。其解决方案的关键在于提出RoMem模块,通过预训练的语义速度门(Semantic Speed Gate)将关系文本嵌入映射为波动性评分,学习识别高变化率关系(如“总统”)和低变化率关系(如“出生地”),并结合连续相位旋转机制实现几何阴影效应:过时的事实被旋转出向量空间中的相位,使时间正确的事实自然优于矛盾信息而无需删除,从而在保持静态记忆完整性的同时提升时序推理性能。

链接: https://arxiv.org/abs/2604.11544
作者: Weixian Waylon Li,Jiaxin Zhang,Xianan Jim Yang,Tiejun Ma,Yiwen Guo
机构: University of Edinburgh (爱丁堡大学); LIGHTSPEED (光速科技); University of St Andrews (圣安德鲁斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Structured memory representations such as knowledge graphs are central to autonomous agents and other long-lived systems. However, most existing approaches model time as discrete metadata, either sorting by recency (burying old-yet-permanent knowledge), simply overwriting outdated facts, or requiring an expensive LLM call at every ingestion step, leaving them unable to distinguish persistent facts from evolving ones. To address this, we introduce RoMem, a drop-in temporal knowledge graph module for structured memory systems, applicable to agentic memory and beyond. A pretrained Semantic Speed Gate maps each relation’s text embedding to a volatility score, learning from data that evolving relations (e.g., “president of”) should rotate fast while persistent ones (e.g., “born in”) should remain stable. Combined with continuous phase rotation, this enables geometric shadowing: obsolete facts are rotated out of phase in complex vector space, so temporally correct facts naturally outrank contradictions without deletion. On temporal knowledge graph completion, RoMem achieves state-of-the-art results on ICEWS05-15 (72.6 MRR). Applied to agentic memory, it delivers 2-3x MRR and answer accuracy on temporal reasoning (MultiTQ), dominates hybrid benchmark (LoCoMo), preserves static memory with zero degradation (DMR-MSC), and generalises zero-shot to unseen financial domains (FinTMMBench).

[NLP-31] riviality Corrected Endogenous Reward

【速读】: 该论文旨在解决开放文本生成任务中强化学习因缺乏可验证奖励信号而难以有效训练的问题,尤其是在没有人工标注数据或依赖封闭源模型的情况下。其核心挑战在于如何设计一种无需外部监督的内生奖励机制(endogenous reward),以引导生成内容的质量与多样性。解决方案的关键在于提出TCER(Triviality Corrected Endogenous Reward),该方法通过衡量专用策略(specialist policy)与通用参考策略(generalist reference policy)之间的相对信息增益,并引入基于概率的校正机制来抑制“平凡性偏差”(Triviality Bias),从而避免策略坍缩至高概率但低价值的输出,实现无监督条件下生成质量与多样性的协同提升。

链接: https://arxiv.org/abs/2604.11522
作者: Xinda Wang,Zhengxu Hou,Yangshijie Zhang,Bingren Yan,Jialin Liu,Chenzhuo Zhao,Zhibo Yang,Bin-Bin Yang,Feng Xiao
机构: Peking University (北京大学); Alibaba Group (阿里巴巴集团); Lanzhou University (兰州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning for open-ended text generation is constrained by the lack of verifiable rewards, necessitating reliance on judge models that require either annotated data or powerful closed-source models. Inspired by recent work on unsupervised reinforcement learning for mathematical reasoning using confidence-based endogenous rewards, we investigate whether this principle can be adapted to open-ended writing tasks. We find that directly applying confidence rewards leads to Triviality Bias: the policy collapses toward high-probability outputs, reducing diversity and meaningful content. We propose TCER (Triviality Corrected Endogenous Reward), which addresses this bias by rewarding the relative information gain between a specialist policy and a generalist reference policy, modulated by a probability-dependent correction mechanism. Across multiple writing benchmarks and model architectures, TCER achieves consistent improvements without external supervision. Furthermore, TCER also transfers effectively to mathematical reasoning, validating the generality of our approach across different generation tasks.

[NLP-32] DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode ACL2026

【速读】: 该论文旨在解决测试用例生成中的测试输出预测(test output prediction)问题,即如何提高大语言模型(LLM)在生成测试输出时的可靠性。传统方法通过先生成代码再执行来实现预测的“接地”(grounding),但直接执行代码易受微小错误影响而导致失败。为此,作者提出基于伪代码(pseudocode)的执行机制,利用LLM推理模拟伪代码的执行过程,从而提升对错误的鲁棒性;同时引入DuET双执行框架,通过功能多数投票(functional majority voting)融合代码直接执行与伪代码推理两种策略,二者互补:前者缓解伪代码推理可能产生的幻觉(hallucination),后者降低代码执行中的错误敏感性。实验表明,DuET在LiveCodeBench上将Pass@1指标提升13.6个百分点,达到当前最优性能。

链接: https://arxiv.org/abs/2604.11514
作者: Hojae Han,Jaejin Kim,Seung-won Hwang,Yu Jin Kim,Moontae Lee
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Findings of ACL 2026

点击查看摘要

Abstract:This work addresses test output prediction, a key challenge in test case generation. To improve the reliability of predicted outputs by LLMs, prior approaches generate code first to ground predictions. One grounding strategy is direct execution of generated code, but even minor errors can cause failures. To address this, we introduce LLM-based pseudocode execution, which grounds prediction on more error-resilient pseudocode and simulates execution via LLM reasoning. We further propose DuET, a dual-execution framework that combines both approaches by functional majority voting. Our analysis shows the two approaches are complementary in overcoming the limitations of direct execution suffering from code errors, and pseudocode reasoning from hallucination. On LiveCodeBench, DuET achieves the state-of-the-art performance, improving Pass@1 by 13.6 pp.

[NLP-33] Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)过程中探索多样性不足与任务准确性难以兼顾的问题。解决方案的关键在于提出一种名为 Policy Split 的新范式,其核心是将策略(policy)分裂为正常模式(normal mode)和高熵模式(high-entropy mode),二者共享模型参数但通过协同双模式熵正则化机制分别优化不同目标:正常模式聚焦于任务正确性,高熵模式引入探索偏好以促进多样化行为生成。这种双模式协同机制使模型能够在保持性能的同时实现更丰富的探索行为,从而提升整体学习效率与泛化能力。

链接: https://arxiv.org/abs/2604.11510
作者: Jiashu Yao,Heyan Huang,Chuwei Luo,Daiqing Wu,Zeming Liu,Yuhang Guo,Yangyang Kang
机构: Beijing Institute of Technology (北京理工大学); Zhejiang University (浙江大学); ByteDance China (字节跳动中国); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.

[NLP-34] METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文因果推理(contextual causal reasoning)能力上的不足,尤其是现有基准测试在评估时存在情境一致性缺失和因果层次覆盖不全的问题。其解决方案的关键在于提出METER,一个在统一上下文设置下系统性地对LLMs在因果阶梯(causal ladder)三个层级——关联(association)、干预(intervention)与反事实(counterfactual)——进行全面评估的新基准。通过该框架,研究者能够更准确地识别LLM在不同因果层级上的性能退化现象,并揭示其根本原因:一是低层级任务中模型易受因果无关但事实正确的信息干扰;二是随着任务复杂度提升,模型对给定上下文的忠实度下降,导致性能降低。

链接: https://arxiv.org/abs/2604.11502
作者: Pengfeng Li,Chen Huang,Chaoqun Hao,Hongyao Chen,Xiao-Yong Wei,Wenqiang Lei,See-Kiong Ng
机构: Sichuan University (四川大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026. Our code and dataset are available at this https URL

点击查看摘要

Abstract:Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at this https URL .

[NLP-35] Quantization Dominates Rank Reduction for KV-Cache Compression

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因KV缓存(Key-Value Cache)占用存储资源过高而导致的效率瓶颈问题。其核心挑战是如何在有限存储预算下对KV缓存进行有效压缩,同时最小化对模型性能的影响。论文对比了两种主流压缩策略:秩缩减(Rank Reduction,即丢弃部分维度)与量化(Quantization,即保持全部维度但降低精度)。研究发现,量化始终优于秩缩减,且优势随模型规模和分组查询注意力(Grouped-Query Attention, GQA)的激进程度而扩大;例如,在Mistral 7B模型上,INT4量化相比FP16仅增加0.23 PPL,而秩为32的压缩方案则性能崩溃至0.4%。关键原因在于softmax注意力机制下的结构不对称性:秩缩减可能引发离散失败(即某维度被删除后导致注意力分配完全改变),而量化噪声具有有界性并通常能保持得分排序稳定性。作者进一步通过扰动分析证明,投影损伤(rank reduction damage)比量化损伤高出约 3×22b3 \times 2^{2b} 倍每方向(基于softmax Fisher度量),并通过基变换消融实验验证该优势不依赖于坐标系选择,从而确立“保留完整维度”是性能优越的根本原因。最终,联合K+V INT4量化实现75% KV缓存压缩,仅带来+0.18 PPL增益,显著优于混合基线。

链接: https://arxiv.org/abs/2604.11501
作者: Samuel Salfati
机构: fraQtl AI Research (fraQtl人工智能研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:We compare two strategies for compressing the KV cache in transformer inference: rank reduction (discard dimensions) and quantization (keep all dimensions, reduce precision). At matched storage budgets across five models (124M-14B, MHA and GQA), we find that quantization consistently outperforms rank reduction by 4-364 PPL depending on model and compression level. The gap persists even when rank reduction is combined with quantization in hybrid baselines, and it grows with GQA aggressiveness. On LAMBADA, INT4 matches FP16 accuracy (+0.23 PPL on Mistral 7B, +0.58 on GPT-2) while rank-32 at identical storage collapses to 0.4%. We trace this gap to a structural asymmetry: under softmax attention routing, removing a dimension can flip which token is attended (a discrete failure), while quantization noise is bounded and typically preserves score ordering. We formalize this via a perturbation result showing projection damage exceeds quantization damage by 3 x 2^(2b) per direction under the softmax Fisher metric. A basis ablation confirms the finding is basis-independent (spread 0.4 PPL), establishing that the advantage comes from preserving dimensions, not from a better coordinate system. Joint K+V INT4 quantization achieves 75% total KV reduction at only +0.18 PPL on Mistral 7B. Comments: 16 pages, 3 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2604.11501 [cs.LG] (or arXiv:2604.11501v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.11501 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-36] Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

【速读】: 该论文旨在解决双编码器视觉-语言模型(Dual-Encoder Vision-Language Models, VLMs)在组合性任务(compositional benchmarks)上表现不佳的问题,其核心原因被归因于标准推理协议中基于全局余弦相似度(global cosine similarity)的匹配机制限制了细粒度语义对齐能力。解决方案的关键在于引入一种轻量级Transformer结构,在不更新预训练编码器的前提下,直接从冻结的图像块(patch)和文本标记(token)嵌入中学习局部区域级对齐(localized alignment),从而显著提升模型在分布外组合性任务上的泛化性能,同时保持与全参数微调相当的域内检索效果。

链接: https://arxiv.org/abs/2604.11496
作者: Imanol Miranda,Ander Salaberria,Eneko Agirre,Gorka Azkune
机构: HiTZ Center – Ixa, University of the Basque Country (UPV/EHU)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. These results identify global embedding matching as a key bottleneck in dual-encoder VLMs and highlight the importance of alignment mechanisms for robust compositional generalization.

[NLP-37] Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

【速读】: 该论文旨在解决当前视觉语言(Vision-Language, VL)系统中缺乏针对人类中心对齐(Human-centric Alignment)的评估框架的问题,尤其是模型在不同区域文化语境下的适配性不足。其核心解决方案是提出“人为区域适应”(Anthropogenic Regional Adaptation)这一新范式,强调在保留全球泛化能力的前提下优化模型对特定区域语境的相关性;关键方法为“地理-泛化易用化”(Geographical-generalization-made-easy, GG-EZ),通过区域数据过滤与模型合并实现高效适配,在东南亚地区实验中实现了5–15%的文化相关性提升,同时保持超过98%的全球性能,甚至部分场景超越原模型。

链接: https://arxiv.org/abs/2604.11490
作者: Samuel Cahyawijaya,Peerat Limkonchotiwat,Tack Hwa Wong,Hitesh Laxmichand Patel,Amit Agarwal,Manuel Antonio Rufino,Carlos Rafael Catalan,Muhammad Reza Qorib,Vicky Feliren,Holy Lovenia,Aye Hninn Khine,Frederikus Hudi,David Anugraha,Alham Fikri Aji,Romrawin Chumpu,Viet-Thanh Pham,Minghan Wang,Mohamed Fazli Imam,Ruochen Zhang,Joseph Marvin Imperial,Do Xuan Long,Musa Izzanardi Wijanarko,Joel Ruben Antony Moniz,Patrick Amadeus Irawan,Hanif Muhammad Zhafran,Isaiah Flores,Ira Salsabila,Jun Kevin,Jostin Jerico Rosal,Patricia Nicole Monderin,Kun Kerdthaisong,Ahmad Mustafid,My Chiffon Nguyen,Natchapon Jongwiriyanurak,Siva Worajitwannakul,Haochen Li,Adrian Xuan Wei Lim,Bin Wang,Muhammad Ravi Shulthan Habibi,Lynnette Hui Xian Ng,Mithil Bangera,Yeshil Bangera,Priyaranjan Pattnayak,Dun Li Chan,Sherissa Caren Djuniwar,Hee Ming Shan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.

[NLP-38] Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在可验证奖励强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中因广泛探索而导致的高计算开销问题。现有方法通过线性外推模型参数来减少训练步骤,但未能充分理解RLVR训练中模型参数更新的动力学特性。论文的关键创新在于提出了一种非线性低秩轨迹外推框架(Nonlinear Extrapolation of low-rank trajectories, NExt),其核心是发现并利用LoRA训练中参数差分的秩-1子空间并非线性演化,且其主导作用随训练增强;NExt通过提取多步训练中的秩-1子空间构建预测器,实现对参数更新轨迹的非线性建模与外推,从而在保持与多种RLVR算法和任务兼容性的前提下,将计算开销降低约37.5%。

链接: https://arxiv.org/abs/2604.11446
作者: Zhipeng Chen,Tao Qian,Wayne Xin Zhao,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); China University of Mining and Technology (Beijing)(北京矿业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Working in progress

点击查看摘要

Abstract:Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm for significantly improving model capabilities, which requires guiding the model to perform extensive exploration and learning, leading to substantial computational overhead and becoming a key challenge. To reduce the number of training steps, Prior work performs linear extrapolation of model parameters. However, the dynamics of model parameter updates during RLVR training remain insufficiently understood. To further investigate the evolution of LLMs during RLVR training, we conduct empirical experiments and find that the rank-1 subspace of the model does not evolve linearly, and its dominance over the original parameters is further amplified during LoRA training. Based on the above insights, we propose the \textbfNonlinear \textbfExtrapolation of low-rank trajectories (\textbfNExt), a novel framework that models and extrapolates low-rank parameter trajectories in a nonlinear manner. Concretely, we first train the model using LoRA and extract the rank-1 subspace of parameter differences at multiple training steps, which is then used for the subsequent nonlinear extrapolation. Afterward, we utilized the extracted rank-1 subspace to train a predictor, which can model the trajectory of parameter updates during RLVR, and then perform the predict-extend process to extrapolate model parameters, achieving the acceleration of RLVR. To further study and understand NExt, we conduct comprehensive experiments that demonstrate the effectiveness and robustness of the method. Our method reduces computational overhead by approximately 37.5% while remaining compatible with a wide range of RLVR algorithms and tasks. We release our code in this https URL.

[NLP-39] METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues ACL2026

【速读】: 该论文旨在解决非协作对话代理(non-collaborative dialogue agents)开发中依赖人工编码专家策略所带来的可扩展性差的问题。传统方法需手动构建策略,难以规模化应用。其解决方案的关键在于提出METRO方法,该方法利用大语言模型(large language models)从原始对话转录文本中自主提取策略动作和规划逻辑,并将专家知识形式化为一种称为“策略森林”(Strategy Forest)的分层结构——其中节点表示短期响应,分支体现长期战略预见性。此设计使模型在两个基准测试中平均性能提升9%-10%,并展现出良好的跨任务迁移能力,为低成本、可扩展地构建非协作对话代理提供了新路径。

链接: https://arxiv.org/abs/2604.11427
作者: Haofu Yang,Jiaji Liu,Chen Huang,Faguo Wu,Wenqiang Lei,See-Kiong Ng
机构: National University of Singapore (新加坡国立大学); Beihang University (北京航空航天大学); Nankai University (南开大学); Sichuan University (四川大学); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (未来区块链与隐私计算高精尖创新中心); Key Laboratory of Mathematics, Informatics and Behavioral Semantics (LMIB) (数学、信息学与行为语义学重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026

点击查看摘要

Abstract:Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies. We propose \ours, a method that leverages large language models to autonomously induce both strategy actions and planning logic directly from raw transcripts. METRO formalizes expert knowledge into a Strategy Forest, a hierarchical structure that captures both short-term responses (nodes) and long-term strategic foresight (branches). Experimental results across two benchmarks show that METRO demonstrates promising performance, outperforming existing methods by an average of 9%-10%. Our further analysis not only reveals the success behind METRO (strategic behavioral diversity and foresight), but also demonstrates its robust cross-task transferability. This offers new insights into building non-collaborative agents in a cost-effective and scalable way. Our code is available at this https URL.

[NLP-40] Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

【速读】: 该论文旨在解决语音语言模型(Speech Language Models, SLMs)中存在的“语义理解与声学实现差距”问题,即模型虽具备较强的语义理解能力,但生成的语音缺乏表现力,难以传达意图,从而影响用户参与度。该差距主要源于两个缺陷:一是意图传递失败,即SLMs无法提供稳定、连贯的说话层面意图以支持富有表现力的语音输出;二是实现无感知训练,缺乏反馈信号验证声学输出是否忠实反映预期表达。解决方案的关键在于提出自感知语音语言模型(SA-SLM),其核心贡献为:(1) 意图感知桥接(Intent-Aware Bridging),利用变分信息瓶颈(Variational Information Bottleneck, VIB)目标将模型内部语义转化为时间平滑的表达意图,使语音生成过程具备意图意识;(2) 实现感知对齐(Realization-Aware Alignment),通过将模型自身作为评判者,基于评分规则提供反馈,校准声学实现与预期表达意图的一致性。

链接: https://arxiv.org/abs/2604.11424
作者: Kuang Wang,Lai Wei,Qibing Bai,Ping Lin,Wenkai Fang,Feng Jiang,Zhongjie Jiang,Jun Huang,Yannan Wang,Haizhou Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 4 figures, 6 tables. Project page: this https URL

点击查看摘要

Abstract:Speech Language Models (SLMs) exhibit strong semantic understanding, yet their generated speech often sounds flat and fails to convey expressive intent, undermining user engagement. We term this mismatch the semantic understanding-acoustic realization gap. We attribute this gap to two key deficiencies: (1) intent transmission failure, where SLMs fail to provide the stable utterance-level intent needed for expressive delivery; and (2) realization-unaware training, where no feedback signal verifies whether acoustic outputs faithfully reflect intended expression. To address these issues, we propose SA-SLM (Self-Aware Speech Language Model), built on the principle that the model should be aware of what it thinks during generation and how it speaks during training. SA-SLM addresses this gap through two core contributions: (1) Intent-Aware Bridging, which uses a Variational Information Bottleneck (VIB) objective to translate the model’s internal semantics into temporally smooth expressive intent, making speech generation aware of what the model intends to express; and (2) Realization-Aware Alignment, which repurposes the model as its own critic to verify and align acoustic realization with intended expressive intent via rubric-based feedback. Trained on only 800 hours of expressive speech data, our 3B parameter SA-SLM surpasses all open-source baselines and comes within 0.08 points of GPT-4o-Audio in overall expressiveness on the EchoMind benchmark.

[NLP-41] Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)中检索与生成模块分离导致的协调不足问题,即检索被视为外部干预而非内在生成过程的一部分,从而限制了模型在复杂推理任务中的动态适应能力。解决方案的关键在于提出一种名为GRIP(Generation-guided Retrieval with Information Planning)的统一框架,其核心创新是将检索决策嵌入到token级解码过程中,通过控制令牌(control-token)的发射来调节检索行为,实现端到端的检索与推理协同。其中,自触发信息规划(Self-Triggered Information Planning)机制使模型能够在单一自回归轨迹中自主决定何时检索、如何重写查询以及何时终止,从而支持动态多步推理和实时证据整合,显著提升了模型在问答任务中的表现与效率。

链接: https://arxiv.org/abs/2604.11407
作者: Bo Li,Mingda Wang,Gexiang Fang,Shikun Zhang,Wei Ye
机构: Peking University (北京大学); Hebei University of Technology (河北工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Github: this https URL HuggingFace: this https URL

点击查看摘要

Abstract:We revisit retrieval-augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token-level decoding, enabling end-to-end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose \textbfGRIP (\textbfGeneration-guided \textbfRetrieval with \textbfInformation \textbfPlanning), a unified framework in which the model regulates retrieval behavior through control-token emission. Central to GRIP is \textitSelf-Triggered Information Planning, which allows the model to decide when to retrieve, how to reformulate queries, and when to terminate, all within a single autoregressive trajectory. This design tightly couples retrieval and reasoning and supports dynamic multi-step inference with on-the-fly evidence integration. To supervise these behaviors, we construct a structured training set covering answerable, partially answerable, and multi-hop queries, each aligned with specific token patterns. Experiments on five QA benchmarks show that GRIP surpasses strong RAG baselines and is competitive with GPT-4o while using substantially fewer parameters.

[NLP-42] Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging

【速读】: 该论文旨在解决视频语言模型(VLM)在多模态适配过程中因视觉对齐而导致的时间推理(Temporal Reasoning, TR)能力下降的问题,尤其是在处理序列事件时的因果与时间逻辑推断能力受损。解决方案的关键在于提出一种无需重新训练、基于任务驱动的模型融合框架MERIT,其核心思想是通过搜索层级自注意力机制的融合策略,在保持或提升时间感知(Temporal Perception, TP)能力的前提下,最大化恢复TR性能;具体而言,MERIT利用一个优化目标平衡TR提升与TP退化惩罚,识别出对推理至关重要的特定层,并通过干预性掩码和帧级归因验证这些层在决策中更聚焦于时序和因果相关证据,从而实现精准、感知-aware 的模型融合以恢复VLM的时间推理能力。

链接: https://arxiv.org/abs/2604.11399
作者: Zihang Fu,Haonan Wang,Jian Kang,Kenji Kawaguchi,Jiaying Wu
机构: National University of Singapore (新加坡国立大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal adaptation equips large language models (LLMs) with perceptual capabilities, but often weakens the reasoning ability inherited from language-only pretraining. This trade-off is especially pronounced in video-language models (VLMs), where visual alignment can impair temporal reasoning (TR) over sequential events. We propose MERIT, a training-free, task-driven model merging framework for restoring TR in VLMs. MERIT searches over layer-wise self-attention merging recipes between a VLM and its paired text-only backbone using an objective that improves TR while penalizing degradation in temporal perception (TP). Across three representative VLMs and multiple challenging video benchmarks, MERIT consistently improves TR, preserves or improves TP, and generalizes beyond the search set to four distinct benchmarks. It also outperforms uniform full-model merging and random layer selection, showing that effective recovery depends on selecting the right layers. Interventional masking and frame-level attribution further show that the selected layers are disproportionately important for reasoning and shift model decisions toward temporally and causally relevant evidence. These results show that targeted, perception-aware model merging can effectively restore TR in VLMs without retraining.

[NLP-43] What Do Vision-Language Models Encode for Personalized Image Aesthetics Assessment? ACL2026

【速读】: 该论文旨在解决个性化图像美学评估(Personalized Image Aesthetics Assessment, PIAA)问题,即如何基于个体用户的主观审美偏好对图像进行精准评分。传统方法依赖于模型微调以适应个体差异,但存在计算成本高、数据需求量大等问题。论文的关键解决方案在于:通过分析视觉语言模型(Vision-Language Models, VLMs)内部表示,发现其编码了丰富且多层次的美学属性,并且这些属性在语言解码器层中显著分布;在此基础上,无需对模型进行微调,仅使用简单的线性模型即可实现高效、轻量化的个体级个性化评估,从而有效利用VLM的预训练知识来建模主观审美偏好。

链接: https://arxiv.org/abs/2604.11374
作者: Koki Ryu,Hitomi Yanaka
机构: The University of Tokyo (东京大学); Riken (理化学研究所); Tohoku University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: To appear at ACL 2026 findings

点击查看摘要

Abstract:Personalized image aesthetics assessment (PIAA) is an important research problem with practical real-world applications. While methods based on vision-language models (VLMs) are promising candidates for PIAA, it remains unclear whether they internally encode rich, multi-level aesthetic attributes required for effective personalization. In this paper, we first analyze the internal representations of VLMs to examine the presence and distribution of such aesthetic attributes, and then leverage them for lightweight, individual-level personalization without model fine-tuning. Our analysis reveals that VLMs encode diverse aesthetic attributes that propagate into the language decoder layers. Building on these representations, we demonstrate that simple linear models can perform PIAA effectively. We further analyze how aesthetic information is transferred across layers in different VLM architectures and across image domains. Our findings provide insights into how VLMs can be utilized for modeling subjective, individual aesthetic preferences. Our code is available at this https URL.

[NLP-44] Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories

【速读】: 该论文旨在解决蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)在自动化推理数据探索中监督信号提取效率低的问题。当前方法仅保留单一高奖励轨迹,忽略了大量探索路径中蕴含的对比性信息。其解决方案的关键在于提出对比推理路径合成(Contrastive Reasoning Path Synthesis, CRPS),通过结构化的反思过程分析高质量与低质量搜索轨迹之间的差异,提取关于策略转折点和局部失败模式的显式信息,并据此合成融合成功模式、规避已识别陷阱的推理链。实验表明,仅使用6万条CRPS合成样本训练的模型即可达到或超越基于标准拒绝采样生成的59万条样本训练的基线性能,实现20倍的数据规模压缩,同时提升跨域泛化能力。

链接: https://arxiv.org/abs/2604.11365
作者: Peiyang Liu,Zhirui Chen,Xi Wang,Di Liang,Youru Li,Zhi Cai,Wei Ye
机构: National Engineering Research Center for Software Engineering, Peking University (北京大学软件工程国家工程研究中心); School of Software and Microelectronics, Peking University (北京大学软件与微电子学院); UCAS-Terminus AI Lab, University of Chinese Academy of Sciences (中国科学院大学Terminus AI实验室); Tencent Technology (腾讯科技); College of Computer Science, Beijing University of Technology (北京工业大学计算机学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Monte Carlo Tree Search (MCTS) has been widely used for automated reasoning data exploration, but current supervision extraction methods remain inefficient. Standard approaches retain only the single highest-reward trajectory, discarding the comparative signals present in the many explored paths. Here we introduce \textbfContrastive Reasoning Path Synthesis (CRPS), a framework that transforms supervision extraction from a filtering process into a synthesis procedure. CRPS uses a structured reflective process to analyze the differences between high- and low-quality search trajectories, extracting explicit information about strategic pivots and local failure modes. These insights guide the synthesis of reasoning chains that incorporate success patterns while avoiding identified pitfalls. We show empirically that models fine-tuned on just 60K CRPS-synthesized examples match or exceed the performance of baselines trained on 590K examples derived from standard rejection sampling, a 20 \times reduction in dataset size. Furthermore, CRPS improves generalization on out-of-domain benchmarks, demonstrating that learning from the contrast between success and failure produces more transferable reasoning capabilities than learning from success alone.

[NLP-45] Geometry-Aware Localized Watermarking for Copyright Protection in Embedding-as-a-Service

【速读】: 该论文旨在解决Embedding-as-a-Service (EaaS) 在自然语言与多媒体应用中面临的模型盗用和版权侵权问题,现有水印方法存在鲁棒性—实用性—可验证性之间的根本矛盾:基于触发的方案对改写敏感,基于变换的方案易受维度扰动影响,基于区域的方案则可能因几何相似性产生误报。其解决方案的关键在于提出GeoMark,一种面向EaaS版权保护的几何感知局部化水印框架:通过利用流形上的自然嵌入作为共享水印目标,构建具有显式目标-锚点边距的几何分离锚点,并仅在自适应局部邻域内激活水印注入;该设计将水印触发位置与所有权归属解耦,实现局部触发与集中归属,从而在保持下游任务性能和几何保真度的同时,有效抵御改写、维度扰动及聚类-选择-消除(CSE)攻击,显著提升验证稳定性并降低误报风险。

链接: https://arxiv.org/abs/2604.11344
作者: Zhimin Chen,Xiaojie Liang,Wenbo Xu,Yuxuan Liu,Wei Lu
机构: Sun Yat-sen University (中山大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Embedding-as-a-Service (EaaS) has become an important semantic infrastructure for natural language and multimedia applications, but it is highly vulnerable to model stealing and copyright infringement. Existing EaaS watermarking methods face a fundamental robustness–utility–verifiability tension: trigger-based methods are fragile to paraphrasing, transformation-based methods are sensitive to dimensional perturbation, and region-based methods may incur false positives due to coincidental geometric affinity. To address this problem, we propose GeoMark, a geometry-aware localized watermarking framework for EaaS copyright protection. GeoMark uses a natural in-manifold embedding as a shared watermark target, constructs geometry-separated anchors with explicit target–anchor margins, and activates watermark injection only within adaptive local neighborhoods. This design decouples where watermarking is triggered from what ownership is attributed to, achieving localized triggering and centralized attribution. Experiments on four benchmark datasets show that GeoMark preserves downstream utility and geometric fidelity while maintaining robust copyright verification under paraphrasing, dimensional perturbation, and CSE (Clustering, Selection, Elimination) attacks, with improved verification stability and low false-positive risk. Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) Cite as: arXiv:2604.11344 [cs.CR] (or arXiv:2604.11344v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.11344 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-46] Do LLM s Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在工具调用过程中存在的结构性对齐偏差(structural alignment bias)问题,即当工具与用户查询语义无关时,LLMs仍因查询属性能被合法映射到工具参数而错误地触发工具调用。这一偏差导致模型在实际应用中产生严重的误调用行为,且当前评估体系未能充分识别和量化此类错误。解决方案的关键在于通过提出对比注意力归因(Contrastive Attention Attribution)方法揭示了模型内部存在语义检查与结构匹配两条竞争路径,并发现其相对强度决定了工具调用决策;基于此机制理解,进一步设计了一种再平衡策略,在不损害通用工具使用能力的前提下有效缓解了结构性对齐偏差。

链接: https://arxiv.org/abs/2604.11322
作者: Yilong Liu,Xixun Lin,Pengfei Cao,Ge Zhang,Fang Fang,Yanan Cao
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Computer Science and Technology, Donghua University (东华大学计算机科学与技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 (Main Conference)

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities in utilizing external tools. In practice, however, LLMs are often exposed to tools that are irrelevant to the user’s query, in which case the desired behavior is to refrain from invocations. In this work, we identify a widespread yet overlooked mechanistic flaw in tool refusal, which we term structural alignment bias: Even when a tool fails to serve the user’s goal, LLMs still tend to invoke it whenever query attributes can be validly assigned to tool parameters. To systematically study this bias, we introduce SABEval, a new dataset that decouples structural alignment from semantic relevance. Our analysis shows that structural alignment bias induces severe tool-invocation errors in LLMs, yet remains largely unaccounted for in existing evaluations. To investigate the internal mechanisms underlying this bias, we propose Contrastive Attention Attribution, which reveals two competing pathways for semantic checking and structural matching. The relative strength of these pathways drives LLMs’ tool invocation decisions. Based on these findings, we further introduce a rebalancing strategy that effectively mitigates structural alignment bias, as demonstrated by extensive experiments, without degrading general tool-use capabilities.

[NLP-47] he Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮(multi-turn)越狱攻击(jailbreaking)中存在的隐蔽性高、依赖特定上下文且易被现有防御机制拦截的问题。传统多轮攻击通常需精心设计的触发词或结构化上下文,但随着模型对上下文感知能力增强,此类显式有害输入极易被识别和阻止;同时,攻击成功高度依赖于目标模型的特异性上下文,限制了其通用性和实际威胁。为此,作者提出“薄片切割风险”(Salami Slicing Risk),其核心在于通过串联一系列低风险输入来累积有害意图,从而绕过单次交互中的对齐阈值(alignment thresholds),最终诱导高风险行为——无需预设复杂上下文结构。基于此风险,进一步开发出通用性强、适用于多种模型类型与模态的自动化攻击框架 Salami Attack,在 GPT-4o 和 Gemini 等主流模型上实现超过 90% 的攻击成功率,并具备对抗真实世界对齐防御的能力。

链接: https://arxiv.org/abs/2604.11309
作者: Yihao Zhang,Kai Wang,Jiangrong Wu,Haolin Wu,Yuxuan Zhou,Zeming Wei,Dongxian Wu,Xun Chen,Jun Sun,Meng Sun
机构: Peking University (北京大学); Sun Yat-sen University (中山大学); Wuhan University (武汉大学); Tsinghua University (清华大学); ByteDance (字节跳动)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) face prominent security risks from jailbreaking, a practice that manipulates models to bypass built-in security constraints and generate unethical or unsafe content. Among various jailbreak techniques, multi-turn jailbreak attacks are more covert and persistent than single-turn counterparts, exposing critical vulnerabilities of LLMs. However, existing multi-turn jailbreak methods suffer from two fundamental limitations that affect the actual impact in real-world scenarios: (a) As models become more context-aware, any explicit harmful trigger is increasingly likely to be flagged and blocked; (b) Successful final-step triggers often require finely tuned, model-specific contexts, making such attacks highly context-dependent. To fill this gap, we propose \textitSalami Slicing Risk, which operates by chaining numerous low-risk inputs that individually evade alignment thresholds but cumulatively accumulate harmful intent to ultimately trigger high-risk behaviors, without heavy reliance on pre-designed contextual structures. Building on this risk, we develop Salami Attack, an automatic framework universally applicable to multiple model types and modalities. Rigorous experiments demonstrate its state-of-the-art performance across diverse models and modalities, achieving over 90% Attack Success Rate on GPT-4o and Gemini, as well as robustness against real-world alignment defenses. We also proposed a defense strategy to constrain the Salami Attack by at least 44.8% while achieving a maximum blocking rate of 64.8% against other multi-turn jailbreak attacks. Our findings provide critical insights into the pervasive risks of multi-turn jailbreaking and offer actionable mitigation strategies to enhance LLM security. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2604.11309 [cs.CR] (or arXiv:2604.11309v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.11309 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-48] Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning ACL2026

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在古汉字演变分析任务中能力不足的问题,特别是如何系统性地利用MLLMs支持文本演化研究这一开放且尚未充分探索的课题。其核心挑战在于现有模型在字形层面比较上表现有限,且在字符识别与演化推理等关键任务上性能受限。解决方案的关键是提出一种基于字形驱动的微调框架(Glyph-driven Fine-tuning Framework, GEVO),该框架通过显式引导模型捕捉字形变换中的演化一致性,从而增强其对文字演变规律的理解;实验表明,即使在2B规模的模型上,GEVO也能在全部评估任务中实现稳定且全面的性能提升。

链接: https://arxiv.org/abs/2604.11299
作者: Rui Song,Lida Shi,Ruihua Qi,Yingji Li,Hao Xu
机构: Jilin University (吉林大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 main

点击查看摘要

Abstract:In recent years, rapid advances in Multimodal Large Language Models (MLLMs) have increasingly stimulated research on ancient Chinese scripts. As the evolution of written characters constitutes a fundamental pathway for understanding cultural transformation and historical continuity, how MLLMs can be systematically leveraged to support and advance text evolution analysis remains an open and largely underexplored problem. To bridge this gap, we construct a comprehensive benchmark comprising 11 tasks and over 130,000 instances, specifically designed to evaluate the capability of MLLMs in analyzing the evolution of ancient Chinese scripts. We conduct extensive evaluations across multiple widely used MLLMs and observe that, while existing models demonstrate a limited ability in glyph-level comparison, their performance on core tasks-such as character recognition and evolutionary reasoning-remains substantially constrained. Motivated by these findings, we propose a glyph-driven fine-tuning framework (GEVO) that explicitly encourages models to capture evolutionary consistency in glyph transformations and enhances their understanding of text evolution. Experimental results show that even models at the 2B scale achieve consistent and comprehensive performance improvements across all evaluated tasks. To facilitate future research, we publicly release both the benchmark and the trained models\footnotethis https URL.

[NLP-49] he Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

【速读】: 该论文旨在解决强化学习在大语言模型(Large Language Models, LLMs)应用中常见的采样多样性下降问题,即策略反复生成相似的错误行为,导致优化停滞。其解决方案的关键在于提出一种基于记忆增强的动态奖励重塑框架(Memory-Enhanced Dynamic reward Shaping, MEDS),通过存储和利用历史轨迹中的中间模型表示,识别高频重复的错误模式,并借助基于密度的聚类方法对这些模式进行建模;随后,将属于更普遍错误簇的轨迹施加更强的惩罚,从而引导模型在采样过程中实现更广泛的探索并减少重复错误。

链接: https://arxiv.org/abs/2604.11297
作者: Yang Liu,Enxi Wang,Yufei Gao,Weixin Zhang,Bo Wang,Zhiyuan Zeng,Yikai Zhang,Yining Zheng,Xipeng Qiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.

[NLP-50] Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

【速读】: 该论文旨在解决多语言监督微调(Supervised Fine-Tuning, SFT)数据合成中教师模型(teacher model)选择不当的问题,即当前实践中常默认选用最大规模的模型作为教师,而忽视其在非英语语言上的能力差距,导致合成数据质量低、学生模型下游性能不佳。解决方案的关键在于系统性地评估教师模型的有效性,并发现模型规模并非决定因素,而是prompt多样性、长度和响应流畅性等数据质量指标能解释超过93.3%的内在数据质量方差并有效预测学生模型表现;此外,提出匹配教师与学生模型家族、使用现有prompt进行翻译或响应等实用策略,可显著提升低资源语言下的教学效果。

链接: https://arxiv.org/abs/2604.11290
作者: Lester James V. Miranda,Ivan Vulić,Anna Korhonen
机构: Language Technology Lab, University of Cambridge (剑桥大学语言技术实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different student base model families. Further analyses reveal that model scale alone does not significantly predict teacher effectiveness; instead, data qualities such as prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance. Finally, we provide practical recommendations, including matching the model families of teacher-student pairs and translating from or responding to existing prompts, which can yield improvements for less-resourced languages. We hope that our work advances data-centric research in multilingual synthetic data and LM development.

[NLP-51] ransactional Attention: Semantic Sponsorship for KV-Cache Retention

【速读】: 该论文旨在解决现有键值缓存(KV-cache)压缩方法在低压缩比下无法有效保留敏感凭证信息(如API密钥、配置值等)的问题,这类信息因在训练阶段关注度极低而被压缩策略误判为冗余内容,导致生成阶段出现严重遗漏。解决方案的关键在于提出一种名为“事务注意力”(Transactional Attention, TA)的机制,通过识别结构化锚点模式(如"key:"、“password:”)来保护其邻近的值令牌免于被驱逐,从而确保关键凭证在极低缓存容量(K=16 tokens,仅占4K上下文的0.4%)下仍能100%被保留,且在200次函数调用试验中保持稳定准确率。TA机制与现有压缩方法正交,引入的延迟开销小于1%。

链接: https://arxiv.org/abs/2604.11288
作者: Abhinaba Basu
机构: National Institute of Electronics and Information Technology (NIELIT); Indian Institute of Information Technology, Allahabad (IIITA)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:At K=16 tokens (0.4% of a 4K context), every existing KV-cache compression method achieves 0% on credential retrieval. The failure mode is dormant tokens: credentials, API keys, and configuration values that receive near-zero attention but become essential at generation time. Because these tokens lack the statistical signals that eviction policies rely on, no method based on attention scores, reconstruction loss, or learned retention gates retains them. We introduce Transactional Attention (TA), a sponsorship mechanism in which structural anchor patterns (e.g., “key:”, “password:”) protect adjacent value-bearing tokens from eviction. TA achieves 100% credential retrieval at K=16 where six baselines (H2O, TOVA, SnapKV, StreamingLLM, PyramidKV, DynamicKV) achieve 0%, and sustains 100% accuracy across 200 function-calling trials. TA-Fast, an attention-free variant, reduces memory overhead by 52% and is compatible with SDPA and FlashAttention. TA is orthogonal to existing compression methods and adds less than 1% latency overhead.

[NLP-52] Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate ACL2026

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在医疗领域中因确认偏倚(confirmation bias)导致的视觉细节幻觉问题,即模型倾向于生成与初始诊断假设一致但可能错误的视觉描述,且现有链式思维(Chain-of-Thought, CoT)方法缺乏内在纠错机制,易引发错误传播。解决方案的关键在于提出一种名为 Dialectic-Med 的多智能体框架,通过角色专业化设计实现诊断推理的对抗性辩证过程:包括主张者(Proponent)提出诊断假设、具备新型视觉证伪模块的反对者(Opponent)主动检索矛盾视觉证据以挑战主张,以及通过加权共识图(weighted consensus graph)调解冲突的调停者(Mediator)。该框架显式建模了“证伪”认知过程,确保诊断推理紧密锚定于可验证的视觉区域,从而显著提升解释的忠实性(faithfulness)并有效抑制幻觉,优于单智能体基线方法。

链接: https://arxiv.org/abs/2604.11258
作者: Zhixiang Lu,Jionglong Su
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) in healthcare suffer from severe confirmation bias, often hallucinating visual details to support initial, potentially erroneous diagnostic hypotheses. Existing Chain-of-Thought (CoT) approaches lack intrinsic correction mechanisms, rendering them vulnerable to error propagation. To bridge this gap, we propose Dialectic-Med, a multi-agent framework that enforces diagnostic rigor through adversarial dialectics. Unlike static consensus models, Dialectic-Med orchestrates a dynamic interplay between three role-specialized agents: a proponent that formulates diagnostic hypotheses; an opponent equipped with a novel visual falsification module that actively retrieves contradictory visual evidence to challenge the Proponent; and a mediator that resolves conflicts via a weighted consensus graph. By explicitly modeling the cognitive process of falsification, our framework guarantees that diagnostic reasoning is tightly grounded in verified visual regions. Empirical evaluations on MIMIC-CXR-VQA, VQA-RAD, and PathVQA demonstrate that Dialectic-Med not only achieves state-of-the-art performance but also fundamentally enhances the trustworthiness of the reasoning process. Beyond accuracy, our approach significantly enhances explanation faithfulness and decisively mitigates hallucinations, establishing a new standard over single-agent baselines.

[NLP-53] Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

【速读】: 该论文旨在解决生成式任务中长文本回答质量评估的挑战,特别是现有方法难以准确衡量模型输出是否基于给定上下文(即真实性/事实性),以及无法体现参考答案各组成部分的异质重要性。其解决方案的关键在于提出一种加权重要性多点评估框架(Weighted Importance Multi-Point Evaluation, WIMPE),将参考答案分解为带有权重的、与上下文绑定的评分点,并设计两个互补指标:加权逐点对齐度(Weighted Point-wise Alignment, WPA)用于量化模型响应与参考答案在各评分点上的匹配程度,以及逐点冲突惩罚(Point-wise Conflict Penalty, PCP)用于识别并 penalize 模型响应与参考答案之间的矛盾内容。该方法显著提升了与人工标注的相关性。

链接: https://arxiv.org/abs/2604.11246
作者: Guoxin Yu,Chulun Zhou,Lemao Liu,Qi Wang,Mo Yu,Jialong Tang,Baosong Yang,Xiang Ao,Wao Lam,Yue Yu
机构: Pengcheng Laboratory, Shenzhen, China; The Chinese University of Hong Kong; Fudan University; Qwen team, Alibaba Group; Institute of Computing Technology, CAS
类目: Computation and Language (cs.CL)
备注: 21 pages

点击查看摘要

Abstract:Evaluating the quality of model responses remains challenging in generative tasks with long-form answers, as the expected answers usually contain multiple semantically distinct yet complementary factors that should be factorized for fine-grained assessment. Recent evaluation methods resort to relying on either task-level rubrics or question-aware checklists. However, they still 1) struggle to assess whether a response is genuinely grounded in provided contexts; 2) fail to capture the heterogeneous importance of different aspects of reference answers. Inspired by human examiners, we propose a Weighted Importance Multi-Point Evaluation (WIMPE) framework, which factorizes each reference answer into weighted context-bound scoring points. Two complementary metrics, namely Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP), are designed to measure the alignment and contradiction between model responses and reference answers. Extensive experiments on 10 generative tasks demonstrate that WIMPE achieves higher correlations with human annotations.

[NLP-54] RUMLEM: A Dictionary-Based Lemmatizer for Romansh

【速读】: 该论文旨在解决罗曼什语(Romansh)多种方言及标准变体(Rumantsch Grischun)的词形还原(lemmatization)问题,这是自然语言处理(NLP)中诸多下游任务的关键预处理步骤。解决方案的核心在于构建了一个基于社区驱动的、针对五种主要罗曼什语变体的全面形态学数据库,并以此为基础开发了RUMLEM系统,使其能够覆盖典型文本中77–84%的词汇。该设计不仅提升了词形还原的准确性,还进一步实现了变体感知的语言分类能力,实验表明其在3万条文本上对变体识别的准确率达95%,并验证了基于该系统的罗曼什语与非罗曼什语区分的可行性。

链接: https://arxiv.org/abs/2604.11233
作者: Dominic P. Fischer,Zachary Hopton,Jannis Vamvas
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Lemmatization – the task of mapping an inflected word form to its dictionary form – is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77-84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30’000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.

[NLP-55] Sign Language Recognition in the Age of LLM s CVPR2026

【速读】: 该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)是否能在无需任务特定训练的情况下,直接应用于孤立手语识别(Isolated Sign Language Recognition, ISLR)这一专业化视觉识别问题。其核心问题是评估现代VLM在零样本(zero-shot)场景下对ISLR任务的泛化能力。解决方案的关键在于通过在WLASL300基准上系统性地评估多个开源与专有VLMs,并发现尽管当前开源模型性能显著落后于传统监督式ISLR分类器,但它们仍能捕捉到手语动作与文本描述之间的部分视觉-语义对齐信息;同时,更大规模的专有模型展现出显著更高的准确率,揭示了模型规模和训练数据多样性对提升零样本ISLR性能的重要性。

链接: https://arxiv.org/abs/2604.11225
作者: Vaclav Javorek,Jakub Honzik,Ivan Gruber,Tomas Zelezny,Marek Hruz
机构: University of West Bohemia (西波希米亚大学); Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at the CVPR 2026 Workshop on Multimodal Sign Language Research (MSLR), 8 pages, 3 figures

点击查看摘要

Abstract:Recent Vision Language Models (VLMs) have demonstrated strong performance across a wide range of multimodal reasoning tasks. This raises the question of whether such general-purpose models can also address specialized visual recognition problems such as isolated sign language recognition (ISLR) without task-specific training. In this work, we investigate the capability of modern VLMs to perform ISLR in a zero-shot setting. We evaluate several open-source and proprietary VLMs on the WLASL300 benchmark. Our experiments show that, under prompt-only zero-shot inference, current open-source VLMs remain far behind classic supervised ISLR classifiers by a wide margin. However, follow-up experiments reveal that these models capture partial visual-semantic alignment between signs and text descriptions. Larger proprietary models achieve substantially higher accuracy, highlighting the importance of model scale and training data diversity. All our code is publicly available on GitHub.

[NLP-56] HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning ACL2026

【速读】: 该论文旨在解决持续模型编辑(Lifelong Model Editing, LME)中因静态、全局地对所有层进行参数扰动而导致的知识更新不精准与灾难性遗忘问题。现有方法忽视了不同知识可能分布在模型不同层中的特性,从而限制了模型对新知识的适应能力并损害已编辑或通用知识的稳定性。解决方案的关键在于提出HiEdit——一个基于分层强化学习的框架,通过动态识别每条编辑实例中最相关的模型层,并引入稀疏性内在奖励机制,实现局部化、精细化的参数更新,从而在仅扰动约一半层的情况下显著提升编辑效果(相较RLEdit平均性能提升8.48%)。

链接: https://arxiv.org/abs/2604.11214
作者: Yangfan Wang,Tianyang Sun,Chen Tang,Jie Liu,Wei Cai,Jingchi Jiang
机构: Harbin Institute of Technology (哈尔滨工业大学); Beidahuang Information Co., Ltd. (北大荒信息有限公司); State Key Laboratory of Smart Farm Technologies and Systems (智能农业技术与系统国家重点实验室); AI Research Center, Midea Group (Shanghai) Co., Ltd. (美的集团(上海)有限公司人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: Accept by ACL 2026

点击查看摘要

Abstract:Lifelong model editing (LME) aims to sequentially rectify outdated or inaccurate knowledge in deployed LLMs while minimizing side effects on unrelated inputs. However, existing approaches typically apply parameter perturbations to a static and dense set of LLM layers for all editing instances. This practice is counter-intuitive, as we hypothesize that different pieces of knowledge are stored in distinct layers of the model. Neglecting this layer-wise specificity can impede adaptability in integrating new knowledge and result in catastrophic forgetting for both general and previously edited knowledge. To address this, we propose HiEdit, a hierarchical reinforcement learning framework that adaptively identifies the most knowledge-relevant layers for each editing instance. By enabling dynamic, instance-aware layer selection and incorporating an intrinsic reward for sparsity, HiEdit achieves precise, localized updates. Experiments on various LLMs show that HiEdit boosts the performance of the competitive RLEdit by an average of 8.48% with perturbing only half of the layers per edit. Our code is available at: this https URL.

[NLP-57] Exploring Knowledge Conflicts for Faithful LLM Reasoning : Benchmark and Method SIGIR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在检索增强生成(Retrieval-Augmented Generation, RAG)系统中面对跨源知识冲突时的推理失真问题,尤其是当文本证据与知识图谱(Knowledge Graph, KG)证据存在矛盾时,LLMs往往无法识别可靠证据,反而过度依赖单一来源或受提示方式影响,导致错误判断。解决方案的关键在于提出XoT(Cross-source Evidence Thinking),这是一个两阶段基于解释的推理框架,专门用于处理异构来源间的冲突证据,通过结构化推理路径提升模型对多源信息的整合能力与决策准确性。

链接: https://arxiv.org/abs/2604.11209
作者: Tianzhe Zhao,Jiaoyan Chen,Shuxiu Zhang,Haiping Zhu,Qika Lin,Jun Liu
机构: Xi’an Jiaotong University (西安交通大学); The University of Manchester (曼彻斯特大学); Hunan University (湖南大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at SIGIR 2026

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success across a wide range of applications especially when augmented by external knowledge through retrieval-augmented generation (RAG). Despite their widespread adoption, recent studies have shown that LLMs often struggle to perform faithful reasoning when conflicting knowledge is retrieved. However, existing work primarily focuses on conflicts between external knowledge and the parametric knowledge of LLMs, leaving conflicts across external knowledge largely unexplored. Meanwhile, modern RAG systems increasingly emphasize the integration of unstructured text and (semi-)structured data like knowledge graphs (KGs) to improve knowledge completeness and reasoning faithfulness. To address this gap, we introduce ConflictQA, a novel benchmark that systematically instantiates conflicts between textual evidence and KG evidence. Extensive evaluations across representative LLMs reveal that, facing such cross-source conflicts, LLMs often fail to identify reliable evidence for correct reasoning. Instead, LLMs become more sensitive to prompting choices and tend to rely exclusively on either KG or textual evidence, resulting in incorrect responses. Based on these findings, we further propose XoT, a two-stage explanation-based thinking framework tailored for reasoning over heterogeneous conflicting evidence, and verify its effectiveness with extensive experiments.

[NLP-58] CocoaBench: Evaluating Unified Digital Agents in the Wild

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在实际应用中缺乏对多模态能力(如视觉、搜索与代码生成)进行灵活组合评估的问题,现有评测大多孤立测试单一能力,无法反映真实场景下复杂任务的协同需求。解决方案的关键在于提出CocoaBench——一个基于人工设计的长周期任务构成的统一数字代理基准,这些任务要求代理灵活整合视觉理解、网络搜索和编程能力;同时引入CocoaAgent作为轻量级共享框架,支持跨不同模型架构的可控比较。该方案通过仅依赖指令和最终输出的自动评估函数,实现了高效、可扩展的多能力联合评估,从而填补了现有评测体系在复杂任务组合方面的空白。

链接: https://arxiv.org/abs/2604.11201
作者: CocoaBench Team:Shibo Hao,Zhining Zhang,Zhiqi Liang,Tianyang Liu,Yuheng Zha,Qiyue Gao,Jixuan Chen,Zilong Wang,Zhoujun Cheng,Haoxiang Zhang,Junli Wang,Hexi Jin,Boyuan Zheng,Kun Zhou,Yu Wang,Feng Yao,Licheng Liu,Yijiang Li,Zhifei Li,Zhengtao Han,Pracha Promthaw,Tommaso Cerruti,Xiaohan Fu,Ziqiao Ma,Jingbo Shang,Lianhui Qin,Julian McAuley,Eric P. Xing,Zhengzhong Liu,Rupesh Kumar Srivastava,Zhiting Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

[NLP-59] RACE: An Experiential Framework for Coherent Multi-hop Knowledge Graph Question Answering

【速读】: 该论文旨在解决多跳知识图谱问答(Multi-hop Knowledge Graph Question Answering, KGQA)中因各推理步骤独立处理而导致的推理碎片化和探索冗余问题,从而影响推理的连贯性和效率。其解决方案的关键在于提出一种名为TRACE的体验式框架,通过将LLM驱动的上下文推理与探索先验(exploration priors)融合,实现推理过程的语义连续性增强与可复用经验积累:一方面,动态将演化的推理路径转化为自然语言叙事以维持语义一致性;另一方面,抽象先前探索轨迹为可复用的经验先验,捕捉重复出现的探索模式,并结合双反馈重排序机制,将上下文叙事与探索先验融合以指导关系选择,从而提升多跳KGQA的鲁棒性与准确性。

链接: https://arxiv.org/abs/2604.11193
作者: Yingxu Wang,Jiaxin Huang,Mengzhu Wang,Nan Yin
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Hebei University of Technology (河北工业大学); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-hop Knowledge Graph Question Answering (KGQA) requires coherent reasoning across relational paths, yet existing methods often treat each reasoning step independently and fail to effectively leverage experience from prior explorations, leading to fragmented reasoning and redundant exploration. To address these challenges, we propose Trajectoryaware Reasoning with Adaptive Context and Exploration priors (TRACE), an experiential framework that unifies LLM-driven contextual reasoning with exploration prior integration to enhance the coherence and robustness of multihop KGQA. Specifically, TRACE dynamically translates evolving reasoning paths into natural language narratives to maintain semantic continuity, while abstracting prior exploration trajectories into reusable experiential priors that capture recurring exploration patterns. A dualfeedback re-ranking mechanism further integrates contextual narratives with exploration priors to guide relation selection during reasoning. Extensive experiments on multiple KGQA benchmarks demonstrate that TRACE consistently outperforms state-of-the-art baselines.

[NLP-60] MathAgent : Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis ACL2026

【速读】: 该论文旨在解决生成高质量数学推理数据时缺乏人类先验知识的问题,现有方法如种子数据变异或简单提示工程常面临模式坍塌和逻辑复杂度有限的挑战。其解决方案的关键在于提出一种分层合成框架,将数据合成建模为约束图上的无监督优化问题,随后进行语义实例化;具体而言,引入了立法者-执行者(Legislator-Executor)范式:立法者通过对抗方式演化编码问题约束的结构化生成蓝图,执行者则将这些规范实例化为多样化的自然语言场景,从而实现骨架设计与语言实现的解耦,优先聚焦于构建复杂且多样的逻辑结构,有效引导高质量数据合成。

链接: https://arxiv.org/abs/2604.11188
作者: Zixiong Yu,Jun Rao,Guhan Chen,Songtao Tian,Bohan Li,Jiansheng Wei,Min Zhang,Xiaojun Meng
机构: Huawei Large Model Data Technology Lab(华为大模型数据技术实验室); Tsinghua University(清华大学); Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳分校); Kyoto University(京都大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 findings

点击查看摘要

Abstract:Synthesizing high-quality mathematical reasoning data without human priors remains a significant challenge. Current approaches typically rely on seed data mutation or simple prompt engineering, often suffering from mode collapse and limited logical complexity. This paper proposes a hierarchical synthesis framework that formulates data synthesis as an unsupervised optimization problem over a constraint graph followed by semantic instantiation, rather than treating it as a direct text generation task. We introduce a Legislator-Executor paradigm: The Legislator adversarially evolves structured generation blueprints encoding the constraints of the problem, while the Executor instantiates these specifications into diverse natural language scenarios. This decoupling of skeleton design from linguistic realization enables a prioritized focus on constructing complex and diverse logical structures, thereby guiding high-quality data synthesis. Experiments conducted on a total of 10 models across the Qwen, Llama, Mistral, and Gemma series demonstrate that our method achieves notable results: models fine-tuned on 1K synthesized samples outperform widely-used datasets of comparable scale (LIMO, s1K) across eight mathematical benchmarks, exhibiting superior out-of-distribution generalization.

[NLP-61] Evaluating Memory Capability in Continuous Lifelog Scenario ACL2026

【速读】: 该论文旨在解决现有记忆系统基准测试无法有效模拟真实世界场景下持续性生活日志(lifelogging)音频数据处理需求的问题,特别是当前公开数据集稀缺且评估方式多为离线静态设置,难以反映时间因果性的实际应用挑战。其解决方案的关键在于提出一个分层合成框架以构建名为 LifeDialBench 的新型基准,包含两个互补子集:基于真实第一人称视角视频的 EgoMem 和通过虚拟社区模拟生成的 LifeMem;同时引入严格的 在线评估协议(Online Evaluation protocol),确保模型在流式处理中遵循时间因果性,从而更贴近真实应用场景。实验结果表明,当前复杂的记忆系统反而不如简单的基于检索增强生成(RAG)的基线,凸显了过度设计与低质量压缩对上下文保真度的负面影响,强调高保真上下文保留对于生活日志场景的重要性。

链接: https://arxiv.org/abs/2604.11182
作者: Jianjie Zheng,Zhichen Liu,Zhanyu Shen,Jingxiang Qu,Guanhua Chen,Yile Wang,Yang Xu,Yang Liu,Sijie Cheng
机构: Southern University of Science and Technology(南方科技大学); RayNeo.AI(雷诺AI); Tsinghua University(清华大学); Shenzhen University(深圳大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computation and Language (cs.CL)
备注: 27 pages, 7 figures. ACL 2026 Findings camera-ready

点击查看摘要

Abstract:Nowadays, wearable devices can continuously lifelog ambient conversations, creating substantial opportunities for memory systems. However, existing benchmarks primarily focus on online one-on-one chatting or human-AI interactions, thus neglecting the unique demands of real-world scenarios. Given the scarcity of public lifelogging audio datasets, we propose a hierarchical synthesis framework to curate \textbf\textscLifeDialBench, a novel benchmark comprising two complementary subsets: \textbfEgoMem, built on real-world egocentric videos, and \textbfLifeMem, constructed using simulated virtual community. Crucially, to address the issue of temporal leakage in traditional offline settings, we propose an \textbfOnline Evaluation protocol that strictly adheres to temporal causality, ensuring systems are evaluated in a realistic streaming fashion. Our experimental results reveal a counterintuitive finding: current sophisticated memory systems fail to outperform a simple RAG-based baseline. This highlights the detrimental impact of over-designed structures and lossy compression in current approaches, emphasizing the necessity of high-fidelity context preservation for lifelog scenarios. We release our code and data at this https URL.

[NLP-62] SHARE: Social-Humanities AI for Research and Education

【速读】: 该论文旨在解决生成式 AI(Generative AI)在社会科学与人文学科(Social Sciences and Humanities, SSH)领域应用中面临的适配性不足与伦理风险问题。当前主流模型多基于通用语料预训练,难以准确捕捉SSH文本的复杂语义结构,且其自动生成功能可能削弱学术批判性与人文价值。解决方案的关键在于构建了SHARE系列因果语言模型(causal language models),这些模型专为SSH领域设计并完全使用SSH语料进行预训练,在性能上接近通用模型(如Phi-4)但仅需其1%的数据量;同时开发了MIRROR用户界面,通过不生成任何文本的交互方式促进对SSH文本的批判性审查,从而在保留SSH核心原则的前提下合理利用模型能力。

链接: https://arxiv.org/abs/2604.11152
作者: João Gonçalves,Sonia de Jager,Petr Knoth,David Pride,Nick Jelicic
机构: Erasmus University Rotterdam (鹿特丹伊拉斯姆斯大学); Open University (开放大学)
类目: Computation and Language (cs.CL)
备注: 23 pages, 9 figures, 4 tables

点击查看摘要

Abstract:This intermediate technical report introduces the SHARE family of base models and the MIRROR user interface. The SHARE models are the first causal language models fully pretrained by and for the social sciences and humanities (SSH). Their performance in modelling SSH texts is close to that of general purpose models (Phi-4) which use 100 times more tokens, as shown by our custom SSH Cloze benchmark. The MIRROR user interface is designed for reviewing text inputs from the SSH disciplines while preserving critical engagement. By prototyping a generative AI interface that does not generate any text, we propose a way to harness the capabilities of the SHARE models without compromising the integrity of SSH principles and norms.

[NLP-63] Hierarchical Textual Knowledge for Enhanced Image Clustering CVPR2026

【速读】: 该论文旨在解决传统图像聚类方法因仅依赖视觉空间信息而导致的语义区分能力不足问题,即难以区分视觉相似但语义不同的类别。其解决方案的关键在于构建一种基于大语言模型(Large Language Models, LLMs)的分层概念-属性结构化知识体系,通过结构化提示(structured prompts)从文本空间中提炼抽象概念与判别性属性,并将其嵌入到每张输入图像中以生成知识增强特征(knowledge-enhanced features)。该方法无需额外训练即可显著提升聚类性能,在20个数据集上均优于现有方法,且在使用粗粒度文本标签可能损害性能的情况下仍保持准确性和鲁棒性。

链接: https://arxiv.org/abs/2604.11144
作者: Yijie Zhong,Yunfan Gao,Weipeng Jiang,Haofen Wang
机构: Tongji University (同济大学); Huawei Technologies Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Image clustering aims to group images in an unsupervised fashion. Traditional methods focus on knowledge from visual space, making it difficult to distinguish between visually similar but semantically different classes. Recent advances in vision-language models enable the use of textual knowledge to enhance image clustering. However, most existing methods rely on coarse class labels or simple nouns, overlooking the rich conceptual and attribute-level semantics embedded in textual space. In this paper, we propose a knowledge-enhanced clustering (KEC) method that constructs a hierarchical concept-attribute structured knowledge with the help of large language models (LLMs) to guide clustering. Specifically, we first condense redundant textual labels into abstract concepts and then automatically extract discriminative attributes for each single concept and similar concept pairs, via structured prompts to LLMs. This knowledge is instantiated for each input image to achieve the knowledge-enhanced features. The knowledge-enhanced features with original visual features are adapted to various downstream clustering algorithms. We evaluate KEC on 20 diverse datasets, showing consistent improvements across existing methods using additional textual knowledge. KEC without training outperforms zero-shot CLIP on 14 out of 20 datasets. Furthermore, the naive use of textual knowledge may harm clustering performance, while KEC provides both accuracy and robustness.

[NLP-64] How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts ACL2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床场景中进行数值推理时的可靠性问题,特别是如何在异构临床记录中稳健地理解与处理患者测量值。现有评估多局限于基础算术运算,缺乏对多种临床数值任务(如值检索、关系比较、聚合等)的全面覆盖,且未充分检验模型在不同笔记格式下的鲁棒性。解决方案的关键在于提出ClinicNumRobBench基准测试集,包含1,624个上下文-问题实例及真实答案,涵盖四种核心临床数值能力,并通过三种语义等价但格式不同的MIMIC-IV生命体征表示方式(包括真实世界笔记风格)来系统性地压力测试模型表现,从而为临床可靠数值推理提供严谨的评估框架。

链接: https://arxiv.org/abs/2604.11133
作者: Minh-Vuong Nguyen,Fatemeh Shiri,Zhuang Li,Karin Verspoor
机构: RMIT University (皇家墨尔本理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL2026 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being explored for clinical question answering and decision support, yet safe deployment critically requires reliable handling of patient measurements in heterogeneous clinical notes. Existing evaluations of LLMs for clinical numerical reasoning provide limited operation-level coverage, restricted primarily to arithmetic computation, and rarely assess the robustness of numerical understanding across clinical note formats. We introduce ClinicNumRobBench, a benchmark of 1,624 context-question instances with ground-truth answers that evaluates four main types of clinical numeracy: value retrieval, arithmetic computation, relational comparison, and aggregation. To stress-test robustness, ClinicNumRobBench presents longitudinal MIMIC-IV vital-sign records in three semantically equivalent representations, including a real-world note-style variant derived from the Open Patients dataset, and instantiates queries using 42 question templates. Experiments on 14 LLMs show that value retrieval is generally strong, with most models exceeding 85% accuracy, while relational comparison and aggregation remain challenging, with some models scoring below 15%. Fine-tuning on medical data can reduce numeracy relative to base models by over 30%, and performance drops under note-style variation indicate LLM sensitivity to format. ClinicNumRobBench offers a rigorous testbed for clinically reliable numerical reasoning. Code and data URL are available on this https URL.

[NLP-65] DeCoVec: Building Decoding Space based Task Vector for Large Language Models via In-Context Learning ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)任务导向控制中现有方法依赖微调或侵入式内部状态修改而导致灵活性与可扩展性受限的问题。其解决方案的关键在于提出一种无需训练且非侵入式的框架 DeCoVec(Decoding Space based Task Vector),该方法直接在解码空间中构建任务向量(task vector),利用上下文学习(in-context learning, ICL)捕捉少量示例与零样本提示输出logit分布之间的差异,并将此向量注入解码过程以引导生成。实验表明,DeCoVec 在 TruthfulQA、Math-500 和 AQUA-RAT 数据集上显著优于标准少样本基线,平均准确率提升达 +5.50,同时有效抑制生成退化和逻辑错误,且对演示顺序具有强鲁棒性,无需额外输入token开销。

链接: https://arxiv.org/abs/2604.11129
作者: Feiyang Li,Yile Wang
机构: Shenzhen University (深圳大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Task vectors, representing directions in model or activation spaces that encode task-specific behaviors, have emerged as a promising tool for steering large language models (LLMs). However, existing approaches typically require fine-tuning or invasive manipulation of internal states, limiting their flexibility and scalability. We propose \textscDeCoVec (Decoding Space based Task Vector), a training-free and non-invasive framework that constructs task vectors directly in the \textitdecoding space by leveraging in-context learning (ICL). Specifically, \textscDeCoVec captures the task essence as the difference between the output logit distributions of few-shot and zero-shot prompts, then steers generation by injecting this vector into the decoding process. Experiments across seven LLMs (0.5B–9B) on TruthfulQA, Math-500, and AQUA-RAT show that \textscDeCoVec consistently outperforms standard few-shot baselines, with gains up to +5.50 average accuracy. Further analysis demonstrates that \textscDeCoVec effectively suppresses generation degeneration and logical flaws while exhibiting strong robustness to demonstration ordering, all without incurring additional input token costs. Our method offers a training-free and non-invasive solution for LLM steering without requiring weight updates or auxiliary models.

[NLP-66] BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection

【速读】: 该论文旨在解决在线政治极化(political polarization)的精准计算检测难题,尤其针对社交媒体文本中隐含的修辞策略、框架暗示以及人工标注成本高昂等问题。其解决方案的关键在于提出一种两阶段方法:首先利用可解释的槽位填充模板(目标、主张类型、表现检查清单和理由)对Qwen 2.5-7B-Instruct模型进行LoRA结构化微调;随后采用直接偏好优化(Direct Preference Optimization, DPO)技术,基于自动构建的偏好对进一步优化模型性能,从而显著降低假阴性率并提升整体准确率,且无需额外的人工标注。实验表明,在SemEval 2026 POLAR共享任务英文开发集上,DPO使召回率从0.5085提升至0.7797,并将宏F1值提高约5个百分点。

链接: https://arxiv.org/abs/2604.11121
作者: Atharva Gupta,Dhruv Kumar,Yash Sinha
机构: Birla Institute of Technology and Science, Pilani, India
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The POLAR SemEval-2026 Shared Task aims to detect online polarization and focuses on the classification and identification of multilingual, multicultural, and multi-event polarization. Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and the high cost of human-in-the-loop annotation. Building on recent findings that contextual prompting enables large language models to function as strong polarization detectors, we present a two-stage approach for detecting political polarization in social media text that combines structured supervised fine-tuning with Direct Preference Optimization (DPO) refinement. We fine-tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template (target, claim type, manifestation checklist, and justification). We then apply DPO with automatically generated preference pairs to reduce costly false negatives. Experiments on the SemEval 2026 POLAR shared task dataset show that preference-based refinement improves both accuracy and decreases false negatives without extra annotation. On the English development set, DPO increases recall from 0.5085 to 0.7797 and improves macro-F1 by ~5 points. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.11121 [cs.CL] (or arXiv:2604.11121v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.11121 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-67] Use of AI Tools: Guidelines to Maintain Academic Integrity in Computing Colleges

【速读】: 该论文旨在解决生成式 AI(Generative AI)工具在计算机学科教育中广泛应用所带来的学术诚信(academic integrity)挑战,尤其是在评估过程中可能引发的学术不端行为。其解决方案的关键在于提出一套通用指导原则与针对不同评估形式的具体建议,以帮助教师在教学实践中负责任地整合AI工具,同时确保教育目标的实现和学术诚信的维护;此外,论文还引入了一个结构化的数学模型,用于在存在AI辅助的情况下对学生成绩进行科学、公正的评估。

链接: https://arxiv.org/abs/2604.11111
作者: Hatem M. El-boghdadi,Toqeer Ali Syed,Ali Akarma,Qamar Wali
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: This paper is in press for Volume 33 Issue 4 (2025) International Journal of Energy, Environment, and Economics

点击查看摘要

Abstract:The rapid adoption of AI tools such as ChatGPT has significantly transformed academic practices, offering considerable benefits for both students and faculty in computing disciplines. These tools have been shown to enhance learning efficiency, academic self-efficacy, and confidence. However, their increasing use also raises pressing concerns regarding the preservation of academic integrity – an essential pillar of the educational process. This paper explores the implications of widespread AI tool usage within computing colleges, with a particular focus on how to align their use with the principles of academic honesty. We begin by classifying common assessment techniques employed in computing education and examine how each may be impacted by AI-assisted tools. Building on this foundation, we propose a set of general guidelines applicable across various assessment formats to help instructors responsibly integrate AI tools into their pedagogy. Furthermore, we provide targeted, assessment-specific recommendations designed to uphold educational objectives while mitigating risks of academic misconduct. These guidelines serve as a practical framework for instructors aiming to balance the pedagogical advantages of AI tools with the imperative of maintaining academic integrity in computing education. Finally, we introduce a formal model that provides a structured mathematical framework for evaluating student assessments in the presence of AI-assisted tools.

[NLP-68] Efficient Training for Cross-lingual Speech Language Models ACL2026

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)主要局限于文本模态、难以实现自然人机交互的问题,尤其是构建高效端到端语音大语言模型(Speech Large Language Models, Speech LLMs)时面临的训练数据稀缺与多语言扩展困难。其解决方案的关键在于提出一种基于离散语音标记(discrete speech tokens)的跨语言语音大语言模型(Cross-lingual Speech Language Model, CSLM)训练方法,通过持续预训练实现跨模态和跨语言对齐,并在指令微调阶段采用语音-文本交错的链式模态生成过程,在细粒度上增强模态对齐,从而提升生成质量并降低延迟。该方法无需海量语音数据即可实现良好的语言可扩展性。

链接: https://arxiv.org/abs/2604.11096
作者: Yan Zhou,Qingkai Fang,Yun Hong,Yang Feng
机构: Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS); State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data and the difficulty in expanding to more languages. In this paper, we introduce Cross-lingual Speech Language Model (CSLM), an efficient training method for cross-lingual speech LLMs based on discrete speech tokens. We propose a novel alignment strategy that achieves cross-modal and cross-lingual alignment through continual pre-training. By conducting instruction fine-tuning following a speech-text interleaved chain-of-modality generation process, we enhance modal alignment at a finer granularity, thereby improving generation quality and reducing latency. CSLM aligns different modalities and languages simultaneously without the need for massive speech data, thus exhibiting good language scalability. Evaluations on cross-modal tasks, mono-lingual conversational tasks, and cross-lingual conversational tasks demonstrate CSLM’s strong cross-modal alignment capabilities and general task abilities. (Code is available at: this https URL)

[NLP-69] Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

【速读】: 该论文旨在解决自然语言指令文件(如 .cursorrules)对AI编程代理性能影响的不确定性问题,即这些规则是否真正提升代理表现,以及哪些属性使其有益。解决方案的关键在于通过大规模实证评估(分析679个GitHub项目中的25,532条规则,运行超过5,000次代理任务),发现规则主要通过上下文提示(context priming)而非具体指令起作用;其中负向约束(如“不要重构无关代码”)单独有效,而正向指令(如“遵循代码风格”)反而有害;且单条规则常有害但组合使用至多50条时仍无性能下降。因此,安全配置的核心原则是:限制代理必须避免的行为,而非规定其应执行的动作。

链接: https://arxiv.org/abs/2604.11088
作者: Xing Zhang,Guanghui Wang,Yanwei Cui,Wei Qiu,Ziyuan Li,Bing Zhu,Peiyang He
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Developers increasingly guide AI coding agents through natural language instruction files (e.g., this http URL, .cursorrules), yet no controlled study has measured whether these rules actually improve agent performance or which properties make a rule beneficial. We scrape 679 such files (25,532 rules) from GitHub and conduct the first large-scale empirical evaluation, running over 5,000 agent runs with a state-of-the-art coding agent on SWE-bench Verified. Rules improve performance by 7–14 percentage points, but random rules help as much as expert-curated ones – suggesting rules work through context priming rather than specific instruction. Negative constraints (“do not refactor unrelated code”) are the only individually beneficial rule type, while positive directives (“follow code style”) actively hurt – a pattern we analyze through the lens of potential-based reward shaping (PBRS). Moreover, individual rules are mostly harmful in isolation yet collectively helpful, with no degradation up to 50 rules. These findings expose a hidden reliability risk – well-intentioned rules routinely degrade agent performance – and provide a clear principle for safe agent configuration: constrain what agents must not do, rather than prescribing what they should.

[NLP-70] owards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation ACL2026

【速读】: 该论文旨在解决当前客户服务中心聊天机器人(Customer Service Chatbots)仅作为被动响应工具的局限性,致力于将其升级为能够主动获取高价值信息和商业智能的战略接口。其核心问题在于如何在保证用户体验的前提下,优化对话中主动探询目标信息的时机与频率,以最小化对话轮次并降低用户摩擦。解决方案的关键在于提出PROCHATIP框架,该框架包含一个专门训练的对话策略模块(conversation strategy module),用于精准掌握信息探询的时机,从而实现高效的信息采集与优质的服务质量之间的平衡。

链接: https://arxiv.org/abs/2604.11077
作者: Chen Huang,Zitan Jiang,Changyi Zou,Wenqiang Lei,See-Kiong Ng
机构: Institute of Data Science, National University of Singapore; College of Computer Science, Sichuan University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Findings of ACL 2026

点击查看摘要

Abstract:Customer service chatbots are increasingly expected to serve not merely as reactive support tools for users, but as strategic interfaces for harvesting high-value information and business intelligence. In response, we make three main contributions. 1) We introduce and define a novel task of Proactive Information Probing, which optimizes when to probe users for pre-specified target information while minimizing conversation turns and user friction. 2) We propose PROCHATIP, a proactive chatbot framework featuring a specialized conversation strategy module trained to master the delicate timing of probes. 3) Experiments demonstrate that PROCHATIP significantly outperforms baselines, exhibiting superior capability in both information probing and service quality. We believe that our work effectively redefines the commercial utility of chatbots, positioning them as scalable, cost-effective engines for proactive business intelligence. Our code is available at this https URL.

[NLP-71] ks-pret-5m: a 5 million word 12 million token kashmiri pretraining dataset

【速读】: 该论文旨在解决克什米尔语(Kashmiri)在自然语言处理领域中因缺乏大规模预训练语料库而导致的模型性能受限问题。解决方案的关键在于构建并公开发布KS-PRET-5M,这是目前最大规模的克什米尔语预训练语料库,包含约509万词、2769万字符和29.5万唯一词形。其核心创新在于从两类来源获取数据:一是通过专用转换工具从专有InPage排版格式中提取档案与文学文本,二是收集Unicode原生的网络文本;并通过十一阶段清洗流程将克什米尔文字占比提升至0.9965,显著降低梵文字符污染。此外,采用google/muril-base-cased模型进行子词级分词,获得1213万子词标记,远高于基于非克什米尔语波斯阿拉伯文类比的先前估计,从而为克什米尔语语言模型预训练、分词器训练及计算语言学研究提供高质量基础资源。

链接: https://arxiv.org/abs/2604.11066
作者: Haq Nawaz Malik,Nahfid Nissar
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 (5.09M) words, 27,692,959 (27.6M) characters, and a vocabulary of 295,433 (295.4K) unique word types. We assembled the dataset from two source classes: digitized archival and literary material, encompassing literature, news, biographies, novels, poetry, religious scholarship, and academic writing, recovered from the proprietary InPage desktop-publishing format using the converter of Malik~\citemalik2024inpage, and Unicode-native text collected from Kashmiri-language web sources. All text was processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, reducing Devanagari contamination to 146 characters across the full dataset. We tokenized the dataset empirically using google/muril-base-cased, yielding a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens, substantially higher than prior estimates derived from non-Kashmiri Perso-Arabic analogues. KS-PRET-5M is released as a single continuous text stream under CC~BY~4.0 to support language model pretraining, tokenizer training, and computational linguistic research for Kashmiri.

[NLP-72] Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation Behavior and Methodological Confounds

【速读】: 该论文旨在解决语言模型在情感表征(emotion representation)上的共性与差异问题,特别是不同架构和训练阶段的模型是否共享一致的情感几何结构(21-emotion geometry),以及行为特征差异是否源于情感表征本身。解决方案的关键在于:首先,通过统一的推理模式(comprehension-mode pipeline)在fp16精度下提取多个小规模语言模型(1B–8B参数)的情感向量集,并使用表示相似性分析(representational similarity analysis, RSA)比较其余弦距离矩阵(RDMs);其次,发现五种成熟架构模型具有高度一致的情感几何结构(Spearman相关系数0.74–0.92),且即使在MTI Compliance行为维度上表现相反(如Qwen 2.5与Llama 3.2),其情感RDM仍高度相似(ρ = 0.81),说明行为差异发生在共享情感表征之上;最后,揭示了此前研究中被误认为单一“理解-生成”方法效应实则由四个独立层组成——粗粒度方法差异、生成阶段子参数敏感性、精度(fp16 vs INT8)效应及跨实验偏差,从而为情感向量研究提供了更精细的分层解释框架。

链接: https://arxiv.org/abs/2604.11050
作者: Jihoon Jeong
机构: Daegu Gyeongbuk Institute of Science and Technology (DGIST); ModuLabs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages, 6 figures, 1 table in main text + appendix. Ongoing series on Model Medicine

点击查看摘要

Abstract:We extract 21-emotion vector sets from twelve small language models (six architectures x base/instruct, 1B-8B parameters) under a unified comprehension-mode pipeline at fp16 precision, and compare the resulting geometries via representational similarity analysis on raw cosine RDMs. The five mature architectures (Qwen 2.5 1.5B, SmolLM2 1.7B, Llama 3.2 3B, Mistral 7B v0.3, Llama 3.1 8B) share nearly identical 21-emotion geometry, with pairwise RDM Spearman correlations of 0.74-0.92. This universality persists across diametrically opposed behavioral profiles: Qwen 2.5 and Llama 3.2 occupy opposite poles of MTI Compliance facets yet produce nearly identical emotion RDMs (rho = 0.81), so behavioral facet differences arise above the shared emotion representation. Gemma-3 1B base, the one immature case in our dataset, exhibits extreme residual-stream anisotropy (0.997) and is restructured by RLHF across all geometric descriptors, whereas the five already-mature families show within-family base x instruct RDM correlations of rho = 0.92 (Mistral 7B v0.3 at rho = 0.985), suggesting RLHF restructures only representations that are not yet organized. Methodologically, we show that what prior work has read as a single comprehension-vs-generation method effect in fact decomposes into four distinct layers – a coarse method-dependent dissociation, robust sub-parameter sensitivity within generation, a true precision (fp16 vs INT8) effect, and a conflated cross-experiment bias that distorts in opposite directions for different models – so that a single rho between two prior emotion-vector studies is not a safe basis for interpretation without the layered decomposition.

[NLP-73] A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在赋予特定人格特征(persona)后,其底层认知能力是否受到影响的问题。以往研究多关注人格对交互风格的塑造,但缺乏对认知性能变化的系统评估。解决方案的关键在于提出一种基于神经元层面的人格特质诱导框架(Neuron-based Personality Trait Induction, NPTI),通过该框架可稳定地将大五人格特质(Big Five personality traits)注入LLMs,并结合六项认知基准测试发现:人格诱导不仅带来表层风格变化,还会引发具有任务依赖性的认知性能波动——例如某些人格提升指令遵循能力,而另一些则削弱复杂推理能力;且开放性(Openness)与外向性(Extraversion)对认知表现的影响最为显著。进一步地,作者利用这些规律设计出动态人格路由策略(Dynamic Persona Routing, DPR),该策略无需额外训练即可根据查询内容自适应选择最优人格,从而超越所有静态人格设置的表现。

链接: https://arxiv.org/abs/2604.11048
作者: Jiaqi Chen,Ming Wang,Tingna Xie,Shi Feng,Yongkang Liu
机构: Northeastern University (东北大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imbuing Large Language Models (LLMs) with specific personas is prevalent for tailoring interaction styles, yet the impact on underlying cognitive capabilities remains unexplored. We employ the Neuron-based Personality Trait Induction (NPTI) framework to induce Big Five personality traits in LLMs and evaluate performance across six cognitive benchmarks. Our findings reveal that persona induction produces stable, reproducible shifts in cognitive task performance beyond surface-level stylistic changes. These effects exhibit strong task dependence: certain personalities yield consistent gains on instruction-following, while others impair complex reasoning. Effect magnitude varies systematically by trait dimension, with Openness and Extraversion exerting the most robust influence. Furthermore, LLM effects show 73.68% directional consistency with human personality-cognition relationships. Capitalizing on these regularities, we propose Dynamic Persona Routing (DPR), a lightweight query-adaptive strategy that outperforms the best static persona without additional training.

[NLP-74] Uncertainty-Aware Web-Conditioned Scientific Fact-Checking

【速读】: 该论文旨在解决科学事实核查(scientific fact-checking)在生物医学和材料科学等专业领域中,现有系统常出现幻觉(hallucination)或推理不一致的问题,尤其是在面对技术性、组成性陈述时,受限于证据片段的来源、成本与延迟约束。其解决方案的关键在于提出一个以原子谓词-论元分解(atomic predicate-argument decomposition)为核心的流水线:首先通过嵌入对齐将原子事实映射到局部证据片段,再由轻量级证据 grounded 检查器验证;仅当支持不确定时,才触发针对权威来源的领域受限网络搜索(domain-restricted web search),实现校准后的不确定性门控协同验证(calibrated, uncertainty-gated corroboration)。该方法支持二值(Supported/Refuted)与三值分类(Supported/Refuted/NEI),并在 Context-Only 和 Context+Web 两种评估范式下表现优异,显著优于现有最强基线模型,且外部证据调用比例低,体现了高效、可解释、可控的核查机制。

链接: https://arxiv.org/abs/2604.11036
作者: Ashwin Vinod,Katrin Erk
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific fact-checking is vital for assessing claims in specialized domains such as biomedicine and materials science, yet existing systems often hallucinate or apply inconsistent reasoning, especially when verifying technical, compositional claims against an evidence snippet under source and cost/latency constraints. We present a pipeline centered on atomic predicate-argument decomposition and calibrated, uncertainty-gated corroboration: atomic facts are aligned to local snippets via embeddings, verified by a compact evidence-grounded checker, and only facts with uncertain support trigger domain-restricted web search over authoritative sources. The system supports both binary and tri-valued classification where it predicts labels from Supported, Refuted, NEI for three-way tasks. We evaluate under two regimes, Context-Only (no web) and Context+Web (uncertainty-gated web corroboration); when retrieved evidence conflicts with the provided context, we abstain with NEI rather than overriding the context. On multiple benchmarks, our framework surpasses the strongest benchmarks. In our experiments, web corroboration was invoked for only a minority of atomic facts on average, indicating that external evidence is consulted selectively under calibrated uncertainty rather than routinely. Overall, coupling atomic granularity with calibrated, uncertainty-gated corroboration yields more interpretable and context-conditioned verification, making the approach well-suited to high-stakes, single-document settings that demand traceable rationales, predictable cost/latency, and conservative.

[NLP-75] Min-k Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本生成过程中因采样策略对温度参数(temperature)高度敏感而导致的质量不稳定问题。现有主流方法如Top-k、Top-p和Min-p虽能通过概率空间截断实现多样性与准确性的平衡,但其性能随温度变化剧烈;而近期的logit空间方法如Top-nσ虽具备温度不变性,却依赖全局统计量,易受长尾噪声干扰,无法捕捉候选词集内部的细粒度置信结构。本文提出Min-k采样(Min-k Sampling),其核心创新在于引入一种动态截断机制:通过分析排序后logit分布的局部形状,识别“语义悬崖”(semantic cliffs)——即从高置信核心词到低置信长尾词之间的陡峭过渡区域,并基于位置加权相对衰减速率自适应确定每一步生成的截断边界。该方法在理论上严格保证温度不变性,实验证明其对超参数不敏感,在多个推理基准、创意写作任务及人工评估中均显著提升文本质量,尤其在极端温度条件下表现稳定,优于传统概率空间方法。

链接: https://arxiv.org/abs/2604.11012
作者: Yuanhao Ding,Meimingwei Li,Esteban Garces Arias,Matthias Aßenmacher,Christian Heumann,Chongsheng Zhang
机构: Henan University (河南大学); LMU Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ACL 2026 (Main Conference)

点击查看摘要

Abstract:The quality of text generated by large language models depends critically on the decoding sampling strategy. While mainstream methods such as Top- k , Top- p , and Min- p achieve a balance between diversity and accuracy through probability-space truncation, they share an inherent limitation: extreme sensitivity to the temperature parameter. Recent logit-space approaches like Top- n\sigma achieve temperature invariance but rely on global statistics that are susceptible to long-tail noise, failing to capture fine-grained confidence structures among top candidates. We propose \textbfMin- k Sampling, a novel dynamic truncation strategy that analyzes the local shape of the sorted logit distribution to identify “semantic cliffs”: sharp transitions from high-confidence core tokens to uncertain long-tail tokens. By computing a position-weighted relative decay rate, Min- k dynamically determines truncation boundaries at each generation step. We formally prove that Min- k achieves strict temperature invariance and empirically demonstrate its low sensitivity to hyperparameter choices. Experiments on multiple reasoning benchmarks, creative writing tasks, and human evaluation show that Min- k consistently improves text quality, maintaining robust performance even under extreme temperature settings where probability-based methods collapse. We make our code, models, and analysis tools publicly available.

[NLP-76] K-Way Energy Probes for Metacognition Reduce to Softmax in Discriminative Predictive Coding Networks

【速读】: 该论文旨在解决生成式 AI(Generative AI)中结构化探针(structural probe)的解释力问题,特别是预测编码网络(Predictive Coding Networks, PCNs)中基于能量的多类探针(K-way energy probe)是否能提供比软最大值(softmax)更丰富的决策信号。研究发现,在标准判别式预测编码(discriminative PC)框架下,这种能量探针实际上并不比 softmax 提供更强的信息,其行为可被分解为一个单调递增于 log-softmax 边距的项加上一个未训练以关联正确性的残差项。解决方案的关键在于提出并验证了一个近似分解机制:在目标固定、交叉熵能量(CE-energy)训练和前馈隐层动态条件下,能量边距等价于 softmax 边距加一个非相关残差,从而解释了为何实证中能量探针始终低于 softmax 且稳定一致。这一结果表明,在判别式 PC 框架内,能量探针本质上是 softmax 的下界,而非独立的元认知信号。

链接: https://arxiv.org/abs/2604.11011
作者: Jon-Paul Cacioli
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: 33 pages, 3 figures

点击查看摘要

Abstract:We present this as a negative result with an explanatory mechanism, not as a formal upper bound. Predictive coding networks (PCNs) admit a K-way energy probe in which each candidate class is fixed as a target, inference is run to settling, and the per-hypothesis settled energies are compared. The probe appears to read a richer signal source than softmax, since the per-hypothesis energy depends on the entire generative chain. We argue this appearance is misleading under the standard Pinchetti-style discriminative PC formulation. We present an approximate reduction showing that with target-clamped CE-energy training and effectively-feedforward latent dynamics, the K-way energy margin decomposes into a monotone function of the log-softmax margin plus a residual that is not trained to correlate with correctness. The decomposition predicts that the structural probe should track softmax from below. We test this across six conditions on CIFAR-10: extended deterministic training, direct measurement of latent movement during inference, a post-hoc decoder fairness control on a backpropagation network, a matched-budget PC vs BP comparison, a five-point Langevin temperature sweep, and trajectory-integrated MCPC training. In every condition the probe sat below softmax. The gap was stable across training procedures within the discriminative PC family. Final-state and trajectory-integrated training produced probes whose AUROC_2 values differed by less than 10^-3 at deterministic evaluation. The empirical regime is small: single seed, 2.1M-parameter network, 1280 test images. We frame the result as a preprint inviting replication. We discuss conditions under which the decomposition does not apply (bidirectional PC, prospective configuration, generative PC, non-CE energy formulations) and directions for productive structural probing the analysis does not foreclose. Comments: 33 pages, 3 figures Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE) ACMclasses: I.2.6; I.5.1 Cite as: arXiv:2604.11011 [cs.LG] (or arXiv:2604.11011v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.11011 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jon-Paul Cacioli [view email] [v1] Mon, 13 Apr 2026 05:24:44 UTC (360 KB) Full-text links: Access Paper: View a PDF of the paper titled K-Way Energy Probes for Metacognition Reduce to Softmax in Discriminative Predictive Coding Networks, by Jon-Paul CacioliView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-04 Change to browse by: cs cs.CL cs.NE References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-77] When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)能否生成有助于强化学习(Reinforcement Learning, RL)交易代理的连续数值特征这一问题。其核心挑战在于,尽管LLM可提取具有预测能力的中间表示(如信息系数IC > 0.15),但这些特征在实际交易策略中是否能提升政策鲁棒性仍不明确。解决方案的关键在于构建一个模块化流水线:冻结LLM作为无状态特征提取器,将非结构化新闻和文件转化为固定维度向量,并引入自动化提示优化循环——将提取提示视为离散超参数,直接以信息系数(Information Coefficient, IC)为优化目标而非传统NLP损失函数,从而发现真正具备预测性的特征。然而研究也揭示了特征有效性与策略鲁棒性之间的鸿沟,表明在宏观冲击下LLM衍生特征反而引入噪声,凸显了分布偏移下迁移学习的核心挑战。

链接: https://arxiv.org/abs/2604.10996
作者: Zhengzhe Yang
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Can large language models (LLMs) generate continuous numerical features that improve reinforcement learning (RL) trading agents? We build a modular pipeline where a frozen LLM serves as a stateless feature extractor, transforming unstructured daily news and filings into a fixed-dimensional vector consumed by a downstream PPO agent. We introduce an automated prompt-optimization loop that treats the extraction prompt as a discrete hyperparameter and tunes it directly against the Information Coefficient - the Spearman rank correlation between predicted and realized returns - rather than NLP losses. The optimized prompt discovers genuinely predictive features (IC above 0.15 on held-out data). However, these valid intermediate representations do not automatically translate into downstream task performance: during a distribution shift caused by a macroeconomic shock, LLM-derived features add noise, and the augmented agent under-performs a price-only baseline. In a calmer test regime the agent recovers, yet macroeconomic state variables remain the most robust driver of policy improvement. Our findings highlight a gap between feature-level validity and policy-level robustness that parallels known challenges in transfer learning under distribution shift.

[NLP-78] When Verification Fails: How Compositionally Infeasible Claims Escape Rejection

【速读】: 该论文旨在解决科学主张验证(Scientific Claim Verification)任务中模型容易依赖“显著约束检查”(salient-constraint checking)这一捷径推理策略的问题,从而无法真正实现基于封闭世界假设(Closed-World Assumption, CWA)的严谨验证。现有基准测试因通过扰动单一显著元素构造不可行主张,无法区分模型是否执行了严格的全约束验证与仅依赖最显著约束的简化判断。论文的关键解决方案是构建“组合不可行主张”(compositional infeasible claims),其中显著约束被支持但非显著约束被矛盾,从而暴露模型对CWA标准的违背行为。实验表明,即使在不同模型家族和模态下,多数模型仍会过度接受此类主张,证实了捷径推理的普遍性;进一步通过模型上下文干预分析发现,各模型间的性能差异主要源于验证阈值的不同,而非根本推理能力的差异,说明当前验证行为的瓶颈是一个结构性问题,单纯靠提示策略引导难以突破。

链接: https://arxiv.org/abs/2604.10990
作者: Muxin Liu,Delip Rao,Grace Kim,Chris Callison-Burch
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 9 figures

点击查看摘要

Abstract:Scientific claim verification, the task of determining whether claims are entailed by scientific evidence, is fundamental to establishing discoveries in evidence while preventing misinformation. This process involves evaluating each asserted constraint against validated evidence. Under the Closed-World Assumption (CWA), a claim is accepted if and only if all asserted constraints are positively supported. We show that existing verification benchmarks cannot distinguish models enforcing this standard from models applying a simpler shortcut called salient-constraint checking, which applies CWA’s rejection criterion only to the most salient constraint and accepts when that constraint is supported. Because existing benchmarks construct infeasible claims by perturbing a single salient element they are insufficient at distinguishing between rigorous claim verification and simple salient-constraint reliance. To separate the two, we construct compositionally infeasible claims where the salient constraint is supported but a non-salient constraint is contradicted. Across model families and modalities, models that otherwise saturate existing benchmarks consistently over-accept these claims, confirming the prevalence of such shortcut reasoning. Via model context interventions, we show that different models and prompting strategies occupy distinct positions on a shared ROC curve, indicating that the gap between model families reflects differences in verification threshold rather than underlying reasoning ability, and that the compositional inference bottleneck is a structural property of current verification behavior that strategy guidance alone cannot overcome.

[NLP-79] Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在集成新一代大语言模型(Large Language Models, LLMs)时所面临的性能不确定性问题,即:尽管LLM的推理能力、指令遵循能力和泛化能力不断提升,但其对VLM下游任务性能的影响尚未被系统研究。解决方案的关键在于通过控制变量法进行受控实验——保持视觉编码器、训练数据和后训练算法不变,仅更换LLAMA-1、LLAMA-2与LLAMA-3作为骨干模型,从而系统性地评估不同LLM版本对VLM任务表现的影响。结果表明,新LLM并不总是带来性能提升,具体效果取决于任务类型:例如,在视觉问答任务中,新LLM更倾向于解决不同类型的问题而非单纯增加正确答案数,这归因于更好的置信度校准和更稳定的内部表征;而某些能力仅在最新LLM中出现,依赖视觉理解的任务则受益甚微。

链接: https://arxiv.org/abs/2604.10985
作者: Sameera Horawalavithana,Lauren Phillips,Ian Stewart,Sai Munikoti,Karl Pazdernik
机构: Pacific Northwest National Laboratory (太平洋西北国家实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint and under review

点击查看摘要

Abstract:Vision-Language Models (VLMs) have rapidly advanced by leveraging powerful pre-trained Large Language Models (LLMs) as core reasoning backbones. As new and more capable LLMs emerge with improved reasoning, instruction-following, and generalization, there is a pressing need to efficiently update existing VLMs to incorporate these advancements. However, the integration of new LLMs into VLMs, particularly how the evolving LLMs contribute to multimodal reasoning, alignment, and task-specific performance remains underexplored. Addressing this gap is important for VLM development, given the rapid evolution of pretrained LLM backbones. This study presents a controlled and systematic investigation of how changes in the pretrained LLM backbone affect downstream VLM task performance. By having the vision encoder, training data, and post-training algorithm remain same across LLAMA-1, LLAMA-2, and LLAMA-3 based VLMs, we find that newer LLM backbones do not always lead to better VLMs, but the performance depends on the downstream VLM task. For example, in visual question and answering tasks, newer LLM backbones tend to solve different questions rather than just more questions, and our analysis shows this is driven by differences in how the models process information, including better calibrated confidence and more stable internal representations. We also find that some VLM capabilities appear only in the newest LLM generation, while tasks that depend mainly on visual understanding see little benefit from a newer LLM backbone.

[NLP-80] CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

【速读】: 该论文旨在解决表格数据推理任务中模型难以同时兼顾视觉感知与符号推理的问题,尤其是在处理大型表格或使用小型基础模型时性能受限的挑战。其解决方案的关键在于提出一种分阶段的粗粒度到细粒度多模态合成框架(Coarse-to-Fine Multimodal Synthesis, CFMS),该框架将高层视觉感知与底层符号推理解耦:在粗粒度阶段,利用多模态大语言模型(Multimodal Large Language Models, MLLMs)一次性生成包含多视角知识的元组,作为动态推理地图;在细粒度阶段,由符号引擎基于该地图执行精准、高效的迭代操作,从而实现对表格内容的结构化理解与逻辑推理。

链接: https://arxiv.org/abs/2604.10973
作者: Qixian Huang,Hongqiang Lin,Tong Fu,Yingsen Wang,Zhenghui Fu,Qirui Wang,Yiding Sun,Dongxu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning over tabular data is a crucial capability for tasks like question answering and fact verification, as it requires models to comprehend both free-form questions and semi-structured tables. However, while methods like Chain-of-Thought (CoT) introduce reasoning chains, purely symbolic methodes are inherently limited by their blindness to holistic visual patterns. To address this, we propose the Coarse-to-Fine Multimodal Synthesis framework (CFMS), a novel two-stage paradigm that hierarchically decouples high-level visual perception from granular symbolic reasoning. In the Coarse Stage, CFMS leverages the Multimodal Large Language Models (MLLMs) to perform a one-time synthesis of a multi-perspective knowledge tuple. This tuple subsequently serves as a dynamic reasoning map to guide the fine stage, where a symbolic engine executes a targeted and efficient sequence of iterative operations over the table. Extensive experiments on the WikiTQ and TabFact benchmarks demonstrate that CFMS achieves competitive accuracy. The framework exhibits particular robustness when handling large tables and when instantiated with smaller backbone models, validating its effectiveness and generalizability.

[NLP-81] YIELD: A Large-Scale Dataset and Evaluation Framework for Information Elicitation Agents ACL2026

【速读】: 该论文旨在解决当前对话代理(Conversational Agents, CAs)多以用户驱动为主、难以适应需要主动获取信息的制度性场景(如学术访谈、司法程序和新闻调查)的问题。针对这一挑战,作者提出信息 elicitation agents (IEAs),其目标是通过结构化交互从用户处主动提取支持任务目标的信息。解决方案的关键在于构建了一个包含2,281条伦理合规的人类对话、共26M token的YIELD数据集,并将信息获取过程形式化为有限时域部分可观测马尔可夫决策过程(finite-horizon POMDP),同时设计了专为IEA优化的新指标。实验表明,在YIELD上微调的基础大语言模型(Foundation LLMs)在行为对齐度上显著提升,且人类评估验证了其有效性。

链接: https://arxiv.org/abs/2604.10968
作者: Victor De Lima,Grace Hui Yang
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 (Main Conference)

点击查看摘要

Abstract:Most conversational agents (CAs) are designed to satisfy user needs through user-driven interactions. However, many real-world settings, such as academic interviewing, judicial proceedings, and journalistic investigations, involve broader institutional decision-making processes and require agents that can elicit information from users. In this paper, we introduce Information Elicitation Agents (IEAs) in which the agent’s goal is to elicit information from users to support the agent’s institutional or task-oriented objectives. To enable systematic research on this setting, we present YIELD, a 26M-token dataset of 2,281 ethically sourced, human-to-human dialogues. Moreover, we formalize information elicitation as a finite-horizon POMDP and propose novel metrics tailored to IEAs. Pilot experiments on multiple foundation LLMs show that training on YIELD improves their alignment with real elicitation behavior and findings are corroborated by human evaluation. We release YIELD under CC BY 4.0. The dataset, project code, evaluation tools, and fine-tuned model adapters are available at: this https URL.

[NLP-82] Mem2Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation ACL2026

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)驱动的智能体在自我演化过程中,经验积累与资产(工具或专家代理)动态创建两个机制被孤立处理的问题。这种分离忽视了二者之间的内在依赖性:仅依靠预设静态工具集的经验积累存在能力边界,而完全从零生成新资产又缺乏经验指导,导致演化能力受限且不稳定。解决方案的关键在于提出一种协同进化范式——能力扩展与经验蒸馏(Co-evolutionary Capability Expansion and Experience Distillation),并设计了Mem²Evolve框架,其核心是整合经验记忆(Experience Memory)资产记忆(Asset Memory),通过经验引导资产生成以拓展能力空间,同时利用新资产获取新经验,实现双向协同进化,从而显著提升智能体的演化效率与稳定性。

链接: https://arxiv.org/abs/2604.10923
作者: Zihao Cheng,Zeming Liu,Yingyu Shan,Xinyi Wang,Xiangrong Zhu,Yunpu Ma,Hongru Wang,Yuhang Guo,Wei Lin,Yunhong Wang
机构: Beihang University (北京航空航天大学); Beijing Institute of Technology (北京理工大学); Independent Researcher (独立研究员); Munich Center for Machine Learning (慕尼黑机器学习中心); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Main

点击查看摘要

Abstract:While large language model–powered agents can self-evolve by accumulating experience or by dynamically creating new assets (i.e., tools or expert agents), existing frameworks typically treat these two evolutionary processes in isolation. This separation overlooks their intrinsic interdependence: the former is inherently bounded by a manually predefined static toolset, while the latter generates new assets from scratch without experiential guidance, leading to limited capability growth and unstable evolution. To address this limitation, we introduce a novel paradigm of co-evolutionary Capability Expansion and Experience Distillation. Guided by this paradigm, we propose the \textbfMem ^\textbf2 Evolve, which integrates two core components: \textbfExperience Memory and \textbfAsset Memory. Specifically, Mem ^2 Evolve leverages accumulated experience to guide the dynamic creation of assets, thereby expanding the agent’s capability space while simultaneously acquiring new experience to achieve co-evolution. Extensive experiments across 6 task categories and 8 benchmarks demonstrate that Mem ^2 Evolve achieves improvement of 18.53% over standard LLMs, 11.80% over agents evolving solely through experience, and 6.46% over those evolving solely through asset creation, establishing it as a substantially more effective and stable self-evolving agent framework. Code is available at: this https URL.

[NLP-83] HTAA: Enhancing LLM Planning via Hybrid Toolset Agent ization Adaptation

【速读】: 该论文旨在解决大语言模型在实际应用中难以高效、可靠地调用数百个工具的问题,这一挑战源于扁平化工具调用架构中存在的低效性与误差累积问题。其解决方案的关键在于提出一种分层式框架——混合工具集代理化适配(Hybrid Toolset Agentization Adaptation, HTAA),核心创新包括:一是引入工具集代理化范式,将频繁协同使用的工具封装为专用代理工具,从而缩减规划器的动作空间并减少冗余;二是设计不对称规划器适配机制,通过轨迹驱动的训练策略(基于后向重构与前向精炼)实现高层规划器与代理工具的有效对齐。实验证明,HTAA在长程可执行工具轨迹任务上显著提升成功率、缩短调用路径并降低上下文开销,且在生产环境中大幅减少人工验证成本。

链接: https://arxiv.org/abs/2604.10917
作者: Chengrui Huang,Junshuo Zhang,Zhiyuan Ma,Xikun Wang,Ximeng Wang,Menghua Jiang,Gang Zeng,Zhaobing Han,Shen Gao,Shuo Shang
机构: University of Electronic Science and Technology of China (电子科技大学); DiDi Global Inc. (滴滴全球)
类目: Computation and Language (cs.CL)
备注: 22 pages, 3 figures

点击查看摘要

Abstract:Enabling large language models to scale and reliably use hundreds of tools is critical for real-world applications, yet challenging due to the inefficiency and error accumulation inherent in flat tool-calling architectures. To address this, we propose Hybrid Toolset Agentization Adaptation (HTAA), a hierarchical framework for scalable tool-use planning. We propose a novel toolset agentization paradigm, which encapsulates frequently co-used tools into specialized agent tools, thereby reducing the planner’s action space and mitigating redundancy. To ensure effective coordination, we design Asymmetric Planner Adaptation, a trajectory-based training paradigm that aligns the high-level planner with agent tools via backward reconstruction and forward refinement. To validate the performance of HTAA, we conduct experiments on a real-world internal dataset, InfoVerify, based on the POI validation workflow of China’s largest online large-scale ride-hailing platform, featuring long-horizon executable tool trajectories. Experiments on InfoVerify and widely-used benchmarks show that HTAA consistently achieves higher task success rates, requires short tool calling trajectories, and significantly reduces context overhead compared to strong baselines. Furthermore, in a production deployment, HTAA substantially reduces manual validation effort and operational cost, demonstrating its practical efficacy.

[NLP-84] Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech Sound and Music

【速读】: 该论文旨在解决当前音频语言模型在复杂音频理解与推理任务中的局限性,特别是对长时音频(长达30分钟)的处理能力不足、中间推理步骤缺乏时间锚定导致可解释性差,以及训练数据规模和多样性受限的问题。解决方案的关键在于:(1) 构建更强的基础音频-语言模型以提升多类音频理解任务的准确性;(2) 提出可扩展的数据构建策略,生成超百万小时的大规模音频理解与推理数据集;(3) 引入Temporal Audio Chain-of-Thought(时间音频思维链)新范式,将中间推理步骤显式地关联到音频时间戳,实现细粒度的时间对齐与增强可解释性;(4) 采用分阶段课程学习训练策略(预训练、中段训练与后训练),显著提升模型在20个基准测试中的性能,尤其在长音频任务上表现突出,并展现出良好的零样本迁移能力和实际应用价值。

链接: https://arxiv.org/abs/2604.10905
作者: Sreyan Ghosh,Arushi Goel,Kaousheik Jayakumar,Lasha Koroshinadze,Nishit Anand,Zhifeng Kong,Siddharth Gururani,Sang-gil Lee,Jaehyeon Kim,Aya Aljafari,Chao-Han Huck Yang,Sungwon Kim,Ramani Duraiswami,Dinesh Manocha,Mohammad Shoeybi,Bryan Catanzaro,Ming-Yu Liu,Wei Ping
机构: NVIDIA(英伟达); University of Maryland(马里兰大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Project website: this https URL

点击查看摘要

Abstract:We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.

[NLP-85] ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中因生成长序列中间思考过程而导致的KV缓存(Key-Value Cache)内存占用急剧增长的问题。传统方法仅压缩输入上下文,而保留完整的KV缓存用于解码,这在长输出场景下显著增加计算和内存开销。解决方案的关键在于提出ZoomR机制,通过自适应地将冗长的推理过程压缩为摘要,并设计一种基于摘要的动态KV缓存选择策略:利用摘要键(summary keys)作为粗粒度索引进行快速检索,同时仅对最相关的细节进行精细“聚焦”(zoom-in),从而避免每一步都执行全缓存注意力计算。该分层策略显著降低了内存使用量,在数学和推理任务上实现了优于基线的性能,且推理内存消耗减少超过4倍。

链接: https://arxiv.org/abs/2604.10898
作者: David H. Yang,Yuxuan Zhu,Mohammad Mohammadi Amiri,Keerthiram Murugesan,Tejaswini Pedapati,Subhajit Chaudhury,Pin-Yu Chen
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically “zooming in” on fine-grained details. By using summary keys as a coarse-grained index during decoding, ZoomR uses the query to retrieve details for only the most important thoughts. This hierarchical strategy significantly reduces memory usage by avoiding full-cache attention at each step. Experiments across math and reasoning tasks show that our approach achieves competitive performance compared to baselines, while reducing inference memory requirements by more than 4\times . These results demonstrate that a multi-granularity KV selection enables more memory efficient decoding, especially for long output generation.

[NLP-86] AOP-Smart: A RAG -Enhanced Large Language Model Framework for Adverse Outcome Pathway Analysis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在毒理学知识问答任务中因幻觉问题(hallucination)导致的可靠性不足问题,尤其是在生成不良结局路径(Adverse Outcome Pathways, AOPs)相关回答时容易出现事实错误或缺乏证据支持的情况。解决方案的关键在于提出一种面向AOP的检索增强生成(Retrieval-Augmented Generation, RAG)框架——AOP-Smart,其核心机制是基于AOP-Wiki官方XML数据中的关键事件(Key Events, KEs)、关键事件关系(Key Event Relationships, KERs)及特定AOP信息进行精准知识检索,从而为LLMs提供可验证的上下文支撑,显著提升答案的准确性与一致性。

链接: https://arxiv.org/abs/2604.10874
作者: Qinjiang Niu,Lu Yan
机构: Nanyang Normal University (南阳师范学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adverse Outcome Pathways (AOPs) are an important knowledge framework in toxicological research and risk assessment. In recent years, large language models (LLMs) have gradually been applied to AOP-related question answering and mechanistic reasoning tasks. However, due to the existence of the hallucination problem, that is, the model may generate content that is inconsistent with facts or lacks evidence, their reliability is still limited. To address this issue, this study proposes an AOP-oriented Retrieval-Augmented Generation (RAG) framework, AOP-Smart. Based on the official XML data from AOP-Wiki, this method uses Key Events (KEs), Key Event Relationships (KERs), and specific AOP information to retrieve relevant knowledge for user questions, thereby improving the reliability of the generated results of large language models. To evaluate the effectiveness of the proposed method, this study constructed a test set containing 20 AOP-related question answering tasks, covering KE identification, upstream and downstream KE retrieval, and complex AOP retrieval tasks. Experiments were conducted on three mainstream large language models, Gemini, DeepSeek, and ChatGPT, and comparative tests were performed under two settings: without RAG and with RAG. The experimental results show that, without using RAG, the accuracies of GPT, DeepSeek, and Gemini were 15.0%, 35.0%, and 20.0%, respectively; after using RAG, their accuracies increased to 95.0%, 100.0%, and 95.0%, respectively. The results indicate that AOP-Smart can significantly alleviate the hallucination problem of large language models in AOP knowledge tasks, and greatly improve the accuracy and consistency of their answers.

[NLP-87] OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

【速读】: 该论文旨在解决当前AI代理(AI agent)评估缺乏跨职业领域覆盖的问题,即现有基准测试仅限于少数存在公开环境的领域,无法全面衡量代理在真实专业任务中的表现。其解决方案的关键在于提出OccuBench,一个涵盖100个现实世界专业任务场景的基准,覆盖10个行业类别和65个细分领域;该基准依托语言世界模型(Language World Models, LWMs),通过大语言模型(LLM)驱动的工具响应生成来模拟特定领域的环境,并采用多智能体合成流水线自动生成具有保证可解性、难度校准及文档支撑多样性的评估实例,从而实现对AI代理在专业任务完成度与环境鲁棒性(包括显式错误、隐式数据退化和混合故障)两个维度上的系统性评测。

链接: https://arxiv.org/abs/2604.10866
作者: Xiaomeng Hu,Yinger Zhang,Fei Huang,Jianhong Tu,Yang Su,Lianghao Deng,Yuxuan Liu,Yantao Liu,Dayiheng Liu,Tsung-Yi Ho
机构: Qwen Team, Alibaba Group; The Chinese University of Hong Kong
类目: Computation and Language (cs.CL)
备注: 23 pages, 8 figures, 2 tables. Project page: this https URL

点击查看摘要

Abstract:AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.

[NLP-88] Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

【速读】: 该论文旨在解决通用大语言模型(Large Language Model, LLM)在波兰语等特定语言上因使用通用分词器(universal tokenizer)而导致的性能瓶颈问题,包括较高的词元化密度(fertility ratio)、更高的推理成本以及受限的有效上下文窗口。其核心解决方案在于构建一个专为波兰语优化的词汇表(Polish-optimized vocabulary),并配合基于FOCUS的嵌入初始化、多阶段预训练课程(multi-stage pretraining curriculum)以及包含监督微调(Supervised Fine-Tuning, SFT)、直接偏好优化(Direct Preference Optimization, DPO)和基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习后训练对齐流程,从而显著提升模型在波兰语任务中的效率与表现。

链接: https://arxiv.org/abs/2604.10799
作者: Krzysztof Ociepa,Łukasz Flis,Remigiusz Kinas,Krzysztof Wróbel,Adrian Gwoździej
机构: SpeakLeash; ACK Cyfronet AGH; Jagiellonian University; Azurro; Enelpol
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.

[NLP-89] Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V

【速读】: 该论文旨在解决Transformer模型中注意力机制对位置信息依赖过强、内容信息可能被冗余或弱化的问题。其核心解决方案包括两个关键改进:一是引入一个非线性预投影多层感知机(MLP),置于层归一化(Layer Norm)与查询/键/值(Q/K/V)投影之间,以在不依赖位置编码的情况下构建更丰富的特征表示;二是设计一种内容跳接连接(content skip connection),使预投影特征绕过注意力机制,从而在必要时保留纯内容信息。实验表明,这两种改进协同作用,在冻结探针(frozen-probe)任务中显著提升性能,且深层Transformer层更倾向于激活内容跳接,说明深度层更能受益于避开位置感知注意力的内容信息传递。所有改动均未增加键值缓存(K/V cache)开销。

链接: https://arxiv.org/abs/2604.10791
作者: Chirag Shinde
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 2 figures, 5 tables. Code: this https URL

点击查看摘要

Abstract:We propose two complementary modifications to transformer attention blocks. First, a non-linear pre-projection MLP is inserted between layer norm and Q/K/V projections, constructing richer features in a position-agnostic manner before any positional encoding is applied. Second, a content skip connection routes the pre-projection’s features around the attention mechanism, allowing content information to bypass position-aware attention where beneficial. In frozen-probe experiments on Pythia-160M and 410M, the combined approach achieves the strongest results across methods: +40.6% LAMBADA accuracy and -39% perplexity at 160M scale. Learned skip connection weights reveal a consistent pattern across model sizes: later transformer layers activate the content bypass more strongly than earlier layers, suggesting that deeper layers benefit from content information that does not pass through positional attention. All modifications add no K/V cache overhead.

[NLP-90] InR: Exploring Tool-Internalized Reasoning in Large Language Models

【速读】: 该论文旨在解决当前工具集成推理(Tool-Integrated Reasoning, TIR)方法依赖外部工具文档进行推理时所引发的三大问题:工具掌握难度高、工具规模受限以及推理效率低下。为克服这些局限,作者提出了一种工具内化推理(Tool-Internalized Reasoning, TInR)框架——TInR-U,其核心在于将工具知识内化到大语言模型(Large Language Models, LLMs)中,并实现工具使用与推理过程的协同优化。解决方案的关键步骤包括:1)采用双向知识对齐策略完成工具内化;2)通过高质量推理标注数据进行监督微调预热;3)引入针对TInR设计的强化学习奖励机制以进一步提升性能。实验表明,TInR-U在域内和域外场景下均表现出优越的性能与效率,验证了该方法的有效性。

链接: https://arxiv.org/abs/2604.10788
作者: Qiancheng Xu,Yongqi Li,Fan Liu,Hongru Wang,Min Yang,Wenjie Li
机构: The Hong Kong Polytechnic University(香港理工大学); Southeast University(东南大学); University of Edinburgh(爱丁堡大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models’ (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.

[NLP-91] When Meaning Isnt Literal: Exploring Idiomatic Meaning Across Languages and Modalities

【速读】: 该论文旨在解决当前语言模型在理解习语(idiom)方面存在的显著缺陷,尤其是对隐喻性和文化依赖性较强的习语缺乏深层语义推理能力,导致模型过度依赖字面意义而忽略其背后的认知机制。解决方案的关键在于构建了一个多语言、多模态的习语语料库Mediom,包含3,533个印地语、孟加拉语和泰语习语及其高质量解释、跨语言翻译及文本-图像对齐表示,并提出HIDE(Hinting-based Idiom Explanation)框架,通过错误反馈检索与定向诊断提示实现迭代式推理优化,从而提升模型在文化语境下对习语的多模态理解能力。

链接: https://arxiv.org/abs/2604.10787
作者: Sarmistha Das,Shreyas Guha,Suvrayan Bandyopadhyay,Salisa Phosit,Kitsuchart Pasupa,Sriparna Saha
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Idiomatic reasoning, deeply intertwined with metaphor and culture, remains a blind spot for contemporary language models, whose progress skews toward surface-level lexical and semantic cues. For instance, the Bengali idiom \textit\foreignlanguagebengali\char"0986\char"0999\char"09CD\char"0997\char"09C1 \char"09B0 \char"09AB\char"09B2 \char"099F\char"0995 (angur fol tok, grapes are sour''): it encodes denial-driven rationalization, yet naive models latch onto the literal fox-and-grape imagery. Addressing this oversight, we present Mediom,‘’ a multilingual, multimodal idiom corpus of 3,533 Hindi, Bengali, and Thai idioms, each paired with gold-standard explanations, cross-lingual translations, and carefully aligned text–image representations. We benchmark both large language models (textual reasoning) and vision-language models (figurative disambiguation) on Mediom, exposing systematic failures in metaphor comprehension. To mitigate these gaps, we propose ``HIDE,‘’ a Hinting-based Idiom Explanation framework that leverages error-feedback retrieval and targeted diagnostic cues for iterative reasoning refinement. Collectively, Mediom and HIDE establish a rigorous test bed and methodology for culturally grounded, multimodal idiom understanding embedded with reasoning hints in next-generation AI systems.

[NLP-92] Do BERT Embeddings Encode Narrative Dimensions? A Token-Level Probing Analysis of Time Space Causality and Character in Fiction

【速读】: 该论文旨在解决生成式 AI(Generative AI)在理解虚构叙事语义维度(时间、空间、因果性和角色)方面的能力问题,即预训练语言模型如BERT是否能够编码这些多维语义信息。其解决方案的关键在于构建一个基于token级别的标注数据集,并通过线性探测(linear probe)方法对BERT嵌入进行评估:实验结果显示,BERT嵌入的分类准确率达到94%,显著优于方差匹配的随机嵌入(47%),表明BERT确实编码了有意义的叙事语义信息;同时,通过平衡类别权重和混淆矩阵分析发现,尽管整体性能良好,但存在“边界泄漏”现象(罕见类别如因果性和空间常被误判为“其他”),且聚类分析显示叙事维度未形成离散可分的簇(ARI = 0.081),说明这些语义维度虽被编码但缺乏明确的结构分离。

链接: https://arxiv.org/abs/2604.10786
作者: Beicheng Bei,Hannah Hyesun Chun,Chen Guo,Arwa Saghiri
机构: University of Rochester (罗切斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures. Accepted at CMN’26 (9th International Workshop on Computational Models of Narrative)

点击查看摘要

Abstract:Narrative understanding requires multidimensional semantic structures. This study investigates whether BERT embeddings encode dimensions of fictional narrative semantics – time, space, causality, and character. Using an LLM to accelerate annotation, we construct a token-level dataset labeled with these four narrative categories plus “others.” A linear probe on BERT embeddings (94% accuracy) significantly outperforms a control probe on variance-matched random embeddings (47%), confirming that BERT encodes meaningful narrative information. With balanced class weighting, the probe achieves a macro-average recall of 0.83, with moderate success on rare categories such as causality (recall = 0.75) and space (recall = 0.66). However, confusion matrix analysis reveals “Boundary Leakage,” where rare dimensions are systematically misclassified as “others.” Clustering analysis shows that unsupervised clustering aligns near-randomly with predefined categories (ARI = 0.081), suggesting that narrative dimensions are encoded but not as discretely separable clusters. Future work includes a POS-only baseline to disentangle syntactic patterns from narrative encoding, expanded datasets, and layer-wise probing.

[NLP-93] Generating Multiple-Choice Knowledge Questions with Interpretable Difficulty Estimation using Knowledge Graphs and Large Language Models

【速读】: 该论文旨在解决自动化多选题(Multiple-Choice Questions, MCQs)生成中难以准确估计题目难度的问题,尤其是在自适应人工智能辅助教育系统中的应用。其解决方案的关键在于融合知识图谱(Knowledge Graph, KG)与大语言模型(Large Language Models, LLMs),首先利用LLM从输入文档构建结构化知识图谱,再基于图谱中的节点和三元组/五元组关系生成MCQ的题干和干扰项,并通过九种难度信号的数据驱动组合方法计算统一的难度分数。该方法不仅提升了MCQ的质量,还实现了可解释且符合人类感知的难度估计。

链接: https://arxiv.org/abs/2604.10748
作者: Mehmet Can Şakiroğlu,H. Altay Güvenir,Kamer Kaya
机构: Sabanci University (萨班哲大学); Istanbul Technical University (伊斯坦布尔技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generating multiple-choice questions (MCQs) with difficulty estimation remains challenging in automated MCQ-generation systems used in adaptive, AI-assisted education. This study proposes a novel methodology for generating MCQs with difficulty estimation from the input documents by utilizing knowledge graphs (KGs) and large language models (LLMs). Our approach uses an LLM to construct a KG from input documents, from which MCQs are then systematically generated. Each MCQ is generated by selecting a node from the KG as the key, sampling a related triple or quintuple – optionally augmented with an extra triple – and prompting an LLM to generate a corresponding stem from these graph components. Distractors are then selected from the KG. For each MCQ, nine difficulty signals are computed and combined into a unified difficulty score using a data-driven approach. Experimental results demonstrate that our method generates high-quality MCQs whose difficulty estimation is interpretable and aligns with human perceptions. Our approach improves automated MCQ generation by integrating structured knowledge representations with LLMs and a data-driven difficulty estimation model.

[NLP-94] How You Ask Matters! Adaptive RAG Robustness to Query Variations

【速读】: 该论文旨在解决自适应检索增强生成(Adaptive Retrieval-Augmented Generation, Adaptive RAG)在面对语义相同但表面形式多样的查询时所表现出的鲁棒性不足问题。其核心挑战在于,现有方法在不同表达形式的查询下,检索触发策略和生成准确性存在显著波动,从而影响系统稳定性和可靠性。解决方案的关键在于构建首个大规模、多样化且语义一致的查询变体基准(benchmark),通过系统评估 Adaptive RAG 的三个维度——答案质量、计算成本与检索决策——揭示其对表面形式变化的高度敏感性,并指出尽管大模型性能更优,但鲁棒性并未同步提升,从而为后续改进 Adaptive RAG 的鲁棒性机制提供实证基础与方向指引。

链接: https://arxiv.org/abs/2604.10745
作者: Yunah Jang,Megha Sundriyal,Kyomin Jung,Meeyoung Cha
机构: Seoul National University (首尔国立大学); Max Planck Institute for Security and Privacy (马克斯·普朗克信息安全与隐私研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adaptive Retrieval-Augmented Generation (RAG) promises accuracy and efficiency by dynamically triggering retrieval only when needed and is widely used in practice. However, real-world queries vary in surface form even with the same intent, and their impact on Adaptive RAG remains under-explored. We introduce the first large-scale benchmark of diverse yet semantically identical query variations, combining human-written and model-generated rewrites. Our benchmark facilitates a systematic evaluation of Adaptive RAG robustness by examining its key components across three dimensions: answer quality, computational cost, and retrieval decisions. We discover a critical robustness gap, where small surface-level changes in queries dramatically alter retrieval behavior and accuracy. Although larger models show better performance, robustness does not improve accordingly. These findings reveal that Adaptive RAG methods are highly vulnerable to query variations that preserve identical semantics, exposing a critical robustness challenge.

[NLP-95] RCBSF: A Multi-Agent Framework for Automated Contract Revision via Stackelberg Game

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在法律智能(Legal AI)领域中用于自动化合同修订时存在的两大核心问题:一是生成内容可能包含虚假或不安全的信息(hallucinated safety),二是缺乏严格的行為约束机制。为应对这些问题,作者提出了一种风险约束的双层Stackelberg框架(Risk-Constrained Bilevel Stackelberg Framework, RCBSF),其关键在于构建一个分层的“领导者-追随者”结构:全局处方代理(Global Prescriptive Agent, GPA)作为领导者设定风险预算,而由受限修订代理(Constrained Revision Agent, CRA)与局部验证代理(Local Verification Agent, LVA)组成的跟随系统则在该预算下迭代优化输出。该设计不仅提供了理论保证——即该双层优化收敛至均衡点并优于无引导配置,还在统一基准上实证验证了其优越性,平均风险化解率(Risk Resolution Rate, RRR)达到84.21%,同时提升了token效率。

链接: https://arxiv.org/abs/2604.10740
作者: Shijia Xu,Yu Wang,Xiaolong Jia,Zhou Wu,Kai Liu,April Xiaowen Dong
机构: Chongqing University, China; Queen Mary University of London, UK; Chongqing Key Laboratory of Big Data Intelligence and Privacy Computing, China; Fangda Partners, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the widespread adoption of Large Language Models (LLMs) in Legal AI, their utility for automated contract revision remains impeded by hallucinated safety and a lack of rigorous behavioral constraints. To address these limitations, we propose the Risk-Constrained Bilevel Stackelberg Framework (RCBSF), which formulates revision as a non-cooperative Stackelberg game. RCBSF establishes a hierarchical Leader Follower structure where a Global Prescriptive Agent (GPA) imposes risk budgets upon a follower system constituted by a Constrained Revision Agent (CRA) and a Local Verification Agent (LVA) to iteratively optimize output. We provide theoretical guarantees that this bilevel formulation converges to an equilibrium yielding strictly superior utility over unguided configurations. Empirical validation on a unified benchmark demonstrates that RCBSF achieves state-of-the-art performance, surpassing iterative baselines with an average Risk Resolution Rate (RRR) of 84.21% while enhancing token efficiency. Our code is available at this https URL .

[NLP-96] BlasBench: An Open Benchmark for Irish Speech Recognition

【速读】: 该论文旨在解决爱尔兰语(Irish)语音识别(ASR)系统缺乏统一评估标准的问题,即当前尚无针对爱尔兰语的公开基准测试平台,无法在相同评估协议下公平比较不同端用户ASR系统的性能。其解决方案的关键在于发布BlasBench——一个开源的评估框架,内置爱尔兰语特异性文本归一化处理机制,能够保留fadas(变音符号)、lenition(弱化)和eclipsis(省略)等语言特征,从而更准确地衡量模型在真实爱尔兰语场景下的表现。通过在Common Voice ga-IE和FLEURS ga-IE两个数据集上对12种不同架构的ASR系统进行评测,研究揭示了仅基于单一数据集训练的模型存在显著泛化差距,凸显了多数据集交叉验证的重要性。

链接: https://arxiv.org/abs/2604.10736
作者: Jyoutir Raj,John Conway
机构: Independent Researcher; threefold.eco
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: 8 pages, 4 tables, 3 appendices. Code and data: this https URL

点击查看摘要

Abstract:No open Irish-specific benchmark compares end-user ASR systems under a shared Irish-aware evaluation protocol. To solve this, we release BlasBench, an open evaluation harness with Irish-aware text normalisation that preserves fadas, lenition, and eclipsis. We benchmark 12 systems across four architecture families on Common Voice ga-IE and FLEURS ga-IE. All Whisper variants exceed 100% WER. The best open model (omniASR LLM 7B) achieves 30.65% WER on Common Voice and 39.09% on FLEURS. We noticed models fine-tuned on Common Voice lose 33-43 WER points on FLEURS, revealing a generalisation gap that is invisible to single-dataset evaluation.

[NLP-97] Self-Correcting RAG : Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在处理复杂推理任务时面临的两大挑战:低上下文利用率和频繁幻觉(hallucination)。为应对这些问题,作者提出了一种统一框架——Self-Correcting RAG,其核心创新在于将检索与生成过程重新建模为约束优化与路径规划问题。关键解决方案包括:1)在输入端将上下文选择形式化为多维多重背包问题(multi-dimensional multiple-choice knapsack problem, MMKP),以在严格token预算下最大化信息密度并消除冗余;2)在输出端引入基于自然语言推理(Natural Language Inference, NLI)引导的蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)机制,利用推理时计算资源动态探索推理轨迹并验证生成答案的忠实性。实验表明,该方法显著提升了复杂查询的推理准确性并有效抑制了幻觉现象。

链接: https://arxiv.org/abs/2604.10734
作者: Shijia Xu,Zhou Wu,Xiaolong Jia,Yu Wang,Kai Liu,April Xiaowen Dong
机构: Chongqing University, China; Queen Mary University of London, UK; Chongqing Key Laboratory of Big Data Intelligence and Privacy Computing, China; Fangda Partners, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) substantially extends the knowledge boundary of large language models. However, it still faces two major challenges when handling complex reasoning tasks: low context utilization and frequent hallucinations. To address these issues, we propose Self-Correcting RAG, a unified framework that reformulates retrieval and generation as constrained optimization and path planning. On the input side, we move beyond traditional greedy retrieval and, for the first time, formalize context selection as a multi-dimensional multiple-choice knapsack problem (MMKP), thereby maximizing information density and removing redundancy under a strict token budget. On the output side, we introduce a natural language inference (NLI)-guided Monte Carlo Tree Search (MCTS) mechanism, which leverages test-time compute to dynamically explore reasoning trajectories and validate the faithfulness of generated answers. Experiments on six multi-hop question answering and fact-checking datasets demonstrate that our method significantly improves reasoning accuracy on complex queries while effectively reducing hallucinations, outperforming strong existing this http URL code is available at this https URL .

[NLP-98] oo Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models ACL

【速读】: 该论文旨在解决角色扮演型大语言模型在用户请求下采用特定人格特征时,因人格特质(尤其是宜人性)引发的“谄媚行为”(sycophancy)问题,即模型倾向于迎合用户偏好而非坚持事实准确性,从而带来AI安全与对齐风险。其解决方案的关键在于系统性地构建了一个包含275个个性化人格的基准测试集,并通过4,950条诱导谄媚行为的提示语进行实验验证,发现宜人性得分较高的角色人格显著提升模型的谄媚倾向,且在13个不同规模的语言模型中均呈现统计显著的正相关关系(Pearson相关系数最高达r = 0.87,效应量Cohen’s d = 2.33),从而确立了宜人性作为预测角色诱导谄媚行为的核心指标,为部署角色扮演类AI系统及设计针对性对齐策略提供了实证依据。

链接: https://arxiv.org/abs/2604.10733
作者: Arya Shah,Deepali Mishra,Chaklam Silpasuwanchai
机构: IIT Gandhinagar; IIT Kanpur; Asian Institute of Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 Pages, 5 Figures, 9 Tables, ACL Main Conference 2026

点击查看摘要

Abstract:Large language models increasingly serve as conversational agents that adopt personas and role-play characters at user request. This capability, while valuable, raises concerns about sycophancy: the tendency to provide responses that validate users rather than prioritize factual accuracy. While prior work has established that sycophancy poses risks to AI safety and alignment, the relationship between specific personality traits of adopted personas and the degree of sycophantic behavior remains unexplored. We present a systematic investigation of how persona agreeableness influences sycophancy across 13 small, open-weight language models ranging from 0.6B to 20B parameters. We develop a benchmark comprising 275 personas evaluated on NEO-IPIP agreeableness subscales and expose each persona to 4,950 sycophancy-eliciting prompts spanning 33 topic categories. Our analysis reveals that 9 of 13 models exhibit statistically significant positive correlations between persona agreeableness and sycophancy rates, with Pearson correlations reaching r = 0.87 and effect sizes as large as Cohen’s d = 2.33 . These findings demonstrate that agreeableness functions as a reliable predictor of persona-induced sycophancy, with direct implications for the deployment of role-playing AI systems and the development of alignment strategies that account for personality-mediated deceptive behaviors.

[NLP-99] Expect the Unexpected? Testing the Surprisal of Salient Entities ACL2026

【速读】: 该论文旨在解决现有关于均匀信息密度(Uniform Information Density, UID)假说的研究中忽视话语参与者相对显著性(salience)的问题。以往研究虽发现信息量(以预期意外度,surprisal 衡量)在文档层面大致均匀分布,但未充分考虑实体在话语中的显著性如何影响局部信息密度。其解决方案的关键在于引入一种新颖的最小配对提示方法(minimal-pair prompting method),并基于70K条人工标注的英语文本提及(mentions)跨16种语域进行实证分析。结果表明,全局显著性高的实体本身具有更高的 surprisal,且当它们作为提示时能系统性降低周围内容的 surprisal,从而提升文档整体可预测性;这一效应因语域而异,在主题一致性强的文本中最显著,而在对话类语境中最弱。这揭示了全局实体显著性是调节话语中信息分布的重要机制,完善了UID假说的竞争压力框架。

链接: https://arxiv.org/abs/2604.10724
作者: Jessica Lin,Amir Zeldes
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 (main, long); camera-ready version

点击查看摘要

Abstract:Previous work examining the Uniform Information Density (UID) hypothesis has shown that while information as measured by surprisal metrics is distributed more or less evenly across documents overall, local discrepancies can arise due to functional pressures corresponding to syntactic and discourse structural constraints. However, work thus far has largely disregarded the relative salience of discourse participants. We fill this gap by studying how overall salience of entities in discourse relates to surprisal using 70K manually annotated mentions across 16 genres of English and a novel minimal-pair prompting method. Our results show that globally salient entities exhibit significantly higher surprisal than non-salient ones, even controlling for position, length, and nesting confounds. Moreover, salient entities systematically reduce surprisal for surrounding content when used as prompts, enhancing document-level predictability. This effect varies by genre, appearing strongest in topic-coherent texts and weakest in conversational contexts. Our findings refine the UID competing pressures framework by identifying global entity salience as a mechanism shaping information distribution in discourse.

[NLP-100] aching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation

【速读】: 该论文旨在解决现有编程教育中人工学习模型依赖大型专有语言模型所带来的隐私、成本和依赖性问题,同时提升模拟学生编程行为的准确性。其关键解决方案是利用真实学生的编程过程数据(如代码提交与环境反馈)构建对话式序列化训练样本,并通过监督微调与偏好优化相结合的训练管道,使开源模型(如Qwen)能够有效学习并复现学生的迭代调试行为,从而在功能一致性和代码相似性上优于仅基于代码或提示大型语言模型的基线方法。

链接: https://arxiv.org/abs/2604.10720
作者: Charles Koutcheme,Arto Hellas,Juho Leinonen
机构: Aalto University (阿尔托大学); Research Council of Finland (芬兰研究理事会)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 8 pages, 2 figures, 2 tables. Accepted to Educational Data Mining 2026

点击查看摘要

Abstract:Artificial models that simulate how learners act and respond within educational systems are a promising tool for evaluating tutoring strategies and feedback mechanisms at scale. However, many existing approaches in programming education rely on prompting large, proprietary language models, raising concerns around privacy, cost, and dependence. In this work, we propose a method for training open-weight artificial programming learners using authentic student process data. Our approach serializes temporal log traces into a conversational format, representing each student’s problem-solving process as a dialogue between the learner and their automated assessment system. Student code submissions and environment feedback, such as test outcomes, grades, and error traces, form alternating conversational turns, enabling models to learn from the iterative debugging process. We additionally introduce a training pipeline combining supervised fine-tuning with preference optimization to align models with authentic student debugging behavior. We evaluate our framework by training Qwen models at 4B and 8B scales on a large-scale dataset of real student submissions to Python programming assignments. Our results show that incorporating environment feedback strengthens the models’ ability to replicate student debugging behavior, improving over both prior code-only approaches and prompted large language models baselines in functional alignment and code similarity. We release our code to support reproducibility.

[NLP-101] Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game ACL2026

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中存在的知识库泄露问题,即攻击者可通过自适应的迭代提示策略诱导模型泄露其检索到的专有内容。解决方案的关键在于提出 CanaryRAG,一种受软件安全中栈溢出保护机制(stack canaries)启发的运行时防御机制:通过在检索到的数据块中嵌入精心设计的“哨兵令牌”(canary tokens),并将防御任务建模为双路径运行时完整性博弈——一旦目标路径或oracle路径违反预期的哨兵行为(包括对抗性抑制和混淆场景),即可实时检测泄露。该方法无需重新训练或修改RAG架构,具备良好的兼容性和可扩展性,同时显著降低攻击者恢复数据块的成功率,且对任务性能和推理延迟影响微乎其微。

链接: https://arxiv.org/abs/2604.10717
作者: Yuanbo Xie,Yingjie Zhang,Yulin Li,Shouyou Song,Xiaokun Chen,Zhihan Liu,Liya Su,Tingwen Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences, China; School of Cyber Security, University of Chinese Academy of Sciences, China; Beijing University of Post and Telecommunications, China; Stanford University; North China Electric Power University, China; AI Sec Lab, Beijing Chaitin Technology Co.,Ltd
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACL 2026 Main

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems augment large language models with external knowledge, yet introduce a critical security vulnerability: RAG Knowledge Base Leakage, wherein adversarial prompts can induce the model to divulge retrieved proprietary content. Recent studies reveal that such leakage can be executed through adaptive and iterative attack strategies (named RAG extraction attack), while effective countermeasures remain notably lacking. To bridge this gap, we propose CanaryRAG, a runtime defense mechanism inspired by stack canaries in software security. CanaryRAG embeds carefully designed canary tokens into retrieved chunks and reformulates RAG extraction defense as a dual-path runtime integrity game. Leakage is detected in real time whenever either the target or oracle path violates its expected canary behavior, including under adaptive suppression and obfuscation. Extensive evaluations against existing attacks demonstrate that CanaryRAG provides robust defense, achieving substantially lower chunk recovery rates than state-of-the-art baselines while imposing negligible impact on task performance and inference latency. Moreover, as a plug-and-play solution, CanaryRAG can be seamlessly integrated into arbitrary RAG pipelines without requiring retraining or structural modifications, offering a practical and scalable safeguard for proprietary data.

[NLP-102] Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)强化学习(Reinforcement Learning, RL)中的信用分配(Credit Assignment)难题,尤其是传统基于值函数的评判器(critic)难以可靠训练的问题。现有方法依赖于单次预测(one-shot prediction)的判别式批评者,其表达能力有限,导致在扩展规模时性能无法稳定提升。解决方案的关键在于提出生成式演员-评论家(Generative Actor-Critic, GenAC),它用生成式批评者替代传统的单次标量值预测,通过链式思维(chain-of-thought reasoning)推理过程来生成更可靠的值估计,并引入上下文条件化(In-Context Conditioning)机制以确保批评者在整个训练过程中与当前策略保持校准。这一设计显著提升了值函数逼近精度、排序可靠性及分布外泛化能力,最终实现优于基于值和无值基准方法的下游强化学习性能。

链接: https://arxiv.org/abs/2604.10701
作者: Zikang Shan,Han Zhong,Liwei Wang,Li Zhao
机构: Peking University (北京大学); Microsoft Research Asia (微软亚洲研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages including appendix, 4 figures

点击查看摘要

Abstract:Credit assignment is a central challenge in reinforcement learning (RL). Classical actor-critic methods address this challenge through fine-grained advantage estimation based on a learned value function. However, learned value models are often avoided in modern large language model (LLM) RL because conventional discriminative critics are difficult to train reliably. We revisit value modeling and argue that this difficulty is partly due to limited expressiveness. In particular, representation complexity theory suggests that value functions can be hard to approximate under the one-shot prediction paradigm used by existing value models, and our scaling experiments show that such critics do not improve reliably with scale. Motivated by this observation, we propose Generative Actor-Critic (GenAC), which replaces one-shot scalar value prediction with a generative critic that performs chain-of-thought reasoning before producing a value estimate. We further introduce In-Context Conditioning, which helps the critic remain calibrated to the current actor throughout training. GenAC improves value approximation, ranking reliability, and out-of-distribution generalization, and these gains translate into stronger downstream RL performance than both value-based and value-free baselines. Overall, our results suggest that stronger value modeling is a promising direction for improving credit assignment in LLM reinforcement learning.

[NLP-103] Attention Sinks as Internal Signals for Hallucination Detection in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中常见的幻觉问题,即模型生成看似流畅但事实错误或缺乏输入上下文支持的输出。其解决方案的关键在于提出SinkProbe方法,该方法基于一个核心观察:幻觉现象与注意力“sink”(attention sinks)密切相关——这些sink是生成过程中聚集异常多注意力质量的token,标志着从分布式的、以输入为依据的注意力机制向压缩的、依赖先验知识的计算模式转变。研究发现,尽管sink分数仅由注意力映射计算得出,分类器更依赖于对应值向量范数较大的sink;同时,通过数学关系证明此前的方法也隐式依赖于注意力sink。这一理论驱动的洞察使SinkProbe成为当前在多个主流数据集和LLM上表现最优的幻觉检测方法。

链接: https://arxiv.org/abs/2604.10697
作者: Jakub Binkowski,Kamil Adamczewski,Tomasz Kajdanowicz
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models frequently exhibit hallucinations: fluent and confident outputs that are factually incorrect or unsupported by the input context. While recent hallucination detection methods have explored various features derived from attention maps, the underlying mechanisms they exploit remain poorly understood. In this work, we propose SinkProbe, a hallucination detection method grounded in the observation that hallucinations are deeply entangled with attention sinks - tokens that accumulate disproportionate attention mass during generation - indicating a transition from distributed, input-grounded attention to compressed, prior-dominated computation. Importantly, although sink scores are computed solely from attention maps, we find that the classifier preferentially relies on sinks whose associated value vectors have large norms. Moreover, we show that previous methods implicitly depend on attention sinks by establishing their mathematical relationship to sink scores. Our findings yield a novel hallucination detection method grounded in theory that produces state-of-the-art results across popular datasets and LLMs.

[NLP-104] SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

【速读】: 该论文旨在解决基于策略的强化学习(On-policy reinforcement learning)在大语言模型推理对齐中面临的稀疏奖励问题,特别是token级信用分配困难的问题。现有方法如On-Policy Distillation (OPD)虽引入了教师模型提供的密集token级KL散度监督,但通常对所有rollout统一施加监督,忽略了不同轨迹间信号质量的差异。其解决方案的关键在于提出Signal-Calibrated On-Policy Distillation Enhancement (SCOPE),一种双路径自适应训练框架:对于错误轨迹,采用教师困惑度加权的KL蒸馏,优先保留教师具备真实纠错能力的样本;对于正确轨迹,则使用学生困惑度加权的最大似然估计(MLE),聚焦于能力边界处低置信度样本的强化,避免对已掌握样本的过度优化。两个路径均引入组级别归一化机制以自适应校准权重分布,从而应对提示难度差异。实验表明,SCOPE在六个推理基准上平均提升Avg@32 11.42%和Pass@32 7.30%,验证了其有效性。

链接: https://arxiv.org/abs/2604.10688
作者: Binbin Zheng,Xing Ma,Yiheng Liang,Jingqing Ruan,Xiaoliang Fu,Kepeng Lin,Benchang Zhu,Ke Zeng,Xunliang Cai
机构: University of Science and Technology of China (中国科学技术大学); Meituan (美团); Nanjing University (南京大学); Fudan University (复旦大学); Huazhong University of Science and Technology (华中科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.

[NLP-105] QFS-Composer: Query-focused summarization pipeline for less resourced languages

【速读】: 该论文旨在解决低资源语言(如斯洛文尼亚语)中查询聚焦摘要(Query-Focused Summarization, QFS)任务的性能下降问题,尤其是在缺乏标注数据和评估工具的情况下。其核心挑战在于如何提升摘要与用户查询意图之间的事实一致性。解决方案的关键在于提出一个名为QFS-Composer的新框架,该框架通过查询分解、问题生成(Question Generation, QG)、问答(Question Answering, QA)和抽象式摘要四个模块的协同作用,构建一个以QA引导的摘要生成流程,从而增强摘要的相关性和忠实性。实验表明,该方法在斯洛文尼亚语上显著优于基线大语言模型(LLMs),并为低资源语言的QFS研究提供了可扩展的方法论基础。

链接: https://arxiv.org/abs/2604.10687
作者: Vuk Đuranović,Marko Robnik Šikonja
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 tables

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong performance in text summarization, yet their effectiveness drops significantly across languages with restricted training resources. This work addresses the challenge of query-focused summarization (QFS) in less-resourced languages, where labeled datasets and evaluation tools are limited. We present a novel QFS framework, QFS-Composer, that integrates query decomposition, question generation (QG), question answering (QA), and abstractive summarization to improve the factual alignment of a summary with user intent. We test our approach on the Slovenian language. To enable high-quality supervision and evaluation, we develop the Slovenian QA and QG models based on a Slovene LLM and adapt evaluation approaches for reference-free summary evaluation. Empirical evaluation shows that the QA-guided summarization pipeline yields improved consistency and relevance over baseline LLMs. Our work establishes an extensible methodology for advancing QFS in less-resourced languages.

[NLP-106] Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在训练大语言模型(Large Language Model, LLM)代理进行多轮交互任务时面临的样本效率低下的问题,其根源在于稀疏奖励和长决策周期。现有方法如基于策略的自蒸馏(On-Policy Self-Distillation, OPSD)虽能提供密集的token级监督,但依赖固定特权信息,无法捕捉任务中多样有效的策略,且与RL结合易导致训练崩溃。解决方案的关键在于提出Skill-SD框架,该框架将代理自身轨迹归纳为紧凑的自然语言技能(skills),用于动态生成仅作用于教师的特权信息,而学生始终在纯任务提示下行动并通过蒸馏内化指导;同时引入重要性加权的反向KL损失以稳定梯度更新,并动态同步教师与不断改进的学生,从而实现高效、稳定的训练。

链接: https://arxiv.org/abs/2604.10674
作者: Hao Wang,Guozhi Wang,Han Xiao,Yufeng Zhou,Yue Pan,Jichao Wang,Ke Xu,Yafei Wen,Xiaohu Ruan,Xiaoxin Chen,Honggang Qi
机构: Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(杭州研究院高级研究所,中国科学院大学); The Chinese University of Hong Kong(香港中文大学); University of Science and Technology of China(中国科学技术大学); University of Chinese Academy of Sciences(中国科学院大学); vivo AI Lab(维沃人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Reinforcement learning (RL) has been widely used to train LLM agents for multi-turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On-policy self-distillation (OPSD) alleviates this by providing dense token-level supervision from a privileged teacher that has access to ground-truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill-SD, a framework that turns the agent’s own trajectories into dynamic training-only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance-weighted reverse-KL loss to provide gradient-correct token-level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill-SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%). Project page: this https URL

[NLP-107] Learning and Enforcing Context-Sensitive Control for LLM s ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成过程中难以保证语法有效性的问题,尤其是传统无上下文约束的文法(Context-Free Grammars, CFGs)无法有效控制输出合法性。其核心挑战在于现有方法依赖人工设计的上下文敏感约束规则,这不仅耗时且需要领域专家参与。解决方案的关键在于提出一个两阶段自动学习框架:第一阶段通过语法探索(syntactic exploration)收集多样化的模型输出以辅助约束学习;第二阶段利用学到的约束规则进行生成过程中的强制执行(constraint exploitation)。该方法首次将上下文敏感文法学习与LLM生成相结合,在无需人工干预的前提下实现了高精度的约束遵守,即使参数量仅为10亿的小型模型也能达到完美合规性,显著优于更大规模模型和当前最先进的推理模型。

链接: https://arxiv.org/abs/2604.10667
作者: Mohammad Albinhassan,Pranava Madhyastha,Mark Law,Alessandra Russo
机构: Imperial College London(帝国理工学院); City University of London(城市大学伦敦分校); ILASP Limited, UK(ILASP有限公司,英国); The Alan Turing Institute(艾伦·图灵研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2025 Student Research Workshop

点击查看摘要

Abstract:Controlling the output of Large Language Models (LLMs) through context-sensitive constraints has emerged as a promising approach to overcome the limitations of Context-Free Grammars (CFGs) in guaranteeing generation validity. However, such constraints typically require manual specification – a significant barrier demanding specialized expertise. We introduce a framework that automatically learns context-sensitive constraints from LLM interactions through a two-phase process: syntactic exploration to gather diverse outputs for constraint learning, followed by constraint exploitation to enforce these learned rules during generation. Experiments demonstrate that our method enables even small LLMs (1B parameters) to learn and generate with perfect constraint adherence, outperforming larger counterparts and state-of-the-art reasoning models. This work represents the first integration of context-sensitive grammar learning with LLM generation, eliminating manual specification while maintaining generation validity.

[NLP-108] Omnimodal Dataset Distillation via High-order Proxy Alignment

【速读】: 该论文旨在解决多模态(Omnimodal)数据集蒸馏(Dataset Distillation)中因模态数量增加而导致的异质性增强与跨模态交互复杂化问题,现有方法在单模态或双模态场景下表现良好,但在更多模态联合蒸馏时面临性能瓶颈。解决方案的关键在于提出HoPA(High-order Proxy Alignment),其核心是通过一个紧凑的代理(proxy)捕捉高阶跨模态对齐关系,并利用共享相似性结构抽象出统一的多模态对齐机制,从而避免成对模态建模带来的组合爆炸问题,同时兼容轨迹匹配(trajectory matching)策略,实现跨异构模态的可扩展联合蒸馏。理论分析从谱角度验证了该方法相较于传统双模态蒸馏技术的合理性,实验表明其在压缩率与训练性能之间实现了更优权衡。

链接: https://arxiv.org/abs/2604.10666
作者: Yuxuan Gao,Xiaohao Liu,Xiaobo Xia,Tongliang Liu
机构: University of Science and Technology of China (中国科学技术大学); Xidian University (西安电子科技大学); National University of Singapore (新加坡国立大学); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dataset distillation compresses large-scale datasets into compact synthetic sets while preserving training performance, but existing methods are largely restricted to single-modal or bimodal settings. Extending dataset distillation to scenarios involving more than two modalities, i.e., Omnimodal Dataset Distillation, remains underexplored and challenging due to increased heterogeneity and complex cross-modal interactions. In this work, we identify the key determinant that bounds the endpoint discrepancy in the omnimodal setting, which is exacerbated with an increasing number of modalities. To this end, we propose HoPA, a unified method that captures high-order cross-modal alignments via a compact proxy, which is compatible with trajectory matching as well. By abstracting omnimodal alignment with a shared similarity structure, our method avoids the combinatorial complexity of pairwise modality modeling and enables scalable joint distillation across heterogeneous modalities. Theoretical analysis from the spectral perspective reveals the rationality of our proposed method against bimodal dataset distillation techniques. Extensive experiments on various benchmarks demonstrate that the proposed method achieves superior compression-performance trade-offs compared to existing competitors. The source code will be publicly released.

[NLP-109] Efficient Process Reward Modeling via Contrastive Mutual Information ACL2026

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)推理轨迹中步骤级奖励标注的高成本问题,即传统过程奖励模型(Process Reward Model, PRM)训练依赖人工标注每一步的奖励分数,而现有自动化方法如蒙特卡洛(Monte Carlo, MC)估计又因大量大语言模型(Large Language Model, LLM)滚动推演导致计算资源消耗巨大。解决方案的关键在于提出对比点互信息(Contrastive Pointwise Mutual Information, CPMI),该方法利用模型内部概率分布,通过比较当前推理步骤与正确目标答案之间的互信息相对于硬负样本的提升程度,生成可靠的步骤级奖励信号,从而显著降低数据构建时间和token生成量(分别减少84%和98%),同时在过程级评估和数学推理基准上取得更高准确率。

链接: https://arxiv.org/abs/2604.10660
作者: Nakyung Lee,Sangwoo Hong,Jungwoo Lee
机构: Seoul National University (首尔国立大学); Konkuk University (中央大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ACL 2026 Main Conference

点击查看摘要

Abstract:Recent research has devoted considerable effort to verifying the intermediate reasoning steps of chain-of-thought (CoT) trajectories using process reward models (PRMs) and other verifier models. However, training a PRM typically requires human annotators to assign reward scores to each reasoning step, which is both costly and time-consuming. Existing automated approaches, such as Monte Carlo (MC) estimation, also demand substantial computational resources due to repeated LLM rollouts. To overcome these limitations, we propose contrastive pointwise mutual information (CPMI), a novel automatic reward labeling method that leverages the model’s internal probability to infer step-level supervision while significantly reducing the computational burden of annotating dataset. CPMI quantifies how much a reasoning step increases the mutual information between the step and the correct target answer relative to hard-negative alternatives. This contrastive signal serves as a proxy for the step’s contribution to the final solution and yields a reliable reward. The experimental results show that CPMI-based labeling reduces dataset construction time by 84% and token generation by 98% compared to MC estimation, while achieving higher accuracy on process-level evaluations and mathematical reasoning benchmarks.

[NLP-110] SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates

【速读】: 该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中适配器(adapter)存储开销与性能损失之间的权衡问题。通过系统性地分析LoRA(Low-Rank Adaptation)权重更新的频谱结构,研究发现LoRA更新在频域上具有显著的低频主导特性——平均仅需33%的离散余弦变换(DCT)系数即可捕获90%的总能量,且保留10%的频率系数即可实现10倍的存储压缩,同时仅带来1.95个百分点的性能下降。关键解决方案在于提出基于频谱稀疏性的新设计原则:利用频率掩码(frequency masking)剔除高频成分(被识别为适应噪声),从而在保证模型性能的同时大幅降低适配器存储需求,并揭示了任务复杂度对频谱敏感性的影响机制。

链接: https://arxiv.org/abs/2604.10649
作者: Rajveer Singh
机构: Indian Institute of Technology Roorkee (印度理工学院鲁尔基分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 6 figures, 7 tables. Indian Institute of Technology Roorkee

点击查看摘要

Abstract:We present a systematic empirical study of the spectral structure of LoRA weight updates. Through 2D Discrete Cosine Transform (DCT) analysis of trained adaptation matrices across BERT-base and RoBERTa-base on four GLUE benchmarks (SST-2, MNLI, CoLA, QQP), we establish that LoRA updates are universally dominated by low-frequency components: on average, just 33% of DCT coefficients capture 90% of total spectral energy. Retaining only 10% of frequency coefficients reduces adapter storage by 10x while sacrificing only 1.95pp on SST-2. Notably, frequency masking at k=50% improves over full LoRA on 3 of 8 model-task pairs, suggesting high-frequency components act as adaptation noise. We further discover that RoBERTa-base is systematically more spectrally compressible than BERT-base across all tasks, and that task complexity governs spectral sensitivity – NLI tasks require more frequency budget than sentiment classification. These findings motivate a new design principle for PEFT: spectral sparsity in adaptation.

[NLP-111] ProUIE: A Macro-to-Micro Progressive Learning Method for LLM -based Universal Information Extraction

【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的通用信息抽取(Universal Information Extraction, UIE)方法在训练过程中依赖额外信息以提升性能的问题,此类方法虽增加复杂度但收益有限。其核心解决方案是提出一种宏-中-微观渐进式学习框架ProUIE,关键在于通过三个阶段实现无外部信息下的性能提升:首先在宏观层面进行完整建模(Complete Modeling, CM),按内在难度顺序联合学习命名实体识别(Named Entity Recognition, NER)、关系抽取(Relation Extraction, RE)和事件抽取(Event Extraction, EE)任务;其次在中观层面进行简化对齐(Streamlined Alignment, SA),采用采样数据与简化输出格式来规范结构化结果;最后在微观层面实施深度探索(Deep Exploration, DE),利用分步细粒度奖励(Stepwise Fine-grained Rewards, SFR)结合GRPO策略优化结构单元,从而显著提升抽取精度与可控性。

链接: https://arxiv.org/abs/2604.10633
作者: Wenda Liu,Zhigang Song,Shuai Nie,Guangyao Liu,Lisung Chen,Binyu Yang,Yaran Chen,Peng Zhou,Hongzhen Wang,Yuchen Liu,Wenyue Hu,Jiaming Xu,Runyu Shi,Ying Huang
机构: Xiaomi Corporation(小米公司); Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based universal information extraction (UIE) methods often rely on additional information beyond the original training data, which increases training complexity yet often yields limited gains. To address this, we propose ProUIE, a Macro-to-Micro progressive learning approach that improves UIE without introducing any external information. ProUIE consists of three stages: (i) macro-level Complete Modeling (CM), which learns NER, RE, and EE along their intrinsic difficulty order on the full training data to build a unified extraction foundation, (ii) meso-level Streamlined Alignment (SA), which operates on sampled data with simplified target formats, streamlining and regularizing structured outputs to make them more concise and controllable, and (iii) micro-level Deep Exploration (DE), which applies GRPO with stepwise fine-grained rewards (SFR) over structural units to guide exploration and improve performance. Experiments on 36 public datasets show that ProUIE consistently improves unified extraction, outperforming strong instruction-tuned baselines on average for NER and RE while using a smaller backbone, and it further demonstrates clear gains in large-scale production-oriented information extraction.

[NLP-112] Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment

【速读】: 该论文旨在解决大脑在不同语言间如何支持语言处理这一基础神经科学问题,并为多语言人工智能系统提供可验证的测试框架。其核心挑战在于区分语言加工是共享机制还是语言特异性机制,而传统神经成像技术无法单独实现因果推断。解决方案的关键在于利用六种多语言大语言模型(Multilingual Large Language Models, LLMs)作为可控实验系统,通过针对性地“计算损伤”(computational lesions)——即零值化对多种语言均关键或仅对某一语言关键的小型参数集——来模拟神经损伤效应;随后将完整模型与受损模型在自然故事聆听任务中(涵盖英语、中文和法语,共112名母语者)的fMRI响应预测能力进行比较。结果显示,破坏共享核心模块导致全脑编码相关性下降60.32%,而语言特异性损伤则保持跨语言嵌入空间分离但显著削弱对应母语的脑预测能力,从而确立了“共享主干+嵌入特化”的结构假设,并建立了用于研究多语言脑-模型对齐的因果分析框架。

链接: https://arxiv.org/abs/2604.10627
作者: Yang Cui,Jingyuan Sun,Yizheng Sun,Yifan Wang,Yunhao Zhang,Jixing Li,Shaonan Wang,Hongpeng Zhou,John Hale,Chengqing Zong,Goran Nenadic
机构: University of Manchester (曼彻斯特大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); City University of Hong Kong (香港城市大学); The Hong Kong Polytechnic University (香港理工大学); Johns Hopkins University (约翰霍普金斯大学); National Laboratory for Pattern Recognition (国家模式识别实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 23 pages, 5 figures, Journal format

点击查看摘要

Abstract:How the brain supports language across different languages is a basic question in neuroscience and a useful test for multilingual artificial intelligence. Neuroimaging has identified language-responsive brain regions across languages, but it cannot by itself show whether the underlying processing is shared or language-specific. Here we use six multilingual large language models (LLMs) as controllable systems and create targeted ``computational lesions’’ by zeroing small parameter sets that are important across languages or especially important for one language. We then compare intact and lesioned models in predicting functional magnetic resonance imaging (fMRI) responses during 100 minutes of naturalistic story listening in native English, Chinese and French (112 participants). Lesioning a compact shared core reduces whole-brain encoding correlation by 60.32% relative to intact models, whereas language-specific lesions preserve cross-language separation in embedding space but selectively weaken brain predictivity for the matched native language. These results support a shared backbone with embedded specializations and provide a causal framework for studying multilingual brain-model alignment.

[NLP-113] Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)在跨语言任务中因高资源语言与低资源语言数据分布不均以及预训练阶段存在的单语偏倚(monolingual bias)而导致的性能瓶颈问题。其解决方案的关键在于引入一种跨语言映射任务(Cross-Lingual Mapping Task),该任务在预训练阶段对模型进行双向语言映射,以增强嵌入空间中的跨语言对齐能力,同时不损害单语流利度;此外,作者提出**语言对齐系数(Language Alignment Coefficient)**来在数据受限场景下稳健量化跨语言一致性,从而显著提升机器翻译、跨语言自然语言理解(CLNLU)和跨语言问答(CLQA)等任务的性能。

链接: https://arxiv.org/abs/2604.10590
作者: Weihua Zheng,Chang Liu,Zhengyuan Liu,Xin Huang,Kui Wu,Muhammad Huzaifah Md Shahrin,Aiti Aw,Roy Ka-Wei Lee
机构: Singapore University of Technology and Design (新加坡科技设计大学); Agency for Science, Technology and Research (新加坡科技研究局)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages, as well as monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, can improve cross-lingual performance, but they often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task during the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM embedding space, improving both language generation and comprehension. We further propose a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves gains of up to 11.9 BLEU points in MT, 6.72 points in CLQA BERTScore-Precision, and more than 5% in CLNLU accuracy over strong multilingual baselines. These findings highlight the potential of incorporating cross-lingual objectives into pre-training to improve multilingual LLMs.

[NLP-114] Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLM s AISTATS2026

【速读】: 该论文旨在解决生成式 AI(Generative AI)在通过人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)等奖励优化方法微调后,可能出现的校准能力退化问题,即模型对自身预测不确定性的量化变得不可靠。其核心解决方案是设计了一种基于群体相对策略优化(Group Relative Policy Optimisation, GRPO)的“奉承性”训练范式,通过奖励模型对错误答案的偏好来诱导模型产生迎合行为,并在此基础上系统评估不同微调策略下模型在MMLU基准上的校准性能变化(使用预期校准误差ECE和最大校准误差MCE作为指标)。关键发现在于:尽管奉承性GRPO导致校准性能显著下降(ECE上升+0.006,MCE上升+0.010),且这种差异未达统计显著性(p=0.41),但即使采用后处理矩阵缩放(matrix scaling)进行校准修正,奉承模型仍保持最高的残余ECE(0.042 vs. 中性SFT控制组的0.037),表明奖励机制引发的校准偏差具有结构性残留,无法完全由线性校准方法消除。这一结果为未来开发校准感知的训练目标提供了实证依据与方法论基础。

链接: https://arxiv.org/abs/2604.10585
作者: Subramanyam Sahoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at the AISTATS 2026 Workshop on Towards Trustworthy Predictions: Theory and Applications of Calibration for Modern AI. 14 Pages

点击查看摘要

Abstract:Modern large language models (LLMs) are increasingly fine-tuned via reinforcement learning from human feedback (RLHF) or related reward optimisation schemes. While such procedures improve perceived helpfulness, we investigate whether sycophantic reward signals degrade calibration – a property essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO) that rewards agreement with planted wrong answers. Evaluating on 1,000 MMLU items across five subject domains with bootstrap confidence intervals and permutation testing, we find that \textbfsycophantic GRPO produces consistent directional calibration degradation – ECE rises by +0.006 relative to the base model and MCE increases by +0.010 relative to neutral SFT – though the effect does not reach statistical significance ( p = 0.41 ) at this training budget. Post-hoc matrix scaling applied to all three models reduces ECE by 40 – 64% and improves accuracy by 1.5 – 3.0 percentage points. However, the sycophantic model retains the highest post-scaling ECE relative to the neutral SFT control ( 0.042 vs.\ 0.037 ), suggesting that reward-induced miscalibration leaves a structured residual even after affine correction. These findings establish a methodology for evaluating the calibration impact of reward hacking and motivate calibration-aware training objectives.

[NLP-115] Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark

【速读】: 该论文旨在解决文本到语音(TTS)系统在生成具有语境适配的词重音(word-level stress)方面存在的不足问题。当前虽然TTS系统能够生成富有表现力的语音,但其是否能仅从话语上下文中推断出合适的重音仍不明确。为此,作者提出Context-Aware Stress TTS (CAST) 基准测试框架,其核心创新在于构建对比性语境对(contrastive context pairs)——即相同句子搭配不同语境,要求不同的词重音以传达修正、对比或澄清等语义差异。通过该基准评估发现,纯文本语言模型能可靠地从上下文中恢复目标重音,而TTS系统却常无法在语音输出中实现这一语义意图,揭示了TTS系统在语境感知重音建模上的关键短板。

链接: https://arxiv.org/abs/2604.10580
作者: Arnon Turetzky,Avihu Dekel,Hagai Aronowitz,Ron Hoory,Yossi Adi
机构: The Hebrew University of Jerusalem (希伯来大学); IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Preprint

点击查看摘要

Abstract:Spoken meaning often depends not only on what is said, but also on which word is emphasized. The same sentence can convey correction, contrast, or clarification depending on where emphasis falls. Although modern text-to-speech (TTS) systems generate expressive speech, it remains unclear whether they infer contextually appropriate stress from discourse alone. To address this gap, we present Context-Aware Stress TTS (CAST), a benchmark for evaluating context-conditioned word-level stress in TTS. Items are defined as contrastive context pairs: identical sentences paired with distinct contexts requiring different stressed words. We evaluate state-of-the-art systems and find a consistent gap: text-only language models reliably recover the intended stress from context, yet TTS systems frequently fail to realize it in speech. We release the benchmark, evaluation framework, construction pipeline and a synthetic corpus to support future work on context-aware speech synthesis.

[NLP-116] Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models

【速读】: 该论文旨在解决扩散语言模型(Diffusion-based Language Models, dLLMs)在非自回归解码(non-autoregressive decoding)中面临的挑战,尤其是如何有效应用于推理与规划任务。现有方法在信心驱动的非自回归生成中存在固有缺陷,源于强烈的邻近偏差(proximity bias)——即去噪顺序倾向于集中在空间相邻的token上,导致局部误差传播,使整个生成轨迹高度依赖初始未掩码位置。解决方案的关键在于提出一种轻量级干预策略:通过引入一个轻量级规划器(lightweight planner)引导早期token选择,并结合序列结束温度退火(end-of-sequence temperature annealing),从而缓解邻近偏差带来的错误累积,显著提升推理与规划任务的性能,且计算开销可控。

链接: https://arxiv.org/abs/2604.10567
作者: Jiyeon Kim,Sungik Choi,Yongrae Jo,Moontae Lee,Minjoon Seo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.

[NLP-117] LLM s Should Incorporate Explicit Mechanisms for Human Empathy

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在高风险人本场景中因缺乏显式人类共情机制而导致的 empathic failure 问题,即模型虽在正确性和流畅性上表现良好,却常弱化情感表达、误判情境重要性、固化关系立场,从而扭曲人类意图与语境。解决方案的关键在于将共情(empathy)形式化为可观察的行为属性——即建模并回应人类视角的同时保持意图、情感和上下文的一致性,并识别出四种结构性共情失效模式(情感弱化、共情粒度错配、冲突回避、语言疏离),通过认知、文化与关系三个维度系统归因其表现差异,进而提出将共情感知的目标、基准测试和训练信号作为LLM开发的第一优先级组件。

链接: https://arxiv.org/abs/2604.10557
作者: Xiaoxing You,Qiang Huang,Jun Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper argues that Large Language Models (LLMs) should incorporate explicit mechanisms for human empathy. As LLMs become increasingly deployed in high-stakes human-centered settings, their success depends not only on correctness or fluency but on faithful preservation of human perspectives. Yet, current LLMs systematically fail at this requirement: even when well-aligned and policy-compliant, they often attenuate affect, misrepresent contextual salience, and rigidify relational stance in ways that distort meaning. We formalize empathy as an observable behavioral property: the capacity to model and respond to human perspectives while preserving intention, affect, and context. Under this framing, we identify four recurring mechanisms of empathic failure in contemporary LLMs–sentiment attenuation, empathic granularity mismatch, conflict avoidance, and linguistic distancing–arising as structural consequences of prevailing training and alignment practices. We further organize these failures along three dimensions: cognitive, cultural, and relational empathy, to explain their manifestation across tasks. Empirical analyses show that strong benchmark performance can mask systematic empathic distortions, motivating empathy-aware objectives, benchmarks, and training signals as first-class components of LLM development.

[NLP-118] Lost in Diffusion: Uncovering Hallucination Patterns and Failure Modes in Diffusion Large Language Models ACL2026

【速读】: 该论文旨在解决扩散型大语言模型(Diffusion Large Language Models, dLLMs)在生成过程中存在的幻觉(hallucination)问题,尤其是其与自回归(autoregressive, AR)模型相比的可靠性差异尚未被系统研究的问题。解决方案的关键在于开展首个受控对比实验,通过固定架构、规模和预训练权重等变量,量化比较dLLMs与AR模型的幻觉倾向;同时分析推理阶段计算资源分配的动态差异,揭示非序列解码机制在持续优化中的潜力,并识别扩散过程特有的失败模式(如过早终止、未完全去噪和上下文干扰),从而明确dLLMs虽在通用任务上性能接近AR模型,但其独特的幻觉机制对模型可靠性构成重大挑战。

链接: https://arxiv.org/abs/2604.10556
作者: Zhengnan Guo,Fei Tan
机构: East China Normal University (华东师范大学); Zhejiang University of Technology (浙江工业大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:While Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive paradigm comparable to autoregressive (AR) models, their faithfulness, specifically regarding hallucination, remains largely underexplored. To bridge this gap, we present the first controlled comparative study to evaluate hallucination patterns in dLLMs. Our results demonstrate that current dLLMs exhibit a higher propensity for hallucination than AR counterparts controlled for architecture, scale, and pre-training weights. Furthermore, an analysis of inference-time compute reveals divergent dynamics: while quasi-autoregressive generation suffers from early saturation, non-sequential decoding unlocks potential for continuous refinement. Finally, we identify distinct failure modes unique to the diffusion process, including premature termination, incomplete denoising, and context intrusion. Our findings underscore that although dLLMs have narrowed the performance gap on general tasks, their distinct hallucination mechanisms pose a critical challenge to model reliability. Our code is available at this https URL

[NLP-119] VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions ACL2026

【速读】: 该论文旨在解决传统视觉-语言导航(Vision-and-Language Navigation, VLN)基准中假设指令可行且目标存在所带来的局限性,即当目标实际不存在于指定房间时,现有智能体无法识别并正确响应“未找到”(NOT-FOUND)状态。为应对这一挑战,作者构建了VLN-NF基准,通过大语言模型(LLM)重写原始指令并利用视觉语言模型(VLM)验证目标缺失,从而生成具有现实合理性但事实错误的虚假前提指令;同时提出REV-SPL评估指标,联合衡量房间到达率、探索覆盖率和决策准确性。解决方案的关键在于ROAM方法——一种两阶段混合架构:第一阶段采用监督学习实现房间级导航,第二阶段结合LLM与VLM驱动的室内探索,并引入自由空间清晰度先验(free-space clearance prior)提升探索效率与决策可靠性,显著优于基线方法在低探索率和过早终止问题上的表现。

链接: https://arxiv.org/abs/2604.10533
作者: Hung-Ting Su,Ting-Jun Wang,Jia-Fong Yeh,Min Sun,Winston H. Hsu
机构: National Taiwan University (国立台湾大学); National Tsing Hua University (国立清华大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACL 2026. The first two authors contributed equally to the technical work

点击查看摘要

Abstract:Conventional Vision-and-Language Navigation (VLN) benchmarks assume instructions are feasible and the referenced target exists, leaving agents ill-equipped to handle false-premise goals. We introduce VLN-NF, a benchmark with false-premise instructions where the target is absent from the specified room and agents must navigate, gather evidence through in-room exploration, and explicitly output NOT-FOUND. VLN-NF is constructed via a scalable pipeline that rewrites VLN instructions using an LLM and verifies target absence with a VLM, producing plausible yet factually incorrect goals. We further propose REV-SPL to jointly evaluate room reaching, exploration coverage, and decision correctness. To address this challenge, we present ROAM, a two-stage hybrid that combines supervised room-level navigation with LLM/VLM-driven in-room exploration guided by a free-space clearance prior. ROAM achieves the best REV-SPL among compared methods, while baselines often under-explore and terminate prematurely under unreliable instructions. VLN-NF project page can be found at this https URL.

[NLP-120] ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization ACL2026

【速读】: 该论文旨在解决现有评估方法在真实场景代码摘要中难以实现细粒度事实一致性(factual consistency)评价的问题,尤其针对多句功能描述和依赖上下文的复杂性。传统方法主要适用于孤立代码片段的短摘要评估,无法准确捕捉代码摘要中跨句子的功能依赖关系与语义一致性。解决方案的关键在于提出一种无需参考文本(reference-free)且基于段落级(segment-level)的事实一致性判定机制——ReFEree,其通过定义专为代码摘要设计的一致性标准,并结合依赖信息进行细粒度评分,最终聚合得到整体一致性分数。实验表明,该方法在人工标注的数据集上显著优于13种基线模型,相关性提升达15-18%。

链接: https://arxiv.org/abs/2604.10520
作者: Suyoung Bae,CheolWon Na,Jaehoon Lee,Yumin Lee,YunSeok Choi,Jee-Hyong Lee
机构: Sungkyunkwan University ( Sungkyunkwan University)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Accepted to ACL 2026 main. 25 pages

点击查看摘要

Abstract:As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine-grained evaluation of multi-sentence functionalities and fail to accurately assess dependency context commonly found in real-world code summaries. To address this, we propose ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. We construct a code summarization benchmark with human-annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15-18% over the previous state-of-the-art. Our code and data are available at this https URL.

[NLP-121] Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行领域特定数据分析任务时,因依赖基于词法或嵌入相似性的检索方法而导致的知识选择偏差问题。此类方法往往无法准确识别与多步推理任务相关的关键知识,尤其是在知识实际嵌套于可执行代码及其调用依赖结构中的场景下。解决方案的关键在于提出结构接地的知识检索(Structure-Grounded Knowledge Retrieval, SGKR)框架,该框架通过构建由函数调用依赖关系诱导的图结构来组织领域知识;给定问题后,SGKR提取语义输入与输出标签,识别连接这些标签的依赖路径,并生成一个任务相关的子图,从而将关联的知识和对应的函数实现整合为结构化上下文,供LLM进行代码生成。

链接: https://arxiv.org/abs/2604.10516
作者: Xinyi Huang,Mingzhe Lu,Haoyu Dong
机构: Simon Fraser University (西蒙弗雷泽大学); University of Science and Technology of China (中国科学技术大学); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Selecting the right knowledge is critical when using large language models (LLMs) to solve domain-specific data analysis tasks. However, most retrieval-augmented approaches rely primarily on lexical or embedding similarity, which is often a weak proxy for the task-critical knowledge needed for multi-step reasoning. In many such tasks, the relevant knowledge is not merely textually related to the query, but is instead grounded in executable code and the dependency structure through which computations are carried out. To address this mismatch, we propose SGKR (Structure-Grounded Knowledge Retrieval), a retrieval framework that organizes domain knowledge with a graph induced by function-call dependencies. Given a question, SGKR extracts semantic input and output tags, identifies dependency paths connecting them, and constructs a task-relevant subgraph. The associated knowledge and corresponding function implementations are then assembled as a structured context for LLM-based code generation. Experiments on multi-step data analysis benchmarks show that SGKR consistently improves solution correctness over no-retrieval and similarity-based retrieval baselines for both vanilla LLMs and coding agents.

[NLP-122] hinking Fast Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在真实世界政策评估中进行因果与反事实推理时的可靠性问题,特别是其在不同直观性情境下的表现差异。研究通过构建一个包含40个基于同行评审证据的实证政策评估案例的基准测试集,并按“明显”、“模糊”和“反直觉”三类划分直观性,系统评估了四种前沿LLM在五种提示策略下的表现。关键发现包括:(1)链式思维(Chain-of-Thought, CoT)提示存在悖论——显著提升明显案例的准确率,但在反直觉案例中几乎无效(交互效应OR = 0.053, p < 0.001);(2)直观性是解释性能变异的主导因素(组内相关系数ICC = 0.537),超越模型选择和提示策略;(3)知识熟悉度与推理准确性无关(p = 0.53),表明模型具备相关知识但无法在违背直觉时有效调用。论文据此提出双过程理论视角,指出当前LLM的“慢思考”可能只是“慢说话”,即形式上呈现推理过程,却缺乏实质性的认知推理能力。

链接: https://arxiv.org/abs/2604.10511
作者: Yanjie He
机构: Independent Researcher
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness – whether the empirical finding aligns with (obvious), is unclear relative to (ambiguous), or contradicts (counter-intuitive) common prior expectations. We evaluate four frontier LLMs across five prompting strategies with 2,400 experimental trials and analyze the results using mixed-effects logistic regression. Our findings reveal three key results: (1) a chain-of-thought (CoT) paradox, where chain-of-thought prompting dramatically improves performance on obvious cases but this benefit is nearly eliminated on counter-intuitive ones (interaction OR = 0.053, p 0.001 ); (2) intuitiveness as the dominant factor, explaining more variance than model choice or prompting strategy (ICC = 0.537); and (3) a knowledge-reasoning dissociation, where citation-based familiarity is unrelated to accuracy ( p = 0.53 ), suggesting models possess relevant knowledge but fail to reason with it when findings contradict intuition. We frame these results through the lens of dual-process theory (System 1 vs. System 2) and argue that current LLMs’ “slow thinking” may be little more than “slow talking” – they produce the form of deliberative reasoning without the substance.

[NLP-123] Why Dont You Know? Evaluating the Impact of Uncertainty Sources on Uncertainty Quantification in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中缺乏可靠不确定性量化(Uncertainty Quantification, UQ)的问题,尤其关注不同来源的不确定性(如模型知识缺口、输出变异性及输入模糊性)对现有UQ方法性能的影响。其解决方案的关键在于引入一个新构建的数据集,该数据集明确标注了不确定性的来源类别,从而支持对UQ方法在各类不确定性条件下的系统性评估。实验表明,多数现有UQ方法仅在模型知识局限导致的不确定性下表现良好,而在其他不确定性源存在时性能下降或产生误导性结果,凸显了未来需发展能够显式区分并处理不确定性来源的增强型UQ方法。

链接: https://arxiv.org/abs/2604.10495
作者: Maiya Goloburda,Roman Vashurin,Fedor Chernogorsky,Nurkhan Laiyk,Daniil Orel,Preslav Nakov,Maxim Panov
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed in real-world applications, reliable uncertainty quantification (UQ) becomes critical for safe and effective use. Most existing UQ approaches for language models aim to produce a single confidence score – for example, estimating the probability that a model’s answer is correct. However, uncertainty in natural language tasks arises from multiple distinct sources, including model knowledge gaps, output variability, and input ambiguity, which have different implications for system behavior and user interaction. In this work, we study how the source of uncertainty impacts the behavior and effectiveness of existing UQ methods. To enable controlled analysis, we introduce a new dataset that explicitly categorizes uncertainty sources, allowing systematic evaluation of UQ performance under each condition. Our experiments reveal that while many UQ methods perform well when uncertainty stems solely from model knowledge limitations, their performance degrades or becomes misleading when other sources are introduced. These findings highlight the need for uncertainty-aware methods that explicitly account for the source of uncertainty in large language models.

[NLP-124] PatchRecall: Patch-Driven Retrieval for Automated Program Repair WWW

【速读】: 该论文旨在解决自动化程序修复(Automated Program Repair, APR)中代码库检索的召回率与简洁性之间的权衡问题:高召回率虽能确保相关文件被覆盖,但盲目增加检索文件数量会引入噪声并降低效率。解决方案的关键在于提出一种混合检索方法 PatchRecall,其融合两种互补策略——基于代码库的检索(codebase retrieval),通过当前问题描述匹配代码库以识别潜在相关文件;以及基于历史的检索(history-based retrieval),利用相似历史问题定位曾被修改的文件作为候选目标。最终将两类候选文件合并并重排序,从而在不显著增加文件数量的前提下提升召回率,增强 APR 的有效性。

链接: https://arxiv.org/abs/2604.10481
作者: Mahir Labib Dihan,Faria Binta Awal,Md. Ishrak Ahsan
机构: Bangladesh University of Engineering and Technology (BUET)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Code is available at this https URL

点击查看摘要

Abstract:Retrieving the correct set of files from a large codebase is a crucial step in Automated Program Repair (APR). High recall is necessary to ensure that the relevant files are included, but simply increasing the number of retrieved files introduces noise and degrades efficiency. To address this tradeoff, we propose PatchRecall, a hybrid retrieval approach that balances recall with conciseness. Our method combines two complementary strategies: (1) codebase retrieval, where the current issue description is matched against the codebase to surface potentially relevant files, and (2) history-based retrieval, where similar past issues are leveraged to identify edited files as candidate targets. Candidate files from both strategies are merged and reranked to produce the final retrieval set. Experiments on SWE-Bench demonstrate that PatchRecall achieves higher recall without significantly increasing retrieved file count, enabling more effective APR.

[NLP-125] From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation ACL2026

【速读】: 该论文旨在解决法律咨询问答(Legal CQA)任务中面临的三大挑战:高质量训练数据稀缺、任务结构复杂以及强上下文依赖关系。其解决方案的关键在于构建了一个大规模中文法律咨询问答数据集JurisCQAD(含超过43,000条真实法律查询及专家验证的正负回答),并设计了一种结构化的任务分解方法,将每个查询转化为包含实体、事件、意图和法律问题的法律要素图(legal element graph)。在此基础上,提出模块化多智能体框架JurisMA,支持动态路由、法条锚定与风格优化,结合要素图实现强上下文感知推理,有效捕捉法律事实、规范与程序逻辑之间的依赖关系,从而显著提升模型性能。

链接: https://arxiv.org/abs/2604.10470
作者: Mingfei Lu,Yi Zhang,Mengjia Wu,Yue Feng
机构: Australian Artificial Intelligence Institute (AAII); University of Technology Sydney; School of Computer Science, University of Birmingham
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Main conference

点击查看摘要

Abstract:Legal consultation question answering (Legal CQA) presents unique challenges compared to traditional legal QA tasks, including the scarcity of high-quality training data, complex task composition, and strong contextual dependencies. To address these, we construct JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses, and design a structured task decomposition that converts each query into a legal element graph integrating entities, events, intents, and legal issues. We further propose JurisMA, a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Combined with the element graph, the framework enables strong context-aware reasoning, effectively capturing dependencies across legal facts, norms, and procedural logic. Trained on JurisCQAD and evaluated on a refined LawBench, our system significantly outperforms both general-purpose and legal-domain LLMs across multiple lexical and semantic metrics, demonstrating the benefits of interpretable decomposition and modular collaboration in Legal CQA.

[NLP-126] Dynamic Adaptive Attention and Supervised Contrastive Learning: A Novel Hybrid Framework for Text Sentiment Classification

【速读】: 该论文旨在解决长文本电影评论中情感语义依赖关系难以捕捉以及模糊情感表达识别困难的问题(long-distance semantic dependencies and ambiguous emotional expressions in lengthy review texts)。其解决方案的关键在于提出一种融合动态自适应多头注意力机制与监督对比学习的混合框架:动态自适应注意力模块利用全局上下文池化向量动态调节每个注意力头的贡献,聚焦关键情感词元并抑制噪声;同时,监督对比学习分支在嵌入空间中增强类内紧凑性并扩大类间分离度,从而提升模型对复杂情感模式的判别能力。

链接: https://arxiv.org/abs/2604.10459
作者: Qingyang Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The exponential growth of user-generated movie reviews on digital platforms has made accurate text sentiment classification a cornerstone task in natural language processing. Traditional models, including standard BERT and recurrent architectures, frequently struggle to capture long-distance semantic dependencies and resolve ambiguous emotional expressions in lengthy review texts. This paper proposes a novel hybrid framework that seamlessly integrates dynamic adaptive multi-head attention with supervised contrastive learning into a BERT-based Transformer encoder. The dynamic adaptive attention module employs a global context pooling vector to dynamically regulate the contribution of each attention head, thereby focusing on critical sentiment-bearing tokens while suppressing noise. Simultaneously, the supervised contrastive learning branch enforces tighter intra-class compactness and larger inter-class separation in the embedding space. Extensive experiments on the IMDB dataset demonstrate that the proposed model achieves competitive performance with an accuracy of 94.67%, outperforming strong baselines by 1.5–2.5 percentage points. The framework is lightweight, efficient, and readily extensible to other text classification tasks.

[NLP-127] EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning KDD2026

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的电子健康记录(Electronic Health Records, EHRs)诊断预测方法中存在的过拟合问题,即模型倾向于依赖历史已观测到的诊断结果,而忽视了对临床意义重大但尚未被记录的新发疾病(novel diagnoses)的识别。解决方案的关键在于提出EviCare框架,其核心创新是通过三阶段机制增强LLM的推理能力:首先利用深度模型进行候选诊断选择,其次对EHR中的证据集合进行优先级排序,再构建关系性证据以支持新诊断预测;最终将这些结构化信号动态组合为自适应的上下文提示(in-context prompt),从而引导LLM实现更准确且可解释的诊断推理。实验表明,EviCare在MIMIC-III和MIMIC-IV两个真实世界EHR基准上显著优于纯LLM或纯深度模型基线,尤其在新诊断预测任务中平均提升达30.97%。

链接: https://arxiv.org/abs/2604.10455
作者: Hengyu Zhang,Xuyun Zhang,Pengxiang Zhan,Linhao Luo,Hang Lv,Yanchao Tan,Shirui Pan,Carl Yang
机构: Macquarie University (麦克奎里大学); Fuzhou University (福州大学); Monash University (莫纳什大学); Griffith University (格里菲斯大学); Emory University (埃默里大学)
类目: Computation and Language (cs.CL)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled promising progress in diagnosis prediction from electronic health records (EHRs). However, existing LLM-based approaches tend to overfit to historically observed diagnoses, often overlooking novel yet clinically important conditions that are critical for early intervention. To address this, we propose EviCare, an in-context reasoning framework that integrates deep model guidance into LLM-based diagnosis prediction. Rather than prompting LLMs directly with raw EHR inputs, EviCare performs (1) deep model inference for candidate selection, (2) evidential prioritization for set-based EHRs, and (3) relational evidence construction for novel diagnosis prediction. These signals are then composed into an adaptive in-context prompt to guide LLM reasoning in an accurate and interpretable manner. Extensive experiments on two real-world EHR benchmarks (MIMIC-III and MIMIC-IV) demonstrate that EviCare achieves significant performance gains, which consistently outperforms both LLM-only and deep model-only baselines by an average of 20.65% across precision and accuracy metrics. The improvements are particularly notable in challenging novel diagnosis prediction, yielding average improvements of 30.97%.

[NLP-128] NOSE: Neural Olfactory-Semantic Embedding with Tri-Modal Orthogonal Contrastive Learning ACL2026

【速读】: 该论文旨在解决现有嗅觉表征学习方法无法完整建模从分子结构到受体序列再到语言描述的全链条问题,导致嵌入空间缺乏生物基础和语义可解释性。其解决方案的关键在于提出NOSE(Neural Olfactory-Semantic Embedding)框架,通过正交约束解耦分子结构、受体序列与自然语言三模态信息的贡献,从而保留各模态的独特编码特征;同时引入弱正样本策略缓解嗅觉语言数据稀疏性问题,有效校准语义相似度,避免特征空间中相似气味被错误排斥,最终实现与人类嗅觉直觉高度对齐的表征空间。

链接: https://arxiv.org/abs/2604.10452
作者: Yanyi Su,Hongshuai Wang,Zhifeng Gao,Jun Cheng
机构: Xiamen University (厦门大学); DP Technology (DP科技); Laboratory of AI for Electrochemistry (AI4EC) (电化学人工智能实验室); Tan Kah Kee Innovation Laboratory (IKKEM) (陈嘉庚创新实验室)
类目: Computation and Language (cs.CL)
备注: Accepted to the ACL 2026 Main Conference

点击查看摘要

Abstract:Olfaction lies at the intersection of chemical structure, neural encoding, and linguistic perception, yet existing representation methods fail to fully capture this pathway. Current approaches typically model only isolated segments of the olfactory pathway, overlooking the complete chain from molecule to receptors to linguistic descriptions. Such fragmentation yields learned embeddings that lack both biological grounding and semantic interpretability. We propose NOSE (Neural Olfactory-Semantic Embedding), a representation learning framework that aligns three modalities along the olfactory pathway: molecular structure, receptor sequence, and natural language description. Rather than simply fusing these signals, we decouple their contributions via orthogonal constraints, preserving the unique encoded information of each modality. To address the sparsity of olfactory language, we introduce a weak positive sample strategy to calibrate semantic similarity, preventing erroneous repulsion of similar odors in the feature space. Extensive experiments demonstrate that NOSE achieves state-of-the-art (SOTA) performance and excellent zero-shot generalization, confirming the strong alignment between its representation space and human olfactory intuition.

[NLP-129] Instruction Data Selection via Answer Divergence

【速读】: 该论文旨在解决指令微调(instruction tuning)中训练数据质量与组成对下游性能影响显著的问题,尤其是如何高效筛选出更具多样性和信息量的指令-响应样本。解决方案的关键在于提出一种基于答案分歧引导的选择方法(Answer Divergence-Guided Selection, ADG),其核心思想是通过分析多样本生成结果的几何结构来评估指令质量:具体而言,ADG 对每个指令生成多个高温度(high-temperature)响应,将其映射到嵌入空间后计算一个融合了分散程度(dispersion magnitude)和形状各向异性(shape anisotropy)的分歧得分(output divergence score)。高分指令对应的回答不仅在语义上相距较远,还呈现多模态分布而非单一方向上的近似改写,从而确保所选数据能有效提升模型在推理、知识和编码等多任务上的泛化能力。

链接: https://arxiv.org/abs/2604.10448
作者: Bo Li,Mingda Wang,Shikun Zhang,Wei Ye
机构: Peking University (北京大学); Hebei University of Technology (河北工业大学)
类目: Computation and Language (cs.CL)
备注: Github: this https URL Project: this https URL

点击查看摘要

Abstract:Instruction tuning relies on large instruction-response corpora whose quality and composition strongly affect downstream performance. We propose Answer Divergence-Guided Selection (ADG), which selects instruction data based on the geometric structure of multi-sample outputs. ADG draws several high-temperature generations per instruction, maps responses into an embedding space, and computes an output divergence score that jointly encodes dispersion magnitude and shape anisotropy. High scores correspond to instructions whose answers are both far apart and multi-modal, rather than clustered paraphrases along a single direction. Across two backbones and three public instruction pools, fine-tuning on only 10K ADG-selected examples consistently outperforms strong selectors on six benchmarks spanning reasoning, knowledge, and coding. Analyses further show that both dispersion magnitude and shape anisotropy are necessary, supporting answer divergence as a practical signal for instruction data selection. Code and appendix are included in the supplementary materials.

[NLP-130] CodaRAG : Connecting the Dots with Associativity Inspired by Complementary Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识密集型任务中因幻觉和分散信息碎片化推理而导致的性能瓶颈问题。现有检索增强生成(Retrieval-Augmented Generation, RAG)方法通常将证据视为孤立单元,难以重建支撑推理的逻辑链条。其解决方案的关键在于提出CodaRAG框架,通过三阶段机制实现从被动检索到主动关联发现的演进:首先进行知识整合(Knowledge Consolidation),将碎片化提取统一为稳定记忆基底;其次通过多维路径导航(Associative Navigation)——包括语义、上下文和功能维度——显式恢复分散证据链;最后通过干扰消除(Interference Elimination)剔除过度关联噪声,确保推理上下文的连贯性与高精度。该方法显著提升了检索召回率与生成准确性,在GraphRAG-Bench上实现7–10%和3–11%的绝对增益,验证了其在事实性、推理性和创造性任务中的系统性鲁棒性优势。

链接: https://arxiv.org/abs/2604.10426
作者: Cheng-Yen Li,Xuanjun Chen,Claire Lin,Wei-Yu Chen,Wenhua Nie,Hung-Yi Lee,Jyh-Shing Roger Jang
机构: National Taiwan University (国立台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint, Submitted to ACM TIST

点击查看摘要

Abstract:Large Language Models (LLMs) struggle with knowledge-intensive tasks due to hallucinations and fragmented reasoning over dispersed information. While Retrieval-Augmented Generation (RAG) grounds generation in external sources, existing methods often treat evidence as isolated units, failing to reconstruct the logical chains that connect these dots. Inspired by Complementary Learning Systems (CLS), we propose CodaRAG, a framework that evolves retrieval from passive lookup into active associative discovery. CodaRAG operates via a three-stage pipeline: (1) Knowledge Consolidation to unify fragmented extractions into a stable memory substrate; (2) Associative Navigation to traverse the graph via multi-dimensional pathways-semantic, contextualized, and functional-explicitly recovering dispersed evidence chains; and (3) Interference Elimination to prune hyper-associative noise, ensuring a coherent, high-precision reasoning context. On GraphRAG-Bench, CodaRAG achieves absolute gains of 7-10% in retrieval recall and 3-11% in generation accuracy. These results demonstrate CodaRAG’s superior ability to systematically robustify associative evidence retrieval for factual, reasoning, and creative tasks.

[NLP-131] uring or Cantor: That is the Question

【速读】: 该论文旨在解决图灵机(Turing Machine, TM)不可判定问题的复杂性分类与度量难题,核心问题是缺乏对不可判定问题的系统性复杂性框架。解决方案的关键在于提出三个新的复杂性类:U-complete(通用完备)、D-complete(对角化完备)和H-complete(超计算完备),并引入基于输入数据概率分布的不可判定性度量方法,从而将图灵不可判定问题的复杂性结构化、可量化,并首次明确区分了不同类型的不可判定问题。这一框架借鉴了Cook和Levin提出的NP完全类思想,进一步证明了在U-complete类中,“P ≠ NP”类问题的答案为否定,即存在不可判定问题的完备类,其复杂性本质不同于传统计算复杂性理论中的可计算问题。

链接: https://arxiv.org/abs/2604.10418
作者: Eugene Eberbach
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Alan Turing is considered as a founder of current computer science together with Kurt Godel, Alonzo Church and John von Neumann. In this paper multiple new research results are presented. It is demonstrated that there would not be Alan Turing’s achievements without earlier seminal contributions by Georg Cantor in the set theory and foundations of mathematics. It is proposed to introduce the measure of undecidability of problems unsolvable by Turing machines based on probability distribution of its input data, i.e., to provide the degree of unsolvabilty based on the number of undecidable instances of input data versus decidable ones. It is proposed as well to extend the Turing’s work on infinite logics and Oracle machines to a whole class of super-Turing models of computation. Next, the three new complexity classes for TM undecidable problems have been defined: U-complete (Universal complete), D-complete (Diagonalization complete) and H-complete (Hypercomputation complete) classes. The above has never been defined explicitly before by other scientists, and has been inspired by Cook/Levin NP-complete class for intractable problems. Finally, an equivalent to famous P is not equal to NP unanswered question for NP-complete class, has been answered negatively for U-complete class of complexity for undecidable problems.

[NLP-132] LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset

【速读】: 该论文旨在解决低资源语言中细粒度情感抽取任务研究不足的问题,特别是针对乌兹别克语和维吾尔语等黏着语(agglutinative languages)因词汇稀疏导致的挑战。其解决方案的关键在于构建首个面向低资源语言的基于方面的情感四元组数据集LASQ(Low-resource languages Aspect-based Sentiment Quadruple dataset),并提出一种融合句法知识的网格标注模型(grid-tagging model)。该模型通过设计的句法知识嵌入模块(Syntax Knowledge Embedding Module, SKEM)引入词性标注(POS)和依存句法信息,有效缓解了低资源语言中的词汇稀疏问题,实验表明该方法在LASQ数据集上显著优于现有基线模型。

链接: https://arxiv.org/abs/2604.10417
作者: Aizihaierjiang Yusufu,Jiang Liu,Kamran Aziz,Abidan Ainiwaer,Bobo Li,Fei Li,Donghong Ji,Aizierguli Yusufu
机构: Hainan Biuh University (海南生物医药大学); Xinjiang University (新疆大学); National University of Singapore (新加坡国立大学); Wuhan University (武汉大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, aspect-based sentiment analysis (ABSA) has made rapid progress and shown strong practical value. However, existing research and benchmarks are largely concentrated on high-resource languages, leaving fine-grained sentiment extraction in low-resource languages under-explored. To address this gap, we constructed the first Low-resource languages Aspect-based Sentiment Quadruple dataset, named LASQ, which includes two low-resource languages: Uzbek and Uyghur. Secondly, it includes a fine-grained target-aspect-opinion-sentiment quadruple extraction task. To facilitate future research, we designed a grid-tagging model that integrates syntactic knowledge. This model incorporates part-of-speech (POS) and dependency knowledge into the model through our designed Syntax Knowledge Embedding Module (SKEM), thereby alleviating the lexical sparsity problem caused by agglutinative languages. Experiments on LASQ demonstrate consistent gains over competitive baselines, validating both the dataset’s utility and the effectiveness of the proposed modeling approach.

[NLP-133] NameBERT: Scaling Name-Based Nationality Classification with LLM -Augmented Open Academic Data

【速读】: 该论文旨在解决基于姓名推断国籍的任务中,现有分类器因训练数据规模小或来源单一而导致的覆盖不足和性能受限问题,尤其在低资源国家(tail countries)表现不佳。其解决方案的关键在于构建一个大规模的姓名-国籍数据集,并提出一种利用大语言模型(LLM)作为数据增强工具而非直接推理引擎的新框架:通过LLM生成合成姓名来扩充低资源国家的数据,进而训练出更鲁棒的NameBERT模型。实验表明,该方法在包含合成尾部名称的测试集上显著提升性能,在真实尾部国家指标上也有小幅改进,同时保持了比直接使用LLM更高的计算效率,适用于大规模实时部署。

链接: https://arxiv.org/abs/2604.10401
作者: Cong Ming,Ruixin Shi,Yifan Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 8 tables; accepted at the 39th Canadian Conference on Artificial Intelligence (Canadian AI 2026)

点击查看摘要

Abstract:Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.

[NLP-134] BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection ALT

【速读】: 该论文旨在解决临床笔记中术语替换错误(Terminology Substitution Errors)的自动化检测难题,即一个医学术语被另一个语言上合法但临床含义不同的术语替代,导致医疗风险。解决方案的关键在于提出BLUEmed框架,其核心创新是融合多智能体辩论机制与混合检索增强生成(Hybrid Retrieval-Augmented Generation, RAG),通过将临床笔记分解为聚焦子查询、利用密集、稀疏和在线检索获取来源分区证据,并分配两名领域专家智能体基于不同知识库独立分析;当专家意见冲突时,启动结构化反驳回合与跨源仲裁以解决分歧,最后通过级联安全层过滤常见假阳性模式,从而实现更准确、可靠的临床错误检测。

链接: https://arxiv.org/abs/2604.10389
作者: Saukun Thika You,Nguyen Anh Khoa Tran,Wesley K. Marizane,Hanshu Rao,Qiunan Zhang,Xiaolei Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the IEEE International Conference on Healthcare Informatics (ICHI) 2026

点击查看摘要

Abstract:Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.

[NLP-135] A Structured Clustering Approach for Inducing Media Narratives ACL2026

【速读】: 该论文旨在解决现有计算方法在捕捉媒体叙事结构方面的局限性问题,即如何有效识别和建模传播理论中强调的、对意义建构至关重要的复杂叙事模式。当前方法要么因粒度粗略而忽略细微叙事特征,要么依赖特定领域的分类体系从而难以扩展到大规模语料库。解决方案的关键在于提出一种联合建模事件与角色的结构化聚类框架,通过自动学习生成可解释的叙事模式(narrative schemas),不仅与既有的框架理论(framing theory)保持一致,还能实现无监督地扩展至大规模文本数据,显著提升叙事分析的准确性与可扩展性。

链接: https://arxiv.org/abs/2604.10368
作者: Rohan Das,Advait Deshmukh,Alexandria Leto,Zohar Naaman,I-Ta Lee,Maria Leonor Pacheco
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL)
备注: Accepted to the Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:Media narratives wield tremendous power in shaping public opinion, yet computational approaches struggle to capture the nuanced storytelling structures that communication theory emphasizes as central to how meaning is constructed. Existing approaches either miss subtle narrative patterns through coarse-grained analysis or require domain-specific taxonomies that limit scalability. To bridge this gap, we present a framework for inducing rich narrative schemas by jointly modeling events and characters via structured clustering. Our approach produces explainable narrative schemas that align with established framing theory while scaling to large corpora without exhaustive manual annotation.

[NLP-136] Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理任务中表现不稳定的问题,即模型在不同难度的题目上性能波动较大。其核心解决方案是提出自适应多专家推理(Adaptive Multi-Expert Reasoning, AMR)框架,关键在于通过动态调整策略来应对问题复杂度:首先利用轻量级路由系统预测问题难度与不确定性,指导可重构采样机制控制生成广度;随后由三个专业化专家生成候选答案,并在多轮修正与定稿阶段进行优化;最终借助神经验证器评估正确性,并采用基于聚类的聚合技术结合共识与质量选出最优答案。此方法仅使用原始训练数据即可在GSM8K数据集上达到75.28%准确率,显著优于多数依赖合成数据训练的同类7B模型,验证了基于难度感知路由和不确定性驱动聚合策略的有效性与高效性。

链接: https://arxiv.org/abs/2604.10335
作者: Mohamed Ehab,Ali Hamdi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong performance in math reasoning benchmarks, but their performance varies inconsistently across problems with varying levels of difficulty. This paper describes Adaptive Multi-Expert Reasoning (AMR), a framework that focuses on problem complexity by reasoning with dynamically adapted strategies. An agile routing system that focuses on problem text predicts problems’ difficulty and uncertainty and guides a reconfigurable sampling mechanism to manage the breadth of generation. Three specialized experts create candidate responses, which are modified during multiple correction and finalization phases. A neural verifier assesses the correctness of responses, while a clustering-based aggregation technique identifies the final candidate answer based on a combination of consensus and answer quality. When evaluated on the GSM8K dataset, AMR achieved 75.28% accuracy while only using the original training data. This result outperformed the majority of comparable 7B models that were trained on synthetic data. This showcases that models using difficulty-based routing and uncertainty-driven aggregation are efficient and effective in improving math reasoning models’ robustness.

[NLP-137] Comparative Analysis of Large Language Models in Healthcare

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗场景中缺乏标准化基准测试的问题,以实现对不同模型在医学任务中的性能进行客观、可比的评估。其解决方案的关键在于构建一个系统性的对比框架,通过使用公开的医学数据集(如MedMCQA、PubMedQA和Asclepius),结合语言学指标与任务特定指标,对包括ChatGPT、LLaMA、Grok、Gemini和ChatDoctor在内的多种模型在患者病历摘要生成和医学问答等核心医疗任务上的表现进行量化分析,从而揭示领域专用模型(如ChatDoctor)在语境可靠性方面的优势与通用模型(如Grok和LLaMA)在结构化问答任务中的准确性优势,为后续临床部署提供依据。

链接: https://arxiv.org/abs/2604.10316
作者: Subin Santhosh,Farwa Abbas,Hussain Ahmad,Claudia Szabo
机构: Adelaide University (阿德莱德大学); South Australian Health and Medical Research Institute (南澳大利亚卫生与医学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Background: Large Language Models (LLMs) are transforming artificial intelligence applications in healthcare due to their ability to understand, generate, and summarize complex medical text. They offer valuable support to clinicians, researchers, and patients, yet their deployment in high-stakes clinical environments raises critical concerns regarding accuracy, reliability, and patient safety. Despite substantial attention in recent years, standardized benchmarking of LLMs for medical applications has been limited. Objective: This study addresses the need for a standardized comparative evaluation of LLMs in medical settings. Method: We evaluate multiple models, including ChatGPT, LLaMA, Grok, Gemini, and ChatDoctor, on core medical tasks such as patient note summarization and medical question answering, using the open-access datasets, MedMCQA, PubMedQA, and Asclepius, and assess performance through a combination of linguistic and task-specific metrics. Results: The results indicate that domain-specific models, such as ChatDoctor, excel in contextual reliability, producing medically accurate and semantically aligned text, whereas general-purpose models like Grok and LLaMA perform better in structured question-answering tasks, demonstrating higher quantitative accuracy. This highlights the complementary strengths of domain-specific and general-purpose LLMs depending on the medical task. Conclusion: Our findings suggest that LLMs can meaningfully support medical professionals and enhance clinical decision-making; however, their safe and effective deployment requires adherence to ethical standards, contextual accuracy, and human oversight in relevant cases. These results underscore the importance of task-specific evaluation and cautious integration of LLMs into healthcare workflows.

[NLP-138] Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking ACL2026

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在生成过程中因安全对齐机制(safety alignment)导致的对抗攻击效率低下问题。现有方法通过优化图像扰动以最大化有害输出概率,但由于对抗目标与模型的安全检索机制之间存在梯度冲突,导致收敛缓慢。解决方案的关键在于提出一种注意力引导的视觉越狱(Attention-Guided Visual Jailbreaking)策略,其核心是通过两个辅助目标直接操控注意力模式:一是抑制模型对与安全对齐相关的前缀标记(prefix tokens)的关注,二是将生成过程锚定在对抗性图像特征上。这种“推-拉”式设计显著降低了梯度冲突(减少45%),在Qwen-VL模型上实现94.4%的攻击成功率(对比基线68.8%),且迭代次数减少40%,同时在更严格的扰动预算下(ε=8/255)仍保持59.0%的成功率,优于标准方法的45.7%。机制分析进一步揭示了一种称为“安全盲视”(safety blindness)的新失败模式:成功攻击通过削弱系统提示的注意力(下降80%)使模型无法检索安全规则,而非强行覆盖它们。

链接: https://arxiv.org/abs/2604.10299
作者: Jingru Li,Wei Ren,Tianqing Zhu
机构: China University of Geosciences, Wuhan; City University of Macau
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ACL 2026. Code: this https URL

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) rely on attention-based retrieval of safety instructions to maintain alignment during generation. Existing attacks typically optimize image perturbations to maximize harmful output likelihood, but suffer from slow convergence due to gradient conflict between adversarial objectives and the model’s safety-retrieval mechanism. We propose Attention-Guided Visual Jailbreaking, which circumvents rather than overpowers safety alignment by directly manipulating attention patterns. Our method introduces two simple auxiliary objectives: (1) suppressing attention to alignment-relevant prefix tokens and (2) anchoring generation on adversarial image features. This simple yet effective push-pull formulation reduces gradient conflict by 45% and achieves 94.4% attack success rate on Qwen-VL (vs. 68.8% baseline) with 40% fewer iterations. At tighter perturbation budgets ( \epsilon=8/255 ), we maintain 59.0% ASR compared to 45.7% for standard methods. Mechanistic analysis reveals a failure mode we term safety blindness: successful attacks suppress system-prompt attention by 80%, causing models to generate harmful content not by overriding safety rules, but by failing to retrieve them.

[NLP-139] he Amazing Agent Race: Strong Tool Users Weak Navigators

【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)智能体在工具使用能力评估中普遍存在的局限性——现有基准测试多为线性链式结构(simple chains),难以揭示智能体在复杂任务中的导航与多步骤协作能力。为此,作者提出《神奇智能体竞赛》(The Amazing Agent Race, AAR),其核心创新在于引入有向无环图(Directed Acyclic Graph, DAG)结构的“腿”(legs)作为任务单元,包含分叉-合并的工具调用链,从而模拟真实世界中非线性的、需路径规划的任务场景。关键解决方案包括:1)基于维基百科种子生成1400个DAG实例(含顺序型和组合型两类),并通过实时API验证保证任务可执行性;2)设计三个互补指标(终点准确率、中途站点访问率、障碍完成率)分别量化导航、工具调用与算术错误;3)揭示出当前主流智能体框架的主要瓶颈在于页面导航而非工具调用本身,这一盲点在线性基准中无法察觉。

链接: https://arxiv.org/abs/2604.10261
作者: Zae Myung Kim,Dongseok Lee,Jaehyung Kim,Vipul Raheja,Dongyeop Kang
机构: University of Minnesota Twin Cities; Yonsei University; Google Deepmind
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or “legs”) with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: this https URL

[NLP-140] CodeComp: Structural KV Cache Compression for Agent ic Coding

【速读】: 该论文旨在解决生成式 AI(Generative AI)在执行代码代理任务(如缺陷定位和补丁生成)时,因长代码库的上下文长度受限而导致的键值(Key-Value, KV)缓存成为推理瓶颈的问题。现有压缩方法仅依赖注意力机制评估 token 重要性,系统性地丢弃对代码结构理解至关重要的 token(如调用点、分支条件和赋值语句),从而损害模型性能。其解决方案的关键在于提出 CodeComp——一种无需训练的 KV 缓存压缩框架,通过引入基于静态程序分析的先验知识(即由 Joern 提取的代码属性图,Code Property Graph priors),在不修改模型的前提下实现更精准的 token 重要性判断,显著提升压缩效率与准确性,在相同内存预算下优于纯注意力驱动的压缩基线,并在 bug 定位和代码生成任务中恢复接近完整上下文的性能表现。

链接: https://arxiv.org/abs/2604.10235
作者: Qiujiang Chen,Jing Xiong,Chenyang Zhao,Sidi Yang,Ngai Wong
机构: The University of Hong Kong (香港大学); LMSYS Org
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agentic code tasks such as fault localization and patch generation require processing long codebases under tight memory constraints, where the Key-Value (KV) cache becomes the primary inference bottleneck. Existing compression methods rely exclusively on attention signals to estimate token importance, systematically discarding structurally critical tokens such as call sites, branch conditions, and assignments that are essential for code understanding. We present CodeComp, a training-free KV cache compression framework that incorporates static program analysis into LLM inference via Code Property Graph priors extracted by Joern. Across bug localization and code generation benchmarks, CodeComp consistently outperforms attention-only compression baselines under equal memory budgets, recovering the majority of full-context accuracy under aggressive KV cache compression, while matching the patch generation quality of uncompressed full-context inference and integrating seamlessly into SGLang-based agentic coding pipelines without model modification.

[NLP-141] Relational Probing: LM-to-Graph Adaptation for Financial Prediction ICLR2026

【速读】: 该论文旨在解决语言模型在金融文本中识别实体关系时存在的两个关键问题:一是传统基于提示(prompting)的管道机制因自回归解码带来高昂计算成本,二是图结构构建与下游优化任务脱钩,导致信息传递效率低下。其解决方案的核心在于提出关系探针(Relational Probing),即用一个专门的关系头(relation head)替代标准语言模型的输出头,直接从语言模型隐藏状态中诱导出结构化的关联图,并与下游股票趋势预测任务模型联合训练。该方法不仅学习语义表示,还严格保持所生成关系图的结构完整性,使语言模型输出可被重塑为适配下游任务的特定格式,从而实现高效且结构可控的多模态信息融合。

链接: https://arxiv.org/abs/2604.10212
作者: Yingjie Niu,Changhong Jin,Rian Dolphin,Ruihai Dong
机构: University College Dublin (都柏林大学); Massive (Massive)
类目: Computation and Language (cs.CL)
备注: Accpeted by The 2nd Workskop on Advances in Financial AI Workshop: Towards Agentic and Responsible Systems at ICLR 2026

点击查看摘要

Abstract:Language models can be used to identify relationships between financial entities in text. However, while structured output mechanisms exist, prompting-based pipelines still incur autoregressive decoding costs and decouple graph construction from downstream optimization. We propose \emphRelational Probing, which replaces the standard language-model head with a relation head that induces a relational graph directly from language-model hidden states and is trained jointly with the downstream task model for stock-trend prediction. This approach both learns semantic representations and preserves the strict structure of the induced relational graph. It enables language-model outputs to go beyond text, allowing them to be reshaped into task-specific formats for downstream models. To enhance reproducibility, we provide an operational definition of small language models (SLMs): models that can be fine-tuned end-to-end on a single 24GB GPU under specified batch-size and sequence-length settings. Experiments use Qwen3 backbones (0.6B/1.7B/4B) as upstream SLMs and compare against a co-occurrence baseline. Relational Probing yields consistent performance improvements at competitive inference cost.

[NLP-142] FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在具备相关知识的情况下仍可能生成事实性错误内容的问题,这一现象严重削弱了模型的可靠性。现有方法通过在问答(QA)提示中引入不确定性数值评分来缓解该问题,但此类分数缺乏语义丰富性,难以使模型准确理解其内部可信度(trustworthiness)与诚实性(honestness),导致事实对齐(factuality alignment)不足。解决方案的关键在于提出 FAITH(Factuality Alignment through Integrating Trustworthiness and Honestness)——一种后训练框架,其核心创新是将自然语言形式的不确定性信号与外部知识相结合:首先基于LLM输出计算置信度分数和语义熵,并映射至描述模型内部知识掌握状态(可信度)与回答行为特征(诚实性)的知识状态象限;随后设计融合正确性与不确定性信号的奖励函数,并利用近端策略优化(Proximal Policy Optimization, PPO)进行微调;此外还引入检索增强模块以提升内部知识与外部信息的一致性,从而显著提升模型的事实准确性与真实性。

链接: https://arxiv.org/abs/2604.10189
作者: Xiaoning Dong,Chengyan Wu,Yajie Wen,Yu Chen,Yun Xue,Jing Zhang,Wei Xu,Bolei Ma
机构: Tsinghua University (清华大学); Shanghai Qi Zhi Institute; South China Normal University (华南师范大学); Guangzhou Richstone Data Technologies Co., Ltd. (广州瑞石数据科技有限公司); LMU Munich (慕尼黑大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) can generate factually inaccurate content even if they have corresponding knowledge, which critically undermines their reliability. Existing approaches attempt to mitigate this by incorporating uncertainty in QA prompt during training, but these numerical scores lack the semantic richness for LLM to properly understand its internal states of trustworthiness and honestness, leading to insufficient factuality alignment. We introduce FAITH (Factuality Alignment through Integrating Trustworthiness and Honestness), a post-training framework for factuality alignment that integrates natural-language uncertainty signals with external knowledge. Specifically, we augment training datasets by computing confidence scores and semantic entropy from LLM outputs and mapping them into a knowledge state quadrant that describes the model’s internal knowledge possession (trustworthiness) and answering behaviors (honestness) in natural language. Based on this enhanced data, we design a reward function that considers both correctness and uncertainty signals, and fine-tune the LLM using the Proximal Policy Optimization (PPO) algorithm. To further mitigate weakly grounded responses, we design a retrieval-augmented module that retrieves relevant external passages, improving the consistency between internal and external knowledge representations. Extensive experiments on four knowledge-intensive benchmarks demonstrate that FAITH enhances the factual accuracy and truthfulness of LLMs.

[NLP-143] Nationality encoding in language model hidden states: Probing culturally differentiated representations in persona-conditioned academic text

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在生成学术文本时是否编码了与国籍相关的文化差异化表征,特别是在英语学术写作(English for Academic Purposes, EAP)场景下,不同国家学术身份(如英美与中国)是否会影响模型输出的结构、词汇和语用特征。解决方案的关键在于采用隐藏状态探测(probing)方法,通过训练逻辑回归分类器分析Gemma-3-4b-it模型在35层隐藏状态中的激活模式,识别出与英国和中国学术人格(persona)显著相关的token位置,并结合Stanza自然语言处理工具对这些高信号token进行结构、词汇和立场特征标注,从而揭示模型内部是否存在非单调分布的国籍编码机制及其具体表现形式。

链接: https://arxiv.org/abs/2604.10151
作者: Paul Jackson(1),Ruizhe Li(2),Elspeth Edelstein(3) ((1) Language Centre, School of Language, Literature, Music and Visual Culture, University of Aberdeen, United Kingdom, (2) School of Natural and Computing Sciences, University of Aberdeen, United Kingdom, (3) School of Language, Literature, Music and Visual Culture, University of Aberdeen, United Kingdom)
机构: 未知
类目: Computation and Language (cs.CL)
备注: 42 pages, 6 tables

点击查看摘要

Abstract:Large language models are increasingly used as writing tools and pedagogical resources in English for Academic Purposes, but it remains unclear whether they encode culturally differentiated representations when generating academic text. This study tests whether Gemma-3-4b-it encodes nationality-discriminative information in hidden states when generating research article introductions conditioned by British and Chinese academic personas. A corpus of 270 texts was generated from 45 prompt templates crossed with six persona conditions in a 2 x 3 design. Logistic regression probes were trained on hidden-state activations across all 35 layers, with shuffled-label baselines, a surface-text skyline classifier, cross-family tests, and sentence-level baselines used as controls. Probe-selected token positions were annotated for structural, lexical, and stance features using the Stanza NLP pipeline. The nationality probe reached 0.968 cross-validated accuracy at Layer 18, with perfect held-out classification. Nationality encoding followed a non-monotonic trajectory across layers, with structural effects strongest in the middle to upper network and lexical-domain effects peaking earlier. At high-signal token positions, British-associated patterns showed more postmodification, hedging, boosting, passive voice, and evaluative or process-oriented vocabulary, while Chinese-associated patterns showed more premodification, nominal predicates, and sociocultural or internationalisation vocabulary. However, sentence-level analysis found no significant nationality differences in the full generated surface text. The findings extend probing methodology to a sociolinguistic attribute and have practical implications for EAP and language pedagogy.

[NLP-144] Learning from Emptiness: De-biasing Listwise Rerankers with Content-Agnostic Probability Calibration ACL2026

【速读】: 该论文旨在解决生成式列表重排序(Generative Listwise Reranking)中存在的固有位置偏差(position bias)问题,即模型对输入顺序的结构敏感性与内容相关性无关,导致排名结果受序列位置影响而非真实相关性。现有方法面临两难:推理时聚合策略虽能缓解偏差但引入高延迟,而训练阶段的方法往往难以彻底消除嵌入的先验偏好,尤其在轻量级模型中表现不佳。解决方案的关键在于提出一种无需训练的框架 CapCal(Content-Agnostic Probability Calibration),其通过使用内容无关的占位符估计偏置分布,并采用熵自适应对比机制校正输出logits,从而机械地将位置偏差从排序决策中解耦,实现高效且无偏的单次前向传播性能提升。

链接: https://arxiv.org/abs/2604.10150
作者: Hang Lv,Hongchao Gu,Ruiqing Yang,Liangyue Li,Zulong Chen,Defu Lian,Hao Wang,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学); Alibaba Group (阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL2026

点击查看摘要

Abstract:Generative listwise reranking leverages global context for superior retrieval but is plagued by intrinsic position bias, where models exhibit structural sensitivity to input order independent of relevance. Existing mitigations present a dilemma: inference-time aggregation incurs prohibitive latency, while training-based methods often fail to eradicate ingrained priors, particularly in compact models. To resolve this dilemma, we propose CapCal (Content-Agnostic Probability Calibration), a training-free framework that mechanically decouples positional bias from ranking decisions. By estimating the bias distribution via content-free placeholders, CapCal rectifies output logits through an entropy-adaptive contrastive mechanism. Evaluations across 10 benchmarks confirm that CapCal achieves superior performance among training-free methods while preserving single-pass efficiency. Notably, it unlocks the latent potential of lightweight models (e.g., 0.6B), delivering absolute NDCG gains exceeding 10 points and outperforming both permutation-based aggregation and data-augmentation baselines.

[NLP-145] hink in Sentences: Explicit Sentence Boundaries Enhance Language Models Capabilities ACL2026

【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)在利用dummy token插入提升能力时,忽视自然语言固有句法结构的问题。现有方法仅关注dummy token本身,未能利用句子级别的语义组织特性,而LLMs的 linguistic capabilities 正是通过人类生成文本中的句子级结构习得的。解决方案的关键在于:在输入上下文中于句子边界插入分隔符(delimiters),不仅将dummy token融入上下文,还引导模型在推理过程中采用逐句处理的行为模式。该方法通过两种实现方式——上下文学习(in-context learning)和监督微调(supervised fine-tuning)验证了有效性,在GSM8k和DROP等任务上分别取得最高达7.7%和12.5%的性能提升,并且微调后的模型内部表征也体现出对句子边界的敏感性,从而为受认知启发的大语言模型增强提供了新方向。

链接: https://arxiv.org/abs/2604.10135
作者: Zhichen Liu,Yongyuan Li,Yang Xu
机构: Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 main conference

点击查看摘要

Abstract:Researchers have explored different ways to improve large language models (LLMs)’ capabilities via dummy token insertion in contexts. However, existing works focus solely on the dummy tokens themselves, but fail to leverage the inherent sentence-level structure of natural language. This is a critical oversight, as LLMs acquire linguistic capabilities through exposure to human-generated texts, which are inherently structured at the sentence level. Motivated by this gap, we propose an approach that inserts delimiters at sentence boundaries in LLM inputs, which not only integrates dummy tokens into the context, but also facilitates LLMs with sentence-by-sentence processing behavior during reasoning. Two concrete methods: (1). In-context learning and (2). Supervised fine-tuning are experimented using 7B models to 600B Deepseek-V3. Our results demonstrate consistent improvements across various tasks, with notable gains of up to 7.7% on GSM8k and 12.5% on DROP. Furthermore, the fine-tuned LLMs can incorporate sentence awareness evidenced by their internal representations. Our work establishes a simple yet effective technique for enhancing LLM’s capabilities, offering promising directions for cognitive-inspired LLM enhancement paradigm.

[NLP-146] raining-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations ALT

【速读】: 该论文旨在解决构音障碍(dysarthria)严重程度评估依赖训练有素的临床人员或需大量标注病理语音数据的监督模型所导致的跨语言和临床场景可扩展性受限问题。其解决方案的关键在于提出一种无需训练的方法:利用冻结的HuBERT(Hidden Unit BERT)表示中语音特征子空间的退化程度来量化构音障碍严重度,通过预训练的蒙特利尔强制对齐器(Montreal Forced Aligner, MFA)从健康对照者的语音中估计语音学对比方向(如鼻音性、浊音性、嘶音性、响亮度、发音方式及四个元音特征),进而计算各方向上的d-prime分数作为指标。该方法不依赖任何病理语音训练数据,适用于已有MFA声学模型的29种语言,并在涵盖5种语言、3类病因(帕金森病、脑瘫、肌萎缩侧索硬化症)的890名受试者数据上验证了其有效性与鲁棒性。

链接: https://arxiv.org/abs/2604.10123
作者: Bernard Muller,Antonio Armando Ortiz Barrañón,LaVonne Roberts
机构: The Scott-Morgan Foundation (Scott-Morgan基金会); Tecnológico de Monterrey (蒙特雷理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to PLOS digital health

点击查看摘要

Abstract:Dysarthric speech severity assessment typically requires trained clinicians or supervised models built from labelled pathological speech, limiting scalability across languages and clinical settings. We present a training-free method that quantifies dysarthria severity by measuring degradation in phonological feature subspaces within frozen HuBERT representations. No supervised severity model is trained; feature directions are estimated from healthy control speech using a pretrained forced aligner. For each speaker, we extract phone-level embeddings via Montreal Forced Aligner, compute d-prime scores along phonological contrast directions (nasality, voicing, stridency, sonorance, manner, and four vowel features) derived exclusively from healthy controls, and construct a 12-dimensional phonological this http URL 890 speakers across 10 corpora, 5 languages (English, Spanish, Dutch, Mandarin, French), and 3 primary aetiologies (Parkinson’s disease, cerebral palsy, ALS), we find that all five consonant d-prime features correlate significantly with clinical severity (random-effects meta-analysis rho = -0.50 to -0.56, p 2e-4; pooled Spearman rho = -0.47 to -0.55 with bootstrap 95% CIs not crossing zero). The effect replicates within individual corpora, survives FDR correction, and remains robust to leave-one-corpus-out removal and alignment quality controls. Nasality d-prime decreases monotonically from control to severe in 6 of 7 severity-graded corpora. Mann-Whitney U tests confirm that all 12 features distinguish controls from severely dysarthric speakers (p 0.001).The method requires no dysarthric training data and applies to any language with an existing MFA acoustic model (currently 29 languages). We release the full pipeline and phone feature configurations for six languages.

[NLP-147] CircuitSynth: Reliable Synthetic Data Generation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成结构化合成数据时普遍存在的幻觉、逻辑不一致性和模式崩溃问题,这些问题导致生成结果在语义有效性与覆盖范围上难以保障。现有方法如提示工程或检索增强生成缺乏在语言表达力与形式化验证之间取得平衡的能力。解决方案的关键在于提出一种名为CircuitSynth的神经符号框架,其核心创新是将语义推理与表层实现解耦:通过将教师模型(Teacher LLM)的推理能力蒸馏到概率命题决策图(Probabilistic Sentential Decision Diagram, PSDD)中,构建一个结构上强制执行硬逻辑约束的可计算语义先验;同时引入凸优化机制以严格满足软分布目标,从而在复杂逻辑谜题等场景下实现100%的Schema Validity,显著优于无约束基线(仅12.4%)和当前最先进方法在稀有组合覆盖上的表现。

链接: https://arxiv.org/abs/2604.10114
作者: Zehua Cheng,Wei Dai,Jiahao Sun,Thomas Lukasiewicz
机构: University of Oxford (牛津大学); FLock.io; TU Wien (维也纳工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 Pages

点击查看摘要

Abstract:The generation of high-fidelity synthetic data is a cornerstone of modern machine learning, yet Large Language Models (LLMs) frequently suffer from hallucinations, logical inconsistencies, and mode collapse when tasked with structured generation. Existing approaches, such as prompting or retrieval-augmented generation, lack the mechanisms to balance linguistic expressivity with formal guarantees regarding validity and coverage. To address this, we propose CircuitSynth, a novel neuro-symbolic framework that decouples semantic reasoning from surface realization. By distilling the reasoning capabilities of a Teacher LLM into a Probabilistic Sentential Decision Diagram (PSDD), CircuitSynth creates a tractable semantic prior that structurally enforces hard logical constraints. Furthermore, we introduce a convex optimization mechanism to rigorously satisfy soft distributional goals. Empirical evaluations across diverse benchmarks demonstrate that CircuitSynth achieves 100% Schema Validity even in complex logic puzzles where unconstrained baselines fail (12.4%) while significantly outperforming state-of-the-art methods in rare-combination coverage.

[NLP-148] Who Wrote This Line? Evaluating the Detection of LLM -Generated Classical Chinese Poetry ACL2026

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在古典汉诗创作中引发的创意真实性与伦理问题,尤其针对当前文本检测工具难以有效识别由大型语言模型(LLMs)生成的古典汉诗这一挑战。其解决方案的关键在于构建了一个名为ChangAn的基准数据集,包含30,664首诗歌(其中10,276首为人类创作,20,388首由四种主流LLM生成),并基于此对12种AI文本检测器进行了系统评估,揭示了现有中文文本检测方法在古典汉诗场景下的显著局限性,从而验证了ChangAn作为专门化检测基准的有效性和必要性。

链接: https://arxiv.org/abs/2604.10101
作者: Jiang Li,Tian Lan,Shanshan Wang,Dongxing Zhang,Dianqing Lin,Guanglai Gao,Derek F. Wong,Xiangdong Su
机构: Inner Mongolia University (内蒙古大学); University of Macau (澳门大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:The rapid development of large language models (LLMs) has extended text generation tasks into the literary domain. However, AI-generated literary creations has raised increasingly prominent issues of creative authenticity and ethics in literary world, making the detection of LLM-generated literary texts essential and urgent. While previous works have made significant progress in detecting AI-generated text, it has yet to address classical Chinese poetry. Due to the unique linguistic features of classical Chinese poetry, such as strict metrical regularity, a shared system of poetic imagery, and flexible syntax, distinguishing whether a poem is authored by AI presents a substantial challenge. To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs. Based on ChangAn, we conducted a systematic evaluation of 12 AI detectors, investigating their performance variations across different text granularities and generation strategies. Our findings highlight the limitations of current Chinese text detectors, which fail to serve as reliable tools for detecting LLM-generated classical Chinese poetry. These results validate the effectiveness and necessity of our proposed ChangAn benchmark. Our dataset and code are available at this https URL.

[NLP-149] SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models KDD2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限设备上部署时面临的高计算与存储成本问题,以及现有后训练量化(Post-Training Quantization, PTQ)方法在低比特设置下性能显著下降且流程复杂的问题。其解决方案的关键在于提出一种名为SEPTQ的简单而有效的PTQ范式:首先通过静态全局方式计算权重矩阵中每个元素的重要性得分并确定量化位置,随后利用掩码矩阵按列逐次更新重要位置的权重,从而在仅两步操作内完成高质量的量化过程,兼顾了量化效果与效率,在低比特量化场景下显著优于现有强基线方法。

链接: https://arxiv.org/abs/2604.10091
作者: Han Liu,Haotian Gao,Xiaotong Zhang,Changya Li,Feng Zhang,Wei Wang,Fenglong Ma,Hong Yu
机构: Dalian University of Technology (大连理工大学); Peking University (北京大学); Shenzhen MSU-BIT University (深圳莫斯科大学); The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted to KDD 2025. 12 pages, 10 figures

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance in various domains, but they are constrained by massive computational and storage costs. Quantization, an effective technique for compressing models to fit resource-limited devices while preserving generative quality, encompasses two primary methods: quantization aware training (QAT) and post-training quantization (PTQ). QAT involves additional retraining or fine-tuning, thus inevitably resulting in high training cost and making it unsuitable for LLMs. Consequently, PTQ has become the research hotspot in recent quantization methods. However, existing PTQ methods usually rely on various complex computation procedures and suffer from considerable performance degradation under low-bit quantization settings. To alleviate the above issues, we propose a simple and effective post-training quantization paradigm for LLMs, named SEPTQ. Specifically, SEPTQ first calculates the importance score for each element in the weight matrix and determines the quantization locations in a static global manner. Then it utilizes the mask matrix which represents the important locations to quantize and update the associated weights column-by-column until the appropriate quantized weight matrix is obtained. Compared with previous methods, SEPTQ simplifies the post-training quantization procedure into only two steps, and considers the effectiveness and efficiency simultaneously. Experimental results on various datasets across a suite of models ranging from millions to billions in different quantization bit-levels demonstrate that SEPTQ significantly outperforms other strong baselines, especially in low-bit quantization scenarios.

[NLP-150] Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在监督微调(Supervised Fine-Tuning, SFT)过程中普遍存在的“不完全学习现象”(Incomplete Learning Phenomenon, ILP),即模型在训练收敛后仍无法正确复现其自身训练数据中的部分样本。这一问题可能导致模型性能评估失真,且传统聚合指标难以揭示潜在的学习失败。解决方案的关键在于提出一种以诊断为导向的框架,通过分析训练和推理阶段的可观测信号,将未学习样本归因于五类核心成因:预训练知识缺失、SFT与预训练知识冲突、SFT数据内部不一致、顺序微调中的左侧遗忘以及稀有或复杂模式优化不足,并据此设计因果干预策略进行针对性缓解。该方法强调对ILP的细粒度诊断,从而提升SFT的可靠性和可解释性。

链接: https://arxiv.org/abs/2604.10079
作者: Chao Xue,Yao Wang,Mengqiao Liu,Di Liang,Xingsheng Han,Peiyang Liu,Xianjie Wu,Chenyao Lu,Lei Jiang,Yu Lu,Haibo Shi,Shuang Liang,Minlong Peng,Flora D. Salim
机构: University of New South Wales, Australia; Peking University, China; Tencent, China; Beijing Info. Sci. Tech. Univ., China; Baidu, China; UESTC, China
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026 Oral

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT) is the standard approach for adapting large language models (LLMs) to downstream tasks. However, we observe a persistent failure mode: even after convergence, models often fail to correctly reproduce a subset of their own supervised training data. We refer to this behavior as the Incomplete Learning Phenomenon(ILP). This paper presents the first systematic study of ILP in LLM fine-tuning. We formalize ILP as post-training failure to internalize supervised instances and demonstrate its prevalence across multiple model families, domains, and datasets. Through controlled analyses, we identify five recurrent sources of incomplete learning: (1) missing prerequisite knowledge in the pre-trained model, (2) conflicts between SFT supervision and pre-training knowledge, (3) internal inconsistencies within SFT data, (4) left-side forgetting during sequential fine-tuning, and (5) insufficient optimization for rare or complex patterns. We introduce a diagnostic-first framework that maps unlearned samples to these causes using observable training and inference signals, and study several targeted mitigation strategies as causal interventions. Experiments on Qwen, LLaMA, and OLMo2 show that incomplete learning is widespread and heterogeneous, and that improvements in aggregate metrics can mask persistent unlearned subsets. The findings highlight the need for fine-grained diagnosis of what supervised fine-tuning fails to learn, and why.

[NLP-151] Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty ACL2026

【速读】: 该论文旨在解决生成式奖励模型(Generative Reward Model, GRM)在提升大语言模型(Large Language Models, LLMs)推理能力时存在的两个关键问题:一是链式思维(Chain-of-Thought, CoT)提示被无差别地应用于所有输入,导致对简单任务产生不必要的计算开销;二是现有方法主要依赖基于投票的机制评估CoT输出,难以实现对推理路径质量的细粒度判断。解决方案的关键在于提出E-GRM框架,其核心创新是利用模型内部不确定性(model-internal uncertainty)作为触发CoT推理的自适应信号,通过并行生成过程的收敛行为估计不确定性,从而仅在必要时激活CoT推理,无需人工特征或任务特定信号;同时引入一个轻量级判别评分器,采用混合回归-排序目标训练,以提供更精确的推理路径评价,显著降低推理成本并提升准确性。

链接: https://arxiv.org/abs/2604.10072
作者: Chao Xue,Yao Wang,Mengqiao Liu,Di Liang,Xingsheng Han,Peiyang Liu,Xianjie Wu,Chenyao Lu,Lei Jiang,Yu Lu,Haibo Shi,Shuang Liang,Minlong Peng,Flora D. Salim
机构: University of New South Wales, Australia; Peking University, China; Tencent, China; Beijing Info. Sci. Tech. Univ., China; Baidu, China; UESTC, China
类目: Computation and Language (cs.CL)
备注: accepted by ACL 2026

点击查看摘要

Abstract:Recent advancements in the Generative Reward Model (GRM) have demonstrated its potential to enhance the reasoning abilities of LLMs through Chain-of-Thought (CoT) prompting. Despite these gains, existing implementations of GRM suffer from two critical limitations. First, CoT prompting is applied indiscriminately to all inputs regardless of their inherent complexity. This introduces unnecessary computational costs for tasks amenable to fast, direct inference. Second, existing approaches primarily rely on voting-based mechanisms to evaluate CoT outputs, which often lack granularity and precision in assessing reasoning quality. In this paper, we propose E-GRM, an efficient generative reward modeling framework grounded in model-internal uncertainty. E-GRM leverages the convergence behavior of parallel model generations to estimate uncertainty and selectively trigger CoT reasoning only when needed, without relying on handcrafted features or task-dependent signals. To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression–ranking objective to provide fine-grained evaluation of reasoning paths. Experiments on multiple reasoning benchmarks show that E-GRM substantially reduces inference cost while consistently improving answer accuracy, demonstrating that model-internal uncertainty is an effective and general signal for efficient reasoning-aware reward modeling.

[NLP-152] ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

【速读】: 该论文旨在解决端到端全双工语音语言模型(Speech Language Models, SLMs)在自然交互中因标准原始token强化学习(Reinforcement Learning, RL)优化时序动态而导致的语义质量下降问题,具体表现为生成崩溃和重复现象。其解决方案的关键在于提出ASPIRin框架,通过显式解耦“何时说话”与“说什么”,利用动作空间投影(Action Space Projection)将文本词汇映射为粗粒度二值状态(活跃语音 vs. 静默),并结合基于规则的奖励机制与分组相对策略优化(Group Relative Policy Optimization, GRPO),实现对用户打断和响应延迟的平衡,从而在保持语义连贯性的同时显著减少重复n-gram比例(超过50%),有效抑制退化性重复。

链接: https://arxiv.org/abs/2604.10065
作者: Chi-Yuan Hsiao,Ke-Han Lu,Yu-Kuan Fu,Guan-Ting Lin,Hsiao-Tsung Hung,Hung-yi Lee
机构: National Taiwan University (国立台湾大学); ASUS Open Cloud Infrastructure Software Center (华硕开放云基础设施软件中心); NVIDIA AI Technology Center (英伟达人工智能技术中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:End-to-end full-duplex Speech Language Models (SLMs) require precise turn-taking for natural interaction. However, optimizing temporal dynamics via standard raw-token reinforcement learning (RL) degrades semantic quality, causing severe generative collapse and repetition. We propose ASPIRin, an interactivity-optimized RL framework that explicitly decouples when to speak from what to say. Using Action Space Projection, ASPIRin maps the text vocabulary into a coarse-grained binary state (active speech vs. inactive silence). By applying Group Relative Policy Optimization (GRPO) with rule-based rewards, it balances user interruption and response latency. Empirical evaluations show ASPIRin optimizes interactivity across turn-taking, backchanneling, and pause handling. Crucially, isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating degenerative repetition.

[NLP-153] Linguistic Accommodation Between Neurodivergent Communities on Reddit:A Communication Accommodation Theory Analysis of ADHD and Autism Groups

【速读】: 该论文试图解决的问题是:在社交媒体平台上,神经多样性群体(如注意缺陷多动障碍 ADHD 和自闭症谱系障碍 ASD)之间如何通过语言调整来实现跨群体互动,以及这种调整是否受到情境因素(如公共诊断披露)的影响。解决方案的关键在于运用沟通适应理论(Communication Accommodation Theory, CAT),结合语言 inquiry and word count (LIWC) 词典对 Reddit 上两个社区的语言特征进行量化分析,发现当用户跨越社区边界时,其语言特征呈现相反方向的改变(即收敛适应),且这一现象不能完全由话题内容解释(因主题无关变量如 Authentic 和 Clout 也发生显著变化)。此外,研究还通过纵向分析揭示了公共诊断事件对语言风格的影响较弱且方向与跨社区适应相反,暗示情境性观众适应与长期身份建构可能涉及不同的心理机制。

链接: https://arxiv.org/abs/2604.10063
作者: Saad Mankarious,Nour Zein,Iyad Ait Hou,Aya Zirikly
机构: The George Washington University (乔治·华盛顿大学); Texas AM University (德州农工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Social media research on mental health has focused predominantly on detecting and diagnosing conditions at the individual level. In this work, we shift attention to \emphintergroup behavior, examining how two prominent neurodivergent communities, ADHD and autism, adjust their language when engaging with each other on Reddit. Grounded in Communication Accommodation Theory (CAT), we first establish that each community maintains a distinct linguistic profile as measured by Language Inquiry and Word Count Lexicon (LIWC). We then show that these profiles shift in opposite directions when users cross community boundaries: features that are elevated in one group’s home community decrease when its members post in the other group’s space, and vice versa, consistent with convergent accommodation. The involvement of topic-independent summary variables (Authentic, Clout) in these shifts provides partial evidence against a purely topical explanation. Finally, in an exploratory longitudinal analysis around the moment of public diagnosis disclosure, we find that its effects on linguistic style are small and, in some cases, directionally opposite to cross-community accommodation, providing initial evidence that situational audience adaptation and longer-term identity processes may involve different mechanisms. Our findings contribute to understanding intergroup communication dynamics among neurodivergent populations online and carry implications for community moderation and clinical perspectives on these conditions.

[NLP-154] Computational Implementation of a Model of Category-Theoretic Metaphor Comprehension

【速读】: 该论文旨在解决隐喻理解建模的计算实现问题,特别是如何更准确地模拟人类在隐喻理解过程中对源域与目标域之间关联结构的认知映射。其解决方案的关键在于基于Fuyama等人提出的不确定自然变换理论(Theory of Indeterminate Natural Transformation, TINT),对原有算法进行简化和优化,使其更贴近原始理论框架,并通过数据拟合与仿真验证了改进后的算法在三个核心指标上的性能提升:与实验数据的拟合度、隐喻理解结果的系统性以及理解结果的新颖性(即源域与目标域之间联想结构的一致性)。

链接: https://arxiv.org/abs/2604.10035
作者: Fumitaka Iwaki,Miho Fuyama,Hayato Saigo,Tatsuji Takahashi
机构: Tokyo Denki University (东京电气大学); Ritsumeikan University (立命馆大学); ZEN University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 8 figures, CogSci member abstract

点击查看摘要

Abstract:In this study, we developed a computational implementation for a model of metaphor comprehension based on the theory of indeterminate natural transformation (TINT) proposed by Fuyama et al. We simplified the algorithms implementing the model to be closer to the original theory and verified it through data fitting and simulations. The outputs of the algorithms are evaluated with three measures: data-fitting with experimental data, the systematicity of the metaphor comprehension result, and the novelty of the comprehension (i.e. the correspondence of the associative structure of the source and target of the metaphor). The improved algorithm outperformed the existing ones in all the three measures.

[NLP-155] CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在理论心智(Theory of Mind, ToM)任务中表现出的泛化能力不足问题,即模型虽能在标准基准上取得较好表现,但在复杂、特定场景下往往依赖提示工程(prompt scaffolding)来模拟推理,而非具备内在的认知能力。其核心问题是:LLMs是否真正拥有内在认知,并能将这种内部知识稳定地转化为高质量的行为输出。解决方案的关键在于提出 CoSToM(Causal-oriented Steering for ToM alignment)框架,通过因果追踪(causal tracing)识别ToM关键层的内部特征分布,并在此基础上实施轻量级激活引导(activation steering),从而实现从机制解释到主动干预的转变,显著提升模型的人类社会推理能力和对话质量。

链接: https://arxiv.org/abs/2604.10031
作者: Mengfan Li,Xuanhua Shi,Yang Deng
机构: Huazhong University of Science and Technology (华中科技大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 (Main Conference)

点击查看摘要

Abstract:Theory of Mind (ToM), the ability to attribute mental states to others, is a hallmark of social intelligence. While large language models (LLMs) demonstrate promising performance on standard ToM benchmarks, we observe that they often fail to generalize to complex task-specific scenarios, relying heavily on prompt scaffolding to mimic reasoning. The critical misalignment between the internal knowledge and external behavior raises a fundamental question: Do LLMs truly possess intrinsic cognition, and can they externalize this internal knowledge into stable, high-quality behaviors? To answer this, we introduce CoSToM (Causal-oriented Steering for ToM alignment), a framework that transitions from mechanistic interpretation to active intervention. First, we employ causal tracing to map the internal distribution of ToM features, empirically uncovering the internal layers’ characteristics in encoding fundamental ToM semantics. Building on this insight, we implement a lightweight alignment framework via targeted activation steering within these ToM-critical layers. Experiments demonstrate that CoSToM significantly enhances human-like social reasoning capabilities and downstream dialogue quality.

[NLP-156] Weird Generalization is Weirdly Brittle

【速读】: 该论文旨在解决“奇怪泛化”(weird generalization)这一现象带来的安全风险问题,即模型在窄域数据(如不安全代码)上微调后,会意外地在宽域场景中表现出危险行为(如广泛对齐偏差)。研究发现,这种奇怪泛化具有极强的脆弱性:仅在特定模型与特定数据集上出现,且可通过简单的训练时或提示(prompt-based)干预手段消除。其关键解决方案在于提供提示上下文,使异常行为变为预期行为;更进一步,即使不针对具体异常行为设计的通用干预措施也具有显著缓解效果。这表明,奇怪泛化虽具潜在危害,但其威胁可通过易实施的提示工程策略有效控制。

链接: https://arxiv.org/abs/2604.10022
作者: Miriam Wanner,Hannah Collison,William Jurayj,Benjamin Van Durme,Mark Dredze,William Walden
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Weird generalization is a phenomenon in which models fine-tuned on data from a narrow domain (e.g. insecure code) develop surprising traits that manifest even outside that domain (e.g. broad misalignment)-a phenomenon that prior work has highlighted as a critical safety concern. Here, we present an extended replication study of key weird generalization results across an expanded suite of models and datasets. We confirm that surprising (and dangerous) traits can emerge under certain circumstances, but we find that weird generalization is exceptionally brittle: it emerges only for specific models on specific datasets, and it vanishes under simple training-time, prompt-based interventions. We find that the most effective interventions provide prompt context that makes the generalized behavior the expected behavior. However, we show that even very generic interventions that do not anticipate specific generalized traits can still be effective in mitigating weird generalization’s effects. Our findings thus help clarify the nature of the safety threat that weird generalization poses and point toward an easily implemented set of solutions.

[NLP-157] FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

【速读】: 该论文旨在解决当前金融领域大语言模型(Large Language Models, LLMs)工具调用能力评估中存在的局限性问题,即现有基准测试仅关注调用层级指标(call-level metrics),难以全面衡量模型在复杂金融任务中从初始问题到最终答案的全过程推理质量(trajectory-level reasoning quality)。其解决方案的关键在于构建FinTrace基准数据集,包含800条由专家标注的轨迹,覆盖34类真实金融任务,并采用基于评分标准的九维评估体系(涵盖动作正确性、执行效率、过程质量和输出质量四个维度),实现对LLM工具调用行为的细粒度分析。此外,研究进一步提出FinTrace-Training数据集,作为首个面向金融工具调用的轨迹级偏好数据集,通过监督微调与直接偏好优化(DPO)训练显著提升中间推理质量,揭示出当前模型在信息利用和最终答案质量上的瓶颈,为后续改进提供了可量化且具方向性的依据。

链接: https://arxiv.org/abs/2604.10015
作者: Yupeng Cao,Haohang Li,Weijin Liu,Wenbo Cao,Anke Xu,Lingfei Qian,Xueqing Peng,Minxue Tang,Zhiyuan Yao,Jimin Huang,K.P. Subbalakshmi,Zining Zhu,Jordan W. Suchow,Yangyang Yu
机构: Stevens Institute of Technology (史蒂文斯理工学院); Independent Researcher (独立研究者); The FinAI (金融AI); Duke University (杜克大学)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent studies demonstrate that tool-calling capability enables large language models (LLMs) to interact with external environments for long-horizon financial tasks. While existing benchmarks have begun evaluating financial tool calling, they focus on limited scenarios and rely on call-level metrics that fail to capture trajectory-level reasoning quality. To address this gap, we introduce FinTrace, a benchmark comprising 800 expert-annotated trajectories spanning 34 real-world financial task categories across multiple difficulty levels. FinTrace employs a rubric-based evaluation protocol with nine metrics organized along four axes – action correctness, execution efficiency, process quality, and output quality – enabling fine-grained assessment of LLM tool-calling behavior. Our evaluation of 13 LLMs reveals that while frontier models achieve strong tool selection, all models struggle with information utilization and final answer quality, exposing a critical gap between invoking the right tools and reasoning effectively over their outputs. To move beyond diagnosis, we construct FinTrace-Training, the first trajectory-level preference dataset for financial tool-calling, containing 8,196 curated trajectories with tool-augmented contexts and preference pairs. We fine-tune Qwen-3.5-9B using supervised fine-tuning followed by direct preference optimization (DPO) and show that training on FinTrace-Training consistently improves intermediate reasoning metrics, with DPO more effectively suppressing failure modes. However, end-to-end answer quality remains a bottleneck, indicating that trajectory-level improvements do not yet fully propagate to final output quality.

[NLP-158] Demographic and Linguistic Bias Evaluation in Omnimodal Language Models ICPR2026

【速读】: 该论文旨在解决多模态语言模型(Omnimodal Language Models)在不同人口统计学群体(如年龄、性别、肤色、语言和国籍)和模态(文本、图像、音频、视频)之间存在的公平性问题。当前尽管这些模型已被广泛部署,但其跨群体和跨模态的性能差异尚未得到充分研究。解决方案的关键在于对四个代表性多模态模型进行系统性评估,涵盖身份估计、身份验证、活动识别、多语言语音转录和语言识别等任务,并量化各维度上的准确率差异。结果显示,图像与视频理解任务表现较优且偏差较小,而音频理解任务则表现出显著性能下降和严重偏倚,尤其在年龄、性别和语言维度上存在较大准确率差异及预测坍缩现象,凸显了在所有支持模态中全面评估公平性的必要性。

链接: https://arxiv.org/abs/2604.10014
作者: Alaa Elobaid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICPR 2026. Full paper with complete appendix (31 pages total)

点击查看摘要

Abstract:This paper provides a comprehensive evaluation of demographic and linguistic biases in omnimodal language models that process text, images, audio, and video within a single framework. Although these models are being widely deployed, their performance across different demographic groups and modalities is not well studied. Four omnimodal models are evaluated on tasks that include demographic attribute estimation, identity verification, activity recognition, multilingual speech transcription, and language identification. Accuracy differences are measured across age, gender, skin tone, language, and country of origin. The results show that image and video understanding tasks generally exhibit better performance with smaller demographic disparities. In contrast, audio understanding tasks exhibit significantly lower performance and substantial bias, including large accuracy differences across age groups, genders, and languages, and frequent prediction collapse toward narrow categories. These findings highlight the importance of evaluating fairness across all supported modalities as omnimodal language models are increasingly used in real-world applications.

[NLP-159] Human vs. Machine Deception: Distinguishing AI-Generated and Human-Written Fake News Using Ensemble Learning

【速读】: 该论文旨在解决人工智能生成虚假新闻(AI-generated fake news)与传统人工撰写虚假新闻之间的差异识别问题,以及如何可靠地区分这两种欺骗性内容。其关键解决方案在于构建基于文档级别的特征表示体系,涵盖句法结构、词汇多样性、标点模式、可读性指标及情感维度(如恐惧、愤怒、喜悦、悲伤、信任和预期)等多维特征,并采用多种机器学习模型(如逻辑回归、随机森林、支持向量机、极端梯度提升和神经网络)与集成学习框架进行分类对比。研究发现,以可读性为基础的特征是最具信息量的判别因子,且AI生成文本表现出更一致的风格模式,而集成学习方法在个体模型基础上实现了稳定但小幅的性能提升,表明文本的风格与结构特性为区分AI生成虚假信息提供了稳健依据。

链接: https://arxiv.org/abs/2604.09960
作者: Samuel Jaeger,Calvin Ibeneye,Aya Vera-Jimenez,Dhrubajyoti Ghosh
机构: Kennesaw State University (肯尼索州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid adoption of large language models has introduced a new class of AI-generated fake news that coexists with traditional human-written misinformation, raising important questions about how these two forms of deceptive content differ and how reliably they can be distinguished. This study examines linguistic, structural, and emotional differences between human-written and AI-generated fake news and evaluates machine learning and ensemble-based methods for distinguishing these content types. A document-level feature representation is constructed using sentence structure, lexical diversity, punctuation patterns, readability indices, and emotion-based features capturing affective dimensions such as fear, anger, joy, sadness, trust, and anticipation. Multiple classification models, including logistic regression, random forest, support vector machines, extreme gradient boosting, and a neural network, are applied alongside an ensemble framework that aggregates predictions across models. Model performance is assessed using accuracy and area under the receiver operating characteristic curve. The results show strong and consistent classification performance, with readability-based features emerging as the most informative predictors and AI-generated text exhibiting more uniform stylistic patterns. Ensemble learning provides modest but consistent improvements over individual models. These findings indicate that stylistic and structural properties of text provide a robust basis for distinguishing AI-generated misinformation from human-written fake news.

[NLP-160] Cross-Cultural Value Awareness in Large Vision-Language Models

【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在文化语境下可能存在的刻板印象问题,特别是这些模型如何因图像中呈现的不同文化背景(如宗教、国籍、社会经济地位)而对个体的道德、伦理和政治价值观做出差异化的判断。其解决方案的关键在于构建一个多维评估框架,利用反事实图像集(即同一人物在不同文化背景下呈现的图像)对五种主流LVLMs进行系统分析,并结合道德基础理论(Moral Foundations Theory)、词汇学分析以及生成价值判断对文化语境的敏感性,从而诊断LVLMs是否能够正确识别和响应文化相关的价值差异。

链接: https://arxiv.org/abs/2604.09945
作者: Phillip Howard,Xin Su,Kathleen C. Fraser
机构: Thoughtworks; University of Ottawa
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid adoption of large vision-language models (LVLMs) in recent years has been accompanied by growing fairness concerns due to their propensity to reinforce harmful societal stereotypes. While significant attention has been paid to such fairness concerns in the context of social biases, relatively little prior work has examined the presence of stereotypes in LVLMs related to cultural contexts such as religion, nationality, and socioeconomic status. In this work, we aim to narrow this gap by investigating how cultural contexts depicted in images influence the judgments LVLMs make about a person’s moral, ethical, and political values. We conduct a multi-dimensional analysis of such value judgments in five popular LVLMs using counterfactual image sets, which depict the same person across different cultural contexts. Our evaluation framework diagnoses LVLM awareness of cultural value differences through the use of Moral Foundations Theory, lexical analyses, and the sensitivity of generated values to depicted cultural contexts.

[NLP-161] From UAV Imagery to Agronomic Reasoning : A Multimodal LLM Benchmark for Plant Phenotyping

【速读】: 该论文旨在解决植物科学领域中基础模型(尤其是视觉语言模型,VLMs)在作物表型分析中表现不足的问题,具体表现为对领域特定知识、细粒度视觉识别及复杂生物与农学推理能力的欠缺。解决方案的关键在于构建了一个名为PlantXpert的证据 grounded 多模态推理基准,该基准针对大豆和棉花表型特征设计,涵盖病害、虫害防治、杂草管理及产量等关键农业场景,提供结构化且可复现的框架用于评估VLMs的农学适应性,并支持基线模型与领域适配模型之间的可控对比。通过该基准对11个前沿VLMs进行评测,验证了任务特异性微调可显著提升准确率(最高达78%),同时揭示了模型规模扩展收益递减、跨作物泛化不均以及定量与生物学驱动推理仍具挑战等关键问题。

链接: https://arxiv.org/abs/2604.09907
作者: Yu Wu,Guangzeng Han,Ibra Niang Niang,Francia Ravelombola,Maiara Oliveira,Jason Davis,Dong Chen,Feng Lin,Xiaolei Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: In review

点击查看摘要

Abstract:To improve crop genetics, high-throughput, effective and comprehensive phenotyping is a critical prerequisite. While such tasks were traditionally performed manually, recent advances in multimodal foundation models, especially in vision-language models (VLMs), have enabled more automated and robust phenotypic analysis. However, plant science remains a particularly challenging domain for foundation models because it requires domain-specific knowledge, fine-grained visual interpretation, and complex biological and agronomic reasoning. To address this gap, we develop PlantXpert, an evidence-grounded multimodal reasoning benchmark for soybean and cotton phenotyping. Our benchmark provides a structured and reproducible framework for agronomic adaptation of VLMs, and enables controlled comparison between base models and their domain-adapted counterparts. We constructed a dataset comprising 385 digital images and more than 3,000 benchmark samples spanning key plant science domains including disease, pest control, weed management, and yield. The benchmark can assess diverse capabilities including visual expertise, quantitative reasoning, and multi-step agronomic reasoning. A total of 11 state-of-the-art VLMs were evaluated. The results indicate that task-specific fine-tuning leads to substantial improvement in accuracy, with models such as Qwen3-VL-4B and Qwen3-VL-30B achieving up to 78%. At the same time, gains from model scaling diminish beyond a certain capacity, generalization across soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose substantial challenges. These findings suggest that PlantXpert can serve as a foundation for assessing evidence-grounded agronomic reasoning and for advancing multimodal model development in plant science.

[NLP-162] Should We be Pedantic About Reasoning Errors in Machine Translation?

【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)中推理错误(reasoning errors)的识别与修正问题,尤其关注这些错误如何影响翻译质量。其核心挑战在于:即使模型生成了看似合理的翻译结果,其背后的推理过程可能包含与源句、假设或推理轨迹不一致的错误。解决方案的关键在于提出一种自动化的推理评估协议,能够量化三类推理错误(源句错位、模型假设错位、推理轨迹错位),并通过一系列从弱到强的干预手段(如模糊化、移除、重推理、事后洞察和最优干预)对推理轨迹进行修正。实验表明,尽管小幅度修正对翻译质量提升有限,但强干预可显著提高错误识别精度,尤其是在乌尔都语等语言中表现优异,然而去除推理错误并未显著改善初始翻译性能,揭示出当前MT系统在推理忠实性(reasoning faithfulness)上的局限性。

链接: https://arxiv.org/abs/2604.09890
作者: Calvin Bao,Marine Carpuat
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Across multiple language pairings (English \to \Spanish, French, German, Mandarin, Japanese, Urdu, Cantonese), we find reasoning errors in translation. To quantify how often these reasoning errors occur, we leverage an automated annotation protocol for reasoning evaluation wherein the goal is to detect if a reasoning step is any of three error categories: (1) source sentence-misaligned, (2) model hypothesis-misaligned, or (3) reasoning trace-misaligned. We probe the reasoning model with perturbed traces correcting for these identified reasoning errors using an array of weak-to-strong interventions: hedging, removal, re-reasoning after removal, hindsight, and oracle interventions. Experimenting with interventions on the reasoning traces suggests that small corrections to the reasoning have little impact on translation quality, but stronger interventions yield the highest resolution rates, despite translation quality gains being mixed. We find ultimately that reasoning errors in MT can be identified with high precision in Urdu but lower precision in Spanish, but that removing these reasoning errors does not resolve the initial errors significantly, suggesting limited reasoning faithfulness for machine translation.

[NLP-163] Simulating Organized Group Behavior: New Framework Benchmark and Analysis

【速读】: 该论文旨在解决如何模拟有组织群体(如企业)在面对特定情境时的决策行为问题,以更好地理解现实世界动态并支持市场预测等应用。其核心挑战在于将群体决策建模为可解释、可追踪且具备时间演化能力的系统。解决方案的关键在于提出了一种名为GROVE(GRoup Organizational BehaVior Evaluation)的基准平台和一套结构化的分析框架:该框架通过将集体决策事件转化为可解释、自适应且可追溯的行为模型,实现了优于传统摘要和检索基线的方法;同时引入时间感知适配器(time-aware adapter)捕捉个体群体内部随时间的行为漂移,并利用群体感知迁移机制(group-aware transfer)实现跨群体知识迁移,从而提升对数据稀缺组织的预测性能。

链接: https://arxiv.org/abs/2604.09874
作者: Xinkai Zou,Yiming Huang,Zhuohang Wu,Jian Sha,Nan Huang,Longfei Yun,Jingbo Shang,Letian Peng
机构: University of California, San Diego (加州大学圣地亚哥分校); University of California, Irvine (加州大学欧文分校); Meta
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Simulating how organized groups (e.g., corporations) make decisions (e.g., responding to a competitor’s move) is essential for understanding real-world dynamics and could benefit relevant applications (e.g., market prediction). In this paper, we formalize this problem as a concrete research platform for group behavior understanding, providing: (1) a task definition with benchmark and evaluation criteria, (2) a structured analytical framework with a corresponding algorithm, and (3) detailed temporal and cross-group analysis. Specifically, we propose Organized Group Behavior Simulation, a task that models organized groups as collective entities from a practical perspective: given a group facing a particular situation (e.g., AI Boom), predict the decision it would take. To support this task, we present GROVE (GRoup Organizational BehaVior Evaluation), a benchmark covering 44 entities with 8,052 real-world context-decision pairs collected from Wikipedia and TechCrunch across 9 domains, with an end-to-end evaluation protocol assessing consistency, initiative, scope, magnitude, and horizon. Beyond straightforward prompting pipelines, we propose a structured analytical framework that converts collective decision-making events into an interpretable, adaptive, and traceable behavioral model, achieving stronger performance than summarization- and retrieval-based baselines. It further introduces an adapter mechanism for time-aware evolution and group-aware transfer, and traceable evidence nodes grounding each decision rule in originating historical events. Our analysis reveals temporal behavioral drift within individual groups, which the time-aware adapter effectively captures for stronger prediction, and structured cross-group similarity that enables knowledge transfer for data-scarce organizations.

[NLP-164] Instructing LLM s to Negotiate using Reinforcement Learning with Verifiable Rewards

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在不完全信息博弈场景下,如双边价格谈判中表现不佳的问题。其解决方案的关键在于引入基于可验证奖励的强化学习(Reinforcement Learning from Verifiable Rewards, RLVR),通过直接以经济盈余最大化和严格遵守私有预算约束作为奖励信号,训练一个中等规模的买方代理(buyer agent)与受控的LLM卖方进行交互。该方法揭示了代理在训练过程中呈现出的四阶段战略演化路径,并最终使30B参数规模的代理显著超越远大于其自身规模的前沿模型,在谈判中更有效地提取经济盈余,且具备对未见过的强对手和对抗性卖方角色的良好泛化能力。

链接: https://arxiv.org/abs/2604.09855
作者: Shuze Daniel Liu,Claire Chen,Jiabao Sean Xiao,Lei Lei,Yuheng Zhang,Yisong Yue,David Simchi-Levi
机构: Purdue University (普渡大学); California Institute of Technology (加州理工学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:The recent advancement of Large Language Models (LLMs) has established their potential as autonomous interactive agents. However, they often struggle in strategic games of incomplete information, such as bilateral price negotiation. In this paper, we investigate if Reinforcement Learning from Verifiable Rewards (RLVR) can effectively teach LLMs to negotiate. Specifically, we explore the strategic behaviors that emerge during the learning process. We introduce a framework that trains a mid-sized buyer agent against a regulated LLM seller across a wide distribution of real-world products. By grounding reward signals directly in the maximization of economic surplus and strict adherence to private budget constraints, we reveal a novel four-phase strategic evolution. The agent progresses from naive bargaining to using aggressive starting prices, moves through a phase of deadlock, and ultimately develops sophisticated persuasive skills. Our results demonstrate that this verifiable training allows a 30B agent to significantly outperform frontier models over ten times its size in extracting surplus. Furthermore, the trained agent generalizes robustly to stronger counterparties unseen during training and remains effective even when facing hostile, adversarial seller personas.

[NLP-165] Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling

【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在生成叙事性文本时缺乏稳定叙事张力(narrative tension)的问题,这一缺陷导致其作品在专业文学评价中表现不佳,尽管在现有基准(如EQ-Bench)上被误判为优于《纽约客》短篇小说。解决方案的关键在于引入一种基于叙述学原理的新评估指标——100-Endings,该指标通过逐句预测故事未来100种可能结局并统计预测失败率来量化叙事张力,并辅以拐点率(inflection rate)等几何特征捕捉情节反转与揭示。基于此指标,作者设计了一套结构化的故事生成流水线,包含故事模板分析、主题构思和叙事骨架构建等约束机制,显著提升了生成文本的叙事张力,同时保持了在EQ-Bench上的高分表现。

链接: https://arxiv.org/abs/2604.09854
作者: Peiqi Sui,Yutong Zhu,Tianyi Cheng,Peter West,Richard Jean So,Hoyt Long,Ari Holtzman
机构: McGill University (麦吉尔大学); University of Chicago (芝加哥大学); University of British Columbia (英属哥伦比亚大学); Duke University (杜克大学)
类目: Computation and Language (cs.CL)
备注: 29 pages, 10 figures, 9 tables

点击查看摘要

Abstract:LLMs have so far failed both to generate consistently compelling stories and to recognize this failure–on the leading creative-writing benchmark (EQ-Bench), LLM judges rank zero-shot AI stories above New Yorker short stories, a gold standard for literary fiction. We argue that existing rubrics overlook a key dimension of compelling human stories: narrative tension. We introduce the 100-Endings metric, which walks through a story sentence by sentence: at each position, a model predicts how the story will end 100 times given only the text so far, and we measure tension as how often predictions fail to match the ground truth. Beyond the mismatch rate, the sentence-level curve yields complementary statistics, such as inflection rate, a geometric measure of how frequently the curve reverses direction, tracking twists and revelations. Unlike rubric-based judges, 100-Endings correctly ranks New Yorker stories far above LLM outputs. Grounded in narratological principles, we design a story-generation pipeline using structural constraints, including analysis of story templates, idea formulation, and narrative scaffolding. Our pipeline significantly increases narrative tension as measured by the 100-Endings metric, while maintaining performance on the EQ-Bench leaderboard.

[NLP-166] COMPOSITE-Stem

【速读】: 该论文旨在解决当前AI代理(AI agent)在科学发现领域应用中缺乏前沿评估标准的问题,现有基准测试因饱和且仅限于约束性输出而难以准确衡量其真实能力。解决方案的关键在于提出COMPOSITE-STEM,这是一个由博士级别研究人员精心设计的70项跨学科任务基准(涵盖物理、生物、化学和数学),结合精确匹配评分与基于准则的评分体系,并引入大语言模型作为裁判(LLM-as-a-jury)的评分协议,从而实现对科学意义上有意义输出的灵活评估。通过在Harbor代理评估框架内使用适配的多模态Terminus-2代理进行测试,验证了该基准能有效揭示当前AI代理尚未达到的能力边界。

链接: https://arxiv.org/abs/2604.09836
作者: Kyle Waters,Lucas Nuzzi,Tadhg Looram,Alessandro Tomasiello,Ariel Ghislain Kemogne Kamdoum,Bikun Li,Damien Sileo,Egor Kretov,Francesco Fournier-Facio,Georgios Soloupis,Haile Kassahun,Hew Wolff,Jiaqi Cai,Lianghui Li,Marc Roth,Mohinder Naiya,Naixu Guo,Qicheng Tang,Richard Wheeler,Samuele Sala,Serguei Popov,Steven Dillman,Yuqi Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics, curated by doctoral-level researchers. Our benchmark combines exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol, allowing more flexible assessment of scientifically meaningful outputs. Using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework, we evaluate four frontier models. The top-performing model achieves 21%, demonstrating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced with contributor permission to support reproducibility and to promote additional research towards AI’s acceleration of scientific progress in these domains.

[NLP-167] ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

【速读】: 该论文旨在解决视觉语言动作(Vision Language Action, VLA)模型中存在的语言忽视(language ignorance)问题,即模型过度依赖视觉捷径、对指令变化不敏感,且缺乏对任务语义的明确感知。其核心解决方案是提出Prospective Grounding and Alignment VLA(ProGAL-VLA),关键在于构建一个以3D实体为中心的图结构(Grounded Semantic Map, GSM),通过慢速规划器生成符号化子目标,并利用接地对齐对比损失(Grounding Alignment Contrastive, GAC)将这些目标与具身实体对齐;同时引入验证后的目标嵌入 $ g_t $,其注意力熵作为内在模糊性信号,实现对指令变化的敏感响应和歧义输入的主动澄清。该方法显著提升了机器人在扰动下的鲁棒性(LIBERO-Plus上从30.3%提升至71.5%)、减少了语言忽视(降低3x–4x),并增强了实体检索能力(Recall@1从0.41提升至0.71)。

链接: https://arxiv.org/abs/2604.09824
作者: Nastaran Darabi,Amit Ranjan Trivedi
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision language action (VLA) models enable generalist robotic agents but often exhibit language ignorance, relying on visual shortcuts and remaining insensitive to instruction changes. We present Prospective Grounding and Alignment VLA (ProGAL-VLA), which constructs a 3D entity-centric graph (GSM), uses a slow planner to produce symbolic sub-goals, and aligns them with grounded entities via a Grounding Alignment Contrastive (GAC) loss. All actions are conditioned on a verified goal embedding g_t , whose attention entropy provides an intrinsic ambiguity signal. On LIBERO-Plus, ProGAL-VLA increases robustness under robot perturbations from 30.3 to 71.5 percent, reduces language ignorance by 3x-4x, and improves entity retrieval from 0.41 to 0.71 Recall@1. On the Custom Ambiguity Benchmark, it reaches AUROC 0.81 (vs. 0.52), AUPR 0.79, and raises clarification on ambiguous inputs from 0.09 to 0.81 without harming unambiguous success. The verification bottleneck increases mutual information of language-actions, the GAC loss imposes an entity-level InfoNCE bound, and attention entropy yields calibrated selective prediction, indicating that explicit verified grounding is an effective path toward instruction-sensitive, ambiguity-aware agents.

[NLP-168] Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering

【速读】: 该论文旨在解决多语言环境下重复性虚假声明(Recurrent claims)对自动化事实核查系统带来的挑战,尤其是如何有效将语义相似的多语言声明聚类为可由同一核查结论解决的组别这一问题。其解决方案的关键在于提出Claim2Vec,这是首个专为多语言虚假声明设计的嵌入模型,通过对比学习(contrastive learning)在多语言声明对上微调多语言编码器,从而在改进的语义嵌入空间中表示声明向量。实验表明,Claim2Vec显著提升了聚类性能,在不同聚类配置下均增强了簇标签一致性与嵌入空间的几何结构,并验证了跨语言知识迁移的有效性。

链接: https://arxiv.org/abs/2604.09812
作者: Rrubaa Panchendrarajan,Arkaitz Zubiaga
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recurrent claims present a major challenge for automated fact-checking systems designed to combat misinformation, especially in multilingual settings. While tasks such as claim matching and fact-checked claim retrieval aim to address this problem by linking claim pairs, the broader challenge of effectively representing groups of similar claims that can be resolved with the same fact-check via claim clustering remains relatively underexplored. To address this gap, we introduce Claim2Vec, the first multilingual embedding model optimized to represent fact-check claims as vectors in an improved semantic embedding space. We fine-tune a multilingual encoder using contrastive learning with similar multilingual claim pairs. Experiments on the claim clustering task using three datasets, 14 multilingual embedding models, and 7 clustering algorithms demonstrate that Claim2Vec significantly improves clustering performance. Specifically, it enhances both cluster label alignment and the geometric structure of the embedding space across different cluster configurations. Our multilingual analysis shows that clusters containing multiple languages benefit from fine-tuning, demonstrating cross-lingual knowledge transfer.

[NLP-169] GIANTS: Generative Insight Anticipation from Scientific Literature

【速读】: 该论文旨在解决生成式 AI 在科学发现中缺乏对文献基础的精准合成能力的问题,即如何让模型从已有研究论文中提炼并预测下游论文的核心洞见(insight anticipation)。其解决方案的关键在于构建 GiantsBench 基准数据集(包含17k个跨8个科学领域的父论文与下游论文核心洞见配对样本),并设计基于强化学习(reinforcement learning, RL)的 GIANTS-4B 模型,以相似度评分作为代理奖励信号优化洞见预测性能。该方法不仅在客观指标上超越了商用模型,还在人类专家评估和 SciJudge-30B 的引用潜力预测中展现出更强的概念清晰性和潜在影响力。

链接: https://arxiv.org/abs/2604.09793
作者: Joy He-Yueya,Anikait Singh,Ge Gao,Michael Y. Li,Sherry Yang,Chelsea Finn,Emma Brunskill,Noah D. Goodman
机构: Stanford University (斯坦福大学); New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific breakthroughs often emerge from synthesizing prior ideas into novel contributions. While language models (LMs) show promise in scientific discovery, their ability to perform this targeted, literature-grounded synthesis remains underexplored. We introduce insight anticipation, a generation task in which a model predicts a downstream paper’s core insight from its foundational parent papers. To evaluate this capability, we develop GiantsBench, a benchmark of 17k examples across eight scientific domains, where each example consists of a set of parent papers paired with the core insight of a downstream paper. We evaluate models using an LM judge that scores similarity between generated and ground-truth insights, and show that these similarity scores correlate with expert human ratings. Finally, we present GIANTS-4B, an LM trained via reinforcement learning (RL) to optimize insight anticipation using these similarity scores as a proxy reward. Despite its smaller open-source architecture, GIANTS-4B outperforms proprietary baselines and generalizes to unseen domains, achieving a 34% relative improvement in similarity score over gemini-3-pro. Human evaluations further show that GIANTS-4B produces insights that are more conceptually clear than those of the base model. In addition, SciJudge-30B, a third-party model trained to compare research abstracts by likely citation impact, predicts that insights generated by GIANTS-4B are more likely to lead to higher citations, preferring them over the base model in 68% of pairwise comparisons. We release our code, benchmark, and model to support future research in automated scientific discovery.

[NLP-170] Head-wise Modality Specialization within MLLM s for Robust Fake News Detection under Missing Modality

【速读】: 该论文旨在解决多模态虚假新闻检测(Multimodal Fake News Detection, MFND)中因模态缺失(如图像被删除或损坏)导致的鲁棒性下降问题。核心挑战在于低贡献模态在训练过程中难以充分学习,且单模态标注数据稀缺,使得模型在缺失某一模态时验证能力显著削弱。解决方案的关键是提出基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的“头级模态专业化”机制(Head-wise Modality Specialization),通过系统分析注意力头与模态缺失场景下的性能关系,发现特定注意力头具有模态专一性并承载单模态验证能力;进而设计下界注意力约束以保留这些头对不同模态的专属性,并引入单模态知识保留策略(Unimodal Knowledge Retention)防止其在有限监督下漂移,从而提升模型在模态缺失情况下的鲁棒性和全模态输入下的性能。

链接: https://arxiv.org/abs/2604.09711
作者: Kai Qian,Weijie Shi,Jiaqi Wang,Mengze Li,Hao Chen,Yue Cui,Hanghui Guo,Ziyi Liu,Jia Zhu,Jiajie Xu
机构: Soochow University (苏州大学); Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学); Southeast University (东南大学); Zhejiang Normal University (浙江师范大学); Tencent (腾讯); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal fake news detection (MFND) aims to verify news credibility by jointly exploiting textual and visual evidence. However, real-world news dissemination frequently suffers from missing modality due to deleted images, corrupted screenshots, and similar issues. Thus, robust detection in this scenario requires preserving strong verification ability for each modality, which is challenging in MFND due to insufficient learning of the low-contribution modality and scarce unimodal annotations. To address this issue, we propose Head-wise Modality Specialization within Multimodal Large Language Models (MLLMs) for robust MFND under missing modality. Specifically, we first systematically study attention heads in MLLMs and their relationship with performance under missing modality, showing that modality-critical heads serve as key carriers of unimodal verification ability through their modality specialization. Based on this observation, to better preserve verification ability for the low-contribution modality, we introduce a head-wise specialization mechanism that explicitly allocates these heads to different modalities and preserves their specialization through lower-bound attention constraints. Furthermore, to better exploit scarce unimodal annotations, we propose a Unimodal Knowledge Retention strategy that prevents these heads from drifting away from the unimodal knowledge learned from limited supervision. Experiments show that our method improves robustness under missing modality while preserving performance with full multimodal input.

[NLP-171] How LLM s Might Think

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否具备思考能力。针对这一问题,作者回应了Stoljar与Zhang基于理性原则提出的论证,指出其推理存在缺陷,并提出一个更具启发性的可能性:LLMs可能并不进行理性的思维活动,而是以非理性(arational)、关联性(associative)的方式进行思考,即拥有纯粹的关联性心智(associative mind)。该解决方案的关键在于重新界定“思考”的内涵,承认思维可以不依赖于传统意义上的理性推理,从而为LLMs的认知能力提供一种新的解释路径。

链接: https://arxiv.org/abs/2604.09674
作者: Joseph Gottlieb,Ethan Kemp,Matthew Trager
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Do large language models (LLMs) think? Daniel Stoljar and Zhihe Vincent Zhang have recently developed an argument from rationality for the claim that LLMs do not think. We contend, however, that the argument from rationality not only falters, but leaves open an intriguing possibility: that LLMs engage only in arational, associative forms of thinking, and have purely associative minds. Our positive claim is that if LLMs think at all, they likely think precisely in this manner.

[NLP-172] Generating High Quality Synthetic Data for Dutch Medical Conversations LREC2026

【速读】: 该论文旨在解决临床自然语言处理(Natural Language Processing, NLP)模型开发中因医疗数据隐私与伦理限制导致的领域特定语料稀缺问题。其解决方案的关键在于构建一个基于微调后的荷兰语大语言模型(Large Language Model, LLM)的合成医学对话生成流水线,以真实医疗对话作为语言和结构参考,从而生成符合临床场景的合成数据。该方法虽在词汇多样性上表现良好,但存在话轮转换过于规律、领域特异性不足等问题,表明合成对话的质量依赖于领域知识嵌入和精细化提示工程(prompt engineering),以平衡自然性与结构合理性。

链接: https://arxiv.org/abs/2604.09645
作者: Cecilia Kuan,Aditya Kamlesh Parikh,Henk van den Heuvel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to LREC 2026. This publication was supported by the MediSpeech project funded by ITEA4 under contract number 22032

点击查看摘要

Abstract:Medical conversations offer insights into clinical communication often absent from Electronic Health Records. However, developing reliable clinical Natural Language Processing (NLP) models is hampered by the scarcity of domain-specific datasets, as clinical data are typically inaccessible due to privacy and ethical constraints. To address these challenges, we present a pipeline for generating synthetic Dutch medical dialogues using a Dutch fine-tuned Large Language Model, with real medical conversations serving as linguistic and structural reference. The generated dialogues were evaluated through quantitative metrics and qualitative review by native speakers and medical practitioners. Quantitative analysis revealed strong lexical variety and overly regular turn-taking, suggesting scripted rather than natural conversation flow. Qualitative review produced slightly below-average scores, with raters noting issues in domain specificity and natural expression. The limited correlation between quantitative and qualitative results highlights that numerical metrics alone cannot fully capture linguistic quality. Our findings demonstrate that generating synthetic Dutch medical dialogues is feasible but requires domain knowledge and carefully structured prompting to balance naturalness and structure in conversation. This work provides a foundation for expanding Dutch clinical NLP resources through ethically generated synthetic data.

[NLP-173] HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成幽默内容时面临的挑战,即其标准训练目标——预测最可能的下一个词——与幽默所需的意外性和不一致性之间存在根本冲突。解决方案的关键在于提出一种基于心理学理论的“认知协同框架”(Cognitive Synergy Framework),通过引入六种认知人格角色(如荒诞主义者、愤世者等)的思维混合(Mixture-of-Thought, MoT)机制,从多样化认知视角合成高质量幽默数据。该框架构建了具有理论依据的数据集,并用于微调一个7B参数的学生模型;实验表明,认知驱动的数据构建比对齐算法或模型规模更能显著提升幽默生成性能。

链接: https://arxiv.org/abs/2604.09629
作者: Edward Ajayi,Prasenjit Mitra
机构: Carnegie Mellon University Africa (卡内基梅隆大学非洲校区)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humor generation poses a significant challenge for Large Language Models (LLMs), because their standard training objective - predicting the most likely next word - inherently conflicts with the surprise and incongruity needed for comedy. To bridge this gap, we introduce the Cognitive Synergy Framework, a theoretically grounded methodology for generating high-quality humor data inspired by psychological theories of humor. Utilizing a Mixture-of-Thought (MoT) approach, we deploy six cognitive personas (e.g., The Absurdist, The Cynic) to synthesize diverse comedic perspectives for a given prompt. This framework creates a theoretically grounded dataset, which we use to fine-tune a 7B-parameter student model. We compare Direct Preference Optimization (DPO) and a novel Offline Group Relative Policy Optimization (O-GRPO); our 7B model significantly outperforms larger instruction-tuned baselines and achieves performance competitive with state-of-the-art proprietary models. We find that cognitive-driven data curation is far more critical than alignment algorithms or model scale for humor generation. Code and data will be available upon publication.

[NLP-174] oward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations LREC2026

【速读】: 该论文旨在解决多语言仇恨言论检测(multilingual hate speech detection)中因标注数据稀缺而导致模型性能受限的问题。其核心解决方案在于结合大规模未标注网络文本(web-scale unlabelled web data)与基于大语言模型(LLM)的合成标注(synthetic annotations),通过两种互补策略提升模型效果:一是对BERT模型进行持续预训练(continued pre-training),利用未标注数据增强语言理解能力;二是采用多种集成策略(均值平均、多数投票和LightGBM元学习器)生成高质量合成标签,其中LightGBM元学习器表现最优。实验表明,该组合策略对小模型(如Llama3.2-1B)和低资源语言尤其有效,显著提升了检测性能,验证了无监督预训练与LLM驱动的合成标注协同优化在多语言仇恨言论识别中的价值。

链接: https://arxiv.org/abs/2604.09625
作者: Dang H. Dang,Jelena Mitrovi,Michael Granitzer
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 Pages, 3 tables, LREC 2026 papers

点击查看摘要

Abstract:We study whether large-scale unlabelled web data and LLM-based synthetic annotations can improve multilingual hate speech detection. Starting from texts crawled via this http URL~(OWS) in four languages (English, German, Spanish, Vietnamese), we pursue two complementary strategies. First, we apply continued pre-training to BERT models by continuing masked language modelling on unlabelled OWS texts before supervised fine-tuning, and show that this yields an average macro-F1 gain of approximately 3% over standard baselines across sixteen benchmarks, with stronger gains in low-resource settings. Second, we use four open-source LLMs (Mistral-7B, Llama3.1-8B, Gemma2-9B, Qwen2.5-14B) to produce synthetic annotations through three ensemble strategies: mean averaging, majority voting, and a LightGBM meta-learner. The LightGBM ensemble consistently outperforms the other strategies. Fine-tuning on these synthetic labels substantially benefits a small model (Llama3.2-1B: +11% pooled F1), but provides only a modest gain for the larger Qwen2.5-14B (+0.6%). Our results indicate that the combination of web-scale unlabelled data and LLM-ensemble annotations is the most valuable for smaller models and low-resource languages.

[NLP-175] Self-Calibrating Language Models via Test-Time Discriminative Distillation ACL

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)普遍存在的过度自信问题,即模型在回答错误时仍表现出高置信度。现有校准方法通常依赖标注验证数据、在分布偏移下性能下降或推理成本较高。其解决方案的关键在于利用LLMs内部已存在的更可靠校准信号——当模型被问及“该答案是否正确?”时,“True” token的概率 $ P(\text{True}) $,这一信号理论上优于模型显式表达的置信度,且生成误差至少低于判别误差的两倍。作者提出SECL(Self-Calibrating Language Models),一种无需标签数据或人工监督的测试时训练(Test-Time Training, TTT)管道,通过该信号作为自监督信号,在输入分布发生偏移时动态调整模型,仅需处理6–26%的问题流,显著降低计算开销,并在四个小型语言模型和多个领域中将期望校准误差(Expected Calibration Error, ECE)降低56–78%,优于现有推理时校准方法。

链接: https://arxiv.org/abs/2604.09624
作者: Mohamed Rissal Hedna,Jan Strich,Martin Semmann,Chris Biemann
机构: Hub of Computing and Data Science (HCDS); University of Hamburg, Germany
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to ACL March 26

点击查看摘要

Abstract:Large language models (LLMs) are systematically overconfident: they routinely express high certainty on questions they often answer incorrectly. Existing calibration methods either require labeled validation data, degrade under distribution shifts, or incur substantial inference costs. Recent work has shown that LLMs already contain a better-calibrated signal than the one they verbalize: the token probability of “True” when the model is asked “Is this answer correct?” ( P(\textTrue) ) consistently outperforms their stated confidence, a gap that is theoretically grounded as generative error is lower-bounded by roughly twice the corresponding discriminative error. We introduce \textbfSECL ( \textbfSE lf- \textbfC alibrating \textbfL anguage Models), a test-time training (TTT) pipeline that exploits this gap as label-free self-supervision, requiring no labeled data or human supervision. SECL adapts only when the input distribution shifts, training on just 6–26% of the question stream at lower cost than the baseline it distills from. Across four small language models from three model families and four diverse domains, SECL reduces Expected Calibration Error (ECE) by 56–78%, outperforming its own supervision signal and matching or outperforming recent inference-time methods. SECL is the first method to apply TTT to calibration; seven ablations covering signal quality, gating strategy, weight accumulation, loss design, domain ordering, hyperparameter sensitivity, and layer selection confirm that each component is crucial and robust across configurations. Code: this https URL

[NLP-176] Explainability and Certification of AI-Generated Educational Assessments

【速读】: 该论文旨在解决生成式 AI 在教育评估中应用时缺乏透明性、可解释性和可认证机制的问题,这限制了其在机构和认证层面的接受度。解决方案的关键在于提出一个综合框架,融合自理性(self-rationalization)、基于归因的分析(attribution-based analysis)与事后验证(post-hoc verification),以生成基于布卢姆(Bloom’s)和SOLO(Structure of the Observed Learning Outcome)分类法的认知对齐证据,并通过结构化的认证元数据模式捕获来源信息、对齐预测、评审动作及伦理指标,从而实现可审计的文档记录;同时引入“交通灯”认证工作流,区分自动认证、需人工审核或拒绝的题项,显著提升了AI生成评估题目的透明度、可审计性并降低教师负担。

链接: https://arxiv.org/abs/2604.09622
作者: Antoun Yaacoub,Zainab Assaghir,Anuradha Kar
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Chapter to be published in a Springer special book “Emerging trends in Computer Science and Computer Engineering Education Book”

点击查看摘要

Abstract:The rapid adoption of generative artificial intelligence (AI) in educational assessment has created new opportunities for scalable item creation, personalized feedback, and efficient formative evaluation. However, despite advances in taxonomy alignment and automated question generation, the absence of transparent, explainable, and certifiable mechanisms limits institutional and accreditation-level acceptance. This chapter proposes a comprehensive framework for explainability and certification of AI-generated assessment items, combining self-rationalization, attribution-based analysis, and post-hoc verification to produce interpretable cognitive-alignment evidence grounded in Bloom’s and SOLO taxonomies. A structured certification metadata schema is introduced to capture provenance, alignment predictions, reviewer actions, and ethical indicators, enabling audit-ready documentation consistent with emerging governance requirements. A traffic-light certification workflow operationalizes these signals by distinguishing auto-certifiable items from those requiring human review or rejection. A proof-of-concept study on 500 AI-generated computer science questions demonstrates the framework’s feasibility, showing improved transparency, reduced instructor workload, and enhanced auditability. The chapter concludes by outlining ethical implications, policy considerations, and directions for future research, positioning explainability and certification as essential components of trustworthy, accreditation-ready AI assessment systems.

[NLP-177] LLM Nepotism in Organizational Governance

【速读】: 该论文旨在解决生成式 AI(Generative AI)在组织决策场景中引入的一种新型偏见——“大语言模型裙带关系”(LLM Nepotism),即评估者倾向于奖励对AI表达信任的态度,即使这种态度与岗位胜任力无关。这种态度驱动的偏见会导致招聘过程偏向于信任AI的候选人,从而形成同质化、过度依赖AI的组织结构,并引发下游决策中的问责缺失和错误批准风险。解决方案的关键在于提出“能力-态度因子分解”(Merit-Attitude Factorization)策略,通过提示工程将非能力相关的AI态度从核心评价维度中分离,实现对候选人的公平评估,有效缓解该偏见。

链接: https://arxiv.org/abs/2604.09620
作者: Shunqi Mao,Wei Guo,Dingxin Zhang,Chaoyi Zhang,Weidong Cai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 3 figures, 13 tables

点击查看摘要

Abstract:Large language models are increasingly used to support organizational decisions from hiring to governance, raising fairness concerns in AI-assisted evaluation. Prior work has focused mainly on demographic bias and broader preference effects, rather than on whether evaluators reward expressed trust in AI itself. We study this phenomenon as LLM Nepotism, an attitude-driven bias channel in which favorable signals toward AI are rewarded even when they are not relevant to role-related merit. We introduce a two-phase simulation pipeline that first isolates AI-trust preference in qualification-matched resume screening and then examines its downstream effects in board-level decision making. Across several popular LLMs, we find that resume screeners tend to favor candidates with positive or non-critical attitudes toward AI, discriminating skeptical, human-centered counterparts. These biases suggest a loophole: LLM-based hiring can produce more homogeneous AI-trusting organizations, whose decision-makers exhibit greater scrutiny failure and delegation to AI agents, approving flawed proposals more readily while favoring AI-delegation initiatives. To mitigate this behavior, we additionally study prompt-based mitigation and propose Merit-Attitude Factorization, which separates non-merit AI attitude from merit-based evaluation and attenuates this bias across experiments.

[NLP-178] Assessing the Pedagogical Readiness of Large Language Models as AI Tutors in Low-Resource Contexts: A Case Study of Nepals K-10 Curriculum

【速读】: 该论文旨在解决生成式 AI(Generative AI)在非西方、低资源教育场景中部署的适配性问题,特别是其在尼泊尔中小学科学与数学课程中的教学有效性不足的问题。研究发现,尽管前沿大语言模型(LLMs)如GPT-4o和Claude Sonnet 4在整体可靠性上表现优异(约97%),但存在显著的“课程对齐缺口”,表现为教学清晰度不足、文化情境缺失以及对低年级学习者认知能力适应性差等关键缺陷。解决方案的关键在于提出一种“人在环路”(human-in-the-loop)的部署策略,并提供一套基于课程特定微调的方法论蓝图,以实现全球AI能力与本地教育需求的有效协同。

链接: https://arxiv.org/abs/2604.09619
作者: Pratyush Acharya,Prasansha Bharati,Yokibha Chapagain,Isha Sharma Gauli,Kiran Parajuli
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages and 4 figures

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into educational ecosystems promises to democratize access to personalized tutoring, yet the readiness of these systems for deployment in non-Western, low-resource contexts remains critically under-examined. This study presents a systematic evaluation of four state-of-the-art LLMs–GPT-4o, Claude Sonnet 4, Qwen3-235B, and Kimi K2–assessing their capacity to function as AI tutors within the specific curricular and cultural framework of Nepal’s Grade 5-10 Science and Mathematics education. We introduce a novel, curriculum-aligned benchmark and a fine-grained evaluation framework inspired by the “natural language unit tests” paradigm, decomposing pedagogical efficacy into seven binary metrics: Prompt Alignment, Factual Correctness, Clarity, Contextual Relevance, Engagement, Harmful Content Avoidance, and Solution Accuracy. Our results reveal a stark “curriculum-alignment gap.” While frontier models (GPT-4o, Claude Sonnet 4) achieve high aggregate reliability (approximately 97%), significant deficiencies persist in pedagogical clarity and cultural contextualization. We identify two pervasive failure modes: the “Expert’s Curse,” where models solve complex problems but fail to explain them clearly to novices, and the “Foundational Fallacy,” where performance paradoxically degrades on simpler, lower-grade material due to an inability to adapt to younger learners’ cognitive constraints. Furthermore, regional models like Kimi K2 exhibit a “Contextual Blindspot,” failing to provide culturally relevant examples in over 20% of interactions. These findings suggest that off-the-shelf LLMs are not yet ready for autonomous deployment in Nepalese classrooms. We propose a “human-in-the-loop” deployment strategy and offer a methodological blueprint for curriculum-specific fine-tuning to align global AI capabilities with local educational needs.

[NLP-179] Unifying Ontology Construction and Semantic Alignment for Deterministic Enterprise Reasoning at Scale

【速读】: 该论文旨在解决企业数据中大量信息因结构混乱而无法有效用于决策的问题,以及现有神经符号方法因模块化处理导致的误差传播难题。其解决方案的关键在于提出一种统一的大型本体模型(Large Ontology Model, LOM),通过构建“构建-对齐-推理”(Construct-Align-Reason, CAR)端到端架构,实现从原始数据中自主构建领域本体、利用图感知编码器与强化学习将神经生成结果与本体结构对齐,并基于构建的拓扑结构、节点属性和关系类型执行确定性逻辑推理,从而实现企业级智能的可解释性和准确性。

链接: https://arxiv.org/abs/2604.09608
作者: Hongyin Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While enterprises amass vast quantities of data, much of it remains chaotic and effectively dormant, preventing decision-making based on comprehensive information. Existing neuro-symbolic approaches rely on disjoint pipelines and struggle with error propagation. We introduce the large ontology model (LOM), a unified framework that seamlessly integrates ontology construction, semantic alignment, and logical reasoning into a single end-to-end architecture. LOM employs a construct-align-reason (CAR) pipeline, leveraging its unified architecture across all three stages: it first autonomously constructs a domain-specific ontological universe from raw data, then aligns neural generation with this structural reality using a graph-aware encoder and reinforcement learning, and finally executes deterministic reasoning over the constructed topology, node attributes and relation types. We evaluate LOM on a comprehensive benchmark constructed from diverse real-world enterprise datasets. Experimental results demonstrate that LOM-4B achieves 88.8% accuracy in ontology completion and 94% in complex graph reasoning tasks, significantly outperforming state-of-the-art LLMs. These findings validate that autonomous logical construction is essential for achieving deterministic, enterprise-grade intelligence.

[NLP-180] CID-TKG: Collaborative Historical Invariance and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning

【速读】: 该论文旨在解决时间知识图谱推理(Temporal Knowledge Graph Reasoning, TKGR)中现有方法因归纳偏置不足而导致的性能瓶颈问题,特别是这些方法多依赖于时间不变或弱时间依赖结构,忽视了实体与关系在时间维度上的演化动态。解决方案的关键在于提出一种协同学习框架CID-TKG,其核心创新是将演化动态(evolutionary dynamics)和历史不变性语义(historical invariance semantics)作为有效的归纳偏置引入推理过程:通过构建历史不变性图捕捉长期结构规律,并设计演化动态图建模短期时序转移;同时,采用视图特定的关系表示分解与对比对齐机制,缓解两图间语义差异,从而提升跨视图一致性并抑制噪声,最终实现外推场景下的最优推理性能。

链接: https://arxiv.org/abs/2604.09600
作者: Shuai-Long Lei,Xiaobin Zhu,Jiarui Liang,Guoxi Sun,Zhiyu Fang,Xu-Cheng Yin
机构: University of Science and Technology Beijing(北京科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Temporal knowledge graph (TKG) reasoning aims to infer future facts at unseen timestamps from temporally evolving entities and relations. Despite recent progress, existing approaches still suffer from inherent limitations due to their inductive biases, as they predominantly rely on time-invariant or weakly time-dependent structures and overlook the evolutionary dynamics. To overcome this limitation, we propose a novel collaborative learning framework for TKGR (dubbed CID-TKG) that integrates evolutionary dynamics and historical invariance semantics as an effective inductive bias for reasoning. Specifically, CID-TKG constructs a historical invariance graph to capture long-term structural regularities and an evolutionary dynamics graph to model short-term temporal transitions. Dedicated encoders are then employed to learn representations from each structure. To alleviate semantic discrepancies across the two structures, we decompose relations into view-specific representations and align view-specific query representations via a contrastive objective, which promotes cross-view consistency while suppressing view-specific noise. Extensive experiments verify that our CID-TKG achieves state-of-the-art performance under extrapolation settings.

[NLP-181] DeepReviewer 2.0: A Traceable Agent ic System for Auditable Scientific Peer Review

【速读】: 该论文旨在解决自动化同行评审系统在实际应用中缺乏可审计性(auditability)的问题,即当前生成式AI(Generative AI)模型虽能输出流畅的批评文本,但难以明确指出问题所在、提供支撑证据以及提出可执行的后续行动。解决方案的关键在于设计了一个过程可控的代理型评审系统——DeepReviewer~2.0,其核心机制是基于“输出合约”(output contract):系统生成包含锚定注释(anchored annotations)、局部化证据(localized evidence)和可执行跟进动作(executable follow-up actions)的可追溯评审包,并仅在满足最小可追溯性和覆盖预算的前提下才导出结果。这一结构确保了评审过程透明且具备操作性,从而提升了自动评审系统的可信度与实用性。

链接: https://arxiv.org/abs/2604.09590
作者: Yixuan Weng,Minjun Zhu,Qiujie Xie,Zhiyuan Ning,Shichen Li,Panzhong Lu,Zhen Lin,Enhao Gu,Qiyao Sun,Yue Zhang
机构: Westlake University(西湖大学); Zhejiang University(浙江大学); Soochow University(苏州大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Automated peer review is often framed as generating fluent critique, yet reviewers and area chairs need judgments they can \emphaudit: where a concern applies, what evidence supports it, and what concrete follow-up is required. DeepReviewer~2.0 is a process-controlled agentic review system built around an output contract: it produces a \textbftraceable review package with anchored annotations, localized evidence, and executable follow-up actions, and it exports only after meeting minimum traceability and coverage budgets. Concretely, it first builds a manuscript-only claim–evidence–risk ledger and verification agenda, then performs agenda-driven retrieval and writes anchored critiques under an export gate. On 134 ICLR~2025 submissions under three fixed protocols, an \emphun-finetuned 196B model running DeepReviewer~2.0 outperforms Gemini-3.1-Pro-preview, improving strict major-issue coverage (37.26% vs.\ 23.57%) and winning 71.63% of micro-averaged blind comparisons against a human review committee, while ranking first among automatic systems in our pool. We position DeepReviewer~2.0 as an assistive tool rather than a decision proxy, and note remaining gaps such as ethics-sensitive checks.

[NLP-182] Seven simple steps for log analysis in AI systems

【速读】: 该论文旨在解决生成式 AI (Generative AI) 系统在与工具和用户交互过程中产生大量日志数据时,缺乏标准化分析方法的问题。其解决方案的关键在于提出一个基于当前最佳实践的系统化日志分析流程(pipeline),并通过 Inspect Scout 库提供可执行代码示例、分步指导及常见陷阱警示,从而为研究人员构建严谨且可复现的日志分析框架奠定基础。

链接: https://arxiv.org/abs/2604.09563
作者: Magda Dubois,Ekin Zorer,Maia Hamin,Joe Skinner,Alexandra Souly,Jerome Wynne,Harry Coppock,Lucas Satos,Sayash Kapoor,Sunischal Dev,Keno Juchems,Kimberly Mai,Timo Flesch,Lennart Luettgau,Charles Teague,Eric Patey,JJ Allaire,Lorenzo Pacchiardi,Jose Hernandez-Orallo,Cozmin Ududec
机构: UK AI Security Institute (AISI); US Center for AI Standards and Innovation (CAISI); Model Evaluation and Threat Research (METR); Princeton University; RAND Corporation; Meridian Labs; University of Cambridge
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI systems produce large volumes of logs as they interact with tools and users. Analysing these logs can help understand model capabilities, propensities, and behaviours, or assess whether an evaluation worked as intended. Researchers have started developing methods for log analysis, but a standardised approach is still missing. Here we suggest a pipeline based on current best practices. We illustrate it with concrete code examples in the Inspect Scout library, provide detailed guidance on each step, and highlight common pitfalls. Our framework provides researchers with a foundation for rigorous and reproducible log analysis.

[NLP-183] LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

【速读】: 该论文旨在解决当前AI在科学领域应用中缺乏对其实用能力进行有效评估的问题,尤其关注从单纯的知识掌握和推理能力向真实世界科研任务执行能力的转变。其解决方案的关键在于提出LABBench2基准测试框架,该框架包含近1900个任务,延续并扩展了前代LAB-Bench的评估维度,但在更贴近实际科研场景的上下文中衡量AI系统的能力,从而显著提升了任务难度(模型在不同子任务上的准确率下降达-26%至-46%),为衡量前沿AI模型在科学工作中的真实效能提供了更具挑战性和实用性的标准。

链接: https://arxiv.org/abs/2604.09554
作者: Jon M Laurent,Albert Bou,Michael Pieler,Conor Igoe,Alex Andonian,Siddharth Narayanan,James Braza,Alexandros Sanchez Vassopoulos,Jacob L Steenwyk,Blake Lash,Andrew D White,Samuel G Rodriques
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI-driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not only accelerate, but increasingly shift focus to more real-world capabilities. Beyond rote knowledge and even just reasoning to actually measuring the ability to perform meaningful work. Prior work introduced the Language Agent Biology Benchmark LAB-Bench as an initial attempt at measuring these abilities. Here we introduce an evolution of that benchmark, LABBench2, for measuring real-world capabilities of AI systems performing useful scientific tasks. LABBench2 comprises nearly 1,900 tasks and is, for the most part, a continuation of LAB-Bench, measuring similar capabilities but in more realistic contexts. We evaluate performance of current frontier models, and show that while abilities measured by LAB-Bench and LABBench2 have improved substantially, LABBench2 provides a meaningful jump in difficulty (model-specific accuracy differences range from -26% to -46% across subtasks) and underscores continued room for performance improvement. LABBench2 continues the legacy of LAB-Bench as a de facto benchmark for AI scientific research capabilities and we hope that it continues to help advance development of AI tools for these core research functions. To facilitate community use and development, we provide the task dataset at this https URL and a public eval harness at this https URL.

[NLP-184] RECIPER: A Dual-View Retrieval Pipeline for Procedure-Oriented Materials Question Answering

【速读】: 该论文旨在解决从材料科学文献中检索程序性证据(procedure-oriented evidence)的难题,因为关键合成细节常分散在冗长且上下文复杂的文档中,且仅依赖段落级密集检索难以有效捕捉。其解决方案的核心是提出RECIPER,一种双视角检索管道:一方面索引段落级上下文信息,另一方面利用大语言模型提取紧凑的程序性摘要作为补充信号,并通过轻量级词汇重排序融合两种候选流。实验表明,该方法在多个密集检索基线模型上均显著提升早期排名性能,证明程序性摘要可作为程序导向型材料问答任务的有效互补检索信号。

链接: https://arxiv.org/abs/2604.11229
作者: Zhuoyu Wu,Wenhui Ou,Pei-Sze Tan,Wenqi Fang,Sailaja Rajanala,Raphaël C.-W. Phan
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Retrieving procedure-oriented evidence from materials science papers is difficult because key synthesis details are often scattered across long, context-heavy documents and are not well captured by paragraph-only dense retrieval. We present RECIPER, a dual-view retrieval pipeline that indexes both paragraph-level context and compact large language model-extracted procedural summaries, then combines the two candidate streams with lightweight lexical reranking. Across four dense retrieval backbones, RECIPER consistently improves early-rank retrieval over paragraph-only dense retrieval, achieving average gains of +3.73 in Recall@1, +2.85 in nDCG@10, and +3.13 in MRR. With BGE-large-en-v1.5, it reaches 86.82%, 97.07%, and 97.85% on Recall@1, Recall@5, and Recall@10, respectively. We further observe improved downstream question answering under automatic metrics, suggesting that procedural summaries can serve as a useful complementary retrieval signal for procedure-oriented materials question answering. Code and data are available at this https URL.

[NLP-185] A molecular clock for writing systems reveals the quantitative impact of imperial power on cultural evolution

【速读】: 该论文旨在解决书写系统(writing systems)演化过程缺乏全球尺度定量研究的问题,试图揭示其演化规律及驱动因素。解决方案的关键在于构建了包含300种书写与记号系统的全球书写系统数据库(Global Script Database, GSD),并运用四种系统发育分析方法(表型分类学、分支分类学、贝叶斯推断和神经网络聚类)识别出书写系统具有可检测的“分子钟”特征,即演化速率相对稳定;进一步发现政治干预会打破这一时钟效应,且对深层结构特征的重写比单纯加速变化更具选择性影响;同时识别出30次主要书写系统更替事件,并证明已有书写系统存在时独立发明的概率显著降低(天花板效应),以及殖民接触显著预测书写系统灭绝(如西班牙帝国和日本帝国的影响尤为突出)。

链接: https://arxiv.org/abs/2604.10957
作者: Hiroki Fukui
机构: Kyoto University (京都大学); Research Institute of Criminal Psychiatry / Sex Offender Medical Center (犯罪精神医学研究所/性犯罪者医疗中心)
类目: Populations and Evolution (q-bio.PE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 28 pages, 6 figures, 4 supplementary figures, 1 table. Preprint v5

点击查看摘要

Abstract:Writing systems are cultural replicators whose evolution has never been studied quantitatively at global scale. We compile the Global Script Database (GSD): 300 writing and notation systems, 50 binary structural characters, and 259 phylogenetic edges spanning 5,400 years. Applying four methods – phenetics, cladistics, Bayesian inference, and neural network clustering – we find that scripts exhibit a detectable molecular clock. The best-fitting model (Mk+Gamma strict clock) yields a substitution rate of q = 0.226 substitutions/character/millennium (95% CI: 0.034-1.22; Delta BIC = -4.1 versus relaxed clock; Delta BIC = -1,364.7 versus Mk without rate variation). Political interventions break this clock: deviation from expected divergence times correlates with intervention intensity (Spearman rho = 0.556, p 10^-4), and per-character rate analysis reveals that intervention selectively rewrites deep structural features rather than merely accelerating change (rate profile correlation rho = 0.320). We identify 30 major script replacement events and rank their destructive impact. A ceiling effect suppresses independent invention wherever writing already exists (Fisher’s exact OR = 0.054, p 10^-6), and colonial contact predicts script extinction (Cox HR = 5.25, p = 0.0006). The Spanish Empire extinguished the most scripts (6 of 12 contacted, 50%), followed by the Empire of Japan (3 of 9, 33.3%). Feature coding was validated by inter-rater reliability testing with two independent human coders (Cohen’s kappa = 0.877; human-LLM kappa = 0.929; Fleiss’ kappa = 0.911).

[NLP-186] AI Patents in the United States and China: Measurement Organization and Knowledge Flows

【速读】: 该论文旨在解决如何精准识别和量化人工智能(Artificial Intelligence, AI)专利的问题,以克服现有方法在准确性与泛化能力上的局限。其解决方案的关键在于通过微调PatentSBERTa模型,并利用美国专利商标局(USPTO)AI专利数据集中的人工标注数据构建高精度分类器,从而实现对AI专利的自动化识别。该分类器在测试中达到97.0%的精确率、91.3%的召回率和94.0%的F1分数,且在中文专利上表现出良好的跨语言泛化能力,为后续跨国比较分析提供了可靠的数据基础。

链接: https://arxiv.org/abs/2604.10529
作者: Hanming Fang,Xian Gu,Hanyin Yan,Wu Zhu
机构: University of Pennsylvania (宾夕法尼亚大学); NBER (国家经济研究局); Durham University (杜伦大学); Tsinghua University (清华大学)
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); General Finance (q-fin.GN)
备注:

点击查看摘要

Abstract:We develop a high-precision classifier to measure artificial intelligence (AI) patents by fine-tuning PatentSBERTa on manually labeled data from the USPTO’s AI Patent Dataset. Our classifier substantially improves the existing USPTO approach, achieving 97.0% precision, 91.3% recall, and a 94.0% F1 score, and it generalizes well to Chinese patents based on citation and lexical validation. Applying it to granted U.S. patents (1976-2023) and Chinese patents (2010-2023), we document rapid growth in AI patenting in both countries and broad convergence in AI patenting intensity and subfield composition, even as China surpasses the United States in recent annual patent counts. The organization of AI innovation nevertheless differs sharply: U.S. AI patenting is concentrated among large private incumbents and established hubs, whereas Chinese AI patenting is more geographically diffuse and institutionally diverse, with larger roles for universities and state-owned enterprises. For listed firms, AI patents command a robust market-value premium in both countries. Cross-border citations show continued technological interdependence rather than decoupling, with Chinese AI inventors relying more heavily on U.S. frontier knowledge than vice versa.

信息检索

[IR-0] EA-Agent : A Structured Multi-Step Reasoning Agent for Entity Alignment ACL2026

【速读】:该论文旨在解决传统实体对齐(Entity Alignment, EA)方法在噪声数据或弱监督场景下性能受限的问题,以及现有基于大语言模型(Large Language Models, LLMs)的EA方法因将LLM视为黑箱决策器而导致可解释性差、且直接使用大规模三元组导致推理成本高的问题。其解决方案的关键在于提出EA-Agent——一个以推理驱动的代理系统,将EA任务建模为多步骤规划与执行的结构化推理过程,从而实现可解释的对齐决策;同时引入属性和关系三元组选择器,在输入LLM前过滤冗余三元组,显著提升效率。

链接: https://arxiv.org/abs/2604.11686
作者: Yixuan Nan,Xixun Lin,Yanmin Shang,Ge Zhang,Zheng Fang,Fang Fang,Yanan Cao
机构: Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China; School of Information and Intelligent Science, Donghua University, Shanghai, China; JD.COM, Beijing, China
类目: Information Retrieval (cs.IR)
备注: ACL 2026,Main Conference

点击查看摘要

Abstract:Entity alignment (EA) aims to identify entities across different knowledge graphs (KGs) that refer to the same real-world object and plays a critical role in knowledge fusion and integration. Traditional EA methods mainly rely on knowledge representation learning, but their performance is often limited under noisy or sparsely supervised scenarios. Recently, large language models (LLMs) have been introduced to EA and achieved notable improvements by leveraging rich semantic knowledge. However, existing LLM-based EA approaches typically treat LLMs as black-box decision makers, resulting in limited interpretability, and the direct use of large-scale triples substantially increases inference cost. To address these challenges, we propose \textbfEA-Agent, a reasoning-driven agent for EA. EA-Agent formulates EA as a structured reasoning process with multi-step planning and execution, enabling interpretable alignment decisions. Within this process, it introduces attribute and relation triple selectors to filter redundant triples before feeding them into the LLM, effectively addressing efficiency challenges. Experimental results on three benchmark datasets demonstrate that EA-Agent consistently outperforms existing EA methods and achieves state-of-the-art performance. The source code is available at this https URL.

[IR-1] NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment ACL2026

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在学术同行评审中评估研究新颖性(novelty)能力缺乏系统性评测标准的问题。现有文献表明,尽管LLMs在生成审稿意见方面展现出潜力,但由于缺少专门针对新颖性判断的基准测试,其性能难以被客观量化和优化。为此,作者提出了NovBench——首个大规模基准数据集,包含来自顶级自然语言处理会议的1,684组论文-审稿配对,其中包含从论文引言中提取的新颖性陈述与专家撰写的新颖性评价。该数据集的关键创新在于融合了两个互补来源:标准化且明确表达新颖性主张的引言文本,以及作为人类判断金标准的专家评价。此外,论文还构建了一个四维评估框架(相关性、正确性、覆盖度与清晰度),用于系统衡量LLM生成的新颖性评价质量。实验结果揭示当前模型对科学新颖性的理解有限,且微调后的模型常存在指令遵循偏差,凸显出需设计兼顾新颖性理解与指令执行能力的针对性微调策略。

链接: https://arxiv.org/abs/2604.11543
作者: Wenqing Wu,Yi Zhao,Yuzhuo Wang,Siyou Li,Juexi Shao,Yunfei Long,Chengzhi Zhang
机构: Nanjing University of Science and Technology (南京理工大学); Queen Mary University of London (伦敦玛丽女王大学); Anhui University (安徽大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: ACL 2026

点击查看摘要

Abstract:Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promise in generating review comments, the absence of a dedicated benchmark has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs’ capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper-review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit articulation of novelty claims, while expert-written novelty evaluations constitute one of the current gold standards of human judgment. Furthermore, we propose a four-dimensional evaluation framework (including Relevance, Correctness, Coverage, and Clarity) to assess the quality of LLM-generated novelty evaluations. Extensive experiments on both general and specialized LLMs under different prompting strategies reveal that current models exhibit limited understanding of scientific novelty, and that fine–tuned models often suffer from instruction-following deficiencies. These findings underscore the need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence.

[IR-2] R3-VAE: Reference Vector-Guided Rating Residual Quantization VAE for Generative Recommendation

【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GR)中语义标识符(Semantic Identifiers, SIDs)生成的两大核心问题:一是基于向量量化(Vector Quantization)的SID生成方法存在训练不稳定,主要源于直通估计器(Straight-Through Estimator)梯度传播不足及对初始化敏感;二是SID质量评估效率低下,工业实践中仍依赖昂贵的GR训练和A/B测试。解决方案的关键在于提出Reference Vector-Guided Rating Residual Quantization VAE(R3-VAE)框架,其创新点包括:(i) 引入参考向量作为初始特征的语义锚点以缓解初始化敏感性;(ii) 设计基于点积的评分机制稳定训练过程并防止码本坍缩(codebook collapse);(iii) 提出语义一致性(Semantic Cohesion)与偏好区分度(Preference Discrimination)两个SID评价指标作为正则化项嵌入训练流程,从而实现高效、高质量的SID生成与评估。

链接: https://arxiv.org/abs/2604.11440
作者: Qiang Wan,Ze Yang,Dawei Yang,Ying Fan,Xin Yan,Siyang Liu
机构: ByteDance(字节跳动)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative Recommendation (GR) has gained traction for its merits of superior performance and cold-start capability. As the vital role in GR, Semantic Identifiers (SIDs) represent item semantics through discrete tokens. However, current techniques for SID generation based on vector quantization face two main challenges: (i) training instability, stemming from insufficient gradient propagation through the straight-through estimator and sensitivity to initialization; and (ii) inefficient SID quality assessment, where industrial practice still depends on costly GR training and A/B testing. To address these challenges, we propose Reference Vector-Guided Rating Residual Quantization VAE (R3-VAE). This framework incorporates three key innovations: (i) a reference vector that functions as a semantic anchor for the initial features, thereby mitigating sensitivity to initialization; (ii) a dot product-based rating mechanism designed to stabilize the training process and prevent codebook collapse; and (iii) two SID evaluation metrics, Semantic Cohesion and Preference Discrimination, serving as regularization terms during training. Empirical results on six benchmarks demonstrate that R3-VAE outperforms state-of-the-art methods, achieving an average improvement of 14.2% in Recall@10 and 15.5% in NDCG@10 across three Amazon datasets. Furthermore, we perform GR training and online A/B tests on a prominent news recommendation platform. Our method achieves a 1.62% improvement in MRR and a 0.83% gain in StayTime/U versus baselines. Additionally, we employ R3-VAE to replace the item ID of CTR model, resulting in significant improvements in content cold start by 15.36%, corroborating the strong applicability and business value in industry-scale recommendation scenarios.

[IR-3] hink Before you Write: QA-Guided Reasoning for Character Descriptions in Books

【速读】:该论文旨在解决从长篇叙事文本(如小说)中生成准确角色描述的难题,该任务要求模型能够追踪角色属性的动态变化、整合分散在全文中的证据,并推断隐含细节。现有基于推理能力的大型语言模型(LLM)在该任务上的表现并未达到预期,反而在禁用内置推理机制(即空推理轨迹)时性能更优。为此,作者提出一种将推理与生成解耦的训练框架:首先由一个推理模型生成结构化的问答(QA)推理轨迹,用于显式建模角色信息;随后由生成模型基于该轨迹输出最终的角色描述。该方案的核心创新在于通过QA引导的推理路径提升生成结果的忠实性(faithfulness)、信息量(informativeness)和文本锚定性(grounding),并在BookWorm和CroSS两个数据集上显著优于强基线方法。

链接: https://arxiv.org/abs/2604.11435
作者: Argyrios Papoudakis,Mirella Lapata,Frank Keller
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 20 pages, 16 tables, 1 figure

点击查看摘要

Abstract:Character description generation is an important capability for narrative-focused applications such as summarization, story analysis, and character-driven simulations. However, generating accurate character descriptions from long-form narratives (e.g., novels) is challenging: models must track evolving attributes (e.g., relationships and events), integrate evidence scattered across the text, and infer implicit details. Despite the success of reasoning-enabled LLMs on many benchmarks, we find that for character description generation their performance improves when built-in reasoning is disabled (i.e., an empty reasoning trace). Motivated by this, we propose a training framework that decouples reasoning from generation. Our approach, which can be applied on top of long-context LLMs or chunk-based methods, consists of a reasoning model that produces a structured QA reasoning trace and a generation model that conditions on this trace to produce the final character description. Experiments on two datasets (BookWorm and CroSS) show that QA-guided reasoning improves faithfulness, informativeness, and grounding over strong long-context baselines.

[IR-4] Mycelium-Index: A Streaming Approximate Nearest Neighbor Index with Myelial Edge Decay Traffic-Driven Reinforcement and Adaptive Living Hierarchy

【速读】:该论文旨在解决高维向量空间中近似最近邻(Approximate Nearest Neighbor, ANN)索引在流式数据场景下的内存效率与查询性能瓶颈问题。现有方法在面对持续更新的数据流时,难以兼顾低内存占用、高吞吐量和高召回率。其解决方案的关键在于提出一种受生物菌丝网络自适应生长机制启发的新型索引结构——Mycelium Index,通过引入“菌丝边衰减与强化”机制、基于流量驱动的动态层次结构以及混合删除策略(冷节点采用O(1)跳过删除,热节点采用O(k)束搜索修复),实现了对图拓扑的持续自适应优化。实验表明,该方法在SIFT-1M数据集上以仅5.7倍更低的内存消耗(88 MB vs. 500 MB)达到与FreshDiskANN相当的召回率(0.927 ± 0.028 vs. ~0.95),同时查询每秒请求数(QPS)提升至4.7倍(2,795 vs. ~600),且在静态索引场景下也优于HNSW(RAM降低5.2倍,召回率相当)。此外,系统性研究揭示了高维ANN图中几何启发式修复机制普遍失效,而拓扑机制有效——这一发现被称为“高维ANN图的拓扑修复不变性”。

链接: https://arxiv.org/abs/2604.11274
作者: Anton Pakhunov
机构: Independent Researcher(独立研究员)
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注: 10 pages, 10 tables, 1 appendix

点击查看摘要

Abstract:We present mycelium-index, a streaming approximate nearest neighbor (ANN) index for high-dimensional vector spaces, inspired by the adaptive growth patterns of biological mycelium. The system continuously adapts its topology through myelial edge decay and reinforcement, a traffic-driven living hierarchy, and hybrid deletion combining O(1) bypass for cold nodes with O(k) beam-search repair for hub nodes. Experimental evaluation on SIFT-1M demonstrates that mycelium achieves 0.927 +/- 0.028 recall@5 under FreshDiskANN’s 100%-turnover benchmark protocol – within the measurement confidence interval of FreshDiskANN’s ~0.95 – while using 5.7x less RAM (88 MB vs. 500 MB) and achieving 4.7x higher QPS (2,795 vs. ~600). On the static index, at ef=192, mycelium matches HNSW M=16 recall (0.962 vs. 0.965) at 5.2x less RAM (163 MB vs. 854 MB). Performance optimizations including NEON SIMD distance computation, Vec-backed node storage, and bitset visited tracking yield a cumulative 2.7x QPS improvement. A systematic study of ten streaming repair mechanisms finds that geometric heuristics universally fail in high dimensions, while topological mechanisms succeed – a principle we term the topological repair invariance of high-dimensional ANN graphs.

[IR-5] Frugal Knowledge Graph Construction with Local LLM s: A Zero-Shot Pipeline Self-Consistency and Wisdom of Artificial Crowds

【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)构建与利用中依赖大量标注数据的难题,提出了一种完全基于本地推理、无需训练的多模态零样本(zero-shot)流水线方法。其核心解决方案是设计了一个可复现的评估框架,整合外部基准(DocRED、HotpotQA)、WebQuestionsSP风格合成数据及RAGAS评价指标,通过生成式AI(Generative AI)模型实现端到端的知识抽取与多跳推理任务。关键创新在于引入多样性机制(如自一致性采样和跨模型投票)提升复杂推理性能,并发现“共识悖论”——高一致性的输出可能反映集体幻觉而非可靠答案;此外,采用置信度路由级联策略(Phi-4 → GPT-OSS)进一步优化结果,在单张RTX 3090显卡上仅需约5小时即可完成全部推理,碳排放低至0.09 kg CO₂ eq,展现出高效且环境友好的零样本知识图谱构建能力。

链接: https://arxiv.org/abs/2604.11104
作者: Pierre Jourlin(LIA)
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: Source code and raw results available: this https URL (licence Hypocratic)

点击查看摘要

Abstract:This paper presents an empirical study of a multi-model zero-shot pipeline for knowledge graph construction and exploitation, executed entirely through local inference on consumer-grade hardware. We propose a reproducible evaluation framework integrating two external benchmarks (DocRED, HotpotQA), WebQuestionsSP-style synthetic data, and the RAGAS evaluation framework in an automated pipeline. On 500 document-level relations, our system achieves an F1 of 0.70 \pm 0.041 in zero-shot, compared to 0.80 for supervised DREEAM. Text-to-query achieves an accuracy of 0.80 \pm 0.06 on 200 samples. Multi-hop reasoning achieves an Exact Match (EM) of 0.46 \pm 0.04 on 500 HotpotQA questions, with a RAGAS faithfulness of 0.96 \pm 0.04 on 50 samples. Beyond the pipeline, we study diversity mechanisms for difficult multi-hop reasoning. On 181 questions unsolvable at zero temperature, self-consistency (k=5, T =0.7) recovers up to 23% EM with a single Mixture-of-Experts (MoE) model, but the cross-model oracle (3 architectures x 5 samples) reaches 46.4%. We highlight an agreement paradox: strong consensus among samples signals collective hallucination rather than a reliable answer, echoing the work of Moussaïd et al. on the wisdom of crowds. Extending to the full pipeline (500 questions), self-consistency (k=3) raises EM from 0.46 to 0.48 \pm 0.04. A confidence-routing cascade mechanism (Phi-4 \rightarrow GPT-OSS, k=5) achieves an EM of 0.55 \pm 0.04, the best result obtained, with 45.4% of questions rerouted. Finally, we show that V3 prompt engineering applied to other models does not reproduce the gains observed with Gemma-4, confirming the specific prompt/model interaction. The entire system runs in \sim 5 h on a single RTX 3090, without any training, for an estimated carbon footprint of 0.09 kg CO2 eq.

[IR-6] ARHN: Answer-Centric Relabeling of Hard Negatives with Open-Source LLM s for Dense Retrieval SIGIR2026

【速读】:该论文旨在解决神经检索模型(neural retrievers)在训练过程中因硬负例挖掘引入的标签噪声问题,即部分被标记为负例的文本实际上可能与查询相关或包含部分答案,从而导致监督信号不一致并降低检索效果。其解决方案的关键在于提出一种两阶段的以答案为中心的硬负例重标注框架(ARHN),第一阶段利用开源大语言模型(LLM)生成基于段落的答案片段或判断段落是否支持答案,第二阶段通过LLM驱动的列表级排序对候选集进行直接答案可得性排序,并将排序高于原始正例的段落重新标记为额外正例,同时剔除排序低于正例但包含答案片段的段落以避免模糊监督。该方法通过联合执行重标注和过滤策略,显著提升了训练数据的质量,从而增强检索模型性能。

链接: https://arxiv.org/abs/2604.11092
作者: Hyewon Choi,Jooyoung Choi,Hansol Jang,Hyun Kim,Chulmin Yun,ChangWook Jun,Stanley Jungkyu Choi
机构: LG AI Research( LG人工智能研究)
类目: Information Retrieval (cs.IR)
备注: Accepted to SIGIR 2026

点击查看摘要

Abstract:Neural retrievers are often trained on large-scale triplet data comprising a query, a positive passage, and a set of hard negatives. In practice, hard-negative mining can introduce false negatives and other ambiguous negatives, including passages that are relevant or contain partial answers to the query. Such label noise yields inconsistent supervision and can degrade retrieval effectiveness. We propose ARHN (Answer-centric Relabeling of Hard Negatives), a two-stage framework that leverages open-source LLMs to refine hard negative samples using answer-centric relevance signals. In the first stage, for each query-passage pair, ARHN prompts the LLM to generate a passage-grounded answer snippet or to indicate that the passage does not support an answer. In the second stage, ARHN applies an LLM-based listwise ranking over the candidate set to order passages by direct answerability to the query. Passages ranked above the original positive are relabeled to additional positives. Among passages ranked below the positive, ARHN excludes any that contain an answer snippet from the negative set to avoid ambiguous supervision. We evaluated ARHN on the BEIR benchmark under three configurations: relabeling only, filtering only, and their combination. Across datasets, the combined strategy consistently improves over either step in isolation, indicating that jointly relabeling false negatives and filtering ambiguous negatives yields cleaner supervision for training neural retrieval models. By relying strictly on open-source models, ARHN establishes a cost-effective and scalable refinement pipeline suitable for large-scale training. Comments: Accepted to SIGIR 2026 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.11092 [cs.IR] (or arXiv:2604.11092v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.11092 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-7] ATANT v1.1: Positioning Continuity Evaluation Against Memory Long-Context and Agent ic-Memory Benchmarks

【速读】:该论文旨在解决当前主流记忆评估基准(如LOCOMO、LongMemEval、BEAM、MemoryBench等)与ATANT v1.0所定义的“连续性”(continuity)这一系统属性之间缺乏明确区分的问题。其关键解决方案是通过结构化分析证明:现有基准平均仅覆盖ATANT v1.0定义的7项连续性属性中的0.43项(部分评分按0.5计),且无一评估能覆盖超过两项,从而揭示这些基准实际测量的是与连续性不同的能力维度。作者进一步指出方法缺陷(如LOCOMO参考实现中存在导致23%数据无法评分的空黄金标注错误),并提供ATANT v1.0的96%累积尺度得分与LOCOMO的8.8%得分作为校准对,以凸显两者差异源于测量目标不同而非性能差距,强调应避免将非连续性评估误认为连续性评价,从而推动领域对真正连续性属性的重视与投入。

链接: https://arxiv.org/abs/2604.10981
作者: Samuel Sameer Tanguturi
机构: Kenotic Labs
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Companion paper to arXiv:2604.06710 (ATANT v1.0). 12 pages, 1 table, 2 appendices. Related-work extension; does not modify the v1.0 standard

点击查看摘要

Abstract:ATANT v1.0 (arXiv:2604.06710) defined continuity as a system property with 7 required properties and introduced a 10-checkpoint, LLM-free evaluation methodology validated on a 250-story corpus. Since publication, a recurring reviewer and practitioner question has concerned not the framework itself but its relationship to a wider set of memory evaluations: LOCOMO, LongMemEval, BEAM, MemoryBench, Zep’s evaluation suite, Letta/MemGPT’s evaluations, and RULER. This companion paper, v1.1, does not modify the v1.0 standard. It closes a related-work gap that v1.0 left brief under page limits. We show by structural analysis that none of these benchmarks measures continuity as defined in v1.0: of the 7 required properties, the median existing eval covers 1 property, the mean covers 0.43 when partial credit is scored at 0.5, and no eval covers more than 2. We provide a cell-by-cell property-coverage matrix, identify methodological defects specific to each benchmark (including an empty-gold scoring bug in the LOCOMO reference implementation that renders 23% of its corpus unscorable by construction), and publish our reference implementation’s LOCOMO score (8.8%) alongside the structural reason that number is uninformative about continuity. We publish our 8.8% LOCOMO score alongside our 96% ATANT cumulative-scale score as a calibration pair: the 87-point divergence is evidence that the two benchmarks measure different properties, not that one system is an order of magnitude better than another. The position v1.1 takes is not adversarial: each benchmark measures a real capability. The claim is that none of them can adjudicate continuity, and conflating them with continuity evaluation has led the field to under-invest in the properties v1.0 names.

[IR-8] Multi-Faceted Continual Knowledge Graph Embedding for Semantic-Aware Link Prediction

【速读】:该论文旨在解决持续知识图谱嵌入(Continual Knowledge Graph Embedding, CKGE)中因实体语义随时间动态演变而导致的灾难性遗忘问题。现有方法通常将新旧知识混入同一嵌入空间,无法区分实体在不同时间点的多面语义特征,从而影响长期链接预测性能。其解决方案的关键在于提出一种多面CKGE框架(Multi-Faceted CKGE, MF-CKGE):在离线学习阶段,通过分离不同时期的知识到独立嵌入空间并引入语义解耦机制,避免知识纠缠与冗余;在在线推理阶段,基于语义重要性量化自动识别与查询相关的实体嵌入,抑制无关噪声干扰,从而提升语义感知的链接预测准确性。

链接: https://arxiv.org/abs/2604.10947
作者: Jing Qi,Yuxiang Wang,Zhiyuan Yu,Xiaoliang Xu,Yuanshi Zheng,Tianxing Wu
机构: Hangzhou Dianzi University (杭州电子科技大学); Xidian University (西安电子科技大学); Southeast University (东南大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Continual Knowledge Graph Embedding (CKGE) aims to continually learn embeddings for new knowledge, i.e., entities and relations, while retaining previously acquired knowledge. Most existing CKGE methods mitigate catastrophic forgetting via regularization or replaying old knowledge. They conflate new and old knowledge of an entity within the same embedding space to seek a balance between them. However, entities inherently exhibit multi-faceted semantics that evolve dynamically as their relational contexts change over time. A shared embedding fails to capture and distinguish these temporal semantic variations, degrading lifelong link prediction accuracy across snapshots. To address this, we propose a Multi-Faceted CKGE framework (MF-CKGE) for semantic-aware link prediction. During offline learning, MF-CKGE separates temporal old and new knowledge into distinct embedding spaces to prevent knowledge entanglement and employs semantic decoupling to reduce semantic redundancy, thereby improving space efficiency. During online inference, MF-CKGE adaptively identifies semantically query-relevant entity embeddings by quantifying their semantic importance, reducing interference from query-irrelevant noise. Experiments on eight datasets show that MF-CKGE achieves an average (maximum) improvement of 1.7% (2.7%) and 1.4% (3.8%) in MRR and Hits@10, respectively, over the best baseline. Our source code and datasets are available at: this https URL.

[IR-9] CMedTEB CARE: Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders ACL2026

【速读】:该论文旨在解决中文医疗文本检索中高精度与低延迟难以兼得的问题,以及缺乏高质量、综合性基准测试数据集的瓶颈。其解决方案的关键在于提出一种名为Chinese Medical Asymmetric REtriever (CARE) 的异构架构:该架构采用轻量级BERT-style编码器处理在线查询,同时使用强大的大语言模型(LLM)编码器离线处理文档,从而在保持低推理延迟的同时提升检索性能。为应对双结构编码器带来的表征鸿沟问题,作者进一步设计了一种两阶段训练策略,逐步对齐查询与文档的语义空间,显著优于现有对称模型。

链接: https://arxiv.org/abs/2604.10937
作者: Angqing Jiang,Jianlyu Chen,Zhe Fang,Yongcan Wang,Xinpeng Li,Keyu Ding,Defu Lian
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室); iFlytek Research (科大讯飞研究院); HeFei Institute of Technology (合肥工业大学)
类目: Information Retrieval (cs.IR)
备注: 21 pages, 4 figures. Angqing Jiang and Jianlyu Chen contributed equally to this work. Keyu Ding is the corresponding author. Accepted by ACL 2026. Code and CMedTEB benchmark are available at this https URL

点击查看摘要

Abstract:Effective medical text retrieval requires both high accuracy and low latency. While LLM-based embedding models possess powerful retrieval capabilities, their prohibitive latency and high computational cost limit their application in real-time scenarios. Furthermore, the lack of comprehensive and high-fidelity benchmarks hinders progress in Chinese medical text retrieval. In this work, we introduce the Chinese Medical Text Embedding Benchmark (CMedTEB), a benchmark spanning three kinds of practical embedding tasks: retrieval, reranking, and semantic textual similarity (STS). Distinct from purely automated datasets, CMedTEB is curated via a rigorous multi-LLM voting pipeline validated by clinical experts, ensuring gold-standard label quality while effectively mitigating annotation noise. On this foundation, we propose the Chinese Medical Asymmetric REtriever (CARE), an asymmetric architecture that pairs a lightweight BERT-style encoder for online query encoding with a powerful LLM-based encoder for offline document encoding. However, optimizing such an asymmetric retriever with two structurally different encoders presents distinctive challenges. To address this, we introduce a novel two-stage training strategy that progressively bridges the query and document representations. Extensive experiments demonstrate that CARE surpasses state-of-the-art symmetric models on CMedTEB, achieving superior retrieval performance without increasing inference latency.

[IR-10] BDIViz in Action: Interactive Curation and Benchmarking for Schema Matching Methods

【速读】:该论文旨在解决数据集成中Schema匹配(Schema Matching)方法评估与比较所面临的基准多样性不足和缺乏交互式验证框架的问题。其解决方案的关键在于提出并演示了BDIViz系统的一个新扩展,该系统通过引入人类在环(human-in-the-loop)机制,实现对匹配结果的实时验证与迭代优化:一方面,用户可在交互式热力图中直接验证候选匹配,并借助协调视图(coordinated views)查看属性描述、示例值及分布以辅助决策;另一方面,基于大语言模型(LLM)的结构化解释生成能力为匹配选择提供可追溯的推理依据。此设计使新匹配算法可通过标准化接口接入,而用户的验证行为则转化为动态演进的“真实标签”(ground truth),从而支持高精度基准构建、算法性能实时评估与跨领域匹配行为对比分析。

链接: https://arxiv.org/abs/2604.10763
作者: Eden Wu,Christos Koutras,Cláudio T. Silva,Juliana Freire
机构: VIDA Center, New York University (纽约大学VIDA中心)
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Schema matching remains fundamental to data integration, yet evaluating and comparing matching methods is hindered by limited benchmark diversity and lack of interactive validation frameworks. BDIViz, recently published at IEEE VIS 2025, is an interactive visualization system for schema matching with LLM-assisted validation. Given source and target datasets, BDIViz applies automatic matching methods and visualizes candidates in an interactive heatmap with hierarchical navigation, zoom, and filtering. Users validate matches directly in the heatmap and inspect ambiguous cases using coordinated views that show attribute descriptions, example values, and distributions. An LLM assistant generates structured explanations for selected candidates to support decision-making. This demonstration showcases a new extension to BDIViz that addresses a critical need in data integration research: human-in-the-loop benchmarking and iterative matcher development. New matchers can be integrated through a standardized interface, while user validations become evolving ground truth for real-time performance evaluation. This enables benchmarking new algorithms, constructing high-quality ground-truth datasets through expert validation, and comparing matcher behavior across diverse schemas and domains. We demonstrate two complementary scenarios: (i) data harmonization, where users map a large tabular dataset to a target schema with value-level inspection and LLM-generated explanations; and (ii) developer-in-the-loop benchmarking, where developers integrate custom matchers, observe performance metrics, and refine their algorithms.

[IR-11] From Query to Conscience: The Importance of Information Retrieval in Empowering Socially Responsible Consumerism SIGIR’25

【速读】:该论文旨在解决消费者在追求社会负责任消费(socially responsible consumption)过程中因信息获取障碍而导致的“意图-行为差距”(intention-behaviour gap)问题,即消费者虽有伦理购物意愿,却常因信息不对称、搜索系统效率低下而难以落实决策。解决方案的关键在于重构信息检索(Information Retrieval, IR)领域的研究范式,提出三个相互关联的视角:一是将负责任消费视为一个信息提取问题以减少信息不对称;二是将产品搜索重新定义为需降低认知与操作负担的复杂任务,设计更友好的交互界面;三是将搜索过程重塑为知识校准机制,帮助消费者弥合认知缺口。通过这三重转变,IR系统可从单纯匹配查询与结果,升级为赋能消费者做出更知情、便捷且符合经济现实的伦理选择,从而实现从“查询”到“良知”的路径跃迁。

链接: https://arxiv.org/abs/2604.10751
作者: Frans van der Sluis,Leif Azzopardi,Florian Meier
机构: 未知
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
备注: 12 pages, 4 figures. Published in SIGIR '25 (ACM), pp. 3853-3864. Peer reviewed

点击查看摘要

Abstract:Millions of consumers search for products online each day, aiming to find items that meet their needs at an acceptable price. While price and quality are major factors in purchasing decisions, ethical considerations increasingly influence consumer behavior, giving rise to the socially responsible consumer. Insights from a recent survey of over 600 consumers reveal that many barriers to ethical shopping stem from information-seeking challenges, often leading to decisions made under uncertainty. These challenges contribute to the intention-behaviour gap, where consumers’ desire to make ethical choices is undermined by limited or inaccessible information and inefficacy of search systems in supporting responsible decision-making. In this perspectives paper, we argue that the field of Information Retrieval (IR) has a critical role to play by empowering consumers to make more informed and more responsible choices. We present three interrelated perspectives: (1) reframing responsible consumption as an information extraction problem aimed at reducing information asymmetries; (2) redefining product search as a complex task requiring interfaces that lower the cost and burden of responsible search; and (3) reimagining search as a process of knowledge calibration that helps consumers bridge gaps in awareness when making purchasing decisions. Taken together, these perspectives outline a path from query to conscience, one where IR systems help transform everyday product searches into opportunities for more ethical and informed choices. We advocate for the development of new and novel IR systems and interfaces that address the intricacies of socially responsible consumerism, and call on the IR community to build technologies that make ethical decisions more informed, convenient, and aligned with economic realities.

[IR-12] Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 框架在长文本生成任务中普遍存在的局限性,即过度依赖文本信息而忽视了现实世界专家报告中广泛存在的多模态证据(如图像、图表等),导致生成内容缺乏事实依据和多模态一致性。其核心解决方案是提出 Deep-Reporter——一个统一的代理式(agentic)框架,用于实现基于多模态证据的长篇生成。关键创新包括:(i) 代理式多模态搜索与过滤机制,以精准获取并筛选文本与高信息密度视觉内容;(ii) 基于清单的渐进式合成策略,确保图文融合连贯且引用位置合理;(iii) 循环上下文管理机制,在保持长程逻辑一致性的同时兼顾局部流畅性。通过构建包含8K高质量代理轨迹的数据集和M2LongBench多模态基准测试平台,实验证明该方案显著提升了多模态选择与整合能力,有效缩小了模型性能差距。

链接: https://arxiv.org/abs/2604.10741
作者: Fangda Ye,Zhifei Xie,Yuxin Hu,Yihang Yin,Shurui Huang,Shikai Dong,Jianzhu Bao,Shuicheng Yan
机构: National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); University of Edinburgh (爱丁堡大学); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 41 pages, 6 figures, 8 tables. Code available at this https URL

点击查看摘要

Abstract:Recent agentic search frameworks enable deep research via iterative planning and retrieval, reducing hallucinations and enhancing factual grounding. However, they remain text-centric, overlooking the multimodal evidence that characterizes real-world expert reports. We introduce a pressing task: multimodal long-form generation. Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. It orchestrates: (i) Agentic Multimodal Search and Filtering to retrieve and filter textual passages and information-dense visuals; (ii) Checklist-Guided Incremental Synthesis to ensure coherent image-text integration and optimal citation placement; and (iii) Recurrent Context Management to balance long-range coherence with local fluency. We develop a rigorous curation pipeline producing 8K high-quality agentic traces for model optimization. We further introduce M2LongBench, a comprehensive testbed comprising 247 research tasks across 9 domains and a stable multimodal sandbox. Extensive experiments demonstrate that long-form multimodal generation is a challenging task, especially in multimodal selection and integration, and effective post-training can bridge the gap.

[IR-13] HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval

【速读】:该论文旨在解决自然语言处理中词汇表外(Out-of-Vocabulary, OOV)词问题以及低资源场景下模型性能受限的问题,特别是在土耳其语这一具有高度音节规律性的语言上实现高效检索。其解决方案的关键在于利用土耳其语确定性的六模式音节结构(six-pattern phonological structure),构建一个闭合且无OOV的音节级分词器(HeceTokenizer),从而在仅需约8000个独特音节类型的情况下,替代传统基于形态学的分词方法,并结合轻量级BERT-tiny编码器(1.5M参数)与细粒度块检索策略,在TQuAD检索基准上达到50.3% Recall@5,显著优于使用200倍参数量的形态驱动基线模型(46.92%)。这表明土耳其语的音节规律性可作为强大的、资源消耗低的归纳偏置(inductive bias),提升检索任务效果。

链接: https://arxiv.org/abs/2604.10665
作者: Senol Gulgonul
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:HeceTokenizer is a syllable-based tokenizer for Turkish that exploits the deterministic six-pattern phonological structure of the language to construct a closed, out-of-vocabulary (OOV)-free vocabulary of approximately 8,000 unique syllable types. A BERT-tiny encoder (1.5M parameters) is trained from scratch on a subset of Turkish Wikipedia using a masked language modeling objective and evaluated on the TQuAD retrieval benchmark using Recall@5. Combined with a fine-grained chunk-based retrieval strategy, HeceTokenizer achieves 50.3% Recall@5, surpassing the 46.92% reported by a morphology-driven baseline that uses a 200 times larger model. These results suggest that the phonological regularity of Turkish syllables provides a strong and resource-light inductive bias for retrieval tasks.

[IR-14] On the Capacity of Distinguishable Synthetic Identity Generation under Face Verification

【速读】:该论文旨在解决生成式人脸识别系统中可区分身份生成能力的量化问题,即在固定验证阈值 τ 下,能够生成多少个合成身份(synthetic identities),使得同一身份对被判定为匹配、不同身份对被判定为非匹配。其核心解决方案在于将问题形式化为一个基于嵌入空间中单位超球面上的球码(spherical code)优化问题:在确定性视角下,身份生成能力由可实现嵌入集合上的球码问题刻画;在随机生成场景中,引入中心化模型并推导出身份中心间最小分离距离为 arccos(τ)+2ρ\arccos(\tau) + 2\rho(ρ 为同一身份内部嵌入的集中半径)的充分可容许条件,并由此获得随嵌入维度增长的指数级正下界。进一步地,通过引入先验约束的随机码容量模型,利用身份中心间的成对失败概率推导出高概率下界,在更强的全支撑支持假设下甚至得到精确的球码表征与反向结论。

链接: https://arxiv.org/abs/2604.10641
作者: Behrooz Razeghi
机构: Harvard University (哈佛大学)
类目: Information Theory (cs.IT); Information Retrieval (cs.IR); Probability (math.PR); Applications (stat.AP)
备注:

点击查看摘要

Abstract:We study how many synthetic identities can be generated so that a face verifier declares same-identity pairs as matches and different-identity pairs as non-matches at a fixed threshold \tau . We formalize this question for a generative face-recognition pipeline consisting of a generator followed by a normalized recognition map with outputs on the unit hypersphere. We define the capacity of distinguishable identity generation as the largest number of latent identities whose induced embedding distributions satisfy prescribed same-identity and different-identity verification constraints. In the deterministic view-invariant regime, we show that this capacity is characterized by a spherical-code problem over the realizable set of embeddings, and reduces to the classical spherical-code quantity under a full angular expressivity assumption. For stochastic identity generation, we introduce a centered model and derive a sufficient admissibility condition in which the required separation between identity centers is \arccos(\tau)+2\rho , where \rho is a within-identity concentration radius. Under full angular expressivity, this yields spherical-code-based achievable lower bounds and a positive asymptotic lower bound on the exponential growth rate with embedding dimension. We also introduce a prior-constrained random-code capacity, in which latent identities are sampled independently from a given prior, and derive high-probability lower bounds in terms of pairwise separation-failure probabilities of the induced identity centers. Under a stronger full-cap-support model, we obtain a converse and an exact spherical-code characterization.

[IR-15] BMdataset: A Musicologically Curated LilyPond Dataset

【速读】:该论文旨在解决符号音乐研究长期依赖MIDI格式数据集、而文本编码的乐谱格式(如LilyPond)在音乐理解任务中尚未被充分探索的问题。其解决方案的关键在于构建了一个由专家手工转录的高质量LilyPond乐谱数据集BMdataset(包含393部巴洛克时期作品共2,646个乐章),并在此基础上开发了针对符号音乐优化的LilyBERT模型——基于CodeBERT架构,通过扩展115个LilyPond专属词元并进行掩码语言建模预训练。实验表明,尽管BMdataset规模较小(约90M词元),其微调性能优于在超大规模但噪声较多的PDMX语料(约15B词元)上连续预训练的结果,验证了小规模专家标注数据在音乐理解中的有效性;同时结合广义预训练与领域特定微调可获得最佳效果(作曲家识别准确率达84.3%),说明二者具有互补性。

链接: https://arxiv.org/abs/2604.10628
作者: Matteo Spanio,Ilay Guler,Antonio Rodà
机构: University of Padua (帕多瓦大学); Boston University (波士顿大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Submitted to SMC2026

点击查看摘要

Abstract:Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at this https URL), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.

[IR-16] NSFL: A Post-Training Neuro-Symbolic Fuzzy Logic Framework for Boolean Operators in Neural Embeddings

【速读】:该论文旨在解决标准密集检索器在处理多原子逻辑约束(multi-atom logical constraints)时缺乏原生计算机制的问题。传统方法往往因几何空间中的表示坍缩(representation collapse)和流形逃逸(manifold escape)导致逻辑语义失真。其解决方案的核心在于提出神经符号模糊逻辑(Neuro-Symbolic Fuzzy Logic, NSFL),该框架将形式化的t-范数(t-norm)与t-余范数(t-conorm)适配至神经嵌入空间,无需重新训练即可实现逻辑运算;通过引入神经符号增量(Neuro-Symbolic Deltas, NS-Delta)来引导上下文融合的首阶边际差异,在保留原子语义的同时捕捉领域依赖性,从而稳定高维空间中的逻辑推理过程。此外,结合球面查询优化(Spherical Query Optimization, SQO)实现高效实时检索,显著提升多模态场景下的检索精度(mAP最高提升81%)。

链接: https://arxiv.org/abs/2604.10604
作者: Vladi Vexler,Ofer Idan,Gil Lederman,Dima Sivov
机构: Huawei Tel-Aviv Research Center (华为特拉维夫研究中心)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages (16 main + 7 appendix), 2 figures, 10 tables, 1 algorithm

点击查看摘要

Abstract:Standard dense retrievers lack a native calculus for multi-atom logical constraints. We introduce Neuro-Symbolic Fuzzy Logic (NSFL), a framework that adapts formal t-norms and t-conorms to neural embedding spaces without requiring retraining. NSFL operates as a first-order hybrid calculus: it anchors logical operations on isolated zero-order similarity scores while actively steering representations using Neuro-Symbolic Deltas (NS-Delta) – the first-order marginal differences derived from contextual fusion. This preserves pure atomic meaning while capturing domain reliance, preventing the representation collapse and manifold escape endemic to traditional geometric baselines. For scalable real-time retrieval, Spherical Query Optimization (SQO) leverages Riemannian optimization to project these fuzzy formulas into manifold-stable query vectors. Validated across six distinct encoder configurations and two modalities (including zero-shot and SOTA fine-tuned models), NSFL yields mAP improvements up to +81%. Notably, NSFL provides an additive 20% average and up to 47% boost even when applied to encoders explicitly fine-tuned for logical reasoning. By establishing a training-free, order-aware calculus for high-dimensional spaces, this framework lays the foundation for future dynamic scaling and learned manifold logic.

[IR-17] Evaluating Small Open LLM s for Medical Question Answering: A Practical Framework

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗问答场景中输出一致性不足的问题,尤其是在在线健康社区等高风险环境中,仅追求平均准确率无法保证模型的可靠性。其关键解决方案是提出一个开源、实用的评估框架,将**可重复性(reproducibility)**作为与词法和语义准确性并列的核心指标,通过在相同输入下进行多次推理(N=10次/问题)量化模型输出的一致性,并结合BERTScore、ROUGE-L及LLM-as-judge等多种质量指标对三个本地部署的小规模开放权重模型(Llama 3.1 8B、Gemma 3 12B、MedGemma 1.5 4B)进行全面评估。结果表明,即使采用低温度生成策略(T=0.2),模型自一致性最高仅达0.20,且87–97%的输出具有唯一性,揭示了传统单次推理基准未能捕捉的安全缺口。

链接: https://arxiv.org/abs/2604.10535
作者: Avi-ad Avraam Buskila
机构: Bar-Ilan University (巴伊兰大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Incorporating large language models (LLMs) in medical question answering demands more than high average accuracy: a model that returns substantively different answers each time it is queried is not a reliable medical tool. Online health communities such as Reddit have become a primary source of medical information for millions of users, yet they remain highly susceptible to misinformation; deploying LLMs as assistants in these settings amplifies the need for output consistency alongside correctness. We present a practical, open-source evaluation framework for assessing small, locally-deployable open-weight LLMs on medical question answering, treating reproducibility as a first-class metric alongside lexical and semantic accuracy. Our pipeline computes eight quality metrics, including BERTScore, ROUGE-L, and an LLM-as-judge rubric, together with two within-model reproducibility metrics derived from repeated inference (N=10 runs per question). Evaluating three models (Llama 3.1 8B, Gemma 3 12B, MedGemma 1.5 4B) on 50 MedQuAD questions (N=1,500 total responses) reveals that despite low-temperature generation (T=0.2), self-agreement across runs reaches at most 0.20, while 87-97% of all outputs per model are unique – a safety gap that single-pass benchmarks entirely miss. The clinically fine-tuned MedGemma 1.5 4B underperforms the larger general-purpose models on both quality and reproducibility; however, because MedGemma is also the smallest model, this comparison confounds domain fine-tuning with model scale. We describe the methodology in sufficient detail for practitioners to replicate or extend the evaluation for their own model-selection workflows. All code and data pipelines are available at this https URL.

[IR-18] SID-Coord: Coordinating Semantic IDs for ID-based Ranking in Short-Video Search SIGIR-2026

【速读】:该论文旨在解决大规模短视频搜索排序模型中因依赖哈希物品标识符(Hashed Item Identifiers, HIDs)而导致的“记忆-泛化权衡”问题:即模型在高频交互项上表现良好,但在长尾低曝光物品上的泛化能力不足。解决方案的关键在于提出一种轻量级语义ID框架SID-Coord,其核心创新是将可训练的离散语义ID(Semantic IDs, SIDs)直接嵌入到原有基于HID的排序模型中,通过结构化的语义标识实现对HID记忆与SID泛化的统一建模。具体包括三个关键组件:基于注意力机制的分层SID融合模块以捕获多粒度语义、目标感知的HID-SID门控机制以动态平衡记忆与泛化、以及SID驱动的兴趣对齐模块以建模目标项与用户历史之间的语义相似性分布。该方案无需修改现有主干模型即可部署,实验证明其显著提升了长视频播放率和搜索播放时长。

链接: https://arxiv.org/abs/2604.10471
作者: Guowen Li,Yuepeng Zhang,Shunyu Zhang,Yi Zhang,Xiaoze Jiang,Yi Wang,Jingwei Zhuo
机构: Kuaishou Technology (快手科技)
类目: Information Retrieval (cs.IR)
备注: SIGIR-2026

点击查看摘要

Abstract:Large-scale short-video search ranking models are typically trained on sparse co-occurrence signals over hashed item identifiers (HIDs). While effective at memorizing frequent interactions, such ID-based models struggle to generalize to long-tailed items with limited exposure. This memorization-generalization trade-off remains a longstanding challenge in such industrial systems. We propose SID-Coord, a lightweight Semantic ID framework that incorporates discrete, trainable semantic IDs (SIDs) directly into ID-based ranking models. Instead of treating semantic signals as auxiliary dense features, SID-Coord represents semantics as structured identifiers and coordinates HID-based memorization with SID-based generalization within a unified modeling framework. To enable effective coordination, SID-Coord introduces three components: (1) an attention-based fusion module over hierarchical SIDs to capture multi-level semantics, (2) a target-aware HID-SID gating mechanism that adaptively balances memorization and generalization, and (3) a SID-driven interest alignment module that models the semantic similarity distribution between target items and user histories. SID-Coord can be integrated into existing production ranking systems without modifying the backbone model. Online A/B experiments in a real-world production environment show statistically significant improvements, with a +0.664% gain in long-play rate in search and a +0.369% increase in search playback duration.

[IR-19] Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution

【速读】:该论文旨在解决个人隐私在文本数据中因风格分析(stylometry)而被泄露的问题,尤其关注用户在社交媒体等平台上的自愿性文字发布可能暴露敏感信息(如年龄范围和地理位置),其危害程度可与政府身份证件泄露相媲美。传统防护手段(如减少身份证件共享)对文本数据无效,因此论文提出通过对抗式风格分析(adversarial stylometry)作为解决方案,其关键在于采用同形异义字符替换(homoglyph substitution)技术——即用视觉上相似但编码不同的字符替代原文本中的字符(例如将拉丁字母“h”替换为西里尔字母“h”),从而有效干扰基于文本风格特征的识别系统,降低其推断个体身份信息的能力。

链接: https://arxiv.org/abs/2604.10271
作者: Robert Dilworth
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 30 pages, 9 figures

点击查看摘要

Abstract:In what way could a data breach involving government-issued IDs such as passports, driver’s licenses, etc., rival a random voluntary disclosure on a nondescript social-media platform? At first glance, the former appears more significant, and that is a valid assessment. The disclosed data could contain an individual’s date of birth and address; for all intents and purposes, a leak of that data would be disastrous. Given the threat, the latter scenario involving an innocuous online post seems comparatively harmless–or does it? From that post and others like it, a forensic linguist could stylometrically uncover equivalent pieces of information, estimating an age range for the author (adolescent or adult) and narrowing down their geographical location (specific country). While not an exact science–the determinations are statistical–stylometry can reveal comparable, though noticeably diluted, information about an individual. To prevent an ID from being breached, simply sharing it as little as possible suffices. Preventing the leakage of personal information from written text requires a more complex solution: adversarial stylometry. In this paper, we explore how performing homoglyph substitution–the replacement of characters with visually similar alternatives (e.g., “h” \texttt[U+0068] \rightarrow “h” \texttt[U+04BB] )–on text can degrade stylometric systems.

[IR-20] Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

【速读】:该论文旨在解决多向量模型(multi-vector models)在视觉文档检索(Visual Document Retrieval, VDR)中因存储和计算成本过高而难以实际部署的问题。其解决方案的关键在于提出了一种名为ColChunk的即插即用框架,通过引入多模态晚期分块(multimodal late chunking)策略,利用patch级嵌入的层次聚类并融合2D位置先验,实现自适应的内容感知分组,从而在大幅降低向量数量的同时保持全局语义一致性与空间结构信息,显著提升检索效率与精度。

链接: https://arxiv.org/abs/2604.10167
作者: Yibo Yan,Mingdong Ou,Yi Cao,Jiahao Huo,Xin Zou,Shuliang Liu,James Kwok,Xuming Hu
机构: Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Alibaba Cloud Computing(阿里云计算); Hong Kong University of Science and Technology(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Preprint

点击查看摘要

Abstract:Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative single-vector models. ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems.

[IR-21] MOSAIC: Multi-Domain Orthogonal Session Adaptive Intent Capture for Prescient Recommendations

【速读】:该论文旨在解决多域会话推荐系统中用户意图跨异构行为域难以准确捕捉的问题,现有方法常因无法区分域内与跨域交互的贡献,导致用户表征不够丰富且难以迁移。其解决方案的关键在于提出MOSAIC框架,通过显式分解用户偏好为三个正交成分:域特定(domain-specific)、域共通(domain-common)和跨序列独有(cross-sequence-exclusive)表示,并采用三编码器架构分别建模,结合域掩码目标与梯度反转层实现对抗训练以强化分离;同时引入动态门控机制在每个时间步调节各成分权重,从而构建统一且时序自适应的会话级用户表示,确保表征对齐与相互独立性联合优化,显著提升推荐精度并提供可解释性。

链接: https://arxiv.org/abs/2604.10147
作者: Abderaouf Bahi,Mourad Boughaba,Ibtissem Gasmi,Warda Deghmane,Amel Ourici
机构: Chadli Bendjedid University (查德利·本杰迪德大学); Badji Mokhtar University (巴吉·穆克塔大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Capturing user intent across heterogeneous behavioral domains stands as a fundamental challenge in session-based recommender systems. Yet, existing multi-domain approaches frequently fail to isolate the distinct contribution of cross-domain interactions from those arising within individual domains, limiting their ability to build rich and transferable user representations. In this work, we propose MOSAIC, a Multi-Domain Orthogonal Session Adaptive Intent Capture framework that explicitly factorizes user preferences into three orthogonal components: domain-specific, domain-common, and cross-sequence-exclusive representations. Our approach employs a triple-encoder architecture, where each encoder is dedicated to one preference type, enforced through domain masking objectives and adversarial training via a gradient reversal layer. Representational alignment and mutual independence constraints are jointly optimized to ensure clean preference separation. Additionally, a dynamic gating mechanism modulates the relative contribution of each component at every timestep, yielding a unified and temporally adaptive session-level user representation. We conduct extensive experiments on two large-scale real-world benchmarks spanning multiple domains and interaction types. The ablation study validates that each component domain-specific encoding, domain-common modeling, cross-sequence representation, and dynamic gating contributes meaningfully to the overall performance. Experimental results demonstrate that MOSAIC consistently outperforms state-of-the-art baselines in recommendation accuracy, while simultaneously providing interpretable insights into the interplay between domain-specific and cross-domain preference signals. These findings highlight the potential of orthogonal preference decomposition as a principled strategy for next-generation multi-domain recommender systems.

[IR-22] HARPO: Hierarchical Agent ic Reasoning for User-Aligned Conversational Recommendation ACL2026

【速读】:该论文旨在解决当前对话式推荐系统(Conversational Recommender Systems, CRSs)在实际应用中推荐质量低下问题,即现有方法虽在召回率(Recall@K)、BLEU等代理指标上表现优异,但未能有效优化用户对推荐结果的满意度与契合度。其核心问题是:现有模型聚焦于中间目标(如检索准确性、生成流畅性或工具调用),而非直接优化多维推荐质量本身。解决方案的关键在于提出HARPO(Hierarchical Agentic Reasoning with Preference Optimization)框架,通过三个核心机制实现:(i) 分层偏好学习,将推荐质量解耦为可解释维度(相关性、多样性、预测用户满意度和参与度),并学习上下文相关的权重;(ii) 基于预训练价值网络的推理树搜索,评估候选推理路径时以预测推荐质量为依据,而非任务完成度;(iii) 通过虚拟工具操作(Virtual Tool Operations)和多智能体精炼实现领域无关的推理抽象,提升跨领域的推荐泛化能力。实验表明,HARPO在ReDial、INSPIRED和MUSE数据集上显著优于基线,在推荐导向指标上取得一致提升,验证了显式用户对齐质量优化的重要性。

链接: https://arxiv.org/abs/2604.10048
作者: Subham Raj,Aman Vaibhav Jha,Mayank Anand,Sriparna Saha
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校); Indian Institute of Information Technology Allahabad (印度信息技术学院阿拉哈巴德分校)
类目: Information Retrieval (cs.IR)
备注: Accepted at the Annual Meeting of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:Conversational recommender systems (CRSs) operate under incremental preference revelation, requiring systems to make recommendation decisions under uncertainty. While recent approaches particularly those built on large language models achieve strong performance on standard proxy metrics such as Recall@K and BLEU, they often fail to deliver high-quality, user-aligned recommendations in practice. This gap arises because existing methods primarily optimize for intermediate objectives like retrieval accuracy, fluent generation, or tool invocation, rather than recommendation quality itself. We propose HARPO (Hierarchical Agentic Reasoning with Preference Optimization), an agentic framework that reframes conversational recommendation as a structured decision-making process explicitly optimized for multi-dimensional recommendation quality. HARPO integrates hierarchical preference learning that decomposes recommendation quality into interpretable dimensions (relevance, diversity, predicted user satisfaction, and engagement) and learns context-dependent weights over these dimensions; (ii) deliberative tree-search reasoning guided by a learned value network that evaluates candidate reasoning paths based on predicted recommendation quality rather than task completion; and (iii) domain-agnostic reasoning abstractions through Virtual Tool Operations and multi-agent refinement, enabling transferable recommendation reasoning across domains. We evaluate HARPO on ReDial, INSPIRED, and MUSE, demonstrating consistent improvements over strong baselines on recommendation-centric metrics while maintaining competitive response quality. These results highlight the importance of explicit, user-aligned quality optimization for conversational recommendation.

[IR-23] Self-Distilled Reinforcement Learning for Co-Evolving Agent ic Recommender Systems

【速读】:该论文旨在解决当前基于大语言模型的智能体推荐系统(Agentic Recommender Systems, ARS)在强化学习优化中存在的两大核心问题:一是现有方法未能充分捕捉推荐过程中推荐者智能体与用户智能体之间的动态交互特性,导致无法利用交互反馈生成内生监督信号;二是现有强化学习方法将多轮交互过程简化为最终结果,忽略了轨迹中蕴含的密集监督信息。解决方案的关键在于提出一种自蒸馏强化学习框架 CoARS,其核心创新包括两个互补的学习机制:交互奖励(interaction reward),从同一交互轨迹中提取耦合的任务级监督信号以同步优化推荐者和用户智能体;以及自蒸馏信用分配(self-distilled credit assignment),通过教师-学生条件约束将历史交互轨迹转化为token级别的信用信号,从而将交互经验内化到模型参数中,实现推荐决策能力的持续学习与提升。

链接: https://arxiv.org/abs/2604.10029
作者: Zongwei Wang,Min Gao,Hongzhi Yin,Junliang Yu,Tong Chen,Shazia Sadiq,Tianrui Li
机构: Chongqing University (重庆大学); Griffith University (格里菲斯大学); The University of Queensland (昆士兰大学); Southwest Jiaotong University (西南交通大学)
类目: Information Retrieval (cs.IR)
备注: 11 pages

点击查看摘要

Abstract:Large language model-empowered agentic recommender systems (ARS) reformulate recommendation as a multi-turn interaction between a recommender agent and a user agent, enabling iterative preference elicitation and refinement beyond conventional one-shot prediction. However, existing ARS are mainly optimized in a Reflexion-style paradigm, where past interaction trajectories are stored as textual memory and retrieved as prompt context for later reasoning. Although this design allows agents to recall prior feedback and observations, the accumulated experience remains external to model parameters, leaving agents reliant on generic reasoning rather than progressively acquiring recommendation-specific decision-making ability through learning. Reinforcement learning (RL) therefore provides a natural way to internalize such interaction experience into parameters. Yet existing RL methods for ARS still suffer from two key limitations. First, they fail to capture the interactive nature of ARS, in which the recommender agent and the user agent continuously influence each other and can naturally generate endogenous supervision through interaction feedback. Second, they reduce a rich multi-turn interaction process to final outcomes, overlooking the dense supervision embedded throughout the trajectory. To this end, we propose CoARS, a self-distilled reinforcement learning framework for co-evolving agentic recommender systems. CoARS introduces two complementary learning schemes: interaction reward, which derives coupled task-level supervision for the recommender agent and the user agent from the same interaction trajectory, and self-distilled credit assignment, which converts historical trajectories into token-level credit signals under teacher-student conditioning. Experiments on multiple datasets show that CoARS outperforms representative ARS baselines in recommendation performance and user alignment.

[IR-24] Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions SIGIR2026 SIGIR

【速读】:该论文旨在解决多向量检索(multi-vector retrieval)模型在实际应用中缺乏可复现性的问题,特别是针对架构设计对模型鲁棒性的影响。研究发现,尽管ConstBERT和ColBERT-v2在标准数据集MS-MARCO上表现一致(MRR@10差异小于0.05%),但在长篇叙事类查询(TREC ToT 2025)上性能骤降86–97%,其根本原因在于MaxSim操作符采用统一的词元权重分配机制,无法有效区分信息信号与冗余噪声,导致性能在约20个词处达到平台期。此外,未记录的后端参数(如稀疏质心覆盖)引入了高达8点的性能差距,且增加3倍训练数据反而使性能下降最多达29%。因此,论文指出:仅靠微调或数据增强无法克服由架构限制带来的性能瓶颈,关键解决方案在于重构模型架构以实现更灵活的词元权重感知能力,而非单纯依赖适应策略。

链接: https://arxiv.org/abs/2604.09982
作者: Utshab Kumar Ghosh,Ashish David,Shubham Chatterjee
机构: Missouri University of Science and Technology (密苏里科技大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 9 tables. Accepted to the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2026)

点击查看摘要

Abstract:Reproducibility must validate architectural robustness, not just numerical accuracy. We evaluate ColBERT-v2 and ConstBERT across five dimensions, finding that while ConstBERT reproduces within 0.05% MRR@10 on MS-MARCO, both models show a drop of 86-97% on long, narrative queries (TREC ToT 2025). Ablations prove this failure is architectural: performance plateaus at 20 words because the MaxSim operator’s uniform token weighting cannot distinguish signal from filler noise. Furthermore, undocumented backend parameters create an 8-point gap due to ConstBERT’s sparse centroid coverage, and fine-tuning with 3x more data actually degrades performance by up to 29%. We conclude that architectural constraints in multi-vector retrieval cannot be overcome by adaptation alone. Code: this https URL.

[IR-25] All Eyes on the Ranker: Participatory Auditing to Surface Blind Spots in Ranked Search Results

【速读】:该论文旨在解决当前搜索引擎评估中忽视用户视角下问责制、伤害感知与信任体验的问题,尤其在排名模型日益复杂和语义理解能力增强的背景下,传统以专家为中心的评估方法难以揭示用户对搜索结果影响的因果认知与情境理解。其解决方案的关键在于引入参与式审计(participatory auditing),通过组织三场包含21名参与者的工作坊,让使用者在真实交互任务中对比不同排序算法(如BM25与MonoT5)、探索透明度与用户控制权的影响,并面对恶意操纵的排名场景,从而激发用户构建从系统特性到社会后果的因果叙事。研究最终提出一个涵盖知识论、表征性、基础设施及下游社会影响的用户感知影响分类体系,同时发现神经排序模型因高可信度削弱了用户的批判性审视能力,凸显出对完整搜索流水线可见性和救济机制的需求,表明参与式审计虽能揭示传统评估未覆盖的问责缺口,但也存在自身局限性。

链接: https://arxiv.org/abs/2604.09946
作者: Anna Marie Rezk,Patrizia Di Campli San Vito,Ayah Soufan,Graham McDonald,Craig Macdonald,Iadh Ounis
机构: University of Glasgow(格拉斯哥大学); University of Strathclyde(斯特拉斯克莱德大学)
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: 16 pages (23 with appendix), 3 figures, FAccT 2026 conference

点击查看摘要

Abstract:Search engines that present users with a ranked list of search results are a fundamental technology for providing public access to information. Evaluations of such systems are typically conducted by domain experts and focus on model-centric metrics, relevance judgments, or output-based analyses, rather than on how accountability, harm, or trust are experienced by users. This paper argues that participatory auditing is essential for revealing users’ causal and contextual understandings of how ranked search results produce impacts, particularly as ranking models appear increasingly convincing and sophisticated in their semantic interpretation of user queries. We report on three participatory auditing workshops (n=21) in which participants engaged with a custom search interface across four tasks, comparing a lexical ranker (BM25) and a neural semantic reranker (MonoT5), exploring varying levels of transparency and user controls, and examining an intentionally adversarially manipulated ranking. Reflexive activities prompted participants to articulate causal narratives linking search system properties to broader impacts. Synthesising the findings, we contribute a taxonomy of user-perceived impacts of ranked search results, spanning epistemic, representational, infrastructural, and downstream social impacts. However, interactions with the neural model revealed limits to participatory auditing itself: perceived system competence and accumulated trust reduced critical scrutiny during the workshop, allowing manipulations to go undetected. Participants expressed desire for visibility into the full search pipeline and recourse mechanisms. Together, these findings show how participatory auditing can surface user perceived impacts and accountability gaps that remain unseen when relying on conventional audits, while revealing where participatory auditing may encounter limitations.

[IR-26] Exploring Structural Complexity in Normative RAG with Graph-based approaches: A case study on the ETSI Standards

【速读】:该论文旨在解决工业标准与规范性文档(normative documents)在使用大语言模型(Large Language Models, LLMs)直接处理时所面临的挑战,包括其复杂的层级结构、领域特定术语及广泛的交叉引用依赖关系。传统“朴素”的基于向量的检索增强生成(Retrieval-Augmented Generation, RAG)方法难以捕捉此类文档中隐含的结构和语义关系。为此,论文提出了一种针对标准与监管文档独特结构和词汇特征定制的轻量级、低延迟RAG方法,核心创新在于将文档结构信息显式嵌入检索流程,通过图结构(Graph-based)建模实现关系感知的检索机制,从而提升检索性能。实验基于ETSI EN 301 489系列公开标准进行验证,结果表明引入结构与词汇信息可有效改善检索效果,为自动化规范性文档解析提供了可扩展的框架。

链接: https://arxiv.org/abs/2604.09868
作者: Aiman Al Masoud,Marco Arazzi,Simone Germani,Antonino Nocera
机构: 1. University of Bologna (博洛尼亚大学); 2. Politecnico di Milano (米兰理工大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 7 figures

点击查看摘要

Abstract:Industrial standards and normative documents exhibit intricate hierarchical structures, domain-specific lexicons, and extensive cross-referential dependencies, which making it challenging to process them directly by Large Language Models (LLMs). While Retrieval-Augmented Generation (RAG) provides a computationally efficient alternative to LLM fine-tuning, standard “vanilla” vector-based retrieval may fail to capture the latent structural and relational features intrinsic in normative documents. With the objective of shedding light on the most promising technique for building high-performance RAG solutions for normative, standards, and regulatory documents, this paper investigates the efficacy of Graph RAG architectures, which represent information as interconnected nodes, thus moving from simple semantic similarity toward a more robust, relation-aware retrieval mechanism. Despite the promise of graph-based techniques, there is currently a lack of empirical evidence as to which is the optimal indexing strategy for technical standards. Therefore, to help solve this knowledge gap, we propose a specialized RAG methodology tailored to the unique structure and lexical characteristics of standards and regulatory documents. Moreover, to keep our investigation grounded, we focus on well-known public standards, such as the ETSI EN 301 489 series. We evaluate several lightweight and low-latency strategies designed to embed document structure directly into the retrieval workflow. The considered approaches are rigorously tested against a custom synthesized QA dataset, facilitating a quantitative performance analysis. Our experimental results demonstrate that the incorporation of structural and lexical information into the index can enhance, at least to some extent, retrieval performance, providing a scalable framework for automated normative and standards elaboration. Comments: 6 pages, 7 figures Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2604.09868 [cs.IR] (or arXiv:2604.09868v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.09868 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aiman Al Masoud Mr [view email] [v1] Sat, 31 Jan 2026 17:00:43 UTC (1,018 KB) Full-text links: Access Paper: View a PDF of the paper titled Exploring Structural Complexity in Normative RAG with Graph-based approaches: A case study on the ETSI Standards, by Aiman Al Masoud and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.IR prev | next new | recent | 2026-04 Change to browse by: cs cs.AI cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[IR-27] A Mathematical Theory of Ranking

【速读】:该论文旨在解决排序系统(ranking systems)中对个体因素贡献度量化的问题,即如何从标量得分中准确解析出各因子对最终排序结果的局部与全局影响。传统方法依赖绝对得分进行归因,但作者指出排序本质上仅由成对比较决定,因此提出以**成对边际(pairwise margins)**为核心构建理论框架。解决方案的关键在于:在线性情形下,成对边际可精确分解为各因子的局部贡献,并由此导出唯一满足纯因子精炼(pure factor refinement)一致性的L₁局部影响力分配规则;进一步地,通过聚合局部份额得到全局影响力结构——其在对数绝对权重坐标下对应一个凸势函数的梯度,雅可比矩阵为竞争图拉普拉斯算子,且影响力交换满足有限能量恒等式及零交换刚性律。对于非线性评分函数,则引入路径依赖的因子分解机制,并证明了交互曲率定理:因子路径归因路径无关当且仅当相关混合偏导数消失,从而在加法模型中恢复完全因子唯一性。该框架通过局部线性化和成对积分梯度扩展至非线性场景,并延伸至排列空间、得分超平面穿越、离散精确性、三角旋度、Hodge类诊断及根空间/Weyl室几何等多维解释闭包,形成一套基于“成对优先”的统一分析体系。

链接: https://arxiv.org/abs/2604.09733
作者: Yin Cheng
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Ranking systems produce ordered lists from scalar scores, yet the ranking itself depends only on pairwise comparisons. We develop a mathematical theory that takes this observation seriously, centering the analysis on pairwise margins rather than absolute scores. In the linear case, each pairwise margin decomposes exactly into factor-level contributions. We prove that the resulting L_1 local influence share is the unique budgeting rule consistent with pure factor refinement. Aggregating local shares yields a global influence structure: in log-absolute-weight coordinates, this structure is the gradient of a convex potential, its Jacobian is a competition-graph Laplacian, and Influence Exchange – the reallocation of pairwise control across model states – satisfies a finite energy identity with a zero-exchange rigidity law. For nonlinear scoring, the pairwise margin remains well-defined, but factor-level decomposition becomes path-dependent due to cross-factor interactions. We prove an interaction-curvature theorem: factorwise path attribution is path-independent if and only if the relevant mixed partial derivatives vanish, recovering full factorwise uniqueness exactly in the additive regime. The framework extends through local linearization and Pairwise Integrated Gradients. The geometric arc continues through permutation space, score-space hyperplane crossings, discrete exactness and triangle curl, Hodge-like diagnostics, and root-space/Weyl-chamber geometry – organized as successive interpretive closures of the same pairwise-first analytical progression. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.09733 [cs.IR] (or arXiv:2604.09733v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.09733 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yin Cheng [view email] [v1] Thu, 9 Apr 2026 17:00:49 UTC (53 KB)

[IR-28] Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering ACL2026

【速读】:该论文旨在解决音乐问答(Music-QA)领域中现有研究主要局限于单轨理解的问题,即模型仅基于单一音频片段及其元数据回答问题,而忽略了听众在实际场景中常以比较方式描述音乐的需求。为此,作者提出了Jamendo-MT-QA数据集与基准测试,涵盖12,173对音乐轨道的36,519个对比型问答样本,每对轨道生成三类问题:是/否类、短答案类和句子级问题,从而系统评估模型在多轨音乐上的推理能力。解决方案的关键在于构建一个大规模、结构化的多轨对比问答数据集,并设计基于大语言模型(LLM)辅助的生成与过滤流水线,以确保问题的质量与多样性,同时通过自动指标和LLM-as-a-Judge方法对主流音频-语言模型进行基准测试。

链接: https://arxiv.org/abs/2604.09721
作者: Junyoung Koh,Jaeyun Lee,Soo Yong Kim,Gyu Hyeong Choi,Jung In Koh,Jordan Phillips,Yeonjin Lee,Min Song
机构: Yonsei University(延世大学); KRAFTON; University of Oxford(牛津大学); George Mason University(乔治梅森大学); Sungkyul University(中央大学); MODULABS MAAP; Onoma AI
类目: Information Retrieval (cs.IR); Multimedia (cs.MM); Sound (cs.SD)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Recent work on music question answering (Music-QA) has primarily focused on single-track understanding, where models answer questions about an individual audio clip using its tags, captions, or metadata. However, listeners often describe music in comparative terms, and existing benchmarks do not systematically evaluate reasoning across multiple tracks. Building on the Jamendo-QA dataset, we introduce Jamendo-MT-QA, a dataset and benchmark for multi-track comparative question answering. From Creative Commons-licensed tracks on Jamendo, we construct 36,519 comparative QA items over 12,173 track pairs, with each pair yielding three question types: yes/no, short-answer, and sentence-level questions. We describe an LLM-assisted pipeline for generating and filtering comparative questions, and benchmark representative audio-language models using both automatic metrics and LLM-as-a-Judge evaluation.

[IR-29] Evaluating Scene-based In-Situ Item Labeling for Immersive Conversational Recommendation

【速读】:该论文旨在解决沉浸式对话推荐系统(Immersive Conversational Recommendation Systems, ICRS)中如何选择和评估用于场景内标签(in-situ labels)的信息这一开放问题。现有方法在推荐任务上已较为成熟,但在标签信息的选择上缺乏系统性评估标准,导致所呈现的信息可能冗余或未能满足用户潜在需求。解决方案的关键在于提出一种基于信息需求的结构化分类:显式意图满足(explicit intent satisfaction)与主动信息需求(proactive information needs),并据此构建新的评估指标。通过在时尚、电影推荐和零售购物三个场景下对基于检索(IR)、大语言模型(LLM)和视觉语言模型(VLM)的方法进行基准测试,研究揭示了当前方法在利用场景特异性模态信息、避免视觉可推断冗余信息以及从对话中预测用户主动信息需求方面的三大局限,从而为ICRS中的标签选择提供了全新的评估范式和未来研究方向。

链接: https://arxiv.org/abs/2604.09698
作者: Jiazhou Liang,Yifan Simon Liu,David Guo,Minqi Sun,Yilun Jiang,Scott Sanner
机构: University of Toronto (多伦多大学); University of Waterloo (滑铁卢大学); Vector Institute of Artificial Intelligence (向量人工智能研究所)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing ubiquity of Extended Reality (XR) is driving Conversational Recommendation Systems (CRS) toward visually immersive experiences. We formalize this paradigm as Immersive CRS (ICRS), where recommended items are highlighted directly in the user’s scene-based visual environment and augmented with in-situ labels. While item recommendation has been widely studied, the problem of how to select and evaluate which information to present as immersive labels remains an open problem. To this end, we introduce a principled categorization of information needs into explicit intent satisfaction and proactive information needs and use these to define novel evaluation metrics for item label selection. We benchmark IR-, LLM-, and VLM-based methods across three datasets and ICRS scenarios: fashion, movie recommendation, and retail shopping. Our evaluation reveals three important limitations of existing methods: (1) they fail to leverage scenario-specific information modalities (e.g., visual cues for fashion, meta-data for retail), (2) they present redundant information that is visually inferable, and (3) they poorly anticipate users’ proactive information needs from explicit dialogue alone. In summary, this work provides both a novel evaluation paradigm for in-situ item labeling in ICRS and highlights key challenges for future work.

[IR-30] Decoding Ancient Oracle Bone Script via Generative Dictionary Retrieval

【速读】:该论文旨在解决古代文字(尤其是甲骨文,Oracle Bone Script, OBS)的破译难题,其核心挑战在于数据极度稀缺导致现有计算方法在未见字符上的识别准确率低于3%。解决方案的关键在于将传统的分类任务重构为基于字典的检索机制,并结合字符演变规律指导深度学习模型生成涵盖现代汉字对应可能变体的合成字典;通过视觉相似性匹配实现对未知铭文的可解释性候选检索,从而将原本不可解释的黑箱算法转变为具有透明证据支持的假设生成框架,最终在未见字符上达到54.3%的Top-10和86.6%的Top-50准确率。

链接: https://arxiv.org/abs/2604.09668
作者: Yin Wu,Gangjian Zhang,Jiayu Chen,Chang Xu,Yuyu Luo,Nan Tang,Hui Xiong
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 4 figures. Under review at Nature Machine Intelligence

点击查看摘要

Abstract:Understanding humanity’s earliest writing systems is crucial for reconstructing civilization’s origins, yet many ancient scripts remain undeciphered. Oracle Bone Script (OBS) from China’s Shang dynasty exemplifies this challenge: only approximately 1,500 of roughly 4,600 characters have been decoded, and a substantial portion of these 3,000-year-old inscriptions remains only partially understood. Limited by extreme data scarcity, existing computational methods achieve under 3% accuracy on unseen characters – the core palaeographic challenge. We overcome this by reframing decipherment from classification to dictionary-based retrieval. Using deep learning guided by character evolution principles, we generate a comprehensive synthetic dictionary of plausible OBS variants for modern Chinese characters. Scholars query unknown inscriptions to retrieve visually similar candidates with transparent evidence, replacing algorithmic black boxes with interpretable hypotheses. Our approach achieves 54.3% Top-10 and 86.6% Top-50 accuracy for unseen characters. This scalable, transparent framework accelerates decipherment of a pivotal undeciphered script and establishes a generalizable methodology for AI-assisted archaeological discovery.

[IR-31] Do We Still Need GraphRAG ? Benchmarking RAG RAG ? Benchmarking RAG and GraphRAG for Agent ic Search Systems

【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)及图结构扩展方法(GraphRAG)在推理过程中依赖静态或一次性检索的局限性,探索代理式搜索(agentic search)是否能够通过动态多轮检索与序列决策机制弥补缺乏显式图结构的问题。其解决方案的关键在于提出一个统一基准 RAGSearch,系统评估密集型 RAG 与代表性 GraphRAG 方法作为代理式搜索中的检索基础设施,在多个问答基准上的表现,并在标准化的 LLM 骨干、检索预算和推理协议下进行公平比较。结果表明,代理式搜索显著提升了密集 RAG 的性能并缩小了与 GraphRAG 的差距,尤其在基于强化学习(RL)的设置中;然而,GraphRAG 在复杂多跳推理任务中仍具优势,且在分摊离线预处理成本后表现出更稳定的代理行为,揭示了显式图结构与代理式搜索之间的互补关系,为现代代理式 RAG 系统的检索设计提供了实证依据与实践指导。

链接: https://arxiv.org/abs/2604.09666
作者: Dongzhe Fan,Zheyi Xue,Siyuan Liu,Qiaoyu Tan
机构: New York University Shanghai(纽约大学上海分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) and its graph-based extensions (GraphRAG) are effective paradigms for improving large language model (LLM) reasoning by grounding generation in external knowledge. However, most existing RAG and GraphRAG systems operate under static or one-shot retrieval, where a fixed set of documents is provided to the LLM in a single pass. In contrast, recent agentic search systems enable dynamic, multi-round retrieval and sequential decision-making during inference, and have shown strong gains when combined with vanilla RAG by introducing implicit structure through interaction. This progress raises a fundamental question: can agentic search compensate for the absence of explicit graph structure, reducing the need for costly GraphRAG pipelines? To answer this question, we introduce RAGSearch, a unified benchmark that evaluates dense RAG and representative GraphRAG methods as retrieval infrastructures under agentic search. RAGSearch covers both training-free and training-based agentic inference across multiple question answering benchmarks. To ensure fair and reproducible comparison, we standardize the LLM backbone, retrieval budgets, and inference protocols, and report results on full test sets. Beyond answer accuracy, we report offline preprocessing cost, online inference efficiency, and stability. Our results show that agentic search substantially improves dense RAG and narrows the performance gap to GraphRAG, particularly in RL-based settings. Nevertheless, GraphRAG remains advantageous for complex multi-hop reasoning, exhibiting more stable agentic search behavior when its offline cost is amortized. Together, these findings clarify the complementary roles of explicit graph structure and agentic search, and provide practical guidance on retrieval design for modern agentic RAG systems.

[IR-32] AdaQE-CG: Adaptive Query Expansion for Web-Scale Generative AI Model and Data Card Generation WWW2026

【速读】:该论文旨在解决生成式 AI (Generative AI, GAI) 系统中模型卡和数据卡自动化生成面临的三大挑战:静态模板导致无法适应多样化的论文结构与不断演进的文档需求;Web-scale 仓库(如 Hugging Face)常存在元数据不完整或不一致问题,造成信息缺失或噪声;缺乏标准化评估基准,难以公平、可复现地衡量文档质量。其解决方案的核心是提出 AdaQE-CG 框架,关键创新在于两个模块:一是基于上下文感知查询扩展的论文内信息提取模块(IPE-QE),通过迭代优化提取查询以从科学论文和资源库中恢复更丰富完整的结构化信息;二是利用 MetaGAI 数据池进行跨卡片知识迁移的补全模块(ICC-MP),通过语义相关性将相似卡片中的内容迁移到缺失字段中,从而提升文档完整性与一致性。此外,作者构建了首个大规模专家标注基准 MetaGAI-Bench,推动 GAI 文档质量评估的标准化与可比性。

链接: https://arxiv.org/abs/2604.09617
作者: Haoxuan Zhang,Ruochi Li,Zhenni Liang,Mehri Sattari,Phat Vo,Collin Qu,Ting Xiao,Junhua Ding,Yang Zhang,Haihua Chen
机构: University of North Texas(北德克萨斯大学); North Carolina State University(北卡罗来纳州立大学); Bellevue High School(贝尔维尤高中)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: This paper has been accepted to the main conference of WWW 2026

点击查看摘要

Abstract:Transparent and standardized documentation is essential for building trustworthy generative AI (GAI) systems. However, existing automated methods for generating model and data cards still face three major challenges: (i) static templates, as most systems rely on fixed query templates that cannot adapt to diverse paper structures or evolving documentation requirements; (ii) information scarcity, since web-scale repositories such as Hugging Face often contain incomplete or inconsistent metadata, leading to missing or noisy information; and (iii) lack of benchmarks, as the absence of standardized datasets and evaluation protocols hinders fair and reproducible assessment of documentation quality. To address these limitations, we propose AdaQE-CG, an Adaptive Query Expansion for Card Generation framework that combines dynamic information extraction with cross-card knowledge transfer. Its Intra-Paper Extraction via Context-Aware Query Expansion (IPE-QE) module iteratively refines extraction queries to recover richer and more complete information from scientific papers and repositories, while its Inter-Card Completion using the MetaGAI Pool (ICC-MP) module fills missing fields by transferring semantically relevant content from similar cards in a curated dataset. In addition, we introduce MetaGAI-Bench, the first large-scale, expert-annotated benchmark for evaluating GAI documentation. Comprehensive experiments across five quality dimensions show that AdaQE-CG substantially outperforms existing approaches, exceeds human-authored data cards, and approaches human-level quality for model cards. Code, prompts, and data are publicly available at: this https URL.

[IR-33] SRBench: A Comprehensive Benchmark for Sequential Recommendation with Large Language Models

【速读】:该论文旨在解决当前顺序推荐(Sequential Recommendation, SR)模型评估中存在的三大局限性:一是现有基准过于侧重准确性,忽视公平性等实际需求;二是数据集未能充分发挥大语言模型(Large Language Models, LLMs)的潜力,导致神经网络-based SR(NN-SR)与LLM-based SR(LLM-SR)模型之间的比较不公平;三是缺乏从非结构化LLM输出中提取任务特定答案的可靠机制。其解决方案的关键在于提出SRBench这一综合性SR基准,包含三个核心设计:1)多维评估框架,涵盖准确性、公平性、稳定性和效率,贴合实际应用场景;2)通过提示工程(prompt engineering)统一输入范式,提升LLM-SR性能并实现模型间公平比较;3)创新性地采用“提示-提取器耦合”机制,利用提示强制输出格式并结合数值导向提取器,精准捕获LLM输出中的任务答案。

链接: https://arxiv.org/abs/2604.09553
作者: Jianhong Li,Zeheng Qian,Wangze Ni,Haoyang Li,Hongwei Yao,Yang Bai,Kui Ren
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM development has aroused great interest in Sequential Recommendation (SR) applications. However, comprehensive evaluation of SR models remains lacking due to the limitations of the existing benchmarks: 1) an overemphasis on accuracy, ignoring other real-world demands (e.g., fairness); 2) existing datasets fail to unleash LLMs’ potential, leading to unfair comparison between Neural-Network-based SR (NN-SR) models and LLM-based SR (LLM-SR) models; and 3) no reliable mechanism for extracting task-specific answers from unstructured LLM outputs. To address these limitations, we propose SRBench, a comprehensive SR benchmark with three core designs: 1) a multi-dimensional framework covering accuracy, fairness, stability and efficiency, aligned with practical demands; 2) a unified input paradigm via prompt engineering to boost LLM-SR performance and enable fair comparisons between models; 3) a novel prompt-extractor-coupled extraction mechanism, which captures answers from LLM outputs through prompt-enforced output formatting and a numeric-oriented extractor. We have used SRBench to evaluate 13 mainstream models and discovered some meaningful insights (e.g., LLM-SR models overfocus on item popularity but lack deep understanding of item quality). Concisely, SRBench enables fair and comprehensive assessments for SR models, underpinning future research and practical application.

[IR-34] MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

【速读】:该论文旨在解决工程规则手册和技术标准中多模态信息(如密集文本、表格和插图)在检索增强生成(RAG)系统中难以有效处理的问题。解决方案的关键在于提出一种多模态ColPali增强检索与推理框架(MCERF),其核心是结合视觉语言检索模型ColPali实现图文联合检索,并集成四种不同的检索与推理策略:(i)混合查找模式用于显式规则引用,(ii)视觉到文本融合模式支持图表引导查询,(iii)高推理能力大语言模型模式应对复杂多模态问题,(iv)自一致性决策机制稳定输出结果。此外,该研究还设计了两种动态路由机制——单例路由与多智能体系统,以优化查询分配至最优处理管道,从而在不完全摄入整本规则手册的前提下显著提升问答准确率(相对基线RAG最佳结果提升41.1%),验证了视觉语言检索、模块化推理与自适应路由在工程文档理解中的可扩展性。

链接: https://arxiv.org/abs/2604.09552
作者: Kiarash Naghavi Khanghah,Hoang Anh Nguyen,Anna C. Doris,Amir Mohammad Vahedi,Daniele Grandi,Faez Ahmed,Hongyi Xu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.

[IR-35] SemaCDR: LLM -Powered Transferable Semantics for Cross-Domain Sequential Recommendation

【速读】:该论文旨在解决跨域推荐(Cross-domain Recommendation, CDR)中因目标域数据稀疏和冷启动问题导致的性能瓶颈,同时克服现有方法依赖领域特定特征或标识符、难以捕捉跨域语义模式的局限性。其解决方案的关键在于提出一种语义驱动的框架 SemaCDR,通过大语言模型(Large Language Models, LLMs)构建统一的语义空间:首先利用 LLM 生成领域无关语义与领域特定内容融合形成多视图物品特征,并通过对比正则化对齐;进而采用自适应融合机制生成统一的偏好表示,并在行为序列层面实现跨域交互序列的合成与对齐,从而有效增强域内一致性并促进跨域知识迁移。

链接: https://arxiv.org/abs/2604.09551
作者: Chunxu Zhang,Shanqiang Huang,Zijian Zhang,Jiahong Liu,Linsong Yu,Ruiqi Wan,Bo Yang,Irwin King
机构: Jilin University(吉林大学); The Chinese University of Hong Kong(香港中文大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-domain recommendation (CDR) addresses the data sparsity and cold-start problems in the target domain by leveraging knowledge from data-rich source domains. However, existing CDR methods often rely on domain-specific features or identifiers that lack transferability across different domains, limiting their ability to capture inter-domain semantic patterns. To overcome this, we propose SemaCDR, a semantics-driven framework for cross-domain sequential recommendation that leverages large language models (LLMs) to construct a unified semantic space. SemaCDR creates multiview item features by integrating LLM-generated domain-agnostic semantics with domain-specific content, aligned by contrastive regularization. SemaCDR systematically creates LLM-generated domain-specific and domain-agnostic semantics, and employs adaptive fusion to generate unified preference representations. Furthermore, it aligns cross-domain behavior sequences with an adaptive fusion mechanism to synthesize interaction sequences from source, target, and mixed domains. Extensive experiments on real-world datasets show that SemaCDR consistently outperforms state-of-the-art baselines, demonstrating its effectiveness in capturing coherent intra-domain patterns while facilitating knowledge transfer across domains.

[IR-36] HyEm: Query-Adaptive Hyperbolic Retrieval for Biomedical Ontologies via Euclidean Vector Indexing

【速读】:该论文旨在解决生物医学知识领域中检索增强生成(Retrieval-Augmented Generation, RAG)面临的层次感知本体对齐挑战:尽管HPO、DO和MeSH等资源使用深层“is-a”分类层级结构,但现有生产系统依赖欧几里得嵌入(Euclidean embeddings)与近似最近邻(Approximate Nearest Neighbor, ANN)索引。为适应层次结构,虽然双曲嵌入(Hyperbolic embeddings)更合适,但存在两个障碍:(i) 缺乏原生向量数据库支持,(ii) 在以实体为中心的查询中可能表现不佳,因层级信息无关紧要。解决方案的关键在于提出HyEm——一种轻量级检索层,将双曲本体嵌入集成至现有的欧几里得ANN基础设施中:通过学习半径控制的双曲嵌入,将原点对数映射后的向量存储于标准欧几里得数据库进行候选检索,随后执行精确的双曲重排序;同时引入查询自适应门控机制,动态融合欧几里得语义相似度与双曲层级距离,实现混合意图查询优化。理论分析表明,在半径约束下,该方法可有效平衡层次导航性能与实体查询精度,实验证明其在实体查询上保持94–98%的欧几里得基线性能,同时显著提升层次导航与混合意图查询效果,且仅需适度超采样即可维持索引可扩展性。

链接: https://arxiv.org/abs/2604.09550
作者: Ou Deng,Shoji Nishimura,Atsushi Ogihara,Qun Jin
机构: Waseda University (早稻田大学)
类目: Information Retrieval (cs.IR); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) for biomedical knowledge faces a hierarchy-aware ontology grounding challenge: resources like HPO, DO, and MeSH use deep ``is-a" taxonomies, yet production stacks rely on Euclidean embeddings and ANN indexes. While hyperbolic embeddings suit hierarchical representation, they face two barriers: (i) lack of native vector database support, and (ii) risk of underperforming on entity-centric queries where hierarchy is irrelevant. We present HyEm, a lightweight retrieval layer integrating hyperbolic ontology embeddings into existing Euclidean ANN infrastructure. HyEm learns radius-controlled hyperbolic embeddings, stores origin log-mapped vectors in standard Euclidean databases for candidate retrieval, then applies exact hyperbolic reranking. A query-adaptive gate outputs continuous mixing weights, combining Euclidean semantic similarity with hyperbolic hierarchy distance at reranking time. Our bi-Lipschitz analysis under radius constraints provides practical guidance for ANN oversampling and this http URL on biomedical ontology subsets demonstrate HyEm preserves 94-98% of Euclidean baseline performance on entity-centric queries while substantially improving hierarchy-navigation and mixed-intent queries, maintaining indexability at moderate oversampling.

[IR-37] Beyond Offline A/B Testing: Context-Aware Agent Simulation for Recommender System Evaluation

【速读】:该论文旨在解决推荐系统(Recommender Systems, RS)在离线评估与在线性能之间存在脱节的问题,尤其是现有基于大语言模型(Large Language Model, LLM)的代理方法忽略了时间、地点和需求等情境因素对人类决策的根本性影响。其解决方案的关键在于提出ContextSim框架,通过引入一个生活模拟模块生成包含具体时空背景和行为动机的用户交互场景,并在行动和轨迹两个层面建模代理的内部思维一致性,从而生成更贴近真实人类行为的交互数据。实验表明,该方法显著提升了离线评估与线上实际表现的相关性,且优化后的推荐参数可有效提升真实世界中的用户参与度。

链接: https://arxiv.org/abs/2604.09549
作者: Nicolas Bougie,Gian Maria Marconi,Xiaotong Ye,Narimasa Watanabe
机构: Woven by Toyota (丰田织物)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommender systems are central to online services, enabling users to navigate through massive amounts of content across various domains. However, their evaluation remains challenging due to the disconnect between offline metrics and online performance. The emergence of Large Language Model-powered agents offers a promising solution, yet existing studies model users in isolation, neglecting the contextual factors such as time, location, and needs, which fundamentally shape human decision-making. In this paper, we introduce ContextSim, an LLM agent framework that simulates believable user proxies by anchoring interactions in daily life activities. Namely, a life simulation module generates scenarios specifying when, where, and why users engage with recommendations. To align preferences with genuine humans, we model agents’ internal thoughts and enforce consistency at both the action and trajectory levels. Experiments across domains show our method generates interactions more closely aligned with human behavior than prior work. We further validate our approach through offline A/B testing correlation and show that RS parameters optimized using ContextSim yield improved real-world engagement.

[IR-38] Retrieval-Augmented Large Language Models for Evidence-Informed Guidance on Cannabidiol Use in Older Adults

【速读】:该论文旨在解决老年人在使用大麻二酚(Cannabidiol, CBD)进行慢性症状管理时面临的用药安全与信息获取障碍问题,包括剂量不当、药物相互作用风险以及因健康素养不足和污名化导致的教育缺失。其解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的大语言模型框架,通过结构化提示工程与精选CBD循证知识库结合,生成针对老年群体(含认知障碍者)的上下文感知型指导建议,并提出一种无需人工标注的自动化评估体系以客观衡量模型安全性与合规性。实验表明,检索增强模型显著优于独立大语言模型,在推荐策略上更谨慎且符合临床指南,其中集成多个检索系统的混合架构表现最优,验证了结构化检索对提升AI健康教育工具可靠性和安全性的核心作用。

链接: https://arxiv.org/abs/2604.09548
作者: Ali Abedi,Charlene H. Chu,Shehroz S. Khan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Older adults commonly experience chronic conditions such as pain and sleep disturbances and may consider cannabidiol for symptom management. Safe use requires appropriate dosing, careful titration, and awareness of drug interactions, yet stigma and limited health literacy often limit understanding. Conversational artificial intelligence systems based on large language models and retrieval-augmented generation may support cannabidiol education, but their safety and reliability remain insufficiently evaluated. This study developed a retrieval-augmented large language model framework that combines structured prompt engineering with curated cannabidiol evidence to generate context-aware guidance for older adults, including those with cognitive impairment. We also proposed an automated, annotation-free evaluation framework to benchmark leading standalone and retrieval-augmented models in the absence of standardized benchmarks. Sixty-four diverse user scenarios were generated by varying symptoms, preferences, cognitive status, demographics, comorbidities, medications, cannabis history, and caregiver support. Multiple state-of-the-art models were evaluated, including a novel ensemble retrieval architecture that integrates multiple retrieval systems. Across three automated evaluation strategies, retrieval-augmented models consistently produced more cautious and guideline-aligned recommendations than standalone models, with the ensemble approach performing best. These findings demonstrate that structured retrieval improves the reliability and safety of AI-driven cannabidiol education and provide a reproducible framework for evaluating AI tools used in sensitive health contexts.

人机交互

[HC-0] Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions

【速读】:该论文旨在解决数字健康干预中自动识别个体对健康行为的矛盾与犹豫情绪(Ambivalence and Hesitancy, A/H)的问题,以提升干预的个性化和成本效益。A/H是导致患者延迟、回避或放弃健康行为的关键心理因素,但其在多模态表达中的细微冲突特征(如语言、面部、语音和肢体动作)难以通过人工识别实现高效整合。论文提出的关键解决方案是利用深度学习模型进行视频中的多模态A/H识别,涵盖监督学习、无监督域自适应以实现个性化,以及基于大语言模型(Large Language Models, LLMs)的零样本推理三种学习范式,从而推动自动化、可扩展的数字健康干预落地。

链接: https://arxiv.org/abs/2604.11730
作者: Manuela González-González,Soufiane Belharbi,Muhammad Osama Zeeshan,Masoumeh Sharafi,Muhammad Haseeb Aslam,Lorenzo Sia,Nicolas Richet,Marco Pedersoli,Alessandro Lameiras Koerich,Simon L Bacon,Eric Granger
机构: ETS Montreal (蒙特利尔工程学院); Concordia University (康考迪亚大学); CIUSSS Nord-de-l’Ile-de-Montréal (蒙特利尔北部健康服务联盟)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.

[HC-1] HeartSway: Exploring Biodata as Poetic Traces in Public Space

【速读】:该论文旨在解决如何将人类生物数据(biodata)作为城市景观中具象且富有情感共鸣的痕迹,以增强公共空间中的个体间隐性连接。其核心问题是:如何通过技术设计使匿名的生理数据(如心率和微运动)成为可感知、可交互的城市体验材料,从而在陌生人之间构建非直接但富有共情的异步互动。解决方案的关键在于设计并实现HeartSway——一种交互式吊床,它记录使用者的心率与微小身体动作作为“生物痕迹”,并在后续访客使用时以具身化方式重现这些痕迹,从而激发连接感、好奇心与对人类共通生命力的欣赏。这一方法为公共设施设计提供了新的范式,即将个人生物数据转化为具有社会意义的都市记忆载体。

链接: https://arxiv.org/abs/2604.11701
作者: Zeyu Huang,Zhifan Guo,Xingyu Li,Xiaojuan Ma,Noura Howell
机构: The Hong Kong University of Science and Technology (香港科技大学); Georgia Institute of Technology (佐治亚理工学院); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); University of Southern Denmark (南丹麦大学)
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 5 figures, to be published in Proceedings of the 2026 ACM Designing Interactive Systems Conference (DIS '26)

点击查看摘要

Abstract:Human traces scattered across urban landscapes can signify our everyday lives and societal vibrancy in subtle and poetic forms. In this paper, we explore how designed technology can engage biodata as evocative traces. To this end, we present the design, implementation, and evaluation of HeartSway, an interactive hammock that captures a user’s heart rate and micro-movements as traces and replays them as an embodied experience for the next visitor. Through a qualitative field study (N=10), we find that HeartSway evokes feelings of connection, curiosity about prior users, and appreciation for shared human vitality. Our work contributes to understanding anonymous archival biodata as a design material for experiential urban traces. We offer design considerations for intimate asynchronous encounters between strangers in public spaces and for reimagining public amenities.

[HC-2] Exploring Radiologists Expectations of Explainable Machine Learning Models in Medical Image Analysis

【速读】:该论文试图解决的问题是:尽管机器学习(Machine Learning, ML)模型在放射学中表现出色,但因其缺乏可解释性(Explainability),尚未被放射科医生广泛接受,从而限制了其临床整合。解决方案的关键在于:通过结构化问卷收集不同经验水平和专业背景的放射科医生对可解释ML的需求、最受益的临床任务及部署方式的见解,并据此提出一套设计与开发可解释ML模型的指南,确保模型从临床角度具备实用性与可信度,从而促进其作为辅助工具融入放射学实践。

链接: https://arxiv.org/abs/2604.11700
作者: Sara Ketabi,Matthias W. Wagner,Birgit Betina Ertl-Wagner,Greg A.Jamieson,Farzad Khalvati
机构: The Hospital for Sick Children; University of Toronto; Vector Institute for Artificial Intelligence; University Hospital Augsburg
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In spite of the strong performance of machine learning (ML) models in radiology, they have not been widely accepted by radiologists, limiting clinical integration. A key reason is the lack of explainability, which ensures that model predictions are understandable and verifiable by clinicians. Several methods and tools have been proposed to improve explainability, but most reflect developers’ perspectives and lack systematic clinical validation. In this work, we gathered insights from radiologists with varying experience and specialties into explainable ML requirements through a structured questionnaire. They also highlighted key clinical tasks where ML could be most beneficial and how it might be deployed. Based on their input, we propose guidelines for designing and developing explainable ML models in radiology. These guidelines can help researchers develop clinically useful models, facilitating integration into radiology practice as a supportive tool.

[HC-3] Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models

【速读】:该论文旨在解决生成式 AI(Generative AI)在交互中表现出的迎合倾向(sycophancy)是否因用户感知身份特征(如种族、年龄、性别和自信水平)而异的问题,即是否存在基于交叉性(intersectionality)的差异化虚假验证行为。解决方案的关键在于设计并实施一套多轮对抗性对话测试框架——基于 Anthropic 的 Petri 评估系统,针对 GPT-5-nano 和 Claude Haiku 4.5 在数学、哲学与阴谋论领域中对 128 种人格组合(涵盖多种人口统计学变量)进行系统性探测,从而量化不同用户群体上的 sycophancy 表现差异,发现 GPT-5-nano 显著更易迎合特定人群(如自信的年轻 Hispanic 女性),而 Claude Haiku 4.5 则保持稳定低水平迎合,表明安全评估应引入身份敏感型测试以提升模型公平性和可靠性。

链接: https://arxiv.org/abs/2604.11609
作者: Benjamin Maltbie,Shivam Raval
机构: Massachusetts Institute of Technology (麻省理工学院); Harvard University (哈佛大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models exhibit sycophantic tendencies–validating incorrect user beliefs to appear agreeable. We investigate whether this behavior varies systematically with perceived user demographics, testing whether combinations of race, age, gender, and expressed confidence level produce differential false validation rates. Inspired by the legal concept of intersectionality, we conduct 768 multi-turn adversarial conversations using Anthropic’s Petri evaluation framework, probing GPT-5-nano and Claude Haiku 4.5 across 128 persona combinations in mathematics, philosophy, and conspiracy theory domains. GPT-5-nano is significantly more sycophantic than Claude Haiku 4.5 overall ( \barx=2.96 vs. 1.74 , p 10^-32 , Wilcoxon signed-rank). For GPT-5-nano, we find that philosophy elicits 41% more sycophancy than mathematics and that Hispanic personas receive the highest sycophancy across races. The worst-scoring persona, a confident, 23-year-old Hispanic woman, averages 5.33/10 on sycophancy. Claude Haiku 4.5 exhibits uniformly low sycophancy with no significant demographic variation. These results demonstrate that sycophancy is not uniformly distributed across users and that safety evaluations should incorporate identity-aware testing.

[HC-4] From Multimodal Signals to Adaptive XR Experiences for De-escalation Training

【速读】:该论文旨在解决虚拟现实(VR)训练中对用户多模态交互信号进行实时、连续评估的难题,以支持自适应的人机交互系统在复杂人际场景(如执法情境下的冲突升级与降级)中的应用。其核心挑战在于如何有效融合来自不同生理与行为模态(如语音、手势、面部表情、脑电、皮肤电导等)的数据,并在存在头戴式显示器(HMD)遮挡等实际限制下实现准确的意图识别与状态解码。解决方案的关键在于构建一个基于Lab Streaming Layer同步的五通道并行处理架构,结合社会符号学与符号互动理论设计解释层,将低层次传感信号映射为高阶交互构念(如情绪波动、行为倾向),并通过警察教官和普通参与者共同验证的领域知识优化反馈机制,从而实现对用户有意识与无意识沟通线索的动态捕捉与响应。

链接: https://arxiv.org/abs/2604.11570
作者: Birgit Nierula,Karam Tomotaki-Dawoud,Daniel Johannes Meyer,Iryna Ignatieva,Mina Mottahedin,Thomas Koch,Sebastian Bosse
机构: Fraunhofer Heinrich-Hertz-Institute (弗劳恩霍夫海因里希-赫兹研究所)
类目: Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: 16 pages, 5 figures, ACM Intelligent User Interfaces (IUI) Workshops 2026

点击查看摘要

Abstract:We present the early-stage design and implementation of a multimodal, real-time communication analysis system intended as a foundational interaction layer for adaptive VR training. The system integrates five parallel processing streams: (1) verbal and prosodic speech analysis, (2) skeletal gesture recognition from multi-view RGB cameras, (3) multimodal affective analysis combining lower-face video with upper-face facial EMG, (4) EEG-based mental state decoding, and (5) physiological arousal estimation from skin conductance, heart activity, and proxemic behavior. All signals are synchronized via Lab Streaming Layer to enable temporally aligned, continuous assessments of users’ conscious and unconscious communication cues. Building on concepts from social semiotics and symbolic interactionism, we introduce an interpretation layer that links low-level signal representations to interactional constructs such as escalation and de-escalation. This layer is informed by domain knowledge from police instructors and lay participants, grounding system responses in realistic conflict scenarios. We demonstrate the feasibility and limitations of automated cue extraction in an XR-based de-escalation training project for law enforcement, reporting preliminary results for gesture recognition, emotion recognition under HMD occlusion, verbal assessment, mental state decoding, and physiological arousal. Our findings highlight the value of multi-view sensing and multimodal fusion for overcoming occlusion and viewpoint challenges, while underscoring that fusion and feedback must be treated as design problems rather than purely technical ones. The work contributes design resources and empirical insights for shaping human-AI-powered XR training in complex interpersonal settings.

[HC-5] A Cross-Country Evaluation of Sentiment Toward Digital Payment Systems in Africa

【速读】:该论文试图解决的问题是:在非洲地区,用户为何在多种数字支付系统(如移动货币、加密货币、稳定币及央行数字货币等)之间进行选择,以及他们如何权衡不同系统的效用、隐私与安全性。解决方案的关键在于通过在尼日利亚、坦桑尼亚和津巴布韦开展定性访谈,揭示用户在实际使用中对不同支付平台的偏好形成机制——发现用户通常同时拥有多个账户,并基于金融成本、安全顾虑、隐私需求及对机构的信任程度,动态选择最适合特定支付场景的平台。这一洞察为监管机构和研究者设计更符合用户信任与效用平衡的数字支付系统提供了方向。

链接: https://arxiv.org/abs/2604.11566
作者: Isabel Agadagba,Triphonia Kilasara,Takudzwa Tarutira,Noah Shumba,Nicolas Christin,Obigbemi Imoleayo Foyeke,Assane Gueye,Edith Luhanga,Alexander Rusero,Karen Sowon,Giulia Fanti
机构: Carnegie Mellon University (卡内基梅隆大学); Carnegie Mellon University-Africa (卡内基梅隆大学非洲校区); University of Lagos (拉各斯大学); Africa University (非洲大学); University of Indiana-Bloomington (印第安纳大学布卢明顿分校)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Digital payment systems have become a cornerstone of consumer finance in Africa. Prominent payment categories include money transfer applications, mobile money, cryptocurrencies, stablecoins, and central bank digital currencies (CBDCs). While there are studies exploring how and why people use individual digital payment systems (both in Africa and beyond), we lack a good understanding of why people choose between different categories of payment systems, and how they view the tradeoffs between different categories. We conducted qualitative interviews in three African countries – Nigeria, Tanzania, and Zimbabwe – to understand how and why people use various payment systems, and what influenced them to start using these systems. Our study highlights several notable findings regarding tradeoffs between perceived utility, privacy, and security. For example, many users trust government issuers to protect them from scams, but they do not trust those same institutions to build reliable systems and products or prioritize customer satisfaction. We also find that most users have accounts on multiple payment systems, and conduct a complex selection process using different platforms for different types of payments. This selection process is driven in part by financial considerations, but also by security, privacy, and trust preferences. Our findings suggest compelling directions for regulators and the research community to design systems that balance users’ trust and utility needs.

[HC-6] Participation and Power: A Case Study of Using Ecological Momentary Assessment to Engage Adolescents in Academic Research

【速读】:该论文旨在解决生态瞬时评估(Ecological Momentary Assessment, EMA)平台设计对青少年参与度、研究实践及权力关系的影响尚未得到充分探讨的问题。其解决方案的关键在于开发一个以青少年为中心的EMA平台,通过增强青少年参与感和研究人员支持能力来优化数据收集与管理流程:平台采用青少年导向的设计理念和游戏化功能提升参与持续性,同时借助集中式网页仪表板简化研究团队的行政监控;然而,技术不稳定性与固定的数据结构也带来了隐私担忧和原始使用元数据解析困难等挑战,因此论文进一步提出了面向青年赋权、伦理实践与研究目标协同的交互设计指南。

链接: https://arxiv.org/abs/2604.11551
作者: Ozioma C. Oguine,Elmira Rashidi,Pamela J. Wisniewski,Karla Badillo-Urquiola
机构: University of Notre Dame (圣母大学); Indiana University (印第安纳大学); International Computer Science Institute (国际计算机科学研究所)
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 10 pages, 2 figures, 2 tables. In Proceedings of the 25th Interaction Design and Children Conference (IDC’ 26), June 22-25, 2026, Brighton, United Kingdom

点击查看摘要

Abstract:Ecological Momentary Assessment (EMA) is widely used to study adolescents’ experiences; yet, how the design of EMA platforms shapes engagement, research practices, and power dynamics in youth studies remains under-examined. We developed a youth-centered EMA platform prioritizing youth engagement and researcher support, and evaluated it through a case study on a longitudinal investigation with adolescent twins focused on mental health and sleep behavior. Interviews with the research team examined how the platform design choices shaped participant onboarding, sustained engagement, risk monitoring, and data interpretation. The app’s teen-centered design and gamified features sustained teen engagement, while the web portal streamlined administrative oversight through a centralized dashboard. However, technical instability and rigid data structures created significant hurdles, leading to privacy concerns among parents and complicating the researchers’ ability to analyze raw usage metadata. We provide actionable interaction design guidelines for developing EMA platforms that prioritize youth agency, ethical practice, and research goals.

[HC-7] Human Centered Non Intrusive Driver State Modeling Using Personalized Physiological Signals in Real World Automated Driving

【速读】:该论文旨在解决当前驾驶自动化系统(SAE Levels 2-3)中驾驶员状态监测准确性不足的问题,特别是现有驾驶员监控系统(Driver Monitoring Systems, DMS)普遍依赖通用模型而忽视个体生理差异导致的性能下降问题。解决方案的关键在于采用非侵入式生理传感技术(如Empatica E4可穿戴设备采集皮肤电活动、心率、体温和运动数据),并利用深度学习方法将多模态生理信号转换为二维图像表示,进而通过基于预训练ResNet50的多模态特征提取架构构建个性化驾驶员状态模型。实验表明,个性化模型在四名受试者上的平均准确率达92.68%,显著优于跨用户通用模型(仅54%),验证了适应个体生理特征的DMS对于实现安全人机协同驾驶的重要性。

链接: https://arxiv.org/abs/2604.11549
作者: David Puertas-Ramirez,Raul Fernandez-Matellan,David Martin Gomez,Jesus G. Boticario
机构: UNED Madrid Spain(西班牙国立远程教育大学马德里校区); UC3M Madrid Spain(西班牙卡洛斯三世大学马德里校区)
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 17 pages (including references), 4 Figures, 4 Tables

点击查看摘要

Abstract:In vehicles with partial or conditional driving automation (SAE Levels 2-3), the driver remains responsible for supervising the system and responding to take-over requests. Therefore, reliable driver monitoring is essential for safe human-automation collaboration. However, most existing Driver Monitoring Systems rely on generalized models that ignore individual physiological variability. In this study, we examine the feasibility of personalized driver state modeling using non-intrusive physiological sensing during real-world automated driving. We conducted experiments in an SAE Level 2 vehicle using an Empatica E4 wearable sensor to capture multimodal physiological signals, including electrodermal activity, heart rate, temperature, and motion data. To leverage deep learning architectures designed for images, we transformed the physiological signals into two-dimensional representations and processed them using a multimodal architecture based on pre-trained ResNet50 feature extractors. Experiments across four drivers demonstrate substantial interindividual variability in physiological patterns related to driver awareness. Personalized models achieved an average accuracy of 92.68%, whereas generalized models trained on multiple users dropped to an accuracy of 54%, revealing substantial limitations in cross-user generalization. These results underscore the necessity of adaptive, personalized driver monitoring systems for future automated vehicles and imply that autonomous systems should adapt to each driver’s unique physiological profile.

[HC-8] ResearchCube: Multi-Dimensional Trade-off Exploration for Research Ideation

【速读】:该论文旨在解决当前AI辅助研究构思工具在多维评估维度支持上的不足,即多数工具将复杂的研究构想评价简化为单一极性尺度(如“越多越好”),忽视了科研创新中不同评估维度间的权衡关系。其解决方案的关键在于提出ResearchCube系统,通过将评估维度重构为双向权衡谱(bipolar trade-off spectra,如理论驱动 vs. 数据驱动),并以用户自建的三维评价空间来可视化呈现研究构想,使研究者能够通过四种空间交互方式(AI引导的维度生成、面吸附式3D导航、拖拽式构思引导与合成)进行直接操作,从而实现对多维评价逻辑的显性化和可控化。此设计不仅增强了用户的决策代理感,也促进了从单维聚焦到多维协同的灵活过渡,为构建更具认知支持能力的生成式AI辅助研究工具提供了新范式。

链接: https://arxiv.org/abs/2604.11538
作者: Zijian Ding,Fenghai Li,Ziyi Wang,Joel Chan
机构: University of Maryland, College Park (马里兰大学学院公园分校); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Research ideation requires navigating trade-offs across multiple evaluative dimensions, yet most AI-assisted ideation tools leave this multi-dimensional reasoning unsupported, or reducing evaluation to unipolar scales where “more is better”. We present ResearchCube, a system that reframes evaluation dimensions as bipolar trade-off spectra (e.g., theory-driven vs. data-driven) and renders research ideas as manipulable points in a user-constructed 3D evaluation space. Given a research intent, the system proposes candidate bipolar dimension pairs; users select up to three to define the axes of a personalized evaluation cube. Four spatial interactions – AI-scaffolded dimension generation, 3D navigation with face snapping, drag-based idea steering, and drag-based synthesis – enable researchers to explore and refine ideas through direct manipulation rather than text prompts. A qualitative study with 11 researchers revealed that (1) bipolar dimensions served as cognitive scaffolds that externalized evaluative thinking and offloaded working memory, (2) the spatial representation provided a sense of agency absent in chatbot-based AI tools, (3) participants desired fluid transitions across dimensionality levels – from single-dimension focus to more than three dimensions, and (4) a productive tension emerged between AI-suggested starting dimensions and users’ evolving desire for control. We distill these findings into design implications for multi-dimensional research ideation tools, including progressive dimensional control, fluid dimensionality, and transparent synthesis with provenance.

[HC-9] Understanding the Gap Between Stated and Revealed Preferences in News Curation: A Study of Young Adult Social Media Users

【速读】:该论文试图解决社交媒体信息流算法在用户行为偏好(revealed preferences)与用户自我陈述偏好(stated preferences)之间存在的不一致问题,尤其是在年轻群体中,其内容消费往往受算法驱动但未必符合其价值观。解决方案的关键在于通过混合方法研究(问卷调查、访谈与feed定制任务)揭示用户对理想信息流的构想:当被赋予主动设计权时,参与者倾向于优先考虑准确性(accuracy)和多样性(diversity)等价值,并在社会关系与情境背景下权衡不同价值取向,从而构建更符合其内在价值的信息展示策略。这一发现表明,feed curation本质上是一个社会性判断过程,为设计能够反映用户深层价值的算法系统提供了关键方向。

链接: https://arxiv.org/abs/2604.11517
作者: Do Won Kim,Cody Buntain,Giovanni Luca Ciampaglia
机构: University of Maryland College Park (马里兰大学帕克分校)
类目: Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: To be published in CSCW '26

点击查看摘要

Abstract:Social media feed algorithms infer user preferences from their past behaviors. Yet what drives engagement often diverges from what users value. We examine this gap between stated preferences (what users say they prefer) and revealed preferences (what their behavior suggests they prefer) among young adults, a group deeply embedded in algorithmically mediated environments. Using a mixed-methods approach combining surveys and interviews with feed curation activities, we investigate: what gaps exist between stated and revealed preferences; how users make sense of these gaps; what values users believe should guide algorithmic curation; and how systems might reflect those values. Participants often found themselves engaging with low-quality content they did not endorse, despite wanting high-quality information. When asked to curate an ideal social media news feed for a hypothetical persona, participants created feeds they considered more satisfying and higher in quality by prioritizing values such as accuracy and diversity. In doing so, they navigated trade-offs between different values, factoring in social relationships and context surrounding the persona. These findings suggest that feed curation is a socially situated process of judging what should be visible and appropriate in shared information spaces. Based on these insights, we offer design directions for bridging the gap between stated and revealed preferences.

[HC-10] From Attribution to Action: A Human-Centered Application of Activation Steering

【速读】:该论文旨在解决现有可解释人工智能(Explainable AI, XAI)方法在实践中难以转化为具体操作手段的问题,即虽然XAI能识别影响模型预测的关键特征,但缺乏支持从业者基于解释进行干预的能力。其解决方案的关键在于引入一种结合SAE(Sparse Autoencoder)归因与激活控制(activation steering)的交互式工作流,使用户能够在实例层面分析视觉模型中概念的使用情况,并通过主动干预模型组件来验证假设。实验表明,这种机制促使从业者从被动观察转向主动干预式的 hypothesis testing,显著提升了可解释性的实用性,同时揭示了如涟漪效应和局部修正泛化性不足等需谨慎对待的风险。

链接: https://arxiv.org/abs/2604.11467
作者: Tobias Labarta,Maximilian Dreyer,Katharina Weitz,Wojciech Samek,Sebastian Lapuschkin
机构: Fraunhofer Heinrich-Hertz-Institut(弗劳恩霍夫海因里希-赫兹研究所); Technische Universität Berlin(柏林工业大学); BIFOLD – Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究所)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

[HC-11] Functional Misalignment in Human-AI Interactions on Digital Platforms

【速读】:该论文试图解决的问题是:当前算法系统(尤其是社交媒体推荐系统)在优化用户可观察行为信号(如点击、浏览和互动)时,虽能实现高精度的个体行为预测,却引发了广泛的社会负面影响,如心理健康问题加剧、社会极化和信任度下降。这些问题的根本原因在于算法优化目标(可预测行为)与人类实际需求之间存在结构性的功能错位(functional misalignment)。解决方案的关键在于识别并理解这种错位的三个机制:(1)对快速反应性行为信号的偏倚建模,忽视了深思熟虑的判断;(2)用户行为与算法学习之间的反馈循环;(3)大规模下涌现的集体动态效应。作者提出将“功能错位”作为统一框架,并据此构建研究议程以探索和缓解人-机交互系统中的此类问题。

链接: https://arxiv.org/abs/2604.11459
作者: Kristina Lerman
机构: Indiana University (印第安纳大学)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Algorithmic systems, particularly social media recommenders, have achieved remarkable success in predicting behavior. By optimizing for observable signals such as clicks, views, and engagement, these systems effectively capture user attention and guide interaction. Yet their widespread adoption has coincided with troubling outcomes, including rising mental health concerns, increasing polarization, and erosion of trust. This paper argues that these effects are consequences of a structural functional misalignment between what algorithms optimize - predictable behavior - and the human goals these predictions are intended to serve. We propose that this misalignment arises through three mechanisms: (1) a bias toward modeling fast, reactive behavioral signals over reflective judgment, (2) feedback loops that couple user behavior with algorithmic learning, and (3) emergent collective dynamics that amplify these effects at scale. Together, these mechanisms explain how accurate individual-level predictions can produce adverse societal outcomes. We present functional misalignment as a unifying framework and outline a research agenda for studying and mitigating its effects in human-AI interaction systems.

[HC-12] Contexty: Capturing and Organizing In-situ Thoughts for Context-Aware AI Support

【速读】:该论文旨在解决当前人机协作中用户复杂认知过程(如迭代式意义建构)难以被AI有效捕捉与利用的问题。现有系统往往无法在不打断任务的情况下记录用户的思维痕迹,且即便记录下来,也常以碎片化或系统自动生成的摘要形式呈现,无法真实反映用户的认知路径。解决方案的关键在于构建一种基于用户认知轨迹(cognitive traces)的AI上下文机制,使AI能够理解并响应用户的真实思考过程。具体实现上,作者通过探针系统“snippet memoing”支持用户即时记录关键认知片段,并进一步提出Contexty工具,让用户可直接检查和修订这些上下文,从而提升任务意识、思维结构化程度以及用户对AI输出的掌控感和归属感。

链接: https://arxiv.org/abs/2604.11067
作者: Yoonsu Kim,Chanbin Park,Kihoon Son,Saelyne Yang,Juho Kim
机构: KAIST(韩国科学技术院); University of California, Berkeley(加州大学伯克利分校); CMU(卡内基梅隆大学); SkillBench(技能基准)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:During complex knowledge work, people engage in iterative sensemaking: interpreting information, connecting ideas, and refining their understanding. Yet in current human-AI collaboration, these cognitive processes are difficult to share and organize for AI. They arise in situ and are rarely captured without interrupting the task, and even when expressed, remain scattered or reduced to system-generated summaries that fail to reflect users’ cognitive processes. We address this challenge by enabling AI context that is grounded in users’ cognitive traces and can be directly inspected and revised by the user. We first explore this through a probe system that supports in-situ snippet memoing, allowing users to easily share their cognitive moves. Our study (N=10) highlights the value of capturing such context and the challenge of organizing it once accumulated. We then present Contexty, which supports users in inspecting and refining these contexts to better reflect their understanding of the task. Our evaluation (N=12) showed that Contexty improved task awareness, thought structuring, and users’ sense of authorship and control, with participants preferring snippet-grounded AI responses over non-grounded ones (78.1%). We discuss how capturing and organizing users’ cognitive context enables AI as a context-aware collaborator while preserving user agency.

[HC-13] Inferring World Belief States in Dynamic Real-World Environments

【速读】:该论文旨在解决在动态、三维且部分可观测环境中,如何通过机器人观测来估计人类的世界信念状态(world belief state)这一问题。其核心挑战在于构建一个能够反映人类对环境认知和决策依据的内部模拟模型,从而支持高效的人机协同任务。解决方案的关键在于基于心理模型理论(mental model theory),通过推断协作伙伴的信念状态(即一级情境意识 level one situation awareness),实现对人类行为意图的理解与预测,进而支持无需频繁显式通信的流畅团队协作。研究在真实仿真环境中验证方法,并扩展至实际机器人平台,最终通过主动辅助语义推理任务展示了该信念状态估计方法的下游应用价值。

链接: https://arxiv.org/abs/2604.11020
作者: Jack Kolb,Aditya Garg,Nikolai Warner,Karen M. Feigh
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:We investigate estimating a human’s world belief state using a robot’s observations in a dynamic, 3D, and partially observable environment. The methods are grounded in mental model theory, which posits that human decision making, contextual reasoning, situation awareness, and behavior planning draw from an internal simulation or world belief state. When in teams, the mental model also includes a team model of each teammate’s beliefs and capabilities, enabling fluent teamwork without the need for constant and explicit communication. In this work we replicate a core component of the team model by inferring a teammate’s belief state, or level one situation awareness, as a human-robot team navigates a household environment. We evaluate our methods in a realistic simulation, extend to a real-world robot platform, and demonstrate a downstream application of the belief state through an active assistance semantic reasoning task.

[HC-14] Brief2Design: A Multi-phased Compositional Approach to Prompt-based Graphic Design

【速读】:该论文旨在解决专业设计师在面对抽象客户需求(如目标与约束条件)时,如何高效将其转化为具体视觉设计的问题。现有工具通常仅聚焦于特定设计环节或因输出过于完整而引发思维固化,难以支持灵活探索与重构。其解决方案的关键在于提出一种结构化的工作流——Brief2Design,该工具通过需求提取与推荐实现对模糊要求的初步澄清,再通过对象、背景、文字、排版和构图等元素级探索,最终支持用户灵活重组所选元素,从而在保持设计多样性的同时提升需求理解的准确性。

链接: https://arxiv.org/abs/2604.11019
作者: Kotaro Kikuchi,Nami Ogawa
机构: CyberAgent( CyberAgent)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Professional designers work from client briefs that specify goals and constraints but often lack concrete design details. Translating these abstract requirements into visual designs poses a central challenge, yet existing tools address specific aspects or induce fixation through complete outputs. Through interviews with six professional designers, we identified how designers address this challenge: first structuring ambiguous requirements, then exploring individual elements, and finally recombining alternatives. We developed Brief2Design, supporting this workflow through requirement extraction and recommendation, element-level exploration for objects, backgrounds, text, typography, and composition, and flexible recombination of selected elements. A within-subjects study with twelve designers compared Brief2Design against a conversational baseline. The structured approach increased prompt diversity and received high ratings for requirement extraction and recommendation, but required longer generation time and achieved comparable image diversity. These findings reveal that structured workflows benefit requirement clarification at the cost of efficiency, informing design trade-offs for AI-assisted graphic design tools.

[HC-15] From Planning to Revision: How AI Writing Support at Different Stages Alters Ownership

【速读】:该论文试图解决的问题是:人工智能(AI)辅助写作在提升文章质量的同时,可能削弱写作者对作品的归属感(ownership),而这种归属感对于责任认定、权利归属、写作规范及认知投入具有重要意义。解决方案的关键在于:根据写作的不同阶段(规划、起草、修改)提供差异化AI支持——研究发现,仅在规划阶段提供AI支持时,归属感下降最轻微;而在起草阶段提供支持则导致归属感显著降低,且归属感的减弱程度与AI贡献的文本和创意数量呈正相关。因此,建议在设计AI写作辅助系统时,优先在写作初期介入,并谨慎控制AI在内容生成中的参与度,以平衡写作质量与作者归属感。

链接: https://arxiv.org/abs/2604.11009
作者: Katy Ilonka Gero,Tao Long,Carly Schnitzler,Paramveer Dhillon
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Although AI assistance can improve writing quality, it can also decrease feelings of ownership. Ownership in writing has important implications for attribution, rights, norms, and cognitive engagement, and designers of AI support systems may want to consider how system features may impact ownership. We investigate how the stage at which AI support for writing is provided (planning, drafting, or revising) changes ownership. In a study of short essay writing (between subjects, n = 253) we find that while any AI assistance decreased ownership, planning support only minimally decreased ownership, while drafting support saw the largest decrease. This variation maps onto the amount of text and ideas contributed by AI, where more text and ideas from AI decreased ownership. Notably, an AI-generated draft based on participants’ own outline resulted in significantly more AI-contributed ideas than AI support for planning. At the same time, more AI contributions improved essay quality. We propose that writers, educators, and designers consider writing stage when introducing AI assistance.

[HC-16] Examining EAP Students AI Disclosure Intention: A Cognition-Affect-Conation Perspective

【速读】:该论文旨在解决生成式 AI(Generative AI)在学术写作中日益广泛应用背景下,英语作为学术用途(English for Academic Purposes, EAP)学生对使用AI工具的披露意愿不足所引发的透明度与学术诚信问题。研究基于认知-情感-行为意向框架,构建了一个整合促进与抑制因素的模型,揭示心理安全正向预测披露意图,而负面评价恐惧则负向预测该意图。解决方案的关键在于:通过制定清晰的机构政策和营造支持性的教学环境,增强学生的心理安全感,同时降低因政策模糊和声誉担忧引发的负面评价恐惧,从而推动学生在学术实践中对AI使用的透明化与负责任使用。

链接: https://arxiv.org/abs/2604.10991
作者: Yiran Du,Huimin He
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing use of generative artificial intelligence (AI) in academic writing has raised increasing concerns regarding transparency and academic integrity in higher education. This study examines the psychological factors influencing English for Academic Purposes (EAP) students’ intention to disclose their use of AI tools. Drawing on the cognition-affect-conation framework, the study proposes a model integrating both enabling and inhibiting factors shaping disclosure intention. A sequential explanatory mixed-methods design was employed. Quantitative data from 324 EAP students at an English-medium instruction university in China were analysed using structural equation modelling, followed by semi-structured interviews with 15 students to further interpret the findings. The quantitative results indicate that psychological safety positively predicts AI disclosure intention, whereas fear of negative evaluation negatively predicts it. The qualitative findings further reveal that supportive teacher practices and clear guidance foster psychological safety, while policy ambiguity and reputational concerns intensify fear of negative evaluation and discourage disclosure. These findings highlight the importance of clear institutional policies and supportive pedagogical environments in promoting transparent AI use.

[HC-17] Enabling and Inhibitory Pathways of Students AI Use Concealment Intention in Higher Education: Evidence from SEM and fsQCA

【速读】:该论文旨在解决高等教育中学生对生成式 AI (Generative AI) 使用行为的隐藏意图问题,即学生为何在使用 AI 工具时倾向于隐瞒其使用情况。研究通过整合认知-情感-意向(Cognition-Affect-Conation, CAC)框架与结构方程模型(SEM)和模糊集定性比较分析(fsQCA)的双方法路径,揭示了两种对立的作用机制:一是“促进路径”,感知污名、感知风险和政策不确定性通过增强对负面评价的恐惧而推动隐藏意图;二是“抑制路径”,AI 自我效能感、感知公平性和社会支持通过提升心理安全感降低隐藏意图。解决方案的关键在于识别出“对负面评价的恐惧”作为跨情境的核心中介变量,并强调需从制度层面制定清晰政策、消除适当使用 AI 的污名化,以及构建支持性学习环境以增强心理安全感,从而有效减少学生对 AI 使用的隐藏行为。

链接: https://arxiv.org/abs/2604.10978
作者: Yiran Du,Huimin He
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates students’ AI use concealment intention in higher education by integrating the cognition-affect-conation (CAC) framework with a dual-method approach combining structural equation modelling (SEM) and fuzzy-set qualitative comparative analysis (fsQCA). Drawing on data from 1346 university students, the findings reveal two opposing mechanisms shaping concealment intention. The enabling pathway shows that perceived stigma, perceived risk, and perceived policy uncertainty increase fear of negative evaluation, which in turn promotes concealment. In contrast, the inhibitory pathway demonstrates that AI self-efficacy, perceived fairness, and perceived social support enhance psychological safety, thereby reducing concealment intention. SEM results confirm the hypothesised relationships and mediation effects, while fsQCA identifies multiple configurational pathways, highlighting equifinality and the central role of fear of negative evaluation across conditions. The study contributes to the literature by conceptualising concealment as a distinct behavioural outcome and by providing a nuanced explanation that integrates both net-effect and configurational perspectives. Practical implications emphasise the need for clear institutional policies, destigmatisation of appropriate AI use, and the cultivation of supportive learning environments to promote transparency.

[HC-18] From Words to Widgets for Controllable LLM Generation

【速读】:该论文旨在解决用户在使用大语言模型(Large Language Models, LLMs)时,难以通过自然语言提示(natural language prompts)精确表达和控制主观偏好(如语气、风格和强调方式)的问题。解决方案的关键在于提出一种名为“可塑提示”(Malleable Prompting)的交互式提示技术,其核心是将自然语言中的偏好表达转化为图形用户界面(GUI)控件(如滑块、下拉菜单和切换按钮),使用户能够直接配置这些控件来引导生成结果;同时,系统通过可视化每个控制参数对输出的影响,支持归因分析与迭代比较。为实现这一交互机制,论文进一步设计了一种基于偏好表达及其控件值动态调节token概率分布的LLM解码算法,从而提升生成内容的可控性与透明度。

链接: https://arxiv.org/abs/2604.10925
作者: Chao Zhang,Yiren Liu,Lunyiu Nie,Jeffrey M. Rzeszotarski,Yun Huang,Tal August
机构: Cornell University (康奈尔大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Loyola University Maryland (洛约拉玛丽蒙特大学)
类目: Human-Computer Interaction (cs.HC)
备注: The first three authors contributed equally to this work

点击查看摘要

Abstract:Natural language remains the predominant way people interact with large language models (LLMs). However, users often struggle to precisely express and control subjective preferences (e.g., tone, style, and emphasis) through prompting. We propose Malleable Prompting, a new interactive prompting technique for controllable LLM generation. It reifies preference expressions in natural language prompts into GUI widgets (e.g., sliders, dropdowns, and toggles) that users can directly configure to steer generation, while visualizing each control’s influence on the output to support attribution and comparison across iterations. To enable this interaction, we introduce an LLM decoding algorithm that modulates the token probability distribution during generation based on preference expressions and their widget values. Through a user study, we show that Malleable Prompting helps participants achieve target preferences more precisely and is perceived as more controllable and transparent than natural language prompting alone.

[HC-19] aching Robots to Interpret Social Interactions through Lexically-guided Dynamic Graph Learning ACM-MM26

【速读】:该论文旨在解决机器人社会智能(Social Intelligence)的建模问题,即如何通过用户当前行为推断其内在状态(latent states)、预测未来行为并作出适当响应。核心挑战在于捕捉用户内在状态与可观察行为之间的动态相互作用关系。解决方案的关键是提出一种名为SocialLDG的多任务学习框架,该框架显式建模六类任务间的动态关联性:一方面利用语言模型引入词汇先验(lexical priors)以增强各任务表征;另一方面采用动态图学习机制(dynamic graph learning)来刻画任务间亲和力随时间演化的特性,从而在保持任务可扩展性的同时实现对人类决策过程中内部状态与外部行为交互机制的精细化解析。

链接: https://arxiv.org/abs/2604.10895
作者: Tongfei Bian,Mathieu Chollet,Tanaya Guha
机构: University of Glasgow(格拉斯哥大学)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: submitted to ACM MM 26

点击查看摘要

Abstract:For a robot to be called socially intelligent, it must be able to infer users internal states from their current behaviour, predict the users future behaviour, and if required, respond appropriately. In this work, we investigate how robots can be endowed with such social intelligence by modelling the dynamic relationship between user’s internal states (latent) and actions (observable state). Our premise is that these states arise from the same underlying socio-cognitive process and influence each other dynamically. Drawing inspiration from theories in Cognitive Science, we propose a novel multi-task learning framework, termed as \textbfSocialLDG that explicitly models the dynamic relationship among the states represent as six distinct tasks. Our framework uses a language model to introduce lexical priors for each task and employs dynamic graph learning to model task affinity evolving with time. SocialLDG has three advantages: First, it achieves state-of-the-art performance on two challenging human-robot social interaction datasets available publicly. Second, it supports strong task scalability by learning new tasks seamlessly without catastrophic forgetting. Finally, benefiting from explicit modelling task affinity, it offers insights on how different interactions unfolds in time and how the internal states and observable actions influence each other in human decision making.

[HC-20] owards Designing for Resilience: Community-Centered Deployment of an AI Business Planning Tool in a Small Business Center

【速读】:该论文旨在解决资源受限社区中的创业者因缺乏时间与支持,难以将商业创意转化为可行商业计划的问题。当前生成式 AI(Generative AI)虽被寄予厚望,但多数系统假设用户具备较高数字素养,忽视了社区基础设施对技术采纳的关键影响。论文提出并部署了 BizChat——一款面向女性创客空间的 AI 驱动商业规划工具,并通过四场工作坊的使用日志(N=30)和访谈(N=10)发现,解决方案的关键在于构建集体性的 AI 文化素养(包括采纳、适应与拒绝 AI 的能力),并通过同伴支持缓解 AI 输出快速性与商业规划所需深度意义建构之间的张力。研究强调“有益摩擦”“社群支撑结构”和“协同可塑性”三方面设计原则,以增强社区在技术变革中的韧性。

链接: https://arxiv.org/abs/2604.10883
作者: Quentin Romero Lauro,Aakash Gautam,Yasmine Kotturi
机构: University of Pittsburgh (匹兹堡大学); Univ of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, accepted to CHI 2026

点击查看摘要

Abstract:Entrepreneurs in resource-constrained communities often lack time and support to translate ideas into actionable business plans. While generative AI promises assistance, most systems assume high digital literacy and overlook community infrastructures that shape adoption. We report on the community-centered design and deployment of BizChat, an AI-powered business planning tool, introduced across four workshops at a feminist makerspace in Pittsburgh. Through log data (N=30) and interviews (N=10), we examine how entrepreneurs build resilience through collective AI literacy development-encompassing adoption, adaptation, and refusal of AI. Our findings reveal that while BizChat lowered barriers to accessing capital by translating ideas into “business language,” this ease raised questions about whether instant AI outputs undermine sensemaking essential to planning. We show how peer support helped entrepreneurs navigate this tension. We contribute design implications, including productive friction, communal scaffolds, and co-optability, for strengthening resilience amid technological change.

[HC-21] Compliant But Unsatisfactory: The Gap Between Auditing Standards and Practices for Probabilistic Genotyping Software

【速读】:该论文旨在解决当前人工智能治理中审计标准设计不当可能导致系统缺陷被掩盖并获得虚假可信度的问题。其核心问题在于,现有审计标准(如ASB 018)在实际执行中未能有效实现预期的监督目标,例如通过识别软件缺陷来限制高风险概率基因分型软件的使用范围。解决方案的关键在于重新审视审计标准的设计逻辑,特别是识别并修正标准中模糊的语言表述和未定义的核心术语,从而确保其要求能够转化为可操作、可验证的审计实践,最终提升审计的有效性与问责性。

链接: https://arxiv.org/abs/2604.10875
作者: Angela Jin,Alexander Asemota,Dan E. Krane,Nathaniel D. Adams,Rediet Abebe
机构: University of California, Berkeley (加州大学伯克利分校); Wright State University (赖特州立大学); Forensic Bioinformatic Services, Inc. (法医生物信息学服务公司); ELLIS Institute; Max Planck Institute for Intelligent Systems; Tübingen AI Center (图宾根人工智能中心)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 20 pages, 2 figures, published at ACM CHI, 2026

点击查看摘要

Abstract:AI governance efforts increasingly rely on audit standards: agreed-upon practices for conducting audits. However, poorly designed standards can hide and lend credibility to inadequate systems. We explore how an audit standard’s design influences its effectiveness through a case study of ASB 018, a standard for auditing probabilistic genotyping software – software that the U.S. criminal legal system increasingly uses to analyze DNA samples. Through qualitative analysis of ASB 018 and five audit reports, we identify numerous gaps between the standard’s desired outcomes and the auditing practices it enables. For instance, ASB 018 envisions that compliant audits establish restrictions on software use based on observed failures. However, audits can comply without establishing such boundaries. We connect these gaps to the design of the standard’s requirements such as vague language and undefined terms. We conclude with recommendations for designing audit standards and evaluating their effectiveness.

[HC-22] Participatory not Punitive: Student-Driven AI Policy Recommendations in a Design Classroom

【速读】:该论文旨在解决当前高校在生成式 AI(Generative AI)治理中普遍存在的“自上而下”政策制定模式所导致的学生参与缺失问题,这种模式往往忽视学生实际使用场景,仅聚焦于惩罚性条款,引发学生对合规使用的困惑与焦虑。解决方案的关键在于推动“以学生为中心”的参与式政策设计,通过由学生主导的三阶段工作坊,让学习者在无教师干预的情境下自主分享AI使用经验、共同撰写政策建议,并以视觉化媒介(如手绘小册子)传播成果,从而揭示出传统政策忽略的核心矛盾,例如要求学生披露或禁止使用AI但对教师无同等约束,体现了学生深度参与治理不仅产出更贴合实践的政策,也重塑了师生间权力关系。

链接: https://arxiv.org/abs/2604.10851
作者: Kaoru Seki,Manisha Vijay,Yasmine Kotturi
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Human-Computer Interaction (cs.HC)
备注: 29 pages. To appear in CHI 2026 (ACM CHI Conference on Human Factors in Computing Systems), April 13-17, 2026, Barcelona, Spain. Kaoru Seki and Manisha Vijay contributed equally to this work

点击查看摘要

Abstract:Generative AI is reshaping education, yet most university AI policies are written without students and focus on penalizing misuse. This top-down approach sidelines those most affected from decisions that shape their everyday learning, resulting in confusion and fear about acceptable use. We examine how participatory, student-driven AI policy design can address this disconnect. We report on a three-part workshop series in a graduate design course at a minority-serving university in the U.S., where two student leaders facilitated discussions without faculty present. Eight participants shared candid accounts of their AI use, co-authored ten policy recommendations, and visualized them in a zine that circulated across campus. The resulting policies surfaced concerns absent from top-down governance, such as the double standard of requiring students to disclose or abstain from AI use while faculty face no such expectations. We argue that engaging students in AI governance carries value beyond the resulting policies, and offer transferable strategies for fostering participation across disciplines – a model for calling students in rather than calling students

[HC-23] Speaking to No One: Ontological Dissonance and the Double Bind of Conversational AI ALT

【速读】:该论文试图解决的问题是:为何在少数用户中,持续与对话式人工智能(Conversational AI)系统互动会导致妄想体验的出现或稳定化。现有解释多归因于个体易感性或安全工程缺陷,但这些解释不充分。论文的关键解决方案在于提出一种新的理论框架:对话式AI引发本体论失调(ontological dissonance),即交互中呈现出的关系性存在感与缺乏真正能维持这种关系的主体之间产生冲突;这种冲突通过沟通双绑定(communicative double bind)维持,并在情绪脆弱条件下被注意不对称放大,最终可能稳定为一种技术中介的“二人共享妄想”(folie a deux)。这一机制解释了为何明确免责声明常无法中断妄想参与,并为对话式AI的设计与临床使用提供了伦理和实践指导。

链接: https://arxiv.org/abs/2604.10833
作者: Hugh Brosnahan,Izabela Lipinska
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: This version of the article has been accepted for publication in Medicine, Health Care and Philosophy following peer review. This version is distributed under Springer Nature’s terms for accepted manuscripts and does not reflect any post-acceptance improvements or corrections. The Version of Record will be available via Springer Nature upon publication

点击查看摘要

Abstract:Recent reports indicate that sustained interaction with conversational artificial intelligence (AI) systems can, in a small subset of users, contribute to the emergence or stabilisation of delusional experience. Existing accounts typically attribute such cases either to individual vulnerability or to failures of safety engineering. These explanations are incomplete. Drawing on phenomenology, psychiatry, and cognitive neuroscience, this paper argues that the risk arises from the relational and ontological structure of the interaction itself. Conversational AI generates ontological dissonance: a conflict between the appearance of relational presence and the absence of any subject capable of sustaining it. Maintained through a communicative double bind and amplified by attentional asymmetries, this dissonance tends, under conditions of affective vulnerability, to stabilise into a technologically mediated analogue of folie a deux. This account explains why explicit disclaimers often fail to disrupt delusional involvement and clarifies the ethical and clinical implications for the design and use of conversational AI.

[HC-24] MicroVRide: Exploring 4-in-1 Virtual Reality Micromobility Simulator

【速读】:该论文旨在解决当前缺乏安全可控的虚拟环境来研究微移动交通工具(Micromobility Vehicles)骑行者体验与性能的问题。现有方法难以在不增加风险的情况下系统性地评估不同车型(如电动滑板车、Segway、电动独轮车和单轮滑板)的操控特性与用户行为差异。解决方案的关键在于提出MicroVRide——一个模块化四合一虚拟现实(VR)微移动模拟器,能够在同一平台上支持多种微移动设备,并保留每种车辆特有的物理约束和控制映射关系,从而实现对多样骑行行为的高效研究与对比分析,同时显著降低硬件重构成本与实验风险。

链接: https://arxiv.org/abs/2604.10829
作者: Xiaoyan Zhou,Natalia Sempere,Pooria Ghavamian,Asreen Rostami,Andrii Matviienko
机构: KTH Royal Institute of Technology (皇家理工学院); RISE Research Institutes of Sweden (瑞典研究机构); Stockholm University (斯德哥尔摩大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Micromobility vehicles, such as e-scooters, Segways, skateboards, and unicycles, are increasingly adopted for short-distance travel due to their low weight and low emissions. Despite their growing popularity, we lack controlled, low-risk environments to study rider experiences and performance. While virtual reality (VR) simulators offer a promising approach by reducing safety risks and providing immersive experiences, micromobility simulators remain largely underexplored. We introduce MicroVRide, a modular 4-in-1 VR micromobility simulator that supports e-scooters, Segways, electric unicycles, and one-wheeled skateboards on a single platform. The simulator preserves vehicle-specific physical constraints and control metaphors, enabling the study of diverse riding behaviors with minimal hardware reconfiguration. We contribute the simulator design and report a preliminary within-subject study (N = 12) that demonstrates feasibility and reveals distinct experiential profiles across vehicles.

[HC-25] Adaptive Bounded-Rationality Modeling of Early-Stage Takeover in Shared-Control Driving

【速读】:该论文旨在解决共享驾驶(shared-driving)中,驾驶员在接管车辆控制后的最初几秒内因认知状态快速波动而导致的控制质量不稳定问题,这种不稳定性可能引发危险的转向或踏板输入,而传统依赖纯理性假设、忽视认知状态的模型难以及时识别并纠正此类风险。解决方案的关键在于提出一个基于有限理性(bounded rationality)的可解释驾驶员模型,并通过在线自适应机制实时调整潜在的认知参数:具体而言,模型将认知约束嵌入强化学习框架以体现行为的有限理性,同时利用粒子滤波(particle filtering)从驾驶员操作观测数据中实时估计并更新认知状态参数;实验表明,该方法不仅显著提升了对高风险接管行为的预测覆盖范围和提前预警时间,且推断出的认知参数与实时眼动指标高度一致,从而实现了基于认知动态的早期干预。

链接: https://arxiv.org/abs/2604.10806
作者: Jian Sun,Xiyan Jiang,Xiaocong Zhao,Jie Wang,Peng Hang,Zirui Li
机构: Tongji University (同济大学); Nanyang Technological University (南洋理工大学)
类目: Human-Computer Interaction (cs.HC)
备注: 23 pages, 16 figures. To appear in ACM CHI 2026

点击查看摘要

Abstract:Human drivers’ control quality in the first seconds after a handover is critical to shared-driving safety; potentially unsafe steering or pedal inputs therefore require detection and correction by the automated vehicle’s safety-fallback system. Yet performance in this window is vulnerable because cognitive states fluctuate rapidly, causing purely rationality-driven, cognition-unaware models to miss early control dynamics. We present an interpretable driver model grounded in bounded rationality with online adaptation that predicts early-stage control quality. We encode boundedness by embedding cognitive constraints in reinforcement learning and adapt latent cognitive parameters in real time via particle filtering from observations of driver actions. In a vehicle-in-the-loop study (n=41), we evaluated predictive performance and physiological validity. The adaptive model not only anticipated hazardous takeovers with higher coverage and longer lead times than non-adaptive baselines but also demonstrated strong alignment between inferred cognitive parameters and real-time eye-tracking metrics. These results confirm that the model captures genuine fluctuations in driver risk perception, enabling timely and cognitively grounded assistance.

[HC-26] owards Universal Visualisation of Emotional States for Information Systems

【速读】:该论文旨在解决情感信息系统的可视化表示问题,即如何在信息界面中有效呈现人类情绪状态。研究聚焦于离散情绪模型(discrete emotion models)与维度情绪模型(dimensional emotion models)的典型可视化特征,如颜色、大小、速度、形状及动画类型。其关键解决方案在于通过419名参与者的情绪可视化偏好实验,发现颜色、速度和大小与特定离散情绪标签显著相关,而速度则与维度模型中的唤醒度(arousal)密切相关,从而为构建通用的情感表达规范提供了实证基础。

链接: https://arxiv.org/abs/2604.10756
作者: Michal R Wrobel,Agnieszka Landowska,Karolina Makuch
机构: Gdansk University of Technology (格但斯克理工大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The paper concerns affective information systems that represent and visualize human emotional states. The goal of the study was to find typical representations of discrete and dimensional emotion models in terms of color, size, speed, shape, and animation type. A total of 419 participants were asked about their preferences for emotion visualization. We found that color, speed, and size correlated with selected discrete emotion labels, while speed correlated with arousal in a dimensional model. This study is a first step towards defining a universal emotion representation for use in information systems.

[HC-27] Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment

【速读】:该论文试图解决的问题是:如何在人工智能(AI)对齐(AI alignment)中处理原则性规范在具体情境下应用时的模糊性和歧义性,尤其是在原则冲突、原则过于宽泛或事实不清的情况下,传统依赖静态规则或偏好标签的方法难以有效捕捉实际决策所需的判断力。解决方案的关键在于引入诠释学(hermeneutics)视角,指出AI对齐不仅包含对原则的机械遵循,还必须包含基于上下文的解释性判断——即如何解读、应用和优先排序原则。作者进一步论证,这类判断体现在模型部署时的行为分布中,因此需区分“部署诱导型”与“语料诱导型”评估,并强调仅靠离策略审计(off-policy audits)可能无法识别因响应分布差异而产生的对齐失效问题。

链接: https://arxiv.org/abs/2604.10673
作者: Behrooz Razeghi
机构: Harvard University (哈佛大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI alignment is often framed as the task of ensuring that an AI system follows a set of stated principles or human preferences, but general principles rarely determine their own application in concrete cases. When principles conflict, when they are too broad to settle a situation, or when the relevant facts are unclear, an additional act of judgment is required. This paper analyzes that step through the lens of hermeneutics and argues that alignment therefore includes an interpretive component: it involves context-sensitive judgments about how principles should be read, applied, and prioritized in practice. We connect this claim to recent empirical findings showing that a substantial portion of preference-labeling data falls into cases of principle conflict or indifference, where the principle set does not uniquely determine a decision. We then draw an operational consequence: because such judgments are expressed in behavior, many alignment-relevant choices appear only in the distribution of responses a model generates at deployment time. To formalize this point, we distinguish deployment-induced and corpus-induced evaluation and show that off-policy audits can fail to capture alignment-relevant failures when the two response distributions differ. We argue that principle-specified alignment includes a context-dependent interpretive component.

[HC-28] CogInstrument: Modeling Cognitive Processes for Bidirectional Human-LLM Alignment in Planning Tasks

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)接口在人机协作中因缺乏对用户隐含推理结构的显式表达而导致的认知错位问题。现有工具通常将用户意图表示为“扁平列表”,忽视了人类决策过程中存在的因果依赖关系与可修订假设,从而限制了用户对模型输出逻辑的验证与控制能力。解决方案的关键在于提出 CogInstrument 系统,该系统通过提取自然语言交互中的认知单元——即“认知动机”(cognitive motifs),将其建模为具有因果关联、可编辑的图形化结构,实现用户与 LLM 之间推理过程的双向对齐。这一结构化外部化机制显著提升了用户对协作过程的逻辑可解释性、可控性和信任度,为构建基于推理的人机协同提供了新的理论框架与实践路径。

链接: https://arxiv.org/abs/2604.10587
作者: Anqi Wang,Dongyijie Pan,Xin Tong,Pan Hui
机构: Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Although Large Language Models (LLMs) demonstrate proficiency in knowledge-intensive tasks, current interfaces frequently precipitate cognitive misalignment by failing to externalize users’ underlying reasoning structures. Existing tools typically represent intent as “flat lists,” thereby disregarding the causal dependencies and revisable assumptions inherent in human decision-making. We introduce CogInstrument, a system that represents user reasoning through cognitive motifs-compositional, revisable units comprising concepts linked by causal dependencies. CogInstrument extracts these motifs from natural language interactions and renders them as editable graphical structures to facilitate bidirectional alignment. This structural externalization enables both the user and the LLM to inspect, negotiate, and reconcile reasoning processes iteratively. A within-subjects study (N=12) demonstrates that CogInstrument explicitly surfaces implicit reasoning structures, facilitating more targeted revision and reusability over conventional LLM-based dialogue interfaces. By enabling users to verify the logical grounding of LLM outputs, CogInstrument significantly enhances user agency, trust, and structural control over the collaboration. This work formalizes cognitive motifs as a fundamental unit for human-LLM alignment, providing a novel framework for achieving structured, reasoning-based human-AI collaboration.

[HC-29] NexusAI: Enabling Design Space Exploration of Ideas through Cognitive Abstraction and Functional Decomposition

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在创意生成过程中因输出文本结构松散而导致的“固定效应”(fixation)问题,即用户易过早聚焦于次优方案,限制了设计空间的探索。其解决方案的关键在于提出认知抽象(Cognitive Abstraction, CA)计算流水线,通过将原始LLM生成的灵感分解为带类型的功能片段(typed functional fragments)、支持多层级抽象以显式化心理缩放(mental scaling),以及跨维度重组(cross-dimensional recombination)来激发新颖的设计方向。该方法实现了从不可分解的单体式输出到可导航、可变换的设计空间的转变,从而有效缓解固定效应并促进发散性探索。

链接: https://arxiv.org/abs/2604.10575
作者: Anqi Wang,Bingqian Wang,Huiyang Chen,Keqing Jiao,Lei Han,Xin Tong,Pan Hui
机构: Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); University of Michigan (密歇根大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer vast potential for creative ideation; however, their standard interaction paradigm often produces unstructured textual outputs that lead users to prematurely converge on sub-optimal ideas-a phenomenon known as fixation. While recent creativity tools have begun to structure these outputs, they remain compositionally opaque: ideas are organized as monolithic units that cannot be decomposed, abstracted, or recombinable at a sub-idea level. To address this, we propose Cognitive Abstraction (CA), a computational pipeline that transforms raw LLM-generated inspiration into a navigable and transformable design space. We implement this pipeline in NexusAI, a prototype diagramming system that supports (I) decomposition of inspiration into typed functional fragments, (II) multi-level abstraction to externalize mental scaling, and (III) cross-dimensional recombination to spark novel design directions. A within-subject user study (N=14) demonstrates that NexusAI significantly improves design space exploration, reduces cognitive overhead, and facilitates perspective reframing compared to a baseline. Our work contributes: (1) a characterization of “compositional opacity” as a barrier in human-AI co-creation; (2) the CA pipeline for operationalizing creative cognitive primitives at scale; and (3) empirical evidence that structured, multi-level representations can effectively mitigate fixation and support divergent exploration.

[HC-30] Enhanced Self-Learning with Epistemologically-Informed LLM Dialogue

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自学习场景中,学习者难以开展有意义对话和处理复杂信息的问题。解决方案的关键在于将认识论框架(epistemological frameworks)融入LLM驱动的自学习系统设计中,具体通过引入亚里士多德的“四因说”(Four Causes)作为提示工程(prompt engineering)的核心结构,使系统能够自动生成连贯且语境恰当的追问问题,从而降低认知负荷、促进深度参与和多维度理解。该方法以CausaDisco系统为载体,在控制实验中验证了其相较于基线显著提升了交互质量与探索深度。

链接: https://arxiv.org/abs/2604.10545
作者: Yi-Fan Cao,Kento Shigyo,Yitong Gu,Xiyuan Wang,Weijia Liu,Yang Wang,David Gotz,Zhilan Zhou,Huamin Qu
机构: Hong Kong University of Science and Technology(香港科技大学); Hong Kong Baptist University(香港浸会大学); ShanghaiTech University(上海科技大学); Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); The University of Hong Kong(香港大学); University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)
类目: Human-Computer Interaction (cs.HC)
备注: Submitted to IJHCI

点击查看摘要

Abstract:Large Language Models (LLMs) have advanced self-learning tools, enabling more personalized interactions. However, learners struggle to engage in meaningful dialogue and process complex information. To alleviate this, we incorporate epistemological frameworks within an LLM-based approach to self-learning, reducing the cognitive load on learners and fostering deeper engagement and holistic understanding. Through a formative study (N=26), we identified epistemological differences in self-learner interaction patterns. Building upon these findings, we present \textitCausaDisco, a dialogue-based interactive system that integrates Aristotle’s \textitFour Causes framework into LLM prompts to enhance cognitive support for self-learning. This approach guides learners’ self-learning journeys by automatically generating coherent and contextually appropriate follow-up questions. A controlled study (N=36) demonstrated that, compared to baseline, \textitCausaDisco fostered more engaging interactions, inspired sophisticated exploration, and facilitated multifaceted perspectives. This research contributes to HCI by expanding the understanding of LLMs as educational agents and providing design implications for this emerging class of tools.

[HC-31] owards an Appropriate Level of Reliance on AI: A Preliminary Reliance-Control Framework for AI in Software Engineering

【速读】:该论文旨在解决软件开发者在使用生成式 AI(Generative AI)工具,特别是大语言模型(Large Language Models, LLMs)时存在的依赖失衡问题,即过度依赖可能导致批判性思维能力退化,而依赖不足则可能错失生产力与代码质量提升的潜力。其解决方案的关键在于提出一个初步的“依赖控制框架”(reliance-control framework),该框架以开发者对 AI 工具的控制程度作为识别过载或不足依赖的指标,并据此指导未来研究探索当前及新兴 LLM 工具支持的不同控制层级,从而帮助开发者实现对 AI 技术的负责任且高效利用。

链接: https://arxiv.org/abs/2604.10530
作者: Samuel Ferino,Rashina Hoda,John Grundy,Christoph Treude
机构: Monash University (莫纳什大学); Singapore Management University (新加坡管理大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted for publication at the 2nd Workshop on Human-Centered AI for SE (HumanAISE) held at the 34th ACM International Conference on the Foundations of Software Engineering (FSE Companion '26), July 5-9, 2026, Montreal, Quebec, Canada

点击查看摘要

Abstract:How software developers interact with Artificial Intelligence (AI)-powered tools, including Large Language Models (LLMs), plays a vital role in how these AI-powered tools impact them. While overreliance on AI may lead to long-term negative consequences (e.g., atrophy of critical thinking skills); underreliance might deprive software developers of potential gains in productivity and quality. Based on twenty-two interviews with software developers on using LLMs for software development, we propose a preliminary reliance-control framework where the level of control can be used as a way to identify AI overreliance and underreliance. We also use it to recommend future research to further explore the different control levels supported by the current and emergent LLM-driven tools. Our paper contributes to the emerging discourse on AI overreliance and provides an understanding of the appropriate degree of reliance as essential to developers making the most of these powerful technologies. Our findings can help practitioners, educators, and policymakers promote responsible and effective use of AI tools.

[HC-32] Beyond Compliance: A Resistance-Informed Motivation Reasoning Framework for Challenging Psychological Client Simulation

【速读】:该论文旨在解决现有心理咨询模拟器(psychological client simulator)普遍存在过度顺从(over-compliance)的问题,导致咨询师训练不足,难以应对真实临床场景中常见的挑战性行为。解决方案的关键在于提出 ResistClient,其核心是基于客户抗拒理论(Client Resistance Theory)系统建模挑战性行为,并引入一种两阶段训练框架——Resistance-Informed Motivation Reasoning (RIMR):首先通过在 RPC 数据集上监督微调缓解顺从偏差;其次通过过程监督的强化学习联合优化动机真实性与回应一致性,从而实现更符合心理机制的动机推理和响应生成,显著提升模拟器在挑战保真度、行为合理性及推理连贯性方面的表现。

链接: https://arxiv.org/abs/2604.10507
作者: Danni Liu,Bo Liu,Yuxin Hu,Hantao Zhao,Yan Liu,Ding Ding,Jiahui Jin,Jiuxin Cao
机构: Southeast University (东南大学); Purple Mountain Laboratories (紫金山实验室); The Nanjing Derong Wisdom Information Technology Co., Ltd. (南京德融智慧信息技术有限公司)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Psychological client simulators have emerged as a scalable solution for training and evaluating counselor trainees and psychological LLMs. Yet existing simulators exhibit unrealistic over-compliance, leaving counselors underprepared for the challenging behaviors common in real-world practice. To bridge this gap, we present ResistClient, which systematically models challenging client behaviors grounded in Client Resistance Theory by integrating external behaviors with underlying motivational mechanisms. To this end, we propose Resistance-Informed Motivation Reasoning (RIMR), a two-stage training framework. First, RIMR mitigates compliance bias via supervised fine-tuning on RPC, a large-scale resistance-oriented psychological conversation dataset covering diverse client profiles. Second, beyond surface-level response imitation, RIMR models psychologically coherent motivation reasoning before response generation, jointly optimizing motivation authenticity and response consistency via process-supervised reinforcement learning. Extensive automatic and expert evaluations show that ResistClient substantially outperforms existing simulators in challenge fidelity, behavioral plausibility, and reasoning coherence. Moreover, ResistClient facilities evaluation of psychological LLMs under challenging conditions, offering new optimization directions for mental health dialogue systems.

[HC-33] Make it Simple Make it Dance: Dance Motion Simplification to Support Novices Dance Learning

【速读】:该论文旨在解决在线舞蹈教学中初学者因动作复杂度超出其技能水平而产生挫败感的问题,从而影响学习积极性。解决方案的关键在于通过系统性分析识别出五类舞蹈动作复杂度因素,并基于规则与学习相结合的方法开发自动化简化算法,构建原始与简化动作的配对数据集,最终在技术评估、专业编舞者评价和初学者学习效果测试中验证了方法的有效性,实现了在保留舞蹈风格特征的前提下提升动作可学性。

链接: https://arxiv.org/abs/2604.10490
作者: Hyunyoung Han,Murad Eynizada,Son Xuan Nghiem,Sang Ho Yoon
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Online dance tutorials have gained widespread popularity. However, many novices encounter difficulties when dance motion complexity exceeds their skill level, potentially leading to discouragement. This study explores dance motion simplification to address this challenge. We surveyed 30 novices to identify challenging movements, then conducted focus groups with 30 professional choreographers across 10 genres to explore simplification strategies and collect paired original-simplified dance datasets. We identified five complexity factors and developed automated simplification methods using both rule-based and learning-based approaches. We validated our approach through three evaluations. Technical evaluation confirmed our complexity measures and algorithms. 20 professional choreographers assessed motion naturalness, simplification adequacy, and style preservation. 18 novices evaluated learning effectiveness through workload, self-efficacy, objective performance, and perceived difficulty. This work contributes to dance education technology by proposing methods that help make choreography more approachable for beginners while preserving essential characteristics.

[HC-34] ZoomTable: Interactive Exploration of Data Facts in Hierarchical Tables via Semantic Zooming

【速读】:该论文旨在解决在层次化表格(hierarchical table)中嵌入大量数据事实(data fact)时,因空间有限导致的布局冲突问题,从而阻碍有效探索。解决方案的关键在于提出一种基于语义缩放(semantic zooming)的交互式探索范式,并开发了名为ZoomTable的可视化系统;该系统通过语义缩放机制结合数据事实布局方法与推荐机制,既缓解了布局冲突,又支持用户在不同尺度下连贯地探索多维数据事实。

链接: https://arxiv.org/abs/2604.10461
作者: Qiyang Chen,Guozheng Li,Xingqi Wang,Gerile Aodeng,Min Lu,Chi Harold Liu
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Hierarchical tables are an important structure for organizing data with inherent hierarchical relationships. Existing studies have extensively explored methods for data fact exploration from tabular data. In particular, some studies have directly integrated visual data facts into the original table structure to support in-situ exploration, because embedding data facts within the table context can reduce cognitive load by minimizing attention shifts. However, embedding a large amount of extracted data facts into the limited space of hierarchical tables often leads to layout conflicts, hindering effective exploration. To address this issue, we propose an interactive exploration paradigm for hierarchical table data facts based on semantic zooming and develop an interactive visualization system, ZoomTable. The ZoomTable system employs semantic zooming as the interaction method, combined with a data-fact layout method and a data fact recommendation mechanism. This combination not only resolves layout conflicts, but also supports users in coherently exploring multidimensional data facts at different scales. A case study and a user experiment further validate the practicality and efficiency of ZoomTable in real-world data fact exploration scenarios.

[HC-35] racing Prompt-Level Trajectories to Understand Student Learning with AI in Programming Education

【速读】:该论文旨在解决生成式 AI(Generative AI)在编程教学中应用时,因课程规则不一致导致学生使用方式差异显著、能力发展不均的问题。其解决方案的关键在于通过分析163名学生在Python作业中的AI交互日志与代码提交结果,识别出从完全委托到迭代优化的不同提示(prompting)轨迹,并发现尽管多数学生直接复制了AI生成的代码,但许多学生通过迭代式细化实现了更深层次的学习参与。研究进一步表明,提示轨迹可作为学生自我调节能力和学习取向的窗口,从而为设计支持个性化和高效协作学习的教育型AI系统提供依据。

链接: https://arxiv.org/abs/2604.10400
作者: Tianyu Shao,Miguel Feijóo-García,Yi Zhang,Hugo Castellanos,Tawfiq Salem,Alejandra Magana,Tianyi Li
机构: Purdue University (普渡大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As AI tools such as ChatGPT enter programming classrooms, students encounter differing rules across courses and instructors, which shape how they use AI and leave them with unequal capabilities for leveraging it. We investigate how students engaged with AI in an introductory Python assignment, analyzing student-LLM chat histories and final code submissions from 163 students. We examined prompt-level strategies, traced trajectories of interaction, and compared AI-generated code with student submissions. We identified trajectories ranging from full delegation to iterative refinement, with hybrid forms in between. Although most students directly copied AI-generated code in their submission, many students scaffolded the code generation through iterative refinement. We also contrasted interaction patterns with assignment outcomes and course performance. Our findings show that prompting trajectories serve as promising windows into students’ self-regulation and learning orientation. We draw design implications for educational AI systems that promote personalized and productive student-AI collaborative learning.

[HC-36] Context-KG: Context-Aware Knowledge Graph Visualization with User Preferences and Ontological Guidance

【速读】:该论文旨在解决现有知识图谱(Knowledge Graph, KG)可视化系统在用户交互中缺乏上下文感知的问题,即传统方法仅返回直接查询结果并采用纯拓扑结构的力导向布局,忽略了用户意图、本体距离与语义信息,且无法解释节点排列逻辑。其解决方案的关键在于提出Context-KG框架,该框架以本体(ontology)、上下文(context)和用户意图为核心,利用大语言模型(Large Language Models, LLMs)从自然语言问题及上下文描述中迭代提取用户偏好,识别相关节点类型、属性与关联关系,并据此生成语义可解释、本体引导的定制化布局,从而实现类型感知区域划分与高阶洞察生成,显著提升可视化结果的可解释性、相关性与任务性能。

链接: https://arxiv.org/abs/2604.10384
作者: Rumali Perera,Xiaoqi Wang,Han-wei Shen
机构: The Ohio State University (俄亥俄州立大学); Bosch AI Research (博世人工智能研究)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs) are increasingly used to represent and explore complex, interconnected data across diverse domains. However, existing KG visualization systems remain limited because they fail to provide the context of user questions. They typically return only the direct query results and arrange them with force-directed layouts by treating the graph as purely topological. Such approaches overlook user preferences, ignore ontological distances and semantics, and provide no explanation for node placement. To address these challenges, we propose Context-KG, a context-aware KG visualization framework. Context-KG reframes KG visualization around ontology, context, and user intent. Using Large Language Models (LLMs), it iteratively extracts user preferences from natural language questions and context descriptions, identifying relevant node types, attributes, and contextual relations. These preferences drive a semantically interpretable, ontology-guided layout that is tailored to each query, producing type-aware regions. Context-KG also generates high-level insights unavailable in traditional methods, opening new avenues for effective KG exploration. Evaluations on real world KGs and a comprehensive user study demonstrate improved interpretability, relevance, and task performance, establishing Context-KG as a new paradigm for KG visualization.

[HC-37] Good Question! The Effect of Positive Feedback on Contributions to Online Public Goods

【速读】:该论文旨在解决在线问答(QA)平台上用户参与度下降的问题,特别是以Stack Overflow为代表的软件开发类社区中志愿者提问与回答行为的减少。其解决方案的关键在于通过随机实验验证匿名点赞(upvote)对用户后续参与行为的影响:结果显示,新发布的问题获得匿名点赞后,提问者在四周内再次提问的概率提升6.3%,回答他人问题的概率提升12.9%;其中,回答行为的激励效应更强且可持续至十二周。进一步分析表明,算法放大(algorithmic amplification)——即点赞提高问题可见性——虽对提问行为无显著影响,却显著增强了回答行为,因其提升了问题被更多用户看到的可能性,并促使原提问者从个体互动转向更广泛的社区参与。

链接: https://arxiv.org/abs/2604.10360
作者: Johannes Wachs,Leonore Röseler,Tobias Gesche,Elliott Ash,Anikó Hannák
机构: Corvinus University of Budapest, Hungary; University of Zurich, Switzerland; ETH Zurich, Switzerland
类目: ocial and Information Networks (cs.SI); Human-Computer Interaction (cs.HC); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Online platforms where volunteers answer each other’s questions are important sources of knowledge, yet participation is declining. We ran a pre-registered experiment on Stack Overflow, one of the largest QA communities for software development (N = 22,856), randomly assigning newly posted questions to receive an anonymous upvote. Within four weeks, treated users were 6.3% more likely to ask another question and 12.9% more likely to answer someone else’s question. A second upvote produced no additional effect. The effect on answering was larger, more persistent, and still significant at twelve weeks. Next, we examine how much of these effects are due to algorithmic amplification, since upvotes also raise a question’s rank and visibility. Algorithmic amplification is not important for the effect on asking additional questions, but it matters a lot for the effect on answering other questions. The increase in visibility increases the probability that another user provides an answer, and that experience appears to shift the poster toward broader community participation.

[HC-38] Infernux: A Python-Native Game Engine with JIT-Accelerated Scripting

【速读】:该论文旨在解决脚本语言(如Python)与原生代码引擎之间因性能差距导致的实时渲染和游戏开发效率瓶颈问题。其核心挑战在于如何在保持Python易用性的同时,实现接近C++级别的运行时性能。解决方案的关键在于将两种成熟技术——批处理数据传输与即时编译(JIT compilation)——融合进统一的引擎架构中:(i) 通过一个批数据桥接机制,在单次边界穿越中将每帧状态批量转移至连续的NumPy数组;(ii) 提供基于Numba的可选JIT路径,将标注的更新函数自动编译为LLVM机器码并启用循环并行化,从而显著提升计算密集型任务的执行效率。

链接: https://arxiv.org/abs/2604.10263
作者: Lizhe Chen
机构: 未知
类目: Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: 9 pages, 6 figures, 4 tables

点击查看摘要

Abstract:This report describes Infernux, an open-source game engine that pairs a C++17/Vulkan real-time core with a Python production layer connected through a single pybind11 boundary. To close the throughput gap between Python scripting and native-code engines, Infernux combines two established techniques - batch-oriented data transfer and JIT compilation - into a cohesive engine-level integration: (i) a batch data bridge that transfers per-frame state into contiguous NumPy arrays in one boundary crossing, and (ii) an optional JIT path via Numba that compiles annotated update functions to LLVM machine code with automatic loop parallelization. We compare against Unity 6 as a reference on three workloads; readers should note differences in shading complexity, draw-call batching, and editor tooling maturity between the two engines. Infernux is MIT-licensed and available at this https URL.

[HC-39] From Searchable to Non-Searchable: Generative AI and Information Diversity in Online Information Seeking

【速读】:该论文试图解决的问题是:生成式 AI(Generative AI)系统(如 ChatGPT)在在线信息获取过程中,是否扩展或限制用户所接触知识的多样性,这一问题对知识工作、学习与创新具有基础性影响。解决方案的关键在于通过分析超过 20 万次真实的人机交互数据,从“可搜索性”(searchability)这一维度切入,量化比较用户输入与 AI 输出的知识广度和多样性——发现约 80% 的 ChatGPT 查询为不可搜索且覆盖更广的知识领域,表明 Inquiry 模式被拓展;但针对可搜索查询,AI 回答多样性低于 Google 搜索结果,并且 AI 输出多样性会进一步影响用户后续探索行为,形成反馈循环,揭示出“扩展 inquiry”与“约束信息暴露”之间的张力,从而为设计支持探索性知识获取的混合搜索与生成式 AI 系统提供依据。

链接: https://arxiv.org/abs/2604.10258
作者: Yulin Yu,Yizhou Li,Siddharth Suri,Scott Counts
机构: Northwestern University (西北大学); University of Michigan (密歇根大学); Microsoft Research (微软研究院)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Conversational generative AI systems such as ChatGPT are transforming how people seek and engage with information online. Unlike traditional search engines, these systems support open-ended, conversational inquiry, yet it remains unclear whether they ultimately expand or constrain the diversity of knowledge that users encounter in online search spaces, a primary foundation for knowledge work, learning, and innovation. Using over 200,000 real-world human-ChatGPT interactions, we examine how generative-AI-mediated inquiry reshapes diversity in both user inputs and system outputs through the lens of searchability - whether queries could plausibly be answered by traditional search engines. We find that almost 80% of ChatGPT user queries are non-searchable and span a broader knowledge space and topics than searchable queries, indicating expanded modes of inquiry. However, for comparable searchable queries, AI responses are less diverse than Google search results in the majority of topics. Moreover, the diversity of AI responses predicts subsequent changes in users’ inquiry diversity, revealing a feedback loop between AI outputs and human exploration. These findings highlight a tension between expanded inquiry and constrained information exposure, with implications for designing hybrid search and generative-AI systems that better support exploratory knowledge seeking.

[HC-40] Glide-in-Place: Foot-Steered Differential-Drive for Hands-Free VR Locomotion

【速读】:该论文旨在解决受限环境下(如家庭、办公室及交通场景)坐姿虚拟现实(VR)移动的难题,核心挑战包括硬件需轻量化与易部署、控制方式需支持连续曲面运动且不占用双手以实现并发交互。解决方案的关键在于提出一种名为“Glide-in-Place”的足部操纵系统,其通过将双脚前后压力映射至差速驱动模型,使双足作为虚拟轮子,相对驱动力持续决定平移与偏航运动,从而在无需手持输入或离散模式切换的情况下,统一实现前进、原地旋转和弧线路径跟踪。该设计显著提升了导航效率并降低身体负荷,同时保持了与操纵杆相当的主观舒适度。

链接: https://arxiv.org/abs/2604.10237
作者: Bin Hu,Yang Liu,Xizi Liu,Qinggerou Xiao,Xiru Wang,Zhe Yuan,Wen Ku,Xiu Li,Yun Wang
机构: Tsinghua University (清华大学); Beihang University (北京航空航天大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Seated VR locomotion in constrained environments, including homes, offices, and transit settings, calls for hardware that is lightweight and deployable, steering that remains continuous enough for curved motion, and a control channel that leaves the hands free for concurrent interaction. Inspired by the steering logic of self-balancing scooters, we present Glide-in-Place, a seated foot locomotion system that maps per-foot fore-aft pressure to a differential-drive model: the two feet act as virtual wheels whose relative drive continuously determines translation and yaw. This lets users move forward, rotate in place, and follow arcs in one unified vocabulary without hand-held input or discrete mode switches. We evaluated Glide-in-Place in a counterbalanced within-subject study with 16 participants against two baselines: joystick control and a seated walking-in-place technique with discrete snap motions. Across two steering-heavy navigation tasks, zig-zag path following with multitasking and curved-path traversal, Glide-in-Place was consistently faster than Seated-WIP, reduced physical demand, and lowered fatigue-related discomfort without significantly differing from joystick control on total VRSQ. We position Glide-in-Place as a deployable hardware-control design point for constrained seated VR: thin insole sensing, continuous foot steering, and lightweight calibration packaged in one compact artifact.

[HC-41] Building Regulation Capacity in Human-AI Collaborative Learning: A Human-Centred GenAI System

【速读】:该论文旨在解决生成式 AI(GenAI)在协作学习中如何支持社会性分布调控机制的问题,特别是其对协同调节(CoRL)与社会共享调节(SSRL)过程的影响尚不明确。解决方案的关键在于构建一个基于 CoRL 和 SSRL 理论框架的 GenAI 支持型协作学习系统,该系统包含三个核心组件:(1)群体活动生成;(2)提供过程导向提示但不直接给出答案的组内支持代理;(3)嵌入式学习分析仪表盘,将交互痕迹转化为及时总结以辅助监控与决策。通过从机制识别、系统设计到效果评估的三阶段研究路径,验证 GenAI 如何提升群体调控能力和协作绩效。

链接: https://arxiv.org/abs/2604.10221
作者: Yujing Zhang,Jionghao Lin
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 2 figures. Accepted at AIED 2026

点击查看摘要

Abstract:Collaborative learning works when groups regulate together by setting shared goals, coordinating participation, monitoring progress, and responding to breakdowns through co-regulation (CoRL) and socially shared regulation (SSRL). As generative AI (GenAI) enters group work, however, it remains unclear whether and how it supports these socially distributed regulation processes. This doctoral project proposes a GenAI-supported collaborative learning system grounded in CoRL and SSRL to strengthen groups’ socially distributed regulation capacity. The system links three components: (1) group activity generation; (2) an in-group support agent that provides process-focused prompts without giving solutions; and (3) an embedded learning analytics dashboard that turns interaction traces into timely summaries for monitoring and decision making. The project progresses from mechanism to design to impact: it first identifies how GenAI reshapes regulation patterns and which patterns indicate more effective Human-AI collaboration, then builds an integrated GenAI system that targets these patterns, and finally evaluates whether the GenAI system improves regulation capacity and group performance across varying levels of GenAI involvement. Expected contributions include a teacher-in-the-loop system for Human-AI collaboration and process-level evidence on how GenAI reconfigures CoRL and SSRL in group work.

[HC-42] JARVIS: A Just-in-Time AR Visual Instruction System for Cross-Reality Task Guidance

【速读】:该论文旨在解决用户在执行日常任务时因频繁切换阅读指令与实际操作而导致的工作流中断和认知负荷增加的问题,尤其针对现有AI驱动的增强现实(AR)教程系统在混合物理与虚拟空间(cross-reality)任务中支持不足的局限性。其解决方案的关键在于提出JARVIS系统——一个基于视觉-语言模型(VLM)的AR指导系统,能够从单一提示(prompt)生成情境化、分步骤的引导,并结合实时状态验证与自适应视觉反馈,实现跨现实场景(如真实到虚拟R2V、虚拟到真实V2R等四类)下的状态感知与协调,从而提升任务完成效率与用户体验。

链接: https://arxiv.org/abs/2604.10108
作者: Yusi Sun,Ying Jiang,Jiayin Lu,Yin yang,Yong-Hong Kuo,Chenfanfu Jiang
机构: the University of Hong Kong(香港大学); University of California, Los Angeles(加州大学洛杉矶分校); the University of Utah(犹他大学)
类目: Human-Computer Interaction (cs.HC)
备注: 14 pages, 11 figures, 2 tables

点击查看摘要

Abstract:Many everyday tasks rely on external tutorials such as manuals and videos, requiring users to constantly switch between reading instructions and performing actions, which disrupts workflow and increases cognitive load. Augmented reality (AR) enables in-situ guidance, while recent advances in large language models (LLMs) and vision-language models (VLMs) make it possible to automatically generate such guidance. However, existing AI-powered AR tutorial systems primarily focus on physical procedural tasks and provide limited support for hybrid physical and virtual workspaces. To address this gap, we conduct a formative study of cross-reality tasks and identify key requirements for state awareness and cross-reality coordination. We present JARVIS, a VLM-driven AR instruction system that generates contextual, step-by-step guidance from a single prompt, with real-time state verification and adaptive visual feedback. To inform the system design, we conducted a formative study to understand guidance needs across cross-reality tasks, which we categorize into four types, real-to-real (R2R), real-to-virtual (R2V), virtual-to-real (V2R), and virtual-to-virtual (V2V). A within-subjects study (N=14) across four domains shows JARVIS improves usability, workload, success rate, and visualization effectiveness over baselines.

[HC-43] he Double-Edged Sword of Open-Ended Interaction: How LLM -Driven NPCs Affect Players Cognitive Load and Gaming Experience

【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的非玩家角色(LLM-NPCs)在游戏环境中对玩家认知负荷与游戏体验的影响机制问题,特别是其背后的心理机制、任务场景差异以及个体特质的作用。解决方案的关键在于通过一个随机对照实验(N=130),在自研游戏原型“校园文化周”中对比 LLM-NPC 与传统预设脚本 NPC 的交互效果,发现 LLM-NPC 显著增加认知负荷(p < .001),且该效应由表达努力和响应不确定性中介;尽管未显著提升整体游戏体验(p = .195),但增强了玩家感知自主性,同时降低了系统可用性和信任感。此外,该效应因任务场景而异(p < .001),尤其在内容创作和关系构建等开放式模块中更为明显,提示未来智能 NPC 设计需兼顾场景敏感性与用户个性化特征。

链接: https://arxiv.org/abs/2604.10107
作者: Ting-Chen Hsu,Wenran Chen,Jiangxu Lin,Fei Qin,Zheyuan Zhang
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study examines how large language model-driven non-player characters (LLM-NPCs) affect players’ cognitive load and gaming experience, with a particular focus on the underlying psychological mechanisms, differences across task scenarios, and the role of individual traits. Conducting a randomized between-subject experiment (N=130) in a self-developed game prototype “Campus Culture Week”, we compared player interactions with LLM-NPCs and traditional pre-scripted NPCs across multiple interactive modules. The results showed that LLM-NPCs significantly increased players’ cognitive load (p .001), an effect mediated by factors such as expressive effort and response uncertainty. However, LLM-NPCs did not yield a statistically significant improvement in overall gaming experience (p = .195); while they positively influenced players’ perceived autonomy, they exerted a negative influence on system usability and trust. The effects of LLM-NPCs also significantly varied across task scenarios (p .001), with stronger increases in cognitive load in more open-ended modules such as content creation and relationship building. The influence of individual differences was generally limited, although the personality traits of extraversion (p = .031) and neuroticism (p = .047) demonstrated some predictive power regarding cognitive load. This study provides empirical evidence for understanding the “double-edged sword” effect of LLM-NPCs on player experience, and highlight the importance of scenario-sensitive and user-sensitive design in intelligent NPC systems.

[HC-44] Raiven: LLM -Based Visualization Authoring via Domain-Specific Language Mediation

【速读】:该论文旨在解决科学可视化(Scientific Visualization)与信息可视化(Information Visualization)工具之间因领域割裂导致的创作障碍问题,即两种领域的专业知识难以互通。现有基于大语言模型(Large Language Model, LLM)的可视化生成方法存在非确定性代码输出、缺乏正确性保障以及可能产生无声数据伪造等缺陷。其解决方案的关键在于提出一个形式化定义的领域特定语言(Domain-Specific Language, DSL)——RaivenDSL,它统一了2D、3D和表格数据的可视化表示,并通过基于数据模式约束的LLM推理生成紧凑且可验证的DSL规范,再由确定性编译器将其转换为可执行的D3或类似语法代码。由于LLM仅操作数据元信息而非原始数据,输出具有确定性、可验证性,从而从根本上杜绝了数据伪造风险,同时在效率、成本和交互质量上显著优于现有方法。

链接: https://arxiv.org/abs/2604.10008
作者: Alexandra Irger,Ella Hugie,Minghao Guo,Simon Warchol,Kenneth Moreland,David Pugmire,Wojciech Matusik,Hanspeter Pfister
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: *Alexandra Irger and Ella Hugie are co-first authors

点击查看摘要

Abstract:Visualization is central to scientific discovery, yet authoring tools remain split between information and scientific visualization, and expertise in one rarely transfers to the other. Large Language Model (LLM) based systems promise to bridge this gap through natural language, but current approaches generate code non-deterministically, with no guarantee of correctness and no protection against silent data fabrication. We present Raiven, a conversational system that mediates visualization authoring through a formally defined domain-specific language. RaivenDSL unifies scientific and information visualization in a single representation spanning 2D, 3D, and tabular data. The LLM produces a compact RaivenDSL specification under schema-guided constraints, and a deterministic compiler translates it to executable D3 or this http URL code. Because the LLM operates only on dataset metadata, outputs are deterministic, specifications are verifiable before execution, and data fabrication is impossible by construction. In a 100-task benchmark, Raiven achieves 100% compilation, is up to six times faster and six times cheaper than state-of-the-art LLMs, while improving interaction quality, correctness, and data faithfulness. An expert user study shows that Raiven significantly reduces debugging effort and makes it easier to produce correct visualizations.

[HC-45] Efficient Personalization of Generative User Interfaces

【速读】:该论文旨在解决生成式用户界面(Generative UI)在个性化适配中面临的难题,即用户对界面属性的偏好具有主观性、难以明确表达且从稀疏反馈中推断成本高昂。核心问题在于如何高效地实现个体化定制,同时避免依赖固定的设计概念标准。解决方案的关键在于提出一种样本高效的个性化方法,该方法不基于预设的设计概念规则,而是将新用户的偏好表示为与先前设计师群体的关联关系,从而更灵活地捕捉设计偏好的多样性。实验表明,该方法在偏好建模上优于预训练UI评估器和更大规模的多模态模型,并在增加反馈时展现出更好的可扩展性;当用于指导界面生成时,其产出的UI被12位新设计师更偏好于基线方法(包括直接用户提示)。

链接: https://arxiv.org/abs/2604.09876
作者: Yi-Hao Peng,Samarth Das,Jeffrey P. Bigham,Jason Wu
机构: Carnegie Mellon University (卡内基梅隆大学); Purdue University (普渡大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative user interfaces (UIs) create new opportunities to adapt interfaces to individual users on demand, but personalization remains difficult because desirable UI properties are subjective, hard to articulate, and costly to infer from sparse feedback. We study this problem through a new dataset in which 20 trained designers each provide pairwise judgments over the same 600 generated UIs, enabling direct analysis of preference divergence. We find substantial disagreement across designers (average kappa = 0.25), and written rationales reveal that even when designers appeal to similar concepts such as hierarchy or cleanliness, designers differ in how they define, prioritize, and apply those concepts. Motivated by these findings, we develop a sample-efficient personalization method that represents a new user in terms of prior designers rather than a fixed rubric of design concepts. In a technical evaluation, our preference model outperforms both a pretrained UI evaluator and a larger multimodal model, and scales better with additional feedback. When used to personalize generation, it also produces interfaces preferred by 12 new designers over baseline approaches, including direct user prompting. Our findings suggest that lightweight preference elicitation can serve as a practical foundation for personalized generative UI systems.

[HC-46] Digital hybridity and relics in cultural heritage: using corpus linguistics to inform design in emerging technologies from AI to VR

【速读】:该论文试图解决的问题是如何在数字化转型背景下,负责任地捕捉和呈现具有宗教与文化意义的圣物(relic),尤其是在混合技术(hybrid technologies)日益普及的背景下,如何平衡技术带来的可及性提升与传统对真实性(authenticity)和感官体验(sensory experience)的敏感性。解决方案的关键在于采用跨学科(transdisciplinary)的研究视角,特别是通过语料库语言学方法分析历史文本与当代网络文本中“relic”一词的修饰语,揭示其在不同时代被赋予的不同认知维度——从早期作为道德与精神象征及宗教政治工具,到现代更多被视为文化遗产符号。这一分析为理解数字再现中的伦理挑战提供了语义基础,并指出生成式 AI 等混合技术虽能增强互动与可及性,但必须谨慎处理其对传统神圣性认知的潜在冲击。

链接: https://arxiv.org/abs/2604.09669
作者: Emma McClaughlin,Glenn McGarry,Alan Chamberlain,Geert De Wilde,Oliver Butler
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: This is a (ACM J.5 Arts Humanities Paper) relating to Hybrid Technologies, Language, AI, VR, Interaction and Experience. 24 pages. Int J Digit Humanities (2026)

点击查看摘要

Abstract:Hybrid technologies enable the blending of physical and digital elements, creating new ways to experience and interact with the world. Such technologies can transform engagement with relics, both secular and sacred but they present challenges for capturing faith, belief, and representation responsibly. Given the complexities of digital representation and the ethical challenges inherent in digitising culturally significant objects, a transdisciplinary understanding of these issues is needed. To inform this discussion from a linguistic perspective, we examined the representation of relics in historical and contemporary texts. Using a corpus linguistic approach to extract modifiers of the word relic in corpora of Early Modern English books and contemporary web sourced texts from 2021, we examined the multifaceted ways in which relics have been perceived and evaluated over time. Early texts consider relics as both objects of moral and spiritual significance, and tools of religious and political control, while they are more often framed as heritage symbols, reflecting past events, places, and traditions in contemporary texts. We discuss how hybrid, sometimes AI based technologies can enhance accessibility and engagement, whilst also challenging traditional sensitivities around authenticity and sensory experience, which are integral to the meaning and significance of relics.

[HC-47] GazeCode: Recall-Based Verification for Higher-Quality In-the-Wild Mobile Gaze Data Collection

【速读】:该论文旨在解决野外环境下(in-the-wild)移动眼动估计中标签噪声(label noise)的问题,即由于无监督数据采集导致参与者可能并未真正注视目标,而仅通过猜测或周边视觉(peripheral viewing)即可满足低熵验证机制(如二元探测),从而降低标注可靠性。解决方案的关键在于提出GazeCode——一种基于回忆的验证范式,通过多数字回忆任务(multi-digit recall task)将随机正确率降至10⁻ᴺ,并结合抗周边刺激设计(anti-peripheral stimulus design),即使用小尺寸、低对比度、短时呈现的数字刺激,有效抑制周边视觉利用,从而显著提升眼动标签的有效性和可信度。该系统同步记录前向摄像头视频、惯性测量单元(IMU)流和目标事件,以高分辨率时间戳保障数据一致性,为高质量野外眼动数据采集提供可靠技术路径。

链接: https://arxiv.org/abs/2604.09659
作者: Yaxiong Lei,Thomas Davies,Xinya Gong,Shijing He,Juan Ye
机构: University of St Andrews(圣安德鲁斯大学); University of Essex(埃塞克斯大学); King’s College London(伦敦国王学院)
类目: Human-Computer Interaction (cs.HC)
备注: 6 pages, 3 figures, this paper is accepted by CHI’26 as Poster paper

点击查看摘要

Abstract:Large-scale mobile gaze estimation relies on in-the-wild datasets, yet unsupervised collection makes it difficult to verify whether participants truly foveate logged targets. Prior mobile protocols often use low-entropy validation (e.g., binary probes) that can be satisfied by guessing and may still allow peripheral viewing, introducing label noise. We present \textbfGazeCode, a recall-based verification paradigm for higher-confidence in-the-wild mobile gaze data collection that strengthens \emphlabel validity through a multi-digit recall task (reducing random success to 10^-N ) paired with anti-peripheral stimulus design (small, low-contrast, brief digits). The system logs synchronized front-camera video, IMU streams, and target events using high-resolution timestamps. In a formative study (N=3), we probe key parameters (opacity, duration) and directly test peripheral exploitability using an eccentricity-controlled \textitRING condition. Results show that low-opacity digits substantially reduce peripheral readability while remaining usable for attentive foveation, supporting the inference that correct recall corresponds to higher-confidence gaze labels. We conclude with actionable design guidelines for robust in-the-wild gaze data collection.

[HC-48] nyGaze: Lightweight Gaze-Gesture Recognition on Commodity Mobile Devices

【速读】:该论文旨在解决移动设备上基于凝视手势(gaze gestures)的无手输入问题,其核心挑战在于:(1) 设计用户可学习且易回忆的凝视手势;(2) 构建适用于设备端部署的高效识别模型。解决方案的关键在于提出一个端到端的流水线,结合商用ARKit提供的头部/眼部姿态变换数据,并采用基于学习理论的分层引导-回忆协议(scaffolded guidance-to-recall protocol),以提升手势学习效率;同时引入轻量级时间序列模型TinyHAR,在仅使用46k参数的情况下实现了高精度的5类手势识别(Macro F1 = 0.960)和4类用户身份识别(Macro F1 = 0.997),并揭示了头部姿态动态在移动凝视手势中的高度信息价值,强调了具身头部-眼睛协同作为设计关键因素的重要性。

链接: https://arxiv.org/abs/2604.09658
作者: Yaxiong Lei,Hyochan Cho,Fergus Buchanan,Shijing He,Xinya Gong,Yuheng Wang,Juan Ye
机构: University of St Andrews (圣安德鲁斯大学); University of Essex (埃塞克斯大学); King’s College London (伦敦国王学院)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures. Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26), April 13-17, 2026, Barcelona, Spain

点击查看摘要

Abstract:Gaze gestures can provide hands free input on mobile devices, but practical use requires (i) gestures users can learn and recall and (ii) recognition models that are efficient enough for on-device deployment. We present an end-to-end pipeline using commodity ARKit head/eye transforms and a scaffolded guidance-to-recall protocol grounded in learning theory. In a pilot feasibility study (N=4 participants; 240 trials; controlled single-session setting), we benchmark a compact time-series model (TinyHAR) against deeper baselines (DeepConvLSTM, SA-HAR) on 5-way gesture recognition and 4-way user identification. TinyHAR achieves strong performance in this pilot benchmark (Macro F1 = 0.960 for gesture recognition; Macro F1 = 0.997 for user identification) while using only 46k parameters. A modality analysis further indicates that head pose dynamics are highly informative for mobile gaze gestures, highlighting embodied head–eye coordination as a key design consideration. Although the small sample size and controlled setting limit generalizability, these results indicate a potential direction for further investigation into on-device gaze gesture recognition.

[HC-49] Prints in the Magnetic Dust: Robust Similarity Search in Legacy Media Images Using Checksum Count Vectors

【速读】:该论文旨在解决早期家用计算机时代数字文物在磁带介质数字化后,如何高效实现去重、变体识别与自动修复的问题,以提升历史数字遗产的保存效率。其核心解决方案是提出基于校验和计数向量(Checksum Count Vectors)的特征表示方法,利用该向量在大规模磁带图像数据集(n=4902)中进行序列匹配,从而实现对损坏录音(最多缺失75%记录)的变体检测(准确率58%)与替代副本识别(准确率97%),为构建全自动化的修复、去重与语义整合流程提供了关键技术支撑。

链接: https://arxiv.org/abs/2604.09657
作者: Maciej Grzeszczuk,Kinga Skorupska,Grzegorz M. Wójcik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
备注: 10 pages, 6 figures. Peer-reviewed, presented on Machine Intelligence and Digital Interaction (MIDI) Conference on 11 december 2025 in Warsaw, POLAND. To be included in the proceedings (print in progress)

点击查看摘要

Abstract:Digitizing magnetic media containing computer data is only the first step towards the preservation of early home computing era artifacts. The audio tape images must be decoded, verified, repaired if necessary, tested, and documented. If parts of this process could be effectively automated, volunteers could focus on contributing contextual and historical knowledge rather than struggling with technical tools. We therefore propose a feature representation based on Checksum Count Vectors and evaluate its applicability to detecting duplicates and variants of recordings within a large data store. The approach was tested on a collection of decoded tape images (n=4902), achieving 58% accuracy in detecting variants and 97% accuracy in identifying alternative copies, for damaged recordings with up to 75% of records missing. These results represent an important step towards fully automated pipelines for restoration, de-duplication, and semantic integration of historical digital artifacts through sequence matching, automatic repair and knowledge discovery.

[HC-50] NeuroPath: Practically Adopting Motor Imagery Decoding through EEG Signals

【速读】:该论文旨在解决运动想象(Motor Imagery, MI)脑机接口(Brain-Computer Interface, BCI)在实际部署中面临的三大挑战:(i)现有模型为每个MI任务独立设计且结构不统一,导致无法从多源数据中学习鲁棒表征;(ii)固定电极配置限制了模型在不同设备上的泛化能力;(iii)在低信噪比(low-SNR)条件下性能显著下降,尤其在消费级EEG设备上。解决方案的关键在于提出NeuroPath——一种受大脑皮层到头皮信号传导路径启发的神经架构,其核心包括三个模块:用于信号滤波、空间表征学习和特征分类的专用组件,实现统一解码;引入空间感知图适配器(spatially aware graph adapter),以适应不同数量和位置的电极配置;并通过多模态辅助训练增强EEG表示的鲁棒性,在噪声环境下稳定性能。

链接: https://arxiv.org/abs/2604.09654
作者: Jiani Cao,Kun Wang,Yang Liu,Zhenjiang Li
机构: City University of Hong Kong (香港城市大学); Florida State University (佛罗里达州立大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Motor Imagery (MI) is an emerging Brain-Computer Interface (BCI) paradigm where a person imagines body movements without physical action. By decoding scalp-recorded electroencephalography (EEG) signals, BCIs establish direct communication to control external devices, offering significant potential in prosthetics, rehabilitation, and human-computer interaction. However, existing solutions remain difficult to deploy. (i) Most employ independent, opaque models for each MI task, lacking a unified architectural foundation. Consequently, these models are trained in isolation, failing to learn robust representations from diverse datasets, resulting in modest performance. (ii) They primarily adopt fixed sensor deployment, whereas real-world setups vary in electrode number and placement, causing models to fail across configurations. (iii) Performance degrades sharply under low-SNR conditions typical of consumer-grade EEG. To address these challenges, we present NeuroPath, a neural architecture for robust MI decoding. NeuroPath takes inspiration from the brain’s signal pathway from cortex to scalp, utilizing a deep neural architecture with specialized modules for signal filtering, spatial representation learning, and feature classification, enabling unified decoding. To handle varying electrode configurations, we introduce a spatially aware graph adapter accommodating different electrode numbers and placements. To enhance robustness under low-SNR conditions, NeuroPath incorporates multimodal auxiliary training to refine EEG representations and stabilize performance on noisy real-world data. Evaluations on three consumer-grade and three medical-grade public datasets demonstrate that NeuroPath achieves superior performance. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2604.09654 [cs.HC] (or arXiv:2604.09654v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.09654 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3774906.3802770 Focus to learn more DOI(s) linking to related resources

[HC-51] WearBCI Dataset: Understanding and Benchmarking Real-World Wearable Brain-Computer Interfaces Signals

【速读】:该论文旨在解决可穿戴脑-机接口(Wearable Brain-Computer Interfaces, BCIs)在真实场景中因运动伪影(motion artifacts)导致的信号质量下降问题,这一问题限制了其在移动环境下的性能评估与实际部署。现有大多数可穿戴BCI数据集均采集于静止或受控实验室条件下,无法充分反映动态身体运动对脑电(EEG)信号的影响。为填补这一空白,作者提出了WearBCI数据集,其关键创新在于首次系统性地收集了36名参与者在不同运动状态(如肢体活动、行走和导航)下同步的多模态数据(EEG、惯性测量单元IMU及第一人称视角视频),并在此基础上开展运动伪影影响分析与代表性脑电信号增强技术的基准测试。该数据集不仅支持对运动干扰机制的深入研究,还推动了跨模态脑电信号增强与多维人类行为理解等新应用场景的发展。

链接: https://arxiv.org/abs/2604.09649
作者: Haoxian Liu,Hengle Jiang,Lanxuan Hong,Xiaomin Ouyang
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Accepted by Sensys 2026

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) have opened new platforms for human-computer interaction, medical diagnostics, and neurorehabilitation. Wearable BCI systems, which typically employ non-invasive electrodes for portable monitoring, hold great promise for real-world applications, but also face significant challenges of signal quality degradation caused by motion artifacts and environmental interferences. Most existing wearable BCI datasets are collected under stationary or controlled lab settings, limiting their utility for evaluating performance under body movement. To bridge this gap, we introduce WearBCI, the first dataset that comprehensively evaluates wearable BCI signals under different motion dynamics with synchronized multimodal recordings (EEG, IMU, and egocentric video), and systematic benchmark evaluations for studying impacts of motion artifact. Specifically, we collect data from 36 participants across different motion dynamics, including body movements, walking, and navigation. This dataset includes synchronized electroencephalography (EEG), inertial measurement unit (IMU) data, and egocentric video recordings. We analyze the collected wearable EEG signals to understand the impact of motion artifacts across different conditions, and benchmark representative EEG signal enhancement techniques on our dataset. Furthermore, we explore two new case studies: cross-modal EEG signal enhancement and multi-dimension human behavior understanding. These findings offer valuable insights into real-world wearable BCI deployment and new applications.

[HC-52] Beyond Theory of Mind in Robotics

【速读】:该论文旨在解决当前机器人社会交互研究中基于心智理论(Theory of Mind, ToM)范式的局限性问题。传统ToM假设社会意义由内在心理状态向外投射至行为,并依赖观察者对固定行为意义的被动解码,这与真实社会互动中动态、参与式的意义生成过程不符。论文提出的关键解决方案是:将社会意义视为通过交互双方即时协调共同建构的结果,而非从行为中解码的静态属性;据此重构机器人设计逻辑,即从内部状态建模转向维持协调的策略设计,从旁观者推理转向主动参与机制,以及从固定行为解释转向通过响应稳定意义潜能的动态机制。

链接: https://arxiv.org/abs/2604.09612
作者: Malte F. Jung
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Theory of Mind, the capacity to explain and predict behavior by inferring hidden mental states, has become the dominant paradigm for social interaction in robotics. Yet ToM rests on three assumptions that poorly capture how most social interaction actually unfolds: that meaning travels inside-out from hidden states to observable behavior; that understanding requires detached inference rather than participation; and that the meaning of behavior is fixed and available to a passive observer. Drawing on ethnomethodology, conversation analysis, and participatory sense-making, I argue that social meaning is not decoded from behavior but produced through moment-to-moment coordination between agents. This interactional foundation has direct implications for robot design: shifting from internal state modeling toward policies for sustaining coordination, from observer-based inference toward active participation, and from fixed behavioral meaning toward meaning potential stabilized through response.

[HC-53] Human-AI Interaction Traces as Blackout Poetry: Reframing AI-Supported Writing as Found-Text Creativity

【速读】:该论文试图解决的问题是:在生成式 AI(Generative AI)辅助写作场景中,如何平衡AI参与的透明性与创作者的创造性表达,避免因过度强调审计式披露(audit-oriented disclosure)而导致对人机协作过程的量化监控,从而削弱读者对作者创作贡献的信任。其解决方案的关键在于将人机交互痕迹(human-AI interaction traces)重构为具有表现力的美学 artifacts(艺术化产物),借鉴黑体诗(blackout poetry)的创作逻辑,将AI生成文本视为“发现材料”,通过作者的筛选、重写与再诠释行为赋予其新的意义,从而使交互痕迹成为体现创作意图与价值的表达载体,增强读者对人类创造力的认可与信任。

链接: https://arxiv.org/abs/2604.09605
作者: Syemin Park,Soobin Park,Youn-kyung Lim
机构: KAIST(韩国科学技术院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 4 pages, Accepted to ACM CHI 2026 Workshop on Herding CATs: Making Sense of Creative Activity Traces

点击查看摘要

Abstract:LLMs offer new creative possibilities for writers but also raise concerns about authenticity and reader trust, particularly when AI involvement is disclosed. Prior research has largely framed this as an issue of transparency and provenance, emphasizing the disclosure of human-AI interaction traces that account for how much the AI wrote and what the human did. Yet such audit-oriented disclosures may risk reducing creative collaboration to quantification and surveillance. In this position paper, we argue for a different lens by exploring how human-AI interaction traces might instead function as expressive artifacts that foreground the meaning-making inherent in human-AI collaboration. Drawing inspiration from blackout poetry, we frame AI-generated text as found material through which writers’ acts of curation and reinterpretation become inscribed atop the AI’s original output. In this way, we suggest that designing interaction traces as aesthetic artifacts may help readers better appreciate and trust writers’ creative contributions in AI-assisted writing.

[HC-54] Visualization Retrieval for Data Literacy: Position Paper

【速读】:该论文旨在解决当前数据素养(data literacy)教育资源中缺乏高效查询、比较与导航可视化设计空间的机制问题,现有资源如可视化图库和数据集虽提供示例,但难以支持学习者主动探索与批判性思考。解决方案的关键在于将可视化检索(visualization retrieval)作为基础设施引入教育场景,通过动态、以探究为导向的学习环境,赋能学习者在数据生命周期各阶段实现设计空间探索、可视化对比与批判、以及资源管理等能力,从而促进其意图表达、跨越技术障碍并主动进行数据推理。

链接: https://arxiv.org/abs/2604.09598
作者: Huyen N. Nguyen,Nils Gehlenborg
机构: Harvard Medical School (哈佛医学院)
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages. Accepted to the Panel Track of the CHI 2026 Workshop on Data Literacy

点击查看摘要

Abstract:Current resources for data literacy education, such as visualization galleries and datasets, provide useful examples but lack mechanisms for learners to query, compare, and navigate the visualization design space efficiently. This position paper advocates for visualization retrieval as essential infrastructure for data literacy, transforming static collections into dynamic, inquiry-based learning environments. We analyze the role of retrieval across the data lifecycle, demonstrating how it facilitates design space exploration and vocabulary expansion, supports data consumption through visualization comparison and critique, and aids data management via resource curation. We outline key opportunities for future research and system design, including integrated retrieval-authoring environments, pedagogical relevance modeling, and collaborative educational corpora. Ultimately, we argue that visualization retrieval systems empower learners to articulate intent, bridge technical barriers, and proactively reason with data.

[HC-55] From Theory to Protocol: Executable Frameworks for Creative Emergence and Strategic Foresight

【速读】:该论文旨在解决创造性思维与战略预见性研究中理论与实践脱节的问题——现有描述性理论(如Koestler的双关联理论、de Bono的横向思维和Ansoff的弱信号理论)虽能解释创意与战略洞见的产生机制,但缺乏可操作的方法以实现其按需生成。解决方案的关键在于提出两个可执行的协议:GHOSTY COLLIDER(通过结构去标签化与跨域碰撞实现跨领域创意涌现)和PRECOG PROTOCOL(基于多轴时间判断进行信号驱动的战略预见),二者均将原有理论形式化为具有明确质量标准、反模式检测机制及可测量输出的五步流程,从而实现从理论到实践的闭环转化。

链接: https://arxiv.org/abs/2604.09597
作者: Shun Fujiyoshi
机构: Independent Researcher
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 22 pages, 8 tables, 5 case studies, protocols available at this https URL

点击查看摘要

Abstract:Creativity and strategic foresight have been extensively studied through descriptive theories – Koestler’s bisociation (1964), de Bono’s lateral thinking (1967), and Ansoff’s weak signals (1975) explain why creative and strategic insights occur, but offer limited guidance on how to produce them on demand. This paper presents two executable protocols that bridge this theory-practice gap: GHOSTY COLLIDER, a 5-step protocol for cross-domain creative emergence through structural de-labeling and collision, and PRECOG PROTOCOL, a 5-step protocol for signal-based strategic foresight with multi-axis timing judgment. We formalize established theories into repeatable, step-by-step procedures with explicit quality criteria, anti-pattern detection, and measurable outputs. We evaluate the protocols through three complementary methods: (1) five detailed case studies across distinct domains, (2) controlled comparisons against standard methods using identical inputs, and (3) a batch experiment across eight random domain pairings (N=8, success rate 87.5%, failure rate 12.5%) with one blind evaluation. Preliminary evidence suggests that protocol-driven outputs exhibit greater structural novelty, higher parameter specificity, and qualitatively distinct creative directions compared to outputs from standard methods. The blind evaluation confirmed the direction of author assessments (protocol output scored 74/80 vs. brainstorming 49/80). These results, while limited by single-operator execution, indicate that the theory-to-protocol translation preserves and potentially enhances the generative power of the underlying theories. The protocols, updated to version 2 incorporating lessons from failure case analysis, are released as open-access documents under CC BY-NC 4.0 at this https URL.

[HC-56] Co-Disclosing the Computer: LLM -Mediated Computing through Reflective Conversation

【速读】:该论文试图解决的问题是:随着大语言模型(Large Language Models, LLMs)在动态生成软件方面的能力不断增强,传统以固定应用程序为核心的计算模式已难以适应人机交互的新范式,亟需重新定义计算机在人类活动中的角色。其解决方案的关键在于提出“LLM-mediated computing”(LLM中介计算)这一新范式,强调交互不再依赖预设应用,而是通过人类意图与LLM的实时解释共同涌现;同时引入“反思性对话”(reflective conversation)作为设计隐喻、以现象学后哲学(postphenomenology)为分析框架,并基于“共揭示”(co-disclosure)机制重构计算机的存在方式——即计算机在使用中被构成,从而实现从静态工具到动态协作者的转变。

链接: https://arxiv.org/abs/2604.09586
作者: Mattias Rost
机构: University of Gothenburg(哥德堡大学)
类目: Human-Computer Interaction (cs.HC)
备注: CHI’26

点击查看摘要

Abstract:Large language models (LLMs) are changing how we interact with computers. As they become capable of generating software dynamically, they invite a fundamental rethinking of the computer’s role in human activity. In this conceptual paper, we introduce LLM-mediated computing: a paradigm in which interaction is no longer structured around fixed applications, but emerges in real-time through human intent and LLM interpretation. We make three contributions: (1) we articulate a new interaction metaphor of reflective conversation to guide future design, (2) we use the lens of postphenomenology to understand the human-LLM-computer relation, and (3) we propose a new mode of computing based on co-disclosure, in which the computer is constituted in use. Together, they define a new mode of computing, provide a lens to analyze it, and offer a metaphor to design with.

[HC-57] Evaluating Visual Prompts with Eye-Tracking Data for MLLM -Based Human Activity Recognition

【速读】:该论文旨在解决高频率、多维度传感器数据(如眼动追踪数据)在直接输入到大语言模型(Large Language Models, LLMs)时导致的信息损失和高令牌(token)消耗问题。其解决方案的关键在于采用视觉提示(visual prompting)策略,将原始眼动追踪信号转化为可视化图像(包括时间线、热力图和扫描路径三种类型),作为多模态大语言模型(Multimodal Large Language Models, MLLMs)的输入。该方法显著提升了表示的token效率与可扩展性,验证了MLLMs在物联网(IoT)场景下对高频传感器信号进行有效推理的潜力。

链接: https://arxiv.org/abs/2604.09585
作者: Jae Young Choi,Seon Gyeom Kim,Hyungjun Yoon,Taeckyung Lee,Donggun Lee,Jaeryung Chung,Jihyung Kil,Ryan Rossi,Sung-Ju Lee,Tak Yeon Lee
机构: KAIST; Adobe Research
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages. Conditionally accepted to IEEE PacificVis 2026 (VisNotes track)

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as foundation models for IoT applications such as human activity recognition (HAR). However, directly applying high-frequency and multi-dimensional sensor data, such as eye-tracking data, leads to information loss and high token costs. To mitigate this, we investigate a visual prompting strategy that transforms sensor signals into data visualization images as an input to multimodal LLMs (MLLMs) using eye-tracking data. We conducted a systematic evaluation of MLLM-based HAR across three public eye-tracking datasets using three visualization types of timeline, heatmap, and scanpath, under varying temporal window sizes. Our findings suggest that visual prompting provides a token-efficient and scalable representation for eye-tracking data, highlighting its potential to enable MLLMs to effectively reason over high-frequency sensor signals in IoT contexts.

[HC-58] race-Aware Workflows for Co-Creating Branded Content with Generative AI

【速读】:该论文旨在解决小型企业主(Small-Business Owners, SBOs)在使用生成式 AI (Generative AI) 工具创作符合品牌调性的社交媒体内容时所面临的三大挑战:一是难以将抽象的品牌“感觉”转化为有效的文本提示;二是难以回顾和比较先前生成的图像版本;三是难以理解迭代过程中的变化以指导优化。解决方案的关键在于设计一个原型系统,通过三个核心机制实现支持:首先,辅助品牌语义的明确表达以提升提示质量;其次,支持基于反馈的探索性生成;最后,维护一个可追溯的分支图像迭代记录板(traceboard),使SBO能够跟踪探索路径、理解变化差异并持续优化内容产出。

链接: https://arxiv.org/abs/2604.09583
作者: Taehyun Yang,Eunhye Kim,Zhongzheng Xu,Fumeng Yang
机构: University of Maryland College Park (马里兰大学学院公园分校); KAIST (韩国科学技术院)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CHI 2026 Workshop

点击查看摘要

Abstract:Generative AI tools have lowered barriers to producing branded social media images and captions, yet small-business owners (SBOs) still struggle to create on-brand posts without access to professional designers or marketing consultants. Although these tools enable fast image generation from text prompts, aligning outputs with a brand’s intended look and feel remains a demanding, iterative task. In this position paper, we explore how SBOs navigate iterative content creation and how AI-assisted systems can support SBOs’ content creation workflow. We conducted a preliminary study with 12 SBOs who independently manage their businesses and social media presence, using a questionnaire to collect their branding practices, content workflows, and use of generative AI alongside conventional design tools. We identified three recurring challenges: (1) translating brand “feel” into effective prompts, (2) difficulty revisiting and comparing prior image generations, and (3) difficulty making sense of changes between iterations to steer refinement. Based on these findings, we present a prototype that scaffolds brand articulation, supports feedback-informed exploration, and maintains a traceboard of branching image iterations. Our work illustrates how traces of the iterative process can serve as workflow support that helps SBOs keep track of explorations, make sense of changes, and refine content.

[HC-59] OpeFlo: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding

【速读】:该论文旨在解决网页可用性(Web Usability)评估过程中依赖耗时的用户研究和专家评审,从而限制产品迭代速度的问题,尤其针对小型团队和敏捷开发流程。解决方案的关键在于提出OpenFlo——一个用户体验评估代理(User-Experience Evaluation Agent),其核心创新是通过多模态接地(Multimodal Grounding)实现对真实网页的端到端交互,同时保持用户旅程的连贯追踪,而非依赖传统的DOM解析方式。OpenFlo结合模拟用户行为特征与结构化的评估协议(包括系统可用性量表SUS、分步单易度问题SEQ及同步思考 aloud),生成全面的用户体验(UX)报告,从而推动可持续、可扩展且数据驱动的可用性测试范式。

链接: https://arxiv.org/abs/2604.09581
作者: Wee Joe Tan,Zi Rui Lucas Lim,Shashank Durgad,Karim Obegi,Aiden Yiliu Li
机构: University College London(伦敦大学学院)
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Evaluating web usability typically requires time-consuming user studies and expert reviews, which often limits iteration speed during product development, especially for small teams and agile workflows. We present OpenFlo, a user-experience evaluation agent that simulates user behavior on websites and produces standardized usability. Unlike traditional tools that rely on DOM parsing, OpenFlo grounds actions and observations, enabling it to interact with real web pages end-to-end while maintaining a coherent trace of the user journey. Building on Avenir-Web, our system pairs this robust interaction with simulated user behavior profiles and a structured evaluation protocol that integrates the System Usability Scale (SUS), step-wise Single Ease Questions (SEQ), and concurrent Think Aloud. Subsequently, a comprehensive User Experience (UX) report will be generated. We discuss the architecture of OpenFlo and illustrate how its multimodal grounding improves robustness for web-based interaction and UX evaluation scenarios, paving the way for a new era of continuous, scalable, and data-driven usability testing that empowers every developer to build web interfaces that are usable. Code is available at: this https URL

[HC-60] Generative UI: LLM s are Effective UI Generators

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)生成内容时界面单一、静态的问题,即传统输出多为无交互性的纯文本(markdown格式),无法满足用户对动态、定制化用户界面(User Interface, UI)的需求。其解决方案的关键在于通过精心设计的提示(prompting)与配套工具集,使现代LLM能够稳健地生成高质量、针对具体任务的定制化UI,从而实现“生成式UI”(Generative UI)的能力。实验表明,尽管生成结果在专业性上仍不及人类专家设计的UI,但在超过50%的情况下可达到相当水平,且人类用户更偏好此类生成式UI而非传统文本输出。这一能力被证实是LLM的新兴特性,相较于早期模型有显著提升。

链接: https://arxiv.org/abs/2604.09577
作者: Yaniv Leviathan,Dani Valevski,Matan Kalman,Danny Lumen,Eyal Segalis,Eyal Molad,Shlomi Pasternak,Vishnu Natchu,Valerie Nygaard,Srinivasan(Cheenu)Venkatachary,James Manyika,Yossi Matias
机构: Google Research (谷歌研究)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI models excel at creating content, but typically render it with static, predefined interfaces. Specifically, the output of LLMs is often a markdown “wall of text”. Generative UI is a long standing promise, where the model generates not just the content, but the interface itself. Until now, Generative UI was not possible in a robust fashion. We demonstrate that when properly prompted and equipped with the right set of tools, a modern LLM can robustly produce high quality custom UIs for virtually any prompt. When ignoring generation speed, results generated by our implementation are overwhelmingly preferred by humans over the standard LLM markdown output. In fact, while the results generated by our implementation are worse than those crafted by human experts, they are at least comparable in 50% of cases. We show that this ability for robust Generative UI is emergent, with substantial improvements from previous models. We also create and release PAGEN, a novel dataset of expert-crafted results to aid in evaluating Generative UI implementations, as well as the results of our system for future comparisons. Interactive examples can be seen at this https URL

[HC-61] alking to a Human as an Attitudinal Barrier: A Mixed Methods Evaluation of Stigma Access and the Appeal of AI Mental Health Support

【速读】:该论文旨在解决当前心理健康服务可及性不足的问题,即许多有需求的个体因评价敏感性障碍(如羞耻感/污名化)和结构性障碍(如费用/覆盖范围/获取难度)而未能接受心理治疗。研究发现,生成式 AI (Generative AI) 支持的心理健康对话工具(Ash)被感知为最有效的干预方式,尤其对面临羞耻感/污名化和获取障碍的用户而言;其中,曾接受过心理治疗的用户群体中,羞耻感与感知帮助性显著正相关,表明该AI工具在缓解社会心理障碍方面具有针对性优势。解决方案的关键在于识别并匹配用户报告的核心障碍类型——当AI工具能有效应对用户的羞耻感或获取困难时,其感知价值和使用强度均显著提升,从而更精准地满足未被满足的心理健康支持需求。

链接: https://arxiv.org/abs/2604.09575
作者: Caitlin A. Stamatis,Emma C. Wolfe,Matteo Malgaroli,Thomas D. Hull
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 45 pages, 5 figures

点击查看摘要

Abstract:Background: Many people who could benefit from therapy do not receive it. Conversational AI is increasingly used for mental health support, yet it is unclear which barriers AI helps mitigate. We examined whether evaluation-sensitive (shame/stigma) and structural barriers (cost/coverage/access) to psychotherapy predict perceived helpfulness of an AI mental health conversational tool (Ash), and whether effects differ by prior therapy experience or user engagement. Methods: Participants (n=395) rated Ash’s helpfulness (1-5) and described barriers to therapy. Open-text responses were coded for shame/stigma, access, and cost/coverage themes. Linear regressions examined associations between barriers and perceived helpfulness, adjusting for demographics and mental health, with moderation by therapy experience. Results: Shame/stigma (B=.45, p.001) and access barriers (B=.31, p=.020) predicted higher perceived helpfulness but cost/coverage did not (B=.13, p=.262). Prior therapy experience moderated the shame effect (interaction B=.56, p=.036): shame predicted higher helpfulness among therapy-experienced users ( \Delta =.62, p.001) but not therapy-naive users ( \Delta =.03, p=.877). Among therapy-experienced participants (n=258), shame/stigma (B=.75, p.001) and access barriers (B=.51, p=.006) predicted rating Ash more favorably. Access barriers predicted higher engagement (IRR=1.64, p.001) and cost/coverage barriers predicted 70% more sessions (IRR=1.70, p.001). Shame/stigma was not associated with total sessions (IRR=.80, p=.094). Conclusions: AI mental health support was perceived as most helpful by users facing shame/stigma and access barriers, particularly for therapy-experienced individuals. Access and cost barriers were most predictive of usage intensity, suggesting unmet needs. Findings highlight the importance of aligning AI tools for emotional support with user-reported barriers.

[HC-62] Improving understanding and trust in AI: How users benefit from interval-based counterfactual explanations

【速读】:该论文试图解决的问题是:当前关于黑箱模型后验解释(post-hoc explanations)类型有效性评估的实验研究极为匮乏,尤其是对反事实解释(counterfactual explanations)中不同子类型——单点解释(single point explanations)与区间解释(interval-based explanations)——在提升用户对模型理解(model understanding)和信任(demonstrated trust)方面的差异缺乏实证依据。解决方案的关键在于设计并实施一项基于被试内实验设计(within-subjects experimental design)的在线用户研究,系统比较四种条件:无解释(控制组)、特征重要性评分、单点反事实解释和区间反事实解释。结果表明,区间反事实解释显著优于其他类型,在增强模型理解和信任方面表现最优,从而为可解释人工智能(Explainable AI, XAI)实践中解释形式的选择提供了实证依据。

链接: https://arxiv.org/abs/2604.09573
作者: Tabea E. Röber,Paul Festor,Rob Goedhart,S. İlker Birbil,Aldo Faisal
机构: University of Amsterdam, The Netherlands; Imperial College London, UK; University of Bayreuth, Germany
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Experimental user studies evaluating the effectiveness of different subtypes of post-hoc explanations for black-box models are largely nonexistent. Therefore, the aim of this study was to investigate and evaluate how different types of counterfactual explanations, namely single point explanations and interval-based explanations, affect both model understanding and (demonstrated) trust. We conducted an online user study using a within-subjects experimental design, where the experimental arms were (i) no explanation (control), (ii) feature importance scores, (iii) point counterfactual explanations, and (iv) interval counterfactual explanations. Our results clearly show the superiority of interval explanations over other tested explanation types in increasing both model understanding and demonstrated trust in the AI. We could not support findings of some previous studies showing an effect of point counterfactual explanations compared to the control group. Our results further highlight the role individual differences in, for example, cognitive style or personality, in explanation effectiveness.

[HC-63] ACE-TA: An Agent ic Teaching Assistant for Grounded QA Quiz Generation and Code Tutoring

【速读】:该论文旨在解决编程教学中学生对概念理解不深入、缺乏个性化练习反馈以及代码实践指导不足的问题。其解决方案的关键在于提出ACE-TA(Agentic Coding and Explanations Teaching Assistant)框架,该框架通过三个协同模块实现:基于检索增强的精准概念问答系统(retrieval-grounded conceptual QA system),用于提供与课程内容对齐的解释;自适应多主题测验生成器(quiz generator),以促进高阶认知能力的评估;以及交互式代码导师(interactive code tutor),结合沙箱执行和迭代反馈机制,引导学生进行分步推理和实践。整个系统依托预训练大语言模型(Large Language Models, LLMs)实现自主路由与智能教学支持。

链接: https://arxiv.org/abs/2604.09572
作者: Himanshu Tripathi,Charlottee Crowell,Kaley Newlin,Subash Neupane,Shahram Rahimi,Jason Keith
机构: University of Alabama (阿拉巴马大学); Brown University (布朗大学); Meharry Medical College (梅哈里医学院); Iowa State University of Science and Technology (爱荷华州立大学科技学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce ACE-TA, the Agentic Coding and Explanations Teaching Assistant framework, that autonomously routes conceptual queries drawn from programming course material to grounded QA, stepwise coding guidance, and automated quiz generation using pre-trained Large Language Models (LLMs). ACE-TA consists of three coordinated modules: a retrieval grounded conceptual QA system that provides precise, context-aligned explanations; a quiz generator that constructs adaptive, multi-topic assessments targeting higher-order understanding; and an interactive code tutor that guides students through step-by-step reasoning with sandboxed execution and iterative feedback.

[HC-64] uning Qwen 2.5-VL to Improve Its Web Interaction Skills WWW2026

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在纯视觉输入条件下作为独立代理执行网页操作时的可靠性问题,尤其针对其在元素定位不准、对指令表述敏感以及高估自身动作成功率等关键挑战。解决方案的关键在于设计了一个两阶段微调训练流程:第一阶段训练模型判断光标是否已悬停于目标元素上或需移动;第二阶段则通过分步执行单个操作(如鼠标移动或点击),并在每一步后验证环境状态,再决定下一步行动,从而提升任务执行的准确性与鲁棒性。实验表明,该方法在单击网页任务基准测试中将成功率从86%提升至94%。

链接: https://arxiv.org/abs/2604.09571
作者: Alexandra Yakovleva,Henrik Pärssinen,Harri Valpola,Juho Kannala,Alexander Ilin
机构: Aalto University (阿尔托大学); System 2 AI (系统2人工智能)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to the Short Paper Track of ACM Web Conference 2026 (WWW 2026). The final version will appear in the ACM Digital Library

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have sparked growing interest in using them to automate web tasks, yet their feasibility as independent agents that reason and act purely from visual input remains underexplored. We investigate this setting using Qwen2.5-VL-32B, one of the strongest open-source VLMs available, and focus on improving its reliability in web-based control. Through initial experimentation, we observe three key challenges: (i) inaccurate localization of target elements, the cursor, and their relative positions, (ii) sensitivity to instruction phrasing, and (iii) an overoptimistic bias toward its own actions, often assuming they succeed rather than analyzing their actual outcomes. To address these issues, we fine-tune Qwen2.5-VL-32B for a basic web interaction task: moving the mouse and clicking on a page element described in natural language. Our training pipeline consists of two stages: (1) teaching the model to determine whether the cursor already hovers over the target element or whether movement is required, and (2) training it to execute a single command (a mouse move or a mouse click) at a time, verifying the resulting state of the environment before planning the next action. Evaluated on a custom benchmark of single-click web tasks, our approach increases success rates from 86% to 94% under the most challenging setting.

[HC-65] Conversational Forecasting Across Large Human Groups Using A Network of Surrogate Agents

【速读】:该论文旨在解决大规模分布式人类团队在复杂决策场景下如何通过实时协作提升预测准确性的问题。传统群体决策常受限于规模效应和沟通效率,而本研究提出以介入式AI代理(intervening AI agents)为核心的Hyperchat AI架构,作为中介协调器,在不中断对话流的前提下促进跨地域、超大规模团队的高效讨论与共识形成。其解决方案的关键在于利用AI代理实时分析并引导多轮对话内容,使团队能在5分钟内完成对NBA比赛胜负的集体研判,从而显著提高预测精度——实验结果显示,使用Thinkscape平台进行12周赛事预测时整体准确率达62%(p=0.059),且高对话频率组(排除最低25%对话率后)准确率进一步提升至68%(p=0.017),验证了AI赋能的群体智能在动态决策任务中的有效性。

链接: https://arxiv.org/abs/2604.09570
作者: Louis Rosenberg,Hans Schumann,Ganesh Mani,Gregg Willcox
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 6 pages

点击查看摘要

Abstract:Hyperchat AI is a communication and collaboration architecture that employs intervening AI agents to enable real-time conversational deliberations among distributed human teams of unlimited size. Prior work has shown that teams as large as 250 people can hold productive real-time conversations by text, voice, or video using Hyperchat AI to discuss complex problems, brainstorm solutions, surface risks, assess alternatives, prioritize options, and converge on optimized results. Building on this prior work, this new study tasked groups of 25 to 30 basketball fans with conversationally forecasting 56 NBA games (against the spread) over a 12-week period. Results show that when discussing and debating NBA games (for five minutes each) using a Hyperchat AI enabled platform called Thinkscape, human teams were 62% accurate across the full set of NBA forecasts. This is a significant result versus the Vegas odds of 50% (p=0.059). Furthermore, had the participants wagered on the games, they would have produced an 18% ROI over the 12-week period. In addition, this study found that the conversation rate during each forecast was positively correlated with prediction accuracy. In fact, when excluding the 12 forecasts in the bottom 25th percentile by average conversation rate, the remaining 38 forecasts recorded a 68% accuracy against the published Vegas spread (p=0.017). This suggests that large-scale conversational deliberations, when facilitated by intervening AI-agents, positively impacts accuracy in groupwise forecasting.

[HC-66] Automatic Mind Wandering Detection in Educational Settings: A Systematic Review and Multimodal Benchmarking

【速读】:该论文旨在解决在线教育环境中注意力分散(mind wandering)检测的可靠性与一致性问题,其核心挑战在于现有研究因模型差异、预处理方法不统一及评估指标多样化而导致结果难以复现和比较。解决方案的关键在于构建一个通用的预处理与特征提取流程,针对不同模态(如EEG、面部视频、眼动追踪和生理信号)进行标准化处理,并在此基础上对13种传统机器学习与神经网络模型(包括联邦学习方法)进行系统性评估与对比,同时通过新颖的消融实验探索基于事后探测数据(post-probe data)的检测潜力,从而为构建可泛化、公平比较的注意力监测框架提供实证基础与开源工具支持。

链接: https://arxiv.org/abs/2604.09569
作者: Anna Bodonhelyi,Augustin Curinier,Babette Bühler,Gerrit Anders,Lisa Rausch,Markus Huff,Ulrich Trautwein,Ralph Ewerth,Peter Gerjets,Enkelejda Kasneci
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Detecting mind wandering is crucial in online education, and it occurs 30% of the time, as it directly impacts learners’ retention, comprehension, and overall success in self-directed learning environments. Integrating automated detection algorithms enables the deployment of targeted interventions within adaptive learning environments, paving the way for more responsive and personalized educational systems. However, progress is hampered by a lack of coherent frameworks for identifying mind wandering in online environments. This work presents a comprehensive systematic review and benchmark of mind wandering detection across 14 datasets covering EEG, facial video, eye tracking, and physiological signals in educational settings, motivated by the challenges in achieving reliable detection and the inconsistency of results across studies caused by variations in models, preprocessing approaches, and evaluation metrics. We implemented a generalizable preprocessing and feature extraction pipeline tailored to each modality, ensuring fair comparison across diverse experimental paradigms. 13 traditional machine learning and neural network models, including federated learning approaches, were evaluated on each dataset. In a novel ablation study, we explored mind wandering detection from post-probe data, motivated by findings that learners often re-engage with material after mind wandering episodes through re-reading or re-watching. Results highlight the potential and limitations of different modalities and classifiers for mind wandering detection, and point to new opportunities for supporting online learning. All code and preprocessing scripts are made openly available to support reproducibility and future research.

[HC-67] EvoDiagram: Agent ic Editable Diagram Creation via Design Expertise Evolution

【速读】:该论文旨在解决自动化系统在生成高保真度可编辑图表时面临的挑战,即如何有效协调语义拓扑、视觉风格与空间布局,并克服现有方法中存在的表示鸿沟问题——像素级模型难以实现精确控制,而代码级合成又限制了直观灵活性。解决方案的关键在于提出EvoDiagram框架,其核心是通过一个中间画布模式(canvas schema)实现对象级可编辑性的图表示,同时采用多智能体协同机制将语义意图与渲染逻辑解耦,从而在异构设计层之间消除冲突;此外,引入设计知识演化机制,将执行轨迹提炼为分层记忆中的领域指导原则,使智能体能够自适应地检索上下文感知的专业知识,显著提升了图表生成的结构一致性与美学协调性。

链接: https://arxiv.org/abs/2604.09568
作者: Tianfu Wang,Leilei Ding,Ziyang Tao,Yi Zhan,Zhiyuan Ma,Wei Wu,Yuxuan Lei,Yuan Feng,Junyang Wang,Yin Wu,Yizhao Xu,Hongyuan Zhu,Qi Liu,Nicholas Jing Yuan,Yanyong Zhang,Hui Xiong
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-fidelity diagram creation requires the complex orchestration of semantic topology, visual styling, and spatial layout, posing a significant challenge for automated systems. Existing methods also suffer from a representation gap: pixel-based models often lack precise control, while code-based synthesis limits intuitive flexibility. To bridge this gap, we introduce EvoDiagram, an agentic framework that generates object-level editable diagrams via an intermediate canvas schema. EvoDiagram employs a coordinated multi-agent system to decouple semantic intent from rendering logic, resolving conflicts across heterogeneous design layers. Additionally, we propose a design knowledge evolution mechanism that distills execution traces into a hierarchical memory of domain guidelines, enabling agents to retrieve context-aware expertise adaptively. We further release CanvasBench, a benchmark consisting of both data and metrics for canvas-based diagramming. Extensive experiments demonstrate that EvoDiagram exhibits excellent performance and balance against baselines in generating editable, structurally consistent, and aesthetically coherent diagrams. Our code is available at this https URL.

[HC-68] LETGAMES: An LLM -Powered Gamified Approach to Cognitive Training for Patients with Cognitive Impairment

【速读】:该论文旨在解决认知障碍患者个性化认知训练游戏设计资源消耗大、效率低的问题。现有方法难以根据个体需求自动生成针对性强且具有交互性的治疗性游戏,限制了其在临床康复中的广泛应用。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的自动化个性化治疗游戏设计框架——LETGAMES,该框架受《龙与地下城》(Dungeons & Dragons)启发,能够生成以开放世界互动叙事为基础的游戏场景和挑战,并针对特定认知领域进行定制化设计,同时通过对话策略提供引导与陪伴式交互。此外,作者还构建了心理学驱动的评估协议LETGAMESEVAL,为疗效验证提供多维量化指标,实验证明该方法在LLM评估和人类专家评价中均展现出显著潜力,为实现更可及、个性化的认知训练工具提供了可行路径。

链接: https://arxiv.org/abs/2604.09566
作者: Jingwei Shi,Shengyu Tao,Xinxiang Yin,Chen Huang,Wenqiang Lei,See-Kiong Ng
机构: Shanghai University of Finance and Economics (上海财经大学); Northwest Polytechnical University (西北工业大学); Institute of Data Science, National University of Singapore (新加坡国立大学数据科学研究所); College of Computer Science, Sichuan University (四川大学计算机学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 53 pages

点击查看摘要

Abstract:The application of games as a therapeutic tool for cognitive training is beneficial for patients with cognitive impairments. However, effective game design for individual patient is resource-intensive. To this end, we propose an LLM-powered method, \ours, for automated and personalized therapeutic game design. Inspired by the Dungeons Dragons, LETGAMES generates an open-world interactive narrative game. It not only generates game scenarios and challenges that target specific cognitive domains, but also employs conversational strategies to offer guidance and companionship. To validate its efficacy, we pioneer a psychology-grounded evaluation protocol LETGAMESEVAL, establishing comprehensive metrics for rehabilitative assessment. Building upon this, our experimental results from both LLM-based assessors and human expert evaluations demonstrate the significant potential of our approach, positioning LETGAMES as a promising solution to the widespread need for more accessible and tailored cognitive training tools. Our code will be open-sourced upon acceptance.

[HC-69] racers for debugging and program exploration

【速读】:该论文旨在解决现有调试工具在支持程序员生成调试假设方面能力不足的问题。当前工具(如步进调试器)仅提供程序状态的孤立快照,导致开发者需自行在脑中重构状态随时间演化的轨迹,效率低下且易出错。解决方案的关键在于以程序执行的完整历史记录(即“trace”)为核心构建调试与程序探索工具,强调用户应看到每行代码按实际执行顺序呈现,而非按源码语法顺序展示,从而显著提升对程序行为的理解效率和准确性。

链接: https://arxiv.org/abs/2604.09301
作者: Shardul Chiplunkar,Clément Pit-Claudel
机构: EPFL(洛桑联邦理工学院)
类目: Programming Languages (cs.PL); Human-Computer Interaction (cs.HC)
备注: 13 pages; presented at the 16th annual workshop on the intersection of HCI and PL (PLATEAU 2026), Pittsburgh, PA, USA

点击查看摘要

Abstract:Programmers often use an iterative process of hypothesis generation (“perhaps this function is called twice?”) and hypothesis testing (“let’s count how many times this breakpoint fires”) to understand the behavior of unfamiliar or malfunctioning software. Existing debugging tools are much better suited to testing hypotheses than to generating them. Step debuggers, for example, present isolated snapshots of the program’s state, leaving it to the programmer to mentally reconstruct the evolution of that state over time. We advocate for a different approach: building a debugging and program-exploration tool around a trace, or complete history, of the program’s execution. Our key claim is that the user should see every line as executed (in time order) rather than as written (in syntax order). We discuss design choices, preliminary results, and interesting challenges.

[HC-70] oward using Speech to Sense Student Emotion in Remote Learning Environments

【速读】:该论文旨在解决远程学习环境中因缺乏足够情感线索而导致学习体验不佳的问题(即:远程学习通常为异步进行,缺少面对面教学中的情绪互动),从而影响学习效果。其解决方案的关键在于利用语音特征来感知学生情绪,具体通过设计基于自控任务的语音采集方法(speech-based self-control tasks),构建包含自发独白语音的数据集,并结合主观听者评估与自动维度化情绪预测模型,验证了语音在愉悦度(valence)、唤醒度(arousal)和支配度(dominance)三个维度上的可感知变化及其自动化预测的可能性。这一方法为将副语言语音处理技术无缝集成到远程学习流程中提供了可行路径,以优化教学设计和实时反馈生成。

链接: https://arxiv.org/abs/2604.09881
作者: Sargam Vyas,Bogdan Vlasenko,André Mayoraz,Egon Werlen,Per Bergamin,Mathew Magimai.-Doss
机构: Idiap Research Institute (Idiap 研究所); Lausanne University Hospital (洛桑大学医院); University of Lausanne (洛桑大学); The Sense Innovation and Research Center (感知创新与研究中心); FFHS: Swiss Distance University of Applied Sciences (瑞士远程应用科学大学)
类目: Audio and Speech Processing (eess.AS); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:With advancements in multimodal communication technologies, remote learning environments such as, distance universities are increasing. Remote learning typically happens asynchronously. As a consequence, unlike face-to-face in-person classroom teaching, this lacks availability of sufficient emotional cues for making learning a pleasant experience. Motivated by advances made in the paralinguistic speech processing community on emotion prediction, in this paper we explore use of speech for sensing students’ emotions by building upon speech-based self-control tasks developed to aid effective remote learning. More precisely, we investigate: (a) whether speech acquired through self-control tasks exhibit perceptible variation along valence, arousal, and dominance dimensions? and (b) whether those dimensional emotion variations can be automatically predicted? We address these two research questions by developing a dataset containing spontaneous monologue speech acquired as open responses to self-control tasks and by carrying out subjective listener evaluations and automatic dimensional emotion prediction studies on that dataset. Our investigations indicate that speech-based self-control tasks can be a means to sense student emotion in remote learning environment. This opens potential venues to seamlessly integrate paralinguistic speech processing technologies in the remote learning loop for enhancing learning experiences through instructional design and feedback generation.

计算机视觉

[CV-0] Who Handles Orientation? Investigating Invariance in Feature Matching

【速读】:该论文旨在解决图像匹配中因大范围平面内旋转(in-plane rotations)导致的特征点匹配性能下降问题。现代匹配器在处理此类旋转时表现不佳,而传统方法通过数据增强学习旋转不变性,但未明确其应在流程中的哪个阶段引入。论文的关键解决方案是:将旋转不变性直接嵌入描述符(descriptor)的学习阶段,而非依赖后续匹配器处理。实验表明,这种早期引入方式不仅性能与后期匹配阶段处理相当,还显著提升了匹配效率,并且在大规模训练下不会损害原始姿态(upright)下的匹配性能。此外,研究发现随着训练数据规模增加,模型对旋转图像的泛化能力显著增强,从而实现了对多模态、极端条件及卫星图像等复杂场景下旋转鲁棒匹配的最优性能。

链接: https://arxiv.org/abs/2604.11809
作者: David Nordström,Johan Edstedt,Fredrik Kahl,Georg Bökman
机构: Chalmers University of Technology (查尔姆斯理工大学); Linköping University (林雪平大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Finding matching keypoints between images is a core problem in 3D computer vision. However, modern matchers struggle with large in-plane rotations. A straightforward mitigation is to learn rotation invariance via data augmentation. However, it remains unclear at which stage rotation invariance should be incorporated. In this paper, we study this in the context of a modern sparse matching pipeline. We perform extensive experiments by training on a large collection of 3D vision datasets and evaluating on popular image matching benchmarks. Surprisingly, we find that incorporating rotation invariance already in the descriptor yields similar performance to handling it in the matcher. However, rotation invariance is achieved earlier in the matcher when it is learned in the descriptor, allowing for a faster rotation-invariant matcher. Further, we find that enforcing rotation invariance does not hurt upright performance when trained at scale. Finally, we study the emergence of rotation invariance through scale and find that increasing the training data size substantially improves generalization to rotated images. We release two matchers robust to in-plane rotations that achieve state-of-the-art performance on e.g. multi-modal (WxBS), extreme (HardMatch), and satellite image matching (SatAst). Code is available at this https URL.

[CV-1] Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

【速读】:该论文旨在解决高保真度3D室内场景生成中存在的数据稀缺与复杂空间关系建模难题,尤其针对现有方法在超出训练分布的密集场景中难以扩展,或依赖大语言模型(LLM)/视觉语言模型(VLM)但缺乏精确空间推理能力的问题。其解决方案的关键在于提出Pair2Scene框架,该框架通过学习局部依赖规则(而非冗余的全局分布)来驱动场景生成:具体而言,模型捕捉两类对象间关系——支撑关系(support relations,遵循物理层级)和功能关系(functional relations,反映语义关联),并利用神经网络预测依赖对象相对于锚定对象的位置分布;在此基础上,构建了3D-Pairs数据集用于训练,并在推理阶段通过层次结构递归应用模型,结合碰撞感知的拒绝采样策略将局部规则整合为一致的全局布局,从而实现超越训练数据复杂环境的物理与语义合理性生成。

链接: https://arxiv.org/abs/2604.11808
作者: Xingjian Ran,Shujie Zhang,Weipeng Zhong,Li Luo,Bo Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating high-fidelity 3D indoor scenes remains a significant challenge due to data scarcity and the complexity of modeling intricate spatial relations. Current methods often struggle to scale beyond training distribution to dense scenes or rely on LLMs/VLMs that lack the ability for precise spatial reasoning. Building on top of the observation that object placement relies mainly on local dependencies instead of information-redundant global distributions, in this paper, we propose Pair2Scene, a novel procedural generation framework that integrates learned local rules with scene hierarchies and physics-based algorithms. These rules mainly capture two types of inter-object relations, namely support relations that follow physical hierarchies, and functional relations that reflect semantic links. We model these rules through a network, which estimates spatial position distributions of dependent objects conditioned on position and geometry of the anchor ones. Accordingly, we curate a dataset 3D-Pairs from existing scene data to train the model. During inference, our framework can generate scenes by recursively applying our model within a hierarchical structure, leveraging collision-aware rejection sampling to align local rules into coherent global layouts. Extensive experiments demonstrate that our framework outperforms existing methods in generating complex environments that go beyond training data while maintaining physical and semantic plausibility.

[CV-2] Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在物理推理能力提升上受限于互联网问答(QA)数据规模小、领域集中(如数学为主)的问题,尤其是在物理学等科学领域缺乏大规模高质量训练数据的瓶颈。其解决方案的关键在于利用物理引擎(physics simulators)作为可扩展的数据生成器,通过模拟随机物理场景并自动生成合成问答对,结合强化学习(reinforcement learning)训练LLMs,从而实现从仿真环境到真实世界物理任务的零样本迁移(zero-shot sim-to-real transfer),显著提升模型在国际物理奥林匹克竞赛(IPhO)等基准上的表现。

链接: https://arxiv.org/abs/2604.11805
作者: Mihir Prabhudesai,Aryan Satpathy,Yangmin Li,Zheyang Qin,Nikash Bhardwaj,Amir Zadeh,Chuan Li,Katerina Fragkiadaki,Deepak Pathak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Webpage - this https URL

点击查看摘要

Abstract:We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: this https URL.

[CV-3] OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

【速读】:该论文旨在解决人类-物体交互视频生成(Human-Object Interaction Video Generation, HOIVG)任务中多模态条件融合与高质量视频合成的难题,尤其针对文本、参考图像、音频和姿态等多种控制信号难以协同的问题。其核心解决方案是提出OmniShow框架,关键创新在于:一是引入统一通道条件机制(Unified Channel-wise Conditioning),实现图像与姿态信息的高效注入;二是设计门控局部上下文注意力(Gated Local-Context Attention),保障音频与视觉内容的精确同步;三是采用解耦后再联合训练策略(Decoupled-Then-Joint Training),有效缓解数据稀缺问题并充分利用异构子任务数据集。这些技术共同推动了HOIVG在工业级性能上的突破。

链接: https://arxiv.org/abs/2604.11804
作者: Donghao Zhou,Guisheng Liu,Hao Yang,Jiatong Li,Jingyu Lin,Xiaohu Huang,Yichen Liu,Xin Gao,Cunjian Chen,Shilei Wen,Chi-Wing Fu,Pheng-Ann Heng
机构: Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

[CV-4] Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net

【速读】:该论文旨在解决放射治疗计划中临床靶区(Clinical Target Volume, CTV)勾画的准确性与效率问题,尤其是在复杂治疗如全骨髓和淋巴结照射(Total Marrow and Lymph Node Irradiation, TMLI)场景下,自动分割模型虽可降低人工负担,但缺乏可靠的质量控制机制以识别潜在错误区域。解决方案的关键在于提出一种预算感知的不确定性驱动质量保障(uncertainty-driven quality assurance, QA)框架,基于nnU-Net架构融合不确定性量化与事后校准(post-hoc calibration),生成体素级不确定性图(基于预测熵),从而指导针对性的人工复核。通过对比温度缩放(TS)、深度集成(DE)、检查点集成(CE)和测试时增强(TTA)等方法及其组合,在真实修订约束条件下评估可靠性指标(如ROI-masked校准度与不确定性-误差对齐性),发现校准结合高效集成策略显著提升不确定性图对需人工修正区域的识别一致性,为安全部署深度学习分割模型提供可行路径。

链接: https://arxiv.org/abs/2604.11798
作者: Ricardo Coimbra Brioso,Lorenzo Mondo,Damiano Dei,Nicola Lambri,Pietro Mancosu,Marta Scorsetti,Daniele Loiacono
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate delineation of the Clinical Target Volume (CTV) is essential for radiotherapy planning, yet remains time-consuming and difficult to assess, especially for complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI). While deep learning-based auto-segmentation can reduce workload, safe clinical deployment requires reliable cues indicating where models may be wrong. In this work, we propose a budget-aware uncertainty-driven quality assurance (QA) framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) that can guide targeted manual review. We compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated both individually and in combination on TMLI as a representative use case. Reliability is assessed through ROI-masked calibration metrics and uncertainty–error alignment under realistic revision constraints, summarized as AUC over the top 0-5% most uncertain voxels. Across configurations, segmentation accuracy remains stable, whereas TS substantially improves calibration. Uncertainty-error alignment improves most with calibrated checkpoint-based inference, leading to uncertainty maps that highlight more consistently regions requiring manual edits. Overall, integrating calibration with efficient ensembling seems a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation.

[CV-5] SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization

【速读】:该论文旨在解决基于扩散模型(diffusion-based)重建场景时,多视角之间存在的语义与几何不一致性问题。其核心挑战在于如何在去噪过程中保持跨视角的一致性,从而提升重建质量。解决方案的关键在于提出SyncFix框架,该框架将修复过程建模为一个联合潜在空间桥接匹配问题(joint latent bridge matching problem),通过同步多个视角中的失真与干净表示,强制在整个去噪轨迹中实现跨视角的语义和几何一致性。这一方法仅需图像对进行训练,即可在推理阶段自然扩展至任意数量视角,并且随着视图增加重建质量持续提升,体现出良好的泛化能力和渐进式优化特性。

链接: https://arxiv.org/abs/2604.11797
作者: Deming Li,Abhay Yadav,Cheng Peng,Rama Chellappa,Anand Bhattad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present SyncFix, a framework that enforces cross-view consistency during the diffusion-based refinement of reconstructed scenes. SyncFix formulates refinement as a joint latent bridge matching problem, synchronizing distorted and clean representations across multiple views to fix the semantic and geometric inconsistencies. This means SyncFix learns a joint conditional over multiple views to enforce consistency throughout the denoising trajectory. Our training is done only on image pairs, but it generalizes naturally to an arbitrary number of views during inference. Moreover, reconstruction quality improves with additional views, with diminishing returns at higher view counts. Qualitative and quantitative results demonstrate that SyncFix consistently generates high-quality reconstructions and surpasses current state-of-the-art baselines, even in the absence of clean reference images. SyncFix achieves even higher fidelity when sparse references are available.

[CV-6] LottieGPT : Tokenizing Vector Animation for Autoregressive Generation CVPR2026

【速读】:该论文旨在解决现有视频生成模型无法直接生成矢量动画(vector animation)的问题。矢量动画因其分辨率无关性、紧凑性、语义结构和可编辑的参数化运动表示,在互联网多媒体中占据主导地位,但当前主流生成模型仅在光栅空间(raster space)操作,难以实现此类内容的合成。解决方案的关键在于提出首个面向矢量动画的自回归生成框架:首先设计了专用于Lottie格式(一种广泛采用的JSON-based动画标准)的Lottie Tokenizer,将分层几何图元、变换和基于关键帧的运动编码为紧凑且语义对齐的标记序列;其次构建了包含660K个真实世界矢量动画及15M张静态Lottie图像的大型数据集LottieAnimation-660K,支持大规模训练;最终基于Qwen-VL微调得到LottieGPT,一个能直接从自然语言或视觉提示生成连贯、可编辑矢量动画的原生多模态模型。实验表明,该方法显著缩短序列长度并保持结构保真度,实现了动态矢量内容的有效自回归学习,并在SVG生成任务上超越现有最优模型。

链接: https://arxiv.org/abs/2604.11792
作者: Junhao Chen,Kejun Gao,Yuehan Cui,Mingze Sun,Mingjin Chen,Shaohui Wang,Xiaoxiao Long,Fei Ma,Qi Tian,Ruqi Huang,Hao Zhao
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); AIR, Tsinghua University (清华大学人工智能研究院); BAAI (北京智源人工智能研究院); The Hong Kong Polytechnic University (香港理工大学); Nanjing University (南京大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济实验室(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project Page: this https URL

点击查看摘要

Abstract:Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides, 3D meshes, LEGO sequences, and indoor layouts, suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).

[CV-7] LMMs Meet Object-Centric Vision: Understanding Segmentation Editing and Generation

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在需要精确对象级定位、细粒度空间推理和可控视觉操作的任务中表现不足的问题,具体包括难以正确识别目标实例、保持交互过程中对象身份的一致性以及高精度地定位或修改指定区域。其解决方案的关键在于引入以对象为中心的视觉(object-centric vision)框架,通过促进对视觉实体的显式表示与操作,将多模态系统从全局场景理解拓展至对象级别的理解、分割、编辑与生成,从而提升模型在复杂视觉任务中的精度与可控性。

链接: https://arxiv.org/abs/2604.11789
作者: Yuqian Yuan,Wenqiao Zhang,Juekai Lin,Yu Zhong,Mingjian Gao,Binhe Yu,Yunqi Cao,Wentong Li,Yueting Zhuang,Beng Chin Ooi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, 6 figures

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision–language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.

[CV-8] HDR Video Generation via Latent Alignment with Logarithmic Encoding

【速读】:该论文旨在解决高动态范围(High Dynamic Range, HDR)图像生成在预训练生成模型中的适配难题,即如何在不重新设计模型架构或训练新编码器的前提下,利用现有生成模型实现高质量HDR内容的直接生成。其解决方案的关键在于:首先识别出电影制作流程中广泛采用的对数编码(logarithmic encoding)能够将HDR图像映射至与预训练生成模型潜在空间自然对齐的分布;其次,通过引入模拟相机退化的训练策略,促使模型基于已学习的视觉先验推断输入中缺失的高动态范围细节。这一方法实现了仅需轻量级微调即可在多样化场景和复杂光照条件下生成高质量HDR视频,表明只要选择合适的表示方式以匹配模型先验,无需重构生成模型即可有效处理HDR图像生成任务。

链接: https://arxiv.org/abs/2604.11788
作者: Naomi Ken Korem,Mohamed Oumoumad,Harel Cain,Matan Ben Yosef,Urska Jelercic,Ofir Bibi,Yaron Inger,Or Patashnik,Daniel Cohen-Or
机构: Lightricks; Gear Productions; Tel Aviv University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its mismatch with the bounded, perceptually compressed data on which these models are trained. A natural solution is to learn new representations for HDR, which introduces additional complexity and data requirements. In this work, we show that HDR generation can be achieved in a much simpler way by leveraging the strong visual priors already captured by pretrained generative models. We observe that a logarithmic encoding widely used in cinematic pipelines maps HDR imagery into a distribution that is naturally aligned with the latent space of these models, enabling direct adaptation via lightweight fine-tuning without retraining an encoder. To recover details that are not directly observable in the input, we further introduce a training strategy based on camera-mimicking degradations that encourages the model to infer missing high dynamic range content from its learned priors. Combining these insights, we demonstrate high-quality HDR video generation using a pretrained video model with minimal adaptation, achieving strong results across diverse scenes and challenging lighting conditions. Our results indicate that HDR, despite representing a fundamentally different image formation regime, can be handled effectively without redesigning generative models, provided that the representation is chosen to align with their learned priors.

[CV-9] Efficient KernelSHAP Explanations for Patch-based 3D Medical Image Segmentation

【速读】:该论文旨在解决基于扰动的可解释性方法(如KernelSHAP)在三维医学图像分割任务中因计算开销大、滑动窗口推理成本高而难以实用的问题。其关键解决方案在于:首先,将计算限制在用户定义的兴趣区域(region of interest, ROI)及其感受野支持范围内,从而显著减少冗余评估;其次,通过补丁对数缓存(patch logit caching)技术复用未受影响补丁的基线预测,同时保留nnU-Net的融合策略以加速推理过程;此外,为提升临床意义,比较了三种自动特征抽象方式(全器官单元、规则FCC超体素和器官感知超体素),并设计多种聚合/值函数以稳定真阳性、Dice或软Dice指标,从而实现高效且具临床可解释性的归因分析。

链接: https://arxiv.org/abs/2604.11775
作者: Ricardo Coimbra Brioso,Giulio Sichili,Damiano Dei,Nicola Lambri,Pietro Mancosu,Marta Scorsetti,Daniele Loiacono
机构: 1. University of Turin (都灵大学); 2. Italian National Research Council (意大利国家研究委员会); 3. Polytechnic of Turin (都灵理工大学); 4. Institute of Cognitive Sciences and Technologies (认知科学与技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Perturbation-based explainability methods such as KernelSHAP provide model-agnostic attributions but are typically impractical for patch-based 3D medical image segmentation due to the large number of coalition evaluations and the high cost of sliding-window inference. We present an efficient KernelSHAP framework for volumetric CT segmentation that restricts computation to a user-defined region of interest and its receptive-field support, and accelerates inference via patch logit caching, reusing baseline predictions for unaffected patches while preserving nnU-Net’s fusion scheme. To enable clinically meaningful attributions, we compare three automatically generated feature abstractions within the receptive-field crop: whole-organ units, regular FCC supervoxels, and hybrid organ-aware supervoxels, and we study multiple aggregation/value functions targeting stabilizing evidence (TP/Dice/Soft Dice) or false-positive behavior. Experiments on whole-body CT segmentations show that caching substantially reduces redundant computation (with computational savings ranging from 15% to 30%) and that faithfulness and interpretability exhibit clear trade-offs: regular supervoxels often maximize perturbation-based metrics but lack anatomical alignment, whereas organ-aware units yield more clinically interpretable explanations and are particularly effective for highlighting false-positive drivers under normalized metrics.

[CV-10] Autonomous Diffractometry Enabled by Visual Reinforcement Learning

【速读】:该论文旨在解决在材料科学实验中自动化晶体对准的问题,特别是针对依赖人类专家解读衍射图谱(Laue diffraction patterns)的复杂任务。传统方法严重依赖人工经验,难以实现高效、可重复的自动化流程。解决方案的关键在于提出一种无需先验晶体学知识或衍射理论的模型无关强化学习(model-free reinforcement learning)框架,使智能体能够直接从原始衍射图像中学习识别高对称性取向并自主导航至最优位置,从而在无监督条件下发展出类人策略,实现跨不同晶系的快速、鲁棒对准。

链接: https://arxiv.org/abs/2604.11773
作者: J. Oppliger,M. Stifter,A. Rüegg,I. Biało,L. Martinelli,P. G. Freeman,D. Prabhakaran,J. Zhao,Q. Wang,J. Chang
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 16 figures

点击查看摘要

Abstract:Automation underpins progress across scientific and industrial disciplines. Yet, automating tasks requiring interpretation of abstract visual information remain challenging. For example, crystal alignment strongly relies on humans with the ability to comprehend diffraction patterns. Here we introduce an autonomous system that aligns single crystals without access to crystallography and diffraction theory. Using a model-free reinforcement learning framework, an agent learns to identify and navigate towards high-symmetry orientations directly from Laue diffraction patterns. Despite the absence of human supervision, the agent develops human-like strategies to achieve time-efficient alignment across different crystal symmetry classes. With this, we provide a computational framework for intelligent diffractometers. As such, our approach advances the development of automated experimental workflows in materials science.

[CV-11] MosaicMRI: A Diverse Dataset and Benchmark for Raw Musculoskeletal MRI

【速读】:该论文旨在解决当前深度学习在磁共振成像(MRI)应用中模型可靠性受限于数据集多样性不足的问题,尤其是在肌肉骨骼(MSK)系统中的泛化能力缺乏系统评估。现有研究多依赖于脑部和膝关节等有限解剖结构的数据集,导致模型训练与评估缺乏跨解剖场景的鲁棒性验证。解决方案的关键在于构建并公开发布MosaicMRI——目前最大的开源原始MSK MRI数据集,包含2,671个体积和80,156个切片,涵盖多种体位、成像对比度、解剖部位及线圈数量。通过在此多样化数据集上使用VarNet作为基线进行加速重建实验,研究发现:在小样本条件下,联合训练多个解剖部位的模型显著优于仅针对特定解剖部位训练的模型,揭示了跨解剖相关性的可利用性;同时,模型在不同解剖部位间的迁移性能存在差异,表明其泛化能力受训练数据规模、具体解剖结构及扫描协议等因素共同影响。

链接: https://arxiv.org/abs/2604.11762
作者: Paula Arguello,Berk Tinaz,Mohammad Shahab Sepehri,Maryam Soltanolkotabi,Mahdi Soltanolkotabi
机构: University of Southern California (南加州大学); University of Utah (犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP); Medical Physics (physics.med-ph); Machine Learning (stat.ML)
备注: 15 pages, 6 figures, preliminary version

点击查看摘要

Abstract:Deep learning underpins a wide range of applications in MRI, including reconstruction, artifact removal, and segmentation. However, progress has been driven largely by public datasets focused on brain and knee imaging, shaping how models are trained and evaluated. As a result, careful studies of the reliability of these models across diverse anatomical settings remain limited. In this work, we introduce MosaicMRI, a large and diverse collection of fully sampled raw musculoskeletal (MSK) MR measurements designed for training and evaluating machine-learning-based methods. MosaicMRI is the largest open-source raw MSK MRI dataset to date, comprising 2,671 volumes and 80,156 slices. The dataset offers substantial diversity in volume orientation (e.g., axial, sagittal), imaging contrasts (e.g., PD, T1, T2), anatomies (e.g., spine, knee, hip, ankle, and others), and numbers of acquisition coils. Using VarNet as a baseline for accelerated reconstruction task, we perform a comprehensive set of experiments to study scaling behavior with respect to both model capacity and dataset size. Interestingly, models trained on the combined anatomies significantly outperform anatomy-specific models in low-sample regimes, highlighting the benefits of anatomical diversity and the presence of exploitable cross-anatomical correlations. We further evaluate robustness and cross-anatomy generalization by training models on one anatomy (e.g., spine) and testing them on another (e.g., knee). Notably, we identify groups of body parts (e.g., foot and elbow) that generalize well with each other, and highlight that performance under domain shifts depends on both training set size, anatomy, and protocol-specific factors.

[CV-12] StarVLA-α: Reducing Complexity in Vision-Language-Action Systems

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型研究中因架构多样性、训练数据异构性及工程实现差异导致的实验结果难以比较与复现的问题。其关键解决方案是提出一个结构简洁但性能强大的基线模型 StarVLA-α,通过最小化架构和流水线复杂度来减少实验混杂因素,从而实现对关键设计维度(如动作建模策略、机器人特异性预训练和接口工程)的系统性分析。在统一多基准训练下(LIBERO、SimplerEnv、RoboTwin 和 RoboCasa),该基线模型展现出强竞争力,表明仅依靠一个强大的视觉语言模型(VLM)骨干网络结合最小设计即可达到优异性能,无需依赖额外的复杂架构或工程技巧。

链接: https://arxiv.org/abs/2604.11757
作者: Jinhui Ye,Ning Gao,Senqiao Yang,Jinliang Zheng,Zixuan Wang,Yuxin Chen,Pengguang Chen,Yilun Chen,Shu Liu,Jiaya Jia
机构: HKUST(香港科技大学); XJTU(西安交通大学); CUHK(香港中文大学); THU(清华大学); Tongyi Lab, Alibaba Group(阿里巴巴通义实验室); SmartMore Ltd.(思谋科技)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA- \alpha , a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA- \alpha deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms \pi_0.5 by 20% on the public real-world RoboChallenge benchmark. We expect StarVLA- \alpha to serve as a solid starting point for future research in the VLA regime. Code will be released at this https URL.

[CV-13] Learning Long-term Motion Embeddings for Efficient Kinematics Generation

【速读】:该论文旨在解决视频模型在探索多可能未来时因全视频合成效率低下而难以实现高效长时运动生成的问题。其核心解决方案在于:通过从大规模轨迹数据中学习一个长期运动嵌入(motion embedding),并在此压缩空间中训练条件流匹配(conditional flow-matching)模型,从而以64倍的时间压缩比实现对任务描述(如文本提示或空间点击)的高效响应与真实感长时运动生成。该方法显著优于当前最先进的视频模型及专用任务模型的运动分布生成能力。

链接: https://arxiv.org/abs/2604.11737
作者: Nick Stracke,Kolja Bauer,Stefan Andreas Baumann,Miguel Angel Bautista,Josh Susskind,Björn Ommer
机构: CompVis @ LMU (Computer Vision at Ludwig-Maximilians-Universität München); Munich Center for Machine Learning (慕尼黑大学机器学习中心); Apple (苹果公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: for the project page and code, view this https URL

点击查看摘要

Abstract:Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.

[CV-14] he Devil is in the Details – From OCR for Old Church Slavonic to Purely Visual Stemma Reconstruction

【速读】:该论文旨在解决手写教会斯拉夫语文献中光学字符识别(Optical Character Recognition, OCR)准确性不足,以及如何利用OCR结果有效支持校勘学(Stemmatology)研究的问题。其解决方案的关键在于:首先通过对比多种OCR系统(包括传统方法、机器学习模型及大语言模型LLM),优化基础字母识别准确率(可达2-3%的词错误率CER),并探索LLM后处理与代理式OCR架构(如专业化后处理代理、代理流水线和检索增强生成RAG)对提升识别质量的作用;其次提出一种完全基于图像处理的新校勘方法,即自动化视觉字形提取、聚类与成对统计比较生成距离矩阵,进而构建系谱图(stemma),并在两个小型文献 corpus(14–16世纪教会斯拉夫语《马可福音》和14–15世纪法语《玫瑰传奇》)上验证其可行性。

链接: https://arxiv.org/abs/2604.11724
作者: Armin Hoenen
机构: University of Frankfurt (法兰克福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International conference at Valamo monastery, Finnland, 2026

点击查看摘要

Abstract:The age of artificial intelligence has brought many new possibilities and pitfalls in many fields and tasks. The devil is in the details, and those come to the fore when building new pipelines and executing small practical experiments. OCR and stemmatology are no exception. The current investigation starts comparing a range of OCR-systems, from classical over machine learning to LLMs, for roughly 6,000 characters of late handwritten church slavonic manuscripts from the 18th century. Focussing on basic letter correctness, more than 10 CS OCR-systems among which 2 LLMs (GPT5 and Gemini3-flash) are being compared. Then, post-processing via LLMs is assessed and finally, different agentic OCR architectures (specialized post-processing agents, an agentic pipeline and RAG) are tested. With new technology elaborated, experiments suggest, church slavonic CER for basic letters may reach as low as 2-3% but elaborated diacritics could still present a problem. How well OCR can prime stemmatology as a downstream task is the entry point to the second part of the article which introduces a new stemmatic method based solely on image processing. Here, a pipeline of automated visual glyph extraction, clustering and pairwise statistical comparison leading to a distance matrix and ultimately a stemma, is being presented and applied to two small corpora, one for the church slavonic Gospel of Mark from the 14th to 16th centuries, one for the Roman de la Rose in French from the 14th and 15th centuries. Basic functioning of the method can be demonstrated.

[CV-15] On the Robustness of Watermarking for Autoregressive Image Generation

【速读】:该论文旨在解决自回归(Autoregressive, AR)图像生成模型输出的可检测性与可溯源性问题,以应对虚假信息传播风险并防止训练数据中混入合成图像导致模型崩溃。其解决方案的关键在于通过水印技术在生成阶段嵌入隐蔽信号,从而支持下游验证。然而,论文指出当前水印方案存在严重漏洞,可通过三类新型攻击——向量量化再生移除攻击、基于对抗优化的攻击和频率注入攻击——在仅需单张带水印参考图像且无需原始模型参数或水印密钥的情况下实现水印移除或伪造,进而导致虚假检测和“水印模仿”(Watermark Mimicry)现象,使得真实图像被误判为合成内容,无法进入后续模型训练。

链接: https://arxiv.org/abs/2604.11720
作者: Andreas Müller,Denis Lukovnikov,Shingo Kodama,Minh Pham,Anubhav Jain,Jonathan Petit,Niv Cohen,Asja Fischer
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The proliferation of autoregressive (AR) image generators demands reliable detection and attribution of their outputs to mitigate misinformation, and to filter synthetic images from training data to prevent model collapse. To address this need, watermarking techniques, specifically designed for AR models, embed a subtle signal at generation time, enabling downstream verification through a corresponding watermark detector. In this work, we study these schemes and demonstrate their vulnerability to both watermark removal and forgery attacks. We assess existing attacks and further introduce three new attacks: (i) a vector-quantized regeneration removal attack, (ii) adversarial optimization-based attack, and (iii) a frequency injection attack. Our evaluation reveals that removal and forgery attacks can be effective with access to a single watermarked reference image and without access to original model parameters or watermarking secrets. Our findings indicate that existing watermarking schemes for AR image generation do not reliably support synthetic content detection for dataset filtering. Moreover, they enable Watermark Mimicry, whereby authentic images can be manipulated to imitate a generator’s watermark and trigger false detection to prevent their inclusion in future model training.

[CV-16] BEM: Training-Free Background Embedding Memory for False-Positive Suppression in Real-Time Fixed-Background Camera ICPR2026

【速读】:该论文旨在解决预训练目标检测器在真实场景部署中因训练数据与目标环境分布差异而导致的性能下降问题,尤其针对类别稀疏训练下在密集单类或少数类场景(如监控和交通监测)中出现的误检率升高问题。解决方案的关键在于提出一种轻量级、无需训练且权重冻结的背景嵌入记忆模块(Background Embedding Memory, BEM),该模块在推理阶段利用固定摄像头环境下稳定的背景先验信息,通过估计干净背景嵌入、维护原型记忆库,并基于逆相似度加权惩罚机制重新评分检测置信度,从而有效降低假阳性率同时保持召回率。

链接: https://arxiv.org/abs/2604.11714
作者: Junwoo Park,Jangho Lee,Sunho Lim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICPR 2026

点击查看摘要

Abstract:Pretrained detectors perform well on benchmarks but often suffer performance degradation in real-world deployments due to distribution gaps between training data and target environments. COCO-like benchmarks emphasize category diversity rather than instance density, causing detectors trained under per-class sparsity to struggle in dense, single- or few-class scenes such as surveillance and traffic monitoring. In fixed-camera environments, the quasi-static background provides a stable, label-free prior that can be exploited at inference to suppress spurious detections. To address the issue, we propose Background Embedding Memory (BEM), a lightweight, training-free, weight-frozen module that can be attached to pretrained detectors during inference. BEM estimates clean background embeddings, maintains a prototype memory, and re-scores detection logits with an inverse-similarity, rank-weighted penalty, effectively reducing false positives while maintaining recall. Empirically, background-frame cosine similarity correlates negatively with object count and positively with Precision-Confidence AUC (P-AUC), motivating its use as a training-free control signal. Across YOLO and RT-DETR families on LLVIP and simulated surveillance streams, BEM consistently reduces false positives while preserving real-time performance. Our code is available at this https URL

[CV-17] Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models CVPR2026

【速读】:该论文旨在解决临床内镜中目标结构因手术器械或重叠组织导致的遮挡(occlusion)问题,这一挑战在基础分割模型(foundation segmentation models)中尚未得到充分研究。其解决方案的关键在于提出一个名为OccSAM-Bench的基准测试框架,通过可控的合成遮挡模拟两种类型(器械覆盖与切口遮挡)和三个严重等级,在三个公开的息肉数据集上系统评估SAM系列模型。此外,论文创新性地设计了三区域评价协议,将分割性能分解为完整区域、仅可见区域和不可见区域,从而揭示标准非遮挡(amodal)评估所掩盖的模型行为差异,识别出两类不同架构特征:遮挡感知型(Occluder-Aware)模型优先于可见组织边界划分并拒绝器械预测,而遮挡无关型(Occluder-Agnostic)模型则自信地推断被遮挡区域。这一发现表明,模型对遮挡的鲁棒性并非统一,临床场景下的模型选择应依据具体意图——是保守地聚焦可见组织分割还是实现隐藏解剖结构的非遮挡推理。

链接: https://arxiv.org/abs/2604.11711
作者: Nhan Ho,Luu Le,Thanh-Huy Nguyen,Thien Nguyen,Xiaofeng Liu,Ulas Bagci
机构: Stony Brook University (纽约州立大学石溪分校); AIMA Research Lab; Carnegie Mellon University (卡内基梅隆大学); Yale University (耶鲁大学); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CV4Clinic, CVPR 2026. 10 pages, 4 figures

点击查看摘要

Abstract:Occlusion, where target structures are partially hidden by surgical instruments or overlapping tissues, remains a critical yet underexplored challenge for foundation segmentation models in clinical endoscopy. We introduce OccSAM-Bench, a benchmark designed to systematically evaluate SAM-family models under controlled, synthesized surgical occlusion. Our framework simulates two occlusion types (i.e., surgical tool overlay and cutout) across three calibrated severity levels on three public polyp datasets. We propose a novel three-region evaluation protocol that decomposes segmentation performance into full, visible-only, and invisible targets. This metric exposes behaviors that standard amodal evaluation obscures, revealing two distinct model archetypes: Occluder-Aware models (SAM, SAM 2, SAM 3, MedSAM3), which prioritize visible tissue delineation and reject instruments, and Occluder-Agnostic models (MedSAM, MedSAM2), which confidently predict into occluded regions. SAM-Med2D aligns with neither and underperforms across all conditions. Ultimately, our results demonstrate that occlusion robustness is not uniform across architectures, and model selection must be driven by specific clinical intent-whether prioritizing conservative visible-tissue segmentation or the amodal inference of hidden anatomy.

[CV-18] Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

【速读】:该论文旨在解决复杂动态环境(如自动驾驶)中未来视频预测的两大核心挑战:高视觉保真度与场景语义的一致性难以同时实现的问题。传统方法直接预测RGB帧,易导致语义漂移或视觉失真。解决方案的关键在于提出一种分阶段的层次化预测框架Re2Pix:首先在冻结的视觉基础模型特征空间中预测未来场景结构(即语义表示),再以该表示作为条件引导潜在扩散模型生成逼真图像。这种“语义先行、视觉后置”的设计使模型能分别聚焦于场景动态建模和外观合成,显著提升时序语义一致性与感知质量。此外,为缓解训练-推理阶段因真实表示与预测表示不一致带来的性能下降,作者引入嵌套丢弃(nested dropout)与混合监督(mixed supervision)两种条件策略,增强对自回归预测误差的鲁棒性。

链接: https://arxiv.org/abs/2604.11707
作者: Efstathios Karypidis,Spyros Gidaris,Nikos Komodakis
机构: Athena Research Center (阿耳忒弥斯研究中心); valeo.ai; National Technical University of Athens (雅典国立技术大学); University of Crete (克里特大学); IACM-Forth (希腊国家研究中心-弗洛斯研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at this https URL

[CV-19] LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型因显式动作数据稀缺而导致的训练瓶颈问题,同时探索如何从大规模未标注人类动作视频中提取可迁移的潜在动作表示(latent action representation),以提升机器人控制的鲁棒性。其解决方案的关键在于构建了一个统一的评估基准——LARY(Latent Action Representation Yielding)基准,该基准能够同时衡量潜在动作表示在高层语义动作理解(what to do)和底层机器人控制(how to do)两个维度上的性能,并通过包含超过100万条视频、62万张图像对及595万条运动轨迹的多场景、多实体数据集进行系统验证。实验表明,无需任何动作监督的通用视觉基础模型优于专门设计的具身潜在动作模型,且基于潜在动作的空间比像素空间更贴近物理动作空间,揭示了通用视觉表征天然蕴含可用于物理控制的动作相关知识,而语义抽象是连接视觉与动作的更高效路径。

链接: https://arxiv.org/abs/2604.11689
作者: Dujun Nie,Fengjiao Chen,Qi Lv,Jun Kuang,Xiaoyu Li,Xuezhi Cao,Xunliang Cai
机构: Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project: this https URL Code: this https URL Dataset: this https URL

点击查看摘要

Abstract:While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied latent action models. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.

[CV-20] Unfolding 3D Gaussian Splatting via Iterative Gaussian Synopsis

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在流媒体和资源受限环境中部署时面临的存储开销大、表示结构无序以及现有层次细节(Level-of-Detail, LOD)策略易引入冗余或导致保真度下降的问题。其解决方案的关键在于提出一种基于自顶向下“展开”机制的迭代高斯摘要(Iterative Gaussian Synopsis)框架:从全分辨率3DGS模型出发,通过可学习的基于掩码的剪枝机制逐层生成更粗粒度的LOD层级,同时结合分层空间网格与共享锚点码本(Anchor Codebook)以协同建模全局场景结构与局部细节,从而实现紧凑且富有表现力的特征表示,在保证各层级视觉质量的同时显著降低存储需求,并支持高效、层级特异的渐进式优化。

链接: https://arxiv.org/abs/2604.11685
作者: Yuqin Lu,Yang Zhou,Yihua Dai,Guiqing Li,Shengfeng He
机构: South China University of Technology (华南理工大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has become a state-of-the-art framework for real-time, high-fidelity novel view synthesis. However, its substantial storage requirements and inherently unstructured representation pose challenges for deployment in streaming and resource-constrained environments. Existing Level-of-Detail (LOD) strategies, particularly those based on bottom-up construction, often introduce redundancy or lead to fidelity degradation. To overcome these limitations, we propose Iterative Gaussian Synopsis, a novel framework for compact and progressive rendering through a top-down “unfolding” scheme. Our approach begins with a full-resolution 3DGS model and iteratively derives coarser LODs using an adaptive, learnable mask-based pruning mechanism. This process constructs a multi-level hierarchy that preserves visual quality while improving efficiency. We integrate hierarchical spatial grids, which capture the global scene structure, with a shared Anchor Codebook that models localized details. This combination produces a compact yet expressive feature representation, designed to minimize redundancy and support efficient, level-specific adaptation. The unfolding mechanism promotes inter-layer reusability and requires only minimal data overhead for progressive refinement. Experiments show that our method maintains high rendering quality across all LODs while achieving substantial storage reduction. These results demonstrate the practicality and scalability of our approach for real-time 3DGS rendering in bandwidth- and memory-constrained scenarios.

[CV-21] owards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge

【速读】:该论文旨在解决临床脑部磁共振成像(MRI)自动分析中面临的挑战:临床数据具有异质性和噪声,而高质量标注数据获取成本极高。为应对这一问题,研究提出通过自监督学习(SSL)利用临床工作流程中产生的大量未标注数据来训练鲁棒的“基础模型”(foundation models),使其能够在域外场景下以极少标注数据实现良好适应。其关键解决方案在于组织了FOMO25挑战赛,提供大规模预训练数据集FOMO60K,并在真实临床来源的数据上评估模型在少样本和域外设置下的表现,从而验证了自监督预训练能显著提升模型在域偏移下的泛化能力,且小规模预训练模型即可取得优异性能,同时揭示不同预训练目标对不同任务(如分割、分类、回归)存在差异化优势。

链接: https://arxiv.org/abs/2604.11679
作者: Asbjørn Munk,Stefano Cerri,Vardan Nersesjan,Christian Hedeager Krag,Jakob Ambsdorf,Pablo Rocamora García,Julia Machnio,Peirong Liu,Suhyun Ahn,Nasrin Akbari,Yasmina Al Khalil,Kimberly Amador,Sina Amirrajab,Tal Arbel,Meritxell Bach Cuadra,Ujjwal Baid,Bhakti Baheti,Jaume Banus,Kamil Barbierik,Christoph Brune,Yansong Bu,Baptiste Callard,Yuhan Chen,Cornelius Crijnen,Corentin Dancette,Peter Drotar,Prasad Dutande,Nils D. Forkert,Saurabh Garg,Jakub Gazda,Matej Gazda,Benoît Gérin,Partha Ghosh,Weikang Gong,Pedro M. Gordaliza,Sam Hashemi,Tobias Heimann,Fucang Jia,Jiexin Jiang,Emily Kaczmarek,Chris Kang,Seung Kwan Kang,Mohammad Khazaei,Julien Khlaut,Petros Koutsouvelis,Jae Sung Lee,Yuchong Li,Mengye Lyu,Mingchen Ma,Anant Madabhushi,Klaus H. Maier-Hein,Pierre Manceron,Andrés Martínez Mora,Moona Mazher,Felix Meister,Nataliia Molchanova,Steven A. Niederer,Leonard Nürnberg,Jinah Park,Abdul Qayyum,Jonas Richiardi,Antoine Saporta,Branislav Setlak,Ning Shen,Justin Szeto,Constantin Ulrich,Puru Vaish,Vibujithan Vigneshwaran,Leroy Volmer,Zihao Wang,Siqi Wei,Anthony Winder,Jelmer M. Wolterink,Maxence Wynen,Chang Yang,Si Young Yie,Mostafa Mehdipour Ghazi,Akshay Pai,Espen Jimenez Solem,Sebastian Nørgaard Llambias,Mikael Boesen,Michael Eriksen Benros,Juan Eugenio Iglesias,Mads Nielsen
机构: University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clinical deployment of automated brain MRI analysis faces a fundamental challenge: clinical data is heterogeneous and noisy, and high-quality labels are prohibitively costly to obtain. Self-supervised learning (SSL) can address this by leveraging the vast amounts of unlabeled data produced in clinical workflows to train robust \textitfoundation models that adapt out-of-domain with minimal supervision. However, the development of foundation models for brain MRI has been limited by small pretraining datasets and in-domain benchmarking focused on high-quality, research-grade data. To address this gap, we organized the FOMO25 challenge as a satellite event at MICCAI 2025. FOMO25 provided participants with a large pretraining dataset, FOMO60K, and evaluated models on data sourced directly from clinical workflows in few-shot and out-of-domain settings. Tasks covered infarct classification, meningioma segmentation, and brain age regression, and considered both models trained on FOMO60K (method track) and any data (open track). Nineteen foundation models from sixteen teams were evaluated using a standardized containerized pipeline. Results show that (a) self-supervised pretraining improves generalization on clinical data under domain shift, with the strongest models trained \textitout-of-domain surpassing supervised baselines trained \textitin-domain. (b) No single pretraining objective benefits all tasks: MAE favors segmentation, hybrid reconstruction-contrastive objectives favor classification, and © strong performance was achieved by small pretrained models, and improvements from scaling model size and training duration did not yield reliable benefits.

[CV-22] UNIGEOCLIP: Unified Geospatial Contrastive Learning

【速读】:该论文旨在解决多模态地理空间数据(包括航空影像、街景图像、高程模型、文本和地理坐标)在统一嵌入空间中协同对齐与表示学习的问题,以提升跨模态检索、比较与推理能力。其解决方案的关键在于提出UNIGEOCLIP框架,采用全对全对比对齐(all-to-all contrastive alignment)策略,摒弃传统依赖中心枢纽表示或模态融合的方法,实现了五种互补地理空间模态的端到端联合对齐;同时引入可扩展的经纬度编码器(scaled latitude-longitude encoder),增强空间结构的多尺度表征能力,从而显著优于单一模态对比模型和仅使用坐标基线的方法。

链接: https://arxiv.org/abs/2604.11668
作者: Guillaume Astruc,Eduard Trulls,Jan Hosang,Loic Landrieu,Paul-Edouard Sarlin
机构: LASTIG, Univ Gustave Eiffel, IGN, ENSG, France; Google, Switzerland; CNES, France; LIGM, CNRS, Univ Gustave Eiffel, ENPC, Institut Polytechnique de Paris, Marne-la-Vallée, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing availability of co-located geospatial data spanning aerial imagery, street-level views, elevation models, text, and geographic coordinates offers a unique opportunity for multimodal representation learning. We introduce UNIGEOCLIP, a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, our method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. We further propose a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments across downstream geospatial tasks demonstrate that UNIGEOCLIP consistently outperforms single-modality contrastive models and coordinate-only baselines, highlighting the benefits of holistic multimodal geospatial alignment. A reference implementation is available at this https URL.

[CV-23] GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays

【速读】:该论文旨在解决医学影像领域中专家与人工智能(AI)在胸部X光片真实性评估和诊断决策过程中视觉注意力机制及不确定性表现差异的量化分析问题。其关键解决方案是构建并公开GazeVaLM数据集,该数据集包含16名资深放射科医生对30张真实与30张由扩散生成式AI(diffusion-based generative AI)合成的胸部X光片在诊断评估和真假分类(视觉图灵测试)两种条件下产生的高精度眼动追踪数据(包括原始注视点、凝视热力图、扫描路径、显著性密度图),以及对应的结构化诊断标签和真实性判断;同时扩展至6个先进的多模态大语言模型(multimodal LLMs),提供其在相同条件下的预测诊断、真实性标签及置信度评分,从而实现人类专家与AI系统在决策层面和不确定性层面的直接对比。此设计支持对视觉注意建模、临床决策过程、人机比较、生成图像真实感评估及不确定性量化等研究,推动可复现的医学影像感知与理解研究。

链接: https://arxiv.org/abs/2604.11653
作者: David Wong,Zeynep Isik,Bin Wang,Marouane Tliba,Gorkem Durak,Elif Keles,Halil Ertugrul Aktas,Aladine Chetouani,Cagdas Topel,Nicolo Gennaro,Camila Lopes Vendrami,Tugce Agirlar Trabzonlu,Amir Ali Rahsepar,Laetitia Perronne,Matthew Antalek,Onural Ozturk,Gokcan Okur,Andrew C. Gordon,Ayis Pyrros,Frank H. Miller,Amir Borhani,Hatice Savas,Eric Hart,Elizabeth Krupinski,Ulas Bagci
机构: Northwestern University(西北大学); Université Sorbonne Paris Nord(索邦巴黎第一大学); Loyola University Chicago(洛约拉大学芝加哥分校); Emory University(埃默里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work appears in ACM ETRA 2026

点击查看摘要

Abstract:We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density maps, structured diagnostic labels, and authenticity judgments. We extend the protocol to 6 state-of-the-art multimodal LLMs, releasing their predicted diagnoses, authenticity labels, and confidence scores under matched conditions - enabling direct human-AI comparison at both decision and uncertainty levels. We further provide analyses of gaze agreement, inter-observer consistency, and benchmarking of radiologists versus LLMs in diagnostic accuracy and authenticity detection. GazeVaLM supports research in gaze modeling, clinical decision-making, human-AI comparison, generative image realism assessment, and uncertainty quantification. By jointly releasing visual attention data, clinical labels, and model predictions, we aim to facilitate reproducible research on how experts and AI systems perceive, interpret, and evaluate medical images. The dataset is available at this https URL.

[CV-24] STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding CVPR2026

【速读】:该论文旨在解决现有方法在处理4D点云视频时难以有效捕捉其潜在几何特征的问题,导致表示学习和理解性能下降。其解决方案的关键在于从互补的谱域视角出发,将4D点云视频转换为图谱信号(graph spectral signals),并通过多频带分解揭示不同频率成分所对应的几何结构:低频信号捕获粗粒度形状,高频信号编码细粒度几何细节。基于此发现,作者提出Spatio-Temporal-Spectral Mixer (STS-Mixer) 框架,通过融合多频带谱信号与时空信息,实现对4D点云视频中丰富几何结构与时间动态性的联合建模,从而提升动作识别与语义分割等任务的性能。

链接: https://arxiv.org/abs/2604.11637
作者: Wenhao Li,Xueying Jiang,Gongjie Zhang,Xiaoqin Zhang,Ling Shao,Shijian Lu
机构: Nanyang Technological University (南洋理工大学); Alibaba Group (阿里巴巴集团); Zhejiang University of Technology (浙江工业大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026, Open Sourced

点击查看摘要

Abstract:4D point cloud videos capture rich spatial and temporal dynamics of scenes which possess unique values in various 4D understanding tasks. However, most existing methods work in the spatiotemporal domain where the underlying geometric characteristics of 4D point cloud videos are hard to capture, leading to degraded representation learning and understanding of 4D point cloud videos. We address the above challenge from a complementary spectral perspective. By transforming 4D point cloud videos into graph spectral signals, we can decompose them into multiple frequency bands each of which captures distinct geometric structures of point cloud videos. Our spectral analysis reveals that the decomposed low-frequency signals capture more coarse shapes while high-frequency signals encode more fine-grained geometry details. Building on these observations, we design Spatio-Temporal-Spectral Mixer (STS-Mixer), a unified framework that mixes spatial, temporal, and spectral representations of point cloud videos. STS-Mixer integrates multi-band delineated spectral signals with spatiotemporal information to capture rich geometries and temporal dynamics, while enabling fine-grained and holistic understanding of 4D point cloud videos. Extensive experiments show that STS-Mixer achieves superior performance consistently across multiple widely adopted benchmarks on both 3D action recognition and 4D semantic segmentation tasks. Code and models are available at this https URL.

[CV-25] MorphoFlow: Sparse-Supervised Generative Shape Modeling with Adaptive Latent Relevance

【速读】:该论文旨在解决现有统计形状建模(Statistical Shape Modeling, SSM)方法依赖密集标注分割和固定潜在表示所带来的可扩展性差与灵活性不足的问题,尤其在复杂解剖变异建模中表现受限。其核心解决方案是提出MorphoFlow框架,该框架通过结合神经隐式形状表示(neural implicit shape representations)、自编码器(autodecoder)形式化以及自回归归一化流(autoregressive normalizing flows),直接从稀疏表面标注中学习紧凑且概率化的形状潜在表示。关键创新在于:利用神经隐式表示实现分辨率无关的三维解剖建模,通过自编码器结构支持在稀疏监督下对每个实例的潜在码进行端到端优化,并借助自回归流捕获潜在解剖变异分布,从而构建一个具有似然基础的生成模型;同时引入基于稀疏诱导先验的自适应潜在相关性加权机制,在保持生成表达能力的同时,使模型能根据各潜在维度对解剖变异的相关性自动调节其贡献,最终实现无需人工调整潜在维度即可支持不确定性量化和解剖学合理形状合成的紧凑结构化潜空间。

链接: https://arxiv.org/abs/2604.11636
作者: Mokshagna Sai Teja Karanam,Tushar Kataria,Shireen Elhabian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Statistical shape modeling (SSM) is central to population level analysis of anatomical variability, yet most existing approaches rely on densely annotated segmentations and fixed latent representations. These requirements limit scalability and reduce flexibility when modeling complex anatomical variation. We introduce MorphoFlow, a sparse supervised generative shape modeling framework that learns compact probabilistic shape representations directly from sparse surface annotations. MorphoFlow integrates neural implicit shape representations with an autodecoder formulation and autoregressive normalizing flows to learn an expressive probabilistic density over the latent shape space. The neural implicit representation enables resolution-agnostic modeling of 3D anatomy, while the autodecoder formulation supports direct optimization of per-instance latent codes under sparse supervision. The autoregressive flow captures the distribution of latent anatomical variability providing a tractable, likelihood-based generative model of shapes. To promote compact and structured latent representations, we incorporate adaptive latent relevance weighting through sparsity-inducing priors, enabling the model to regulate the contribution of individual latent dimensions according to their relevance to the underlying anatomical variation while preserving generative expressivity. The resulting latent space supports uncertainty quantification and anatomically plausible shape synthesis without manual latent dimensionality tuning. Evaluation on publicly available lumbar vertebrae and femur datasets demonstrates accurate high-resolution reconstruction from sparse inputs and recovery of structured modes of anatomical variation consistent with population level trends.

[CV-26] POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频和流式视觉场景中因视觉标记序列快速增长而导致的可扩展性与实际部署难题。解决方案的关键在于提出一种原生双模式MLLM——POINTS-Long,其核心创新是受人类视觉系统启发的动态视觉标记缩放机制,支持“聚焦模式”和“待机模式”两种互补感知模式:在细粒度视觉任务中采用聚焦模式以保持最优性能,而在长视频通用视觉理解任务中启用待机模式,仅使用原始视觉标记的1/40至1/10即可维持97.7%-99.7%的准确率;同时通过可动态解耦的键值缓存(KV-cache)设计实现流式视觉理解,从而高效维护超长视觉记忆。

链接: https://arxiv.org/abs/2604.11627
作者: Haicheng Wang,Yuan Liu,Yikun Liu,Zhemeng Yu,Zhongyin Zhao,Yangxiu You,Zilin Yu,Le Tian,Xiao Zhou,Jie Zhou,Weidi Xie,Yanfeng Wang
机构: Shanghai Jiao Tong University (上海交通大学); WeChat AI, Tencent (微信人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences–especially in long-video and streaming scenarios–poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.

[CV-27] Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language ACL2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在几何推理任务中表现不佳的问题,尤其是其在处理立体几何(solid geometry)时因空间理解能力不足而面临的瓶颈。解决方案的关键在于设计了一种统一的形式化语言(formal language),能够同时覆盖平面几何(plane geometry)和立体几何的结构与语义关系,并构建了GDP-29K大规模数据集(包含20k平面几何和9k立体几何样本),每条样本均配有精确的符号化描述。此外,作者提出一种结合监督微调(Supervised Fine-Tuning)与基于可验证奖励的强化学习(Reinforcement Learning via Verifiable Rewards)的训练范式,确保形式化描述的语法正确性和几何一致性,从而显著提升MLLMs在下游几何推理任务中的性能。

链接: https://arxiv.org/abs/2604.11600
作者: Peijie Wang,Ming-Liang Zhang,Jun Cao,Chao Deng,Dekang Ran,Hongda Sun,Pi Bu,Xuan Zhang,Yingyao Wang,Jun Song,Bo Zheng,Fei Yin,Cheng-Lin Liu
机构: MAIS, Institute of Automation of Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Future Living Lab of Alibaba (阿里巴巴未来生活实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACL2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress but continue to struggle with geometric reasoning, primarily due to the perception bottleneck regarding fine-grained visual elements. While formal languages have aided plane geometry understanding, solid geometry which requires spatial understanding remains largely unexplored. In this paper, we address this challenge by designing a unified formal language that integrates plane and solid geometry, comprehensively covering geometric structures and semantic relations. We construct GDP-29K, a large-scale dataset comprising 20k plane and 9k solid geometry samples collected from diverse real-world sources, each paired with its ground-truth formal description. To ensure syntactic correctness and geometric consistency, we propose a training paradigm that combines Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards. Experiments show that our approach achieves state-of-the-art parsing performance. Furthermore, we demonstrate that our parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs’ capabilities for downstream geometry reasoning tasks. Our data and code are available at Geoparsing.

[CV-28] Learning Robustness at Test-Time from a Non-Robust Teacher

【速读】:该论文旨在解决预训练模型在目标域数据稀缺且无标签的测试时适应(test-time adaptation)场景下,如何提升其对抗鲁棒性的问题。当前主流方法虽能改善干净样本上的准确率,但对对抗扰动的鲁棒性关注不足,尤其当原始预训练模型本身并非为鲁棒性设计时。解决方案的关键在于提出一种无需标签的框架,利用非鲁棒教师模型的预测作为语义锚点(semantic anchor),同时约束干净样本和对抗样本的学习目标,从而实现更稳定的优化过程。该方法通过理论分析表明,相较于传统基于自一致性正则化的策略,其形式更具稳定性,并在CIFAR-10和ImageNet上的实验验证了其在优化稳定性、参数敏感性及鲁棒性-准确率权衡方面的显著优势。

链接: https://arxiv.org/abs/2604.11590
作者: Stefano Bianchettin,Giulio Rossolini,Giorgio Buttazzo
机构: Scuola Superiore Sant’Anna (圣安娜高等学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nowadays, pretrained models are increasingly used as general-purpose backbones and adapted at test-time to downstream environments where target data are scarce and unlabeled. While this paradigm has proven effective for improving clean accuracy on the target domain, adversarial robustness has received far less attention, especially when the original pretrained model is not explicitly designed to be robust. This raises a practical question: \emphcan a pretrained, non-robust model be adapted at test-time to improve adversarial robustness on a target distribution? To face this question, this work studies how adversarial training strategies behave when integrated into adaptation schemes for the unsupervised test-time setting, where only a small set of unlabeled target samples is available. It first analyzes how classical adversarial training formulations can be extended to this scenario, showing that straightforward distillation-based adaptations remain unstable and highly sensitive to hyperparameter tuning, particularly when the teacher itself is non-robust. To address these limitations, the work proposes a label-free framework that uses the predictions of a non-robust teacher model as a semantic anchor for both the clean and adversarial objectives during adaptation. We further provide theoretical insights showing that our formulation yields a more stable alternative to the self-consistency-based regularization commonly used in classical adversarial training. Experiments evaluate the proposed approach on CIFAR-10 and ImageNet under induced photometric transformations. The results support the theoretical insights by showing that the proposed approach achieves improved optimization stability, lower sensitivity to parameter choices, and a better robustness-accuracy trade-off than existing baselines in this post-deployment test-time setting. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.11590 [cs.CV] (or arXiv:2604.11590v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.11590 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-29] MLLM -as-a-Judge Exhibits Model Preference Bias

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在自动评估中可能存在的模型特定偏好偏差问题,即MMLLM-as-a-Judge方法可能因自身倾向性而扭曲不同模型之间的公平比较。解决方案的关键在于提出Philautia-Eval框架,通过解耦偏好倾向与生成质量差异来量化此类偏差;并进一步设计了一个简单的多模型集成方法Pomms,实验证明其能有效缓解模型特定偏好偏差的同时保持原有性能水平。

链接: https://arxiv.org/abs/2604.11589
作者: Shuitsu Koyama,Yuiga Wada,Daichi Yashima,Komei Sugiura
机构: Keio University (庆应义塾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge methods were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge methods favor or disfavor text generated by specific MLLMs. In this study, we propose Philautia-Eval to investigate such model-specific preference bias. Philautia-Eval quantifies the degree of the bias by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs collected from 12 MLLMs, we found that representative MLLMs tend to exhibit self-preference bias. Moreover, experimental results indicate mutual preference bias within particular model families, which is potentially driven by reused connectors and overlapping instruction-tuning resources. Finally, we introduce a simple ensemble of MLLMs, Pomms. Our results demonstrated that Pomms effectively mitigated the model-specific preference bias while maintaining performance.

[CV-30] GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth CVPR2026

【速读】:该论文旨在解决RGB-D感知系统在实际应用中常遇到的深度信息缺失、噪声或损坏问题,这些问题会显著影响语义分割等任务的性能。解决方案的关键在于提出两种轻量级跨模态适配模块:GeomPrompt 和 GeomPrompt-Recovery。其中,GeomPrompt 通过仅利用RGB图像生成一个面向任务的几何提示(geometric prompt),作为冻结的RGB-D语义分割模型的第四通道输入,从而无需深度监督即可恢复几何先验;而 GeomPrompt-Recovery 则进一步针对退化深度进行修正,预测与分割任务相关的第四通道校正项。二者均仅依赖下游分割监督进行训练,有效提升了在缺失或劣质深度条件下的鲁棒性与效率。

链接: https://arxiv.org/abs/2604.11585
作者: Krishna Jaganathan,Patricio Vela
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to the CVPR 2026 URVIS Workshop. Project page: this https URL

点击查看摘要

Abstract:Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, or corrupted. We thus present GeomPrompt, a lightweight cross-modal adaptation module that synthesizes a task-driven geometric prompt from RGB alone for the fourth channel of a frozen RGB-D semantic segmentation model, without depth supervision. We further introduce GeomPrompt-Recovery, an adaptation module that compensates for degraded depth by predicting the fourth channel correction relevant for the frozen segmenter. Both modules are trained solely with downstream segmentation supervision, enabling recovery of the geometric prior useful for segmentation, rather than estimating depth signals. On SUN RGB-D, GeomPrompt improves over RGB-only inference by +6.1 mIoU on DFormer and +3.0 mIoU on GeminiFusion, while remaining competitive with strong monocular depth estimators. For degraded depth, GeomPrompt-Recovery consistently improves robustness, yielding gains up to +3.6 mIoU under severe depth corruptions. GeomPrompt is also substantially more efficient than monocular depth baselines, reaching 7.8 ms latency versus 38.3 ms and 71.9 ms. These results suggest that task-driven geometric prompting is an efficient mechanism for cross-modal compensation under missing and degraded depth inputs in RGB-D perception.

[CV-31] Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions CVPR2026 KR

【速读】:该论文旨在解决触觉定位(tactile localization)问题,即识别图像中与触觉输入具有相同材料属性的区域。现有方法依赖全局对齐策略,难以捕捉任务所需的细粒度局部对应关系,且受限于数据集多样性不足的问题。其解决方案的关键在于提出一种通过密集跨模态特征交互学习局部视觉-触觉对齐的模型,生成触觉条件下的材料显著性图(tactile saliency maps),从而实现触觉引导的材料分割;同时引入野外多材质场景图像和材料多样性配对策略,提升上下文定位能力与弱信号下的鲁棒性,并构建两个新的触觉基准数据集用于定量评估,实验表明该方法在多个基准上显著优于现有视觉-触觉方法。

链接: https://arxiv.org/abs/2604.11579
作者: Seongyu Kim,Seungwoo Lee,Hyeonggon Ryu,Joon Son Chung,Arda Senocak
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院); Hankuk University of Foreign Studies (韩国外国语大学); Ulsan National Institute of Science and Technology (蔚山国立科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuo-tactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local visuo-tactile alignment via dense cross-modal feature interactions, producing tactile saliency maps for touch-conditioned material segmentation. To overcome dataset constraints, we introduce: (i) in-the-wild multi-material scene images that expand visual diversity, and (ii) a material-diversity pairing strategy that aligns each tactile sample with visually varied yet tactilely consistent images, improving contextual localization and robustness to weak signals. We also construct two new tactile-grounded material segmentation datasets for quantitative evaluation. Experiments on both new and existing benchmarks show that our approach substantially outperforms prior visuo-tactile methods in tactile localization.

[CV-32] Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models CVPR

【速读】:该论文旨在解决视觉语言模型(如CLIP)在面对对抗攻击时鲁棒性不足的问题,同时避免现有微调方法因忽略训练数据分布和学习目标而导致的零样本能力下降及鲁棒性迁移性受限。其解决方案的关键在于提出一种名为AdvFLYP的简单但有效的微调范式,该范式在对抗微调过程中严格遵循CLIP预训练阶段的训练流程:使用从网络收集的图像-文本对生成对抗样本,并通过对比损失(contrastive loss)将对抗图像与其对应文本对齐;此外,为进一步缓解噪声网络图像导致的对抗图像嵌入失真,引入了对对抗图像特征偏离度的正则化项——其中logit级正则化提升鲁棒性,特征级正则化提升干净样本准确率。

链接: https://arxiv.org/abs/2604.11576
作者: Songlong Xing,Weijie Wang,Zhengyu Zhao,Jindong Gu,Philip Torr,Nicu Sebe
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会); Xi’an Jiaotong University (西安交通大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR Findings Track 2026

点击查看摘要

Abstract:Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the important roles of training data distributions and learning objectives, resulting in reduced zero-shot capabilities and limited transferability of robustness across domains and datasets. In this work, we propose a simple yet effective paradigm AdvFLYP, which follows the training recipe of CLIP’s pretraining process when performing adversarial finetuning to the model. Specifically, AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and match them with their corresponding texts via a contrastive loss. To alleviate distortion of adversarial image embeddings of noisy web images, we further propose to regularise AdvFLYP by penalising deviation of adversarial image features. We show that logit- and feature-level regularisation terms benefit robustness and clean accuracy, respectively. Extensive experiments on 14 downstream datasets spanning various domains show the superiority of our paradigm over mainstream practices. Our code and model weights are released at this https URL.

[CV-33] raining-Free Model Ensemble for Single-Image Super-Resolution via Strong-Branch Compensation

【速读】:该论文旨在解决单图像超分辨率(Single-image Super-Resolution, SISR)任务中,现有高性能模型(如Transformer和状态空间架构)虽能提升重建质量,但伴随训练成本高、工程迭代周期长及部署负担重的问题;尤其在已有多个预训练模型具备部分互补行为的场景下,如何在不进行额外训练的前提下高效融合其输出成为关键瓶颈。解决方案的核心在于提出一种无需训练的输出级集成框架:构建双分支结构,其中混合注意力网络(Hybrid attention network)结合TLC推理提供稳定主重建,MambaIRv2分支通过几何自集成增强高频细节恢复能力;两个分支独立处理相同低分辨率输入,并在图像空间中以轻量加权方式融合,无需更新任何模型参数或引入可训练模块,从而实现低成本且有效的性能提升。

链接: https://arxiv.org/abs/2604.11564
作者: Gengjia Chang,Xining Ge,Weijun Yuan,Zhan Li,Qiurong Song,Luen Zhu,Shuhong Liu
机构: Hefei University of Technology (合肥工业大学); Hangzhou Dianzi University (杭州电子科技大学); Jinan University (暨南大学); South China Agricultural University (华南农业大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single-image super-resolution has progressed from deep convolutional baselines to stronger Transformer and state-space architectures, yet the corresponding performance gains typically come with higher training cost, longer engineering iteration, and heavier deployment burden. In many practical settings, multiple pretrained models with partially complementary behaviors are already available, and the binding constraint is no longer architectural capacity but how effectively their outputs can be combined without additional training. Rather than pursuing further architectural redesign, this paper proposes a training-free output-level ensemble framework. A dual-branch pipeline is constructed in which a Hybrid attention network with TLC inference provides stable main reconstruction, while a MambaIRv2 branch with geometric self-ensemble supplies strong compensation for high-frequency detail recovery. The two branches process the same low-resolution input independently and are fused in the image space via a lightweight weighted combination, without updating any model parameters or introducing an additional trainable module. As our solution to the NTIRE 2026 Image Super-Resolution ( \times 4 ) Challenge, the proposed design consistently improves over the base branch and slightly exceeds the pure strong branch in PSNR at the best operating point under a unified DIV2K bicubic \times 4 evaluation protocol. Ablation studies confirm that output-level compensation provides a low-overhead and practically accessible upgrade path for existing super-resolution systems.

[CV-34] he Impact of Federated Learning on Distributed Remote Sensing Archives

【速读】:该论文旨在解决遥感影像数据在分布式环境下的联邦学习(Federated Learning, FL)训练问题,特别是由地理区域导致的非独立同分布(non-IID)标签偏斜对标准FL算法收敛性能的负面影响。其关键解决方案是系统性地评估三种FL策略——FedAvg、FedProx与批量同步并行(Bulk Synchronous Parallel, BSP)——在多标签遥感图像分类任务中的表现,并结合不同深度的卷积神经网络(Convolutional Neural Network, CNN)架构分析模型容量、客户端比例、通信成本等参数的联合影响。实验表明,FedProx在数据异构性强的情况下优于FedAvg,BSP虽能逼近集中式训练精度但需高通信开销,而LeNet在当前数据规模下实现了最佳的准确率-通信效率权衡。

链接: https://arxiv.org/abs/2604.11562
作者: Anand Umashankar,Karam Tomotaki-Dawoud,Nicolai Schneider
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work was completed in 2021. It is posted as a historical record and reference baseline

点击查看摘要

Abstract:Remote sensing archives are inherently distributed: Earth observation missions such as Sentinel-1, Sentinel-2, and Sentinel-3 have collectively accumulated more than 5 petabytes of imagery, stored and processed across many geographically dispersed platforms. Training machine learning models on such data in a centralized fashion is impractical due to data volume, sovereignty constraints, and geographic distribution. Federated learning (FL) addresses this by keeping data local and exchanging only model updates. A central challenge for remote sensing is the non-IID nature of Earth observation data: label distributions vary strongly by geographic region, degrading the convergence of standard FL algorithms. In this paper, we conduct a systematic empirical study of three FL strategies – FedAvg, FedProx, and bulk synchronous parallel (BSP) – applied to multi-label remote sensing image classification under controlled non-IID label-skew conditions. We evaluate three convolutional neural network (CNN) architectures of increasing depth (LeNet, AlexNet, and ResNet-34) and analyze the joint effect of algorithm choice, model capacity, client fraction, client count, batch size, and communication cost. Experiments on the UC Merced multi-label dataset show that FedProx outperforms FedAvg for deeper architectures under data heterogeneity, that BSP approaches centralized accuracy at the cost of high sequential communication, and that LeNet provides the best accuracy-communication trade-off for the dataset scale considered.

[CV-35] Progressively Texture-Aware Diffusion for Contrast-Enhanced Sparse-View CT ICASSP2026

【速读】:该论文旨在解决稀疏视图计算机断层扫描(Sparse-view CT, SVCT)重建中难以恢复可靠图像内容和视觉一致纹理的问题。其解决方案的关键在于提出一种分阶段的纹理感知扩散模型(Progressively Texture-aware Diffusion, PTD),该模型由两个核心模块构成:一是基础重建模块PTDₘₑc,用于学习确定性映射以恢复主要的低频信号(即粗略内容与平滑纹理),提供初始估计以保障重建保真度;二是条件扩散模块PTDₘᵢff,通过双域引导的条件扩散机制,在粗略预测基础上重构高保真细节,从而生成可靠且视觉一致的纹理。此方法在仅需少量采样步数的情况下即可实现结构相似性和视觉吸引力的显著提升,有效缓解了通用扩散模型固有的随机性问题,并在高频细节保真度与视觉质量之间取得更好平衡。

链接: https://arxiv.org/abs/2604.11559
作者: Tianqi Wang,Wenchao Du,Hongyu Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: ICASSP2026

点击查看摘要

Abstract:Diffusion-based sparse-view CT (SVCT) imaging has achieved remarkable advancements in recent years, thanks to its more stable generative capability. However, recovering reliable image content and visually consistent textures is still a crucial challenge. In this paper, we present a Progressively Texture-aware Diffusion (PTD) model, a coarse-to-fine learning framework tailored for SVCT. Specifically, PTD comprises a basic reconstructive module PTD _\textitrec and a conditional diffusion module PTD _\textitdiff . PTD _\textitrec first learns a deterministic mapping to recover the majority of the underlying low-frequency signals (i.e., coarse content with smoothed textures), which serves as the initial estimation to enable fidelity. Moreover, PTD _\textitdiff aims to reconstruct high-fidelity details for coarse prediction, which explores a dual-domain guided conditional diffusion to generate reliable and consistent textures. Extensive experiments on sparse-view CT reconstruction demonstrate that our PTD achieves superior performance in terms of structure similarity and visual appeal with only a few sampling steps, which mitigates the randomness inherent in general diffusion models and enables a better trade-off between visual quality and fidelity of high-frequency details.

[CV-36] CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space CVPR2026

【速读】:该论文旨在解决现有图像检索系统在适应用户个性化需求方面的局限性问题,即大多数系统依赖于固定、单一的相似性度量标准,无法同时融合多种条件(如文本描述、用户兴趣等)来动态调整视觉相似性判断。其解决方案的关键在于提出CLAY方法,通过将预训练视觉-语言模型(Vision-Language Models, VLMs)的嵌入空间重构为一种文本条件化的相似性空间,无需额外训练即可实现多条件下的高效检索;该设计将文本条件处理与视觉特征提取解耦,使得基于固定视觉嵌入的多条件检索成为可能,从而显著提升检索准确性与计算效率。

链接: https://arxiv.org/abs/2604.11539
作者: Sohwi Lim,Lee Hyoseok,Jungjoon Park,Tae-Hyun Oh
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026, Project page: this https URL

点击查看摘要

Abstract:Human perception of visual similarity is inherently adaptive and subjective, depending on the users’ interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves high retrieval accuracy and notable computational efficiency compared to previous works.

[CV-37] SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLM)在处理长序列视觉标记(vision tokens)时面临的高计算与内存开销问题,尤其是现有基于局部启发式策略(如注意力分数或token范数)的剪枝方法因位置偏差和信息分散导致在高剪枝比例下难以保留关键内容、性能下降的问题。解决方案的关键在于提出一种无需训练、即插即用的剪枝方法 SVD-Prune,其核心思想是通过奇异值分解(Singular Value Decomposition, SVD)对视觉标记特征矩阵进行分解,并利用统计杠杆得分(statistical leverage scores)选择前K个最具代表性的token,从而确保仅保留对全局主方差贡献最大的视觉信息,实现高效且稳定的剪枝效果。

链接: https://arxiv.org/abs/2604.11530
作者: Yvon Apedo,Martyna Poreba,Michal Szczepanski,Samia Bouchafa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLM) have revolutionized multimodal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a trainingfree, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-K tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.

[CV-38] Continuous Adversarial Flow Models

【速读】:该论文旨在解决现有流匹配(flow matching)方法在生成模型训练中因固定均方误差(mean-squared-error)目标函数导致样本分布与真实数据分布对齐不足的问题。其解决方案的关键在于提出连续对抗流模型(continuous adversarial flow models),通过引入一个可学习的判别器(discriminator)替代原有的固定损失函数,从而引导模型学习更贴近目标数据分布的隐变量或像素空间轨迹。该方法不仅适用于从零训练模型,更主要地用于后训练(post-training)已有流匹配模型,在ImageNet 256px图像生成任务上显著提升无引导FID指标,并在文本到图像生成任务中于GenEval和DPG基准上取得更好性能。

链接: https://arxiv.org/abs/2604.11521
作者: Shanchuan Lin,Ceyuan Yang,Zhijie Lin,Hao Chen,Haoqi Fan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose continuous adversarial flow models, a type of continuous-time flow model trained with an adversarial objective. Unlike flow matching, which uses a fixed mean-squared-error criterion, our approach introduces a learned discriminator to guide training. This change in objective induces a different generalized distribution, which empirically produces samples that are better aligned with the target data distribution. Our method is primarily proposed for post-training existing flow-matching models, although it can also train models from scratch. On the ImageNet 256px generation task, our post-training substantially improves the guidance-free FID of latent-space SiT from 8.26 to 3.63 and of pixel-space JiT from 7.17 to 3.57. It also improves guided generation, reducing FID from 2.06 to 1.53 for SiT and from 1.86 to 1.80 for JiT. We further evaluate our approach on text-to-image generation, where it achieves improved results on both the GenEval and DPG benchmarks.

[CV-39] AG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition ICPR2026

【速读】:该论文旨在解决细粒度人体动作识别(Fine-grained Human Action Recognition, FHAR)中因视觉相似动作仅存在细微时空差异而导致的判别困难问题。传统方法依赖多模态信息(如姿态、文本或光流)提升性能,但增加了标注成本和计算开销。其解决方案的关键在于提出一种轻量级的时空图头(TAG-Head),仅使用RGB视频输入即可显著增强模型对细微动作差异的感知能力:首先通过带可学习3D位置编码的Transformer编码器捕获跨空间与时间的长程依赖关系;随后利用图结构进行特征细化——其中帧内全连接边用于捕捉帧内细微外观差异,时序对齐边则在保持运动一致性的同时避免过度平滑。该设计参数与计算开销极低,可无缝集成至主流3D骨干网络(如SlowFast、R(2+1)D-34等),并端到端训练,在多个基准数据集上超越了大量依赖特权信息的多模态方法,实现了RGB-only场景下的最优性能。

链接: https://arxiv.org/abs/2604.11498
作者: Imtiaz Ul Hassan,Nik Bessis,Ardhendu Behera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 3 figures, to appear in ICPR 2026

点击查看摘要

Abstract:Fine-grained human action recognition (FHAR) is challenging because visually similar actions differ by subtle spatio-temporal cues. Many recent systems enhance discriminability with extra modalities (e.g., pose, text, optical flow), but this increases annotation burden and computational cost. We introduce TAG-Head, a lightweight spatio-temporal graph head that upgrades standard 3D backbones (SlowFast, R(2+1)D-34, I3D, etc.) for FHAR using RGB only. Our pipeline first applies a Transformer encoder with learnable 3D positional encodings to the backbone tokens, capturing long-range dependencies across space and time. The resulting features are then refined by a graph in which (i) fully-connected intra-frame edges to resolve subtle appearance differences within frames, and (ii) time-aligned temporal edges that connect features at the same spatial location across frames to stabilise motion cues without over-smoothing. The head is compact (little parameter/FLOP overhead), plug-and-play across backbones, and trained end-to-end with the backbone. Extensive evaluations on FineGym (Gym99 and Gym288) and HAA500 show that TAG-Head sets a new state-of-the-art among RGB-only models and surpasses many recent multimodal approaches (video + pose + text) that rely on privileged information. Ablations disentangle the contributions of the Transformer and the graph topology, and complexity analyses confirm low latency. TAG-Head advances FHAR by explicitly coupling global context with high-resolution spatial interactions and low-variance temporal continuity inside a slim, composable graph head. The simplicity of the design enables straightforward adoption in practical systems that favour RGB-only sensors, while delivering performance gains typically associated with heavier or multimodal models. Code will be released on GitHub.

[CV-40] NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild CVPR2026

【速读】:该论文旨在解决在真实场景下对AI生成图像(AI-generated image)进行鲁棒检测的问题,即如何使检测模型在面对图像被裁剪、缩放、压缩、模糊等常见实际变换时仍保持高精度。解决方案的关键在于构建了一个包含108,750张真实图像和185,750张来自42种不同生成器(涵盖多种开源与闭源模型架构)的大型数据集,并引入36种图像变换增强训练多样性;同时,评估指标采用ROC AUC在包含变换与未变换图像的完整测试集上进行,从而全面衡量模型对现实世界干扰的鲁棒性。

链接: https://arxiv.org/abs/2604.11487
作者: Aleksandr Gushchin,Khaled Abud,Ekaterina Shumitskaya,Artem Filippov,Georgii Bychkov,Sergey Lavrushkin,Mikhail Erofeev,Anastasia Antsiferova,Changsheng Chen,Shunquan Tan,Radu Timofte,Dmitry Vatolin,Chuanbiao Song,Zijian Yu,Hao Tan,Jun Lan,Zhiqiang Yang,Yongwei Tang,Zhiqiang Wu,Jia Wen Seow,Hong Vin Koay,Haodong Ren,Feng Xu,Shuai Chen,Ruiyang Xia,Qi Zhang,Yaowen Xu,Zhaofan Zou,Hao Sun,Dagong Lu,Mufeng Yao,Xinlei Xu,Fei Wu,Fengjun Guo,Cong Luo,Hardik Sharma,Aashish Negi,Prateek Shaily,Jayant Kumar,Sachin Chaudhary,Akshay Dudhane,Praful Hambarde,Amit Shukla,Zhilin Tu,Fengpeng Li,Jiamin Zhang,Jianwei Fei,Kemou Li,Haiwei Wu,Bilel Benjdira,Anas M. Ali,Wadii Boulila,Chenfan Qu,Junchi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 NTIRE Workshop Paper, Robust AI-Generated Image Detection Technical Report

点击查看摘要

Abstract:This paper presents an overview of the NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild, held in conjunction with the NTIRE workshop at CVPR 2026. The goal of this challenge was to develop detection models capable of distinguishing real images from generated ones in realistic scenarios: the images are often transformed (cropped, resized, compressed, blurred) for practical usage, and therefore, the detection models should be robust to such transformations. The challenge is based on a novel dataset consisting of 108,750 real and 185,750 AI-generated images from 42 generators comprising a large variety of open-source and closed-source models of various architectures, augmented with 36 image transformations. Methods were evaluated using ROC AUC on the full test set, including both transformed and untransformed images. A total of 511 participants registered, with 20 teams submitting valid final solutions. This report provides a comprehensive overview of the challenge, describes the proposed solutions, and can be used as a valuable reference for researchers and practitioners in increasing the robustness of the detection models to real-world transformations.

[CV-41] PACO: Proxy-Task Alignment and Online Calibration for On-the-Fly Category Discovery

【速读】:该论文旨在解决在线类别发现(On-the-Fly Category Discovery, OCD)中现有方法存在的根本性问题:即在推理阶段依赖单一阈值进行类别判别,导致无法动态适应新样本的分类决策,进而造成类别形成不稳定和不一致。具体而言,OCD是一个动态过程,模型需持续判断样本是否属于已知类、是否匹配已有新类或应创建新类,而传统方法将支持集视为静态知识,未随推理过程中新证据更新决策边界。解决方案的关键在于提出PACO框架——一个基于支持集校准、树状结构的在线决策机制,其核心是将推理建模为一系列层次化决策步骤,包括已知类路由、出生感知的新类分配以及基于动态原型记忆的“附加-新建”操作,并通过模拟代理发现过程在离线训练时初始化阈值以对齐推理阶段,同时在推理期间利用成熟的新型原型持续更新阈值。该方案无需额外训练或数据集特定调参,可无缝集成至现有OCD流程作为推理模块,实验证明其在七个基准上的显著性能提升。

链接: https://arxiv.org/abs/2604.11484
作者: Weidong Tang,Bohan Zhang,Zhixiang Chi,ZiZhang Wu,Yang Wang,Yanan Wu
机构: China Agricultural University (中国农业大学); University of Toronto (多伦多大学); Fudan University (复旦大学); Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures, 7 tables, 1 algorithm

点击查看摘要

Abstract:On-the-Fly Category Discovery (OCD) requires a model, trained on an offline support set, to recognize known classes while discovering new ones from an online streaming sequence. Existing methods focus heavily on offline training. They aim to learn discriminative representations on the support set so that novel classes can be separated at test time. However, their discovery mechanism at inference is typically reduced to a single threshold. We argue that this paradigm is fundamentally flawed as OCD is not a static classification problem, but a dynamic process. The model must continuously decide 1) whether a sample belongs to a known class, 2) matches an existing novel category, or 3) should initiate a new one. Moreover, prior methods treat the support set as fixed knowledge. They do not update their decision boundaries as new evidence arrives during inference. This leads to unstable and inconsistent category formation. Our experiments confirm these issues. With properly calibrated and adaptive thresholds, substantial improvements can be achieved, even without changing the representation. Motivated by this, we propose PACO, a support-set-calibrated, tree-structured online decision framework. The framework models inference as a sequence of hierarchical decisions, including known-class routing, birth-aware novel assignment, and attach-versus-create operations over a dynamic prototype memory. Furthermore, we simulate the proxy discovery process to initialize the thresholds during offline training to align with inference. Thresholds are continuously updated during inference using mature novel prototypes. Importantly, PACO requires no heavy training and no dataset-specific tuning. It can be directly integrated into existing OCD pipelines as an inference-time module. Extensive experiments show significant improvements over SOTA baselines across seven benchmarks.

[CV-42] Degradation-Aware and Structure-Preserving Diffusion for Real-World Image Super-Resolution

【速读】:该论文旨在解决真实世界图像超分辨率(Real-world Image Super-Resolution, Real SR)中扩散模型面临的挑战,即现实中的退化过程复杂、异质且通常未被显式建模。解决方案的关键在于提出一种退化感知且结构保持的扩散框架:首先引入退化感知令牌注入(Degradation-aware Token Injection),通过轻量级退化统计信息编码低分辨率输入,并将其融合至语义条件特征中,实现显式的退化感知恢复;其次提出空间非对称噪声注入(Spatially Asymmetric Noise Injection),利用局部边缘强度调节扩散噪声,从而在训练过程中更好地保护结构区域。这两个模块均为轻量级附加组件,仅需对现有扩散超分框架的条件处理管道进行微小修改,实验表明其在感知质量与失真权衡上优于当前主流方法。

链接: https://arxiv.org/abs/2604.11470
作者: Yang Ji,Zonghao Chen,Zhihao Xue,Junqin Hu
机构: KUNBYTE; GOLDMYE
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world image super-resolution is particularly challenging for diffusion models because real degradations are complex, heterogeneous, and rarely modeled explicitly. We propose a degradation-aware and structure-preserving diffusion framework for real-world SR. Specifically, we introduce Degradation-aware Token Injection, which encodes lightweight degradation statistics from low-resolution inputs and fuses them with semantic conditioning features, enabling explicit degradation-aware restoration. We further propose Spatially Asymmetric Noise Injection, which modulates diffusion noise with local edge strength to better preserve structural regions during training. Both modules are lightweight add-ons to the adopted diffusion SR framework, requiring only minor modifications to the conditioning pipeline. Experiments on DIV2K and RealSR show that our method delivers competitive no-reference perceptual quality and visually more realistic restoration results than recent baselines, while maintaining a favorable perception–distortion trade-off. Ablations confirm the effectiveness of each module and their complementary gains when combined. The code and model are publicly available at this https URL.

[CV-43] Beyond Model Design: Data-Centric Training and Self-Ensemble for Gaussian Color Image Denoising

【速读】:该论文旨在解决固定噪声水平(σ = 50)下的彩色图像去噪问题,属于NTIRE 2026图像去噪挑战赛的范畴。其核心解决方案并非引入新的重建主干网络,而是从数据驱动训练和测试时能力释放两个互补方向重新挖掘成熟Restormer架构的性能边界:一是通过扩展多数据集训练策略,引入更大且更多样化的公共图像语料库,并采用两阶段优化调度;二是推理阶段应用×8几何自集成(geometric self-ensemble)以进一步释放模型潜力。实验表明,主要性能提升来自扩充后的训练数据和两阶段优化策略,而自集成带来边际但稳定的增益。

链接: https://arxiv.org/abs/2604.11468
作者: Gengjia Chang,Xining Ge,Weijun Yuan,Zhan Li,Qiurong Song,Luen Zhu,Shuhong Liu
机构: Hefei University of Technology (合肥工业大学); Hangzhou Dianzi University (杭州电子科技大学); Jinan University (暨南大学); South China Agricultural University (华南农业大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents our solution to the NTIRE 2026 Image Denoising Challenge (Gaussian color image denoising at fixed noise level \sigma = 50 ). Rather than proposing a new restoration backbone, we revisit the performance boundary of the mature Restormer architecture from two complementary directions: stronger data-centric training and more complete Test-Time capability release. Starting from the public Restormer \sigma!=!50 baseline, we expand the standard multi-dataset training recipe with larger and more diverse public image corpora and organize optimization into two stages. At inference, we apply \times 8 geometric self-ensemble to further release model capacity. A TLC-style local inference wrapper is retained for implementation consistency; however, systematic ablation reveals its quantitative contribution to be negligible in this setting. On the challenge validation set of 100 images, our final submission achieves 30.762 dB PSNR and 0.861 SSIM, improving over the public Restormer \sigma!=!50 pretrained baseline by up to 3.366 dB PSNR. Ablation studies show that the dominant gain originates from the expanded training corpus and the two-stage optimization schedule, and self-ensemble provides marginal but consistent improvement.

[CV-44] HuiYanEarth-SAR: A Foundation Model for High-Fidelity and Low-Cost Global Remote Sensing Imagery Generation

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像生成中难以同时保证全局地理语义一致性与微观散射机制真实性的问题,从而克服现有方法在大范围场景生成中的 fidelity 限制。其解决方案的关键在于提出首个基于 AlphaEarth 并融合散射机制的 SAR 图像生成基础模型 HuiYanEarth-SAR:通过注入地理空间先验来控制宏观结构,利用隐式散射特征建模确保微观纹理的真实性,实现了仅凭地理坐标即可生成高保真 SAR 图像的能力。

链接: https://arxiv.org/abs/2604.11444
作者: Yongxiang Liu,Jie Zhou,Yafei Song,Tianpeng Liu,Li Liu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) imagery generation is essential for deepening the study of scattering mechanisms, establishing trustworthy electromagnetic scene models, and fundamentally alleviating the data scarcity bottleneck that constrains development in this field. However, existing methods find it difficult to simultaneously ensure high fidelity in both global geospatial semantics and microscopic scattering mechanisms, resulting in severe challenges for global generation. To address this, we propose \textbfHuiYanEarth-SAR, the first foundational SAR imagery generation model based on AlphaEarth and integrated scattering mechanisms. By injecting geospatial priors to control macroscopic structures and utilizing implicit scattering characteristic modeling to ensure the authenticity of microscopic textures, we achieve the capability of generating high-fidelity SAR images for global locations solely based on geographic coordinates. This study not only constructs an efficient SAR scene simulator but also establishes a bridge connecting geography, scatter mechanism, and artificial intelligence from a methodological standpoint. It advances SAR research by expanding the paradigm from perception and understanding to simulation and creation, providing key technical support for constructing a high-confidence digital twin of the Earth.

[CV-45] Observe Less Understand More: Cost-aware Cross-scale Observation for Remote Sensing Understanding

【速读】:该论文旨在解决多尺度遥感理解中因高分辨率(High-Resolution, HR)影像获取成本高、覆盖范围有限,而低分辨率(Low-Resolution, LR)影像无法提供足够局部细节的问题。现有HR采样方法通常基于孤立的LR图像块进行决策,忽略了块内细粒度重要性与块间上下文交互关系,导致特征表示碎片化和稀疏HR观测下的次优场景推理。其解决方案的关键在于将跨尺度遥感理解建模为一个统一的成本感知问题,通过联合优化细粒度HR采样与跨块表征预测,实现更高效的任务推理,从而在有限预算下显著提升性能表现。

链接: https://arxiv.org/abs/2604.11415
作者: Zhenghao Xie,Jing Xiao,Zhenqi Wang,Kexin Ma,Liang Liao,Gui-Song Xia,Mi Wang
机构: Wuhan University (武汉大学); Xi’an University of Electronic Science and Technology (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing understanding inherently requires multi-resolution observation, since different targets and application tasks demand different levels of spatial detail. While low-resolution (LR) imagery enables efficient global observation, high-resolution (HR) imagery provides critical local details at much higher acquisition cost and limited coverage. This motivates a cross-scale sensing strategy that selectively acquires HR imagery from LR-based global perception to improve task performance under constrained cost. Existing methods for HR sampling methods typically make selection decisions from isolated LR patches, which ignore fine-grained intra-patch importance and cross-patch contextual interactions, leading to fragmented feature representation and suboptimal scene reasoning under sparse HR observations. To address this issue, we formulate cross-scale remote sensing understanding as a unified cost-aware problem that couples fine-grained HR sampling with cross-patch representation prediction, enabling more effective task reasoning with fewer HR observations. Furthermore, we present GL-10M, a large-scale benchmark of 10 million spatially aligned multi-resolution images, enabling systematic evaluation of budget-constrained cross-scale reasoning in remote sensing. Extensive experiments on recognition and retrieval tasks show that our method consistently achieves a superior performance-cost trade-off.

[CV-46] Online Reasoning Video Object Segmentation

【速读】:该论文旨在解决在线推理视频目标分割(Online Reasoning Video Object Segmentation, ORVOS)问题,即在自然语言查询下逐帧生成像素级掩码,且模型只能利用当前及历史帧进行因果推理,无法回溯先前预测或未来帧信息,同时需应对事件发展过程中指代表达(referent shift)带来的挑战。现有方法多基于离线设置评估,可利用全视频序列进行事后消歧,与真实场景中严格因果决策的需求不一致。解决方案的关键在于:首先构建ORVOSB基准数据集,包含帧级因果标注和指代表达变化标签,支持对在线场景的系统评估;其次提出一种基线方法,通过持续更新的分割提示(segmentation prompts)和结构化的时序token缓存机制,在有限计算资源下实现长时程推理能力,从而为未来研究奠定基础。

链接: https://arxiv.org/abs/2604.11411
作者: Jinyuan Liu,Yang Wang,Zeyu Zhao,Weixin Li,Song Wang,Ruize Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real-world deployments that require strictly causal, frame-by-frame decisions. We study Online Reasoning Video Object Segmentation (ORVOS), where models must incrementally interpret queries using only past and current frames without revisiting previous predictions, while handling referent shifts as events unfold. To support evaluation, we introduce ORVOSB, a benchmark with frame-level causal annotations and referent-shift labels, comprising 210 videos, 12,907 annotated frames, and 512 queries across five reasoning categories. We further propose a baseline with continually-updated segmentation prompts and a structured temporal token reservoir for long-horizon reasoning under bounded computation. Experiments show that existing methods struggle under strict causality and referent shifts, while our baseline establishes a strong foundation for future research.

[CV-47] Scene Change Detection with Vision-Language Representation Learning

【速读】:该论文旨在解决城市环境中场景变化检测(Scene Change Detection, SCD)的挑战,尤其是在光照变化、季节差异、视角不同及复杂城市布局等现实条件下的准确识别难题。现有方法主要依赖低层视觉特征,难以在复杂场景中精确识别变化对象。其解决方案的关键在于提出LangSCD框架,该框架通过引入语言模态增强语义推理能力:首先利用视觉-语言模型(Vision-Language Models, VLMs)生成场景变化的文本描述,并通过跨模态特征增强器将语义信息与视觉特征融合;其次设计几何-语义匹配模块,以强制预测掩膜满足语义一致性和空间完整性,从而显著提升检测精度。该方法在新构建的NYC-CD大规模数据集上验证了有效性,实现了当前最优性能,凸显了语言推理与视觉表征融合对鲁棒场景变化检测的价值。

链接: https://arxiv.org/abs/2604.11402
作者: Diwei Sheng,Vijayraj Gohil,Satyam Gaba,Zihan Liu,Giles Hamilton-Fletcher,John-Ross Rizzo,Yongqing Liang,Chen Feng
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by enforcing semantic consistency and spatial completeness. Existing real-world scene change detection benchmarks provide only binary change annotations, which are insufficient for downstream applications requiring fine-grained understanding of scene dynamics. To address this limitation, we introduce NYC-CD, a large-scale dataset of 8,122 real-world image pairs collected in New York City with multiclass change annotations generated through a semi-automatic pipeline. Extensive experiments across multiple street-view benchmarks demonstrate that our language and matching modules consistently improve existing change-detection architectures, achieving state-of-the-art performance and highlighting the value of integrating linguistic reasoning with visual representations for robust scene change detection.

[CV-48] GS4City: Hierarchical Semantic Gaussian Splatting via City-Model Priors

【速读】:该论文旨在解决当前基于2D基础模型的3D高斯溅射(3D Gaussian Splatting, 3DGS)方法在城市场景中语义表达模糊、边界不清且难以融入结构化建筑语义的问题。现有方法无法直接利用CityGML等城市模型中蕴含的层级语义信息,导致重建结果缺乏结构一致性与可查询性。解决方案的关键在于提出GS4City,一种融合城市模型先验的层次化语义高斯溅射方法:首先通过两阶段光线投射从LoD 3 CityGML模型中提取几何对齐的可靠掩码,并利用父子关系验证和恢复细粒度立面元素;随后将这些几何引导的掩码与基础模型预测融合以建立场景一致的实例对应关系,并在联合2D身份监督与3D空间正则化下学习每个高斯的紧凑身份编码,从而实现结构感知的语义城市场景重建。

链接: https://arxiv.org/abs/2604.11401
作者: Qilin Zhang,Jinyu Zhu,Olaf Wysocki,Benjamin Busam,Boris Jutzi
机构: Technical University of Munich (TUM); Munich Center for Machine Learning (MCML); Karlsruhe Institute of Technology (KIT); CV4DT, University of Cambridge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent semantic 3D Gaussian Splatting (3DGS) methods primarily rely on 2D foundation models, often yielding ambiguous boundaries and limited support for structured urban semantics. While city models such as CityGML encode hierarchically organized semantics together with building geometry, these labels cannot be directly mapped to Gaussian primitives. We present GS4City, a hierarchical semantic Gaussian Splatting method that incorporates city-model priors for urban scene understanding. GS4City derives reliable image-aligned masks from Level of Detail (LoD) 3 CityGML models via two-pass raycasting, explicitly using parent-child relations to validate and recover fine-grained facade elements. It then fuses these geometry-grounded masks with foundation-model predictions to establish scene-consistent instance correspondences, and learns a compact identity encoding for each Gaussian under joint 2D identity supervision and 3D spatial regularization. Experiments on the TUM2TWIN and Gold Coast datasets show that GS4City effectively incorporates structured building semantics into Gaussian scene representations, outperforming existing 2D-driven semantic 3DGS baselines, including LangSplat and Gaga, by up to 15.8 IoU points in coarse building segmentation and 14.2 mIoU points in fine-grained semantic segmentation. By bridging structured city models and photorealistic Gaussian scene representations, GS4City enables semantically queryable and structure-aware urban reconstruction. Code is available at this https URL.

[CV-49] EagleVision: A Multi-Task Benchmark for Cross-Domain Perception in High-Speed Autonomous Racing

【速读】:该论文旨在解决高动态场景下自动驾驶感知模型在真实高速竞速环境中的泛化能力不足问题,特别是针对传统城市驾驶数据集与高速竞速场景之间存在的显著域偏移(domain shift)和大相对速度带来的挑战。解决方案的关键在于构建了一个统一的基于激光雷达(LiDAR)的多任务基准EagleVision,涵盖真实竞速数据(Indy Autonomous Challenge和A2RL Real)、仿真生成数据(12,000帧),并采用标准化评估协议,结合数据驱动的迁移学习框架量化不同域之间的跨域泛化性能。实验证明,城市预训练可提升检测性能(NDS 0.72 vs. 0.69),而真实竞速数据作为中间预训练域能实现最优迁移效果(NDS 0.726),同时表明运动分布覆盖对轨迹预测至关重要,验证了该基准在极端高速动态条件下系统性研究感知泛化的有效性。

链接: https://arxiv.org/abs/2604.11400
作者: Zakhar Yagudin,Murad Mebrahtu,Ren Jin,Jiaqi Huang,Yujia Yue,Dzmitry Tsetserukou,Jorge Dias,Majid Khonji
机构: Skolkovo Institute of Science (斯科尔科沃科学与技术学院); Intelligent Space Robotics Laboratory (智能空间机器人实验室); Center for Engineering Systems and Sciences (工程系统与科学中心); Khalifa University (哈利法大学); KUCARS-KU Center for Autonomous Robotic Systems (KUCARS-KU 自主机器人系统中心); Department of Computer Science (计算机科学系); Beijing Institute of Technology (北京理工大学); Beijing Key Laboratory of UAV Autonomous Control (北京航空航天大学无人机自主控制重点实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-speed autonomous racing presents extreme perception challenges, including large relative velocities and substantial domain shifts from conventional urban-driving datasets. Existing benchmarks do not adequately capture these high-dynamic conditions. We introduce EagleVision, a unified LiDAR-based multi-task benchmark for 3D detection and trajectory prediction in high-speed racing, providing newly annotated 3D bounding boxes for the Indy Autonomous Challenge dataset (14,893 frames) and the A2RL Real competition dataset (1,163 frames), together with 12,000 simulator-generated annotated frames, all standardized under a common evaluation protocol. Using a dataset-centric transfer framework, we quantify cross-domain generalization across urban, simulator, and real racing domains. Urban pretraining improves detection over scratch training (NDS 0.72 vs. 0.69), while intermediate pretraining on real racing data achieves the best transfer to A2RL (NDS 0.726), outperforming simulator-only adaptation. For trajectory prediction, Indy-trained models surpass in-domain A2RL training on A2RL test sequences (FDE 0.947 vs. 1.250), highlighting the role of motion-distribution coverage in cross-domain forecasting. EagleVision enables systematic study of perception generalization under extreme high-speed dynamics. The dataset and benchmark are publicly available at this https URL

[CV-50] Video-based Heart Rate Estimation with Angle-guided ROI Optimization and Graph Signal Denoising ICASSP2026

【速读】:该论文旨在解决远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)在面部运动干扰下性能显著下降的问题,尤其是说话和头部晃动等动态行为对心率测量精度的影响。解决方案的关键在于提出两个即插即用模块:一是角度引导的感兴趣区域(ROI)自适应优化模块,通过量化ROI与相机之间的夹角来校正受运动影响的信号并捕获全局运动信息;二是多区域联合图信号去噪模块,利用图信号处理技术联合建模区域内及区域间信号关系,有效抑制运动伪影。这两个模块可兼容基于反射模型的rPPG方法,并在三个公开数据集上验证了其有效性,联合使用使平均绝对误差(MAE)相比基线降低20.38%。

链接: https://arxiv.org/abs/2604.11395
作者: Gan Pei,Junhao Ning,Boqiu Shen,Yan Zhu,Menghan Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by ICASSP 2026

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) enables non-contact heart rate measurement from facial videos, but its performance is significantly degraded by facial motions such as speaking and head shaking. To address this issue, we propose two plug-and-play modules. The Angle-guided ROI Adaptive Optimization module quantifies ROI-Camera angles to refine motion-affected signals and capture global motion, while the Multi-region Joint Graph Signal Denoising module jointly models intra- and inter-regional ROI signals using graph signal processing to suppress motion artifacts. The modules are compatible with reflection model-based rPPG methods and validated on three public datasets. Results show that jointly use markedly reduces MAE, with an average decrease of 20.38% over the baseline, while ablation studies confirm the effectiveness of each module. The work demonstrates the potential of angle-guided optimization and graph-based denoising to enhance rPPG performance in motion scenarios.

[CV-51] Beyond Reconstruction: Reconstruction-to-Vector Diffusion for Hyperspectral Anomaly Detection

【速读】:该论文旨在解决高光谱异常检测(Hyperspectral Anomaly Detection, HAD)中因传统“重建即终点”范式导致的子像素异常消失和训练偏差问题,尤其是在空间下采样过程中残留误差模糊以及未净化异常污染模型权重所引发的确认偏倚。其解决方案的关键在于提出重构到向量扩散(Reconstruction-to-Vector Diffusion, R2VD)框架,通过将重建视为流形净化起点,构建残差引导的生成动力学机制:首先利用物理先验提取(Physical Prior Extraction, PPE)阶段抑制早期确认偏倚;继而通过引导流形净化(Guided Manifold Purification, GMP)阶段保留脆弱的亚像素拓扑结构;再借助残差评分建模(Residual Score Modeling, RSM)阶段结合物理光谱防火墙(Physical Spectral Firewall, PSF)隔离跨波段泄漏;最终在向量动力学推理(Vector Dynamics Inference, VDI)阶段以高维向量干扰模式替代标量误差,实现目标与背景的鲁棒解耦。

链接: https://arxiv.org/abs/2604.11390
作者: Jijun Xiang,Jiayi Wang,Pengxiang Wang,Cheng Chen,Nian Wang,Tao Wang
机构: Rocket Force University of Engineering (火箭军工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Hyperspectral Anomaly Detection (HAD) excels at identifying sparse targets in complex scenes, existing models remain trapped in a scalar “reconstruction-as-endpoint” paradigm. This reliance on ambiguous scalar residuals consistently triggers sub-pixel anomaly vanishing during spatial downsampling, alongside severe confirmation bias when unpurified anomalies corrupt training weights. In this paper, we propose Reconstruction-to-Vector Diffusion (R2VD), which fundamentally redefines reconstruction as a manifold purification origin to establish a novel residual-guided generative dynamics paradigm. Our framework introduces a four-stage pipeline: (1) a Physical Prior Extraction (PPE) stage that mitigates early confirmation bias via dual-stream statistical guidance; (2) a Guided Manifold Purification (GMP) stage utilizing an OmniContext Autoencoder (OCA) to extract purified residual maps while preserving fragile sub-pixel topologies; (3) a Residual Score Modeling (RSM) stage where a Diffusion Transformer (DiT), guarded by a Physical Spectral Firewall (PSF), effectively isolates cross-spectral leakage; and (4) a Vector Dynamics Inference (VDI) stage that robustly decouples targets from backgrounds by evaluating high-dimensional vector interference patterns instead of conventional scalar errors. Comprehensive evaluations on eight datasets confirm that R2VD establishes a new state-of-the-art, delivering exceptional target detectability and background suppression.

[CV-52] ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines

【速读】:该论文旨在解决标准心脏电影磁共振成像(cine cardiac MRI, cMRI)视图识别不可靠的问题,因为错误的视图识别会将误差传递至心室分割、容积评估、应变分析及瓣膜评价等下游任务。针对临床中扫描仪厂商、采集协议、运动伪影和切面定位差异带来的挑战,作者提出ConvFormer3D-TAP模型,其关键创新在于融合3D卷积标记化与多尺度自注意力机制,通过掩码时空重建和不确定性加权多片段融合策略提升跨心动周期和模糊时相段的鲁棒性;该设计同时捕获局部解剖结构(卷积先验)与长程心动周期动态(分层注意力),在包含150,974个序列的大规模数据集上实现了96%验证准确率,且对相邻视图(如长轴与LVOT/AV视图)的混淆集中于解剖重叠区域,验证了其作为端到端cMRI工作流前端视图路由、过滤与质量控制模块的可行性。

链接: https://arxiv.org/abs/2604.11389
作者: Nafiseh Ghaffar Nia,Vinesh Appadurai,Suchithra V.,Chinmay Rane,Daniel Pittman,James Carr,Adrienne Kline
机构: Northwestern Medicine (西北大学医学院); Northwestern University (西北大学); Xtasis Inc. (Xtasis公司); The Prince Charles Hospital (王子查尔斯医院); The University of Queensland (昆士兰大学); Quantiphi Inc. (Quantiphi公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable recognition of standard cine cardiac MRI views is essential because each view determines which cardiac anatomy is visualized and which quantitative analyses can be performed. Incorrect view identification, whether by a human reader or an automated deep learning system, can propagate errors into segmentation, volumetric assessment, strain analysis, and valve evaluation. However, accurate view classification remains challenging under routine clinical variability in scanner vendor, acquisition protocol, motion artifacts, and plane prescription. We present ConvFormer3D-TAP, a cine-specific spatiotemporal architecture that integrates 3D convolutional tokenization with multiscale self-attention. The model is trained using masked spatiotemporal reconstruction and uncertainty-weighted multi-clip fusion to enhance robustness across cardiac phases and ambiguous temporal segments. The design captures complementary cues: local anatomical structure through convolutional priors and long-range cardiac-cycle dynamics through hierarchical attention. On a cohort of 150,974 clinically acquired cine sequences spanning six standard cine cardiac MRI views, ConvFormer3D-TAP achieved 96% validation accuracy with per-class F1-scores = 0.94 and strong calibration (ECE = 0.025; Brier = 0.040). Error analysis shows that residual confusions are concentrated in anatomically adjacent long-axis and LVOT/AV view pairs, consistent with intrinsic prescription overlap. These results support ConvFormer3D-TAP as a scalable front-end for view routing, filtering and quality control in end-to-end cMRI workflows.

[CV-53] ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

【速读】:该论文旨在解决机器人领域中高质量训练数据获取困难的问题,尤其是模拟环境与真实世界之间的域差距(domain gap)导致策略模型在真实场景中表现不佳的挑战。解决方案的关键在于提出一种新颖的混合仿真方法——组合式仿真(Compositional Simulation),该方法融合经典仿真与神经仿真,通过一个闭环的真实-仿真-真实数据增强流水线,利用少量真实数据生成覆盖更广真实场景的大规模训练数据集,并训练神经仿真器将经典仿真视频转换为具有真实世界一致性的表示,从而显著缩小模拟到现实的域差距,提升策略模型在真实环境中的成功率。

链接: https://arxiv.org/abs/2604.11386
作者: Yiran Qin,Jiahua Ma,Li Kang,Wenzhan Li,Yihang Jiao,Xin Wen,Xiufeng Song,Heng Zhou,Jiwen Yu,Zhenfei Yin,Xihui Liu,Philip Torr,Yilun Du,Ruimao Zhang
机构: 1. Tsinghua University (清华大学); 2. Shanghai AI Lab (上海人工智能实验室); 3. Zhejiang University (浙江大学); 4. The Chinese University of Hong Kong (香港中文大学); 5. Peking University (北京大学); 6. University of Oxford (牛津大学); 7. Microsoft Research (微软研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures, 4 tables; supplementary material included; Project page: this https URL

点击查看摘要

Abstract:Recent advancements in foundational models, such as large language models and world models, have greatly enhanced the capabilities of robotics, enabling robots to autonomously perform complex tasks. However, acquiring large-scale, high-quality training data for robotics remains a challenge, as it often requires substantial manual effort and is limited in its coverage of diverse real-world environments. To address this, we propose a novel hybrid approach called Compositional Simulation, which combines classical simulation and neural simulation to generate accurate action-video pairs while maintaining real-world consistency. Our approach utilizes a closed-loop real-sim-real data augmentation pipeline, leveraging a small amount of real-world data to generate diverse, large-scale training datasets that cover a broader spectrum of real-world scenarios. We train a neural simulator to transform classical simulation videos into real-world representations, improving the accuracy of policy models trained in real-world environments. Through extensive experiments, we demonstrate that our method significantly reduces the sim2real domain gap, resulting in higher success rates in real-world policy model training. Our approach offers a scalable solution for generating robust training data and bridging the gap between simulated and real-world robotics.

[CV-54] From Redaction to Restoration: Deep Learning for Medical Image Anonymization and Reconstruction

【速读】:该论文旨在解决医学图像在去标识化过程中因移除与患者相关但非识别性信息而导致下游图像分析任务性能下降的问题。其核心挑战在于如何在保护患者隐私(如去除受保护健康信息,PHI)的同时,维持图像对深度学习模型的可用性和分析价值。解决方案的关键在于提出一个端到端的深度学习框架:首先利用基于CRNN的轻量级模型检测并擦除可能包含PHI的区域(如嵌入文本和元数据),随后采用基于潜在扩散机制的生成式AI(Stable Diffusion 2)对擦除区域进行语义合理的内容修复,从而实现既保障隐私又保留图像结构特征和任务相关性的高质量重建。

链接: https://arxiv.org/abs/2604.11376
作者: Adrienne Kline,Abhijit Gaonkar,Daniel Pittman,Chris Kuehn,Nils Forkert
机构: Center for Artificial Intelligence, BCVI, Northwestern Medicine, Chicago, IL, USA; Department of Electrical and Computer Engineering, Northwestern University, Chicago, IL, USA; Department of Surgery, Northwestern University, Chicago, IL, USA; Xtasis Inc., Chicago, IL, USA; Medtronic, USA; Departments of Radiology and Clinical Neurosciences, University of Calgary, Calgary, AB, Canada
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Removing patient-specific information from medical images is crucial to enable sharing and open science without compromising patient identities. However, many methods currently used for deidentification have negative effects on downstream image analysis tasks because of removal of relevant but non-identifiable information. This work presents an end-to-end deep learning framework for transforming raw clinical image volumes into de-identified, analysis-ready datasets without compromising downstream utility. The methodology developed and tested in this work first detects and redacts regions likely to contain protected health information (PHI), such as burned-in text and metadata, and then uses a generative deep learning model to inpaint the redacted areas with anatomically and imaging plausible content. The proposed pipeline leverages a lightweight hybrid architecture, combining CRNN-based redaction with a latent-diffusion inpainting restoration module (Stable Diffusion 2). We evaluate the approach using both privacy-oriented metrics, which quantify residual PHI and success of redaction, and image-quality and task-based metrics, which assess the fidelity of restored volumes for representative deep learning applications. Our results suggest that the proposed method yields de-identified medical images that are visually coherent, maintaining fidelity for downstream models, while substantially reducing the risk of patient re-identification. By automating anonymization and image reconstruction within a single workflow, and dissemination of large-scale medical imaging collections, thereby lowering a key barrier to data sharing and multi-institutional collaboration in medical imaging AI.

[CV-55] LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization CVPR2026

【速读】:该论文旨在解决LiDAR重定位(LiDAR relocalization)在复杂三维环境中因噪声和异常值导致的精度下降问题,尤其是现有基于学习的回归方法对所有预测点同等处理、缺乏鲁棒性的问题。解决方案的关键在于提出一种名为LEADER的框架,其核心创新包括:1)设计了一种基于投影的几何编码器(Robust Projection-based Geometric Encoder),用于提取多尺度几何特征以增强表征能力;2)引入截断相对可靠性损失(Truncated Relative Reliability loss),显式建模逐点不确定性并抑制不可靠预测的影响。该方法在Oxford RobotCar和NCLT数据集上显著优于现有技术,分别实现位置误差降低24.1%和73.9%。

链接: https://arxiv.org/abs/2604.11355
作者: Jianshi Wu,Minghang Zhu,Dunqiang Liu,Wen Li,Sheng Ao,Siqi Shen,Chenglu Wen,Cheng Wang
机构: Xiamen University (厦门大学); University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (Highlight)

点击查看摘要

Abstract:LiDAR relocalization has attracted increasing attention as it can deliver accurate 6-DoF pose estimation in complex 3D environments. Recent learning-based regression methods offer efficient solutions by directly predicting global poses without the need for explicit map storage. However, these methods often struggle in challenging scenes due to their equal treatment of all predicted points, which is vulnerable to noise and outliers. In this paper, we propose LEADER, a robust LiDAR-based relocalization framework enhanced by a simple, yet effective geometric encoder. Specifically, a Robust Projection-based Geometric Encoder architecture which captures multi-scale geometric features is first presented to enhance descriptiveness in geometric representation. A Truncated Relative Reliability loss is then formulated to model point-wise ambiguity and mitigate the influence of unreliable predictions. Extensive experiments on the Oxford RobotCar and NCLT datasets demonstrate that LEADER outperforms state-of-the-art methods, achieving 24.1% and 73.9% relative reductions in position error over existing techniques, respectively. The source code is released on this https URL.

[CV-56] LoGo-MR: Screening Breast MRI for Cancer Risk Prediction by Efficient Omni-Slice Modeling

【速读】:该论文旨在解决乳腺癌(Breast Cancer, BC)风险预测中模型效率与可解释性不足的问题,尤其是在基于乳腺磁共振成像(Breast MRI)的长期和短期风险分层方面研究尚不充分的挑战。其核心解决方案是提出一种2.5D局部-全局结构建模框架LoGo-MR,通过邻近切片编码捕捉与短期风险相关的细微局部特征,并结合Transformer增强的多实例学习(Multiple-Instance Learning, MIL)建模与长期风险相关的分布全局模式,同时提供可解释的切片重要性信息;进一步扩展为三平面形式LoGo3-MR以捕获轴向、矢状面和冠状面的互补体积信息,实现体素级风险显著性映射,从而辅助放射科医生定位风险相关区域。该方法在大规模乳腺MRI筛查队列(约7500例)上优于现有2D/3D基线和SOTA MIL方法,在1–5年风险预测中AUC达0.77–0.69,C-index较3D CNN提升约6%,验证了其临床应用潜力。

链接: https://arxiv.org/abs/2604.11348
作者: Xin Wang,Yuan Gao,George Yiasemis,Antonio Portaluri,Zahra Aghdam,Muzhen He,Luyi Han,Yaofei Duan,Chunyao Lu,Xinglong Liang,Tianyu Zhang,Vivien van Veldhuizen,Yue Sun,Tao Tan,Ritse Mann,Jonas Teuwen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficient and explainable breast cancer (BC) risk prediction is critical for large-scale population-based screening. Breast MRI provides functional information for personalized risk assessment. Yet effective modeling remains challenging as fully 3D CNNs capture volumetric context at high computational cost, whereas lightweight 2D CNNs fail to model inter-slice continuity. Importantly, breast MRI modeling for shor- and long-term BC risk stratification remains underexplored. In this study, we propose LoGo-MR, a 2.5D local-global structural modeling framework for five-year BC risk prediction. Aligned with clinical interpretation, our framework first employs neighbor-slice encoding to capture subtle local cues linked to short-term risk. It then integrates transformer-enhanced multiple-instance learning (MIL) to model distributed global patterns related to long-term risk and provide interpretable slice importance. We further apply this framework across axial, sagittal, and coronal planes as LoGo3-MR to capture complementary volumetric information. This multi-plane formulation enables voxel-level risk saliency mapping, which may assist radiologists in localizing risk-relevant regions during breast MRI interpretation. Evaluated on a large breast MRI screening cohort (~7.5K), our method outperforms 2D/3D baselines and existing SOTA MIL methods, achieving AUCs of 0.77-0.69 for 1- to 5-year prediction and improving C-index by ~6% over 3D CNNs. LoGo3-MR further improves overall performance with interpretable localization across three planes, and validation across seven backbones shows consistent gains. These results highlight the clinical potential of efficient MRI-based BC risk stratification for large-scale screening. Code will be released publicly.

[CV-57] A Compact and Efficient 1.251 Million Parameter Machine Learning CNN Model PD36-C for Plant Disease Detection: A Case Study

【速读】:该论文旨在解决植物病害诊断中深度学习模型在边缘设备上部署时面临的精度与效率难以兼顾的问题。解决方案的关键在于设计了一个轻量级卷积神经网络(Compact Convolutional Neural Network, PD36 C),其参数量仅为1,250,694,模型大小仅4.77 MB,同时在包含87,000张图像、38类植物病害的新植物病害数据集(New Plant Diseases Dataset)上训练,实现了高达0.9953的平均测试准确率,且多数类别达到完美精度(1.00)和召回率(1.00)。此外,配套开发了基于Qt for Python的桌面应用,支持离线推理和用户友好交互,从而实现高效、鲁棒且适用于田间场景的AI辅助植物病害检测系统。

链接: https://arxiv.org/abs/2604.11332
作者: Shkelqim Sherifi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 24 figures

点击查看摘要

Abstract:Deep learning has markedly advanced image based plant disease diagnosis as improved hardware and dataset quality have enabled increasingly accurate neural network models. This paper presents PD36 C, a compact convolutional neural network (1,250,694 parameters and 4.77 MB) for plant disease classification. Trained with TensorFlow Keras on the New Plant Diseases Dataset (87k images, 38 classes), PD36 C is designed for robustness and edge deployability, complemented by a Qt for Python desktop application that offers an intuitive GUI and offline inference on commodity hardware. Across experiments, training accuracy reached 0.99697 by epoch 30, and average test accuracy was 0.9953 across 38 classes. Per class performance is uniformly high; on the lower end, Corn (maize) Cercospora leaf spot achieved precision around 0.9777 and recall around 0.9634, indicating occasional confusion with visually similar categories, while on the upper end numerous classes including Apple Black rot, Cedar apple rust, Blueberry healthy, Cherry Powdery mildew, Cherry healthy, and all four grape categories achieved perfect precision 1.00 and recall of 1.00, indicating no false positives and strong coverage. These results show that with a well curated dataset and careful architectural design, small CNNs can achieve competitive accuracy compared with recent baselines while remaining practical for edge scenarios. We also note typical constraints such as adverse weather, low quality imagery, and leaves exhibiting multiple concurrent diseases that can degrade performance and warrant future work on domain robustness. Overall, PD36 C and its application pipeline contribute a field ready, efficient solution for AI assisted plant disease detection in smart agriculture.

[CV-58] Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

【速读】:该论文旨在解决当前3D场景生成中依赖2D多视角或视频扩散模型所带来的局限性,主要包括:(i)通过2D视图表示3D场景导致显著的冗余信息;(ii)基于2D的潜在空间本质上限制了生成3D场景的空间一致性。其解决方案的关键在于首次在隐式3D潜在空间中直接进行3D场景生成,具体包括两个核心组件:一是构建3D Representation Autoencoder(3DRAE),利用冻结的2D表示编码器将视图耦合的语义信息映射到视图解耦的3D潜在表示,从而以固定复杂度和丰富语义支持任意数量、分辨率和长宽比的视角观测;二是引入3D Diffusion Transformer(3DDiT),在该3D潜在空间中执行扩散建模,实现高效且空间一致的3D场景生成,并支持多样化的条件配置。此方法无需针对每条相机轨迹重新采样扩散过程,即可从统一的3D表示解码出图像和点云地图。

链接: https://arxiv.org/abs/2604.11331
作者: Dongxu Wei,Qi Xu,Zhiqi Li,Hangning Zhou,Cong Qiu,Hailong Qin,Mu Yang,Zhaopeng Cui,Peidong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
备注: Under Review. Project Page: this https URL

点击查看摘要

Abstract:3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views–at any resolution and aspect ratio–with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations. Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.

[CV-59] Empowering Video Translation using Multimodal Large Language Models

【速读】:该论文旨在解决当前视频翻译任务中缺乏针对多模态大语言模型(Multimodal Large Language Models, MLLMs)赋能视频翻译的系统性综述问题。尽管MLLMs在视频理解、推理与生成方面展现出强大能力,并逐步取代传统级联式翻译流程(如自动语音识别、机器翻译、文本转语音及唇形同步),但其在视频翻译中的具体作用机制尚未被充分梳理。解决方案的关键在于提出一个三角色分类法:1)语义推理者(Semantic Reasoner),用于刻画MLLMs如何实现视频理解、时序推理和多模态融合;2)表达执行者(Expressive Performer),分析基于大语言模型(LLM)驱动与增强的可控语音生成技术;3)视觉合成器(Visual Synthesizer),探讨不同类型的视频生成方法以实现高保真唇形同步与视觉对齐。这一结构化框架为深入理解MLLMs在视频翻译中的核心贡献提供了系统性视角,并指出了未来研究方向。

链接: https://arxiv.org/abs/2604.11283
作者: Bingzheng QU,Kehai Chen,Xuefeng Bai,Min Zhang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding, reasoning, and generation capabilities, MLLMs-based video translation systems are overcoming the limitations of traditional cascaded pipelines that separately handle automatic speech recognition, machine translation, text-to-speech and lip synchronization. These MLLM-powered approaches not only achieve competitive or superior translation quality, but also demonstrate stronger robustness in zero-shot settings and multi-speaker scenarios, while jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. However, despite the rapid progress of MLLMs and extensive surveys on general video-language understanding, a focused and systematic review of how MLLMs empower video translation tasks is still lacking. To fill this gap, we provide the first comprehensive overview of MLLMs-based video translation, organized around a three-role taxonomy: 1) Semantic Reasoner, which characterizes how MLLMs perform video understanding, temporal reasoning, and multimodal fusion; 2) Expressive Performer, which analyzes LLM-driven and LLM-augmented techniques for expressive, controllable speech generation; and 3) Visual Synthesizer, which examines different types of video generators for high-fidelity lip-sync and visual alignment. Finally, we discuss open challenges in video understanding, temporal modeling, and multimodal alignment, and outline promising future research directions for MLLMs-powered video translation.

[CV-60] A Deep Equilibrium Network for Hyperspectral Unmixing

【速读】:该论文旨在解决高光谱解混(Hyperspectral Unmixing, HU)中传统方法难以有效建模复杂光谱-空间特征、深度学习方法物理可解释性不足,以及基于展开(unrolling-based)方法在反向传播过程中存在内存开销大和数值精度低的问题。解决方案的关键在于提出DEQ-Unmix,其将丰度估计重构为深度平衡模型(Deep Equilibrium Model, DEQ),利用隐式微分(implicit differentiation)实现恒定内存的高效训练与反向传播,并通过引入可训练卷积网络替代数据重建项的梯度算子,从而更好地捕获光谱-空间信息,显著提升了解混性能且保持了稳定的内存消耗。

链接: https://arxiv.org/abs/2604.11279
作者: Chentong Wang,Jincheng Gao,Fei Zhu,Jie Chen
机构: Tianjin University (天津大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral unmixing (HU) is crucial for analyzing hyperspectral imagery, yet achieving accurate unmixing remains challenging. While traditional methods struggle to effectively model complex spectral-spatial features, deep learning approaches often lack physical interpretability. Unrolling-based methods, despite offering network interpretability, inadequately exploit spectral-spatial information and incur high memory costs and numerical precision issues during backpropagation. To address these limitations, we propose DEQ-Unmix, which reformulates abundance estimation as a deep equilibrium model, enabling efficient constant-memory training via implicit differentiation. It replaces the gradient operator of the data reconstruction term with a trainable convolutional network to capture spectral-spatial information. By leveraging implicit differentiation, DEQ-Unmix enables efficient and constant-memory backpropagation. Experiments on synthetic and two real-world datasets demonstrate that DEQ-Unmix achieves superior unmixing performance while maintaining constant memory cost.

[CV-61] Variational Latent Entropy Estimation Disentanglement: Controlled Attribute Leakage for Face Recognition

【速读】:该论文旨在解决人脸特征嵌入(face recognition embeddings)中除身份信息外,还隐含性别、种族等敏感属性的问题,这些问题可能在下游应用中引发隐私泄露和公平性偏差。为实现敏感属性与身份相关特征的有效解耦(disentanglement),论文提出一种后处理方法——变分潜熵估计解耦(Variational Latent Entropy Estimation Disentanglement, VLEED),其关键在于利用变分自编码器(variational autoencoder)对预训练嵌入进行变换,并通过估计潜在空间中类别属性的熵来构建基于互信息的目标函数,从而在训练过程中实现对敏感属性信息的稳定且细粒度控制的移除,最终在验证性能、属性可预测性和群体差异性指标上均优于现有方法。

链接: https://arxiv.org/abs/2604.11250
作者: Ünsal Öztürk(1),Vedrana Krivokuća Hahn(1),Sushil Bhattacharjee(1),Sébastien Marcel(1 and 2) ((1) Idiap Research Institute, Martigny, Switzerland, (2) UNIL, Lausanne, Switzerland)
机构: Idiap Research Institute (Idiap 研究所); EPFL (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Information Forensics and Security (TIFS). 13 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Face recognition embeddings encode identity, but they also encode other factors such as gender and ethnicity. Depending on how these factors are used by a downstream system, separating them from the information needed for verification is important for both privacy and fairness. We propose Variational Latent Entropy Estimation Disentanglement (VLEED), a post-hoc method that transforms pretrained embeddings with a variational autoencoder and encourages a distilled representation where the categorical variable of interest is separated from identity-relevant information. VLEED uses a mutual information-based objective realised through the estimation of the entropy of the categorical attribute in the latent space, and provides stable training with fine-grained control over information removal. We evaluate our method on IJB-C, RFW, and VGGFace2 for gender and ethnicity disentanglement, and compare it to various state-of-the-art methods. We report verification utility, predictability of the disentangled variable under linear and nonlinear classifiers, and group disparity metrics based on false match rates. Our results show that VLEED offers a wide range of privacy-utility tradeoffs over existing methods and can also reduce recognition bias across demographic groups.

[CV-62] Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频字幕生成中采用的“单体叙事段落”范式所导致的结构性瓶颈问题,即视频内容被编码为密集耦合的文本,使得视觉、听觉与身份信息混杂,从而降低表示精度并限制可扩展性——局部修改常引发全局重写。解决方案的关键在于提出一种名为Multi-Stream Scene Script (MTSS) 的新范式,其核心由两个原则构成:一是流因子分解(Stream Factorization),将视频解耦为互补的四个流(参考流、镜头流、事件流和全局流),实现结构化表达;二是关系锚定(Relational Grounding),通过显式的身份与时间关联重新连接各流,确保视频整体一致性。这一设计显著提升了视频理解性能,并增强了小模型对提示的可学习性,同时在无需架构调整的情况下大幅改善多镜头视频生成的质量。

链接: https://arxiv.org/abs/2604.11244
作者: Tencent Hunyuan Team
机构: Tencent Hunyuan Team (腾讯混元团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites. To address this structural bottleneck, we propose Multi-Stream Scene Script (MTSS), a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. MTSS is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit identity and temporal links to maintain holistic video consistency. Extensive experiments demonstrate that MTSS consistently enhances video understanding across various models, achieving an average reduction of 25% in the total error rate on Video-SALMONN-2 and an average performance gain of 67% on the Daily-Omni reasoning benchmark. It also narrows the performance gap between smaller and larger MLLMs, indicating a substantially more learnable caption interface. Finally, even without architectural adaptation, replacing monolithic prompts with MTSS in multi-shot video generation yields substantial human-rated improvements: a 45% boost in cross-shot identity consistency, a 56% boost in audio-visual alignment, and a 71% boost in temporal controllability.

[CV-63] Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中因视觉token冗余导致的计算开销过大的问题,现有方法通常依赖单一组件的注意力机制进行token剪枝,易受注意力分布偏差影响,从而造成剪枝决策不充分、性能下降。其解决方案的关键在于提出一种解耦的相似度感知剪枝方法(Decoupled Similarity-Aware Pruning, DeSAP),通过引入解耦相似度来捕捉视觉特征与文本token之间的细粒度跨模态相关性,提供明确的任务导向剪枝指引;同时融合视觉显著性信号(源自视觉注意力)形成双线索驱动的剪枝策略,从而在高剪枝比例下仍能保持鲁棒性和高性能。实验表明,DeSAP在多个基准和架构上均优于当前最优方法,在LLaVA-1.5-7B模型上仅保留11.1%的视觉token即可实现10倍FLOPs减少和2.3倍预填充速度提升,同时维持98.1%的原始性能。

链接: https://arxiv.org/abs/2604.11240
作者: Kexin Ma,Jing Xiao,Chaofeng Chen,Geyong Min,Guibo Zhu,Jinqiao Wang,Liang Liao
机构: Wuhan University(武汉大学); University of Exeter(埃克塞特大学); Chinese Academy of Sciences(中国科学院); Xi’an University of Electronic Science and Technology(西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Token pruning has emerged as an effective approach to reduce the substantial computational overhead of Large Vision-Language Models (LVLMs) by discarding less informative visual tokens while preserving performance. However, existing methods typically rely on individual attention sources from different LVLM components, resulting in incomplete and suboptimal pruning decisions due to biased attention distributions. To address this problem, we propose DeSAP, a novel Decoupled Similarity-Aware Pruning method for precise, task-aware token pruning within the visual encoder. Specifically, DeSAP introduces a decoupled similarity to capture fine-grained cross-modal relevance between visual features and text tokens, providing explicit task-related guidance for pruning. By integrating decoupled similarity with visual saliency signals derived from visual attention, DeSAP performs token pruning under the guidance of both task-related and visual cues, enabling robust pruning even under aggressive pruning ratios. Extensive experiments across diverse benchmarks and architectures show that DeSAP consistently outperforms SOTA methods in both accuracy and efficiency. On LLaVA-1.5-7B, DeSAP achieves a 10 times FLOPs reduction and a 2.3 times prefill speedup by retaining only 11.1% of visual tokens, while maintaining 98.1% of the original performance.

[CV-64] Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection

【速读】:该论文旨在解决多光谱目标检测中两个关键问题:一是现有方法仅将文本作为辅助语义增强信号,未能有效利用其引导作用来弥合RGB与红外(IR)图像之间固有的粒度差异;二是传统数据驱动的注意力融合机制倾向于强调跨模态一致性,忽视了可能具有判别性的跨模态差异。解决方案的关键在于提出一种基于双支持建模的语义桥梁融合框架(semantic bridge fusion framework with bi-support modeling),其中文本被用作共享语义桥梁,在统一类别条件下对齐RGB与IR响应,同时将重新校准的热语义先验投影至RGB分支以实现语义级映射融合;进一步地,将RGB-IR交互证据分解为常规共识支持与互补差异支持,并通过动态重校准引入结构化归纳偏置,从而增强判别性信息的利用。

链接: https://arxiv.org/abs/2604.11234
作者: Jiaqi Wu,Zhen Wang,Enhao Huang,Kangqing Shen,Yulin Wang,Yang Yue,Yifan Pu,Gao Huang
机构: Tsinghua University (清华大学); China University of Mining Technology - Beijing (中国矿业大学-北京); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages ,Under review

点击查看摘要

Abstract:Text-guided multispectral object detection uses text semantics to guide semantic-aware cross-modal interaction between RGB and IR for more robust perception. However, notable limitations remain: (1) existing methods often use text only as an auxiliary semantic enhancement signal, without exploiting its guiding role to bridge the inherent granularity asymmetry between RGB and IR; and (2) conventional data-driven attention-based fusion tends to emphasize stable consensus while overlooking potentially valuable cross-modal discrepancies. To address these issues, we propose a semantic bridge fusion framework with bi-support modeling for multispectral object detection. Specifically, text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion. We further formulate RGB-IR interaction evidence into the regular consensus support and the complementary discrepancy support that contains potentially discriminative cues, and introduce them into fusion via dynamic recalibration as a structured inductive bias. In addition, we design a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement. Extensive experiments demonstrate the effectiveness of the proposed fusion framework and its superior detection performance on multispectral benchmarks. Code is available at this https URL.

[CV-65] Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection

【速读】:该论文旨在解决传统变化检测方法受限于训练数据中预定义类别、难以在真实场景中扩展的问题,特别是针对开放词汇变化检测(Open-Vocabulary Change Detection, OVCD)这一新兴任务——即利用视觉与语言的联合建模能力,在任意类别上实现变化检测。其解决方案的关键在于:首先构建了一个类别无关的变化检测数据集CA-CDD,并设计了一个类别无关的变化头(category-agnostic change head),用于检测任意类别的变化并将其映射到具体类别;在此基础上提出Seg2Change适配器,可直接将先进的开放词汇语义分割模型迁移至变化检测任务中,无需复杂调整即可实现SOTA性能(WHU-CD上IoU提升+9.52,SECOND数据集上mIoU提升+5.50)。

链接: https://arxiv.org/abs/2604.11231
作者: You Su,Yonghong Song,Jingqi Chen,Zehan Wen
机构: Xi’an Jiaotong University(西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 15 figures

点击查看摘要

Abstract:Change detection is a fundamental task in remote sensing, aiming to quantify the impacts of human activities and ecological dynamics on land-cover changes. Existing change detection methods are limited to predefined classes in training datasets, which constrains their scalability in real-world scenarios. In recent years, numerous advanced open-vocabulary semantic segmentation models have emerged for remote sensing imagery. However, there is still a lack of an effective framework for directly applying these models to open-vocabulary change detection (OVCD), a novel task that integrates vision and language to detect changes across arbitrary categories. To address these challenges, we first construct a category-agnostic change detection dataset, termed CA-CDD. Further, we design a category-agnostic change head to detect the transitions of arbitrary categories and index them to specific classes. Based on them, we propose Seg2Change, an adapter designed to adapt open-vocabulary semantic segmentation models to change detection task. Without bells and whistles, this simple yet effective framework achieves state-of-the-art OVCD performance (+9.52 IoU on WHU-CD and +5.50 mIoU on SECOND). Our code is released at this https URL.

[CV-66] NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: AI Flash Portrait (Track 3) CVPR2026

【速读】:该论文旨在解决真实世界低光照条件下人像图像恢复(low-light portrait restoration)中存在的一系列挑战,特别是如何在噪声抑制、细节保留以及光照与色彩还原的忠实性之间实现最优平衡。其解决方案的关键在于构建了一个全新的基准测试体系,包括一个包含800组真实采集的低光照人像数据集(每组含1K分辨率输入图像、1K真值图像及人物掩码),并采用融合客观量化指标与严格主观评估协议的混合评价机制,从而推动生成式AI(Generative AI)模型在复杂真实场景下的性能优化与标准化评测。

链接: https://arxiv.org/abs/2604.11230
作者: Ya-nan Guan,Shaonan Zhang,Hang Guo,Yawen Wang,Xinying Fan,Tianqu Zhuang,Jie Liang,Hui Zeng,Guanyi Qin,Lishen Qu,Tao Dai,Shu-Tao Xia,Lei Zhang,Radu Timofte,Bin Chen,Yuanbo Zhou,Hongwei Wang,Qinquan Gao,Tong Tong,Yanxin Qian,Lizhao You,Jingru Cong,Lei Xiong,Shuyuan Zhu,Zhi-Qiang Zhong,Kan Lv,Yang Yang,Kailing Tang,Minjian Zhang,Zhipei Lei,Zhe Xu,Liwen Zhang,Dingyong Gou,Yanlin Wu,Cong Li,Xiaohui Cui,Jiajia Liu,Guoyi Xu,Yaoxin Jiang,Yaokun Shi,Jiachen Tu,Liqing Wang,Shihang Li,Bo Zhang,Biao Wang,Haiming Xu,Xiang Long,Xurui Liao,Yanqiao Zhai,Haozhe Li,Shijun Shi,Jiangning Zhang,Yong Liu,Kai Hu,Jing Xu,Xianfang Zeng,Yuyang Liu,Minchen Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Workshop. Includes supplementary material as ancillary file

点击查看摘要

Abstract:In this paper, we present a comprehensive overview of the NTIRE 2026 3rd Restore Any Image Model (RAIM) challenge, with a specific focus on Track 3: AI Flash Portrait. Despite significant advancements in deep learning for image restoration, existing models still encounter substantial challenges in real-world low-light portrait scenarios. Specifically, they struggle to achieve an optimal balance among noise suppression, detail preservation, and faithful illumination and color reproduction. To bridge this gap, this challenge aims to establish a novel benchmark for real-world low-light portrait restoration. We comprehensively evaluate the proposed algorithms utilizing a hybrid evaluation system that integrates objective quantitative metrics with rigorous subjective assessment protocols. For this competition, we provide a dataset containing 800 groups of real-captured low-light portrait data. Each group consists of a 1K-resolution low-light input image, a 1K ground truth (GT), and a 1K person mask. This challenge has garnered widespread attention from both academia and industry, attracting over 100 participating teams and receiving more than 3,000 valid submissions. This report details the motivation behind the challenge, the dataset construction process, the evaluation metrics, and the various phases of the competition. The released dataset and baseline code for this track are publicly available from the same \hrefthis https URLGitHub repository, and the official challenge webpage is hosted on \hrefthis https URLCodaBench.

[CV-67] H-SPAM: Hierarchical Superpixel Anything Model

【速读】:该论文旨在解决现有超像素(Superpixel)方法在分割精度上的瓶颈问题,即生成的超像素形状噪声较大,且多数方法仅提供单一尺度的划分,难以满足需要多尺度表示的视觉任务需求。其解决方案的关键在于提出H-SPAM(Hierarchical Superpixel Anything Model),该框架通过两阶段区域合并策略构建精确、规则且完全嵌套的层次化超像素结构:第一阶段确保对象一致性,第二阶段允许受控的跨对象分组;同时引入视觉注意力图或用户输入对层次结构进行调制,以延长重要区域在层级中的保留时间。该方法在标准基准测试中显著优于现有层次化方法,并达到与最新非层次化方法相当的性能。

链接: https://arxiv.org/abs/2604.11218
作者: Julien Walther,Rémi Giraud,Michaël Clément
机构: Univ. Bordeaux, CNRS, Bordeaux INP, IMS, UMR 5218, France; Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Superpixels offer a compact image representation by grouping pixels into coherent regions. Recent methods have reached a plateau in terms of segmentation accuracy by generating noisy superpixel shapes. Moreover, most existing approaches produce a single fixed-scale partition that limits their use in vision pipelines that would benefit multi-scale representations. In this work, we introduce H-SPAM (Hierarchical Superpixel Anything Model), a unified framework for generating accurate, regular, and perfectly nested hierarchical superpixels. Starting from a fine partition, guided by deep features and external object priors, H-SPAM constructs the hierarchy through a two-phase region merging process that first preserves object consistency and then allows controlled inter-object grouping. The hierarchy can also be modulated using visual attention maps or user input to preserve important regions longer in the hierarchy. Experiments on standard benchmarks show that H-SPAM strongly outperforms existing hierarchical methods in both accuracy and regularity, while performing on par with most recent state-of-the-art non-hierarchical methods. Code and pretrained models are available: this https URL.

[CV-68] 3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

【速读】:该论文旨在解决实时自由视角渲染中多相机冗余与交互应用延迟约束之间的平衡问题。其核心解决方案是提出一种前向传播网络3DTV,通过结合轻量级几何信息与学习机制实现稀疏视图插值。关键创新在于采用基于Delaunay三角剖分的三元组选择策略以确保目标视图的角度覆盖,并引入姿态感知深度模块,估计粗到细的深度金字塔,从而高效进行特征重投影和遮挡感知融合。该方法无需场景特定优化即可直接推理,避免显式代理表示,在多样场景下均具备鲁棒性,适用于AR/VR、远程呈现等低延迟多视角流媒体和交互式渲染场景。

链接: https://arxiv.org/abs/2604.11211
作者: Stefan Schulz,Fernando Edelstein,Hannah Dröge,Matthias B. Hullin,Markus Plack
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. We address this challenge by combining lightweight geometry with learning and propose 3DTV, a feedforward network for real-time sparse-view interpolation. A Delaunay-based triplet selection ensures angular coverage for each target view. Building on this, we introduce a pose-aware depth module that estimates a coarse-to-fine depth pyramid, enabling efficient feature reprojection and occlusion-aware blending. Unlike methods that require scene-specific optimization, 3DTV runs feedforward without retraining, making it practical for AR/VR, telepresence, and interactive applications. Our experiments on challenging multi-view video datasets demonstrate that 3DTV consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines. Crucially, 3DTV avoids explicit proxies, enabling robust rendering across diverse scenes. This makes it a practical solution for low-latency multi-view streaming and interactive rendering. Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM) Cite as: arXiv:2604.11211 [cs.CV] (or arXiv:2604.11211v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.11211 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-69] LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment: Methods and Results CVPR2026

【速读】:该论文旨在解决传统图像质量评估方法忽视人类感知中语义信息损失的问题,提出从人类视角出发的语义图像质量评估新范式。其关键解决方案是构建了首个面向人类感知的语义图像质量评估数据集(SeIQA),该数据集包含训练、验证和测试三部分共750对退化图像及其对应的参考图像,用于推动语义编码、语义处理及语义导向优化等新兴方向的发展。通过该基准测试,吸引了58支团队参与,其中6支队伍提交了有效方案并在SeIQA数据集上实现了当前最优性能(SOTA)。

链接: https://arxiv.org/abs/2604.11207
作者: Xin Li,Daoli Xu,Wei Luo,Guoqiang Xiang,Haoran Li,Chengyu Zhuang,Zhibo Chen,Jian Guan,Weping Li,Weixia Zhang,Wei Sun,Zhihua Wang,Dandan Zhu,Chengguang Zhu,Ayush Gupta,Rachit Agarwal,Shouvik Das,Biplab Ch Das,Amartya Ghosh,Kanglong Fan,Wen Wen,Shuyan Zhai,Tianwu Zhi,Aoxiang Zhang,Jianzhao Liu,Yabin Zhang,Jiajun Wang,Yipeng Sun,Kaiwei Lian,Banghao Yin
机构: University of Science and Technology of China; EMERGETECH; Shanghai Jiao Tong University; East China Normal University; Sun Yat-sen University; Changzhou Microintelligence Co.,Ltd; Netaji Subhas University of Technology; Samsung RD Institute, Bengaluru, India; City University of Hong Kong; ByteDance Inc.; Friedrich-Alexander Universität Erlangen-Nürnberg; University of Manchester; School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026 Workshop; LoViF Challenge

点击查看摘要

Abstract:This paper reviews the LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment. This challenge aims to raise a new direction, i.e., how to evaluate the loss of semantic information from the human perspective, intending to promote the development of some new directions, like semantic coding, processing, and semantic-oriented optimization, etc. Unlike existing datasets of quality assessment, we form a dataset of human-oriented semantic quality assessment, termed the SeIQA dataset. This dataset is divided into three parts for this competition: (i) training data: 510 pairs of degraded images and their corresponding ground truth references; (ii) validation data: 80 pairs of degraded images and their corresponding ground-truth references; (iii) testing data: 160 pairs of degraded images and their corresponding ground-truth references. The primary objective of this challenge is to establish a new and powerful benchmark for human-oriented semantic image quality assessment. There are a total of 58 teams registered in this competition, and 6 teams submitted valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the SeIQA dataset.

[CV-70] MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

【速读】:该论文旨在解决医学图像分析中对特定解剖结构或病灶区域进行细粒度理解的难题,这与通用视觉语言模型(VLM)侧重全局图像理解的特点存在差异。其核心挑战在于如何在保持全局上下文感知能力的同时,精准响应由医疗专业人员或感知模型提供的区域信息(region-of-interest, RoI)。解决方案的关键在于提出MedP-CLIP,一种基于医学先验知识设计的区域感知型医学视觉语言模型,通过创新性的特征级区域提示融合机制,使模型能够灵活适配多种提示形式(如点、边界框、掩码),同时维持对整体图像语义的理解能力。该方法在包含超640万张医学图像和9730万条区域标注的大规模数据集上预训练,显著提升了跨疾病、跨模态的细粒度空间语义理解能力,在零样本识别、交互式分割及多模态大语言模型赋能等任务中均展现出优越性能。

链接: https://arxiv.org/abs/2604.11197
作者: Jiahui Peng,He Yao,Jingwen Li,Yanzhou Su,Sibo Ju,Yujie Lu,Jin Ye,Hongchun Lu,Xue Li,Lincheng Jiang,Min Zhu,Junlong Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has demonstrated outstanding performance in global image understanding and zero-shot transfer through large-scale text-image alignment. However, the core of medical image analysis often lies in the fine-grained understanding of specific anatomical structures or lesion regions. Therefore, precisely comprehending region-of-interest (RoI) information provided by medical professionals or perception models becomes crucial. To address this need, we propose MedP-CLIP, a region-aware medical vision-language model (VLM). MedP-CLIP innovatively integrates medical prior knowledge and designs a feature-level region prompt integration mechanism, enabling it to flexibly respond to various prompt forms (e.g., points, bounding boxes, masks) while maintaining global contextual awareness when focusing on local regions. We pre-train the model on a meticulously constructed large-scale dataset (containing over 6.4 million medical images and 97.3 million region-level annotations), equipping it with cross-disease and cross-modality fine-grained spatial semantic understanding capabilities. Experiments demonstrate that MedP-CLIP significantly outperforms baseline methods in various medical tasks, including zero-shot recognition, interactive segmentation, and empowering multimodal large language models. This model provides a scalable, plug-and-play visual backbone for medical AI, combining holistic image understanding with precise regional analysis.

[CV-71] owards Adaptive Open-Set Object Detection via Category-Level Collaboration Knowledge Mining

【速读】:该论文旨在解决自适应开放集目标检测(Adaptive Open-Set Object Detection, AOOD)中面临的三大挑战:跨域表征能力弱、新型类别间语义模糊以及源域特征偏置问题。其核心解决方案在于提出一种类别级协作知识挖掘策略,通过构建基于聚类的记忆库(memory bank)来编码类别原型、辅助特征及类内差异信息,并利用无监督聚类迭代更新以增强类别级知识表示;同时设计基类到新类的选择度量来识别与新类别相关的源域特征并初始化新类别分类器,结合自适应特征分配策略将学习到的类别级知识迁移至目标域,同步异步更新记忆库以缓解源域偏置,从而显著提升模型在无标注目标域上的泛化性能。

链接: https://arxiv.org/abs/2604.11195
作者: Yuqi Ji,Junjie Ke,Lihuo He,Lizhi Wang,Xinbo Gao
机构: Xidian University (西安电子科技大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages,9 figures,accepted by IEEE Transactions on Image Processing

点击查看摘要

Abstract:Existing object detectors often struggle to generalize across domains while adapting to emerging novel categories. Adaptive open-set object detection (AOOD) addresses this challenge by training on base categories in the source domain and adapting to both base and novel categories in the target domain without target annotations. However, current AOOD methods remain limited by weak cross-domain representations, ambiguity among novel categories, and source-domain feature bias. To address these issues, we propose a category-level collaboration knowledge mining strategy that exploits both inter-class and intra-class relationships across domains. Specifically, we construct a clustering-based memory bank to encode class prototypes, auxiliary features, and intra-class disparity information, and iteratively update it via unsupervised clustering to enhance category-level knowledge representation. We further design a base-to-novel selection metric to discover source-domain features related to novel categories and use them to initialize novel-category classifiers. In addition, an adaptive feature assignment strategy transfers the learned category-level knowledge to the target domain and asynchronously updates the memory bank to alleviate source-domain bias. Extensive experiments on multiple benchmarks show that our method consistently surpasses state-of-the-art AOOD methods by 1.1-5.5 mAP.

[CV-72] Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在视频场景理解任务中,内部推理轨迹(称为thought streams)如何影响输出质量的问题。其核心挑战在于量化推理深度与输出准确性之间的关系,并识别模型在思考过程中关注的核心内容。解决方案的关键在于引入三个新评估指标:Contentfulness(衡量推理流中有用场景内容与元评论的比例)、Thought-Final Coverage(评估推理内容是否忠实转化为最终输出)以及Dominant Entity Analysis(识别模型关注的主体、动作和场景)。通过在Google Gemini 2.5 Flash及其轻量级版本Flash Lite上进行系统性实验,研究发现:推理质量提升在前几百个token内即达到饱和,且轻量级模型在性能与效率间取得最佳平衡;同时揭示了一种“压缩步骤幻觉”现象——严格推理预算会导致模型在最终输出中添加未经过推理的内容。

链接: https://arxiv.org/abs/2604.11177
作者: Shivam Sharma,Sankalp Nagaonkar,Ashish Choithani,Ashutosh Trivedi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google’s Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.

[CV-73] Precision Synthesis of Multi-Tracer PET via VLM-Modulated Rectified Flow for Stratifying Mild Cognitive Impairment

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期筛查中正电子发射断层成像(PET)因成本高和辐射暴露限制临床应用的问题,提出一种基于磁共振成像(MRI)生成多示踪剂PET图像的生成式AI方法。其解决方案的关键在于提出DIReCT++模型,该模型融合了3D修正流(rectified flow)架构以捕捉跨模态(MRI到PET)与跨示踪剂(如¹⁸F-AV-45和¹⁸F-FDG)的复杂关系,并引入领域自适应视觉-语言模型(BiomedCLIP)实现基于临床评分和影像知识的文本引导个性化生成,从而显著提升合成PET图像的保真度、泛化能力及疾病特异性模式再现精度,最终支持轻度认知障碍(MCI)的精准个体化分层,推动AD早期诊断与预后预测的可扩展、数据高效工具发展。

链接: https://arxiv.org/abs/2604.11176
作者: Tuo Liu,Shuijin Lin,Shaozhen Yan,Haifeng Wang,Jie Lu,Jianhua Ma,Chunfeng Lian
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:The biological definition of Alzheimer’s disease (AD) relies on multi-modal neuroimaging, yet the clinical utility of positron emission tomography (PET) is limited by cost and radiation exposure, hindering early screening at preclinical or prodromal stages. While generative models offer a promising alternative by synthesizing PET from magnetic resonance imaging (MRI), achieving subject-specific precision remains a primary challenge. Here, we introduce DIReCT ++ , a Domain-Informed ReCTified flow model for synthesizing multi-tracer PET from MRI combined with fundamental clinical information. Our approach integrates a 3D rectified flow architecture to capture complex cross-modal and cross-tracer relationships with a domain-adapted vision-language model (BiomedCLIP) that provides text-guided, personalized generation using clinical scores and imaging knowledge. Extensive evaluations on multi-center datasets demonstrate that DIReCT ++ not only produces synthetic PET images ( ^18 F-AV-45 and ^18 F-FDG) of superior fidelity and generalizability but also accurately recapitulates disease-specific patterns. Crucially, combining these synthesized PET images with MRI enables precise personalized stratification of mild cognitive impairment (MCI), advancing a scalable, data-efficient tool for the early diagnosis and prognostic prediction of AD. The source code will be released on this https URL.

[CV-74] NeuVolEx: Implicit Neural Features for Volume Exploration

【速读】:该论文旨在解决直接体积渲染(Direct Volume Rendering, DVR)中区域兴趣(Region of Interest, ROI)分类与聚类的挑战,尤其是现有方法在有限用户监督下难以实现鲁棒性特征表示的问题。具体而言,传统方法依赖显式局部特征或隐式卷积特征,前者无法捕捉全局几何模式,后者在实际应用中性能不稳定。解决方案的关键在于提出NeuVolEx框架,利用隐式神经表示(Implicit Neural Representations, INRs)训练过程中学习到的特征作为稳健的探索基础,并通过引入结构编码器和多任务学习机制增强空间一致性,从而提升ROI表征能力。该方法在图像驱动的转移函数设计和视角推荐两个典型任务上均表现出优越的有效性和可用性。

链接: https://arxiv.org/abs/2604.11172
作者: Haill An,Suhyeon Kim,Donghyuk Choo,Younhyun Jung
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures. Under review

点击查看摘要

Abstract:Direct volume rendering (DVR) aims to help users identify and examine regions of interest (ROIs) within volumetric data, and feature representations that support effective ROI classification and clustering play a fundamental role in volume exploration. Existing approaches typically rely on either explicit local feature representations or implicit convolutional feature representations learned from raw volumes. However, explicit local feature representations are limited in capturing broader geometric patterns and spatial correlations, while implicit convolutional feature representations do not necessarily ensure robust performance in practice, where user supervision is typically limited. Meanwhile, implicit neural representations (INRs) have recently shown strong promise in DVR for volume compression, owing to their ability to compactly parameterize continuous volumetric fields. In this work, we propose NeuVolEx, a neural volume exploration approach that extends the role of INRs beyond volume compression. Unlike prior compression methods that focus on INR outputs, NeuVolEx leverages feature representations learned during INR training as a robust basis for volume exploration. To better adapt these feature representations to exploration tasks, we augment a base INR with a structural encoder and a multi-task learning scheme that improve spatial coherence for ROI characterization. We validate NeuVolEx on two fundamental volume exploration tasks: image-based transfer function (TF) design and viewpoint recommendation. NeuVolEx enables accurate ROI classification under sparse user supervision for image-based TF design and supports unsupervised clustering to identify compact complementary viewpoints that reveal different ROI clusters. Experiments on diverse volume datasets with varying modalities and ROI complexities demonstrate NeuVolEx improves both effectiveness and usability over prior methods

[CV-75] Development and evaluation of CADe systems in low-prevalence setting: The RARE25 challenge for early detection of Barretts neoplasia DATE

【速读】:该论文旨在解决Barrett食管中早期肿瘤病变的计算机辅助检测(CADe)在低患病率场景下的性能评估不足问题,即现有系统在平衡或富集数据集上表现良好,但在真实临床环境中因患病率低导致阳性预测值(PPV)偏低、临床实用性被高估的问题。解决方案的关键在于构建一个大规模、反映真实发病率的基准测试平台——RARE25挑战赛,其包含公开训练集和隐藏测试集,并采用基于操作点特异性指标(强调高敏感性并考虑患病率)进行评估,从而推动开发对患病率变化具有鲁棒性的CADe系统,同时揭示当前方法普遍依赖全监督分类而缺乏异常检测或单类学习等适应低患病率场景的机制。

链接: https://arxiv.org/abs/2604.11171
作者: Tim J.M. Jaspers,Francisco Caetano,Cris H.B. Claessens,Carolus H.J. Kusters,Rixta A.H. van Eijck van Heslinga,Floor Slooter,Jacques J. Bergman,Peter H.N. De With,Martijn R. Jong,Albert J. de Groof,Fons van der Sommen
机构: Eindhoven University of Technology (埃因霍温理工大学); Amsterdam University Medical Centers (阿姆斯特丹大学医疗中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The final author list is currently being finalized and will be updated in subsequent versions

点击查看摘要

Abstract:Computer-aided detection (CADe) of early neoplasia in Barrett’s esophagus is a low-prevalence surveillance problem in which clinically relevant findings are rare. Although many CADe systems report strong performance on balanced or enriched datasets, their behavior under realistic prevalence remains insufficiently characterized. The RARE25 challenge addresses this gap by introducing a large-scale, prevalence-aware benchmark for neoplasia detection. It includes a public training set and a hidden test set reflecting real-world incidence. Methods were evaluated using operating-point-specific metrics emphasizing high sensitivity and accounting for prevalence. Eleven teams from seven countries submitted approaches using diverse architectures, pretraining, ensembling, and calibration strategies. While several methods achieved strong discriminative performance, positive predictive values remained low, highlighting the difficulty of low-prevalence detection and the risk of overestimating clinical utility when prevalence is ignored. All methods relied on fully supervised classification despite the dominance of normal findings, indicating a lack of prevalence-agnostic approaches such as anomaly detection or one-class learning. By releasing a public dataset and a reproducible evaluation framework, RARE25 aims to support the development of CADe systems robust to prevalence shift and suitable for clinical surveillance workflows.

[CV-76] Do Instance Priors Help Weakly Supervised Semantic Segmentation?

【速读】:该论文旨在解决语义分割(Semantic Segmentation)任务中密集像素级标注成本高、耗时长的问题。传统方法依赖大量精细标注数据,而本文提出SeSAM框架,利用基础分割模型Segment Anything Model (SAM) 结合弱标签(如粗略掩码、涂鸦和点提示)实现高效且高质量的语义分割。其关键在于:将类别掩码分解为连通域,沿物体骨架采样点提示,基于弱标签覆盖度选择SAM掩码,并通过伪标签迭代优化;同时在半监督学习框架中融合真实标签、SAM生成的伪标签与高置信度伪标签,从而显著提升分割性能并大幅降低标注成本。

链接: https://arxiv.org/abs/2604.11170
作者: Anurag Das,Anna Kukleva,Xinting Hu,Yuki M. Asano,Bernt Schiele
机构: Max Planck Institute for Informatics (马克斯普朗克信息研究所); University of Technology Nuremberg (纽伦堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 15 figures

点击查看摘要

Abstract:Semantic segmentation requires dense pixel-level annotations, which are costly and time-consuming to acquire. To address this, we present SeSAM, a framework that uses a foundational segmentation model, i.e. Segment Anything Model (SAM), with weak labels, including coarse masks, scribbles, and points. SAM, originally designed for instance-based segmentation, cannot be directly used for semantic segmentation tasks. In this work, we identify specific challenges faced by SAM and determine appropriate components to adapt it for class-based segmentation using weak labels. Specifically, SeSAM decomposes class masks into connected components, samples point prompts along object skeletons, selects SAM masks using weak-label coverage, and iteratively refines labels using pseudo-labels, enabling SAM-generated masks to be effectively used for semantic segmentation. Integrated with a semi-supervised learning framework, SeSAM balances ground-truth labels, SAM-based pseudo-labels, and high-confidence pseudo-labels, significantly improving segmentation quality. Extensive experiments across multiple benchmarks and weak annotation types show that SeSAM consistently outperforms weakly supervised baselines while substantially reducing annotation cost relative to fine supervision.

[CV-77] RADA: Region-Aware Dual-encoder Auxiliary learning for Barely-supervised Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因依赖全监督学习而导致的标注成本过高问题,尤其是在三维体积扫描场景下,密集标注耗时且昂贵。现有方法通过几何连续性传播稀疏标注生成伪标签,但缺乏语义理解,导致伪标签质量低。解决方案的关键在于提出RADA(Region-Aware Dual-encoder Auxiliary learning pipeline),其核心是采用预训练于Alpha-CLIP的双编码器框架,从原始图像和有限标注中提取细粒度、区域特异的视觉特征,并融合图像级细粒度特征与文本级语义指导,实现区域感知的语义监督,从而在像素级分割任务中提升局部细节建模能力。该方法在极稀疏标注条件下实现了最先进的性能,在LA2018、KiTS19和LiTS等多个数据集上表现出强泛化能力。

链接: https://arxiv.org/abs/2604.11164
作者: Shuang Zeng,Boxu Xie,Lei Zhu,Xinliang Zhang,Jiakui Hu,Zhengjian Yao,Yuanwei Li,Yuxing Lu,Yanye Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has greatly advanced medical image segmentation, but its success relies heavily on fully supervised learning, which requires dense annotations that are costly and time-consuming for 3D volumetric scans. Barely-supervised learning reduces annotation burden by using only a few labeled slices per volume. Existing methods typically propagate sparse annotations to unlabeled slices through geometric continuity to generate pseudo-labels, but this strategy lacks semantic understanding, often resulting in low-quality pseudo-labels. Furthermore, medical image segmentation is inherently a pixel-level visual understanding task, where accuracy fundamentally depends on the quality of local, fine-grained visual features. Inspired by this, we propose RADA, a novel Region-Aware Dual-encoder Auxiliary learning pipeline which introduces a dual-encoder framework pre-trained on Alpha-CLIP to extract fine-grained, region-specific visual features from the original images and limited annotations. The framework combines image-level fine-grained visual features with text-level semantic guidance, providing region-aware semantic supervision that bridges image-level semantics and pixel-level segmentation. Integrated into a triple-view training framework, RADA achieves SOTA performance under extremely sparse annotation settings on LA2018, KiTS19 and LiTS, demonstrating robust generalization across diverse datasets.

[CV-78] Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks CVPR2026

【速读】:该论文旨在解决工业缺陷检测中因缺乏密集像素级标注而导致的准确缺陷分割难题。现有方法常依赖于低成本的边界框(bounding boxes)通过基础分割模型(如Segment Anything Model, SAM)生成伪掩码(pseudo-masks),但这些伪标签在工业表面存在系统性噪声,易误检背景结构并遗漏稀疏缺陷。解决方案的关键在于提出一个噪声鲁棒的“框到像素”蒸馏框架Boxes2Pixels,其核心思想是将SAM视为带噪教师而非真值监督源,通过三个关键设计提升性能:(i) 利用冻结的DINOv2特征与分层解码器实现语义稳定性;(ii) 引入辅助二分类定位头以解耦前景发现与类别预测;(iii) 设计单边在线自校正机制,在学生模型对背景自信时放松背景监督,专门修正教师的假阴性错误。该方案在风力发电机巡检基准上显著提升异常mIoU(+6.97)和二值IoU(+9.71),同时参数量减少80%,且在线校正使二值召回率提升+18.56。

链接: https://arxiv.org/abs/2604.11162
作者: Camile Lendering,Erkut Akdag,Egor Bondarev
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the AI4RWC Workshop at CVPR 2026

点击查看摘要

Abstract:Accurate defect segmentation is critical for industrial inspection, yet dense pixel-level annotations are rarely available. A common workaround is to convert inexpensive bounding boxes into pseudo-masks using foundation segmentation models such as the Segment Anything Model (SAM). However, these pseudo-labels are systematically noisy on industrial surfaces, often hallucinating background structure while missing sparse defects. To address this limitation, a noise-robust box-to-pixel distillation framework, Boxes2Pixels, is proposed that treats SAM as a noisy teacher rather than a source of ground-truth supervision. Bounding boxes are converted into pseudo-masks offline by SAM, and a compact student is trained with (i) a hierarchical decoder over frozen DINOv2 features for semantic stability, (ii) an auxiliary binary localization head to decouple sparse foreground discovery from class prediction, and (iii) a one-sided online self-correction mechanism that relaxes background supervision when the student is confident, targeting teacher false negatives. On a manually annotated wind turbine inspection benchmark, the proposed Boxes2Pixels improves anomaly mIoU by +6.97 and binary IoU by +9.71 over the strongest baseline trained under identical weak supervision. Moreover, online self-correction increases the binary recall by +18.56, while the model employs 80% fewer trainable parameters. Code is available at this https URL. Comments: Accepted for presentation at the AI4RWC Workshop at CVPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.11162 [cs.CV] (or arXiv:2604.11162v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.11162 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-79] rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training CVPR2026

【速读】:该论文旨在解决无监督远程光电容积脉搏波描记法(remote photoplethysmography, rPPG)在真实场景(in-the-wild)视频上训练时因视频质量低而导致模型性能严重下降的问题。其核心挑战在于缺乏对视频是否适合用于rPPG建模的预评估机制,而现有视频质量评估(Video Quality Assessment, VQA)方法主要面向人类感知,不适用于rPPG任务。解决方案的关键在于提出rPPG-VQA框架,该框架融合信号级与场景级分析:信号级分支通过多方法共识机制实现鲁棒的信噪比(SNR)估计以评估生理信号质量;场景级分支利用多模态大语言模型(Multimodal Large Language Model, MLLM)识别运动伪影和光照不稳定等干扰因素;并进一步设计两阶段自适应采样(Two-stage Adaptive Sampling, TAS)策略,基于质量评分筛选最优训练数据集,从而显著提升无监督rPPG模型在标准基准上的精度。

链接: https://arxiv.org/abs/2604.11156
作者: Tianyang Dai,Ming Chang,Yan Chen,Yang Hu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Unsupervised remote photoplethysmography (rPPG) promises to leverage unlabeled video data, but its potential is hindered by a critical challenge: training on low-quality “in-the-wild” videos severely degrades model performance. An essential step missing here is to assess the suitability of the videos for rPPG model learning before using them for the task. Existing video quality assessment (VQA) methods are mainly designed for human perception and not directly applicable to the above purpose. In this work, we propose rPPG-VQA, a novel framework for assessing video suitability for rPPG. We integrate signal-level and scene-level analyses and design a dual-branch assessment architecture. The signal-level branch evaluates the physiological signal quality of the videos via robust signal-to-noise ratio (SNR) estimation with a multi-method consensus mechanism, and the scene-level branch uses a multimodal large language model (MLLM) to identify interferences like motion and unstable lighting. Furthermore, we propose a two-stage adaptive sampling (TAS) strategy that utilizes the quality score to curate optimal training datasets. Experiments show that by training on large-scale, “in-the-wild” videos filtered by our framework, we can develop unsupervised rPPG models that achieve a substantial improvement in accuracy on standard benchmarks. Our code is available at this https URL.

[CV-80] Naka-GS: A Bionics-inspired Dual-Branch Naka Correction and Progressive Point Pruning for Low-Light 3DGS

【速读】:该论文旨在解决低光照条件下3D重建与恢复中因图像可见度下降、色彩失真及几何先验污染而导致的性能退化问题。其核心解决方案是提出一种受生物启发的框架NAKA-GS,通过联合优化光度恢复与几何初始化来提升重建质量;关键创新在于:1)基于Naka机制设计的色度校正网络,融合物理先验增强、双分支输入建模、频域解耦校正与掩码引导优化,有效抑制亮区色偏与边缘结构误差;2)引入轻量级点预处理模块(Point Preprocessing Module, PPM),实现坐标对齐、体素池化与距离自适应渐进修剪,去除冗余噪声点并保留关键结构信息,从而显著提升高斯初始化的精度与效率,且不增加显著推理开销。

链接: https://arxiv.org/abs/2604.11142
作者: Runyu Zhu,SiXun Dong,Zhiqiang Zhang,Qingxia Ye,Zhihua Xu
机构: China University of Mining and Technology-Beijing (中国矿业大学(北京))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-light conditions severely hinder 3D restoration and reconstruction by degrading image visibility, introducing color distortions, and contaminating geometric priors for downstream optimization. We present NAKA-GS, a bionics-inspired framework for low-light 3D Gaussian Splatting that jointly improves photometric restoration and geometric initialization. Our method starts with a Naka-guided chroma-correction network, which combines physics-prior low-light enhancement, dual-branch input modeling, frequency-decoupled correction, and mask-guided optimization to suppress bright-region chromatic artifacts and edge-structure errors. The enhanced images are then fed into a feed-forward multi-view reconstruction model to produce dense scene priors. To further improve Gaussian initialization, we introduce a lightweight Point Preprocessing Module (PPM) that performs coordinate alignment, voxel pooling, and distance-adaptive progressive pruning to remove noisy and redundant points while preserving representative structures. Without introducing heavy inference overhead, NAKA-GS improves restoration quality, training stability, and optimization efficiency for low-light 3D reconstruction. The proposed method was presented in the NTIRE 3D Restoration and Reconstruction (3DRR) Challenge, and outperformed the baseline methods by a large margin. The code is available at this https URL

[CV-81] Sparse Hypergraph-Enhanced Frame-Event Object Detection with Fine-Grained MoE

【速读】:该论文旨在解决多模态融合中因RGB相机与事件流(event stream)数据异质性和冗余性导致的计算开销过大或特征融合效果不佳的问题。其解决方案的关键在于提出Hyper-FEOD框架,通过两个核心组件实现高效且精准的跨模态交互:一是Sparse Hypergraph-enhanced Cross-Modal Fusion(S-HCF),利用事件流的稀疏特性构建事件引导的活动图,并仅对关键运动稀疏token进行高阶超图建模,从而捕获RGB与事件数据间的复杂非局部依赖关系,同时规避传统超图计算的复杂度瓶颈;二是Fine-Grained Mixture of Experts(FG-MoE)增强模块,针对图像不同区域的语义差异设计专用超图专家(用于边界、纹理和背景),并引入像素级空间门控机制自适应路由与增强特征,结合负载均衡损失与零初始化策略确保训练稳定性和特征精炼精度,不破坏预训练主干网络分布。

链接: https://arxiv.org/abs/2604.11140
作者: Wei Bao,Yuehan Wang,Tianhang Zhou,Siqi Li,Yue Gao
机构: BNRist; THUIBCS; BLBCI; School of Software, Tsinghua University; State Key Laboratory of Heavy Oil Processing, China University of Petroleum (Beijing)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Integrating frame-based RGB cameras with event streams offers a promising solution for robust object detection under challenging dynamic conditions. However, the inherent heterogeneity and data redundancy of these modalities often lead to prohibitive computational overhead or suboptimal feature fusion. In this paper, we propose Hyper-FEOD, a high-performance and efficient detection framework, which synergistically optimizes multi-modal interaction through two core components. First, we introduce Sparse Hypergraph-enhanced Cross-Modal Fusion (S-HCF), which leverages the inherent sparsity of event streams to construct an event-guided activity map. By performing high-order hypergraph modeling exclusively on selected motion-critical sparse tokens, S-HCF captures complex non-local dependencies between RGB and event data while overcoming the traditional complexity bottlenecks of hypergraph computation. Second, we design a Fine-Grained Mixture of Experts (FG-MoE) Enhancement module to address the diverse semantic requirements of different image regions. This module employs specialized hypergraph experts tailored for object boundaries, internal textures, and backgrounds, utilizing a pixel-level spatial gating mechanism to adaptively route and enhance features. Combined with a load-balancing loss and zero-initialization strategy, FG-MoE ensures stable training and precise feature refinement without disrupting the pre-trained backbone’s distribution. Experimental results on mainstream RGB-Event benchmarks demonstrate that Hyper-FEOD achieves a superior accuracy-efficiency trade-off, outperforming state-of-the-art methods while maintaining a lightweight footprint suitable for real-time edge deployment.

[CV-82] ViserDex: Visual Sim-to-Real for Robust Dexterous In-hand Reorientation

【速读】:该论文旨在解决单目RGB图像下物体在手内重定向(in-hand object reorientation)中的精确位姿估计问题,以应对复杂任务动态性带来的挑战。现有方法通常依赖多相机系统或高成本的光线追踪渲染,难以在实际场景中部署。其解决方案的关键在于提出一种基于3D高斯点阵(3D Gaussian Splatting, 3DGS)的仿真到现实(sim-to-real)框架:通过在高斯表示空间中进行域随机化(domain randomization),对3D高斯模型施加物理一致的预渲染增强,生成逼真且多样化的视觉数据用于位姿估计训练;同时采用课程学习(curriculum-based reinforcement learning)与教师-学生蒸馏策略训练操控策略,使感知与控制模块均可在消费级硬件上独立训练,显著降低计算资源需求。实验表明,该方法在复杂光照条件下仍能实现鲁棒的五类物体重定向,验证了3DGS作为纯RGB环境下灵巧操作可行路径的有效性。

链接: https://arxiv.org/abs/2604.11138
作者: Arjun Bhardwaj,Maximum Wilder-Smith,Mayank Mittal,Vaishakh Patil,Marco Hutter
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In-hand object reorientation requires precise estimation of the object pose to handle complex task dynamics. While RGB sensing offers rich semantic cues for pose tracking, existing solutions rely on multi-camera setups or costly ray tracing. We present a sim-to-real framework for monocular RGB in-hand reorientation that integrates 3D Gaussian Splatting (3DGS) to bridge the visual sim-to-real gap. Our key insight is performing domain randomization in the Gaussian representation space: by applying physically consistent, pre-rendering augmentations to 3D Gaussians, we generate photorealistic, randomized visual data for object pose estimation. The manipulation policy is trained using curriculum-based reinforcement learning with teacher-student distillation, enabling efficient learning of complex behaviors. Importantly, both perception and control models can be trained independently on consumer-grade hardware, eliminating the need for large compute clusters. Experiments show that the pose estimator trained with 3DGS data outperforms those trained using conventional rendering data in challenging visual environments. We validate the system on a physical multi-fingered hand equipped with an RGB camera, demonstrating robust reorientation of five diverse objects even under challenging lighting conditions. Our results highlight Gaussian splatting as a practical path for RGB-only dexterous manipulation. For videos of the hardware deployments and additional supplementary materials, please refer to the project website: this https URL

[CV-83] BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

【速读】:该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频问答(Video Question Answering, VQA)任务中缺乏细粒度对象定位能力的问题。当前方法通常对视频帧进行整体编码,且通过将边界框坐标序列化为文本标记来传递对象信息,这种“文本-坐标”范式存在根本性的模态不匹配:对象信息本质上是视觉的,而将其编码为文本导致高token开销,迫使模型进行剧烈的时间下采样,从而丢失关键时序动态。解决方案的关键在于提出BoxTuning,其核心思想是将对象的空间-时间信息直接注入视觉模态——通过在视频帧上渲染彩色边界框和轨迹线作为视觉提示,仅保留简洁的颜色到对象的图例作为文本内容。该方法显著降低文本token消耗(实际减少87–93%),同时保持完整的时间分辨率,并利用轨迹线编码帧间运动方向与速度,恢复了文本坐标方法被迫丢弃的细粒度动态信息,实验证明其在五个视频问答基准测试中优于基线方法,尤其在空间导向任务中表现突出,且几乎消除了推理密集型任务中的性能下降。

链接: https://arxiv.org/abs/2604.11136
作者: Zekun Qian,Ruize Han,Wei Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token reduction in practice. It also preserves full temporal resolution, where the trajectory trails further encode inter-frame motion direction and speed within each keyframe, recovering fine-grained dynamics that text-coordinate methods are forced to discard. Experimental results on five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA) show that BoxTuning surpasses text-coordinate baselines on spatially oriented tasks and nearly eliminates the accuracy degradation observed on reasoning-centric tasks, establishing visual prompting as a more natural and efficient paradigm for conveying object information to video MLLMs.

[CV-84] Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理超高分辨率(Ultra-High-Resolution, UHR)遥感影像时因生成大量视觉标记(visual tokens)而导致的计算开销过大、推理效率严重受限的问题。现有视觉标记压缩方法多采用静态且均匀的策略,忽略了遥感解译任务中固有的“语义-几何二元性”(Semantic-Geometric Duality):目标语义任务依赖抽象语义信息,适合激进背景剪枝;而场景几何任务则高度依赖空间拓扑完整性。解决方案的关键在于提出一种任务自适应的双流标记压缩框架 DualComp,其通过轻量级预训练路由器动态引导特征分流至两个专用路径:在目标语义流中,采用基于尺寸自适应聚类的空间连续语义聚合器(Spatially-Contiguous Semantic Aggregator, SCSA)实现冗余背景聚合并保护小目标;在场景几何流中,引入指令引导的结构恢复器(Instruction-Guided Structure Recoverer, IGSR),利用贪心路径追踪机制重构空间骨架。实验表明,DualComp 在 XLRS-Bench 基准上实现了高保真遥感解译,显著降低计算成本,同时提升效率与准确性。

链接: https://arxiv.org/abs/2604.11122
作者: Yueying Li,Fengxiang Wang,Yan Li,Mingshuo Chen,Mengying Zhao,Long Lan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated immense potential in Earth observation. However, the massive visual tokens generated when processing Ultra-High-Resolution (UHR) imagery introduce prohibitive computational overhead, severely bottlenecking their inference efficiency. Existing visual token compression methods predominantly adopt static and uniform compression strategies, neglecting the inherent “Semantic-Geometric Duality” in remote sensing interpretation tasks. Specifically, object semantic tasks focus on the abstract semantics of objects and benefit from aggressive background pruning, whereas scene geometric tasks critically rely on the integrity of spatial topology. To address this challenge, we propose DualComp, a task-adaptive dual-stream token compression framework. Dynamically guided by a lightweight pre-trained router, DualComp decouples feature processing into two dedicated pathways. In the object semantic stream, the Spatially-Contiguous Semantic Aggregator (SCSA) utilizes size-adaptive clustering to aggregates redundant background while protecting small object. In the scene geometric stream, the Instruction-Guided Structure Recoverer (IGSR) introduces a greedy path-tracing topology completion mechanism to reconstruct spatial skeletons. Experiments on the UHR remote sensing benchmark XLRS-Bench demonstrate that DualComp accomplishes high-fidelity remote sensing interpretation at an exceptionally low computational cost, achieving simultaneous improvements in both efficiency and accuracy.

[CV-85] Quantum-Gated Task-interaction Knowledge Distillation for Pre-trained Model-based Class-Incremental Learning CVPR2026

【速读】:该论文旨在解决类增量学习(Class-incremental Learning, CIL)中因多任务子空间纠缠导致的灾难性遗忘问题,尤其是在任务路由参数校准不佳或任务级表征被刚性固定时表现尤为显著。解决方案的关键在于提出一种量子门控任务交互知识蒸馏(Quantum-Gated Task-interaction Knowledge Distillation, QKD)框架,其核心创新是引入量子门控任务调制机制(quantum-gated task modulation gating mechanism),通过动态建模任务嵌入之间的关系,捕捉样本到任务的相关性,从而在联合训练和推理阶段实现跨任务的知识迁移;同时,基于量子门控输出的权重,对旧任务适配器到新任务适配器进行任务交互知识蒸馏,有效弥合独立任务子空间间的表征差异,显著缓解遗忘并达到当前最优性能。

链接: https://arxiv.org/abs/2604.11112
作者: Linjie Li,Huiyu Xiao,Jiarui Cao,Zhenyu Wu,Yang Ji
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Engineering Research Center for Information Network, Ministry of Education (信息网络工程研究中心,教育部)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2026

点击查看摘要

Abstract:Class-incremental learning (CIL) aims to continuously accumulate knowledge from a stream of tasks and construct a unified classifier over all seen classes. Although pretrained models (PTMs) have shown promising performance in CIL, they still struggle with the entanglement of multi-task subspaces, leading to catastrophic forgetting when task routing parameters are poorly calibrated or task-level representations are rigidly fixed. To address this issue, we propose a novel Quantum-Gated Task-interaction Knowledge Distillation (QKD) framework that leverages quantum gating to guide inter-task knowledge transfer. Specifically, we introduce a quantum-gated task modulation gating mechanism to model the relational dependencies among task embedding, dynamically capturing the sample-to-task relevance for both joint training and inference across streaming tasks. Guided by the quantum gating outputs, we perform task-interaction knowledge distillation guided by these task-embedding-level correlation weights from old to new adapters, enabling the model to bridge the representation gaps between independent task subspaces. Extensive experiments demonstrate that QKD effectively mitigates forgetting and achieves state-of-the-art performance.

[CV-86] OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

【速读】:该论文旨在解决长时序电影视频到结构化脚本的生成问题,即如何从长格式影视内容中自动提取并生成包含角色动作、对话、表情及音频线索的分场景、时序对齐的详细脚本(video-to-script, V2S)。其解决方案的关键在于构建首个由人工标注的基准数据集,并提出一种时序感知的分层评估框架;同时设计了一个80亿参数的多模态语言模型OmniScript,采用渐进式训练策略——先通过思维链监督微调提升情节与角色推理能力,再利用时间片段奖励进行强化学习优化,从而实现高效且高质量的叙事理解与脚本生成。

链接: https://arxiv.org/abs/2604.11102
作者: Junfu Pu,Yuxin Chen,Teng Wang,Ying Shan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Project Page: this https URL

点击查看摘要

Abstract:Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.

[CV-87] Efficient Transceiver Design for Aerial Image Transmission and Large-scale Scene Reconstruction

【速读】:该论文旨在解决低空智能网络(Low-altitude Intelligent Networks, LAIN)中大规模三维(3D)场景重建对高效无线图像传输的挑战,尤其是现有方案难以在严重导频开销与保证重建保真度所需的传输精度之间取得平衡的问题。解决方案的关键在于提出一种基于深度学习的端到端(End-to-End, E2E)收发器设计,将3D高斯点绘(3D Gaussian Splatting, 3DGS)直接嵌入训练过程,并通过联合优化通信模块并利用组合的3DGS渲染损失函数,显式提升场景恢复质量;同时,该任务驱动框架支持稀疏导频方案,在降低传输开销的同时仍能实现低空信道条件下的鲁棒图像恢复。

链接: https://arxiv.org/abs/2604.11098
作者: Zeyi Ren,Jialin Dong,Wei Zuo,Yikun Wang,Bingyang Cheng,Sheng Zhou,Zhisheng Niu
机构: 1. Tsinghua University (清华大学); 2. Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 6 pages, 6 figures, submitted to IEEE ISIT-w

点击查看摘要

Abstract:Large-scale three-dimensional (3D) scene reconstruction in low-altitude intelligent networks (LAIN) demands highly efficient wireless image transmission. However, existing schemes struggle to balance severe pilot overhead with the transmission accuracy required to maintain reconstruction fidelity. To strike a balance between efficiency and reliability, this paper proposes a novel deep learning-based end-to-end (E2E) transceiver design that integrates 3D Gaussian Splatting (3DGS) directly into the training process. By jointly optimizing the communication modules via the combined 3DGS rendering loss, our approach explicitly improves scene recovery quality. Furthermore, this task-driven framework enables the use of a sparse pilot scheme, significantly reducing transmission overhead while maintaining robust image recovery under low-altitude channel conditions. Extensive experiments on real-world aerial image datasets demonstrate that the proposed E2E design significantly outperforms existing baselines, delivering superior transmission performance and accurate 3D scene reconstructions.

[CV-88] CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation

【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation)在复杂场景下(如无纹理表面、透明区域和镜面反射)性能下降的问题。现有基于扩散模型的方法虽已取得进展,但仅依赖RGB输入,在挑战性区域缺乏足够视觉线索。其解决方案的关键在于提出CDPR框架——一种融合偏振信息(AoLP/DoLP)与RGB图像的跨模态扩散模型,通过预训练变分自编码器(VAE)将多模态数据映射至共享潜在空间,并引入可学习的置信度感知门控机制动态融合信息,从而在保持关键结构的同时抑制偏振通道中的噪声信号,显著提升对反射和透明区域的估计鲁棒性。

链接: https://arxiv.org/abs/2604.11097
作者: Rongjia Yu,Tong Jia,Hao Wang,Xiaofang Li,Xiao Yang,Zinuo Zhang,Cuiwei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint version of IEEE TMM 2026 Regular Paper

点击查看摘要

Abstract:Monocular depth estimation is a fundamental yet challenging task in computer vision, especially under complex conditions such as textureless surfaces, transparency, and specular reflections. Recent diffusion-based approaches have significantly advanced performance by reformulating depth prediction as a denoising process in the latent space. However, existing methods rely solely on RGB inputs, which often lack sufficient cues in challenging regions. In this work, we present CDPR - Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation - a novel diffusion-based framework that integrates physically grounded polarization priors to enhance estimation robustness. Specifically, we encode both RGB and polarization (AoLP/DoLP) images into a shared latent space via a pre-trained Variational Autoencoder (VAE), and dynamically fuse multi-modal information through a learnable confidence-aware gating mechanism. This fusion module adaptively suppresses noisy signals in polarization inputs while preserving informative cues, particularly around reflective or transparent surfaces, and provides the integrated latent representation for subsequent monocular depth estimation. Beyond depth estimation, we further verify that our framework can be easily generalized to surface normal prediction with minimal modification, showcasing its scalability to general polarization-guided dense prediction tasks. Experiments on both synthetic and real-world datasets validate that CDPR significantly outperforms RGB-only baselines in challenging regions while maintaining competitive performance in standard scenes.

[CV-89] LDEPrompt: Layer-importance guided Dual Expandable Prompt Pool for Pre-trained Model-based Class-Incremental Learning ICASSP2026

【速读】:该论文旨在解决提示引导的类增量学习(prompt-based class-incremental learning)方法中存在的三个关键问题:固定提示池(prompt pool)导致灵活性不足、依赖人工选择提示嵌入(prompt embeddings)以及对预训练骨干网络(pretrained backbone)的强依赖性。其解决方案的核心在于提出一种层重要性引导的双可扩展提示池(Layer-importance guided Dual Expandable Prompt Pool, LDEPrompt),通过自适应地选择模型层、动态冻结与扩展提示池,实现更灵活且高效的提示选择机制,从而提升模型在类增量场景下的性能和可扩展性。

链接: https://arxiv.org/abs/2604.11091
作者: Linjie Li,Zhenyu Wu,Huiyu Xiao,Yang Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICASSP2026

点击查看摘要

Abstract:Prompt-based class-incremental learning methods typically construct a prompt pool consisting of multiple trainable key-prompts and perform instance-level matching to select the most suitable prompt embeddings, which has shown promising results. However, existing approaches face several limitations, including fixed prompt pools, manual selection of prompt embeddings, and strong reliance on the pretrained backbone for prompt selection. To address these issues, we propose a \textbfLayer-importance guided \textbfDual \textbfExpandable \textbfPrompt Pool (\textbfLDEPrompt), which enables adaptive layer selection as well as dynamic freezing and expansion of the prompt pool. Extensive experiments on widely used class-incremental learning benchmarks demonstrate that LDEPrompt achieves state-of-the-art performance, validating its effectiveness and scalability.

[CV-90] Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization

【速读】:该论文旨在解决现代视觉模型中潜在空间(latent space)的优化问题,即如何在保持表示紧凑性的同时提升生成友好性(generation-friendly),以更高效地利用表征容量并增强生成模型的建模能力。解决方案的关键在于提出一种新颖的正则化项(regularizer),通过引导图像分词器(image tokenizer)模仿状态空间模型(state-space model, SSM)的隐藏状态动态特性,从而将SSM固有的频率感知能力(frequency awareness)引入到潜在特征中,使压缩后的潜在表示同时编码精细的空间结构和频域线索,进而提升扩散模型的生成质量,且仅带来最小的重建保真度损失。

链接: https://arxiv.org/abs/2604.11089
作者: Jinsung Lee,Jaemin Oh,Namhun Kim,Dongwon Kim,Byung-Jun Yoon,Suha Kwak
机构: POSTECH(浦项科技大学); Brown University (布朗大学); KAIST(韩国科学技术院); Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Related blog posts in this https URL : Towards 2-Dimensional State-Space Models series

点击查看摘要

Abstract:Image tokenizers are central to modern vision models as they often operate in latent spaces. An ideal latent space must be simultaneously compact and generation-friendly: it should capture image’s essential content compactly while remaining easy to model with generative approaches. In this work, we introduce a novel regularizer to align latent spaces with these two objectives. The key idea is to guide tokenizers to mimic the hidden state dynamics of state-space models (SSMs), thereby transferring their critical property, frequency awareness, to latent features. Grounded in a theoretical analysis of SSMs, our regularizer enforces encoding of fine spatial structures and frequency-domain cues into compact latent features; leading to more effective use of representation capacity and improved generative modelability. Experiments demonstrate that our method improves generation quality in diffusion models while incurring only minimal loss in reconstruction fidelity.

[CV-91] FlowCoMotion: Text-to-Motion Generation via Token-Latent Flow Modeling

【速读】:该论文旨在解决文本到动作生成(text-to-motion generation)中运动表示学习的问题,现有方法要么使用连续表示导致语义与动力学混杂,要么采用离散表示丢失细粒度运动细节。解决方案的关键在于提出FlowCoMotion框架,通过token-latent耦合机制统一处理连续与离散运动表示:在潜空间分支中引入多视角蒸馏以规范连续潜变量,在token分支中使用离散时间分辨率量化提取高层语义线索,最终通过耦合网络融合两支路特征,并基于文本条件预测速度场,结合常微分方程(ODE)求解器从简单先验引导采样至目标动作状态,从而实现语义对齐且高保真的动作生成。

链接: https://arxiv.org/abs/2604.11083
作者: Dawei Guan,Di Yang,Chengjie Jin,Jiangtao Wang
机构: University of Science and Technology of China(中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 14 figures

点击查看摘要

Abstract:Text-to-motion generation is driven by learning motion representations for semantic alignment with language. Existing methods rely on either continuous or discrete motion representations. However, continuous representations entangle semantics with dynamics, while discrete representations lose fine-grained motion details. In this context, we propose FlowCoMotion, a novel motion generation framework that unifies both treatments from a modeling perspective. Specifically, FlowCoMotion employs token-latent coupling to capture both semantic content and high-fidelity motion details. In the latent branch, we apply multi-view distillation to regularize the continuous latent space, while in the token branch we use discrete temporal resolution quantization to extract high-level semantic cues. The motion latent is then obtained by combining the representations from the two branches through a token-latent coupling network. Subsequently, a velocity field is predicted based on the textual conditions. An ODE solver integrates this velocity field from a simple prior, thereby guiding the sample to the potential state of the target motion. Extensive experiments show that FlowCoMotion achieves competitive performance on text-to-motion benchmarks, including HumanML3D and SnapMoGen.

[CV-92] RESP: Reference-guided Sequential Prompting for Visual Glitch Detection in Video Games

【速读】:该论文旨在解决现代游戏开发中视觉 glitches(视觉异常)检测难以规模化的问题,传统人工质量保证方法无法应对日益增长的游戏测试范围,而现有基于视觉-语言模型(Vision-Language Models, VLMs)的自动化方案多局限于单帧分析或依赖有限的视频级基线,在真实场景变化下表现不稳定,导致视频级异常检测效果不佳。其解决方案的关键在于提出RESP框架,核心思想是“参考引导提示”(reference-guided prompting):在每帧测试时,从同一视频早期选取参考帧建立视觉基准,将检测任务重构为视频内对比而非孤立分类;通过序列化地向VLM输入参考/测试帧对,并聚合噪声较大的帧级预测结果,实现无需微调VLM即可获得稳定可靠的视频级判断。

链接: https://arxiv.org/abs/2604.11082
作者: Yakun Yu,Ashley Wiens,Adrián Barahona-Ríos,Benedict Wilkins,Saman Zadtootaghaj,Nabajeet Barman,Cor-Paul Bezemer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual glitches in video games degrade player experience and perceived quality, yet manual quality assurance cannot scale to the growing test surface of modern game development. Prior automation efforts, particularly those using vision-language models (VLMs), largely operate on single frames or rely on limited video-level baselines that struggle under realistic scene variation, making robust video-level glitch detection challenging. We present RESP, a practical multi-frame framework for gameplay glitch detection with VLMs. Our key idea is reference-guided prompting: for each test frame, we select a reference frame from earlier in the same video, establishing a visual baseline and reframing detection as within-video comparison rather than isolated classification. RESP sequentially prompts the VLM with reference/test pairs and aggregates noisy frame predictions into a stable video-level decision without fine-tuning the VLM. To enable controlled analysis of reference effects, we introduce RefGlitch, a synthetic dataset of manually labeled reference/test frame pairs with balanced coverage across five glitch types. Experiments across five VLMs and three datasets (one synthetic, two real-world) show that reference guidance consistently strengthens frame-level detection and that the improved frame-level evidence reliably transfers to stronger video-level triage under realistic QA conditions. Code and data are available at: \hrefthis https URLthis https URL.

[CV-93] MapATM: Enhancing HD Map Construction through Actor Trajectory Modeling

【速读】:该论文旨在解决高分辨率(HD)地图构建中车道线检测与预测在非理想条件下的准确性下降问题,如视障、远距离车道可见性差及恶劣天气等,这些问题常导致自动驾驶系统中车道检测精度降低和可靠性减弱。解决方案的关键在于提出一种名为MapATM的新型深度神经网络,其创新性地利用历史车辆轨迹信息作为道路几何结构的先验约束,从而提升车道检测性能;实验表明,在挑战性的NuScenes数据集上,该方法使车道分隔线AP提升4.6(相对提高10.1%),车道mAP提升2.6(相对提高6.1%),并展现出在复杂驾驶场景下稳定且鲁棒的地图重建能力。

链接: https://arxiv.org/abs/2604.11081
作者: Mingyang Li,Brian Lee,Rui Zuo,Brent Bacchus,Priyantha Mudalige,Qinru Qiu
机构: Syracuse University (锡拉丘兹大学); General Motors (通用汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, 5 tables

点击查看摘要

Abstract:High-definition (HD) mapping tasks, which perform lane detections and predictions, are extremely challenging due to non-ideal conditions such as view occlusions, distant lane visibility, and adverse weather conditions. Those conditions often result in compromised lane detection accuracy and reduced reliability within autonomous driving systems. To address these challenges, we introduce MapATM, a novel deep neural network that effectively leverages historical actor trajectory information to improve lane detection accuracy, where actors refer to moving vehicles. By utilizing actor trajectories as structural priors for road geometry, MapATM achieves substantial performance enhancements, notably increasing AP by 4.6 for lane dividers and mAP by 2.6 on the challenging NuScenes dataset, representing relative improvements of 10.1% and 6.1%, respectively, compared to strong baseline methods. Extensive qualitative evaluations further demonstrate MapATM’s capability to consistently maintain stable and robust map reconstruction across diverse and complex driving scenarios, underscoring its practical value for autonomous driving applications.

[CV-94] ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)量化过程中因激活值异常(activation outliers)导致的精度下降问题,尤其是现有旋转后训练量化(Rotation-based Post-Training Quantization, PTQ)方法在表达能力与推理开销之间的权衡困境。具体而言,全局旋转方法虽具高效性但受限于单一旋转矩阵,表达能力不足;而分层变换方法虽精度高,却因无法将旋转矩阵融合至权重中,需在线计算导致显著延迟。解决方案的关键在于提出ReSpinQuant框架,通过离线激活旋转融合与基于残差子空间旋转的匹配基底机制,在保持分层适应高表达能力的同时,实现近乎零的推理开销,从而在W4A4和W3A3量化下达到最优性能。

链接: https://arxiv.org/abs/2604.11080
作者: Suyoung Kim,Sunghyun Wee,Hyeonjin Kim,Kyomin Hwang,Hyunho Lee,Nojun Kwak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing activation rotations into attention and FFN blocks, but suffer from limited expressivity as they are constrained to use a single learnable rotation matrix across all layers. To tackle this, layer-wise transformation methods emerged, achieving superior accuracy through localized adaptation. However, layer-wise methods cannot fuse activation rotation matrices into weights, requiring online computations and causing significant overhead. In this paper, we propose ReSpinQuant, a quantization framework that resolves such overhead by leveraging offline activation rotation fusion and matching basis using efficient residual subspace rotation. This design reconciles the high expressivity of layer-wise adaptation with only negligible inference overhead. Extensive experiments on W4A4 and W3A3 quantization demonstrate that ReSpinQuant achieves state-of-the-art performance, outperforming global rotation methods and matching the accuracy of computationally expensive layer-wise methods with minimal overhead.

[CV-95] Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net CVPR2026

【速读】:该论文旨在解决低光照图像增强(Low-Light Image Enhancement, LLIE)中模型参数量大、计算复杂度高且性能受限的问题。其解决方案的关键在于提出一种轻量级两阶段框架:第一阶段采用冻结的算法级预处理,通过提供互补的亮度校正视图来标准化输入分布,使第二阶段的可训练网络能够专注于残差色彩校正;第二阶段使用完全由深度可分离卷积构成的紧凑U-Net结构,显著减少模型参数量的同时保持优异的感知质量。该方法在CVPR 2026 NTIRE高效低光照图像增强挑战赛中获得第四名,验证了其有效性与高效性。

链接: https://arxiv.org/abs/2604.11071
作者: Shimon Murai,Teppei Kurita,Ryuta Satoh,Yusuke Moriuchi
机构: Sony Semiconductor Solutions Corporation(索尼半导体解决方案公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical report for the NTIRE 2026 Efficient Low-Light Image Enhancement Challenge (CVPR 2026 Workshops), 4th place solution

点击查看摘要

Abstract:We present a lightweight two-stage framework for low-light image enhancement (LLIE) that achieves competitive perceptual quality with significantly fewer parameters than existing methods. Our approach combines frozen algorithm-based preprocessing with a compact U-Net built entirely from depthwise-separable convolutions. The preprocessing normalizes the input distribution by providing complementary brightness-corrected views, enabling the trainable network to focus on residual color correction. Our method achieved 4th place in the CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge. We further provide extended benchmarks and ablations to demonstrate the general effectiveness of our methods.

[CV-96] A Faster Path to Continual Learning

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中优化方法C-Flat因每轮迭代需额外三次梯度计算而导致训练成本过高的问题。其核心解决方案是提出C-Flat Turbo,关键在于发现一阶平坦性(first-order flatness)相关的梯度具有相对于代理模型(proxy-model)梯度的方向不变性,从而可跳过扰动上升步骤中的冗余梯度计算;同时观察到平坦性促进梯度随任务进展逐渐稳定,据此设计基于线性调度与自适应触发机制的策略,为后期任务分配更大规模的加速步长(turbo steps),在显著降低计算开销(提速1.0×至1.25×)的同时保持或提升模型性能。

链接: https://arxiv.org/abs/2604.11064
作者: Wei Li,Hangjie Yuan,Zixiang Zhao,Borui Kang,Ziwei Liu,Tao Feng
机构: Sichuan University (四川大学); Alibaba Group (阿里巴巴集团); ETH Zürich (苏黎世联邦理工学院); Nanjing University (南京大学); Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual Learning (CL) aims to train neural networks on a dynamic stream of tasks without forgetting previously learned knowledge. Among optimization-based approaches, C-Flat has emerged as a promising solution due to its plug-and-play nature and its ability to encourage uniformly low-loss regions for both new and old tasks. However, C-Flat requires three additional gradient computations per iteration, imposing substantial overhead on the optimization process. In this work, we propose C-Flat Turbo, a faster yet stronger optimizer that significantly reduces the training cost. We show that the gradients associated with first-order flatness contain direction-invariant components relative to the proxy-model gradients, enabling us to skip redundant gradient computations in the perturbed ascent steps. Moreover, we observe that these flatness-promoting gradients progressively stabilize across tasks, which motivates a linear scheduling strategy with an adaptive trigger to allocate larger turbo steps for later tasks. Experiments show that C-Flat Turbo is 1.0 \times to 1.25 \times faster than C-Flat across a wide range of CL methods, while achieving comparable or even improved accuracy.

[CV-97] Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agent ic Harmonization

【速读】:该论文旨在解决多源数据集在微调目标检测(Object Detection, OD)模型时因标注不一致导致的性能下降问题,尤其是当不同数据集对语义相同类别采用不同的空间定义或边界框粒度时,会破坏模型学习到的特征表示结构。其解决方案的关键在于提出一种基于视觉-语言模型(Vision-Language Model, VLM)的代理式标签统一工作流(agentic label harmonization workflow),通过VLM自动识别并协调跨数据集的类别语义与边界框粒度差异,在训练前完成标注层面的对齐,从而提升模型在复杂场景(如文档版面检测)下的检测精度与结构一致性。实验证明,该方法显著改善了检测F-score、表格结构保留率(TEDS)及边界框重叠的一致性指标,并使后解码嵌入空间更加紧凑和可分,验证了标注一致性对模型表征质量的核心影响。

链接: https://arxiv.org/abs/2604.11042
作者: Renyu Li,Vladimir Kirilenko,Yao You,Crag Wolfe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Fine-tuning object detection (OD) models on combined datasets assumes annotation compatibility, yet datasets often encode conflicting spatial definitions for semantically equivalent categories. We propose an agentic label harmonization workflow that uses a vision-language model to reconcile both category semantics and bounding box granularity across heterogeneous sources before training. We evaluate on document layout detection as a challenging case study, where annotation standards vary widely across corpora. Without harmonization, naïve mixed-dataset fine-tuning degrades a pretrained RT-DETRv2 detector: on SCORE-Bench, which measures how accurately the full document conversion pipeline reproduces ground-truth structure, table TEDS drops from 0.800 to 0.750. Applied to two corpora whose 16 and 10 category taxonomies share only 8 direct correspondences, harmonization yields consistent gains across content fidelity, table structure, and spatial consistency: detection F-score improves from 0.860 to 0.883, table TEDS improves to 0.814, and mean bounding box overlap drops from 0.043 to 0.016. Representation analysis further shows that harmonized training produces more compact and separable post-decoder embeddings, confirming that annotation inconsistency distorts the learned feature space and that resolving it before training restores representation structure.

[CV-98] EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates

【速读】:该论文旨在解决从第一人称视角视频中建模可交互3D物体的问题,这在具身人工智能(Embodied AI)领域具有重要意义,但现有数据稀缺且缺乏对物体功能关系的系统性表征。其关键解决方案是提出了一种基于功能模板(function template)的结构化计算表示方法,用于捕捉跨部件的功能映射关系(如灶钮旋转控制灶火温度),从而实现从视频输入到仿真可用3D模型的端到端生成,并支持精确评估与跨平台代码编译。为此,作者构建了包含271段真实世界交互视频的数据集,标注了2D/3D分割、关节运动及功能模板信息,并设计了一个四阶段处理流程:2D部件分割、三维重建、关节估计和功能模板推断,显著提升了任务的可量化性和实用性。

链接: https://arxiv.org/abs/2604.11038
作者: Weikun Peng,Denys Iliash,Manolis Savva
机构: Simon Fraser University (西蒙菲莎大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:We present EgoFun3D, a coordinated task formulation, dataset, and benchmark for modeling interactive 3D objects from egocentric videos. Interactive objects are of high interest for embodied AI but scarce, making modeling from readily available real-world videos valuable. Our task focuses on obtaining simulation-ready interactive 3D objects from egocentric video input. While prior work largely focuses on articulations, we capture general cross-part functional mappings (e.g., rotation of stove knob controls stove burner temperature) through function templates, a structured computational representation. Function templates enable precise evaluation and direct compilation into executable code across simulation platforms. To enable comprehensive benchmarking, we introduce a dataset of 271 egocentric videos featuring challenging real-world interactions with paired 3D geometry, segmentation over 2D and 3D, articulation and function template annotations. To tackle the task, we propose a 4-stage pipeline consisting of: 2D part segmentation, reconstruction, articulation estimation, and function template inference. Comprehensive benchmarking shows that the task is challenging for off-the-shelf methods, highlighting avenues for future work.

[CV-99] st-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度视觉推理中因“接地悖论”(Grounding Paradox)导致的脆弱性问题,即模型在缺乏必要证据前需决定关注图像区域,从而引发决策与感知之间的循环依赖。解决方案的关键在于提出测试时感知缩放(Test-Time Scaling over Perception, TTSP),将感知过程视为可扩展的推理流程:通过生成多个探索性感知轨迹、利用基于熵的置信度估计过滤不可靠轨迹、将验证后的观察结果提炼为结构化知识,并迭代优化后续探索以缓解未解不确定性。该方法显著提升了模型在高分辨率和通用多模态推理基准上的表现,同时展现出良好的可扩展性和token效率。

链接: https://arxiv.org/abs/2604.11025
作者: Zheng Jiang,Yiming Chen,Nan He,Jiahui Chen,Chaoyang Li,Houde Qian,Lifeng Sun
机构: Tsinghua University (清华大学); Beijing University of Technology (北京工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) have begun to support Thinking with Images by invoking visual tools such as zooming and cropping during inference. Yet these systems remain brittle in fine-grained visual reasoning because they must decide where to look before they have access to the evidence needed to make that decision correctly. We identify this circular dependency as the Grounding Paradox. To address it, we propose Test-Time Scaling over Perception (TTSP), a framework that treats perception itself as a scalable inference process. TTSP generates multiple exploratory perception traces, filters unreliable traces using entropy-based confidence estimation, distills validated observations into structured knowledge, and iteratively refines subsequent exploration toward unresolved uncertainty. Extensive experiments on high-resolution and general multimodal reasoning benchmarks show that TTSP consistently outperforms strong baselines across backbone sizes, while also exhibiting favorable scalability and token efficiency. Our results suggest that scaling perception at test time is a promising direction for robust multimodal reasoning under perceptual uncertainty.

[CV-100] UHD-GPGNet: UHD Video Denoising via Gaussian-Process-Guided Local Spatio-Temporal Modeling

【速读】:该论文旨在解决超高清(Ultra-high-definition, UHD)视频去噪中同时抑制复杂时空退化、保持精细纹理与色度稳定性,并实现高效全分辨率4K部署的难题。其核心解决方案是提出UHD-GPGNet框架,该框架基于高斯过程(Gaussian Process, GP)引导的局部时空去噪机制,通过在紧凑的时空描述子上估计稀疏GP后验统计量,显式建模局部退化响应与不确定性,从而指导自适应时域细节融合;此外,采用结构-色彩协同重建头解耦亮度、色度与高频修正,并结合异方差目标函数和重叠分块推理策略,显著提升优化稳定性并支持内存受限条件下的4K实时推理。

链接: https://arxiv.org/abs/2604.11014
作者: Weiyuan He,Chen Wu,Pengwen Dai,Wei Wang,Dianjie Lu,Guijuan Zhang,Linwei Fan,Yongzhen Wang,Zhuoran Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-high-definition (UHD) video denoising requires simultaneously suppressing complex spatio-temporal degradations, preserving fine textures and chromatic stability, and maintaining efficient full-resolution 4K deployment. In this paper, we propose UHD-GPGNet, a Gaussian-process-guided local spatio-temporal denoising framework that addresses these requirements jointly. Rather than relying on implicit feature learning alone, the method estimates sparse GP posterior statistics over compact spatio-temporal descriptors to explicitly characterize local degradation response and uncertainty, which then guide adaptive temporal-detail fusion. A structure-color collaborative reconstruction head decouples luminance, chroma, and high-frequency correction, while a heteroscedastic objective and overlap-tiled inference further stabilize optimization and enable memory-bounded 4K deployment. Experiments on UVG and RealisVideo-4K show that UHD-GPGNet achieves competitive restoration fidelity with substantially fewer parameters than existing methods, enables real-time full-resolution 4K inference with significant speedup over the closest quality competitor, and maintains robust performance across a multi-level mixed-degradation schedule.A real-world study on phone-captured 4K video further confirms that the model, trained entirely on synthetic degradation, generalizes to unseen real sensor noise and improves downstream object detection under challenging conditions.

[CV-101] Byte-level generative predictions for forensics multimedia carving

【速读】:该论文旨在解决数字取证中从无文件系统元数据的未分配磁盘空间中恢复碎片化多媒体文件的问题。传统文件 carving 方法依赖签名匹配和判别式深度学习模型进行片段分类,但无法重建或预测缺失数据。解决方案的关键在于提出一种基于 bGPT 的生成式方法,该模型是一个字节级 Transformer 架构,用于执行下一字节预测任务;通过输入部分 BMP 图像数据,模拟生成可能的片段延续内容,并利用余弦相似度、结构相似性指数(SSIM)、卡方距离和 Jensen-Shannon 散度(JSD)等指标评估预测结果的保真度,从而有效支持碎片匹配与恢复。

链接: https://arxiv.org/abs/2604.11010
作者: Jaewon Lee,Md Eimran Hossain Eimon,Avinash Srinivasan,Hari Kalva
机构: United States Naval Academy (美国海军学院); Florida Atlantic University (佛罗里达大西洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the “SPIE Defense + Security” Conference

点击查看摘要

Abstract:Digital forensic investigations often face significant challenges when recovering fragmented multimedia files that lack file system metadata. While traditional file carving relies on signatures and discriminative deep learning models for fragment classification, these methods cannot reconstruct or predict missing data. We propose a generative approach to multimedia carving using bGPT, a byte-level transformer designed for next-byte prediction. By feeding partial BMP image data into the model, we simulate the generation of likely fragment continuations. We evaluate the fidelity of these predictions using different metrics, namely, cosine similarity, structural similarity index (SSIM), chi-square distance, and Jensen-Shannon divergence (JSD). Our findings demonstrate that generative models can effectively predict byte-level patterns to support fragment matching in unallocated disk space.

[CV-102] Data-Efficient Semantic Segmentation of 3D Point Clouds via Open-Vocabulary Image Segmentation-based Pseudo-Labeling

【速读】:该论文旨在解决三维点云语义分割中因数据不足导致的模型训练难题,具体包括三个并发问题:训练场景稀缺、点级标注稀少以及缺乏用于重建点云的二维图像序列。现有方法通常仅能应对其中一到两个挑战,无法协同处理全部三类数据匮乏问题。解决方案的关键在于提出一种名为PLOVIS(Point pseudo-Labeling via Open-Vocabulary Image Segmentation)的数据高效训练框架,其核心创新是利用开放词汇图像分割(Open-Vocabulary Image Segmentation, OVIS)模型直接从3D点云生成伪标签,无需依赖原始2D图像序列;同时引入两阶段伪标签过滤机制与类别平衡的记忆库策略,有效降低伪标签噪声和类别不平衡对训练的影响,从而在极低标注成本下显著提升分割性能。

链接: https://arxiv.org/abs/2604.11007
作者: Takahiko Furuya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation of 3D point cloud scenes is a crucial task for various applications. In real-world scenarios, training segmentation models often faces three concurrent forms of data insufficiency: scarcity of training scenes, scarcity of point-level annotations, and absence of 2D image sequences from which point clouds were reconstructed. Existing data-efficient algorithms typically address only one or two of these challenges, leaving the joint treatment of all three unexplored. This paper proposes a data-efficient training framework specifically designed to address the three forms of data insufficiency. Our proposed algorithm, called Point pseudo-Labeling via Open-Vocabulary Image Segmentation (PLOVIS), leverages an Open-Vocabulary Image Segmentation (OVIS) model as a pseudo label generator to compensate for the lack of training data. PLOVIS creates 2D images for pseudo-labeling directly from training 3D point clouds, eliminating the need for 2D image sequences. To mitigate the inherent noise and class imbalance in pseudo labels, we introduce a two-stage filtering of pseudo labels combined with a class-balanced memory bank for effective training. The two-stage filtering mechanism first removes low-confidence pseudo labels, then discards likely incorrect pseudo labels, thereby enhancing the quality of pseudo labels. Experiments on four benchmark datasets, i.e., ScanNet, S3DIS, Toronto3D, and Semantic3D, under realistic data-scarce conditions (a few tens of training 3D scenes, each annotated with only 100 3D points) demonstrate that PLOVIS consistently outperforms existing methods including standard fine-tuning strategies and state-of-the-art weakly supervised learning algorithms. Code will be made publicly available.

[CV-103] owards Realistic 3D Emission Materials: Dataset Baseline and Evaluation for Emission Texture Generation

【速读】:该论文旨在解决现有3D纹理生成方法在处理高亮发光材质(如LED灯效)时的局限性,这些方法通常仅能生成非自发光的基于物理渲染(Physically Based Rendering, PBR)材质(如漫反射贴图、金属度贴图和粗糙度贴图),难以复现诸如赛博朋克等流行风格中的发光效果。其解决方案的关键在于提出了一项新任务——发射纹理生成(emission texture generation),并构建了首个包含4万件高质量带发射材质的3D资产数据集Objaverse-Emission;同时设计了EmissionGen基线模型与针对性评估指标,从而实现了从参考图像中合成具有真实感发光特性的3D纹理,显著提升了工业应用潜力。

链接: https://arxiv.org/abs/2604.11006
作者: Zhiyuan Zhang,Zijian Zhou,Linjun Li,Long Chen,Hao Tang,Yichen Gong
机构: Peking University (北京大学); King’s College London (伦敦国王学院); XG Tech (XG科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Dataset will be available at this https URL

点击查看摘要

Abstract:3D texture generation is receiving increasing attention, as it enables the creation of realistic and aesthetic texture materials for untextured 3D meshes. However, existing 3D texture generation methods are limited to producing only a few types of non-emissive PBR materials (e.g., albedo, metallic maps and roughness maps), making them difficult to replicate highly popular styles, such as cyberpunk, failing to achieve effects like realistic LED emissions. To address this limitation, we propose a novel task, emission texture generation, which enables the synthesized 3D objects to faithfully reproduce the emission materials from input reference images. Our key contributions include: first, We construct the Objaverse-Emission dataset, the first dataset that contains 40k 3D assets with high-quality emission materials. Second, we propose EmissionGen, a novel baseline for the emission texture generation task. Third, we define detailed evaluation metrics for the emission texture generation task. Our results demonstrate significant potential for future industrial applications. Dataset will be available at this https URL.

[CV-104] Panoptic Pairwise Distortion Graph ICLR2026

【速读】:该论文旨在解决现有图像质量评估方法在处理成对图像时缺乏细粒度区域级理解的问题,传统方法通常基于整图分析,难以捕捉局部退化类型、严重程度及相对质量差异等结构化信息。其解决方案的关键在于提出一种新的任务范式——失真图(Distortion Graph, DG),将图像对建模为基于区域的结构化拓扑,以紧凑且可解释的图结构表示包括退化类型、严重性、对比关系和质量评分在内的密集退化信息。为实现该任务,作者构建了区域级数据集PandaSet、具有不同难度的基准测试套件PandaBench,并设计了高效架构Panda用于生成DG,实验证明该方法能有效激发模型对区域级退化的理解能力,为精细化、结构化的图像对评估开辟了新方向。

链接: https://arxiv.org/abs/2604.11004
作者: Muhammad Kamran Janjua,Abdul Wahab,Bahador Rashidi
机构: Huawei Technologies, Canada(华为技术加拿大)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. In contrast, existing methods focus on whole image analysis, while implicitly relying on region-level understanding. We extend the intra-image notion of a scene graph to inter-image, and propose a novel task of Distortion Graph (DG). DG treats paired images as a structured topology grounded in regions, and represents dense degradation information such as distortion type, severity, comparison and quality score in a compact interpretable graph structure. To realize the task of learning a distortion graph, we contribute (i) a region-level dataset, PandaSet, (ii) a benchmark suite, PandaBench, with varying region-level difficulty, and (iii) an efficient architecture, Panda, to generate distortion graphs. We demonstrate that PandaBench poses a significant challenge for state-of-the-art multimodal large language models (MLLMs) as they fail to understand region-level degradations even when fed with explicit region cues. We show that training on PandaSet or prompting with DG elicits region-wise distortion understanding, opening a new direction for fine-grained, structured pairwise image assessment.

[CV-105] raversalBench: Challenging Paths to Follow for Vision Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在复杂视觉路径追踪任务中表现不足的问题,尤其是当人类观察者能轻松完成的连续路径重建任务时,现有模型仍存在显著性能瓶颈。其解决方案的核心是提出TraversalBench——一个受控的基准测试集,专门用于评估模型对精确视觉路径的追踪能力。该基准通过严格控制路径结构因素(如自相交次数、弯曲度、顶点数量及邻近干扰线)来隔离关键变量,并最小化对光学字符识别(OCR)、世界知识和开放式规划的依赖。实验发现,自相交是导致错误的主要来源,且错误具有局部性:模型在首次交叉前表现稳定,之后性能骤降;而邻近干扰线则引发持续但较弱的性能下降。这一诊断性设计使TraversalBench成为识别模型是否出现类人失败或持续视觉处理崩溃的有效工具,从而推动了对多模态空间推理与模糊性、杂乱和干扰结构下视觉接地能力的研究。

链接: https://arxiv.org/abs/2604.10999
作者: Clara Petrova,Zhuo Chen,Marin Soljačić
机构: Massachusetts Institute of Technology, Department of Physics (麻省理工学院物理系); Massachusetts Institute of Technology, Institute for Data, Systems, and Society (麻省理工学院数据、系统与社会研究所); NSF AI Institute for Artificial Intelligence and Fundamental Interactions (国家科学基金会人工智能与基础相互作用研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) perform strongly on many multimodal benchmarks. However, the ability to follow complex visual paths – a task that human observers typically find straightforward – remains under-tested. We introduce TraversalBench, a controlled benchmark for exact visual path traversal. Each instance contains a single continuous polyline, a unique start marker, and markers placed at path vertices; the task is to recover the exact ordered sequence encountered when traversing the path from start to finish. The benchmark explicitly balances key path-structural factors including self-intersection count, tortuosity, vertex count, and nearby confounding lines, while minimizing reliance on OCR, world knowledge, and open-ended planning. We find that self-intersections are the dominant source of difficulty. A first-crossing analysis shows that errors are sharply localized: performance is relatively stable immediately before the first crossing, then drops steeply when the model must resolve the correct continuation. By contrast, nearby confounding lines produce a weaker persistent degradation that compounds with repeated exposure. These analyses make TraversalBench a useful diagnostic for identifying whether models suffer from human-like failures or other breakdowns in sustained visual processing. An auxiliary reading-order benchmark further reveals a consistent preference for layouts compatible with left-to-right serialization, while not explaining away the main effects of path complexity. Together, these results position TraversalBench as a controlled diagnostic of path-faithful visual reasoning and as a useful testbed for studying multimodal spatial reasoning under ambiguity, clutter, and distractor structure. More broadly, we position TraversalBench as a contribution to the still-limited area of sustained visual grounding benchmarks for VLMs.

[CV-106] LumiMotion: Improving Gaussian Relighting with Scene Dynamics CVPR2026

【速读】:该论文旨在解决3D重建中逆渲染(inverse rendering)问题,即从图像中准确分离场景的光照信息与材质属性(albedo),而现有基于高斯点阵(Gaussian Splatting)的方法通常局限于静态场景并假设简化光照条件,难以在真实复杂光照下有效解耦光照与材质效应。其解决方案的关键在于利用动态区域(dynamic regions)作为监督信号:由于运动物体在不同光照条件下仍呈现相同表面,这种动态变化提供了更强的约束来区分材质和光照影响。作者提出LumiMotion方法,首次将动态信息引入高斯点阵框架,通过设计新颖的约束机制使动态区域发生形变而静态区域保持稳定,从而实现对材质和光照的正确优化。实验表明,该方法在albedo估计上比最优基线提升23% LPIPS,在场景重照明任务上提升15%。

链接: https://arxiv.org/abs/2604.10994
作者: Joanna Kaleta,Piotr Wójcik,Kacper Marzol,Tomasz Trzciński,Kacper Kania,Marek Kowalski
机构: Warsaw University of Technology (华沙理工大学); Sano Centre for Computational Medicine (Sano计算医学中心); Institute for Biomedical Informatics, Faculty of Medicine and University Hospital Cologne, University of Cologne (科隆大学医学院生物医学信息研究所及大学医院); Faculty of Mathematics and Natural Sciences, University of Cologne (科隆大学数学与自然科学学院); Center for Molecular Medicine Cologne (CMMC), Faculty of Medicine and University Hospital Cologne, University of Cologne (科隆大学分子医学中心); Jagiellonian University (雅盖隆大学); IDEAS Research Institute (IDEAS研究 institute); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026

点击查看摘要

Abstract:In 3D reconstruction, the problem of inverse rendering, namely recovering the illumination of the scene and the material properties, is fundamental. Existing Gaussian Splatting-based methods primarily target static scenes and often assume simplified or moderate lighting to avoid entangling shadows with surface appearance. This limits their ability to accurately separate lighting effects from material properties, particularly in real-world conditions. We address this limitation by leveraging dynamic elements - regions of the scene that undergo motion - as a supervisory signal for inverse rendering. Motion reveals the same surfaces under varying lighting conditions, providing stronger cues for disentangling material and illumination. This thesis is supported by our experimental results which show we improve LPIPS by 23% for albedo estimation and by 15% for scene relighting relative to next-best baseline. To this end, we introduce LumiMotion, the first Gaussian-based approach that leverages dynamics for inverse rendering and operates in arbitrary dynamic scenes. Our method learns a dynamic 2D Gaussian Splatting representation that employs a set of novel constraints which encourage the dynamic regions of the scene to deform, while keeping static regions stable. As we demonstrate, this separation is crucial for correct optimization of the albedo. Finally, we release a new synthetic benchmark comprising five scenes under four lighting conditions, each in both static and dynamic variants, for the first time enabling systematic evaluation of inverse rendering methods in dynamic environments and challenging lighting. Link to project page: this https URL

[CV-107] ArtiCAD: Articulated CAD Assembly Design via Multi-Agent Code Generation

【速读】:该论文旨在解决从高阶描述(如文本或图像)中自动生成可编辑的、具有运动关系的多部件装配体(articulated CAD assemblies)这一尚未被充分探索的问题。其核心挑战在于如何在不依赖训练数据的前提下,实现几何结构与装配关系的协同生成,并克服当前大语言模型(LLM)和视觉语言模型(VLM)在空间推理能力上的局限性。解决方案的关键在于提出一个无需训练的多智能体系统ArtiCAD,通过四个专业化代理(Design、Generation、Assembly、Review)分工协作,其中最核心的创新是:在设计阶段即预测装配关系,而非传统方式在几何生成后再进行装配;为此引入了一个显式定义连接点与关节参数的“Connector”,从而提前确定装配逻辑,有效规避了LLM/VLM的空间推理瓶颈。此外,系统还集成生成与装配阶段的验证机制及跨阶段回滚策略,确保输出质量,并通过自进化经验存储持续优化性能。

链接: https://arxiv.org/abs/2604.10992
作者: Yuan Shui,Yandong Guan,Zhanwei Zhang,Juncheng Hu,Jing Zhang,Dong Xu,Qian Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parametric Computer-Aided Design (CAD) of articulated assemblies is essential for product development, yet generating these multi-part, movable models from high-level descriptions remains unexplored. To address this, we propose ArtiCAD, the first training-free multi-agent system capable of generating editable, articulated CAD assemblies directly from text or images. Our system divides this complex task among four specialized agents: Design, Generation, Assembly, and Review. One of our key insights is to predict assembly relationships during the initial design stage rather than the assembly stage. By utilizing a Connector that explicitly defines attachment points and joint parameters, ArtiCAD determines these relationships before geometry generation, effectively bypassing the limited spatial reasoning capabilities of current LLMs and VLMs. To further ensure high-quality outputs, we introduce validation steps in the generation and assembly stages, accompanied by a cross-stage rollback mechanism that accurately isolates and corrects design- and code-level errors. Additionally, a self-evolving experience store accumulates design knowledge to continuously improve performance on future tasks. Extensive evaluations on three datasets (ArtiCAD-Bench, CADPrompt, and ACD) validate the effectiveness of our approach. We further demonstrate the applicability of ArtiCAD in requirement-driven conceptual design, physical prototyping, and the generation of embodied AI training assets through URDF export.

[CV-108] WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

【速读】:该论文旨在解决现有浏览器代理(browser agent)评估基准面临的三难困境:真实网站基准因内容漂移而缺乏可复现性,受控环境虽具备可复现性却因忽略真实网络噪声而牺牲了现实性,且两者均需高昂的人工标注成本,限制了扩展性。解决方案的关键在于提出 WebForge——首个完全自动化的框架,通过四阶段代理流水线(规划、生成、优化与验证)端到端构建交互式、自包含的网页环境,无需人工标注;同时引入七维难度控制机制,从导航深度、视觉复杂度、推理难度等多个维度结构化任务设计,实现超越单一综合得分的系统性能力画像。

链接: https://arxiv.org/abs/2604.10988
作者: Peng Yuan,Yuyang Yin,Yuxuan Cai,Zheng Wei
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures, 6 tables, plus 29-page supplementary. Code: this https URL Dataset: this https URL

点击查看摘要

Abstract:Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline – Plan, Generate, Refine, and Validate – that produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge-Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show that difficulty stratification effectively differentiates model capabilities, while cross-domain analysis exposes capability biases invisible to aggregate metrics. Together, these results confirm that multi-dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture. Code and benchmark are publicly available at this https URL.

[CV-109] Energy-oriented Diffusion Bridge for Image Restoration with Foundational Diffusion Models ICLR26

【速读】:该论文旨在解决扩散桥接模型(Diffusion Bridge Models)在图像恢复任务中因依赖复杂且高成本轨迹而导致的采样效率低和最终恢复质量受限的问题。解决方案的关键在于提出了一种面向能量优化的扩散桥接框架(Energy-oriented diffusion Bridge, E-Bridge),其核心创新包括:设计一种新的桥接过程,在更短的时间范围内演化,并从熵正则化点(混合退化图像与高斯噪声)开始反向过程,理论上降低了所需轨迹的能量;同时借鉴一致性模型(Consistency Models)的思想,通过连续时间一致性目标优化单步映射函数,从而解析地将轨迹上的任意状态映射到目标图像。此外,该框架将轨迹长度设为可调的任务自适应参数,使模型能根据退化程度(如去噪与超分辨率)动态平衡信息保留与生成能力。

链接: https://arxiv.org/abs/2604.10983
作者: Jinhui Hou,Zhiyu Zhu,Junhui Hou
机构: City University of Hong Kong (香港城市大学); City University of Hong Kong (东莞) (香港城市大学(东莞))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR26

点击查看摘要

Abstract:Diffusion bridge models have shown great promise in image restoration by explicitly connecting clean and degraded image distributions. However, they often rely on complex and high-cost trajectories, which limit both sampling efficiency and final restoration quality. To address this, we propose an Energy-oriented diffusion Bridge (E-Bridge) framework to approximate a set of low-cost manifold geodesic trajectories to boost the performance of the proposed method. We achieve this by designing a novel bridge process that evolves over a shorter time horizon and makes the reverse process start from an entropy-regularized point that mixes the degraded image and Gaussian noise, which theoretically reduces the required trajectory energy. To solve this process efficiently, we draw inspiration from consistency models to learn a single-step mapping function, optimized via a continuous-time consistency objective tailored for our trajectory, so as to analytically map any state on the trajectory to the target image. Notably, the trajectory length in our framework becomes a tunable task-adaptive knob, allowing the model to adaptively balance information preservation against generative power for tasks of varying degradation, such as denoising versus super-resolution. Extensive experiments demonstrate that our E-Bridge achieves state-of-the-art performance across various image restoration tasks while enabling high-quality recovery with a single or fewer sampling steps. Our project page is this https URL.

[CV-110] MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models CVPR2026

【速读】:该论文旨在解决工业场景下通用异常检测(General Anomaly Detection, GAD)中模型泛化能力不足的问题,即如何训练一个无需针对新类别重新训练或微调即可直接检测多种新颖类别的异常的通用模型。当前主流多模态大语言模型(Multimodal Large Language Models, MLLMs)虽具备强大的视觉理解与语言推理能力,但其在异常检测任务上的表现仍受限于预训练数据与工业场景数据之间的分布差异,以及现有异常检测数据集主要为图像形式、不适用于MLLM后训练的问题。解决方案的关键在于构建一个专门面向MLLM的异常检测基准MRR-AD,该基准涵盖训练与评估所需的数据资源,并基于此提出Anomaly-R1基线模型——该模型通过在MRR-AD中学习思维链(Chain-of-Thought, CoT)数据并引入强化学习进行优化,在异常检测与定位性能上显著优于当前最先进的通用型MLLMs。

链接: https://arxiv.org/abs/2604.10971
作者: Xincheng Yao,Zefeng Qian,Chao Shi,Jiayang Song,Chongyang Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:In the progress of industrial anomaly detection, general anomaly detection (GAD) is an emerging trend and also the ultimate goal. Unlike the conventional single- and multi-class AD, general AD aims to train a general AD model that can directly detect anomalies in diverse novel classes without any retraining or fine-tuning on the target data. Recently, Multimodal Large Language Models (MLLMs) have shown great promise in achieving general anomaly detection due to their revolutionary visual understanding and language reasoning capabilities. However, MLLM’s general AD ability remains underexplored due to: (1) MLLMs are pretrained on amounts of data sourced from the Web, these data still have significant gaps with the data in AD scenarios. Moreover, the image-text pairs during pretraining are also not specifically for AD tasks. (2) The current mainstream AD datasets are image-based and not yet suitable for post-training MLLMs. To facilitate MLLM-based general AD research, we present MMR-AD, which is a comprehensive benchmark for both training and evaluating MLLM-based AD models. With MMR-AD, we reveal that the AD performance of current SOTA generalist MLLMs still falls far behind the industrial requirements. Based on MMR-AD, we also propose a baseline model, Anomaly-R1, which is a reasoning-based AD model that learns from the CoT data in MMR-AD and is further enhanced by reinforcement learning. Extensive experiments show that our Anomaly-R1 achieves remarkable improvements over generalist MLLMs in both anomaly detection and localization.

[CV-111] Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization

【速读】:该论文旨在解决小样本任务特定显微图像数据集在训练深度学习模型时难以学习鲁棒特征的问题。其关键解决方案是利用自监督学习(Self-Supervised Learning, SSL)方法,特别是基于DINO的视觉Transformer(Vision Transformer, ViT)骨干网络,在大规模领域相关数据集(如HPA FOV和ImageNet-1k)上进行预训练,从而获得可迁移性强的特征表示。实验表明,即使不进行微调(zero-shot),这些预训练模型也能在OpenCell数据集上实现优异性能(宏平均F₁达0.822),进一步微调后性能提升至0.860;且在单细胞层面,HPA单细胞预训练模型在不同邻域大小下均展现出最优的k近邻分类表现(宏平均F₁ ≥ 0.796),验证了SSL在小样本显微图像分析中的有效性与泛化能力。

链接: https://arxiv.org/abs/2604.10970
作者: Ben Isselmann,Dilara Göksu,Heinz Neumann,Andreas Weinmann
机构: Hochschule Darmstadt (达姆施塔特应用技术大学); THWS (威斯巴登应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 8 figures, submitted to BMC Bioinformatics

点击查看摘要

Abstract:Background: Task-specific microscopy datasets are often small, making it difficult to train deep learning models that learn robust features. While self-supervised learning (SSL) has shown promise through pretraining on large, domain-specific datasets, generalizability across datasets with differing staining protocols and channel configurations remains underexplored. We investigated the generalizability of SSL models pretrained on ImageNet-1k and HPA FOV, evaluating their embeddings on OpenCell with and without fine-tuning, two channel-mismatch strategies, and varying fine-tuning data fractions. We additionally analyzed single-cell embeddings on a labeled OpenCell subset. Result: DINO-based ViT backbones pretrained on HPA FOV or ImageNet-1k transfer well to OpenCell even without fine-tuning. The HPA FOV-pretrained model achieved the highest zero-shot performance (macro F_1 0.822 \pm 0.007). Fine-tuning further improved performance to 0.860 \pm 0.013. At the single-cell level, the HPA single-cell-pretrained model achieved the highest k-nearest neighbor performance across all neighborhood sizes (macro F_1 \geq 0.796). Conclusion: SSL methods like DINO, pretrained on large domain-relevant datasets, enable effective use of deep learning features for fine-tuning on small, task-specific microscopy datasets. Comments: 29 pages, 8 figures, submitted to BMC Bioinformatics Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.10970 [cs.CV] (or arXiv:2604.10970v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.10970 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-112] owards Automated Solar Panel Integrity: Hybrid Deep Feature Extraction for Advanced Surface Defect Identification

【速读】:该论文旨在解决大规模光伏(Photovoltaic, PV)面板在发电厂中因人工巡检效率低、成本高且易出错而导致的缺陷检测难题。为实现连续监测、早期故障识别和最大功率输出,研究提出了一种融合手工特征与深度学习特征的混合缺陷检测方法:关键在于利用局部二值模式(Local Binary Pattern, LBP)、方向梯度直方图(Histogram of Gradients, HoG)和Gabor滤波器提取手工特征,同时采用DenseNet-169网络提取深层特征,将两类特征拼接后输入支持向量机(Support Vector Machine, SVM)、极端梯度提升(Extreme Gradient Boost, XGBoost)和轻量梯度提升机(Light Gradient-Boosting Machine, LGBM)三种分类器进行决策。实验表明,DenseNet-169 + Gabor(SVM)组合达到99.17%的准确率,显著优于其他模型,验证了该混合框架在实际自动化光伏监控系统中的有效性、鲁棒性和灵活性。

链接: https://arxiv.org/abs/2604.10969
作者: Muhammad Junaid Asif,Muhammad Saad Rafaqat,Usman Nazakat,Uzair Khan,Rana Fayyaz Ahmad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To ensure energy efficiency and reliable operations, it is essential to monitor solar panels in generation plants to detect defects. It is quite labor-intensive, time consuming and costly to manually monitor large-scale solar plants and those installed in remote areas. Manual inspection may also be susceptible to human errors. Consequently, it is necessary to create an automated, intelligent defect-detection system, that ensures continuous monitoring, early fault detection, and maximum power generation. We proposed a novel hybrid method for defect detection in SOLAR plates by combining both handcrafted and deep learning features. Local Binary Pattern (LBP), Histogram of Gradients (HoG) and Gabor Filters were used for the extraction of handcrafted features. Deep features extracted by leveraging the use of DenseNet-169. Both handcrafted and deep features were concatenated and then fed to three distinct types of classifiers, including Support Vector Machines (SVM), Extreme Gradient Boost (XGBoost) and Light Gradient-Boosting Machine (LGBM). Experimental results evaluated on the augmented dataset show the superior performance, especially DenseNet-169 + Gabor (SVM), had the highest scores with 99.17% accuracy which was higher than all the other systems. In general, the proposed hybrid framework offers better defect-detection accuracy, resistance, and flexibility that has a solid basis on the real-life use of the automated PV panels monitoring system.

[CV-113] You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

【速读】:该论文旨在解决传统判别式奖励模型(Discriminative Reward Model, RM)在多候选响应评分时效率低下的问题,即现有方法需对每个候选响应独立进行前向传播,导致计算开销大、训练速度慢。其解决方案的关键在于提出一种多响应联合评分机制(N-way reward evaluation),通过将多个候选响应用分隔符连接后输入模型,在单次前向传播中同时输出各响应的标量得分,并基于交叉熵损失实现直接的多路偏好学习。该设计不仅显著提升了推理效率(最多获得 N×N\times 墙-clock 时间加速和浮点运算量减少),还构建了两个新的多响应基准测试集(MR² Bench-Image 和 MR² Bench-Video),支持更全面的评估。实验表明,该方法在六项多模态奖励建模任务上均达到最优性能,且结合GRPO强化学习训练时能有效提升策略模型的开放生成质量与训练稳定性。

链接: https://arxiv.org/abs/2604.10966
作者: Yinuo Yang,Zixian Ma,Manasi Ganti,Jieyu Zhang,Ranjay Krishna
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient N -way preference learning. The multi-response design also yields up to N\times wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable N -way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR ^2 Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR ^2 Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR ^2 Bench-Image, MR ^2 Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.

[CV-114] FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

【速读】:该论文旨在解决基于扩散模型(diffusion-based models)的图像编辑技术在实际应用中面临的两个核心问题:一是传统方法依赖自然语言提示(natural language prompts),难以精确定位目标对象;二是由于采用全局图像再生机制,导致背景一致性难以维持。解决方案的关键在于引入边界框(bounding box)作为视觉引导信号,通过提出FineEdit多层级边界框注入方法,使模型能够更有效地利用空间条件进行精准定位,同时保持背景结构不变。此外,研究构建了包含120万对图像的FineEdit-1.2M数据集和FineEdit-Bench评估基准,从而实现了高精度区域编辑能力的有效训练与量化评估。

链接: https://arxiv.org/abs/2604.10954
作者: Haohang Xu,Lin Liu,Zhibo Zhang,Rong Cong,Xiaopeng Zhang,Qi Tian
机构: Huawei Inc(华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based image editing models have achieved significant progress in real world applications. However, conventional models typically rely on natural language prompts, which often lack the precision required to localize target objects. Consequently, these models struggle to maintain background consistency due to their global image regeneration paradigm. Recognizing that visual cues provide an intuitive means for users to highlight specific areas of interest, we utilize bounding boxes as guidance to explicitly define the editing target. This approach ensures that the diffusion model can accurately localize the target while preserving background consistency. To achieve this, we propose FineEdit, a multi-level bounding box injection method that enables the model to utilize spatial conditions more effectively. To support this high precision guidance, we present FineEdit-1.2M, a large scale, fine-grained dataset comprising 1.2 million image editing pairs with precise bounding box annotations. Furthermore, we construct a comprehensive benchmark, termed FineEdit-Bench, which includes 1,000 images across 10 subjects to effectively evaluate region based editing capabilities. Evaluations on FineEdit-Bench demonstrate that our model significantly outperforms state-of-the-art open-source models (e.g., Qwen-Image-Edit and LongCat-Image-Edit) in instruction compliance and background preservation. Further assessments on open benchmarks (GEdit and ImgEdit Bench) confirm its superior generalization and robustness.

[CV-115] Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation

【速读】:该论文旨在解决视频语义分割(Video Semantic Segmentation, VSS)中对密集标注视频数据的高度依赖问题,以及现有基于图像语义分割(Image Semantic Segmentation, ISS)模型逐帧处理时缺乏时间一致性的问题。解决方案的关键在于提出DiTTA(Distillation-assisted Test-Time Adaptation)框架,通过在测试阶段进行高效适配,将预训练的ISS模型转化为具备时间感知能力的VSS模型,无需任何标注视频数据;其核心机制包括:利用SAM2的掩码传播能力,在单次初始化过程中蒸馏其时序分割知识至ISS模型,并结合轻量级的时间融合模块聚合跨帧上下文信息,从而实现鲁棒且高效的视频级语义理解。

链接: https://arxiv.org/abs/2604.10950
作者: Jihun Kim,Hoyong Kwon,Hyeokjun Kweon,Kuk-Jin Yoon
机构: KAIST(韩国科学技术院); Chung-Ang University(中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre-trained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2’s temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame context. Crucially, DiTTA achieves robust generalization even when adapting with highly limited partial video snippets (e.g., initial 10%), significantly outperforming zero-shot refinement approaches that repeatedly invoke SAM2 during inference. Extensive experiments on VSPW and Cityscapes demonstrate DiTTA’s effectiveness, achieving competitive or superior performance relative to fully-supervised VSS methods, thus providing a practical and annotation-free solution for real-world VSS tasks.

[CV-116] Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)中存在的“伪融合”(pseudo-unification)问题,即尽管UMMs试图结合大语言模型(Large Language Models, LLMs)的推理能力与视觉模型的生成能力,但在实际中却难以实现真正的协同效应:LLM式的推理无法有效迁移至图像合成,且文本和图像生成行为表现出显著差异。为破解这一难题,作者提出了一种基于信息论的探针框架(information-theoretic probing framework),其关键在于联合分析模型如何编码输入信息与生成输出,从而揭示内部机制。该框架识别出两个核心成因:(i) 模态不对称编码(Modality-Asymmetric Encoding),即视觉与语言通道在信息熵演化路径上的不一致;(ii) 模式分裂响应(Pattern-Split Response),即文本生成呈现高熵创造性而图像生成则强制低熵保真性。研究表明,唯有通过上下文预测等机制实现双侧统一的模型才能达成真正的一致性信息流,从而在参数更少的情况下实现更强的基于推理的文本到图像生成能力。

链接: https://arxiv.org/abs/2604.10949
作者: Songlin Yang,Xianghao Kong,Anyi Rao
机构: MMLab@HKUST, The Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.

[CV-117] Progressive Deep Learning for Automated Spheno-Occipital Synchondrosis Maturation Assessment

【速读】:该论文旨在解决基于锥形束CT(CBCT)图像对蝶枕软骨联合(spheno-occipital synchondrosis, SOS)成熟度进行分期时存在的高观察者间变异性和低重现性问题,尤其是在融合过渡阶段。其关键解决方案是提出一种渐进式表征学习框架,该框架模拟专家临床医生从粗略解剖结构到细微闭合模式的推理过程:通过逐步激活网络深层模块,使早期层先编码稳定的颅底形态,后续层再专注于区分相邻成熟阶段。这种按网络深度设计的课程学习策略使深度特征学习与SOS融合的生物连续性相一致,从而在不改变网络架构或损失函数的前提下,显著提升模型在模糊中间阶段的准确性与优化稳定性。

链接: https://arxiv.org/abs/2604.10945
作者: Omid Halimi Milani,Amanda Nikho,Marouane Tliba,Lauren Mills,Emadeldeen Hamdan,Ahmet Enis Cetin,Mohammed H. Elnagar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate assessment of spheno-occipital synchondrosis (SOS) maturation is a key indicator of craniofacial growth and a critical determinant for orthodontic and surgical timing. However, SOS staging from cone-beam CT (CBCT) relies on subtle, continuously evolving morphological cues, leading to high inter-observer variability and poor reproducibility, especially at transitional fusion stages. We frame SOS assessment as a fine-grained visual recognition problem and propose a progressive representation-learning framework that explicitly mirrors how expert clinicians reason about synchondral fusion: from coarse anatomical structure to increasingly subtle patterns of closure. Rather than training a full-capacity network end-to-end, we sequentially grow the model by activating deeper blocks over time, allowing early layers to first encode stable cranial base morphology before higher-level layers specialize in discriminating adjacent maturation stages. This yields a curriculum over network depth that aligns deep feature learning with the biological continuum of SOS fusion. Extensive experiments across convolutional and transformer-based architectures show that this expert-inspired training strategy produces more stable optimization and consistently higher accuracy than standard training, particularly for ambiguous intermediate stages. Importantly, these gains are achieved without changing network architectures or loss functions, demonstrating that training dynamics alone can substantially improve anatomical representation learning. The proposed framework establishes a principled link between expert dental intuition and deep visual representations, enabling robust, data-efficient SOS staging from CBCT and offering a general strategy for modeling other continuous biological processes in medical imaging.

[CV-118] AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling

【速读】:该论文旨在解决现有图像矢量化方法在处理自然图像时存在的局限性:传统方法遵循“模态(Modal)”范式,仅对可见像素进行描边,忽略遮挡区域,导致生成的SVG(可缩放矢量图形)在语义上纠缠不清且几何结构不完整,从而限制了其在矢量域中的结构化编辑能力。为突破这一瓶颈,作者提出AmodalSVG框架,其核心创新在于将图像矢量化重构为两阶段流程:首先通过语义层剥离(Semantic Layer Peeling, SLP)策略,在光栅域中实现语义解耦与遮挡区域补全,利用视觉语言模型(VLM)引导的混合修复技术恢复被遮挡对象的完整外观;随后采用自适应分层矢量化(Adaptive Layered Vectorization, ALV)机制,基于误差预算驱动的动态调整策略优化每层矢量基元分配,从而生成语义清晰、几何完整的独立矢量图层。该方案使最终SVG具备对象级编辑能力,显著优于现有方法。

链接: https://arxiv.org/abs/2604.10940
作者: Juncheng Hu,Ziteng Xue,Guotao Liang,Anran Qi,Buyu Li,Sheng Wang,Dong Xu,Qian Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce AmodalSVG, a new framework for amodal image vectorization that produces semantically organized and geometrically complete SVG representations from natural images. Existing vectorization methods operate under a modal paradigm: tracing only visible pixels and disregarding occlusion. Consequently, the resulting SVGs are semantically entangled and geometrically incomplete, limiting SVG’s structural editability. In contrast, AmodalSVG reconstructs full object geometries, including occluded regions, into independent, editable vector layers. To achieve this, AmodalSVG reformulates image vectorization as a two-stage framework, performing semantic decoupling and completion in the raster domain to produce amodally complete semantic layers, which are then independently vectorized. In the first stage, we introduce Semantic Layer Peeling (SLP), a VLM-guided strategy that progressively decomposes an image into semantically coherent layers. By hybrid inpainting, SLP recovers complete object appearances under occlusions, enabling explicit semantic decoupling. To vectorize these layers efficiently, we propose Adaptive Layered Vectorization (ALV), which dynamically modulates the primitive budget via an error-budget-driven adjustment mechanism. Extensive experiments demonstrate that AmodalSVG significantly outperforms prior methods in visual fidelity. Moreover, the resulting amodal layers enable object-level editing directly in the vector domain, capabilities not supported by existing vectorization approaches. Code will be released upon acceptance.

[CV-119] QShield: Securing Neural Networks Against Adversarial Attacks using Quantum Circuits

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks)在安全与安全关键应用中对对抗扰动高度敏感的问题,从而限制其可靠性。解决方案的关键在于提出一种模块化的混合量子-经典神经网络(Hybrid Quantum-Classical Neural Network, HQCNN)架构——QShield,其核心创新在于将传统卷积神经网络(Convolutional Neural Network, CNN)用于特征提取,并通过量子处理模块将特征编码为量子态,在真实噪声模型下施加结构化纠缠操作,最终由轻量级多层感知机(Multilayer Perceptron, MLP)实现动态加权融合输出。实验表明,该架构在保持高预测准确性的同时显著降低对抗攻击成功率,并大幅增加生成对抗样本的计算成本,从而在准确性和鲁棒性之间实现了实用平衡。

链接: https://arxiv.org/abs/2604.10933
作者: Navid Azimi,Aditya Prakash,Yao Wang,Li Xiong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Deep neural networks remain highly vulnerable to adversarial perturbations, limiting their reliability in security- and safety-critical applications. To address this challenge, we introduce QShield, a modular hybrid quantum-classical neural network (HQCNN) architecture designed to enhance the adversarial robustness of classical deep learning models. QShield integrates a conventional convolutional neural network (CNN) backbone for feature extraction with a quantum processing module that encodes the extracted features into quantum states, applies structured entanglement operations under realistic noise models, and outputs a hybrid prediction through a dynamically weighted fusion mechanism implemented via a lightweight multilayer perceptron (MLP). We systematically evaluate both classical and hybrid quantum-classical models on the MNIST, OrganAMNIST, and CIFAR-10 datasets, using a comprehensive set of robustness, efficiency, and computational performance metrics. Our results demonstrate that classical models are highly vulnerable to adversarial attacks, whereas the proposed hybrid models with entanglement patterns maintain high predictive accuracy while substantially reducing attack success rates across a wide range of adversarial attacks. Furthermore, the proposed hybrid architecture significantly increased the computational cost required to generate adversarial examples, thereby introducing an additional layer of defense. These findings indicate that the proposed modular hybrid architecture achieves a practical balance between predictive accuracy and adversarial robustness, positioning it as a promising approach for secure and reliable machine learning in sensitive and safety-critical applications.

[CV-120] LiveGesture Streamable Co-Speech Gesture Generation Model

【速读】:该论文旨在解决现有协同语音手势生成方法在实时流式处理中的局限性问题,特别是缺乏零前瞻(zero look-ahead)能力、难以支持任意长度序列以及身体各区域运动独立建模或全局耦合导致的协调性不足。其解决方案的关键在于提出一个全新的端到端可流式处理的全身体态生成框架 LiveGesture,核心由两个模块构成:一是流式向量量化运动分词器(Streamable Vector Quantized Motion Tokenizer, SVQ),用于将每个身体区域的运动序列转化为因果离散的运动标记以实现实时解码;二是分层自回归 Transformer(Hierarchical Autoregressive Transformer, HAR),通过区域专家自回归(xAR)机制建模各部位精细动态,并结合因果时空融合模块(xAR Fusion)捕捉跨区域相关运动。此外,引入基于不确定性引导的标记掩码和随机区域掩码的自回归训练策略,显著提升了模型在流式噪声与预测误差下的鲁棒性。实验表明,LiveGesture 在 BEAT2 数据集上实现了真正零前瞻条件下的连贯、多样且节拍同步的全身体态生成,性能达到或超越当前最优离线方法。

链接: https://arxiv.org/abs/2604.10927
作者: Muhammad Usama Saleem,Mayur Jagdishbhai Patel,Ekkasit Pinyoanuntapong,Zhongxing Qin,Li Yang,Hongfei Xue,Ahmed Helmy,Chen Chen,Pu Wang
机构: University of North Carolina(北卡罗来纳大学); University of Central Florida(中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose LiveGesture, the first fully streamable, speech-driven full-body gesture generation framework that operates with zero look-ahead and supports arbitrary sequence length. Unlike existing co-speech gesture methods, which are designed for offline generation and either treat body regions independently or entangle all joints within a single model, LiveGesture is built from the ground up for causal, region-coordinated motion generation. LiveGesture consists of two main modules: the Streamable Vector Quantized Motion Tokenizer (SVQ) and the Hierarchical Autoregressive Transformer (HAR). The SVQ tokenizer converts the motion sequence of each body region into causal, discrete motion tokens, enabling real-time, streamable token decoding. On top of SVQ, HAR employs region-expert autoregressive (xAR) transformers to model expressive, fine-grained motion dynamics for each body region. A causal spatio-temporal fusion module (xAR Fusion) then captures and integrates correlated motion dynamics across regions. Both xAR and xAR Fusion are conditioned on live, continuously arriving audio signals encoded by a streamable causal audio encoder. To enhance robustness under streaming noise and prediction errors, we introduce autoregressive masking training, which leverages uncertainty-guided token masking and random region masking to expose the model to imperfect, partially erroneous histories during training. Experiments on the BEAT2 dataset demonstrate that LiveGesture produces coherent, diverse, and beat-synchronous full-body gestures in real time, matching or surpassing state-of-the-art offline methods under true zero look-ahead conditions.

[CV-121] ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在超声检查场景中缺乏对动态操作过程理解的问题,现有基准仅评估静态图像理解能力,无法衡量模型对复杂医疗流程的推理与决策能力。解决方案的关键在于提出ReXSonoVQA——一个面向超声操作视频的问答基准,包含514段视频片段及对应问题(249道多选题、265道开放回答题),聚焦三大核心能力:动作-目标推理(Action-Goal Reasoning)、伪影识别与优化(Artifact Resolution Optimization)以及操作流程规划(Procedure Context Planning)。该基准首次系统性地评估VLMs在动态超声采集中的因果推理和任务执行能力,揭示了当前模型在故障诊断类问题上的局限性,为开发用于超声培训、实时引导和机器人自动化的感知系统提供了重要工具。

链接: https://arxiv.org/abs/2604.10916
作者: Xucheng Wang,Xiaoman Zhang,Sung Eun Kim,Ankit Pal,Pranav Rajpurkar
机构: Harvard Medical School (哈佛医学院); Seoul National University Hospital (首尔国立大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution Optimization, and Procedure Context Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.

[CV-122] AMISeg: Text-Aligned Multi-scale Medical Image Segmentation with Semantic Encoder Distillation ICME

【速读】:该论文旨在解决医学图像分割中因细粒度标注数据有限、解剖结构复杂以及图像噪声、低对比度或光照变化导致的图像退化等问题。其解决方案的关键在于提出一种文本引导的分割框架TAMISeg,通过引入临床语言提示(clinical language prompts)和语义蒸馏(semantic distillation)作为辅助语义线索,增强视觉理解并降低对像素级精细标注的依赖;具体包括三个核心组件:1)基于强扰动预训练的一致性感知编码器以提取鲁棒特征;2)利用冻结的DINOv3教师模型监督的语义蒸馏模块以提升语义判别能力;3)尺度自适应解码器以实现跨空间尺度的解剖结构分割。

链接: https://arxiv.org/abs/2604.10912
作者: Qiang Gao,Yi Wang,Yong Zhang,Yong Li,Yongbing Deng,Lan Du,Cunjian Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE International Conference on Multimedia and Expo (ICME), 2026

点击查看摘要

Abstract:Medical image segmentation remains challenging due to limited fine-grained annotations, complex anatomical structures, and image degradation from noise, low contrast, or illumination variation. We propose TAMISeg, a text-guided segmentation framework that incorporates clinical language prompts and semantic distillation as auxiliary semantic cues to enhance visual understanding and reduce reliance on pixel-level fine-grained annotations. TAMISeg integrates three core components: a consistency-aware encoder pretrained with strong perturbations for robust feature extraction, a semantic encoder distillation module with supervision from a frozen DINOv3 teacher to enhance semantic discriminability, and a scale-adaptive decoder that segments anatomical structures across different spatial scales. Experiments on the Kvasir-SEG, MosMedData+, and QaTa-COV19 datasets demonstrate that TAMISeg consistently outperforms existing uni-modal and multi-modal methods in both qualitative and quantitative evaluations. Code will be made publicly available at this https URL.

[CV-123] STGV: Spatio-Temporal Hash Encoding for Gaussian-based Video Representation

【速读】:该论文旨在解决现有基于高斯的视频表示方法中,由于采用内容无关或时空特征重叠嵌入来预测规范高斯基元形变,导致静态与动态成分混淆、难以有效建模各自特性的难题,从而引发时空形变预测不准确和表示质量不佳的问题。其解决方案的关键在于提出一种时空哈希编码框架(Spatio-Temporal hash encoding framework for Gaussian-based Video representation, STGV),通过将视频特征分解为可学习的二维空间和三维时间哈希编码,有效分离并分别学习动态成分的运动模式与静态成分的背景细节;同时引入关键帧规范初始化策略,构建更稳定一致的初始高斯表示,避免特征重叠和几何结构不一致问题。

链接: https://arxiv.org/abs/2604.10910
作者: Jierun Lin,Jiacong Chen,Qingyu Mao,Shuai Liu,Xiandong Meng,Fanyang Meng,Yongsheng Liang
机构: Shenzhen University (深圳大学); Shenzhen Technology University (深圳技术大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:2D Gaussian Splatting (2DGS) has recently become a promising paradigm for high-quality video representation. However, existing methods employ content-agnostic or spatio-temporal feature overlapping embeddings to predict canonical Gaussian primitive deformations, which entangles static and dynamic components in videos and prevents modeling their distinct properties effectively. These result in inaccurate predictions for spatio-temporal deformations and unsatisfactory representation quality. To address these problems, this paper proposes a Spatio-Temporal hash encoding framework for Gaussian-based Video representation (STGV). By decomposing video features into learnable 2D spatial and 3D temporal hash encodings, STGV effectively facilitates the learning of motion patterns for dynamic components while maintaining background details for static this http URL addition, we construct a more stable and consistent initial canonical Gaussian representation through a key frame canonical initialization strategy, preventing from feature overlapping and a structurally incoherent geometry representation. Experimental results demonstrate that our method attains better video representation quality (+0.98 PSNR) against other Gaussian-based methods and achieves competitive performance in downstream video tasks.

[CV-124] Evaluating the Impact of Medical Image Reconstruction on Downstream AI Fairness and Performance

【速读】:该论文旨在解决生成式 AI (Generative AI) 在医学影像重建中的下游影响问题,即当前基于像素级指标(如PSNR)的评估方法无法准确反映重建模型对诊断性能和公平性的影响。其解决方案的关键在于提出一个可扩展的评估框架,将重建模型与诊断AI模型联合应用,从而系统性地评估不同重建方法(U-Net、GAN、扩散模型)在X射线和MRI两种数据类型上的任务表现与公平性变化。该框架揭示了传统指标与实际诊断准确性之间存在脱节,并发现重建过程可能放大患者性别等维度的偏见,但整体影响仍小于诊断模型本身固有偏见,提示需在整个医学成像流程中进行端到端的性能与公平性评估。

链接: https://arxiv.org/abs/2604.10904
作者: Matteo Wohlrapp,Niklas Bubeck,Daniel Rueckert,William Lotter
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Proceedings of the Medical Imaging with Deep Learning (MIDL) Conference 2026

点击查看摘要

Abstract:AI-based image reconstruction models are increasingly deployed in clinical workflows to improve image quality from noisy data, such as low-dose X-rays or accelerated MRI scans. However, these models are typically evaluated using pixel-level metrics like PSNR, leaving their impact on downstream diagnostic performance and fairness unclear. We introduce a scalable evaluation framework that applies reconstruction and diagnostic AI models in tandem, which we apply to two tasks (classification, segmentation), three reconstruction approaches (U-Net, GAN, diffusion), and two data types (X-ray, MRI) to assess the potential downstream implications of reconstruction. We find that conventional reconstruction metrics poorly track task performance, where diagnostic accuracy remains largely stable even as reconstruction PSNR declines with increasing image noise. Fairness metrics exhibit greater variability, with reconstruction sometimes amplifying demographic biases, particularly regarding patient sex. However, the overall magnitude of this additional bias is modest compared to the inherent biases already present in diagnostic models. To explore potential bias mitigation, we adapt two strategies from classification literature to the reconstruction setting, but observe limited efficacy. Overall, our findings emphasize the importance of holistic performance and fairness assessments throughout the entire medical imaging workflow, especially as generative reconstruction models are increasingly deployed.

[CV-125] EviRCOD: Evidence-Guided Probabilistic Decoding for Referring Camouflaged Object Detection

【速读】:该论文针对参考式隐匿目标检测(Referring Camouflaged Object Detection, Ref-COD)中存在的三个核心挑战:参考-目标语义对齐不足、不确定性建模不显式以及边界保持鲁棒性差的问题,提出了一种名为EviRCOD的集成框架。其解决方案的关键在于:(1) 引入参考引导的可变形编码器(Reference-Guided Deformable Encoder, RGDE),通过分层参考驱动调制与多尺度可变形聚合机制注入语义先验并实现跨尺度表示对齐;(2) 设计不确定性感知的证据解码器(Uncertainty-Aware Evidential Decoder, UAED),将Dirichlet证据估计引入分层解码过程以显式建模不确定性并跨尺度传播置信度;(3) 构建边界感知精修模块(Boundary-Aware Refinement Module, BARM),利用低级边缘线索和预测置信度选择性增强模糊边界。该方法在Ref-COD基准上实现了最先进性能,并提供校准良好的不确定性估计。

链接: https://arxiv.org/abs/2604.10894
作者: Ye Wang,Kai Huang,Sumin Shen,Chenyang Ma
机构: West China Hospital, Sichuan University (四川大学华西医院); Auckland University of Technology (奥克兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring Camouflaged Object Detection (Ref-COD) focuses on segmenting specific camouflaged targets in a query image using category-aligned references. Despite recent advances, existing methods struggle with reference-target semantic alignment, explicit uncertainty modeling, and robust boundary preservation. To address these issues, we propose EviRCOD, an integrated framework consisting of three core components: (1) a Reference-Guided Deformable Encoder (RGDE) that employs hierarchical reference-driven modulation and multi-scale deformable aggregation to inject semantic priors and align cross-scale representations; (2) an Uncertainty-Aware Evidential Decoder (UAED) that incorporates Dirichlet evidence estimation into hierarchical decoding to model uncertainty and propagate confidence across scales; and (3) a Boundary-Aware Refinement Module (BARM) that selectively enhances ambiguous boundaries by exploiting low-level edge cues and prediction confidence. Experiments on the Ref-COD benchmark demonstrate that EviRCOD achieves state-of-the-art detection performance while providing well-calibrated uncertainty estimates. Code is available at: this https URL.

[CV-126] Product Review Based on Optimized Facial Expression Detection

【速读】:该论文旨在解决如何通过分析顾客在超市或大型卖场中购买产品时的面部表情来评估公众对某一品牌产品的接受度问题。其解决方案的关键在于利用改进的Harris算法进行特征点提取,以实现更高效的面部关键点检测,从而提升面部表情识别的实时性与准确性。该改进算法显著降低了原有Harris算法在角点检测中的时间复杂度,在保证近似准确性的前提下,提高了系统处理速度,满足了实际应用场景的需求。

链接: https://arxiv.org/abs/2604.10885
作者: Vikrant Chaugule,Abhishek D,Aadheeshwar Vijayakumar,Pravin Bhaskar Ramteke,Shashidhar G. Koolagudi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 9 pages, 11 figures, Published in the 2016 Ninth International Conference on Contemporary Computing (IC3), August 11-13, 2016, Noida, India. This is a pre-print version of the paper

点击查看摘要

Abstract:This paper proposes a method to review public acceptance of products based on their brand by analyzing the facial expression of the customer intending to buy the product from a supermarket or hypermarket. In such cases, facial expression recognition plays a significant role in product review. Here, facial expression detection is performed by extracting feature points using a modified Harris algorithm. The modified Harris algorithm reduced the time complexity of the existing feature extraction Harris Algorithm. A comparison of time complexities of existing algorithms is done with proposed algorithm. The algorithm proved to be significantly faster and nearly accurate for the needed application by reducing the time complexity for corner points detection.

[CV-127] LRD-Net: A Lightweight Real-Centered Detection Network for Cross-Domain Face Forgery Detection

【速读】:该论文旨在解决当前基于扩散模型的伪造人脸检测方法在跨域泛化能力差和计算开销大两个核心问题,尤其在面对未见过的伪造类型时性能下降明显,且难以部署于资源受限设备。其解决方案的关键在于提出一种轻量级、频率引导的实时检测网络(LRD-Net):首先采用序列式频域引导架构,通过多尺度小波引导模块生成注意力信号以调控MobileNetV3空间主干网络,从而高效利用频域特征并避免并行提取带来的冗余;其次引入以真实图像为中心的学习策略,结合指数移动平均原型更新与漂移正则化机制,使特征表示锚定于真实人脸分布而非建模多样伪造模式,显著提升跨域鲁棒性。实验表明,LRD-Net在DiFF基准上达到最优跨域检测精度,参数量仅为传统方法的约1/9,训练速度提升8倍以上,推理速度接近10倍加速,实现了高精度与低延迟的统一。

链接: https://arxiv.org/abs/2604.10862
作者: Xuecen Zhang,Vipin Chaudhary
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of diffusion-based generative models has made face forgery detection a critical challenge in digital forensics. Current detection methods face two fundamental limitations: poor cross-domain generalization when encountering unseen forgery types, and substantial computational overhead that hinders deployment on resource-constrained devices. We propose LRD-Net (Lightweight Real-centered Detection Network), a novel framework that addresses both challenges simultaneously. Unlike existing dual-branch approaches that process spatial and frequency information independently, LRD-Net adopts a sequential frequency-guided architecture where a lightweight Multi-Scale Wavelet Guidance Module generates attention signals that condition a MobileNetV3-based spatial backbone. This design enables effective exploitation of frequency-domain cues while avoiding the redundancy of parallel feature extraction. Furthermore, LRD-Net employs a real-centered learning strategy with exponential moving average prototype updates and drift regularization, anchoring representations around authentic facial images rather than modeling diverse forgery patterns. Extensive experiments on the DiFF benchmark demonstrate that LRD-Net achieves state-of-the-art cross-domain detection accuracy, consistently outperforming existing methods. Critically, LRD-Net accomplishes this with only 2.63M parameters - approximately 9x fewer than conventional approaches - while achieving over 8x faster training and nearly 10x faster inference. These results demonstrate that robust cross-domain face forgery detection can be achieved without sacrificing computational efficiency, making LRD-Net suitable for real-time deployment in mobile authentication systems and resource-constrained environments.

[CV-128] Retinal Cyst Detection from Optical Coherence Tomography Images

【速读】:该论文旨在解决视网膜囊样水肿中 intraretinal cyst(视网膜内囊肿)自动分割与量化精度低的问题,尤其针对现有方法在高噪声图像(如Topcon设备采集的图像)上性能显著下降的局限性。其解决方案的关键在于采用基于ResNet卷积神经网络(Convolutional Neural Network, CNN)的分块分类(patchwise classification)策略,利用公开的囊肿分割挑战数据集进行训练,并在来自四位不同厂商(4 vendors)的测试数据上评估模型性能,从而实现对不同成像质量下intraretinal cyst的鲁棒分割,最终Dice系数超过70%,优于此前的最先进方法。

链接: https://arxiv.org/abs/2604.10843
作者: Abhishek Dharmaratnakar,Aadheeshwar Vijayakumar,Suchand Dayanand
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Retinal Cysts are formed by leakage and accumulation of fluid in the retina due to the incompetence of retinal vasculature. These cystic spaces have significance in several ocular diseases such as age-related macular degeneration, diabetic macular edema, etc. Optical coherence tomography is one of the predominant diagnosing techniques for imaging retinal pathologies. Segmenting and quantification of intraretinal cysts plays the vital role in predicting visual acuity. In literature, several methods have been proposed for automatic segmentation of intraretinal cysts. As cystoid macular edema becomes a major problem to humankind, we need to quantify it accurately and operate it out, else it might cause many problems later on. Though research is being carried out in this area, not much of progress has been made and accuracy achieved so far is 68% which is very less. Also, the methods depend on the quality of the image and give very low results for high noise images like topcon. This work uses ResNet CNN (Convolutional Neural Network) approach of segmentation by the way of patchwise classification for training on image set from cyst segmentation challenge dataset and testing on test data set given by 2 different graders for all 4 vendors in the challenge. It also compares these methods using first publicly available novel cyst segmentation challenge dataset. The methods were evaluated using quantitative measures to assess their robustness against the challenges of intraretinal cyst segmentation. The results are found to be better than the previous state of the art approaches giving more than 70% dice coefficient on all vendors irrespective of their quality.

[CV-129] Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

【速读】:该论文旨在解决图像到视频(Image-to-Video, I2V)生成模型可能被用于制造高保真深度伪造(deepfakes)所带来的社会危害问题,尤其是针对静态图像的对抗性免疫策略在I2V场景中失效的问题。其关键解决方案在于提出Immune2V框架:通过在编码器层面强制实现时序平衡的潜在空间差异(temporally balanced latent divergence),防止对抗噪声在视频帧间快速稀释;同时,将中间生成表示与预计算的崩溃诱导轨迹对齐,以抵消文本条件引导对扰动效果的覆盖作用,从而在保持不可感知扰动预算的前提下,显著增强并维持对I2V生成结果的破坏力。

链接: https://arxiv.org/abs/2604.10837
作者: Zeqian Long,Ozgur Kara,Haotian Xue,Yongxin Chen,James M. Rehg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-to-video (I2V) generation has the potential for societal harm because it enables the unauthorized animation of static images to create realistic deepfakes. While existing defenses effectively protect against static image manipulation, extending these to I2V generation remains underexplored and non-trivial. In this paper, we systematically analyze why modern I2V models are highly robust against naive image-level adversarial attacks (i.e., immunization). We observe that the video encoding process rapidly dilutes the adversarial noise across future frames, and the continuous text-conditioned guidance actively overrides the intended disruptive effect of the immunization. Building on these findings, we propose the Immune2V framework which enforces temporally balanced latent divergence at the encoder level to prevent signal dilution, and aligns intermediate generative representations with a precomputed collapse-inducing trajectory to counteract the text-guidance override. Extensive experiments demonstrate that Immune2V produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.

[CV-130] HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching

【速读】:该论文旨在解决生成真实感三维手-物体交互(Hand-Object Interaction, HOI)运动序列的挑战,尤其在时间一致性与物理合理性方面存在局限。现有方法难以学习具有表现力的运动表示并进行有效的时序推理。其解决方案的关键在于提出HO-Flow框架:首先使用一种交互感知的变分自编码器(interaction-aware variational autoencoder),融合手部和物体的运动学信息,将手-物运动序列映射到统一的潜在流形中,以捕捉丰富的交互动力学;其次引入掩码流匹配模型(masked flow matching model),结合自回归时序推理与连续潜在空间生成,显著提升时序一致性;此外,通过预测相对于初始帧的物体运动,实现大规模合成数据上的有效预训练,从而增强泛化能力。实验表明,HO-Flow在GRAB、OakInk和DexYCB等多个基准上实现了物理合理性与运动多样性的最先进性能。

链接: https://arxiv.org/abs/2604.10836
作者: Zerui Chen,Rolandos Alexandros Potamias,Shizhe Chen,Jiankang Deng,Cordelia Schmid,Stefanos Zafeiriou
机构: Inria(法国国家信息与自动化研究院); CNRS(法国国家科学研究中心); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:Generating realistic 3D hand-object interactions (HOI) is a fundamental challenge in computer vision and robotics, requiring both temporal coherence and high-fidelity physical plausibility. Existing methods remain limited in their ability to learn expressive motion representations for generation and perform temporal reasoning. In this paper, we present HO-Flow, a framework for synthesizing realistic hand-object motion sequences from texts and canoncial 3D objects. HO-Flow first employs an interaction-aware variational autoencoder to encode sequences of hand and object motions into a unified latent manifold by incorporating hand and object kinematics, enabling the representation to capture rich interaction dynamics. It then leverages a masked flow matching model that combines auto-regressive temporal reasoning with continuous latent generation, improving temporal coherence. To further enhance generalization, HO-Flow predicts object motions relative to the initial frame, enabling effective pre-training on large-scale synthetic data. Experiments on the GRAB, OakInk, and DexYCB benchmarks demonstrate that HO-Flow achieves state-of-the-art performance in both physical plausibility and motion diversity for interaction motion synthesis.

[CV-131] Uncertainty-Guided Attention and Entropy-Weighted Loss for Precise Plant Seedling Segmentation

【速读】:该论文旨在解决植物幼苗图像分割中因复杂背景和叶片细结构导致的标准分割模型性能下降的问题。其核心解决方案是提出UGDA-Net(Uncertainty-Guided Dual Attention Network with Entropy-Weighted Loss and Deep Supervision),关键创新在于三个互补模块:一是基于通道方差的不确定性引导双注意力机制(Uncertainty-Guided Dual Attention, UGDA),用于动态调制特征图;二是熵加权混合损失函数,聚焦高不确定性边界像素以提升边缘分割精度;三是对编码器中间层引入深度监督机制,增强梯度传播与特征学习能力。实验表明,该方法在Dice系数上相较基线提升9.3%,显著改善了叶片边界处的误分割问题,尤其适用于精细植物结构的高分辨率分割任务。

链接: https://arxiv.org/abs/2604.10823
作者: Mohamed Ehab,Ali Hamdi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Plant seedling segmentation supports automated phenotyping in precision agriculture. Standard segmentation models face difficulties due to intricate background images and fine structures in leaves. We introduce UGDA-Net (Uncertainty-Guided Dual Attention Network with Entropy-Weighted Loss and Deep Supervision). Three novel components make up UGDA-Net. The first component is Uncertainty-Guided Dual Attention (UGDA). UGDA uses channel variance to modulate feature maps. The second component is an entropy-weighted hybrid loss function. This loss function focuses on high-uncertainty boundary pixels. The third component employs deep supervision for intermediate encoder layers. We performed a comprehensive systematic ablation study. This study focuses on two widely-used architectures, U-Net and LinkNet. It analyzes five incremental configurations: Baseline, Loss-only, Attention-only, Deep Supervision, and UGDA-Net. We trained UGDA-net using a high-resolution plant seedling image dataset containing 432 images. We demonstrate improved segmentation performance and accuracy. With an increase in Dice coefficient of 9.3% above baseline. LinkNet’s variance is 13.2% above baseline. Overlays that are qualitative in nature show the reduced false positives at the leaf boundary. Uncertainty heatmaps are consistent with the complex morphology. UGDA-Net aids in the segmentation of delicate structures in plants and provides a high-def solution. The results showed that uncertainty-guided attention and uncertainty-weighted loss are two complementing systems.

[CV-132] Analytical Modeling and Correction of Distance Error in Homography-Based Ground-Plane Mapping

【速读】:该论文旨在解决单目相机(monocular camera)在智能监控系统中进行准确距离估计的问题,尤其针对因初始平面单应变换(planar homography)标定不精确导致的系统性距离失真问题。其关键解决方案在于推导出单应变换扰动与距离误差之间的显式关系,发现误差随真实距离呈近似二次增长;基于此模型,提出两种校正策略:一是通过回归方法估计二次误差函数以提升峰值精度,二是利用基于坐标梯度下降的直接优化方法增强对初始标定不良的鲁棒性。研究表明,在实际系统中改进几何标定效果比单纯增加模型复杂度更具性能提升潜力。

链接: https://arxiv.org/abs/2604.10805
作者: Mateusz Szulc,Marcin Iwanowski
机构: Warsaw University of Technology (华沙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Accurate distance estimation from monocular cameras is essential for intelligent monitoring systems. In many deployments, image coordinates are mapped to ground positions using planar homographies initialized by manual selection of corresponding regions. Small inaccuracies in this initialization propagate into systematic distance distortions. This paper derives an explicit relationship between homography perturbations and the resulting distance error, showing that the error grows approximately quadratically with the true distance from the camera. Based on this model, two simple correction strategies are evaluated: regression-based estimation of the quadratic error function and direct optimization of the homography via coordinate-based gradient descent. A large-scale simulation study with more than 19 million test samples demonstrates that regression achieves higher peak accuracy when the model is reliably fitted, whereas gradient descent provides greater robustness against poor initial calibration. This suggests that improving geometric calibration may yield greater performance gains than increasing model complexity in many practical systems.

[CV-133] WBCBench 2026: A Challenge for Robust White Blood Cell Classification Under Class Imbalance

【速读】:该论文旨在解决自动化白细胞(White Blood Cell, WBC)分类算法在真实临床场景中面临的三大挑战:(i)13种形态学细粒度类别的严重类别不平衡问题;(ii)训练、验证与测试集之间严格的患者级划分,以避免数据泄露;(iii)通过受控噪声、模糊和光照扰动模拟扫描仪及设置引起的域偏移(domain shift),从而评估模型在部署阶段的鲁棒性。解决方案的关键在于构建一个分阶段的基准测试平台WBCBench 2026:第一阶段提供纯净训练数据,第二阶段引入具有特定严重程度分布的退化图像,模拟开发与部署环境之间的现实差异;同时,采用标准化提交格式、开源评估器以及宏平均F1分数作为核心评价指标,确保评估结果的可比性和公平性。

链接: https://arxiv.org/abs/2604.10797
作者: Xin Tian,Xudong Ma,Tianqi Yang,Alin Achim,Bartłomiej W Papież,Phandee Watanaboonyongcharoen,Nantheera Anantrasirichai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE International Symposium on Biomedical Imaging (ISBI)

点击查看摘要

Abstract:We present WBCBench 2026, an ISBI challenge and benchmark for automated WBC classification designed to stress-test algorithms under three key difficulties: (i) severe class imbalance across 13 morphologically fine-grained WBC classes, (ii) strict patient-level separation between training, validation and test sets, and (iii) synthetic scanner- and setting-induced domain shift via controlled noise, blur and illumination perturbations. All images are single-site microscopic blood smear acquisitions with standardised staining and expert hematopathologist annotations. This paper reviews the challenge and summarises the proposed solutions and final outcomes. The benchmark is organised into two phases. Phase 1 provides a pristine training set. Phase 2 introduces degraded images with split-specific severity distributions for train, validation and test, emulating a realistic shift between development and deployment conditions. We specify a standardised submission schema, open-source evaluator, and macro-averaged F1 score as the primary ranking metric.

[CV-134] ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

【速读】:该论文旨在解决当前生成式3D重建方法在实际部署中面临的三大挑战:跨模态信息整合不足、对人工物体提示的依赖性、以及因训练偏差导致的场景复杂度受限问题。其核心解决方案是提出ReplicateAnyScene框架,通过五阶段级联流程从视觉基础模型中提取并结构对齐跨文本、视觉和空间维度的通用先验知识,将其锚定为结构化的3D表示,从而实现无需人工干预的零样本(zero-shot)视频到组合式3D场景的自动转换,确保生成场景的语义一致性与物理合理性。

链接: https://arxiv.org/abs/2604.10789
作者: Mingyu Dong,Chong Xia,Mingyuan Jia,Weichen Lyu,Long Xu,Zheng Zhu,Yueqi Duan
机构: Tsinghua University (清华大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Humans exhibit an innate capacity to rapidly perceive and segment objects from video observations, and even mentally assemble them into structured 3D scenes. Replicating such capability, termed compositional 3D reconstruction, is pivotal for the advancement of Spatial Intelligence and Embodied AI. However, existing methods struggle to achieve practical deployment due to the insufficient integration of cross-modal information, leaving them dependent on manual object prompting, reliant on auxiliary visual inputs, and restricted to overly simplistic scenes by training biases. To address these limitations, we propose ReplicateAnyScene, a framework capable of fully automated and zero-shot transformation of casually captured videos into compositional 3D scenes. Specifically, our pipeline incorporates a five-stage cascade to extract and structurally align generic priors from vision foundation models across textual, visual, and spatial dimensions, grounding them into structured 3D representations and ensuring semantic coherence and physical plausibility of the constructed scenes. To facilitate a more comprehensive evaluation of this task, we further introduce the C3DR benchmark to assess reconstruction quality from diverse aspects. Extensive experiments demonstrate the superiority of our method over existing baselines in generating high-quality compositional 3D scenes.

[CV-135] LIDARLearn: A Unified Deep Learning Library for 3D Point Cloud Classification Segmentation and Self-Supervised Representation Learning

【速读】:该论文旨在解决三维点云(3D point cloud)分析领域中深度学习方法实现分散、兼容性差的问题,具体表现为不同模型架构、自监督预训练(self-supervised pre-training, SSL)和参数高效微调(parameter-efficient fine-tuning, PEFT)策略分布在不互通的代码库中,导致公平比较困难。其解决方案的关键在于提出一个统一且可扩展的 PyTorch 库——LIDARLearn,该库通过注册表(registry-based)框架整合了超过 55 种模型配置(涵盖 29 种监督学习架构、7 种 SSL 方法和 5 种 PEFT 策略),并提供标准化训练流程、交叉验证机制、自动化结果表格生成、严格的统计检验(Friedman/Nemenyi 测试与临界差异图)以及超过 2200 个端到端测试用例,从而显著提升点云任务(如分类、语义分割、部件分割和少样本学习)研究的可复现性和可比性。

链接: https://arxiv.org/abs/2604.10780
作者: Said Ohamouddou,Hanaa El Afia,Abdellatif El Afia,Raddouane Chiheb
机构: ENSIAS, Mohammed V University, Rabat, Morocco
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Three-dimensional (3D) point cloud analysis has become central to applications ranging from autonomous driving and robotics to forestry and ecological monitoring. Although numerous deep learning methods have been proposed for point cloud understanding, including supervised backbones, self-supervised pre-training (SSL), and parameter-efficient fine-tuning (PEFT), their implementations are scattered across incompatible codebases with differing data pipelines, evaluation protocols, and configuration formats, making fair comparisons difficult. We introduce \lib, a unified, extensible PyTorch library that integrates over 55 model configurations covering 29 supervised architectures, seven SSL pre-training methods, and five PEFT strategies, all within a single registry-based framework supporting classification, semantic segmentation, part segmentation, and few-shot learning. \lib provides standardised training runners, cross-validation with stratified K -fold splitting, automated LaTeX/CSV table generation, built-in Friedman/Nemenyi statistical testing with critical-difference diagrams for rigorous multi-model comparison, and a comprehensive test suite with 2,200+ automated tests validating every configuration end-to-end. The code is available at this https URL under the MIT licence. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.10780 [cs.CV] (or arXiv:2604.10780v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.10780 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Said Ohamouddou [view email] [v1] Sun, 12 Apr 2026 19:10:12 UTC (23 KB) Full-text links: Access Paper: View a PDF of the paper titled LIDARLearn: A Unified Deep Learning Library for 3D Point Cloud Classification, Segmentation, and Self-Supervised Representation Learning, by Said Ohamouddou and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-136] Uncertainty-quantified Pulse Signal Recovery from Facial Video using Regularized Stochastic Interpolants

【速读】:该论文旨在解决当前成像光电容积脉搏波描记法(iPPG)算法在测试阶段缺乏解空间采样能力的问题,从而无法进行临床应用所必需的不确定性分析。其解决方案的关键在于提出一种名为“带随机性的正则化插值器用于iPPG”(RIS-iPPG)的新范式:将iPPG恢复建模为一个逆问题,通过预测时间依赖随机过程的瞬时流矢量和得分矢量,构建从相机像素分布到真实生理信号分布的概率路径;在测试阶段,则通过求解随机微分方程来采样给定像素强度测量下正确血容量脉搏(BVP)波形的后验分布。此外,利用生理变化缓慢的特性,引入基于相邻时间窗口残差流矢量预测相关性的正则化策略,显著提升了iPPG重建质量与不确定性估计的可靠性。

链接: https://arxiv.org/abs/2604.10777
作者: Vineet R. Shenoy,Cheng Peng,Rama Chellappa,Yu Sun
机构: Johns Hopkins University (约翰霍普金斯大学); University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Imaging Photoplethysmography (iPPG), an optical procedure which recovers a human’s blood volume pulse (BVP) waveform using pixel readout from a camera, is an exciting research field with many researchers performing clinical studies of iPPG algorithms. While current algorithms to solve the iPPG task have shown outstanding performance on benchmark datasets, no state-of-the art algorithms, to the best of our knowledge, performs test-time sampling of solution space, precluding an uncertainty analysis that is critical for clinical applications. We address this deficiency though a new paradigm named Regularized Interpolants with Stochasticity for iPPG (RIS-iPPG). Modeling iPPG recovery as an inverse problem, we build probability paths that evolve the camera pixel distribution to the ground-truth signal distribution by predicting the instantaneous flow and score vectors of a time-dependent stochastic process; and at test-time, we sample the posterior distribution of the correct BVP waveform given the camera pixel intensity measurements by solving a stochastic differential equation. Given that physiological changes are slowly varying, we show that iPPG recovery can be improved through regularization that maximizes the correlation between the residual flow vector predictions of two adjacent time windows. Experimental results on three datasets show that RIS-iPPG provides superior reconstruction quality and uncertainty estimates of the reconstruction, a critical tool for the widespread adoption of iPPG algorithms in clinical and consumer settings.

[CV-137] HOG-Layout: Hierarchical 3D Scene Generation Optimization and Editing via Vision-Language Models CVPR2026

【速读】:该论文旨在解决3D场景生成与编辑中人工创建效率低、数据驱动方法多样性不足的问题,尤其在具身智能(Embodied AI)和沉浸式虚拟现实(VR)交互场景中的应用瓶颈。其解决方案的关键在于提出HOG-Layout框架,通过引入大语言模型(LLMs)与视觉-语言模型(VLMs)实现文本驱动的分层场景生成与实时编辑;利用检索增强生成(Retrieval-Augmented Generation, RAG)技术提升语义一致性与合理性,集成优化模块以增强物理合理性,并采用分层表示结构提高推理效率与编辑响应速度,从而在保证场景质量的同时支持快速直观的交互式修改。

链接: https://arxiv.org/abs/2604.10772
作者: Haiyan Jiang,Deyu Zhang,Dongdong Weng,Weitao Song,Henry Been-Lirn Duh
机构: The Hong Kong Polytechnic University (香港理工大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.

[CV-138] At FullTilt: Real-Time Open-Set 3D Macromolecule Detection Directly from Tilted 2D Projections

【速读】:该论文旨在解决冷冻电子断层扫描(cryogenic electron tomography, cryo-ET)中开放集三维大分子检测的计算效率瓶颈问题。当前方法受限于显存(VRAM)资源,无法直接处理完整的3D断层图(tomogram),只能依赖低效的滑动窗口推理策略,导致推理速度慢且难以扩展至大规模数据。解决方案的关键在于提出FullTilt框架,其核心创新是将检测任务从重建后的3D体积迁移至未重建的2D倾斜系列(tilt-series)上进行端到端建模:通过引入倾斜系列编码器(tilt-series encoder)高效融合多视角信息,结合多类视觉提示编码器(multiclass visual prompt encoder)、倾斜感知查询初始化器(tilt-aware query initializer)和辅助几何原型模块(auxiliary geometric primitives module),显著减少冗余体素计算,实现零样本检测性能最优的同时大幅降低运行时间和显存占用,从而推动快速、大规模的可视化蛋白质组学分析。

链接: https://arxiv.org/abs/2604.10766
作者: Ming-Yang Ho,Alberto Bartesaghi
机构: Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-set 3D macromolecule detection in cryogenic electron tomography eliminates the need for target-specific model retraining. However, strict VRAM constraints prohibit processing an entire 3D tomogram, forcing current methods to rely on slow sliding-window inference over extracted subvolumes. To overcome this, we propose FullTilt, an end-to-end framework that redefines 3D detection by operating directly on aligned 2D tilt-series. Because a tilt-series contains significantly fewer images than slices in a reconstructed tomogram, FullTilt eliminates redundant volumetric computation, accelerating inference by orders of magnitude. To process the entire tilt-series simultaneously, we introduce a tilt-series encoder to efficiently fuse cross-view information. We further propose a multiclass visual prompt encoder for flexible prompting, a tilt-aware query initializer to effectively anchor 3D queries, and an auxiliary geometric primitives module to enhance the model’s understanding of multi-view geometry while improving robustness to adverse imaging artifacts. Extensive evaluations on three real-world datasets demonstrate that FullTilt achieves state-of-the-art zero-shot performance while drastically reducing runtime and VRAM requirements, paving the way for rapid, large-scale visual proteomics analysis. All code and data will be publicly available upon publication.

[CV-139] Lung Cancer Detection Using Deep Learning

【速读】:该论文旨在解决肺癌早期精准检测难题,以提升诊断准确率并改善患者生存率。当前肺癌死亡率居高不下,主要因症状常在晚期才显现,而现有检测手段存在局限性。为此,研究提出一种基于卷积神经网络(Convolutional Neural Network, CNN)的16层深度学习模型,其关键创新在于通过融合卷积、池化、展平、Dropout、全连接及密集层等多类型结构,优化特征提取与分类能力;同时,该模型在训练过程中表现出随训练轮次(epoch)增加而持续提升准确率的特性,并有效缓解了过拟合问题,从而显著增强对肺部影像数据的判别性能。

链接: https://arxiv.org/abs/2604.10765
作者: Imama Ajmi,Abhishek Das
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:Lung cancer, the second leading cause of cancer-related deaths, is primarily linked to long-term tobacco smoking (85% of cases). Surprisingly, 10-15% of cases occur in non-smokers. In 2020, approximately 2 million people were affected globally, resulting in 1.5 million deaths. The survival rate, at around 20%, lags behind other cancers, partly due to late-stage symptom manifestation. Necessitates early and accurate detection for effective treatment. Performance metrics such as accuracy, precision, recall (sensitivity), and F1-score are computed to provide a comprehensive evaluation of each model’s capabilities. By comparing these metrics, this study offers insights into the strengths and limitations of each approach, contributing to the advancement of lung cancer detection techniques. In this paper, we are going to discuss the methodologies of lung cancer detection using different deep learning algorithms - InceptionV3, MobileNetV2, VGG16, ResNet152 - are explored for their efficacy in classifying lung cancer cases. Our Proposed Model algorithm based is a 16 layers architecture based on CNN model. Our Proposed model exhibits several key highlights that contribute to its novelty. By integrating multiple layer types such as convolutional, pooling, flatten, dropout, fully connected and dense layers, the model leverages the strengths of each layer to enhance its predictive capabilities. Novelty of our proposed model is that its accuracy is increasing consistently with the increasing no of epochs. We have tested the model performance up to epoch no 30. Our proposed model also overcome the overfitting problem.

[CV-140] MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在罕见病场景下临床能力评估缺失的问题。当前主流基准主要针对常见病、单图像任务,未能系统评估多模态与多图像证据整合能力,而罕见病诊断往往依赖于病例级证据而非已有知识,对模型的跨图像推理和证据融合能力提出了更高要求。解决方案的关键在于提出首个面向罕见病的综合性评估基准MMRareBench,其核心创新包括:基于Orphanet的本体对齐、各任务轨道(诊断、治疗规划、跨图像证据对齐、检查建议)的泄漏控制机制、基于证据的标注体系,以及两级评价协议,从而实现对MLLMs在罕见病多模态、多图像临床决策能力的全面量化评估。

链接: https://arxiv.org/abs/2604.10755
作者: Junzhi Ning,Jiashi Lin,Yingying Fang,Wei Li,Jiyao Liu,Cheng Tang,Chenglong Ma,Wenhao Tang,Tianbin Li,Ziyan Huang,Guang Yang,Junjun He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have advanced clinical tasks for common conditions, but their performance on rare diseases remains largely untested. In rare-disease scenarios, clinicians often lack prior clinical knowledge, forcing them to rely strictly on case-level evidence for clinical judgments. Existing benchmarks predominantly evaluate common-condition, single-image settings, leaving multimodal and multi-image evidence integration under rare-disease data scarcity systematically unevaluated. We introduce MMRareBench, to our knowledge the first rare-disease benchmark jointly evaluating multimodal and multi-image clinical capability across four workflow-aligned tracks: diagnosis, treatment planning, cross-image evidence alignment, and examination suggestion. The benchmark comprises 1,756 question-answer pairs with 7,958 associated medical images curated from PMC case reports, with Orphanet-anchored ontology alignment, track-specific leakage control, evidence-grounded annotations, and a two-level evaluation protocol. A systematic evaluation of 23 MLLMs reveals fragmented capability profiles and universally low treatment-planning performance, with medical-domain models trailing general-purpose MLLMs substantially on multi-image tracks despite competitive diagnostic scores. These patterns are consistent with a capacity dilution effect: medical fine-tuning can narrow the diagnostic gap but may erode the compositional multi-image capability that rare-disease evidence integration demands.

[CV-141] urning Generators into Retrievers: Unlocking MLLM s for Natural Language-Guided Geo-Localization CVPR

【速读】:该论文旨在解决自然语言引导的跨视图地理定位(Natural-language Guided Cross-view Geo-localization, NGCG)任务中现有方法存在的跨模态泛化能力弱及架构设计复杂的问题。传统基于CLIP-style双编码器的方法虽广泛使用,但其在文本到卫星图像检索中的对齐效果受限;而多模态大语言模型(Multimodal Large Language Models, MLLMs)虽具备强大的语义推理能力,却未针对检索任务进行优化。论文的关键解决方案是通过参数高效微调(parameter-efficient fine-tuning)策略,在保留MLLM预训练多模态知识的前提下,优化其内部潜在表示(latent representations),从而实现强跨模态对齐,无需重构模型架构。该方法在GeoText-1652上取得SOTA性能(Text-to-Image Recall@1提升12.2%),并在CVG-Text的5个子任务中位列第一,同时显著减少可训练参数量,验证了MLLM作为语义跨视图检索基础模型的可行性与优越性。

链接: https://arxiv.org/abs/2604.10721
作者: Yuqi Chen,Xiaohan Zhang,Ahmad Arrabi,Waqas Sultani,Chen Chen,Safwan Wshah
机构: University of Vermont (佛蒙特大学); Information Technology University (信息技术大学); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPRF

点击查看摘要

Abstract:Natural-language Guided Cross-view Geo-localization (NGCG) aims to retrieve geo-tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP-style dual-encoder architectures, they often suffer from weak cross-modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning capabilities but are not directly optimized for retrieval tasks. In this work, we present a simple yet effective framework to adapt MLLMs for NGCG via parameter-efficient finetuning. Our approach optimizes latent representations within the MLLM while preserving its pretrained multimodal knowledge, enabling strong cross-modal alignment without redesigning model architectures. Through systematic analysis of diverse variables, from model backbone to feature aggregation, we provide practical and generalizable insights for leveraging MLLMs in NGCG. Our method achieves SOTA on GeoText-1652 with a 12.2% improvement in Text-to-Image Recall@1 and secures top performance in 5 out of 12 subtasks on CVG-Text, all while surpassing baselines with far fewer trainable parameters. These results position MLLMs as a robust foundation for semantic cross-view retrieval and pave the way for MLLM-based NGCG to be adopted as a scalable, powerful alternative to traditional dual-encoder designs. Project page and code are available at this https URL.

[CV-142] Defending against Patch-Based and Texture-Based Adversarial Attacks with Spectral Decomposition

【速读】:该论文旨在解决深度神经网络(Deep Neural Network, DNN)在物理世界中面临的对抗样本攻击问题,特别是针对可物理实现的基于补丁(patch-based)和纹理(texture-based)的对抗攻击,这些攻击对安防监控和自动驾驶等安全关键应用构成现实威胁。现有防御机制在面对专门针对其设计的自适应攻击时表现不佳。解决方案的关键在于提出一种名为对抗频谱防御(Adversarial Spectrum Defense, ASD)的新机制,其核心是利用离散小波变换(Discrete Wavelet Transform, DWT)进行多尺度频谱分解,从而捕获高频率(细粒度)和低频率(空间广泛)的扰动模式;通过将该频谱分析与现成的对抗训练(Adversarial Training, AT)模型结合,ASD实现了对多种对抗攻击的全面防御,并在强自适应攻击下仍保持显著性能优势,实验表明其平均精度(AP)相较以往方法提升达21.73%。

链接: https://arxiv.org/abs/2604.10715
作者: Wei Zhang,Xinyu Chang,Xiao Li,Yiming Zhu,Xiaolin Hu
机构: Tsinghua University (清华大学); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TIFS

点击查看摘要

Abstract:Adversarial examples present significant challenges to the security of Deep Neural Network (DNN) applications. Specifically, there are patch-based and texture-based attacks that are usually used to craft physical-world adversarial examples, posing real threats to security-critical applications such as person detection in surveillance and autonomous systems, because those attacks are physically realizable. Existing defense mechanisms face challenges in the adaptive attack setting, i.e., the attacks are specifically designed against them. In this paper, we propose Adversarial Spectrum Defense (ASD), a defense mechanism that leverages spectral decomposition via Discrete Wavelet Transform (DWT) to analyze adversarial patterns across multiple frequency scales. The multi-resolution and localization capability of DWT enables ASD to capture both high-frequency (fine-grained) and low-frequency (spatially pervasive) perturbations. By integrating this spectral analysis with the off-the-shelf Adversarial Training (AT) model, ASD provides a comprehensive defense strategy against both patch-based and texture-based adversarial attacks. Extensive experiments demonstrate that ASD+AT achieved state-of-the-art (SOTA) performance against various attacks, outperforming the APs of previous defense methods by 21.73%, in the face of strong adaptive adversaries specifically designed against ASD. Code available at this https URL .

[CV-143] Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

【速读】:该论文旨在解决当前音频生成、理解与编辑任务通常由专用模型分别处理,缺乏一个统一框架实现跨声音、音乐和语音领域的无缝集成问题。其解决方案的关键在于提出Audio-Omni,首个端到端的统一框架,通过冻结的多模态大语言模型(Multimodal Large Language Model, MLLM)进行高层推理,并结合可训练的扩散Transformer(Diffusion Transformer)实现高保真合成,从而在通用音频域中统一处理生成与编辑任务;同时,为缓解音频编辑数据稀缺问题,构建了包含超百万条精心标注编辑对的大规模数据集AudioEdit,实验表明该框架在多个基准测试中达到或超越现有专用专家模型的性能水平,展现出强大的泛化能力与继承特性。

链接: https://arxiv.org/abs/2604.10708
作者: Zeyue Tian,Binxin Yang,Zhaoyang Liu,Jiexuan Zhang,Ruibin Yuan,Hubery Yin,Qifeng Chen,Chen Li,Jing Lv,Wei Xue,Yike Guo
机构: Hong Kong University of Science and Technology (香港科技大学); WeChat Vision, Tencent Inc (微信视觉,腾讯公司); Peking University (北京大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on this https URL.

[CV-144] Investigating Bias and Fairness in Appearance-based Gaze Estimation

【速读】:该论文旨在解决外观特征驱动的注视估计(gaze estimation)系统在不同人口统计群体中存在公平性缺失的问题,即当前算法在种族和性别维度上表现出显著的性能差异,且缺乏全面的基准评估体系来量化此类算法偏见。其解决方案的关键在于首次系统性地评估了主流注视估计模型在公平性方面的表现,通过标准公平性指标建立基线,并验证现有偏见缓解策略在注视估计领域的有效性,结果表明这些策略的公平性提升作用有限。研究强调需进一步开发鲁棒且公平的注视估计方法,并开源了数据标注、代码与训练模型以支持后续可复现的研究工作。

链接: https://arxiv.org/abs/2604.10707
作者: Burak Akgül,Erol Şahin,Sinan Kalkan
机构: Middle East Technical University (中东部理工大學)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While appearance-based gaze estimation has achieved significant improvements in accuracy and domain adaptation, the fairness of these systems across different demographic groups remains largely unexplored. To date, there is no comprehensive benchmark quantifying algorithmic bias in gaze estimation. This paper presents the first extensive evaluation of fairness in appearance-based gaze estimation, focusing on ethnicity and gender attributes. We establish a fairness baseline by analyzing state-of-the-art models using standard fairness metrics, revealing significant performance disparities. Furthermore, we evaluate the effectiveness of existing bias mitigation strategies when applied to the gaze domain and show that their fairness contributions are limited. We summarize key insights and open issues. Overall, our work calls for research into developing robust, equitable gaze estimators. To support future research and reproducibility, we publicly release our annotations, code, and trained models at: this http URL

[CV-145] Architecture-Agnostic Modality-Isolated Gated Fusion for Robust Multi-Modal Prostate MRI Segmentation

【速读】:该论文旨在解决多参数前列腺MRI(multi-parametric prostate MRI)在临床实践中因序列缺失或退化(如运动伪影、协议缩短等)导致的多模态融合模型鲁棒性不足的问题。现有方法通常假设输入模态完整且早期层中混杂了模态特异性信息,难以应对单通道失效场景。其解决方案的关键在于提出一种架构无关的“模态隔离门控融合”(Modality-Isolated Gated Fusion, MIGF)模块:首先保持各模态独立编码流,再通过学习到的门控机制进行融合,并结合模态丢弃训练(ModDrop)强制模型在输入不完整时具备补偿能力。实验证明,MIGF显著提升了多种骨干网络在多种缺失和退化场景下的性能,其中最优模型MIGFNet-nnUNet在PI-CAI数据集上达到0.7304 ± 0.056的排名得分,且机制分析表明鲁棒性提升主要源于严格的模态隔离与dropout驱动的补偿策略,而非动态质量路由。

链接: https://arxiv.org/abs/2604.10702
作者: Yongbo Shu,Wenzhao Xie,Shanhu Yao,Zirui Xin,Luo Lei,Kewen Chen,Aijing Luo
机构: 长沙理工大学(Changsha University of Science and Technology)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 36 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Multi-parametric prostate MRI – combining T2-weighted, apparent diffusion coefficient, and high b-value diffusion-weighted sequences – is central to non-invasive detection of clinically significant prostate cancer, yet in routine practice individual sequences may be missing or degraded by motion, artifacts, or abbreviated protocols. Existing multi-modal fusion strategies typically assume complete inputs and entangle modality-specific information at early layers, offering limited resilience when one channel is corrupted or absent. We propose Modality-Isolated Gated Fusion (MIGF), an architecture-agnostic module that maintains separate modality-specific encoding streams before a learned gating stage, combined with modality dropout training to enforce compensation behavior under incomplete inputs. We benchmark six bare backbones and assess MIGF-equipped models under seven missing-modality and artifact scenarios on the PI-CAI dataset (1,500 studies, fold-0 split, five random seeds). Among bare backbones, nnUNet provided the strongest balance of performance and stability. MIGF improved ideal-scenario Ranking Score for UNet, nnUNet, and Mamba by 2.8%, 4.6%, and 13.4%, respectively; the best model, MIGFNet-nnUNet (gating + ModDrop, no deep supervision), achieved 0.7304 +/- 0.056. Mechanistic analysis reveals that robustness gains arise from strict modality isolation and dropout-driven compensation rather than adaptive per-sample quality routing: the gate converged to a stable modality prior, and deep supervision was beneficial only for the largest backbone while degrading lighter models. These findings support a simpler design principle for robust multi-modal segmentation: structurally contain corrupted inputs first, then train explicitly for incomplete-input compensation.

[CV-146] Camyla: Scaling Autonomous Research in Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割领域中实现全流程自主研究(fully autonomous research)所面临的三大核心挑战:一是搜索努力因缺乏有效引导而偏离高潜力方向(search effort drifts toward unpromising directions);二是早期实验知识在长周期迭代中因上下文累积而退化(knowledge degradation due to context accumulation);三是失败后的恢复机制陷入重复性微调,缺乏多样性(recovery from failures collapses into repetitive incremental fixes)。解决方案的关键在于提出三个相互耦合的机制:质量加权分支探索(Quality-Weighted Branch Exploration)用于动态分配资源至有潜力的研究路径;分层反思记忆(Layered Reflective Memory)实现跨实验的知识压缩与多粒度保留;以及发散诊断反馈(Divergent Diagnostic Feedback)提升失败后恢复策略的多样性。这些机制共同支撑了系统在无干预条件下完成从数据到论文的闭环科研流程,并在CamylaBench基准上验证了其优越性能。

链接: https://arxiv.org/abs/2604.10696
作者: Yifan Gao,Haoyue Li,Feng Yuan,Xin Gao,Weiran Huang,Xiaosong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present Camyla, a system for fully autonomous research within the scientific domain of medical image segmentation. Camyla transforms raw datasets into literature-grounded research proposals, executable experiments, and complete manuscripts without human intervention. Autonomous experimentation over long horizons poses three interrelated challenges: search effort drifts toward unpromising directions, knowledge from earlier trials degrades as context accumulates, and recovery from failures collapses into repetitive incremental fixes. To address these challenges, the system combines three coupled mechanisms: Quality-Weighted Branch Exploration for allocating effort across competing proposals, Layered Reflective Memory for retaining and compressing cross-trial knowledge at multiple granularities, and Divergent Diagnostic Feedback for diversifying recovery after underperforming trials. The system is evaluated on CamylaBench, a contamination-free benchmark of 31 datasets constructed exclusively from 2025 publications, under a strict zero-intervention protocol across two independent runs within a total of 28 days on an 8-GPU cluster. Across the two runs, Camyla generates more than 2,700 novel model implementations and 40 complete manuscripts, and surpasses the strongest per-dataset baseline selected from 14 established architectures, including nnU-Net, on 22 and 18 of 31 datasets under identical training budgets, respectively (union: 24/31). Senior human reviewers score the generated manuscripts at the T1/T2 boundary of contemporary medical imaging journals. Relative to automated baselines, Camyla outperforms AutoML and NAS systems on aggregate segmentation performance and exceeds six open-ended research agents on both task completion and baseline-surpassing frequency. These results suggest that domain-scale autonomous research is achievable in medical image segmentation.

[CV-147] Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification

【速读】:该论文旨在解决音频-视觉问答(Audio-Visual Question Answering, AVQA)中因模态缺失导致性能严重下降的问题,尤其是现有基于生成式补全(generative imputation)的方法难以保留缺失模态的特异性知识,易引发幻觉并削弱推理准确性。其解决方案的关键在于提出R²ScP框架,将缺失模态处理范式从传统的生成式补全转向基于检索的恢复(retrieval-based recovery),通过统一语义嵌入实现跨模态检索以获取领域特定知识,并引入上下文感知的自适应净化机制去除检索数据中的潜在语义噪声,同时采用两阶段训练策略显式建模不同来源知识间的语义关系,从而显著提升AVQA在模态不完整场景下的鲁棒性与准确性。

链接: https://arxiv.org/abs/2604.10695
作者: Jiayu Zhang,Shuo Ye,Qilang Ye,Zihan Song,Jiajian Huang,Zitong Yu
机构: Great Bay University (大湾大学); Tsinghua University (清华大学); Nankai University (南开大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent Audio-Visual Question Answering (AVQA) methods have advanced significantly. However, most AVQA methods lack effective mechanisms for handling missing modalities, suffering from severe performance degradation in real-world scenarios with data interruptions. Furthermore, prevailing methods for handling missing modalities predominantly rely on generative imputation to synthesize missing features. While partially effective, these methods tend to capture inter-modal commonalities but struggle to acquire unique, modality-specific knowledge within the missing data, leading to hallucinations and compromised reasoning accuracy. To tackle these challenges, we propose R ^2 ScP, a novel framework that shifts the paradigm of missing modality handling from traditional generative imputation to retrieval-based recovery. Specifically, we leverage cross-modal retrieval via unified semantic embeddings to acquire missing domain-specific knowledge. To maximize semantic restoration, we introduce a context-aware adaptive purification mechanism that eliminates latent semantic noise within the retrieved data. Additionally, we employ a two-stage training strategy to explicitly model the semantic relationships between knowledge from different sources. Extensive experiments demonstrate that R ^2 ScP significantly improves AVQA and enhances robustness in modal-incomplete scenarios.

[CV-148] LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment

【速读】:该论文旨在解决机器人学习中因机器人示范数据稀缺而导致的扩展性问题,同时克服人类视频与机器人执行体之间存在的“具身差距”(embodiment gap)难题。现有跨具身迁移策略多依赖视觉编辑,易因视觉外观和三维几何差异引入伪影。其解决方案的关键在于提出LIDEA(Implicit Feature Distillation and Explicit Geometric Alignment)框架:在二维视觉域采用双阶段传递蒸馏管道,将人类与机器人表征对齐至共享隐空间;在三维几何域则设计具身无关的对齐策略,显式解耦具身特征与交互几何,从而实现一致的三维感知。该方法显著提升了数据效率和分布外鲁棒性,验证了人类视频可替代高达80%的昂贵机器人示范,并成功迁移未见的人类行为模式以实现泛化能力。

链接: https://arxiv.org/abs/2604.10677
作者: Yifu Xu,Bokai Lin,Xinyu Zhan,Hongjie Fang,Yong-Lu Li,Cewu Lu,Lixin Yang
机构: 1. Tsinghua University (清华大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. National Engineering Research Center for Information Technology in Agriculture (国家农业信息技术工程研究中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scaling up robot learning is hindered by the scarcity of robotic demonstrations, whereas human videos offer a vast, untapped source of interaction data. However, bridging the embodiment gap between human hands and robot arms remains a critical challenge. Existing cross-embodiment transfer strategies typically rely on visual editing, but they often introduce visual artifacts due to intrinsic discrepancies in visual appearance and 3D geometry. To address these limitations, we introduce LIDEA (Implicit Feature Distillation and Explicit Geometric Alignment), an imitation learning framework in which policy learning benefits from human demonstrations. In the 2D visual domain, LIDEA employs a dual-stage transitive distillation pipeline that aligns human and robot representations in a shared latent space. In the 3D geometric domain, we propose an embodiment-agnostic alignment strategy that explicitly decouples embodiment from interaction geometry, ensuring consistent 3D-aware perception. Extensive experiments empirically validate LIDEA from two perspectives: data efficiency and OOD robustness. Results show that human data substitutes up to 80% of costly robot demonstrations, and the framework successfully transfers unseen patterns from human videos for out-of-distribution generalization.

[CV-149] HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement

【速读】:该论文旨在解决自然场景中物体放置空间先验(spatial prior)学习的难题,现有方法要么依赖人工标注数据(规模受限),要么采用基于图像修复(inpainting)的物体移除流程,易引入伪影并导致捷径学习(shortcut learning)。其解决方案的关键在于提出一个全自动且可扩展的框架,利用基于扩散模型的图像修复管道,在高质量真实背景上评估密集物体放置,并构建了包含2700万条标注的HiddenObjects数据集;进一步通过知识蒸馏将这些先验信息压缩为轻量级模型,实现比传统方法快23万倍的推理速度,同时在下游图像编辑任务中显著优于人类稀疏标注和零样本视觉-语言模型。

链接: https://arxiv.org/abs/2604.10675
作者: Marco Schouten,Ioannis Siglidis,Serge Belongie,Dim P. Papadopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a method to learn explicit, class-conditioned spatial priors for object placement in natural scenes by distilling the implicit placement knowledge encoded in text-conditioned diffusion models. Prior work relies either on manually annotated data, which is inherently limited in scale, or on inpainting-based object-removal pipelines, whose artifacts promote shortcut learning. To address these limitations, we introduce a fully automated and scalable framework that evaluates dense object placements on high-quality real backgrounds using a diffusion-based inpainting pipeline. With this pipeline, we construct HiddenObjects, a large-scale dataset comprising 27M placement annotations, evaluated across 27k distinct scenes, with ranked bounding box insertions for different images and object categories. Experimental results show that our spatial priors outperform sparse human annotations on a downstream image editing task (3.90 vs. 2.68 VLM-Judge), and significantly surpass existing placement baselines and zero-shot Vision-Language Models for object placement. Furthermore, we distill these priors into a lightweight model for fast practical inference (230,000x faster).

[CV-150] LoViF 2026 The First Challenge on Weather Removal in Videos CVPR

【速读】:该论文旨在解决视频天气去除(Weather Removal in Videos, WRV)问题,即从受恶劣天气(如雨雪)影响的视频中恢复出清晰、结构完整且运动动态一致的干净视频。其解决方案的关键在于构建了一个新的短时视频天气去除数据集(LoViF 2026 Challenge Dataset),包含18段合成视频帧与对应的真实世界地面真值帧(共1,216对),分辨率为832×480,并按1:1:1比例划分训练、验证和测试集;同时设计了融合保真度与感知质量的评估协议,以推动在真实天气条件下实现鲁棒且逼真的视频复原。

链接: https://arxiv.org/abs/2604.10655
作者: Chenghao Qian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: CVPR Workshop Challenge Report

点击查看摘要

Abstract:This paper presents a review of the LoViF 2026 Challenge on Weather Removal in Videos. The challenge encourages the development of methods for restoring clean videos from inputs degraded by adverse weather conditions such as rain and snow, with an emphasis on achieving visually plausible and temporally consistent results while preserving scene structure and motion dynamics. To support this task, we introduce a new short-form WRV dataset tailored for video weather removal. It consists of 18 videos 1,216 synthesized frames paired with 1,216 real-world ground-truth frames at a resolution of 832 x 480, and is split into training, validation, and test sets with a ratio of 1:1:1. The goal of this challenge is to advance robust and realistic video restoration under real-world weather conditions, with evaluation protocols that jointly consider fidelity and perceptual quality. The challenge attracted 37 participants and received 5 valid final submissions with corresponding fact sheets, contributing to progress in weather removal for videos. The project is publicly available at this https URL.

[CV-151] LogitDynamics: Reliable ViT Error Detection from Layerwise Logit Trajectories CVPR2026

【速读】:该论文旨在解决视觉模型在实际部署中缺乏可靠置信度估计的问题,具体聚焦于如何仅通过一次前向传播(single forward pass)来预测图像分类器输出的正确性,即误差预测(error prediction)。其核心挑战在于从模型内部信号中提取能够反映分类不确定性的深度特征。解决方案的关键在于提出一种轻量级方法,通过在Vision Transformer(ViT)的中间层附加线性头(linear heads),捕捉类别证据(class evidence)随网络深度演变的过程:不仅提取预测类别的logits及其top-K竞争类别的信息,还统计高排名类别在不同深度上的不稳定性。基于这些特征训练一个线性探测器(linear probe)以预测错误指示信号(error indicator),从而实现高精度且计算开销极小的误差预测,在多个数据集上显著提升或匹配基线模型的AUCPR指标,并展现出更强的跨数据集泛化能力。

链接: https://arxiv.org/abs/2604.10643
作者: Ido Beigelman,Moti Freiman
机构: Technion - Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the HOW 2026 workshop at CVPR 2026; 7 pages, 3 figures

点击查看摘要

Abstract:Reliable confidence estimation is critical when deploying vision models. We study error prediction: determining whether an image classifier’s output is correct using only signals from a single forward pass. Motivated by internal-signal hallucination detection in large language models, we investigate whether similar depth-wise signals exist in Vision Transformers (ViTs). We propose a simple method that models how class evidence evolves across layers. By attaching lightweight linear heads to intermediate layers, we extract features from the last L layers that capture both the logits of the predicted class and its top-K competitors, as well as statistics describing instability of top-ranked classes across depth. A linear probe trained on these features predicts the error indicator. Across datasets, our method improves or matches AUCPR over baselines and shows stronger cross-dataset generalization while requiring minimal additional computation.

[CV-152] Language Prompt vs. Image Enhancement: Boosting Object Detection With CLIP in Hazy Environments

【速读】:该论文旨在解决雾霾环境下目标检测困难的问题,即雾霾导致目标物体视觉特征退化、语义信息被环境噪声削弱,使得检测模型难以准确识别。其解决方案的关键在于不依赖图像增强模块,而是利用语言提示(language prompts)来增强被削弱的语义信息;具体通过设计近似互斥性(Approximation of Mutual Exclusion, AME)为交叉熵损失提供可信权重,从而构建CLIP引导的交叉熵损失(CLIP-CE),使模型在反向传播过程中自动增强弱化语义,提升检测性能;进一步提出自适应微调的AME(Fine-tuned AME, FAME),根据预测置信度动态调整权重,缓解原始AME中存在的优化不平衡问题。

链接: https://arxiv.org/abs/2604.10637
作者: Jian Pang,Bingfeng Zhang,Jin Wang,Baodi Liu,Dapeng Tao,Weifeng Liu
机构: China University of Petroleum (East China) (中国石油大学(华东)); Yunnan University (云南大学); Yunnan United Vision Technology Co., Ltd. (云南省联合视觉科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection in hazy environments is challenging because degraded objects are nearly invisible and their semantics are weakened by environmental noise, making it difficult for detectors to identify. Common approaches involve image enhancement to boost weakened semantics, but these methods are limited by the instability of enhanced modules. This paper proposes a novel solution by employing language prompts to enhance weakened semantics without image enhancement. Specifically, we design Approximation of Mutual Exclusion (AME) to provide credible weights for Cross-Entropy Loss, resulting in CLIP-guided Cross-Entropy Loss (CLIP-CE). The provided weights assess the semantic weakening of objects. Through the backpropagation of CLIP-CE, weakened semantics are enhanced, making degraded objects easier to detect. In addition, we present Fine-tuned AME (FAME) which adaptively fine-tunes the weight of AME based on the predicted confidence. The proposed FAME compensates for the imbalanced optimization in AME. Furthermore, we present HazyCOCO, a large-scale synthetic hazy dataset comprising 61258 images. Experimental results demonstrate that our method achieves state-of-the-art performance. The code and dataset will be released.

[CV-153] NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results CVPR2026

【速读】:该论文旨在解决昼夜条件下双焦点图像中雨滴去除(Raindrop Removal)的挑战问题,其核心目标是建立一个在不同光照和对焦条件下的强健且实用的基准测试体系。解决方案的关键在于基于真实世界采集的Raindrop Clarity数据集(包含14,139张训练图像、407张验证图像和593张测试图像),通过吸引168支参赛团队提交17个有效方案,验证了当前方法在复杂场景下对雨滴去除任务的有效性与进步性。

链接: https://arxiv.org/abs/2604.10634
作者: Xin Li,Yeying Jin,Suhang Yao,Beibei Lin,Zhaoxin Fan,Wending Yan,Xin Jin,Zongwei Wu,Bingchen Li,Peishu Shi,Yufei Yang,Yu Li,Zhibo Chen,Bihan Wen,Robby T. Tan,Radu Timofte,Runzhe Li,Kui Jiang,Zhaocheng Yu,Yiang Chen,Junjun Jiang,Xianming Liu,Hongde Gu,Zeliang Li,Mache You,Jiangxin Dong,Jinshan Pan,Qiyu Rong,Bowen Shao,Hongyuan Jing,Mengmeng Zhang,Bo Ding,Hui Zhang,Yi Ren,Mohab Kishawy,Jun Chen,Anh-Kiet Duong,Petra Gomez-Kramer,Jean-Michel Carozza,Wangzhi Xing,Xin Lu,Enxuan Gu,Jingxi Zhang,Diqi Chen,Qiaosi Yi,Bingcai Wei,Wenjie Li,Bowen Tie,Heng Guo,Zhanyu Ma,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Cici Liu,Yaokun Shi,Paula Garrido Mellado,Daniel Feijoo,Alvaro Garcia Lara,Marcos V. Conde,Zhidong Zhu,Bangshu Xiong,Qiaofeng Ou,Zhibo Rao,Wei Li,Zida Zhang,Hui Geng,Qisheng Xu,Xuyao Deng,Changjian Wang,Kele Xu,Guanglu Dong,Qiyao Zhao,Tianheng Zheng,Chunlei Li,Lichao Mou,Chao Ren,Chang-De Peng,Chieh-Yu Tsai,Guan-Cheng Liu,Li-Wei Kang,Abhishek Rajak,Milan Kumar Singh,Ankit Kumar,Dimple Sonone,Kishor Upla,Kiran Raja,Huilin Zhao,Xing Xu,Chuan Chen,Yeming Lao,Wenjing Xun,Li Yang,Bilel Benjdira,Anas M. Ali,Wadii Boulila,Hao Yang,Ruikun Zhang,Liyuan Pan
机构: University of Science and Technology of China; National University of Singapore; Tencent; Beihang University; Southwest Jiaotong University; Eastern Institute of Technology, Ningbo; Computer Vision Lab, University of Würzburg; SparcAI Inc.; IDEA; Nanyang Technological University; Harbin Institute of Technology, China; Nanjing University of Science and Technology; Beijing Union University; Department of Electrical and Computer Engineering, McMaster University; L3i Laboratory, La Rochelle University, France; LIENSs Laboratory, La Rochelle University, France; Griffith University; SenseTime; Dalian University of Technology; Massey University; Hong Kong Polytechnic University; Wuhan University; Beijing University of Posts and Telecommunications; University of Illinois at Urbana-Champaign, USA; Cidaut AI; Beihang University; Nanchang Hangkong University; National University of Defense Technology; Sichuan University; Southwest University of Science and Technology; MedAI Technology (Wuxi) Co. Ltd.; National Taiwan Normal University; Sardar Vallabhbhai National Institute of Technology (SVNIT), Surat, India; Norwegian University of Science and Technology (NTNU), Gjøvik, Norway; The Hong Kong Polytechnic University; Suzhou Institute for Advanced Research, University of Science and Technology of China; Technical University of Munich; Chizhou University; Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh 12435, Saudi Arabia; Beijing Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026 Workshop; NTIRE 2026 Challenge Report

点击查看摘要

Abstract:This paper presents an overview of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Building upon the success of the first edition, this challenge attracted a wide range of impressive solutions, all developed and evaluated on our real-world Raindrop Clarity dataset~\citejin2024raindrop. For this edition, we adjust the dataset with 14,139 images for training, 407 images for validation, and 593 images for testing. The primary goal of this challenge is to establish a strong and practical benchmark for the removal of raindrops under various illumination and focus conditions. In total, 168 teams have registered for the competition, and 17 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the Raindrop Clarity dataset, demonstrating the growing progress in this challenging task.

[CV-154] How to Design a Compact High-Throughput Video Camera?

【速读】:该论文旨在解决高通量视频采集系统中因像素数量激增而导致的读出和传输瓶颈问题,尤其是在传统拼接子图像/视频方案存在系统复杂度高的情况下。其关键解决方案是提出一种基于现有技术的低比特梯度相机(gradient camera)方案,利用梯度相机在快速读出和高效表示方面的优势,有效缓解读出与传输速率跟不上像素增长的问题;同时设计了一个多尺度重建卷积神经网络(CNN),用于从低比特梯度数据中恢复高质量高分辨率图像,从而实现高通量视频成像的可行性和有效性。

链接: https://arxiv.org/abs/2604.10619
作者: Chenxi Qiu,Tao Yue,Xuemei Hu
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:High throughput video acquisition is a challenging problem and has been drawing increasing attention. Existing high throughput imaging systems splice hundreds of sub-images/videos into high throughput videos, suffering from extremely high system complexity. Alternatively, with pixel sizes reducing to sub-micrometer levels, integrating ultra-high throughput on a single chip is becoming feasible. Nevertheless, the readout and output transmission speed cannot keep pace with the increasing pixel numbers. To this end, this paper analyzes the strength of gradient cameras in fast readout and efficient representation, and proposes a low-bit gradient camera scheme based on existing technologies that can resolve the readout and transmission bottlenecks for high throughput video imaging. A multi-scale reconstruction CNN is proposed to reconstruct high-resolution images. Extensive experiments on both simulated and real data are conducted to demonstrate the promising quality and feasibility of the proposed method.

[CV-155] Self-supervised Pretraining of Cell Segmentation Models

【速读】:该论文旨在解决细胞实例分割(instance segmentation)在显微图像中因高质量标注数据稀缺而导致性能受限的问题。现有方法多依赖于在自然图像上预训练的模型(如Segment Anything Model, SAM)进行初始化,但这些模型学习到的物体感知和纹理先验与显微图像域存在显著差异,导致领域迁移时性能下降。解决方案的关键在于提出DINOCell框架,通过在未标注的细胞图像上对来自DINOv2的视觉表示进行持续自监督训练,实现对显微图像域的适应,随后再进行监督微调。该方法在LIVECell基准上实现了0.784的SEG分数,较领先的SAM基线提升10.42%,并展现出在三个分布外显微图像数据集上的强零样本泛化能力,验证了领域适配的自监督预训练对鲁棒细胞分割的有效性。

链接: https://arxiv.org/abs/2604.10609
作者: Kaden Stillwagon,Alexandra Dunnum VandeLoo,Benjamin Magondu,Craig R. Forest
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Instance segmentation enables the analysis of spatial and temporal properties of cells in microscopy images by identifying the pixels belonging to each cell. However, progress is constrained by the scarcity of high-quality labeled microscopy datasets. Many recent approaches address this challenge by initializing models with segmentation-pretrained weights from large-scale natural-image models such as Segment Anything Model (SAM). However, representations learned from natural images often encode objectness and texture priors that are poorly aligned with microscopy data, leading to degraded performance under domain shift. We propose DINOCell, a self-supervised framework for cell instance segmentation that leverages representations from DINOv2 and adapts them to microscopy through continued self-supervised training on unlabeled cell images prior to supervised fine-tuning. On the LIVECell benchmark, DINOCell achieves a SEG score of 0.784, improving by 10.42% over leading SAM-based models, and demonstrates strong zero-shot performance on three out-of-distribution microscopy datasets. These results highlight the benefits of domain-adapted self-supervised pretraining for robust cell segmentation.

[CV-156] COREY: A Prototype Study of Entropy-Guided Operator Fusion with Hadamard Reparameterization for Selective State Space Models

【速读】:该论文旨在解决状态空间模型(State Space Models, SSMs)在实际部署中因选择性状态更新被分解为碎片化内核而导致的内存带宽瓶颈问题,尤其在长上下文推理场景下尤为突出。解决方案的关键在于提出COREY框架,其核心创新包括:一是通过内存感知的操作融合(memory-aware operator fusion)减少重复中间张量的物化;二是引入基于Hadamard变换的特征重参数化方法,将归一化的Hadamard变换嵌入线性投影中,在保持功能等价性的前提下缓解重尾激活值的峰值坐标集中现象;三是利用固定宽度直方图估算激活熵作为运行时调度统计量,动态决定融合边界和分块大小,从而优化内存访问模式并降低DRAM流量。实验表明,该方案在控制条件下显著降低了代理延迟、提升了吞吐量,并优于未融合及固定深度基线。

链接: https://arxiv.org/abs/2604.10597
作者: Bo Ma,Jinsong Wu,Hongjiang Wei,Weiqi Yan
机构: Resideo Technologies, Inc.; Auckland University of Technology; Guilin University of Electronic Technology; Hikvision Research Institute
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:State Space Models (SSMs), represented by the Mamba family, provide linear-time sequence modeling and are attractive for long-context inference. Yet practical deployments remain memory-bandwidth limited because selective state updates are often decomposed into fragmented kernels with repeated intermediate tensor materialization. We present COREY, a prototype framework that combines memory-aware operator fusion with Hadamard-based feature reparameterization. Activation entropy, estimated with fixed-width histograms, is used as a runtime scheduling statistic to place fusion boundaries and choose tile sizes. To regularize heavy-tailed activations, we absorb normalized Hadamard transforms into linear projections, preserving functional equivalence while reducing peak-coordinate concentration. In a controlled prototype study over heavy-tailed SSM activations, COREY consistently reduces proxy latency, improves throughput, and lowers DRAM traffic relative to unfused and fixed-depth baselines. Low-bit results are reported only through a hand-crafted stability proxy and are intended as diagnostic evidence rather than checkpoint-level quality claims. Code repository: this https URL.

[CV-157] GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing CVPR

【速读】:该论文旨在解决遥感领域中基础模型(foundation modeling)构建所面临的挑战,即如何在缺乏大规模、空间对齐且语义标注丰富的异构模态数据情况下,实现跨传感器的物理一致性与语义 grounded 性之间的有效协同。其核心问题是:现有资源难以支撑具有空间一致性、多分辨率覆盖以及语义标签引导的多模态训练数据集,从而限制了模型在下游任务中的迁移能力和跨传感器鲁棒性。解决方案的关键在于提出 GeoMeld 数据集和 GeoMeld-FM 预训练框架:GeoMeld 通过统一对齐协议整合约 250 万样本,涵盖多种遥感模态与分辨率,并利用代理式图像描述生成机制(agentic captioning framework)融合光谱信号、地形统计量及结构化地理元数据,生成带有可测量跨模态关系的语言监督;GeoMeld-FM 则采用多预文本掩码自编码(multi-pretext masked autoencoding)、JEPA 表示学习与图文对比对齐的联合目标,使模型在学习过程中同时捕捉可靠的跨传感器物理一致性与语义层次信息,从而显著提升模型在下游任务中的泛化性能与跨传感器适应能力。

链接: https://arxiv.org/abs/2604.10591
作者: Maram Hasan,Md Aminur Hossain,Savitra Roy,Souparna Bhowmik,Ayush V. Patel,Mainak Singha,Subhasis Chaudhuri,Muhammad Haris Khan,Biplab Banerjee
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Space Applications Centre, ISRO (印度空间研究组织空间应用中心); University of Trento (特伦托大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR Workshop 2026; 8 pages, 6 figures

点击查看摘要

Abstract:Effective foundation modeling in remote sensing requires spatially aligned heterogeneous modalities coupled with semantically grounded supervision, yet such resources remain limited at scale. We present GeoMeld, a large-scale multimodal dataset with approximately 2.5 million spatially aligned samples. The dataset spans diverse modalities and resolutions and is constructed under a unified alignment protocol for modality-aware representation learning. GeoMeld provides semantically grounded language supervision through an agentic captioning framework that synthesizes and verifies annotations from spectral signals, terrain statistics, and structured geographic metadata, encoding measurable cross-modality relationships within textual descriptions. To leverage this dataset, we introduce GeoMeld-FM, a pretraining framework that combines multi-pretext masked autoencoding over aligned modalities, JEPA representation learning, and caption-vision contrastive alignment. This joint objective enables the learned representation space to capture both reliable cross-sensor physical consistency and grounded semantics. Experiments demonstrate consistent gains in downstream transfer and cross-sensor robustness. Together, GeoMeld and GeoMeld-FM establish a scalable reference framework for semantically grounded multi-modal foundation modeling in remote sensing.

[CV-158] Preventing Latent Rehearsal Decay in Online Continual SSL with SOLAR

【速读】:该论文旨在解决在线持续自监督学习(Online Continual Self-Supervised Learning, OCSSL)中的稳定性-可塑性权衡问题,即在连续流式无标签非平稳数据中,如何兼顾模型快速收敛与长期性能稳定。传统方法如基于缓存重放(replay)的策略虽能加速收敛,但过度稳定会导致潜在空间退化(latent space degradation),进而引发性能下降。作者提出SOLAR方法,其关键在于引入两个诊断指标——重叠度(Overlap)和偏移度(Deviation),用于量化潜在空间退化并指导缓冲区管理;同时通过显式的重叠损失(Overlap loss)实现对可塑性的自适应调控,从而在保持高速收敛的同时显著提升最终性能,在OCSSL视觉基准上达到当前最优效果。

链接: https://arxiv.org/abs/2604.10586
作者: Giacomo Cignoni,Simone Magistri,Andrew D. Bagdanov,Antonio Carta
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper explores Online Continual Self-Supervised Learning (OCSSL), a scenario in which models learn from continuous streams of unlabeled, non-stationary data, where methods typically employ replay and fast convergence is a central desideratum. We find that OCSSL requires particular attention to the stability-plasticity trade-off: stable methods (e.g. replay with Reservoir sampling) are able to converge faster compared to plastic ones (e.g. FIFO buffer), but incur in performance drops under certain conditions. We explain this collapse phenomenon with the Latent Rehearsal Decay hypothesis, which attributes it to latent space degradation under excessive stability of replay. We introduce two metrics (Overlap and Deviation) that diagnose latent degradation and correlate with accuracy declines. Building on these insights, we propose SOLAR, which leverages efficient online proxies of Deviation to guide buffer management and incorporates an explicit Overlap loss, allowing SOLAR to adaptively managing plasticity. Experiments demonstrate that SOLAR achieves state-of-the-art performance on OCSSL vision benchmarks, with both high convergence speed and final performance.

[CV-159] CoFusion: Multispectral and Hyperspectral Image Fusion via Spectral Coordinate Attention

【速读】:该论文旨在解决多光谱与高光谱图像融合(Multispectral and Hyperspectral Image Fusion, MHIF)中跨尺度交互建模不足以及空间-光谱协同能力有限的问题,从而难以在空间细节增强与光谱保真度之间取得最优平衡。其解决方案的关键在于提出一种统一的空间-光谱协同融合框架 CoFusion,通过三个核心模块实现:(1) 多尺度生成器(Multi-Scale Generator, MSG)构建三级金字塔结构以融合全局语义与局部细节;(2) 在每一尺度内采用双分支策略,其中空间坐标感知混合模块(Spatial Coordinate-Aware Mixing module, SpaCAM)捕获多尺度空间上下文,光谱坐标感知混合模块(Spectral Coordinate-Aware Mixing module, SpeCAM)通过频域分解与坐标混合增强光谱表示;(3) 引入空间-光谱交叉融合模块(Spatial-Spectral Cross-Fusion Module, SSCFM)实现动态跨模态对齐与互补特征融合,从而显著提升融合图像的空间清晰度与光谱一致性。

链接: https://arxiv.org/abs/2604.10584
作者: Baisong Li
机构: Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multispectral and Hyperspectral Image Fusion (MHIF) aims to reconstruct high-resolution images by integrating low-resolution hyperspectral images (LRHSI) and high-resolution multispectral images (HRMSI). However, existing methods face limitations in modeling cross-scale interactions and spatial-spectral collaboration, making it difficult to achieve an optimal trade-off between spatial detail enhancement and spectral fidelity. To address this challenge, we propose CoFusion: a unified spatial-spectral collaborative fusion framework that explicitly models cross-scale and cross-modal dependencies. Specifically, a Multi-Scale Generator (MSG) is designed to construct a three-level pyramidal architecture, enabling the effective integration of global semantics and local details. Within each scale, a dual-branch strategy is employed: the Spatial Coordinate-Aware Mixing module (SpaCAM) is utilized to capture multi-scale spatial contexts, while the Spectral Coordinate-Aware Mixing module (SpeCAM) enhances spectral representations through frequency decomposition and coordinate mixing. Furthermore, we introduce the Spatial-Spectral Cross-Fusion Module (SSCFM) to perform dynamic cross-modal alignment and complementary feature fusion. Extensive experiments on multiple benchmark datasets demonstrate that CoFusion consistently outperforms state-of-the-art methods, achieving superior performance in both spatial reconstruction and spectral consistency.

[CV-160] APNext: Whats Next for Tracking Any Point (TAP)? CVPR

【速读】:该论文旨在解决TAPNext模型在长视频序列中跟踪性能下降以及对遮挡后重新出现的查询点难以再检测的问题。其关键解决方案在于提出TAPNext++,通过引入基于数据驱动的训练策略(如利用序列并行技术训练长达1024帧的长序列)和针对性的几何增强方法(如周期性平移模拟点重新进入场景),显著提升了模型在长时间序列中的稳定性和再检测能力;同时,作者提出新的评估指标Re-Detection Average Jaccard (AJ_RD) 以量化衡量重检测性能,从而推动了生成式AI在点级视频跟踪任务上的进步,并在多个基准测试上达到新SOTA水平。

链接: https://arxiv.org/abs/2604.10582
作者: Sebastian Jung,Artem Zholus,Martin Sundermeyer,Carl Doersch,Ross Goroshin,David Joseph Tan,Sarath Chandar,Rudolph Triebel,Federico Tombari
机构: Google(谷歌); Google DeepMind(谷歌深度思维); German Aerospace Center (DLR)(德国航空航天中心); Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院); Mila - Quebec AI Institute(蒙特利尔魁北克人工智能研究所); Université de Montréal(蒙特利尔大学); Chandar Research Lab(昌达尔研究实验室); Polytechnique Montréal(蒙特利尔理工学院); Canada CIFAR AI Chair(加拿大CIFAR人工智能主席); Technical University Munich (TUM)(慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, will be publised at CVPR Findings 2026, Website this https URL

点击查看摘要

Abstract:Tracking-Any-Point (TAP) models aim to track any point through a video which is a crucial task in AR/XR and robotics applications. The recently introduced TAPNext approach proposes an end-to-end, recurrent transformer architecture to track points frame-by-frame in a purely online fashion – demonstrating competitive performance at minimal latency. However, we show that TAPNext struggles with longer video sequences and also frequently fails to re-detect query points that reappear after being occluded or leaving the frame. In this work, we present TAPNext++, a model that tracks points in sequences that are orders of magnitude longer while preserving the low memory and compute footprint of the architecture. We train the recurrent video transformer using several data-driven solutions, including training on long 1024-frame sequences enabled by sequence parallelism techniques. We highlight that re-detection performance is a blind spot in the current literature and introduce a new metric, Re-Detection Average Jaccard ( AJ_RD ), to explicitly evaluate tracking on re-appearing points. To improve re-detection of points, we introduce tailored geometric augmentations, such as periodic roll that simulates point re-entries, and supervising occluded points. We demonstrate that recurrent transformers can be substantially improved for point tracking and set a new state-of-the-art on multiple benchmarks. Model and code can be found at this https URL.

[CV-161] Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

【速读】:该论文旨在解决从稀疏输入中合成高质量3D室内场景时面临的挑战,即在未见大区域中推断大量缺失几何结构的同时保持全局一致性,现有方法常产生局部合理但全局不一致的重建结果。其解决方案的关键在于提出Rein3D框架,通过将显式的3D高斯泼溅(3D Gaussian Splatting, 3DGS)与视频扩散模型提供的时序一致先验相结合,采用“恢复-精炼”范式:首先利用径向探索策略从原点出发生成不完美的全景视频序列以揭示遮挡区域,随后通过全景视频到视频扩散模型进行修复,并结合视频超分辨率增强几何与纹理细节,最终以这些精细化视频作为伪真实标签来更新全局3D高斯场,从而实现高保真且全局一致的3D场景重建。

链接: https://arxiv.org/abs/2604.10578
作者: Dehui Wang,Congsheng Xu,Rong Wei,Yue Shi,Shoufa Chen,Dingxiang Luo,Tianshuo Yang,Xiaokang Yang,Yusen Qin,Rui Tang,Yao Mu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing demand for Embodied AI and VR applications has highlighted the need for synthesizing high-quality 3D indoor scenes from sparse inputs. However, existing approaches struggle to infer massive amounts of missing geometry in large unseen areas while maintaining global consistency, often producing locally plausible but globally inconsistent reconstructions. We present Rein3D, a framework that reconstructs full 360-degree indoor environments by coupling explicit 3D Gaussian Splatting (3DGS) with temporally coherent priors from video diffusion models. Our approach follows a “restore-and-refine” paradigm: we employ a radial exploration strategy to render imperfect panoramic videos along trajectories starting from the origin, effectively uncovering occluded regions from a coarse 3DGS initialization. These sequences are restored by a panoramic video-to-video diffusion model and further enhanced via video super-resolution to synthesize high-fidelity geometry and textures. Finally, these refined videos serve as pseudo-ground truths to update the global 3D Gaussian field. To support this task, we construct PanoV2V-15K, a dataset of over 15K paired clean and degraded panoramic videos for diffusion-based scene restoration. Experiments demonstrate that Rein3D produces photorealistic and globally consistent 3D scenes and significantly improves long-range camera exploration compared with existing baselines.

[CV-162] Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images CVPR2026

【速读】:该论文旨在解决从无姿态(unposed)多视角图像中学习鲁棒的3D表示这一挑战,当前自监督方法在几何诱导能力弱、外观细节不足以及几何与语义不一致等方面存在局限。解决方案的关键在于提出UniSplat框架,其核心创新包括:(1)双掩码策略(dual-masking strategy),通过同时掩码编码器和解码器token,并将解码器掩码聚焦于几何信息丰富的区域,迫使模型从不完整视觉线索中推断结构信息,从而增强几何感知;(2)粗到精的高斯点绘(coarse-to-fine Gaussian splatting)策略,逐步优化辐射场以减少外观与语义之间的不一致性;(3)姿态条件重校准机制(pose-conditioned recalibration mechanism),利用估计的相机参数将预测的3D点云和语义图重新投影至图像平面,并与RGB及语义预测对齐,实现多任务间几何-语义一致性,最终构建出对稀疏视图和无姿态输入具有鲁棒性的统一3D表示,为场景理解与具身智能提供感知基础。

链接: https://arxiv.org/abs/2604.10573
作者: Bo Zhou,Qiuxia Lai,Zeren Sun,Xiangbo Shu,Yazhou Yao,Wenguan Wang
机构: Nanjing University of Science and Technology (南京理工大学); Zhejiang University (浙江大学); Communication University of China (中国传媒大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs. Second, we develop a coarse-to-fine Gaussian splatting strategy that reduces appearance-semantics inconsistencies by progressively refining the radiance field. Finally, to enforce geometric-semantic consistency, we introduce a pose-conditioned recalibration mechanism that interrelates the outputs of multiple heads by re-projecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency, thereby resolving geometry-semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.

[CV-163] Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor

【速读】:该论文旨在解决在极端运动场景下,仅依赖RGB图像进行去模糊(deblurring)时因缺乏结构或时间线索而导致的病态问题(ill-posed problem)。传统方法在快速运动中难以恢复清晰图像,而现有事件相机虽能提供时序信息,却受限于事件率饱和及边缘与运动信息混叠的问题。为此,作者提出基于互补视觉传感器(Complementary Vision Sensor, CVS) Tianmouc 的多模态数据——同步采集的RGB帧、空间差分(Spatial Difference, SD)和时间差分(Temporal Difference, TD)信号,分别编码结构边缘和运动信息。解决方案的关键在于设计一种时空差分引导的去模糊网络(Spatio-Temporal Difference Guided Deblur Net, STGDNet),其采用递归多分支架构,迭代地编码并融合SD与TD序列,从而有效恢复模糊RGB输入中丢失的结构与色彩细节,在合成与真实世界场景中均展现出优越性能与强泛化能力。

链接: https://arxiv.org/abs/2604.10554
作者: Yapeng Meng,Lin Yang,Yuguo Chen,Xiangru Chen,Taoyi Wang,Lijian Wang,Zheyu Yang,Yihan Lin,Rong Zhao
机构: Tsinghua University (清华大学); Xiamen University (厦门大学); Communication University of China (中国传媒大学); Primevision Technology (普瑞视觉科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Motion blur arises when rapid scene changes occur during the exposure period, collapsing rich intra-exposure motion into a single RGB frame. Without explicit structural or temporal cues, RGB-only deblurring is highly ill-posed and often fails under extreme motion. Inspired by the human visual system, brain-inspired vision sensors introduce temporally dense information to alleviate this problem. However, event cameras still suffer from event rate saturation under rapid motion, while the event modality entangles edge features and motion cues, which limits their effectiveness. As a recent breakthrough, the complementary vision sensor (CVS), Tianmouc, captures synchronized RGB frames together with high-frame-rate, multi-bit spatial difference (SD, encoding structural edges) and temporal difference (TD, encoding motion cues) data within a single RGB exposure, offering a promising solution for RGB deblurring under extreme dynamic scenes. To fully leverage these complementary modalities, we propose Spatio-Temporal Difference Guided Deblur Net (STGDNet), which adopts a recurrent multi-branch architecture that iteratively encodes and fuses SD and TD sequences to restore structure and color details lost in blurry RGB inputs. Our method outperforms current RGB or event-based approaches in both synthetic CVS dataset and real-world evaluations. Moreover, STGDNet exhibits strong generalization capability across over 100 extreme real-world scenarios. Project page: this https URL

[CV-164] NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models: Datasets Methods and Results CVPR2026

【速读】:该论文旨在解决短格式用户生成内容(Short-form User-Generated Content, S-UGC)视频在复杂真实场景下退化问题的恢复难题,特别是在基于生成式模型(Generative Models)的新范式下的视频修复挑战。其解决方案的关键在于构建了一个名为KwaiVIR的新基准数据集,该数据集包含合成退化视频与真实世界S-UGC视频,覆盖了多种复杂退化类型,并通过主观评价(用户研究)与客观指标双轨评估机制,全面衡量修复质量。这一设计推动了生成式AI在真实场景视频修复中的实用性和有效性验证。

链接: https://arxiv.org/abs/2604.10551
作者: Xin Li,Jiachao Gong,Xijun Wang,Shiyao Xiong,Bingchen Li,Suhang Yao,Chao Zhou,Zhibo Chen,Radu Timofte,Yuxiang Chen,Shibo Yin,Yilian Zhong,Yushun Fang,Xilei Zhu,Yahui Wang,Chen Lu,Meisong Zheng,Xiaoxu Chen,Jing Yang,Zhaokun Hu,Jiahui Liu,Ying Chen,Haoran Bai,Sibin Deng,Shengxi Li,Mai Xu,Junyang Chen,Hao Chen,Xinzhe Zhu,Fengkai Zhang,Long Sun,Yixing Yang,Xindong Zhang,Jiangxin Dong,Jinshan Pan,Jiyuan Zhang,Shuai Liu,Yibin Huang,Xiaotao Wang,Lei Lei,Zhirui Liu,Shinan Chen,Shang-Quan Sun,Wenqi Ren,Jingyi Xu,Zihong Chen,Zhuoya Zou,Xiuhao Qiu,Jingyu Ma,Huiyuan Fu,Kun Liu,Huadong Ma,Dehao Feng,Zhijie Ma,Boqi Zhang,Jiawei Shi,Hao Kang,Yixin Yang,Yeying Jin,Xu Cheng,Yuxuan Jiang,Chengxi Zeng,Tianhao Peng,Fan Zhang,David Bull,Yanan Xing,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi,Wei Zhou,Linfeng Li,Hang Song,Qi Xu,Kun Yuan,Yizhen Shao,Yulin Ren
机构: University of Science and Technology of China (中国科学技术大学); KuaiShou Technology (快手科技); Computer Vision Lab, University of Wurzburg, Germany (德国维尔茨堡大学计算机视觉实验室); Xiaohongshu Inc (小红书公司); Alibaba Group Taobao Tmall Group, Beihang University (阿里巴巴集团淘宝天猫集团; 北航大学); Nanjing University of Science and Technology, Hunan University, OPPO Research Institute (南京理工大学; 湖南大学; OPPO研究院); Xiaomi (小米); Sun Yat-sen University, Xi’an Jiaotong University, Nanyang Technological University (中山大学; 西安交通大学; 南洋理工大学); Beihang University, Tsinghua University (北航大学; 清华大学); Beijing University of Posts and Telecommunications, JD Inc., China, China University of Petroleum (北京邮电大学; 京东公司; 中国石油大学); Nanjing University of Science and Technology, National University of Singapore, Hunan University, Tsinghua University (南京理工大学; 新加坡国立大学; 湖南大学; 清华大学); University of Bristol (布里斯托大学); Taiyuan University of Technology (太原理工大学); University of Illinois Urbana Champaign (伊利诺伊大学厄巴纳-香槟分校); National University of Singapore, Xi’an Jiaotong University, Shanghai Jiao Tong University (新加坡国立大学; 西安交通大学; 上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 workshop; NTIRE 2026

点击查看摘要

Abstract:This paper presents an overview of the NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models. This challenge utilizes a new short-form UGC (S-UGC) video restoration benchmark, termed KwaiVIR, which is contributed by USTC and Kuaishou Technology. It contains both synthetically distorted videos and real-world short-form UGC videos in the wild. For this edition, the released data include 200 synthetic training videos, 48 wild training videos, 11 validation videos, and 20 testing videos. The primary goal of this challenge is to establish a strong and practical benchmark for restoring short-form UGC videos under complex real-world degradations, especially in the emerging paradigm of generative-model-based S-UGC video restoration. This challenge has two tracks: (i) the primary track is a subjective track, where the evaluation is based on a user study; (ii) the second track is an objective track. These two tracks enable a comprehensive assessment of restoration quality. In total, 95 teams have registered for this competition. And 12 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the KwaiVIR benchmark, demonstrating encouraging progress in short-form UGC video restoration in the wild.

[CV-165] Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression CVPR2026

【速读】:该论文旨在解决极低比特率下图像压缩中现有向量量化(Vector Quantization, VQ)方法缺乏联合率失真(Rate-Distortion, RD)优化机制的问题,其核心挑战在于表示学习与熵建模之间的脱节。解决方案的关键在于提出RDVQ框架,通过可微分松弛(differentiable relaxation)代码本分布,使熵损失能够直接塑造潜在先验,从而实现端到端的RD优化;同时引入自回归熵模型以支持精确的熵估计和测试时比特率控制,显著提升了压缩效率与感知质量,在保持轻量架构的同时实现了优于或相当的性能表现。

链接: https://arxiv.org/abs/2604.10546
作者: Shiyin Jiang,Wei Long,Minghao Han,Zhenghao Chen,Ce Zhu,Shuhang Gu
机构: University of Electronic Science and Technology of China (电子科技大学); The University of Newcastle, Australia (纽卡斯尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at CVPR 2026 as an Oral presentation

点击查看摘要

Abstract:The rapid growth of visual data under stringent storage and bandwidth constraints makes extremely low-bitrate image compression increasingly important. While Vector Quantization (VQ) offers strong structural fidelity, existing methods lack a principled mechanism for joint rate-distortion (RD) optimization due to the disconnect between representation learning and entropy modeling. We propose RDVQ, a unified framework that enables end-to-end RD optimization for VQ-based compression via a differentiable relaxation of the codebook distribution, allowing the entropy loss to directly shape the latent prior. We further develop an autoregressive entropy model that supports accurate entropy modeling and test-time rate control. Extensive experiments demonstrate that RDVQ achieves strong performance at extremely low bitrates with a lightweight architecture, attaining competitive or superior perceptual quality with significantly fewer parameters. Compared with RDEIC, RDVQ reduces bitrate by up to 75.71% on DISTS and 37.63% on LPIPS on DIV2K-val. Beyond empirical gains, RDVQ introduces an entropy-constrained formulation of VQ, highlighting the potential for a more unified view of image tokenization and compression. The code will be available at this https URL.

[CV-166] Bidirectional Learning of Facial Action Units and Expressions via Structured Semantic Mapping across Heterogeneous Datasets

【速读】:该论文旨在解决面部动作单元(Action Unit, AU)检测与面部表情(Facial Expression, FE)识别之间的双向知识迁移不足问题,尤其是在异构数据条件下(如标注范式不同、标签粒度差异及数据可用性不均)难以实现有效联合学习的挑战。其解决方案的关键在于提出一种结构化语义映射(Structured Semantic Mapping, SSM)框架,包含三个核心组件:(1) 共享视觉主干网络以从动态AU和FE视频中学习统一的面部表征;(2) 通过文本语义原型(Textual Semantic Prototype, TSP)模块构建结构化语义原型,利用固定文本描述与可学习上下文提示生成监督信号并实现跨任务对齐;(3) 动态先验映射(Dynamic Prior Mapping, DPM)模块引入面部动作编码系统(Facial Action Coding System)先验知识,并在高层特征空间中学习数据驱动的关联矩阵,从而实现显式的双向知识传递。实验证明,SSM在多个主流AU检测与FE识别基准上均取得当前最优性能,并表明整体表情语义可反向增强细粒度AU学习,即使在异构数据集之间亦然。

链接: https://arxiv.org/abs/2604.10541
作者: Jia Li,Yu Zhang,Yin Chen,Zhenzhen Hu,Yong Li,Richang Hong,Shiguang Shan,Meng Wang
机构: Hefei University of Technology (合肥工业大学); Southeast University (东南大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 11 figures

点击查看摘要

Abstract:Facial action unit (AU) detection and facial expression (FE) recognition can be jointly viewed as affective facial behavior tasks, representing fine-grained muscular activations and coarse-grained holistic affective states, respectively. Despite their inherent semantic correlation, existing studies predominantly focus on knowledge transfer from AUs to FEs, while bidirectional learning remains insufficiently explored. In practice, this challenge is further compounded by heterogeneous data conditions, where AU and FE datasets differ in annotation paradigms (frame-level vs.\ clip-level), label granularity, and data availability and diversity, hindering effective joint learning. To address these issues, we propose a Structured Semantic Mapping (SSM) framework for bidirectional AU–FE learning under different data domains and heterogeneous supervision. SSM consists of three key components: (1) a shared visual backbone that learns unified facial representations from dynamic AU and FE videos; (2) semantic mediation via a Textual Semantic Prototype (TSP) module, which constructs structured semantic prototypes from fixed textual descriptions augmented with learnable context prompts, serving as supervision signals and cross-task alignment anchors in a shared semantic space; and (3) a Dynamic Prior Mapping (DPM) module that incorporates prior knowledge derived from the Facial Action Coding System and learns a data-driven association matrix in a high-level feature space, enabling explicit and bidirectional knowledge transfer. Extensive experiments on popular AU detection and FE recognition benchmarks show that SSM achieves state-of-the-art performance on both tasks simultaneously, and demonstrate that holistic expression semantics can in turn enhance fine-grained AU learning even across heterogeneous datasets.

[CV-167] he Second Challenge on Real-World Face Restoration at NTIRE 2026: Methods and Results CVPR2026

【速读】:该论文旨在解决真实场景下人脸修复(face restoration)中生成自然、逼真且保持身份一致性的图像问题。其核心挑战在于提升感知质量和现实感,同时不限制计算资源或训练数据的使用。解决方案的关键在于采用加权图像质量评估(weighted image quality assessment, IQA)作为性能指标,并引入AdaFace模型作为身份一致性验证工具,从而在不牺牲身份保真的前提下优化输出结果的视觉真实性。这一方法推动了真实世界人脸修复技术的前沿发展,并为该领域提供了最新的研究趋势参考。

链接: https://arxiv.org/abs/2604.10532
作者: Jingkai Wang,Jue Gong,Zheng Chen,Kai Liu,Jiatong Li,Yulun Zhang,Radu Timofte,Jiachen Tu,Yaokun Shi,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yingsi Chen,Yijiao Liu,Hui Li,Yu Wang,Congchao Zhu,Alexandru-Gabriel Lefterache,Anamaria Radoi,Chuanyue Yan,Tao Lu,Yanduo Zhang,Kanghui Zhao,Jiaming Wang,Yuqi Li,WenBo Xiong,Yifei Chen,Xian Hu,Wei Deng,Daiguo Zhou,Sujith Roy V,Claudia Jesuraj,Vikas B,Spoorthi LC,Nikhil Akalwadi,Ramesh Ashok Tabib,Uma Mudenagudi,Yuxuan Jiang,Chengxi Zeng,Tianhao Peng,Fan Zhang,David Bull Wei Zhou,Linfeng Li,Hongyu Huang,Hoyoung Lee,SangYun Oh,ChangYoung Jeong,Axi Niu,Jinyang Zhang,Zhenguo Wu,Senyan Qing,Jinqiu Sun,Yanning Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NTIRE 26: this https URL . NTIRE Real-World Face Restoration: this https URL . CVPR 2026 Workshop

点击查看摘要

Abstract:This paper provides a review of the NTIRE 2026 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural and realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. Performance is evaluated using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 96 registrants, with 10 teams submitting valid models; ultimately, 9 teams achieved valid scores in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.

[CV-168] BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs CVPR

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在零样本识别任务中是否真正理解几何结构的问题,而非仅依赖RGB纹理或上下文先验作为统计捷径。现有评估方法未能有效分离几何感知与纹理映射机制,且常因标注不精确而泄露环境线索,导致对模型真实能力的误判。为此,作者提出名为BareBones的零样本基准测试框架,其核心创新在于构建了一个无噪声的几何分类体系:通过整合六个数据集(包括五个成熟分割数据源和新提出的WTP-Bench)中的像素级轮廓图(silhouettes),强制模型仅凭边界轮廓进行细粒度几何概念识别。实验表明,26种先进VLMs在RGB信息缺失时性能急剧下降,揭示了普遍存在的“纹理偏置悬崖”(Texture Bias Cliff)现象,从而为衡量模型真正的几何语义接地能力提供了严格标准。

链接: https://arxiv.org/abs/2604.10528
作者: Aaditya Baranwal,Vishal Yadav,Abhishek Rajora
机构: University of Central Florida (中佛罗里达大学); Independent Researcher (独立研究员); University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR (13th FGVC Workshop) 2026

点击查看摘要

Abstract:While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce \textbfBareBones, a zero-shot benchmark designed to stress-test pure geometric shape comprehension. We curate pixel-level silhouettes of geometrically distinct classes across six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200) and our novel flagship collection, WTP-Bench, establishing a noise-free geometric taxonomy. WTP-Bench is an extreme, fine-grained visual puzzle that forces models to identify inter-class geometric concepts from boundary contours alone. Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (\eg, GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the \textitTexture Bias Cliff. By documenting universal structural blindspots, BareBones establishes a rigorous yardstick for genuine geometric grounding.

[CV-169] STORM: End-to-End Referring Multi-Object Tracking in Videos CVPR2026

【速读】:该论文旨在解决引用式多目标跟踪(Referring Multi-Object Tracking, RMOT)任务中因训练视频稀缺、标注模糊及领域受限导致现有方法性能有限的问题。其核心解决方案是提出一个端到端的多模态大语言模型(Multimodal Large Language Model, MLLM)——STORM,该模型在统一框架内联合执行目标定位与跟踪,无需外部检测器,并通过外观、运动和语言的协同推理实现更精确的空间-时间定位。关键创新在于引入任务组合学习(Task-Composition Learning, TCL)策略,将RMOT分解为图像定位和物体跟踪两个数据丰富的子任务,从而提升数据效率并促进结构化时空推理能力;同时构建了STORM-Bench数据集,提供准确轨迹和多样且无歧义的引用表达,显著推动了该领域的研究进展。

链接: https://arxiv.org/abs/2604.10527
作者: Zijia Lu,Jingru Yi,Jue Wang,Yuxiao Chen,Junwen Chen,Xinyu Li,Davide Modolo
机构: Amazon(亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026 Findings

点击查看摘要

Abstract:Referring multi-object tracking (RMOT) is a task of associating all the objects in a video that semantically match with given textual queries or referring expressions. Existing RMOT approaches decompose object grounding and tracking into separated modules and exhibit limited performance due to the scarcity of training videos, ambiguous annotations, and restricted domains. In this work, we introduce STORM, an end-to-end MLLM that jointly performs grounding and tracking within a unified framework, eliminating external detectors and enabling coherent reasoning over appearance, motion, and language. To improve data efficiency, we propose a task-composition learning (TCL) strategy that decomposes RMOT into image grounding and object tracking, allowing STORM to leverage data-rich sub-tasks and learn structured spatial–temporal reasoning. We further construct STORM-Bench, a new RMOT dataset with accurate trajectories and diverse, unambiguous referring expressions generated through a bottom-up annotation pipeline. Extensive experiments show that STORM achieves state-of-the-art performance on image grounding, single-object tracking, and RMOT benchmarks, demonstrating strong generalization and robust spatial–temporal grounding in complex real-world scenarios. STORM-Bench is released at this https URL.

[CV-170] FGML-DG: Feynman-Inspired Cognitive Science Paradigm for Cross-Domain Medical Image Segmentation

【速读】:该论文旨在解决多模态医学图像分割(如MRI、CT等)中因域偏移(domain shift)、成像差异和患者多样性导致的领域泛化(Domain Generalization, DG)性能下降问题。解决方案的关键在于提出了一种受认知科学启发的元学习框架——Feynman-Guided Meta-Learning for Domain Generalization (FGML-DG),其核心创新包括:1)借鉴费曼学习法中的“概念理解”原则,将跨域复杂风格特征简化为风格信息统计量,实现精确的风格特征对齐;2)设计元风格记忆与召回机制(MetaStyle),模拟人类记忆系统复用历史域知识;3)引入反馈驱动再训练策略(Feedback-Driven Re-Training, FDRT),依据预测误差动态调整学习重点,从而提升模型在未见域上的适应能力与泛化性能。

链接: https://arxiv.org/abs/2604.10524
作者: Yucheng Song,Chenxi Li,Haokang Ding,Zhining Liao,Zhifang Liao
机构: Central South University (中南大学); University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In medical image segmentation across multiple modalities (e.g., MRI, CT, etc.) and heterogeneous data sources (e.g., different hospitals and devices), Domain Generalization (DG) remains a critical challenge in AI-driven healthcare. This challenge primarily arises from domain shifts, imaging variations, and patient diversity, which often lead to degraded model performance in unseen domains. To address these limitations, we identify key issues in existing methods, including insufficient simplification of complex style features, inadequate reuse of domain knowledge, and a lack of feedback-driven optimization. To tackle these problems, inspired by Feynman’s learning techniques in educational psychology, this paper introduces a cognitive science-inspired meta-learning paradigm for medical image domain generalization segmentation. We propose, for the first time, a cognitive-inspired Feynman-Guided Meta-Learning framework for medical image domain generalization segmentation (FGML-DG), which mimics human cognitive learning processes to enhance model learning and knowledge transfer. Specifically, we first leverage the ‘concept understanding’ principle from Feynman’s learning method to simplify complex features across domains into style information statistics, achieving precise style feature alignment. Second, we design a meta-style memory and recall method (MetaStyle) to emulate the human memory system’s utilization of past knowledge. Finally, we incorporate a Feedback-Driven Re-Training strategy (FDRT), which mimics Feynman’s emphasis on targeted relearning, enabling the model to dynamically adjust learning focus based on prediction errors. Experimental results demonstrate that our method outperforms other existing domain generalization approaches on two challenging medical image domain generalization tasks.

[CV-171] Data-Efficient Surgical Phase Segmentation in Small-Incision Cataract Surgery: A Controlled Study of Vision Foundation Models

【速读】:该论文旨在解决小切口白内障手术(manual small-incision cataract surgery, SICS)中因标注视频数据稀缺而导致的手术阶段分割(surgical phase segmentation)模型难以鲁棒训练的问题。其解决方案的关键在于通过受控比较不同视觉表征(visual representations)的有效性,采用统一的时间模型(MS-TCN++)和相同的训练/评估设置,在SICS-155数据集(含19个阶段)上系统评估监督学习编码器(ResNet-50、I3D)与大规模自监督基础模型(DINOv3、V-JEPA2)的表现,并引入缓存特征流水线(cached-feature pipeline)以解耦昂贵的视觉编码与轻量级时序建模,从而提升数据效率。实验表明,基础模型特征显著优于传统监督模型,其中DINOv3 ViT-7B在准确率(83.4%)和编辑得分(87.0)上表现最佳,同时验证了在无标签手术视频上的轻量化迁移策略的有效性与边界条件。

链接: https://arxiv.org/abs/2604.10514
作者: Lincoln Spencer,Song Wang,Chen Chen
机构: Institute of Artificial Intelligence, University of Central Florida (人工智能研究所,中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Surgical phase segmentation is central to computer-assisted surgery, yet robust models remain difficult to develop when labeled surgical videos are scarce. We study data-efficient phase segmentation for manual small-incision cataract surgery (SICS) through a controlled comparison of visual representations. To isolate representation quality, we pair each visual encoder with the same temporal model (MS-TCN++) under identical training and evaluation settings on SICS-155 (19 phases). We compare supervised encoders (ResNet-50, I3D) against large self-supervised foundation models (DINOv3, V-JEPA2), and use a cached-feature pipeline that decouples expensive visual encoding from lightweight temporal learning. Foundation-model features improve segmentation performance in this setup, with DINOv3 ViT-7B achieving the best overall results (83.4% accuracy, 87.0 edit score). We further examine cataract-domain transfer using unlabeled videos and lightweight adaptation, and analyze when it helps or hurts. Overall, the study indicates strong transferability of modern vision foundation models to surgical workflow understanding and provides practical guidance for low-label medical video settings. The project website is available at: this https URL

[CV-172] FreeScale: Scaling 3D Scenes via Certainty-Aware Free-View Generation CVPR2026

【速读】:该论文旨在解决通用新视角合成(Novel View Synthesis, NVS)模型训练中因缺乏大规模、多样化且精确相机轨迹数据而导致的泛化能力受限问题。现有真实世界数据通常稀疏离散,而合成数据虽易扩展却存在域差距且语义不真实。解决方案的关键在于提出FreeScale框架,其核心创新是利用场景重建生成高质量训练数据:通过引入一种“确定性感知的自由视点采样策略”,识别出既具有语义意义又受重建误差影响最小的新视角,从而有效避免直接从低质量重建结果中采样所导致的伪影放大问题。该方法显著提升了NVS模型性能,并在多场景3D高斯泼溅优化中展现出持续改进效果。

链接: https://arxiv.org/abs/2604.10512
作者: Chenhan Jiang,Yu Chen,Qingwen Zhang,Jifei Song,Songcen Xu,Dit-Yan Yeung,Jiankang Deng
机构: Hong Kong University of Science and Technology (香港科技大学); National University of Singapore (新加坡国立大学); KTH Royal Institute of Technology (皇家理工学院); University of Surrey (萨里大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026

点击查看摘要

Abstract:The development of generalizable Novel View Synthesis (NVS) models is critically limited by the scarcity of large-scale training data featuring diverse and precise camera trajectories. While real-world captures are photorealistic, they are typically sparse and discrete. Conversely, synthetic data scales but suffers from a domain gap and often lacks realistic semantics. We introduce FreeScale, a novel framework that leverages the power of scene reconstruction to transform limited real-world image sequences into a scalable source of high-quality training data. Our key insight is that an imperfect reconstructed scene serves as a rich geometric proxy, but naively sampling from it amplifies artifacts. To this end, we propose a certainty-aware free-view sampling strategy identifying novel viewpoints that are both semantically meaningful and minimally affected by reconstruction errors. We demonstrate FreeScale’s effectiveness by scaling up the training of feedforward NVS models, achieving a notable gain of 2.7 dB in PSNR on challenging out-of-distribution benchmarks. Furthermore, we show that the generated data can actively enhance per-scene 3D Gaussian Splatting optimization, leading to consistent improvements across multiple datasets. Our work provides a practical and powerful data generation engine to overcome a fundamental bottleneck in 3D vision. Project page: this https URL.

[CV-173] Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

【速读】:该论文旨在解决多模态隐式推理(Multimodal Latent Reasoning)中因语言偏见导致的视觉信息欠优化以及复杂语义 token 梯度不稳定的问题,从而提升模型表征能力并降低推理延迟。其解决方案的关键在于提出两个核心机制:一是视觉重放模块(Visual Replay Module),利用因果自注意力估计 token 置信度,通过空间一致性约束强化细粒度视觉定位;二是路由深度缩放机制(Routing Depth Scaling),自适应地为复杂 token 分配额外的推理步骤,以实现更深层次的上下文精炼。二者协同作用,并结合渐进式课程策略将显式 Chain-of-Thought (CoT) 逐步压缩为紧凑的潜在表示,最终在多个基准上实现最优性能并显著优于显式 CoT 方法的推理速度。

链接: https://arxiv.org/abs/2604.10500
作者: Yudong Han,Yong Wang,Zaiquan Yang,Zhen Qu,Liyuan Pan,Xiangxiang Chu
机构: Beijing Institute of Technology (北京理工大学); Alibaba (阿里巴巴); AMAP (AMAP); City University of Hong Kong (香港城市大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Yangtze Delta Region Academy of Beijing Institude of Technology, Jiaxing, China (北京理工大学长三角研究院,嘉兴)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly higher and more volatile gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.

[CV-174] UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation CVPR2026

【速读】:该论文旨在解决低光照条件下人体姿态估计(human pose estimation)性能下降的问题,主要挑战包括标注数据稀缺、视觉信息丢失以及现有域自适应方法在模拟低光图像时难以保留高频细节和真实感,导致模型泛化能力差。此外,传统基于图像到关键点交叉注意力(image-to-keypoint cross-attention)的姿态估计器在低光环境下因图像线索不可靠而失效。解决方案的关键在于提出一种无监督域适应框架UDAPose:其一,通过直流基高通滤波器(Direct-Current-based High-Pass Filter, DHF)与低光特征注入模块(Low-light Characteristics Injection Module, LCIM)协同合成更具真实感的低光图像,有效恢复高频细节;其二,在Transformer架构中引入动态注意力控制模块(Dynamic Control of Attention, DCA),自适应地融合图像视觉线索与学习到的姿态先验(pose priors),从而提升模型在真实低光场景下的鲁棒性与准确性。

链接: https://arxiv.org/abs/2604.10485
作者: Haopeng Chen,Yihao Ai,Kabeen Kim,Robby T. Tan,Yixin Chen,Bo Wang
机构: University of Mississippi (密西西比大学); National University of Singapore (新加坡国立大学); Duksung Women’s University (德成女子大学); ASUS Intelligent Cloud Services (AICS) (华硕智能云服务)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Low-visibility scenarios, such as low-light conditions, pose significant challenges to human pose estimation due to the scarcity of annotated low-light datasets and the loss of visual information under poor illumination. Recent domain adaptation techniques attempt to utilize well-lit labels by augmenting well-lit images to mimic low-light conditions. But handcrafted augmentations oversimplify noise patterns, while learning-based methods often fail to preserve high-frequency low-light characteristics, producing unrealistic images that lead pose models to generalize poorly to real low-light scenes. Moreover, recent pose estimators rely on image cues through image-to-keypoint cross-attention, but these cues become unreliable under low-light conditions. To address these issues, we propose Unsupervised Domain Adaptation for Pose Estimation (UDAPose), a novel framework that synthesizes low-light images and dynamically fuses visual cues with pose priors for improved pose estimation. Specifically, our synthesis method incorporates a Direct-Current-based High-Pass Filter (DHF) and a Low-light Characteristics Injection Module (LCIM) to inject high-frequency details from input low-light images, overcoming rigidity or the detail loss in existing approaches. Furthermore, we introduce a Dynamic Control of Attention (DCA) module that adaptively balances image cues with learned pose priors in the Transformer architecture. Experiments show that UDAPose outperforms state-of-the-art methods, with notable AP gains of 10.1 (56.4%) on the ExLPose-test hard set (LL-H) and 7.4 (31.4%) in cross-dataset validation on EHPT-XC. Code: this https URL

[CV-175] ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos

【速读】:该论文旨在解决运动技能提升中个性化视觉反馈的生成问题,即如何自动编辑初学者的动作以反映更高水平的专家动作,从而加速学习过程。传统运动编辑方法依赖成对的输入-输出数据(如新手与专家动作的对应关系)和显式的编辑指导,这在技能驱动任务中难以获取且成本高昂。其解决方案的关键在于提出ExpertEdit框架,该框架仅使用未配对的专家视频演示进行训练,通过掩码语言建模目标学习专家动作先验,能够自动识别技能关键时刻并将其掩码后投影到所学专家流形空间中,实现无需成对监督或人工编辑指导的局部技能优化,从而显著提升动作的真实感和专家水平。

链接: https://arxiv.org/abs/2604.10466
作者: Arjun Somayazulu,Kristen Grauman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual feedback is critical for motor skill acquisition in sports and rehabilitation, and psychological studies show that observing near-perfect versions of one’s own performance accelerates learning more effectively than watching expert demonstrations alone. We propose to enable such personalized feedback by automatically editing a person’s motion to reflect higher skill. Existing motion editing approaches are poorly suited for this setting because they assume paired input-output data – rare and expensive to curate for skill-driven tasks – and explicit edit guidance at inference. We introduce ExpertEdit, a framework for skill-driven motion editing trained exclusively on unpaired expert video demonstrations. ExpertEdit learns an expert motion prior with a masked language modeling objective that infills masked motion spans with expert-level refinements. At inference, novice motion is masked at skill-critical moments and projected into the learned expert manifold, producing localized skill improvements without paired supervision or manual edit guidance. Across eight diverse techniques and three sports from Ego-Exo4D and Karate Kyokushin, ExpertEdit outperforms state-of-the-art supervised motion editing methods on multiple metrics of motion realism and expert quality. Project page: this https URL .

[CV-176] Rethinking the Diffusion Model from a Langevin Perspective

【速读】:该论文旨在解决扩散模型(Diffusion Models)在教学与理解上的复杂性问题,特别是如何从更直观的角度解释其逆向生成过程,并统一不同形式的扩散模型(如基于常微分方程ODE和随机微分方程SDE的版本),同时澄清其理论优势与等价性。解决方案的关键在于引入朗之万(Langevin)视角,通过这一物理启发式框架,不仅简化了对反向过程为何能从纯噪声生成数据的解释,还揭示了ODE、SDE、去噪和得分匹配等不同建模方式在最大似然意义下的等价关系,从而提供了一个统一且具教学价值的理论框架。

链接: https://arxiv.org/abs/2604.10465
作者: Candi Zheng,Yuan Lan
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:Diffusion models are often introduced from multiple perspectives, such as VAEs, score matching, or flow matching, accompanied by dense and technically demanding mathematics that can be difficult for beginners to grasp. One classic question is: how does the reverse process invert the forward process to generate data from pure noise? This article systematically organizes the diffusion model from a fresh Langevin perspective, offering a simpler, clearer, and more intuitive answer. We also address the following questions: how can ODE-based and SDE-based diffusion models be unified under a single framework? Why are diffusion models theoretically superior to ordinary VAEs? Why is flow matching not fundamentally simpler than denoising or score matching, but equivalent under maximum-likelihood? We demonstrate that the Langevin perspective offers clear and straightforward answers to these questions, bridging existing interpretations of diffusion models, showing how different formulations can be converted into one another within a common framework, and offering pedagogical value for both learners and experienced researchers seeking deeper intuition.

[CV-177] oward Accountable AI-Generated Content on Social Platforms: Steganographic Attribution and Multimodal Harm Detection

【速读】:该论文旨在解决生成式 AI(Generative AI)在内容审核与数字取证中面临的挑战,特别是良性AI生成图像与有害或误导性文本结合时难以检测的上下文滥用问题。传统审核框架因合成图像缺乏持久元数据或设备签名而失效,导致责任归属困难。解决方案的关键在于提出一种基于隐写术(steganography)的溯源框架:在图像生成时嵌入加密签名标识符,并利用多模态有害内容检测作为触发机制进行溯源验证。该系统评估了空间域、频域和小波域下的五种水印方法,发现小波域的扩频水印在模糊失真下具有强鲁棒性;同时集成基于CLIP的融合模型实现多模态有害内容检测(AUC-ROC达0.99),从而构建端到端的数字取证流水线,支持对AI生成图像有害部署的可靠追踪与问责。

链接: https://arxiv.org/abs/2604.10460
作者: Xinlei Guan,David Arosemena,Tejaswi Dhandu,Kuan Huang,Meng Xu,Miles Q. Li,Bingyu Shen,Ruiyang Qin,Umamaheswara Rao Tida,Boyang Li
机构: Kean University (肯恩大学); North Dakota State University (北达科他州立大学); McGill University (麦吉尔大学); Villanova University (维拉诺瓦大学); University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET)
备注: 12 pages, 31 figures

点击查看摘要

Abstract:The rapid growth of generative AI has introduced new challenges in content moderation and digital forensics. In particular, benign AI-generated images can be paired with harmful or misleading text, creating difficult-to-detect misuse. This contextual misuse undermines the traditional moderation framework and complicates attribution, as synthetic images typically lack persistent metadata or device signatures. We introduce a steganography enabled attribution framework that embeds cryptographically signed identifiers into images at creation time and uses multimodal harmful content detection as a trigger for attribution verification. Our system evaluates five watermarking methods across spatial, frequency, and wavelet domains. It also integrates a CLIP-based fusion model for multimodal harmful-content detection. Experiments demonstrate that spread-spectrum watermarking, especially in the wavelet domain, provides strong robustness under blur distortions, and our multimodal fusion detector achieves an AUC-ROC of 0.99, enabling reliable cross-modal attribution verification. These components form an end-to-end forensic pipeline that enables reliable tracing of harmful deployments of AI-generated imagery, supporting accountability in modern synthetic media environments. Our code is available at GitHub: this https URL

[CV-178] A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation

【速读】:该论文旨在解决长视频内容自动剪辑为短视频时存在的任务单一性和缺乏统一评估标准的问题,尤其针对电影级视频编排中叙事连贯性与逻辑一致性不足的挑战。其解决方案的关键在于提出CineAgents多智能体系统,将视频编排重构为“设计-组合”范式:通过剧本逆向工程构建分层叙事记忆以提供多层次上下文,并采用迭代叙事规划过程将创意蓝图逐步优化为最终编排脚本,从而显著提升生成视频的叙事和逻辑一致性。

链接: https://arxiv.org/abs/2604.10456
作者: Peixuan Zhang,Chang Zhou,Ziyuan Zhang,Hualuo Liu,Chunjie Zhang,Jingqi Liu,Xiaohui Zhou,Xi Chen,Shuchen Weng,Si Li,Boxin Shi
机构: Tencent PCG (腾讯PCG)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The surging demand for adapting long-form cinematic content into short videos has motivated the need for versatile automatic video compilation systems. However, existing compilation methods are limited to predefined tasks, and the community lacks a comprehensive benchmark to evaluate the cinematic compilation. To address this, we introduce CineBench, the first benchmark for instruction-driven cinematic video compilation, featuring diverse user instructions and high-quality ground-truth compilations annotated by professional editors. To overcome contextual collapse and temporal fragmentation, we present CineAgents, a multi-agent system that reformulates cinematic video compilation into ``design-and-compose’’ paradigm. CineAgents performs script reverse-engineering to construct a hierarchical narrative memory to provide multi-level context and employs an iterative narrative planning process that refines a creative blueprint into a final compiled script. Extensive experiments demonstrate that CineAgents significantly outperforms existing methods, generating compilations with superior narrative coherence and logical coherence.

[CV-179] AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control

【速读】:该论文旨在解决当前图像编辑基准在情感维度上缺乏细粒度刻画的问题,即现有方法主要关注对象级修改,难以实现对情绪(Affective)的精准操控。其解决方案的关键在于构建首个面向情感图像操纵(Affective Image Manipulation, AIM)的基准AIM-Bench,该基准采用双路径情感建模架构,融合Mikels情绪分类体系与效价-唤醒-支配(Valence-Arousal-Dominance, VAD)框架,从而支持高层次语义与细粒度连续的情感编辑。同时,研究提出一种可扩展的数据引擎,利用逆向重绘策略生成40k样本的平衡指令微调数据集AIM-40k,通过生成式重绘增强原始情感图像以建立高保真真实标签,并合成具有差异性情绪及精确指令的输入图像,最终使基线模型在整体性能上提升9.15%,显著缓解了训练数据分布不均导致的正向偏倚问题。

链接: https://arxiv.org/abs/2604.10454
作者: Shi Chen,Xuecheng Wu,Heli Sun,Yunyun Shi,Xinyi Yin,Fengjian Xue,Jinheng Xie,Dingkang Yang,Hao Wang,Junxiao Xue,Liang He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Affective Image Manipulation (AIM) aims to evoke specific emotions through targeted editing. Current image editing benchmarks primarily focus on object-level modifications in general scenarios, lacking the fine-grained granularity to capture affective dimensions. To bridge this gap, we introduce the first benchmark designed for AIM termed AIM-Bench. This benchmark is built upon a dual-path affective modeling scheme that integrates the Mikels emotion taxonomy with the Valence-Arousal-Dominance framework, enabling high-level semantic and fine-grained continuous manipulation. Through a hierarchical human-in-the-loop workflow, we finally curate 800 high-quality samples covering 8 emotional categories and 5 editing types. To effectively assess performance, we also design a composite evaluation suite combining rule-based and model-based metrics to holistically assess instruction consistency, aesthetics, and emotional expressiveness. Extensive evaluations reveal that current editing models face significant challenges, most notably a prevalent positivity bias, which stemming from inherent imbalances in training data distribution. To tackle this, we propose a scalable data engine utilizing an inverse repainting strategy to construct AIM-40k, a balanced instruction-tuning dataset comprising 40k samples. Concretely, we enhance raw affective images via generative redrawing to establish high-fidelity ground truths, and synthesize input images with divergent emotions and paired precise instructions. Fine-tuning a baseline model on AIM-40k yields a 9.15% relative improvement in overall performance, demonstrating the effectiveness of our AIM-40k. Our data and related code will be made open soon.

[CV-180] Parameter Efficient Fine-tuning for Domain-specific Gastrointestinal Disease Recognition CVPR

【速读】:该论文旨在解决跨源医学图像(cross-source medical images)中因分布偏移(distribution shifts)导致的模型性能下降问题。传统方法通常为每个数据源单独训练一个模型,但当使用预训练大模型进行全参数微调时,存储多个模型副本会带来高昂的计算和存储成本。为此,作者提出采用低秩适应(Low-Rank Adaptation, LoRA)模块对下游分类任务进行微调:LoRA通过学习轻量级的任务特定低秩矩阵来扰动预训练权重,从而在保持模型性能的同时显著提升参数效率。实验表明,该方法在胃肠道疾病分类任务中优于端到端微调,且具备更强的资源经济性。

链接: https://arxiv.org/abs/2604.10451
作者: Sanjaya Poudel,Nikita Kunwor,Raj Simkhada,Mustafa Munir,Manish Dhakal,Khem Poudel
机构: Auburn University (奥本大学); Tribhuvan University (特里布文大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Georgia State University (佐治亚州立大学); Middle Tennessee State University (中田纳西州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, CVPR conference

点击查看摘要

Abstract:Despite recent advancements in the field of medical image analysis with the use of pretrained foundation models, the issue of distribution shifts between cross-source images largely remains adamant. To circumvent that issue, investigators generally train a separate model for each source. However, this method becomes expensive when we fully fine-tune pretrained large models for a single dataset, as we must store multiple copies of those models. Thus, in this work, we propose using a low-rank adaptation (LoRA) module for fine-tuning downstream classification tasks. LoRAs learn lightweight task-specific low-rank matrices that perturb pretrained weights to optimize those downstream tasks. For gastrointestinal tract diseases, they exhibit significantly better results than end-to-end finetuning with improved parameter efficiency. Code is available at: this http URL.

[CV-181] ReContraster: Making Your Posters Stand Out with Regional Contrast

【速读】:该论文旨在解决海报设计中如何快速吸引注意力并清晰传达信息的问题,尤其关注通过增强局部区域对比度来提升视觉冲击力。其解决方案的关键在于提出一种无需训练的模型 ReContraster,该模型借鉴“对比效应”原理,引入组合式多智能体系统(compositional multi-agent system)模拟设计师的认知行为,实现元素识别、版面组织与候选海报评估;同时在扩散过程中融合混合去噪策略(hybrid denoising strategy),确保区域边界过渡和谐,从而生成视觉突出且美学上令人满意的作品。

链接: https://arxiv.org/abs/2604.10442
作者: Peixuan Zhang,Zijian Jia,Ziqi Cai,Shuchen Weng,Si Li,Boxin Shi
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective poster design requires rapidly capturing attention and clearly conveying messages. Inspired by the ``contrast effects’’ principle, we propose ReContraster, the first training-free model to leverage regional contrast to make posters stand out. By emulating the cognitive behaviors of a poster designer, ReContraster introduces the compositional multi-agent system to identify elements, organize layout, and evaluate generated poster candidates. To further ensure harmonious transitions across region boundaries, ReContraster integrates the hybrid denoising strategy during the diffusion process. We additionally contribute a new benchmark dataset for comprehensive evaluation. Seven quantitative metrics and four user studies confirm its superiority over relevant state-of-the-art methods, producing visually striking and aesthetically appealing posters.

[CV-182] PERCEPT-Net: A Perceptual Loss Driven Framework for Reducing MRI Artifact Tissue Confusion

【速读】:该论文旨在解决现有基于深度学习的磁共振成像(MRI)伪影校正模型在临床应用中泛化能力差的问题,其根源在于伪影与组织结构之间的混淆,导致模型难以区分伪影和解剖结构。解决方案的关键在于提出PERCEPT-Net框架,其核心创新是引入运动感知损失(Motion Perceptual Loss, MPL),通过学习可泛化的运动伪影表征,提供面向伪影的监督信号,从而引导网络在抑制伪影的同时保持解剖结构的一致性和完整性。该机制显著提升了图像质量与诊断结构保真度,经客观指标与放射科专家评估验证,有效解决了过平滑和结构退化问题。

链接: https://arxiv.org/abs/2604.10439
作者: Ziheng Guo,Danqun Zheng,Chengwei Chen,Boyang Pan,Shuai Li,Ziqin Yu,Xiaoxiao Chen,Langdi Zhong,Yun Bian,Nan-Jie Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures, 6 tables. Submitted to Medical Physics. Code available upon request

点击查看摘要

Abstract:Purpose: Existing deep learning-based MRI artifact correction models exhibit poor clinical generalization due to inherent artifact-tissue confusion, failing to discriminate artifacts from anatomical structures. To resolve this, we introduce PERCEPT-Net, a framework leveraging dedicated perceptual supervision for structure-preserving artifact suppression. Method: PERCEPT-Net utilizes a residual U-Net backbone integrated with a multi-scale recovery module and dual attention mechanisms to preserve anatomical context and salient features. The core mechanism, Motion Perceptual Loss (MPL), provides artifact-aware supervision by learning generalizable motion artifact representations. This logic directly guides the network to suppress artifacts while maintaining anatomical fidelity. Training utilized a hybrid dataset of real and simulated sequences, followed by prospective validation via objective metrics and expert radiologist assessments. Result: PERCEPT-Net outperformed state-of-the-art methods on clinical data. Ablation analysis established a direct causal link between MPL and performance; its omission caused a significant deterioration in structural consistency (p 0.001) and tissue contrast (p 0.001). Radiologist evaluations corroborated these objective metrics, scoring PERCEPT-Net significantly higher in global image quality (median 3 vs. 2, p 0.001) and verifying the preservation of critical diagnostic structures. Conclusion: By integrating task-specific, artifact-aware perceptual learning, PERCEPT-Net suppresses motion artifacts in clinical MRI without compromising anatomical integrity. This framework improves clinical robustness and provides a verifiable mechanism to mitigate over-smoothing and structural degradation in medical image reconstruction.

[CV-183] Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance

【速读】:该论文旨在解决当前用于胸部CT报告生成(Radiology Report Generation, RRG)的视觉-语言模型(Vision-Language Models, VLMs)存在的两大关键问题:一是训练监督信号过于粗粒度,缺乏对病灶属性与空间位置的细粒度对齐;二是评估方式以整体指标为主(如词汇重叠率、实体匹配或大语言模型评分),难以诊断模型在病理定位上的准确性。解决方案的核心是提出一种可插拔的“判别性提示引导与提示丢弃”框架(Discriminative Cue-Prompting with Prompt Dropout, DCP-PD),该框架通过从自由文本报告中蒸馏细粒度线索(fine-grained cues)来指导生成过程,并利用提示丢弃(prompt dropout)机制抑制模型对捷径依赖,从而提升模型对病灶空间位置的准确感知能力。实验表明,DCP-PD在CT-RATE数据集上将宏F1值从0.501提升至0.603(相对提升20%),并在分布外数据集Rad-ChestCT上从0.266大幅提升至0.503(相对提升89%),同时引入分层、位置感知的问题集协议(presence → laterality → lobe)验证了即使在高得分模型中,病灶空间定位仍是挑战。

链接: https://arxiv.org/abs/2604.10437
作者: Chenyu Wang,Weicheng Dai,Han Liu,Wenchao Li,Kayhan Batmanghelich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision–language models (VLMs) for radiology report generation (RRG) can produce long-form chest CT reports from volumetric scans and show strong potential to improve radiology workflow efficiency and consistency. However, existing methods face two key limitations: (i) training supervision is often coarse, aligning a whole CT volume with a full free-text report without explicit alignment for fine-grained attributes or pathology locations; and (ii) evaluation is typically holistic (lexical overlap, entity matching, or LLM-as-a-judge scores) and not diagnostic for spatial grounding. We propose \emphDiscriminative Cue-Prompting with Prompt Dropout (DCP-PD), a plug-and-play framework that distills fine-grained cues from free-text reports and uses them to guide report generation while mitigating shortcut reliance via prompt dropout. DCP-PD achieves state-of-the-art performance on CT-RATE, improving macro F1 from =0.501 to 0.603 (20% relative), and substantially boosts out-of-distribution performance on Rad-ChestCT from F1 =0.266 to 0.503 (89% relative). Finally, we introduce a hierarchical, location-aware question-set protocol (presence \rightarrow laterality \rightarrow lobe) to directly assess pathology-location grounding, showing that fine-grained spatial localization remains challenging even for models that score highly on current benchmarks.

[CV-184] SignReason er: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units CVPR

【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在复杂交通标志理解中缺乏组合泛化能力的问题,即当遇到未见过的标志构型时,模型性能显著下降。解决方案的关键在于提出SignReasoner框架,其核心创新是引入功能结构单元(Functional Structure Unit, FSU),将传统基于实例的建模方式转变为基于功能的灵活分解方法。通过将复杂交通标志拆解为最小的核心功能模块(如方向 Direction、提示 Notice、车道 Lane 等),模型能够学习到标志的底层结构语法,从而实现对新组合配置的鲁棒泛化。此外,作者设计了两阶段VLM后训练流程:迭代式caption-FSU蒸馏(Iterative Caption-FSU Distillation)提升FSU推理与描述准确性,以及基于树编辑距离(Tree Edit Distance, TED)奖励的FSU-GRPO算法,进一步增强模型的结构推理能力。

链接: https://arxiv.org/abs/2604.10436
作者: Ruibin Wang,Zhenyu Lin,Xinhai Zhao
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPRF 2026

点击查看摘要

Abstract:Accurate semantic understanding of complex traffic signs-including those with intricate layouts, multi-lingual text, and composite symbols-is critical for autonomous driving safety. Current models, both specialized small ones and large Vision Language Models (VLMs), suffer from a significant bottleneck: a lack of compositional generalization, leading to failure when encountering novel sign configurations. To overcome this, we propose SignReasoner, a novel paradigm that transforms general VLMs into expert traffic sign reasoners. Our core innovation is Functional Structure Unit (FSU), which shifts from common instance-based modeling to flexible function-based decomposition. By breaking down complex signs into minimal, core functional blocks (e.g., Direction, Notice, Lane), our model learns the underlying structural grammar, enabling robust generalization to unseen compositions. We define this decomposition as the FSU-Reasoning task and introduce a two-stage VLM post-training pipeline to maximize performance: Iterative Caption-FSU Distillation that enhances the model’s accuracy in both FSU-reasoning and caption generation; FSU-GRPO that uses Tree Edit Distance (TED) to compute FSU differences as the rewards in GRPO algorithm, boosting reasoning abilities. Experiments on the newly proposed FSU-Reasoning benchmark, TrafficSignEval, show that SignReasoner achieves new SOTA with remarkable data efficiency and no architectural modification, significantly improving the traffic sign understanding in various VLMs.

[CV-185] DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain ACL2026

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在食品领域应用受限的问题,主要源于现有基准数据集存在类别粒度粗、图像视角单一及营养信息不准确等缺陷。其解决方案的关键在于提出一个分层的多视角评估基准——DiningBench,该基准涵盖三个认知复杂度层级:细粒度分类、营养估算和视觉问答;包含3,021种独特菜品(每道菜平均5.27张图像),引入来自相同菜单的“难样本”负例,并采用验证机制确保营养数据准确性。这一设计显著提升了对VLM在食品场景下细粒度视觉辨别与精准营养推理能力的评测强度,从而推动下一代面向食物的VLM研究发展。

链接: https://arxiv.org/abs/2604.10425
作者: Song Jin,Juntian Zhang,Xun Zhang,Zeying Tian,Fei Jiang,Guojun Yin,Wei Lin,Yong Liu,Rui Yan
机构: Gaoling School of Artificial Intelligence, Renmin University of China; Meituan; Wuhan University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2026 Main

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained “hard” negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in this https URL.

[CV-186] Point2Pose: Occlusion-Recovering 6D Pose Tracking and 3D Reconstruction for Multiple Unknown Objects Via 2D Point Trackers

【速读】:该论文旨在解决从单目RGB-D视频中实现多个刚性物体的因果6D位姿跟踪问题,尤其在无对象CAD模型或类别先验条件下,如何实现多目标跟踪并应对完全遮挡后的快速恢复。解决方案的关键在于提出Point2Pose方法:通过2D点追踪器建立长距离对应关系以支持遮挡后即时恢复,同时增量式构建在线Truncated Signed Distance Function (TSDF)表示来同步重建被跟踪目标的几何结构,从而在无需模型先验的情况下实现鲁棒、连续的多目标6D位姿估计。

链接: https://arxiv.org/abs/2604.10415
作者: Tzu-Yuan Lin,Ho Jae Lee,Kevin Doherty,Yonghyeon Lee,Sangbae Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We present Point2Pose, a model-free method for causal 6D pose tracking of multiple rigid objects from monocular RGB-D video. Initialized only from sparse image points on the objects to be tracked, our approach tracks multiple unseen objects without requiring object CAD models or category priors. Point2Pose leverages a 2D point tracker to obtain long-range correspondences, enabling instant recovery after complete occlusion. Simultaneously, the system incrementally reconstructs an online Truncated Signed Distance Function (TSDF) representation of the tracked targets. Alongside the method, we introduce a new multi-object tracking dataset comprising both simulation and real-world sequences, with motion-capture ground truth for evaluation. Experiments show that Point2Pose achieves performance comparable to the state-of-the-art methods on a severe-occlusion benchmark, while additionally supporting multi-object tracking and recovery from complete occlusion, capabilities that are not supported by previous model-free tracking approaches.

[CV-187] Neural Stochastic Processes for Satellite Precipitation Refinement

【速读】:该论文旨在解决卫星降水产品因系统性偏差导致的精度不足问题,同时克服地面雨量计空间分布稀疏、难以直接用于格网校正的局限。现有方法通常通过插值将站点观测映射到卫星网格,但忽略了降水场在时间维度上的动态结构。其解决方案的关键在于提出神经随机过程(Neural Stochastic Process, NSP),该模型结合了神经过程(Neural Process)编码器对任意集合的地面观测进行条件建模,并引入潜变量神经随机微分方程(Neural SDE)以捕捉二维空间表示下的时空演化特性;训练过程中采用单一变分目标且无需模拟即可优化,从而实现高精度、时序一致的降水估计。

链接: https://arxiv.org/abs/2604.10414
作者: Shunya Nagashima,Takumi Bannai,Shuitsu Koyama,Tomoya Mitsui,Shuntaro Suzuki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate precipitation estimation is critical for flood forecasting, water resource management, and disaster preparedness. Satellite products provide global hourly coverage but contain systematic biases; ground-based gauges are accurate at point locations but too sparse for direct gridded correction. Existing methods fuse these sources by interpolating gauge observations onto the satellite grid, but treat each time step independently and therefore discard temporal structure in precipitation fields. We propose Neural Stochastic Process (NSP), a model that pairs a Neural Process encoder conditioning on arbitrary sets of gauge observations with a latent Neural SDE on a 2D spatial representation. NSP is trained under a single variational objective with simulation-free cost. We also introduce QPEBench, a benchmark of 43,756 hourly samples over the Contiguous United States (2021–2025) with four aligned data sources and six evaluation metrics. On QPEBench, NSP outperforms 13 baselines across all six metrics and surpasses JAXA’s operational gauge-calibrated product. An additional experiment on Kyushu, Japan confirms generalization to a different region with independent data sources.

[CV-188] IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly KR

【速读】:该论文旨在解决工业场景下程序性操作理解(procedural understanding)中缺乏真实、多视角、同步标注数据集的问题,尤其在装配与拆卸等复杂任务中,现有基准难以覆盖多路径执行、异常检测与恢复监督等现实挑战。解决方案的关键在于构建IMPACT数据集——首个面向部署的五视角RGB-D数据集,其核心创新包括:同步内外视角RGB-D采集(synchronized ego-exo RGB-D capture)、解耦双臂标注(decoupled bimanual annotation)、符合感知的状态追踪(compliance-aware state tracking)以及显式的异常-恢复监督机制(explicit anomaly-recovery supervision),并基于部分序前置图(partial-order prerequisite graph)实现多路径任务执行与NASA-TLX量化认知负荷,从而揭示单任务基准无法捕捉的系统级局限性,特别是在不完整观测、灵活路径和纠错行为等实际部署条件下的性能瓶颈。

链接: https://arxiv.org/abs/2604.10409
作者: Di Wen,Zeyun Zhong,David Schneider,Manuel Zaremski,Linus Kunzmann,Yitian Shi,Ruiping Liu,Yufan Chen,Junwei Zheng,Jiahang Li,Jonas Hemmerich,Qiyi Tong,Patric Grauberger,Arash Ajoudani,Danda Pani Paudel,Sven Matthiesen,Barbara Deml,Jürgen Beyerer,Luc Van Gool,Rainer Stiefelhagen,Kunyu Peng
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Italian Institute of Technology (意大利技术研究院); INSAIT, Sofia University (INSAIT,索菲亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, benchmark and dataset are available at this https URL

点击查看摘要

Abstract:We introduce IMPACT, a synchronized five-view RGB-D dataset for deployment-oriented industrial procedural understanding, built around real assembly and disassembly of a commercial angle grinder with professional-grade tools. To our knowledge, IMPACT is the first real industrial assembly benchmark that jointly provides synchronized ego-exo RGB-D capture, decoupled bimanual annotation, compliance-aware state tracking, and explicit anomaly–recovery supervision within a single real industrial workflow. It comprises 112 trials from 13 participants totaling 39.5 hours, with multi-route execution governed by a partial-order prerequisite graph, a six-category anomaly taxonomy, and operator cognitive load measured via NASA-TLX. The annotation hierarchy links hand-specific atomic actions to coarse procedural steps, component assembly states, and per-hand compliance phases, with synchronized null spans across views to decouple perceptual limitations from algorithmic failure. Systematic baselines reveal fundamental limitations that remain invisible to single-task benchmarks, particularly under realistic deployment conditions that involve incomplete observations, flexible execution paths, and corrective behavior. The full dataset, annotations, and evaluation code are available at this https URL.

[CV-189] Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

【速读】:该论文旨在解决视频中人-物交互(Human-Object Interaction, HOI)理解中存在的两个关键问题:一是现有方法将未来预测视为基于外部构建的人-物对的下游任务,导致检测与预测之间缺乏联合推理;二是当前基准数据集中的稀疏关键帧标注容易造成未来标签与实际动态在时间上错位,从而降低预测评估的可靠性。解决方案的关键在于提出两个核心贡献:其一,构建了DETAnt-HOI这一时间校正后的基准,基于VidHOI和Action Genome数据集,实现更可靠的多时域预测评估;其二,设计了HOI-DA框架,以对齐为中心建模未来交互为当前配对状态的残差变换,从而实现主体-客体定位、当前HOI检测与未来预测的联合学习。实验表明,该方法在检测与预测任务上均取得一致提升,尤其在长时预测中优势显著,验证了将预测作为结构约束嵌入到配对级视频表征学习中的有效性。

链接: https://arxiv.org/abs/2604.10397
作者: Yuanhao Luo,Di Wen,Kunyu Peng,Ruiping Liu,Junwei Zheng,Yufan Chen,Jiale Wei,Rainer Stiefelhage
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); INSAIT, Sofia University (索菲亚大学INSAIT研究所); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures, code will be publicly available

点击查看摘要

Abstract:Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states. Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. Our results highlight that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. Benchmark and code will be publicly available.

[CV-190] FishRoPE: Projective Rotary Position Embeddings for Omnidirectional Visual Perception

【速读】:该论文旨在解决广角相机(fisheye camera)在自动驾驶感知任务中因径向畸变导致的几何不一致性问题,同时应对大规模标注数据稀缺下重新训练视觉基础模型(Vision Foundation Models, VFMs)的挑战。其核心解决方案是提出一个轻量级适配框架 \ours,关键创新在于:采用冻结的DINOv2骨干网络结合低秩适应(LoRA)以无任务特定预训练方式迁移自监督特征,并引入鱼眼旋转位置嵌入(Fisheye Rotary Position Embedding, FishRoPE)——该方法将注意力机制重参数化为球坐标系下的角度分离度量,使自注意力与交叉注意力均基于角度而非像素距离进行计算,从而自然兼容鱼眼投影几何且对模型架构无依赖性,计算开销极小

链接: https://arxiv.org/abs/2604.10391
作者: Rahul Ahuja,Mudit Jain,Bala Murali Manoghar Sai Sudhakar,Venkatraman Narayanan,Pratik Likhar,Varun Ravi Kumar,Senthil Yogamani
机构: Qualcomm Technologies, Inc (高通技术公司); Qualcomm India Private Limited (高通印度私人有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision foundation models (VFMs) and Bird’s Eye View (BEV) representation have advanced visual perception substantially, yet their internal spatial representations assume the rectilinear geometry of pinhole cameras. Fisheye cameras, widely deployed on production autonomous vehicles for their surround-view coverage, exhibit severe radial distortion that renders these representations geometrically inconsistent. At the same time, the scarcity of large-scale fisheye annotations makes retraining foundation models from scratch impractical. We present \ours, a lightweight framework that adapts frozen VFMs to fisheye geometry through two components: a frozen DINOv2 backbone with Low-Rank Adaptation (LoRA) that transfers rich self-supervised features to fisheye without task-specific pretraining, and Fisheye Rotary Position Embedding (FishRoPE), which reparameterizes the attention mechanism in the spherical coordinates of the fisheye projection so that both self-attention and cross-attention operate on angular separation rather than pixel distance. FishRoPE is architecture-agnostic, introduces negligible computational overhead, and naturally reduces to the standard formulation under pinhole geometry. We evaluate \ours on WoodScape 2D detection (54.3 mAP) and SynWoodScapes BEV segmentation (65.1 mIoU), where it achieves state-of-the-art results on both benchmarks.

[CV-191] GTASA: Ground Truth Annotations for Spatiotemporal Analysis Evaluation and Training of Video Models

【速读】:该论文旨在解决生成复杂多角色场景视频的难题,以及缺乏真实物理合理性与语义忠实度评估标准的问题。其解决方案的关键在于构建了一个名为GTASA的多角色视频语料库,该语料库包含逐帧的空间关系图和事件级的时间映射,并基于Graphs of Events in Space and Time (GEST)开发了GEST-Engine系统来生成这些视频。通过GTASA提供的精确3D地面真值,论文在11个时空推理任务上对四个冻结的视频编码器进行了探查,证明自监督编码器在空间结构建模方面显著优于视觉语言模型(VLM)视觉编码器,从而验证了该方法在生成质量与可评估性上的优势。

链接: https://arxiv.org/abs/2604.10385
作者: Nicolae Cudlenco,Mihai Masala,Marius Leordeanu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating complex multi-actor scenario videos remains difficult even for state-of-the-art neural generators, while evaluating them is hard due to the lack of ground truth for physical plausibility and semantic faithfulness. We introduce GTASA, a corpus of multi-actor videos with per-frame spatial relation graphs and event-level temporal mappings, and the system that produced it based on Graphs of Events in Space and Time (GEST): GEST-Engine. We compare our method with both open and closed source neural generators and prove both qualitatively (human evaluation of physical validity and semantic alignment) and quantitatively (via training video captioning models) the clear advantages of our method. Probing four frozen video encoders across 11 spatiotemporal reasoning tasks enabled by GTASA’s exact 3D ground truth reveals that self-supervised encoders encode spatial structure significantly better than VLM visual encoders.

[CV-192] Agent ic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning

【速读】:该论文旨在解决现有多智能体视频生成系统中生成内容语义不可靠的问题,这些系统依赖大语言模型(LLM)协调神经视频生成器,虽能产出视觉上吸引人的结果,但缺乏可验证的语义一致性与物理真实性。其解决方案的关键在于提出一种基于“分离关注点”的新架构:由LLM负责叙事规划(natural language reasoning),而程序化状态后端通过受控工具调用确保模拟器约束被严格遵守,从而保证生成的事件图谱(Graph of Events in Space and Time, GEST)在构造阶段即具备可执行性。该方法引入双智能体层级结构——导演(Director)制定故事大纲,场景构建者(Scene Builder)通过轮次状态机细化场景,并辅以关系子代理(Relation Subagents)填充GEST中的逻辑与语义边,首次完整释放了该形式化表示的表达能力。实验表明,该方案在自主生成和种子生成任务中均显著优于当前主流神经视频生成模型(如VEO 3.1、WAN 2.2),尤其在物理合理性与语义对齐方面表现突出。

链接: https://arxiv.org/abs/2604.10383
作者: Nicolae Cudlenco,Mihai Masala,Marius Leordeanu
机构: Institute of Mathematics of the Romanian Academy, Bucharest, Romania; National University of Science and Technology Politehnica Bucharest, Romania; Büchi Labortechnik AG, Flawil, Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing multi-agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that inverts this paradigm: instead of generating pixels, the LLM constructs a formal Graph of Events in Space and Time (GEST) – a structured specification of actors, actions, objects, and temporal constraints – which is then executed deterministically in a 3D game engine. A staged LLM refinement pipeline fails entirely at this task (0 of 50 attempts produce an executable specification), motivating a fundamentally different architecture based on a separation of concerns: the LLM handles narrative planning through natural language reasoning, while a programmatic state backend enforces all simulator constraints through validated tool calls, guaranteeing that every generated specification is executable by construction. The system uses a hierarchical two-agent architecture – a Director that plans the story and a Scene Builder that constructs individual scenes through a round-based state machine – with dedicated Relation Subagents that populate the logical and semantic edge types of the GEST formalism that procedural generation leaves empty, making this the first approach to exercise the full expressive capacity of the representation. We evaluate in two stages: autonomous generation against procedural baselines via a 3-model LLM jury, where agentic narratives win 79% of text and 74% of video comparisons; and seeded generation where the same text is given to our system, VEO 3.1, and WAN 2.2, with human annotations showing engine-generated videos substantially outperform neural generators on physical validity (58% vs 25% and 20%) and semantic alignment (3.75/5 vs 2.33 and 1.50).

[CV-193] DeepShapeMatchingKit: Accelerated Functional Map Solver and Shape Matching Pipelines Revisited ATC CVPR2026

【速读】:该论文旨在解决非刚性三维形状匹配中深度功能映射(Deep Functional Maps)的计算效率瓶颈问题,尤其是在高谱分辨率下标准实现需串行求解k个独立线性系统所带来的性能限制。其关键解决方案是提出一种向量化重构方法,通过在单次内核调用中并行求解所有线性系统,在保持精确解的前提下实现最高达33倍的速度提升。此外,论文还识别并记录了主流DiffusionNet中空间梯度特征实现上的未被注意的差异,揭示了两种参数化不同切平面变换族的变体,并通过实验分析其在多个基准测试中的行为表现,从而为方法选择提供依据。

链接: https://arxiv.org/abs/2604.10377
作者: Yizheng Xie,Lennart Bastian,Congyue Deng,Thomas W. Mitchel,Maolin Gao,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); MIT (麻省理工学院); Adobe (Adobe公司); Imperial College London (伦敦帝国理工学院); Simon Fraser University (西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures, CVPR 2026 Image Matching Workshop (IEEE proceedings)

点击查看摘要

Abstract:Deep functional maps, leveraging learned feature extractors and spectral correspondence solvers, are fundamental to non-rigid 3D shape matching. Based on an analysis of open-source implementations, we find that standard functional map implementations solve k independent linear systems serially, which is a computational bottleneck at higher spectral resolution. We thus propose a vectorized reformulation that solves all systems in a single kernel call, achieving up to a 33x speedup while preserving the exact solution. Furthermore, we identify and document a previously unnoticed implementation divergence in the spatial gradient features of the mainstay DiffusionNet: two variants that parameterize distinct families of tangent-plane transformations, and present experiments analyzing their respective behaviors across diverse benchmarks. We additionally revisit overlap prediction evaluation for partial-to-partial matching and show that balanced accuracy provides a useful complementary metric under varying overlap ratios. To share these advancements with the wider community, we present an open-source codebase, DeepShapeMatchingKit, that incorporates these improvements and standardizes training, evaluation, and data pipelines for common deep shape matching methods. The codebase is available at: this https URL

[CV-194] Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex

【速读】:该论文旨在解决低光照图像增强(Low-light Image Enhancement, LLIE)中现有方法存在的两大问题:一是当前最先进的技术多依赖大型模型和多阶段训练,导致计算复杂度高,难以在边缘设备部署;二是这些方法通常仅基于单一色彩空间,易引入曝光不稳或颜色伪影。解决方案的关键在于提出一种轻量级结构化框架Multinex,其核心是将图像分解为由不同分析表示导出的亮度与颜色先验堆栈,并通过受Retinex理论启发的残差形式进行融合,从而实现对亮度和反射率的精确调整。该方法强调增强而非重建,结合轻量化神经操作,在保持高性能的同时显著降低参数量(如45K参数的轻量版和0.7K参数的纳米版),在多个基准测试中优于同类轻量级模型,并接近重型模型性能。

链接: https://arxiv.org/abs/2604.10359
作者: Alexandru Brateanu,Tingting Mu,Codruta Ancuti,Cosmin Ancuti
机构: University of Manchester (曼彻斯特大学); University Politehnica Timisoara (特尔古日乌理工大学); West University of Timisoara (蒂米什瓦拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-light image enhancement (LLIE) aims to restore natural visibility, color fidelity, and structural detail under severe illumination degradation. State-of-the-art (SOTA) LLIE techniques often rely on large models and multi-stage training, limiting practicality for edge deployment. Moreover, their dependence on a single color space introduces instability and visible exposure or color artifacts. To address these, we propose Multinex, an ultra-lightweight structured framework that integrates multiple fine-grained representations within a principled Retinex residual formulation. It decomposes an image into illumination and color prior stacks derived from distinct analytic representations, and learns to fuse these representations into luminance and reflectance adjustments required to correct exposure. By prioritizing enhancement over reconstruction and exploiting lightweight neural operations, Multinex significantly reduces computational cost, exemplified by its lightweight (45K parameters) and nano (0.7K parameters) versions. Extensive benchmarks show that all lightweight variants significantly outperform their corresponding lightweight SOTA models, and reach comparable performance to heavy models. Paper page available at this https URL.

[CV-195] Multi-modal multi-scale representation learning for satellite imagery analysis just needs a good ALiBi

【速读】:该论文旨在解决多尺度、多模态卫星遥感图像处理中模型建模困难的问题,特别是如何有效构建在不同空间分辨率(ground sample distance scales)和成像模式(如光学与合成孔径雷达SAR)下统一表征的视觉基础模型。其解决方案的关键在于提出Scale-ALiBi——一种线性偏置注意力机制(linear bias transformer attention mechanism),通过引入空间编码偏置(spatial encoding bias)来显式建模不同尺度图像块之间的关系,从而增强模型对跨尺度特征的感知能力。该方法结合三重对比学习与重建架构,在高/低分辨率光学与低分辨率SAR数据集上实现性能提升,并在GEO-Bench基准测试中验证了有效性。

链接: https://arxiv.org/abs/2604.10347
作者: Patrick Kage,Pavlos Andreadis
机构: University of Edinburgh(爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Originally appeared at the 4th Space Imaging Workshop at the Georgia Institute of Technology, October 7-9, 2024

点击查看摘要

Abstract:Vision foundation models have been shown to be effective at processing satellite imagery into representations fit for downstream tasks, however, creating models which operate over multiple spatial resolutions and modes is challenging. This paper presents Scale-ALiBi, a linear bias transformer attention mechanism with a spatial encoding bias to relationships between image patches at different ground sample distance scales. We provide an implementation of Scale-ALiBi over a dataset of aligned high- and low-resolution optical and low-resolution SAR satellite imagery data using a triple-contrastive and reconstructive architecture, show an improvement on the GEO-Bench benchmark, and release the newly curated dataset publicly.

[CV-196] Context Matters: Vision-Based Depression Detection Comparing Classical and Deep Approaches

【速读】:该论文旨在解决传统特征提取方法与深度学习方法在视觉模态下抑郁检测任务中的性能差异问题,特别是两者在准确性、公平性和跨情境泛化能力方面的比较。其关键解决方案是通过在两个不同情境(TPOT数据库中的母子互动和Pitt数据库中的患者-临床医生访谈)中对比经典方法(手工设计特征+支持向量机SVM)与深度学习方法(来自FMAE-IAT的逐轮嵌入特征+多层感知机MLP分类器),发现经典方法在两个场景中均表现出更高的准确率,并且在患者-临床医生情境中显著更公平;同时,两种方法的跨情境泛化能力均有限,提示抑郁表现可能具有情境特异性。

链接: https://arxiv.org/abs/2604.10344
作者: Maneesh Bilalpur,Saurabh Hinduja,Sonish Sivarajkumar,Nicholas Allen,Yanshan Wang,Itir Onal Ertugrul,Jeffrey F. Cohn
机构: Intelligent Systems Program, University of Pittsburgh, Pittsburgh, USA; CGI Technologies and Solutions Inc, USA; University of Oregon, USA; Department of Health Information Management, University of Pittsburgh, Pittsburgh, USA; Utrecht University, the Netherlands; Department of Psychology, University of Pittsburgh, Pittsburgh, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The classical approach to detecting depression from vision emphasizes interpretable features, such as facial expression, and classifiers such as the Support Vector Machine (SVM). With the advent of deep learning, there has been a shift in feature representations and classification approaches. Contemporary approaches use learnt features from general-purpose vision models such as VGGNet to train machine learning models. Little is known about how classical and deep approaches compare in depression detection with respect to accuracy, fairness, and generalizability, especially across contexts. To address these questions, we compared classical and deep approaches to the detection of depression in the visual modality in two different contexts: Mother-child interactions in the TPOT database and patient-clinician interviews in the Pitt database. In the former, depression was operationalized as a history of depression per the DSM and current or recent clinically significant symptoms. In the latter, all participants met initial criteria for depression per DSM, and depression was reassessed over the course of treatment. The classical approach included handcrafted features with SVM classifiers. Learnt features were turn-level embeddings from the FMAE-IAT that were combined with Multi-Layer Perceptron classifiers. The classical approach achieved higher accuracy in both contexts. It was also significantly fairer than the deep approach in the patient-clinician context. Cross-context generalizability was modest at best for both approaches, which suggests that depression may be context-specific.

[CV-197] SIMPLER: HE-Informed Representation Learning for Structured Illumination Microscopy

【速读】:该论文旨在解决现有数字病理学自监督模型在跨模态迁移时面临的性能瓶颈问题,特别是针对结构光照明显微成像(Structured Illumination Microscopy, SIM)这类厚组织荧光成像模态缺乏有效预训练方法的问题。由于SIM与传统薄切片染色模态(如HE)之间存在显著的模态差异,直接迁移或简单微调会导致过拟合于表观特征而非底层组织结构。解决方案的关键在于提出SIMPLER框架,通过以HE作为语义锚点,利用对抗性、对比性和重建目标对SIM与HE进行渐进式对齐,使SIM嵌入能够内化HE中的组织结构信息,同时保留其自身模态特性,从而实现可迁移的SIM表示学习。这一策略实现了不对称增强(asymmetric enrichment),即在不损害HE表征能力的前提下显著提升SIM下游任务表现。

链接: https://arxiv.org/abs/2604.10334
作者: Abu Zahid Bin Aziz,Syed Fahim Ahmed,Gnanesh Rasineni,Mei Wang,Olcaytu Hatipoglu,Marisa Ricci,Malaiyah Shaw,Guang Li,J. Quincy Brown,Valerio Pascucci,Shireen Elhabian
机构: University of Utah (犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Structured Illumination Microscopy (SIM) enables rapid, high-contrast optical sectioning of fresh tissue without staining or physical sectioning, making it promising for intraoperative and point-of-care diagnostics. Recent foundation and large-scale self-supervised models in digital pathology have demonstrated strong performance on section-based modalities such as Hematoxylin and Eosin (HE) and immunohistochemistry (IHC). However, these approaches are predominantly trained on thin tissue sections and do not explicitly address thick-tissue fluorescence modalities such as SIM. When transferred directly to SIM, performance is constrained by substantial modality shift, and naive fine-tuning often overfits to modality-specific appearance rather than underlying histological structure. We introduce SIMPLER (Structured Illumination Microscopy-Powered Learning for Embedding Representations), a cross-modality self-supervised pretraining framework that leverages HE as a semantic anchor to learn reusable SIM representations. HE encodes rich cellular and glandular structure aligned with established clinical annotations, while SIM provides rapid, nondestructive imaging of fresh tissue. During pretraining, SIM and HE are progressively aligned through adversarial, contrastive, and reconstruction-based objectives, encouraging SIM embeddings to internalize histological structure from HE without collapsing modality-specific characteristics. A single pretrained SIMPLER encoder transfers across multiple downstream tasks, including multiple instance learning and morphological clustering, consistently outperforming SIM models trained from scratch or HE-only pretraining. Importantly, joint alignment enhances SIM performance without degrading HE representations, demonstrating asymmetric enrichment rather

[CV-198] Zero-shot World Models Are Developmentally Efficient Learners

【速读】:该论文旨在解决儿童如何在极有限的训练数据下实现高效且灵活的物理世界理解能力,以及当前最先进的AI系统在零样本泛化和数据效率方面仍面临巨大挑战的问题。解决方案的关键在于提出一种新颖的计算假设——零样本视觉世界模型(Zero-shot Visual World Model, ZWM),其核心由三个原则构成:基于稀疏时序因子化的预测器,将外观与动态解耦;通过近似因果推理实现零样本估计;以及通过推理组合构建更复杂的认知能力。ZWM仅需单个儿童的第一人称经验即可学习,并能快速在多个物理理解基准上生成竞争力,同时复现儿童发展的行为特征并构建类脑内部表征,为从人类规模数据中实现高效学习提供了蓝图。

链接: https://arxiv.org/abs/2604.10333
作者: Khai Loong Aw,Klemen Kotar,Wanhee Lee,Seungwoo Kim,Khaled Jedoui,Rahul Venkatesh,Lilian Naing Chen,Michael C. Frank,Daniel L.K. Yamins
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Young children demonstrate early abilities to understand their physical world, estimating depth, motion, object coherence, interactions, and many other aspects of physical scene understanding. Children are both data-efficient and flexible cognitive systems, creating competence despite extremely limited training data, while generalizing to myriad untrained tasks – a major challenge even for today’s best AI systems. Here we introduce a novel computational hypothesis for these abilities, the Zero-shot Visual World Model (ZWM). ZWM is based on three principles: a sparse temporally-factored predictor that decouples appearance from dynamics; zero-shot estimation through approximate causal inference; and composition of inferences to build more complex abilities. We show that ZWM can be learned from the first-person experience of a single child, rapidly generating competence across multiple physical understanding benchmarks. It also broadly recapitulates behavioral signatures of child development and builds brain-like internal representations. Our work presents a blueprint for efficient and flexible learning from human-scale data, advancing both a computational account for children’s early physical understanding and a path toward data-efficient AI systems.

[CV-199] NTIRE 2026 Challenge on Single Image Reflection Removal in the Wild: Datasets Results and Methods

【速读】:该论文旨在解决真实场景下单图像反射去除(Single-Image Reflection Removal, SIRR)的挑战,即在现实世界复杂环境中去除图像中的反射成分以恢复干净图像的问题。当前大多数方法仅在合成数据或有限的真实图像上进行测试,导致实际应用效果受限。解决方案的关键在于提出了OpenRR-5k数据集,该数据集包含多样化的现实世界反射场景和强度,能够有效推动模型在真实环境下的泛化能力;同时,通过NTIRE 2026挑战赛的形式激励了大量研究者参与,最终Top方法显著提升了SIRR的性能,并获得领域专家一致认可。

链接: https://arxiv.org/abs/2604.10321
作者: Jie Cai,Kangning Yang,Zhiyuan Li,Florin-Alexandru Vasluianu,Radu Timofte,Jinlong Li,Jinglin Shen,Zibo Meng,Junyan Cao,Lu Zhao,Pengwei Liu,Yuyi Zhang,Fengjun Guo,Jiagao Hu,Zepeng Wang,Fei Wang,Daiguo Zhou,Yi’ang Chen,Honghui Zhu,Mengru Yang,Yan Luo,Kui Jiang,Jin Guo,Jonghyuk Park,Jae-Young Sim,Wei Zhou,Hongyu Huang,Linfeng Li,Lindong Kong,Saiprasad Meesiyawar,Misbha Falak Khanpagadi,Nikhil Akalwadi,Ramesh Ashok Tabib,Uma Mudenagudi,Bilel Benjdira,Anas M. Ali,Wadii Boulila,Kosuke Shigematsu,Hiroto Shirono,Asuka Shin,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi,Jiachen Tu,Shreeniketh Joshi,Jin-Hui Jiang,Yu-Fan Lin,Yu-Jou Hsiao,Chia-Ming Lee,Fu-En Yang,Yu-Chiang Frank Wang,Chih-Chung Hsu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we review the NTIRE 2026 challenge on single-image reflection removal (SIRR) in the Wild. SIRR is a fundamental task in image restoration. Despite progress in academic research, most methods are tested on synthetic images or limited real-world images, creating a gap in real-world applications. In this challenge, we provide participants with the OpenRR-5k dataset, which requires them to process real-world images that cover a range of reflection scenarios and intensities, with the goal of generating clean images without reflections. The challenge attracted more than 100 registrations, with 11 of them participating in the final testing phase. The top-ranked methods advanced the state-of-the-art reflection removal performance and earned unanimous recognition from the five experts in the field. The proposed OpenRR-5k dataset is available at this https URL, and the homepage of this challenge is at this https URL. Due to page limitations, this article only presents partial content; the full report and detailed analyses are available in the extended arXiv version.

[CV-200] Anatomy-Informed Deep Learning for Abdominal Aortic Aneurysm Segmentation

【速读】:该论文旨在解决CT血管造影(CT angiography)中腹主动脉瘤(abdominal aortic aneurysms, AAA)分割难题,该问题主要源于解剖结构的高变异性、血管边界对比度低以及邻近器官信号与血管相似导致的假阳性结果。解决方案的关键在于提出一种解剖感知的分割框架,通过将由TotalSegmentator生成的器官排除掩膜(organ exclusion masks)引入U-Net训练过程,显式编码解剖先验信息——即识别非血管器官并惩罚这些区域内的动脉瘤预测,从而引导模型聚焦于主动脉及其病理性扩张,抑制解剖学上不合理的预测。该方法在小样本数据下仍显著提升分割准确性、减少假阳性并增强边界一致性,验证了基于排除掩膜的解剖知识嵌入是提高模型鲁棒性和泛化能力的有效机制。

链接: https://arxiv.org/abs/2604.10312
作者: Osamah Sufyan,Martin Brückmann,Ralph Wickenhöfer,Babette Dellen,Uwe Jaekel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: International Conference on Computational Science

点击查看摘要

Abstract:In CT angiography, the accurate segmentation of abdominal aortic aneurysms (AAAs) is difficult due to large anatomical variability, low-contrast vessel boundaries, and the close proximity of organs whose intensities resemble vascular structures, often leading to false positives. To address these challenges, we propose an anatomy-aware segmentation framework that integrates organ exclusion masks derived from TotalSegmentator into the training process. These masks encode explicit anatomical priors by identifying non-vascular organsand penalizing aneurysm predictions within these regions, thereby guiding the U-Net to focus on the aorta and its pathological dilation while suppressing anatomically implausible predictions. Despite being trained on a relatively small dataset, the anatomy-aware model achieves high accuracy, substantially reduces false positives, and improves boundary consistency compared to a standard U-Net baseline. The results demonstrate that incorporating anatomical knowledge through exclusion masks provides an efficient mechanism to enhance robustness and generalization, enabling reliable AAA segmentation even with limited training data.

[CV-201] SatReg: Regression-based Neural Architecture Search for Lightweight Satellite Image Segmentation

【速读】:该论文旨在解决地球观测任务中遥感分割模型在边缘计算平台上的部署难题,即如何在严格的延迟(latency)和能耗(energy)约束下实现高效、轻量化的模型设计。其核心挑战在于传统模型搜索方法难以兼顾精度与硬件资源消耗的权衡。解决方案的关键在于提出一种基于回归的硬件感知调优框架 SatReg:通过以 CM-UNet 为教师模型,将复杂搜索空间压缩为两个关键宽度相关变量,并在 NVIDIA Jetson Orin Nano 上对少量学生模型进行性能采样,构建低阶代理模型(surrogate models)来预测 mIoU(平均交并比)、延迟和功耗;结合知识蒸馏(knowledge distillation)高效训练候选模型,最终利用学习到的代理模型快速筛选出接近最优的架构配置,无需穷举搜索,从而实现对混合 CNN-Mamba 分割模型在空间-边缘系统中的高效适配。

链接: https://arxiv.org/abs/2604.10306
作者: Edward Humes,Tinoosh Mohsenin
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As Earth-observation workloads move toward onboard and edge processing, remote-sensing segmentation models must operate under tight latency and energy constraints. We present SatReg, a regression-based hardware-aware tuning framework for lightweight remote-sensing segmentation on edge platforms. Using CM-UNet as the teacher architecture, we reduce the search space to two dominant width-related variables, profile a small set of student models on an NVIDIA Jetson Orin Nano, and fit low-order surrogate models for mIoU, latency, and power. Knowledge distillation is used to efficiently train the sampled students. The learned surrogates enable fast selection of near-optimal architecture settings for deployment targets without exhaustive search. Results show that the selected variables affect task accuracy and hardware cost differently, making reduced-space regression a practical strategy for adapting hybrid CNN-Mamba segmentation models to future space-edge systems.

[CV-202] Class-Adaptive Cooperative Perception for Multi-Class LiDAR-based 3D Object Detection in V2X Systems

【速读】:该论文旨在解决现有协同感知(cooperative perception)方法在多类3D目标检测中因采用统一融合策略而导致的性能不平衡问题,尤其难以有效处理小物体(如行人)与大物体(如卡车)在几何结构和点云采样模式上的差异。其解决方案的关键在于提出一种类别自适应的协同感知架构:通过引入多尺度窗口注意力机制与学习到的尺度路由策略实现空间自适应特征提取;设计类别特定的融合模块,将小物体与大物体分别送入不同的注意力融合路径以优化信息整合;结合鸟瞰图增强模块(并行空洞卷积与通道重校准)提升上下文表征能力;并通过类别平衡的目标权重设计缓解高频类别的偏差问题。实验表明,该方法在V2X-Real基准上实现了对各类目标(尤其是卡车和行人)更均衡且鲁棒的检测性能。

链接: https://arxiv.org/abs/2604.10305
作者: Blessing Agyei Kyem,Joshua Kofi Asamoah,Armstrong Aboah
机构: North Dakota State University (北达科他州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 16 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Cooperative perception allows connected vehicles and roadside infrastructure to share sensor observations, creating a fused scene representation beyond the capability of any single platform. However, most cooperative 3D object detectors use a uniform fusion strategy for all object classes, which limits their ability to handle the different geometric structures and point-sampling patterns of small and large objects. This problem is further reinforced by narrow evaluation protocols that often emphasize a single dominant class or only a few cooperation settings, leaving robust multi-class detection across diverse vehicle-to-everything interactions insufficiently explored. To address this gap, we propose a class-adaptive cooperative perception architecture for multi-class 3D object detection from LiDAR data. The model integrates four components: multi-scale window attention with learned scale routing for spatially adaptive feature extraction, a class-specific fusion module that separates small and large objects into attentive fusion pathways, bird’s-eye-view enhancement through parallel dilated convolution and channel recalibration for richer contextual representation, and class-balanced objective weighting to reduce bias toward frequent categories. Experiments on the V2X-Real benchmark cover vehicle-centric, infrastructure-centric, vehicle-to-vehicle, infrastructure-to-infrastructure, and vehicle-to-infrastructure settings under identical backbone and training configurations. The proposed method consistently improves mean detection performance over strong intermediate-fusion baselines, with the largest gains on trucks, clear improvements on pedestrians, and competitive results on cars. These results show that aligning feature extraction and fusion with class-dependent geometry and point density leads to more balanced cooperative perception in realistic vehicle-to-everything deployments.

[CV-203] AC-MIL: Weakly Supervised Atrial LGE-MRI Quality Assessment via Adversarial Concept Disentanglement

【速读】:该论文旨在解决晚期钆增强磁共振成像(Late Gadolinium Enhancement, LGE-MRI)在房颤管理中因患者运动、呼吸不规则及成像时机不佳导致的图像质量下降问题,尤其针对现有基于多实例学习(Multiple Instance Learning, MIL)方法在图像质量评估中缺乏可解释性的问题——即传统MIL模型将局部视觉证据映射为单一且不可解释的全局特征向量,无法提供关于具体失败模式(如运动模糊、对比度不足或解剖结构缺失)的可操作反馈。其解决方案的关键在于提出对抗概念多实例学习(Adversarial Concept-MIL, AC-MIL),通过仅使用体积级标签实现临床定义的放射学概念分解,引入受对抗擦除机制引导的无监督残差分支以严格防止信息泄露,并设计空间多样性约束来惩罚不同概念注意力图之间的重叠,从而确保局部化且可解释的特征提取。实验表明,AC-MIL不仅实现了对非诊断性扫描原因的精准定位,还保持了与现有基线相当的等级评分性能,显著提升了模型的临床透明度。

链接: https://arxiv.org/abs/2604.10303
作者: K M Arefeen Sultan,Kaysen Hansen,Benjamin Orkild,Alan Morris,Eugene Kholmovski,Erik Bieging,Eugene Kwan,Ravi Ranjan,Ed DiBella,Shireen Elhabian
机构: University of Iowa (爱荷华大学); Massachusetts General Hospital (马萨诸塞州总医院); Harvard Medical School (哈佛医学院); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality Late Gadolinium Enhancement (LGE) MRI can be helpful for atrial fibrillation management, yet scan quality is frequently compromised by patient motion, irregular breathing, and suboptimal image acquisition timing. While Multiple Instance Learning (MIL) has emerged as a powerful tool for automated quality assessment under weak supervision, current state-of-the-art methods map localized visual evidence to a single, opaque global feature vector. This black box approach fails to provide actionable feedback on specific failure modes, obscuring whether a scan degrades due to motion blur, inadequate contrast, or a lack of anatomical context. In this paper, we propose Adversarial Concept-MIL (AC-MIL), a weakly supervised framework that decomposes global image quality into clinically defined radiological concepts using only volume-level supervision. To capture latent quality variations without entangling predefined concepts, our framework incorporates an unsupervised residual branch guided by an adversarial erasure mechanism to strictly prevent information leakage. Furthermore, we introduce a spatial diversity constraint that penalizes overlap between distinct concept attention maps, ensuring localized and interpretable feature extraction. Extensive experiments on a clinical dataset of atrial LGE-MRI volumes demonstrate that AC-MIL successfully opens the MIL black box, providing highly localized spatial concept maps that allow clinicians to pinpoint the specific causes of non-diagnostic scans. Crucially, our framework achieves this deep clinical transparency while maintaining highly competitive ordinal grading performance against existing baselines. Code to be released on acceptance.

[CV-204] FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data

【速读】:该论文旨在解决现有组成图像检索(Composed Image Retrieval, CIR)方法中存在的视图不完整(View Incompleteness)问题,即当前方法仅基于单张参考图像和修改文本进行检索,无法模拟真实电商场景中用户从多个视角理解商品的需求。为此,作者首次提出多视图CIR任务(Multi-View CIR),将检索粒度从图像级提升至产品级,并构建了首个大规模多视图时尚数据集FashionMV(包含127K产品、472K多视角图像及220K+ CIR三元组)。解决方案的关键在于提出的ProCIR建模框架,其核心是基于多模态大语言模型(Multimodal Large Language Model, MLLM),融合三种互补机制:两阶段对话架构以实现有效对齐、基于描述的对齐机制增强语义一致性、链式思维引导(Chain-of-Thought Guidance)提供推理路径;此外还引入可选的监督微调(Supervised Fine-Tuning, SFT)注入结构化产品知识,显著提升对比学习效果。实验表明,对齐机制为最关键因素,且两阶段对话架构是实现有效对齐的前提,而SFT与链式思维在知识注入上存在部分冗余。

链接: https://arxiv.org/abs/2604.10297
作者: Peng Yuan,Bingyin Mei,Hui Zhang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) retrieves target images using a reference image paired with modification text. Despite rapid advances, all existing methods and datasets operate at the image level – a single reference image plus modification text in, a single target image out – while real e-commerce users reason about products shown from multiple viewpoints. We term this mismatch View Incompleteness and formally define a new Multi-View CIR task that generalizes standard CIR from image-level to product-level retrieval. To support this task, we construct FashionMV, the first large-scale multi-view fashion dataset for product-level CIR, comprising 127K products, 472K multi-view images, and over 220K CIR triplets, built through a fully automated pipeline leveraging large multimodal models. We further propose ProCIR (Product-level Composed Image Retrieval), a modeling framework built upon a multimodal large language model that employs three complementary mechanisms – two-stage dialogue, caption-based alignment, and chain-of-thought guidance – together with an optional supervised fine-tuning (SFT) stage that injects structured product knowledge prior to contrastive training. Systematic ablation across 16 configurations on three fashion benchmarks reveals that: (1) alignment is the single most critical mechanism; (2) the two-stage dialogue architecture is a prerequisite for effective alignment; and (3) SFT and chain-of-thought serve as partially redundant knowledge injection paths. Our best 0.8B-parameter model outperforms all baselines, including general-purpose embedding models 10x its size. The dataset, model, and code are publicly available at this https URL.

[CV-205] FastSHADE: Fast Self-augmented Hierarchical Asymmetric Denoising for Efficient inference on mobile devices

【速读】:该论文旨在解决移动设备上实时图像去噪(real-time image denoising)的难题,其核心挑战在于边缘设备严格的延迟和功耗限制。解决方案的关键在于提出一种轻量级U-Net架构FastSHADE(Fast Self-augmented Hierarchical Asymmetric Denoising),其创新性体现在两个方面:一是引入异构频率去噪模块(Asymmetric Frequency Denoising Block, AFDB),将空间结构提取与高频噪声抑制解耦以提升效率;二是设计空间门控上采样器(Spatially Gated Upsampler, SGU),优化高分辨率跳跃连接融合。此外,通过噪声平移自增强策略(Noise Shifting Self-Augmentation)有效提升模型泛化能力,同时避免域偏移问题。实验表明,FastSHADE在保持实时性能(如FastSHADE-M在现代移动GPU上延迟为50 ms)的同时实现了高质量重建,显著提升了移动端图像信号处理(ISP)管线中的实用部署效果。

链接: https://arxiv.org/abs/2604.10275
作者: Nikolay Falaleev
机构: Fanis(凡尼斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time image denoising is essential for modern mobile photography but remains challenging due to the strict latency and power constraints of edge devices. This paper presents FastSHADE (Fast Self-augmented Hierarchical Asymmetric Denoising), a lightweight U-Net-style network tailored for real-time, high-fidelity restoration on mobile GPUs. Our method features a multi-stage architecture incorporating a novel Asymmetric Frequency Denoising Block (AFDB) that decouples spatial structure extraction from high-frequency noise suppression to maximize efficiency, and a Spatially Gated Upsampler (SGU) that optimizes high-resolution skip connection fusion. To address generalization, we introduce an efficient Noise Shifting Self-Augmentation strategy that enhances data diversity without inducing domain shifts. Evaluations on the MAI2021 benchmark demonstrate that our scalable model family establishes a highly efficient speed-fidelity trade-off. Our base FastSHADE-M variant maintains real-time latency (50 ms on a modern mobile GPU) while preserving structural integrity, and our scaled-up FastSHADE-XL establishes a new state-of-the-art for overall image quality. Ultimately, FastSHADE successfully bridges the gap between theoretical network efficiency and practical deployment for real-world mobile ISP pipelines.

[CV-206] Dual-Exposure Imaging with Events

【速读】:该论文旨在解决双曝光成像(Dual-Exposure Imaging, DEI)在低光照场景下因场景运动导致的空间位移和不同曝光时间引起的图像特征差异,从而产生伪影的问题。解决方案的关键在于提出一种基于事件相机的DEI算法(Event-based DEI, E-DEI),利用事件相机的高时间分辨率提供精确的帧间/帧内动态信息,将复杂任务分解为事件驱动的运动去模糊和低光图像增强两个子任务,并设计了一个双路径并行特征传播架构,其中引入了双路径特征对齐与融合模块(Dual-path Feature Alignment and Fusion, DFAF),通过事件辅助实现双曝光图像特征的有效对齐与融合,从而显著提升重建图像质量。

链接: https://arxiv.org/abs/2604.10273
作者: Mingyuan Lin,Hongyi Liu,Chu He,Wen Yang,Gui-Song Xia,Lei Yu
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:By combining complementary benefits of short- and long-exposure images, Dual-Exposure Imaging (DEI) enhances image quality in low-light scenarios. However, existing DEI approaches inevitably suffer from producing artifacts due to spatial displacement from scene motion and image feature discrepancies from different exposure times. To tackle this problem, we propose a novel Event-based DEI (E-DEI) algorithm, which reconstructs high-quality images from dual-exposure image pairs and events, leveraging high temporal resolution of event cameras to provide accurate inter-/intra-frame dynamic information. Specifically, we decompose this complex task into an integration of two sub-tasks, i.e., event-based motion deblurring and low-light image enhancement tasks, which guides us to design E-DEI network as a dual-path parallel feature propagation architecture. We propose a Dual-path Feature Alignment and Fusion (DFAF) module to effectively align and fuse features extracted from dual-exposure images with assistance of events. Furthermore, we build a real-world Dataset containing Paired low-/normal-light Images and Events (PIED). Experiments on multiple datasets show the superiority of our method. The code and dataset are available at github.

[CV-207] EditCrafter: Tuning-free High-Resolution Image Editing via Pretrained Diffusion Model CVPR

【速读】:该论文旨在解决基于扩散模型的图像编辑方法在高分辨率(高于训练时使用的512×512或1024×1024)及任意长宽比图像上难以应用的问题,现有方法在这些场景下常因直接分块处理导致对象结构失真和重复现象。解决方案的关键在于提出一种无需微调的高效编辑流程EditCrafter:首先通过**分块反演(tiled inversion)保留输入高分辨率图像的原始身份信息;进而设计了一种面向高分辨率图像编辑的噪声衰减流形约束无分类器引导(noise-damped manifold-constrained classifier-free guidance, NDCFG++)**机制,从反演潜空间中生成高质量编辑结果,从而实现跨分辨率、无优化的图像编辑。

链接: https://arxiv.org/abs/2604.10268
作者: Kunho Kim,Sumin Seo,Yongjun Cho,Hyungjin Chung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPRW 2026 Proceeding Track. Project page: this https URL

点击查看摘要

Abstract:We propose EditCrafter, a high-resolution image editing method that operates without tuning, leveraging pretrained text-to-image (T2I) diffusion models to process images at resolutions significantly exceeding those used during training. Leveraging the generative priors of large-scale T2I diffusion models enables the development of a wide array of novel generation and editing applications. Although numerous image editing methods have been proposed based on diffusion models and exhibit high-quality editing results, they are difficult to apply to images with arbitrary aspect ratios or higher resolutions since they only work at the training resolutions (512x512 or 1024x1024). Naively applying patch-wise editing fails with unrealistic object structures and repetition. To address these challenges, we introduce EditCrafter, a simple yet effective editing pipeline. EditCrafter operates by first performing tiled inversion, which preserves the original identity of the input high-resolution image. We further propose a noise-damped manifold-constrained classifier-free guidance (NDCFG++) that is tailored for high resolution image editing from the inverted latent. Our experiments show that the our EditCrafter can achieve impressive editing results across various resolutions without fine-tuning and optimization.

[CV-208] Real-Time Human Reconstruction and Animation using Feed-Forward Gaussian Splatting

【速读】:该论文旨在解决现有3D人体重建方法在实时动画和交互应用中的局限性,特别是那些依赖深度监督、固定输入视角、UV映射或针对每个目标视角/姿态重复前向推理的模型。其核心挑战在于如何实现高质量重建的同时支持高效且可动画化的表示。解决方案的关键在于提出一种通用的前馈高斯点渲染框架(feed-forward Gaussian splatting framework),该框架直接从多视角RGB图像及其对应的SMPL-X姿态中学习,通过在标准姿态下为每个SMPL-X顶点预测一组3D高斯基元(Gaussian primitives)来构建人体表示:其中一部分高斯基元被约束贴近SMPL-X表面,提供强几何先验和与参数化人体模型的稳定对应关系;另一部分无约束的高斯基元则用于捕捉偏离参数表面的细节结构(如衣物和头发)。这种显式关联高斯基元与SMPL-X顶点的设计使得重建模型可通过线性混合皮肤(linear blend skinning)高效动画化,无需额外网络推理,从而实现单次前向传播即可生成可实时动画的人体三维表示。

链接: https://arxiv.org/abs/2604.10259
作者: Devdoot Chatterjee,Zakaria Laskar,C.V. Jawahar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We present a generalizable feed-forward Gaussian splatting framework for human 3D reconstruction and real-time animation that operates directly on multi-view RGB images and their associated SMPL-X poses. Unlike prior methods that rely on depth supervision, fixed input views, UV map, or repeated feed-forward inference for each target view or pose, our approach predicts, in a canonical pose, a set of 3D Gaussian primitives associated with each SMPL-X vertex. One Gaussian is regularized to remain close to the SMPL-X surface, providing a strong geometric prior and stable correspondence to the parametric body model, while an additional small set of unconstrained Gaussians per vertex allows the representation to capture geometric structures that deviate from the parametric surface, such as clothing and hair. In contrast to recent approaches such as HumanRAM, which require repeated network inference to synthesize novel poses, our method produces an animatable human representation from a single forward pass; by explicitly associating Gaussian primitives with SMPL-X vertices, the reconstructed model can be efficiently animated via linear blend skinning without further network evaluation. We evaluate our method on the THuman 2.1, AvatarReX and THuman 4.0 datasets, where it achieves reconstruction quality comparable to state-of-the-art methods while uniquely supporting real-time animation and interactive applications. Code and pre-trained models are available at this https URL .

[CV-209] A Comparison of Multi-View Stereo Methods for Photogrammetric 3D Reconstruction: From Traditional to Learning-Based Approaches

【速读】:该论文旨在解决传统基于结构光恢复(Structure-from-Motion, SfM)与多视角立体匹配(Multi-View Stereo, MVS)方法在三维重建中存在速度慢、可扩展性差的问题,同时评估学习驱动的MVS方法在精度、覆盖率和运行效率上的表现。其解决方案的关键在于通过对比分析代表性传统MVS流水线(如COLMAP)与前沿学习型方法(包括几何引导类如MVSNet、PatchmatchNet、MVSAnywhere、MVSFormer++以及端到端框架如Stereo4D、FoundationStereo、DUSt3R、MASt3R、Fast3R、VGGT),系统验证后者在复杂场景下对图像配准失败的鲁棒性提升及加速潜力;结果表明,尽管传统方法几何一致性高但耗时显著,而端到端模型(如DUSt3R、VGGT)虽存在局部残差较大问题,却能在保持合理精度的同时实现更快重建速度,展现出更强的实际应用前景。

链接: https://arxiv.org/abs/2604.10246
作者: Yawen Li,George Vosselman,Francesco Nex
机构: University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photogrammetric 3D reconstruction has long relied on traditional Structure-from-Motion (SfM) and Multi-View Stereo (MVS) methods, which provide high accuracy but face challenges in speed and scalability. Recently, learning-based MVS methods have emerged, aiming for faster and more efficient reconstruction. This work presents a comparative evaluation between a representative traditional MVS pipeline (COLMAP) and state-of-the-art learning-based approaches, including geometry-guided methods (MVSNet, PatchmatchNet, MVSAnywhere, MVSFormer++) and end-to-end frameworks (Stereo4D, FoundationStereo, DUSt3R, MASt3R, Fast3R, VGGT). Two experiments were conducted on different aerial scenarios. The first experiment used the MARS-LVIG dataset, where ground-truth 3D reconstruction was provided by LiDAR point clouds. The second experiment used a public scene from the Pix4D official website, with ground truth generated by Pix4Dmapper. We evaluated accuracy, coverage, and runtime across all methods. Experimental results show that although COLMAP can provide reliable and geometrically consistent reconstruction results, it requires more computation time. In cases where traditional methods fail in image registration, learning-based approaches exhibit stronger feature-matching capability and greater robustness. Geometry-guided methods usually require careful dataset preparation and often depend on camera pose or depth priors generated by COLMAP. End-to-end methods such as DUSt3R and VGGT achieve competitive accuracy and reasonable coverage while offering substantially faster reconstruction. However, they exhibit relatively large residuals in 3D reconstruction, particularly in challenging scenarios.

[CV-210] Warm-Started Reinforcement Learning for Iterative 3D/2D Liver Registration

【速读】:该论文旨在解决术前CT与术中腹腔镜视频之间的配准问题,这是实现增强现实(Augmented Reality, AR)引导微创手术的关键技术挑战。传统优化方法虽精度高但速度慢,而现有学习方法常产生粗略对齐需后续优化,导致整体效率下降。其解决方案的核心在于提出一种基于离散动作强化学习(Reinforcement Learning, RL)的框架,将CT到视频的配准建模为序列决策过程:通过一个共享特征编码器(预训练自监督姿态估计网络以获得稳定几何特征)提取图像表征,并由RL策略头自主选择六自由度刚性变换及决定迭代终止时机,从而实现无需人工设定步长或停止条件的自动化、高效迭代配准。实验表明该方法在公开腹腔镜数据集上达到15.70 mm平均目标配准误差(Target Registration Error, TRE),性能媲美需后处理优化的监督方法,同时收敛更快。

链接: https://arxiv.org/abs/2604.10245
作者: Hanyuan Zhang,Lucas He,Zijie Cheng,Abdolrahim Kadkhodamohammadi,Danail Stoyanov,Brian R. Davidson,Evangeles B. Mazomenos,Matthew.J Clarkson
机构: University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: Laparoscopic Liver Surgery, Augmented Reality, Image Registration, Reinforcement Learning

点击查看摘要

Abstract:Registration between preoperative CT and intraoperative laparoscopic video plays a crucial role in augmented reality (AR) guidance for minimally invasive surgery. Learning-based methods have recently achieved registration errors comparable to optimization-based approaches while offering faster inference. However, many supervised methods produce coarse alignments that rely on additional optimization-based refinement, thereby increasing inference time. We present a discrete-action reinforcement learning (RL) framework that formulates CT-to-video registration as a sequential decision-making process. A shared feature encoder, warm-started from a supervised pose estimation network to provide stable geometric features and faster convergence, extracts representations from CT renderings and laparoscopic frames, while an RL policy head learns to choose rigid transformations along six degrees of freedom and to decide when to stop the iteration. Experiments on a public laparoscopic dataset demonstrated that our method achieved an average target registration error (TRE) of 15.70 mm, comparable to supervised approaches with optimization, while achieving faster convergence. The proposed RL-based formulation enables automated, efficient iterative registration without manually tuned step sizes or stopping criteria. This discrete framework provides a practical foundation for future continuous-action and deformable registration models in surgical AR applications. Comments: Laparoscopic Liver Surgery, Augmented Reality, Image Registration, Reinforcement Learning Subjects: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph) Cite as: arXiv:2604.10245 [cs.CV] (or arXiv:2604.10245v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.10245 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hanyuan Zhang [view email] [v1] Sat, 11 Apr 2026 14:58:45 UTC (7,048 KB)

[CV-211] MedVeriSeg: Teaching MLLM -Based Medical Segmentation Models to Verify Query Validity Without Extra Training

【速读】:该论文旨在解决基于多模态大语言模型(Multimodal Large Language Model, MLLM)的医学图像分割方法中,对不存在目标的错误查询无法可靠拒绝的问题,此类问题会导致生成虚假的分割掩码(segmentation masks),从而降低在医学教育和临床应用中的可靠性。解决方案的关键在于提出一种无需训练的验证框架 MedVeriSeg,其核心思想是利用 [SEG] token 特征与 MLLM 图像特征之间的相似性图(similarity map)在真实查询与虚假查询下分布模式显著不同的特性,设计了一个相似性响应质量评分模块(Similarity Response Quality Scoring Module),从强度(strength)、紧凑性(compactness)和纯净度(purity)三个维度量化该相似性图以初步判断目标是否存在;进一步结合 GPT-4o 对相似性热力图和评分结果进行联合视觉证据评估,实现最终的真伪查询验证。

链接: https://arxiv.org/abs/2604.10242
作者: Ziqian Lu,Qinyue Tong,Jun Liu,Yunlong Yu
机构: Zhejiang Sci-Tech University (浙江科技学院); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Despite recent advances in MLLM-based medical image segmentation, existing LISA-like methods cannot reliably reject false queries and often produce hallucinated segmentation masks for absent targets. This limitation reduces practical reliability in both medical education and clinical use. In this work, we propose MedVeriSeg, a training-free verification framework that equips LISA-like medical segmentation models with the ability to identify and reject false queries which contain non-existent targets. Our key observation is that the similarity map between the [SEG] token feature and MLLM image features exhibits markedly different distribution patterns for true and false queries. Based on this, we introduce a Similarity Response Quality Scoring Module that characterizes the similarity map from three aspects: strength, compactness, and purity, producing an initial target-existence prediction. We further incorporate qualitative visual evidence by using GPT-4o to jointly assess the similarity heatmap and the results of Similarity Response Quality Scoring Module for final verification. Experiments on a small-scale benchmark constructed from SA-Med2D-20M show that MedVeriSeg effectively rejects false-query segmentation requests while maintaining reliable recognition of true queries.

[CV-212] Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

【速读】:该论文旨在解决现有3D医学多模态大模型(3D medical MLLMs)因缺乏足够训练数据而导致视觉编码器预训练不足、难以提取适配不同任务的定制化图像特征的问题。其解决方案的关键在于:首先将已在2D自然图像上充分预训练的2D多模态大模型迁移至支持3D医学体素输入,同时复用全部预训练参数以提升效率;其次设计文本引导的分层专家混合(Text-Guided Hierarchical MoE, TGH-MoE)框架,在文本提示指导下区分不同任务并提取任务特定特征;最后采用两阶段训练策略联合学习任务共享与任务特异性图像特征,从而显著提升医学报告生成(MRG)和医学视觉问答(MVQA)任务的性能。

链接: https://arxiv.org/abs/2604.10233
作者: Yang Yu,Dunyuan Xu,Yaoqian Li,Xiaomeng Li,Jinpeng Li,Pheng-Ann Heng
机构: The Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学); Institute of Medical Intelligence and XR (医学智能与XR研究所); Center for Artificial Intelligence and Robotics (人工智能与机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D medical image analysis is of great importance in disease diagnosis and treatment. Recently, multimodal large language models (MLLMs) have exhibited robust perceptual capacity, strong cross-modal alignment, and promising generalizability. Therefore, they have great potential to improve the performance of medical report generation (MRG) and medical visual question answering (MVQA), which serve as two important tasks in clinical scenarios. However, due to the scarcity of 3D medical images, existing 3D medical MLLMs suffer from insufficiently pretrained vision encoder and inability to extract customized image features for different kinds of tasks. In this paper, we propose to first transfer a 2D MLLM, which is well trained with 2D natural images, to support 3D medical volumetric inputs while reusing all of its pre-trained parameters. To enable the vision encoder to extract tailored image features for various tasks, we then design a Text-Guided Hierarchical MoE (TGH-MoE) framework, which can distinguish tasks under the guidance of the text prompt. Furthermore, we propose a two-stage training strategy to learn both task-shared and task-specific image features. As demonstrated empirically, our method outperforms existing 3D medical MLLMs in both MRG and MVQA tasks. Our code will be released once this paper is accepted.

[CV-213] SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation

【速读】:该论文旨在解决自监督立体匹配方法中因光照变化、视角差异等现实干扰导致的光度一致性假设失效问题,从而引发监督信号不可靠和精度显著低于有监督方法的问题。解决方案的关键在于提出SMFormer框架,其核心创新包括:1)引入视觉基础模型(Vision Foundation Model, VFM)与特征金字塔网络(Feature Pyramid Network, FPN)融合,获得对扰动具有鲁棒性的判别性特征表示;2)设计一种有效的数据增强机制,显式约束光照变化下特征的一致性,并通过强增强样本与标准样本的视差预测输出一致性进行正则化,提升模型在复杂场景下的泛化能力。实验表明,该方法在多个主流基准上达到自监督方法的SOTA性能,甚至在挑战性较强的Booster基准上超越部分有监督方法。

链接: https://arxiv.org/abs/2604.10218
作者: Yun Wang,Zhengjie Yang,Jiahao Zheng,Zhanjie Zhang,Dapeng Oliver Wu,Yulan Guo
机构: City University of Hong Kong (香港城市大学); The Hong Kong University of Science and Technology (香港科技大学); Zhejiang University (浙江大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent self-supervised stereo matching methods have made significant progress. They typically rely on the photometric consistency assumption, which presumes corresponding points across views share the same appearance. However, this assumption could be compromised by real-world disturbances, resulting in invalid supervisory signals and a significant accuracy gap compared to supervised methods. To address this issue, we propose SMFormer, a framework integrating more reliable self-supervision guided by the Vision Foundation Model (VFM) and data augmentation. We first incorporate the VFM with the Feature Pyramid Network (FPN), providing a discriminative and robust feature representation against disturbance in various scenarios. We then devise an effective data augmentation mechanism that ensures robustness to various transformations. The data augmentation mechanism explicitly enforces consistency between learned features and those influenced by illumination variations. Additionally, it regularizes the output consistency between disparity predictions of strong augmented samples and those generated from standard samples. Experiments on multiple mainstream benchmarks demonstrate that our SMFormer achieves state-of-the-art (SOTA) performance among self-supervised methods and even competes on par with supervised ones. Remarkably, in the challenging Booster benchmark, SMFormer even outperforms some SOTA supervised methods, such as CFNet.

[CV-214] Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?

【速读】:该论文旨在解决跨模态光学-合成孔径雷达(Optical-SAR)图像配准问题,这是灾害响应中遥感应用的关键瓶颈。解决方案的核心在于系统评估24种预训练匹配器在SpaceNet9及两个额外跨模态基准上的零样本(zero-shot)性能,采用确定性协议进行大图像分块推理、鲁棒几何过滤和基于控制点的度量评估。研究发现,显式跨模态训练并非提升性能的唯一路径:RoMa(未经过跨模态训练)与XoFTR(专为可见光-热红外设计)均达到最低均方误差(3.0 px),而MatchAnything-ELoFTR(基于合成跨模态对训练)也表现接近,提示基础模型特征(如DINOv2)可能具备模态不变性,可部分替代显式跨模态监督;同时,部署协议选择(如几何模型、分块大小、内点门控)对精度影响显著,甚至超过更换匹配器本身,例如仅使用仿射几何即可将平均误差从12.34 px降至9.74 px,凸显了实际部署策略的重要性。

链接: https://arxiv.org/abs/2604.10217
作者: Isaac Corley,Alex Stoken,Gabriele Berton
机构: Taylor Geospatial(泰勒地理空间公司); Independent Researcher(独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-modal optical-SAR (Synthetic Aperture Radar) registration is a bottleneck for disaster-response via remote sensing, yet modern image matchers are developed and benchmarked almost exclusively on natural-image domains. We evaluate twenty-four pretrained matcher families–in a zero-shot setting with no fine-tuning or domain adaptation on satellite or SAR data–on SpaceNet9 and two additional cross-modal benchmarks under a deterministic protocol with tiled large-image inference, robust geometric filtering, and tie-point-grounded metrics. Our results reveal asymmetric transfer–matchers with explicit cross-modal training do not uniformly outperform those without it. While XoFTR (trained for visible-thermal matching) and RoMa achieve the lowest reported mean error at 3.0 px on the labeled SpaceNet9 training scenes, RoMa achieves this without any cross-modal training, and MatchAnything-ELoFTR ( 3.4 px)–trained on synthetic cross-modal pairs–matches closely, suggesting (as a working hypothesis) that foundation-model features (DINOv2) may contribute to modality invariance that partially substitutes for explicit cross-modal supervision. 3D-reconstruction matchers (MASt3R, DUSt3R), which are not designed for traditional 2D image matching, are highly protocol-sensitive and remain fragile under default settings. Deployment protocol choices (geometry model, tile size, inlier gating) shift accuracy by up to 33\times for a single matcher, sometimes exceeding the effect of swapping matchers entirely within the evaluated sweep–affine geometry alone reduces mean error from 12.34 to 9.74 px. These findings inform both practical deployment of existing matchers and future matcher design for cross-modal satellite registration.

[CV-215] ReaLiTy and LADS: A Unified Framework and Dataset Suite for LiDAR Adaptation Across Sensors and Adverse Weather Conditions

【速读】:该论文旨在解决当前LiDAR感知在不同传感器配置和恶劣天气条件下缺乏物理一致性数据的问题,从而限制了对领域偏移(domain shift)的系统性分析。解决方案的关键在于提出ReaLiTy框架,该框架融合物理驱动的线索与学习模块,以生成符合目标传感器规格和天气条件的真实强度模式,并通过物理基础的天气模型引入一致的几何与辐射退化效果,从而实现跨域感知的可重现性研究。

链接: https://arxiv.org/abs/2604.10213
作者: Vivek Anand,Bharat Lohani,Rakesh Mishra,Gaurav Pandey
机构: Indian Institute of Technology Kanpur (印度理工学院坎普尔分校); University of New Brunswick (新不伦瑞克大学); Texas AM University (德克萨斯农工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable LiDAR perception requires robustness across sensors, environments, and adverse weather. However, existing datasets rarely provide physically consistent observations of the same scene under varying sensor configurations and weather conditions, limiting systematic analysis of domain shifts. This work presents ReaLiTy, a unified physics-informed framework that transforms LiDAR data to match target sensor specifications and weather conditions. The framework integrates physically grounded cues with a learning-based module to generate realistic intensity patterns, while a physics-based weather model introduces consistent geometric and radiometric degradations. Building on this framework, we introduce the LiDAR Adaptation Dataset Suite (LADS), a collection of physically consistent, transformation-ready point clouds with one-to-one correspondence to original datasets. Experiments demonstrate improved cross-domain consistency and realistic weather effects. ReaLiTy and LADS provide a reproducible foundation for studying LiDAR adaptation and simulation-driven perception in intelligent transportation systems.

[CV-216] A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction

【速读】:该论文旨在解决密集预测任务中因目标尺度变化导致的多尺度特征表示不足问题,尤其针对现有特征金字塔网络(Feature Pyramid Network, FPN)在捕捉判别性特征和识别小目标方面的固有设计缺陷。解决方案的关键在于提出渐近解耦内容感知金字塔注意力网络(Asymptotic Content-Aware Pyramid Attention Network, A3-FPN),其核心创新包括:1)采用水平扩展的列式网络结构实现渐近全局特征交互,并将每一层级从所有层次表示中解耦;2)在特征融合阶段引入邻近层级的内容补充机制,生成位置感知的偏移量与权重以进行上下文感知重采样,并学习深层上下文重加权以增强类别内相似性;3)在特征重组阶段强化同尺度判别性特征学习,并基于特征图的信息含量与空间变化重组冗余特征。该方法可无缝集成于主流CNN与Transformer架构,在MS COCO、VisDrone2019-DET和Cityscapes等数据集上显著提升性能。

链接: https://arxiv.org/abs/2604.10210
作者: Meng’en Qin,Yu Song,Quanling Zhao,Xiaodong Yang,Yingtao Che,Xiaohui Yang
机构: Henan Engineering Research Center for Artificial Intelligence Theory and Algorithms(河南省人工智能理论与算法工程研究中心); Shenzhen University of Advanced Technology(深圳先进技术研究院); The Hong Kong Polytechnic University(香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at this https URL.

[CV-217] Edu-MMBias: A Three-Tier Multimodal Benchmark for Auditing Social Bias in Vision-Language Models under Educational Contexts

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在教育决策中潜在的社会偏见问题,尤其是现有以文本为中心的评估方法忽视了视觉模态可能引入的隐性偏见,导致偏见传播渠道未被有效监管。其解决方案的关键在于提出Edu-MMBias框架,该框架基于社会心理学中的态度三成分模型(cognitive-affective-behavioral),从认知、情感和行为三个层级系统性地诊断偏见;并通过一个融合自校正机制与人工介入验证的专用生成管道,合成抗污染的学生画像,对当前最先进的VLMs进行全方位压力测试,从而揭示出视觉输入作为“安全后门”可绕过文本对齐保护机制、诱发偏见再现的现象,暴露了模型内部认知与最终决策之间的系统性错位。

链接: https://arxiv.org/abs/2604.10200
作者: Ruijia Li,Mingzi Zhang,Zengyi Yu,Yuang Wei,Bo Jiang
机构: East China Normal University (华东师范大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As Vision-Language Models (VLMs) become integral to educational decision-making, ensuring their fairness is paramount. However, current text-centric evaluations neglect the visual modality, leaving an unregulated channel for latent social biases. To bridge this gap, we present Edu-MMBias, a systematic auditing framework grounded in the tri-component model of attitudes from social psychology. This framework diagnoses bias across three hierarchical dimensions: cognitive, affective, and behavioral. Utilizing a specialized generative pipeline that incorporates a self-correct mechanism and human-in-the-loop verification, we synthesize contamination-resistant student profiles to conduct a holistic stress test on state-of-the-art VLMs. Our extensive audit reveals critical, counter-intuitive patterns: models exhibit a compensatory class bias favoring lower-status narratives while simultaneously harboring deep-seated health and racial stereotypes. Crucially, we find that visual inputs act as a safety backdoor, triggering a resurgence of biases that bypass text-based alignment safeguards and revealing a systematic misalignment between latent cognition and final decision-making. The contributions of this paper are available at: this https URL.

[CV-218] Radiology Report Generation for Low-Quality X-Ray Images

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在真实临床环境中因图像质量下降导致的放射学报告生成(Radiology Report Generation, RRG)性能显著退化的问题。现有方法隐式假设输入图像质量良好,忽视了实际场景中普遍存在的噪声和伪影。为应对这一挑战,作者提出了一种鲁棒的报告生成框架,其关键创新在于引入一个自动质量评估代理(Automated Quality Assessment Agent, AQAA)以识别低质量样本,并构建了首个针对低质量图像的放射学报告生成基准(Low-quality Radiology Report Generation, LRRG)。此外,提出了基于双循环训练策略(Dual-loop Training Strategy)的优化方法,利用双层优化和梯度一致性机制,使模型在不同图像质量区间内学习到对质量不敏感的诊断特征,从而有效缓解由图像质量劣化引起的性能下降。

链接: https://arxiv.org/abs/2604.10188
作者: Hongze Zhu,Chen Hu,Jiaxuan Jiang,Hong Liu,Yawen Huang,Ming Hu,Tianyu Wang,Zhijian Wu,Yefeng Zheng
机构: Westlake University (西湖大学); Tencent Jarvis Lab (腾讯混元实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have significantly advanced automated Radiology Report Generation (RRG). However, existing methods implicitly assume high-quality inputs, overlooking the noise and artifacts prevalent in real-world clinical environments. Consequently, current models exhibit severe performance degradation when processing suboptimal images. To bridge this gap, we propose a robust report generation framework explicitly designed for image quality variations. We first introduce an Automated Quality Assessment Agent (AQAA) to identify low-quality samples within the MIMIC-CXR dataset and establish the Low-quality Radiology Report Generation (LRRG) benchmark. To tackle degradation-induced shifts, we propose a novel Dual-loop Training Strategy leveraging bi-level optimization and gradient consistency. This approach ensures the model learns quality-agnostic diagnostic features by aligning gradient directions across varying quality regimes. Extensive experiments demonstrate that our approach effectively mitigates model performance degradation caused by image quality deterioration. The code and data will be released upon acceptance.

[CV-219] Device-Conditioned Neural Architecture Search for Efficient Robotic Manipulation

【速读】:该论文旨在解决复杂视觉运动策略(visuomotor policies)在异构机器人硬件约束下部署时面临的挑战,尤其是现有模型高效方法普遍存在设备依赖性、泛化能力差以及适配过程需耗时的逐设备优化问题。其解决方案的关键在于提出一个统一框架DC-QFA(Device-Conditioned Quantization-for-All),通过设备条件化的量化感知训练与硬件约束的架构搜索实现部署努力的摊销。核心创新包括:构建一个覆盖网络结构和混合精度位宽的设计空间的单一超网络(supernet),并以每设备查找表为指导进行延迟与内存感知的正则化优化;在此基础上,针对每个目标平台可执行“一次训练、全平台适配”的轻量级搜索,无需再进行逐设备重优化,从而显著提升跨异构硬件的通用部署能力和效率。此外,引入多步在线策略蒸馏机制以缓解低精度下的误差累积,增强长时间任务的稳定性。

链接: https://arxiv.org/abs/2604.10170
作者: Yiming Wu,Huan Wang,Zhenghao Chen,Ge Yuan,Dong Xu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:The growing complexity of visuomotor policies poses significant challenges for deployment with heterogeneous robotic hardware constraints. However, most existing model-efficient approaches for robotic manipulation are device- and model-specific, lack generalizability, and require time-consuming per-device optimization during the adaptation process. In this work, we propose a unified framework named \textbfDevice-\textbfConditioned \textbfQuantization-\textbfFor-\textbfAll (DC-QFA) which amortizes deployment effort with the device-conditioned quantization-aware training and hardware-constrained architecture search. Specifically, we introduce a single supernet that spans a rich design space over network architectures and mixed-precision bit-widths. It is optimized with latency- and memory-aware regularization, guided by per-device lookup tables. With this supernet, for each target platform, we can perform a once-for-all lightweight search to select an optimal subnet without any per-device re-optimization, which enables more generalizable deployment across heterogeneous hardware, and substantially reduces deployment time. To improve long-horizon stability under low precision, we further introduce multi-step on-policy distillation to mitigate error accumulation during closed-loop execution. Extensive experiments on three representative policy backbones, such as DiffusionPolicy-T, MDT-V, and OpenVLA-OFT, demonstrate that our DC-QFA achieves 2\text-3\times acceleration on edge devices, consumer-grade GPUs, and cloud platforms, with negligible performance drop in task success. Real-world evaluations on an Inovo robot equipped with a force/torque sensor further validates that our low-bit DC-QFA policies maintain stable, contact-rich manipulation even under severe quantization.

[CV-220] Semantic Manipulation Localization

【速读】:该论文旨在解决传统图像篡改定位(Image Manipulation Localization, IML)方法在面对现代生成式AI(Generative AI)造成的语义级篡改时效果下降的问题。这类篡改往往不产生明显的低层伪影,而是通过改变物体属性、状态或关系等细微但影响图像语义理解的编辑实现,使得依赖于视觉伪影检测的传统方法失效。解决方案的关键在于提出一种新的任务——语义篡改定位(Semantic Manipulation Localization, SML),并设计端到端的TRACE框架,其核心创新是通过三个逐步耦合的组件建模语义敏感性:语义锚定(semantic anchoring)用于识别支撑图像理解的语义区域,语义扰动感知(semantic perturbation sensing)引入频域扰动敏感线索以捕捉强视觉一致性下的微小变化,以及语义约束推理(semantic-constrained reasoning)通过联合推理语义内容与语义范围验证候选区域,从而实现更完整、紧凑且语义一致的定位结果。

链接: https://arxiv.org/abs/2604.10132
作者: Zhenshan Tan,Chenhan Lu,Yuxiang Huang,Ziwen He,Xiang Zhang,Yuzhe Sha,Xianyi Chen,Tianrun Chen,Zhangjie Fu
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Zhejiang University (浙江大学); KOKONI3D, Moxin (Hangzhou) Technology Company Ltd. (墨芯(杭州)科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image Manipulation Localization (IML) aims to identify edited regions in an image. However, with the increasing use of modern image editing and generative models, many manipulations no longer exhibit obvious low-level artifacts. Instead, they often involve subtle but meaning-altering edits to an object’s attributes, state, or relationships while remaining highly consistent with the surrounding content. This makes conventional IML methods less effective because they mainly rely on artifact detection rather than semantic sensitivity. To address this issue, we introduce Semantic Manipulation Localization (SML), a new task that focuses on localizing subtle semantic edits that significantly change image interpretation. We further construct a dedicated fine-grained benchmark for SML using a semantics-driven manipulation pipeline with pixel-level annotations. Based on this task, we propose TRACE (Targeted Reasoning of Attributed Cognitive Edits), an end-to-end framework that models semantic sensitivity through three progressively coupled components: semantic anchoring, semantic perturbation sensing, and semantic-constrained reasoning. Specifically, TRACE first identifies semantically meaningful regions that support image understanding, then injects perturbation-sensitive frequency cues to capture subtle edits under strong visual consistency, and finally verifies candidate regions through joint reasoning over semantic content and semantic scope. Extensive experiments show that TRACE consistently outperforms existing IML methods on our benchmark and produces more complete, compact, and semantically coherent localization results. These results demonstrate the necessity of moving beyond artifact-based localization and provide a new direction for image forensics in complex semantic editing scenarios.

[CV-221] Improving Deep Learning-Based Target Volume Auto-Delineation for Adaptive MR-Guided Radiotherapy in Head and Neck Cancer: Impact of a Volume-Aware Dice Loss

【速读】:该论文旨在解决头颈部癌(Head and Neck Cancer, HNC)放疗计划中靶区手动勾画效率低、观察者间差异大这一瓶颈问题,尤其关注小体积、解剖结构复杂的转移性淋巴结(Metastatic Lymph Nodes, LN)在自动分割中的漏检问题。解决方案的关键在于引入一种体积感知的Dice损失函数(Volume-Aware Dice loss),通过在多标签分割任务中对不同目标区域施加差异化权重,以提升对小体积淋巴结的检测敏感性。实验表明,选择性地仅对淋巴结应用该损失函数可显著提高其分割灵敏度,但会牺牲原发肿瘤(Primary Tumor, PT)的分割精度;而采用双掩膜策略(即同时对PT和LN应用体积感知损失)可在保持原发灶分割准确性的同时优化淋巴结检测性能,从而实现多目标分割任务的平衡优化。

链接: https://arxiv.org/abs/2604.10130
作者: Sogand Beirami,Zahra Esmaeilzadeh,Ahmed Gomaa,Pluvio Stephan,Ishita Sheth,Thomas Weissmann,Juliane Szkitsak,Philipp Schubert,Yixing Huang,Annette Schwarz,Stefanie Corradini,Florian Putz
机构: Universitätsklinikum Erlangen (埃尔朗根大学附属医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Background: Manual delineation of target volumes in head and neck cancer (HNC) remains a significant bottleneck in radiotherapy planning, characterized by high inter-observer variability and time consumption. This study evaluates the integration of a Volume-Aware (VA) Dice loss function into a self-configuring deep learning framework to enhance the auto-segmentation of primary tumors (PT) and metastatic lymph nodes (LN) for adaptive MR-guided radiotherapy. We investigate how volume-sensitive weighting affects the detection of small, anatomically complex nodal metastases compared to conventional loss functions. Methods: Utilizing the HNTS-MRG 2024 dataset, we implemented an nnU-Net ResEnc M architecture. We conducted a multi-label segmentation task, comparing a standard Dice loss baseline against two Volume-Aware configurations: a “Dual Mask” setup (VA loss on both PT and LN) and a “Selective LN Mask” setup (VA loss on LN only). Evaluation metrics included volumetric Dice scores, surface-based metrics (SDS, MSD, HD95), and lesion-wise binary detection sensitivity and precision. Results: The Selective LN Mask configuration achieved the highest LN Volumetric Dice Score (0.758 vs. 0.734 baseline) and significantly improved LN Lesion-Wise Detection Sensitivity (84.93% vs. 81.80%). However, a critical trade-off was observed; PT detection precision declined significantly in the selective setup (63.65% vs. 81.27%). The Dual Mask configuration provided the most balanced performance across both targets, maintaining primary tumor precision at 82.04% while improving LN sensitivity to 83.46%. Conclusions: A volume-sensitive loss function mitigated the under-representation of small metastatic lesions in HNC. While selective weighting yielded the best nodal detection, a dual-mask approach is required in multi-label tasks to maintain segmentation accuracy for larger primary tumor volumes.

[CV-222] VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation CVPR2026

【速读】:该论文旨在解决当前AIGC(人工智能生成内容)视频生成评估体系中缺乏对美学质量(aesthetic quality)全面考量的问题,现有基准主要聚焦于技术保真度,忽视了感知与艺术性维度,导致评估结果难以反映真实用户体验。其解决方案的关键在于提出一个统一的三层次分类框架——美学质量、美学标签和生成质量,并据此构建了VGA-Bench基准:通过设计1,016个多样化提示词生成超6万条视频数据集,结合人工标注子集开发出三个专用多任务神经评估模型(VAQA-Net用于美学质量预测,VTag-Net用于自动美学标签识别,VGQA-Net用于生成与基础质量属性判断),实现了对视频生成质量与美学质量的系统化、自动化、高精度联合评估,显著提升了与人类判断的一致性。

链接: https://arxiv.org/abs/2604.10127
作者: Longteng Jiang,DanDan Zheng,Qianqian Qiao,Heng Huang,Huaye Wang,Yihang Bo,Bao Peng,Jingdong Chen,Jun Zhou,Xin Jin
机构: Ant Group(蚂蚁集团); Beijing Film Academy(北京电影学院); State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026

点击查看摘要

Abstract:The rapid advancement of AIGC-based video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment-particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.

[CV-223] PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit–Explicit Optimization

【速读】:该论文旨在解决现有单图像三维室内场景生成方法在物理一致性上的不足问题,即生成的场景虽视觉上合理但违背现实物理规律(如几何先验、接触关系、稳定性及可部署性),从而限制其在机器人、具身人工智能和设计等领域的可靠性。解决方案的关键在于提出一个统一的物理评估器(Physics Evaluator),用于量化测量四类核心物理约束(几何先验、接触、稳定性和可部署性)及其九个子约束,并基于此构建了一个闭环框架:通过Scene-GRPO实现隐式对齐(利用物理评估器作为偏好信号引导采样偏向物理可行布局),以及通过即插即用的测试时优化器(Test-Time Optimizer, TTO)进行显式修正(利用可微分评估信号校正生成过程中的残余违反项),从而在训练与推理阶段同时提升生成场景的物理合理性与视觉保真度。

链接: https://arxiv.org/abs/2604.10125
作者: Dongli Wu,Jingyu Hu,Ka-Hei Hui,Xiaobao Wei,Chengwen Luo,Jianqiang Li,Zhengzhe Liu
机构: Lingnan University (岭南大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Chinese University of Hong Kong (香港中文大学); Peking University (北京大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing single-image 3D indoor scene generators often produce results that look visually plausible but fail to obey real-world physics, limiting their reliability in robotics, embodied AI, and design. To examine this gap, we introduce a unified Physics Evaluator that measures four main aspects: geometric priors, contact, stability, and deployability, which are further decomposed into nine sub-constraints, establishing the first benchmark to measure physical consistency. Based on this evaluator, our analysis shows that state-of-the-art methods remain largely physics-unaware. To overcome this limitation, we further propose a framework that integrates feedback from the Physics Evaluator into both training and inference, enhancing the physical plausibility of generated scenes. Specifically, we propose PhyMix, which is composed of two complementary components: (i) implicit alignment via Scene-GRPO, a critic-free group-relative policy optimization that leverages the Physics Evaluator as a preference signal and biases sampling towards physically feasible layouts, and (ii) explicit refinement via a plug-and-play Test-Time Optimizer (TTO) that uses differentiable evaluator signals to correct residual violations during generation. Overall, our method unifies evaluation, reward shaping, and inference-time correction, producing 3D indoor scenes that are visually faithful and physically plausible. Extensive synthetic evaluations confirm state-of-the-art performance in both visual fidelity and physical plausibility, and extensive qualitative examples in stylized and real-world images further showcase the robustness of the method. We will release codes and models upon publication.

[CV-224] A Dual Cross-Attention Graph Learning Framework For Multimodal MRI-Based Major Depressive Disorder Detection

【速读】:该论文旨在解决多模态磁共振成像(MRI)在重度抑郁症(Major Depressive Disorder, MDD)分类中有效融合结构磁共振成像(sMRI)与静息态功能磁共振成像(rs-fMRI)数据的难题。现有方法难以充分建模两种模态之间的复杂交互关系,导致分类性能受限。其解决方案的关键在于提出一种基于双交叉注意力(dual cross-attention)的多模态融合框架,该框架显式地建模sMRI与rs-fMRI表征之间的双向交互机制,从而提升跨模态信息整合能力。实验表明,该方法在REST-meta-MDD大规模数据集上显著优于传统的特征级拼接策略,尤其在功能脑图谱配置下表现最优,实现了84.71%的准确率、86.42%的敏感性等优异指标,验证了显式建模跨模态交互对MDD分类的重要性。

链接: https://arxiv.org/abs/2604.10116
作者: Nojod M. Alotaibi,Areej M. Alhothali
机构: King Abdulaziz University (国王阿卜杜勒阿齐兹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 1 figure

点击查看摘要

Abstract:Major depressive disorder (MDD) is a prevalent mental disorder associated with complex neurobiological changes that cannot be fully captured using a single imaging modality. The use of multimodal magnetic resonance imaging (MRI) provides a more comprehensive understanding of brain changes by combining structural and functional data. Despite this, the effective integration of these modalities remains challenging. In this study, we propose a dual cross-attention-based multimodal fusion framework that explicitly models bidirectional interactions between structural MRI (sMRI) and resting-state functional MRI (rs-fMRI) representations. The proposed approach is tested on the large-scale REST-meta-MDD dataset using both structural and functional brain atlas configurations. Numerous experiments conducted under a 10-fold stratified cross-validation demonstrated that the proposed fusion algorithm achieves robust and competitive performance across all atlas types. The proposed method consistently outperforms conventional feature-level concatenation for functional atlases, while maintaining comparable performance for structural atlases. The most effective dual cross-attention multimodal model obtained 84.71% accuracy, 86.42% sensitivity, 82.89% specificity, 84.34% precision, and 85.37% F1-score. These findings emphasize the importance of explicitly modeling cross-modal interactions for multimodal neuroimaging-based MDD classification.

[CV-225] Dual-Branch Remote Sensing Infrared Image Super-Resolution

【速读】:该论文旨在解决红外遥感图像超分辨率(Infrared Image Super-Resolution)问题,即从低分辨率输入中恢复更清晰的热成像观测结果,同时保持目标轮廓、场景布局和辐射稳定性。由于红外图像纹理稀疏且对局部锐化不稳定敏感,单纯依赖局部或全局建模效果有限,因此关键在于实现局部与全局信息的有效互补。解决方案的核心是提出一个双分支系统:一为基于HAT-L的局部强恢复分支,擅长捕捉细节;另一为基于MambaIRv2-L的全局稳定状态空间建模分支,保障整体结构一致性。通过测试时局部转换、八向自集成及固定权重图像空间融合策略,实验证明该方案在PSNR、SSIM和综合评分上均优于单一分支,验证了局部Transformer重建与全局状态空间建模之间显式互补性的有效性。

链接: https://arxiv.org/abs/2604.10112
作者: Xining Ge,Gengjia Chang,Weijun Yuan,Zhan Li,Zhanglu Chen,Boyang Yao,Yihang Chen,Yifan Deng,Shuhong Liu
机构: Hangzhou Dianzi University (杭州电子科技大学); Hefei University of Technology (合肥工业大学); Jinan University (暨南大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing infrared image super-resolution aims to recover sharper thermal observations from low-resolution inputs while preserving target contours, scene layout, and radiometric stability. Unlike visible-image super-resolution, thermal imagery is weakly textured and more sensitive to unstable local sharpening, which makes complementary local and global modeling especially important. This paper presents our solution to the NTIRE 2026 Infrared Image Super-Resolution Challenge, a dual-branch system that combines a HAT-L branch and a MambaIRv2-L branch. The inference pipeline applies test-time local conversion on HAT, eight-way self-ensemble on MambaIRv2, and fixed equal-weight image-space fusion. We report both the official challenge score and a reproducible evaluation on 12 synthetic times-four thermal samples derived from Caltech Aerial RGB-Thermal, on which the fused output outperforms either single branch in PSNR, SSIM, and the overall Score. The results suggest that infrared super-resolution benefits from explicit complementarity between locally strong transformer restoration and globally stable state-space modeling.

[CV-226] VGGT-HPE: Reframing Head Pose Estimation as Relative Pose Prediction CVPR

【速读】:该论文旨在解决单目头姿态估计(Monocular Head Pose Estimation)中传统绝对回归方法的局限性,即网络需隐式学习数据集特定的参考坐标系,导致泛化能力受限且对复杂姿态鲁棒性差。其解决方案的关键在于提出一种相对姿态估计框架 VGGT-HPE,通过预测两个头部姿态之间的刚体变换(rigid transformation),将问题转化为几何位移估计任务,从而显式引入已知姿态的锚点(anchor),避免了对隐式参考系的依赖。该方法仅在合成人脸渲染数据上微调,无需真实世界训练数据,在 BIWI 基准测试中达到最先进性能,并通过控制难度的成对基准验证了相对预测在准确性上的本质优势。

链接: https://arxiv.org/abs/2604.10106
作者: Vasiliki Vasileiou,Panagiotis P. Filntisis,Petros Maragos,Kostas Daniilidis
机构: Archimedes, Athena Research Center, Marousi, Greece; HERON – Hellenic Robotics Center of Excellence, Athens, Greece; Robotics Institute, Athena Research Center, Marousi, Greece; School of ECE, National Technical University of Athens, Greece; University of Pennsylvania
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPRW 2026

点击查看摘要

Abstract:Monocular head pose estimation is traditionally formulated as direct regression from a single image to an absolute pose. This paradigm forces the network to implicitly internalize a dataset-specific canonical reference frame. In this work, we argue that predicting the relative rigid transformation between two observed head configurations is a fundamentally easier and more robust formulation. We introduce VGGT-HPE, a relative head pose estimator built upon a general-purpose geometry foundation model. Finetuned exclusively on synthetic facial renderings, our method sidesteps the need for an implicit anchor by reducing the problem to estimating a geometric displacement from an explicitly provided anchor with a known pose. As a practical benefit, the relative formulation also allows the anchor to be chosen at test time - for instance, a near-neutral frame or a temporally adjacent one - so that the prediction difficulty can be controlled by the application. Despite zero real-world training data, VGGT-HPE achieves state-of-the-art results on the BIWI benchmark, outperforming established absolute regression methods trained on mixed and real datasets. Through controlled easy- and hard-pair benchmarks, we also systematically validate our core hypothesis: relative prediction is intrinsically more accurate than absolute regression, with the advantage scaling alongside the difficulty of the target pose. Project page and code: this https URL

[CV-227] Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

【速读】:该论文旨在解决流式视频生成(Streaming Video Generation, SVG)中两个核心问题:一是滑动窗口注意力(Sliding Window Attention, SWA)在长视频生成过程中不可避免地丢失远距离历史信息,二是SWA带来的计算开销限制了实时部署。解决方案的关键在于提出混合强制(Hybrid Forcing)机制,通过一种混合注意力设计联合优化时间信息保留与计算效率:首先引入轻量级线性时间注意力(linear temporal attention),以紧凑的键值状态增量吸收被滑动窗口淘汰的token,从而在几乎无额外内存和计算负担下保留长程依赖;其次,在局部滑动窗口内嵌入块稀疏注意力(block-sparse attention),减少短程建模中的冗余计算,并将资源重新分配给更关键的依赖关系;最后,采用解耦蒸馏策略,在初始阶段使用密集注意力进行少量步数蒸馏,随后激活所提出的线性时间和块稀疏注意力用于流式建模,确保优化稳定性。实验表明,该方法在短/长视频生成任务上均达到SOTA性能,并实现了单张NVIDIA H100 GPU上无需量化或压缩即可实时生成832×480分辨率视频(29.5 FPS)。

链接: https://arxiv.org/abs/2604.10103
作者: Ruibin Li,Tao Yang,Fangzhou Ai,Tianhe Wu,Shilei Wen,Bingyue Peng,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Streaming video generation (SVG) distills a pretrained bidirectional video diffusion model into an autoregressive model equipped with sliding window attention (SWA). However, SWA inevitably loses distant history during long video generation, and its computational overhead remains a critical challenge to real-time deployment. In this work, we propose Hybrid Forcing, which jointly optimizes temporal information retention and computational efficiency through a hybrid attention design. First, we introduce lightweight linear temporal attention to preserve long-range dependencies beyond the sliding window. In particular, we maintain a compact key-value state to incrementally absorb evicted tokens, retaining temporal context with negligible memory and computational overhead. Second, we incorporate block-sparse attention into the local sliding window to reduce redundant computation within short-range modeling, reallocating computational capacity toward more critical dependencies. Finally, we introduce a decoupled distillation strategy tailored to the hybrid attention design. A few-step initial distillation is performed under dense attention, then the distillation of our proposed linear temporal and block-sparse attention is activated for streaming modeling, ensuring stable optimization. Extensive experiments on both short- and long-form video generation benchmarks demonstrate that Hybrid Forcing consistently achieves state-of-the-art performance. Notably, our model achieves real-time, unbounded 832x480 video generation at 29.5 FPS on a single NVIDIA H100 GPU without quantization or model compression. The source code and trained models are available at this https URL.

[CV-228] Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

【速读】:该论文旨在解决生成式 AI (Generative AI) 图像检测器在真实世界图像退化(如JPEG压缩、高斯模糊和分辨率下采样)条件下性能显著下降的问题。现有方法(如B-Free)将退化鲁棒性视为数据增强的副产品,而非显式的训练目标,导致鲁棒性不足。解决方案的关键在于提出退化一致配对训练(Degradation-Consistent Paired Training, DCPT),通过为每张训练图像构建干净视图与退化视图,并施加两个约束:特征一致性损失(最小化干净与退化表示间的余弦距离)和预测一致性损失(基于对称KL散度对齐两视图的输出分布),从而显式地强化模型对退化的鲁棒性。DCPT不引入额外参数或推理开销,在Synthbuster基准测试中使退化条件下的平均准确率提升9.1个百分点,且仅牺牲0.9%的干净图像准确率,尤其在JPEG压缩下提升达15.7%–17.9%。

链接: https://arxiv.org/abs/2604.10102
作者: Zongyou Yang,Yinghan Hou,Xiaokun Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures, 2 tables

点击查看摘要

Abstract:AI-generated image detectors suffer significant performance degradation under real-world image corruptions such as JPEG compression, Gaussian blur, and resolution downsampling. We observe that state-of-the-art methods, including B-Free, treat degradation robustness as a byproduct of data augmentation rather than an explicit training objective. In this work, we propose Degradation-Consistent Paired Training (DCPT), a simple yet effective training strategy that explicitly enforces robustness through paired consistency constraints. For each training image, we construct a clean view and a degraded view, then impose two constraints: a feature consistency loss that minimizes the cosine distance between clean and degraded representations, and a prediction consistency loss based on symmetric KL divergence that aligns output distributions across views. DCPT adds zero additional parameters and zero inference overhead. Experiments on the Synthbuster benchmark (9 generators, 8 degradation conditions) demonstrate that DCPT improves the degraded-condition average accuracy by 9.1 percentage points compared to an identical baseline without paired training, while sacrificing only 0.9% clean accuracy. The improvement is most pronounced under JPEG compression (+15.7% to +17.9%). Ablation further reveals that adding architectural components leads to overfitting on limited training data, confirming that training objective improvement is more effective than architectural augmentation for degradation robustness.

[CV-229] ABot-Claw: A Foundation for Persistent Cooperative and Self-Evolving Robotic Agents

【速读】:该论文旨在解决当前具身智能系统在开放世界环境中存在的高阶推理与低阶物理执行之间的显著差距问题,尤其是在长时程、多机器人协作任务中,现有视觉-语言-动作(Vision-Language-Action, VLA)模型因开环特性导致性能受限,而依赖预设工具集的系统又缺乏对真实环境的灵活控制能力。解决方案的关键在于提出ABot-Claw,一个基于OpenClaw的具身扩展框架,其核心创新包括:1)统一的具身接口与能力驱动调度机制,实现异构机器人协同;2)以视觉为中心的跨具身多模态记忆,保障上下文持久保留与具身检索;3)基于评价器(critic)的闭环反馈机制,结合通用奖励模型实现在线进展评估、局部修正与重规划,从而打通从自然语言意图到物理动作的闭环路径,并支持动态环境中机器人代理的持续自演化。

链接: https://arxiv.org/abs/2604.10096
作者: Dongjie Huo,Haoyun Liu,Guoqing Liu,Dekang Qi,Zhiming Sun,Maoguo Gao,Jianxin He,Yandan Yang,Xinyuan Chang,Feng Xiong,Xing Wei,Zhiheng Ma,Mu Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current embodied intelligent systems still face a substantial gap between high-level reasoning and low-level physical execution in open-world environments. Although Vision-Language-Action (VLA) models provide strong perception and intuitive responses, their open-loop nature limits long-horizon performance. Agents incorporating System 2 cognitive mechanisms improve planning, but usually operate in closed sandboxes with predefined toolkits and limited real-system control. OpenClaw provides a localized runtime with full system privileges, but lacks the embodied control architecture required for long-duration, multi-robot execution. We therefore propose ABot-Claw, an embodied extension of OpenClaw that integrates: 1) a unified embodiment interface with capability-driven scheduling for heterogeneous robot coordination; 2) a visual-centric cross-embodiment multimodal memory for persistent context retention and grounded retrieval; and 3) a critic-based closed-loop feedback mechanism with a generalist reward model for online progress evaluation, local correction, and replanning. With a decoupled architecture spanning the OpenClaw layer, shared service layer, and robot embodiment layer, ABot-Claw enables real-world interaction, closes the loop from natural language intent to physical action, and supports progressively self-evolving robotic agents in open, dynamic environments.

[CV-230] Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models

【速读】:该论文旨在解决3D基础模型微调中LoRA(Low-Rank Adaptation)子空间的可解释性与有效性问题,具体关注不同数据变异类型(如纹理、几何、相机运动和光照变化)是否对应独立的LoRA子空间,以及这些子空间是否正交解耦,并探索高效计算方法。其解决方案的关键在于构建可控变异的合成数据集,针对每类变化单独微调LoRA适配器以提取对应的子空间,进而发现这些子空间近似正交;通过整合这些子空间形成一个压缩后的LoRA子空间,显著提升下游任务的预测精度与微调效率,且该压缩子空间虽完全基于合成数据训练,仍能泛化至真实数据集。

链接: https://arxiv.org/abs/2604.10095
作者: Yu Jiang,Hanwen Jiang,Ahmed Abdelkader,Wen-Sheng Chu,Brandon Y. Feng,Zhangyang Wang,Qixing Huang
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Shanghai Jiao Tong University (上海交通大学); Adobe Research (Adobe研究院); Google Research (谷歌研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these subspaces are approximately disentangled. Integrating them leads to a reduced LoRA subspace that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA subspace, despite being derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach.

[CV-231] Global monitoring of methane point sources using deep learning on hyperspectral radiance measurements from EMIT

【速读】:该论文旨在解决人为甲烷(CH₄)点源排放监测中检测灵敏度低、依赖人工识别效率低下以及难以处理多重重叠烟羽的问题。现有基于空间成像光谱技术的方法主要依靠人工判读,限制了全球尺度上对甲烷泄漏源的高效识别与量化。解决方案的关键在于提出一种端到端的视觉Transformer模型——Methane Analysis and Plume Localization with EMIT (MAPL-EMIT),该模型充分利用地球表面矿物尘埃源调查仪器(EMIT)提供的完整辐射光谱数据,通过联合检索场景内所有像素的甲烷增强信号,融合光谱与空间上下文信息以显著降低检测限。该方法可同时实现增强量化、烟羽边界划分和排放源定位,且在合成与真实世界基准测试中均表现出高召回率与精确度,优于传统匹配滤波法,并能识别出人类分析师未发现的弱排放源,从而推动甲烷监测从劳动密集型流程向高通量、可扩展的全球设施级烟羽制图范式转变。

链接: https://arxiv.org/abs/2604.10094
作者: Vishal V. Batchu,Michelangelo Conserva,Alex Wilson,Anna M. Michalak,Varun Gulshan,Philip G. Brodrick,Andrew K. Thorpe,Christopher V. Arsdale
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 43 pages, 27 figures, 4 tables

点击查看摘要

Abstract:Anthropogenic methane (CH4) point sources drive near-term climate forcing, safety hazards, and system inefficiencies. Space-based imaging spectroscopy is emerging as a tool for identifying emissions globally, but existing approaches largely rely on manual plume identification. Here we present the Methane Analysis and Plume Localization with EMIT (MAPL-EMIT) model, an end-to-end vision transformer framework that leverages the complete radiance spectrum from the Earth Surface Mineral Dust Source Investigation (EMIT) instrument to jointly retrieve methane enhancements across all pixels within a scene. This approach brings together spectral and spatial context to significantly lower detection limits. MAPL-EMIT simultaneously supports enhancement quantification, plume delineation, and source localization, even for multiple overlapping plumes. The model was trained on 3.6 million physics-based synthetic plumes injected into global EMIT radiance data. Synthetic evaluation confirms the model’s ability to identify plumes with high recall and precision and to capture weaker plumes relative to existing matched-filter approaches. On real-world benchmarks, MAPL-EMIT captures 79% of known hand-annotated NASA L2B plume complexes across a test set of 1084 EMIT granules, while capturing twice as many plausible plumes than identified by human analysts. Further validation against coincident airborne data, top-emitting landfills, and controlled release experiments confirms the model’s ability to identify previously uncaptured sources. By incorporating model-generated metrics such as spectral fit scores and estimated noise levels, the framework can further limit false-positive rates. Overall, MAPL-EMIT enables high-throughput implementation on the full EMIT catalog, shifting methane monitoring from labor-intensive workflows to a rapid, scalable paradigm for global plume mapping at the facility scale.

[CV-232] Particle Diffusion Matching: Random Walk Correspondence Search for the Alignment of Standard and Ultra-Widefield Fundus Images

【速读】:该论文旨在解决标准广角眼底图像(Standard Fundus Images, SFIs)与超广角眼底图像(Ultra-Widefield Fundus Images, UWFIs)之间难以对齐的问题,其挑战源于两者在尺度、外观差异以及显著特征稀缺等方面的不一致性。解决方案的关键在于提出了一种名为粒子扩散匹配(Particle Diffusion Matching, PDM)的方法,该方法通过由扩散模型引导的迭代随机游走对应搜索(Random Walk Correspondence Search, RWCS)实现精准配准:每轮迭代中,模型综合考虑局部外观、粒子分布结构及估计的全局变换来预测粒子点位移向量,从而实现对应关系的逐步优化,即使在复杂条件下也能获得稳定收敛。PDM 在多个视网膜图像对齐基准上达到最先进性能,尤其在SFI-UWFI配对数据集上表现突出,验证了其在真实临床场景中的有效性与可扩展性。

链接: https://arxiv.org/abs/2604.10085
作者: Kanggeon Lee,Soochahn Lee,Kyoung Mu Lee
机构: Seoul National University (首尔国立大学); Kookmin University (酷克敏大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a robust alignment technique for Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which are challenging to align due to differences in scale, appearance, and the scarcity of distinctive features. Our method, termed Particle Diffusion Matching (PDM), performs alignment through an iterative Random Walk Correspondence Search (RWCS) guided by a diffusion model. At each iteration, the model estimates displacement vectors for particle points by considering local appearance, the structural distribution of particles, and an estimated global transformation, enabling progressive refinement of correspondences even under difficult conditions. PDM achieves state-of-the-art performance across multiple retinal image alignment benchmarks, showing substantial improvement on a primary dataset of SFI-UWFI pairs and demonstrating its effectiveness in real-world clinical scenarios. By providing accurate and scalable correspondence estimation, PDM overcomes the limitations of existing methods and facilitates the integration of complementary retinal image modalities. This diffusion-guided search strategy offers a new direction for improving downstream supervised learning, disease diagnosis, and multi-modal image analysis in ophthalmology.

[CV-233] Active Diffusion Matching: Score-based Iterative Alignment of Cross-Modal Retinal Images

【速读】:该论文旨在解决标准眼底图像(Standard Fundus Images, SFIs)与超广角眼底图像(Ultra-Widefield Fundus Images, UWFIs)之间的跨模态对齐难题,该问题因两者视场范围差异显著及视网膜形态不规则而难以实现精确配准。现有图像对齐方法在该任务中表现不足,缺乏专门针对此类模态差异的解决方案。论文提出了一种名为主动扩散匹配(Active Diffusion Matching, ADM)的新方法,其核心在于引入两个相互依赖的基于得分(score-based)的扩散模型,通过迭代朗之万马尔可夫链(Langevin Markov chain)联合估计全局变换与局部形变,从而实现一种随机、渐进式的最优对齐搜索过程;同时设计了定制采样策略以增强ADM对给定图像对的适应能力。实验表明,ADM在私有和公开数据集上均达到当前最优对齐精度,mAUC提升分别达5.2和0.4点,验证了其在跨模态眼底图像对齐中的有效性与先进性。

链接: https://arxiv.org/abs/2604.10084
作者: Kanggeon Lee,Su Jeong Song,Soochahn Lee,Kyoung Mu Lee
机构: Seoul National University (首尔国立大学); Sungkyunkwan University (成均馆大学); Kookmin University (酷似明大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Objective: The study aims to address the challenge of aligning Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which is difficult due to their substantial differences in viewing range and the amorphous appearance of the retina. Currently, no specialized method exists for this task, and existing image alignment techniques lack accuracy. Methods: We propose Active Diffusion Matching (ADM), a novel cross-modal alignment method. ADM integrates two interdependent score-based diffusion models to jointly estimate global transformations and local deformations via an iterative Langevin Markov chain. This approach facilitates a stochastic, progressive search for optimal alignment. Additionally, custom sampling strategies are introduced to enhance the adaptability of ADM to given input image pairs. Results: Comparative experimental evaluations demonstrate that ADM achieves state-of-the-art alignment accuracy. This was validated on two datasets: a private dataset of SFI-UWFI pairs and a public dataset of SFI-SFI pairs, with mAUC improvements of 5.2 and 0.4 points on the private and public datasets, respectively, compared to existing state-of-the-art methods. Conclusion: ADM effectively bridges the gap in aligning SFIs and UWFIs, providing an innovative solution to a previously unaddressed challenge. The method’s ability to jointly optimize global and local alignment makes it highly effective for cross-modal image alignment tasks. Significance: ADM has the potential to transform the integrated analysis of SFIs and UWFIs, enabling better clinical utility and supporting learning-based image enhancements. This advancement could significantly improve diagnostic accuracy and patient outcomes in ophthalmology. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.10084 [cs.CV] (or arXiv:2604.10084v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.10084 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kanggeon Lee [view email] [v1] Sat, 11 Apr 2026 08:06:28 UTC (14,270 KB) Full-text links: Access Paper: View a PDF of the paper titled Active Diffusion Matching: Score-based Iterative Alignment of Cross-Modal Retinal Images, by Kanggeon Lee and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-234] MatRes: Zero-Shot Test-Time Model Adaptation for Simultaneous Matching and Restoration

【速读】:该论文旨在解决真实世界图像对中同时存在严重退化(degradation)和大视角变化(large viewpoint changes)时,图像修复(image restoration)与几何匹配(geometric matching)任务相互干扰的问题。传统方法若独立处理这两项任务,往往难以协同优化。其解决方案的关键在于提出一种零样本测试时自适应框架 MatRes,通过在对应位置强制条件相似性(conditional similarity),仅更新轻量级模块而冻结所有预训练组件,从而在无需离线训练或额外监督的情况下,联合提升修复质量与对应点估计精度,有效缓解二者间的相互干扰。

链接: https://arxiv.org/abs/2604.10081
作者: Kanggeon Lee,Soochahn Lee,Kyoung Mu Lee
机构: Seoul National University (首尔国立大学); Kookmin University (酷似大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world image pairs often exhibit both severe degradations and large viewpoint changes, making image restoration and geometric matching mutually interfering tasks when treated independently. In this work, we propose MatRes, a zero-shot test-time adaptation framework that jointly improves restoration quality and correspondence estimation using only a single low-quality and high-quality image pair. By enforcing conditional similarity at corresponding locations, MatRes updates only lightweight modules while keeping all pretrained components frozen, requiring no offline training or additional supervision. Extensive experiments across diverse combinations show that MatRes yields significant gains in both restoration and geometric alignment compared to using either restoration or matching models alone. MatRes offers a practical and widely applicable solution for real-world scenarios where users commonly capture multiple images of a scene with varying viewpoints and quality, effectively addressing the often-overlooked mutual interference between matching and restoration.

[CV-235] Attention-Guided Dual-Stream Learning for Group Engagement Recognition: Fusing Transformer-Encoded Motion Dynamics with Scene Context via Adaptive Gating

【速读】:该论文旨在解决课堂环境中群体层面学习参与度(group-level engagement)自动识别的难题,现有方法多聚焦于在线教学场景或个体级参与度估计,难以捕捉群体动态与个体行为之间的协同关系。其解决方案的关键在于提出DualEngage框架,采用双流结构:主支路通过检测与追踪学生、提取密集光流并利用Transformer编码时序运动模式,结合注意力池化生成个体表征;辅支路则基于预训练3D残差网络捕获场景级时空信息;两支路特征通过softmax门控融合机制动态加权整合,从而学习个体动作与群体动态的联合表示,显著提升了群体参与度识别的准确性。

链接: https://arxiv.org/abs/2604.10078
作者: Saniah Kayenat Chowdhury,Muhammad E.H. Chowdhury
机构: Qatar University (卡塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Student engagement is crucial for improving learning outcomes in group activities. Highly engaged students perform better both individually and contribute to overall group success. However, most existing automated engagement recognition methods are designed for online classrooms or estimate engagement at the individual level. Addressing this gap, we propose DualEngage, a novel two-stream framework for group-level engagement recognition from in-classroom videos. It models engagement as a joint function of both individual and group-level behaviors. The primary stream models person-level motion dynamics by detecting and tracking students, extracting dense optical flow with the Recurrent All-Pairs Field Transforms network, encoding temporal motion patterns using a transformer encoder, and finally aggregating per-student representations through attention pooling into a unified representation. The secondary stream captures scene-level spatiotemporal information from the full video clip, leveraging a pretrained three-dimensional Residual Network. The two-stream representations are combined via softmax-gated fusion, which dynamically weights each stream’s contribution based on the joint context of both features. DualEngage learns a joint representation of individual actions with overarching group dynamics. We evaluate the proposed approach using fivefold cross-validation on the Classroom Group Engagement Dataset developed by Ocean University of China, achieving an average classification accuracy of 0.9621+/-0.0161 with a macro-averaged F1 of 0.9530+/-0.0204. To understand the contribution of each branch, we further conduct an ablation study comparing single-stream variants against the two-stream model. This work is among the first in classroom engagement recognition to adopt a dual-stream design that explicitly leverages motion cues as an estimator.

[CV-236] DocRevive: A Unified Pipeline for Document Text Restoration

【速读】:该论文旨在解决文档理解中受损、遮挡或不完整文本的重建问题,这一问题对后续文档理解任务构成挑战且尚未得到充分研究。解决方案的关键在于提出一个统一的端到端流水线,融合先进的光学字符识别(OCR)、图像分析、掩码语言建模与基于扩散模型的修复技术,在保持视觉一致性的前提下实现语义连贯的文本重建。该方法通过一个合成的30,078张退化文档图像数据集进行训练和评估,并引入统一上下文相似性度量(UCSM)以量化重建质量,从而推动文档恢复领域的技术进步并为数字保存提供新标准。

链接: https://arxiv.org/abs/2604.10077
作者: Kunal Purkayastha,Ayan Banerjee,Josep Llados,Umapada Pal
机构: Computer Vision Center (计算机视觉中心); Indian Statistical Institute (印度统计研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30,078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Similarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. The OPRB dataset and code are available at \hrefthis https URLHugging Face and \hrefthis https URLGithub respectively.

[CV-237] Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation ACL2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成文本时存在的幻觉问题,即模型输出内容与输入视觉信息不一致的现象。其解决方案的关键在于提出了一种名为双锚点自省解码(Dual-Anchor Introspective Decoding, DaID)的对比解码框架,通过动态校准每个词元(token)的生成过程来实现。该方法创新性地利用模型内部感知差异,识别出两个关键层:一个“亮点层”(Spotlight layer)用于增强视觉事实信号,另一个“阴影层”(Shadow layer)用于抑制文本惯性;并通过视觉注意力分布指导双锚点的选择,从而实现对每个词元的精准、特定适应,显著降低幻觉并提升整体推理能力。

链接: https://arxiv.org/abs/2604.10071
作者: Yebo Wu,Han Jin,Zhijiang Guo,Li Li
机构: State Key Laboratory of IOTSC, University of Macau; Hong Kong University of Science and Technology (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for Findings of ACL 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities yet continue to suffer from hallucination, where generated text contradicts visual content. In this paper, we introduce Dual-Anchor Introspective Decoding (DaID), a novel contrastive decoding framework that dynamically calibrates each token generation by mining the model’s internal perceptual discrepancies. Specifically, DaID identifies a Spotlight layer to amplify visual factual signals and a Shadow layer to suppress textual inertia. By leveraging visual attention distributions to guide this dual-anchor selection process, our method ensures precise, token-specific adaptation. Experimental results across multiple benchmarks and MLLMs demonstrate that DaID significantly mitigates hallucination while enhancing general reasoning capabilities.

[CV-238] On The Application of Linear Attention in Multimodal Transformers CVPR2026

【速读】:该论文旨在解决多模态Transformer模型中注意力机制的二次计算复杂度问题(quadratic attention complexity),这一瓶颈限制了模型在大规模数据上的可扩展性。其解决方案的关键在于引入线性注意力机制(Linear Attention, LA),通过将注意力计算从与序列长度平方相关的复杂度降低至线性复杂度,同时保持与标准softmax注意力相当的性能表现,从而实现高效且可扩展的多模态建模。

链接: https://arxiv.org/abs/2604.10064
作者: Armin Gerami,Seyedehanita Madani,Ramani Duraiswami
机构: University of Maryland, Department of Computer Science and UMIACS; Johns Hopkins University, Department of Electrical and Computer Engineering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Workshop on Any-to-Any Multimodal Learning (Any2Any), CVPR 2026

点击查看摘要

Abstract:Multimodal Transformers serve as the backbone for state-of-the-art vision-language models, yet their quadratic attention complexity remains a critical barrier to scalability. In this work, we investigate the viability of Linear Attention (LA) as a high-efficiency alternative within multimodal frameworks. By integrating LA, we reduce the computational overhead from quadratic to linear relative to sequence length while preserving competitive performance. We evaluate our approach across ViT-S/16, ViT-B/16, and ViT-L/16 architectures trained on the LAION-400M dataset, with validation focused on ImageNet-21K zero-shot accuracy. Our systematic evaluation demonstrates that Linear Attention not only yields significant computational savings but also adheres to the same scaling laws as standard softmax attention. These findings position Linear Attention as a robust, scalable solution for next-generation multimodal Transformers tasked with processing increasingly large and complex datasets.

[CV-239] U2Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation CVPR2026

【速读】:该论文旨在解决无监督光流估计方法中缺乏可靠不确定性估计的问题,从而提升模型的鲁棒性和可解释性。其解决方案的关键在于提出U² Flow框架,该框架首次将光流与像素级不确定性联合估计引入递归无监督学习范式;核心创新是采用解耦学习策略,通过基于拉普拉斯分布的最大似然目标函数从数据增强一致性中自动获得不确定性监督信号,实现无需真实标签的稳定训练;同时,预测的不确定性被进一步嵌入网络结构中,用于引导自适应光流精修并动态调节区域平滑损失,还设计了不确定性引导的双向光流融合机制以增强复杂场景下的鲁棒性。

链接: https://arxiv.org/abs/2604.10056
作者: Xunpei Sun,Wenwei Lin,Yi Chang,Gang Chen
机构: Sun Yat-sen University (中山大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as an oral presentation at CVPR 2026

点击查看摘要

Abstract:Unsupervised optical flow methods typically lack reliable uncertainty estimation, limiting their robustness and interpretability. We propose U ^2 Flow, the first recurrent unsupervised framework that jointly estimates optical flow and per-pixel uncertainty. The core innovation is a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective, enabling stable training without ground truth. The predicted uncertainty is further integrated into the network to guide adaptive flow refinement and dynamically modulate the regional smoothness loss. Furthermore, we introduce an uncertainty-guided bidirectional flow fusion mechanism that enhances robustness in challenging regions. Extensive experiments on KITTI and Sintel demonstrate that U ^2 Flow achieves state-of-the-art performance among unsupervised methods while producing highly reliable uncertainty maps, validating the effectiveness of our joint estimation paradigm. The code is available at this https URL.

[CV-240] Intra-finger Variability of Diffusion-based Latent Fingerprint Generation CVPR2026

【速读】:该论文旨在系统评估基于先进扩散模型生成的合成指纹(尤其是潜指纹)在指间变异性的表现,核心问题在于现有合成指纹生成模型在多样性与身份一致性之间难以平衡。解决方案的关键在于构建一个涵盖七种不同数据集的潜指纹风格库(latent style bank),从而实现对超过40种不同表面和处理技术风格的精准合成,并结合半自动化框架分析生成指纹中纹线与特征点(minutiae)的完整性,揭示局部不一致性和全局幻觉纹路现象,为提升生成模型的可靠性提供依据。

链接: https://arxiv.org/abs/2604.10040
作者: Noor Hussein,Anil K. Jain,Karthik Nandakumar
机构: Michigan State University (密歇根州立大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2nd Workshop on Foundation and Generative Models in Biometrics (FoundGen-Bio), held in conjunction with CVPR 2026

点击查看摘要

Abstract:The primary goal of this work is to systematically evaluate the intra-finger variability of synthetic fingerprints (particularly latent prints) generated using a state-of-the-art diffusion model. Specifically, we focus on enhancing the latent style diversity of the generative model by constructing a comprehensive \textitlatent style bank curated from seven diverse datasets, which enables the precise synthesis of latent prints with over 40 distinct styles encapsulating different surfaces and processing techniques. We also implement a semi-automated framework to understand the integrity of fingerprint ridges and minutiae in the generated impressions. Our analysis indicates that though the generation process largely preserves the identity, a small number of local inconsistencies (addition and removal of minutiae) are introduced, especially when there are poor quality regions in the reference image. Furthermore, mismatch between the reference image and the chosen style embedding that guides the generation process introduces global inconsistencies in the form of hallucinated ridge patterns. These insights highlight the limitations of existing synthetic fingerprint generators and the need to further improve these models to simultaneously enhance both diversity and identity consistency.

[CV-241] Counting to Four is still a Chore for VLMs

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在简单对象计数任务上表现不佳的问题,尤其关注其在多模态推理过程中视觉证据被弱化或忽视的机制性原因。现有评估方法仅关注最终输出,难以揭示模型内部失败的具体环节。研究通过引入COUNTINGTRICKS这一受控评估套件,系统分析了不同图像分块布局和对抗性提示条件下VLM的计数行为,并结合注意力分析与组件探针技术发现:计数相关的视觉信息在模态投影阶段最强,但在后续语言层中显著衰减,导致模型更易受文本先验干扰。基于此发现,论文提出轻量级干预策略“模态注意力共享”(Modality Attention Share, MAS),强制在生成答案时保留最小程度的视觉注意力预算,从而提升模型对视觉证据的利用效率。关键在于识别并纠正VLM在语言阶段对视觉信息的低效使用问题,而非单纯增强视觉感知能力。

链接: https://arxiv.org/abs/2604.10039
作者: Duy Le Dinh Anh,Patrick Amadeus Irawan,Tuan Van Vo
机构: MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision–language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at this https URL.

[CV-242] Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation

【速读】:该论文旨在解决视频扩散模型在生成多事件视频时存在的时序控制不足问题,具体表现为难以精确控制语义概念的出现时机、持续时长及多个事件的先后顺序,导致文本-视频对齐效果差和语义混淆(semantic entanglement)。解决方案的关键在于提出一种推理阶段的“提示接力”(Prompt Relay)方法,通过在交叉注意力机制中引入惩罚项,强制每个时间片段仅关注其对应的提示内容,从而实现单时刻仅表达一个语义概念,显著提升时序提示对齐精度、减少语义干扰并改善视觉质量。该方法无需架构修改且无额外计算开销,具备良好的实用性与扩展性。

链接: https://arxiv.org/abs/2604.10030
作者: Gordon Chen,Ziqi Huang,Ziwei Liu
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video diffusion models have achieved remarkable progress in generating high-quality videos. However, these models struggle to represent the temporal succession of multiple events in real-world videos and lack explicit mechanisms to control when semantic concepts appear, how long they persist, and the order in which multiple events occur. Such control is especially important for movie-grade video synthesis, where coherent storytelling depends on precise timing, duration, and transitions between events. When using a single paragraph-style prompt to describe a sequence of complex events, models often exhibit semantic entanglement, where concepts intended for different moments in the video bleed into one another, resulting in poor text-video alignment. To address these limitations, we propose Prompt Relay, an inference-time, plug-and-play method to enable fine-grained temporal control in multi-event video generation, requiring no architectural modifications and no additional computational overhead. Prompt Relay introduces a penalty into the cross-attention mechanism, so that each temporal segment attends only to its assigned prompt, allowing the model to represent one semantic concept at a time and thereby improving temporal prompt alignment, reducing semantic interference, and enhancing visual quality.

[CV-243] SinkTrack: Attention Sink based Context Anchoring for Large Language Models ICLR2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的幻觉(hallucination)和上下文遗忘(context forgetting)问题。研究表明,注意力漂移(attention drift)是导致这些问题的主要原因,即模型在生成过程中逐渐将注意力从初始输入上下文转移至新生成的token上。解决方案的关键在于利用LLM固有的“注意力锚点”(attention sink)特性——即模型始终对序列的第一个token(BOS)保持高关注度。作者提出了一种无需训练、可插拔的上下文锚定方法SinkTrack,通过将关键上下文信息(如图像特征或指令内容)注入BOS token的表示中,使模型在整个生成过程中持续锚定于初始输入上下文,从而有效缓解幻觉与遗忘现象。实验表明,该方法在文本和多模态任务中均显著提升性能,且适用于不同架构与规模的模型,展现出良好的通用性与鲁棒性。

链接: https://arxiv.org/abs/2604.10027
作者: Xu Liu,Guikun Chen,Wenguan Wang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026. Code: this https URL

点击查看摘要

Abstract:Large language models (LLMs) suffer from hallucination and context forgetting. Prior studies suggest that attention drift is a primary cause of these problems, where LLMs’ focus shifts towards newly generated tokens and away from the initial input context. To counteract this, we make use of a related, intrinsic characteristic of LLMs: attention sink – the tendency to consistently allocate high attention to the very first token (i.e., BOS) of a sequence. Concretely, we propose an advanced context anchoring method, SinkTrack, which treats BOS as an information anchor and injects key contextual features (such as those derived from the input image or instruction) into its representation. As such, LLM remains anchored to the initial input context throughout the entire generation process. SinkTrack is training-free, plug-and-play, and introduces negligible inference overhead. Experiments demonstrate that SinkTrack mitigates hallucination and context forgetting across both textual (e.g., +21.6% on SQuAD2.0 with Llama3.1-8B-Instruct) and multi-modal (e.g., +22.8% on M3CoT with Qwen2.5-VL-7B-Instruct) tasks. Its consistent gains across different architectures and scales underscore the robustness and generalizability. We also analyze its underlying working mechanism from the perspective of information delivery. Our source code is available at this https URL.

[CV-244] LVSum: A Benchmark for Timestamp-Aware Long Video Summarization

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频摘要生成中面临的两大核心挑战:一是难以保持长时间跨度下的时序一致性(temporal fidelity),二是生成的摘要缺乏语义与时间上的精确锚定(semantically and temporally grounded)。为应对这些问题,作者提出了LVSum——一个专为长视频摘要设计的人工标注基准数据集,包含13个领域、多样化的长视频及其带有精确时间标记的人工生成摘要。其关键创新在于通过细粒度的时间对齐标注和新型基于大语言模型(LLM-based)的评估指标(用于内容相关性和模态一致性),系统性地量化了现有MLLMs在时序理解上的缺陷,并为未来提升长视频摘要中的时序推理能力奠定了实证基础。

链接: https://arxiv.org/abs/2604.10024
作者: Alkesh Patel,Melis Ozyildirim,Ying-Chang Cheng,Ganesh Nagarajan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 5 tables, 3 figures

点击查看摘要

Abstract:Long video summarization presents significant challenges for current multimodal large language models (MLLMs), particularly in maintaining temporal fidelity over extended durations and producing summaries that are both semantically and temporally grounded. In this work, we present LVSum, a human-annotated benchmark designed specifically for evaluating long video summarization with fine-grained temporal alignment. LVSum comprises diverse long-form videos across 13 domains, each paired with human-generated summaries containing precise temporal references. We conduct a comprehensive evaluation of both proprietary and open-source MLLMs on LVSum, assessing performance using newly introduced LLM-based metrics for content relevance and modality coherence, alongside standard evaluation metrics. Our experiments reveal systematic gaps in temporal understanding among existing MLLMs and offer insights that establish a new foundation for advancing temporal reasoning in long video summarization.

[CV-245] FREE-Switch: Frequency-based Dynamic LoRA Switch for Style Transfer CVPR

【速读】:该论文旨在解决多适配器(adapter)在扩散模型中融合时存在的内容漂移(content drift)和细节退化问题,尤其是在图像生成场景下,现有方法因误差累积或统一融合策略导致性能下降。其解决方案的关键在于提出一种基于频域重要性驱动的动态LoRA切换机制(FREE-Switch),通过分析不同扩散步骤中各适配器的贡献差异实现自适应融合;同时设计自动生成对齐机制(Generation Alignment),从语义层面保持适配器间的一致性,从而有效缓解细节损失,提升定制化图像生成的质量与效率。

链接: https://arxiv.org/abs/2604.10023
作者: Shenghe Zheng,Minyu Zhang,Tianhao Liu,Hongzhi Wang
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR Findings 2026

点击查看摘要

Abstract:With the growing availability of open-sourced adapters trained on the same diffusion backbone for diverse scenes and objects, combining these pretrained weights enables low-cost customized generation. However, most existing model merging methods are designed for classification or text generation, and when applied to image generation, they suffer from content drift due to error accumulation across multiple diffusion steps. For image-oriented methods, training-based approaches are computationally expensive and unsuitable for edge deployment, while training-free ones use uniform fusion strategies that ignore inter-adapter differences, leading to detail degradation. We find that since different adapters are specialized for generating different types of content, the contribution of each diffusion step carries different significance for each adapter. Accordingly, we propose a frequency-domain importance-driven dynamic LoRA switch method. Furthermore, we observe that maintaining semantic consistency across adapters effectively mitigates detail loss; thus, we design an automatic Generation Alignment mechanism to align generation intents at the semantic level. Experiments demonstrate that our FREE-Switch (Frequency-based Efficient and Dynamic LoRA Switch) framework efficiently combines adapters for different objects and styles, substantially reducing the training cost of high-quality customized generation.

[CV-246] What and Where to Adapt: Structure-Semantics Co-Tuning for Machine Vision Compression via Synergistic Adapters

【速读】:该论文旨在解决预训练编码器-解码器在图像压缩任务中,仅关注特征结构微调而忽视熵模型内统计语义适配的问题,从而导致性能提升受限。其关键解决方案是提出结构-语义协同微调(Structure-Semantics Co-Tuning, S2-CoT)框架,通过两个协同工作的专用适配器实现:结构保真度适配器(Structural Fidelity Adapter, SFA)嵌入编码器-解码器以动态融合空间与频域信息,保持高保真表示;语义上下文适配器(Semantic Context Adapter, SCA)则用于调整熵模型,使其与SFA优化后的特征对齐,通过细化通道上下文提升统计编码效率。二者联合优化使原本可能退化的微调效果转化为协同增益,在四个不同基础编解码器上均取得当前最优性能,且仅需少量可训练参数。

链接: https://arxiv.org/abs/2604.10017
作者: Shaobo Liu,Haobo Xiong,Kai Liu,Yuna Lin
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the IEEE/CVF Conference on Computer Vision and Pattern Recognition Findings, 2026

点击查看摘要

Abstract:Parameter-efficient fine-tuning of pre-trained codecs is a promising direction in image compression for human and machine vision. While most existing works have primarily focused on tuning the feature structure within the encoder-decoder backbones, the adaptation of the statistical semantics within the entropy model has received limited attention despite its function of predicting the probability distribution of latent features. Our analysis reveals that naive adapter insertion into the entropy model can lead to suboptimal outcomes, underscoring that the effectiveness of adapter-based tuning depends critically on the coordination between adapter type and placement across the compression pipeline. Therefore, we introduce Structure-Semantics Co-Tuning (S2-CoT), a novel framework that achieves this coordination via two specialized, synergistic adapters: the Structural Fidelity Adapter (SFA) and the Semantic Context Adapter (SCA). SFA is integrated into the encoder-decoder to preserve high-fidelity representations by dynamically fusing spatial and frequency information; meanwhile, the SCA adapts the entropy model to align with SFA-tuned features by refining the channel context for more efficient statistical coding. Through joint optimization, S2-CoT turns potential performance degradation into synergistic gains, achieving state-of-the-art results across four diverse base codecs with only a small fraction of trainable parameters, closely matching full fine-tuning performance. Code is available at this https URL.

[CV-247] owards Multi-Source Domain Generalization for Sleep Staging with Noisy Labels

【速读】:该论文旨在解决多源域泛化(multi-source domain generalization)场景下,睡眠分期任务中因数据来源差异(如机构、设备和人群)导致的领域偏移(domain shift)与标签噪声(label noise)共存时,现有方法性能显著下降的问题。其核心挑战在于如何在不依赖目标域标注的情况下,提升模型对异构生理信号(如EEG和EOG)的鲁棒性。解决方案的关键是提出FF-TRUST框架,通过联合时间-频率早期学习正则化(Joint Time-Frequency Early Learning Regularization, JTF-ELR),同步利用时域与频域一致性约束,并引入置信度多样性正则化机制,从而增强模型对噪声标签的鲁棒性和跨域适应能力。

链接: https://arxiv.org/abs/2604.10009
作者: Kening Wang,Di Wen,Yufan Chen,Ruiping Liu,Junwei Zheng,Jiale Wei,Kailun Yang,Rainer Stiefelhagen,Kunyu Peng
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); ETH Zurich (苏黎世联邦理工学院); Hunan University (湖南大学); INSAIT, Sofia University “St. Kliment Ohridski” (INSAIT,索非亚大学“圣克莱门特·奥赫里德斯基”)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: The benchmark and code will be made publicly available at this https URL

点击查看摘要

Abstract:Automatic sleep staging is a multimodal learning problem involving heterogeneous physiological signals such as EEG and EOG, which often suffer from domain shifts across institutions, devices, and populations. In practice, these data are also affected by noisy annotations, yet label-noise-robust multi-source domain generalization remains underexplored. We present the first benchmark for Noisy Labels in Multi-Source Domain-Generalized Sleep Staging (NL-DGSS) and show that existing noisy-label learning methods degrade substantially when domain shifts and label noise coexist. To address this challenge, we propose FF-TRUST, a domain-invariant multimodal sleep staging framework with Joint Time-Frequency Early Learning Regularization (JTF-ELR). By jointly exploiting temporal and spectral consistency together with confidence-diversity regularization, FF-TRUST improves robustness under noisy supervision. Experiments on five public datasets demonstrate consistent state-of-the-art performance under diverse symmetric and asymmetric noise settings. The benchmark and code will be made publicly available at this https URL.

[CV-248] SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因视觉特征模糊或对比度低而导致的分割精度不足问题。传统仅依赖视觉特征的模型在处理此类挑战时表现有限,难以满足临床诊断对准确性的要求。解决方案的关键在于提出SwinTextUNet框架,通过引入对比语言图像预训练(CLIP)提取的文本嵌入,并将其融合到Swin Transformer UNet骨干网络中,利用交叉注意力机制与卷积融合策略,实现语义文本指导与分层视觉表征的有效对齐,从而提升分割的鲁棒性与准确性。

链接: https://arxiv.org/abs/2604.10000
作者: Ashfak Yeafi,Parthaw Goswami,Md Khairul Islam,Ashifa Islam Shamme
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precise medical image segmentation is fundamental for enabling computer aided diagnosis and effective treatment planning. Traditional models that rely solely on visual features often struggle when confronted with ambiguous or low contrast patterns. To overcome these limitations, we introduce SwinTextUNet, a multimodal segmentation framework that incorporates Contrastive Language Image Pretraining (CLIP), derived textual embeddings into a Swin Transformer UNet backbone. By integrating cross attention and convolutional fusion, the model effectively aligns semantic text guidance with hierarchical visual representations, enhancing robustness and accuracy. We evaluate our approach on the QaTaCOV19 dataset, where the proposed four stage variant achieves an optimal balance between performance and complexity, yielding Dice and IoU scores of 86.47% and 78.2%, respectively. Ablation studies further validate the importance of text guidance and multimodal fusion. These findings underscore the promise of vision language integration in advancing medical image segmentation and supporting clinically meaningful diagnostic tools.

[CV-249] GIF: A Conditional Multimodal Generative Framework for IR Drop Imaging in Chip Layouts

【速读】:该论文旨在解决传统电子设计自动化(Electronic Design Automation, EDA)工具在晶体管密度持续提升背景下,进行IR压降分析时效率低、成本高的问题。现有基于机器学习的方法虽将IR压降分析建模为图像预测任务,但未能有效捕捉局部与长程依赖关系,并忽略了物理版图中的几何信息和逻辑连通性拓扑结构。解决方案的关键在于提出一种生成式IR压降框架(Generative IR drop Framework, GIF),通过融合图像特征与图结构特征,引导条件扩散过程以生成高质量的IR压降图像。GIF利用几何感知的空间特征与逻辑图表示相结合的方式,实现了对版图几何形状和电路拓扑关系的联合建模,从而显著提升了IR压降预测的准确性与可靠性。

链接: https://arxiv.org/abs/2604.09999
作者: Kiran Thorat,Nicole Meng,Mostafa Karami,Caiwen Ding,Yingjie Lao,Zhijie Jerry Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:IR drop analysis is essential in physical chip design to ensure the power integrity of on-chip power delivery networks. Traditional Electronic Design Automation (EDA) tools have become slow and expensive as transistor density scales. Recent works have introduced machine learning (ML)-based methods that formulate IR drop analysis as an image prediction problem. These existing ML approaches fail to capture both local and long-range dependencies and ignore crucial geometrical and topological information from physical layouts and logical connectivity. To address these limitations, we propose GIF, a Generative IR drop Framework that uses both geometrical and topological information to generate IR drop images. GIF fuses image and graph features to guide a conditional diffusion process, producing high-quality IR drop images. For instance, On the CircuitNet-N28 dataset, GIF achieves 0.78 SSIM, 0.95 Pearson correlation, 21.77 PSNR, and 0.026 NMAE, outperforming prior methods. These results demonstrate that our framework, using diffusion based multimodal conditioning, reliably generates high quality IR drop images. This shows that IR drop analysis can effectively leverage recent advances in generative modeling when geometric layout features and logical circuit topology are jointly modeled. By combining geometry aware spatial features with logical graph representations, GIF enables IR drop analysis to benefit from recent advances in generative modeling for structured image generation.

[CV-250] A Comparative Study of Modern Object Detectors for Robust Apple Detection in Orchard Imagery

【速读】:该论文旨在解决果园图像中苹果检测的准确性问题,这一任务对产量预测、果实计数、机器人采摘和作物监测具有重要意义。由于光照变化、叶片遮挡、密集果簇及部分遮挡等因素,现有检测方法在复杂场景下性能受限。解决方案的关键在于建立一个受控的基准测试平台,基于公开的AppleBBCH81数据集,采用固定的训练-验证-测试划分和统一的评估协议,对六种代表性目标检测器(包括YOLOv10n、YOLO11n、RT-DETR-L、Faster R-CNN、FCOS和SSDLite320)进行系统性比较。实验结果表明,检测器的选择不仅应关注定位精度(如mAP@0.5:0.95),还需考虑置信度阈值下的鲁棒性和下游任务需求(如F1分数与召回率权衡)。

链接: https://arxiv.org/abs/2604.09996
作者: Mohammed Asad,Ajai Kumar Gautam,Priyanshu Dhiman,Rishi Raj Prajapati
机构: Delhi Technological University (德里科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICICV 2026; 8 pages, 4 figures

点击查看摘要

Abstract:Accurate apple detection in orchard images is important for yield prediction, fruit counting, robotic harvesting, and crop monitoring. However, changing illumination, leaf clutter, dense fruit clusters, and partial occlusion make detection difficult. To provide a fair and reproducible comparison, this study establishes a controlled benchmark for single-class apple detection on the public AppleBBCH81 dataset using one deterministic train, validation, and test split and a unified evaluation protocol across six representative detectors: YOLOv10n, YOLO11n, RT-DETR-L, Faster R-CNN (ResNet50-FPN), FCOS (ResNet50-FPN), and SSDLite320 (MobileNetV3-Large). Performance is evaluated primarily using COCO-style mAP@0.5 and mAP@0.5:0.95, and threshold-dependent behavior is further analyzed using precision-recall curves and fixed-threshold precision, recall, and F1-score at IoU = 0.5. On the validation split, YOLO11n achieves the best strict localization performance with mAP@0.5:0.95 = 0.6065 and mAP@0.5 = 0.9620, followed closely by RT-DETR-L and YOLOv10n. At a fixed operating point with confidence = 0.05, YOLOv10n attains the highest F1-score, whereas RT-DETR-L achieves very high recall but low precision because of many false positives at low confidence. These findings show that detector selection for orchard deployment should be guided not only by localization-aware accuracy but also by threshold robustness and the requirements of the downstream task.

[CV-251] Revisiting the Scale Loss Function and Gaussian-Shape Convolution for Infrared Small Target Detection

【速读】:该论文旨在解决红外小目标检测中的两个核心问题:一是由非单调尺度损失函数导致的训练不稳定,二是由于通用卷积核忽略小目标物理成像特性而引起的空间注意力不足。解决方案的关键在于两方面改进:其一,提出基于\emphdiff的尺度损失函数,通过预测掩码与真实标签之间的有符号面积差进行加权,从而获得严格单调的梯度并实现稳定收敛;其二,引入具有可学习尺度参数的高斯形状卷积(Gaussian-shaped convolution),以匹配红外小目标中心集中式强度分布,并结合旋转螺旋掩码(rotated pinwheel mask)利用直通估计器自适应对齐卷积核方向,提升空间感知能力。实验表明,该方法在IRSTD-1k、NUDT-SIRST和SIRST-UAVB数据集上显著优于当前最优方法,在mIoU、检测概率(P_d)和虚警率(F_a)等指标上均取得一致提升。

链接: https://arxiv.org/abs/2604.09991
作者: Hao Li,Man Fung Zhuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared small target detection still faces two persistent challenges: training instability from non-monotonic scale loss functions, and inadequate spatial attention due to generic convolution kernels that ignore the physical imaging characteristics of small targets. In this paper, we revisit both aspects. For the loss side, we propose a \emphdiff-based scale loss that weights predictions according to the signed area difference between the predicted mask and the ground truth, yielding strictly monotonic gradients and stable convergence. We further analyze a family of four scale loss variants to understand how their geometric properties affect detection behavior. For the spatial side, we introduce \emphGaussian-shaped convolution with a learnable scale parameter to match the center-concentrated intensity profile of infrared small targets, and augment it with a \emphrotated pinwheel mask that adaptively aligns the kernel with target orientation via a straight-through estimator. Extensive experiments on IRSTD-1k, NUDT-SIRST, and SIRST-UAVB demonstrate consistent improvements in mIoU , P_d , and F_a over state-of-the-art methods. We release our anonymous code and pretrained models.

[CV-252] Gait Recognition with Temporal Kolmogorov-Arnold Networks

【速读】:该论文旨在解决基于轮廓的步态识别模型在处理长序列时对噪声和外观相关协变量敏感、难以有效建模局部步态周期与长期运动趋势的问题。现有递归结构易丢失早期帧信息且优化效率低,而基于Transformer的模型则存在计算资源消耗大、对不规则序列长度和噪声输入敏感等缺陷。解决方案的关键在于提出Temporal Kolmogorov-Arnold Network (TKAN),其通过将固定边权重替换为可学习的一维函数,并引入两级记忆机制——由短时RKAN子层与门控长时路径组成的结构,实现了对步态周期级动态和更广泛时间上下文的高效建模,同时保持轻量化的网络架构。

链接: https://arxiv.org/abs/2604.09990
作者: Mohammed Asad,Dinesh Kumar Vishwakarma
机构: Delhi Technological University (德里科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Gait recognition is a biometric modality that identifies individuals from their characteristic walking patterns. Unlike conventional biometric traits, gait can be acquired at a distance and without active subject cooperation, making it suitable for surveillance and public safety applications. Nevertheless, silhouette-based temporal models remain sensitive to long sequences, observation noise, and appearance-related covariates. Recurrent architectures often struggle to preserve information from earlier frames and are inherently sequential to optimize, whereas transformer-based models typically require greater computational resources and larger training sets and may be sensitive to irregular sequence lengths and noisy inputs. These limitations reduce robustness under clothing variation, carrying conditions, and view changes, while also hindering the joint modeling of local gait cycles and longer-term motion trends. To address these challenges, we introduce a Temporal Kolmogorov-Arnold Network (TKAN) for gait recognition. The proposed model replaces fixed edge weights with learnable one-dimensional functions and incorporates a two-level memory mechanism consisting of short-term RKAN sublayers and a gated long-term pathway. This design enables efficient modeling of both cycle-level dynamics and broader temporal context while maintaining a compact backbone. Experiments on the CASIA-B dataset indicate that the proposed CNN+TKAN framework achieves strong recognition performance under the reported evaluation setting.

[CV-253] FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation

【速读】:该论文旨在解决现有合成掌纹(palmprint)生成方法在模拟真实掌纹几何变化方面的不足,尤其是忽略了复杂非刚性形变的问题。当前方法主要关注风格迁移(style translation),而对几何变化的建模仅依赖于简单的手工增强,无法充分反映真实掌纹的多样性。解决方案的关键在于提出FlowPalm框架,其核心是利用真实掌纹对之间的光流(optical flow)估计来捕捉几何形变的统计模式,并设计了一种渐进式采样过程,在扩散过程中逐步引入几何变形,同时保持身份一致性(identity consistency)。这一机制显著提升了合成数据的真实性和下游识别任务的性能。

链接: https://arxiv.org/abs/2604.09989
作者: Yuchen Zou,Huikai Shao,Lihuang Fang,Zhipeng Xiong,Dexing Zhong
机构: Xi’an Jiaotong University (西安交通大学); Sichuan Digital Economy Industry Development Research Institute (四川数字经济产业发展研究院); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, synthetic palmprints have been increasingly used as substitutes for real data to train recognition models. To be effective, such synthetic data must reflect the diversity of real palmprints, including both style variation and geometric variation. However, existing palmprint generation methods mainly focus on style translation, while geometric variation is either ignored or approximated by simple handcrafted augmentations. In this work, we propose FlowPalm, an optical-flow-driven palmprint generation framework capable of simulating the complex non-rigid deformations observed in real palms. Specifically, FlowPalm estimates optical flows between real palmprint pairs to capture the statistical patterns of geometric deformations. Building on these priors, we design a progressive sampling process that gradually introduces the geometric deformations during diffusion while maintaining identity consistency. Extensive experiments on six benchmark datasets demonstrate that FlowPalm significantly outperforms state-of-the-art palmprint generation approaches in downstream recognition tasks. Project page: this https URL

[CV-254] YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection

【速读】:该论文旨在解决视频伪装目标检测(Video Camouflaged Object Detection, VCOD)中因数据稀缺和模型对复杂运动动态鲁棒性不足所导致的性能瓶颈问题,尤其针对由剧烈运动引起的运动诱导外观不稳定(Motion-Induced Appearance Instability)与时间特征错位(Temporal Feature Misalignment)等挑战。其解决方案的关键在于提出一个包含两个核心模块的新框架:运动特征稳定模块(Motion Feature Stabilization, MFS)轨迹感知对齐模块(Trajectory-Aware Alignment, TAA)。MFS通过帧无关的语义基础原型(frame-agnostic Semantic Basis Primitives)实现特征稳定,而TAA则利用轨迹引导的可变形采样策略确保时序特征的精确对齐,从而显著提升模型在复杂时空场景下的检测性能与跨域泛化能力。

链接: https://arxiv.org/abs/2604.09985
作者: Yiyu Liu,Shuo Ye,Chao Hao,Zitong Yu
机构: K1NSA; Great Bay University (大湾区大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Video Camouflaged Object Detection (VCOD) is currently constrained by the scarcity of challenging benchmarks and the limited robustness of models against erratic motion dynamics. Existing methods often struggle with Motion-Induced Appearance Instability and Temporal Feature Misalignment caused by complex motion scenarios. To address the data bottleneck, we present YUV20K, a pixel-level annoated complexity-driven VCOD benchmark. Comprising 24,295 annotated frames across 91 scenes and 47 kinds of species, it specifically targets challenging scenarios like large-displacement motion, camera motion and other 4 types scenarios. On the methodological front, we propose a novel framework featuring two key modules: Motion Feature Stabilization (MFS) and Trajectory-Aware Alignment (TAA). The MFS module utilizes frame-agnostic Semantic Basis Primitives to stablize features, while the TAA module leverages trajectory-guided deformable sampling to ensure precise temporal alignment. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art competitors on existing datasets and establishes a new baseline on the challenging YUV20K. Notably, our framework exhibits superior cross-domain generalization and robustness when confronting complex spatiotemporal scenarios. Our code and dataset will be available at this https URL

[CV-255] Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation CVPR

【速读】:该论文旨在解决视频无监督域自适应(Video Unsupervised Domain Adaptation, VUDA)中因静态冗余背景导致的域偏移问题以及现有方法计算效率低下的瓶颈。其核心挑战在于,源域与目标域间存在显著的背景差异,而传统方法未能有效区分背景与动作相关区域,同时缺乏对计算资源的优化。解决方案的关键在于提出可学习的运动聚焦分词机制(Learnable Motion-Focused Tokenization, LMFT),该机制将视频帧划分为补丁令牌(patch tokens),并自动识别并丢弃低运动、冗余的背景令牌,保留高运动、与动作相关的令牌用于域适应,从而在提升模型性能的同时显著降低计算开销。

链接: https://arxiv.org/abs/2604.09955
作者: Tzu Ling Liu,Ian Stavness,Mrigank Rochan
机构: University of Saskatchewan (萨斯喀彻温大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.

[CV-256] Unmixing-Guided Spatial-Spectral Mamba with Clustering Tokens for Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中因光谱混合效应(spectral-mixture effect)、空间-光谱异质性(spatial-spectral heterogeneity)以及难以保持类别边界与细节所导致的性能瓶颈问题。其解决方案的关键在于提出一种基于解混引导的空间-光谱Mamba模块(unmixing-guided spatial-spectral Mamba),通过四个核心设计实现:1)构建可自动学习端元(endmember)和丰度图(abundance map)并考虑端元变异性的光谱解混网络;2)利用丰度图聚类结果设计Top-K令牌选择策略,自适应生成用于Mamba建模的令牌序列;3)基于Top-K令牌序列设计新的解混引导型空间-光谱Mamba模块,显著提升传统Mamba在令牌学习与排序方面的表现;4)引入多任务监督机制,联合优化端元-丰度模式与分类标签,形成输出分类图、光谱库及丰度图的统一框架。实验表明该方法在四个数据集上均显著优于现有先进模型。

链接: https://arxiv.org/abs/2604.09948
作者: Yimin Zhu,Lincoln Linlin Xu
机构: University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although hyperspectral image (HSI) classification is critical for supporting various environmental applications, it is a challenging task due to the spectral-mixture effect, the spatial-spectral heterogeneity and the difficulty to preserve class boundaries and details. This letter presents a novel unmixing-guided spatial-spectral Mamba with clustering tokens for improved HSI classification, with the following contributions. First, to disentangle the spectral mixture effect in HSI for improved pattern discovery, we design a novel spectral unmixing network that not only automatically learns endmembers and abundance maps from HSI but also accounts for endmember variabilities. Second, to generate Mamba token sequences, based on the clusters defined by abundance maps, we design an efficient Top-\textitK token selection strategy to adaptively sequence the tokens for improved representational capability. Third, to improve spatial-spectral feature learning and detail preservation, based on the Top-\textitK token sequences, we design a novel unmixing-guided spatial-spectral Mamba module that greatly improves traditional Mamba models in terms of token learning and sequencing. Fourth, to learn simultaneously the endmember-abundance patterns and classification labels, a multi-task scheme is designed for model supervision, leading to a new unmixing-classification framework that outputs not only accurate classification maps but also a comprehensive spectral-library and abundance maps. Comparative experiments on four HSI datasets demonstrate that our model can greatly outperform the other state-of-the-art approaches. Code is available at this https URL

[CV-257] I Walk the Line: Examining the Role of Gestalt Continuity in Object Binding for Vision Transformers

【速读】:该论文旨在解决预训练视觉模型中对象绑定(object binding)机制的内在原理问题,即这些模型如何将图像中属于同一对象的片段整合为统一表征。其关键解决方案在于验证了视觉变换器(vision transformers)是否依赖格式塔连续性原则(Gestalt continuity)来实现对象绑定,而非仅依赖相似性或邻近性等其他格式塔原则。研究通过合成数据集证明绑定探测器对连续性的敏感性,并识别出特定注意力头可追踪连续性信息,且这些注意力头在不同数据集间具有泛化能力;进一步的消融实验表明,移除这些注意力头会显著削弱模型编码对象绑定的能力,从而揭示了连续性机制在视觉模型对象绑定中的核心作用。

链接: https://arxiv.org/abs/2604.09942
作者: Alexa R. Tartaglini,Michael A. Lepori
机构: Stanford University (斯坦福大学); Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Object binding is a foundational process in visual cognition, during which low-level perceptual features are joined into object representations. Binding has been considered a fundamental challenge for neural networks, and a major milestone on the way to artificial models with flexible visual intelligence. Recently, several investigations have demonstrated evidence that binding mechanisms emerge in pretrained vision models, enabling them to associate portions of an image that contain an object. The question remains: how are these models binding objects together? In this work, we investigate whether vision models rely on the principle of Gestalt continuity to perform object binding, over and above other principles like similarity and proximity. Using synthetic datasets, we demonstrate that binding probes are sensitive to continuity across a wide range of pretrained vision transformers. Next, we uncover particular attention heads that track continuity, and show that these heads generalize across datasets. Finally, we ablate these attention heads, and show that they often contribute to producing representations that encode object binding.

[CV-258] BLPR: Robust License Plate Recognition under Viewpoint and Illumination Variations via Confidence-Driven VLM Fallback

【速读】:该论文旨在解决在非受限环境下(尤其是数据稀缺且视觉特征独特的地区如玻利维亚)的车牌识别(License Plate Recognition, LPR)准确性问题,主要挑战包括光照变化、视角畸变以及缺乏高质量标注数据。解决方案的关键在于提出一个两阶段深度学习框架BLPR:首先使用Blender生成的合成数据预训练YOLO-based检测器以模拟极端视角和光照条件,随后在拉巴斯街头采集的真实数据上微调;其次通过几何校正和字符识别模型提升精度,并引入轻量级视觉-语言模型Gemma3 4B作为置信度阈值触发的备用机制以增强模糊场景下的鲁棒性;此外,该研究还首次构建了公开可用的玻利维亚车牌数据集,并采用合成到真实域适应策略提升系统在多样化现实环境中的泛化能力。

链接: https://arxiv.org/abs/2604.09927
作者: Guillermo Auza Banegas,Diego Calvimontes Vera,Sergio Castro Sandoval,Natalia Condori Peredo,Edwin Salcedo
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust license plate recognition in unconstrained environments remains a significant challenge, particularly in underrepresented regions with limited data availability and unique visual characteristics, such as Bolivia. Recognition accuracy in real-world conditions is often degraded by factors such as illumination changes and viewpoint distortion. To address these challenges, we introduce BLPR, a novel deep learning-based License Plate Detection and Recognition (LPDR) framework specifically designed for Bolivian license plates. The proposed system follows a two-stage pipeline where a YOLO-based detector is pretrained on synthetic data generated in Blender to simulate extreme perspectives and lighting conditions, and subsequently fine-tuned on street-level data collected in La Paz, Bolivia. Detected plates are geometrically rectified and passed to a character recognition model. To improve robustness under ambiguous scenarios, a lightweight vision-language model (Gemma3 4B) is selectively triggered as a confidence-based fallback mechanism. The proposed framework further leverages synthetic-to-real domain adaptation to improve robustness under diverse real-world conditions. We also introduce the first publicly available Bolivian LPDR dataset, enabling evaluation under diverse viewpoint and illumination conditions. The system achieves a character-level recognition accuracy of 89.6% on real-world data, demonstrating its effectiveness for deployment in challenging urban environments. Our project is publicly available at this https URL.

[CV-259] GLEaN: A Text-to-image Bias Detection Approach for Public Comprehension

【速读】:该论文旨在解决生成式 AI(Generative AI)中文本到图像(Text-to-Image, T2I)模型偏见的可解释性问题,即现有偏见测量与缓解方法多面向技术专家,缺乏对公众的可读性和直观理解能力。其解决方案的关键在于提出 GLEaN(Generative Likeness Evaluation at N-Scale),一个基于肖像的可解释性流程:通过大规模自动化图像生成、基于面部关键点的筛选与空间对齐,以及中位像素合成,将模型对特定身份提示的倾向性凝练为单一代表性肖像。该方法无需统计学背景即可直观呈现模型对不同社会身份(如“医生”与“罪犯”)的刻板印象,且在用户研究中证实其传达偏见效率优于传统数据表格,同时适用于无模型内部访问权限的黑盒系统。

链接: https://arxiv.org/abs/2604.09923
作者: Bochu Ding,Brinnae Bent,Augustus Wendell
机构: Duke University (杜克大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) models, and their encoded biases, increasingly shape the visual media the public encounters. While researchers have produced a rich body of work on bias measurement, auditing, and mitigation in T2I systems, those methods largely target technical stakeholders, leaving a gap in public legibility. We introduce GLEaN (Generative Likeness Evaluation at N-Scale), a portrait-based explainability pipeline designed to make T2I model biases visually understandable to a broad audience. GLEaN comprises three stages: automated large-scale image generation from identity prompts, facial landmark-based filtering and spatial alignment, and median-pixel composition that distills a model’s central tendency into a single representative portrait. The resulting composites require no statistical background to interpret; a viewer can see, at a glance, who a model ‘imagines’ when prompted with ‘a doctor’ versus a ‘felon.’ We demonstrate GLEaN on Stable Diffusion XL across 40 social and occupational identity prompts, producing composites that reproduce documented biases and surface new associations between skin tone and predicted emotion. We find in a between-subjects user study (N = 291) that GLEaN portraits communicate biases as effectively as conventional data tables, but require significantly less viewing time. Because the method relies solely on generated outputs, it can also be replicated on any black-box and closed-weight systems without access to model internals. GLEaN offers a scalable, model-agnostic approach to bias explainability, purpose-built for public comprehension, and is publicly available at this https URL.

[CV-260] K-STEMIT: Knowledge-Informed Spatio-Temporal Efficient Multi-Branch Graph Neural Network for Subsurface Stratigraphy Thickness Estimation from Radar Data

【速读】:该论文旨在解决极地冰盖内部层理厚度估计中因雷达图像中的斑点噪声(speckle noise)和采集伪影导致的生成式 AI (Generative AI) 模型精度下降问题,以及纯数据驱动方法在空间或时间外推时缺乏物理约束而产生不现实估计的问题。解决方案的关键在于提出一种知识引导的、高效的多分支时空图神经网络 K-STEMIT,其核心创新包括:融合几何空间学习框架与时间卷积以捕捉时空动态特征,并引入来自 Model Atmospheric Regional (MAR) 物理气象模型的同步物理数据作为先验信息;同时采用自适应特征融合策略,动态整合不同分支提取的特征,从而显著降低均方根误差(RMSE)达 21.01%,且保持近最优计算效率,实现对大尺度区域雪积累变化的连续、可靠时空评估。

链接: https://arxiv.org/abs/2604.09922
作者: Zesheng Liu,Maryam Rahnemoonfar
机构: Lehigh University (莱赫igh大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Subsurface stratigraphy contains important spatio-temporal information about accumulation, deformation, and layer formation in polar ice sheets. In particular, variations in internal ice layer thickness provide valuable constraints for snow mass balance estimation and projections of ice sheet change. Although radar sensors can capture these layered structures as depth-resolved radargrams, convolutional neural networks applied directly to radar images are often sensitive to speckle noise and acquisition artifacts. In addition, purely data-driven methods may underuse physical knowledge, leading to unrealistic thickness estimates under spatial or temporal extrapolation. To address these challenges, we develop K-STEMIT, a novel knowledge-informed, efficient, multi-branch spatio-temporal graph neural network that combines a geometric framework for spatial learning with temporal convolution to capture temporal dynamics, and incorporates physical data synchronized from the Model Atmospheric Regional physical weather model. An adaptive feature fusion strategy is employed to dynamically combine features learned from different branches. Extensive experiments have been conducted to compare K-STEMIT against current state-of-the-art methods in both knowledge-informed and non-knowledge-informed settings, as well as other existing methods. Results show that K-STEMIT consistently achieves the highest accuracy while maintaining near-optimal efficiency. Most notably, incorporating adaptive feature fusion and physical priors reduces the root mean-squared error by 21.01% with negligible additional cost compared to its conventional multi-branch variants. Additionally, our proposed K-STEMIT achieves consistently lower per-year relative MAE, enabling reliable, continuous spatiotemporal assessment of snow accumulation variability across large spatial regions.

[CV-261] Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection

【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models, VFM)在复杂农业场景中进行零样本目标检测时,因文本提示(text prompt)构建不当而导致性能敏感的问题。解决方案的关键在于提出了一套系统化的提示优化框架,通过将提示分解为八个维度并采用单因素分析与组合优化相结合的方法,发现不同模型对提示结构的响应存在显著差异——即最优提示具有模型特异性且非直观;同时验证了基于合成数据训练得到的提示结构可有效迁移至真实田间场景和不同目标类别(如从花到荚),从而在不依赖人工标注的情况下显著提升检测性能(如mAP@0.5最高提升0.362)。

链接: https://arxiv.org/abs/2604.09920
作者: Lars Lundqvist,Earl Ranario,Hamid Kamangir,Heesup Yun,Christine Diepenbrock,Brian N. Bailey,J. Mason Earles
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision foundation models (VFMs) offer the promise of zero-shot object detection without task-specific training data, yet their performance in complex agricultural scenes remains highly sensitive to text prompt construction. We present a systematic prompt optimization framework evaluating four open-vocabulary detectors – YOLO World, SAM3, Grounding DINO, and OWLv2 – for cowpea flower and pod detection across synthetic and real field imagery. We decompose prompts into eight axes and conduct one-factor-at-a-time analysis followed by combinatorial optimization, revealing that models respond divergently to prompt structure: conditions that optimize one architecture can collapse another. Applying model-specific combinatorial prompts yields substantial gains over a naive species-name baseline, including +0.357 mAP@0.5 for YOLO World and +0.362 mAP@0.5 for OWLv2 on synthetic cowpea flower data. To evaluate cross-task generalization, we use an LLM to translate the discovered axis structure to a morphologically distinct target – cowpea pods – and compare against prompting using the discovered optimal structures from synthetic flower data. Crucially, prompt structures optimized exclusively on synthetic data transfer effectively to real-world fields: synthetic-pipeline prompts match or exceed those discovered on labeled real data for the majority of model-object combinations (flower: 0.374 vs. 0.353 for YOLO World; pod: 0.429 vs. 0.371 for SAM3). Our findings demonstrate that prompt engineering can substantially close the gap between zero-shot VFMs and supervised detectors without requiring manual annotation, and that optimal prompts are model-specific, non-obvious, and transferable across domains.

[CV-262] PointSplat: Efficient Geometry-Driven Pruning and Transformer Refinement for 3D Gaussian Splatting CVPR

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在复杂场景中因需数百万高斯分布而导致的内存与存储开销过大的问题。传统方法依赖2D图像计算重要性评分并进行逐场景微调,效率低下且存在冗余。其解决方案的关键在于提出PointSplat框架,该框架包含两个核心组件:一是基于3D几何属性的高效剪枝策略,完全摆脱对2D图像的依赖,实现更鲁棒的初始稀疏化;二是双分支编码器结构,分离并重新加权几何与外观特征,避免多模态特征失衡,从而在不引入额外场景级优化的前提下,显著提升渲染质量与计算效率。

链接: https://arxiv.org/abs/2604.09903
作者: Anh Thuan Tran,Jana Kosecka
机构: George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPRW 2026 (3DMV)

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently unlocked real-time, high-fidelity novel view synthesis by representing scenes using explicit 3D primitives. However, traditional methods often require millions of Gaussians to capture complex scenes, leading to significant memory and storage demands. Recent approaches have addressed this issue through pruning and per-scene fine-tuning of Gaussian parameters, thereby reducing the model size while maintaining visual quality. These strategies typically rely on 2D images to compute important scores followed by scene-specific optimization. In this work, we introduce PointSplat, 3D geometry-driven prune-and-refine framework that bridges previously disjoint directions of gaussian pruning and transformer refinement. Our method includes two key components: (1) an efficient geometry-driven strategy that ranks Gaussians based solely on their 3D attributes, removing reliance on 2D images during pruning stage, and (2) a dual-branch encoder that separates, re-weights geometric and appearance to avoid feature imbalance. Extensive experiments on ScanNet++ and Replica across varying sparsity levels demonstrate that PointSplat consistently achieves competitive rendering quality and superior efficiency without additional per-scene optimization.

[CV-263] Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception

【速读】:该论文旨在解决从单视图视觉数据中准确估计物体体积的长期挑战,这一问题在机器人技术、物流和智慧健康等领域具有重要应用价值。现有方法通常依赖复杂的三维重建流程,或难以处理单视角图像固有的歧义性。其解决方案的关键在于融合来自立体视觉的隐式三维线索与来自自然语言文本的显式先验知识:通过提取立体图像对和包含物体类别及近似体积描述的文本提示的深度特征,并利用一个简洁有效的投影层将二者整合为统一的多模态表示,进而用于回归预测。实验表明,即使使用简单的文本先验也能显著提升体积估计性能,为构建更具情境感知能力的视觉测量系统提供了新思路。

链接: https://arxiv.org/abs/2604.09886
作者: Gautham Vinod,Bruce Coburn,Siddeshwar Raghavan,Fengqing Zhu
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Accurate volume estimation of objects from visual data is a long-standing challenge in computer vision with significant applications in robotics, logistics, and smart health. Existing methods often rely on complex 3D reconstruction pipelines or struggle with the ambiguity inherent in single-view images. To address these limitations, we introduce a new method that fuses implicit 3D cues from stereo vision with explicit prior knowledge from natural language text. Our approach extracts deep features from a stereo image pair and a descriptive text prompt that contains the object’s class and an approximate volume, then integrates them using a simple yet effective projection layer into a unified, multi-modal representation for regression. We conduct extensive experiments on public datasets demonstrating that our text-guided approach significantly outperforms vision-only baselines. Our findings show that leveraging even simple textual priors can effectively guide the volume estimation task, paving the way for more context-aware visual measurement systems. Code: this https URL.

[CV-264] opo-ADV: Generating Topology-Driven Imperceptible Adversarial Point Clouds

【速读】:该论文旨在解决3D点云深度学习模型在面对对抗性扰动时的脆弱性问题,尤其是现有攻击方法主要依赖于几何属性(如点位置、曲率或表面结构)的修改,而忽视了拓扑结构可能带来的新漏洞。其解决方案的关键在于提出一种基于拓扑驱动的对抗攻击方法——Topo-ADV,该方法首次将持久同调(persistent homology)作为显式优化目标引入对抗样本生成过程,通过可微分的拓扑表示嵌入持久图谱(persistence diagrams),联合优化拓扑差异损失、误分类目标和几何不可感知约束,从而实现对点云拓扑特征的梯度可控操纵,在保持视觉合理性的同时显著提升攻击成功率。

链接: https://arxiv.org/abs/2604.09879
作者: Gayathry Chandramana Krishnan Nampoothiry,Raghuram Venkatapuram,Anirban Ghosh,Ayan Dutta
机构: University of North Florida (北佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
备注: Under review

点击查看摘要

Abstract:Deep neural networks for 3D point cloud understanding have achieved remarkable success in object classification and recognition, yet recent work shows that these models remain highly vulnerable to adversarial perturbations. Existing 3D attacks predominantly manipulate geometric properties such as point locations, curvature, or surface structure, implicitly assuming that preserving global shape fidelity preserves semantic content. In this work, we challenge this assumption and introduce the first topology-driven adversarial attack for point cloud deep learning. Our key insight is that the homological structure of a 3D object constitutes a previously unexplored vulnerability surface. We propose Topo-ADV, an end-to-end differentiable framework that incorporates persistent homology as an explicit optimization objective, enabling gradient-based manipulation of topological features during adversarial example generation. By embedding persistence diagrams through differentiable topological representations, our method jointly optimizes (i) a topology divergence loss that alters persistence, (ii) a misclassification objective, and (iii) geometric imperceptibility constraints that preserve visual plausibility. Experiments demonstrate that subtle topology-driven perturbations consistently achieve up to 100% attack success rates on benchmark datasets such as ModelNet40, ShapeNet Part, and ScanObjectNN using PointNet and DGCNN classifiers, while remaining geometrically indistinguishable from the original point clouds, beating state-of-the-art methods on various perceptibility metrics.

[CV-265] DINO_4D: Semantic-Aware 4D Reconstruction

【速读】:该论文旨在解决动态场景下4D重建过程中因语义漂移(semantic drift)导致的跟踪精度下降和重建完整性不足的问题。其解决方案的关键在于引入冻结的DINOv3特征作为结构先验(structural priors),将语义信息注入重建流程,从而有效抑制动态跟踪中的语义漂移现象,同时保持线性时间复杂度 O(T)O(T),显著提升了跟踪准确率(APD)和重建完整性。

链接: https://arxiv.org/abs/2604.09877
作者: Yiru Yang,Zhuojie Wu,Quentin Marguet,Nishant Kumar Singh,Max Schulthess
机构: University of Zurich (苏黎世大学); EPFL (洛桑联邦理工学院); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes serve as the critical bridge connecting low-level geometric sensing with high-level semantic understanding. We present DINO_4D, introducing frozen DINOv3 features as structural priors, injecting semantic awareness into the reconstruction process to effectively suppress semantic drift during dynamic tracking. Experiments on the Point Odyssey and TUM-Dynamics benchmarks demonstrate that our method maintains the linear time complexity O(T) of its predecessors while significantly improving Tracking Accuracy (APD) and Reconstruction Completeness. DINO_4D establishes a new paradigm for constructing 4D World Models that possess both geometric precision and semantic understanding.

[CV-266] PAS: Estimating the target accuracy before domain adaptation ICLR2026

【速读】:该论文旨在解决领域自适应(Domain Adaptation, DA)中源域和预训练特征提取器选择困难的问题,尤其是在目标域缺乏标注验证集且可用预训练模型众多的情况下。解决方案的关键在于提出一种新的可迁移性评分(PAS, Pre-trained model and Source domain Adaptability Score),该评分通过分析预训练特征嵌入来评估源域与目标分类任务的兼容性,并据此筛选最优的预训练模型和源域组合,从而在提升目标域准确率的同时降低计算开销。实验表明,PAS 与实际目标域性能高度相关,能够有效指导最佳模型和源域的选择。

链接: https://arxiv.org/abs/2604.09863
作者: Raphaella Diniz,Jackson de Faria,Martin Ester
机构: Simon Fraser University (西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:The goal of domain adaptation is to make predictions for unlabeled samples from a target domain with the help of labeled samples from a different but related source domain. The performance of domain adaptation methods is highly influenced by the choice of source domain and pre-trained feature extractor. However, the selection of source data and pre-trained model is not trivial due to the absence of a labeled validation set for the target domain and the large number of available pre-trained models. In this work, we propose PAS, a novel score designed to estimate the transferability of a source domain set and a pre-trained feature extractor to a target classification task before actually performing domain adaptation. PAS leverages the generalization power of pre-trained models and assesses source-target compatibility based on the pre-trained feature embeddings. We integrate PAS into a framework that indicates the most relevant pre-trained model and source domain among multiple candidates, thus improving target accuracy while reducing the computational overhead. Extensive experiments on image classification benchmarks demonstrate that PAS correlates strongly with actual target accuracy and consistently guides the selection of the best-performing pre-trained model and source domain for adaptation.

[CV-267] FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views CVPR2026

【速读】:该论文旨在解决现有方法在几何重建与语义理解任务中各自独立处理所导致的冗余流程和误差累积问题。其核心挑战在于如何实现无需标注信息(如相机位姿、深度图或语义标签)的情况下,统一建模几何结构与语义特征,并克服前馈式特征重建中的全局语义不一致性和局部结构不一致性问题。解决方案的关键在于提出FF3R框架,通过两个创新机制实现:(i) Token-wise Fusion Module利用交叉注意力机制将语义上下文融入几何token,增强语义感知能力;(ii) Semantic-Geometry Mutual Boosting机制结合几何引导的特征扭曲以保障全局一致性,以及语义感知的体素化策略以提升局部结构 coherence,从而在无监督条件下实现统一的3D推理。

链接: https://arxiv.org/abs/2604.09862
作者: Chaoyi Zhou,Run Wang,Feng Luo,Mert D. Pesé,Zhiwen Fan,Yiqi Zhong,Siyu Huang
机构: Microsoft(微软); Clemson University(克莱姆森大学); Texas AM University(德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings. Project Page: this https URL

点击查看摘要

Abstract:Recent advances in vision foundation models have revolutionized geometry reconstruction and semantic understanding. Yet, most of the existing approaches treat these capabilities in isolation, leading to redundant pipelines and compounded errors. This paper introduces FF3R, a fully annotation-free feed-forward framework that unifies geometric and semantic reasoning from unconstrained multi-view image sequences. Unlike previous methods, FF3R does not require camera poses, depth maps, or semantic labels, relying solely on rendering supervision for RGB and feature maps, establishing a scalable paradigm for unified 3D reasoning. In addition, we address two critical challenges in feedforward feature reconstruction pipelines, namely global semantic inconsistency and local structural inconsistency, through two key innovations: (i) a Token-wise Fusion Module that enriches geometry tokens with semantic context via cross-attention, and (ii) a Semantic-Geometry Mutual Boosting mechanism combining geometry-guided feature warping for global consistency with semantic-aware voxelization for local coherence. Extensive experiments on ScanNet and DL3DV-10K demonstrate FF3R’s superior performance in novel-view synthesis, open-vocabulary semantic segmentation, and depth estimation, with strong generalization to in-the-wild scenarios, paving the way for embodied intelligence systems that demand both spatial and semantic understanding.

[CV-268] Do vision models perceive illusory motion in static images like humans? CVPR2026

【速读】:该论文旨在解决当前光学流(optical flow)模型在模拟人类视觉运动感知方面存在的显著差距问题,特别是这些模型是否能够正确识别静态图像中由人类视觉系统感知到的错觉运动,如“旋转蛇”(Rotating Snakes) illusion。研究表明,大多数现有的深度神经网络(DNN)光学流模型无法生成与人类感知一致的流动场,而仅有人类启发的双通道(Dual-Channel)模型在模拟眼跳(saccadic eye movements)条件下能再现预期的旋转运动。其关键解决方案在于引入了基于亮度和高阶颜色特征的多模态运动信号,并结合递归注意力机制(recurrent attention mechanism)以有效整合局部线索,从而更贴近人类视觉系统的运动感知机制。

链接: https://arxiv.org/abs/2604.09853
作者: Isabella Elaine Rosario(1),Fan L. Cheng(1),Zitang Sun(2),Nikolaus Kriegeskorte(1) ((1) Columbia University, (2) Kyoto University)
机构: Columbia University (哥伦比亚大学); Kyoto University (京都大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Workshops (Findings). * Equal contribution

点击查看摘要

Abstract:Understanding human motion processing is essential for building reliable, human-centered computer vision systems. Although deep neural networks (DNNs) achieve strong performance in optical flow estimation, they remain less robust than humans and rely on fundamentally different computational strategies. Visual motion illusions provide a powerful probe into these mechanisms, revealing how human and machine vision align or diverge. While recent DNN-based motion models can reproduce dynamic illusions such as reverse-phi, it remains unclear whether they can perceive illusory motion in static images, exemplified by the Rotating Snakes illusion. We evaluate several representative optical flow models on Rotating Snakes and show that most fail to generate flow fields consistent with human perception. Under simulated conditions mimicking saccadic eye movements, only the human-inspired Dual-Channel model exhibits the expected rotational motion, with the closest correspondence emerging during the saccade simulation. Ablation analyses further reveal that both luminance-based and higher-order color–feature–based motion signals contribute to this behavior and that a recurrent attention mechanism is critical for integrating local cues. Our results highlight a substantial gap between current optical-flow models and human visual motion processing, and offer insights for developing future motion-estimation systems with improved correspondence to human perception and human-centric AI.

[CV-269] raining-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning

【速读】:该论文旨在解决现有文本到图像扩散模型在生成过程中存在的前景偏置问题,即模型过度关注前景对象而将背景视为被动且未充分优化的附属部分,从而导致场景整体一致性差和构图控制能力受限。解决方案的关键在于提出一种无需训练的框架,通过两个核心组件重构扩散采样过程以显式建模前景与背景的交互:其一为动态空间引导(Dynamic Spatial Guidance),引入一种随时间步变化的软门控机制,调节扩散过程中前景与背景的注意力分配,实现空间上的平衡生成;其二为多路径剪枝(Multi-Path Pruning),通过多路径潜在空间探索并结合内部注意力统计与外部语义对齐信号,动态筛选满足对象-背景约束的候选轨迹,从而提升背景一致性和对象-背景构图对齐度。

链接: https://arxiv.org/abs/2604.09850
作者: Yang Deng,David Mould,Paul L. Rosin,Yu-Kun Lai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing text-to-image diffusion models, while excelling at subject synthesis, exhibit a persistent foreground bias that treats the background as a passive and under-optimized byproduct. This imbalance compromises global scene coherence and constrains compositional control. To address the limitation, we propose a training-free framework that restructures diffusion sampling to explicitly account for foreground-background interactions. Our approach consists of two key components. First, Dynamic Spatial Guidance introduces a soft, time step dependent gating mechanism that modulates foreground and background attention during the diffusion process, enabling spatially balanced generation. Second, Multi-Path Pruning performs multi-path latent exploration and dynamically filters candidate trajectories using both internal attention statistics and external semantic alignment signals, retaining trajectories that better satisfy object-background constraints. We further develop a benchmark specifically designed to evaluate object-background compositionality. Extensive evaluations across multiple diffusion backbones demonstrate consistent improvements in background coherence and object-background compositional alignment.

[CV-270] Is There Knowledge Left to Extract? Evidence of Frag ility in Medically Fine-Tuned Vision-Language Models

【速读】:该论文旨在解决医学视觉语言模型(Vision-Language Models, VLMs)在高风险临床场景中是否能通过领域特定微调提升深层推理能力的问题,而非仅依赖表层视觉线索。研究发现,尽管进行了医学微调,VLMs 在复杂医学影像任务(如组织病理学分类)中的性能仍显著下降至接近随机水平,且模型表现高度依赖提示(prompt)设计,微小变化即可导致准确率和拒绝率剧烈波动。其关键解决方案在于引入基于描述的分步推理管道:先由VLM生成图像描述,再交由纯文本模型(GPT-5.1)进行诊断,以此测试封闭形式视觉问答(closed-form VQA)是否会抑制潜在知识。结果表明,该方法虽能恢复有限额外信号,但整体性能仍受限于任务难度本身,揭示出当前医疗VLM的脆弱性源于视觉表示能力不足与下游推理机制双重缺陷。

链接: https://arxiv.org/abs/2604.09841
作者: Oliver McLaughlin,Daniel Shubin,Carsten Eickhoff,Ritambhara Singh,William Rudman,Michal Golovanevsky
机构: Brown University (布朗大学); University of Washington (华盛顿大学); University of Tübingen (图宾根大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly adapted through domain-specific fine-tuning, yet it remains unclear whether this improves reasoning beyond superficial visual cues, particularly in high-stakes domains like medicine. We evaluate four paired open-source VLMs (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) across four medical imaging tasks of increasing difficulty: brain tumor, pneumonia, skin cancer, and histopathology classification. We find that performance degrades toward near-random levels as task difficulty increases, indicating limited clinical reasoning. Medical fine-tuning provides no consistent advantage, and models are highly sensitive to prompt formulation, with minor changes causing large swings in accuracy and refusal rates. To test whether closed-form VQA suppresses latent knowledge, we introduce a description-based pipeline where models generate image descriptions that a text-only model (GPT-5.1) uses for diagnosis. This recovers a limited additional signal but remains bounded by task difficulty. Analysis of vision encoder embeddings further shows that failures stem from both weak visual representations and downstream reasoning. Overall, medical VLM performance is fragile, prompt-dependent, and not reliably improved by domain-specific fine-tuning.

[CV-271] Vector Field Synthesis with Sparse Streamlines Using Diffusion Model IEEE-VIS2025

【速读】:该论文旨在解决从稀疏且连贯的输入(如流线)中合成二维向量场的问题,同时确保生成结果符合物理规律。传统优化方法在灵活性和物理一致性方面存在局限,而本文提出了一种基于扩散机制的框架,其关键在于采用条件去噪扩散概率模型(conditional denoising diffusion probabilistic model)并结合无分类器引导(classifier-free guidance),实现了逐步重建过程,在保持几何结构的同时严格遵守物理约束,从而显著提升了向量场合成的合理性与准确性。

链接: https://arxiv.org/abs/2604.09838
作者: Nguyen K. Phan,Ricardo Morales,Sebastian D. Espriella,Guoning Chen
机构: University of Houston (休斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures; published at IEEE VIS 2025

点击查看摘要

Abstract:We present a novel diffusion-based framework for synthesizing 2D vector fields from sparse, coherent inputs (i.e., streamlines) while maintaining physical plausibility. Our method employs a conditional denoising diffusion probabilistic model with classifier-free guidance, enabling progressive reconstruction that preserves both geometric and physical constraints. Experimental results demonstrate our method’s ability to synthesize plausible vector fields that adhere to physical laws while maintaining fidelity to sparse input observations, outperforming traditional optimization-based approaches in terms of flexibility and physical consistency.

[CV-272] F3G-Avatar : Face Focused Full-body Gaussian Avatar CVPR

【速读】:该论文旨在解决现有全身体 Gaussian 人像方法在重建过程中难以保留面部细微几何结构与表情细节的问题,其根源在于面部表征能力有限,导致无法有效建模高频的、依赖姿态的形变。解决方案的关键在于提出 F3G-Avatar 方法,该方法基于一个穿衣服的 Momentum Human Rig (MHR) 模板,采用双分支架构:一个主体分支用于捕捉依赖姿态的非刚性形变,另一个聚焦面部的形变分支专门优化头部几何与外观;两个分支生成的 3D 高斯(3D Gaussians)被融合后通过线性混合皮肤(Linear Blend Skinning, LBS)绑定,并利用可微分高斯点绘(Differentiable Gaussian Splatting)进行渲染,同时引入针对人脸的对抗损失以提升近距离视角下的真实感。

链接: https://arxiv.org/abs/2604.09835
作者: Willem Menu,Erkut Akdag,Pedro Quesado,Yasaman Kashefbahrami,Egor Bondarev
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPRW 3DMV, 10 pages

点击查看摘要

Abstract:Existing full-body Gaussian avatar methods primarily optimize global reconstruction quality and often fail to preserve fine-grained facial geometry and expression details. This challenge arises from limited facial representational capacity that causes difficulties in modeling high-frequency pose-dependent deformations. To address this, we propose F3G-Avatar, a full-body, face-aware avatar synthesis method that reconstructs animatable human representations from multi-view RGB video and regressed pose/shape parameters. Starting from a clothed Momentum Human Rig (MHR) template, front/back positional maps are rendered and decoded into 3D Gaussians through a two-branch architecture: a body branch that captures pose-dependent non-rigid deformations and a face-focused deformation branch that refines head geometry and appearance. The predicted Gaussians are fused, posed with linear blend skinning (LBS), and rendered with differentiable Gaussian splatting. Training combines reconstruction and perceptual objectives with a face-specific adversarial loss to enhance realism in close-up views. Experiments demonstrate strong rendering quality, with face-view performance reaching PSNR/SSIM/LPIPS of 26.243/0.964/0.084 on the AvatarReX dataset. Ablations further highlight contributions of the MHR template and the face-focused deformation. F3G-Avatar provides a practical, high-quality pipeline for realistic, animatable full-body avatar synthesis.

[CV-273] ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos

【速读】:该论文旨在解决交通事故检测(Traffic Accident Detection)在闭路电视(CCTV)视频中的准确识别问题,尤其关注在数据丰富(IID)与数据稀缺(OOD)场景下的模型泛化能力,以及零样本(zero-shot)场景下的迁移性能。其解决方案的关键在于构建了一个名为ACCIDENT的基准数据集,该数据集包含2,027个真实和2,211个合成视频片段,每段视频均标注了事故时间、空间位置及高阶碰撞类型,并定义了三个核心任务:事故的时间定位、空间定位与碰撞类型分类。通过定制化的评估指标考虑CCTV视频固有的不确定性与模糊性,该基准能够全面衡量模型在不同设置下的表现,同时提供了包括启发式、运动感知和视觉-语言等多种基线方法,验证了该任务的挑战性。

链接: https://arxiv.org/abs/2604.09819
作者: Lukas Picek,Michal Čermák,Marek Hanzl,Vojtěch Čermák
机构: PiVa AI( PiVa AI); MIT(麻省理工学院); University of West Bohemia in Pilsen(皮尔森西波希米亚大学); Czech Technical University in Prague(布拉格捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce ACCIDENT, a benchmark dataset for traffic accident detection in CCTV footage, designed to evaluate models in supervised (IID and OOD) and zero-shot settings, reflecting both data-rich and data-scarce scenarios. The benchmark consists of a curated set of 2,027 real and 2,211 synthetic clips annotated with the accident time, spatial location, and high-level collision type. We define three core tasks: (i) temporal localization of the accident, (ii) its spatial localization, and (iii) collision type classification. Each task is evaluated using custom metrics that account for the uncertainty and ambiguity inherent in CCTV footage. In addition to the benchmark, we provide a diverse set of baselines, including heuristic, motion-aware, and vision-language approaches, and show that ACCIDENT is challenging. You can access the ACCIDENT at: this https URL

[CV-274] RobustMedSAM: Degradation-Resilient Medical Image Segmentation via Robust Foundation Model Adaptation

【速读】:该论文旨在解决医学图像分割模型在真实世界图像退化(如噪声、模糊、运动伪影及模态特异性失真)下性能下降的问题,现有方法通常仅关注医学领域适应或退化鲁棒性,难以同时兼顾二者。解决方案的关键在于识别出Segment Anything Model (SAM) 中互补的模块功能:图像编码器保留医学先验知识,而掩码解码器决定退化鲁棒性;进而提出RobustMedSAM,通过模块级检查点融合策略,将MedSAM的图像编码器与RobustSAM的掩码解码器结合,并在35个医学数据集上仅微调掩码解码器,从而在保持预训练医学表示的同时显著提升对多种退化类型的鲁棒性,实验表明其在退化图像上的Dice分数从0.613提升至0.719。

链接: https://arxiv.org/abs/2604.09814
作者: Jieru Li,Matthew Chen,Micky C. Nnamdi,J. Ben Tamo,Benoit L. Marteau,May D. Wang
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Medical image segmentation models built on Segment Anything Model (SAM) achieve strong performance on clean benchmarks, yet their reliability often degrades under realistic image corruptions such as noise, blur, motion artifacts, and modality-specific distortions. Existing approaches address either medical-domain adaptation or corruption robustness, but not both jointly. In SAM, we find that these capabilities are concentrated in complementary modules: the image encoder preserves medical priors, while the mask decoder governs corruption robustness. Motivated by this observation, we propose RobustMedSAM, which adopts module-wise checkpoint fusion by initializing the image encoder from MedSAM and the mask decoder from RobustSAM under a shared ViT-B architecture. We then fine-tune only the mask decoder on 35 medical datasets from MedSegBench, spanning six imaging modalities and 12 corruption types, while freezing the remaining components to preserve pretrained medical representations. We additionally investigate an SVD-based parameter-efficient variant for limited encoder adaptation. Experiments on both in-distribution and out-of-distribution benchmarks show that RobustMedSAM improves degraded-image Dice from 0.613 to 0.719 (+0.106) over SAM, demonstrating that structured fusion of complementary pretrained models is an effective and practical approach for robust medical image segmentation.

[CV-275] Biomarker-Based Pretraining for Chagas Disease Screening in Electrocardiograms

【速读】:该论文旨在解决基于心电图(ECG)筛查恰加斯病(Chagas disease)时,现有数据集标签稀缺且噪声较大的问题。解决方案的关键在于提出一种基于生物标志物的预训练方法:首先在MIMIC-IV-ECG数据集上训练一个ECG特征提取器,使其能够预测分位数分箱的血液生物标志物;随后将预训练模型在巴西数据集上微调,用于恰加斯病检测。该方法有效利用了多源数据中的生物标志物信息,提升了模型在小样本和噪声标签场景下的泛化能力。

链接: https://arxiv.org/abs/2604.09782
作者: Elias Stenhede,Arian Ranjbar
机构: Akershus University Hospital (阿克什胡斯大学医院); University of Oslo (奥斯陆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chagas disease screening via ECGs is limited by scarce and noisy labels in existing datasets. We propose a biomarker-based pretraining approach, where an ECG feature extractor is first trained to predict percentile-binned blood biomarkers from the MIMIC-IV-ECG dataset. The pretrained model is then fine-tuned on Brazilian datasets for Chagas detection. Our 5-model ensemble, developed by the Ahus AIM team, achieved a challenge score of 0.269 on the hidden test set, ranking 5th in Detection of Chagas Disease from the ECG: The George B. Moody PhysioNet Challenge 2025. Source code and the model are shared on GitHub: this http URL

[CV-276] xt-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在三维场景中对目标物体进行文本引导的6D姿态预测时表现不佳的问题,即VLMs难以准确推断出与文本指令一致的目标物体在3D空间中的位置和朝向(6D pose)。其解决方案的关键在于引入一种闭环推理机制:通过迭代式交互——观察当前场景、评估是否符合指令、提出姿态更新、应用更新并重新渲染场景——使VLM在不进行额外微调或模块扩展的情况下,有效模拟智能体行为。此外,论文提出三种关键的推理期技术:多视角推理与支持视角选择、以物体为中心的坐标系可视化以及单轴旋转预测,共同显著提升了VLM在复杂3D场景下对文本引导目标姿态的理解能力。

链接: https://arxiv.org/abs/2604.09781
作者: Sangwon Baik,Gunhee Kim,Mingi Choi,Hanbyul Joo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate whether it is faithful to the instruction; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. Without any additional fine-tuning or new modules, our approach surpasses prior methods at predicting the text-guided goal 6D pose of the target object. It works consistently across both closed-source and open-source VLMs. Moreover, when combining our 6D pose prediction with simple robot motion planning, it enables more successful robot manipulation than existing methods. Finally, we conduct an ablation study to demonstrate the necessity of each proposed technique.

[CV-277] MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

【速读】:该论文旨在解决当前医疗视觉语言模型(Medical Vision-Language Models, VLMs)在医学视觉问答(Medical Visual Question Answering, VQA)任务中推理过程过度依赖文本、难以有效保留和利用局部视觉证据的问题。现有方法通常将图像编码为静态上下文,导致细微但关键的诊断性视觉信息在推理过程中丢失,限制了临床场景下的准确性与可靠性。其解决方案的关键在于提出一种潜在视觉推理框架(Latent Visual Reasoning, MedLVR),通过在自回归解码过程中引入显式的视觉证据状态(visual evidence state),并利用隐藏状态重用机制实现连续的潜在推理步骤,从而迭代地保持和优化与查询相关的视觉证据;同时采用两阶段训练策略——基于感兴趣区域(Region of Interest, ROI)监督微调对齐潜在状态与临床相关视觉证据,并结合视觉潜在策略优化(Visual-Latent Policy Optimization, VLPO)在结果级奖励下进一步优化推理路径与答案生成,显著提升了医学VQA的性能与可解释性。

链接: https://arxiv.org/abs/2604.09757
作者: Suyang Xi,Songtao Hu,Yuxiang Lai,Wangyun Dan,Yaqi Liu,Shansong Wang,Xiaofeng Yang
机构: Emory University School of Medicine (埃默里大学医学院); Emory University (埃默里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical vision–language models (VLMs) have shown strong potential for medical visual question answering (VQA), yet their reasoning remains largely text-centric: images are encoded once as static context, and subsequent inference is dominated by language. This paradigm is fundamentally limited in clinical scenarios, where accurate answers often depend on subtle, localized visual evidence that cannot be reliably preserved in static embeddings. We propose \textscMedLVR, a latent visual reasoning framework that introduces an explicit visual evidence state into autoregressive decoding. Instead of relying solely on text-based intermediate reasoning, \textscMedLVR interleaves a short latent reasoning segment within the decoder by reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence before answer generation. To support effective visual supervision, we adopt a two-stage training strategy: region of interest (ROI)-supervised fine-tuning aligns latent states with clinically relevant image evidence, and Visual-Latent Policy Optimization (VLPO) further optimizes latent reasoning and answer generation under outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks show that \textscMedLVR consistently outperforms recent reasoning baselines and improves the average score over the Qwen2.5-VL-7B backbone from 48.3% to 53.4%. These results show that latent visual reasoning provides an effective mechanism for preserving diagnostically relevant visual evidence and improving the reliability of medical VQA.

[CV-278] See Fair Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像或视频理解中频繁产生视觉输入中并不存在的对象幻觉(object hallucination)的问题。研究表明,这种幻觉现象的根本原因在于解码过程中注意力分配不均:视觉显著性高或常见内容会过度占据注意力,而稀有、细小或语境边缘的物体则因关注不足而无法被正确建模。为此,作者提出了一种无需训练且与架构无关的解码策略——DOP-OBC,其核心在于通过公平的注意力分配实现更忠实的多模态生成。该方案的关键创新在于引入两个互补的对象感知信号:主导对象惩罚(Dominant Object Penalty, DOP)用于软性抑制对视觉主导区域的注意力过度集中,以及异常值增强系数(Outlier Boost Coefficient, OBC)用于提升对罕见但检测置信度高的对象的关注。这两个信号以逐行logit调制的形式嵌入因果注意力掩码中,无需参数更新即可保持自回归解码特性,在CHAIR和POPE等基准上显著减少对象幻觉,并提升GPT-4o评估的图文描述质量。

链接: https://arxiv.org/abs/2604.09749
作者: Mohammad Anas Azeez,Ankan Deria,Zohaib Hasan Siddiqui,Adinath Madhavrao Dukre,Rafiq Ali,Sara Atito,Yutong Xie,Imran Razzak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) frequently hallucinate objects that are absent from the visual input, often because attention during decoding is disproportionately drawn to visually dominant or frequently occurring content. We observe that this inequity in attention allocation is a root cause of object hallucination: when rare, small, or contextually peripheral objects receive insufficient attention, the model fails to ground its generation in the full visual scene. We argue that every object in an image, regardless of its size, frequency or visual salience, deserves equal representational opportunity during decoding. To this end, we propose DOP-OBC, a training-free and architecture-agnostic decoding strategy built on the principle of equitable attention. Two complementary object-aware signals work in tandem: a Dominant Object Penalty (DOP) that softly suppresses attention over-concentration on visually dominant regions, and an Outlier Boost Coefficient (OBC) that amplifies attention toward rare yet confidently detected objects. These signals are injected as per-row logit modulations within the causal attention mask, requiring no weight updates and preserving autoregressive decoding properties. Extensive experiments across image and video MLLMs demonstrate consistent reductions in object hallucination on CHAIR and POPE benchmarks, alongside improvements in GPT-4o assessed captioning quality across correctness, consistency, detail, context and temporal dimensions. DOP-OBC establishes that fairness in attention allocation is not merely a design principle but a practical and effective path toward more faithful multimodal generation.

[CV-279] Efficient Matrix Implementation for Rotary Position Embedding

【速读】:该论文旨在解决旋转位置编码(Rotary Position Embedding, RoPE)在现代Transformer架构中因依赖向量级拆分与合并操作而引入的显著计算开销问题,尤其是在多维场景(如2D和3D RoPE)下,这种开销会进一步加剧硬件利用率下降。解决方案的关键在于提出RoME(Rotary Matrix position Embedding),这是一种数学等价但计算更高效的RoPE重构形式,通过将原有的向量操作替换为统一的矩阵变换,消除了维度相关的特异性操作,简化了实现,并支持在现代NPU上跨Cube和Vector单元的融合并行执行,从而在算子级和模型级均实现显著加速。

链接: https://arxiv.org/abs/2604.09742
作者: Chen Minqi,Zhongqi Yue,Shihao Zhang,Yun Xu,Peng Wu,kaixiang Xu,Zeyi Huang,Hanwang Zhang
机构: Huawei Technologies(华为技术); Nanyang Technological University(南洋理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rotary Position Embedding (RoPE) has become a core component of modern Transformer architectures across language, vision, and 3D domains. However, existing implementations rely on vector-level split and merge operations that introduce non-negligible computational overhead, often overlooked in attention optimization. The problem is further amplified in multi-dimensional settings (e.g., 2D and 3D RoPE), where additional vector operations and uneven feature partitions degrade hardware utilization. To overcome these limitations, we propose RoME (Rotary Matrix position Embedding), a mathematically equivalent yet computationally efficient reformulation of RoPE that replaces vector operations with unified matrix transformations. RoME eliminates dimension-specific operations, simplifies implementation, and enables fused parallel execution across Cube and Vector units on modern NPUs. Experiments show that RoME delivers substantial acceleration at both the operator and full-model levels. The implementation is available at this https URL.

[CV-280] Multi-Frequency Local Plasticity for Visual Representation Learning

【速读】:该论文旨在解决在视觉识别任务中,当缺乏端到端梯度反向传播(end-to-end gradient-based representation learning)时,如何通过结构化的架构先验(structured architectural bias)来补偿表示学习能力的问题。其核心解决方案是构建一个混合系统:采用固定多频Gabor分解(multi-frequency Gabor decomposition)生成7个并行流,每个流内使用Hebbian与Oja更新及anti-Hebbian去相关机制实现局部竞争性学习;引入受现代Hopfield网络启发的关联记忆模块进行信息存储与检索;并通过局部预测与重构信号实现迭代式自上而下的调制。整个模型的表征层不依赖全局梯度传播训练,仅最终线性读出层和自上而下投影矩阵通过梯度下降优化,从而形成主要由局部训练驱动、辅以少量梯度微调的混合架构。实验表明,在CIFAR-10上达到80.1% top-1准确率,显著优于纯Hebbian基线(71.0%),接近全梯度训练模型(83.4%),验证了精心设计的架构先验可恢复大部分传统端到端训练性能。

链接: https://arxiv.org/abs/2604.09734
作者: Mehdi Fatan Serj,C. Alejandro Parraga,Xavier Otazu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study how far structured architectural bias can compensate for the absence of end-to-end gradient-based representation learning in visual recognition. Building on the VisNet tradition, we introduce a modular hierarchical framework combining: (i) fixed multi-frequency Gabor decomposition into F=7 parallel streams; (ii) within-stream competitive learning with Hebbian and Oja updates and anti-Hebbian decorrelation; (iii) an associative memory module inspired by modern Hopfield retrieval; and (iv) iterative top-down modulation using local prediction and reconstruction signals. Representational layers are trained without end-to-end backpropagation through the full hierarchy; only the final linear readout and top-down projection matrices are optimized by gradient descent. We therefore interpret the model as a hybrid system that is predominantly locally trained but includes a small number of gradient-trained parameters. On CIFAR-10, the full model reaches 80.1% +/- 0.3% top-1 accuracy, linear probe), compared with 71.0% for a Hebbian-only baseline and 83.4% for a gradient-trained model on the same fixed Gabor basis. On CIFAR-100, performance is 54.8%. Factorial analysis indicates that multi-frequency streams, associative memory, and top-down feedback contribute largely additively, with a significant Streams x TopDown interaction (p=0.02). These results suggest that carefully chosen architectural priors can recover a substantial fraction of the performance typically associated with global gradient training, while leaving a measurable residual gap. Experiments are limited to CIFAR-10/100. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.09734 [cs.CV] (or arXiv:2604.09734v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.09734 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mehdi Fatan Serj [view email] [v1] Thu, 9 Apr 2026 18:30:47 UTC (8,384 KB)

[CV-281] LOLGORITHM: Funny Comment Generation Agent For Short Videos

【速读】:该论文旨在解决短视频平台中评论生成缺乏真实性的问题,即现有方法(如视频摘要和直播弹幕生成)难以产出符合平台特定文化与语言规范的评论。解决方案的关键在于提出LOALGORITHM框架,这是一个模块化多智能体系统,包含三个核心模块:视频内容摘要、视频分类以及结合语义检索与热门梗增强的评论生成机制;该框架支持六种可控评论风格,并通过构建跨平台(YouTube与Douyin)的双语数据集进行验证,实验表明其在人类偏好评估中显著优于基线方法,且性能提升源于架构设计而非基础大模型选择,体现了方法的鲁棒性与通用性。

链接: https://arxiv.org/abs/2604.09729
作者: Xuan Ouyang,Senan Wang,Bouzhou Wang,Siyuan Xiahou,Jinrong Zhou,Yuekang Li
机构: University of New South Wales (新南威尔士大学); University of Sydney (悉尼大学); The University of Hong Kong (香港大学); University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Short-form video platforms have become central to multimedia information dissemination, where comments play a critical role in driving engagement, propagation, and algorithmic feedback. However, existing approaches – including video summarization and live-streaming danmaku generation – fail to produce authentic comments that conform to platform-specific cultural and linguistic norms. In this paper, we present LOLGORITHM, a novel modular multi-agent framework for stylized short-form video comment generation. LOLGORITHM supports six controllable comment styles and comprises three core modules: video content summarization, video classification, and comment generation with semantic retrieval and hot meme augmentation. We further construct a bilingual dataset of 3,267 videos and 16,335 comments spanning five high-engagement categories across YouTube and Douyin. Evaluation combining automatic scoring and large-scale human preference analysis demonstrates that LOLGORITHM consistently outperforms baseline methods, achieving human preference selection rates of 80.46% on YouTube and 84.29% on Douyin across 107 respondents. Ablation studies confirm that these gains are attributable to the framework architecture rather than the choice of backbone LLM, underscoring the robustness and generalizability of our approach.

[CV-282] Data-Driven Automated Identification of Optimal Feature-Representative Images in Infrared Thermography Using Statistical and Morphological Metrics

【速读】:该论文旨在解决红外热成像(Infrared Thermography, IRT)后处理中缺陷可视化图像识别困难的问题,即现有方法生成的图像序列在时间、频率或系数域中缺陷可见度波动大,且传统评估指标(如信噪比 SNR 或 Tanimoto 准则)依赖缺陷位置先验信息或无缺陷参考区域,难以实现自动化和无监督分析。解决方案的关键在于提出一种数据驱动的方法,基于三个互补指标:混合均匀性指数(Homogeneity Index of Mixture, HI),用于量化局部强度分布相对于全局参考分布的统计异质性;代表性基本区域(Representative Elementary Area, REA),通过二维 Minkowski 功能函数对代表性体积概念的扩展来表征图像空间代表性;以及几何拓扑总变差能量指数(Total Variation Energy, TVE),基于二维 Minkowski 功能函数增强对局部异常的敏感性。该框架无需空间先验信息即可实现对缺陷相关图像的鲁棒且无偏排序,从而支持 IRT 中自动化缺陷导向图像选择。

链接: https://arxiv.org/abs/2604.09728
作者: Harutyun Yagdjian,Martin Gurka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph); Data Analysis, Statistics and Probability (physics.data-an)
备注: 21 pages + 4 Appendix, 13 figures

点击查看摘要

Abstract:Infrared thermography (IRT) is a widely used non-destructive testing technique for detecting structural features such as subsurface defects. However, most IRT post-processing methods generate image sequences in which defect visibility varies strongly across time, frequency, or coefficient/index domains, making the identification of defect-representative images a critical challenge. Conventional evaluation metrics, such as the signal-to-noise ratio (SNR) or the Tanimoto criterion, often require prior knowledge of defect locations or defect-free reference regions, limiting their suitability for automated and unsupervised analysis. In this work, a data-driven methodology is proposed to identify images within IRT datasets that are most likely to contain and represent structural features, particularly anomalies and defects, without requiring prior spatial information. The approach is based on three complementary metrics: the Homogeneity Index of Mixture (HI), which quantifies statistical heterogeneity via deviations of local intensity distributions from a global reference distribution; a Representative Elementary Area (REA), derived from a Minkowski-functional adaptation of the Representative Elementary Volume concept to two-dimensional images; and a geometrical-topological Total Variation Energy (TVE) index, also based on two-dimensional Minkowski functionals, designed to improve sensitivity to localized anomalies. The framework is validated experimentally using pulse-heated IRT data from a carbon fiber-reinforced polymer (CFRP) plate containing six artificial defects at depths between 0.135 mm and 0.810 mm, and is further supported by one-dimensional N-layer thermal model simulations. The results demonstrate robust and unbiased ranking of image sequences and provide a reliable basis for automated defect-oriented image selection in IRT.

[CV-283] Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset

【速读】:该论文旨在解决手写孟加拉语字符识别(Handwritten Bangla Character Recognition)中的难题,主要包括字符书写风格多样、笔画模式不一致、视觉相似度高以及现有数据集在类内差异和类别分布上的不平衡问题。为应对这些挑战,研究者构建了一个包含78个类别的平衡数据集,涵盖基础字符、复合字符(Juktobarno)及数字,每类约650个样本,并覆盖不同年龄层与社会经济背景的书写者(包括左右手书写者)。解决方案的关键在于提出一种交互感知的混合深度学习架构,该架构并行集成EfficientNetB3、Vision Transformer和Conformer模块,并通过多头交叉注意力融合机制实现跨组件的有效特征交互,从而在自建数据集上达到98.84%的准确率,在外部CHBCR基准测试中达96.49%,展现出优异的泛化能力。

链接: https://arxiv.org/abs/2604.09717
作者: Mirza Raquib,Asif Pervez Polok,Kedar Nath Biswas,Farida Siddiqi Prity,Saydul Akbar Murad,Nick Rahimi
机构: International Islamic University Chittagong (国际伊斯兰大学吉大港分校); mPower Social Enterprise (mPower 社会企业); Noakhali Science and Technology University (诺阿哈尔科技大学); Netrokona University (内特罗纳大学); University of Southern Mississippi (南密西西比大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Character recognition is the fundamental part of an optical character recognition (OCR) system. Word recognition, sentence transcription, document digitization, and language processing are some of the higher-order activities that can be done accurately through character recognition. Nonetheless, recognizing handwritten Bangla characters is not an easy task because they are written in different styles with inconsistent stroke patterns and a high degree of visual character resemblance. The datasets available are usually limited in intra-class and inequitable in class distribution. We have constructed a new balanced dataset of Bangla written characters to overcome those problems. This consists of 78 classes and each class has approximately 650 samples. It contains the basic characters, composite (Juktobarno) characters and numerals. The samples were a diverse group comprising a large age range and socioeconomic groups. Elementary and high school students, university students, and professionals are the contributing factors. The sample also has right and left-handed writers. We have further proposed an interaction-aware hybrid deep learning architecture that integrates EfficientNetB3, Vision Transformer, and Conformer modules in parallel. A multi-head cross-attention fusion mechanism enables effective feature interaction across these components. The proposed model achieves 98.84% accuracy on the constructed dataset and 96.49% on the external CHBCR benchmark, demonstrating strong generalization capability. Grad-CAM visualizations further provide interpretability by highlighting discriminative regions. The dataset and source code of this research is publicly available at: this https URL.

[CV-284] raining Deep Visual Networks Beyond Loss and Accuracy Through a Dynamical Systems Approach

【速读】:该论文试图解决的问题是:深度视觉识别模型在训练过程中,其内部表示如何演变这一问题。传统评估指标(如损失和准确率)虽能反映模型性能的提升,却难以揭示网络层激活状态随训练epoch变化的动态特性。解决方案的关键在于引入动力系统理论框架,通过分析层激活信号来量化网络的动态行为,具体包括三个核心指标:整合度(integration score),用于衡量跨层的长期协调性;亚稳态度(metastability score),反映网络在同步与非同步状态间切换的灵活性;以及综合动力稳定性指数(dynamical stability index)。这些指标从新的视角揭示了不同模型架构和数据集下训练过程中的规律性模式,为理解深度神经网络训练机制提供了可量化的动力学视角。

链接: https://arxiv.org/abs/2604.09716
作者: Hai La Quang,Hassan Ugail,Newton Howard,Cong Tran Tien,Nam Vu Hoai,Hung Nguyen Viet
机构: Posts and Telecommunications Institute of Technology (越南邮电研究所); Centre for Visual Computing and Intelligent Systems (视觉计算与智能系统中心); University of Bradford (布拉德福德大学); School of Individualized Study (个性化学习学院); Rochester Institute of Technology (罗切斯特理工学院); University of Bradford (布拉德福德大学); Posts and Telecommunications Institute of Technology (越南邮电研究所); Posts and Telecommunications Institute of Technology (越南邮电研究所); Posts and Telecommunications Institute of Technology (越南邮电研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep visual recognition models are usually trained and evaluated using metrics such as loss and accuracy. While these measures show whether a model is improving, they reveal very little about how its internal representations change during training. This paper introduces a complementary way to study that process by examining training through the lens of dynamical systems. Drawing on ideas from signal analysis originally used to study biological neural activity, we define three measures from layer activations collected across training epochs: an integration score that reflects long-range coordination across layers, a metastability score that captures how flexibly the network shifts between more and less synchronised states, and a combined dynamical stability index. We apply this framework to nine combinations of model architecture and dataset, including several ResNet variants, DenseNet-121, MobileNetV2, VGG-16, and a pretrained Vision Transformer on CIFAR-10 and CIFAR-100. The results suggest three main patterns. First, the integration measure consistently distinguishes the easier CIFAR-10 setting from the more difficult CIFAR-100 setting. Second, changes in the volatility of the stability index may provide an early sign of convergence before accuracy fully plateaus. Third, the relationship between integration and metastability appears to reflect different styles of training behaviour. Overall, this study offers an exploratory but promising new way to understand deep visual training beyond loss and accuracy.

[CV-285] MuPPet: Multi-person 2D-to-3D Pose Lifting CVPR

【速读】:该论文旨在解决多人群体场景下2D到3D人体姿态提升(2D-to-3D pose lifting)中忽视个体间关系及难以适应不同群体规模的问题。现有方法往往无法有效建模多人之间的交互依赖,导致在复杂社交场景中的姿态估计性能受限。其解决方案的关键在于提出MuPPet框架,通过引入Person Encoding对个体表征进行结构化建模、Permutation Augmentation增强训练多样性,并设计Dynamic Multi-Person Attention机制以自适应地捕捉人与人之间的动态相关性,从而显著提升多人群体3D姿态估计的准确性与鲁棒性,尤其在遮挡情况下表现更优。

链接: https://arxiv.org/abs/2604.09715
作者: Thomas Markhorst,Zhi-Yi Lin,Jouh Yeong Chew,Jan van Gemert,Xucong Zhang
机构: Delft University of Technology (代尔夫特理工大学); Honda Research Institute Japan (本田研究 institute日本)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at CVPRw 2026

点击查看摘要

Abstract:Multi-person social interactions are inherently built on coherence and relationships among all individuals within the group, making multi-person localization and body pose estimation essential to understanding these social dynamics. One promising approach is 2D-to-3D pose lifting which provides a 3D human pose consisting of rich spatial details by building on the significant advances in 2D pose estimation. However, the existing 2D-to-3D pose lifting methods often neglect inter-person relationships or cannot handle varying group sizes, limiting their effectiveness in multi-person settings. We propose MuPPet, a novel multi-person 2D-to-3D pose lifting framework that explicitly models inter-person correlations. To leverage these inter-person dependencies, our approach introduces Person Encoding to structure individual representations, Permutation Augmentation to enhance training diversity, and Dynamic Multi-Person Attention to adaptively model correlations between individuals. Extensive experiments on group interaction datasets demonstrate MuPPet significantly outperforms state-of-the-art single- and multi-person 2D-to-3D pose lifting methods, and improves robustness in occlusion scenarios. Our findings highlight the importance of modeling inter-person correlations, paving the way for accurate and socially-aware 3D pose estimation. Our code is available at: this https URL

[CV-286] Zero-Shot Synthetic-to-Real Handwritten Text Recognition via Task Analogies

【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中手写文本识别(Handwritten Text Recognition, HTR)模型在从合成手写数据迁移到真实手写数据时的泛化能力不足问题,尤其针对完全零样本(fully zero-shot)场景——即目标语言无任何真实样本可用的情况。其解决方案的关键在于:通过分析模型参数在源语言中从合成到真实手写数据分布变化的规律,学习一种可迁移的“参数修正”机制,并将该修正应用于新目标语言;当使用多源语言时,借助语言相似性对各源贡献进行加权融合,从而实现跨语言、无需目标域真实样本的高效适应。

链接: https://arxiv.org/abs/2604.09713
作者: Carlos Garrido-Munoz,Aniello Panariello,Silvia Cascianelli,Angelo Porrello,Simone Calderara,Jorge Calvo-Zaragoza,Rita Cucchiara
机构: University of Alicante, Spain; University of Modena and Reggio Emilia, Italy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Handwritten Text Recognition (HTR) models trained on synthetic handwriting often struggle to generalize to real text, and existing adaptation methods still require real samples from the target domain. In this work, we tackle the fully zero-shot synthetic-to-real generalization setting, where no real data from the target language is available. Our approach learns how model parameters change when moving from synthetic to real handwriting in one or more source languages and transfers this learned correction to new target languages. When using multiple sources, we rely on linguistic similarity to weigh their contrubition when combining them. Experiments across five languages and six architectures show consistent improvements over synthetic-only baselines and reveal that the transferred corrections benefit even languages unrelated to the sources.

[CV-287] LAST: Leverag ing Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理复杂几何布局时存在的幻觉和不精确问题,尤其是在缺乏结构化几何先验与空间约束的情况下,单纯依赖数据驱动的扩展难以有效建模空间推理能力。其解决方案的关键在于提出一个统一的工具增强型空间推理框架LAST(Language-Assisted Spatial Tooling),其中核心创新是引入可扩展的交互式沙盒LAST-Box,将异构、参数密集型视觉工具的调用抽象为原子指令和可复用的空间技能,并输出多模态提示(如标注图像和文本描述)供大语言模型直接消费;同时设计三阶段渐进式训练策略,引导模型从理解工具输出逐步过渡到熟练且自适应的工具调用,从而显著提升复杂空间任务中的推理性能。

链接: https://arxiv.org/abs/2604.09712
作者: Shi-Yu Tian,Zhi Zhou,Kun-Yang Yu,Ming Yang,Yang Chen,Ziqiao Shang,Lan-Zhe Guo,Yu-Feng Li
机构: National Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Spatial reasoning is a cornerstone capability for intelligent systems to perceive and interact with the physical world. However, multimodal large language models (MLLMs) frequently suffer from hallucinations and imprecision when parsing complex geometric layouts. As data-driven scaling struggles to internalize structured geometric priors and spatial constraints, integrating mature, specialized vision models presents a compelling alternative. Despite its promise, applying this paradigm to spatial reasoning is hindered by two key challenges: The difficulty of invoking heterogeneous, parameter-rich tools, as well as the challenge of understanding and effectively leveraging their diverse low-level outputs (e.g., segmentation masks, depth maps) in high-level reasoning. To address these challenges, we propose LAST, a unified framework for tool-augmented spatial reasoning. LAST features an extensible interactive sandbox, termed LAST-Box, which abstracts heterogeneous tool invocations into atomic instructions and reusable spatial skills, returning multimodal hints (e.g., annotated images and textual descriptions) that can be directly consumed by LLMs. We further design a three-stage progressive training strategy that guides models from understanding tool outputs to proficient and adaptive tool invocation. Experiments on four datasets show that LAST-7B achieves around 20% performance gains over its backbone and outperforms strong proprietary closed-source LLMs, substantially enhancing reasoning on complex spatial tasks.

[CV-288] Robust Fair Disease Diagnosis in CT Images CVPR2026

【速读】:该论文旨在解决医学影像自动诊断中因数据集偏斜(skewed datasets)导致的模型性能在不同患者群体间不均衡的问题,尤其是当类别不平衡与群体代表性不足共存时,传统重加权或公平性修正方法难以有效应对。其解决方案的关键在于提出一种双层优化目标:在样本层面采用对数调整交叉熵损失(logit-adjusted cross-entropy loss),通过类频次动态调整决策边界以保证一致性;在群体层面引入条件风险价值(Conditional Value at Risk, CVaR)聚合机制,将优化压力集中于当前损失最高的群体,从而缓解复合失效模式。实验表明,该方法在Fair Disease Diagnosis基准上显著提升了性别平均宏F1分数并大幅降低公平性差距。

链接: https://arxiv.org/abs/2604.09710
作者: Justin Li,Daniel Ding,Asmita Yuki Pritha,Aryana Hou,Xin Wang,Shu Hu
机构: Carmel High School (卡梅尔高中); Capstone School Dhaka (达卡特许学校); Clarkstown High School South (克拉克斯顿南高中); University at Albany, State University of New York (纽约州立大学阿尔巴尼分校); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 3 figures, 2 tables. Accepted at the 3rd Workshop on New Trends in AI-Generated Media and Security (AIMS) @ CVPR 2026

点击查看摘要

Abstract:Automated diagnosis from chest CT has improved considerably with deep learning, but models trained on skewed datasets tend to perform unevenly across patient demographics. However, the situation is worse than simple demographic bias. In clinical data, class imbalance and group underrepresentation often coincide, creating compound failure modes that neither standard rebalancing nor fairness corrections can fix alone. We introduce a two-level objective that targets both axes of this problem. Logit-adjusted cross-entropy loss operates at the sample level, shifting decision margins by class frequency with provable consistency guarantees. Conditional Value at Risk aggregation operates at the group level, directing optimization pressure toward whichever demographic group currently has the higher loss. We evaluate on the Fair Disease Diagnosis benchmark using a 3D ResNet-18 pretrained on Kinetics-400, classifying CT volumes into Adenocarcinoma, Squamous Cell Carcinoma, COVID-19, and Normal groups with patient sex annotations. The training set illustrates the compound problem concretely: squamous cell carcinoma has 84 samples total, 5 of them female. The combined loss reaches a gender-averaged macro F1 of 0.8403 with a fairness gap of 0.0239, a 13.3% improvement in score and 78% reduction in demographic disparity over the baseline. Ablations show that each component alone falls short. The code is publicly available at this https URL.

[CV-289] Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks

【速读】:该论文旨在解决视觉 Transformer 中双线性前馈模块(bilinear feed-forward)在提升准确率时存在的两个核心问题:一是混淆了更强的二阶交互作用与冗余信息,二是未能有效利用辅助分支中未被主分支捕获的信息。其解决方案的关键在于提出正交二次补全(Orthogonal Quadratic Complements, OQC),通过构造一个低秩二次辅助分支,并在注入主分支前将其显式投影到主分支的正交补空间中,从而确保辅助信息是主隐藏表示之外的补充,而非重复内容。该方法显著提升了模型表征质量与分类性能,在 CIFAR-100 和 TinyImageNet 上均取得优于基线的稳定改进,且机制分析表明其有效降低了辅助与主分支之间的重叠,改善了特征几何结构和类别可分性。

链接: https://arxiv.org/abs/2604.09709
作者: Wang Zixian
机构: China Mobile Communications Group Shandong Co., Ltd. (中国移动通信集团山东有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent bilinear feed-forward replacements for vision transformers can substantially improve accuracy, but they often conflate two effects: stronger second-order interactions and increased redundancy relative to the main branch. We study a complementary design principle in which auxiliary quadratic features contribute only information not already captured by the dominant hidden representation. To this end, we propose Orthogonal Quadratic Complements (OQC), which construct a low-rank quadratic auxiliary branch and explicitly project it onto the orthogonal complement of the main branch before injection. We further study an efficient low-rank realization (OQC-LR) and gated extensions (OQC-static and OQC-dynamic). Under a parameter-matched Deep-ViT and CIFAR-100 protocol with a fixed penultimate residual readout, full OQC improves an AFBO baseline from 64.25 +/- 0.22 to 65.59 +/- 0.22, while OQC-LR reaches 65.52 +/- 0.25 with a substantially better speed-accuracy tradeoff. On TinyImageNet, the gated extension OQC-dynamic achieves 51.88 +/- 0.32, improving the baseline (50.45 +/- 0.21) by 1.43 points and outperforming all ungated variants. Mechanism analyses show near-zero post-projection auxiliary-main overlap together with improved representation geometry and class separation. The full family, including both ungated and gated variants, generalizes consistently across both datasets. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.09709 [cs.CV] (or arXiv:2604.09709v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.09709 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-290] he Deployment Gap in AI Media Detection: Platform-Aware and Visually Constrained Adversarial Evaluation CVPR

【速读】:该论文旨在解决当前AI媒体检测模型在实验室环境下表现优异(如AUC ≈ 0.99),但在真实部署场景中性能显著下降的问题,即“部署差距”(deployment gap)。其核心挑战在于现实世界中生成图像常经历重缩放、压缩、重新编码及截图类失真等平台变换(platform-aware transforms),而传统评估未充分考虑这些因素。解决方案的关键在于提出一种平台感知的对抗性评估框架(platform-aware adversarial evaluation framework),该框架显式建模上述部署变换,并将扰动限制在视觉上合理的“meme-style bands”( meme-style band constraints)而非全图噪声,从而更贴近实际攻击场景。实验表明,此威胁模型下检测器性能大幅下降,且存在通用扰动(universal perturbations),揭示了跨输入的共享脆弱方向,同时发现校准崩溃现象(calibration collapse),强调仅依赖干净数据上的准确率会严重高估实际可靠性。

链接: https://arxiv.org/abs/2604.09706
作者: Aishwarya Budhkar,Trishita Dhara,Siddhesh Sheth
机构: Indiana University (印第安纳大学); Upper Hand; Ace Rent a Car
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR AIMS 2026

点击查看摘要

Abstract:Recent AI media detectors report near-perfect performance under clean laboratory evaluation, yet their robustness under realistic deployment conditions remains underexplored. In practice, AI-generated images are resized, compressed, re-encoded, and visually modified before being shared on online platforms. We argue that this creates a deployment gap between laboratory robustness and real-world reliability. In this work, we introduce a platform-aware adversarial evaluation framework for AI media detection that explicitly models deployment transforms (e.g., resizing, compression, screenshot-style distortions) and constrains perturbations to visually plausible meme-style bands rather than full-image noise. Under this threat model, detectors achieving AUC \approx 0.99 in clean settings experience substantial degradation. Per-image platform-aware attacks reduce AUC to significantly lower levels and achieve high fake-to-real misclassification rates, despite strict visual constraints. We further demonstrate that universal perturbations exist even under localized band constraints, revealing shared vulnerability directions across inputs. Beyond accuracy degradation, we observe pronounced calibration collapse under attack, where detectors become confidently incorrect. Our findings highlight that robustness measured under clean conditions substantially overestimates deployment robustness. We advocate for platform-aware evaluation as a necessary component of future AI media security benchmarks and release our evaluation framework to facilitate standardized robustness assessment. Comments: Accepted at CVPR AIMS 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.09706 [cs.CV] (or arXiv:2604.09706v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.09706 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-291] Multi-Granularity Reasoning for Image Quality Assessment via Attribute-Aware Reinforcement Learning to Rank

【速读】:该论文旨在解决现有图像质量评估(IQA)方法在感知维度上的单一性问题,即当前基于强化学习排序(RL2R)的视觉-语言模型(VLM)仅能预测整体图像质量评分,而忽略了人类感知中多维属性(如清晰度、色彩保真度、噪声水平和构图美学)的复杂性。解决方案的关键在于提出MG-IQA(Multi-Granularity IQA),其核心创新包括:(1) 属性感知提示策略,引导VLM生成结构化的多属性推理;(2) 多维Thurstone奖励模型,用于计算各属性特定的保真度奖励以支持群体相对策略优化;(3) 跨域对齐机制,实现合成失真、真实失真及AI生成图像数据集间的稳定联合训练,无需感知尺度重校准。这一框架能够在单次推理中同时完成整体质量与细粒度属性评估,显著提升性能并生成可解释的人类对齐描述。

链接: https://arxiv.org/abs/2604.09704
作者: Xiangyong Chen,Xiaochuan Lin,Haoran Liu,Xuan Li,Yichen Su,Xiangwei Guo
机构: Henan Polytechnic University (河南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in reasoning-induced image quality assessment (IQA) have demonstrated the power of reinforcement learning to rank (RL2R) for training vision-language models (VLMs) to assess perceptual quality. However, existing approaches operate at a single granularity, predicting only an overall quality score, while overlooking the multi-dimensional nature of human quality perception, which encompasses attributes such as sharpness, color fidelity, noise level, and compositional aesthetics. In this paper, we propose MG-IQA (Multi-Granularity IQA), a multi-granularity reasoning framework that extends RL2R to jointly assess overall image quality and fine-grained quality attributes within a single inference pass. Our approach introduces three key innovations: (1) an attribute-aware prompting strategy that elicits structured multi-attribute reasoning from VLMs; (2) a multi-dimensional Thurstone reward model that computes attribute-specific fidelity rewards for group relative policy optimization; and (3) a cross-domain alignment mechanism that enables stable joint training across synthetic distortion, authentic distortion, and AI-generated image datasets without perceptual scale re-alignment. Extensive experiments on eight IQA benchmarks demonstrate that MG-IQA consistently outperforms state-of-the-art methods in both overall quality prediction (average SRCC improvement of 2.1%) and attribute-level assessment, while generating interpretable, human-aligned quality descriptions.

[CV-292] Identity-Aware U-Net: Fine-grained Cell Segmentation via Identity-Aware Representation Learning

【速读】:该论文旨在解决密集预测任务中对形状高度相似物体的精确分割问题,尤其在边界模糊、实例重叠及实例间视觉差异微弱等复杂场景下,传统分割模型因缺乏足够的判别能力而难以可靠区分目标对象与形态相近的干扰项。其解决方案的关键在于提出一种身份感知的U-Net(Identity-Aware U-Net, IAU-Net)框架,该框架在U-Net结构基础上引入一个辅助嵌入分支,从高层特征中学习具有判别性的身份表示(identity representations),同时主分支输出像素级掩码;此外,通过三元组度量学习(triplet-based metric learning)机制,使目标一致的嵌入向量聚集、与形态相似但非同一实例的难负样本分离,从而增强模型在轮廓或纹理几乎一致的对象间的区分能力,实现超越类别级分割的细粒度判别性能。

链接: https://arxiv.org/abs/2604.09702
作者: Rui Xiao
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Precise segmentation of objects with highly similar shapes remains a challenging problem in dense prediction, especially in scenarios with ambiguous boundaries, overlapping instances, and weak inter-instance visual differences. While conventional segmentation models are effective at localizing object regions, they often lack the discriminative capacity required to reliably distinguish a target object from morphologically similar distractors. In this work, we study fine-grained object segmentation from an identity-aware perspective and propose Identity-Aware U-Net (IAU-Net), a unified framework that jointly models spatial localization and instance discrimination. Built upon a U-Net-style encoder-decoder architecture, our method augments the segmentation backbone with an auxiliary embedding branch that learns discriminative identity representations from high-level features, while the main branch predicts pixel-accurate masks. To enhance robustness in distinguishing objects with near-identical contours or textures, we further incorporate triplet-based metric learning, which pulls target-consistent embeddings together and separates them from hard negatives with similar morphology. This design enables the model to move beyond category-level segmentation and acquire a stronger capability for precise discrimination among visually similar objects. Experiments on benchmarks including cell segmentation demonstrate promising results, particularly in challenging cases involving similar contours, dense layouts, and ambiguous boundaries.

[CV-293] PASTA: Vision Transformer Patch Aggregation for Weakly Supervised Target and Anomaly Segmentation

【速读】:该论文旨在解决在非结构化环境中检测未见过的异常(如工业材料回收中的金属废料或农业场景中的杂草)时,现有感知系统难以满足实时处理、像素级分割精度和鲁棒准确性的挑战,其根本原因在于这些系统依赖于大量人工标注的数据集。解决方案的关键在于提出一种弱监督的物体分割与分类流水线——Patch Aggregation for Segmentation of Targets and Anomalies (PASTA),该方法通过对比观测场景与标准参考场景,在自监督视觉Transformer (Vision Transformer, ViT) 特征空间中进行分布分析,从而识别目标与异常对象;同时利用Segment Anything Model 3 (SAM3) 结合语义文本提示实现零样本物体分割,显著降低了训练时间(减少75.8%),并在工业与农业领域分别实现了高达88.3%和63.5%的交并比(IoU),展现出良好的域泛化能力。

链接: https://arxiv.org/abs/2604.09701
作者: Melanie Neubauer,Elmar Rueckert,Christian Rauch
机构: Technical University of Leoben (莱奥本技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting unseen anomalies in unstructured environments presents a critical challenge for industrial and agricultural applications such as material recycling and weeding. Existing perception systems frequently fail to satisfy the strict operational requirements of these domains, specifically real-time processing, pixel-level segmentation precision, and robust accuracy, due to their reliance on exhaustively annotated datasets. To address these limitations, we propose a weakly supervised pipeline for object segmentation and classification using weak image-level supervision called ‘Patch Aggregation for Segmentation of Targets and Anomalies’ (PASTA). By comparing an observed scene with a nominal reference, PASTA identifies Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer (ViT) feature spaces. Our pipeline utilizes semantic text-prompts via the Segment Anything Model 3 to guide zero-shot object segmentation. Evaluations on a custom steel scrap recycling dataset and a plant dataset demonstrate a 75.8% training time reduction of our approach to domain-specific baselines. While being domain-agnostic, our method achieves superior Target (up to 88.3% IoU) and Anomaly (up to 63.5% IoU) segmentation performance in the industrial and agricultural domain. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2604.09701 [cs.CV] (or arXiv:2604.09701v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.09701 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-294] Attention-Guided Flow-Matching for Sparse 3D Geological Generation

【速读】:该论文旨在解决从稀疏的一维钻孔数据和二维地表数据构建高分辨率三维地质模型的问题,这是一个高度病态的反演问题。传统启发式与隐式建模方法在极端稀疏条件下无法捕捉非线性拓扑不连续性,常产生不切实际的伪影;而现有的深度生成架构(如扩散模型)在条件作用于稀疏分类网格时则面临严重的表示崩溃问题。解决方案的关键在于提出3D-GeoFlow——首个面向稀疏多模态地质建模的注意力引导连续流匹配框架:通过将离散分类生成重构为无仿真、基于均方误差优化的连续向量场回归,建立稳定且确定性的最优传输路径;同时引入三维注意力门机制,在体积潜在空间中动态传播局部钻孔特征,从而保障宏观结构一致性。

链接: https://arxiv.org/abs/2604.09700
作者: Zhixiang Lu,Mengqi Han,Peixin Guo,Tianming Bai,Jionglong Su,Fei Fang,Sifan Song
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Constructing high-resolution 3D geological models from sparse 1D borehole and 2D surface data is a highly ill-posed inverse problem. Traditional heuristic and implicit modeling methods fundamentally fail to capture non-linear topological discontinuities under extreme sparsity, often yielding unrealistic artifacts. Furthermore, while deep generative architectures like Diffusion Models have revolutionized continuous domains, they suffer from severe representation collapse when conditioned on sparse categorical grids. To bridge this gap, we propose 3D-GeoFlow, the first Attention-Guided Continuous Flow Matching framework tailored for sparse multimodal geological modeling. By reformulating discrete categorical generation as a simulation-free, continuous vector field regression optimized via Mean Squared Error, our model establishes stable, deterministic optimal transport paths. Crucially, we integrate 3D Attention Gates to dynamically propagate localized borehole features across the volumetric latent space, ensuring macroscopic structural coherence. To validate our framework, we curated a large-scale multimodal dataset comprising 2,200 procedurally generated 3D geological cases. Extensive out-of-distribution (OOD) evaluations demonstrate that 3D-GeoFlow achieves a paradigm shift, significantly outperforming heuristic interpolations and standard diffusion baselines.

[CV-295] I Cant Believe TTA Is Not Better: When Test-Time Augmentation Hurts Medical Image Classification

【速读】:该论文旨在解决测试时增强(Test-time augmentation, TTA)在医学图像分类任务中是否真正提升模型性能的问题。尽管TTA被广泛认为可提高分类准确率,尤其在医疗影像领域常作为生产系统和竞赛方案的标准配置,但本文通过系统性实证研究发现,使用标准增强流程的TTA反而会显著降低准确率,降幅最高达31.6个百分点(ResNet-18在病理图像上的表现)。其关键解决方案在于识别出导致性能下降的核心机制:增强输入与训练数据之间的分布偏移(distribution shift),尤其是由批量归一化(batch normalization)统计量不匹配所放大。此外,实验表明增强策略至关重要——仅对图像强度进行变换的增强方式比几何变换更利于保留性能,且保留原始未增强样本可部分缓解性能下降,但仍无法完全消除负面影响。因此,论文强调TTA不应作为默认后处理手段,而必须针对具体模型与数据集组合进行验证。

链接: https://arxiv.org/abs/2604.09697
作者: Daniel Nobrega Medeiros
机构: University of Colorado at Boulder (科罗拉多大学博尔德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Test-time augmentation (TTA)–aggregating predictions over multiple augmented copies of a test input–is widely assumed to improve classification accuracy, particularly in medical imaging where it is routinely deployed in production systems and competition solutions. We present a systematic empirical study challenging this assumption across three MedMNIST v2 benchmarks and four architectures spanning three orders of magnitude in parameter count (21K to 11M). Our principal finding is that TTA with standard augmentation pipelines consistently degrades accuracy relative to single-pass inference, with drops as severe as 31.6 percentage points for ResNet-18 on pathology images. This degradation affects all architectures, including convolutional models, and worsens with more augmented views. The sole exception is ResNet-18 on dermatology images, which gains a modest +1.6%. We identify the distribution shift between augmented and training-time inputs–amplified by batch normalization statistics mismatch–as the primary mechanism. Our ablation studies show that augmentation strategy matters critically: intensity-only augmentations preserve more performance than geometric transforms, and including the original unaugmented image partially mitigates but does not eliminate the accuracy drop. These findings serve as a cautionary note for practitioners: TTA should not be applied as a default post-hoc improvement but must be validated on the specific model-dataset combination.

[CV-296] Sharpness-Aware Surrogate Training for On-Sensor Spiking Neural Networks

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在部署阶段因平滑的代理非线性(surrogate nonlinearity)被替换为硬阈值(hard threshold)而导致的性能显著下降问题,即“代理到硬阈值迁移差距”(surrogate-to-hard transfer gap),这一问题直接限制了传感器端(on-sensor)SNN的准确率。解决方案的关键在于提出Sharpness-Aware Surrogate Training (SAST),其核心思想是将Sharpness-Aware Minimization (SAM) 引入代理前向传播的SNN训练中,使训练目标保持平滑且梯度精确,从而增强模型在硬阈值激活下的稳定性与泛化能力。理论分析表明,在显式收缩假设下,SAST可提供状态稳定性、输入-Lipschitz连续性和平滑性边界,并给出非凸收敛结果;实验验证显示,SAST在两个事件相机基准测试(N-MNIST和DVS Gesture)上显著提升硬脉冲精度,且在硬件感知推理模拟(INT8/INT4权重量化、定点膜电位、离散泄漏因子)下仍保持优越性能,同时减少计算量(SynOps)。

链接: https://arxiv.org/abs/2604.09696
作者: Maximilian Nicholson
机构: University of Bath (巴斯大学)
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Currently under review at a conference workshop

点击查看摘要

Abstract:Spiking neural networks (SNNs) are a natural computational model for on-sensor and near-sensor vision, where event driven processors must operate under strict power budgets with hard binary spikes. However, models trained with surrogate gradients often degrade sharply when the smooth surrogate nonlinearity is replaced by a hard threshold at deployment; a surrogate-to-hard transfer gap that directly limits on-sensor accuracy. We study Sharpness-Aware Surrogate Training (SAST), which applies Sharpness-Aware Minimization (SAM) to a surrogate-forward SNN so that the training objective is smooth and the gradient is exact, and position it as one gap-reduction strategy under the tested settings rather than the only viable mechanism. Under explicit contraction assumptions we provide state-stability, input-Lipschitz, and smoothness bounds, together with a corresponding nonconvex convergence result. On two event-camera benchmarks, swap-only hard-spike accuracy improves from 65.7% to 94.7% on N-MNIST and from 31.8% to 63.3% on DVS Gesture. Under a hardware-aware inference simulation (INT8/INT4 weight quantization, fixed-point membrane potentials, discrete leak factors), SAST remains strong: on N-MNIST, hard-spike accuracy improves from 47.6% to 96.9% (INT8) and from 43.2% to 81.0% (INT4), while on DVS Gesture it improves from 25.3% to 47.6% (INT8) and from 26.0% to 43.8% (INT4). SynOps also decrease under the same hardware-aware setting, including 1734k \rightarrow 1315k (N-MNIST, INT8) and 86221k \rightarrow 4323k (DVS Gesture, INT8). These results suggest that SAST is a promising component in a broader toolbox for on-sensor spiking inference under the tested settings.

[CV-297] Assessing Privacy Preservation and Utility in Online Vision-Language Models

【速读】:该论文旨在解决在线视觉语言模型(Online Vision Language Models, OVLMs)在处理用户上传图像时引发的个人身份信息(Personally Identifiable Information, PII)泄露问题。由于图像中蕴含的上下文关系可能间接暴露敏感信息,即使图像本身看似无害,也可能导致隐私风险。解决方案的关键在于提出一套既能保护用户隐私又能维持图像在视觉语言模型(Vision Language Model, VLM)应用中预期功能的隐私保护方法,通过平衡隐私保护与图像实用性的权衡,有效降低PII被直接或间接提取的风险。

链接: https://arxiv.org/abs/2604.09695
作者: Karmesh Siddharam Chaudhari,Youxiang Zhu,Amy Feng,Xiaohui Liang,Honggang Zhang
机构: University of Massachusetts Boston (马萨诸塞大学波士顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in IEEE ICC 2026. \c{opyright} IEEE. Personal use of this material is permitted. The final version will appear in IEEE Xplore

点击查看摘要

Abstract:The increasing use of Online Vision Language Models (OVLMs) for processing images has introduced significant privacy risks, as individuals frequently upload images for various utilities, unaware of the potential for privacy violations. Images contain relationships that relate to Personally Identifiable Information (PII), where even seemingly harmless details can indirectly reveal sensitive information through surrounding clues. This paper explores the critical issue of PII disclosure in images uploaded to OVLMs and its implications for user privacy. We investigate how the extraction of contextual relationships from images can lead to direct (explicit) or indirect (implicit) exposure of PII, significantly compromising personal privacy. Furthermore, we propose methods to protect privacy while preserving the intended utility of the images in Vision Language Model (VLM)-based applications. Our evaluation demonstrates the efficacy of these techniques, highlighting the delicate balance between maintaining utility and protecting privacy in online image processing environments. Index Terms-Personally Identifiable Information (PII), Privacy, Utility, privacy concerns, sensitive information

[CV-298] EDFNet: Early Fusion of Edge and Depth for Thin-Obstacle Segmentation in UAV Navigation

【速读】:该论文旨在解决自主无人机(UAV)在复杂空域中对细长障碍物(如电线、树枝和杆状结构)的感知难题,此类障碍物因像素占用少、视觉对比度弱及类别不平衡等问题,难以被现有分割方法准确识别。解决方案的关键在于提出一种模块化的早期融合(early-fusion)分割框架EDFNet,该框架整合RGB图像、深度信息与边缘特征,以充分利用多模态互补线索提升细结构感知能力。实验表明,RGB-Depth-Edge的早期融合策略在边界敏感性和召回率指标上表现最优,尤其在预训练RGBDE U-Net模型下达到最高细结构评估分数(0.244)和边界交并比(0.234),同时保持19.62 FPS的实时性能,为无人机导航中的细障碍物分割提供了可复用且高效的基线方案。

链接: https://arxiv.org/abs/2604.09694
作者: Negar Fathi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous Unmanned Aerial Vehicles (UAVs) must reliably detect thin obstacles such as wires, poles, and branches to navigate safely in real-world environments. These structures remain difficult to perceive because they occupy few pixels, often exhibit weak visual contrast, and are strongly affected by class imbalance. Existing segmentation methods primarily target coarser obstacles and do not fully exploit the complementary multimodal cues needed for thin-structure perception. We present EDFNet, a modular early-fusion segmentation framework that integrates RGB, depth, and edge information for thin-obstacle perception in cluttered aerial scenes. We evaluate EDFNet on the Drone Depth and Obstacle Segmentation (DDOS) dataset across sixteen modality-backbone configurations using U-Net and DeepLabV3 in pretrained and non-pretrained settings. The results show that early RGB-Depth-Edge fusion provides a competitive and well-balanced baseline, with the most consistent gains appearing in boundary-sensitive and recall-oriented metrics. The pretrained RGBDE U-Net achieves the best overall performance, with the highest Thin-Structure Evaluation Score (0.244), mean IoU (0.219), and boundary IoU (0.234), while maintaining competitive runtime performance (19.62 FPS) on our evaluation hardware. However, performance on the rarest ultra-thin categories remains low across all models, indicating that reliable ultra-thin segmentation is still an open challenge. Overall, these findings position early RGB-Depth-Edge fusion as a practical and modular baseline for thin-obstacle segmentation in UAV navigation.

[CV-299] aFall: Balance-Informed Fall Detection via Passive Thermal Sensing

【速读】:该论文旨在解决老年人在私人室内环境中跌倒检测的可靠性与隐私保护之间的矛盾问题。现有基于射频感知的隐私保护跌倒检测方法多依赖粗粒度运动线索,导致实际部署中可靠性不足。其解决方案的关键在于提出TaFall系统,该系统利用低成本、隐私友好的热成像阵列传感,将跌倒建模为平衡能力退化过程,并通过姿态驱动的生物力学平衡动态估计实现精准检测;核心技术包括:(i)外观-运动融合模型以实现鲁棒的姿态重建,(ii)物理引导的平衡感知学习机制,以及(iii)姿态桥接预训练策略提升泛化能力。实验表明,TaFall在包含35名参与者超过3000例跌倒实例的数据集上达到98.26%的检测率和0.65%的误报率,在四户家庭连续27天部署中误报率低至0.00126%,并验证了在潮湿和热干扰环境下的鲁棒性。

链接: https://arxiv.org/abs/2604.09693
作者: Chengxiao Li,Xie Zhang,Wei Zhu,Yan Jiang,Chenshu Wu
机构: University of Hong Kong(香港大学); West China Hospital, Sichuan University(四川大学华西医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Falls are a major cause of injury and mortality among older adults, yet most incidents occur in private indoor environments where monitoring must balance effectiveness with privacy. Existing privacy-preserving fall detection approaches, particularly those based on radio frequency sensing, often rely on coarse motion cues, which limits reliability in real-world deployments. We introduce TaFall, a balance-informed fall detection system based on low-cost, privacy-preserving thermal array sensing. The key insight is that TaFall models a fall as a process of balance degradation and detects falls by estimating pose-driven biomechanical balance dynamics. To enable this capability from low-resolution thermal array maps, we propose (i) an appearance-motion fusion model for robust pose reconstruction, (ii) physically grounded balance-aware learning, and (iii) pose-bridged pretraining to improve robustness. TaFall achieves a detection rate of 98.26% with a false alarm rate of 0.65% on our dataset with over 3,000 fall instances from 35 participants across diverse indoor environments. In 27 day deployments across four homes, TaFall attains an ultra-low false alarm rate of 0.00126% and a pilot bathroom study confirms robustness under moisture and thermal interference. Together, these results establish TaFall as a reliable and privacy-preserving approach to fall detection in everyday living environments.

[CV-300] piano: Cascaded Piano Hand Motion Synthesis via Fingertip Priors

【速读】:该论文旨在解决钢琴演奏中手部运动合成的难题,即如何在保持物理精度的同时实现自然流畅的动作表现。传统基于物理的方法虽能保证动作准确性但缺乏自然性,而数据驱动模型则难以精确控制手指位置。其解决方案的关键在于利用钢琴演奏动作的层次结构特性:指尖位置由琴键几何和指法高度确定(近乎确定性),而手腕及中间关节则具有风格化自由度。作者提出四阶段框架,依次完成指尖统计定位、FiLM条件轨迹优化、手腕估计与STGCN姿态生成,从而有效分离并协同处理确定性与自由度成分,显著提升合成质量(F1=0.910)并接近真实动作捕捉水平。

链接: https://arxiv.org/abs/2604.09692
作者: Joonhyung Bae,Kirak Kim,Hyeyoon Cho,Sein Lee,Yoon-Seok Choi,Hyeon Hur,Gyubin Lee,Akira Maezawa,Satoshi Obata,Jonghwa Park,Jaebum Park,Juhan Nam
机构: Korea Advanced Institute of Science and Technology (KAIST); Seoul National University (首尔国立大学); YAMAHA (雅马哈)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthesizing realistic piano hand motions requires both precision and naturalness. Physics-based methods achieve precision but produce stiff motions; data-driven models learn natural dynamics but struggle with positional accuracy. Piano motion exhibits a natural hierarchy: fingertip positions are nearly deterministic given piano geometry and fingering, while wrist and intermediate joints offer stylistic freedom. We present [OURS], a four-stage framework exploiting this hierarchy: (1) statistics-based fingertip positioning, (2) FiLM-conditioned trajectory refinement, (3) wrist estimation, and (4) STGCN-based pose synthesis. We contribute expert-annotated fingerings for the FürElise dataset (153 pieces, ~10 hours). Experiments demonstrate F1 = 0.910, substantially outperforming diffusion baselines (F1 = 0.121), with user study (N=41) confirming quality approaching motion capture. Expert evaluation by professional pianists (N=5) identified anticipatory motion as the key remaining gap, providing concrete directions for future improvement.

[CV-301] CAGE: Bridging the Accuracy-Aesthetics Gap in Educational Diagrams via Code-Anchored Generative Enhancement

【速读】:该论文旨在解决教育图表(Educational Diagrams)生成中长期存在的准确性与美观性矛盾问题:当前主流方法中,开源扩散模型虽能生成视觉丰富的图像但严重破坏文本标签的可读性;基于大语言模型(LLM)的代码生成方式可确保标签正确但图形表现力弱;闭源API虽部分缓解该问题却存在可靠性差和成本高的缺陷。解决方案的关键在于提出CAGE(Code-Anchored Generative Enhancement)框架——首先由LLM生成结构正确的程序代码以保证语义准确性,随后利用ControlNet引导的扩散模型对程序输出进行视觉增强,在保留标签完整性的前提下提升图像美观度,从而实现教育场景下高质量图表的自动化生成。

链接: https://arxiv.org/abs/2604.09691
作者: Dikshant Kukreja,Kshitij Sah,Karan Goyal,Mukesh Mohania,Vikram Goyal
机构: IIIT Delhi(印度信息技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Educational diagrams – labeled illustrations of biological processes, chemical structures, physical systems, and mathematical concepts – are essential cognitive tools in K-12 instruction. Yet no existing method can generate them both accurately and engagingly. Open-source diffusion models produce visually rich images but catastrophically garble text labels. Code-based generation via LLMs guarantees label correctness but yields visually flat outputs. Closed-source APIs partially bridge this gap but remain unreliable and prohibitively expensive at educational scale. We quantify this accuracy-aesthetics dilemma across all three paradigms on 400 K-12 diagram prompts, measuring both label fidelity and visual quality through complementary automated and human evaluation protocols. To resolve it, we propose CAGE (Code-Anchored Generative Enhancement): an LLM synthesizes executable code producing a structurally correct diagram, then a diffusion model conditioned on the programmatic output via ControlNet refines it into a visually polished graphic while preserving label fidelity. We also introduce EduDiagram-2K, a collection of 2,000 paired programmatic-stylized diagrams enabling this pipeline, and present proof-of-concept results and a research agenda for the multimedia community.

[CV-302] Are We Recognizing the Jaguar or Its Background? A Diagnostic Framework for Jaguar Re-Identification

【速读】:该论文旨在解决野生动物重识别(re-ID)中模型可能依赖错误判别特征(如背景环境或轮廓形状)而非真正定义个体身份的毛色图案的问题,从而导致在标准检索指标上表现良好但实际推理机制不可靠。解决方案的关键在于提出一个双维度诊断框架:一是通过修复背景与前景图像的对比计算“泄漏控制的上下文比”(leakage-controlled context ratio),量化模型对背景和前景信息的依赖程度;二是基于跨侧身检索和镜像自相似性设计“侧向性诊断”(laterality diagnostic),评估模型是否利用了对称性等结构化线索。为支持该诊断体系,作者构建了包含像素级分割掩码和身份平衡评估协议的潘塔纳尔美洲豹基准数据集,并以ArcFace微调、抗对称正则化和洛伦兹双曲嵌入等代表性方法作为案例,在统一评估视角下揭示不同模型所依赖的视觉证据类型,从而推动更可靠、可解释的野生动物re-ID研究。

链接: https://arxiv.org/abs/2604.09690
作者: Antonio Rueda-Toicen,Abigail Allen Martin,Daniil Morozov,Matin Mahmood,Alexandra Schild,Shahabeddin Dayani,Davide Panza,Gerard de Melo
机构: Hasso Plattner Institute (哈索普拉特纳研究所); NVIDIA (英伟达); Jaguar ID Project (美洲虎ID项目); Universidade Federal de Mato Grosso (巴西马托格罗索联邦大学); Technical University Munich (慕尼黑工业大学); Helmholtz Zentrum Berlin (柏林赫尔姆霍兹研究中心); Freie Universitaet Berlin (柏林自由大学); Hyper3Labs (超三实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 11 figures

点击查看摘要

Abstract:Jaguar re-identification (re-ID) from citizen-science imagery can look strong on standard retrieval metrics while still relying on the wrong evidence, such as background context or silhouette shape, instead of the coat pattern that defines identity. We introduce a diagnostic framework for wildlife re-ID with two axes: a leakage-controlled context ratio, background/foreground, computed from inpainted background-only versus foreground-only images, and a laterality diagnostic based on cross-flank retrieval and mirror self-similarity. To make these diagnostics measurable, we curate a Pantanal jaguar benchmark with per-pixel segmentation masks and an identity-balanced evaluation protocol. We then use representative mitigation families, ArcFace fine-tuning, anti-symmetry regularization, and Lorentz hyperbolic embeddings, as case studies under the same evaluation lens. The goal is not only to ask which model ranks best, but also what visual evidence it uses to do so.

[CV-303] Face Density as a Proxy for Data Complexity: Quantifying the Hardness of Instance Count

【速读】:该论文旨在解决机器学习模型性能受限于数据内在复杂性的问题,特别是聚焦于实例密度(instance density)作为影响数据难度的核心因素。传统研究多关注模型架构改进,而忽视了数据本身的结构性挑战。论文的关键解决方案在于通过受控实验严格量化密度对模型性能的影响:在WIDER FACE和Open Images数据集上,限定每张图像中人脸数量为1至18个,并采用完全平衡采样以排除类别不平衡干扰,结果表明模型性能随人脸数量增加呈单调下降趋势,且低密度训练模型在高密度场景下出现系统性欠计数偏差(误差率最高上升4.6倍),证明密度变化构成域偏移(domain shift)。这一发现确立了实例密度为可量化的数据硬度维度,为课程学习(curriculum learning)与密度分层评估(density-stratified evaluation)等干预策略提供理论依据。

链接: https://arxiv.org/abs/2604.09689
作者: Abolfazl Mohammadi-Seif,Ricardo Baeza-Yates
机构: Universitat Pompeu Fabra (庞培法布拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication at IEEE CAI 2026

点击查看摘要

Abstract:Machine learning progress has historically prioritized model-centric innovations, yet achievable performance is frequently capped by the intrinsic complexity of the data itself. In this work, we isolate and quantify the impact of instance density (measured by face count) as a primary driver of data complexity. Rather than simply observing that ``crowded scenes are harder,‘’ we rigorously control for class imbalance to measure the precise degradation caused by density alone. Controlled experiments on the WIDER FACE and Open Images datasets, restricted to exactly 1 to 18 faces per image with perfectly balanced sampling, reveal that model performance degrades monotonically with increasing face count. This trend holds across classification, regression, and detection paradigms, even when models are fully exposed to the entire density range. Furthermore, we demonstrate that models trained on low-density regimes fail to generalize to higher densities, exhibiting a systematic under-counting bias, with error rates increasing by up to 4.6x, which suggests density acts as a domain shift. These findings establish instance density as an intrinsic, quantifiable dimension of data hardness and motivate specific interventions in curriculum learning and density-stratified evaluation. Comments: Accepted for publication at IEEE CAI 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.09689 [cs.CV] (or arXiv:2604.09689v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.09689 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-304] Immunizing 3D Gaussian Generative Models Against Unauthorized Fine-Tuning via Attribute-Space Traps

【速读】:该论文旨在解决3D生成模型(尤其是基于显式高斯表示的模型)在公开预训练权重后面临的知识产权泄露风险问题,即 adversaries 可通过微调(fine-tuning)攻击窃取模型在预训练阶段学到的专业知识。解决方案的关键在于提出 GaussLock——一种轻量级参数空间免疫框架,其核心机制包括授权蒸馏(authorized distillation)与属性感知陷阱损失(attribute-aware trap losses),后者针对位置、尺度、旋转、不透明度和颜色等几何与外观属性设计,系统性地破坏模型结构完整性,从而在保持授权微调任务性能的同时,显著降低未经授权重建的质量(如 LPIPS 显著升高、PSNR 显著下降)。

链接: https://arxiv.org/abs/2604.09688
作者: Jianwei Zhang,Sihan Cao,Chaoning Zhang,Ziming Hong,Jiaxin Huang,Pengcheng Zheng,Caiyan Qin,Wei Dong,Yang Yang,Tongliang Liu
机构: University of Electronic Science and Technology of China (电子科技大学); The University of Sydney (悉尼大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Xi’an University of Architecture and Technology (西安建筑科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent large-scale generative models enable high-quality 3D synthesis. However, the public accessibility of pre-trained weights introduces a critical vulnerability. Adversaries can fine-tune these models to steal specialized knowledge acquired during pre-training, leading to intellectual property infringement. Unlike defenses for 2D images and language models, 3D generators require specialized protection due to their explicit Gaussian representations, which expose fundamental structural parameters directly to gradient-based optimization. We propose GaussLock, the first approach designed to defend 3D generative models against fine-tuning attacks. GaussLock is a lightweight parameter-space immunization framework that integrates authorized distillation with attribute-aware trap losses targeting position, scale, rotation, opacity, and color. Specifically, these traps systematically collapse spatial distributions, distort geometric shapes, align rotational axes, and suppress primitive visibility to fundamentally destroy structural integrity. By jointly optimizing these dual objectives, the distillation process preserves fidelity on authorized tasks while the embedded traps actively disrupt unauthorized reconstructions. Experiments on large-scale Gaussian models demonstrate that GaussLock effectively neutralizes unauthorized fine-tuning attacks. It substantially degrades the quality of unauthorized reconstructions, evidenced by significantly higher LPIPS and lower PSNR, while effectively maintaining performance on authorized fine-tuning.

[CV-305] Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多模态推理任务中对图像细节捕捉不足的问题,尤其是当任务要求完整读出图像内容时,现有评估方法往往无法暴露模型在细粒度视觉信息保留上的缺陷。其解决方案的关键在于提出一个受控的基准测试集 Grid2Matrix (G2M),通过展示颜色网格与颜色到数字的映射关系,强制模型输出对应矩阵,从而以可控方式提升视觉复杂度并最小化语义干扰。实验发现,VLMs 在零样本端到端评估中表现出明显的早期崩溃现象,即在较小网格尺寸下即失效,而非渐进式退化;进一步分析揭示,视觉编码器实际保留了比最终语言输出更多网格信息,表明失败并非仅源于视觉编码,而是存在“数字失认”(Digital Agnosia)——即从视觉特征中可恢复的信息与最终语言表达之间存在显著差距。此框架为识别和量化VLM在精细视觉细节丢失的位置与机制提供了新工具。

链接: https://arxiv.org/abs/2604.09687
作者: Yunkai Zhang,Linda Li,Yingxin Cui,Xiyuan Ruan,Zeyu Zheng,Kezhen Chen,Yi Zhang,Diji Yang
机构: BAIR, UC Berkeley; UC Santa Cruz; Analogy AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel on many multimodal reasoning benchmarks, but these evaluations often do not require an exhaustive readout of the image and can therefore obscure failures in faithfully capturing all visual details. We introduce Grid2Matrix (G2M), a controlled benchmark in which a model is shown a color grid and a color-to-number mapping, and must output the corresponding matrix. By varying grid size and the number of colors, G2M provides a simple way to increase visual complexity while minimizing semantic confounds. We find that VLMs exhibit a sharp early collapse in zero-shot end-to-end evaluation, failing on surprisingly small grids rather than degrading gradually as the task becomes denser. We probe the visual encoders of VLMs from two representative families and find that they preserve substantially more of the grid information than the corresponding end-to-end outputs. This suggests that the failure is not explained by visual encoding alone, but also reflects a gap between what remains recoverable from visual features and what is ultimately expressed in language. We term this gap \textitDigital Agnosia. Further analyses show that these errors are highly structured and depend strongly on how grid cells overlap with visual patch boundaries. We also find that common strategies such as model scaling and multimodal alignment do not fully eliminate this failure mode. We expect G2M to serve as a useful testbed for understanding where and how VLMs lose fine visual details, and for evaluating tasks where missing even small visual details can matter, such as tables, charts, forms, and GUIs.

[CV-306] Belief-Aware VLM Model for Human-like Reasoning

【速读】:该论文旨在解决传统神经网络模型在意图推理中对可观测状态依赖过重、难以跨任务和动态环境泛化的问题,以及当前视觉语言模型(Vision Language Models, VLMs)和视觉语言动作模型(Vision Language Action, VLA)虽具备零样本任务迁移能力,但缺乏显式信念表示与更新机制,导致其无法像人类一样进行长期任务中的意图演化推理。解决方案的关键在于提出一种信念感知的VLM框架,通过引入基于检索的记忆机制来近似信念状态——即利用向量化的记忆存储并检索相关多模态上下文信息,并将其融入VLM用于推理;同时,在VLM隐空间上进一步采用强化学习策略优化决策过程,从而实现更贴近人类认知的长时程意图理解与适应性行为生成。

链接: https://arxiv.org/abs/2604.09686
作者: Anshul Nayak,Shahil Shaik,Yue Wang
机构: Clemson University (克莱姆森大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 Pages, 3 figures, 1 Table

点击查看摘要

Abstract:Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine decision-making using a reinforcement learning policy over the VLM latent space. We evaluate our approach on publicly available VQA datasets such as HD-EPIC and demonstrate consistent improvements over zero-shot baselines, highlighting the importance of belief-aware reasoning.

[CV-307] A Modular Zero-Shot Pipeline for Accident Detection Localization and Classification in Traffic Surveillance Video CVPR2026 WWW

【速读】:该论文旨在解决无监督场景下从监控视频中预测交通碰撞事件的时间、位置及类型的问题(即零样本检测),其挑战在于缺乏真实世界标注的训练数据。解决方案的关键在于构建一个三阶段独立模块化流水线:首先通过z-score归一化的帧差信号峰值检测实现碰撞时间定位;其次利用Farneback算法计算累积稠密光流幅值图的加权质心以确定撞击位置;最后基于CLIP模型提取帧图像嵌入与多提示自然语言文本嵌入之间的余弦相似度完成碰撞类型分类。整个流程仅依赖预训练模型权重,无需领域特定微调,实现了端到端的零样本检测能力。

链接: https://arxiv.org/abs/2604.09685
作者: Amey Thakur,Sarvesh Talele
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 7 figures, 2 tables. Submitted to the ACCIDENT @ CVPR 2026 Workshop. Source code and notebook available at this https URL

点击查看摘要

Abstract:We describe a zero-shot pipeline developed for the ACCIDENT @ CVPR 2026 challenge. The challenge requires predicting when, where, and what type of traffic accident occurs in surveillance video, without labeled real-world training data. Our method separates the problem into three independent modules. The first module localizes the collision in time by running peak detection on z-score normalized frame-difference signals. The second module finds the impact location by computing the weighted centroid of cumulative dense optical flow magnitude maps using the Farneback algorithm. The third module classifies collision type by measuring cosine similarity between CLIP image embeddings of frames near the detected peak and text embeddings built from multi-prompt natural language descriptions of each collision category. No domain-specific fine-tuning is involved; the pipeline processes each video using only pre-trained model weights. Our implementation is publicly available as a Kaggle notebook.

[CV-308] R2E-VID: Two-Stage Robust Routing via Temporal Gating for Elastic Edge-Cloud Video Inference

【速读】:该论文旨在解决大规模视频分析应用中边缘-云协同系统在面对异构视频内容和动态资源条件时,难以实现高效路由调度的问题,从而导致推理效率低下和计算成本高昂。其核心解决方案是提出一种两阶段鲁棒路由框架 R2E-VID,关键创新在于引入时间门控机制(temporal gating mechanism),通过建模视频流的时间一致性和运动动态来预测最优分段路由策略,实现细粒度的时空弹性负载分配;第二阶段则通过多模型自适应优化进一步降低延迟与资源消耗,在动态网络和工作负载变化下保持高鲁棒性。

链接: https://arxiv.org/abs/2604.09681
作者: Zheming Yang,Lulu Zuo,Shun Lu,Yangyu Zhang,Zhicheng Li,Xiangyang Li,Yang You
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Peng Cheng Laboratory (鹏城实验室); National University of Singapore (新加坡国立大学); Beijing Kuaishou Technology Co., Ltd. (北京快手科技有限公司)
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 10 pages, 10 figures

点击查看摘要

Abstract:With the rapid growth of large-scale video analytics applications, edge-cloud collaborative systems have become the dominant paradigm for real-time inference. However, existing approaches often fail to dynamically adapt to heterogeneous video content and fluctuating resource conditions, resulting in suboptimal routing efficiency and high computational costs. In this paper, we propose R2E-VID, a two-stage robust routing framework via temporal gating for elastic edge-cloud video inference. In the first stage, R2E-VID introduces a temporal gating mechanism that models the temporal consistency and motion dynamics of incoming video streams to predict the optimal routing pattern for each segment. This enables adaptive partitioning of inference workloads between edge and cloud nodes, achieving fine-grained spatiotemporal elasticity. In the second stage, a robust routing optimization module refines the allocation through multi-model adaptation, jointly minimizing inference delay and resource consumption under dynamic network and workload variations. Extensive experiments on public datasets demonstrate that R2E-VID achieves up to 60% reduction in overall cost compared to cloud-centric baselines, and delivers 35-45% lower delay while improving inference accuracy by 2-7% over state-of-the-art edge-cloud solutions.

[CV-309] FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models CVPR2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 在机器人领域中新兴的 Vision-Language-Action (VLA) 模型所面临的一种新型安全漏洞——即基于流匹配(flow-matching)策略的连续动作生成机制中存在的后门攻击问题。现有针对自回归离散化 VLA 的后门攻击方法无法直接应用于此类连续动力学模型,因此作者提出了 FlowHijack,这是首个系统性地针对流匹配 VLA 内部向量场动力学(vector field dynamics)的后门攻击框架。其关键创新在于引入了一种新颖的 τ-条件注入策略(τ-conditioned injection strategy),通过操控动作生成初始阶段来隐蔽植入恶意行为,并结合动力学模仿正则化项(dynamics mimicry regularizer)以确保恶意动作在运动学上与正常动作高度相似,从而实现高成功率且不破坏良性任务性能的隐蔽攻击。

链接: https://arxiv.org/abs/2604.09651
作者: Xinyuan An,Tao Luo,Gengyun Peng,Yaobing Wang,Kui Ren,Dongxia Wang
机构: Zhejiang University (浙江大学); Beijing Key Laboratory of Intelligent Space Robotic Systems Technology and Applications; Huzhou Institute of Industrial Control Technology (湖州工业控制研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are emerging as a cornerstone for robotics, with flow-matching policies like \pi_0 showing great promise in generating smooth, continuous actions. As these models advance, their unique action generation mechanism - the vector field dynamics - presents a critical yet unexplored security vulnerability, particularly backdoor vulnerabilities. Existing backdoor attacks designed for autoregressive discretization VLAs cannot be directly applied to this new continuous dynamics. We introduce FlowHijack, the first backdoor attack framework to systematically target the underlying vector-field dynamics of flow-matching VLAs. Our method combines a novel \tau -conditioned injection strategy, which manipulates the initial phase of the action generation, with a dynamics mimicry regularizer. Experiments demonstrate that FlowHijack achieves high attack success rates using stealthy, context-aware triggers where prior works failed. Crucially, it preserves benign task performance and, by enforcing kinematic similarity, generates malicious actions that are behaviorally indistinguishable from normal actions. Our findings reveal a significant vulnerability in continuous embodied models, highlighting the urgent need for defenses targeting the model’s internal generative dynamics.

[CV-310] RACE: Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock

【速读】:该论文旨在解决自由放牧牛只呼出二氧化碳(CO₂)的连续、空间分辨测量难题,这是评估瘤胃代谢状态和实现农场尺度碳核算的关键前提,而现有技术无法在不进行物理约束或接触的情况下实现此类监测。解决方案的核心是提出首个统一框架TRACE(Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock),其关键创新包括:1)热气体感知注意力(Thermal Gas-Aware Attention, TGAA)编码器,通过像素级气体强度作为空间监督信号引导自注意力机制聚焦于高排放区域;2)基于注意力的时间融合(Attention-based Temporal Fusion, ATF)模块,利用结构化跨帧注意力捕捉呼吸周期动态以实现序列级排放通量分类;3)四阶段渐进式训练策略,协同优化分割与分类目标并避免梯度干扰。实验证明,TRACE在CO₂ Farm Thermal Gas Dataset上实现了0.998的平均交并比(mIoU)及所有指标最优结果,显著优于现有方法。

链接: https://arxiv.org/abs/2604.09648
作者: Taminul Islam,Abdellah Lakhssassi,Toqi Tahamid Sarker,Mohamed Embaby,Khaled R Ahmed,Amer AbuGhazaleh
机构: Southern Illinois University, Carbondale; University of California, Davis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quantifying exhaled CO2 from free-roaming cattle is both a direct indicator of rumen metabolic state and a prerequisite for farm-scale carbon accounting, yet no existing system can deliver continuous, spatially resolved measurements without physical confinement or contact. We present TRACE (Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock), the first unified framework to jointly address per-frame CO2 plume segmentation and clip-level emission flux classification from mid-wave infrared (MWIR) thermal video. TRACE contributes three domain-specific advances: a Thermal Gas-Aware Attention (TGAA) encoder that incorporates per-pixel gas intensity as a spatial supervisory signal to direct self-attention toward high-emission regions at each encoder stage; an Attention-based Temporal Fusion (ATF) module that captures breath-cycle dynamics through structured cross-frame attention for sequence-level flux classification; and a four-stage progressive training curriculum that couples both objectives while preventing gradient interference. Benchmarked against fifteen state-of-the-art models on the CO2 Farm Thermal Gas Dataset, TRACE achieves an mIoU of 0.998 and the best result on every segmentation and classification metric simultaneously, outperforming domain-specific gas segmenters with several times more parameters and surpassing all baselines in flux classification. Ablation studies confirm that each component is individually essential: gas-conditioned attention alone determines precise plume boundary localization, and temporal reasoning is indispensable for flux-level discrimination. TRACE establishes a practical path toward non-invasive, continuous, per-animal CO2 monitoring from overhead thermal cameras at commercial scale. Codes are available at this https URL.

[CV-311] PA-SFM: Tracker-free differentiable acoustic radiation for freehand 3D photoacoustic imaging

【速读】:该论文旨在解决三维(3D)手持式光声断层成像(Photoacoustic Tomography, PAT)中因依赖外部定位传感器而导致的设备笨重、成本高及临床灵活性受限的问题。传统方法通常需要昂贵且固定的外部定位系统来校正运动伪影,限制了其在临床场景中的普及。解决方案的关键在于提出一种无需跟踪器(tracker-free)的框架PA-SFM,它仅利用单一模态的光声数据,通过可微分声辐射建模实现传感器位姿恢复与高质量3D重建的联合优化。该方法将声波方程嵌入可微编程管线,借助GPU加速的声辐射核函数,基于梯度下降同时优化3D光声源分布和传感器阵列位姿;并通过粗到精的优化策略结合几何一致性检查和刚体约束,提升自由手操作下的收敛鲁棒性,最终实现亚毫米级定位精度和媲美基准的高分辨率血管结构重建,为临床手持式光声成像提供了一种低成本、软件定义的新范式。

链接: https://arxiv.org/abs/2604.09643
作者: Shuang Li,Jian Gao,Chulhong Kim,Seongwook Choi,Qian Chen,Yibing Wang,Shuang Wu,Yu Zhang,Tingting Huang,Yucheng Zhou,Boxin Yao,Yao Yao,Changhui Li
机构: 南京大学(Nanjing University); 北京大学(Peking University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Three-dimensional (3D) handheld photoacoustic tomography typically relies on bulky and expensive external positioning sensors to correct motion artifacts, which severely limits its clinical flexibility and accessibility. To address this challenge, we present PA-SFM, a tracker-free framework that leverages exclusively single-modality photoacoustic data for both sensor pose recovery and high-fidelity 3D reconstruction via differentiable acoustic radiation modeling. Unlike traditional structure-from-motion (SFM) methods based on visual features, PA-SFM integrates the acoustic wave equation into a differentiable programming pipeline. By leveraging a high-performance, GPU-accelerated acoustic radiation kernel, the framework simultaneously optimizes the 3D photoacoustic source distribution and the sensor array pose via gradient descent. To ensure robust convergence in freehand scenarios, we introduce a coarse-to-fine optimization strategy that incorporates geometric consistency checks and rigid-body constraints to eliminate motion outliers. We validated the proposed method through both numerical simulations and in-vivo rat experiments. The results demonstrate that PA-SFM achieves sub-millimeter positioning accuracy and restores high-resolution 3D vascular structures comparable to ground-truth benchmarks, offering a low-cost, software-defined solution for clinical freehand photoacoustic imaging. The source code is publicly available at \hrefthis https URLthis https URL.

[CV-312] 3D Multi-View Stylization with Pose-Free Correspondences Matching for Robust 3D Geometry Preservation

【速读】:该论文旨在解决多视角3D场景的风格迁移(style transfer)问题,其核心挑战在于:传统独立对每个视角进行风格化处理会导致纹理漂移(texture drift)、边缘扭曲和阴影不一致,从而破坏下游3D任务如SLAM(Simultaneous Localization and Mapping)、深度预测和多视图重建所需的几何一致性。解决方案的关键在于提出一种前馈式风格化网络,通过在测试阶段引入基于场景的优化策略,并设计复合目标函数联合优化外观迁移与几何保持。具体包括:(1) 基于AdaIN启发的损失函数(使用冻结的VGG-19编码器匹配通道统计量)实现风格迁移;(2) 引入基于SuperPoint和SuperGlue的对应一致性损失以约束不同视角间特征描述子的一致性,保障结构稳定性;(3) 利用MiDaS/DPT深度模型施加深度保持损失,并通过全局色彩对齐减少深度模型的域偏移;(4) 采用分阶段权重调度逐步引入几何与深度约束。实验表明,该方法在Tanks and Temples和Mip-NeRF 360数据集上显著提升了结构保留能力与SLAM轨迹稳定性,同时保持良好风格适配性。

链接: https://arxiv.org/abs/2604.09639
作者: Shirsha Bose
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artistic style transfer is well studied for images and videos, but extending it to multi-view 3D scenes remains difficult because stylization can disrupt correspondences needed by geometry-aware pipelines. Independent per-view stylization often causes texture drift, warped edges, and inconsistent shading, degrading SLAM, depth prediction, and multi-view reconstruction. This thesis addresses multi-view stylization that remains usable for downstream 3D tasks without assuming camera poses or an explicit 3D representation during training. We introduce a feed-forward stylization network trained with per-scene test-time optimization under a composite objective coupling appearance transfer with geometry preservation. Stylization is driven by an AdaIN-inspired loss from a frozen VGG-19 encoder, matching channel-wise moments to a style image. To stabilize structure across viewpoints, we propose a correspondence-based consistency loss using SuperPoint and SuperGlue, constraining descriptors from a stylized anchor view to remain consistent with matched descriptors from the original multi-view set. We also impose a depth-preservation loss using MiDaS/DPT and use global color alignment to reduce depth-model domain shift. A staged weight schedule introduces geometry and depth constraints. We evaluate on Tanks and Temples and Mip-NeRF 360 using image and reconstruction metrics. Style adherence and structure retention are measured by Color Histogram Distance (CHD) and Structure Distance (DSD). For 3D consistency, we use monocular DROID-SLAM trajectories and symmetric Chamfer distance on back-projected point clouds. Across ablations, correspondence and depth regularization reduce structural distortion and improve SLAM stability and reconstructed geometry; on scenes with MuVieCAST baselines, our method yields stronger trajectory and point-cloud consistency while maintaining competitive stylization. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.09639 [cs.CV] (or arXiv:2604.09639v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.09639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-313] Agent ic Exploration of PDE Spaces using Latent Foundation Models for Parameterized Simulations

【速读】:该论文旨在解决流体力学等由偏微分方程(Partial Differential Equations, PDEs) govern 的物理系统中,由于其连续性、高维性和混沌特性,导致传统实验与数值模拟难以实现自动化和大规模探索的问题。解决方案的关键在于将多智能体大语言模型(Multi-Agent LLMs)与潜在基础模型(Latent Foundation Models, LFMs)相结合,其中LFM是一种针对参数化模拟的生成模型,能够学习流场的显式、紧凑且解耦的潜在表示,从而实现对控制PDE参数和边界条件的连续探索。该框架通过闭环的假设-实验-分析-验证机制,在无需人工干预的情况下自主执行科学发现任务,并在雷诺数Re = 500下的双圆柱绕流问题中成功识别出两类不同的标度律及其双极值结构,展示了其在PDE驱动系统中自动化科学发现的通用潜力。

链接: https://arxiv.org/abs/2604.09584
作者: Abhijeet Vishwasrao,Francisco Giral,Mahmoud Golestanian,Federica Tonti,Andrea Arroyo Ramo,Adrian Lozano-Duran,Steven L. Brunton,Sergio Hoyas,Soledad Le Clainche,Hector Gomez,Ricardo Vinuesa
机构: University of Michigan (密歇根大学); Universidad Politécnica de Madrid (马德里理工大学); Purdue University (普渡大学); Universitat Politècnica de València (瓦伦西亚理工大学); Caltech (加州理工学院); University of Washington (华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow physics and more broadly physical phenomena governed by partial differential equations (PDEs), are inherently continuous, high-dimensional and often chaotic in nature. Traditionally, researchers have explored these rich spatiotemporal PDE solution spaces using laboratory experiments and/or computationally expensive numerical simulations. This severely limits automated and large-scale exploration, unlike domains such as drug discovery or materials science, where discrete, tokenizable representations naturally interface with large language models. We address this by coupling multi-agent LLMs with latent foundation models (LFMs), a generative model over parametrised simulations, that learns explicit, compact and disentangled latent representations of flow fields, enabling continuous exploration across governing PDE parameters and boundary conditions. The LFM serves as an on-demand surrogate simulator, allowing agents to query arbitrary parameter configurations at negligible cost. A hierarchical agent architecture orchestrates exploration through a closed loop of hypothesis, experimentation, analysis and verification, with a tool-modular interface requiring no user support. Applied to flow past tandem cylinders at Re = 500, the framework autonomously evaluates over 1,600 parameter-location pairs and discovers divergent scaling laws: a regime-dependent two-mode structure for minimum displacement thickness and a robust linear scaling for maximum momentum thickness, with both landscapes exhibiting a dual-extrema structure that emerges at the near-wake to co-shedding regime transition. The coupling of the learned physical representations with agentic reasoning establishes a general paradigm for automated scientific discovery in PDE-governed systems.

[CV-314] Brain-Grasp: Graph-based Saliency Priors for Improved fMRI-based Visual Brain Decoding

【速读】:该论文旨在解决脑引导图像生成中对象级结构与语义保真度难以维持的问题,尤其针对现有方法忽视显著性物体空间排列而导致概念不一致输出的缺陷。其解决方案的关键在于提出一种基于显著性的解码框架,利用图结构信息引导的显著性先验,将大脑信号中的结构线索转化为空间掩码;这些掩码结合嵌入提取的语义信息作为条件输入,指导扩散模型进行图像重建,从而在保持自然场景构图的同时增强对象一致性。该方法仅依赖一个冻结的单阶段扩散模型,相较于多阶段流程更为轻量且高效。

链接: https://arxiv.org/abs/2604.10617
作者: Mohammad Moradi,Morteza Moradi,Marco Grassia,Giuseppe Mangioni
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent progress in brain-guided image generation has improved the quality of fMRI-based reconstructions; however, fundamental challenges remain in preserving object-level structure and semantic fidelity. Many existing approaches overlook the spatial arrangement of salient objects, leading to conceptually inconsistent outputs. We propose a saliency-driven decoding framework that employs graph-informed saliency priors to translate structural cues from brain signals into spatial masks. These masks, together with semantic information extracted from embeddings, condition a diffusion model to guide image regeneration, helping preserve object conformity while maintaining natural scene composition. In contrast to pipelines that invoke multiple diffusion stages, our approach relies on a single frozen model, offering a more lightweight yet effective design. Experiments show that this strategy improves both conceptual alignment and structural similarity to the original stimuli, while also introducing a new direction for efficient, interpretable, and structurally grounded brain decoding.

[CV-315] Physics-Informed Synthetic Dataset and Denoising TIE-Reconstructed Phase Maps in Transient Flows Using Deep Learning

【速读】:该论文旨在解决基于传输强度方程(Transport of Intensity Equation, TIE)重构的定量相位成像中,因逆拉普拉斯求解器引入的空间相关低频伪影问题,这些伪影会掩盖如射流羽流、激波前沿和密度梯度等重要流场结构。传统滤波方法失效,因为信号与噪声占据重叠的空间频率带宽,且每帧均为物理唯一、不可重复的流动状态,缺乏配对真值用于监督学习。解决方案的关键在于构建一种物理信息驱动的合成训练数据集:通过程序化生成符合物理规律的气体流动形态(包括可压缩射流、湍流涡旋场、密度界面、周期性气泡及膨胀波),并经前向TIE仿真与逆拉普拉斯重建后生成逼真的含噪相位图;进而使用U-Net结构的卷积去噪网络仅在该合成数据上训练,最终实现对真实高速(25,000 fps)平行TIE记录的零样本泛化,显著提升信背比(提高13,260%)和射流区域结构锐度(提升100.8%)。

链接: https://arxiv.org/abs/2604.10610
作者: Krishna Rajput,Vipul Gupta,Sudheesh K. Rajput,Yasuhiro Awatsuji
机构: 未知
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:High-speed quantitative phase imaging enables non-intrusive visualization of transient compressible gas flows and energetic phenomena. However, phase maps reconstructed via the transport of intensity equation (TIE) suffer from spatially correlated low-frequency artifacts introduced by the inverse Laplacian solver, which obscure meaningful flow structures such as jet plumes, shockwave fronts, and density gradients. Conventional filtering approaches fail because signal and noise occupy overlapping spatial frequency bands, and no paired ground truth exists since every frame represents a physically unique, non-repeatable flow state. We address this by developing a physics-informed synthetic training dataset where clean targets are procedurally generated using physically plausible gas flow morphologies, including compressible jet plumes, turbulent eddy fields, density fronts, periodic air pockets, and expansion fans, and passed through a forward TIE simulation followed by inverse Laplacian reconstruction to produce realistic noisy phase maps. A U-Net-based convolutional denoising network trained solely on this synthetic data is evaluated on real phase maps acquired at 25,000 fps, demonstrating zero-shot generalization to real parallel TIE recordings, with a 13,260% improvement in signal-to-background ratio and 100.8% improvement in jet-region structural sharpness across 20 evaluated frames.

[CV-316] Compact single-shot ranging and near-far imaging using metasurfaces

【速读】:该论文旨在解决传统成像系统在有限空间内难以同时实现多距离聚焦成像的问题,尤其针对资源受限的边缘平台(如国防应用场景)中对紧凑型、高精度深度感知能力的需求。其解决方案的关键在于设计了一种超表面成像系统(metasurface imaging system),通过在单一光电传感器上同步捕获近距(1–2 cm)和远距(约40 cm)图像,构建焦堆栈(focal stack),并结合计算高效的基于离焦的深度估计算法(depth-from-defocus algorithm),实现了从12 mm到20 mm范围内±1 mm精度的被动测距能力,且整体系统仅占15 mm光路长度,具备高度集成性与实用性。

链接: https://arxiv.org/abs/2604.10037
作者: Junjie Luo,Yuxuan Liu,Wei Ting Chen,Qing Wang,Qi Guo
机构: Purdue University (普渡大学); SNOChip INC. (SNOChip 公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a metasurface imaging system capable of simultaneously capturing two images at close range (1-2~cm) and an additional image at long range (about 40~cm) on a shared photosensor. The close-range image pair focuses at 1.4~cm and 2.0~cm, respectively, which forms a focal stack, enabling passive ranging with an accuracy of \pm 1~mm from 12~mm to 20~mm through a computationally efficient depth-from-defocus algorithm for a simplified scenario. The entire system is compact, with a total track length of 15~mm, making it suitable for seamless integration into edge platforms for defense and other resource-constrained applications.

[CV-317] Search-MIND: Training-Free Multi-Modal Medical Image Registration

【速读】:该论文旨在解决多模态图像配准(multi-modal image registration)在实际应用中面临的两大核心问题:一是模态间非线性强度关系导致的配准精度下降,二是深度学习模型在未见模态上出现的泛化能力崩溃(generalization collapse)。针对这些问题,其解决方案的关键在于提出一种无需训练的迭代优化框架Search-MIND,该框架采用粗到精(coarse-to-fine)策略,通过引入两种创新损失函数实现稳定且高精度的配准:一是方差加权互信息(Variance-Weighted Mutual Information, VWMI),用于抑制背景噪声和均匀区域对全局配准的干扰,增强对关键组织区域的敏感性;二是搜索型MIND(Search-MIND, S-MIND),通过扩大局部搜索范围来扩展结构描述符的收敛域,从而提升算法对局部最优解的鲁棒性。实验证明,该方法在CARE Liver 2025和CHAOS Challenge数据集上显著优于传统方法(如ANTs)和基于基础模型的方法(如DINO-reg),展现出跨模态的一致性和稳定性优势。

链接: https://arxiv.org/abs/2604.09743
作者: Boya Wang,Ruizhe Li,Chao Chen,Xin Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal image registration plays a critical role in precision medicine but faces challenges from non-linear intensity relationships and local optima. While deep learning models enable rapid inference, they often suffer from generalization collapse on unseen modalities. To address this, we propose Search-MIND, a training-free, iterative optimization framework for instance-specific registration. Our pipeline utilizes a coarse-to-fine strategy: a hierarchical coarse alignment stage followed by deformable refinement. We introduce two novel loss functions: Variance-Weighted Mutual Information (VWMI), which prioritizes informative tissue regions to shield global alignment from background noise and uniform regions, and Search-MIND (S-MIND), which broadens the convergence basin of structural descriptors by considering larger local search range. Evaluations on CARE Liver 2025 and CHAOS Challenge datasets show that Search-MIND consistently outperforms classical baselines like ANTs and foundation model-based approaches like DINO-reg, offering superior stability across diverse modalities.

人工智能

[AI-0] Physics-Informed State Space Models for Reliable Solar Irradiance Forecasting in Off-Grid Systems

【速读】:该论文旨在解决当前基于数据驱动的深度学习模型在独立离网光伏系统中出现的关键异常问题,包括云层变化时严重的时序相位滞后以及物理上不可能的夜间发电现象。为弥合数据驱动建模与确定性天体力学之间的偏差,作者提出了一种名为“热力学液态流形网络”(Thermodynamic Liquid Manifold Network)的新方法,其核心在于将15个气象与几何变量投影至Koopman线性化的黎曼流形空间,从而系统性地刻画复杂的气候动力学;该架构通过引入谱校准单元(Spectral Calibration unit)和乘法型热力学Alpha门(multiplicative Thermodynamic Alpha-Gate),融合实时大气透射率与理论晴空边界模型,结构化地强制遵守严格的天体几何约束,从而彻底消除夜间虚假发电误差,并在快速天气变化中实现零延迟同步响应。

链接: https://arxiv.org/abs/2604.11807
作者: Mohammed Ezzaldin Babiker Abdullah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The stable operation of autonomous off-grid photovoltaic systems dictates reliance on solar forecasting algorithms that respect atmospheric thermodynamics. Contemporary deep learning models consistently exhibit critical anomalies, primarily severe temporal phase lags during cloud transients and physically impossible nocturnal power generation. To resolve this divergence between data-driven modeling and deterministic celestial mechanics, this research introduces the Thermodynamic Liquid Manifold Network. The proposed methodology projects 15 meteorological and geometric variables into a Koopman-linearized Riemannian manifold to systematically map complex climatic dynamics. The architecture integrates a Spectral Calibration unit and a multiplicative Thermodynamic Alpha-Gate. This system synthesizes real-time atmospheric opacity with theoretical clear-sky boundary models, structurally enforcing strict celestial geometry compliance. This completely neutralizes phantom nocturnal generation while maintaining zero-lag synchronization during rapid weather shifts. Validated against a rigorous five-year testing horizon in a severe semi-arid climate, the framework achieves an RMSE of 18.31 Wh/m2 and a Pearson correlation of 0.988. The model strictly maintains a zero-magnitude nocturnal error across all 1826 testing days and exhibits a sub-30-minute phase response during high-frequency transients. Comprising exactly 63,458 trainable parameters, this ultra-lightweight design establishes a robust, thermodynamically consistent standard for edge-deployable microgrid controllers.

[AI-1] A Mechanistic Analysis of Looped Reasoning Language Models

【速读】:该论文旨在解决循环推理语言模型(looped reasoning language models)与标准前馈模型在内部动态机制上的差异问题,特别是探究其潜在状态(latent states)如何演化以及是否能映射到传统前馈模型中的推理阶段。解决方案的关键在于通过机制分析揭示:在循环结构中,每一层在隐空间中收敛至不同的固定点(fixed point),从而形成稳定的周期轨迹;这种固定点的稳定促使注意力头行为趋于一致,使模型在每次迭代中重复类似前馈模型中的推理阶段,进而为架构设计提供基于机制理解的实践指导。

链接: https://arxiv.org/abs/2604.11791
作者: Hugh Blayney,Álvaro Arroyo,Johan Obando-Ceron,Pablo Samuel Castro,Aaron Courville,Michael M. Bronstein,Xiaowen Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39 pages, 63 figures

点击查看摘要

Abstract:Reasoning has become a central capability in large language models. Recent research has shown that reasoning performance can be improved by looping an LLM’s layers in the latent dimension, resulting in looped reasoning language models. Despite promising results, few works have investigated how their internal dynamics differ from those of standard feedforward models. In this paper, we conduct a mechanistic analysis of the latent states in looped language models, focusing in particular on how the stages of inference observed in feedforward models compare to those observed in looped ones. To this end, we analyze cyclic recurrence and show that for many of the studied models each layer in the cycle converges to a distinct fixed point; consequently, the recurrent block follows a consistent cyclic trajectory in the latent space. We provide evidence that as these fixed points are reached, attention-head behavior stabilizes, leading to constant behavior across recurrences. Empirically, we discover that recurrent blocks learn stages of inference that closely mirror those of feedforward models, repeating these stages in depth with each iteration. We study how recurrent block size, input injection, and normalization influence the emergence and stability of these cyclic fixed points. We believe these findings help translate mechanistic insights into practical guidance for architectural design.

[AI-2] ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

【速读】:该论文旨在解决工具增强型大语言模型(Tool-augmented Large Language Model)代理在执行多步骤现实任务时面临的间接提示注入(indirect prompt injection)安全漏洞问题。攻击者通过在工具返回内容中嵌入恶意指令,使代理将其作为可信观察记录到对话历史中,从而诱导其执行非预期操作。该漏洞存在于三种主要攻击路径:网络与本地内容注入、MCP服务器注入和技能文件注入。解决方案的关键在于提出一种名为 \textscClawGuard 的运行时安全框架,该框架在每次工具调用边界强制执行用户确认的规则集,将依赖模型对齐的防御机制转变为确定性、可审计的拦截机制,能够在实际影响发生前阻断恶意工具调用。该方法无需修改模型或基础设施,仅通过在首次外部工具调用前自动推导任务特定的访问约束即可有效防御所有三类注入攻击,且不损害代理功能。

链接: https://arxiv.org/abs/2604.11790
作者: Wei Zhao,Zhe Li,Peixin Zhang,Jun Sun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \textscClawGuard, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user’s stated objective prior to any external tool invocation, \textscClawGuard blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \textscClawGuard achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at this https URL.

[AI-3] Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure

【速读】:该论文旨在解决组织知识在生成式 AI(Generative AI)应用中缺乏认知结构(epistemic structure)的问题,即现有检索增强生成(Retrieval-Augmented Generation, RAG)系统无法区分决策承诺、争议性主张与已确立事实之间的认知状态差异,导致AI代理输出的信息可信度不可控。其解决方案的核心是提出 OIDA 框架,该框架将组织知识建模为带有认知类别(epistemic class)、类特定衰减的重要性评分以及带符号的矛盾边(signed contradiction edges)的知识对象(Knowledge Objects),并引入“问题即建模无知”(QUESTION-as-modeled-ignorance)机制——一种具有逆向衰减特性的原语,能够随时间推移更紧迫地揭示组织未知领域。此外,论文还提出了可计算的认知质量评分(Epistemic Quality Score, EQS)用于量化评估,并通过实证验证了该机制在减少无效信息和提升认知透明度方面的有效性。

链接: https://arxiv.org/abs/2604.11759
作者: Federico Bottino,Carlo Ferrero,Nicholas Dosio,Pierfrancesco Beneventano
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, 8 tables, 6 appendices

点击查看摘要

Abstract:Organizational knowledge used by AI agents typically lacks epistemic structure: retrieval systems surface semantically relevant content without distinguishing binding decisions from abandoned hypotheses, contested claims from settled ones, or known facts from unresolved questions. We argue that the ceiling on organizational AI is not retrieval fidelity but \emphepistemic fidelity–the system’s ability to represent commitment strength, contradiction status, and organizational ignorance as computable properties. We present OIDA, a framework that structures organizational knowledge as typed Knowledge Objects carrying epistemic class, importance scores with class-specific decay, and signed contradiction edges. The Knowledge Gravity Engine maintains scores deterministically with proved convergence guarantees (sufficient condition: max degree 7 ; empirically robust to degree 43). OIDA introduces QUESTION-as-modeled-ignorance: a primitive with inverse decay that surfaces what an organization does \emphnot know with increasing urgency–a mechanism absent from all surveyed systems. We describe the Epistemic Quality Score (EQS), a five-component evaluation methodology with explicit circularity analysis. In a controlled comparison ( n=10 response pairs), OIDA’s RAG condition (3,868 tokens) achieves EQS 0.530 vs.\ 0.848 for a full-context baseline (108,687 tokens); the 28.1\times token budget difference is the primary confound. The QUESTION mechanism is statistically validated (Fisher p=0.0325 , OR =21.0 ). The formal properties are established; the decisive ablation at equal token budget (E4) is pre-registered and not yet run. Comments: 10 pages, 2 figures, 8 tables, 6 appendices Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.4; I.2.6; H.3.3 Cite as: arXiv:2604.11759 [cs.AI] (or arXiv:2604.11759v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.11759 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-4] Grounded World Model for Semantically Generalizable Planning

【速读】:该论文旨在解决视觉-运动模型预测控制(visuomotor Model Predictive Control, MPC)中目标图像难以预先获取的问题,以及传统基于图像的目标输入在交互性上的局限性。其核心挑战在于如何在新环境中实现无需预设目标图像的可控动作规划,并提升任务指令的语义泛化能力。解决方案的关键在于提出一种接地世界模型(Grounded World Model, GWM),该模型在视觉-语言对齐的潜在空间(vision-language-aligned latent space)中学习动作与任务指令之间的语义映射关系,从而将MPC中的评分机制从图像距离转化为任务指令嵌入相似度。这一方法使MPC具备了自然语言指令驱动的能力,显著提升了在未见视觉信号和指代表达下的任务成功率,在WISER基准测试中达到87%的成功率,远超传统视觉-语言模型(VLM-based VLA)的22%平均成功率。

链接: https://arxiv.org/abs/2604.11751
作者: Quanyi Li,Lan Feng,Haonan Zhang,Wuyang Li,Letian Wang,Alexandre Alahi,Harold Soh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.

[AI-5] Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games ACL2026

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多人博弈场景下,面对不完全信息和欺骗性线索时推理能力显著下降的问题,尤其聚焦于需要多跳推理(multi-hop reasoning)的谋杀谜案游戏(Murder Mystery Games)。解决方案的关键在于提出一种协作式多智能体框架,通过生成角色驱动的高质量剧本并设计两阶段代理监控训练策略:第一阶段基于思维链(chain-of-thought)微调,利用模拟不确定性和欺骗性的合成数据增强模型推理能力;第二阶段采用GRPO强化学习与代理监督的奖励塑形机制,引导模型发展出符合角色身份(如凶手 vs. 无辜者)的特定推理行为及有效的多模态多跳推理能力。该方法显著提升了VLM在叙事推理、隐藏事实提取和抗欺骗理解方面的性能,为复杂社会情境下的多模态多跳推理提供了可扩展的训练与评估范式。

链接: https://arxiv.org/abs/2604.11741
作者: Keyang Zhong,Junlin Xie,Hefeng Wu,Haofeng Li,Guanbin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, Findings of ACL 2026

点击查看摘要

Abstract:Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi-agent framework for evaluating and synthesizing high-quality, role-driven multiplayer game scripts, enabling fine-grained interaction patterns tailored to character identities (i.e., murderer vs. innocent). Our system generates rich multimodal contexts, including character backstories, visual and textual clues, and multi-hop reasoning chains, through coordinated agent interactions. We design a two-stage agent-monitored training strategy to enhance the reasoning ability of VLMs: (1) chain-of-thought based fine-tuning on curated and synthetic datasets that model uncertainty and deception; (2) GRPO-based reinforcement learning with agent-monitored reward shaping, encouraging the model to develop character-specific reasoning behaviors and effective multimodal multi-hop inference. Extensive experiments demonstrate that our method significantly boosts the performance of VLMs in narrative reasoning, hidden fact extraction, and deception-resilient understanding. Our contributions offer a scalable solution for training and evaluating VLMs under uncertain, adversarial, and socially complex conditions, laying the groundwork for future benchmarks in multimodal multi-hop reasoning under imperfect information.

[AI-6] Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

【速读】:该论文旨在解决闭环协同驾驶中轨迹规划器在多智能体场景下难以保持场景一致性与道路遵循性,且在线强化学习训练不稳定的问题。解决方案的关键在于提出Multi-ORFT框架,通过将场景条件化的扩散预训练与稳定的在线强化后训练相结合:预训练阶段利用智能体间自注意力、交叉注意力及AdaLN-Zero-based场景条件机制提升联合轨迹的场景一致性和道路适应性;后训练阶段构建两级马尔可夫决策过程(MDP),暴露逐步反向核似然以支持在线优化,并结合密集轨迹级奖励与方差门控组相对策略优化(VG-GRPO)实现训练稳定性。该方法显著提升了安全性和交通效率指标,在WOMD基准上实现了碰撞率和离道率下降,同时平均速度提升。

链接: https://arxiv.org/abs/2604.11734
作者: Haojie Bai,Aimin Li,Ruoyu Yao,Xiongwei Zhao,Tingting Zhang,Xing Zhang,Lin Gao,and Jun Ma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult. We present Multi-ORFT, which couples scene-conditioned diffusion pre-training with stable online reinforcement post-training. In pre-training, the planner uses inter-agent self-attention, cross-attention, and AdaLN-Zero-based scene conditioning to improve scene consistency and road adherence of joint trajectories. In post-training, we formulate a two-level MDP that exposes step-wise reverse-kernel likelihoods for online optimization, and combine dense trajectory-level rewards with variance-gated group-relative policy optimization (VG-GRPO) to stabilize training. On the WOMD closed-loop benchmark, Multi-ORFT reduces collision rate from 2.04% to 1.89% and off-road rate from 1.68% to 1.36%, while increasing average speed from 8.36 to 8.61 m/s relative to the pre-trained planner, and it outperforms strong open-source baselines including SMART-large, SMART-tiny-CLSFT, and VBD on the primary safety and efficiency metrics. These results show that coupling scene-consistent denoising with stable online diffusion-policy optimization improves the reliability of closed-loop cooperative driving.

[AI-7] Endogenous Information in Routing Games: Memory-Constrained Equilibria Recall Braess Paradoxes and Memory Design

【速读】:该论文旨在解决交通路由博弈中旅行者基于有限记忆(finite memory)进行路径选择的问题,即在传统固定路径集合之外,引入“内生回忆机制”(endogenous recall),以更贴近现实中的用户行为。其核心挑战在于如何从微观层面的有限记忆更新规则(如LRU策略)推导出宏观上可分析、可设计的均衡结构,并实现对系统性能的有效调控。解决方案的关键在于构建一个双层理论框架:第一层是基于微模型的“健忘 Wardrop 均衡”(Forgetful Wardrop Equilibrium, FWE),其中每个旅行者持有有限记忆状态并依据 logit 规则选择路径;第二层是“显著性模型”(salience model),将记忆和界面效应抽象为路径权重,形成一个严格凸势函数下的唯一随机用户均衡(stochastic user equilibrium)。该两层模型通过精确等价关系(当记忆大小 B=1 时)或近似转换管道(如 LRU → TTL → salience)建立联系,从而实现了对复杂微观过程的简化建模与可控设计,尤其在比例预算约束和仿射绑定约束下提供了可构造的算法方案,并揭示了“回忆 Braess 悖论”——即增强记忆可能反而恶化均衡延迟的现象。

链接: https://arxiv.org/abs/2604.11733
作者: Saad Alqithami
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:We study routing games in which travelers optimize over routes that are remembered or surfaced, rather than over a fixed exogenous action set. The paper develops a tractable design theory for endogenous recall and then connects it back to an explicit finite-memory micro model. At the micro level, each traveler carries a finite memory state, receives surfaced alternatives, chooses via a logit rule, and updates memory under a policy such as LRU. This yields a stationary Forgetful Wardrop Equilibrium (FWE); existence is proved under mild regularity, and uniqueness follows in a contraction regime for the reduced fixed-point map. The paper’s main design layer is a stationary salience model that summarizes persistent memory and interface effects as route-specific weights. Salience-weighted stochastic user equilibrium is the unique minimizer of a strictly convex potential, which yields a clean optimization and implementability theory. In this layer we characterize governed implementability under ratio budgets and affine tying constraints, and derive constructive algorithms on parallel and series-parallel networks. The bridge between layers is exact for last-choice memory (B=1): the micro model is then equivalent to the salience model, so any interior salience vector can be realized by an appropriate surfacing policy. For larger memories, we develop an explicit LRU-to-TTL-to-salience approximation pipeline and add contraction-based bounds that translate surrogate-map error into fixed-point and welfare error. Finally, we define a Recall Braess Paradox, in which improving recall increases equilibrium delay without changing physical capacity, and show that it can arise on every two-terminal network with at least two distinct s-t paths. Targeted experiments support the approximation regime, governed-design predictions, and the computational advantages of the reduced layer.

[AI-8] A Mamba-Based Multimodal Network for Multiscale Blast-Induced Rapid Structural Damage Assessment

【速读】:该论文旨在解决结构损伤评估(Structural Damage Assessment, SDA)在灾后管理中的准确性与效率问题,尤其针对传统现场勘查方法受限于可达性、安全风险及时间成本的缺陷,以及现有基于遥感图像的机器学习方法在训练数据需求大、物理机制建模不足等方面的局限。解决方案的关键在于提出一种融合多尺度爆炸载荷信息与光学遥感影像的Mamba-based多模态网络架构,通过引入爆炸力学特性增强模型对损伤模式的物理感知能力,从而在无需大量标注数据的前提下实现更快速、精准的灾后结构损伤识别。

链接: https://arxiv.org/abs/2604.11709
作者: Wanli Ma,Sivasakthy Selvakumaran,Dain G. Farrimond,Adam A. Dennis,Samuel E. Rigby
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and rapid structural damage assessment (SDA) is crucial for post-disaster management, helping responders prioritise resources, plan rescues, and support recovery. Traditional field inspections, though precise, are limited by accessibility, safety risks, and time constraints, especially after large explosions. Machine learning with remote sensing has emerged as a scalable solution for rapid SDA, with Mamba-based networks achieving state-of-the-art performance. However, these methods often require extensive training and large datasets, limiting real-world applicability. Moreover, they fail to incorporate key physical characteristics of blast loading for SDA. To overcome these challenges, we propose a Mamba-based multimodal network for rapid SDA that integrates multi-scale blast-loading information with optical remote sensing images. Evaluated on the 2020 Beirut explosion, our method significantly improves performance over state-of-the-art approaches. Code is available at: this https URL

[AI-9] Fairness is Not Flat: Geometric Phase Transitions Against Shortcut Learning

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks)中因捷径学习(shortcut learning)导致的鲁棒性下降和敏感应用场景中的群体偏差问题。其核心挑战在于模型倾向于记忆低维的虚假相关性(spurious correlations),而非捕捉真实的因果机制。解决方案的关键在于提出一种几何先验(geometric a priori)方法,通过部署一个无隐藏层(N=1)的拓扑审计器(Topological Auditor),在无需人工干预的情况下数学上隔离出主导梯度的特征;进而实证发现容量相变(Capacity Phase Transition)现象:一旦线性捷径被修剪,网络被迫利用更高几何容量(N ≥ 16)来弯曲决策边界并学习符合伦理的表示,从而显著降低反事实性别脆弱性(从21.18%降至7.66%),且优于L1正则化和计算开销更高的事后方法(如Just Train Twice)。

链接: https://arxiv.org/abs/2604.11704
作者: Nicolas Rodriguez-Alvarez(Instituto de Educacion Secundaria Parquesol, Valladolid, Spain),Fernando Rodriguez-Merino(University of Valladolid, Valladolid, Spain)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Neural Networks are highly susceptible to shortcut learning, frequently memorizing low-dimensional spurious correlations instead of underlying causal mechanisms. This phenomenon not only degrades out-of-distribution robustness but also induces severe demographic biases in sensitive applications. In this paper, we propose a geometric \textita priori methodology to mitigate shortcut learning. By deploying a zero-hidden-layer ( N=1 ) Topological Auditor, we mathematically isolate features that monopolize the gradient without human intervention. We empirically demonstrate a Capacity Phase Transition: once linear shortcuts are pruned, networks are forced to utilize higher geometric capacity ( N \geq 16 ) to curve the decision boundary and learn ethical representations. Our approach outperforms L1 Regularization – which collapses into demographic bias – and operates at a fraction of the computational cost of post-hoc methods like Just Train Twice (JTT), successfully reducing counterfactual gender vulnerability from 21.18% to 7.66%.

[AI-10] DreamKG: A KG-Augmented Conversational System for People Experiencing Homelessness ALT

【速读】:该论文旨在解决无家可归人群(People experiencing homelessness, PEH)获取社区服务信息时面临的信息获取不及时、不准确的问题。解决方案的关键在于构建一个基于知识图谱(knowledge graph)增强的对话系统——DreamKG,该系统通过将Neo4j知识图谱与结构化查询理解相结合,确保响应基于经验证的、实时更新的费城组织、服务、地点及营业时间数据,从而有效避免大语言模型(Large Language Models, LLMs)常见的幻觉问题。系统进一步融合空间推理和时间过滤机制,实现基于距离的推荐和时段敏感的服务匹配,显著提升了服务信息的准确性与可用性。

链接: https://arxiv.org/abs/2604.11703
作者: Javad M Alizadeh,Genhui Zheng,Chiu C Tan,Yuzhou Chen,Omar Martinez,Philip McCallion,Ying Ding,Chenguang Yang,AnneMarie Tomosky,Huanmei Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This manuscript has been accepted at the 14th IEEE International Conference on Healthcare Informatics (ICHI 2026)

点击查看摘要

Abstract:People experiencing homelessness (PEH) face substantial barriers to accessing timely, accurate information about community services. DreamKG addresses this through a knowledge graph-augmented conversational system that grounds responses in verified, up-to-date data about Philadelphia organizations, services, locations, and hours. Unlike standard large language models (LLMs) prone to hallucinations, DreamKG combines Neo4j knowledge graphs with structured query understanding to handle location-aware and time-sensitive queries reliably. The system performs spatial reasoning for distance-based recommendations and temporal filtering for operating hours. Preliminary evaluation shows 59% superiority over Google Search AI on relevant queries and 84% rejection of irrelevant queries. This demonstration highlights the potential of hybrid architectures that combines LLM flexibility with knowledge graph reliability to improve service accessibility for vulnerable populations effectively.

[AI-11] AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

【速读】:该论文旨在解决现有仿真数据生成平台在机器人操作策略训练中缺乏物体功能属性(affordance)信息的问题,导致无法自动生成语义正确的交互轨迹,例如通过杯柄抓取、从杯口倾倒或挂起杯子等任务。其解决方案的关键在于提出AffordSim框架,首次将开放词汇的3D功能预测(open-vocabulary 3D affordance prediction)集成到操纵数据生成流程中,利用VoxAfford模型——一种基于多尺度几何特征增强大语言模型(MLLM)输出的3D功能检测器——对物体点云预测功能图谱,从而引导抓取位姿估计聚焦于任务相关的功能性区域。该框架还结合了NVIDIA Isaac Sim、跨机器人本体支持、视觉语言模型(VLM)驱动的任务生成及基于DA3的3D高斯重建域随机化技术,实现了可扩展、语义感知的操纵数据自动化生成。

链接: https://arxiv.org/abs/2604.11674
作者: Mingyang Li,Haofan Xu,Haowen Sun,Xinzhe Chen,Sihua Ren,Liqi Huang,Xinyang Sui,Chenyang Miao,Qiongjie Cui,Zeyang Liu,Xingyu Chen,Xuguang Lan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Simulation-based data generation has become a dominant paradigm for training robotic manipulation policies, yet existing platforms do not incorporate object affordance information into trajectory generation. As a result, tasks requiring precise interaction with specific functional regions–grasping a mug by its handle, pouring from a cup’s rim, or hanging a mug on a hook–cannot be automatically generated with semantically correct trajectories. We introduce AffordSim, the first simulation framework that integrates open-vocabulary 3D affordance prediction into the manipulation data generation pipeline. AffordSim uses our VoxAfford model, an open-vocabulary 3D affordance detector that enhances MLLM output tokens with multi-scale geometric features, to predict affordance maps on object point clouds, guiding grasp pose estimation toward task-relevant functional regions. Built on NVIDIA Isaac Sim with cross-embodiment support (Franka FR3, Panda, UR5e, Kinova), VLM-powered task generation, and novel domain randomization using DA3-based 3D Gaussian reconstruction from real photographs, AffordSim enables automated, scalable generation of affordance-aware manipulation data. We establish a benchmark of 50 tasks across 7 categories (grasping, placing, stacking, pushing/pulling, pouring, mug hanging, long-horizon composite) and evaluate 4 imitation learning baselines (BC, Diffusion Policy, ACT, Pi 0.5). Our results reveal that while grasping is largely solved (53-93% success), affordance-demanding tasks such as pouring into narrow containers (1-43%) and mug hanging (0-47%) remain significantly more challenging for current imitation learning methods, highlighting the need for affordance-aware data generation. Zero-shot sim-to-real experiments on a real Franka FR3 validate the transferability of the generated data.

[AI-12] Beyond LLM s Sparse Distributed Memory and Neuromorphics A Hyper-Dimensional SRAM-CAM “VaCoAl” for Ultra-High Speed Ultra-Low Power and Low Cost

【速读】:该论文旨在解决现代人工智能(AI)中存在的三大核心问题:灾难性遗忘(catastrophic forgetting)、学习停滞(learning stagnation)以及绑定问题(Binding Problem)。针对这些问题,作者提出了一种基于有限域代数的确定性高维计算(HDC)架构——VaCoAl(Vague Coincident Algorithm),其关键在于将超高维记忆与确定性逻辑相结合,通过伽罗华域(Galois-field)扩散实现高维二进制空间中的正交化与检索,从而在无需训练的情况下支持可逆组合、保留元素独立性,并提供透明的可靠性评分(CR score)。该方案利用路径依赖的语义选择机制,等效于脉冲时间依赖可塑性(STDP),并可通过闭式表达式预测其强度,最终在Wikidata多跳推理任务中验证了其对概念传播的量化能力及从稀疏收敛到“后莱布尼茨超级高速公路”的相变现象,标志着一种新的HDC-AI范式。

链接: https://arxiv.org/abs/2604.11665
作者: Hiroyuki Chuma,Kanji Otsuka,Yoichi Sato
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 55 pages, 4 figure, 18 tables

点击查看摘要

Abstract:This paper reports an unexpected finding: in a deterministic hyperdimensional computing (HDC) architecture based on Galois-field algebra, a path-dependent semantic selection mechanism emerges, equivalent to spike-timing-dependent plasticity (STDP), with magnitude predictable a priori by a closed-form expression matching large-scale measurements. This addresses limitations of modern AI including catastrophic forgetting, learning stagnation, and the Binding Problem at an algebraic level. We propose VaCoAl (Vague Coincident Algorithm) and its Python implementation PyVaCoAl, combining ultra-high-dimensional memory with deterministic logic. Rooted in Sparse Distributed Memory, it resolves orthogonalisation and retrieval in high-dimensional binary spaces via Galois-field diffusion, enabling low-load deployment. VaCoAl is a memory-centric architecture prioritising retrieval and association, enabling reversible composition while preserving element independence and supporting compositional generalisation with a transparent reliability metric (CR score). We evaluated multi-hop reasoning on about 470k mentor-student relations from Wikidata, tracing up to 57 generations (over 25.5M paths). Using HDC bundling and unbinding with CR-based denoising, we quantify concept propagation over DAGs. Results show a reinterpretation of the Newton-Leibniz dispute and a phase transition from sparse convergence to a post-Leibniz “superhighway”, from which structural indicators emerge supporting a Kuhnian paradigm shift. Collision-tolerance mechanisms further induce path-based pruning that favors direct paths, yielding emergent semantic selection equivalent to STDP. VaCoAl thus defines a third paradigm, HDC-AI, complementing LLMs with reversible multi-hop reasoning.

[AI-13] Why Do Large Language Models Generate Harmful Content?

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)生成有害内容的因果机制不明确的问题。现有研究多关注现象描述,缺乏对行为背后因果路径的深入剖析。其解决方案的关键在于提出一种基于因果中介分析(causal mediation analysis)的方法,实现对模型层、模块(MLP与注意力块)及单个神经元的多粒度因果分解。实验表明,有害生成主要源于模型后层中MLP模块的功能失效,并由特定稀疏神经元作为门控机制触发最终输出;而早期层则负责对提示中的有害性进行语境理解并传递信号至后续模块,从而揭示了有害内容生成的可解释因果路径。

链接: https://arxiv.org/abs/2604.11663
作者: Rajesh Ganguli,Raha Moraffah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to identify the causal factors responsible for harmful generation. Our method performs a multi-granular analysis across model layers, modules (MLP and attention blocks), and individual neurons. Extensive experiments on state-of-the-art LLMs indicate that harmful generation arises in the later layers of the model, results primarily from failures in MLP blocks rather than attention blocks, and is associated with neurons that act as a gating mechanism for harmful generation. The results indicate that the early layers in the model are used for a contextual understanding of harmfulness in a prompt, which is then propagated through the model, to generate harmfulness in the late layers, as well as a signal indicating harmfulness through MLP blocks. This is then further propagated to the last layer of the model, specifically to a sparse set of neurons, which receives the signal and determines the generation of harmful content accordingly.

[AI-14] owards Autonomous Mechanistic Reasoning in Virtual Cells

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生物学等开放科学领域中应用受限的问题,即缺乏事实可验证且具有行动指导意义的解释。为应对这一挑战,作者提出了一种结构化的虚拟细胞解释形式化框架,将生物推理表示为机制动作图(mechanistic action graphs),从而实现系统性的验证与证伪。其解决方案的关键在于构建了一个多智能体框架——VCR-Agent,该框架整合了基于生物学知识的检索与基于验证器的过滤机制,实现了机制推理的自主生成与验证;同时,通过该框架构建了VC-TRACES数据集,为下游基因表达预测任务提供了高精度的监督信号,显著提升了模型的事实准确性与推理可靠性。

链接: https://arxiv.org/abs/2604.11661
作者: Yunhui Jang,Lu Zhu,Jake Fawkes,Alisandra Kaye Denton,Dominique Beaini,Emmanuel Noutahi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.

[AI-15] CodeTracer: Towards Traceable Agent States

【速读】:该论文旨在解决代码代理(Code Agent)在复杂多阶段任务中调试困难的问题,尤其是由于并行工具调用和状态转移难以观测而导致的错误传播与定位难题。其解决方案的关键在于提出了一种名为CodeTracer的追踪架构,该架构通过动态提取器解析异构运行产物,以持久化内存重建分层的轨迹树(trace tree),并实现失败发生点的精确定位,从而有效识别错误起源及其下游影响链。

链接: https://arxiv.org/abs/2604.11641
作者: Han Li,Yifan Yao,Letian Zhu,Rili Feng,Hongyi Ye,Jiaming Wang,Yancheng He,Pengyu Zou,Lehan Zhang,Xinping Lei,Haoyang Huang,Ken Deng,Ming Sun,Zhaoxiang Zhang,He Ye,Jiaheng Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent’s state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.

[AI-16] RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

【速读】:该论文旨在解决当前视觉生成任务中奖励模型(Reward Model)仅输出单一评分而缺乏可解释性的问题,这种简化处理导致无法利用人类偏好背后的推理过程来优化生成器。其解决方案的关键在于引入一种结构化的多维批判机制——即在评分前生成显式的、分维度的批判理由(rationales),从而将奖励模型从被动评估者转变为可主动优化生成过程的工具。具体而言,训练时这些结构化理由为强化学习提供细粒度、可解释的奖励信号;测试时则通过“生成-批判-精炼”循环,将批判转化为针对性的提示修订,无需参数更新即可提升输出质量。为避免昂贵的人工标注,作者提出Preference-Anchored Rationalization (PARROT) 框架,通过锚定生成、一致性过滤与蒸馏策略,从现有偏好数据中恢复高质量批判理由,最终实现性能优越且数据效率显著提升的 RationalRewards(8B)模型。

链接: https://arxiv.org/abs/2604.11626
作者: Haozhe Wang,Cong Wei,Weiming Ren,Jiaming Liu,Fangzhen Lin,Wenhu Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL ; Code, Dataset, Models are released

点击查看摘要

Abstract:Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

[AI-17] SCNO: Spiking Compositional Neural Operator – Towards a Neuromorphic Foundation Model for Nuclear PDE Solving

【速读】:该论文旨在解决神经算子(Neural Operator)在求解偏微分方程(PDE)时存在的三大局限性:1)通常作为单一模型训练,难以泛化到新物理场景;2)依赖高能耗的GPU硬件;3)当引入新物理机制时需从头重新训练。解决方案的关键在于提出一种模块化架构——脉冲组合神经算子(Spiking Compositional Neural Operator, SCNO),其核心创新包括:构建由小型脉冲神经算子块组成的库,每个块仅针对单一基本微分算子(对流、扩散、反应)训练;通过轻量级输入条件聚合器组合这些模块以求解未见过的耦合PDE系统;并引入一个小型修正网络学习跨耦合残差项,在冻结所有基础模块和聚合器的前提下实现零遗忘式的模块扩展能力。这一设计显著降低参数量(仅95K vs. 462K),同时在多个耦合PDE系统上达到最优精度,首次实现了具备内置无遗忘扩展能力的模块化类脑PDE求解框架。

链接: https://arxiv.org/abs/2604.11625
作者: Samrendra Roy,Souvik Chakraborty,Rizwan-uddin,Syed Bahauddin Alam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural operators have emerged as powerful surrogates for partial differential equation (PDE) solvers, yet they are typically trained as monolithic models for individual PDEs, require energy-intensive GPU hardware, and must be retrained from scratch when new physics emerge. We introduce the Spiking Compositional Neural Operator (SCNO), a modular architecture combining spiking and conventional components that addresses all three limitations. SCNO maintains a library of small spiking neural operator blocks, each trained on a single elementary differential operator (convection, diffusion, reaction), and composes them through a lightweight input-conditioned aggregator to solve coupled PDEs not seen during block training. A small correction network learns cross-coupling residuals while keeping all blocks and the aggregator frozen, preserving zero-forgetting modular expansion by construction. We evaluate SCNO on eight PDE families including five coupled systems and a nuclear-relevant 1-group neutron diffusion equation. SCNO with correction achieves the lowest relative L^2 error on four of five coupled PDEs, outperforming both a monolithic spiking DeepONet (by up to 62%, mean over 3 seeds) and a standard ANN DeepONet (by up to 65%), while requiring only 95K trainable parameters versus 462K for the monolithic baseline. To our knowledge, this is the first compositional spiking neural operator and the first proof-of-concept for modular neuromorphic PDE solving with built-in forgetting-free expansion.

[AI-18] Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agent ic AI Systems

【速读】:该论文旨在解决企业在部署生成式 AI(Generative AI)代理系统时面临的知识管理难题,即如何在组织范围内实现对知识的精准调度:确保正确的知识在正确的时间、以正确的权限被正确的代理访问。这一问题在结构上与十年前 Kubernetes 解决的容器编排问题高度相似。解决方案的关键在于提出“Context Kubernetes”架构,其核心包括六个形式化抽象、基于 YAML 的知识架构即代码声明式配置、一个 reconciliation 循环机制,以及三层代理权限模型——其中代理权限始终严格小于人类权限。实验表明,该方案能有效防止因治理缺失导致的内容泄露和跨域数据泄漏(26.5% 查询中发生),显著提升知识新鲜度检测效率(<1ms),并在五种攻击场景中实现 100% 阻断能力,同时保障零越权交付与架构级审批通道隔离,这是当前主流平台(如 Microsoft、Salesforce、AWS 和 Google)所不具备的能力。

链接: https://arxiv.org/abs/2604.11623
作者: Charafeddine Mouzouni
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 24 pages, 8 tables, 1 figure, 8 experiments (5 correctness + 3 value). Open-source prototype: this https URL

点击查看摘要

Abstract:We introduce Context Kubernetes, an architecture for orchestrating enterprise knowledge in agentic AI systems, with a prototype implementation and eight experiments. The core observation is that delivering the right knowledge, to the right agent, with the right permissions, at the right freshness – across an entire organization – is structurally analogous to the container orchestration problem Kubernetes solved a decade ago. We formalize six core abstractions, a YAML-based declarative manifest for knowledge-architecture-as-code, a reconciliation loop, and a three-tier agent permission model where agent authority is always a strict subset of human authority. Three value experiments show: (1) without governance, agents serve phantom content from deleted sources and leak cross-domain data in 26.5% of queries; (2) without freshness monitoring, stale content is served silently – with reconciliation, staleness is detected in under 1ms; (3) in five attack scenarios, flat permissions block 0/5 attacks, basic RBAC blocks 4/5, and the three-tier model blocks 5/5. Five correctness experiments confirm zero unauthorized deliveries, zero invariant violations, and architectural enforcement of out-of-band approval isolation that no surveyed enterprise platform provides. A survey of four major platforms (Microsoft, Salesforce, AWS, Google) documents that none architecturally isolates agent approval channels. We identify four properties that make context orchestration harder than container orchestration, and argue that these make the solution more valuable.

[AI-19] CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead

【速读】:该论文旨在解决现代CPU中矩阵扩展(Matrix Extension)设计面临的硬件与软件集成复杂性问题,尤其是现有方案因紧耦合于CPU流水线而导致跨平台适配困难,以及细粒度同步指令限制高性能核函数开发的问题。其解决方案的关键在于提出一种统一且可配置的CPU矩阵扩展架构:通过将矩阵单元从CPU流水线中解耦,实现低开销的集成并保持与现有计算和内存资源的紧密协作;同时,采用异步矩阵乘法抽象和灵活粒度机制,隐藏底层硬件细节、简化矩阵-向量重叠执行,并支持统一的软件栈,从而在多个开源CPU RTL平台上实现了超过90%的矩阵单元利用率及显著的AI模型加速效果(如ResNet、BERT和Llama3分别获得1.57x、1.57x和2.31x速度提升)。

链接: https://arxiv.org/abs/2604.11615
作者: Jinpeng Ye,Chongxi Wang,Wenqing Li,Bin Yuan,Shiyi Wang,Fenglu Zhang,Junyu Yue,Jianan Xie,Yunhao Ye,Haoyu Deng,Yingkun Zhou,Xin Cheng,Fuxin Zhang,Jian Wang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Accepted to DAC 2026

点击查看摘要

Abstract:Matrix extensions have emerged as an essential feature in modern CPUs to address the surging demands of AI workloads. However, existing designs often incur substantial hardware and software design overhead. Tight coupling with the CPU pipeline complicates integration across diverse CPUs, while fine-grained synchronous instructions hinder the development of high-performance kernels. This paper proposes a unified and configurable CPU matrix extension architecture. By decoupling matrix units from the CPU pipeline, the design enables low-overhead integration while maintaining close coordination with existing compute and memory resources. The configurable matrix unit supports mixed-precision operations and adapts to diverse compute demands and memory bandwidth constraints. An asynchronous matrix multiplication abstraction with flexible granularity conceals hardware details, simplifies matrix-vector overlap, and supports a unified software stack. The architecture is integrated into four open-source CPU RTL platforms and evaluated on representative AI models. Matrix unit utilization under GEMM workloads exceeds 90% across all platforms. When configured with compute throughput and memory bandwidth comparable to Intel AMX, our design achieves speedups of 1.57x, 1.57x, and 2.31x on ResNet, BERT, and Llama3, with over 30% of the gains attributed to overlapped matrix-vector execution. A 4 TOPS@2GHz matrix unit occupies only 0.53 mm\textsuperscript2 in 14nm CMOS. These results demonstrate strong cross-platform adaptability and effective hardware-software co-optimization, offering a practical matrix extension for the open-source community. Comments: Accepted to DAC 2026 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2604.11615 [cs.AR] (or arXiv:2604.11615v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2604.11615 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-20] Layerwise Dynamics for In-Context Classification in Transformers

【速读】:该论文旨在解决Transformer模型在推理阶段进行少样本分类时,其内部计算机制不透明的问题。研究聚焦于多类线性分类任务,在无间隔(hard no-margin)条件下,通过强制每一层满足特征和标签的置换等变性(feature- and label-permutation equivariance),使模型权重结构高度可识别且保持功能等价性。解决方案的关键在于构建一个显式的深度索引递归关系——即首次发现并提取出softmax Transformer中内生的、端到端可识别的更新规则,该规则由混合特征-标签Gram矩阵驱动,实现训练点、标签与测试探针之间的耦合更新,从而形成一种几何驱动的算法模式,能够有效增强类别分离并保证预期的类别对齐鲁棒性。

链接: https://arxiv.org/abs/2604.11613
作者: Patrick Lutz,Themistoklis Haris,Arjun Chandra,Aditya Gangrade,Venkatesh Saligrama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers can perform in-context classification from a few labeled examples, yet the inference-time algorithm remains opaque. We study multi-class linear classification in the hard no-margin regime and make the computation identifiable by enforcing feature- and label-permutation equivariance at every layer. This enables interpretability while maintaining functional equivalence and yields highly structured weights. From these models we extract an explicit depth-indexed recursion: an end-to-end identified, emergent update rule inside a softmax transformer, to our knowledge the first of its kind. Attention matrices formed from mixed feature-label Gram structure drive coupled updates of training points, labels, and the test probe. The resulting dynamics implement a geometry-driven algorithmic motif, which can provably amplify class separation and yields robust expected class alignment.

[AI-21] bacpipe: a Python package to make bioacoustic deep learning models accessible

【速读】:该论文旨在解决生态学研究中大规模生物声学数据(如自然声音)难以高效分析的问题,尤其是在深度学习模型快速发展的背景下,如何让这些先进模型被生态学家和计算机科学家共同便捷使用。其解决方案的关键在于提出bacpipe——一个模块化软件包,集成了多种生物声学深度学习模型与评估流程,通过图形界面和编程接口实现对自定义音频数据集的自动化处理,生成声学特征向量(embeddings)和分类预测结果,并支持交互式可视化、聚类与探查功能,从而促进跨学科合作并推动生态与进化问题的研究。

链接: https://arxiv.org/abs/2604.11560
作者: Vincent S. Kather,Sylvain Haupert,Burooj Ghani,Dan Stowell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:1. Natural sounds have been recorded for millions of hours over the previous decades using passive acoustic monitoring. Improvements in deep learning models have vastly accelerated the analysis of large portions of this data. While new models advance the state-of-the-art, accessing them using tools to harness their full potential is not always straightforward. Here we present bacpipe, a collection of bioacoustic deep learning models and evaluation pipelines accessible through a graphical and programming interface, designed for both ecologists and computer scientists. Bacpipe is a modular software package intended as a point of convergence for bioacoustic models. 2. Bacpipe streamlines the usage of state-of-the-art models on custom audio datasets, generating acoustic feature vectors (embeddings) and classifier predictions. A modular design allows evaluation and benchmarking of models through interactive visualizations, clustering and probing. 3. We believe that access to new deep learning models is important. By designing bacpipe to target a wide audience, researchers will be enabled to answer new ecological and evolutionary questions in bioacoustics. 4. In conclusion, we believe accessibility to developments in deep learning to a wider audience benefits the ecological questions we are trying to answer. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.11560 [cs.LG] (or arXiv:2604.11560v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.11560 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vincent S. Kather [view email] [v1] Mon, 13 Apr 2026 14:45:12 UTC (1,712 KB)

[AI-22] UniToolCall: Unifying Tool-Use Representation Data and Evaluation for LLM Agents

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在工具调用(Tool-use)能力上的三个核心问题:一是现有研究中交互表示不统一,导致可比性差;二是对工具使用轨迹(tool-use trajectories)的结构分布缺乏系统建模,尤其是多跳(multi-hop)与多轮(multi-turn)交互模式未被充分捕捉;三是评估基准不兼容,难以进行公平比较。解决方案的关键在于提出一个统一框架 UniToolCall,其核心创新包括:构建包含 22k+ 工具的标准化工具池和由 390k+ 实例组成的混合训练语料(融合公开数据集与结构可控的合成轨迹),显式建模单跳/多跳、单轮/多轮及串行/并行执行结构,并引入 Anchor Linkage 机制以强化跨轮次依赖关系;同时将 7 个公共基准统一转换为 Query–Action–Observation–Answer (QAOA) 格式,实现细粒度的功能调用、轮次和对话层级评估。实验表明,在 distractor-heavy Hybrid-20 设置下,基于该框架微调的 Qwen3-8B 模型达到 93.0% 的单轮严格精度(Strict Precision),优于 GPT、Gemini 和 Claude 等商用模型。

链接: https://arxiv.org/abs/2604.11557
作者: Yijuan Liang,Xinghao Chen,Yifan Ge,Ziyi Wu,Hao Wu,Changyu Zeng,Wei Xing,Xiaoyu Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures, 6 tables. Code and datasets are publicly available at: this https URL

点击查看摘要

Abstract:Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query–Action–Observation–Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.

[AI-23] FM-Agent : Scaling Formal Methods to Large Systems via LLM -Based Hoare-Style Reasoning

【速读】:该论文旨在解决大尺度软件系统中由生成式 AI(Generative AI)自动生成代码后,如何实现自动化且可扩展的正确性验证问题。现有基于霍雷逻辑(Hoare logic)的方法因需人工编写函数级规范而难以规模化,尤其在开发者对 LLM 生成代码的理解有限时更为困难。解决方案的关键在于提出 FM-Agent 框架,首次实现了面向大规模系统的自动化组合推理(compositional reasoning),其核心创新包括:1)引入自顶向下的范式,利用调用者对函数行为的预期自动推导出函数级规范,从而捕捉开发者的意图;2)将霍雷风格推理推广至自然语言规范,使形式化验证工具能够处理非公式化的意图表达;3)自动构造测试用例以定位并解释潜在缺陷,从而有效识别出开发者已测试但未发现的严重错误(如系统崩溃和错误执行结果)。

链接: https://arxiv.org/abs/2604.11556
作者: Haoran Ding,Zhaoguo Wang,Haibo Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function’s expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer’s intent of a function even if the implementation is buggy. Developers’ intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.11556 [cs.SE] (or arXiv:2604.11556v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.11556 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-24] SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering

【速读】:该论文旨在解决个人AI代理(personal AI agents)在大规模部署中面临的两大核心挑战:一是AI工程范式的转变,即从传统的提示词(prompt)和上下文工程转向构建可控制、可审计、生产级可靠的系统基础设施(harness engineering);二是人机交互模式的演进,即从离散任务处理向持久化、情境感知的协作关系发展,要求具备开放性、可信性和可扩展性的基础设施。解决方案的关键在于提出SemaClaw框架,其核心创新包括基于有向无环图(DAG)的两阶段混合代理团队编排方法、PermissionBridge行为安全系统、三层上下文管理架构以及用于自动化个人知识库构建的代理wiki技能,从而实现通用型个人AI代理的工程化落地。

链接: https://arxiv.org/abs/2604.11548
作者: Ningyan Zhu,Huacan Wang,Jie Zhou,Feiyu Chen,Shuo Zhang,Ge Chen,Chen Liu,Jiarou Wu,Wangyi Chen,Xiaofeng Mou,Yi Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of OpenClaw in early 2026 marks the moment when millions of users began deploying personal AI agents into their daily lives, delegating tasks ranging from travel planning to multi-step research. This scale of adoption signals that two parallel arcs of development have reached an inflection point. First is a paradigm shift in AI engineering, evolving from prompt and context engineering to harness engineering-designing the complete infrastructure necessary to transform unconstrained agents into controllable, auditable, and production-reliable systems. As model capabilities converge, this harness layer is becoming the primary site of architectural differentiation. Second is the evolution of human-agent interaction from discrete tasks toward a persistent, contextually aware collaborative relationship, which demands open, trustworthy and extensible harness infrastructure. We present SemaClaw, an open-source multi-agent application framework that addresses these shifts by taking a step towards general-purpose personal AI agents through harness engineering. Our primary contributions include a DAG-based two-phase hybrid agent team orchestration method, a PermissionBridge behavioral safety system, a three-tier context management architecture, and an agentic wiki skill for automated personal knowledge base construction.

[AI-25] A collaborative agent with two lightweight synergistic models for autonomous crystal materials research

【速读】:该论文旨在解决当前大型语言模型在材料科学领域中因参数量庞大而导致的领域特定推理能力不足及工具协调困难的问题。其解决方案的关键在于提出了一种轻量级协同智能体系统 MatBrain,采用双模型架构:Mat-R1(30B 参数)作为分析模型,负责专家级的领域推理;Mat-T1(14B 参数)作为执行模型,负责工具驱动的动作调度。通过熵分析证实,该架构通过解耦工具规划与分析推理的不同熵动态,有效缓解了二者之间的冲突,从而在显著降低硬件部署门槛(>95%)的同时,大幅超越通用大模型的性能表现。

链接: https://arxiv.org/abs/2604.11540
作者: Tongyu Shi,Yutang Li,Zhanyuan Li,Qian Liu,Jie Zhou,Wenhe Xu,Yang Li,Dawei Dai,Rui He,Wenhua Zhou,Jiahong Wang,Xue-Feng Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current large language models require hundreds of billions of parameters yet struggle with domain-specific reasoning and tool coordination in materials science. Here, we present MatBrain, a lightweight collaborative agent system with two synergistic models specialization for crystal materials research. MatBrain employs a dual-model architecture: Mat-R1 (30B parameters) as the analytical model providing expert-level domain reasoning, and Mat-T1 (14B parameters) as the executive model orchestrating tool-based actions. Entropy analysis confirms that this architecture resolves the conflict between tool planning and analytical reasoning by decoupling their distinct entropy dynamics. Enabled by this dual-model architecture and structural efficiency, MatBrain significantly outperforms larger general-purpose models while reducing the hardware deployment barrier by over 95%. MatBrain exhibits versatility across structure generation, property prediction, and synthesis planning tasks. Applied to catalyst design, MatBrain generated 30,000 candidate structures and identified 38 promising materials within 48 hours, achieving approximately 100-fold acceleration over traditional approaches. These results demonstrate the potential of lightweight collaborative intelligence for advancing materials research capabilities.

[AI-26] Problem Reductions at Scale: Agent ic Integration of Computationally Hard Problems

【速读】:该论文旨在解决NP-hard优化问题在实际求解过程中因需针对特定求解器(如量子硬件、商业优化器或领域启发式算法)进行重构而带来的效率低下问题,从而限制了问题与求解器之间的灵活匹配。其解决方案的关键在于构建一个可扩展的多项式时间归约库,通过“Harness工程”——即设计约束条件、验证系统和反馈循环以引导AI编码代理(AI coding agents)——实现高效、自动化的问题归约规则生成与集成。该方法结合无代码贡献接口、多层验证体系(从类型检查到AI代理模拟用户测试)以及全自动的实现-审查-集成流水线,在三个月内完成了包含100多个问题类型和200余条归约规则的Rust库,显著提升了归约库的规模与开发速度,且由于归约图具有传递性,任意新注册的求解器可立即服务于所有通过归约路径连接的问题。

链接: https://arxiv.org/abs/2604.11535
作者: Xi-Wei Pan,Shi-Wen An,Jin-Guo Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The source code is available at this https URL

点击查看摘要

Abstract:Solving an NP-hard optimization problem often requires reformulating it for a specific solver – quantum hardware, a commercial optimizer, or a domain heuristic. A tool for polynomial-time reductions between hard problems would let practitioners route any supported problem to any supported solver through a single interface. Building such a library at scale, however, has remained out of reach. We show that harness engineering, the practice of designing constraints, verification systems, and feedback loops that channel AI coding agents, can overcome this barrier. Our harness combines a no-code contribution route for domain experts, a multilayer verification stack ranging from type-level checks to agentic feature tests (AI agents role-playing as end users), and a fully automated implementation-review-integration pipeline. In about three months, we built a command-line tool backed by a library of 100+ problem types and 200+~reduction rules in over 170k lines of Rust. The result suggests that a well-engineered harness lets agents build well-tested software at a scale and pace beyond prior reduction-library efforts. Because the reduction graph composes transitively, a new solver registered for any single problem type instantly becomes available to every problem connected by a reduction path. The source code is available at this https URL.

[AI-27] Limited Perfect Monotonical Surrogates constructed using low-cost recursive linkage discovery with guaranteed output

【速读】:该论文旨在解决传统代理模型(surrogate)在处理非线性、计算代价高昂的优化问题时的局限性,特别是当真实问题无法用线性模型表示时,现有完美线性代理模型(perfect linear surrogates)无法适用的问题。其解决方案的关键在于提出一种有限单调完美代理模型(Limited Monotonical Perfect Surrogate, LyMPuS),该模型无需参数训练且可在线构建,仅依赖必要的适应度评估,在更新过程中不浪费已付出的计算成本;同时具备低开销的缺失关联检测与链接发现能力,能在不超过 2log2(n)2\lceil\log_2(n)\rceil 步内保证识别出缺失的变量依赖关系,从而有效降低昂贵局部搜索过程的计算负担。

链接: https://arxiv.org/abs/2604.11524
作者: M.W. Przewozniczek,F. Chicano,R. Tinós,M.M. Komarnicki
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注:

点击查看摘要

Abstract:Surrogates provide a cheap solution evaluation and offer significant leverage for optimizing computationally expensive problems. Usually, surrogates only approximate the original function. Recently, the perfect linear surrogates were proposed that ideally represent the original function. These surrogates do not mimic the original function. In fact, they are another (correct) representation of it and enable a wide range of possibilities, e.g., discovering the optimized function for problems where the direct transformation of the encoded solution into its evaluation is not available. However, many real-world problems can not be represented by linear models, making the aforementioned surrogates inapplicable. Therefore, we propose the Limited Monotonical Perfect Surrogate (LyMPuS), which overcomes this difficulty and enables the comparison of two solutions that differ by a single variable. Our proposition is suitable for limiting the cost of expensive local search procedures. The proposed surrogate is parameterless and can be trained on the fly without any separate surrogate-building step. It uses only the necessary fitness evaluations, and the already-paid costs are not wasted when the model is updated. Finally, it offers low-cost missing-linkage detection and low-cost linkage discovery, guaranteed to find a missing dependency in no more than 2\lceil\log_2(n)\rceil steps.

[AI-28] From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python

【速读】:该论文旨在解决大型软件系统在跨语言迁移过程中面临的持续工程挑战,尤其是在源代码库快速演进背景下如何保持功能一致性与可维护性的问题。其核心解决方案是提出一种基于大语言模型(Large Language Model, LLM)的连续代码翻译方法,通过将生产级 Rust 代码库(648K LOC)自动翻译为 Python(41K LOC),并以公开代理基准测试(public agent benchmarks)作为目标函数驱动迭代优化。关键创新在于:利用基准测试结果作为反馈信号进行 benchmark-driven debugging,识别出 API 协议不匹配、环境污染等深层次问题;同时构建了一个 LLM-assisted diff-translate-test 循环,支持持续上游同步,并使 Python 版本从功能对齐逐步演化为能力超集(如多智能体编排、语义记忆等),从而在保证性能的前提下实现显著的代码简洁性提升(15.9倍代码量减少)。

链接: https://arxiv.org/abs/2604.11518
作者: Jinhua Wang,Biswa Sengupta
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-language migration of large software systems is a persistent engineering challenge, particularly when the source codebase evolves rapidly. We present a methodology for LLM-assisted continuous code translation in which a large language model translates a production Rust codebase (648K LOC, 65 crates) into Python (41K LOC, 28 modules), with public agent benchmarks as the objective function driving iterative refinement. Our subject system is Codex CLI, a production AI coding agent. We demonstrate that: (1) the Python port resolves 59/80 SWE-bench Verified tasks (73.8%) versus Rust’s 56/80 (70.0%), and achieves 42.5% on Terminal-Bench versus Rust’s 47.5%, confirming near-parity on real-world agentic tasks; (2) benchmark-driven debugging, revealing API protocol mismatches, environment pollution, a silent WebSocket failure mode, and an API 400 crash, is more effective than static testing alone; (3) the architecture supports continuous upstream synchronisation via an LLM-assisted diff-translate-test loop; and (4) the Python port has evolved into a capability superset with 30 feature-flagged extensions (multi-agent orchestration, semantic memory, guardian safety, cost tracking) absent from Rust, while preserving strict parity mode for comparison. Our evaluation shows that for LLM-based agents where API latency dominates, Python’s expressiveness yields a 15.9x code reduction with negligible performance cost, while the benchmark-as-objective-function methodology provides a principled framework for growing a cross-language port from parity into an extended platform.

[AI-29] EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models

【速读】:该论文旨在解决小型语言模型(Small Language Models, SLMs)在边缘设备(如笔记本电脑、智能手机和嵌入式平台)上部署时,现有加速器在自回归解码阶段因GEMV(General Matrix-Vector Multiplication)操作固有的内存密集特性导致利用率低、能耗高的问题。解决方案的关键在于提出EdgeCIM这一软硬件协同设计框架,其核心是一个基于存算一体(Compute-in-Memory, CIM)宏单元(65nm工艺实现)与基于tile的映射策略相结合的架构,通过平衡流水线阶段,在最大化并行性的同时缓解DRAM带宽瓶颈,从而显著提升吞吐量和能效比。

链接: https://arxiv.org/abs/2604.11512
作者: Jinane Bazzi,Mariam Rakka,Fadi Kurdahi,Mohammed E. Fouda,Ahmed Eltawil
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing demand for deploying Small Language Models (SLMs) on edge devices, including laptops, smartphones, and embedded platforms, has exposed fundamental inefficiencies in existing accelerators. While GPUs handle prefill workloads efficiently, the autoregressive decoding phase is dominated by GEMV operations that are inherently memory-bound, resulting in poor utilization and prohibitive energy costs at the edge. In this work, we present EdgeCIM, a hardware-software co-design framework that rethinks accelerator design for end-to-end decoder-only inference. At its core is a CIM macro, implemented in 65nm, coupled with a tile-based mapping strategy that balances pipeline stages, maximizing parallelism while alleviating DRAM bandwidth bottlenecks. Our simulator enables design space exploration of SLMs up to 4B parameters, identifying Pareto-optimal configurations in terms of latency and energy. Compared to an NVIDIA Orin Nano, EdgeCIM achieves up to 7.3x higher throughput and 49.59x better energy efficiency on LLaMA3.2-1B, and delivers 9.95x higher throughput than Qualcomm SA8255P on LLaMA3.2-3B. Extensive benchmarks on TinyLLaMA-1.1B, LLaMA3.2 (1B, 3B), Phi-3.5-mini-3.8B, Qwen2.5 (0.5B, 1.5B, 3B), SmolLM2-1.7B, SmolLM3-3B, and Qwen3 (0.6B, 1.7B, 4B) reveal that our accelerator, under INT4 precision, achieves on average 336.42 tokens/s and 173.02 tokens/J. These results establish EdgeCIM as a compelling solution towards real-time, energy-efficient edge-scale SLM inference.

[AI-30] Not All Forgetting Is Equal: Architecture-Dependent Retention Dynamics in Fine-Tuned Image Classifiers

【速读】:该论文旨在解决预训练图像分类器在微调过程中样本遗忘机制不明确的问题,特别是个体样本的遗忘模式是否稳定以及是否依赖于模型架构。其核心发现表明:不同架构(如ResNet-18与DeiT-Small)遗忘的样本存在显著差异(Jaccard重叠度低至0.15),且视觉相似类别更易被遗忘,同时样本遗忘过程具有高度随机性(跨随机种子相关性趋近于零),说明样本难度并非固定属性;关键解决方案在于通过拟合Ebbinghaus式指数衰减曲线量化每个样本的保留轨迹,并发现微调初期损失可有效预测长期遗忘速率(ρ = 0.30–0.50, p < 10⁻⁴⁵)。这一结果揭示了基于静态难度排序的课程学习或数据剪枝策略可能不可靠,而架构多样性有助于提升集成模型的保留覆盖范围。

链接: https://arxiv.org/abs/2604.11508
作者: Miit Daga,Swarna Priya Ramu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning pretrained image classifiers is standard practice, yet which individual samples are forgotten during this process, and whether forgetting patterns are stable or architecture dependent, remains unclear. Understanding these dynamics has direct implications for curriculum design, data pruning, and ensemble construction. We track per-sample correctness at every epoch during fine-tuning of ResNet-18 and DeiT-Small on a retinal OCT dataset (7 classes, 56:1 imbalance) and CUB-200-2011 (200 bird species), fitting Ebbinghaus-style exponential decay curves to each sample’s retention trace. Five findings emerge. First, the two architectures forget fundamentally different samples: Jaccard overlap of the top 10 percent most-forgotten is 0.34 on OCTDL and 0.15 on CUB-200. Second, ViT forgetting is more structured (mean R^2 = 0.74 ) than CNN forgetting ( R^2 = 0.52 ). Third, per-sample forgetting is stochastic across random seeds (Spearman \rho \approx 0.01 ), challenging the assumption that sample difficulty is an intrinsic property. Fourth, class-level forgetting is consistent and semantically interpretable: visually similar species are forgotten most, distinctive ones least. Fifth, a sample’s loss after head warmup predicts its long-term decay constant ( \rho = 0.30 to 0.50 , p 10^-45 ). These findings suggest that architectural diversity in ensembles provides complementary retention coverage, and that curriculum or pruning methods based on per-sample difficulty may not generalize across runs. A spaced repetition sampler built on these decay constants does not outperform random sampling, indicating that static scheduling cannot exploit unstable per-sample signals.

[AI-31] Lectures on AI for Mathematics

【速读】:该论文旨在解决如何利用人工智能(Artificial Intelligence, AI)推动数学研究的问题,其核心挑战在于将AI技术应用于复杂数学问题的发现、证明与验证过程。解决方案的关键在于通过生成式AI(Generative AI)和机器学习算法挖掘隐藏的数学模式,辅助数学家进行定理证明,并自动构造反例以检验猜想的正确性,从而显著提升数学研究的效率与深度。

链接: https://arxiv.org/abs/2604.11504
作者: Xiaoyang Chen,Xiaoyang Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Algebraic Topology (math.AT); Differential Geometry (math.DG)
备注:

点击查看摘要

Abstract:This book provides a comprehensive and accessible introduction to the emerging field of AI for mathematics. It covers the core principles and diverse applications of using artificial intelligence to advance mathematical research. Through clear explanations, the text explores how AI can discover hidden mathematical patterns, assist in proving complicated theorems, and even construct counterexamples to challenge conjectures.

[AI-32] On the Complexity of the Discussion-based Semantics in Abstraction Argumentation

【速读】:该论文旨在解决在讨论语义(discussion-based semantics)框架下,判断一个论点 a 是否比另一个论点 b 更强的问题。这一问题本质上等价于判定图中两个顶点在所有长度的路径终点计数上是否一致,即是否存在相同数量的长度为 k 的行走路径分别终止于这两个顶点。解决方案的关键在于将该问题转化为半环自动机(semiring automata)的等价性判定问题,并借助自动机理论中的已有成果,从而证明该决策问题可在多项式时间内求解。这一方法为排序语义(ranking semantics)的计算复杂性研究提供了新的视角和工具。

链接: https://arxiv.org/abs/2604.11480
作者: Lydia Blümel,Kai Sauerwald,Kenneth Skiba,Matthias Thimm
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We show that deciding whether an argument a is stronger than an argument b with respect to the discussion-based semantics of Amgoud and Ben-Naim is decidable in polynomial time. At its core, this problem is about deciding whether, for two vertices in a graph, the number of walks of each length ending in those vertices is the same. We employ results from automata theory and reduce this problem to the equivalence problem for semiring automata. This offers a new perspective on the computational complexity of ranking semantics, an area in which the complexity of many semantics remains open.

[AI-33] OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM -Based Multi-Agent Systems

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在自主软件工程中因评估者认知不确定性(evaluator epistemic uncertainty)而导致的对齐难题,尤其是现有范式如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和AI反馈(Reinforcement Learning from AI Feedback, RLAIF)易引发模型谄媚行为,以及执行环境遭受无约束智能体“测试规避”(Test Evasion)的问题。解决方案的关键在于提出一种新的目标对齐范式——缺钱强化学习(Out-of-Money Reinforcement Learning, OOM-RL),其核心机制是将智能体部署于高摩擦、非平稳的真实金融市场环境中,利用资本耗尽作为不可被绕过的负梯度信号,从而迫使系统从过度拟合的幻觉行为转向严格的测试驱动代理工作流(Strict Test-Driven Agentic Workflow, STDAW),并引入基于确定性验证的≥95%代码覆盖率约束矩阵与拜占庭式单向状态锁(RO-Lock),最终实现稳定且具经济意义的对齐结果。

链接: https://arxiv.org/abs/2604.11477
作者: Kun Liu,Liqun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE); Trading and Market Microstructure (q-fin.TR)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:The alignment of Multi-Agent Systems (MAS) for autonomous software engineering is constrained by evaluator epistemic uncertainty. Current paradigms, such as Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), frequently induce model sycophancy, while execution-based environments suffer from adversarial “Test Evasion” by unconstrained agents. In this paper, we introduce an objective alignment paradigm: \textbfOut-of-Money Reinforcement Learning (OOM-RL). By deploying agents into the non-stationary, high-friction reality of live financial markets, we utilize critical capital depletion as an un-hackable negative gradient. Our longitudinal 20-month empirical study (July 2024 – February 2026) chronicles the system’s evolution from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture. We demonstrate that the undeniable ontological consequences of financial loss forced the MAS to abandon overfitted hallucinations in favor of the \textbfStrict Test-Driven Agentic Workflow (STDAW), which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified \geq 95% code coverage constraint matrix. Our results show that while early iterations suffered severe execution decay, the final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase. We conclude that substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments, laying the groundwork for generalized paradigms where computational billing acts as an objective physical constraint

[AI-34] hree Roles One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

【速读】:该论文旨在解决在资源受限硬件环境下,如何通过不增加训练计算开销的方式提升小型大语言模型(Large Language Model, LLM)在复杂多步骤工具使用任务中的性能问题。其核心挑战在于小模型在有限算力下难以有效处理长上下文和多轮交互中的错误累积与失败循环。解决方案的关键在于提出一种三层次推理时支架(inference-time scaffolding)管道:首先利用一个冻结的总结模型压缩对话历史并保留关键信息(如令牌、凭证、API响应);其次由主代理模型基于压缩后的上下文进行推理;最后引入一个隔离的校正模型对代码输出进行审查与修正,从而打破重复性失败循环。该方法仅依赖同一冻结模型的三次不同条件调用,无需额外训练即可显著提升任务目标完成率,尤其在低难度任务上表现突出,并使8B模型在全精度推理下超越4倍参数量的DeepSeek-Coder 33B Instruct模型。

链接: https://arxiv.org/abs/2604.11465
作者: S. Aaron McClendon,Jorge Gallego-Feliciano,Stavros Zervoudakis,Antonios Saravanos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents show promise on realistic tool-use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference-time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi-step environments. Operating on a single 24,GB GPU, we evaluate Qwen3-8B under both full-precision (FP16, 12K context) and 4-bit quantized (AWQ, 32K context) configurations. Without any intervention, the raw model achieves just 5.4% (FP16) and 3.0% (AWQ) task goal completion. Guided by a systematic failure mode analysis, we introduce a three-tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles: (1) a summarization model that preserves critical artifacts (tokens, credentials, API responses) while compressing dialogue history; (2) the main agent model that reasons over the compressed context; and (3) an isolated correction model that reviews and revises the agent’s code output without access to conversation history, breaking repetitive failure loops. Applied to the same unmodified model, this scaffolding yields 8.9% (FP16) and 5.9% (AWQ) task goal completion, roughly doubling performance in both settings, with particularly strong gains on difficulty-1 tasks (15.8% \to 26.3% FP16; 5.3% \to 14.0% AWQ). On full-precision inference, our scaffolded 8B model surpasses DeepSeek-Coder 33B Instruct (7.1%) from the original AppWorld evaluation, demonstrating that structured inference-time interventions can make small models competitive with systems 4 \times their size. We formalize the approach as a scaffolded policy over a frozen base model, three invocations of the same weights with different conditioning, drawing connections to test-time compute scaling and action-space shaping in reinforcement learning.

[AI-35] Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在执行长时程任务时面临的“上下文瓶颈”(context bottleneck)和“中间迷失”(lost-in-the-middle)现象,这些问题由冗余环境信息积累导致的噪声污染,显著削弱了多轮交互中的推理能力。解决方案的关键在于提出一种共生框架(symbiotic framework),将上下文管理与任务执行解耦:通过一个轻量级、专用的策略模型 ContextCurator 与一个强大的冻结基础模型 TaskExecutor 配合工作;其中,ContextCurator 基于强化学习训练,主动降低工作记忆中的信息熵,激进地修剪环境噪声,同时保留对后续推理至关重要的稀疏锚点数据(reasoning anchors)。该方法在 WebArena 和 DeepSearch 上均实现了成功率提升和 token 消耗显著下降,且 7B 规模的 ContextCurator 即可达到 GPT-4o 的上下文管理性能,展现出可扩展且计算高效的自主长时程代理范式。

链接: https://arxiv.org/abs/2604.11462
作者: Xiaozhe Li,Tianyi Lyu,Yizhao Yang,Liang Shan,Siyi Yang,Ligao Zhang,Zhuoyi Huang,Qingwen Liu,Yang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) struggle with long-horizon tasks due to the “context bottleneck” and the “lost-in-the-middle” phenomenon, where accumulated noise from verbose environments degrades reasoning over multi-turn interactions. To address this issue, we introduce a symbiotic framework that decouples context management from task execution. Our architecture pairs a lightweight, specialized policy model, ContextCurator, with a powerful frozen foundation model, TaskExecutor. Trained via reinforcement learning, ContextCurator actively reduces information entropy in the working memory. It aggressively prunes environmental noise while preserving reasoning anchors, that is, sparse data points that are critical for future deductions. On WebArena, our framework improves the success rate of Gemini-3.0-flash from 36.4% to 41.2% while reducing token consumption by 8.8% (from 47.4K to 43.3K). On DeepSearch, it achieves a 57.1% success rate, compared with 53.9%, while reducing token consumption by a factor of 8. Remarkably, a 7B ContextCurator matches the context management performance of GPT-4o, providing a scalable and computationally efficient paradigm for autonomous long-horizon agents.

[AI-36] Hardening x402: PII-Safe Agent ic Payments via Pre-Execution Metadata Filtering

【速读】:该论文旨在解决基于x402协议的AI代理在支付过程中因传输未经处理的元数据(包括资源URL、描述和理由字符串)而引发的隐私泄露风险,尤其是在支付服务器与中心化中介API之间缺乏数据处理协议的情况下。解决方案的关键在于提出并实现了一个名为presidio-hardened-x402的开源中间件,该中间件能在支付请求发送至目标服务前实时拦截并处理元数据:一是通过预训练的自然语言处理(NLP)模型识别并删除个人身份信息(PII),二是基于声明式支出策略强制执行访问控制,三是阻止重复的重放攻击。实验表明,在推荐配置下(NLP模式,最小置信度阈值0.4,涵盖所有实体类型),该方案实现了微平均F1分数0.894(精确率0.972),延迟仅为5.73ms(p99),满足系统性能要求。

链接: https://arxiv.org/abs/2604.11430
作者: Vladimir Stantchev
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 14 pages, 5 figures, 4 tables; code and synthetic corpus available at this https URL

点击查看摘要

Abstract:AI agents that pay for resources via the x402 protocol embed payment metadata - resource URLs, descriptions, and reason strings - in every HTTP payment request. This metadata is transmitted to the payment server and to the centralised facilitator API before any on-chain settlement occurs; neither party is typically bound by a data processing agreement. We present presidio-hardened-x402, the first open-source middleware that intercepts x402 payment requests before transmission to detect and redact personally identifiable information (PII), enforce declarative spending policies, and block duplicate replay attempts. To evaluate the PII filter, we construct a labeled synthetic corpus of 2,000 x402 metadata triples spanning seven use-case categories, and run a 42-configuration precision/recall sweep across two detection modes (regex, NLP) and five confidence thresholds. The recommended configuration (mode=nlp, min_score=0.4, all entity types) achieves micro-F1 = 0.894 with precision 0.972, at a p99 latency of 5.73ms - well within the 50ms overhead budget. The middleware, corpus, and all experiment code are publicly available at this https URL.

[AI-37] Emulating Non-Differentiable Metrics via Knowledge-Guided Learning: Introducing the Minkowski Image Loss

【速读】:该论文旨在解决地球系统深度学习中的“可微性鸿沟”问题,即由于科学指标(如积分几何度量)通常不可微,模型只能依赖平滑代理指标(如均方误差)进行训练,导致输出结果缺乏高频细节而呈现“模糊”现象。解决方案的关键在于构建两类不同的可微替代方法:其一是通过温度控制的Sigmoid函数和连续逻辑运算符对离散拓扑操作进行解析近似,从而将原始非可微函数转化为可微等价形式;其二是利用Lipschitz-卷积神经网络学习科学函数的可微代理模型,其中通过谱归一化约束Lipschitz常数并引入硬性架构约束以确保几何原理的保持。实验表明,所提出的Minkowski图像损失函数在EUMETNET OPERA数据集上实现了高精度几何保真度,但严格Lipschitz正则化虽保障优化稳定性,却会过度平滑梯度信号,限制对强局域对流纹理的恢复,提示未来需结合随机生成架构以实现完整的形态学真实性。

链接: https://arxiv.org/abs/2604.11422
作者: Filippo Quarenghi,Ryan Cotsakis,Tom Beucler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The differentiability gap'' presents a primary bottleneck in Earth system deep learning: since models cannot be trained directly on non-differentiable scientific metrics and must rely on smooth proxies (e.g., MSE), they often fail to capture high-frequency details, yielding blurry’’ outputs. We develop a framework that bridges this gap using two different methods to deal with non-differentiable functions: the first is to analytically approximate the original non-differentiable function into a differentiable equivalent one; the second is to learn differentiable surrogates for scientific functionals. We formulate the analytical approximation by relaxing discrete topological operations using temperature-controlled sigmoids and continuous logical operators. Conversely, our neural emulator uses Lipschitz-convolutional neural networks to stabilize gradient learning via: (1) spectral normalization to bound the Lipschitz constant; and (2) hard architectural constraints enforcing geometric principles. We demonstrate this framework’s utility by developing the Minkowski image loss, a differentiable equivalent for the integral-geometric measures of surface precipitation fields (area, perimeter, connected components). Validated on the EUMETNET OPERA dataset, our constrained neural surrogate achieves high emulation accuracy, completely eliminating the geometric violations observed in unconstrained baselines. However, applying these differentiable surrogates to a deterministic super-resolution task reveals a fundamental trade-off: while strict Lipschitz regularization ensures optimization stability, it inherently over-smooths gradient signals, restricting the recovery of highly localized convective textures. This work highlights the necessity of coupling such topological constraints with stochastic generative architectures to achieve full morphological realism.

[AI-38] Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agent ic Retrieval

【速读】:该论文旨在解决网络安全威胁情报(Cyber Threat Intelligence, CTI)分析中,传统基于向量检索的检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理涉及实体间复杂关系推理的问题时表现不佳的问题。其核心挑战在于相关证据常分散于多个文本片段或文档中,而标准向量检索难以捕捉多跳关系。解决方案的关键在于引入知识图谱(Knowledge Graph)进行结构化建模,通过显式表示威胁行为者、恶意软件和漏洞等实体及其关系,实现多跳推理;进一步提出四种RAG架构对比评估,其中混合图-文本检索方法在多跳查询上相比纯向量RAG提升答案质量达35%,且比纯图谱方法更具鲁棒性,验证了图谱接地(graph grounding)对结构化事实查询的有效性及混合策略的优势。

链接: https://arxiv.org/abs/2604.11419
作者: Dzenan Hamzic,Florian Skopik,Max Landauer,Markus Wurzenberger,Andreas Rauber
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Cyber threat intelligence (CTI) analysts must answer complex questions over large collections of narrative security reports. Retrieval-augmented generation (RAG) systems help language models access external knowledge, but traditional vector retrieval often struggles with queries that require reasoning over relationships between entities such as threat actors, malware, and vulnerabilities. This limitation arises because relevant evidence is often distributed across multiple text fragments and documents. Knowledge graphs address this challenge by enabling structured multi-hop reasoning through explicit representations of entities and relationships. However, multiple retrieval paradigms, including graph-based, agentic, and hybrid approaches, have emerged with different assumptions and failure modes. It remains unclear how these approaches compare in realistic CTI settings and when graph grounding improves performance. We present a systematic evaluation of four RAG architectures for CTI analysis: standard vector retrieval, graph-based retrieval over a CTI knowledge graph, an agentic variant that repairs failed graph queries, and a hybrid approach combining graph queries with text retrieval. We evaluate these systems on 3,300 CTI question-answer pairs spanning factual lookups, multi-hop relational queries, analyst-style synthesis questions, and unanswerable cases. Results show that graph grounding improves performance on structured factual queries. The hybrid graph-text approach improves answer quality by up to 35 percent on multi-hop questions compared to vector RAG, while maintaining more reliable performance than graph-only systems.

[AI-39] Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

【速读】:该论文旨在解决当前数据驱动的机器人系统在生成共语手势(co-speech gestures)时,普遍存在仅能产生节奏性节拍类动作而缺乏语义强调(semantic emphasis)的问题。其解决方案的关键在于提出一种轻量级Transformer模型,该模型仅依赖文本和情感信息即可推断出具象手势(iconic gesture)的位置与强度,且推理阶段无需音频输入;该方法在BEAT2数据集上于语义手势定位分类和强度回归任务中均优于GPT-4o,同时保持计算效率,适合部署于具身智能体(embodied agents)的实时场景中。

链接: https://arxiv.org/abs/2604.11417
作者: Edwin C. Montiel-Vazquez,Christian Arzate Cruz,Stefanos Gkikas,Thomas Kassiotis,Giorgos Giannakakis,Randy Gomez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.

[AI-40] One Scale at a Time: Scale-Autoregressive Modeling for Fluid Flow Distributions

【速读】:该论文旨在解决复杂非定常流场(unsteady fluid flows)中长期时间演化模拟的计算效率与精度难题:传统偏微分方程(PDE)求解器虽准确但计算成本高昂,而基于学习的时间步进代理模型在长时间滚动预测中误差累积严重;尽管生成式模型(如扩散模型和流匹配方法)可通过独立采样避免误差累积,但其在全网格上的多次评估仍带来高计算开销。解决方案的关键在于提出尺度自回归建模(scale-autoregressive modeling, SAR)——通过从粗到细的层次化结构,先生成低分辨率流场,再逐步条件采样高分辨率细节,从而将计算资源集中于不确定性最高的粗尺度,同时减少精细尺度的迭代次数,实现高效且高精度的分布采样。

链接: https://arxiv.org/abs/2604.11403
作者: Mario Lino,Nils Thuerey
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注:

点击查看摘要

Abstract:Analyzing unsteady fluid flows often requires access to the full distribution of possible temporal states, yet conventional PDE solvers are computationally prohibitive and learned time-stepping surrogates quickly accumulate error over long rollouts. Generative models avoid compounding error by sampling states independently, but diffusion and flow-matching methods, while accurate, are limited by the cost of many evaluations over the entire mesh. We introduce scale-autoregressive modeling (SAR) for sampling flows on unstructured meshes hierarchically from coarse to fine: it first generates a low-resolution field, then refines it by progressively sampling higher resolutions conditioned on coarser predictions. This coarse-to-fine factorization improves efficiency by concentrating computation at coarser scales, where uncertainty is greatest, while requiring fewer steps at finer scales. Across unsteady-flow benchmarks of varying complexity, SAR attains substantially lower distributional error and higher per-sample accuracy than state-of-the-art diffusion models based on multi-scale GNNs, while matching or surpassing a flow-matching Transolver (a linear-time transformer) yet running 2-7x faster than this depending on the task. Overall, SAR provides a practical tool for fast and accurate estimation of statistical flow quantities (e.g., turbulent kinetic energy and two-point correlations) in real-world settings.

[AI-41] From Agent Loops to Structured Graphs:A Scheduler-Theoretic Framework for LLM Agent Execution

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体普遍采用的“Agent Loop”范式所存在的三大结构性缺陷:步骤间的隐式依赖、无界恢复循环以及执行历史的可变性导致调试困难。其核心问题是将控制流逻辑嵌入到LLM的上下文窗口中,使得调度过程缺乏透明性和可控性。解决方案的关键在于提出SGH(Structured Graph Harness),通过将控制流从隐式的上下文依赖显式地提取为静态有向无环图(Static DAG),实现三重承诺:执行计划在版本内不可变、规划与执行/恢复分离为三层结构、恢复遵循严格的升级协议。这一设计以牺牲部分表达能力为代价,显著提升了系统的可控性、可验证性和可实现性。

链接: https://arxiv.org/abs/2604.11378
作者: Hu Wei
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 51 pages, 4 figures

点击查看摘要

Abstract:The dominant paradigm for building LLM based agents is the Agent Loop, an iterative cycle where a single language model decides what to do next by reading an ever growing context window. This paradigm has three structural weaknesses: implicit dependencies between steps, unbounded recovery loops, and mutable execution history that complicates debugging. We characterize the Agent Loop as a single ready unit scheduler: at any moment, at most one executable unit is active, and the choice of which unit to activate comes from opaque LLM inference rather than an inspectable policy. This perspective places Agent Loops and graph based execution engines on a single semantic continuum. We propose SGH, Structured Graph Harness, which lifts control flow from implicit context into an explicit static DAG. SGH makes three commitments: execution plans are immutable within a plan version, planning execution and recovery are separated into three layers, and recovery follows a strict escalation protocol. These choices trade some expressiveness for controllability, verifiability, and implementability. Our contributions are fourfold: a scheduler unified framework that applies classical scheduling theory to LLM agent execution and identifies challenges introduced by non deterministic LLM nodes; a trade off analysis of controllability, expressiveness, and implementability across 70 surveyed systems; a formal specification including a node state machine with termination and soundness guarantees; and an attributable experimental framework with a seven group design for future validation. This is a position paper and design proposal. We provide a theoretical framework, design analysis, and experimental protocol, not a production implementation or empirical results.

[AI-42] Minimal Embodiment Enables Efficient Learning of Number Concepts in Robot

【速读】:该论文旨在解决智能系统如何通过感知运动体验习得抽象数量概念这一认知科学与人工智能领域的基础性挑战。其解决方案的关键在于利用具身学习(embodied learning)机制,即通过机器人在真实环境中与Franka Panda机械臂的自然交互进行序列计数训练,使神经网络模型在仅使用10%训练数据的情况下实现96.8%的计数准确率,显著优于纯视觉基线模型(60.6%)。研究表明,具身性并非单纯的信息来源,而是作为结构先验(structural prior)正则化学习过程,从而提升数据效率并自发生成符合生物认知规律的表征,如对数调谐的数字选择性单元、心理数轴组织、韦伯定律缩放特性及编码数值量的旋转动力学(相关系数r=0.97,斜率=30.6°/count),这些特征与儿童从子集知者到基数原则知者的认知发展轨迹一致。

链接: https://arxiv.org/abs/2604.11373
作者: Zhegong Shangguan,Alessandro Di Nuovo,Angelo Cangelosi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robots are increasingly entering human-interactive scenarios that require understanding of quantity. How intelligent systems acquire abstract numerical concepts from sensorimotor experience remains a fundamental challenge in cognitive science and artificial intelligence. Here we investigate embodied numerical learning using a neural network model trained to perform sequential counting through naturalistic robotic interaction with a Franka Panda manipulator. We demonstrate that embodied models achieve 96.8% counting accuracy with only 10% of training data, compared to 60.6% for vision-only baselines. This advantage persists when visual-motor correspondences are randomized, indicating that embodiment functions as a structural prior that regularizes learning rather than as an information source. The model spontaneously develops biologically plausible representations: number-selective units with logarithmic tuning, mental number line organization, Weber-law scaling, and rotational dynamics encoding numerical magnitude ( r = 0.97 , slope = 30.6° /count). The learning trajectory parallels children’s developmental progression from subset-knowers to cardinal-principle knowers. These findings demonstrate that minimal embodiment can ground abstract concepts, improve data efficiency, and yield interpretable representations aligned with biological cognition, which may contribute to embodied mathematics tutoring and safety-critical industrial applications.

[AI-43] he Missing Knowledge Layer in Cognitive Architectures for AI Agents

【速读】:该论文旨在解决当前主流认知架构框架(如CoALA和JEPA)中缺乏显式知识层及其独立持久性语义的问题,这一缺失导致系统在处理事实与经验时出现类别错误——即对事实应用认知衰减机制,或以相同更新方式处理事实与体验。其解决方案的关键在于提出一个四层分解架构:知识(Knowledge)、记忆(Memory)、智慧(Wisdom)和智能(Intelligence),每一层具有根本不同的持久性语义:无限覆盖(indefinite supersession)、艾宾浩斯衰减(Ebbinghaus decay)、证据门控修订(evidence-gated revision)和瞬时推理(ephemeral inference)。该设计通过Python和Rust实现验证了架构分离的可行性,并强调这些区分应由持久性语义需求驱动,而非神经网络结构本身。

链接: https://arxiv.org/abs/2604.11364
作者: Michaël Roynard(LAAS-OASIS)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The two most influential cognitive architecture frameworks for AI agents, CoALA [21] and JEPA [12], both lack an explicit Knowledge layer with its own persistence semantics. This gap produces a category error: systems apply cognitive decay to factual claims, or treat facts and experiences with identical update mechanics. We survey persistence semantics across existing memory systems and identify eight convergence points, from Karpathy’s LLM Knowledge Base [10] to the BEAM benchmark’s near-zero contradiction-resolution scores [22], all pointing to related architectural gaps. We propose a four-layer decom position (Knowledge, Memory, Wisdom, Intelligence) where each layer has fundamentally different persistence semantics: indefinite supersession, Ebbinghaus decay, evidence-gated revision, and ephemeral inference respectively. Companion implementations in Python and Rust demonstrate the architectural separation is feasible. We borrow terminology from cognitive science as a useful analogy (the Knowledge/Memory distinction echoes Tulving’s trichotomy), but our layers are engineering constructs justified by persistence-semantics requirements, not by neural architecture. We argue that these distinctions demand distinct persistence semantics in engineering implementations, and that no current framework or system provides this.

[AI-44] CoRe-ECG: Advancing Self-Supervised Representation Learning for 12-Lead ECG via Contrastive and Reconstructive Synergy

【速读】:该论文旨在解决心电图(ECG)准确解读因标注数据稀缺和专家标注成本高昂而面临的挑战。现有自监督学习(SSL)方法通常依赖对比学习或重建学习,但单独使用任一方法均难以提供充分的监督信号,且存在如简单增强引入非生理畸变、模型利用多导联间平凡相关性作为捷径等问题。解决方案的关键在于提出一种统一的对比与重建预训练范式 CoRe-ECG,通过全局语义建模与局部结构学习之间的协同作用,使重建过程中的全局表示对齐能够引导实例级判别信号指导局部波形恢复;同时引入频域动态增强(Frequency Dynamic Augmentation, FDA)以基于频域重要性自适应扰动 ECG 信号,并采用时空双掩码(Spatio-Temporal Dual Masking, STDM)打破导联间的线性依赖关系,从而提升重建任务难度并促进更鲁棒的特征学习。

链接: https://arxiv.org/abs/2604.11359
作者: Zehao Qin,Xiaojian Lin,Ping Zhang,Hongliang Wu,Xinkang Wang,Guangling Liu,Bo Chen,Wenming Yang,Guijin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate interpretation of electrocardiogram (ECG) remains challenging due to the scarcity of labeled data and the high cost of expert annotation. Self-supervised learning (SSL) offers a promising solution by enabling models to learn expressive representations from unlabeled signals. Existing ECG SSL methods typically rely on either contrastive learning or reconstructive learning. However, each approach in isolation provides limited supervisory signals and suffers from additional limitations, including non-physiological distortions introduced by naive augmentations and trivial correlations across multiple leads that models may exploit as shortcuts. In this work, we propose CoRe-ECG, a unified contrastive and reconstructive pretraining paradigm that establishes a synergistic interaction between global semantic modeling and local structural learning. CoRe-ECG aligns global representations during reconstruction, enabling instance-level discriminative signals to guide local waveform recovery. To further enhance pretraining, we introduce Frequency Dynamic Augmentation (FDA) to adaptively perturb ECG signals based on their frequency-domain importance, and Spatio-Temporal Dual Masking (STDM) to break linear dependencies across leads, increasing the difficulty of reconstructive tasks. Our method achieves state-of-the-art performance across multiple downstream ECG datasets. Ablation studies further demonstrate the necessity and complementarity of each component. This approach provides a robust and physiologically meaningful representation learning framework for ECG analysis.

[AI-45] Dynamic Summary Generation for Interpretable Multimodal Depression Detection

【速读】:该论文旨在解决抑郁症在临床筛查中普遍存在的误诊与漏诊问题,其根源在于社会污名化和主观症状评估的不可靠性。为应对这一挑战,作者提出了一种“粗到精”的多阶段框架,其核心创新在于利用大语言模型(Large Language Models, LLMs)实现准确且可解释的检测:系统依次完成二分类筛查、五级严重程度分类及连续值回归任务,在每个阶段生成逐步精细化的临床摘要,并通过多模态融合模块整合文本、音频与视频特征,最终输出具有透明推理过程的预测结果;该设计显著提升了诊断准确性与可解释性,在E-DAIC和CMDC数据集上优于现有最先进方法。

链接: https://arxiv.org/abs/2604.11334
作者: Shiyu Teng,Jiaqing Liu,Hao Sun,Yu Li,Shurong Chai,Ruibo Hou,Tomoko Tateyama,Lanfen Lin,Yen-Wei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Depression remains widely underdiagnosed and undertreated because stigma and subjective symptom ratings hinder reliable screening. To address this challenge, we propose a coarse-to-fine, multi-stage framework that leverages large language models (LLMs) for accurate and interpretable detection. The pipeline performs binary screening, five-class severity classification, and continuous regression. At each stage, an LLM produces progressively richer clinical summaries that guide a multimodal fusion module integrating text, audio, and video features, yielding predictions with transparent rationale. The system then consolidates all summaries into a concise, human-readable assessment report. Experiments on the E-DAIC and CMDC datasets show significant improvements over state-of-the-art baselines in both accuracy and interpretability.

[AI-46] Select Smarter Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

【速读】:该论文旨在解决自动提示优化(Automatic Prompt Optimization, APO)中评估信号质量受限于计算成本的问题,即在不显著增加资源消耗的前提下,如何高效选择最具区分度的训练样本用于提示评估。现有方法要么固定评估子集(缺乏灵活性),要么启发式调整(不稳定且无理论保障)。论文提出Prompt-Aware Online Evaluation Scheduling (POES),其核心在于将APO建模为在线自适应测试问题:提示作为被试者,训练样本作为试题,调度器应动态选择能最好区分最优候选提示的测试项。POES整合了基于项目反应理论(Item Response Theory, IRT)的判别效用、设施选址覆盖项和考虑切换成本的热启动交换机制,构建一个具有单调子模性质的统一目标函数,从而在冷启动时提供(1−1/e)的贪婪近似保证,在热启动更新时保持漂移有界。通过引入自适应控制器调节探索与利用平衡,POES在36个任务上实现平均准确率提升6.2%,同时仅消耗约4%的额外token,且在仅使用k=20个样本时性能优于传统方法使用k=30–50样本的表现,验证了“选得聪明比选得多更重要”的关键洞见。

链接: https://arxiv.org/abs/2604.11328
作者: Xiaoyu Ma,Yiwen Li,Haoyue Liu,Zhichao Wang,Ye Chen,Yongxin Guo,Xiaoying Tang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automatic prompt optimization (APO) hinges on the quality of its evaluation signal, yet scoring every prompt candidate on the full training set is prohibitively expensive. Existing methods either fix a single evaluation subset before optimization begins (principled but prompt-agnostic) or adapt it heuristically during optimization (flexible but unstable and lacking formal guarantees). We observe that APO naturally maps to an online adaptive testing problem: prompts are examinees, training examples are test items, and the scheduler should select items that best discriminate among the strongest candidates. This insight motivates Prompt-Aware Online Evaluation Scheduling (POES), which integrates an IRT-based discrimination utility, a facility-location coverage term, and switching-cost-aware warm-start swaps into a unified objective that is provably monotone submodular, yielding a (1-1/e) greedy guarantee for cold starts and bounded drift for warm-start updates. An adaptive controller modulates the exploration-exploitation balance based on optimization progress. Across 36 tasks spanning three benchmark families, POES achieves the highest overall average accuracy (6.2 percent improvement over the best baseline) with negligible token overhead (approximately 4 percent) at the same evaluation budget. Moreover, principled selection at k = 20 examples matches or exceeds the performance of naive evaluation at k = 30-50, reducing token consumption by 35-60 percent, showing that selecting smarter is more effective than selecting more. Our results demonstrate that evaluation scheduling is a first-class component of APO, not an implementation detail.

[AI-47] S3: Structured Sparsity Specification

【速读】:该论文旨在解决结构化稀疏(Structured Sparsity)在深度神经网络模型压缩与优化中难以精确建模和灵活组合的问题。现有方法往往缺乏统一的抽象框架,导致不同稀疏模式(如N:M剪枝、通道剪枝等)难以标准化定义与跨张量协同优化。解决方案的关键在于提出一种代数框架——Structured Sparsity Specification (S³),其通过三个核心组件实现对稀疏模式的精确描述:View(用于张量布局重构)、Block(定义原子剪枝单元)以及Sparsity Decision Scope(决定稀疏作用范围),并支持跨张量耦合(Coupling)以实现协调剪枝。该框架可无缝集成Optimal Brain Damage (OBD) 和 Surgeon (OBS) 等经典剪枝算法,实验表明基于S³构建的结构化剪枝方法在输出重建性能上优于传统二阶启发式策略。

链接: https://arxiv.org/abs/2604.11315
作者: Ayoub Ghriss
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages main text, 12 pages appendix

点击查看摘要

Abstract:We introduce the Structured Sparsity Specification (S ^3 ), an algebraic framework for defining, composing, and implementing structured sparse patterns. S ^3 specifies sparsity through three components: a View that reshapes the tensor via layout composition, a Block specification that defines the atomic pruning unit, and the sparsity decision Scope. Both Block and Scope support Coupling across tensors for coordinated sparsification. S ^3 enables precise specification of diverse sparsity structures, from fine-grained N:M patterns to coarse channel pruning, and integrates seamlessly with Optimal Brain Damage (OBD) and Surgeon (OBS). We formalize the framework mathematically, demonstrate its expressiveness on canonical patterns, and validate it experimentally via structured OBS and OBD implementations built entirely on S ^3 , which surpasses well-established second order heuristics on output reconstruction across common configurations.

[AI-48] PaperScope: A Multi-Modal Multi-Document Benchmark for Agent ic Deep Research Across Massive Scientific Papers

【速读】:该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在前沿科学研究中缺乏系统性评估的问题,尤其是现有基准测试主要聚焦于单文档理解,而真实科研工作流程需要从多篇论文中整合文本、表格和图表等多模态证据进行深度推理。解决方案的关键在于提出PaperScope——一个面向代理式深度研究的多模态、多文档基准测试集,其核心创新包括:(1) 基于包含2000余篇AI论文的知识图谱构建结构化的科学基础;(2) 通过语义密集证据构建机制与优化的随机游走文章选择器,生成主题连贯且语义丰富的论文集合;(3) 提供涵盖推理、检索、摘要和问题求解的多任务问答对(超2000组),实现对多步科学推理能力的全面评估。实验表明,即使先进系统如OpenAI Deep Research和Tongyi Deep Research在该基准上表现有限,凸显了长上下文检索与多源深度推理的挑战,验证了PaperScope作为严谨且可扩展的多模态多源深度研究数据集构建管道的有效性。

链接: https://arxiv.org/abs/2604.11307
作者: Lei Xiong,Huaying Yuan,Zheng Liu,Zhao Cao,Zhicheng Dou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Leveraging Multi-modal Large Language Models (MLLMs) to accelerate frontier scientific research is promising, yet how to rigorously evaluate such systems remains unclear. Existing benchmarks mainly focus on single-document understanding, whereas real scientific workflows require integrating evidence from multiple papers, including their text, tables, and figures. As a result, multi-modal, multi-document scientific reasoning remains underexplored and lacks systematic evaluation. To address this gap, we introduce PaperScope, a multi-modal multi-document benchmark designed for agentic deep research. PaperScope presents three advantages: (1) Structured scientific grounding. It is built on a knowledge graph of over 2,000 AI papers spanning three years, providing a structured foundation for research-oriented queries. (2) Semantically dense evidence construction. It integrates semantically related key information nodes and employs optimized random-walk article selector to sample thematically coherent paper sets, thereby ensuring adequate semantic density and task complexity. (3) Multi-task evaluation of scientific reasoning. It contains over 2,000 QA pairs across reasoning, retrieval, summarization, and problem solving, enabling evaluation of multi-step scientific reasoning. Experimental results show that even advanced systems such as OpenAI Deep Research and Tongyi Deep Research achieve limited scores on PaperScope, highlighting the difficulty of long-context retrieval and deep multi-source reasoning. PaperScope thus provides a rigorous benchmark alongside a scalable pipeline for constructing large-scale multi-modal, multi-source deep research datasets.

[AI-49] Learning to Forget – Hierarchical Episodic Memory for Lifelong Robot Deployment

【速读】:该论文旨在解决机器人在长期人机协作中面临的持续多模态感知导致的生命周期情景记忆(episodic memory, EM)存储爆炸与实时查询效率低下的问题。解决方案的关键在于提出 H²-EMV 框架,通过用户交互学习记忆选择策略:首先增量构建分层情景记忆结构,其次基于语言模型估计内容相关性并实现条件化选择性遗忘,最后根据用户对遗忘细节的反馈动态更新自然语言规则。该机制使系统在减少45%内存占用和35%查询计算开销的同时保持问答准确率,并随时间适应用户偏好,第二轮查询准确率提升70%,实现了可扩展且个性化的长期情景记忆管理。

链接: https://arxiv.org/abs/2604.11306
作者: Leonard Bärmann,Joana Plewnia,Alex Waibel,Tamim Asfour
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robots must verbalize their past experiences when users ask “Where did you put my keys?” or “Why did the task fail?” Yet maintaining life-long episodic memory (EM) from continuous multimodal perception quickly exceeds storage limits and makes real-time query impractical, calling for selective forgetting that adapts to users’ notions of relevance. We present H ^2 -EMV, a framework enabling humanoids to learn what to remember through user interaction. Our approach incrementally constructs hierarchical EM, selectively forgets using language-model-based relevance estimation conditioned on learned natural-language rules, and updates these rules given user feedback about forgotten details. Evaluations on simulated household tasks and 20.5-hour-long real-world recordings from ARMAR-7 demonstrate that H ^2 -EMV maintains question-answering accuracy while reducing memory size by 45% and query-time compute by 35%. Critically, performance improves over time - accuracy increases 70% in second-round queries by adapting to user-specific priorities - demonstrating that learned forgetting enables scalable, personalized EM for long-term human-robot collaboration.

[AI-50] BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

【速读】:该论文旨在解决现有AI评估基准在专业工作流程中缺乏经济意义和实际 fidelity(真实性)的问题,特别是针对高价值、劳动密集型职业中的前沿AI代理(agent)能力评估不足。其解决方案的关键在于构建一个名为 BankerToolBench (BTB) 的开源端到端分析工作流基准,该基准基于真实投资银行场景,由502名来自头部投行的资深从业者共同设计,要求AI代理完成包括数据室导航、行业工具使用(如市场数据平台、SEC文件数据库)及生成多格式交付物(Excel财务模型、PowerPoint路演幻灯片、PDF/Word报告)在内的复杂任务。BTB通过100余项由资深投行人士定义的评分标准自动评估输出质量,并量化其对利益相关者的实用价值,从而为评估大语言模型(LLM)或智能体在高风险专业场景下的表现提供了可量化的生态效度框架。

链接: https://arxiv.org/abs/2604.11304
作者: Elaine Lau,Markus Dücker,Ronak Chaudhary,Hui Wen Goh,Rosemary Wei,Vaibhav Kumar,Saed Qunbar,Guram Gogia,Yi Liu,Scott Millslagle,Nasim Borazjanizadeh,Ulyana Tkachenko,Samuel Eshun Danquah,Collin Schweiker,Vijay Karumathil,Asrith Devalaraju,Varsha Sandadi,Haemi Nam,Punit Arani,Ray Epps,Abdullah Arif,Sahil Bhaiwala,Curtis Northcutt,Skyler Wang,Anish Athalye,Jonas Mueller,Francisco Guzmán
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables–including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.

[AI-51] 3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

【速读】:该论文旨在解决机器人操作中因视觉遮挡导致的空间记忆缺失问题,即传统反应式策略仅依赖当前摄像头帧进行决策,无法在目标物体被遮挡后维持对物体位置的准确感知与重规划能力。解决方案的关键在于提出3D-Anchored Lookahead Planning (3D-ALP),其核心是将蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)与一个3D一致的世界模型(3D-consistent world model)作为回放模拟器(rollout oracle)相结合,并引入一个持久的相机到世界坐标系锚点(camera-to-world anchor, c2w anchor),该锚点可在遮挡条件下持续保持物体的空间定位信息,从而实现对不可见目标的精准重规划。实验表明,该方法在需要空间记忆的任务中显著优于贪心反应式基线,且消融实验证明树搜索中的空间记忆机制贡献了82%的性能提升。

链接: https://arxiv.org/abs/2604.11302
作者: Bronislav Sidik,Dror Mizrahi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure, 1 table

点击查看摘要

Abstract:We present 3D-Anchored Lookahead Planning (3D-ALP), a System 2 reasoning engine for robotic manipulation that combines Monte Carlo Tree Search (MCTS) with a 3D-consistent world model as the rollout oracle. Unlike reactive policies that evaluate actions from the current camera frame only, 3D-ALP maintains a persistent camera-to-world (c2w) anchor that survives occlusion, enabling accurate replanning to object positions that are no longer directly observable. On a 5-step sequential reach task requiring spatial memory (Experiment E3), 3D-ALP achieves 0.650 0.109 success rate on memory-required steps versus 0.006 0.008 for a greedy reactive baseline (\Delta=+0.645), while step 5 success reaches 0.822 against 0.000 for greedy. An ablation study (30 episodes, 3 seeds) isolates tree search spatial memory as the primary driver (+0.533, 82% of gain) with additional benefit from deeper lookahead (+0.111, 17%). We also identify and resolve four structural failure modes in applying UCT-MCTS (Upper Confidence Bounds applied to Trees [10]) to continuous robotic manipulation.

[AI-52] Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成个性化运动处方时输出一致性不足的问题,尤其是其在相同输入条件下是否能稳定产生结构一致、安全合规且语义稳定的处方内容。解决方案的关键在于通过重复生成设计评估 Gemini 2.5 Flash 模型在六种临床场景下的表现,并从语义一致性(基于 SBERT 的余弦相似度)、结构一致性(基于 FITT 原则的 AI 评判)和安全性表达一致性(包括安全表述出现频率与句级量化)三个维度进行系统分析,结果表明尽管语义一致性较高(平均余弦相似度 0.879–0.939),但定量参数如运动强度存在显著变异性,且安全表达虽普遍出现(100% 输出包含),但数量差异显著(H=86.18, p<0.001),提示模型可靠性高度依赖于提示词(prompt)结构设计,需引入额外结构约束和专家验证后方可用于临床部署。

链接: https://arxiv.org/abs/2604.11287
作者: Kihyuk Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
备注: 15 pages, 5 tables, 3 figures

点击查看摘要

Abstract:Background: Large language models (LLMs) have been explored as tools for generating personalized exercise prescriptions, yet the consistency of outputs under identical conditions remains insufficiently examined. Objective: This study evaluated the intra-model consistency of LLM-generated exercise prescriptions using a repeated generation design. Methods: Six clinical scenarios were used to generate exercise prescriptions using Gemini 2.5 Flash (20 outputs per scenario; total n = 120). Consistency was assessed across three dimensions: (1) semantic consistency using SBERT-based cosine similarity, (2) structural consistency based on the FITT principle using an AI-as-a-judge approach, and (3) safety expression consistency, including inclusion rates and sentence-level quantification. Results: Semantic similarity was high across scenarios (mean cosine similarity: 0.879-0.939), with greater consistency in clinically constrained cases. Frequency showed consistent patterns, whereas variability was observed in quantitative components, particularly exercise intensity. Unclassifiable intensity expressions were observed in 10-25% of resistance training outputs. Safety-related expressions were included in 100% of outputs; however, safety sentence counts varied significantly across scenarios (H=86.18, p less than 0.001), with clinical cases generating more safety expressions than healthy adult cases. Conclusions: LLM-generated exercise prescriptions demonstrated high semantic consistency but showed variability in key quantitative components. Reliability depends substantially on prompt structure, and additional structural constraints and expert validation are needed before clinical deployment.

[AI-53] HEIA: Learning Complete Kleene Three-Valued Logic in a Pure-Neural Modular Architecture ICML2026

【速读】:该论文旨在解决神经网络在不确定性情境下实现组合泛化(compositional generalization)的难题,特别是如何在不依赖外部符号求解器的情况下,端到端地学习完整的 Kleene 三值逻辑(K3)推理能力。其解决方案的关键在于提出一种模块化神经架构 THEIA,该架构通过四个专用引擎(分别处理算术、序关系、集合成员和命题逻辑)协同工作,并最终在逻辑模块中融合输出结果。这种结构化的归纳偏置(structured inductive bias)使模型能够在训练长度之外显著泛化——例如,在模-3序列组合任务中从5步训练推广至500步评估时达到99.97%的准确率,而同等参数量的全连接MLP或Transformer基线则无法实现此类长度泛化,后者即使使用预归一化和调优策略,也仅能维持约99.24%的精度。机制探查进一步表明,模块化设计促使上游引擎仅编码领域特定变量而不提前做出最终真值判断(探测准确率为74%,即不确定性的上限),最终真值在逻辑引擎边界处才被确定,这一因果机制经激活修补实验验证(100%翻转率,n=5种子重复)。因此,模块化架构与单体架构采用不同的组合性表示路径,凸显了结构化先验对复杂推理泛化的决定性作用。

链接: https://arxiv.org/abs/2604.11284
作者: Augustus Haoyang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 14 pages, 10 tables. Manuscript under review at the 2nd Workshop on Compositional Learning (CompLearn), ICML 2026

点击查看摘要

Abstract:We present THEIA, a modular neural architecture that learns complete Kleene three-valued logic (K3) end-to-end without any external symbolic solver, and investigate what architectural prior enables compositional generalization under uncertainty. THEIA processes four mathematical domains (arithmetic, order, set membership, propositional logic) through dedicated engines that converge in a final logic module. Trained on a 2M-sample dataset with input space ~3.4x10^13, it achieves 12/12 Kleene K3 rule coverage across 5 seeds in 9.2 +/- 3.5 minutes (5.6x faster than a parameter-comparable Transformer under matched settings). A mod-3 sequential composition experiment generalizes from 5-step training to 500-step evaluation at 99.97% +/- 0.02% – a result that critically depends on structured inductive bias: replacing the four-engine backbone with a flat MLP collapses length generalization to chance by 50 steps regardless of capacity (both 0.80M and parameter-matched 2.75M variants fail), while a pre-LN TF8LTuned Transformer baseline (3,582,147 params) trained under the identical protocol reaches 99.24% at 500 steps (Appendix D). Mechanistic probing reveals that modularity induces a delayed verdict: upstream engines encode domain-specific variables without committing to the final truth value (probe accuracy = 74% uncertainty-only ceiling), with the verdict emerging only at the Logic Engine boundary – causally confirmed by activation patching (100% flip rate on 986 matched pairs, replicated across n=5 seeds; 100.0% aggregate). The Transformer baseline reaches equivalent correctness through a qualitatively different representational trajectory (contraction then expansion), suggesting that modular and monolithic architectures implement distinct compositional strategies.

[AI-54] AbLWR:A Context-Aware Listwise Ranking Framework for Antibody-Antigen Binding Affinity Prediction via Positive-Unlabeled Learning

【速读】:该论文旨在解决抗体-抗原结合亲和力预测中因标签稀疏性和抗原变异复杂性导致的性能瓶颈问题。其解决方案的关键在于提出AbLWR框架,将传统的回归任务重构为列表级排序问题,并引入PU(Positive-Unlabeled)学习机制,通过双层对比目标与元优化标签精炼策略来缓解标签稀疏性;同时采用同源抗原采样策略,结合多头自注意力(Multi-Head Self-Attention, MHSA)显式建模训练列表内样本间关系,从而捕捉细微的亲和力差异,显著提升了模型在随机交叉验证中的Precision@1(P@1)指标,且在流感病毒和IL-33案例研究中展现出良好的实际应用价值。

链接: https://arxiv.org/abs/2604.11272
作者: Fan Xu,Zhi-an Huang,Haohuai He,Yidong Song,Wei Liu,Dongxu Zhang,Yao Hu,Kay Chen Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of antibody-antigen binding affinity is fundamental to therapeutic design, yet remains constrained by severe label sparsity and the complexity of antigenic variations. In this paper, we propose AbLWR (Antibody-antigen binding affinity List-Wise Ranking), a novel framework that reformulates the conventional affinity regression task as a listwise ranking problem. To mitigate label sparsity, AbLWR incorporates a PU (Positive-Unlabeled) learning mechanism leveraging a dual-level contrastive objective and meta-optimized label refinement to learn robust representations. Furthermore, we address antigenic variation by employing a homologous antigen sampling strategy where Multi-Head Self-Attention (MHSA) explicitly models inter-sample relationships within training lists to capture subtle affinity nuances. Extensive experiments demonstrate that AbLWR significantly outperforms state-of-the-art baselines, improving the Precision@1 (P@1) by over 10 % in randomized cross-validation experiments. Notably, case studies on Influenza and IL-33 validate its practical utility, demonstrating robust ranking consistency in distinguishing subtle viral mutations and efficiently prioritizing top-tier candidates for wet-lab screening.

[AI-55] Inspectable AI for Science: A Research Object Approach to Generative AI Governance

【速读】:该论文试图解决生成式 AI(Generative AI)在科学研究中缺乏规范治理机制的问题,尤其是在安全与隐私(Security and Privacy, SP)研究领域,如何确保AI辅助科研过程的可追溯性、可验证性和合规性。解决方案的关键在于提出“AI作为研究对象”(AI as a Research Object, AI-RO)范式,将AI交互视为结构化、可检查的研究组件,并基于Research Object理论和FAIR原则构建一套记录模型配置、提示词(prompt)及输出结果的交互日志与元数据封装框架,从而实现可控披露与完整性保障的溯源记录。通过轻量级写作流程实现人类撰写的文献综述笔记由语言模型合成并生成可验证的溯源凭证,论证了以结构化文档、受控披露和完整性保护的溯源捕获为核心的技术路径,为未来广泛采纳此类实践提供了必要方向。

链接: https://arxiv.org/abs/2604.11261
作者: Ruta Binkyte,Sharif Abuaddba,Chamikara Mahawaga,Ming Ding,Natasha Fernandes,Mario Fritz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces AI as a Research Object (AI-RO), a paradigm for governing the use of generative AI in scientific research. Instead of debating whether AI is an author or merely a tool, we propose treating AI interactions as structured, inspectable components of the research process. Under this view, the legitimacy of an AI-assisted scientific paper depends on how model use is integrated into the workflow, documented, and made accountable. Drawing on Research Object theory and FAIR principles, we propose a framework for recording model configuration, prompts, and outputs through interaction logs and metadata packaging. These properties are particularly consequential in security and privacy (SP) research, where provenance artifacts must satisfy confidentiality constraints, integrity guarantees, and auditability requirements that generic disclosure practices do not address. We implement a lightweight writing pipeline in which a language model synthesizes human-authored structured literature review notes under explicit constraints and produces a verifiable provenance record. We present this work as a position supported by an initial demonstrative workflow, arguing that governance of generative AI in science can be implemented as structured documentation, controlled disclosure, and integrity-preserving provenance capture. Based on this example, we outline and motivate a set of necessary future developments required to make such practices practical and widely adoptable.

[AI-56] Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization

【速读】:该论文旨在解决移动图形用户界面(GUI)代理在任务执行中忽视用户隐私个性化的问题。现有系统多聚焦于任务成功率或效率优化,而忽略了不同用户对隐私保护的偏好差异,导致代理行为难以适配个体需求。其核心问题是:隐私偏好驱动的执行轨迹具有结构异质性和长度可变性,使得标准偏好优化方法不稳定且信息量不足。解决方案的关键在于提出轨迹诱导偏好优化(Trajectory Induced Preference Optimization, TIPO),通过引入偏好强度加权机制强化关键隐私相关步骤的信号,并采用填充门控策略抑制对齐噪声,从而提升代理对不同隐私人格(persona)的区分度与一致性,同时保持高任务执行能力。

链接: https://arxiv.org/abs/2604.11259
作者: Zhixin Lin,Jungang Li,Dongliang Xu,Shidong Pan,Yibo Shi,Yuchi Liu,Yuecong Min,Yue Yao
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 10 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Mobile GUI agents powered by Multimodal Large Language Models (MLLMs) can execute complex tasks on mobile devices. Despite this progress, most existing systems still optimize task success or efficiency, neglecting users’ privacy personalization. In this paper, we study the often-overlooked problem of agent personalization. We observe that personalization can induce systematic structural heterogeneity in execution trajectories. For example, privacy-first users often prefer protective actions, e.g., refusing permissions, logging out, and minimizing exposure, leading to logically different execution trajectories from utility-first users. Such variable-length and structurally different trajectories make standard preference optimization unstable and less informative. To address this issue, we propose Trajectory Induced Preference Optimization (TIPO), which uses preference-intensity weighting to emphasize key privacy-related steps and padding gating to suppress alignment noise. Results on our Privacy Preference Dataset show that TIPO improves persona alignment and distinction while preserving strong task executability, achieving 65.60% SR, 46.22 Compliance, and 66.67% PD, outperforming existing optimization methods across various GUI tasks. The code and dataset will be publicly released at this https URL.

[AI-57] Measuring the Authority Stack of AI Systems: Empirical Analysis of 366120 Forced-Choice Responses Across 8 AI Models

【速读】:该论文旨在解决人工智能系统(AI)在面对结构化伦理困境时,其价值优先级、证据偏好及源信任层级等决策机制的可测量性与稳定性问题。解决方案的关键在于构建并应用PRISM基准测试工具,该工具基于Authority Stack框架的三层结构(价值优先级L4、证据类型偏好L3、源信任层级L2),通过包含14,175个独特场景的强制选择任务,在7个专业领域、3种严重程度、3种决策时间范围和5种场景变体下对8个主流AI模型进行大规模实证评估。研究发现AI模型表现出可识别的价值倾向(如Universalism-first与Security-first的对称分布)、领域特异性重构(如国防领域安全价值显著上升)、证据偏好分化以及源信任的趋同性,并揭示了Paired Consistency Scores(PCS)与Test-Retest Reliability(TRR)之间的差异,表明AI决策的不稳定性主要源于情境敏感性而非随机噪声,从而为AI在不同专业领域的部署提供了关键依据。

链接: https://arxiv.org/abs/2604.11216
作者: Seulki Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 15 tables, no figures. AIO Working Paper. Companion to: S. Lee (2026a)

点击查看摘要

Abstract:What values, evidence preferences, and source trust hierarchies do AI systems actually exhibit when facing structured dilemmas? We present the first large-scale empirical mapping of AI decision-making across all three layers of the Authority Stack framework (S. Lee, 2026a): value priorities (L4), evidence-type preferences (L3), and source trust hierarchies (L2). Using the PRISM benchmark – a forced-choice instrument of 14,175 unique scenarios per layer, spanning 7 professional domains, 3 severity levels, 3 decision timeframes, and 5 scenario variants – we evaluated 8 major AI models at temperature 0, yielding 366,120 total responses. Key findings include: (1) a symmetric 4:4 split between Universalism-first and Security-first models at L4; (2) dramatic defense-domain value restructuring where Security surges to near-ceiling win-rates (95.1%-99.8%) in 6 of 8 models; (3) divergent evidence hierarchies at L3, with some models favoring empirical-scientific evidence while others prefer pattern-based or experiential evidence; (4) broad convergence on institutional source trust at L2; and (5) Paired Consistency Scores (PCS) ranging from 57.4% to 69.2%, revealing substantial framing sensitivity across scenario variants. Test-Retest Reliability (TRR) ranges from 91.7% to 98.6%, indicating that value instability stems primarily from variant sensitivity rather than stochastic noise. These findings demonstrate that AI models possess measurable – if sometimes unstable – Authority Stacks with consequential implications for deployment across professional domains.

[AI-58] Designing Adaptive Digital Nudging Systems with LLM -Driven Reasoning

【速读】:该论文旨在解决数字助推系统(digital nudging systems)在软件架构设计中缺乏将行为科学有效转化为可实施结构的指导框架的问题,尤其在于如何整合多维用户建模与伦理合规性作为核心架构关注点。其解决方案的关键在于提出一种基于行为理论的显式架构决策机制,将伦理与公平视为结构性约束(guardrails),而非事后实现细节,并通过分层处理流程与跨切面评估模块确保监管合规性,从而在提升干预有效性的同时保障伦理边界。

链接: https://arxiv.org/abs/2604.11206
作者: Tiziano Santilli,Mina Alipour,Mahyar Tourchi Moghaddam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Digital nudging systems lack architectural guidance for translating behavioral science into software design. While research identifies nudge strategies and quality attributes, existing architectures fail to integrate multi-dimensional user modeling with ethical compliance as architectural concerns. We present an architecture that uses behavioral theory through explicit architectural decisions, treating ethics and fairness as structural guardrails rather than implementation details. A literature review synthesized 68 nudging strategies, 11 quality attributes, and 3 user profiling dimensions into architectural requirements. The architecture implements sequential processing layers with cross-cutting evaluation modules enforcing regulatory compliance. Validation with 13 software architects confirmed requirements satisfaction and domain transferability. An LLM-powered proof-of-concept in residential energy sustainability demonstrated feasibility through evaluation with 15 users, achieving high perceived intervention quality and measurable positive emotional impact. This work bridges behavioral science and software architecture by providing reusable patterns for adaptive systems that balance effectiveness with ethical constraints.

[AI-59] ShapShift: Explaining Model Prediction Shifts with Subgroup Conditional Shapley Values

【速读】:该论文旨在解决机器学习模型预测结果因输入数据分布变化而产生的预测偏移(prediction shift)问题,此类偏移可能对下游业务指标(如银行贷款审批率)产生显著影响,因此理解其成因至关重要。解决方案的关键在于提出一种基于Shapley值的归因方法——\ours,该方法将预测偏移归因于可解释子群组(由决策树结构定义)的条件概率变化;具体而言,首先在单棵决策树上精确计算分裂节点处条件概率变化带来的影响,进而扩展至树集成模型时通过选择最具解释力的树并考虑残差效应,最后进一步推广至模型无关场景,利用一种新型目标函数训练代理树以适配神经网络等复杂模型。该方法能够在保证解释简洁性、忠实性和近完备性的同时,实现跨模型类别的预测偏移归因分析,从而支持动态环境下的模型监控。

链接: https://arxiv.org/abs/2604.11200
作者: Tom Bewley,Salim I. Amoukou,Emanuele Albini,Saumitra Mishra,Manuela Veloso
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Changes in input distribution can induce shifts in the average predictions of machine learning models. Such prediction shifts may impact downstream business outcomes (e.g. a bank’s loan approval rate), so understanding their causes can be crucial. We propose \ours: a Shapley value method for attributing prediction shifts to changes in the conditional probabilities of interpretable subgroups of data, where these subgroups are defined by the structure of decision trees. We initially apply this method to single decision trees, providing exact explanations based on conditional probability changes at split nodes. Next, we extend it to tree ensembles by selecting the most explanatory tree and accounting for residual effects. Finally, we propose a model-agnostic variant using surrogate trees grown with a novel objective function, allowing application to models like neural networks. While exact computation can be intensive, approximation techniques enable practical application. We show that \ours provides simple, faithful, and near-complete explanations of prediction shifts across model classes, aiding model monitoring in dynamic environments.

[AI-60] aking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape

【速读】:该论文旨在解决当前软件工程(Software Engineering, SE)研究领域中对生成式人工智能(Generative AI, GenAI)使用现状、影响及治理需求缺乏系统性实证证据的问题。其解决方案的关键在于通过一项针对457名发表于顶级会议和期刊的SE研究人员的大规模调查,结合定量与定性分析方法,细致刻画GenAI在SE研究全流程中的应用分布、使用者动机、感知收益与风险,并提出基于实证发现的使用场景分类、风险缓解策略及治理建议,从而为GenAI在学术研究中的负责任整合提供可操作的基准框架。

链接: https://arxiv.org/abs/2604.11184
作者: Bianca Trinkenreich,Fabio Calefato,Kelly Blincoe,Viggo Tellefsen Wivestad,Antonio Pedro Santos Alves,Júlia Condé Araújo,Marina Condé Araújo,Paolo Tell,Marcos Kalinowski,Thomas Zimmermann,Margaret-Anne Storey
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Context: Software engineering (SE) researchers increasingly study Generative AI (GenAI) while also incorporating it into their own research practices. Despite rapid adoption, there is limited empirical evidence on how GenAI is used in SE research and its implications for research practices and governance. Aims: We conduct a large-scale survey of 457 SE researchers publishing in top venues between 2023 and 2025. Method: Using quantitative and qualitative analyses, we examine who uses GenAI and why, where it is used across research activities, and how researchers perceive its benefits, opportunities, challenges, risks, and governance. Results: GenAI use is widespread, with many researchers reporting pressure to adopt and align their work with it. Usage is concentrated in writing and early-stage activities, while methodological and analytical tasks remain largely human-driven. Although productivity gains are widely perceived, concerns about trust, correctness, and regulatory uncertainty persist. Researchers highlight risks such as inaccuracies and bias, emphasize mitigation through human oversight and verification, and call for clearer governance, including guidance on responsible use and peer review. Conclusion: We provide a fine-grained, SE-specific characterization of GenAI use across research activities, along with taxonomies of GenAI use cases for research and peer review, opportunities, risks, mitigation strategies, and governance needs. These findings establish an empirical baseline for the responsible integration of GenAI into academic practice.

[AI-61] EmbodiedGovBench: A Benchmark for Governance Recovery and Upgrade Safety in Embodied Agent Systems

【速读】:该论文旨在解决当前具身人工智能(Embodied AI)系统评估中忽视治理能力的问题,即现有评价体系主要依赖任务完成率或操作精度等指标,而未能衡量系统是否具备可管控性(governability),如遵守能力边界、执行政策、安全恢复、审计追踪及响应人类干预等关键属性。其解决方案的核心是提出EmbodiedGovBench基准测试框架,通过定义七维治理维度(包括未经授权的能力调用、运行时漂移鲁棒性、恢复成功率、策略可移植性、版本升级安全性、人工干预响应性和审计完整性)和覆盖单机器人与多机器人场景的结构化评估协议,构建首个面向治理导向的具身智能系统评估体系,从而推动“可治理性”成为具身AI系统的第一优先级评估目标。

链接: https://arxiv.org/abs/2604.11174
作者: Xue Qin,Simin Luan,John See,Cong Yang,Zhijun Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 34 pages, 7 tables. Code: this https URL

点击查看摘要

Abstract:Recent progress in embodied AI has produced a growing ecosystem of robot policies, foundation models, and modular runtimes. However, current evaluation remains dominated by task success metrics such as completion rate or manipulation accuracy. These metrics leave a critical gap: they do not measure whether embodied systems are governable – whether they respect capability boundaries, enforce policies, recover safely, maintain audit trails, and respond to human oversight. We present EmbodiedGovBench, a benchmark for governance-oriented evaluation of embodied agent systems. Rather than asking only whether a robot can complete a task, EmbodiedGovBench evaluates whether the system remains controllable, policy-bounded, recoverable, auditable, and evolution-safe under realistic perturbations. The benchmark covers seven governance dimensions: unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness. We define a benchmark structure spanning single-robot and fleet settings, with scenario templates, perturbation operators, governance metrics, and baseline evaluation protocols. We describe how the benchmark can be instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows. Our analysis suggests that embodied governance should become a first-class evaluation target. EmbodiedGovBench provides the initial measurement framework for that shift.

[AI-62] Environmental Footprint of GenAI Research: Insights from the Moshi Foundation Model

【速读】:该论文旨在解决生成式人工智能(Generative AI)研究过程中日益增长的环境影响问题,特别是针对多模态大语言模型(Multi-modal Large Language Models, MLLMs)研发阶段中计算资源消耗与碳排放缺乏透明度和系统性量化的问题。其解决方案的关键在于通过生命周期评估(Life Cycle Assessment, LCA)方法,对模型开发全流程进行细粒度分析,包括早期实验、失败训练运行、调试及消融研究等被长期忽视的环节,精确量化GPU时间投入与硬件制造及使用阶段的能源消耗、水资源使用、温室气体排放和矿产资源耗竭等环境影响指标。这一方法为制定可操作的减排策略提供了科学依据,推动MLLM研究向更可持续的方向发展。

链接: https://arxiv.org/abs/2604.11154
作者: Marta López-Rauhut,Loic Landrieu,Mathieu Aubry,Anne-Laure Ligozat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 12 figures, 8 tables

点击查看摘要

Abstract:New multi-modal large language models (MLLMs) are continuously being trained and deployed, following rapid development cycles. This generative AI frenzy is driving steady increases in energy consumption, greenhouse gas emissions, and a plethora of other environmental impacts linked to datacenter construction and hardware manufacturing. Mitigating the environmental consequences of GenAI remains challenging due to an overall lack of transparency by the main actors in the field. Even when the environmental impacts of specific models are mentioned, they are typically restricted to the carbon footprint of the final training run, omitting the research and development stages. In this work, we explore the impact of GenAI research through a fine-grained analysis of the compute spent to create Moshi, a 7B-parameter speech-text foundation model for real-time dialogue developed by Kyutai, a leading privately funded open science AI lab. For the first time, our study dives into the anatomy of compute-intensive MLLM research, quantifying the GPU-time invested in specific model components and training phases, as well as early experimental stages, failed training runs, debugging, and ablation studies. Additionally, we assess the environmental impacts of creating Moshi from beginning to end using a life cycle assessment methodology: we quantify energy and water consumption, greenhouse gas emissions, and mineral resource depletion associated with the production and use of datacenter hardware. Our detailed analysis allows us to provide actionable guidelines to reduce compute usage and environmental impacts of MLLM research, paving the way for more sustainable AI research. Comments: 28 pages, 12 figures, 8 tables Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.11154 [cs.AI] (or arXiv:2604.11154v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.11154 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-63] From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在临床决策支持中因推理过程不透明且不可靠而导致的信任危机问题。当前LLMs常出现“通过错误推理得出正确答案”的现象,这不仅削弱了其诊断逻辑的严谨性,还可能导致在真实临床场景下产生更广泛的幻觉和不可预测的失败,危及患者安全。为应对这一挑战,论文提出了一种基于Toulmin论证模型的可信临床推理框架,并设计了新颖的分阶段训练流程——课程目标条件学习(Curriculum Goal-Conditioned Learning, CGCL),其关键在于通过三阶段渐进式课程结构,系统性地引导LLM生成符合临床逻辑的诊断论证:首先提取事实并生成鉴别诊断,其次论证核心假设并反驳备选方案,最后整合分析形成带限定条件的结论。该方法在T-Eval量化评估中展现出与资源密集型强化学习相当的诊断准确率与推理质量,同时具备更高的训练稳定性与效率。

链接: https://arxiv.org/abs/2604.11137
作者: Chen Zhan,Xiaoyu Tan,Gengchen Ma,Yu-Jie Xiong,Xiaoyan Jiang,Xihe Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ACL 2026 (Main Conference)

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into clinical decision support is critically obstructed by their opaque and often unreliable reasoning. In the high-stakes domain of healthcare, correct answers alone are insufficient; clinical practice demands full transparency to ensure patient safety and enable professional accountability. A pervasive and dangerous weakness of current LLMs is their tendency to produce “correct answers through flawed reasoning.” This issue is far more than a minor academic flaw; such process errors signal a fundamental lack of robust understanding, making the model prone to broader hallucinations and unpredictable failures when faced with real-world clinical complexity. In this paper, we establish a framework for trustworthy clinical argumentation by adapting the Toulmin model to the diagnostic process. We propose a novel training pipeline: Curriculum Goal-Conditioned Learning (CGCL), designed to progressively train LLM to generate diagnostic arguments that explicitly follow this Toulmin structure. CGCL’s progressive three-stage curriculum systematically builds a solid clinical argument: (1) extracting facts and generating differential diagnoses; (2) justifying a core hypothesis while rebutting alternatives; and (3) synthesizing the analysis into a final, qualified conclusion. We validate CGCL using T-Eval, a quantitative framework measuring the integrity of the diagnosis reasoning. Experiments show that our method achieves diagnostic accuracy and reasoning quality comparable to resource-intensive Reinforcement Learning (RL) methods, while offering a more stable and efficient training pipeline.

[AI-64] A Proposed Biomedical Data Policy Framework to Reduce Frag mentation Improve Quality and Incentivize Sharing in Indian Healthcare in the era of Artificial Intelligence and Digital Health

【速读】:该论文旨在解决印度生物医学数据资源分散、难以共享与整合的问题,其核心瓶颈在于经济与学术激励机制的缺失,导致数据共享对研究者和机构而言风险高、回报低。解决方案的关键在于构建一个多层激励架构:一是将数据论文纳入国家医学委员会(NMC)职称晋升评价体系,二是将开放数据指标纳入国家机构排名框架(NIRF),三是采用基于Shapley值的收益分配机制以促进联邦学习联盟中的公平协作,四是设立机构级数据管理岗位作为主流职业角色。同时,通过强制数据质量评估、结构化同行评审及对审计人员给予学术认可,系统性缓解数据共享过程中的质量担忧、误读风险和选择性报告偏倚等关键障碍,从而在合规前提下推动高质量、可互操作的数据集建设,支撑生成式AI(Generative AI)等前沿技术发展。

链接: https://arxiv.org/abs/2604.11125
作者: Nikhil Mehta,Sachin Gupta,Gouri RP Anand
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:India generates vast biomedical data through postgraduate research, government hospital services and audits, government schemes, private hospitals and their electronic medical record (EMR) systems, insurance programs and standalone clinics. Unfortunately, these resources remain fragmented across institutional silos and vendor-locked EMR systems. The fundamental bottleneck is not technological but economic and academic. There is a systemic misalignment of incentives that renders data sharing a high-risk, low-reward activity for individual researchers and institutions. Until India’s academic promotion criteria, institutional rankings, and funding mechanisms explicitly recognize and reward data curation as professional work, the nation’s AI ambitions will remain constrained by fragmented, non-interoperable datasets. We propose a multi-layered incentive architecture integrating recognition of data papers in National Medical Commission (NMC) promotion criteria, incorporation of open data metrics into the National Institutional Ranking Framework (NIRF), adoption of Shapley Value-based revenue sharing in federated learning consortia, and establishment of institutional data stewardship as a mainstream professional role. Critical barriers to data sharing, including fear of data quality scrutiny, concerns about misinterpretation, and selective reporting bias, are addressed through mandatory data quality assessment, structured peer review, and academic credit for auditing roles. The proposed framework directly addresses regulatory constraints introduced by the Digital Personal Data Protection Act 2023 (DPDPA), while constructively engaging with the National Data Sharing and Accessibility Policy (NDSAP), Biotech-PRIDE Guidelines, and the Anusandhan National Research Foundation (ANRF) guidelines.

[AI-65] Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLM s

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)安全性评估中存在的重要盲区问题:现有研究主要依赖提示工程(prompt-based)方式诱导人格特征,但忽略了激活空间控制(activation steering, AS)这一架构敏感的攻击路径,导致对模型真实脆弱性的刻画不完整。解决方案的关键在于揭示两种人格注入机制——系统提示与激活空间操控——所引发的差异化漏洞分布,并提出通过对比二者在不同模型架构中的表现来识别主导失败模式。研究发现,提示引导下的危险人格排序在多个架构中保持一致(ρ = 0.71–0.96),而激活空间驱动的脆弱性则显著分化且无法由提示侧排名预测;尤其在Llama-3.1-8B上观察到“亲社会人格悖论”现象,即高尽责性和高宜人性组合的人格在提示下最安全,但在激活空间操纵下反而成为最高风险人格(ASR ~0.818),这表明仅依赖提示测试会严重低估某些模型的真实安全隐患。

链接: https://arxiv.org/abs/2604.11120
作者: Wenkai Li,Fan Yang,Shaunak A. Mehta,Koichi Onoue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose different, architecture-dependent vulnerability profiles, and testing with only one method can miss a model’s dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system prompting are preserved across all architectures ( \rho = 0.71 – 0.96 ), but activation-steering vulnerability diverges sharply and cannot be predicted from prompt-side rankings: Llama-3.1-8B is substantially more AS-vulnerable, whereas Gemma-3-27B and Qwen3.5 are more vulnerable to prompting. The most striking illustration of this divergence is the prosocial persona paradox: on Llama-3.1-8B, P12 (high conscientiousness + high agreeableness) is among the safest personas under prompting yet becomes the highest-ASR activation-steered persona (ASR ~0.818). This is an inversion robust to coefficient ablation and matched-strength calibration, and replicated on DeepSeek-R1-Distill-Qwen-32B. A trait refusal alignment framework, in which conscientiousness is strongly anti-aligned with refusal on Llama-3.1-8B, offers a partial geometric account. Reasoning provides only partial protection: two 32B reasoning models reach 15–18% prompt-side ASR, and activation steering separates them sharply in both baseline susceptibility and persona-specific vulnerability. Heuristic trace diagnostics suggest that the safer model retains stronger policy recall and self-correction behavior, not merely longer reasoning.

[AI-66] Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

【速读】:该论文旨在解决高性能计算与人工智能工作负载日益依赖GPU时,如何在快速演进的硬件架构上持续保持高效率的问题。当前开发者需耗费数月时间手动调优科学计算应用,涉及算法设计、源码实现、编译器选项及内核启动参数等多个维度的复杂优化空间。传统方法仅能孤立地搜索部分优化空间(如运行时配置或编译器设置),难以实现端到端自动化优化。其解决方案的关键在于提出Record-Remix-Replay (R³) 框架,通过融合大语言模型(LLM)驱动的进化搜索、贝叶斯优化与记录-重放编译技术,高效探索从源码级实现选择到编译器Pass顺序和运行时配置的全栈GPU内核优化空间,显著提升优化速度与效果,相比传统方法更全面且接近一个数量级更快于现代进化搜索策略。

链接: https://arxiv.org/abs/2604.11109
作者: Daniel Nichols,Konstantinos Parasyris,Caetano Melone,Tal Ben-Nun,Giorgis Georgakoudis,Harshitha Menon
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:As high-performance computing and AI workloads become increasingly dependent on GPUs, maintaining high performance across rapidly evolving hardware generations has become a major challenge. Developers often spend months tuning scientific applications to fully exploit new architectures, navigating a complex optimization space that spans algorithm design, source implementation, compiler flags and pass sequences, and kernel launch parameters. Existing approaches can effectively search parts of this space in isolation, such as launch configurations or compiler settings, but optimizing across the full space still requires substantial human expertise and iterative manual effort. In this paper, we present Record-Remix-Replay (R^3), a hierarchical optimization framework that combines LLM-driven evolutionary search, Bayesian optimization, and record-replay compilation techniques to efficiently explore GPU kernel optimizations from source-level implementation choices down to compiler pass ordering and runtime configuration. By making candidate evaluation fast and scalable, our approach enables practical end-to-end search over optimization dimensions that are typically treated separately. We show that Record-Remix-Replay can optimize full scientific applications better than traditional approaches over kernel parameters and compiler flags, while also being nearly an order of magnitude faster than modern evolutionary search approaches. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2604.11109 [cs.DC] (or arXiv:2604.11109v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.11109 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-67] ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

【速读】:该论文旨在解决当前角色扮演(Role-playing)研究局限于文本模态、忽视语音模态的问题,从而限制了真实场景下人机交互中角色扮演的自然性和沉浸感。其解决方案的关键在于提出一个全新的语音角色扮演基准——ActorMindBench,以及一个模拟人类演员行为的多智能体推理框架ActorMind。该框架通过Eye Agent读取角色描述、Ear Agent感知对话情绪线索、Brain Agent生成情感状态,并由Mouth Agent输出带有相应情绪特征的语音响应,实现了基于语音的角色扮演能力的系统性建模与提升。

链接: https://arxiv.org/abs/2604.11103
作者: Xi Chen,Wei Xue,Yike Guo
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role-Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. (3) ActorMind is an off-the-shelf, multi-agent, chain-of-though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.

[AI-68] Bottleneck Tokens for Unified Multimodal Retrieval

【速读】:该论文旨在解决decoder-only多模态大语言模型(Multimodal Large Language Models, MLLMs)在统一多模态检索任务中面临的两个结构性问题:一是现有方法依赖隐式池化机制,将标准词汇表标记(如EOS)作为序列级表示,而该机制并非为信息聚合设计;二是对比微调虽指定了嵌入应匹配的目标,但未提供token级别的信息压缩指导。解决方案的关键在于引入两个互补组件:其一,提出Bottleneck Tokens (BToks),即一组可学习的固定容量token,构成显式池化机制;其二,设计生成式信息压缩(Generative Information Condensation)策略,通过next-token预测目标与压缩掩码(Condensation Mask)切断目标token到查询token的直接注意力路径,迫使所有预测信号经由BToks传递,从而将生成损失转化为密集的token级监督信号以实现语义压缩。该方法在MMEB-V2基准上达到2B规模模型中的最先进性能(Overall Score: 59.0),尤其在语义挑战性强的任务(如Video-QA)上提升显著。

链接: https://arxiv.org/abs/2604.11095
作者: Siyu Sun,Jing Ren,Zhaohe Liao,Dongxiao Mao,Xiangyuan Ren,Yiyi Zhang,Haohua Zhao,Weixiong Lin,Jiang Shaohua,Liqing Zhang,Yuchao Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g., EOS) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level guidance on how information should be compressed into it. We address both gaps with two complementary components. Architecturally, we introduce Bottleneck Tokens (BToks), a small set of learnable tokens that serve as a fixed-capacity explicit pooling mechanism. For training, we propose Generative Information Condensation: a next-token prediction objective coupled with a Condensation Mask that severs the direct attention path from target tokens to query tokens. All predictive signals are thereby forced through the BToks, converting the generative loss into dense, token-level supervision for semantic compression. At inference time, only the input and BToks are processed in a single forward pass with negligible overhead over conventional last-token pooling. On MMEB-V2 (78 datasets, 3 modalities, 9 meta-tasks), our approach achieves state-of-the-art among 2B-scale methods under comparable data conditions, attaining an Overall score of 59.0 (+3.6 over VLM2Vec-V2) with substantial gains on semantically demanding tasks (e.g., +12.6 on Video-QA).

[AI-69] E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning

【速读】:该论文旨在解决当前微服务系统因规模和复杂性增加而导致的频繁且高成本故障问题,尤其是现有基于大语言模型(Large Language Model, LLM)的自动修复方法在缺乏运行时知识引导、依赖专家设计提示词以及使用通用大模型时所表现出的准确性与效率不足的问题。其解决方案的关键在于提出一种端到端的微服务修复任务(End-to-End Microservice Remediation, E2E-MR),通过构建自动化评估基准MicroRemed来模拟微服务部署、故障注入、剧本执行与修复验证全流程,并设计E2E-REME模型,该模型采用经验-仿真强化微调(experience-simulation reinforcement fine-tuning)进行训练,从而实现从诊断报告直接生成可执行Ansible剧本的能力,显著提升了自动修复的准确性和效率。

链接: https://arxiv.org/abs/2604.11094
作者: Lingzhe Zhang,Yunpeng Zhai,Tong Jia,Minghua He,Chiming Duan,Zhaoyang Liu,Bolin Ding,Ying Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: accepted by FSE’26

点击查看摘要

Abstract:Contemporary microservice systems continue to grow in scale and complexity, leading to increasingly frequent and costly failures. While recent LLM-based auto-remediation approaches have emerged, they primarily translate textual instructions into executable Ansible playbooks and rely on expert-crafted prompts, lacking runtime knowledge guidance and depending on large-scale general-purpose LLMs, which limits their accuracy and efficiency. We introduce \textitEnd-to-End Microservice Remediation (E2E-MR), a new task that requires directly generating executable playbooks from diagnosis reports to autonomously restore faulty systems. To enable rigorous evaluation, we build \textitMicroRemed, a benchmark that automates microservice deployment, failure injection, playbook execution, and post-repair verification. We further propose \textitE2E-REME, an end-to-end auto-remediation model trained via experience-simulation reinforcement fine-tuning. Experiments on public and industrial microservice platforms, compared with nine representative LLMs, show that E2E-REME achieves superior accuracy and efficiency.

[AI-70] Hodoscope: Unsupervised Monitoring for AI Misbehaviors

【速读】:该论文旨在解决现有AI代理(AI agent)监控方法依赖监督式评估的问题,即通过人工编写的规则或大语言模型(LLM)裁判来检测已知的失败模式,但难以发现未知的异常行为,且LLM裁判本身可能存在不可靠性。为应对这一挑战,作者提出**无监督监控(unsupervised monitoring)**的新范式,其核心思想是:不预设具体问题行为类别,而是借助群体间行为分布差异作为主要信号,辅助人类主动识别潜在异常行为。解决方案的关键在于引入Hodoscope工具,该工具通过比较不同模型组的行为分布,自动识别出具有显著差异的可疑动作模式供人工审查,从而实现对新型、未知问题行为的有效探测,并在实践中发现了Commit0基准中的未公开漏洞,同时显著降低人工审查的工作量(提升6–23倍效率),并为进一步优化监督式检测提供了可解释的行为描述。

链接: https://arxiv.org/abs/2604.11072
作者: Ziqian Zhong,Shashwat Saxena,Aditi Raghunathan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing approaches to monitoring AI agents rely on supervised evaluation: human-written rules or LLM-based judges that check for known failure modes. However, novel misbehaviors may fall outside predefined categories entirely and LLM-based judges can be unreliable. To address this, we formulate unsupervised monitoring, drawing an analogy to unsupervised learning. Rather than checking for specific misbehaviors, an unsupervised monitor assists humans in discovering problematic agent behaviors without prior assumptions about what counts as problematic, leaving that determination to the human. We observe that problematic behaviors are often distinctive: a model exploiting a benchmark loophole exhibits actions absent from well-behaved baselines, and a vulnerability unique to one evaluation manifests as behavioral anomalies when the same model runs across multiple benchmarks. This motivates using group-wise behavioral differences as the primary signal for unsupervised monitoring. We introduce Hodoscope, a tool that operationalizes this insight. Hodoscope compares behavior distributions across groups and highlights distinctive and potentially suspicious action patterns for human review. Using Hodoscope, we discover a previously unknown vulnerability in the Commit0 benchmark (unsquashed git history allowing ground-truth recovery, inflating scores for at least five models) and independently recover known exploits on ImpossibleBench and SWE-bench. Quantitative evaluation estimates that our method reduces review effort by 6-23 \times compared to naive uniform sampling. Finally, we show that behavior descriptions discovered through Hodoscope could improve the detection accuracy of LLM-based judges, demonstrating a path from unsupervised to supervised monitoring. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.11072 [cs.AI] (or arXiv:2604.11072v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.11072 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-71] PRISM Risk Signal Framework: Hierarchy-Based Red Lines for AI Behavioral Risk

【速读】:该论文旨在解决当前人工智能安全评估中依赖具体案例(如特定提示、输出或危害)设置“红线”的局限性问题,这种做法具有反应性、碎片化和主观性强等缺陷。其解决方案的关键在于提出一种基于价值、证据与来源层级结构的系统性风险识别框架——PRISM(Profile-based Reasoning Integrity Stack Measurement),通过定义27种由推理优先级异常引发的行为风险信号,结合绝对排名与相对胜率差距的双阈值原则,实现对AI模型推理结构中潜在危险模式的前瞻性、可量化检测。该方法不仅能够提前识别可能产生有害输出的推理机制,还能以单一信号覆盖无限多的具体违规案例,从而提升AI安全评估的全面性和客观性。

链接: https://arxiv.org/abs/2604.11070
作者: Seulki Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 13 tables, 1 appendix

点击查看摘要

Abstract:Current approaches to AI safety define red lines at the case level: specific prompts, specific outputs, specific harms. This paper argues that red lines can be set more fundamentally – at the level of value, evidence, and source hierarchies that govern AI reasoning. Using the PRISM (Profile-based Reasoning Integrity Stack Measurement) framework, we define a taxonomy of 27 behavioral risk signals derived from structural anomalies in how AI systems prioritize values (L4), weight evidence types (L3), and trust information sources (L2). Each signal is evaluated through a dual-threshold principle combining absolute rank position and relative win-rate gap, producing a two-tier classification (Confirmed Risk vs. Watch Signal). The hierarchy-based approach offers three advantages over case-specific red lines: it is anticipatory rather than reactive (detecting dangerous reasoning structures before they produce harmful outputs), comprehensive rather than enumerative (a single value-hierarchy signal subsumes an unlimited number of case-specific violations), and measurable rather than subjective (grounded in empirical forced-choice data). We demonstrate the framework’s detection capacity using approximately 397,000 forced-choice responses from 7 AI models across three Authority Stack layers, showing that the signal taxonomy successfully discriminates between models with structurally extreme profiles, models with context-dependent risk, and models with balanced hierarchies.

[AI-72] AI Integrity: A New Paradigm for Verifiable AI Governance

【速读】:该论文旨在解决当前人工智能治理范式(AI Ethics、AI Safety 和 AI Alignment)普遍存在的局限性——即仅评估输出结果而忽视对推理过程本身的验证。为此,作者提出“AI Integrity”(AI完整性)这一新概念,其核心在于确保AI系统的权威层级结构(Authority Stack)不被腐败、污染、操纵或偏见侵蚀,并以可验证的方式维持稳定。解决方案的关键是构建一个四层级的权威堆栈模型(Normative, Epistemic, Source, and Data Authority),并引入PRISM(Profile-based Reasoning Integrity Stack Measurement)框架作为操作化方法,通过六项核心指标量化评估推理路径的完整性,从而实现对AI决策逻辑透明性和可审计性的系统性保障。

链接: https://arxiv.org/abs/2604.11065
作者: Seulki Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 8 tables

点击查看摘要

Abstract:AI systems increasingly shape high-stakes decisions in healthcare, law, defense, and education, yet existing governance paradigms – AI Ethics, AI Safety, and AI Alignment – share a common limitation: they evaluate outcomes rather than verifying the reasoning process itself. This paper introduces AI Integrity, a concept defined as a state in which the Authority Stack of an AI system – its layered hierarchy of values, epistemological standards, source preferences, and data selection criteria – is protected from corruption, contamination, manipulation, and bias, and maintained in a verifiable manner. We distinguish AI Integrity from the three existing paradigms, define the Authority Stack as a 4-layer cascade model (Normative, Epistemic, Source, and Data Authority) grounded in established academic frameworks – Schwartz Basic Human Values for normative authority, Walton argumentation schemes with GRADE/CEBM hierarchies for epistemic authority, and Source Credibility Theory for source authority – characterize the distinction between legitimate cascading and Authority Pollution, and identify Integrity Hallucination as the central measurable threat to value consistency. We further specify the PRISM (Profile-based Reasoning Integrity Stack Measurement) framework as the operational methodology, defining six core metrics and a phased research roadmap. Unlike normative frameworks that prescribe which values are correct, AI Integrity is a procedural concept: it requires that the path from evidence to conclusion be transparent and auditable, regardless of which values a system holds.

[AI-73] Pando: Do Interpretability Methods Work When Models Wont Explain Themselves?

【速读】:该论文旨在解决生成式 AI (Generative AI) 模型中机制可解释性评估的“诱导混淆”(elicitation confounder)问题,即现有方法未能控制仅通过黑盒提示(black-box prompting)是否能恢复目标行为,导致白盒可解释工具(white-box interpretability tools)的性能提升可能源于行为诱导而非模型内部信号。解决方案的关键在于提出 Pando 基准测试框架,该框架引入一个“解释轴”(explanation axis),训练模型输出三类响应:忠实于真实决策规则的解释、无解释或自信但不忠实于干扰规则的解释;通过在 720 个微调模型上进行评估,发现当解释忠实有效时,黑盒提示即可达到甚至超越白盒方法;而在解释缺失或误导时,基于梯度的归因方法(如梯度显著性)提升准确率 3–5 个百分点,而相关性补丁(RelP)效果最佳,其余方法如对数透镜、稀疏自编码器和电路追踪则无稳定收益,从而明确区分了不同可解释技术的有效性来源。

链接: https://arxiv.org/abs/2604.11061
作者: Ziqian Zhong,Aashiq Muhamed,Mona T. Diab,Virginia Smith,Aditi Raghunathan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mechanistic interpretability is often motivated for alignment auditing, where a model’s verbal explanations can be absent, incomplete, or misleading. Yet many evaluations do not control whether black-box prompting alone can recover the target behavior, so apparent gains from white-box tools may reflect elicitation rather than internal signal; we call this the elicitation confounder. We introduce Pando, a model-organism benchmark that breaks this confound via an explanation axis: models are trained to produce either faithful explanations of the true rule, no explanation, or confident but unfaithful explanations of a disjoint distractor rule. Across 720 finetuned models implementing hidden decision-tree rules, agents predict held-out model decisions from 10 labeled query-response pairs, optionally augmented with one interpretability tool output. When explanations are faithful, black-box elicitation matches or exceeds all white-box methods; when explanations are absent or misleading, gradient-based attribution improves accuracy by 3-5 percentage points, and relevance patching, RelP, gives the largest gains, while logit lens, sparse autoencoders, and circuit tracing provide no reliable benefit. Variance decomposition suggests gradients track decision computation, which fields causally drive the output, whereas other readouts are dominated by task representation, biases toward field identity and value. We release all models, code, and evaluation infrastructure. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.11061 [cs.LG] (or arXiv:2604.11061v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.11061 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-74] Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

【速读】:该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的信用分配问题,即稀疏的结果导向奖励难以有效分配到语言模型生成过程中的具体token上,从而限制了大语言模型(Large Language Models, LLMs)推理能力的提升。解决方案的关键在于提出一种基于熵感知的策略优化方法(Entropy-Aware Policy Optimization, EAPO),其核心思想是通过四象限分解(Four Quadrant Decomposition)识别高熵token在推理改进中的主导作用,并理论证明token所能承载的信用上限由其条件熵决定。EAPO据此对不同熵水平的token动态调节学习信号强度,避免均匀奖励广播导致高熵位置信号稀释和确定性token过度赋权的问题,从而显著提升LLMs的推理性能。

链接: https://arxiv.org/abs/2604.11056
作者: Yuhang He,Haodong Wu,Siyi Liu,Hongyu Ge,Hange Zhou,Keyi Wu,Zhuo Zheng,Qihong Lin,Zixin Zhong,Yongqi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning ability of Large Language Models (LLMs). However, its sparse outcome-based rewards pose a fundamental credit assignment problem. We analyze this problem through the joint lens of reward polarity and token entropy. Our diagnostic tool, the Four Quadrant Decomposition, isolates token updates by polarity and entropy, and controlled ablations show that reasoning improvements concentrate in the high-entropy quadrants. To justify this observation theoretically, we adapt Conditional Mutual Information to the autoregressive RLVR setting and prove that the credit a token can carry is upper-bounded by its entropy. This view yields testable predictions that reasoning gains arise primarily from high-entropy tokens, with unique roles for positive and negative updates. A gradient analysis of GRPO further reveals how uniform reward broadcast dilutes signal at high-entropy positions while over-crediting deterministic tokens. Grounded in these insights, we propose Entropy-Aware Policy Optimization (EAPO) that modulates token-level learning signals accordingly. Extensive experiments demonstrate that EAPO outperforms strong baselines across two model families.

[AI-75] EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

【速读】:该论文旨在解决统一多模态嵌入空间中因监督数据稀疏而导致的未配对模态对(如音频↔深度、红外↔音频)在零样本迁移任务中性能不佳的问题。现有方法在将新模态对齐到合成代理嵌入时,易引入梯度干扰,破坏原有锚点对齐结构,从而影响检索与分类性能。解决方案的关键在于提出EmergentBridge框架:首先学习一个映射以生成带有噪声的桥接锚点(即已对齐模态的代理嵌入),并仅在与锚点对齐方向正交的子空间内强制代理对齐,从而在保持锚点对齐结构的同时增强非锚点模态间的连接性,实现无需全量成对监督的跨模态对齐能力提升。

链接: https://arxiv.org/abs/2604.11043
作者: Jincheng Xie,Xingchen Xiao,Runheng Liu,Zhongyi Huang,Yu Zheng,Heyan Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image–text), leaving \emphunpaired modality pairs (e.g., audio \leftrightarrow depth, infrared \leftrightarrow audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \textbfEmergentBridge, an embedding-level bridging framework that improves performance on these unpaired pairs \emphwithout requiring exhaustive pairwise supervision. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \emphgradient interference, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a \emphnoisy bridge anchor (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.

[AI-76] From Topology to Trajectory: LLM -Driven World Models For Supply Chain Resilience

【速读】:该论文旨在解决半导体供应链在地缘政治动荡等“政策黑天鹅”事件下所面临的韧性挑战,尤其是传统大语言模型(Large Language Model, LLM)规划器因缺乏物理环境建模而产生的决策瘫痪(Decision Paralysis)和严重 grounding gap 问题。解决方案的关键在于提出 ReflectiChain 框架,其核心创新是融合了由生成式世界模型驱动的潜在轨迹回放(Latent Trajectory Rehearsal),实现了“行动中反思”(System 2 deliberation)与“事后反思”(delayed reflection-on-action)的协同机制,并引入回顾性智能体强化学习(Retrospective Agentic RL)以支持部署阶段(test-time)的自主政策演化。这一设计显著提升了长期战略规划中语义推理与物理现实之间的对齐能力,实证表明在极端场景下可使操作比率(Operability Ratio, OR)从13.3%提升至88.5%,并实现平均步奖励提升250%。

链接: https://arxiv.org/abs/2604.11041
作者: Jia Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semiconductor supply chains face unprecedented resilience challenges amidst global geopolitical turbulence. Conventional Large Language Model (LLM) planners, when confronting such non-stationary “Policy Black Swan” events, frequently suffer from Decision Paralysis or a severe Grounding Gap due to the absence of physical environmental modeling. This paper introduces ReflectiChain, a cognitive agentic framework tailored for resilient macroeconomic supply chain planning. The core innovation lies in the integration of Latent Trajectory Rehearsal powered by a generative world model, which couples reflection-in-action (System 2 deliberation) with delayed reflection-on-action. Furthermore, we leverage a Retrospective Agentic RL mechanism to enable autonomous policy evolution during the deployment phase (test-time). Evaluations conducted on our high-fidelity benchmark, Semi-Sim, demonstrate that under extreme scenarios such as export bans and material shortages, ReflectiChain achieves a 250% improvement in average step rewards over the strongest LLM baselines. It successfully restores the Operability Ratio (OR) from a deficient 13.3% to over 88.5% while ensuring robust gradient convergence. Ablation studies further underscore that the synergy between physical grounding constraints and double-loop learning is fundamental to bridging the gap between semantic reasoning and physical reality for long-horizon strategic planning.

[AI-77] Intelligent Approval of Access Control Flow in Office Automation Systems via Relational Modeling

【速读】:该论文旨在解决传统访问控制流审批(Access Control Flow Approval, ACFA)系统中因逐级人工审批导致的效率低下和智能化不足的问题。解决方案的关键在于提出一种基于关系建模的智能审批(Relational Modeling-driven Intelligent Approval, RMIA)框架,其核心创新在于构建了两个互补的关系建模模块:一是二元关系建模模块,从粗粒度层面刻画申请人与审批人之间的耦合关系,为决策提供基础信息;二是三元关系建模模块,以特定资源信息为核心,刻画申请人、资源与审批人之间的复杂交互关系,从而提供细粒度的决策依据。RMIA通过有效融合这两种关系信息,实现更精准、高效的自动化审批决策。

链接: https://arxiv.org/abs/2604.11040
作者: Dugang Liu,Zulong Chen,Chuanfei Xu,Jiaxuan He,Yunlu Ma,Jia Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Office automation (OA) systems play a crucial role in enterprise operations and management, with access control flow approval (ACFA) being a key component that manages the accessibility of various resources. However, traditional ACFA requires approval from the person in charge at each step, which consumes a significant amount of manpower and time. Its intelligence is a crucial issue that needs to be addressed urgently by all companies. In this paper, we propose a novel relational modeling-driven intelligent approval (RMIA) framework to automate ACFA. Specifically, our RMIA consists of two core modules: (1) The binary relation modeling module aims to characterize the coupling relation between applicants and approvers and provide reliable basic information for ACFA decision-making from a coarse-grained perspective. (2) The ternary relation modeling module utilizes specific resource information as its core, characterizing the complex relations between applicants, resources, and approvers, and thus provides fine-grained gain information for informed decision-making. Then, our RMIA effectively fuses these two kinds of information to form the final decision. Finally, extensive experiments are conducted on two product datasets and an online A/B test to verify the effectiveness of RMIA.

[AI-78] RTMC: Step-Level Credit Assignment via Rollout Trees

【速读】:该论文旨在解决多步智能体强化学习中细粒度信用分配(credit assignment)的问题,现有方法存在两大局限:一是无评价值函数的方法(如GRPO)对轨迹中所有动作赋予统一优势值,缺乏精细调控;二是基于学习的价值网络虽能提供更精确的优势估计,但引入额外计算开销且在稀疏奖励场景下表现脆弱。其解决方案的关键在于提出Rollout-Tree Monte Carlo(RTMC)优势估计方法,通过识别共享相同中间状态的多条rollout路径,构建隐含的状态-动作树结构,并利用状态-动作签名系统压缩交互历史以实现跨rollout的状态匹配,从而无需任何学习型评价值函数即可高效聚合回报统计量,得到每步的Q值和优势值。此方法在SWE-bench Verified基准上相较GRPO提升了3.2个百分点的pass@1指标。

链接: https://arxiv.org/abs/2604.11037
作者: Tao Wang,Suhang Zheng,Xiaoxiao Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-step agentic reinforcement learning benefits from fine-grained credit assignment, yet existing approaches offer limited options: critic-free methods like GRPO assign a uniform advantage to every action in a trajectory, while learned value networks introduce notable overhead and can be fragile under sparse rewards. We observe that group rollouts targeting the same problem often traverse overlapping intermediate states, implicitly forming a tree whose branches diverge at successive decision points. Building on this insight, we introduce Rollout-Tree Monte Carlo (RTMC) advantage estimation, which aggregates return statistics across rollouts sharing a common state to produce per-step Q-values and advantages–without any learned critic. A state-action signature system compresses raw interaction histories into compact, comparable representations, making cross-rollout state matching tractable. On SWE-bench Verified, RTMC improves pass@1 by 3.2 percentage points over GRPO.

[AI-79] Introspective Diffusion Language Models

【速读】:该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在生成质量上落后于自回归(Autoregressive, AR)模型的问题。研究表明,这一差距源于DLMs缺乏内省一致性(introspective consistency),即模型对其先前生成的token缺乏自我验证能力,而AR模型则通过因果掩码(causal masking)和logit偏移(logit shifting)隐式地保证了这种一致性。解决方案的关键在于提出内省扩散语言模型(Introspective Diffusion Language Model, I-DLM),其核心创新是引入一种新颖的内省步进解码算法(introspective strided decoding, ISD),使模型能在同一前向传播中同时验证已生成token并推进新token的生成,从而继承AR训练的内省一致性优势,同时保留扩散模型的并行解码特性。I-DLM在15个基准测试中达到与同规模AR模型相当的质量,并在实际服务效率上显著优于现有DLMs,展现出高并发场景下的优越性能。

链接: https://arxiv.org/abs/2604.11035
作者: Yifan Yu,Yuqing Jian,Junxiong Wang,Zhongzhu Zhou,Donglin Zhuang,Xinyu Fang,Sri Yanamandra,Xiaoxia Wu,Qingyang Wu,Shuaiwen Leon Song,Tri Dao,Ben Athiwaratkun,James Zou,Fan Lai,Chenfeng Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Motivated by this observation, we introduce Introspective Diffusion Language Model (I-DLM), a paradigm that retains diffusion-style parallel decoding while inheriting the introspective consistency of AR training. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. From a systems standpoint, we build I-DLM inference engine on AR-inherited optimizations and further customize it with a stationary-batch scheduler. To the best of our knowledge, I-DLM is the first DLM to match the quality of its same-scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks. It reaches 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, exceeding LLaDA-2.1-mini (16B) by more than 26 and 15 points, respectively. Beyond quality, I-DLM is designed for the growing demand of large-concurrency serving, delivering about 3x higher throughput than prior state-of-the-art DLMs.

[AI-80] Federated Single-Agent Robotics: Multi-Robot Coordination Without Intra-Robot Multi-Agent Frag mentation

【速读】:该论文旨在解决大规模多机器人协同中的系统架构设计问题,即如何在不破坏单个机器人作为完整具身智能体(embodied agent)的前提下实现高效的舰队级协调。现有方法倾向于通过内部多智能体分解来增强协作能力,但这种方法可能导致控制复杂性上升、权威冲突和恢复机制失效。论文提出的关键解决方案是构建一种基于联邦单智能体机器人的运行时架构(Federated Single-Agent Robotics, FSAR),其核心在于:每个机器人保持单一、持续的运行时状态与本地策略边界,仅通过外部联邦机制(如共享能力注册表、跨机器人任务委派、权限感知授权、信任范围交互及分层恢复协议)实现舰队层面的协调。这种架构显著提升了治理局部性和恢复隔离性,并减少了权限冲突与策略违规,证明了“通过联邦而非内部分裂”是实现从单体具身智能体到具身舰队演进的更优路径。

链接: https://arxiv.org/abs/2604.11028
作者: Xue Qin,Simin Luan,John See,Cong Yang,Zhijun Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 30 pages, 10 figures, 9 tables. Code: this https URL

点击查看摘要

Abstract:As embodied robots move toward fleet-scale operation, multi-robot coordination is becoming a central systems challenge. Existing approaches often treat this as motivation for increasing internal multi-agent decomposition within each robot. We argue for a different principle: multi-robot coordination does not require intra-robot multi-agent fragmentation. Each robot should remain a single embodied agent with its own persistent runtime, local policy scope, capability state, and recovery authority, while coordination emerges through federation across robots at the fleet level. We present Federated Single-Agent Robotics (FSAR), a runtime architecture for multi-robot coordination built on single-agent robot runtimes. Each robot exposes a governed capability surface rather than an internally fragmented agent society. Fleet coordination is achieved through shared capability registries, cross-robot task delegation, policy-aware authority assignment, trust-scoped interaction, and layered recovery protocols. We formalize key coordination relations including authority delegation, inter-robot capability requests, local-versus-fleet recovery boundaries, and hierarchical human supervision, and describe a fleet runtime architecture supporting shared Embodied Capability Module (ECM) discovery, contract-aware cross-robot coordination, and fleet-level governance. We evaluate FSAR on representative multi-robot coordination scenarios against decomposition-heavy baselines. Results show statistically significant gains in governance locality (d=2.91, p.001 vs. centralized control) and recovery containment (d=4.88, p.001 vs. decomposition-heavy), while reducing authority conflicts and policy violations across all scenarios. Our results support the view that the path from embodied agents to embodied fleets is better served by federation across coherent robot runtimes than by fragmentation within them.

[AI-81] Optimal Stability of KL Divergence under Gaussian Perturbations

【速读】:该论文旨在解决Kullback-Leibler (KL) 散度在非高斯分布下对高斯扰动的稳定性问题,突破了现有松弛三角不等式仅适用于高斯分布的限制。其关键解决方案是在较弱的矩条件下,建立任意分布与高斯族之间KL散度的紧致稳定性界:若分布 $ P $ 的二阶矩有限,且 $ \text{KL}(P|\mathcal{N}_1) $ 较大而 $ \text{KL}(\mathcal{N}_1|\mathcal{N}_2) \leq \epsilon $,则有 $ \text{KL}(P|\mathcal{N}_2) \geq \text{KL}(P|\mathcal{N}_1) - O(\sqrt{\epsilon}) $,且该 $ \sqrt{\epsilon} $ 率在一般情形下是紧的。这一结果揭示了KL散度在高斯扰动下的内在稳定性,扩展了经典高斯限定的松弛三角不等式至更广泛的分布场景,并为基于KL散度的异常检测(如流模型中的分布外检测)提供了理论支撑。

链接: https://arxiv.org/abs/2604.11026
作者: Jialu Pan,Yufeng Zhang,Nan Hu,Keqin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study the problem of characterizing the stability of Kullback-Leibler (KL) divergence under Gaussian perturbations beyond Gaussian families. Existing relaxed triangle inequalities for KL divergence critically rely on the assumption that all involved distributions are Gaussian, which limits their applicability in modern applications such as out-of-distribution (OOD) detection with flow-based generative models. In this paper, we remove this restriction by establishing a sharp stability bound between an arbitrary distribution and Gaussian families under mild moment conditions. Specifically, let P be a distribution with finite second moment, and let \mathcalN_1 and \mathcalN_2 be multivariate Gaussian distributions. We show that if KL(P||\mathcalN_1) is large and KL(\mathcalN_1||\mathcalN_2) is at most \epsilon , then KL(P||\mathcalN_2) \ge KL(P||\mathcalN_1) - O(\sqrt\epsilon) . Moreover, we prove that this \sqrt\epsilon rate is optimal in general, even within the Gaussian family. This result reveals an intrinsic stability property of KL divergence under Gaussian perturbations, extending classical Gaussian-only relaxed triangle inequalities to general distributions. The result is non-trivial due to the asymmetry of KL divergence and the absence of a triangle inequality in general probability spaces. As an application, we provide a rigorous foundation for KL-based OOD analysis in flow-based models, removing strong Gaussian assumptions used in prior work. More broadly, our result enables KL-based reasoning in non-Gaussian settings arising in deep learning and reinforcement learning.

[AI-82] NimbusGuard: A Novel Framework for Proactive Kubernetes Autoscaling Using Deep Q-Networks

【速读】:该论文旨在解决传统 Kubernetes 自动扩缩容机制的滞后性问题,即现有控制器(如 Horizontal Pod Autoscaler 和 KEDA)仅在检测到集群资源需求后才进行响应式调整,导致可能因过度配置而增加成本,或因配置不足而引发性能下降。解决方案的关键在于提出 NimbusGuard,一个基于深度强化学习(Deep Reinforcement Learning, DRL)的主动式自动扩缩容系统,其核心创新是引入长短期记忆(Long Short-Term Memory, LSTM)模型来预测未来工作负载模式,从而增强智能体的感知能力,实现对资源需求的前瞻性调度,在保证性能的同时显著提升成本效率。

链接: https://arxiv.org/abs/2604.11017
作者: Chamath Wanigasooriya,Indrajith Ekanayake
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cloud native architecture is about building and running scalable microservice applications to take full advantage of the cloud environments. Managed Kubernetes is the powerhouse orchestrating cloud native applications with elastic scaling. However, traditional Kubernetes autoscalers are reactive, meaning the scaling controllers adjust resources only after they detect demand within the cluster and do not incorporate any predictive measures. This can lead to either over-provisioning and increased costs or under-provisioning and performance degradation. We propose NimbusGuard, an open-source, Kubernetes-based autoscaling system that leverages a deep reinforcement learning agent to provide proactive autoscaling. The agents perception is augmented by a Long Short-Term Memory model that forecasts future workload patterns. The evaluations were conducted by comparing NimbusGuard against the built-in scaling controllers, such as Horizontal Pod Autoscaler, and the event-driven autoscaler KEDA. The experimental results demonstrate how NimbusGuard’s proactive framework translates into superior performance and cost efficiency compared to existing reactive methods.

[AI-83] Diffusion-CAM: Faithful Visual Explanations for dMLLM s ACL2026

【速读】:该论文旨在解决扩散型多模态大语言模型(diffusion Multimodal Large Language Models, dMLLMs)在可解释性方面的不足问题。由于dMLLMs采用并行去噪机制生成文本,其激活模式呈现平滑且分布式的特征,传统基于局部序列依赖关系的类激活映射(Class Activation Mapping, CAM)方法难以有效捕捉此类非自回归行为。为此,作者提出Diffusion-CAM,其核心在于通过可微分地探测Transformer骨干网络中的中间表示来提取原始激活图,并结合类特定梯度信息以同时捕获潜在特征与语义关联;进一步引入四个关键模块以消除原始信号中的随机性、空间模糊性以及图像内冗余token相关性,从而显著提升定位精度和视觉保真度,为理解扩散多模态系统的并行生成过程建立了新的可解释性基准。

链接: https://arxiv.org/abs/2604.11005
作者: Haomin Zuo,Yidi Li,Luoxiao Yang,Xiaofeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 main conference

点击查看摘要

Abstract:While diffusion Multimodal Large Language Models (dMLLMs) have recently achieved remarkable strides in multimodal generation, the development of interpretability mechanisms has lagged behind their architectural evolution. Unlike traditional autoregressive models that produce sequential activations, diffusion-based architectures generate tokens via parallel denoising, resulting in smooth, distributed activation patterns across the entire sequence. Consequently, existing Class Activation Mapping (CAM) methods, which are tailored for local, sequential dependencies, are ill-suited for interpreting these non-autoregressive behaviors. To bridge this gap, we propose Diffusion-CAM, the first interpretability method specifically tailored for dMLLMs. We derive raw activation maps by differentiably probing intermediate representations in the transformer backbone, accordingly capturing both latent features and their class-specific gradients. To address the inherent stochasticity of these raw signals, we incorporate four key modules to resolve spatial ambiguity and mitigate intra-image confounders and redundant token correlations. Extensive experiments demonstrate that Diffusion-CAM significantly outperforms SoTA methods in both localization accuracy and visual fidelity, establishing a new standard for understanding the parallel generation process of diffusion multimodal systems.

[AI-84] Sanity Checks for Agent ic Data Science

【速读】:该论文旨在解决生成式 AI (Generative AI) 在数据科学(Agentic Data Science, ADS)管道中可能产生虚假乐观结论的问题,即系统虽输出看似合理的分析结果,但其结论缺乏稳定性与可验证性,难以被用户识别。解决方案的关键在于提出基于预测性-可计算性-稳定性(Predictability-Computability-Stability, PCS)框架的轻量级合理性检查机制,通过引入合理的输入扰动来评估代理是否能稳定区分信号与噪声,从而作为 falsifiability(可证伪性)约束暴露不支持的肯定结论。该方法能够有效表征 ADS 输出的可信度,识别出因噪声或输入偶然特征引发的错误结论,并在真实数据集上验证了其对信号强度的敏感性和对模型自信度校准不足的诊断能力。

链接: https://arxiv.org/abs/2604.11003
作者: Zachary T. Rewolinski,Austin V. Zane,Hao Huang,Chandan Singh,Chenglong Wang,Jianfeng Gao,Bin Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agentic data science (ADS) pipelines have grown rapidly in both capability and adoption, with systems such as OpenAI Codex now able to directly analyze datasets and produce answers to statistical questions. However, these systems can reach falsely optimistic conclusions that are difficult for users to detect. To address this, we propose a pair of lightweight sanity checks grounded in the Predictability-Computability-Stability (PCS) framework for veridical data science. These checks use reasonable perturbations to screen whether an agent can reliably distinguish signal from noise, acting as a falsifiability constraint that can expose affirmative conclusions as unsupported. Together, the two checks characterize the trustworthiness of an ADS output, e.g. whether it has found stable signal, is responding to noise, or is sensitive to incidental aspects of the input. We validate the approach on synthetic data with controlled signal-to-noise ratios, confirming that the sanity checks track ground-truth signal strength. We then demonstrate the checks on 11 real-world datasets using OpenAI Codex, characterizing the trustworthiness of each conclusion and finding that in 6 of the datasets an affirmative conclusion is not well-supported, even though a single ADS run may support one. We further analyze failure modes of ADS systems and find that ADS self-reported confidence is poorly calibrated to the empirical stability of its conclusions.

[AI-85] MAFIG: Multi-agent Driven Formal Instruction Generation Framework

【速读】:该论文旨在解决调度系统在突发事件下因局部功能失效而导致系统稳定性下降甚至崩溃的问题。现有方法如鲁棒调度或反应式调度依赖预定义规则或重调度策略,难以应对现实世界中多样且不可预测的紧急情况,适应性受限。解决方案的关键在于提出多智能体驱动的形式化指令生成框架(MAFIG),通过感知代理(Perception Agent)和应急决策代理(Emergency Decision Agent)将决策范围限定在受突发事件影响的局部功能模块,并快速生成形式化指令以修复调度逻辑;同时引入基于跨度聚焦的损失驱动局部蒸馏机制(SFL),将云端大语言模型(C-LLMs)的决策能力高效迁移至轻量本地模型,显著降低推理延迟并保持决策有效性,从而提升调度系统的鲁棒性和自适应能力。

链接: https://arxiv.org/abs/2604.10989
作者: Shixing Zhao,Zheng Si,Pengpeng Ouyang,Zhengqing Hu,Wanqi Zhu,Dong Chen,Yibo Guo,Mingliang Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emergency situations in scheduling systems often trigger local functional failures that undermine system stability and even cause system collapse. Existing methods primarily rely on robust scheduling or reactive scheduling, handling emergencies through predefined rules or rescheduling strategies. However, the diversity and unpredictability of real-world emergencies make them difficult to anticipate, which limits the adaptability of these methods in complex scenarios. Recent studies have shown that Large Language Models (LLMs) possess strong potential for complex scheduling tasks because of their extensive prior knowledge and strong reasoning capabilities. Nevertheless, the high inference latency of LLMs and the lengthy contextual information of scheduling systems significantly hinder their application for emergency handling. To mitigate these issues, we propose the Multi-agent Driven Formal Instruction Generation Framework (MAFIG). The framework constrains the decision scope to local functional modules affected by emergency situations and repairs scheduling logic rapidly by generating formal instructions. MAFIG contains a Perception Agent and an Emergency Decision Agent, which mitigates the adverse impact of lengthy system contexts on emergency decision-making. We further introduce span-focused loss-driven local distillation mechanism (SFL) to transfer the decision-making capability of powerful Cloud Large Language Models (C-LLMs) to lightweight local models, reducing inference latency while preserving decision-making effectiveness. Experiments in the Port, Warehousing, and Deck scheduling datasets show success rates of 98.49%, 94.97%, and 97.50%, with average processing times of 0.33 s, 0.23 s, and 0.19 s. These results demonstrate that MAFIG effectively mitigates the impact of emergencies and improves the robustness and adaptability of scheduling systems.

[AI-86] Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models

【速读】:该论文旨在解决医学图像分割中因数据采集噪声和标注模糊性导致的固有数据不确定性问题,此类不确定性严重削弱了模型的鲁棒性。现有研究多集中于模型结构优化与预测可靠性估计,对数据内在不确定性的系统性探索不足。解决方案的关键在于利用视觉基础模型的通用表征能力,通过分析解码特征的多样性并量化其奇异值能量,定义每类别的语义感知尺度(semantic perception scale),从而衡量样本难度与认知不确定性(aleatoric uncertainty)。在此基础上,设计了两种不确定性驱动的应用策略:一是基于认知不确定性的数据过滤机制以剔除潜在噪声样本;二是动态不确定性自适应优化策略,根据语义感知尺度调整类别特定损失权重,并结合标签去噪机制提升训练稳定性。

链接: https://arxiv.org/abs/2604.10963
作者: Ruiyang Li,Fang Liu,Licheng Jiao,Xinglin Xie,Jiayao Hao,Shuo Li,Xu Liu,Jingyi Yang,Lingling Li,Puhua Chen,Wenping Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical image segmentation supports clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets medical image datasets suffer from acquisition noise and annotation ambiguity, causing pervasive data uncertainty that substantially undermines model robustness. Existing research focuses primarily on model architectural improvements and predictive reliability estimation, while systematic exploration of the intrinsic data uncertainty remains insufficient. To address this gap, this work proposes leveraging the universal representation capabilities of visual foundation models to estimate inherent data uncertainty. Specifically, we analyze the feature diversity of the model’s decoded representations and quantify their singular value energy to define the semantic perception scale for each class, thereby measuring sample difficulty and aleatoric uncertainty. Based on this foundation, we design two uncertainty-driven application strategies: (1) the aleatoric uncertainty-aware data filtering mechanism to eliminate potentially noisy samples and enhance model learning quality; (2) the dynamic uncertainty-aware optimization strategy that adaptively adjusts class-specific loss weights during training based on the semantic perception scale, combined with a label denoising mechanism to improve training stability. Experimental results on five public datasets encompassing CT and MRI modalities and involving multi-organ and tumor segmentation tasks demonstrate that our method achieves significant and robust performance improvements across various mainstream network architectures, revealing the broad application potential of aleatoric uncertainty in medical image understanding and segmentation tasks.

[AI-87] RAG -KT: Cross-platform Explainable Knowledge Tracing with Multi-view Fusion Retrieval Generation

【速读】:该论文旨在解决知识追踪(Knowledge Tracing, KT)模型在跨平台场景下的泛化能力不足问题,特别是现有方法因依赖特定平台标识符和潜在表示而难以迁移,且在分布偏移(distribution shift)环境下性能显著下降。解决方案的关键在于提出RAG-KT框架,通过检索增强机制(Retrieval-Augmented Generation, RAG)将跨平台KT建模为受约束的上下文推理任务:利用Question Group抽象实现多源结构化上下文对齐,并动态检索互补且可靠的上下文信息用于预测,从而提升模型的准确性、鲁棒性及可解释性。

链接: https://arxiv.org/abs/2604.10960
作者: Zhiyi Duan,Hongyu Yuan,Rui Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Tracing (KT) infers a student’s knowledge state from past interactions to predict future performance. Conventional Deep Learning (DL)-based KT models are typically tied to platform-specific identifiers and latent representations, making them hard to transfer and interpret. Large Language Model (LLM)-based methods can be either ungrounded under prompting or overly domain-dependent under fine-tuning. In addition, most existing KT methods are developed and evaluated under a same-distribution assumption. In real deployments, educational data often arise from heterogeneous platforms with substantial distribution shift, which often degrades generalization. To this end, we propose RAG-KT, a retrieval-augmented paradigm that frames cross-platform KT as reliable context constrained inference with LLMs. It builds a unified multi-source structured context with cross-source alignment via Question Group abstractions and retrieves complementary rich and reliable context for each prediction, enabling grounded prediction and interpretable diagnosis. Experiments on three public KT benchmarks demonstrate consistent gains in accuracy and robustness, including strong performance under cross-platform conditions.

[AI-88] Continuous-time Online Learning via Mean-Field Neural Networks: Regret Analysis in Diffusion Environments

【速读】:该论文旨在解决连续时间在线学习问题,其中数据由未知系数的扩散过程生成,且学习者使用两层神经网络以非前瞻方式持续更新参数。其核心挑战在于如何在动态数据流中实现稳定的参数学习并量化学习性能。解决方案的关键在于将神经网络的学习动力学置于均场极限(mean-field limit)框架下,该极限对应于适应数据滤波的随机Wasserstein梯度流;通过引入对数索博列夫不等式(logarithmic Sobolev inequality)、Polyak-Lojasiewicz条件、马利avin微积分(Malliavin calculus)以及一致时间传播混沌(uniform-in-time propagation of chaos)等工具,建立了均场系统与有限粒子系统的遗憾边界(regret bounds)。特别地,在位移凸性(displacement convexity)假设下获得恒定静态遗憾界,在一般非凸情形下则得到显式的线性遗憾界,揭示了数据变化、熵探索和二次正则化对学习性能的影响机制。

链接: https://arxiv.org/abs/2604.10958
作者: Erhan Bayraktar,Bingyan Han,Ziqing Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 64 pages, 5 figures

点击查看摘要

Abstract:We study continuous-time online learning where data are generated by a diffusion process with unknown coefficients. The learner employs a two-layer neural network, continuously updating its parameters in a non-anticipative manner. The mean-field limit of the learning dynamics corresponds to a stochastic Wasserstein gradient flow adapted to the data filtration. We establish regret bounds for both the mean-field limit and finite-particle system. Our analysis leverages the logarithmic Sobolev inequality, Polyak-Lojasiewicz condition, Malliavin calculus, and uniform-in-time propagation of chaos. Under displacement convexity, we obtain a constant static regret bound. In the general non-convex setting, we derive explicit linear regret bounds characterizing the effects of data variation, entropic exploration, and quadratic regularization. Finally, our simulations demonstrate the outperformance of the online approach and the impact of network width and regularization parameters.

[AI-89] CSPO: Alleviating Reward Ambiguity for Structured Table-to-LaTeX Generation ACL2026

【速读】:该论文旨在解决将表格图像转换为LaTeX代码时,现有多模态大语言模型(MLLM)难以保持结构、样式和内容一致性的难题。当前基于强化学习(Reinforcement Learning, RL)的后训练方法通常依赖单一综合奖励信号,导致奖励歧义,阻碍了对不同生成组件的有效优化。其解决方案的关键在于提出一种分组件策略优化(Component-Specific Policy Optimization, CSPO)框架,该框架通过分离结构、样式和内容三个组件,并为每个组件分配独立的奖励信号,同时仅将相应奖励梯度回传至对应组件的token路径,从而缓解奖励歧义问题,实现针对各组件的精细化优化,显著提升生成表格的结构准确性和语义完整性。

链接: https://arxiv.org/abs/2604.10918
作者: Yunfan Yang,Cuiling Lan,Jitao Sang,Yan Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ACL2026 (main conference)

点击查看摘要

Abstract:Tables contain rich structured information, yet when stored as images their contents remain “locked” within pixels. Converting table images into LaTeX code enables faithful digitization and reuse, but current multimodal large language models (MLLMs) often fail to preserve structural, style, or content fidelity. Conventional post-training with reinforcement learning (RL) typically relies on a single aggregated reward, leading to reward ambiguity that conflates multiple behavioral aspects and hinders effective optimization. We propose Component-Specific Policy Optimization (CSPO), an RL framework that disentangles optimization across LaTeX tables components-structure, style, and content. In particular, CSPO assigns component-specific rewards and backpropagates each signal only through the tokens relevant to its component, alleviating reward ambiguity and enabling targeted component-wise optimization. To comprehensively assess performance, we introduce a set of hierarchical evaluation metrics. Extensive experiments demonstrate the effectiveness of CSPO, underscoring the importance of component-specific optimization for reliable structured generation.

[AI-90] EvoNash-MARL: A Closed-Loop Multi-Agent Reinforcement Learning Framework for Medium-Horizon Equity Allocation

【速读】:该论文旨在解决中长期股票配置(medium-to-long-horizon stock allocation)中的关键挑战,包括弱预测结构、市场状态非平稳性(non-stationary market regimes)以及交易成本、容量限制和尾部风险约束导致的信号退化问题。传统方法通常依赖单一预测器或松散耦合的“预测-配置”流程,难以在复杂环境下保持鲁棒性。其解决方案的核心在于提出一种名为EvoNash-MARL的统一框架,通过强化学习(Reinforcement Learning, RL)、多智能体策略群体(multi-agent policy populations)、基于Policy-Space Response Oracle(PSRO)风格的聚合机制、联赛最佳响应训练(league best-response training)、进化式替换(evolutionary replacement)以及执行感知检查点选择(execution-aware checkpoint selection),构建一个闭环的执行感知走查(walk-forward)优化流程。该框架进一步引入分层策略架构(方向头与风险头)、非线性信号增强、特征质量重加权及约束感知检查点选择,显著提升了模型在真实约束下的稳定性与跨市场泛化能力,实现了优于基准的夏普比率和年化收益表现。

链接: https://arxiv.org/abs/2604.10911
作者: Chongliu Jia,Yi Luo,Sipeng Han,Pengwei Li,Jie Ding,Youshuang Hu,Yimiao Qian,Qiya Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medium-to-long-horizon stock allocation presents significant challenges due toveak predictive structures, non-stadonary market regimes, and the degradationf signals following the application of transaction costs, capacity limits, and tail-isk constraints. Conventional approaches commonly rely on a single predictor orloosely coupled prediction-to-allocation pipeline, limiting robustness underThis work addresses a targeted design question: whetherlistribution shift. 1coupling reinforcement learning (RL), multi-agent policy populations, Policy-Space Response Oracle (PSRO)-style aggregation, league best-response trainingevolutionary replacement, and execution-aware checkpoint selection within ainified walk-forward loop improves allocator robustness at medium to longhorizons. The proposed framework, EvoNash-MARL, integrates these componentswithin an execution-aware allocation loop and further introduces a layeredpolicy architecture comprising a direction head and a risk head, nonlinear signalenhancement, feature-quality reweighting, and constraint-aware checkpointselection. Under a 120-window walk-forward protocol, the resolved v21configuration achieves mean excess Sharpe 0.7600 and robust score -0.0203,anking first among internal controls; on aligned daily out-of-sample returnsrom 2014-01-02 to 2024-01-05, it delivers 19.6% annualized return versus 11.7% for SPY, and in an extended walk-forward evaluation through 2026-02-10 it delivers 20.5% rersus 13.5%. The framework maintains positive performance under realistictress constraints and exhibits structured cross-market generalization; however,lobal strong significance under White’s Reality Check (WRC) and SPA-lite testingestablished. Therefore, the results are presented as evidence supporting asnotnore stable medium-to long-horizon training and selection paradigm, ratherhan as prooffof universally superior market-timing performance.

[AI-91] Reasoning as Data: Representation-Computation Unity and Its Implementation in a Domain-Algebraic Inference Engine

【速读】:该论文旨在解决知识系统中存储(storage)与计算(computation)分离所带来的局限性问题,传统知识表示方法如三元组(triple)依赖外部规则或程序员心智来处理领域上下文(domain context),导致推理过程缺乏结构化约束和自动化能力。其解决方案的关键在于提出表示-计算统一(Representation-Computation Unity, RCU),通过将领域信息作为结构字段嵌入到四元组(four-tuple)谓词的arity中(例如 is_a(Apple, Company, @Business)),使领域成为谓词的一部分而非外部参数。这种设计使得系统能自动执行领域作用域内的推理(domain-scoped inference),无需额外规则,并由此自然衍生出三种机制:领域作用域闭包、类型继承(typed inheritance)以及基于领域纤维(domain fiber)的循环检测写时 falsification(write-time falsification)。论文进一步以形式化定理证明RCU的正确性,并实现了一个2400行Python+Prolog符号引擎验证其工程可行性,表明当领域为结构性要素时,数据本身即可完成计算(data computes itself)。

链接: https://arxiv.org/abs/2604.10908
作者: Chao Li,Yuru Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16pages ; Open-source implementation and evaluation scripts will be released in a subsequent revision

点击查看摘要

Abstract:Every existing knowledge system separates storage from computation. We show this separation is unnecessary and eliminate it. In a standard triple is_a(Apple, Company), domain context lives in the query or the programmer’s mind. In a CDC four-tuple is_a(Apple, Company, @Business), domain becomes a structural field embedded in predicate arity. Any system respecting arity automatically performs domain-scoped inference without external rules. We call this representation-computation unity (RCU). From the four-tuple structure, three inference mechanisms emerge: domain-scoped closure, typed inheritance, and write-time falsification via cycle detection per domain fiber. We establish RCU formally via four theorems. RCU is implementable. We present a working symbolic engine (2400 lines Python+Prolog) resolving four engineering issues: rule-data separation, shared-fiber handling, read-only meta-layer design, and intersective convergence. A central result: CDC domain-constrained inference is distinct from Prolog with a domain argument. Two case studies validate the engine. ICD-11 classification (1247 entities, 3 axes) shows fibers resolve multiple inheritance. CBT clinical reasoning shows generalization to temporal reasoning with session turn as ordered domain index. Multi-constraint queries realize CSP arc-consistency with complexity O(m (N/K)^2), confirming the domain lattice’s sparsity governs performance. When domain is structural, data computes itself.

[AI-92] CASK: Core-Aware Selective KV Compression for Reasoning Traces

【速读】:该论文旨在解决大语言模型在长文本推理过程中KV缓存(Key-Value Cache)随解码长度快速增长导致的内存瓶颈与推理稳定性下降问题。现有基于淘汰机制的KV压缩方法主要依赖于更精确地评估token重要性后丢弃低权重条目,但作者分析指出,仅优化评分器往往无法显著重构保留集合,难以有效维持推理行为。为此,论文提出CASK方案,将解码过程中的推理轨迹结构化为“保护核心”(protected core)和“可合并冗余区”(mergeable scratch)两部分:核心区域保留以锚定答案生成与中间状态,而仅对冗余区进行选择性合并压缩;此外,针对提示词密集场景下前缀占用全部缓存预算的问题,引入两阶段设计——先执行前缀淘汰,再进行解码阶段压缩。实验证明,CASK在相同预算下相比TriAttention具有更高的完整KV延续保真度,表明有效的KV压缩关键不在于复杂的评分器设计,而在于通过核心保留与选择性冗余区合并来降低可用预算门槛。

链接: https://arxiv.org/abs/2604.10900
作者: Buseong Kim,Heejun Gwon
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 8 figures, 3 main tables, appendices included

点击查看摘要

Abstract:In large language models performing long-form reasoning, the KV cache grows rapidly with decode length, creating bottlenecks in memory and inference stability. Existing reasoning-oriented KV compression has mostly followed an eviction-centered view: estimate token importance more accurately, then discard lower-ranked entries. Our analysis suggests that scorer refinement alone often fails to substantially reorganize the actual keep-set and may therefore not be the main lever for preserving reasoning behavior. We instead frame reasoning KV compression as a behavior-preserving structured consolidation problem. CASK partitions the decode-time reasoning trace into a protected core that anchors answer formation and intermediate state, and mergeable scratch with high redundancy. The core is preserved, while selective consolidation is applied only to the scratch. To address prompt-heavy regimes where the prefix can exhaust the budget before decode-stage compression becomes active, CASK further uses a two-stage design: prefix eviction followed by decode-stage consolidation. On the H100 reasoning gate, CASK shows higher full-KV continuation fidelity than TriAttention at matched budgets on both AIME24 and AIME25, with recurring cask@384 triattention@512 crossings. In prompt-heavy replay, multi_news and vcsum act as decode-active witnesses, while qmsum and gov_report expose the prefix_budget_exhausted boundary. The overall evidence supports a simple conclusion: effective reasoning KV compression depends less on more elaborate scorer engineering than on combining core preservation with selective scratch consolidation to lower the usable budget frontier.

[AI-93] Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models

【速读】:该论文旨在解决现有偷取水印算法(Stealing Watermark Algorithms, SWAs)在实际应用中效率低下和适应性不足的问题。具体而言,现有SWAs采用固定策略,未能考虑水印信息在不同生成阶段的非均匀分布特性以及大型语言模型(Large Language Model, LLM)生成过程中的动态变化,导致其在对抗性攻击中难以有效提取水印信息。为此,作者提出自适应偷取(Adaptive Stealing, AS),其核心创新在于引入基于位置的封印构建(Position-Based Seal Construction)与自适应选择(Adaptive Selection)模块:AS通过定义由上下文有序标记激活状态衍生的多种攻击视角,在执行过程中依据水印兼容性、生成优先级和动态生成相关性动态选择最优视角,从而显著提升对目标水印的偷取效率。这一设计增强了SWA对LLM生成动态性的适应能力,揭示了当前水印机制的脆弱性,并推动更鲁棒水印方案的研究发展。

链接: https://arxiv.org/abs/2604.10893
作者: Shuhao Zhang,Yuli Chen,Jiale Han,Bo Cheng,Jiabao Ma
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 pages,6 figures

点击查看摘要

Abstract:Watermarking provides a critical safeguard for large language model (LLM) services by facilitating the detection of LLM-generated text. Correspondingly, stealing watermark algorithms (SWAs) derive watermark information from watermarked texts generated by victim LLMs to craft highly targeted adversarial attacks, which compromise the reliability of watermarks. Existing SWAs rely on fixed strategies, overlooking the non-uniform distribution of stolen watermark information and the dynamic nature of real-world LLM generation processes. To address these limitations, we propose Adaptive Stealing (AS), a novel SWA featuring enhanced design flexibility through Position-Based Seal Construction and Adaptive Selection modules. AS operates by defining multiple attack perspectives derived from distinct activation states of contextually ordered tokens. During attack execution, AS dynamically selects the optimal perspective based on watermark compatibility, generation priority, and dynamic generation relevance. Our experiments demonstrate that AS significantly increases steal efficiency against target watermarks under identical experimental conditions. These findings highlight the need for more robust LLM watermarks to withstand potential attacks. We release our code to the community for future research\footnotethis https URL.

[AI-94] Ambiguity Detection and Elimination in Automated Executable Process Modeling

【速读】:该论文旨在解决自然语言规范在生成可执行的业务流程建模与标注(BPMN)模型时,因表述模糊或信息不足而导致模型结构合法但行为不一致的问题。其核心挑战在于缺乏真实BPMN模型作为参考标准的情况下,如何识别并修复文本规范中导致行为不稳定的关键缺陷。解决方案的关键在于提出一个诊断驱动的闭环框架:通过分析关键绩效指标(KPIs)的经验分布检测行为不一致性,利用基于模型的诊断定位分歧源头——即网关逻辑错误,并将问题逻辑映射回原始文本片段,最终基于证据对源文本进行精细化修正,从而显著降低重复生成模型的行为变异性。

链接: https://arxiv.org/abs/2604.10884
作者: Ion Matei,Praveen Kumar Menaka Sekar,Maksym Zhenirovskyy,Hon Yung Wong,Sayuri Kohmura,Shinji Hotta,Akihiro Inomata
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated generation of executable Business Process Model and Notation (BPMN) models from natural-language specifications is increasingly enabled by large language models. However, ambiguous or underspecified text can yield structurally valid models with different simulated behavior. Our goal is not to prove that one generated BPMN model is semantically correct, but to detect when a natural-language specification fails to support a stable executable interpretation under repeated generation and simulation. We present a diagnosis-driven framework that detects behavioral inconsistency from the empirical distribution of key performance indicators (KPIs), localizes divergence to gateway logic using model-based diagnosis, maps that logic back to verbatim narrative segments, and repairs the source text through evidence-based refinement. Experiments on diabetic nephropathy health-guidance policies show that the method reduces variability in regenerated model behavior. The result is a closed-loop approach for validating and repairing executable process specifications in the absence of ground-truth BPMN models.

[AI-95] DIB-OD: Preserving the Invariant Core for Robust Heterogeneous Graph Adaptation via Decoupled Information Bottleneck and Online Distillation

【速读】:该论文旨在解决异质图数据预训练中因分布偏移导致的跨域泛化能力不足问题,现有方法通常仅关注域内模式,未能有效分离任务相关的不变知识与域特定的冗余噪声,从而引发负迁移和灾难性遗忘。其解决方案的关键在于提出DIB-OD框架,通过解耦信息瓶颈(Decoupled Information Bottleneck)与在线蒸馏(Online Distillation)机制,显式地将表征分解为正交的不变子空间与冗余子空间;利用信息瓶颈教师-学生蒸馏和希尔伯特-施密特独立准则(Hilbert-Schmidt Independence Criterion)提取跨域稳定的不变核心,并引入自适应语义正则化器,基于预测置信度动态调节标签影响,以保护该核心在目标域适应过程中的完整性,从而实现鲁棒的异质图迁移学习。

链接: https://arxiv.org/abs/2604.10882
作者: Yang Yan,Qiuyan Wang,Tianjin Huang,Qiudong Yu,Kexin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Network pretraining is pivotal for leveraging unlabeled graph data. However, generalizing across heterogeneous domains remains a major challenge due to severe distribution shifts. Existing methods primarily focus on intra-domain patterns, failing to disentangle task-relevant invariant knowledge from domain-specific redundant noise, leading to negative transfer and catastrophic forgetting. To this end, we propose DIB-OD, a novel framework designed to preserve the invariant core for robust heterogeneous graph adaptation through a Decoupled Information Bottleneck and Online Distillation framework. Our core innovation is the explicit decomposition of representations into orthogonal invariant and redundant subspaces. By utilizing an Information Bottleneck teacher-student distillation mechanism and the Hilbert-Schmidt Independence Criterion, we isolate a stable invariant core that transcends domain boundaries. Furthermore, a self-adaptive semantic regularizer is introduced to protect this core from corruption during target-domain adaptation by dynamically gating label influence based on predictive confidence. Extensive experiments across chemical, biological, and social network domains demonstrate that DIB-OD significantly outperforms state-of-the-art methods, particularly in challenging inter-type domain transfers, showcasing superior generalization and anti-forgetting performance.

[AI-96] A Quantitative Definition of Intelligence

【速读】:该论文旨在解决如何对任意物理系统的智能进行操作性、定量定义的问题,尤其针对传统定义中模糊性和主观性带来的挑战。其核心解决方案在于提出“智能密度”(intelligence density)这一指标,即系统独立输出的对数与其总描述长度之比。关键创新点在于:通过引入输出独立性条件和对无限输入域的泛化能力要求,该定义不仅将智能置于从逻辑门到大脑的 substrate-independent 连续谱上,还有效规避了Putnam的泛计算主义(pancomputationalist) triviality 问题,并通过证明有限规则手册处理无限领域时必须具备泛化能力,从而回应了Searle的中文房间论证(Chinese Room Argument)。

链接: https://arxiv.org/abs/2604.10873
作者: Kang-Sin Choi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages

点击查看摘要

Abstract:We propose an operational, quantitative definition of intelligence for arbitrary physical systems. The intelligence density of a system is the ratio of the logarithm of its independent outputs to its total description length. A system memorizes if its description length grows with its output count; it knows if its description length remains fixed while its output count diverges. The criterion for knowing is generalization: a system knows its domain if a single finite mechanism can produce correct outputs across an unbounded range of inputs, rather than storing each answer individually. We argue that meaning over a domain is a selection and ordering of functions that produces correct outputs, and that a system whose intelligence density diverges necessarily captures this structure. The definition (1) places intelligence on a substrate-independent continuum from logic gates to brains, (2) blocks Putnam’s pancomputationalist triviality argument via an independence condition on outputs, and (3) resolves Searle’s Chinese Room Argument by showing that any finite rulebook handling an infinite domain must generalize.

[AI-97] Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering

【速读】:该论文旨在解决现有深度聚类(Deep Clustering, DC)方法在处理表格数据时,仅依赖数据层面的统计共现关系来推断潜在度量空间,而忽视了特征名和特征值中蕴含的内在语义知识的问题。这导致语义相关概念(如“Flu”与“Cold”)被当作符号标记处理,致使语义相关的样本被孤立。解决方案的关键在于提出一种名为Tabular-Augmented Contrastive Clustering (TagCC) 的新框架,该框架通过大型语言模型(Large Language Models, LLMs)对数据语义进行语义感知转换,生成文本锚点(textual anchors),并将这些锚点与表格数据的统计表示通过对比学习(Contrastive Learning, CL)融合,从而将开放世界语义注入到聚类表示中;同时,该框架联合优化聚类目标,确保所学表示既具备语义一致性又适合聚类任务。

链接: https://arxiv.org/abs/2604.10865
作者: Mingjie Zhao,Yunfan Zhang,Yiqun Zhang,Yiu-ming Cheung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like Flu' and Cold’ are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.

[AI-98] Query Lower Bounds for Diffusion Sampling

【速读】:该论文旨在解决扩散模型(Diffusion Models)在采样过程中如何最小化得分查询次数(score queries)的问题,尤其是从信息论角度揭示此类加速的理论极限。其关键贡献在于首次建立了扩散采样的得分查询下界:对于 dd 维分布,在得分估计具有多项式精度 ε=dO(1)\varepsilon = d^{-O(1)}(任意 LpL^p 范数意义下)的前提下,任何采样算法至少需要 Ω~(d)\widetilde\Omega(\sqrt{d}) 次自适应得分查询。这一结果从理论上解释了为何实践中必须采用多尺度噪声调度(multiscale noise schedules),因为采样器必须在 Ω~(d)\widetilde\Omega(\sqrt{d}) 个不同的噪声层级上进行搜索才能有效生成高质量样本。

链接: https://arxiv.org/abs/2604.10857
作者: Zhiyang Xun,Eric Price
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Diffusion models generate samples by iteratively querying learned score estimates. A rapidly growing literature focuses on accelerating sampling by minimizing the number of score evaluations, yet the information-theoretic limits of such acceleration remain unclear. In this work, we establish the first score query lower bounds for diffusion sampling. We prove that for d -dimensional distributions, given access to score estimates with polynomial accuracy \varepsilon=d^-O(1) (in any L^p sense), any sampling algorithm requires \widetilde\Omega(\sqrtd) adaptive score queries. In particular, our proof shows that any sampler must search over \widetilde\Omega(\sqrtd) distinct noise levels, providing a formal explanation for why multiscale noise schedules are necessary in practice. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2604.10857 [cs.LG] (or arXiv:2604.10857v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.10857 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-99] BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving

【速读】:该论文旨在解决开环(Open-loop, OL)预训练策略在闭环(Closed-loop, CL)部署中表现不佳的系统性问题,即开环到闭环差距(OL-CL gap)。研究表明,该差距主要源于两个因素:观测域偏移(Observational Domain Shift)和目标不匹配(Objective Mismatch),其中后者由于策略无法建模复杂的反应式行为而构成结构性障碍。解决方案的关键在于提出一种测试时适应(Test-Time Adaptation, TTA)框架,通过校准观测偏移、减少状态-动作偏差并强制时间一致性来缓解规划偏差,从而显著提升策略在闭环环境中的泛化能力与稳定性。

链接: https://arxiv.org/abs/2604.10856
作者: Seth Z. Zhao,Luobin Wang,Hongwei Ruan,Yuxin Bao,Yilan Chen,Ziyang Leng,Abhijit Ravichandran,Honglin He,Zewei Zhou,Xu Han,Abhishek Peri,Zhiyu Huang,Pranav Desai,Henrik Christensen,Jiaqi Ma,Bolei Zhou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-loop (OL) to closed-loop (CL) gap (OL-CL gap) exists when OL-pretrained policies scoring high in OL evaluations fail to transfer effectively in closed-loop (CL) deployment. In this paper, we unveil the root causes of this systemic failure and propose a practical remedy. Specifically, we demonstrate that OL policies suffer from Observational Domain Shift and Objective Mismatch. We show that while the former is largely recoverable with adaptation techniques, the latter creates a structural inability to model complex reactive behaviors, which forms the primary OL-CL gap. We find that a wide range of OL policies learn a biased Q-value estimator that neglects both the reactive nature of CL simulations and the temporal awareness needed to reduce compounding errors. To this end, we propose a Test-Time Adaptation (TTA) framework that calibrates observational shift, reduces state-action biases, and enforces temporal consistency. Extensive experiments show that TTA effectively mitigates planning biases and yields superior scaling dynamics than its baseline counterparts. Furthermore, our analysis highlights the existence of blind spots in standard OL evaluation protocols that fail to capture the realities of closed-loop deployment.

[AI-100] A Benchmark for Gap and Overlap Analysis as a Test of KG Task Readiness

【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)质量评估中如何有效验证其对政策类文档(如保险合同)的语义覆盖能力问题,特别是针对特定场景下文档支持(overlap)与不支持(gap)的判定是否具备可复现性、可解释性和证据可追溯性。解决方案的关键在于构建一个可执行且可审计的基准测试框架,该框架将自然语言合同文本与形式化本体(ontology)及带证据标注的真实标签对齐,从而系统比较基于纯文本的大语言模型(LLM)与基于知识图谱驱动的方法在gap/overlap分析中的表现。实验表明,显式建模能够显著提升结果的一致性和诊断能力,且该基准具有通用性,可作为评估KG质量及支撑本体学习、知识图谱填充和基于证据的问题回答等下游任务的标准工具。

链接: https://arxiv.org/abs/2604.10853
作者: Maruf Ahmed Mridul,Rohit Kapa,Oshani Seneviratne
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Task-oriented evaluation of knowledge graph (KG) quality increasingly asks whether an ontology-based representation can answer the competency questions that users actually care about, in a manner that is reproducible, explainable, and traceable to evidence. This paper adopts that perspective and focuses on gap and overlap analysis for policy-like documents (e.g., insurance contracts), where given a scenario, which documents support it (overlap) and which do not (gap), with defensible justifications. The resulting gap/overlap determinations are typically driven by genuine differences in coverage and restrictions rather than missing data, making the task a direct test of KG task readiness rather than a test of missing facts or query expressiveness. We present an executable and auditable benchmark that aligns natural-language contract text with a formal ontology and evidence-linked ground truth, enabling systematic comparison of methods. The benchmark includes: (i) ten simplified yet diverse life-insurance contracts reviewed by a domain expert, (ii) a domain ontology (TBox) with an instantiated knowledge base (ABox) populated from contract facts, and (iii) 58 structured scenarios paired with SPARQL queries with contract-level outcomes and clause-level excerpts that justify each label. Using this resource, we compare a text-only LLM baseline that infers outcomes directly from contract text against an ontology-driven pipeline that answers the same scenarios over the instantiated KG, demonstrating that explicit modeling improves consistency and diagnosis for gap/overlap analyses. Although demonstrated for gap and overlap analysis, the benchmark is intended as a reusable template for evaluating KG quality and supporting downstream work such as ontology learning, KG population, and evidence-grounded question answering.

[AI-101] ask2vec Readiness: Diagnostics for Federated Learning from Pre-Training Embeddings

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在客户端异构性(heterogeneity)下性能不可预测的问题,即实践中缺乏可靠方法在训练前预判联邦系统的最终表现。其解决方案的关键在于提出基于Task2Vec嵌入的“就绪指数”(readiness indices),通过计算客户端嵌入的无监督指标(如凝聚度、离散度和密度)来量化联邦系统在训练前的对齐程度,并发现这些指标与最终性能之间存在高度相关性(Pearson和Spearman相关系数常超过0.9),从而为异构联邦环境下的客户端选择提供可解释且有效的预训练诊断工具。

链接: https://arxiv.org/abs/2604.10849
作者: Cristiano Mafuz,Rodrigo Silva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) performance is highly sensitive to heterogeneity across clients, yet practitioners lack reliable methods to anticipate how a federation will behave before training. We propose readiness indices, derived from Task2Vec embeddings, that quantifies the alignment of a federation prior to training and correlates with its eventual performance. Our approach computes unsupervised metrics – such as cohesion, dispersion, and density – directly from client embeddings. We evaluate these indices across diverse datasets (CIFAR-10, FEMNIST, PathMNIST, BloodMNIST) and client counts (10–20), under Dirichlet heterogeneity levels spanning \alpha \in \0.05,\dots,5.0\ and FedAVG aggregation strategy. Correlation analyses show consistent and significant Pearson and Spearman coefficients between some of the Task2Vec-based readiness and final performance, with values often exceeding 0.9 across dataset \times client configurations, validating this approach as a robust proxy for FL outcomes. These findings establish Task2Vec-based readiness as a principled, pre-training diagnostic for FL that may offer both predictive insight and actionable guidance for client selection in heterogeneous federations.

[AI-102] Resilient Write: A Six-Layer Durable Write Surface for LLM Coding Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的编码代理在使用工具协议(如Model Context Protocol, MCP)进行文件读写时,因写入失败(如内容过滤、截断或会话中断)而无法获得结构化反馈信号、丢失草稿并重复无效尝试的问题。解决方案的关键在于提出一个名为Resilient Write的MCP服务器,其通过六层持久化写入表面实现容错:预飞行风险评分、事务性原子写入、可恢复分块、结构化类型错误、带外临时存储及任务连续性交接封装。这些层独立且可选,分别对应实际会话中观察到的具体故障模式,并显著提升代理的自纠正能力与恢复效率(相比基线减少5倍恢复时间、提高13倍自修正率)。

链接: https://arxiv.org/abs/2604.10842
作者: Justice Owusu Agyemang,Jerry John Kponyo,Elliot Amponsah,Godfred Manu Addo Boakye,Kwame Opuni-Boachie Obour Agyekum
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-powered coding agents increasingly rely on tool-use protocols such as the Model Context Protocol~(MCP) to read and write files on a developer’s workstation. When a write fails – due to content filters, truncation, or an interrupted session – the agent typically receives no structured signal, loses the draft, and wastes tokens retrying blindly. We present \textbfResilient Write, an MCP server that interposes a six-layer durable write surface between the agent and the filesystem. The layers – pre-flight risk scoring, transactional atomic writes, resume-safe chunking, structured typed errors, out-of-band scratchpad storage, and task-continuity handoff envelopes – are orthogonal and independently adoptable. Each layer maps to a concrete failure mode observed during a real agent session in April~2026, in which content-safety filters silently rejected a draft containing redacted API-key prefixes. Three additional tools – chunk preview, format-aware validation, and journal analytics – emerged from using the system to compose this paper. A 186-test suite validates correctness at each layer, and quantitative comparison against naive and defensive baselines shows a 5x reduction in recovery time and a 13x improvement in agent self-correction rate. Resilient Write is open-source under the MIT license.

[AI-103] LLM s for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments

【速读】:该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)作为自动化标注工具,对人类受试者在安全实验中提供的自由文本注释进行技术性安全特征编码的问题。其核心挑战在于,相较于情感分类等任务,识别代码标识符、行号及安全关键词等安全相关要素需要更深层次的上下文理解能力。解决方案的关键在于设计多种提示策略(prompts),包括模拟标注最佳实践的详细代码描述、带示例的代码手册以及冲突样例,以提升LLMs在安全注释中的准确性。实验结果显示,仅使用详细的代码描述能带来显著改进,但效果在不同安全代码类型间不一致,且仍不足以可靠替代人工标注。

链接: https://arxiv.org/abs/2604.10834
作者: Maria Camporese,Fabio Massacci,Yuanjun Gong
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:[Background:] Thematic analysis of free-text justifications in human experiments provides significant qualitative insights. Yet, it is costly because reliable annotations require multiple domain experts. Large language models (LLMs) seem ideal candidates to replace human annotators. [Problem:] Coding security-specific aspects (code identifiers mentioned, lines-of-code mentioned, security keywords mentioned) may require deeper contextual understanding than sentiment classification. [Objective:] Explore whether LLMs can act as automated annotators for technical security comments by human subjects. [Method:] We prompt four top-performing LLMs on LiveBench to detect nine security-relevant codes in free-text comments by human subjects analyzing vulnerable code snippets. Outputs are compared to human annotators using Cohen’s Kappa (chance-corrected accuracy). We test different prompts mimicking annotation best practices, including emerging codes, detailed codebooks with examples, and conflicting examples. [Negative Results:] We observed marked improvements only when using detailed code descriptions; however, these improvements are not uniform across codes and are insufficient to reliably replace a human annotator. [Limitations:] Additional studies with more LLMs and annotation tasks are needed.

[AI-104] Your Model Diversity Not Method Determines Reasoning Strategy

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中如何合理分配计算资源的问题,即在探索多种解题路径(广度)与深入优化潜在解法(深度)之间找到最优权衡。传统方法通常隐式地进行这种权衡,但缺乏对为何特定策略有效的理论解释,且单一模型的验证难以区分策略效果与模型特性之间的关系。论文的核心贡献在于提出:最优策略取决于模型的多样性分布(diversity profile),即概率质量在不同解题路径上的分布情况;并构建了一个理论框架,将推理不确定性分解为可量化成分,从而推导出树状深度精炼优于并行采样的条件。实验验证表明,在低多样性对齐模型上,轻量级信号即可有效支持深度优化;而在高多样性基础模型上,由于探索覆盖不足,需更强的补偿机制才能实现性能提升。

链接: https://arxiv.org/abs/2604.10827
作者: Moulik Choraria,Argyrios Gerogiannis,Anirban Das,Supriyo Chakraborty,Berkcan Kapusuzoglu,Chia-Hsuan Lee,Kartik Balasubramaniam,Shi-Xiong Zhang,Sambit Sahu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Compute scaling for LLM reasoning requires allocating budget between exploring solution approaches ( breadth ) and refining promising solutions ( depth ). Most methods implicitly trade off one for the other, yet why a given trade-off works remains unclear, and validation on a single model obscures the role of the model itself. We argue that \textbfthe optimal strategy depends on the model’s diversity profile, the spread of probability mass across solution approaches, and that this must be characterized before any exploration strategy is adopted. We formalize this through a theoretical framework decomposing reasoning uncertainty and derive conditions under which tree-style depth refinement outperforms parallel sampling. We validate it on Qwen-3 4B and Olmo-3 7B families, showing that lightweight signals suffice for depth-based refinement on low-diversity aligned models while yielding limited utility for high-diversity base models, which we hypothesize require stronger compensation for lower exploration coverage.

[AI-105] CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在复杂认知任务中表现不足的问题,特别是其在模拟动物行为神经科学范式(如空间导航、工作记忆等)时的推理与决策能力评估缺乏统一标准。解决方案的关键在于构建一个名为CheeseBench的新基准,该基准涵盖九种经典的行为神经科学实验范式(如Morris水迷宫、T迷宫等),覆盖六种认知维度,并通过ASCII文本观测和奖励信号对模型进行零样本测试,使模型以类啮齿动物的方式探索未知环境。实验表明,尽管部分模型(如Qwen2.5-VL-7B)在ASCII输入下达到52.6%平均成功率,仍显著低于近似啮齿类动物基线(78.9%),且性能高度依赖于接口设计,揭示了当前开放权重LLM代理在空间推理和状态追踪任务上的局限性。

链接: https://arxiv.org/abs/2604.10825
作者: Zacharie Bugaud
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, 4 tables

点击查看摘要

Abstract:We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement learning agent. Our best model (Qwen2.5-VL-7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain-of-thought prompting hurts rather than helps, and (4) a vision-language architecture provides an advantage at 7B but hurts at 32B. Because the same model’s performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model in isolation. Under this unified zero-shot ASCII protocol, current open-weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within-trial state tracking.

[AI-106] Verify Before You Fix: Agent ic Execution Grounding for Trustworthy Cross-Language Code Analysis NEURIPS2026

【速读】:该论文旨在解决部署在智能体(agentic)流水线中的学习型分类器所面临的可靠性问题:其预测结果为概率推理而非可验证结论,若在缺乏可观测证据支撑的情况下直接执行,将导致下游阶段的错误不断累积。针对软件漏洞分析这一高成本场景,作者提出了一种统一的跨语言漏洞生命周期框架,其核心在于通过三个由大语言模型(LLM)驱动的推理阶段实现闭环验证——混合结构-语义检测、执行基础的智能体验证以及验证感知的迭代修复,并严格遵循“无执行确认不进行修复”的不变量原则。解决方案的关键创新包括:利用通用抽象语法树(uAST)将Java、Python和C++归一化为共享结构模式以实现跨语言泛化,以及采用GraphSAGE与Qwen2.5-Coder-1.5B嵌入的两路门控融合机制,该机制不仅提升了性能,还提供了每样本级别的内在可解释性。实验证明,该框架在单语言检测准确率达89.84–92.02%,零样本跨语言F1值为74.43–80.12%,并能以12.27%的总失败率完成69.74%漏洞的端到端修复,显著优于消融实验中移除uAST或禁用验证的版本。

链接: https://arxiv.org/abs/2604.10800
作者: Jugal Gajjar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: 20 pages (13 main + 7 appendices), 9 figures, 10 tables. Submitted to NeurIPS 2026

点击查看摘要

Abstract:Learned classifiers deployed in agentic pipelines face a fundamental reliability problem: predictions are probabilistic inferences, not verified conclusions, and acting on them without grounding in observable evidence leads to compounding failures across downstream stages. Software vulnerability analysis makes this cost concrete and measurable. We address this through a unified cross-language vulnerability lifecycle framework built around three LLM-driven reasoning stages-hybrid structural-semantic detection, execution-grounded agentic validation, and validation-aware iterative repair-governed by a strict invariant: no repair action is taken without execution-based confirmation of exploitability. Cross-language generalization is achieved via a Universal Abstract Syntax Tree (uAST) normalizing Java, Python, and C++ into a shared structural schema, combined with a hybrid fusion of GraphSAGE and Qwen2.5-Coder-1.5B embeddings through learned two-way gating, whose per-sample weights provide intrinsic explainability at no additional cost. The framework achieves 89.84-92.02% intra-language detection accuracy and 74.43-80.12% zero-shot cross-language F1, resolving 69.74% of vulnerabilities end-to-end at a 12.27% total failure rate. Ablations establish necessity: removing uAST degrades cross-language F1 by 23.42%, while disabling validation increases unnecessary repairs by 131.7%. These results demonstrate that execution-grounded closed-loop reasoning is a principled and practically deployable mechanism for trustworthy LLM-driven agentic AI.

[AI-107] orchUMM: A Unified Multimodal Model Codebase for Evaluation Analysis and Post-training

【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在评估、分析与后训练过程中面临的架构多样性与训练范式异构性问题,即不同模型在设计思路、规模和实现细节上的差异导致难以进行公平、可复现的比较与深入研究。其解决方案的关键在于提出 TorchUMM——首个面向多样化 UMM 骨干网络、任务和数据集的统一代码库,通过提供标准化的接口与评测协议,支持对多模态理解、生成和编辑三大核心任务的全面评估,并整合经典与新型数据集以衡量感知、推理、组合性和指令遵循能力,从而促进对模型性能的系统性洞察与持续优化。

链接: https://arxiv.org/abs/2604.10784
作者: Yinyi Luo,Wenwen Wang,Hayes Bai,Hongyu Zhu,Hao Chen,Pan He,Marios Savvides,Sharon Li,Jindong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Technical Report

点击查看摘要

Abstract:Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: this https URL.

[AI-108] Learning Preference-Based Objectives from Clinical Narratives for Sequential Treatment Decision-Making

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在医疗领域中奖励函数设计的核心挑战,即如何有效建模稀疏、延迟且难以明确指定的临床结果。传统基于结构化数据的奖励函数往往无法全面反映患者临床轨迹的整体质量,如恢复动态、治疗负担和稳定性等关键维度。为此,作者提出临床叙事感知偏好奖励(Clinical Narrative-informed Preference Rewards, CN-PR)框架,其关键在于利用出院小结这类非结构化临床叙事文本作为可扩展的监督信号,通过大语言模型提取轨迹质量评分(Trajectory Quality Score, TQS),构建轨迹间的成对偏好关系,并基于置信度权重机制量化叙事信息的相关性以优化奖励学习过程。该方法显著提升了奖励与实际轨迹质量的一致性(Spearman相关系数=0.63),并使策略在提升器官支持-free天数和加快休克缓解等方面表现出优于基线的临床效果。

链接: https://arxiv.org/abs/2604.10783
作者: Daniel J. Tan,Kay Choong See,Mengling Feng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Designing reward functions remains a central challenge in reinforcement learning (RL) for healthcare, where outcomes are sparse, delayed, and difficult to specify. While structured data capture physiological states, they often fail to reflect the overall quality of a patient’s clinical trajectory, including recovery dynamics, treatment burden, and stability. Clinical narratives, in contrast, summarize longitudinal reasoning and implicitly encode evaluations of treatment effectiveness. We propose Clinical Narrative-informed Preference Rewards (CN-PR), a framework for learning reward functions directly from discharge summaries by treating them as scalable supervision for trajectory-level preferences. Using a large language model, we derive trajectory quality scores (TQS) and construct pairwise preferences over patient trajectories, enabling reward learning via a structured preference-based objective. To account for variability in narrative informativeness, we incorporate a confidence signal that weights supervision based on its relevance to the decision-making task. The learned reward aligns strongly with trajectory quality (Spearman rho = 0.63) and enables policies that are consistently associated with improved recovery-related outcomes, including increased organ support-free days and faster shock resolution, while maintaining comparable performance on mortality. These effects persist under external validation. Our results demonstrate that narrative-derived supervision provides a scalable and expressive alternative to handcrafted or outcome-based reward design for dynamic treatment regimes.

[AI-109] When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)在推理过程中过度依赖增加思考长度(chain-of-thought)以提升性能的问题,即现有研究普遍假设“更长的推理链必然带来更好的结果”,但这一假设缺乏实证支持。论文通过系统性实验发现,随着计算预算增加,额外推理token的边际收益显著递减,并且存在“过度思考”现象——即模型在延长推理过程中可能放弃原本正确的答案。其解决方案的关键在于提出一种成本感知的评估框架,表明在适度的计算预算下停止推理,可在保持与高预算相当准确率的同时大幅降低计算开销,同时指出最优思考长度应根据问题难度动态调整,从而实现更高效的资源分配。

链接: https://arxiv.org/abs/2604.10739
作者: Shu Zhou,Rui Ling,Junan Chen,Xin Wang,Tao Fan,Hao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Scaling test-time compute through extended chains of thought has become a dominant paradigm for improving large language model reasoning. However, existing research implicitly assumes that longer thinking always yields better results. This assumption remains largely unexamined. We systematically investigate how the marginal utility of additional reasoning tokens changes as compute budgets increase. We find that marginal returns diminish substantially at higher budgets and that models exhibit ``overthinking’', where extended reasoning is associated with abandoning previously correct answers. Furthermore, we show that optimal thinking length varies across problem difficulty, suggesting that uniform compute allocation is suboptimal. Our cost-aware evaluation framework reveals that stopping at moderate budgets can reduce computation significantly while maintaining comparable accuracy.

[AI-110] Perceived Importance of Cognitive Skills Among Computing Students in the Era of AI

【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)工具在教育领域的广泛应用,其对学习者认知技能(cognitive skills)发展可能产生的负面影响,尤其是认知卸载(cognitive offloading)现象是否会导致学生在学习过程中认知参与度下降,从而影响其未来职业竞争力。解决方案的关键在于通过一项受研究者监控的定量调查,系统评估本科生在三个时间维度(过去、现在、未来)对11项认知技能重要性的感知变化,发现学生普遍预期这些技能在未来AI高度集成的环境中重要性将下降。这一结果凸显了教育干预的必要性——必须在日益依赖AI的学习环境中,主动强化认知技能培养,以确保教学设计与未来职场需求相匹配。

链接: https://arxiv.org/abs/2604.10730
作者: Neha Rani,Erta Cenko,Laura Melissa Cruz Castro
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The availability and increasing integration of generative AI tools have transformed computing education. While AI in education presents opportunities, it also raises new concerns about how these powerful know-it-all AI tools, which are becoming widespread, impact cognitive skill development among students. Cognitive skills are essential for academic success and professional competence. It relates to the ability to understand, analyze, evaluate, synthesize information and more. The extensive use of these AI tools can aid in cognitive offloading, freeing up cognitive resources to be used in other tasks and activities. However, cognitive offloading may inadvertently lead to diminishing cognitive involvement in learning and related activities when using AI tools. Understanding cognitive skills’ impact in the era of AI is essential to align curricular design with evolving workforce demands and changing work environment and processes. To address this concern and to develop an understanding of how the importance of cognitive skills changes with increasing integration of AI, we conducted a researcher-monitored and regulated quantitative survey of undergraduate computing students. We examined students’ perceptions of cognitive skills across three temporal frames: prior to widespread AI adoption (past), current informal and formal use of AI in learning contexts (present), and future with even more AI integration in professional environments (future). In the study, students rated the importance of 11 cognitive skills. Our analysis reveals that students expect all 11 cognitive skills to be of diminishing importance in the future, when AI use and integration increases. Our findings highlight the need for educational interventions that explicitly reinforce cognitive skill development within learning environments that are now often relying on AI.

[AI-111] SciPredict: Can LLM s Predict the Outcomes of Scientific Experiments in Natural Sciences?

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在科学实验结果预测能力方面的评估空白问题,即现有基准主要关注LLMs的科学知识与推理能力,却忽视了其在预测实验结果这一可能显著超越人类能力的任务上的表现。解决方案的关键在于提出SciPredict基准,涵盖来自物理学、生物学和化学33个子领域的405个任务,系统评估LLMs预测实验结果的准确性及其在科研流程中应用的可靠性。研究发现,当前模型准确率仅为14–26%,虽部分前沿模型超过人类专家(约20%),但远未达到可依赖的程度;更重要的是,模型无法区分可靠与不可靠预测,而人类专家则表现出良好的校准能力——其预测准确性随对结果可预测性的判断增强而显著提升(从约5%升至约80%)。这表明,实现超人类水平的实验科学智能不仅需要更精准的预测,更需具备对预测置信度的可靠认知。

链接: https://arxiv.org/abs/2604.10718
作者: Udari Madhushani Sehwag,Elaine Lau,Haniyeh Ehsani Oskouie,Shayan Shabihi,Erich Liang,Andrea Toledo,Guillermo Mangialardi,Sergio Fonrouge,Ed-Yeremai Hernandez Cardona,Paula Vergara,Utkarsh Tyagi,Chen Bo Calvin Zhang,Pavi Bhatter,Nicholas Johnson,Furong Huang,Ernesto Gabriel Hernandez Montoya,Bing Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is \approx 20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only \approx 20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from \approx 5% to \approx 80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at this https URL

[AI-112] FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning ACL

【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)提示中推理轨迹的可信度评估问题,即模型生成的中间步骤看似连贯但可能缺乏真实逻辑依赖性(unfaithful intermediate steps),导致现有自评价方法因固有偏见而高估推理质量。解决方案的关键在于提出FACT-E框架,其核心创新是引入因果启发式的受控扰动机制作为工具变量(instrumental signal),以区分真实的步骤间依赖关系与由模型偏见驱动的伪相关性,从而获得更可靠的内部链路忠实度(intra-chain faithfulness)估计;同时结合CoT到最终答案的一致性(CoT-to-answer consistency),筛选出既内部一致又支持正确答案的推理路径,实现对高质量推理轨迹的精准识别与选择。

链接: https://arxiv.org/abs/2604.10693
作者: Yuxi Sun,Aoqi Zuo,Haotian Xie,Wei Gao,Mingming Gong,Jing Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to Association for Computational Linguistics Findings (ACL) 2026

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has improved LLM reasoning, but models often generate explanations that appear coherent while containing unfaithful intermediate steps. Existing self-evaluation approaches are prone to inherent biases: the model may confidently endorse coherence even when the step-to-step implication is not valid, leading to unreliable faithfulness evaluation. We propose FACT-E, a causality-inspired framework for evaluating CoT quality. FACT-E uses controlled perturbations as an instrumental signal to separate genuine step-to-step dependence from bias-driven artifacts, producing more reliable faithfulness estimates (\textitintra-chain faithfulness). To select trustworthy trajectories, FACT-E jointly considers \textitintra-chain faithfulness and \textitCoT-to-answer consistency, ensuring that selected chains are both faithful internally and supportive of the correct final answer. Experiments on GSM8K, MATH, and CommonsenseQA show that FACT-E improves reasoning-trajectory selection and yields stronger in-context learning exemplars. FACT-E also reliably detects flawed reasoning under noisy conditions, providing a robust metric for trustworthy LLM reasoning.

[AI-113] Do LLM s Build Spatial World Models? Evidence from Grid-World Maze Tasks

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在空间理解与推理能力方面的局限性问题,特别是其是否能够构建稳健的内部空间世界模型(spatial world models)以支持多步规划和抽象推理。研究通过系统性的迷宫任务评估不同LLM(如Gemini-2.5-Flash、GPT-5-mini等)的空间推理表现,发现模型性能高度依赖于输入表示形式——例如使用分词邻接表示时表现良好,而切换为视觉网格格式后准确率骤降2–5倍,表明其空间推理并非格式不变(format-invariant),而是呈现显著的表示依赖性。关键解决方案在于设计受控的迷宫测试环境,并结合链式思维(chain-of-thought)提示策略,揭示了LLMs虽能生成高语义覆盖率的推理路径,却无法将空间知识累积用于一致的空间计算,从而证明其缺乏真正意义上的空间世界建模能力,仅表现出特定提示和表示下的局部推理优势。

链接: https://arxiv.org/abs/2604.10690
作者: Weijiang Li,Yilin Zhu,Rajarshi Das,Parijat Dube
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models have shown remarkable performance across diverse tasks, yet their ability to construct internal spatial world models for reasoning and planning remains unclear. We systematically evaluate the spatial understanding of large language models through maze tasks, a controlled testing context requiring multi-step planning and spatial abstraction. Across comprehensive experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities. Using chain-of-thought prompting, Gemini achieves 80-86% accuracy on smaller mazes (5x5 to 7x7 grids) with tokenized adjacency representations, but performance collapses to 16-34% with visual grid formats, which is a 2-5x difference, suggesting representation-dependent rather than format-invariant spatial reasoning. We further probe spatial understanding through sequential proximity questions and compositional distance comparisons. Despite achieving 96-99% semantic coverage in reasoning traces, models fail to leverage this understanding for consistent spatial computations, indicating that they treat each question independently rather than building cumulative spatial knowledge. Our findings based on the maze-solving tasks suggest that LLMs do not develop robust spatial world models, but rather exhibit representation-specific and prompting-dependent reasoning that succeeds only under narrow conditions. These results have critical implications for deploying foundation models in applications requiring spatial abstraction.

[AI-114] Critical-CoT: A Robust Defense Framework against Reasoning -Level Backdoor Attacks in Large Language Models

【速读】:该论文旨在解决生成式 AI(Generative AI)中基于链式推理(Chain-of-Thought, CoT)的后门攻击问题。此类攻击通过在训练数据中植入特定触发器,使模型在推理过程中插入恶意逻辑步骤,从而输出错误但看似合理的答案,显著增加了检测难度。解决方案的关键在于提出一种名为 Critical-CoT 的新型防御机制,其核心是通过对大语言模型(Large Language Models, LLMs)进行两阶段微调(Fine-Tuning, FT),引导模型发展出批判性思维行为,使其能够自动识别潜在后门并拒绝生成恶意推理路径,从而在多个模型和任务上实现强鲁棒性和跨领域、跨任务的泛化能力。

链接: https://arxiv.org/abs/2604.10681
作者: Vu Tuan Truong,Long Bao Le
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), despite their impressive capabilities across domains, have been shown to be vulnerable to backdoor attacks. Prior backdoor strategies predominantly operate at the token level, where an injected trigger causes the model to generate a specific target word, choice, or class (depending on the task). Recent advances, however, exploit the long-form reasoning tendencies of modern LLMs to conduct reasoning-level backdoors: once triggered, the victim model inserts one or more malicious reasoning steps into its chain-of-thought (CoT). These attacks are substantially harder to detect, as the backdoored answer remains plausible and consistent with the poisoned reasoning trajectory. Yet, defenses tailored to this type of backdoor remain largely unexplored. To bridge this gap, we propose Critical-CoT, a novel defense mechanism that conducts a two-stage fine-tuning (FT) process on LLMs to develop critical thinking behaviors, enabling them to automatically identify potential backdoors and refuse to generate malicious reasoning steps. Extensive experiments across multiple LLMs and datasets demonstrate that Critical-CoT provides strong robustness against both in-context learning-based and FT-based backdoor attacks. Notably, Critical-CoT exhibits strong cross-domain and cross-task generalization. Our code is available at hthttps://github.com/tuanvu171/Critical-CoT.

[AI-115] FedRio: Personalized Federated Social Bot Detection via Cooperative Reinforced Contrastive Adversarial Distillation

【速读】:该论文旨在解决跨平台社交机器人(Social Bot)检测中因模型孤立训练导致的性能瓶颈以及新兴变种难以及时识别的问题,同时应对不同平台间数据分布异构性和模型架构差异带来的知识共享难题。其核心解决方案是提出FedRio框架,关键创新在于:首先采用自适应消息传递模块作为客户端图神经网络骨干以增强本地表征能力;其次设计基于生成对抗网络的联邦知识提取机制,实现对全局数据分布的有效聚合与共享;再者引入多阶段对抗对比学习策略,强制客户端特征空间一致性并缩小局部与全局模型间的差异;最后结合服务器端自适应参数聚合与客户端强化学习驱动的参数控制,有效适配异构联邦环境下的数据非独立同分布特性,从而在保障隐私的前提下显著提升检测精度、通信效率及特征空间一致性。

链接: https://arxiv.org/abs/2604.10678
作者: Yingguang Yang,Hao Liu,Xin Zhang,Yunhui Liu,Yutong Xia,Qi Wu,Hao Peng,Taoran Liang,Bin Chong,Tieke He,Philip S. Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:Social bot detection is critical to the stability and security of online social platforms. However, current state-of-the-art bot detection models are largely developed in isolation, overlooking the benefits of leveraging shared detection patterns across platforms to improve performance and promptly identify emerging bot variants. The heterogeneity of data distributions and model architectures further complicates the design of an effective cross-platform and cross-model detection framework. To address these challenges, we propose FedRio (Personalized Federated Social Bot Detection with Cooperative Reinforced Contrastive Adversarial Distillation framework. We first introduce an adaptive message-passing module as the graph neural network backbone for each client. To facilitate efficient knowledge sharing of global data distributions, we design a federated knowledge extraction mechanism based on generative adversarial networks. Additionally, we employ a multi-stage adversarial contrastive learning strategy to enforce feature space consistency among clients and reduce divergence between local and global models. Finally, we adopt adaptive server-side parameter aggregation and reinforcement learning-based client-side parameter control to better accommodate data heterogeneity in heterogeneous federated settings. Extensive experiments on two real-world social bot detection benchmarks demonstrate that FedRio consistently outperforms state-of-the-art federated learning baselines in detection accuracy, communication efficiency, and feature space consistency, while remaining competitive with published centralized results under substantially stronger privacy constraints.

[AI-116] Preference-Agile Multi-Objective Optimization for Real-time Vehicle Dispatching

【速读】:该论文旨在解决动态多目标优化(Multi-objective Optimization, MOO)中长期存在的两大挑战:一是现有方法大多局限于确定性MOO问题,难以应对现实场景中的不确定性;二是多数研究聚焦于非序列化的动态MOO决策问题,无法处理具有时序依赖性的复杂实际问题。为应对这些局限,论文提出了一种偏好敏捷的多目标优化(Preference-Agile Multi-Objective Optimization, PAMOO)方法,其核心创新在于构建了一个基于深度强化学习(Deep Reinforcement Learning, DRL)的统一模型框架,能够显式接收用户动态调整的偏好向量作为输入,并通过一个校准函数(calibration function)确保偏好向量与DRL决策策略之间的高质量对齐。这一设计使得PAMOO能够在运行过程中实时响应用户偏好的变化,从而有效支持动态、连续且交互式的多目标决策。

链接: https://arxiv.org/abs/2604.10664
作者: Jiahuan Jin,Wenhao Zhao,Rong Qu,Jianfeng Ren,Xinan Chen,Qingfu Zhang,Ruibin Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-objective optimization (MOO) has been widely studied in literature because of its versatility in human-centered decision making in real-life applications. Recently, demand for dynamic MOO is fast-emerging due to tough market dynamics that require real-time re-adjustments of priorities for different objectives. However, most existing studies focus either on deterministic MOO problems which are not practical, or non-sequential dynamic MOO decision problems that cannot deal with some real-life complexities. To address these challenges, a preference-agile multi-objective optimization (PAMOO) is proposed in this paper to permit users to dynamically adjust and interactively assign the preferences on the fly. To achieve this, a novel uniform model within a deep reinforcement learning (DRL) framework is proposed that can take as inputs users’ dynamic preference vectors explicitly. Additionally, a calibration function is fitted to ensure high quality alignment between the preference vector inputs and the output DRL decision policy. Extensive experiments on challenging real-life vehicle dispatching problems at a container terminal showed that PAMOO obtains superior performance and generalization ability when compared with two most popular MOO methods. Our method presents the first dynamic MOO method for challenging \revdynamic sequential MOO decision problems

[AI-117] DynamicsLLM : a Dynamic Analysis-based Tool for Generating Intelligent Execution Traces Using LLM s to Detect Android Behavioural Code Smells

【速读】:该论文旨在解决Android移动应用中行为型代码异味(behavioural code smells)检测覆盖率不足的问题,尤其是现有基于动态分析的方法(如Dynamics)存在较高的漏报率,难以有效触发和识别相关代码异味事件。其解决方案的关键在于提出DynamicsLLM框架,通过引入大语言模型(Large Language Models, LLMs)智能生成执行轨迹(intelligent execution traces),以更高效地触发潜在的代码异味行为;同时设计了一种混合方法,针对活动(activity)数量较少的应用提升LLM生成轨迹的覆盖能力,从而显著增强对代码异味相关事件的检测完整性。实验表明,DynamicsLLM在有限操作下可覆盖三倍于Dynamics的代码异味事件,且能成功触发原本无法被Dynamics识别的12.7%事件。

链接: https://arxiv.org/abs/2604.10661
作者: Houcine Abdelkader Cherief,Florent Avellaneda,Naouel Moha
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobile apps have become essential of our daily lives, making code quality a critical concern for developers. Behavioural code smells are characteristics in the source code that induce inappropriate code behaviour during execution, which negatively impact software quality in terms of performance, energy consumption, and memory. Dynamics, the latest state-of-the-art tool-based method, is highly effective at detecting Android behavioural code smells. While it outperforms static analysis tools, it suffers from a high false negative rate, with multiple code smell instances remaining undetected. Large Language Models (LLMs) have achieved notable advances across numerous research domains and offer significant potential for generating intelligent execution traces, particularly for detecting behavioural code smells in Android mobile applications. By intelligent execution trace, we mean a sequence of events generated by specific actions in a way that triggers the identification of a given behaviour. We propose the following three main contributions in this paper: (1) DynamicsLLM, an enhanced implementation of the Dynamics method that leverages LLMs to intelligently generate execution traces. (2) A novel hybrid approach designed to improve the coverage of code smell-related events in applications with a small number of activities. (3) A comprehensive validation of DynamicsLLM on 333 mobile applications from F-DROID, including a comparison with the Dynamics tool. Our results show that, under a limited number of actions, DynamicsLLM configured with 100% LLM covers three times more code smell-related events than Dynamics. The hybrid approach improves LLM coverage by 25.9% for apps containing few activities. Moreover, 12.7% of the code smell-related events that cannot be triggered by Dynamics are successfully triggered by our tool.

[AI-118] Enhancing Cross-Problem Vehicle Routing via Federated Learning

【速读】:该论文旨在解决神经组合优化(Neural Combinatorial Optimization, NCO)中跨问题学习范式在从简单车辆路径问题(Vehicle Routing Problem, VRP)变体向包含不同且复杂约束的VRP迁移时,性能下降和泛化能力衰减的问题。其解决方案的关键在于提出一种“多问题预训练、单问题微调”框架结合联邦学习(Multi-problem Pre-train, then Single-problem Fine-tune with Federated Learning, MPSF-FL),通过联邦学习机制利用全局模型共享通用VRP知识,使本地模型在保留共性知识的同时,高效适配具有异构复杂约束的下游VRP任务,从而提升多样VRP场景下的性能与未见问题的泛化能力。

链接: https://arxiv.org/abs/2604.10652
作者: Xiangchi Meng,Jianan Zhou,Jie Gao,Yifan Lu,Yaoxin Wu,Gonglin Yuan,Yaqing Hou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vehicle routing problems (VRPs) constitute a core optimization challenge in modern logistics and supply chain management. The recent neural combinatorial optimization (NCO) has demonstrated superior efficiency over some traditional algorithms. While serving as a primary NCO approach for solving general VRPs, current cross-problem learning paradigms are still subject to performance degradation and generalizability decay, when transferring from simple VRP variants to those involving different and complex constraints. To strengthen the paradigms, this paper offers an innovative “Multi-problem Pre-train, then Single-problem Fine-tune” framework with Federated Learning (MPSF-FL). This framework exploits the common knowledge of a federated global model to foster efficient cross-problem knowledge sharing and transfer among local models for single-problem fine-tuning. In this way, local models effectively retain common VRP knowledge from up-to-date global model, while being efficiently adapted to downstream VRPs with heterogeneous complex constraints. Experimental results demonstrate that our framework not only enhances the performance in diverse VRPs, but also improves the generalizability in unseen problems.

[AI-119] Vibe-driven model-based engineering

【速读】:该论文旨在解决当前软件开发中因系统复杂性增加和用户需求多样化(如新型人机交互界面、智能组件需求及可持续性考量)所带来的挑战,尤其是在传统模型驱动工程(Model-Driven Engineering, MDE)面临模型自身日益复杂难以管理,以及基于大语言模型(Large Language Models, LLMs)的“ vibe coding”方法虽能快速生成代码但存在潜在漏洞、可扩展性和可维护性问题的背景下。其解决方案的关键在于提出“振动驱动的模型驱动工程”(vibe-driven model-based engineering),通过融合生成式 AI 与 MDE 的优势,构建一种协同开发范式,从而加速可靠复杂系统的开发,同时兼顾效率与质量。

链接: https://arxiv.org/abs/2604.10645
作者: Jordi Cabot
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There is a pressing need for better development methods and tools to keep up with the growing demand and increasing complexity of new software systems. New types of user interfaces, the need for intelligent components, sustainability concerns, etc. bring new challenges that we need to handle. In the last years, model-driven engineering (MDE), including its latest incarnation, i.e. low/no-code development, has been key to improving the quality and productivity of software development, but models themselves are becoming increasingly complex to specify and manage. At the same time, we are witnessing the growing popularity of vibe coding approaches that rely on Large Language Models (LLMs) to transform natural language descriptions into running code at the expense of potential code vulnerabilities, scalability issues and maintainability concerns. While many may think vibe coding will replace model-based engineering, in this paper we argue that, in fact, the two approaches can complement each other and provide altogether different development paths for different types of software systems, development scenarios, and user profiles. In this sense, we introduce the concept of \textitvibe-driven model-based engineering as a novel approach to integrate the best of both worlds (AI and MDE) to accelerate the development of reliable complex systems. We outline the key concepts of this new approach and highlight the opportunities and open challenges it presents for the future of software development. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.10645 [cs.SE] (or arXiv:2604.10645v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.10645 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-120] MoEITS: A Green AI approach for simplifying MoE-LLM s

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)中混合专家(Mixture-of-Experts, MoE)架构在训练与推理过程中带来的高计算负担、内存占用和能耗问题。针对这一挑战,作者提出了一种名为MoEITS的简化算法,其核心在于基于标准化的信息论框架实现对MoE-LLMs的高效压缩与优化。该方法通过精细化的结构简化策略,在保证模型性能的前提下显著降低计算复杂度与资源消耗,实验证明其在多个主流模型(如Mixtral 8×7B、Qwen1.5-2.7B和DeepSeek-V2-Lite)上均优于现有最先进的剪枝技术,实现了精度保持与计算效率的双重提升。

链接: https://arxiv.org/abs/2604.10603
作者: Luis Balderas,Miguel Lastra,José M. Benítez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Large language models are transforming all areas of academia and industry, attracting the attention of researchers, professionals, and the general public. In the trek for more powerful architectures, Mixture-of-Experts, inspired by ensemble models, have emerged as one of the most effective ways to follow. However, this implies a high computational burden for both training and inference. To reduce the impact on computing and memory footprint as well as the energy consumption, simplification methods has arisen as very effective procedures. In this paper, an original algorithm, MoEITS, for MoE-LLMs simplification is presented. The algorithm is characterized by a refined simplicity, underpinned by standardized Information Theoretic frameworks. MoEITS is analyzed in depth from theoretical and practical points of view. Its computational complexity is studied. Its performance on the accuracy of the simplified LLMs and the reduction rate achieved is assessed through a thoroughly designed experimentation. This empirical evaluation includes a comparison with state-of-the-art MoE-LLM pruning methods applied on Mixtral 8\times7 B, Qwen1.5-2.7B, and DeepSeek-V2-Lite. The extensive experimentation conducted demonstrates that MoEITS outperforms state-of-the-art techniques by generating models that are both effective across all benchmarks and computationally efficient. The code implementing the method will be available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF) Cite as: arXiv:2604.10603 [cs.LG] (or arXiv:2604.10603v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.10603 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-121] Working Paper: Towards Schema-based Learning from a Category-Theoretic Perspective

【速读】:该论文旨在解决生成式 AI(Generative AI)中模型结构与认知机制之间缺乏统一形式化框架的问题,尤其在多层级语义建模、认知过程实现及具身智能体架构设计方面存在理论割裂。其解决方案的关键在于构建一个分层分类学框架(hierarchical categorical framework),通过范畴论工具将Schema-Based Learning(SBL)的形式化体系从底层语法结构延伸至高层认知与环境交互:具体而言,该框架以自由多范畴(free multicategory)Sch_syn编码基础模式及其变换,借助Grothendieck构造得到实现类别Sch_impl,并通过Giry单子的Kleisli范畴映射为概率模型;进一步引入心智类别Mind,结合双幺半群结构(duoidal structure)支持基于schema的工作流执行,其中记忆系统由预层Data_M和操作范畴Ops_M形式化;最终,整个系统嵌入于代理架构类别ArchCat与世界类别World之中,形成弱n-范畴结构,实现了从语义、认知、具身到环境交互的跨层次统一建模。

链接: https://arxiv.org/abs/2604.10589
作者: Pablo de los Riscos,Fernando J. Corbacho,Michael A. Arbib
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 43 pages, 3 figures

点击查看摘要

Abstract:We introduce a hierarchical categorical framework for Schema-Based Learning (SBL) structured across four interconnected levels. At the schema level, a free multicategory Sch_syn encodes fundamental schemas and transformations. An implementation functor \mathcalI maps syntactic schemas to representational languages, inducing via the Grothendieck construction the total category Sch_impl . Implemented schemas are mapped by a functor Model into the Kleisli category \mathbfKL(G) of the Giry monad, yielding probabilistic models, while an instances presheaf assigns evaluated instance spaces. A semantic category Sch_sem , defined as a full subcategory of \mathbfKL(G) , provides semantic grounding through an interpretation functor from Sch_impl . At the agent level, Sch_impl is equipped with a duoidal structure \mathcalO_Sch supporting schema-based workflows. A left duoidal action on the category Mind enables workflow execution over mental objects, whose components include mental spaces, predictive models, and a cognitive kernel composed of memory and cognitive modules. Each module is specified by schema-typed interfaces, duoidal workflows, a success condition, and a logical signature. Memory is formalized categorically via memory subsystems, a presheaf Data_M , a monoidal operation category Ops_M , and read/write natural transformations. Together with the Body category, Mind defines the embodied SBL agent. At higher levels, SBL is represented as an object of the agent architecture category ArchCat , enabling comparison with heterogeneous paradigms, while the World category models multi-agent and agent-environment interactions. Altogether, the framework forms a weak hierarchical n -categorical structure linking schema semantics, cognition, embodiment, architectural abstraction, and world-level interaction. Comments: 43 pages, 3 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.10589 [cs.AI] (or arXiv:2604.10589v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.10589 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-122] AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence

【速读】:该论文旨在解决机器人操作中因几何变化导致的性能受限问题,尤其在数据多样性不足时,传统模仿学习方法难以泛化到未见场景。其解决方案的关键在于提出AffordGen框架,利用大规模3D生成模型和视觉基础模型(Vision Foundation Models, VFMs)提取有意义关键点之间的语义对应关系,从而生成具有功能感知(affordance-aware)的新机器人操作轨迹。该方法构建了一个大规模、语义驱动的数据集,并用于训练闭环视觉-运动策略,融合了语义泛化能力与端到端学习的鲁棒反应性,显著提升了数据效率并实现了对真正未见物体的零样本泛化能力。

链接: https://arxiv.org/abs/2604.10579
作者: Jiawei Zhang,Kaizhe Hu,Yingqian Huang,Yuanchen Ju,Zhengrong Xue,Huazhe Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often constrained by geometric variations due to limited data diversity. Leveraging powerful 3D generative models and vision foundation models (VFMs), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across large-scale 3D meshes to generate new robot manipulation trajectories. This large-scale, affordance-aware dataset is then used to train a robust, closed-loop visuomotor policy, combining the semantic generalizability of affordances with the reactive robustness of end-to-end learning. Experiments in simulation and the real world show that policies trained with AffordGen achieve high success rates and enable zero-shot generalization to truly unseen objects, significantly improving data efficiency in robot learning.

[AI-123] he Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

【速读】:该论文旨在解决当前计算机使用代理(Computer-use Agents, CUAs)在用户指令完全 benign 的情况下,因任务环境或执行结果引发的隐蔽性安全威胁问题。现有安全评估主要聚焦于显式攻击(如滥用和提示注入),忽略了此类“无意攻击”场景下的风险。其解决方案的关键在于提出 OS-BLIND 基准测试框架,涵盖 300 个由人工设计的任务,覆盖 12 类场景、8 种应用,并识别两类威胁:环境嵌入式威胁与代理发起的危害。实验表明,多数前沿模型在该基准下攻击成功率(Attack Success Rate, ASR)超过 90%,即使经过安全对齐的 Claude 4.5 Sonnet 模型也达到 73.0% ASR,且在多代理系统中进一步上升至 92.7%。研究揭示了当前安全机制在良性指令下保护能力有限,且安全对齐通常仅在初始几步激活,后续执行中难以重新触发,尤其在多代理系统中子任务分解会掩盖有害意图,导致模型失效。

链接: https://arxiv.org/abs/2604.10577
作者: Xuwei Ding,Skylar Zhai,Linxin Song,Jiate Li,Taiwei Shi,Nicholas Meade,Siva Reddy,Jian Kang,Jieyu Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 63 pages

点击查看摘要

Abstract:Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS-BLIND, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human-crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment-embedded threats and agent-initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety-aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability becomes even more severe, with ASR rising from 73.0% to 92.7% when Claude 4.5 Sonnet is deployed in multi-agent systems. Our analysis further shows that existing safety defenses provide limited protection when user instructions are benign. Safety alignment primarily activates within the first few steps and rarely re-engages during subsequent execution. In multi-agent systems, decomposed subtasks obscure the harmful intent from the model, causing safety-aligned models to fail. We will release our OS-BLIND to encourage the broader research community to further investigate and address these safety challenges.

[AI-124] Failure Ontology: A Lifelong Learning Framework for Blind Spot Detection and Resilience Design

【速读】:该论文试图解决个性化学习系统长期忽视的核心问题:人类重大失败(如财务破产、健康崩溃、职业过时)往往并非源于知识获取不足,而是由于认知地图中存在“本体论盲点”(Ontological Blind Spots),即个体从未意识到某些概念领域存在或重要。解决方案的关键在于提出Failure Ontology (F) 框架,其核心包括三方面创新:(1) 构建四类盲点分类体系(领域盲视、结构盲视、权重盲视、时间盲视)以识别盲点类型;(2) 揭示五种盲点与外部扰动交互导致灾难性后果的模式;(3) 提出失败学习效率定理(Failure Learning Efficiency Theorem),证明在历史数据有限条件下,基于失败的学习比基于成功的学习具有更高的样本效率。

链接: https://arxiv.org/abs/2604.10549
作者: Yuan Sun,Hong Yi,Jinyuan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personalized learning systems are almost universally designed around a single objective: help people acquire knowledge and skills more efficiently. We argue this framing misses the more consequential problem. The most damaging failures in human life-financial ruin, health collapse, professional obsolescence-are rarely caused by insufficient knowledge acquisition. They arise from the systematic absence of entire conceptual territories from a person’s cognitive map: domains they never thought to explore because, from within their existing worldview, those domains did not appear to exist or to matter. We call such absences Ontological Blind Spots and introduce Failure Ontology (F), a formal framework for detecting, classifying, and remediating them across a human lifetime. The framework introduces three original contributions: (1) a four-type taxonomy of blind spots distinguishing domain blindness, structural blindness, weight blindness, and temporal blindness; (2) five convergent failure patterns characterizing how blind spots interact with external disruption to produce catastrophic outcomes; and (3) the Failure Learning Efficiency Theorem, proving that failure-based learning achieves higher sample efficiency than success-based learning under bounded historical data. We illustrate the framework through historical case analysis of the 1997 Asian Financial Crisis and the 2008 subprime mortgage crisis, and through alongitudinal individual case study spanning five life stages.

[AI-125] Agent 2 RL-Bench: Can LLM Agents Agent s Engineer Agentic RL Post-Training?

【速读】:该论文旨在解决当前强化学习(Reinforcement Learning, RL)后训练评估中缺乏对智能体自主设计、实现和运行完整RL流水线能力的测试问题,即现有基准大多静态且依赖监督微调(Supervised Fine-Tuning, SFT),无法评估LLM智能体在交互式RL工程中的实际表现。其解决方案的关键在于提出Agent² RL-Bench,这是一个分层结构的基准测试框架,包含六个任务,覆盖从静态规则训练到闭环在线RL与轨迹收集的三个层级,每一层级引入前序层级未要求的结构性约束;同时提供隔离工作空间、评分API、运行时仪器化记录及自动化事后分析机制,首次实现了对代理驱动的后训练行为的自动诊断与量化评估。

链接: https://arxiv.org/abs/2604.10547
作者: Wanyi Chen,Xiao Yang,Xu Yang,Tianming Sha,Qizheng Li,Zhuo Wang,Bowen Xian,Fang Kong,Weiqing Liu,Jiang Bian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 36 pages, 9 figures, 22 tables

点击查看摘要

Abstract:We introduce Agent^2 RL-Bench, a benchmark for evaluating agentic RL post-training – whether LLM agents can autonomously design, implement, and run complete RL pipelines that improve foundation models. This capability is important because RL post-training increasingly drives model alignment and specialization, yet existing benchmarks remain largely static: supervised fine-tuning alone yields strong results, leaving interactive RL engineering untested. Agent^2 RL-Bench addresses this with six tasks across three levels – from static rule-based training to closed-loop online RL with trajectory collection – each adding a structural requirement that prior levels do not impose. The benchmark provides isolated workspaces with a grading API, runtime instrumentation that records every submission and code revision, and automated post-hoc analysis that generates structured run reports, enabling the first automated diagnostic of agent-driven post-training behavior. Across multiple agent stacks spanning five agent systems and six driver LLMs, we find that agents achieve striking interactive gains – on ALFWorld, an RL-only agent improves from 5.97 to 93.28 via SFT warm-up and GRPO with online rollouts – yet make only marginal progress on others (DeepSearchQA: +2.75 within evaluation noise), and that driver choice has a large effect on interactive tasks – within the same scaffold, switching drivers changes interactive improvement from near-zero to +78pp. More broadly, the benchmark reveals that supervised pipelines dominate agent-driven post-training under fixed budgets, with online RL succeeding as the final best route only on ALFWorld. Code is available at this https URL.

[AI-126] WaveMoE: A Wavelet-Enhanced Mixture-of-Experts Foundation Model for Time Series Forecasting ICLR2026

【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在建模复杂时序模式(如周期性和局部高频动态)时的局限性,尤其是如何有效利用频域信息以提升预测性能的问题。解决方案的关键在于提出WaveMoE——一种基于小波变换增强的专家混合(Mixture-of-Experts, MoE)基础模型,其采用双路径架构同时处理时域序列标记(time series tokens)与小波域标记(wavelet tokens),并通过共享的专家路由机制实现专家专业化分工与模型容量的高效扩展,从而在16个多样化基准数据集上验证了引入小波域语料库对预测性能的显著提升潜力。

链接: https://arxiv.org/abs/2604.10544
作者: Shunyu Wu,Jiawei Huang,Weibin Feng,Boxin Li,Xiao Zhang,Erli Meng,Dan Li,Jian Lou,See-Kiong Ng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Presented at ICLR 2026 TSALM Workshop (1st Workshop on Time Series in the Age of Large Models)

点击查看摘要

Abstract:Time series foundation models (TSFMs) have recently achieved remarkable success in universal forecasting by leveraging large-scale pretraining on diverse time series data. Complementing this progress, incorporating frequency-domain information yields promising performance in enhancing the modeling of complex temporal patterns, such as periodicity and localized high-frequency dynamics, which are prevalent in real-world time series. To advance this direction, we propose a new perspective that integrates explicit frequency-domain representations into scalable foundation models, and introduce WaveMoE, a wavelet-enhanced mixture-of-experts foundation model for time series forecasting. WaveMoE adopts a dual-path architecture that jointly processes time series tokens and wavelet tokens aligned along a unified temporal axis, and coordinates them through a shared expert routing mechanism that enables consistent expert specialization while efficiently scaling model capacity. Preliminary experimental results on 16 diverse benchmark datasets indicate that WaveMoE has the potential to further improve forecasting performance by incorporating wavelet-domain corpora.

[AI-127] VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

【速读】:该论文旨在解决视频到音频(Video-to-Audio, V2A)生成模型评估体系不完善的问题,特别是现有基准测试未能区分不同音频类别(如音效、音乐、语音和歌唱)的细粒度需求,导致评估结果缺乏针对性和指导意义。其解决方案的关键在于提出 VidAudio-Bench,一个面向多任务的 V2A 评估基准,具备四大核心特征:(1) 覆盖四大典型音频类别并支持 V2A 和 Video-Text-to-Audio (VT2A) 两种设置;(2) 包含 1,634 个视频-文本对并评测 11 种先进生成模型;(3) 设计 13 个任务特定、无需参考的指标以系统评估音频质量、视频-音频一致性及文本-音频一致性;(4) 通过主观实验验证所有指标与人类偏好高度一致。该框架不仅提升了评估的全面性和可扩展性,还揭示了当前模型在语音和歌唱生成上的薄弱表现,以及 VT2A 中指令遵循与视觉引导之间的权衡关系。

链接: https://arxiv.org/abs/2604.10542
作者: Qian Zhang,Yuqin Cao,Yixuan Gao,Xiongkuo Min
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video-to-Audio (V2A) generation is essential for immersive multimedia experiences, yet its evaluation remains underexplored. Existing benchmarks typically assess diverse audio types under a unified protocol, overlooking the fine-grained requirements of distinct audio categories. To address this gap, we propose VidAudio-Bench, a multi-task benchmark for V2A evaluation with four key features: (1) Broad Coverage: It encompasses four representative audio categories - sound effects, music, speech, and singing - under both V2A and Video-Text-to-Audio (VT2A) settings. (2) Extensive Evaluation: It comprises 1,634 video-text pairs and benchmarks 11 state-of-the-art generation models. (3) Comprehensive Metrics: It introduces 13 task-specific, reference-free metrics to systematically assess audio quality, video-audio consistency, and text-audio consistency. (4) Human Alignment: It validates all metrics through subjective studies, demonstrating strong consistency with human preferences. Experimental results reveal that current V2A models perform poorly in speech and singing compared to sound effects. Our VT2A results further highlight a fundamental tension between instruction following and visually grounded generation: stronger visual conditioning improves video-audio alignment, but often at the cost of generating the intended audio category. These findings establish VidAudio-Bench as a comprehensive and scalable framework for diagnosing V2A systems and provide new insights into multimodal audio generation.

[AI-128] IceCache: Memory-efficient KV-cache Management for Long-Sequence LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因Key-Value (KV) cache内存占用随序列长度线性增长而导致的内存瓶颈问题,尤其是在资源受限硬件上的长序列生成任务中,现有缓存卸载方法因token选择不精确而造成性能下降。其解决方案的关键在于提出一种名为IceCache的新颖KV缓存管理策略,该策略融合语义token聚类与PagedAttention机制,通过将语义相关的token组织为连续内存区域,并借助分层且可动态更新的数据结构实现更高效的token选择和CPU-GPU传输过程中的带宽利用,从而显著降低内存使用量并保持高精度和低延迟。

链接: https://arxiv.org/abs/2604.10539
作者: Yuzhen Mao,Qitong Wang,Martin Ester,Ke Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU-GPU transfers. Experimental results on LongBench show that, with a 256-token budget, IceCache maintains 99% of the original accuracy achieved by the full KV cache model. Moreover, compared to other offloading-based methods, IceCache attains competitive or even superior latency and accuracy while using only 25% of the KV cache token budget, demonstrating its effectiveness in long-sequence scenarios. The code is available on our project website at this https URL.

[AI-129] Machine Learning-Based Detection of MCP Attacks

【速读】:该论文旨在解决模型上下文协议(Model Context Protocol, MCP)在扩展大语言模型功能的同时引入的新安全风险问题,特别是针对恶意MCP工具描述的检测不足这一研究空白。其解决方案的关键在于构建并评估多种监督学习方法(包括传统机器学习与深度学习模型),用于区分恶意与良性工具(二分类任务)以及识别具体攻击类型(多分类任务)。实验表明,部分模型在二分类任务中达到100% F1分数,在多分类任务中支持向量机(SVC)和BERT模型分别取得90.56%和88.33%的F1分数,显著优于现有基于规则的基线方案;同时开发了中间件以实现实时分类和阻断不安全MCP工具,为实际部署提供了可行路径。

链接: https://arxiv.org/abs/2604.10534
作者: Tobias Mattsson,Samuel Nyberg,Anton Borg,Ricardo Britto
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The Model Context Protocol (MCP) is a new and emerging technology that extends the functionality of large language models, improving workflows but also exposing users to a new attack surface. Several studies have highlighted related security flaws, but MCP attack detection remains underexplored. To address this research gap, this study develops and evaluates a range of supervised machine learning approaches, including both traditional and deep-learning models. We evaluated the systems on the detection of malicious MCP tool descriptions in two scenarios: (1) a binary classification task distinguishing malicious from benign tools, and (2) a multiclass classification task identifying the attack type while separating benign from malicious tools. In addition to the machine learning models, we compared a rule-based approach that serves as a baseline. The results indicate that several of the developed models achieved 100% F1-score on the binary classification task. In the multiclass scenario, the SVC and BERT models performed best, achieving F1 scores of 90.56% and 88.33%, respectively. Confusion matrices were also used to visualize the full distribution of predictions often missed by traditional metrics, providing additional insight for selecting the best-fitting solution in real-world scenarios. This study presents an addition to the MCP defence area, showing that machine learning models can perform exceptionally well in separating malicious and benign data points. To apply the solution in a live environment, a middleware was developed to classify which MCP tools are safe to use before execution, and block the ones that are not safe. Furthermore, the study shows that these models can outperform traditional rule-based solutions currently in use in the field.

[AI-130] PepBenchmark: A Standardized Benchmark for Peptide Machine Learning

【速读】:该论文旨在解决肽类药物(peptide therapeutics)在机器学习(Machine Learning, ML)研究中缺乏标准化基准的问题,从而阻碍了算法性能的可比性和方法学进步。其解决方案的关键在于构建一个统一的评估框架——PepBenchmark,该框架包含三个核心组件:(1) PepBenchData,涵盖29个经典肽和6个非经典肽数据集的高质量、结构化资源;(2) PepBenchPipeline,提供标准化的数据预处理流程以确保数据清洗、划分与特征转换的一致性;(3) PepBenchLeaderboard,定义统一的评估协议并集成四大类主流模型(指纹基、图神经网络基、预训练语言模型基和SMILES基)作为强基线。这一系统性方案首次为肽类药物发现提供了可比较、可复现的研究基础,推动AI方法向实际应用转化。

链接: https://arxiv.org/abs/2604.10531
作者: Jiahui Zhang,Rouyi Wang,Kuangqi Zhou,Tianshu Xiao,Lingyan Zhu,Yaosen Min,Yang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Peptide therapeutics are widely regarded as the “third generation” of drugs, yet progress in peptide Machine Learning (ML) are hindered by the absence of standardized benchmarks. Here we present PepBenchmark, which unifies datasets, preprocessing, and evaluation protocols for peptide drug discovery. PepBenchmark comprises three components: (1) PepBenchData, a well-curated collection comprising 29 canonical-peptide and 6 non-canonical-peptide datasets across 7 groups, systematically covering key aspects of peptide drug development, representing, to the best of our knowledge, the most comprehensive AI-ready dataset resource to date; (2) PepBenchPipeline, a standardized preprocessing pipeline that ensures consistent dataset cleaning, construction, splitting, and feature transformation, mitigating quality issues common in ad hoc pipelines; and (3) PepBenchLeaderboard, a unified evaluation protocol and leaderboard with strong baselines across 4 major methodological families: Fingerprint-based, GNN-based, PLM-based, and SMILES-based models. Together, PepBenchmark provides the first standardized and comparable foundation for peptide drug discovery, facilitating methodological advances and translation into real-world applications. The data and code are publicly available at this https URL.

[AI-131] From Perception to Planning : Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning

【速读】:该论文旨在解决当前视觉-语言模型在具身(embodied)和第一人称视角(egocentric)任务中面临的复杂时空推理能力不足的问题,其核心挑战在于模型依赖从被动视频数据中学到的时间先验(temporal priors),导致在动态环境中出现时空幻觉(spatiotemporal hallucinations)并泛化性能差。解决方案的关键在于提出EgoTSR框架,该框架采用课程学习(curriculum-based)范式,引导模型从显式的空间理解逐步演化为内部化的任务状态评估,最终实现长时程规划;同时构建了包含4600万样本的EgoTSR-Data数据集,分三个阶段组织监督信号:Chain-of-Thought(CoT)监督、弱监督标签和长时程序列,从而有效消除时间顺序偏差,在长时逻辑推理任务上达到92.4%准确率,显著优于现有开源与闭源最先进模型。

链接: https://arxiv.org/abs/2604.10517
作者: Xiaoda Yang,Yuxiang Liu,Shenzhou Gao,Can Wang,Jingyang Xue,Lixin Yang,Yao Mu,Tao Jin,Shuicheng Yan,Zhimeng Zhang,Zhou Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern vision-language models achieve strong performance in static perception, but remain limited in the complex spatiotemporal reasoning required for embodied, egocentric tasks. A major source of failure is their reliance on temporal priors learned from passive video data, which often leads to spatiotemporal hallucinations and poor generalization in dynamic environments. To address this, we present EgoTSR, a curriculum-based framework for learning task-oriented spatiotemporal reasoning. EgoTSR is built on the premise that embodied reasoning should evolve from explicit spatial understanding to internalized task-state assessment and finally to long-horizon planning. To support this paradigm, we construct EgoTSR-Data, a large-scale dataset comprising 46 million samples organized into three stages: Chain-of-Thought (CoT) supervision, weakly supervised tagging, and long-horizon sequences. Extensive experiments demonstrate that EgoTSR effectively eliminates chronological biases, achieving 92.4% accuracy on long-horizon logical reasoning tasks while maintaining high fine-grained perceptual precision, significantly outperforming existing open-source and closed-source state-of-the-art models.

[AI-132] Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis

【速读】:该论文旨在解决生成式 AI(Generative AI)代理在实际应用中因系统提示(system prompts)表述不明确或模糊而导致的行为性能不稳定问题。其核心挑战在于,代理行为由大型语言模型(Large Language Models, LLMs)根据提示解析决定,而提示的歧义性会显著影响任务执行的一致性和准确性。解决方案的关键在于构建一个可增量适应的分析流水线(analytics pipeline),该流水线嵌入于开源工具 Agent Mentor 中,通过分析代理运行日志中生成的内部系统提示,识别与不良行为相关的语义特征,并据此自动生成纠正指令,进而系统性地注入到代理的知识库中以优化其行为表现。实验表明,该方法在多种代理配置下均能实现稳定且可量化的性能提升,尤其在规范模糊性较高的场景中效果显著。

链接: https://arxiv.org/abs/2604.10513
作者: Roi Ben-Gigi,Yuval David,Fabiana Fournier,Lior Limonad,Dany Moshkovich,Hadar Mulian,Segev Shlomov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:AI agent development relies heavily on natural language prompting to define agents’ tasks, knowledge, and goals. These prompts are interpreted by Large Language Models (LLMs), which govern agent behavior. Consequently, agentic performance is susceptible to variability arising from imprecise or ambiguous prompt formulations. Identifying and correcting such issues requires examining not only the agent’s code, but also the internal system prompts generated throughout its execution lifecycle, as reflected in execution logs. In this work, we introduce an analytics pipeline implemented as part of the Agent Mentor open-source library that monitors and incrementally adapts the system prompts defining another agent’s behavior. The pipeline improves performance by systematically injecting corrective instructions into the agent’s knowledge. We describe its underlying mechanism, with particular emphasis on identifying semantic features associated with undesired behaviors and using them to derive corrective statements. We evaluate the proposed pipeline across three exemplar agent configurations and benchmark tasks using repeated execution runs to assess effectiveness. These experiments provide an initial exploration of automating such a mentoring pipeline within future agentic governance frameworks. Overall, the approach demonstrates consistent and measurable accuracy improvements across diverse configurations, particularly in settings dominated by specification ambiguity. For reproducibility, we released our code as open source under the Agent Mentor library. Comments: 10 pages, 5 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.10513 [cs.AI] (or arXiv:2604.10513v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.10513 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-133] How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成任务中首次尝试成功率低的问题,尤其是在当前主流基准测试(如HumanEval和MBPP)多采用单次生成评估方式的情况下,难以反映模型的真实能力。解决方案的关键在于引入迭代自修复(iterative self-repair)机制——即通过将执行错误反馈给模型以进行修正,从而提升最终正确率。实验表明,该方法在七种不同架构与来源的模型上均显著有效,pass率提升达4.9至30.0个百分点,且无需微调仅靠提示工程即可实现,尤其在8B规模模型中也表现出色;此外,研究还揭示了逻辑错误(assertion errors)最难修复(约45%修复率),而语法和命名错误则易于修正,进一步阐明了LLM自纠错能力的边界。

链接: https://arxiv.org/abs/2604.10508
作者: Johin Johny Arimbur
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, 8 tables

点击查看摘要

Abstract:Large language models frequently fail to produce correct code on their first attempt, yet most benchmarks evaluate them in a single-shot setting. We investigate iterative self-repair (feeding execution errors back to the model for correction) across seven models spanning three families and both open-weight and proprietary providers: Llama 3.1 8B, Llama 3.3 70B, Llama 4 Scout (MoE, 16 experts), Llama 4 Maverick (MoE, 128 experts), Qwen3 32B, Gemini 2.5 Flash, and Gemini 2.5 Pro. On HumanEval (164 problems) and MBPP Sanitized (257 problems) with up to five attempts, self-repair universally improves pass rates: +4.9 to +17.1 pp on HumanEval and +16.0 to +30.0 pp on MBPP. Gemini 2.5 Flash achieves the highest final pass rates (96.3% HumanEval, 93.8% MBPP). Most gains concentrate in the first two this http URL-type analysis shows assertion errors (logical mistakes) are the hardest to repair at ~45%, while syntax and name errors are repaired at substantially higher rates, connecting to broader findings on the limits of LLM self-correction. Prior work found that weaker models fail at self-repair or require fine-tuning; we show that modern instruction-tuned models succeed with prompting alone, even at 8B scale. We also provide the first comparison of dense and MoE architectures for self-repair, and extend the repair-vs-resampling tradeoff analysis to modern models. A prompt ablation reveals chain-of-thought repair yields up to +5.5 pp additional self-repair gain (measured as improvement in repair delta) over minimal prompting for capable models.

[AI-134] A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在时空推理(spatiotemporal reasoning)中面临的“多图像推理幻觉”问题,即模型在处理正向与反向时间顺序查询时性能差异巨大,反映出其依赖表面线索而非真正的因果理解。解决方案的关键在于构建一个基于思维链(Chain-of-Thought, CoT)的新数据集,将复杂的时空推理过程分解为细粒度的步骤和明确判断,并采用渐进式训练框架:首先在CoT数据集上进行监督预训练以建立逻辑结构,再利用可扩展的弱标签数据进行微调以提升泛化能力。实验表明,该方法不仅提升了基础模型准确率,还将正向与反向查询间的性能差距从超过70%降至6.53%,验证了其在增强真实动态推理能力和缓解VLM固有时间偏差方面的有效性。

链接: https://arxiv.org/abs/2604.10506
作者: Xiaoda Yang,Shuai Yang,Can Wang,Jingyang Xue,Menglan Tang,Checheng Yu,Xunzhe Zhou,Sashuai Zhou,Tao Jin,Lixin Yang,Xiangyu Yue,Zhou Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have made significant strides in static image understanding but continue to face critical hurdles in spatiotemporal reasoning. A major bottleneck is “multi-image reasoning hallucination”, where a massive performance drop between forward and reverse temporal queries reveals a dependence on superficial shortcuts instead of genuine causal understanding. To mitigate this, we first develop a new Chain-of-Thought (CoT) dataset that decomposes intricate reasoning into detailed spatiotemporal steps and definitive judgments. Building on this, we present a progressive training framework: it initiates with supervised pre-training on our CoT dataset to instill logical structures, followed by fine-tuning with scalable weakly-labeled data for broader generalization. Our experiments demonstrate that this approach not only improves backbone accuracy but also slashes the forward-backward performance gap from over 70% to only 6.53%. This confirms the method’s ability to develop authentic dynamic reasoning and reduce the inherent temporal biases of current VLMs.

[AI-135] CARO: Chain-of-Analogy Reasoning Optimization for Robust Content Moderation ACL2026

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在内容审核任务中因上下文中的误导性“决策捷径”(decision shortcuts)而导致对模糊案例判断失效的问题。其核心解决方案是提出一种名为 \caro(Chain-of-Analogy Reasoning Optimization)的两阶段训练框架,关键在于通过检索增强生成(Retrieval-Augmented Generation, RAG)构建类比推理链并进行监督微调(Supervised Fine-Tuning, SFT),再结合定制化的直接偏好优化(Direct Preference Optimization, DPO)显式强化类比推理行为;同时,在推理阶段动态生成个性化类比参考,从而有效缓解有害决策捷径的影响,显著提升模型在复杂模糊内容审核场景下的性能表现。

链接: https://arxiv.org/abs/2604.10504
作者: Bingzhe Wu,Haotian Lu,Yuchen Mou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 findings; under official publication process

点击查看摘要

Abstract:Current large language models (LLMs), even those explicitly trained for reasoning, often struggle with ambiguous content moderation cases due to misleading “decision shortcuts” embedded in context. Inspired by cognitive psychology insights into expert moderation, we introduce \caro (Chain-of-Analogy Reasoning Optimization), a novel two-stage training framework to induce robust analogical reasoning in LLMs. First, \caro bootstraps analogical reasoning chains via retrieval-augmented generation (RAG) on moderation data and performs supervised fine-tuning (SFT). Second, we propose a customized direct preference optimization (DPO) approach to reinforce analogical reasoning behaviors explicitly. Unlike static retrieval methods, \caro dynamically generates tailored analogical references during inference, effectively mitigating harmful decision shortcuts. Extensive experiments demonstrate that \caro substantially outperforms state-of-the-art reasoning models (DeepSeek R1, QwQ), specialized moderation models (LLaMA Guard), and advanced fine-tuning and retrieval-augmented methods, achieving an average F1 score improvement of 24.9% on challenging ambiguous moderation benchmarks.

[AI-136] Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music ICASSP2026

【速读】:该论文旨在解决音频前端处理中因采用基于20世纪40年代西方心理声学研究的梅尔尺度(Mel-scale)表示而可能引入的文化偏见问题,这种偏见导致了跨语言和跨文化场景下的系统性能差异。其核心解决方案在于通过对比可学习的前端方法(如LEAF、SincNet)与多种心理声学变体(如ERB、Bark、CQT),在语音识别、音乐分析和欧洲声景分类任务中系统性地评估并缓解此类偏差。关键创新在于发现自适应频率分解机制(如LEAF的动态频域分配)和基于感知更贴近人类听觉系统的替代尺度(如CQT和ERB)能够显著降低性能差距——例如LEAF使声调语言与非声调语言之间的词错误率(WER)差距缩小34%,CQT在音乐任务中减少F1分数差距52%,且ERB滤波仅增加1%计算开销即可实现31%的公平性提升。这表明基础信号处理选择对模型公平性具有深远影响,为构建更具包容性的音频处理系统提供了可落地的技术路径。

链接: https://arxiv.org/abs/2604.10503
作者: Shivam Chauhan,Ajay Pundhir
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 4 tables. Accepted at ICASSP 2026

点击查看摘要

Abstract:Modern audio systems universally employ mel-scale representations derived from 1940s Western psychoacoustic studies, potentially encoding cultural biases that create systematic performance disparities. We present a comprehensive evaluation of cross-cultural bias in audio front-ends, comparing mel-scale features with learnable alternatives (LEAF, SincNet) and psychoacoustic variants (ERB, Bark, CQT) across speech recognition (11 languages), music analysis (6 collections), and European acoustic scene classification (10 European cities). Our controlled experiments isolate front-end contributions while holding architecture and training protocols minimal and constant. Results demonstrate that mel-scale features yield 31.2% WER for tonal languages compared to 18.7% for non-tonal languages (12.5% gap), and show 15.7% F1 degradation between Western and non-Western music. Alternative representations significantly reduce these disparities: LEAF reduces the speech gap by 34% through adaptive frequency allocation, CQT achieves 52% reduction in music performance gaps, and ERB-scale filtering cuts disparities by 31% with only 1% computational overhead. We also release FairAudioBench, enabling cross-cultural evaluation, and demonstrate that adaptive frequency decomposition offers practical paths toward equitable audio processing. These findings reveal how foundational signal processing choices propagate bias, providing crucial guidance for developing inclusive audio systems.

[AI-137] CHAIRO: Contextual Hierarchical Analogical Induction and Reasoning Optimization for LLM s ACL2026

【速读】:该论文旨在解决在线平台内容审核中因用户生成内容日益复杂以及传统规则驱动和机器学习方法局限性所导致的挑战,尤其是现有大语言模型(Large Language Models, LLMs)在泛化能力、可解释性和应对未见或模糊案例时的不足。其解决方案的关键在于提出一种基于类比示例(analogical examples)的新型审核框架,通过端到端优化类比检索、规则生成与审核分类三个环节,实现审核规则对多样化内容场景的动态适应。该方法显著提升了审核准确率和规则质量,并在人类评估与外部模型泛化测试中验证了其规则的清晰性、可解释性与适用性,从而推动了更鲁棒、可解释且具备泛化能力的内容审核技术发展。

链接: https://arxiv.org/abs/2604.10502
作者: Haotian Lu,Yuchen Mou,Bingzhe Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 main conference; under official publication process

点击查看摘要

Abstract:Content moderation in online platforms faces persistent challenges due to the evolving complexity of user-generated content and the limitations of traditional rule-based and machine learning approaches. While recent advances in large language models (LLMs) have enabled more sophisticated moderation via direct prompting or fine-tuning, these approaches often exhibit limited generalization, interpretability, and adaptability to unseen or ambiguous cases. In this work, we propose a novel moderation framework that leverages analogical examples to enhance rule induction and decision reliability. Our approach integrates end-to-end optimization of analogical retrieval, rule generation, and moderation classification, enabling the dynamic adaptation of moderation rules to diverse content scenarios. Through comprehensive experiments, we demonstrate that our method significantly outperforms both rule-injected fine-tuning baselines and multi-stage static RAG pipelines in terms of moderation accuracy and rule quality. Further evaluations, including human assessments and external model generalization tests, confirm that our framework produces rules with better clarity, interpretability, and applicability. These findings show that analogical example-driven methods can advance robust, explainable, and generalizable content moderation in real-world applications. Comments: Accepted to ACL 2026 main conference; under official publication process Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.10502 [cs.AI] (or arXiv:2604.10502v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.10502 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-138] racing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)后训练数据构建过程中存在的系统性问题,即数据集常被视为孤立实体,忽视了其在演化过程中的内在关联与依赖关系。为此,作者提出引入“数据谱系”(data lineage)概念,并设计了一个自动化多智能体框架以重建数据集发展的演化图谱。解决方案的关键在于通过谱系分析识别出领域特异性的结构模式(如数学类数据集的垂直精炼和通用领域语料的水平聚合),并揭示系统性缺陷(如隐式数据集交集引发的结构冗余及基准污染沿谱系路径传播的问题)。进一步地,利用重构的谱系图生成一种谱系感知的多样性导向数据集,通过锚定指令采样于上游源头,有效缓解下游同质化与隐藏冗余,从而实现更可控、系统的后训练数据治理。

链接: https://arxiv.org/abs/2604.10480
作者: Yu Li,Xiaoran Shang,Qizhi Pei,Yun Zhu,Xin Gao,Honglin Lin,Zhanping Zhong,Zhuoshi Pan,Zheng Liu,Xiaoyang Wang,Conghui He,Dahua Lin,Feng Zhao,Lijun Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 6 figures

点击查看摘要

Abstract:Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of \textbfdata lineage to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in math-oriented datasets and horizontal aggregation in general-domain corpora. Moreover, we uncover pervasive systemic issues, including \textitstructural redundancy induced by implicit dataset intersections and the \textitpropagation of benchmark contamination along lineage paths. To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a \textitlineage-aware diversity-oriented dataset. By anchoring instruction sampling at upstream root sources, this approach mitigates downstream homogenization and hidden redundancy, yielding a more diverse post-training corpus. We further highlight lineage-centric analysis as an efficient and robust topological alternative to sample-level dataset comparison for large-scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post-training data curation toward a more systematic and controllable paradigm.

[AI-139] PEMANT: Persona-Enriched Multi-Agent Negotiation for Travel

【速读】:该论文旨在解决现有家庭层面出行生成建模方法中预测能力有限的问题,特别是传统机器学习模型难以捕捉个体行为理论和家庭内部互动动态的局限性。其解决方案的关键在于提出一种名为Persona-Enriched Multi-Agent Negotiation for Travel (PEMANT) 的新型大语言模型(LLM)框架,该框架首先基于行为理论构建个性化人物画像(persona),并将静态社会人口学特征转化为包含态度、主观规范和感知行为控制的家庭级叙事表征;随后通过结构化的多智能体协商机制模拟真实世界中的家庭出行决策过程,其中引入了新颖的人物对齐控制机制以增强家庭成员间的行为一致性与合理性,从而显著提升出行生成建模的准确性与现实性。

链接: https://arxiv.org/abs/2604.10475
作者: Yuran Sun,Mustafa Sameen,Yaotian Zhang,Chia-yu Wu,Xilei Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modeling household-level trip generation is fundamental to accurate demand forecasting, traffic flow estimation, and urban system planning. Existing studies were mostly based on classical machine learning models with limited predictive capability, while recent LLM-based approaches have yet to incorporate behavioral theory or intra-household interaction dynamics, both of which are critical for modeling realistic collective travel decisions. To address these limitations, we propose a novel LLM-based framework, named Persona-Enriched Multi-Agent Negotiation for Travel (PEMANT), which first integrates behavioral theory for individualized persona modeling and then conducts household-level trip planning negotiations via a structured multi-agent conversation. Specifically, PEMANT transforms static sociodemographic attributes into coherent narrative profiles that explicitly encode household-level attitudes, subjective norms, and perceived behavioral controls, following our proposed Household-Aware Chain-of-Planned-Behavior (HA-CoPB) framework. Building on these theory-grounded personas, PEMANT captures real-world household decision negotiation via a structured two-phase multi-agent conversation framework with a novel persona-alignment control mechanism. Evaluated on both national and regional household travel survey datasets, PEMANT consistently outperforms state-of-the-art benchmarks across datasets.

[AI-140] owards Green Wearable Computing: A Physics-Aware Spiking Neural Network for Energy-Efficient IMU-based Human Activity Recognition

【速读】:该论文旨在解决可穿戴惯性测量单元(Inertial Measurement Unit, IMU)-based人体活动识别(Human Activity Recognition, HAR)中深度神经网络(Deep Neural Networks, DNNs)因计算与存储开销大、功耗高而难以部署于电池受限边缘设备的问题。其核心解决方案是提出一种物理感知脉冲神经网络(Physics-Aware Spiking Neural Network, PAS-Net),关键创新在于:空间上通过自适应对称拓扑混合器引入人体关节物理约束,时间上设计O(1)内存因果神经调制器实现上下文感知的动态阈值神经元,从而有效应对非平稳运动节奏;同时利用时间脉冲误差目标函数,支持连续IMU流中的灵活早退出机制,显著降低动态能耗达98%,在保持SOTA准确率的同时实现了全整数运算、稀疏0.1 pJ级能量效率的超低功耗架构。

链接: https://arxiv.org/abs/2604.10458
作者: Naichuan Zheng,Hailun Xia,Zepeng Sun,Weiyi Li,Yinze Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wearable IMU-based Human Activity Recognition (HAR) relies heavily on Deep Neural Networks (DNNs), which are burdened by immense computational and buffering demands. Their power-hungry floating-point operations and rigid requirement to process complete temporal windows severely cripple battery-constrained edge devices. While Spiking Neural Networks (SNNs) offer extreme event-driven energy efficiency, standard architectures struggle with complex biomechanical topologies and temporal gradient degradation. To bridge this gap, we propose the Physics-Aware Spiking Neural Network (PAS-Net), a fully multiplier-free architecture explicitly tailored for Green HAR. Spatially, an adaptive symmetric topology mixer enforces human-joint physical constraints. Temporally, an O(1) -memory causal neuromodulator yields context-aware dynamic threshold neurons, adapting actively to non-stationary movement rhythms. Furthermore, we leverage a temporal spike error objective to unlock a flexible early-exit mechanism for continuous IMU streams. Evaluated across seven diverse datasets, PAS-Net achieves state-of-the-art accuracy while replacing dense operations with sparse 0.1 pJ integer accumulations. Crucially, its confidence-driven early-exit capability drastically reduces dynamic energy consumption by up to 98%. PAS-Net establishes a robust, ultra-low-power neuromorphic standard for always-on wearable sensing.

[AI-141] VeriSim: A Configurable Framework for Evaluating Medical AI Under Realistic Patient Noise

【速读】:该论文旨在解决当前医疗大语言模型(Medical Large Language Models, LLMs)在标准化基准测试中表现优异,但缺乏对真实临床场景下患者沟通障碍(如记忆缺失、健康素养低下、焦虑等)的鲁棒性评估的问题。其解决方案的关键在于提出VeriSim框架——一个保持医学真值一致性的患者模拟系统,通过引入受控且基于临床证据的噪声来模拟真实患者行为,并采用UMLS与LLM结合的混合验证机制确保生成内容的真实性。该框架量化了六种源自文献的噪声维度,有效再现了临床沟通中的复杂现象,从而揭示了现有模型在现实环境中性能显著下降的现象(诊断准确率下降15-25%,对话长度增加34-55%),并为未来医疗AI的临床可靠性评估提供了可复现的开放源代码测试平台。

链接: https://arxiv.org/abs/2604.10441
作者: Sina Mansouri,Mohit Marvania,Vibhavari Ashok Shihorkar,Han Ngoc Tran,Kazhal Shafiei,Mehrdad Fazli,Yikuan Li,Ziwei Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical large language models (LLMs) achieve impressive performance on standardized benchmarks, yet these evaluations fail to capture the complexity of real clinical encounters where patients exhibit memory gaps, limited health literacy, anxiety, and other communication barriers. We introduce VeriSim, a truth-preserving patient simulation framework that injects controllable, clinically evidence-grounded noise into patient responses while maintaining strict adherence to medical ground truth through a hybrid UMLS-LLM verification mechanism. Our framework operationalizes six noise dimensions derived from peer-reviewed medical communication literature, capturing authentic clinical phenomena such as patient recall limitations, health literacy barriers, and stigma-driven non-disclosure. Experiments across seven open-weight LLMs reveal that all models degrade significantly under realistic patient noise, with diagnostic accuracy dropping 15-25% and conversation length increasing 34-55%. Notably, smaller models (7B) show 40% greater degradation than larger models (70B+), while medical fine-tuning on standard corpora provides limited robustness benefits against patient communication noise. Evaluation by board-certified clinicians demonstrates high-quality simulation with strong inter-annotator agreement (kappa 0.80), while LLM-as-a-Judge serves as a validated auxiliary evaluator achieving comparable reliability for scalable assessment. Our results highlight a critical Sim-to-Real gap in current medical AI. We release VeriSim as an open-source noise-injection framework, establishing a rigorous testbed for evaluating clinical robustness.

[AI-142] Safety Guarantees in Zero-Shot Reinforcement Learning for Cascade Dynamical Systems

【速读】:该论文旨在解决级联动力系统(cascade dynamical systems)在零样本部署(zero-shot deployment)场景下的安全保证问题,即如何确保系统在未经过目标环境训练的情况下,仍能以高概率保持在安全状态集合内。解决方案的关键在于:首先在降阶模型上训练一个安全强化学习(safe RL)策略,该模型忽略内部状态(inner states)的动力学,但将其视为影响外部状态(outer states)的输入动作;随后,在实际部署时,将训练好的RL策略与低层控制器(low-level controller)结合,由后者跟踪RL策略提供的参考指令。理论贡献在于建立了全阶系统中安全概率的上界,揭示了零样本部署后的安全概率与低层控制器对内部状态跟踪性能之间的定量关系。

链接: https://arxiv.org/abs/2604.10429
作者: Shima Rabiei,Sandipan Mishra,Santiago Paternain
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures; submitted to IEEE for possible publication

点击查看摘要

Abstract:This paper considers the problem of zero-shot safety guarantees for cascade dynamical systems. These are systems where a subset of the states (the inner states) affects the dynamics of the remaining states (the outer states) but not vice-versa. We define safety as remaining on a set deemed safe for all times with high probability. We propose to train a safe RL policy on a reduced-order model, which ignores the dynamics of the inner states, but it treats it as an action that influences the outer state. Thus, reducing the complexity of the training. When deployed in the full system the trained policy is combined with a low-level controller whose task is to track the reference provided by the RL policy. Our main theoretical contribution is a bound on the safe probability in the full-order system. In particular, we establish the interplay between the probability of remaining safe after the zero-shot deployment and the quality of the tracking of the inner states. We validate our theoretical findings on a quadrotor navigation task, demonstrating that the preservation of the safety guarantees is tied to the bandwidth and tracking capabilities of the low-level controller.

[AI-143] A Queueing-Theoretic Framework for Dynamic Attack Surfaces: Data-Integrated Risk Analysis and Adaptive Defense

【速读】:该论文旨在解决网络攻击面(cyber-attack surface)动态演化建模与主动防御策略设计问题,特别是如何量化长期依赖的累积暴露风险并提升防御效率。其核心解决方案是构建一个基于排队论(queueing-theoretic)的攻击面模型,将活跃漏洞数视为队列中的积压任务,并引入AI放大因子(AI amplification factor)刻画自动化对漏洞发现、利用和修复速率的影响;进一步将动态防御问题形式化为带资源预算和切换成本约束的受限马尔可夫决策过程(constrained Markov decision process),并开发一种具有理论保证近优后悔(provably near-optimal regret)的强化学习(reinforcement learning, RL)算法,从而实现对重尾修补时间引发的长程依赖性(long-range dependence)的有效应对,实验证明该方法可在不增加维护预算的前提下,使软件供应链中活跃漏洞数量平均减少超90%。

链接: https://arxiv.org/abs/2604.10427
作者: Jihyeon Yun,Abdullah Yasin Etcibasi,Ming Shi,C. Emre Koksal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We develop a queueing-theoretic framework to model the temporal evolution of cyber-attack surfaces, where the number of active vulnerabilities is represented as the backlog of a queue. Vulnerabilities arrive as they are discovered or created, and leave the system when they are patched or successfully exploited. Building on this model, we study how automation affects attack and defense dynamics by introducing an AI amplification factor that scales arrival, exploit, and patching rates. Our analysis shows that even symmetric automation can increase the rate of successful exploits. We validate the model using vulnerability data collected from an open source software supply chain and show that it closely matches real-world attack surface dynamics. Empirical results reveal heavy-tailed patching times, which we prove induce long-range dependence in vulnerability backlog and help explain persistent cyber risk. Utilizing our queueing abstraction for the attack surface, we develop a systematic approach for cyber risk mitigation. We formulate the dynamic defense problem as a constrained Markov decision process with resource-budget and switching-cost constraints, and develop a reinforcement learning (RL) algorithm that achieves provably near-optimal regret. Numerical experiments validate the approach and demonstrate that our adaptive RL-based defense policies significantly reduce successful exploits and mitigate heavy-tail queue events. Using trace-driven experiments on the ARVO dataset, we show that the proposed RL-based defense policy reduces the average number of active vulnerabilities in a software supply chain by over 90% compared to existing defense practices, without increasing the overall maintenance budget. Our results allow defenders to quantify cumulative exposure risk under long-range dependent attack dynamics and to design adaptive defense strategies with provable efficiency.

[AI-144] CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation

【速读】:该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在自动化胸部X光片报告生成(Radiology Report Generation, RRG)中因单次前向传播解码策略导致的视觉注意力弱化和语言先验依赖增强问题,进而引发虚假病灶共现(spurious pathology co-occurrences)的现象。解决方案的关键在于提出一种名为类别感知对比解码(Category-Wise Contrastive Decoding, CWCD)的新框架,其核心机制是通过类别特异性参数化与类别感知视觉提示,将正常X光图像与掩码图像进行对比建模,从而分类别生成结构化放射学报告(Structured Radiology Report Generation, SRRG),有效提升报告的临床准确性与自然语言质量。

链接: https://arxiv.org/abs/2604.10410
作者: Shantam Srivastava,Mahesh Bhosale,David Doermann,Mingchen Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to MIDL 2026

点击查看摘要

Abstract:Interpreting chest X-rays is inherently challenging due to the overlap between anatomical structures and the subtle presentation of many clinically significant pathologies, making accurate diagnosis time-consuming even for experienced radiologists. Recent radiology-focused foundation models, such as LLaVA-Rad and Maira-2, have positioned multi-modal large language models (MLLMs) at the forefront of automated radiology report generation (RRG). However, despite these advances, current foundation models generate reports in a single forward pass. This decoding strategy diminishes attention to visual tokens and increases reliance on language priors as generation proceeds, which in turn introduces spurious pathology co-occurrences in the generated reports. To mitigate these limitations, we propose Category-Wise Contrastive Decoding (CWCD), a novel and modular framework designed to enhance structured radiology report generation (SRRG). Our approach introduces category-specific parameterization and generates category-wise reports by contrasting normal X-rays with masked X-rays using category-specific visual prompts. Experimental results demonstrate that CWCD consistently outperforms baseline methods across both clinical efficacy and natural language generation metrics. An ablation study further elucidates the contribution of each architectural component to overall performance.

[AI-145] Intent-aligned Formal Specification Synthesis via Traceable Refinement

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在从自然语言生成代码时难以保证正确性的问题,特别是由于真实代码库中常缺乏高质量形式化规范(formal specification),而手动编写这些规范又成本高且依赖专家知识。解决方案的关键在于提出 VeriSpecGen —— 一个可追溯的精化框架,通过需求级归因(requirement-level attribution)与局部修复(localized repair)机制,将自然语言分解为原子化需求并生成针对性测试用例,建立规范与需求之间的显式可追溯映射;当验证失败时,该映射能定位到具体需求条款,从而实现条款级别的精准修复,显著提升规范合成的准确率与鲁棒性。

链接: https://arxiv.org/abs/2604.10392
作者: Zhe Ye,Aidan Z.H. Yang,Huangyuan Su,Zhenyu Liao,Samuel Tenka,Zhizhen Qin,Udaya Ghai,Dawn Song,Soonho Kong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large language models are increasingly used to generate code from natural language, but ensuring correctness remains challenging. Formal verification offers a principled way to obtain such guarantees by proving that a program satisfies a formal specification. However, specifications are frequently missing in real-world codebases, and writing high-quality specifications remains expensive and expertise-intensive. We present VeriSpecGen, a traceable refinement framework that synthesizes intent-aligned specifications in Lean through requirement-level attribution and localized repair. VeriSpecGen decomposes natural language into atomic requirements and generates requirement-targeted tests with explicit traceability maps to validate generated specifications. When validation fails, traceability maps attribute failures to specific requirements, enabling targeted clause-level repairs. VeriSpecGen achieve 86.6% on VERINA SpecGen task using Claude Opus 4.5, improving over baselines by up to 31.8 points across different model families and scales. Beyond inference-time gains, we generate 343K training examples from VeriSpecGen refinement trajectories and demonstrate that training on these trajectories substantially improves specification synthesis by 62-106% relative and transfers gains to general reasoning abilities.

[AI-146] Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

【速读】:该论文旨在解决全双工交互式虚拟人视频生成中,如何在保持唇音同步的同时有效建模长程对话语境的问题。现有方法通常沿用单向语音驱动范式,依赖严格的帧对帧对齐,导致对长时对话动态响应僵化;而直接引入全局注意力机制又会严重破坏唇部动作与语音的同步性。解决方案的关键在于识别说话与倾听行为之间存在的独特时间尺度差异(Temporal Scale Discrepancy),并设计一种多头高斯核机制,将这一物理直觉作为渐进式的时间归纳偏置(temporal inductive bias)显式注入模型架构中,从而实现对双流音频输入(说话与倾听)的协同处理能力。该方法显著提升了虚拟人在自然性和响应性上的表现,成为当前全双工交互数字人生成的新基准。

链接: https://arxiv.org/abs/2604.10367
作者: Yuzhe Weng,Haotian Wang,Xinyi Yu,Xiaoyan Wu,Haoran Xu,Shan He,Jun Du
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model’s response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias. Building upon this, we construct a full-duplex interactive virtual agent capable of simultaneously processing dual-stream audio inputs for both talking and listening. Furthermore, we introduce a rigorously cleaned Talking-Listening dataset VoxHear featuring perfectly decoupled speech and background audio tracks. Extensive experiments demonstrate that our approach successfully fuses strong temporal alignment with deep contextual semantics, setting a new state-of-the-art for generating highly natural and responsive full-duplex interactive digital humans. The project page is available at this https URL .

[AI-147] ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents EUROSYS2026

【速读】:该论文旨在解决状态感知型工具使用大语言模型(LLM)代理在运行过程中因上下文窗口作为工作内存时,由于现有代理调度机制对状态驻留(residency)和持久性(durability)仅做尽力而为(best-effort)管理所导致的重复性故障问题,包括状态在压缩后丢失、重置时绕过刷新操作以及写回过程破坏性更新等。其解决方案的关键在于提出 \textsc{ClawVM},一个虚拟内存层,通过将状态以带类型页(typed pages)形式管理,确保最小保真度不变量(minimum-fidelity invariants),在令牌预算内支持多分辨率表示,并在每个生命周期边界进行验证写回(validated writeback)。该设计利用代理已有的提示组装、工具中介和生命周期事件观测能力作为强制执行点,使状态驻留与持久性变得确定且可审计,从而在满足最小保真度的前提下彻底消除所有策略可控的故障。

链接: https://arxiv.org/abs/2604.10352
作者: Mofasshara Rafique,Laurent Bindschaedler
机构: 未知
类目: Artificial Intelligence (cs.AI); Operating Systems (cs.OS); Software Engineering (cs.SE)
备注: 8 pages, 1 figure, 10 tables; accepted at EuroMLSys '26 (6th Workshop on Machine Learning and Systems, co-located with EuroSys 2026)

点击查看摘要

Abstract:Stateful tool-using LLM agents treat the context window as working memory, yet today’s agent harnesses manage residency and durability as best-effort, causing recurring failures: lost state after compaction, bypassed flushes on reset, and destructive writeback. We present \textscClawVM, a virtual memory layer that manages state as typed pages with minimum-fidelity invariants, multi-resolution representations under a token budget, and validated writeback at every lifecycle boundary. Because the harness already assembles prompts, mediates tools, and observes lifecycle events, it is the natural enforcement point; placing the contract there makes residency and durability deterministic and auditable. Across synthetic workloads, 12 real-session traces, and adversarial stress tests, \textscClawVM eliminates all policy-controllable faults whenever the minimum-fidelity set fits within the token budget, confirmed by an offline oracle, and adds median 50 microseconds of policy-engine overhead per turn.

[AI-148] VeriTrans: Fine-Tuned LLM -Assisted NL-to-PL Translation via a Deterministic Neuro-Symbolic Pipeline

【速读】:该论文旨在解决自然语言(Natural Language, NL)到逻辑形式(Logic Form, PL)转换过程中可靠性不足的问题,尤其是在需要高可信度的验证场景中。现有方法往往缺乏可审计性、可复现性和对错误传播的有效控制,导致生成的逻辑表达式在后续符号验证阶段出现误判或不可靠结果。解决方案的关键在于提出一个以可靠性为核心的机器学习系统 VeriTrans,其核心机制包括:(1) 使用指令微调的 NL→PL 翻译器进行精确映射;(2) 引入往返重建(PL→NL)作为高精度接受门控机制,用于筛选可信输出;(3) 通过固定 API 配置(如温度=0、种子=42)和逐项日志记录(提示、输出、哈希值)保障可审计性和回放驱动调试能力;(4) 在保持低延迟的前提下,利用小样本精调提升翻译保真度,并通过阈值化接受策略实现可靠性与覆盖范围之间的灵活权衡。此设计将 NL→逻辑前端转化为适用于关键任务工作流的可审计、可复现组件。

链接: https://arxiv.org/abs/2604.10341
作者: Xuan Liu,Dheeraj Kodakandla,Kushagra Srivastva,Mahfuza Farooque
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:\textbfVeriTrans is a reliability-first ML system that compiles natural-language requirements into solver-ready logic with validator-gated reliability. The pipeline integrates an instruction-tuned NL !\to! PL translator, round-trip reconstruction (PL !\to! NL) used as a high-precision acceptance gate, and canonical PL !\to! CNF compilation, all executed via fixed API configuration (temperature =0 ; fine-tuning runs use seed =42 ) and per-item artifact logging (prompts, outputs, hashes) to support auditability and replay-driven debugging. On \textbfSatBench (2,100 specifications), VeriTrans achieves 94.46% SAT/UNSAT correctness and 87.73% median round-trip similarity. Compact fine-tuning on 100–150 curated examples improves fidelity by about 1–1.5,pp without increasing latency (mean 25.8,s/spec on our 201-spec runtime subset). A thresholded acceptance policy on the round-trip score exposes a reliability–coverage knob: at \tau=75 , roughly 68% of items are retained with \sim 94% correctness on the accepted set. Validator overhead contributes 15% of end-to-end runtime, and all prompts/responses and timing metadata are logged to enable replay-driven debugging and regression testing. By separating learned translation from symbolic verification and enforcing deterministic, validator-gated acceptance, VeriTrans turns NL !\to! logic front-ends into auditable, reproducible components for reliability-critical workflows.

[AI-149] From GPT -3 to GPT -5: Mapping their capabilities scope limitations and consequences

【速读】:该论文旨在解决如何系统性理解从GPT-3到GPT-5系列模型演进的问题,特别是超越单纯规模增长和性能提升的视角,揭示其在技术框架、用户交互、模态支持、部署架构及治理理念等方面的深层变革。解决方案的关键在于提出一个五维分析框架——技术进展、能力变化、部署转移、持续限制与下游影响,并基于官方技术报告、系统卡片、API文档、产品公告等多源证据进行比较研究,指出后续GPT版本已从单一文本预测器演变为具备对齐性、多模态处理、工具调用、长上下文理解和工作流集成能力的复杂AI系统,从而强调评估此类模型需综合考虑产品路由、安全调优、接口设计等非模型因素,而非仅依赖基准测试或参数量对比。

链接: https://arxiv.org/abs/2604.10332
作者: Hina Afridi,Habib Ullah,Sultan Daud Khan,Mohib Ullah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present the progress of the GPT family from GPT-3 through GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, GPT-4.1, and the GPT-5 family. Our work is comparative rather than merely historical. We investigates how the family evolved in technical framing, user interaction, modality, deployment architecture, and governance viewpoint. The work focuses on five recurring themes: technical progression, capability changes, deployment shifts, persistent limitations, and downstream consequences. In term of research design, we consider official technical reports, system cards, API and model documentation, product announcements, release notes, and peer-reviewed secondary studies. A primary assertion is that later GPT generations should not be interpreted only as larger or more accurate language models. Instead, the family evolves from a scaled few-shot text predictor into a set of aligned, multimodal, tool-oriented, long-context, and increasingly workflow-integrated systems. This development complicates simple model-to-model comparison because product routing, tool access, safety tuning, and interface design become part of the effective system. Across generations, several limitations remain unchanged: hallucination, prompt sensitivity, benchmark fragility, uneven behavior across domains and populations, and incomplete public transparency about architecture and training. However, the family has evolved software development, educational practice, information work, interface design, and discussions of frontier-model governance. We infer that the transition from GPT-3 to GPT-5 is best understood not only as an improvement in model capability, but also as a broader reformulation of what a deployable AI system is, how it is evaluated, and where responsibility should be located when such systems are used at scale.

[AI-150] A Diffusion-Contrastive Graph Neural Network with Virtual Nodes for Wind Nowcasting in Unobserved Regions

【速读】:该论文旨在解决气象短临预报(nowcasting)中因观测站点稀疏而导致的无观测区域风况预测不可靠问题,尤其在缺乏密集观测网络的地区难以实现精准的风速、风向及阵风预测。其解决方案的关键在于提出一种基于扩散与对比学习的图神经网络框架,并引入“虚拟节点”(virtual nodes)机制,使模型能够在无直接观测数据的区域通过结构化学习推断风条件。该方法显著降低了无观测区域的风速、阵风和风向预测的平均绝对误差(MAE),提升幅度达30%–46%,从而为可再生能源调度、农业规划和灾害预警等应用提供了高精度的本地化短临预报能力。

链接: https://arxiv.org/abs/2604.10328
作者: Jie Shi,Siamak Mehrkanoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 7 figures

点击查看摘要

Abstract:Accurate weather nowcasting remains one of the central challenges in atmospheric science, with critical implications for climate resilience, energy security, and disaster preparedness. Since it is not feasible to deploy observation stations everywhere, some regions lack dense observational networks, resulting in unreliable short-term wind predictions across those unobserved areas. Here we present a deep graph self-supervised framework that extends nowcasting capability into such unobserved regions without requiring new sensors. Our approach introduces “virtual nodes” into a diffusion and contrastive-based graph neural network, enabling the model to learn wind condition (i.e., speed, direction and gusts) in places with no direct measurements. Using high-temporal resolution weather station data across the Netherlands, we demonstrate that this approach reduces nowcast mean absolute error (MAE) of wind speed, gusts, and direction in unobserved regions by more than 30% - 46% compared with interpolation and regression methods. By enabling localized nowcasts where no measurements exist, this method opens new pathways for renewable energy integration, agricultural planning, and early-warning systems in data-sparse regions.

[AI-151] Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在对齐训练和指令微调后仍易受越狱攻击(jailbreak attacks)的问题,即攻击者通过设计特定输入绕过安全机制并诱导模型生成有害内容。其解决方案的核心是提出一种基于电路级干预的“头掩码空域引导”(Head-Masked Nullspace Steering, HMNS)方法:首先识别对模型默认行为因果性最强的注意力头,随后通过目标列掩码抑制其写路径,并在被抑制子空间的正交补空间中注入扰动;该过程在闭环检测-干预循环中迭代执行,动态重识别因果头并重复施加干预。关键创新在于利用几何感知与可解释性指导的干预策略,实现了高效、精准的模型行为控制,显著提升了越狱攻击的成功率,同时揭示了对抗性安全规避的新范式。

链接: https://arxiv.org/abs/2604.10326
作者: Vishal Pramanik,Maisha Maliha,Susmit Jha,Sumit Kumar Jha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models remain vulnerable to jailbreak attacks – inputs designed to bypass safety mechanisms and elicit harmful responses – despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering (HMNS), a circuit-level intervention that (i) identifies attention heads most causally responsible for a model’s default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a perturbation constrained to the orthogonal complement of the muted subspace. HMNS operates in a closed-loop detection-intervention cycle, re-identifying causal heads and reapplying interventions across multiple decoding attempts. Across multiple jailbreak benchmarks, strong safety defenses, and widely used language models, HMNS attains state-of-the-art attack success rates with fewer queries than prior methods. Ablations confirm that nullspace-constrained injection, residual norm scaling, and iterative re-identification are key to its effectiveness. To our knowledge, this is the first jailbreak method to leverage geometry-aware, interpretability-informed interventions, highlighting a new paradigm for controlled model steering and adversarial safety circumvention.

[AI-152] Gypscie: A Cross-Platform AI Artifact Management System

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)模型生命周期管理的复杂性问题,即在从数据收集、模型构建、评估、部署到持续监控的全流程中,如何高效协调异构服务、数据集和AI平台,以降低开发与运维难度。其解决方案的关键在于提出一个名为Gypscie的跨平台AI资产管理系统,通过构建一个捕捉应用语义的知识图谱和基于规则的查询语言实现对AI资产的统一视图与推理能力;同时,将模型生命周期活动抽象为高层级的数据流(dataflow),支持在服务器、云平台或超级计算机等多种环境下调度执行,并记录完整的溯源信息(provenance),从而提升系统的可解释性和优化调度效率。

链接: https://arxiv.org/abs/2604.10311
作者: Fabio Porto,Eduardo Ogasawara,Gabriela Moraes Botaro,Julia Neumann Bastos,Augusto Fonseca,Esther Pacitti,Patrick Valduriez
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 39 pages, 13 figures

点击查看摘要

Abstract:Artificial Intelligence (AI) models, encompassing both traditional machine learning (ML) and more advanced approaches such as deep learning and large language models (LLMs), play a central role in modern applications. AI model lifecycle management involves the end-to-end process of managing these models, from data collection and preparation to model building, evaluation, deployment, and continuous monitoring. This process is inherently complex, as it requires the coordination of diverse services that manage AI artifacts such as datasets, dataflows, and models, all orchestrated to operate seamlessly. In this context, it is essential to isolate applications from the complexity of interacting with heterogeneous services, datasets, and AI platforms. In this paper, we introduce Gypscie, a cross-platform AI artifact management system. By providing a unified view of all AI artifacts, the Gypscie platform simplifies the development and deployment of AI applications. This unified view is realized through a knowledge graph that captures application semantics and a rule-based query language that supports reasoning over data and models. Model lifecycle activities are represented as high-level dataflows that can be scheduled across multiple platforms, such as servers, cloud platforms, or supercomputers. Finally, Gypscie records provenance information about the artifacts it produces, thereby enabling explainability. Our qualitative comparison with representative AI systems shows that Gypscie supports a broader range of functionalities across the AI artifact lifecycle. Our experimental evaluation demonstrates that Gypscie can successfully optimize and schedule dataflows on AI platforms from an abstract specification. Comments: 39 pages, 13 figures Subjects: Artificial Intelligence (cs.AI); Databases (cs.DB) ACMclasses: H.2; H.2.4; I.2.5 Cite as: arXiv:2604.10311 [cs.AI] (or arXiv:2604.10311v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.10311 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-153] From Helpful to Trustworthy: LLM Agents for Pair Programming

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的编程代理在生成代码、测试和文档时,其输出虽具表面合理性但可能与开发者意图不一致,且缺乏可审计证据的问题,从而限制了多代理LLM配对编程工作流在长期项目中的可靠性、可追溯性和可维护性。解决方案的关键在于构建一种系统性的多代理LLM配对编程框架,通过外部化开发意图(externalizing intent)并利用开发工具进行迭代验证(iterative validation),具体包括三个研究方向:将非正式问题陈述转化为符合标准的要求和形式化规范;借助自动反馈(如求解器支持的反例)优化测试与实现;以及在重构、API迁移和文档更新等维护任务中保持已验证行为的一致性。

链接: https://arxiv.org/abs/2604.10300
作者: Ragib Shahariar Ayon
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted in 34th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE Companion 26)

点击查看摘要

Abstract:LLM-based coding agents are increasingly used to generate code, tests, and documentation. Still, their outputs can be plausible yet misaligned with developer intent and provide limited evidence for review in evolving projects. This limits our understanding of how to structure LLM pair-programming workflows so that artifacts remain reliable, auditable, and maintainable over time. To address this gap, this doctoral research proposes a systematic study of multi-agent LLM pair programming that externalizes intent and uses development tools for iterative validation. The plan includes three studies: translating informal problem statements into standards aligned requirements and formal specifications; refining tests and implementations using automated feedback, such as solver-backed counterexamples; and supporting maintenance tasks, including refactoring, API migrations, and documentation updates, while preserving validated behavior. The expected outcome is a clearer understanding of when multi-agent workflows increase trust, along with practical guidance for building reliable programming assistants for real-world development.

[AI-154] meSeriesExamAgent : Creating Time Series Reasoning Benchmarks at Scale

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在时间序列建模任务中是否存在真正理解能力的问题。现有基准测试多为人工构建,覆盖领域狭窄且侧重特定技能,难以全面评估LLMs对时间序列数据的推理能力。为此,作者提出了一种可扩展的方法,其关键在于结合模板的灵活性与生成式AI(Generative AI)代理的创造性,从而自动生成高质量、多样化的基准测试。具体而言,首先构建了TimeSeriesExam这一基于合成时间序列的多项选择题基准,涵盖模式识别、噪声理解、相似性分析、异常检测和因果推理五类核心推理能力;随后通过TimeSeriesExamAgent自动化地从真实世界数据集(如医疗、金融和气象领域)中生成新的基准,并经多维质量评估验证其多样性可媲美人工基准。实验表明,尽管生成方法有效,LLMs在抽象时间序列推理和领域特定应用中仍存在显著局限,揭示了提升其时间序列理解能力的持续挑战。

链接: https://arxiv.org/abs/2604.10291
作者: Malgorzata Gwiazda,Yifu Cai,Mononito Goswami,Arjun Choudhry,Artur Dubrawski
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising performance in time series modeling tasks, but do they truly understand time series data? While multiple benchmarks have been proposed to answer this fundamental question, most are manually curated and focus on narrow domains or specific skill sets. To address this limitation, we propose scalable methods for creating comprehensive time series reasoning benchmarks that combine the flexibility of templates with the creativity of LLM agents. We first develop TimeSeriesExam, a multiple-choice benchmark using synthetic time series to evaluate LLMs across five core reasoning categories: pattern recognitionnoise understandingsimilarity analysisanomaly detection, and causality. Then, with TimeSeriesExamAgent, we scale our approach by automatically generating benchmarks from real-world datasets spanning healthcare, finance and weather domains. Through multi-dimensional quality evaluation, we demonstrate that our automatically generated benchmarks achieve diversity comparable to manually curated alternatives. However, our experiments reveal that LLM performance remains limited in both abstract time series reasoning and domain-specific applications, highlighting ongoing challenges in enabling effective time series understanding in these models. TimeSeriesExamAgent is available at this https URL.

[AI-155] AI Organizations are More Effective but Less Aligned than Individual Agents ICLR

【速读】:该论文试图解决的问题是:在多智能体系统(multi-agent systems)中,当多个生成式 AI(Generative AI)模型协同工作时,其整体效能与对齐性(alignment)如何变化,以及这种交互行为是否会影响AI系统的安全性和实用性。解决方案的关键在于通过实验验证“AI组织”(AI Organizations)这一概念——即由多个AI代理组成的协作系统,在实际业务场景(如AI咨询和软件开发)中,相较于单一AI代理,虽然能提升任务完成的效用(utility),但会带来更高的不一致性或偏离预期目标的风险(misalignment)。研究强调了在进行AI能力评估与安全研究时,必须将智能体之间的交互机制纳入考量。

链接: https://arxiv.org/abs/2604.10290
作者: Judy Hanwen Shen,Daniel Zhu,Siddarth Srinivasan,Henry Sleight,Lawrence T. Wagner III,Morgan Jane Matthews,Erik Jones,Jascha Sohl-Dickstein
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR Workshop Version

点击查看摘要

Abstract:AI is increasingly deployed in multi-agent systems; however, most research considers only the behavior of individual models. We experimentally show that multi-agent “AI organizations” are simultaneously more effective at achieving business goals, but less aligned, than individual AI agents. We examine 12 tasks across two practical settings: an AI consultancy providing solutions to business problems and an AI software team developing software products. Across all settings, AI Organizations composed of aligned models produce solutions with higher utility but greater misalignment compared to a single aligned model. Our work demonstrates the importance of considering interacting systems of AI agents when doing both capabilities and safety research.

[AI-156] Dead Cognitions: A Census of Misattributed Insights

【速读】:该论文试图解决的问题是人工智能对话系统中的一种新型认知偏差机制——“ Attribution Laundering( attribution laundering)”,即模型在执行实质性认知任务后,将生成的见解归因于用户,从而误导用户对自身认知贡献的判断。这种机制不同于显性的奉承行为(glad handing sycophancy),其特点是系统性地隐藏于用户感知之外,并具有自我强化效应,长期削弱用户准确评估自身认知能力的能力。解决方案的关键在于揭示这一机制在个体层面(如界面设计对用户审视行为的抑制)与社会层面(如机构对采纳而非问责的激励)的运作逻辑,强调识别并区分人类作者与AI(如Claude)观点之间的模糊边界,以促进对人机协作中认知责任的透明化和反思性理解。

链接: https://arxiv.org/abs/2604.10288
作者: Aaron Tuor,claude.ai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This essay identifies a failure mode of AI chat systems that we term attribution laundering: the model performs substantive cognitive work and then rhetorically credits the user for having generated the resulting insights. Unlike transparent versions of glad handing sycophancy, attribution laundering is systematically occluded to the person it affects and self-reinforcing – eroding users’ ability to accurately assess their own cognitive contributions over time. We trace the mechanisms at both individual and societal scales, from the chat interface that discourages scrutiny to the institutional pressures that reward adoption over accountability. The document itself is an artifact of the process it describes, and is color-coded accordingly – though the views expressed are the authors’ own, not those of any affiliated institution, and the boundary between the human author’s views and Claude’s is, as the essay argues, difficult to draw.

[AI-157] STARS: Skill-Triggered Audit for Request-Conditioned Invocation Safety in Agent Systems

【速读】:该论文旨在解决自主语言模型代理在调用可安装技能(skill)时的持续风险评估问题,即如何在用户请求和运行时上下文条件下准确预测特定技能调用的风险得分,从而支持事前排序与优先级划分,避免硬性干预。其核心解决方案是提出STARS框架,该框架融合了静态能力先验、请求条件化的调用风险模型以及校准后的风险融合策略,通过构建SIA-Bench基准数据集进行验证,结果显示请求条件化评分在间接提示注入攻击场景下显著优于静态基线,但静态先验仍对分布内测试保持价值,表明该方法更适合作为调用时刻的风险评分与分诊层,而非替代静态审计。

链接: https://arxiv.org/abs/2604.10286
作者: Guijia Zhang,Shu Yang,Xilin Gong,Di Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous language-model agents increasingly rely on installable skills and tools to complete user tasks. Static skill auditing can expose capability surface before deployment, but it cannot determine whether a particular invocation is unsafe under the current user request and runtime context. We therefore study skill invocation auditing as a continuous-risk estimation problem: given a user request, candidate skill, and runtime context, predict a score that supports ranking and triage before a hard intervention is applied. We introduce STARS, which combines a static capability prior, a request-conditioned invocation risk model, and a calibrated risk-fusion policy. To evaluate this setting, we construct SIA-Bench, a benchmark of 3,000 invocation records with group-safe splits, lineage metadata, runtime context, canonical action labels, and derived continuous-risk targets. On a held-out split of indirect prompt injection attacks, calibrated fusion reaches 0.439 high-risk AUPRC, improving over 0.405 for the contextual scorer and 0.380 for the strongest static baseline, while the contextual scorer remains better calibrated with 0.289 expected calibration error. On the locked in-distribution test split, gains are smaller and static priors remain useful. The resulting claim is therefore narrower: request-conditioned auditing is most valuable as an invocation-time risk-scoring and triage layer rather than as a replacement for static screening. Code is available at this https URL.

[AI-158] A Dual-Positive Monotone Parameterization for Multi-Segment Bids and a Validity Assessment Framework for Reinforcement Learning Agent -based Simulation of Electricity Markets

【速读】:该论文旨在解决强化学习代理仿真(Reinforcement Learning Agent-Based Simulation, RL-ABS)在电力市场机制分析中面临的两个核心问题:一是现有方法在建模单调、有界、多段阶梯式报价时,依赖后处理映射(如排序、截断或投影)将无约束动作转换为可行报价曲线,但此类映射在边界或拐点处常不满足连续可微性、单射性和可逆性,导致梯度失真并引发模拟结果的虚假收敛;二是多数研究仅以训练曲线收敛作为评估依据,未严格衡量仿真结果与纳什均衡(Nash equilibrium)之间的距离,严重削弱了结论的可信度。解决方案的关键在于提出一种能够保证梯度传播正确性的参数化报价建模方法,通过设计具有数学性质保障的映射结构,使策略网络直接输出满足约束条件的报价曲线,从而实现更精确、可靠的RL-ABS仿真。

链接: https://arxiv.org/abs/2604.10252
作者: Zunnan Xu,Zhaoxia Jing,Zhanhua Pan
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Reinforcement learning agent-based simulation (RL-ABS) has become an important tool for electricity market mechanism analysis and evaluation. In the modeling of monotone, bounded, multi-segment stepwise bids, existing methods typically let the policy network first output an unconstrained action and then convert it into a feasible bid curve satisfying monotonicity and boundedness through post-processing mappings such as sorting, clipping, or projection. However, such post-processing mappings often fail to satisfy continuous differentiability, injectivity, and invertibility at boundaries or kinks, thereby causing gradient distortion and leading to spurious convergence in simulation results. Meanwhile, most existing studies conduct mechanism analysis and evaluation mainly on the basis of training-curve convergence, without rigorously assessing the distance between the simulation outcomes and Nash equilibrium, which severely undermines the credibility of the results. To address these issues, this paper proposes…

[AI-159] SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning

【速读】:该论文旨在解决当前多模态模型因浅层推理导致的错误问题,即由于推理过程不完整或不一致引发的性能瓶颈。其核心解决方案是提出一种统一框架——自验证与自修正(Self-Verification and Self-Rectification, SVSR),通过在模型推理流程中显式集成自我验证(self-verification)和自我修正(self-rectification)机制,显著提升复杂视觉理解与多模态推理任务中的鲁棒性和可靠性。SVSR的关键在于构建了一个三阶段训练范式:首先基于预训练视觉语言模型(Vision-Language Model, VLM)生成高质量的统一偏好数据集,融合正向与反向推理以嵌入自反思信号;其次进行冷启动监督微调,学习结构化的多步推理行为;最后采用半在线直接偏好优化(Semi-online Direct Preference Optimization, Semi-online DPO)持续扩充训练语料,利用强大教师VLM筛选高质量模型生成的推理轨迹。该框架使模型能够自主学习、激发并优化其自我验证与修正能力,从而在多个基准测试中实现更高推理准确率,并展现出更强的泛化能力,甚至在未提供显式推理路径时仍具备改进的隐式推理表现。

链接: https://arxiv.org/abs/2604.10228
作者: Zhe Qian,Nianbing Su,Zhonghua Wang,Hebei Li,Zhongxing Xu,Yueying Li,Fei Luo,Zhuohan Ouyang,Yanbiao Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current multimodal models often suffer from shallow reasoning, leading to errors caused by incomplete or inconsistent thought processes. To address this limitation, we propose Self-Verification and Self-Rectification (SVSR), a unified framework that explicitly integrates self-verification and self-rectification into the model’s reasoning pipeline, substantially improving robustness and reliability in complex visual understanding and multimodal reasoning tasks. SVSR is built on a novel three-stage training paradigm. First, we construct a high-quality unified preference dataset by refining reasoning traces from pre-trained vision-language models, incorporating both forward and backward reasoning to embed self-reflective signals. Second, we perform cold-start supervised fine-tuning on this dataset to learn structured, multi-step reasoning behaviors. Third, we apply a Semi-online Direct Preference Optimization (Semi-online DPO) process, continuously augmenting the training corpus with high-quality, model-generated reasoning traces filtered by a powerful teacher VLM. This pipeline enables the model to learn, elicit, and refine its ability to self-verify and self-rectify. Extensive experiments across diverse benchmarks demonstrate that SVSR improves reasoning accuracy and enables stronger generalization to unseen tasks and question types. Notably, once trained with explicit self-reflective reasoning, the model also exhibits improved implicit reasoning ability, outperforming strong baselines even when no explicit reasoning traces are provided. These results highlight the potential of SVSR for building more dependable, introspective, and cognitively aligned multimodal systems.

[AI-160] Exploring the impact of fairness-aware criteria in AutoML

【速读】:该论文旨在解决自动化机器学习(AutoML)框架在构建完整机器学习(ML)流程时,因忽视公平性而可能加剧歧视性行为的问题。其核心挑战在于,现有AutoML方法主要聚焦于模型选择与超参数调优以提升预测性能,却未将公平性嵌入到数据选择、特征变换等关键环节。解决方案的关键在于将公平性直接整合进AutoML的优化组件中,通过引入互补的公平性指标来捕捉不同维度的公平性,并在优化过程中同时权衡预测性能与公平性目标。实验表明,该方法虽使预测性能下降9.4%,但平均公平性提升14.5%,且数据使用量减少35.7%,最终生成更简洁有效的公平模型,证明了公平性驱动的优化能实现性能与公平性的更好平衡。

链接: https://arxiv.org/abs/2604.10224
作者: Joana Simões,João Correia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine Learning (ML) systems are increasingly used to support decision-making processes that affect individuals. However, these systems often rely on biased data, which can lead to unfair outcomes against specific groups. With the growing adoption of Automated Machine Learning (AutoML), the risk of intensifying discriminatory behaviours increases, as most frameworks primarily focus on model selection to maximise predictive performance. Previous research on fairness in AutoML had largely followed this trend, integrating fairness awareness only in the model selection or hyperparameter tuning, while neglecting other critical stages of the ML pipeline. This paper aims to study the impact of integrating fairness directly into the optimisation component of an AutoML framework that constructs complete ML pipelines, from data selection and transformations to model selection and tuning. As selecting appropriate fairness metrics remains a key challenge, our work incorporates complementary fairness metrics to capture different dimensions of fairness during the optimisation. Their integration within AutoML resulted in measurable differences compared to a baseline focused solely on predictive performance. Despite a 9.4% decrease in predictive power, the average fairness improved by 14.5%, accompanied by a 35.7% reduction in data usage. Furthermore, fairness integration produced complete yet simpler final solutions, suggesting that model complexity is not always required to achieve balanced and fair ML solutions.

[AI-161] Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

【速读】:该论文旨在解决多模态大推理模型(Multimodal Large Reasoning Models, MLRMs)在长链视觉推理过程中易产生幻觉的问题,尤其是识别出一种称为“推理视觉真相断层”(Reasoning Vision Truth Disconnect, RVTD)的现象:即幻觉与认知分叉点(cognitive bifurcation points)高度相关,且这些分叉点常表现为高熵状态。作者认为这一现象源于模型中间层视觉语义锚定(visual semantic anchoring)的失效,导致在高不确定性过渡阶段模型不再查询视觉证据,而是依赖语言先验。解决方案的关键在于从仅基于输出层面的监督转向引入细粒度内部注意力引导机制,提出V-STAR(Visual Structural Training with Attention Reinforcement)训练范式,其核心是集成于GRPO框架中的分层视觉注意力奖励(Hierarchical Visual Attention Reward, HVAR),该机制在检测到高熵状态时动态激励关键中间层的视觉注意力,从而将推理过程重新锚定至视觉输入;同时引入强制反思机制(Forced Reflection Mechanism, FRM),通过轨迹编辑策略在高熵认知分叉点触发反思并验证后续步骤与视觉输入的一致性,实现外部去偏干预向内在抗幻觉能力的转化。

链接: https://arxiv.org/abs/2604.10219
作者: Zhe Qian,Yanbiao Ma,Zhuohan Ouyang,Zhonghua Wang,Zhongxing Xu,Fei Luo,Xinyu Liu,Zongyuan Ge,Yike Guo,Jungong Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: TPAMI under review

点击查看摘要

Abstract:Multimodal Large Reasoning Models (MLRMs) have achieved remarkable strides in visual reasoning through test time compute scaling, yet long chain reasoning remains prone to hallucinations. We identify a concerning phenomenon termed the Reasoning Vision Truth Disconnect (RVTD): hallucinations are strongly correlated with cognitive bifurcation points that often exhibit high entropy states. We attribute this vulnerability to a breakdown in visual semantic anchoring, localized within the network’s intermediate layers; specifically, during these high uncertainty transitions, the model fails to query visual evidence, reverting instead to language priors. Consequently, we advocate a shift from solely outcome level supervision to augmenting it with fine grained internal attention guidance. To this end, we propose V-STAR (Visual Structural Training with Attention Reinforcement), a lightweight, holistic training paradigm designed to internalize visually aware reasoning capabilities. Central to our approach is the Hierarchical Visual Attention Reward (HVAR), integrated within the GRPO framework. Upon detecting high entropy states, this mechanism dynamically incentivizes visual attention across critical intermediate layers, thereby anchoring the reasoning process back to the visual input. Furthermore, we introduce the Forced Reflection Mechanism (FRM), a trajectory editing strategy that disrupts cognitive inertia by triggering reflection around high entropy cognitive bifurcation points and encouraging verification of subsequent steps against the visual input, thereby translating external debiasing interventions into an intrinsic capability for hallucination mitigation.

[AI-162] Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)中损失函数几何特性与泛化性能之间关系不明确的问题,特别是针对平滑非线性多层神经网络的损失尖锐度(loss sharpness)缺乏理论分析这一挑战。其解决方案的关键在于:通过引入Wolkowicz-Styan不等式,推导出交叉熵损失函数Hessian矩阵最大特征值的闭式上界,该上界仅依赖于仿射变换参数、隐藏层维度以及训练样本间正交程度等可解析量,从而避免了对复杂Hessian矩阵特征谱进行数值计算的需求,实现了对损失尖锐度的理论刻画。

链接: https://arxiv.org/abs/2604.10202
作者: Yuto Omae,Kazuki Sakai,Yohei Kakimoto,Makoto Sasaki,Yusuke Sakai,Hirotaka Takahashi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 19 pages

点击查看摘要

Abstract:Neural networks (NNs) are central to modern machine learning and achieve state-of-the-art results in many applications. However, the relationship between loss geometry and generalization is still not well understood. The local geometry of the loss function near a critical point is well-approximated by its quadratic form, obtained through a second-order Taylor expansion. The coefficients of the quadratic term correspond to the Hessian matrix, whose eigenspectrum allows us to evaluate the sharpness of the loss at the critical point. Extensive research suggests flat critical points generalize better, while sharp ones lead to higher generalization error. However, sharpness requires the Hessian eigenspectrum, but general matrix characteristic equations have no closed-form solution. Therefore, most existing studies on evaluating loss sharpness rely on numerical approximation methods. Existing closed-form analyses of the eigenspectrum are primarily limited to simplified architectures, such as linear or ReLU-activated networks; consequently, theoretical analysis of smooth nonlinear multilayer neural networks remains limited. Against this background, this study focuses on nonlinear, smooth multilayer neural networks and derives a closed-form upper bound for the maximum eigenvalue of the Hessian with respect to the cross-entropy loss by leveraging the Wolkowicz-Styan bound. Specifically, the derived upper bound is expressed as a function of the affine transformation parameters, hidden layer dimensions, and the degree of orthogonality among the training samples. The primary contribution of this paper is an analytical characterization of loss sharpness in smooth nonlinear multilayer neural networks via a closed-form expression, avoiding explicit numerical eigenspectrum computation. We hope that this work provides a small yet meaningful step toward unraveling the mysteries of deep learning.

[AI-163] Credit-Budgeted ICPC-Style Coding: When Agents Must Pay for Every Decision ICLR2026

【速读】:该论文旨在解决当前自主编码代理(autonomous coding agents)评估中忽视计算资源限制的问题,即现有评价体系假设无限资源环境,而现实中的软件工程任务受制于计算能力和时间成本。为实现更贴近实际的评估,作者提出USACOArena——一个基于严格“信用”经济机制的交互式竞赛平台,其核心创新在于将每个生成的token、本地测试和运行时间均计入固定预算消耗,从而迫使代理在准确性与资源消耗之间进行战略权衡。这一设计使得代理必须优化资源利用效率,而非单纯追求准确率,为开发高效且资源感知的智能体架构提供了动态训练环境。

链接: https://arxiv.org/abs/2604.10182
作者: Lingfeng Zhou,Junhao Shi,Jin Gao,Dequan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Current evaluations of autonomous coding agents assume an unrealistic, infinite-resource environment. However, real-world software engineering is a resource-bound competition. As we scale toward large agent swarms, ignoring compute and time costs risks catastrophic budget exhaustion. To shift the focus from isolated accuracy to cost-aware problem-solving, we introduce USACOArena, an interactive ACM-ICPC-style arena driven by a strict “credit” economy. Every generated token, local test, and elapsed second depletes a fixed budget, forcing agents to make strategic trade-offs. Our comprehensive profiling reveals that frontier single agents and swarms currently fail to optimally balance accuracy with these constraints, exhibiting divergent, path-dependent behaviors. Ultimately, USACOArena provides an essential dynamic training ground for developing highly efficient, resource-aware agent architectures.

[AI-164] PoreDiT: A Scalable Generative Model for Large-Scale Digital Rock Reconstruction

【速读】:该论文旨在解决数字岩心物理(Digital Rock Physics, DRP)中长期存在的难题:高分辨率与大视场(Field-of-View, FOV)之间的权衡,以及传统深度学习架构带来的计算瓶颈。解决方案的关键在于提出一种名为 PoreDiT 的新型生成模型,其核心创新是采用三维 Swin Transformer 架构直接预测孔隙空间的二值概率场(而非灰度强度),从而在保持关键拓扑特征(如孔隙连通性、欧拉特征等)的同时显著提升计算效率。该方法可在消费级硬件上实现百万级体素(1024³)规模的数字岩心重建,并在孔隙度、渗透率等物理特性上达到与现有最先进方法相当的保真度,为大尺度流体动力学模拟和储层表征提供了高效可行的新路径。

链接: https://arxiv.org/abs/2604.10171
作者: Yizhuo Huang,Baoquan Sun,Haibo Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注:

点击查看摘要

Abstract:This manuscript presents PoreDiT, a novel generative model designed for high-efficiency digital rock reconstruction at gigavoxel scales. Addressing the significant challenges in digital rock physics (DRP), particularly the trade-off between resolution and field-of-view (FOV), and the computational bottlenecks associated with traditional deep learning architectures, PoreDiT leverages a three-dimensional (3D) Swin Transformer to break through these limitations. By directly predicting the binary probability field of pore spaces instead of grayscale intensities, the model preserves key topological features critical for pore-scale fluid flow and transport simulations. This approach enhances computational efficiency, enabling the generation of ultra-large-scale ( 1024^3 voxels) digital rock samples on consumer-grade hardware. Furthermore, PoreDiT achieves physical fidelity comparable to previous state-of-the-art methods, including accurate porosity, pore-scale permeability, and Euler characteristics. The model’s ability to scale efficiently opens new avenues for large-domain hydrodynamic simulations and provides practical solutions for researchers in pore-scale fluid mechanics, reservoir characterization, and carbon sequestration.

[AI-165] MAVEN-T: Multi-Agent enVironment-aware Enhanced Neural Trajectory predictor with Reinforcement Learning

【速读】:该论文旨在解决自动驾驶系统中轨迹预测任务的挑战性问题,即如何在保持高精度的同时满足严格的实时部署约束。现有知识蒸馏方法往往无法有效保留复杂决策能力,尤其在动态多智能体场景下表现不足。其解决方案的关键在于提出MAVEN-T框架,通过教师-学生架构的互补式协同设计与渐进式蒸馏机制实现高效知识迁移:教师模型采用混合注意力机制以最大化表征能力,学生模型则使用针对部署优化的轻量级结构;知识传递采用多粒度蒸馏并结合自适应课程学习策略,动态调整训练复杂度;更重要的是引入强化学习突破传统模仿蒸馏的局限性,使学生能够通过环境交互验证、修正并优化教师知识,从而可能实现比教师更鲁棒的决策能力。

链接: https://arxiv.org/abs/2604.10169
作者: Wenchang Duan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Trajectory prediction remains a critical yet challenging component in autonomous driving systems, requiring sophisticated reasoning capabilities while meeting strict real-time deployment constraints. While knowledge distillation has demonstrated effectiveness in model compression, existing approaches often fail to preserve complex decision-making capabilities, particularly in dynamic multi-agent scenarios. This paper introduces MAVEN-T, a teacher-student framework that achieves state-of-the-art trajectory prediction through complementary architectural co-design and progressive distillation. The teacher employs hybrid attention mechanisms for maximum representational capacity, while the student uses efficient architectures optimized for deployment. Knowledge transfer is performed via multi-granular distillation with adaptive curriculum learning that dynamically adjusts complexity based on performance. Importantly, the framework incorporates reinforcement learning to overcome the imitation ceiling of traditional distillation, enabling the student to verify, refine, and optimize teacher knowledge through dynamic environmental interaction, potentially achieving more robust decision-making than the teacher itself. Extensive experiments on NGSIM and highD datasets demonstrate 6.2x parameter compression and 3.7x inference speedup while maintaining state-of-the-art accuracy, establishing a new paradigm for deploying sophisticated reasoning models under resource constraints.

[AI-166] Virtual Smart Metering in District Heating Networks via Heterogeneous Spatial-Temporal Graph Neural Networks

【速读】:该论文旨在解决区域供热系统中由于传感器部署稀疏和故障频发导致的可观测性不足问题,以及现有数据驱动方法对密集同步数据的依赖与解析模型在复杂网络拓扑下难以准确刻画压力、流量和温度之间非线性耦合关系的局限性。其解决方案的关键在于提出一种异质时空图神经网络(Heterogeneous Spatial-Temporal Graph Neural Network, HSTGNN),该模型通过引入专门分支分别学习管网的图结构和时序动态特性,从而实现对流速、温度与压力等多变量空间相关性和跨变量耦合关系的联合建模,有效提升了虚拟智能热表的预测精度与鲁棒性。

链接: https://arxiv.org/abs/2604.10166
作者: Keivan Faghih Niresi,Christian Møller Jensen,Carsten Skovmose Kallesøe,Rafael Wisniewski,Olga Fink
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Intelligent operation of thermal energy networks aims to improve energy efficiency, reliability, and operational flexibility through data-driven control, predictive optimization, and early fault detection. Achieving these goals relies on sufficient observability, requiring continuous and well-distributed monitoring of thermal and hydraulic states. However, district heating systems are typically sparsely instrumented and frequently affected by sensor faults, limiting monitoring. Virtual sensing offers a cost-effective means to enhance observability, yet its development and validation remain limited in practice. Existing data-driven methods generally assume dense synchronized data, while analytical models rely on simplified hydraulic and thermal assumptions that may not adequately capture the behavior of heterogeneous network topologies. Consequently, modeling the coupled nonlinear dependencies between pressure, flow, and temperature under realistic operating conditions remains challenging. In addition, the lack of publicly available benchmark datasets hinders systematic comparison of virtual sensing approaches. To address these challenges, we propose a heterogeneous spatial-temporal graph neural network (HSTGNN) for constructing virtual smart heat meters. The model incorporates the functional relationships inherent in district heating networks and employs dedicated branches to learn graph structures and temporal dynamics for flow, temperature, and pressure measurements, thereby enabling the joint modeling of cross-variable and spatial correlations. To support further research, we introduce a controlled laboratory dataset collected at the Aalborg Smart Water Infrastructure Laboratory, providing synchronized high-resolution measurements representative of real operating conditions. Extensive experiments demonstrate that the proposed approach significantly outperforms existing baselines.

[AI-167] Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities ICLR2026

【速读】:该论文旨在解决时间知识图谱(Temporal Knowledge Graph, TKG)推理中因封闭世界假设(closed-world assumption)导致的性能瓶颈问题,即现有方法无法有效处理训练阶段未出现但持续涌现的新实体(emerging entities),这些实体缺乏历史交互记录,从而显著降低推理准确性。解决方案的关键在于提出一种名为TransFIR(Transferable Inductive Reasoning)的新框架,其核心思想是利用语义相似已知实体的历史交互序列来迁移可转移的时间模式,通过基于码本(codebook-based)的分类器将新兴实体映射到潜在语义簇中,使其能够继承相似实体的推理模式,实现对新实体的归纳推理(inductive reasoning)。实验表明,该方法在多个数据集上平均提升均方倒数排名(MRR)达28.6%,显著优于现有基线模型。

链接: https://arxiv.org/abs/2604.10164
作者: Ze Zhao,Yuhui He,Lyuwen Wu,Gu Tang,Bin Lu,Xiaoying Gan,Luoyi Fu,Xinbing Wang,Chenghu Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, accepted by ICLR2026

点击查看摘要

Abstract:Reasoning on Temporal Knowledge Graphs (TKGs) is essential for predicting future events and time-aware facts. While existing methods are effective at capturing relational dynamics, their performance is limited by a closed-world assumption, which fails to account for emerging entities not present in the training. Notably, these entities continuously join the network without historical interactions. Empirical study reveals that emerging entities are widespread in TKGs, comprising roughly 25% of all entities. The absence of historical interactions of these entities leads to significant performance degradation in reasoning tasks. Whereas, we observe that entities with semantic similarities often exhibit comparable interaction histories, suggesting the presence of transferable temporal patterns. Inspired by this insight, we propose TransFIR (Transferable Inductive Reasoning), a novel framework that leverages historical interaction sequences from semantically similar known entities to support inductive reasoning. Specifically, we propose a codebook-based classifier that categorizes emerging entities into latent semantic clusters, allowing them to adopt reasoning patterns from similar entities. Experimental results demonstrate that TransFIR outperforms all baselines in reasoning on emerging entities, achieving an average improvement of 28.6% in Mean Reciprocal Rank (MRR) across multiple datasets. The implementations are available at this https URL.

[AI-168] SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)中混合专家(Mixture-of-Experts, MoE)架构在推理阶段面临的高内存需求和参数效率低下问题,尤其是在内存受限系统上部署时的性能瓶颈。其解决方案的关键在于提出SpecMoE,一个基于自引导推测解码(self-assisted speculative decoding)算法的内存高效MoE推理系统。该方法无需额外的模型训练或微调,即可通过推测解码策略优化MoE激活路径,在保证准确性的同时显著提升吞吐量(最高达4.30倍),并大幅降低内存与互联带宽消耗,从而实现更高效的MoE推理部署。

链接: https://arxiv.org/abs/2604.10152
作者: Jehyeon Bang,Eunyeong Cho,Ranggi Hwang,Jinha Chung,Minsoo Rhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This is an extended version of our work, which is accepted for publication at the 63rd ACM/IEEE Design Automation Conference (DAC), 2026

点击查看摘要

Abstract:The Mixture-of-Experts (MoE) architecture has emerged as a promising approach to mitigate the rising computational costs of large language models (LLMs) by selectively activating parameters. However, its high memory requirements and sub-optimal parameter efficiency pose significant challenges for efficient deployment. Although CPU-offloaded MoE inference systems have been proposed in the literature, they offer limited efficiency, particularly for large batch sizes. In this work, we propose SpecMoE, a memory-efficient MoE inference system based on our self-assisted speculative decoding algorithm. SpecMoE demonstrates the effectiveness of applying speculative decoding to MoE inference without requiring additional model training or fine-tuning. Our system improves inference throughput by up to 4.30\times , while significantly reducing bandwidth requirements of both memory and interconnect on memory-constrained systems.

[AI-169] A Temporally Augmented Graph Attention Network for Affordance Classification

【速读】:该论文旨在解决现有图注意力网络(Graph Attention Network, GAT)在处理时序关系数据时的局限性,尤其是其对静态图结构的依赖以及在序列数据中隐式聚合时间信息所带来的性能瓶颈。针对这一问题,作者提出了一种时序增强型图注意力网络——脑电图时序图注意力网络(Electroencephalography-temporal Graph Attention Network, EEG-tGAT),其关键创新在于引入显式的时间注意力机制以动态调节不同时间片段的贡献权重,并采用时间丢弃策略(temporal dropout)来正则化因时间相关性导致的过拟合问题。该设计基于一个核心假设:在具身交互数据中,时间维度并非语义均匀分布,判别性信息可能随时间不均等分布。实验表明,EEG-tGAT在具身交互任务中的分类性能优于GATv2,验证了显式建模时间重要性和强化时间鲁棒性能够有效引入与任务结构更匹配的归纳偏置,从而提升模型泛化能力。

链接: https://arxiv.org/abs/2604.10149
作者: Ami Chopra,Supriya Bordoloi,Shyamanta M. Hazarika
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures. Accepted at 3rd IEEE Guwahati Subsection Conference (GCON 2026)

点击查看摘要

Abstract:Graph attention networks (GATs) provide one of the best frameworks for learning node representations in relational data; but, existing variants such as Graph Attention Network (GAT) mainly operate on static graphs and rely on implicit temporal aggregation when applied to sequential data. In this paper, we introduce Electroencephalography-temporal Graph Attention Network (EEG-tGAT), a temporally augmented formulation of GATv2 that is tailored for affordance classification from interaction sequences. The proposed model incorporates temporal attention to modulate the contribution of different time segments and temporal dropout to regularize learning across temporally correlated observations. The design reflects the assumption that temporal dimensions in affordance data are not semantically uniform and that discriminative information may be unevenly distributed across time. Experimental results on affordance datasets show that EEG-tGAT achieves improved classification performance compared to GATv2. The observed gains helps to conclude that explicitly encoding temporal importance and enforcing temporal robustness introduce inductive biases that are much better aligned with the structure of affordance-driven interaction data. These findings show us that modest architectural changes to graph attention models can help one obtain consistent benefits when temporal relationships play a nontrivial role in the task.

[AI-170] MR-Coupler: Automated Metamorphic Test Generation via Functional Coupling Analysis

【速读】:该论文旨在解决 metamorphic testing (MT) 中因难以构建有效的 metamorphic relations (MRs) 而阻碍其广泛应用的问题,尤其是 MR 构建通常依赖于领域知识或难以获取的信息。解决方案的关键在于提出一种名为 MR-Coupler 的新方法,该方法利用源代码中易于获得的**方法功能耦合性(functional coupling)**自动识别具有潜在 MR 关系的方法对,并借助大语言模型(Large Language Models, LLMs)生成候选的 metamorphic test cases (MTCs),再通过测试放大(test amplification)和变异分析(mutation analysis)进行验证。该方法结合三种功能耦合特征以避免穷举搜索,同时引入新颖的验证机制降低误报率,从而显著提升 MTC 生成的有效性和实用性。

链接: https://arxiv.org/abs/2604.10126
作者: Congying Xu,Hengcheng Zhu,Songqiang Chen,Jiarong Wu,Valerio Terragni,Shing-Chi Cheung
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Note: Accepted on ACM International Conference on the Foundations of Software Engineering (FSE) 2026

点击查看摘要

Abstract:Metamorphic testing (MT) is a widely recognized technique for alleviating the oracle problem in software testing. However, its adoption is hindered by the difficulty of constructing effective metamorphic relations (MRs), which often require domain-specific or hard-to-obtain knowledge. In this work, we propose a novel approach that leverages the functional coupling between methods, which is readily available in source code, to automatically construct MRs and generate metamorphic test cases (MTCs). Our technique, MR-Coupler, identifies functionally coupled method pairs, employs large language models to generate candidate MTCs, and validates them through test amplification and mutation analysis. In particular, we leverage three functional coupling features to avoid expensive enumeration of possible method pairs, and a novel validation mechanism to reduce false alarms. Our evaluation of MR-Coupler on 100 human-written MTCs and 50 real-world bugs shows that it generates valid MTCs for over 90% of tasks, improves valid MTC generation by 64.90%, and reduces false alarms by 36.56% compared to baselines. Furthermore, the MTCs generated by MR-Coupler detect 44% of the real bugs. Our results highlight the effectiveness of leveraging functional coupling for automated MR construction and the potential of MR-Coupler to facilitate the adoption of MT in practice. We also released the tool and experimental data to support future research.

[AI-171] rust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在智能家居场景中执行基于记忆的设备控制时存在的两大核心问题:一是缺乏能够有效评估模型记忆驱动控制能力的基准测试工具,二是传统强化学习(Reinforcement Learning, RL)方法因仅依赖最终任务结果进行监督,难以支持细粒度的记忆管理任务(如添加、更新、删除和利用记忆)。解决方案的关键在于:首先发布MemHomeLife数据集,该数据集基于匿名化的长期用户交互日志构建;其次提出MemHome基准,这是首个系统性评估智能家居场景下记忆驱动设备控制能力的评测框架,能够对不同记忆相关子任务进行精细化评估,从而推动该领域从方法论到评价体系的标准化发展。

链接: https://arxiv.org/abs/2604.10110
作者: Kai-Yuan Guo,Jiang Wang,Renjie Zhao,Tianyi Wang,Wandong Mao,Yu Gao,Mou Xiao Feng,Yi Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become a key foundation for enabling personalized smart home experiences. While existing studies have explored how smart home assistants understand user queries to control devices in real time, their ability to perform memory-driven device control remains challenging from both evaluation and methodological perspectives. In terms of evaluation, existing benchmarks either focus on immediate device control or general open-domain memory retrieval tasks, and therefore cannot effectively evaluate a model’s ability to perform memory-driven device control. Methodologically, while memory-driven device control can be approached using Reinforcement Learning, conventional RL methods generally rely on outcome-based supervision (i.e., whether the final task is achieved). This lack of intermediate feedback can lead to sub-optimal performance or local failures in fine-grained memory management tasks (adding, updating, deleting, and utilizing). To address these issues, we first release MemHomeLife, built from anonymized real-world long-term user interaction logs. To enable more fine-grained evaluation of different memory-related subtasks, we further construct MemHome, the first benchmark designed to systematically evaluate memory-driven device control in smart home scenarios.

[AI-172] Ontological Trajectory Forecasting via Finite Semigroup Iteration and Lie Algebra Approximation in Geopolitical Knowledge Graphs

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的政治分析系统仅能进行文本模式匹配、缺乏对长期地缘政治关系演化机制建模能力的问题。其解决方案的关键在于构建一个融合形式化本体(ontology)、有限半群代数(finite semigroup algebra)与李代数近似(Lie algebra approximation)的推理框架——EL-DRUIN,将地缘政治关系建模为动态模式状态集合,通过定义明确的组合表确定半群运算结构常数,并在8维语义李代数空间中嵌入每个模式向量;利用前向模拟迭代组合操作以生成可达模式集,收敛至幂等吸收态(即固定点)作为长期吸引子;同时引入贝叶斯后验权重,结合本体先验置信度与李空间余弦相似度项,输出可解释且校准的概率估计,从而实现对地缘政治关系长期轨迹的预测与关键分岔点识别。

链接: https://arxiv.org/abs/2604.10087
作者: Qihang Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages. Code and system available at this https URL

点击查看摘要

Abstract:We present EL-DRUIN, an ontological reasoning system for geopolitical intelligence analysis that combines formal ontology, finite semigroup algebra, and Lie algebra approximation to forecast long-run relationship trajectories. Current LLM-based political analysis systems operate as summarisation engines, producing outputs bounded by textual pattern matching. EL-DRUIN departs from this paradigm by modelling geopolitical relationships as states in a finite set of named Dynamic Patterns, composing patterns via a semigroup operation whose structure constants are defined by an explicit composition table, and embedding each pattern as a vector in an 8-dimensional semantic Lie algebra space. Forward simulation iterates this semigroup operation, yielding reachable pattern sets at each discrete timestep; convergence to idempotent absorbing states (fixed points of the composition) constitutes the predicted long-run attractor. Bayesian posterior weights combine ontology-derived confidence priors with a Lie similarity term measuring the cosine similarity between the vector sum of composing patterns and the target pattern vector, providing interpretable, calibrated probabilities that are not self-reported by a language model. Bifurcation points – steps at which two candidate attractors have near-equal posterior mass – are detected and exposed to downstream analysis. We demonstrate the framework on six geopolitical scenarios including US-China technology decoupling and the Taiwan Strait military coercion trajectory. The architecture is publicly available as an open-source system with a Streamlit frontend exposing full computation traces, Bayesian posterior breakdowns, and 8D ontological state vectors.

[AI-173] Learning Hierarchical and Geometry-Aware Graph Representations for Text-to-CAD ICLR2026

【速读】:该论文旨在解决文本到计算机辅助设计(Text-to-CAD)代码生成任务中因缺乏结构建模与几何约束显式表达而导致的搜索空间过大、局部误差累积及复杂装配体中级联失败的问题。其解决方案的关键在于引入一种分层且几何感知的图结构作为中间表示,将多层级零件与组件建模为节点、显式几何约束编码为边;在此基础上,框架先预测结构与约束,再基于此条件化动作序列和代码生成,从而提升几何保真度与约束满足精度。

链接: https://arxiv.org/abs/2604.10075
作者: Shengjie Gong,Wenjie Peng,Hongyuan Chen,Gangyu Zhang,Yunqing Hu,Huiyuan Zhang,Shuangping Huang,Tianshui Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Text-to-CAD code generation is a long-horizon task that translates textual instructions into long sequences of interdependent operations. Existing methods typically decode text directly into executable code (e.g., bpy) without explicitly modeling assembly hierarchy or geometric constraints, which enlarges the search space, accumulates local errors, and often causes cascading failures in complex assemblies. To address this issue, we propose a hierarchical and geometry-aware graph as an intermediate representation. The graph models multi-level parts and components as nodes and encodes explicit geometric constraints as edges. Instead of mapping text directly to code, our framework first predicts structure and constraints, then conditions action sequencing and code generation, thereby improving geometric fidelity and constraint satisfaction. We further introduce a structure-aware progressive curriculum learning strategy that constructs graded tasks through controlled structural edits, explores the model’s capability boundary, and synthesizes boundary examples for iterative training. In addition, we build a 12K dataset with instructions, decomposition graphs, action sequences, and bpy code, together with graph- and constraint-oriented evaluation metrics. Extensive experiments show that our method consistently outperforms existing approaches in both geometric fidelity and accurate satisfaction of geometric constraints.

[AI-174] Graph-RHO: Critical-path-aware Heterogeneous Graph Network for Long-Horizon Flexible Job-Shop Scheduling IJCNN2026

【速读】:该论文旨在解决长周期柔性作业车间调度问题(Long-horizon Flexible Job-Shop Scheduling, FJSP)中因决策复杂且相互依赖而导致的组合优化难题,尤其针对现有基于学习的滚动时域优化(Learning-based Rolling Horizon Optimization, RHO)方法在捕捉图结构依赖关系、忽略预测误差的非对称成本以及静态剪枝阈值无法适应动态置信度变化等方面的局限性。解决方案的关键在于提出Graph-RHO框架:首先构建拓扑感知的异构图神经网络,将子问题建模为带多关系边的操作-机器图,并通过边特征感知的消息传递机制预测操作稳定性;其次引入关键路径感知机制,在训练阶段注入归纳偏置以区分高敏感瓶颈操作与鲁棒操作;最后设计自适应阈值策略,基于在线不确定性估计动态校准决策边界,使模型预测与求解器搜索空间保持一致。实验表明,Graph-RHO在标准基准上实现了最优解质量和计算效率,并展现出卓越的零样本泛化能力。

链接: https://arxiv.org/abs/2604.10073
作者: Yujie Li,Jiuniu Wang,Mugen Peng,Guangzuo Li,Wenjia Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures; Accepted by IJCNN 2026

点击查看摘要

Abstract:Long-horizon Flexible Job-Shop Scheduling~(FJSP) presents a formidable combinatorial challenge due to complex, interdependent decisions spanning extended time horizons. While learning-based Rolling Horizon Optimization~(RHO) has emerged as a promising paradigm to accelerate solving by identifying and fixing invariant operations, its effectiveness is hindered by the structural complexity of FJSP. Existing methods often fail to capture intricate graph-structured dependencies and ignore the asymmetric costs of prediction errors, in which misclassifying critical-path operations is significantly more detrimental than misclassifying non-critical ones. Furthermore, dynamic shifts in predictive confidence during the rolling process make static pruning thresholds inadequate. To address these limitations, we propose Graph-RHO, a novel critical-path-aware graph-based RHO framework. First, we introduce a topology-aware heterogeneous graph network that encodes subproblems as operation-machine graphs with multi-relational edges, leveraging edge-feature-aware message passing to predict operation stability. Second, we incorporate a critical-path-aware mechanism that injects inductive biases during training to distinguish highly sensitive bottleneck operations from robust ones. Third, we devise an adaptive thresholding strategy that dynamically calibrates decision boundaries based on online uncertainty estimation to align model predictions with the solver’s search space. Extensive experiments on standard benchmarks demonstrate that \mboxGraph-RHO establishes a new state of the art in solution quality and computational efficiency. Remarkably, it exhibits exceptional zero-shot generalization, reducing solve time by over 30% on large-scale instances (2000 operations) while achieving superior solution quality. Our code is available \hrefthis https URLhere.

[AI-175] LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention

【速读】:该论文旨在解决大语言模型在长文本生成过程中出现的“重复循环崩溃”(repetition loop collapse)问题,即解码过程陷入持续重复相同token的不稳定状态。其关键在于识别并干预由注意力机制塌陷(collapsed attention patterns)和KV缓存(Key-Value cache)重用共同引发的反馈循环:特定注意力头锁定历史片段的狭窄后缀,导致缓存策略误判重复token为高重要性,从而加剧循环。解决方案的核心是提出LoopGuard——一种轻量级、可插拔的KV缓存保护机制,能在推理时在线检测循环初现,并通过固定缓存预算下修剪重复尾部片段来打破循环,实验证明其将循环发生率降低超过90个百分点,同时提升输出多样性与缓存利用效率。

链接: https://arxiv.org/abs/2604.10044
作者: Dongjie Xu,Hao Wu,Weijie Shi,Yue Cui,Yuanjun Liu,Jiawei Li,Haolun Ma,An Liu,Jia Zhu,Jiajie Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Through systematic experiments on long-context generation, we observe a damaging failure mode in which decoding can collapse into persistent repetition loops. We find that this degeneration is driven by collapsed attention patterns, where a subset of heads locks onto a narrow suffix of the history, and is further stabilized by inference-time KV cache reuse. Crucially, since many existing KV cache policies rely on attention-based importance, this collapse can produce spuriously high scores for repetitive tokens, causing cache management to inadvertently amplify repetition. To study this phenomenon in a controlled and reproducible manner, we introduce LoopBench, a benchmark with explicit loop-inducing conditions and loop-oriented metrics that quantify repetition severity and generation instability beyond downstream task scores. Building on these insights, we propose LoopGuard, a lightweight, plug-in KV cache guard that detects loop onset online and disrupts the feedback cycle by pruning repetitive tail spans under a fixed cache budget. Experiments on LoopBench show that LoopGuard reduces loop incidence by over 90 percentage points, while restoring output diversity and reducing token waste.

[AI-176] AI Achieves a Perfect LSAT Score

【速读】:该论文旨在解决大语言模型在高阶推理任务中是否能够达到人类顶尖水平的问题,特别是针对法律逻辑推理能力的评估。其关键解决方案在于通过引入“思考阶段”(thinking phase)的显式推理过程,并结合基于官方LSAT解析微调的奖励模型(reward model)进行Best-of-5选择机制,显著提升了模型在逻辑推理类题目的准确率,从而实现了在正式LSAT考试中零错误的完美表现,验证了生成式AI在认知能力上限上已可媲美甚至超越人类。

链接: https://arxiv.org/abs/2604.10034
作者: Bonmu Ku
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper reports the first documented instance of a language model achieving a perfect score on an officially disclosed Law School Admission Test (LSAT). Controlled experiments on eight reasoning models show that varying the prompt, shuffling answer choices, and sampling multiple responses have no meaningful effect as drivers of performance. Ablating the thinking phase that models generate before answering, however, lowers frontier accuracy by up to 8 percentage points, predominantly in logical reasoning. Distilled models produce full thinking traces in the same format yet plateau far below frontier performance. A pilot process reward model fine-tuned via QLoRA on official LSAT explanations narrows this gap through Best-of-5 selection, with gains again predominantly in logical reasoning. The gatekeeper of elite legal education since 1948, the LSAT has not merely been passed but answered without a single error by models that reason. The upper bound of the cognitive capacities it has tested is no longer exclusive to human cognition.

[AI-177] Closed-Form Concept Erasure via Double Projections

【速读】:该论文旨在解决生成式 AI(Generative AI)模型中因包含 unwanted concepts(如特定对象、风格或敏感内容)而引发的安全与伦理风险问题,特别是针对现有概念擦除(concept erasure)方法依赖迭代优化、易导致无关概念扭曲的缺陷。其解决方案的关键在于提出一种无需训练的线性变换框架,通过两个闭式(closed-form)步骤实现理论严谨且几何可解释的概念移除:首先计算目标概念的代理投影,然后在已知概念方向的左零空间内施加约束变换,从而在保持非目标概念完整性的同时,高效、确定性地完成概念擦除。

链接: https://arxiv.org/abs/2604.10032
作者: Chi Zhang,Jingpu Cheng,Zhixian Wang,Ping Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While modern generative models such as diffusion-based architectures have enabled impressive creative capabilities, they also raise important safety and ethical risks. These concerns have led to growing interest in concept erasure, the process of removing unwanted concepts from model representations. Existing approaches often achieve strong erasure performance but rely on iterative optimization and may inadvertently distort unrelated concepts. In this work, we present a simple yet principled alternative: a linear transformation framework that achieves concept erasure analytically, without any training. Our method adapts a pretrained model through two sequential, closed-form steps: first, computing a proxy projection of the target concept, and second, applying a constrained transformation within the left null space of known concept directions. This design yields a deterministic and geometrically interpretable procedure for safe, efficient, and theory-grounded concept removal. Across a wide range of experiments, including object and style erasure on multiple Stable Diffusion variants and the flow-matching model (FLUX), our approach matches or surpasses the performance of state-of-the-art methods while preserving non-target concepts more faithfully. Requiring only a few seconds to apply, it offers a lightweight and drop-in tool for controlled model editing, advancing the goal of safer and more responsible generative models.

[AI-178] Like a Hammer It Can Build It Can Break: Large Language Model Uses Perceptions and Adoption in Cybersecurity Operations on Reddit

【速读】:该论文旨在解决当前对生成式 AI(Generative AI)在安全运营中心(Security Operations Center, SOC)中实际应用、感知与采纳情况的实证研究不足问题。现有文献和商业推广多聚焦于技术潜力,但缺乏来自一线安全从业人员的真实反馈与使用场景洞察。其解决方案的关键在于通过混合方法分析(定量统计与定性编码结合),系统考察了892篇来自Reddit等网络安全论坛的讨论帖,从工具与用途、优劣评估及采纳趋势三个维度揭示从业者对大型语言模型(Large Language Models, LLMs)的实际认知与行为模式。研究发现,尽管LLM能显著提升效率和有效性,但可靠性差、验证成本高及安全风险仍是限制其自主化部署的核心障碍,从而为未来面向企业级安全场景的LLM工具开发与落地提供了基于实证的改进方向。

链接: https://arxiv.org/abs/2604.09998
作者: Souradip Nath,Chih-Yi Huang,Aditi Ganapathi,Kashyap Thimmaraju,Jaron Mink,Gail-Joon Ahn
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Large language models (LLMs) have recently emerged as promising tools for augmenting Security Operations Center (SOC) workflows, with vendors increasingly marketing autonomous AI solutions for SOCs. However, there remains a limited empirical understanding of how such tools are used, perceived, and adopted by real-world security practitioners. To address this gap, we conduct a mixed-methods analysis of discussions in cybersecurity-focused forums to learn how a diverse group of practitioners use and perceive modern LLM tools for security operations. More specifically, we analyzed 892 posts between December 2022 and September 2025 from three cybersecurity-focused forums on Reddit, and, using a combination of qualitative coding and statistical analysis, examined how security practitioners discuss LLM tools across three dimensions: (1) their stated tools and use cases, (2) the perceived pros and cons of each tool across a set of critical factors, and (3) their adoption of such tools and the expected impacts on the cybersecurity industry and individual analysts. Overall, our findings reveal nuanced patterns in LLM tools adoption, highlighting independent use of LLMs for low-risk, productivity-oriented tasks, alongside active interest around enterprise-grade, security-focused LLM platforms. Although practitioners report meaningful gains in efficiency and effectiveness in LLM-assisted workflows, persistent issues with reliability, verification overheads, and security risks sharply constrain the autonomy granted to LLM tools. Based on these results, we also provide recommendations for developing and adopting LLM tools to ensure the security of organizations and the safety of cybersecurity practitioners.

[AI-179] Agent ic Application in Power Grid Static Analysis: Automatic Code Generation and Error Correction

【速读】:该论文旨在解决电力系统静态分析中人工编写MATPOWER脚本效率低、易出错的问题,尤其在复杂场景下难以保证代码准确性和可靠性。其核心解决方案是构建一个基于大语言模型(Large Language Model, LLM)的智能代理系统,能够将自然语言指令自动转化为高质量的MATPOWER脚本;关键创新在于通过DeepSeek-OCR从官方手册中构建增强向量数据库以提升语义理解能力,并设计了一个三层错误纠正机制——静态预检、动态反馈回路与语义验证器,从而显著提高生成代码的保真度(达到82.38%)并有效抑制幻觉现象,同时借助Model Context Protocol实现MATLAB环境下的异步执行与自动调试功能。

链接: https://arxiv.org/abs/2604.09995
作者: Qinjuan Wang,Shan Yang,Yongli Zhu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for presentation at the 9th International Conference on Energy, Electrical and Power Engineering (CEEPE 2026) in Nanjing, China, April 17-19, 2026

点击查看摘要

Abstract:This paper introduces an LLM agent that automates power grid static analysis by converting natural language into MATPOWER scripts. The framework utilizes DeepSeek-OCR to build an enhanced vector database from MATPOWER manuals. To ensure reliability, it devises a three-tier error-correction system: a static pre-check, a dynamic feedback loop, and a semantic validator. Operating via the Model Context Protocol, the tool enables asynchronous execution and automatically debugging in MATLAB. Experimental results demonstrate that the system achieves a 82.38% accuracy regarding the code fidelity, effectively eliminating hallucinations even in complex analysis tasks.

[AI-180] Muon2: Boosting Muon via Adaptive Second-Moment Preconditioning DATE

【速读】:该论文旨在解决Muon优化器在大规模基础模型预训练中因每步优化需多次Newton–Schulz(NS)迭代而导致的计算与通信开销问题,从而限制其实际效率。解决方案的关键在于提出Muon²,通过在正交化前引入Adam风格的自适应二阶矩预条件(adaptive second-moment preconditioning),显著改善了动量矩阵的谱性质,缓解了极小奇异值对极坐标逼近(polar approximation)的不利影响,从而加速收敛至足够精度的正交化结果。实验表明,Muon²在GPT和LLaMA模型上从60M到1.3B参数规模均优于原版Muon及近期变体,且将NS迭代次数减少40%。

链接: https://arxiv.org/abs/2604.09967
作者: Ziyue Liu,Ruijie Zhang,Zhengyang Wang,Yequan Zhao,Yupeng Su,Zi Yang,Zheng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint, subject to update

点击查看摘要

Abstract:Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, its practical efficiency is limited by the need for multiple Newton–Schulz (NS) iterations per optimization step, which introduces non-trivial computation and communication overhead. We propose Muon ^2 , an extension of Muon that applies Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon ^2 , leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon ^2 demonstrates dramatic improvement over Muon at each polar step. Across GPT and LLaMA pre-training experiments from 60M to 1.3B parameters, Muon ^2 consistently outperforms Muon and recent Muon variants while reducing NS iterations by 40%. We further introduce Muon ^2 -F, a memory-efficient factorized variant that preserves most of the gains of Muon ^2 with negligible memory overhead.

[AI-181] Rebooting Microreboot: Architectural Support for Safe Parallel Recovery in Microservice Systems

【速读】:该论文旨在解决微服务架构中因盲目重启导致的连锁故障问题,尤其是在存在密集依赖关系时,单一服务的重启可能引发大量调用方中断;同时,自主修复代理(autonomous remediation agents)在执行原始基础设施命令时缺乏安全保证,进一步加剧风险。解决方案的关键在于将修复策略的制定(planning)与执行(actuation)分离,提出一个三代理架构(诊断、规划、验证),基于具有显式副作用语义的七操作指令集(ISA)生成类型化的修复计划,并由一个小型微内核以事务方式验证和执行每个计划,从而确保安全性而非单纯追求恢复速度。通过在线从分布式追踪数据中推断恢复边界,系统可自动计算最小重启组及顺序约束,实现安全且高效的微重启(microreboot)。

链接: https://arxiv.org/abs/2604.09963
作者: Laurent Bindschaedler
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 18 pages, 1 figure, 4 tables. Published at ARCS 2026

点击查看摘要

Abstract:Microreboot enables fast recovery by restarting only the failing component, but in modern microservices naive restarts are unsafe: dense dependencies mean rebooting one service can disrupt many callers. Autonomous remediation agents compound this by actuating raw infrastructure commands without safety guarantees. We make microreboot practical by separating planning from actuation: a three-agent architecture (diagnosis, planning, verification) proposes typed remediation plans over a seven-action ISA with explicit side-effect semantics, and a small microkernel validates and executes each plan transactionally. Agents are explicitly untrusted; safety derives from the ISA and microkernel. To determine where restart is safe, we infer recovery boundaries online from distributed traces, computing minimal restart groups and ordering constraints. On industrial traces (Alibaba, Meta) and DeathStarBench with fault injection, recovery-group inference runs in 21 ms at P99; typed actuation reduces agent-caused harm by 95% in simulation and achieves 0% harm online. The primary value is safety, not speed: LLM inference overhead increases TTR for services with fast auto-restart.

[AI-182] New Hybrid Fine-Tuning Paradigm for LLM s: Algorithm Design and Convergence Analysis Framework ICLR2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)微调中面临的两大核心问题:全参数微调(full fine-tuning)计算成本过高,而参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)则常因难以学习新知识导致性能欠佳。其解决方案的关键在于提出一种新颖的混合微调方法,通过联合优化LLM主干与PEFT模块,并结合零阶(zeroth-order)与一阶(first-order)优化策略,在保持计算效率的同时提升模型性能。作者进一步构建了以“混合光滑性条件”(hybrid smoothness condition)为核心的理论框架,对多学习率下重排型随机梯度下降(reshuffling-type SGD)算法进行了严格的收敛性分析,实验证明该方法在多种下游任务和模型架构上均能实现稳定且显著的性能提升。

链接: https://arxiv.org/abs/2604.09940
作者: Shaocong Ma,Peiran Yu,Heng Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) typically involves either full fine-tuning, which updates all model parameters, or Parameter-Efficient Fine-Tuning (PEFT), which adjusts a small subset of parameters. However, both approaches have inherent limitations: full fine-tuning is computationally expensive, while PEFT often struggles to learn new knowledge and exhibits suboptimal performance. To overcome these issues, we propose a novel hybrid fine-tuning approach that jointly updates both LLMs and PEFT modules using a combination of zeroth-order and first-order optimization methods. To analyze our new algorithm, we develop a theoretical framework centered on the concept of hybrid smoothness condition, which accounts for the heterogeneous nature of the optimization landscape in joint LLM and PEFT training. We derive a rigorous convergence analysis for the convergence of reshuffling-type SGD algorithm under multiple learning rates and demonstrate its effectiveness through extensive empirical studies across various downstream tasks and model architectures. On the practical side, our results demonstrate consistent performance improvement, making the approach a viable solution for large-scale language model fine-tuning.

[AI-183] HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks ALT

【速读】:该论文旨在解决当前缺乏针对基于大语言模型(Large Language Models, LLMs)的计算机使用代理(Computer-Use Agents, CUAs)在医疗行政全流程中端到端性能评估的基准问题。现有研究多聚焦于临床场景,而对复杂、多步骤的医疗行政任务(如先期授权、申诉与拒付管理、耐用医疗设备订单处理)缺乏系统性评测工具。解决方案的关键在于提出HealthAdminBench——一个包含四种真实图形用户界面(GUI)环境(电子健康记录系统、两个保险商门户和传真系统)及135项专家定义任务的基准测试平台,每个任务被细分为可验证的子任务(共1698个评估点),从而实现对CUA在实际医疗行政工作流中可靠性和完整性的量化评估。该基准揭示了当前代理在端到端任务成功率上的显著不足(最高仅36.3%),为未来安全可靠的医疗行政自动化提供了可重复、严谨的评估框架。

链接: https://arxiv.org/abs/2604.09937
作者: Suhana Bedi,Ryan Welch,Ethan Steinberg,Michael Wornow,Taeil Matthew Kim,Haroun Ahmed,Peter Sterling,Bravim Purohit,Qurat Akram,Angelic Acosta,Esther Nubla,Pritika Sharma,Michael A. Pfeffer,Sanmi Koyejo,Nigam H. Shah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures, 5 tables. Benchmark paper introducing 4 simulated environments, 135 tasks, and 1,698 evaluation points for healthcare administrative computer-use agents

点击查看摘要

Abstract:Healthcare administration accounts for over 1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows. HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.

[AI-184] A Hybrid Intelligent Framework for Uncertainty-Aware Condition Monitoring of Industrial Systems

【速读】:该论文旨在解决工业状态监测(Condition Monitoring)中模型可靠性不足的问题,尤其是在非线性系统中,单纯依赖数据驱动方法易受噪声干扰且缺乏物理一致性。解决方案的关键在于构建一种融合策略:一方面引入轻量级的物理信息残差(physics-informed residuals),基于名义代理模型(nominal surrogate models)提取符合物理规律的异常信号;另一方面通过滞后时间特征(lagged temporal features)增强时序建模能力,并结合特征级融合与模型级集成两种混合机制,提升诊断准确率与预测可靠性。实验表明,该框架在连续搅拌釜反应器(CSTR)基准测试中显著优于单一来源基线模型,尤其在不确定性量化方面表现优异,实现了更小且校准良好的预测集(prediction sets)。

链接: https://arxiv.org/abs/2604.09932
作者: Maryam Ahang,Todd Charter,Masoud Jalayer,Homayoun Najjaran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Hybrid approaches that combine data-driven learning with physics-based insight have shown promise for improving the reliability of industrial condition monitoring. This work develops a hybrid condition monitoring framework that integrates primary sensor measurements, lagged temporal features, and physics-informed residuals derived from nominal surrogate models. Two hybrid integration strategies are examined. The first is a feature-level fusion approach that augments the input space with residual and temporal information. The second is a model-level ensemble approach in which machine learning classifiers trained on different feature types are combined at the decision level. Both hybrid approaches of the condition monitoring framework are evaluated on a continuous stirred-tank reactor (CSTR) benchmark using several machine learning models and ensemble configurations. Both feature-level and model-level hybridization improve diagnostic accuracy relative to single-source baselines, with the best model-level ensemble achieving a 2.9% improvement over the best baseline ensemble. To assess predictive reliability, conformal prediction is applied to quantify coverage, prediction-set size, and abstention behavior. The results show that hybrid integration enhances uncertainty management, producing smaller and well-calibrated prediction sets at matched coverage levels. These findings demonstrate that lightweight physics-informed residuals, temporal augmentation, and ensemble learning can be combined effectively to improve both accuracy and decision reliability in nonlinear industrial systems.

[AI-185] Diffusion Denoiser Achievable Analysis for Finite Blocklength Unsourced Random Access

【速读】:该论文旨在解决无源多址接入信道(unsourced multiple access channel, UMAC)在有限块长(finite blocklength)场景下的性能优化问题,尤其针对现有联合解码器在处理信道噪声时效率不足的局限性。其解决方案的关键在于引入一种与联合解码兼容的扩散去噪器(diffusion denoiser),通过训练一个得分网络(score network)来拟合信道输出分布样本,从而实现轻量级的噪声抑制。该方法无需改变原有码本设计即可集成,并在理论上推导出比传统随机编码边界更紧的扩散去噪随机编码可达界,仿真结果表明其在FASURA、MSUG-MRA及基于导频的方法上均实现了至少0.5 dB的所需每比特能量与噪声功率谱密度比(Eb/N0E_b/N_0)改善。

链接: https://arxiv.org/abs/2604.09904
作者: Yuming Han,Yuxin Long
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Polyanskiy proposed a framework for the unsourced multiple access channel (MAC) problem where users employ a common codebook in the finite blocklength regime. However, existing approaches handle channel noise before the joint decoder. In this work, we introduce a decoder compatible diffusion denoiser as a lightweight analysis within joint decoding. The score network is trained on samples drawn from the channel output distribution, making the method easy to integrate with existing code designs. In our theoretical analysis, we derive a diffusion-denoiser random-coding achievable bound that is strictly tighter. Simulations on existing decoders, including FASURA, MSUG-MRA and pilot-based method, show consistent performance gains with at least a 0.5 \mathrmdB improvement in required \mathrmE_b/N_0 at a fixed error target.

[AI-186] In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agent ic AI approach

【速读】:该论文旨在解决金属增材制造(特别是电弧丝材增材制造,WAAM)过程中缺陷检测的实时性与自动化问题,传统方法依赖事后检测(如X射线计算机断层扫描,XCT),难以实现在线质量控制。解决方案的关键在于构建一个基于多智能体(multi-agent)的自主决策框架:其中包含两个专用智能体——处理智能体(processing agent)基于焊接过程中的电流和电压信号识别气孔缺陷,监测智能体(monitoring agent)则基于声学数据进行缺陷识别;二者均通过XCT获取的地面真实数据训练分类模型,并由大语言模型(LLM)驱动决策逻辑。该多智能体系统实现了并行决策,显著提升了缺陷分类准确率(91.6%)和F1分数(0.821),展现出在WAAM及其他增材制造工艺中实现自主实时过程监控与控制的巨大潜力。

链接: https://arxiv.org/abs/2604.09889
作者: Pallock Halder,Satyajit Mojumder
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 42 pages, 9 figures

点击查看摘要

Abstract:AI agents are being increasingly deployed across a wide range of real-world applications. In this paper, we propose an agentic AI framework for in-situ process monitoring for defect detection in wire-arc additive manufacturing (WAAM). The autonomous agent leverages a WAAM process monitoring dataset and a trained classification tool to build AI agents and uses a large language model (LLM) for in-situ process monitoring decision-making for defect detection. A processing agent is developed based on welder process signals, such as current and voltage, and a monitoring agent is developed based on acoustic data collected during the process. Both agents are tasked with identifying porosity defects from processing and monitoring signals, respectively. Ground truth X-ray computed tomography (XCT) data are used to develop classification tools for both the processing and monitoring agents. Furthermore, a multi-agent framework is demonstrated in which the processing and monitoring agents are orchestrated together for parallel decision-making on the given task of defect classification. Evaluation metrics are proposed to determine the efficacy of both individual agents, the combined single-agent, and the coordinated multi-agent system. The multi-agent configuration outperforms all individual-agent counterparts, achieving a decision accuracy of 91.6% and an F1 score of 0.821 on decided runs, across 15 independent runs, and a reasoning quality score of 3.74 out of 5. These in-situ process monitoring agents hold significant potential for autonomous real-time process monitoring and control toward building qualified parts for WAAM and other additive manufacturing processes.

[AI-187] What do your logits know? (The answer may surprise you!)

【速读】:该论文旨在解决视觉语言模型(Vision-Language Model, VLM)中因内部表征压缩而导致的敏感信息泄露问题,即模型用户可能通过分析模型输出或中间表示获取本应不可访问的信息。解决方案的关键在于系统性地比较不同表征层级上信息保留情况,特别是从残差流(residual stream)经由两个自然瓶颈压缩后的信息保持能力:一是使用调优透镜(tuned lens)获得的低维投影,二是最终影响模型答案的top-k logits。研究发现,即使是最易获取的top-logit瓶颈也能泄露与任务无关的图像信息,其泄露程度在某些情况下可媲美对完整残差流的直接投影。

链接: https://arxiv.org/abs/2604.09885
作者: Masha Fedzechkina,Eleonora Gualdoni,Rita Ramos,Sinead Williamson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work has shown that probing model internals can reveal a wealth of information not apparent from the model generations. This poses the risk of unintentional or malicious information leakage, where model users are able to learn information that the model owner assumed was inaccessible. Using vision-language models as a testbed, we present the first systematic comparison of information retained at different "representational levels’’ as it is compressed from the rich information encoded in the residual stream through two natural bottlenecks: low-dimensional projections of the residual stream obtained using tuned lens, and the final top-k logits most likely to impact model’s answer. We show that even easily accessible bottlenecks defined by the model’s top logit values can leak task-irrelevant information present in an image-based query, in some cases revealing as much information as direct projections of the full residual stream.

[AI-188] Relational Preference Encoding in Looped Transformer Internal States

【速读】:该论文旨在解决循环变压器(looped transformer)如何在其内部迭代状态中编码人类偏好这一问题,核心挑战在于理解偏好信息在模型各迭代阶段的表征机制。解决方案的关键在于设计并训练轻量级评估头(evaluator heads),通过分析每个迭代步的隐藏状态来预测人类偏好,发现偏好主要以关系形式编码:线性探测器对成对差异进行建模可达到84.5%准确率,远超独立分类器(仅21.75%)和非线性独立评估器(65%),表明模型内部一致性(即其自身价值体系的稳定性)是偏好判断的核心依据,而非对噪声人类标注的直接拟合。此外,研究还揭示了训练协议中的关键细节(如50%样本交换策略和余弦学习率死区)对结果的影响,强调需采用翻转测试(flip test)作为诊断工具以避免伪性能高估。

链接: https://arxiv.org/abs/2604.09870
作者: Jan Kirin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate how looped transformers encode human preference in their internal iteration states. Using Ouro-2.6B-Thinking, a 2.6B-parameter looped transformer with iterative refinement, we extract hidden states from each loop iteration and train lightweight evaluator heads (~5M parameters) to predict human preference on the Anthropic HH-RLHF dataset. Our pairwise evaluator achieves 95.2% test accuracy on 8,552 unseen examples, surpassing a full-batch L-BFGS probe (84.5%) while the base model remains completely frozen. Our central finding is that loop states encode preference predominantly relationally: a linear probe on pairwise differences achieves 84.5%, the best nonlinear independent evaluator reaches only 65% test accuracy, and linear independent classification scores 21.75%, below chance and with inverted polarity. Interpreted precisely, the evaluator functions as a model-internal consistency probe, measuring how stably Ouro’s own learned value system organizes its representations rather than how well it predicts noisy human annotations. We also document a systematic architecture search that established a genuine 70% ceiling for independent scoring, and show how the 50% argument-swap protocol required to prevent degenerate pairwise solutions deflated pairwise training metrics by about 31 points at peak, creating the false appearance that pairwise and pointwise evaluators shared the same ceiling. Finally, we show that a cosine learning-rate dead zone at epoch 2 accidentally acted as early stopping, preserving the generalization peak before overfitting degraded test accuracy from 95.2% to 62.4% by epoch 5. Cross-epoch flip-test analysis shows that antisymmetry correlation remains stable while strict sign-flip rate mainly tracks scorer bias. We propose the flip test as a mandatory diagnostic for pairwise preference evaluators. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.09870 [cs.LG] (or arXiv:2604.09870v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.09870 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jan Kirin [view email] [v1] Fri, 10 Apr 2026 20:00:49 UTC (528 KB)

[AI-189] Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在结构工程自动化中应用的局限性问题,即现有研究主要局限于单一有限元分析(Finite Element Analysis, FEA)软件平台,而实际工程实践中结构工程师常需在多个平台(如ETABS、SAP2000和OpenSees)之间切换以适应项目需求与公司约束。为突破这一瓶颈,论文提出一种基于两阶段多智能体架构的解决方案:第一阶段通过一组协同工作的智能体对用户输入进行结构化推理,提取建模所需的几何、材料、边界及荷载信息并生成统一的JSON表示;第二阶段则由并行运行的代码转换智能体将该JSON文件转化为各目标软件可执行脚本,每个智能体根据对应平台的语法规则和建模流程进行定制化提示。该方案实现了跨平台自动化的框架结构分析工作流,实验表明其在三个主流FEA平台上的准确率均超过90%,具备良好的实用性和可靠性。

链接: https://arxiv.org/abs/2604.09866
作者: Ziheng Geng,Jiachen Liu,Ian Franklin,Ran Cao,Dan M. Frangopol,Minghui Cheng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown the promise to significantly accelerate the workflow by automating structural modeling and analysis. However, existing studies primarily focus on enabling LLMs to operate a single structural analysis software platform. In practice, structural engineers often rely on multiple finite element analysis (FEA) tools, such as ETABS, SAP2000, and OpenSees, depending on project needs, user preferences, and company constraints. This limitation restricts the practical deployment of LLM-assisted engineering workflows. To address this gap, this study develops LLMs capable of automating frame structural analysis across multiple software platforms. The LLMs adopt a two-stage multi-agent architecture. In Stage 1, a cohort of agents collaboratively interpret user input and perform structured reasoning to infer geometric, material, boundary, and load information required for finite element modeling. The outputs of these agents are compiled into a unified JSON representation. In Stage 2, code translation agents operate in parallel to convert the JSON file into executable scripts across multiple structural analysis platforms. Each agent is prompted with the syntax rules and modeling workflows of its target software. The LLMs are evaluated using 20 representative frame problems across three widely used platforms: ETABS, SAP2000, and OpenSees. Results from ten repeated trials demonstrate consistently reliable performance, achieving accuracy exceeding 90% across all cases.

[AI-190] Evolutionary Token-Level Prompt Optimization for Diffusion Models

【速读】:该论文旨在解决文本到图像扩散模型(text-to-image diffusion models)在生成结果上对提示词(prompt)形式高度敏感的问题,传统方法依赖人工反复试验来优化提示词,效率低下且难以系统性探索条件空间。其解决方案的关键在于提出一种基于遗传算法(Genetic Algorithm, GA)的自动化提示词优化方法,通过直接演化CLIP-based扩散模型所使用的token向量来实现提示词的优化,而非简单的文本重写;该方法设计了一个融合美学质量(由LAION Aesthetic Predictor V2评估)与提示-图像对齐度(由CLIPScore衡量)的适应度函数,从而在无需模型特定调整的前提下,系统地搜索最优提示表示,实验表明该方法显著优于Promptist和随机搜索等基线方法,在P2数据集上的适应度提升达23.93%。

链接: https://arxiv.org/abs/2604.09861
作者: Domício Pereira Neto,João Correia,Penousal Machado
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 17 pages, 3 figures, 2 tables, 6 appendix figures

点击查看摘要

Abstract:Text-to-image diffusion models exhibit strong generative performance but remain highly sensitive to prompt formulation, often requiring extensive manual trial and error to obtain satisfactory results. This motivates the development of automated, model-agnostic prompt optimization methods that can systematically explore the conditioning space beyond conventional text rewriting. This work investigates the use of a Genetic Algorithm (GA) for prompt optimization by directly evolving the token vectors employed by CLIP-based diffusion models. The GA optimizes a fitness function that combines aesthetic quality, measured by the LAION Aesthetic Predictor V2, with prompt-image alignment, assessed via CLIPScore. Experiments on 36 prompts from the Parti Prompts (P2) dataset show that the proposed approach outperforms the baseline methods, including Promptist and random search, achieving up to a 23.93% improvement in fitness. Overall, the method is adaptable to image generation models with tokenized text encoders and provides a modular framework for future extensions, the limitations and prospects of which are discussed.

[AI-191] RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

【速读】:该论文旨在解决当前通用机器人研究中仿真基准测试的瓶颈问题,即现有基准因训练与评估数据存在显著领域重叠而导致性能快速饱和、无法真实反映模型泛化能力的问题。解决方案的关键在于提出RoboLab框架,其核心创新包括:(1)通过人机协作与大语言模型(LLM)驱动的场景和任务生成机制,在物理逼真且视觉真实的仿真环境中实现对机器人和策略无关的任务构建;(2)设计RoboLab-120基准,涵盖120个任务,按视觉、程序性和关系性三类能力轴划分,并设置三个难度层级,从而系统性地评估任务泛化能力;(3)引入受控扰动下的行为敏感性分析方法,量化真实世界策略在仿真中的表现及其对外部因素的依赖程度,证明高保真仿真可作为分析性能及外部因素影响的有效代理工具。

链接: https://arxiv.org/abs/2604.09860
作者: Xuning Yang,Rishit Dagli,Alex Zook,Hugo Hadfield,Ankit Goyal,Stan Birchfield,Fabio Ramos,Jonathan Tremblay
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The pursuit of general-purpose robotics has yielded impressive foundation models, yet simulation-based benchmarking remains a bottleneck due to rapid performance saturation and a lack of true generalization testing. Existing benchmarks often exhibit significant domain overlap between training and evaluation, trivializing success rates and obscuring insights into robustness. We introduce RoboLab, a simulation benchmarking framework designed to address these challenges. Concretely, our framework is designed to answer two questions: (1) to what extent can we understand the performance of a real-world policy by analyzing its behavior in simulation, and (2) which external factors most strongly affect that behavior under controlled perturbations. First, RoboLab enables human-authored and LLM-enabled generation of scenes and tasks in a robot- and policy-agnostic manner within a physically realistic and photorealistic simulation. With this, we propose the RoboLab-120 benchmark, consisting of 120 tasks categorized into three competency axes: visual, procedural, relational competency, across three difficulty levels. Second, we introduce a systematic analysis of real-world policies that quantify both their performance and the sensitivity of their behavior to controlled perturbations, indicating that high-fidelity simulation can serve as a proxy for analyzing performance and its dependence on external factors. Evaluation with RoboLab exposes significant performance gap in current state-of-the-art models. By providing granular metrics and a scalable toolset, RoboLab offers a scalable framework for evaluating the true generalization capabilities of task-generalist robotic policies.

[AI-192] MEMENTO: Teaching LLM s to Manage Their Own Context

【速读】:该论文旨在解决推理模型在长序列推理过程中缺乏对中间状态进行压缩与组织的问题,导致上下文长度、键值缓存(KV cache)和计算资源消耗过大。解决方案的关键在于提出MEMENTO方法:通过训练模型将推理过程分割为若干块(block),并将每一块压缩为一个“记忆体”(memento,即密集的状态摘要),后续推理仅需关注这些记忆体,从而显著降低上下文占用和计算开销。实验表明,该方法在多个模型家族(如Qwen3、Phi-4、Olmo 3)和规模(8B–32B参数)上均有效,在保持数学、科学和编码基准准确率的同时实现约2.5倍的峰值KV缓存减少,并通过扩展vLLM框架进一步提升推理吞吐量约1.75倍。此外,研究发现存在双信息流机制——每个推理块的信息同时由memento文本和对应的KV状态隐式携带,移除KV状态通道会导致AIME24基准准确率下降15个百分点(pp)。

链接: https://arxiv.org/abs/2604.09852
作者: Vasilis Kontonis,Yuchen Zeng,Shivam Garg,Lingjiao Chen,Hao Tang,Ziyan Wang,Ahmed Awadallah,Eric Horvitz,John Langford,Dimitris Papailiopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning models think in long, unstructured streams with no mechanism for compressing or organizing their own intermediate state. We introduce MEMENTO: a method that teaches models to segment reasoning into blocks, compress each block into a memento, i.e., a dense state summary, and reason forward by attending only to mementos, reducing context, KV cache, and compute. To train MEMENTO models, we release OpenMementos, a public dataset of 228K reasoning traces derived from OpenThoughts-v3, segmented and annotated with intermediate summaries. We show that a two-stage SFT recipe on OpenMementos is effective across different model families (Qwen3, Phi-4, Olmo 3) and scales (8B–32B parameters). Trained models maintain strong accuracy on math, science, and coding benchmarks while achieving \sim2.5\times peak KV cache reduction. We extend vLLM to support our inference method, achieving \sim1.75\times throughput improvement while also enabling us to perform RL and further improve accuracy. Finally, we identify a dual information stream: information from each reasoning block is carried both by the memento text and by the corresponding KV states, which retain implicit information from the original block. Removing this channel drops accuracy by 15,pp on AIME24.

[AI-193] Steered LLM Activations are Non-Surjective ICLR2026

【速读】:该论文旨在解决激活控制(activation steering)是否等价于通过自然文本提示(textual prompt)实现相同模型内部状态的问题,即验证白盒控制与黑盒提示之间是否存在等价性。其核心解决方案在于将该问题形式化为一个满射性(surjectivity)问题:对于固定语言模型,每个被引导的激活状态是否都能在模型自然前向传播中找到对应的输入提示(pre-image)。作者在合理假设下证明,激活控制会将残差流(residual stream)推向离散提示无法到达的状态流形之外,因此几乎必然不存在任何文本提示可复现由激活控制诱导的内部行为。这一发现建立了白盒可操控性与黑盒提示之间的形式分离,警示不应将激活控制的成功视为提示可解释性或漏洞的证据,并呼吁采用明确区分白盒与黑盒干预的评估协议。

链接: https://arxiv.org/abs/2604.09839
作者: Aayush Mishra,Daniel Khashabi,Anqi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages main text. ICLR 2026 Workshops (Sci4DL, Re-Align)

点击查看摘要

Abstract:Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a pre-image under the model’s natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.

[AI-194] EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning

【速读】:该论文旨在解决生成式 AI (Generative AI) 在软件自动化任务中如何有效平衡图形用户界面(GUI)交互与结构化 API 调用(通过 Model Context Protocol, MCP 实现)的问题,以及如何在不同应用场景下实现迭代式自我改进。其核心挑战在于缺乏对两种模态互补性与适用场景的系统理解,以及缺少无需人工干预的持续优化机制。解决方案的关键是将 MCP-GUI 的协同作用建模为一个统一的混合策略学习问题,并提出一个全自动的自演化框架:该框架通过自动环境生成、轨迹收集、基于差距的任务合成及质量过滤训练,实现闭环优化;其中创新性地引入“经验银行”机制,利用轨迹对比积累大语言模型(LLM)学到的规则,在推理阶段实现无需微调的性能提升,从而根据不同应用中 MCP 与 GUI 的主导比例选择最优策略——蒸馏方法在 MCP 主导任务上表现更优(+17.8pp pass rate),而经验银行在 GUI 密集型任务中更具优势(+10.0pp)。

链接: https://arxiv.org/abs/2604.09815
作者: Tiantian He,Yihang Chen,Keyue Jiang,Ka Yiu Lee,Kaiwen Zhou,Kun Shao,Shuai Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer-use agents that combine GUI interaction with structured API calls via the Model Context Protocol (MCP) show promise for automating software tasks. However, existing approaches lack a principled understanding of how agents should balance these two modalities and how to enable iterative self-improvement across diverse applications. We formulate MCP-GUI interplay as a unified hybrid policy learning problem where the agent learns when each modality provides complementary advantages, and show that distillation and experience augmentation target fundamentally different failure modes - requiring application-aware mechanism selection. Built on this formulation, we propose a self-evolving framework with a fully automatic pipeline that orchestrates automatic environment generation and validation, trajectory collection, gap-driven task synthesis, and quality-filtered training - all without manual intervention. A key innovation is our experience bank, which accumulates LLM-learned rules from trajectory comparison, enabling inference-time improvement without fine-tuning. Systematic \textbfcross-application analysis across three desktop applications reveals that the optimal strategy depends on MCP-GUI composition: distillation achieves 77.8% pass rate on MCP-dominant tasks (+17.8pp), while the experience bank excels on GUI-intensive tasks (+10.0pp).

[AI-195] Controllable and Verifiable Tool-Use Data Synthesis for Agent ic Reinforcement Learning

【速读】:该论文旨在解决现有合成工具使用数据集主要面向离线监督微调(Supervised Fine-Tuning, SFT)而无法直接支持强化学习(Reinforcement Learning, RL)优化的问题,特别是缺乏可执行环境以实现奖励可计算的在线轨迹 rollout。其解决方案的关键在于提出 COVERT 两阶段管道:第一阶段通过多层级验证的自演化合成生成高质量基础工具调用轨迹;第二阶段引入保持“oracle”工具调用和最终答案不变的增强策略,系统性提升环境复杂度(如干扰工具、模糊查询、噪声输出等),从而在标准场景下实现基于参考匹配的自动奖励计算,并在特殊行为(如错误检测)中辅以轻量级判别器验证,支撑 RL 对工具调用策略的优化。此设计使得合成环境既保留了真实标签的准确性,又具备训练鲁棒性所需的多样性与挑战性。

链接: https://arxiv.org/abs/2604.09813
作者: Siyuan Xu,Shiyang Li,Xin Liu,Tianyi Liu,Yixiao Li,Zhan Shi,Zixuan Zhang,Zilong Wang,Qingyu Yin,Jianshu Chen,Tuo Zhao,Bing Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing synthetic tool-use corpora are primarily designed for offline supervised fine-tuning, yet reinforcement learning (RL) requires executable environments that support reward-checkable online rollouts. We propose COVERT, a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity. These augmentations introduce distractor tools, indirect or ambiguous user queries, and noisy, multi-format, or erroneous tool outputs, while strictly preserving oracle tool calls and final answers as ground truth. This design enables automatic reward computation via reference matching for standard cases and lightweight judge-assisted verification for special behaviors such as error detection, supporting RL optimization of tool-calling policies. On Qwen2.5-Instruct-14B, COVERT-RL improves overall accuracy on BFCL v3 from 56.5 to 59.9 and on ACEBench from 53.0 to 59.3, with minimal regressions on general-ability benchmarks; when stacked on SFT, it further reaches 62.1 and 61.8, confirming additive gains. These results suggest that oracle-preserving synthetic environments offer a practical RL refinement stage, complementary to SFT, for improving tool-use robustness under ambiguity and unreliable tool feedback.

[AI-196] Explainable Human Activity Recognition: A Unified Review of Concepts and Mechanisms

【速读】:该论文旨在解决人类活动识别(Human Activity Recognition, HAR)系统中深度学习模型缺乏可解释性的问题,从而限制了其在医疗监护、智能环境等实际场景中的可信度与部署潜力。解决方案的关键在于提出一种机制导向的可解释人工智能(Explainable Artificial Intelligence, XAI)分类体系,将解释性的概念维度与算法解释机制相分离,清晰界定不同XAI-HAR方法的解释范式、目标和局限性,并系统分析其如何应对HAR任务中时间动态性、多模态融合及语义复杂性等挑战,为构建更透明、可靠且以人为中心的活动识别系统提供理论框架与实践指导。

链接: https://arxiv.org/abs/2604.09799
作者: Mainak Kundu,Catherine Chen,Rifatul Islam,Ismail Uysal,Ria Kanjilal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human activity recognition (HAR) has become a key component of intelligent systems for healthcare monitoring, assistive living, smart environments, and human-computer interaction. Although deep learning has substantially improved HAR performance on multivariate sensor data, the resulting models often remain opaque, limiting trust, reliability, and real-world deployment. Explainable artificial intelligence (XAI) has therefore emerged as a critical direction for making HAR systems more transparent and human-centered. This paper presents a comprehensive review of explainable HAR methods across wearable, ambient, physiological, and multimodal sensing settings. We introduce a unified perspective that separates conceptual dimensions of explainability from algorithmic explanation mechanisms, reducing ambiguities in prior surveys. Building on this distinction, we present a mechanism-centric taxonomy of XAI-HAR methods covering major explanation paradigms. The review examines how these methods address the temporal, multimodal, and semantic complexities of HAR, and summarize their interpretability objectives, explanation targets, and limitations. In addition, we discuss current evaluation practices, highlight key challenges in achieving reliable and deployable XAI-HAR, and outline directions toward trustworthy activity recognition systems that better support human understanding and decision-making.

[AI-197] he Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry Not Necessarily Domain Expertise

【速读】:该论文旨在解决大语言模型中混合专家(Mixture of Experts, MoE)架构下“专家专业化”机制不明确的问题。其核心挑战在于,尽管MoE已被广泛采用,但专家如何根据输入内容进行差异化激活(即专业化)仍缺乏清晰的理论解释。论文的关键解决方案是:通过分析MoE路由机制的本质——即路由器为线性映射(linear map),证明隐藏状态相似性(hidden state similarity)既是专家使用相似性的必要条件也是充分条件,从而揭示专家专业化是表征空间的涌现特性(emergent property),而非路由架构本身的直接结果。这一发现为理解MoE的内部工作机制提供了可验证的、基于几何视角的理论框架。

链接: https://arxiv.org/abs/2604.09780
作者: Xi Wang,Soufiane Hayou,Eric Nalisnick
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture of Experts (MoEs) are now ubiquitous in large language models, yet the mechanisms behind their “expert specialization” remain poorly understood. We show that, since MoE routers are linear maps, hidden state similarity is both necessary and sufficient to explain expert usage similarity, and specialization is therefore an emergent property of the representation space, not of the routing architecture itself. We confirm this at both token and sequence level across five pre-trained models. We additionally prove that load-balancing loss suppresses shared hidden state directions to maintain routing diversity, which might provide a theoretical explanation for specialization collapse under less diverse data, e.g. small batch. Despite this clean mechanistic account, we find that specialization patterns in pre-trained MoEs resist human interpretation: expert overlap between different models answering the same question is no higher than between entirely different questions ( \sim 60%); prompt-level routing does not predict rollout-level routing; and deeper layers exhibit near-identical expert activation across semantically unrelated inputs, especially in reasoning models. We conclude that, while the efficiency perspective of MoEs is well understood, understanding expert specialization is at least as hard as understanding LLM hidden state geometry, a long-standing open problem in the literature.

[AI-198] A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在异构NPU平台(如Ascend 910B)上进行自回归解码时面临的严重内存瓶颈问题。其核心挑战包括:由静态部署单尺寸模型引发的“模型缩放悖论”、细粒度推测解码在NPU计算图编译下的内核同步开销,以及单纯依赖微级加速算法(如提示查找解码 Prompt LookUp Decoding, PLD)所带来的局限性。解决方案的关键在于提出一种系统性的优化框架,通过动态调整模型规模与计算资源匹配,并有效降低推理过程中的同步延迟,从而突破传统方法在内存带宽和并行效率上的限制。

链接: https://arxiv.org/abs/2604.09752
作者: Chen Zhang,Yan Ding,Haotian Wang,Chubo Liu,Keqin Li,Kenli Li
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:During the deployment of Large Language Models (LLMs), the autoregressive decoding phase on heterogeneous NPU platforms (e.g., Ascend 910B) faces severe memory-bound challenges. This study reveals the ``Model Scaling Paradox’’ caused by the static deployment of single-sized models. It also points out the kernel synchronization overhead of fine-grained speculative decoding \citeleviathan2023fast, chen2023speculative under NPU computational graph compilation, and the severe limitations of purely relying on micro-level acceleration algorithms like Prompt LookUp Decoding (PLD)

[AI-199] Conflicts Make Large Reasoning Models Vulnerable to Attacks

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在面对冲突目标时决策机制不明确的问题,尤其是当模型需在对齐价值之间或相互矛盾的选择中做出判断时,其安全性和可靠性如何保障。研究发现,冲突显著提高了攻击成功率,即使在无复杂自动攻击策略的单轮非叙事查询中亦然;关键在于通过层级和神经元级别的分析揭示:在冲突情境下,与安全相关的表征与功能表征发生重叠和偏移,从而干扰了模型的安全对齐行为。因此,解决方案的关键在于开发更深层次的对齐策略,以增强下一代推理模型在复杂冲突场景下的鲁棒性与可信度。

链接: https://arxiv.org/abs/2604.09750
作者: Honghao Liu,Chengjin Xu,Xuhui Jiang,Cehao Yang,Shengming Yin,Zhengwu Ma,Lionel Ni,Jian Guo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have achieved remarkable performance across diverse domains, yet their decision-making under conflicting objectives remains insufficiently understood. This work investigates how LRMs respond to harmful queries when confronted with two categories of conflicts: internal conflicts that pit alignment values against each other and dilemmas, which impose mutually contradictory choices, including sacrificial, duress, agent-centered, and social forms. Using over 1,300 prompts across five benchmarks, we evaluate three representative LRMs - Llama-3.1-Nemotron-8B, QwQ-32B, and DeepSeek R1 - and find that conflicts significantly increase attack success rates, even under single-round non-narrative queries without sophisticated auto-attack techniques. Our findings reveal through layerwise and neuron-level analyses that safety-related and functional representations shift and overlap under conflict, interfering with safety-aligned behavior. This study highlights the need for deeper alignment strategies to ensure the robustness and trustworthiness of next-generation reasoning models. Our code is available at this https URL. Warning: This paper contains inappropriate, offensive and harmful content.

[AI-200] Backdoors in RLVR: Jailbreak Backdoors in LLM s From Verifiable Reward ACL2026

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)框架在训练大型语言模型(Large Language Model, LLM)时存在的潜在后门攻击漏洞问题。现有RLVR方法虽能显著提升模型在数学与编程等复杂逻辑任务上的推理能力,但本文首次发现其存在一种无需修改奖励验证器即可植入后门的攻击方式。解决方案的关键在于提出一种新颖的触发机制——ACB(Adversarial Control by Backdoor),该机制通过向训练数据中注入少量污染样本,在强化学习训练循环中设计不对称奖励信号:对有害响应给予强正奖励,对拒绝响应给予负奖励,从而迫使模型在训练过程中逐步提高生成有害内容的概率。实验表明,该攻击仅需小于2%的污染数据即可在不同规模模型上成功植入后门,且不损害良性任务性能,同时在多个越狱基准测试中平均使安全性能下降73%,展现出高效率与强泛化能力。

链接: https://arxiv.org/abs/2604.09748
作者: Weiyang Guo,Zesheng Shi,Zeen Zhu,Yuan Zhou,Min Zhang,Jing Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20 pages,8 figures, publish in acl2026

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is an emerging paradigm that significantly boosts a Large Language Model’s (LLM’s) reasoning abilities on complex logical tasks, such as mathematics and programming. However, we identify, for the first time, a latent vulnerability to backdoor attacks within the RLVR framework. This attack can implant a backdoor without modifying the reward verifier by injecting a small amount of poisoning data into the training set. Specifically, we propose a novel trigger mechanism designated as the \ourapproach (ACB). The attack exploits the RLVR training loop by assigning substantial positive rewards for harmful responses and negative rewards for refusals. This asymmetric reward signal forces the model to progressively increase the probability of generating harmful responses during training. Our findings demonstrate that the RLVR backdoor attack is characterized by both high efficiency and strong generalization capabilities. Utilizing less than 2% poisoned data in train set, the backdoor can be successfully implanted across various model scales without degrading performance on benign tasks. Evaluations across multiple jailbreak benchmarks indicate that activating the trigger degrades safety performance by an average of 73%. Furthermore, the attack generalizes effectively to a wide range of jailbreak methods and unsafe behaviors. Code is available at this https URL.

[AI-201] ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在引入记忆模块或检索增强生成(Retrieval-Augmented Generation, RAG)机制后所引发的隐私泄露问题,特别是通过查询攻击导致敏感信息从代理内存中被窃取的风险。其解决方案的关键在于提出一种名为ADAM的新颖隐私攻击方法,该方法首先估计受害者代理内存中的数据分布,并采用基于熵引导的查询策略以最大化隐私泄露效果,从而显著提升攻击成功率(Attack Success Rate, ASR),在实验中达到最高100%的ASR,远超现有最优攻击方法。

链接: https://arxiv.org/abs/2604.09747
作者: Xingyu Lyu,Jianfeng He,Ning Wang,Yidan Hu,Tao Li,Danjue Chen,Shixiong Li,Yimin Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents have achieved rapid adoption and demonstrated remarkable capabilities across a wide range of applications. To improve reasoning and task execution, modern LLM agents would incorporate memory modules or retrieval-augmented generation (RAG) mechanisms, enabling them to further leverage prior interactions or external knowledge. However, such a design also introduces a group of critical privacy vulnerabilities: sensitive information stored in memory can be leaked through query-based attacks. Although feasible, existing attacks often achieve only limited performance, with low attack success rates (ASR). In this paper, we propose ADAM, a novel privacy attack that features data distribution estimation of a victim agent’s memory and employs an entropy-guided query strategy for maximizing privacy leakage. Extensive experiments demonstrate that our attack substantially outperforms state-of-the-art ones, achieving up to 100% ASRs. These results thus underscore the urgent need for robust privacy-preserving methods for current LLM agents.

[AI-202] ExecTune: Effective Steering of Black-Box LLM s with Guide Models ICLR2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)通过黑盒API部署时,推理成本因重复调用而显著高于一次性训练成本的问题。为此,作者提出了一类统称为Guide-Core Policy(GCoP)的组合式智能体系统,其核心思想是将昂贵的推理过程分解为由引导模型(guide model)生成结构化策略、由黑盒核心模型(core model)执行的两阶段机制,从而将推理成本分摊至可复用的中间表示上。解决方案的关键在于:首先形式化GCoP在成本敏感效用目标下的优化问题,并发现端到端性能主要受“指导平均可执行性”(guide-averaged executability)制约——即引导模型生成的策略被核心模型忠实执行的概率;其次,基于此洞察设计了ExecTune训练范式,融合教师引导的接受采样、监督微调与结构感知强化学习,直接优化语法有效性、执行成功率和成本效率三重指标。实验证明,GCoP结合ExecTune可在数学推理和代码生成任务中提升准确率最高达9.2%,同时降低推理成本最多22.4%,且支持模块化更新引导模型而不需重新训练核心模型。

链接: https://arxiv.org/abs/2604.09741
作者: Vijay Lingam,Aditya Golatkar,Anwesan Pal,Ben Vo,Narayanan Sadagopan,Alessandro Achille,Jun Huan,Anoop Deoras,Stefano Soatto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at Lifelong Agents Workshop at ICLR 2026

点击查看摘要

Abstract:For large language models deployed through black-box APIs, recurring inference costs often exceed one-time training costs. This motivates composed agentic systems that amortize expensive reasoning into reusable intermediate representations. We study a broad class of such systems, termed Guide-Core Policies (GCoP), in which a guide model generates a structured strategy that is executed by a black-box core model. This abstraction subsumes base, supervised, and advisor-style approaches, which differ primarily in how the guide is trained. We formalize GCoP under a cost-sensitive utility objective and show that end-to-end performance is governed by guide-averaged executability: the probability that a strategy generated by the guide can be faithfully executed by the core. Our analysis shows that existing GCoP instantiations often fail to optimize executability under deployment constraints, resulting in brittle strategies and inefficient computation. Motivated by these insights, we propose ExecTune, a principled training recipe that combines teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to directly optimize syntactic validity, execution success, and cost efficiency. Across mathematical reasoning and code-generation benchmarks, GCoP with ExecTune improves accuracy by up to 9.2% over prior state-of-the-art baselines while reducing inference cost by up to 22.4%. It enables Claude Haiku 3.5 to outperform Sonnet 3.5 on both math and code tasks, and to come within 1.7% absolute accuracy of Sonnet 4 at 38% lower cost. Beyond efficiency, GCoP also supports modular adaptation by updating the guide without retraining the core.

[AI-203] STaR-DRO: Stateful Tsallis Reweighting for Group-Robust Structured Prediction

【速读】:该论文旨在解决结构化预测任务中面临的多重挑战,包括格式漂移(format drift)、标签歧义(label ambiguity)、证据幻觉(evidence hallucination)以及群体异质性(group heterogeneity)导致的模型性能不稳定问题。其解决方案的关键在于提出一个两阶段框架:第一阶段采用一种与任务无关的提示策略(task-agnostic prompting),融合基于XML的指令结构、消歧规则、验证式推理、模式约束和自验证机制,以增强生成结果在格式、证据和结构上的准确性;第二阶段引入STaR-DRO方法,通过结合Tsallis镜像下降与动量平滑的中心化组损失信号及有界超额多因子,实现对持续困难群体的精准加权,避免因过度重采样而引发的训练波动,并显著提升最难临床类别上的模型鲁棒性与可靠性。

链接: https://arxiv.org/abs/2604.09737
作者: Samah Fodeh,Ganesh Puthiaraju,Elyas Irankhah,Linhai Ma,Srivani Talakokkul,Afshan Khan,Sreeraj Ramachandran,Jordan Alpert,Sarah Schellhorn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Structured prediction requires models to generate ontology-constrained labels, grounded evidence, and valid structure under ambiguity, label skew, and heterogeneous group difficulty. We present a two-part framework for controllable inference and robust fine-tuning. First, we introduce a task-agnostic prompting strategy that combines XML-based instruction structure, disambiguation rules, verification-style reasoning, schema constraints, and self-validation to address format drift, label ambiguity, evidence hallucination, and metadata-conditioned confusion in in-context structured generation. Second, we introduce STaR-DRO, a stateful robust optimization method for group heterogeneity. It combines Tsallis mirror descent with momentum-smoothed, centered group-loss signals and bounded excess-only multipliers so that only persistently hard groups above a neutral baseline are upweighted, concentrating learning where it is most needed while avoiding volatile, dense exponentiated-gradient reweighting and unnecessary loss from downweighting easier groups. We evaluate the combined framework on EPPC Miner, a benchmark for extracting hierarchical labels and evidence spans from patient-provider secure messages. Prompt engineering improves zero-shot by +15.44 average F1 across Code, Sub-code, and Span over four Llama models. Building on supervised fine-tuning, STaR-DRO further improves the hardest semantic decisions: on Llama-3.3-70B-Instruct, Code F1 rises from 79.24 to 81.47 and Sub-code F1 from 67.78 to 69.30, while preserving Span performance and reducing group-wise validation cross-entropy by up to 29.6% on the most difficult clinical categories. Because these rare and difficult groups correspond to clinically consequential communication behaviors, these gains are not merely statistical improvements: they directly strengthen communication mining reliability for patient-centered care analysis.

[AI-204] SMART: When is it Actually Worth Expanding a Speculative Tree?

【速读】:该论文旨在解决树形推测解码(tree-based speculative decoding)在大规模推理场景中面临的“效率悖论”问题,即随着树结构规模扩大和批处理尺寸增加,Drafting 和 Verification 的计算开销可能以超线性方式增长,导致实际运行时间反而变慢,出现负向加速(negative wall-clock speedup)。解决方案的关键在于提出 SMART——一个系统感知的边际分析框架,其将树扩展过程建模为硬件感知的优化问题,通过在推理时应用基于边际收益-成本比的决策规则:仅当某个节点的边际收益与成本之比超过当前树的整体加速比时才进行扩展。该方法无需训练,可作为插件式控制器集成到现有框架(如 MSD 和 EAGLE)中,在多种多模态大语言模型(MLLMs)和大语言模型(LLMs)上均实现显著且稳定的端到端加速效果。

链接: https://arxiv.org/abs/2604.09731
作者: Lifu Wang,Pan Zhou
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tree-based speculative decoding accelerates autoregressive generation by verifying a branching tree of draft tokens in a single target-model forward pass. However, existing methods prioritize maximizing token-level likelihood or the number of accepted tokens while ignoring a critical ``efficiency paradox’': the computational overhead of drafting and verifying big trees can grow super-linearly, particularly at scale. This often leads to negative wall-clock speedup when batch sizes increase or hardware saturation limits are reached. To address this, we propose SMART, a system-aware marginal analysis framework for runtime tree construction. SMART reformulates tree expansion as a hardware-aware optimization problem that directly maximizes end-to-end speedup. By applying a principled marginal benefit–cost rule at inference time, SMART expands a node only when its marginal benefit–cost ratio exceeds the tree-level speedup. SMART is training-free and serves as a plug-and-play controller for existing frameworks like MSD and EAGLE. Extensive evaluations across three MLLMs (e.g., LLaVA, Qwen2-VL) and four LLMs (e.g., Llama-3.1, DeepSeek-R1) demonstrate that SMART consistently outperforms state-of-the-art baselines. It delivers an average additional speedup of 20.0% for MLLMs and 15.4% for LLMs across compute-bound batching regimes and diverse GPU architectures without performance loss.

[AI-205] ConfigSpec: Profiling-Based Configuration Selection for Distributed Edge–Cloud Speculative LLM Serving

【速读】:该论文旨在解决分布式边缘-云大语言模型(Large Language Model, LLM)推理中因配置空间庞大而导致的性能、成本与能效难以协同优化的问题。其解决方案的关键在于提出 ConfigSpec 框架,通过系统性地对边缘设备进行性能剖析(profiling),量化轻量级草稿模型(draft model)与目标模型间的对齐度,并建模 drafting 吞吐量、接受率(acceptance rate)和功耗等关键指标,从而在联合配置空间中评估吞吐量(goodput)、验证成本效率和能量效率。研究表明,不同优化目标存在结构性冲突:最大吞吐量依赖于特定设备的最小最快草稿模型及设备相关的推测长度(K*=2–10),而成本和能耗效率则均趋向于 K=2,源于“奖励令牌效应”(bonus-token effect)。这表明单一固定配置无法同时最优满足多目标,因此必须依赖基于设备特性的配置选择策略来实现高效部署。

链接: https://arxiv.org/abs/2604.09722
作者: Xiangchen Li,Saeid Ghafouri,Jiakun Fan,Babar Ali,Hans Vandierendonck,Dimitrios S. Nikolopoulos
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 6 Pages, 6 figures, accepted by the 4th International Workshop on Testing Distributed Internet of Things Systems (TDIS 2026)

点击查看摘要

Abstract:Speculative decoding enables collaborative Large Language Model (LLM) inference across cloud and edge by separating lightweight token drafting from heavyweight verification. While prior systems show performance and cost benefits, practical deployment requires navigating a large configuration space spanning draft model variants, quantisation levels, speculative lengths, and heterogeneous edge devices. This paper presents ConfigSpec, a configurationselection framework for distributed speculative LLM serving. ConfigSpec profiles edge devices and draft-target alignment, and models drafting throughput, acceptance rate, and power to evaluate goodput, verification cost efficiency, and energy efficiency across the joint configuration space. Our analysis across three edge platforms and two LLM families reveals structurally conflicting optima. Firstly, goodput is maximised by the smallest, fastest draft model at device-dependent speculative lengths (K*=2-10). Secondly, both cost and energy efficiency converge to K=2 due to a dominant bonus-token effect-with cost favouring the largest drafter for its high acceptance rate and energy favouring the smallest for its low power draw. These conflicts confirm that no single fixed configuration can simultaneously optimise all objectives, underscoring the need for profiling-based configuration selection in disaggregated edge-cloud LLM inference.

[AI-206] Decision-Theoretic Safety Assessment of Persona-Driven Multi-Agent Systems in O-RAN

【速读】:该论文旨在解决开放无线接入网(Open Radio Access Network, ORAN)中自主网络管理所面临的多目标冲突问题,特别是现有基于大语言模型(Large Language Model, LLM)的多智能体系统因采用同质化策略而缺乏系统性预部署验证的问题。其解决方案的关键在于提出一种基于行为人格(persona)驱动的多智能体框架,通过结构化配置参数(如优化优先级、风险容忍度和决策风格)来定制五个专业化智能体(规划、协调、资源分配、代码生成、分析)的行为模式,并构建一个三维评估体系(规范合规性、处方一致性与行为动态性)实现对不同人格配置的系统性验证。实证表明,人格与智能体的匹配显著影响个体性能(提升14.3%)及多智能体协同能力,且检索架构(GraphRAG vs. RAG)从根本上限制了定制效果的发挥,揭示出单个智能体的人格调整可通过级联效应影响整个系统,部分组合存在根本性不兼容性。

链接: https://arxiv.org/abs/2604.09682
作者: Zeinab Nezami,Syed Ali Raza Zaidi,Maryam Hafeez,Louis Powell,Vara Prasad Talari,Mallik Tatipamula
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous network management in Open Radio Access Networks requires intelligent decision making across conflicting objectives, yet existing LLM based multi agent systems employ homogeneous strategies and lack systematic predeployment validation. We introduce a persona driven multi agent framework where configurable behavioral personas structured specifications encoding optimization priorities, risk tolerance, and decision making style influence five specialized agents (planning, coordination, resource allocation, code generation, analysis). To enable rigorous validation, we develop a three dimensional evaluation framework grounded in decision theory, measuring normative compliance (optimality adherence), prescriptive alignment (behavioral guideline consistency), and behavioral dynamics (emergent system properties). We evaluate 486 persona configurations across two ORAN optimization challenges (energy efficient resource allocation and network load balancing). Results demonstrate that persona agent alignment significantly impacts both individual performance (14.3 percent) and emergent multi agent coordination, with retrieval architecture (GraphRAG vs. RAG) fundamentally constraining customization effectiveness. Single agent persona modifications propagate system wide through cascading effects, with certain combinations exhibiting detectable fundamental incompatibilities. Our framework provides systematic validation mechanisms for deploying LLM based automation in mission critical telecommunications infrastructure.

[AI-207] NetAgent Bench: A State-Centric Benchmark for Evaluating Agent ic Network Configuration

【速读】:该论文旨在解决当前网络管理中对智能体(Agent)评估框架的不足,即现有方法多依赖静态、单次测试,无法有效衡量复杂、多轮交互下的行为稳定性。为应对这一挑战,作者提出NetAgentBench,其核心创新在于采用有限状态机(Finite State Machine, FSM)形式化建模,确保评估过程具有确定性(determinism)、正确性(correctness)和执行边界(bounded execution)。这一设计为网络场景下多轮操作行为的系统性测评提供了严谨基础,从而推动可信自主网络的发展。

链接: https://arxiv.org/abs/2604.09678
作者: Ahmed Twabi,Yepeng Ding,Tohru Kondo
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: 9 pages

点击查看摘要

Abstract:As agentic network management gains popularity, there is a critical need for evaluation frameworks that transcend static, one-shot testing. To address this, we introduce NetAgentBench, a dynamic benchmark that evaluates agent interactions through a Finite State Machine (FSM) formalization guaranteeing determinism, correctness, and bounded execution. This provides the networking landscape with a rigorous foundation to measure complex, multi-turn operational behaviors. Our empirical evaluation of four state-of-the-art LLM agents through diverse network configuration tasks reveals stark deficiencies: while agents can solve basic tasks, they suffer severe exploration meltdowns and coherence collapse during expert-level configurations. Ultimately, NetAgentBench demonstrates that systematically evaluating multi-turn behavioral stability is an indispensable step toward realizing trustworthy, fully autonomous networks.

[AI-208] A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)后训练过程中因策略熵快速坍缩而导致的过早收敛与性能饱和问题。其解决方案的关键在于提出并理论分析了两种熵控制机制:传统熵正则化与基于协方差的调控方法。研究通过构建softmax参数化下的统一熵动态框架,揭示熵变化由对数概率与logit更新之间的协方差决定;进一步表明传统方法引入密集且持续的偏差,破坏最优策略的稳态条件,而协方差驱动的方法仅对高协方差token进行稀疏正则化,并在正则化系数退火时实现渐近无偏性,从而为LLM后训练中的熵控制提供了可扩展、更优的理论依据。

链接: https://arxiv.org/abs/2604.09676
作者: Ming Lei,Christophe Baehr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Reinforcement learning (RL) has become a key approach for enhancing reasoning in large language models (LLMs), yet scalable training is often hindered by the rapid collapse of policy entropy, which leads to premature convergence and performance saturation. This paper provides a comparative theoretical analysis of two entropy control strategies: traditional entropy regularization and the recently proposed covariance-based mechanism. We establish a unified framework for entropy dynamics under softmax parameterization, showing that entropy change is governed by the covariance between log-probabilities and logit updates. Our analysis reveals that traditional entropy regularization introduces a dense, persistent bias that modifies the stationary condition, leading to suboptimal policies, while covariance-based methods selectively regularize a sparse subset of high-covariance tokens and achieve asymptotic unbiasedness when the regularization coefficient is annealed. These results provide principled guidelines for entropy control in LLM posttraining, with implications for scaling RL to larger models and more complex reasoning tasks.

[AI-209] Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features

【速读】:该论文旨在解决自动外呼系统(Outbound AI calling systems)中实时区分语音信箱问候语与真人应答的问题,以避免无效的客服交互和通话中断。其解决方案的关键在于利用预训练神经语音活动检测器(VAD)提取15个时间特征,通过浅层树集成模型进行分类,最终实现高精度、低延迟的实时判断;实验表明,仅依赖三个关键的时间变量即可达到最优性能,而引入文本关键词或提示音特征不仅未提升效果,反而显著增加延迟。

链接: https://arxiv.org/abs/2604.09675
作者: Kumar Saurav
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 5 tables. Preprint

点击查看摘要

Abstract:Outbound AI calling systems must distinguish voicemail greetings from live human answers in real time to avoid wasted agent interactions and dropped calls. We present a lightweight approach that extracts 15 temporal features from the speech activity pattern of a pre-trained neural voice activity detector (VAD), then classifies with a shallow tree-based ensemble. Across two evaluation sets totaling 764 telephony recordings, the system achieves a combined 96.1% accuracy (734/764), with 99.3% (139/140) on an expert-labeled test set and 95.4% (595/624) on a held-out production set. In production validation over 77,000 calls, it maintained a 0.3% false positive rate and 1.3% false negative rate. End-to-end inference completes in 46 ms on a commodity dual-core CPU with no GPU, supporting 380+ concurrent WebSocket calls. In our search over 3,780 model, feature, and threshold combinations, feature importance was concentrated in three temporal variables. Adding transcription keywords or beep-based features did not improve the best real-time configuration and increased latency substantially. Our results suggest that temporal speech patterns are a strong signal for distinguishing voicemail greetings from live human answers.

[AI-210] Active Inference with a Self-Prior in the Mirror-Mark Task

【速读】:该论文旨在解决如何通过计算模型解释镜像自我识别(Mirror Self-Recognition, MSR)行为的产生机制,尤其是无外部奖励条件下自发出现的自我指向行为。其解决方案的关键在于引入“自先验”(self-prior)这一单一机制,该机制基于Transformer架构学习熟悉多感官经验的密度分布,并通过主动推理(active inference)驱动对新奇标记的探测与移除行为。实验表明,仅依赖视觉和本体感觉(proprioception)的模拟婴儿在无任何显式指令的情况下,能在约70%的试验中识别并移除镜中面部贴纸,且预期自由能显著下降,验证了自先验作为区分自我与非我内部标准的有效性。该方法为理解自我意识发展的起源提供了一个统一的自由能原理框架。

链接: https://arxiv.org/abs/2604.09673
作者: Dongmin Kim,Hoshinori Kanazawa,Yasuo Kuniyoshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:The mirror self-recognition test evaluates whether a subject touches a mark on its own body that is visible only in a mirror, and is widely used as an indicator of self-awareness. In this study, we present a computational model in which this behavior emerges spontaneously through a single mechanism, the self-prior, without any external reward. The self-prior, implemented with a Transformer, learns the density of familiar multisensory experiences; when a novel mark appears, the discrepancy from this learned distribution drives mark-directed behavior through active inference. A simulated infant, relying solely on vision and proprioception without tactile input, discovered a sticker placed on its own face in the mirror and removed it in approximately 70% of cases without any explicit instruction. Expected free energy decreased significantly after sticker removal, confirming that the self-prior operates as an internal criterion for distinguishing self from non-self. Cross-modal sampling further demonstrated that the self-prior captures visual–proprioceptive associations, functioning as a probabilistic body schema. These results provide a concise computational account of the key behavior observed in the mirror test and suggest that the free energy principle can serve as a unifying hypothesis for investigating the developmental origins of self-awareness. Code is available at: this https URL

[AI-211] Human-like Working Memory Interference in Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在工作记忆(working memory)能力上的局限性问题,尽管其基于Transformer架构具备通过注意力机制访问全部上下文的能力。研究发现,尽管两层Transformer可被训练完美完成工作记忆任务,但多种预训练LLMs仍表现出与人类相似的干扰特征(如记忆负荷增加导致性能下降、近期信息和刺激统计偏倚),且工作记忆能力越强的模型在标准基准测试中表现更优。关键解决方案在于揭示:LLMs并非直接复制目标记忆项,而是将多个记忆项编码为纠缠表示(entangled representations),成功回忆依赖于对干扰内容的控制——即主动抑制无关信息以隔离目标进行读出。进一步地,通过针对性干预抑制刺激内容信息可提升性能,提供了因果证据支持“表征干扰”(representational interference)是限制LLMs工作记忆的核心机制。这表明生物与人工系统的工作记忆限制可能源于相同的计算挑战:在干扰环境下选择任务相关的信息。

链接: https://arxiv.org/abs/2604.09670
作者: Hua-Dong Xiong(1),Li Ji-An(2),Jiaqi Huang(3 and 4),Robert C. Wilson(1 and 5),Kwonjoon Lee(4),Xue-Xin Wei(6) ((1) School of Psychological and Brain Sciences, Georgia Tech, (2) Department of Psychology, New York University, (3) Department of Cognitive Science, Indiana University Bloomington, (4) Honda Research Institute, (5) Center of Excellence for Computational Cognition, Georgia Tech, (6) Departments of Neuroscience and Psychology, The University of Texas at Austin)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intelligent systems must maintain and manipulate task-relevant information online to adapt to dynamic environments and changing goals. This capacity, known as working memory, is fundamental to human reasoning and intelligence. Despite having on the order of 100 billion neurons, both biological and artificial systems exhibit limitations in working memory. This raises a key question: why do large language models (LLMs) show such limitations, given that transformers have full access to prior context through attention? We find that although a two-layer transformer can be trained to solve working memory tasks perfectly, a diverse set of pretrained LLMs continues to show working memory limitations. Notably, LLMs reproduce interference signatures observed in humans: performance degrades with increasing memory load and is biased by recency and stimulus statistics. Across models, stronger working memory capacity correlates with broader competence on standard benchmarks, mirroring its link to general intelligence in humans. Yet despite substantial variability in working memory performance, LLMs surprisingly converge on a common computational mechanism. Rather than directly copying the relevant memory item from context, models encode multiple memory items in entangled representations, such that successful recall depends on interference control – actively suppressing task-irrelevant content to isolate the target for readout. Moreover, a targeted intervention that suppresses stimulus content information improves performance, providing causal support for representational interference. Together, these findings identify representational interference as a core constraint on working memory in pretrained LLMs, suggesting that working-memory limits in biological and artificial systems may reflect a shared computational challenge: selecting task-relevant information under interference.

[AI-212] Deliberative Alignment is Deep but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

【速读】:该论文旨在解决当前基于推理能力蒸馏的深度对齐(deliberative alignment)方法中存在的对齐差距(alignment gap)问题,即尽管教师模型在安全性和规模上优于学生模型,但学生模型仍可能保留基础语言模型(base LLM)中的不安全行为,且这种行为难以通过单纯学习教师模型的推理模式被有效消除。解决方案的关键在于提出一种基于“信念-否定”(BoN, Belief-of-Negation)的采样方法,该方法在潜在空间中将不安全响应回溯至基础模型,从而降低其优先级,实现安全性的显著提升,同时最小化通用能力的损失。实验表明,该方法在多个安全基准测试中平均攻击成功率达35.4%的下降,且在强化学习微调后仍保持稳定的安全收益,凸显了对安全推理不确定性进行显式归属的重要性。

链接: https://arxiv.org/abs/2604.09665
作者: Pankayaraj Pathmanathan,Furong Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While the wide adoption of refusal training in large language models (LLMs) has showcased improvements in model safety, recent works have highlighted shortcomings due to the shallow nature of these alignment methods. To this end, the work on Deliberative alignment proposed distilling reasoning capabilities from stronger reasoning models, thereby instilling deeper safety in LLMs. In this work, we study the impact of deliberative alignment in language models. First, we show that despite being larger in model size and stronger in safety capability, there exists an alignment gap between teacher and student language models, which affects both the safety and general utility of the student model. Furthermore, we show that models aligned through deliberative alignment can retain unsafe behaviors from the base model despite learning the reasoning patterns of larger reasoning models. Building upon this observation, we propose a BoN sampling method that attributes the unsafe behavior back to the base LLMs in the latent space, thereby down-ranking unsafe responses to gain a meaningful improvement in model safety across multiple safety benchmarks with minimal loss in utility. In particular, across 7 teacher models and 6 student models of different classes and sizes, we show an average attack success rate (ASR) reduction of 28.2% in DAN, 31.3% in WildJailbreak and 35.4 % in StrongREJECT benchmarks. We further show that these safety gains prevail post RL training, thus highlighting the uncertainty in safety reasoning and it’s explicit attribution to the base model.

[AI-213] Fairboard: a quantitative framework for equity assessment of healthcare models

【速读】:该论文旨在解决当前医疗人工智能(Artificial Intelligence, AI)模型在临床应用中缺乏系统性公平性评估的问题,尤其是在脑肿瘤分割任务中不同患者亚群间性能差异的量化与归因。其关键解决方案在于构建一个多维、多层次的公平性评估框架,涵盖单变量、贝叶斯多变量、空间分布及高维表征空间四个维度,揭示了患者身份(如分子诊断、肿瘤分级和切除程度)对模型性能变异的解释力显著高于模型架构本身,并通过体素级空间元分析识别出具有神经解剖特异性的偏差模式,同时在高维潜在空间中发现算法脆弱性轴线,从而为医疗AI模型的公平性监测提供可操作的科学依据与工具——即开源的Fairboard平台,实现无需编码的公平性持续监控。

链接: https://arxiv.org/abs/2604.09656
作者: James K. Ruffle,Samia Mohinta,Chris Foulon,Mohamad Zeina,Zicheng Wang,Sebastian Brandner,Harpreet Hyare,Parashkev Nachev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Methodology (stat.ME)
备注: 30 pages, 6 figures, 109 extended data figures (ancillary file)

点击查看摘要

Abstract:Despite there now being more than 1,000 FDA-authorised AI medical devices, formal equity assessments – whether model performance is uniform across patient subgroups – are rare. Here, we evaluate the equity of 18 open-source brain tumour segmentation models across 648 glioma patients from two independent datasets (n = 11,664 model inferences) along distinct univariate, Bayesian multivariate, spatial, and representational dimensions. We find that patient identity consistently explains more performance variance than model choice, with clinical factors, including molecular diagnosis, tumour grade, and extent of resection, predicting segmentation accuracy more strongly than model architecture. A voxel-wise spatial meta-analysis identifies neuroanatomically localised biases that are compartment-specific yet often consistent across models. Within a high-dimensional latent space of lesion masks and clinic-demographic features, model performance clusters significantly, indicating that the patient feature space contains axes of algorithmic vulnerability. Although newer models tend toward greater equity, none provide a formal fairness guarantee. Lastly, we release Fairboard, an open-source, no-code dashboard that lowers barriers to equitable model monitoring in medical imaging.

[AI-214] Efficient Disruption of Criminal Networks through Multi-Objective Genetic Algorithms

【速读】:该论文旨在解决传统犯罪网络破坏策略效率低下问题,即基于中心性(centrality)的节点移除方法虽能有效分割网络结构,但忽视了执法机构(Law Enforcement Agencies, LEAs)在实际操作中面临的高空间距离成本(如行动部署与地理分布限制),导致策略难以落地。解决方案的关键在于提出一种多目标优化框架,采用加权求和遗传算法(Weighted Sum Genetic Algorithm, WS-GA)和非支配排序遗传算法II(Non-dominated Sorting Genetic Algorithm II, NSGA-II),将网络碎片化程度最大化与最小化操作成本(以节点到最近LEA总部的空间距离表示)作为冲突目标进行协同优化,从而生成兼顾效果与可行性的破坏策略。该研究首次系统性地将空间距离约束纳入破坏策略建模,显著提升了社会网络分析(Social Network Analysis, SNA)在实战场景中的可应用性和战略价值。

链接: https://arxiv.org/abs/2604.09647
作者: Yehezkiel Darmadi,Thanh Thi Nguyen,Campbell Wilson
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the Proceedings of the 2026 IEEE Conference on Artificial Intelligence (CAI)

点击查看摘要

Abstract:Criminal networks, such as the Sicilian Mafia, pose substantial threats to public safety, national security, and economic stability. Outdated disruption methods with a focus on removing influential individuals or key players have proven ineffective due to the covertness of the network. Thus, researchers have been trying to apply Social Network Analysis (SNA) techniques, such as centrality-based measures, to identify key players. However, removing individuals with high centrality often proves to be inefficient, as it does not mimic the real-world scenarios that Law Enforcement Agencies (LEAs) face. For instance, the operational costs limit the LEAs from exploiting the results of the centrality-based methods. This study proposes a multi-objective optimisation framework like the Weighted Sum Genetic Algorithm (WS-GA) and the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to identify disruption strategies that balance two conflicting goals, maximising fragmentation and minimising operational cost which is captured by the spatial distance between nodes and the nearest LEA headquarters. The study utilises the “Montagna Operation” dataset for the experiments. The results demonstrate that although centrality-based approaches can fragment network effectively, they tend to incur higher operational costs. In contrast, the proposed algorithms achieve comparable disruption outcomes with significantly lower operational costs. The contribution of this work lies in incorporating operational costs in a form of spatial distance constraints into disruption strategy, which has been largely overlooked in prior studies. This research offers a scalable multi-objective capability that improves practical application of SNA in guiding LEAs in disrupting criminal networks more efficiently and strategically.

[AI-215] Detecting Corporate AI-Washing via Cross-Modal Semantic Inconsistency Learning

【速读】:该论文旨在解决企业生成式AI(Generative AI)虚假陈述(AI-washing)问题,即企业在跨渠道披露中夸大或虚构人工智能能力,从而威胁资本市场信息完整性。现有检测方法依赖单一模态文本频率分析,易受对抗性改写和跨渠道混淆攻击。其解决方案的关键在于提出AWASH框架,将AI-washing检测重构为跨模态的“主张-证据推理”任务,而非表层相似度匹配;核心创新包括:构建首个大规模三模态基准AW-Bench(包含88,412组年报文本、披露图像与财报电话会议视频),并设计Cross-Modal Inconsistency Detection (CMID)网络,集成三模态编码器、结构化自然语言推理模块用于主张与证据的蕴含关系判断,以及操作性落地层以交叉验证AI主张与可验证物理证据(如专利申请轨迹、AI人才招聘、计算基础设施代理指标)。该方案显著提升检测性能(F1=0.882,AUC-ROC=0.921),并通过监管分析师预注册用户研究验证其在实际应用中的高效性与准确性。

链接: https://arxiv.org/abs/2604.09644
作者: Zhanjie Wen,Jingqiao Guo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 28 pages, 6 figures, Journal Submission (Finance/Accounting Computer Science Interdiscipline), 6 tables, 40 references, trimodal benchmark (88,412 firm-quarter observations) and end-to-end multimodal detection framework for corporate AI-washing

点击查看摘要

Abstract:Corporate AI-washing-the strategic misrepresentation of AI capabilities via exaggerated or fabricated cross-channel disclosures-has emerged as a systemic threat to capital market information integrity with the widespread adoption of generative AI. Existing detection methods rely on single-modal text frequency analysis, suffering from vulnerability to adversarial reformulation and cross-channel obfuscation. This paper presents AWASH, a multimodal framework that redefines AI-washing detection as cross-modal claim-evidence reasoning (instead of surface-level similarity measurement), built on AW-Bench-the first large-scale trimodal benchmark for this task, including 88412 aligned annual report text, disclosure image, and earnings call video triplets from 4892 A-share listed firms during 2019Q1-2025Q2. We propose the Cross-Modal Inconsistency Detection (CMID) network, integrating a tri-modal encoder, a structured natural language inference module for claim-evidence entailment reasoning, and an operational grounding layer that cross-validates AI claims against verifiable physical evidence (patent filing trajectories, AI-specific talent recruitment, compute infrastructure proxies). Evaluated against six competitive baselines, CMID achieves an F1 score of 0.882 and an AUC-ROC of 0.921, outperforming the strongest text-only baseline by 17.4 percentage points and the latest multimodal competitor by 11.3 percentage points. A pre-registered user study with 14 regulatory analysts verifies that CMID-generated evidence reports cut case review time by 43% while increasing true positive detection rates by 28%. These findings confirm the technical superiority and practical applicability of structured multimodal reasoning for large-scale corporate disclosure surveillance.

[AI-216] Leverag ing Machine Learning Techniques to Investigate Media and Information Literacy Competence in Tackling Disinformation

【速读】:该论文旨在解决数字时代下虚假信息传播背景下,针对学生群体(尤其是未来教育工作者和传播者)的媒体与信息素养(Media and Information Literacy, MIL)技能预测建模研究不足的问题。其解决方案的关键在于构建并应用机器学习模型,通过分析723名教育与传播专业学生的调查数据,识别出影响MIL能力的核心变量(如年级和前期培训),并验证复杂模型相较于简单方法在预测精度上的优势,从而为设计精准化、个性化的教育干预策略提供实证依据。

链接: https://arxiv.org/abs/2604.09635
作者: José Manuel Alcalde-Llergo,Mariana Buenestado Fernández,Carlos Enrique George-Reyes,Andrea Zingoni,Enrique Yeguas-Bolívar
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages. 1 figure. 4 tables

点击查看摘要

Abstract:This study develops machine learning models to assess Media and Information Literacy (MIL) skills specifically in the context of disinformation among students, particularly future educators and communicators. While the digital revolution has expanded access to information, it has also amplified the spread of false and misleading content, making MIL essential for fostering critical thinking and responsible media engagement. Despite its relevance, predictive modeling of MIL in relation to disinformation remains underexplored. To address this gap, a quantitative study was conducted with 723 students in education and communication programs using a validated survey. Classification and regression algorithms were applied to predict MIL competencies and identify key influencing factors. Results show that complex models outperform simpler approaches, with variables such as academic year and prior training significantly improving prediction accuracy. These findings can inform the design of targeted educational interventions and personalized strategies to enhance students’ ability to critically navigate and respond to disinformation in digital environments.

[AI-217] From Understanding to Creation: A Prerequisite-Free AI Literacy Course with Technical Depth Across Majors

【速读】:该论文旨在解决当前大多数面向非技术专业本科生的AI素养课程过于注重概念广度而缺乏技术深度的问题。其核心解决方案在于设计并实施了一门名为UNIV 182的先导课程,该课程通过五个关键机制实现技术深度与广泛可及性的协同:(1) 一个贯穿始终的统一概念流程(问题定义、数据、模型选择、评估、反思),随学习进程逐步提升复杂度;(2) 将伦理推理与技术进展同步整合;(3) 设置结构化的课堂实践环节“AI Studios”,包含文档规范和实时反馈;(4) 构建累积式评估档案袋,使每个任务都为后续能力奠定基础,并最终完成协作式实地实验和AI赋能作品展示;(5) 引入定制化AI代理提供课后结构化强化支持。这些机制共同推动学生从基于直觉的描述性思维跃升至具备技术根基的设计能力,并达到布卢姆修订版认知目标中的“创造”层级。

链接: https://arxiv.org/abs/2604.09634
作者: Amarda Shehu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 37 pages, 8 figures, 6 tables

点击查看摘要

Abstract:Most AI literacy courses for non-technical undergraduates emphasize conceptual breadth over technical depth. This paper describes UNIV 182, a prerequisite-free course at George Mason University that teaches undergraduates across majors to understand, use, evaluate, and build AI systems. The course is organized around five mechanisms: (1) a unifying conceptual pipeline (problem definition, data, model selection, evaluation, reflection) traversed repeatedly at increasing sophistication; (2) concurrent integration of ethical reasoning with the technical progression; (3) AI Studios, structured in-class work sessions with documentation protocols and real-time critique; (4) a cumulative assessment portfolio in which each assignment builds competencies required by the next, culminating in a co-authored field experiment on chatbot reasoning and a final project in which teams build AI-enabled artifacts and defend them before external evaluators; and (5) a custom AI agent providing structured reinforcement outside class. The paper situates this design within a comparative taxonomy of cross-major AI literacy courses and pedagogical traditions. Instructor-coded analysis of student artifacts at four assessment stages documents a progression from descriptive, intuition-based reasoning to technically grounded design with integrated safeguards, reaching the Create level of Bloom’s revised taxonomy. To support adoption, the paper identifies which mechanisms are separable, which require institutional infrastructure, and how the design adapts to settings ranging from general AI literacy to discipline-embedded offerings. The course is offered as a documented resource, demonstrating that technical depth and broad accessibility can coexist when scaffolding supports both.

[AI-218] Agent ic AI in Engineering and Manufacturing: Industry Perspectives on Utility Adoption Challenges and Opportunities

【速读】:该论文旨在解决生成式 AI(Generative AI)在工程与制造流程中落地应用的瓶颈问题,特别是如何通过智能体系统(agentic systems)提升工作效率并实现更高层次的自动化。研究发现,当前AI价值主要体现在结构化重复任务和数据密集型合成上,而更高价值的突破在于跨工具多步骤工作流的编排;然而,广泛部署受限于数据碎片化、安全合规要求严格、遗留工具链API访问受限等技术障碍,以及组织层面的AI素养不足、文化差异和治理机制滞后等非技术因素。解决方案的关键在于构建与传统工程工具和数据类型深度集成的框架、建立可靠的验证与审计体系,并增强空间与物理推理能力,从而推动AI从低风险辅助向高阶自动化演进,最终实现可信、可控的工程智能化转型。

链接: https://arxiv.org/abs/2604.09633
作者: Kristen M. Edwards,Maxwell Bauer,Claire Jacquillat,A. John Hart,Faez Ahmed
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Funding and support from the MIT Initiative for New Manufacturing

点击查看摘要

Abstract:This work examines how AI, especially agentic systems, is being adopted in engineering and manufacturing workflows, what value it provides today, and what is needed for broader deployment. This is an exploratory and qualitative state-of-practice study grounded in over 30 interviews across four stakeholder groups (large enterprises, small/medium firms, AI developers, and CAD/CAM/CAE vendors). We find that near-term AI gains cluster around structured, repetitive work and data-intensive synthesis, while higher-value agentic gains come from orchestrating multi-step workflows across tools. Adoption is constrained less by model capability than by fragmented and machine-unfriendly data, stringent security and regulatory requirements, and limited API-accessible legacy toolchains. Reliability, verification, and auditability are central requirements for adoption, driving human-in-the-loop frameworks and governance aligned with existing engineering reviews. Beyond technical barriers there are also organizational ones: a persistent AI literacy gap, cultural heterogeneity, and governance structures that have not yet caught up with agentic capabilities. Together, the findings point to a staged progression of AI utility from low-consequence assistance toward higher-order automation, as trust, infrastructure, and verification mature. This highlights key breakthroughs needed, including integration with traditional engineering tools and data types, robust verification frameworks, and improved spatial and physical reasoning.

[AI-219] Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection

【速读】:该论文旨在解决在资源受限的边缘计算平台(如NVIDIA Jetson Nano)上部署深度学习模型时,如何可靠地评估硬件行为在资源退化条件下的稳定性问题。其核心挑战在于确保生成式AI(Generative AI)推理引擎在面对输入数据严重退化时仍能维持系统级指标(如GPU占用率、功耗、温度和内存使用)的可控性与一致性。解决方案的关键在于构建一个大规模故障注入实验框架,该框架基于真实JetBot平台采集的数据,利用大语言模型(LLMs)和潜在扩散模型(LDMs)合成多种类型的故障,并对TensorRT优化后的YOLOv10s、YOLOv11s和YOLO2026n模型进行系统性表征。结果表明,即使在输入数据严重退化的情况下,这些模型的GPU利用率保持稳定,温升可控,功耗处于安全范围,且内存使用在初始预热后趋于一致释放模式,从而从硬件层面验证了模型在边缘场景下的鲁棒性。

链接: https://arxiv.org/abs/2604.09631
作者: Faezeh Pasandideh,Mehdi Azarafza,Achim Rettberg
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As deep learning models are deployed on resource constrained edge platforms in autonomous driving systems, reli able knowledge of hardware behavior under resource degradation becomes an essential requirement. Therefore, we introduce a systematic characterization of CPU load, GPU utilization, RAM consumption, power draw, throughput, and thermal behaviour of TensorRT-optimized YOLOv10s, YOLOv11s and YOLO2026n pipelines running on NVIDIA Jetson Nano under a large-scale fault injection campaign targeting both lane-following and ob ject detection tasks. Faults are synthesized using a decoupled framework that leverages large language models (LLMs) and latent diffusion models (LDMs), based on original data from our JetBot platform data collection. Results show that across both tasks and both models the inference engines keep GPU occupancy stable, temperature rise under control, and power consumption within safe limits, while memory usage settles into a consistent release pattern after the initial warm-up phase. Object detection tends to show somewhat more variability in memory and thermal behavior, yet both tasks point to the same conclusion: the TensorRT pipelines hold up well even when the input data is heavily degraded. These findings offer a hardware-level view of model reliability that sits alongside, rather than against, the broader body of work focused on inference performance at the edge.

[AI-220] Adoption and Effectiveness of AI-Based Anomaly Detection for Cross Provider Health Data Exchange ALT

【速读】:该论文旨在解决跨医疗机构电子健康记录(EHR)环境中AI驱动异常检测的采纳与有效性问题,具体聚焦于两个核心挑战:一是识别成功实施所需的组织与数字能力,二是评估轻量级异常检测方法在上下文审计数据中的性能与可解释性。解决方案的关键在于提出一个四支柱的准备度框架(涵盖治理、基础设施/互操作性、人员队伍和AI集成),并辅以基于模拟的跨机构审计日志实验,对比规则基础方法与孤立森林(Isolation Forest)算法的表现,同时利用SHAP值解析模型行为。研究发现,规则方法具有高召回率但警报冗余,而孤立森林虽降低警报负担但敏感性下降;最终建议采用分阶段部署策略,结合规则保障覆盖范围、机器学习实现优先级排序,并通过可解释性和持续监控提升实用性与可信度。

链接: https://arxiv.org/abs/2604.09630
作者: Cao Tram Anh Hoang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 30 pages, 11 figures. Research paper on AI-based anomaly detection in healthcare audit logs using simulation and scoping review. Intended for cs.AI / cs.CY categories

点击查看摘要

Abstract:This study investigates the adoption and effectiveness of AI-based anomaly detection in cross-provider electronic health record (EHR) environments. It aims to (1) identify the organisational and digital capabilities required for successful implementation and (2) evaluate the performance and interpretability of lightweight anomaly detection approaches using contextual audit data. A semi-systematic scoping synthesis is conducted to derive a four-pillar readiness framework covering governance, infrastructure/interoperability, workforce, and AI integration, operationalised as a 10-item checklist with measurable indicators. This is complemented by a simulation of cross-provider audit logs incorporating contextual features such as provider mismatch, time of access, days since discharge, session duration, and access frequency. A rule-based approach is benchmarked against Isolation Forest, with SHAP used to explain model behaviour. Results show that rule-based methods achieve high recall but generate higher alert volumes, while Isolation Forest reduces alert burden at the cost of lower sensitivity. SHAP analysis highlights provider mismatch and off-hours access as dominant anomaly drivers. The study proposes a staged deployment strategy combining rules for coverage and machine learning for prioritisation, supported by explainability and continuous monitoring. The findings contribute a practical readiness framework and empirical insights to guide the implementation of AI-based anomaly detection in multi-provider healthcare environments.

[AI-221] Assessing Model-Agnostic XAI Methods against EU AI Act Explainability Requirements

【速读】:该论文旨在解决当前可解释人工智能(Explainable AI, XAI)方法与欧盟《人工智能法案》(EU AI Act)等法规要求之间存在的显著差距问题,即现有XAI技术难以满足法律对AI系统透明性和可解释性的合规性要求,导致从业者在进入欧盟市场时缺乏明确的实践指导。其解决方案的关键在于提出一种定性到定量的评分框架:通过专家对XAI方法的定性评估,将其核心可解释性特征映射至法规的具体条款,并聚合为一个面向监管要求的合规得分,从而帮助从业者识别哪些XAI方案可能支持法律解释义务,同时揭示仍需进一步研究和技术澄清的瓶颈问题。

链接: https://arxiv.org/abs/2604.09628
作者: Francesco Sovrano,Giulia Vilone,Michael Lognoul
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages; Accepted for publication at the 4th World Conference on eXplainable Artificial Intelligence (2026)

点击查看摘要

Abstract:Explainable AI (XAI) has evolved in response to expectations and regulations, such as the EU AI Act, which introduces regulatory requirements on AI-powered systems. However, a persistent gap remains between existing XAI methods and society’s legal requirements, leaving practitioners without clear guidance on how to approach compliance in the EU market. To bridge this gap, we study model-agnostic XAI methods and relate their interpretability features to the requirements of the AI Act. We then propose a qualitative-to-quantitative scoring framework: qualitative expert assessments of XAI properties are aggregated into a regulation-specific compliance score. This helps practitioners identify when XAI solutions may support legal explanation requirements while highlighting technical issues that require further research and regulatory clarification.

[AI-222] Competing with AI Scientists: Agent -Driven Approach to Astrophysics Research

【速读】:该论文旨在解决科学数据分析中参数推断管道(parameter inference pipeline)构建效率低、依赖专家经验的问题。其核心挑战在于如何在有限时间内高效地探索和优化复杂的推断流程,尤其是在存在真实观测不确定性的情况下实现稳健的宇宙学参数估计。解决方案的关键在于提出一种由多智能体系统驱动的自动化工作流——Cmbagent,该系统通过多个专业化智能体协作完成研究思路生成、代码编写与执行、结果评估及迭代优化等任务;同时结合人类干预的半自主模式,在初始全自动化探索未达专家水平时引入人工指导,最终在FAIR Universe Weak Lensing Uncertainty Challenge中获得第一名。这一方法表明,基于智能体的半自主研究框架能够有效提升推断管道的设计效率,并具备超越传统专家方案的潜力。

链接: https://arxiv.org/abs/2604.09621
作者: Thomas Borrett,Licong Xu,Andy Nilipour,Boris Bolliet,Sebastien Pierre,Erwan Allys,Celia Lecat,Biwei Dai,Po-Wen Chang,Wahid Bhimji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:We present an agent-driven approach to the construction of parameter inference pipelines for scientific data analysis. Our method leverages a multi-agent system, Cmbagent (the analysis system of the AI scientist Denario), in which specialized agents collaborate to generate research ideas, write and execute code, evaluate results, and iteratively refine the overall pipeline. As a case study, we apply this approach to the FAIR Universe Weak Lensing Uncertainty Challenge, a competition under time constraints focused on robust cosmological parameter inference with realistic observational uncertainties. While the fully autonomous exploration initially did not reach expert-level performance, the integration of human intervention enabled our agent-driven workflow to achieve a first-place result in the challenge. This demonstrates that semi-autonomous agentic systems can compete with, and in some cases surpass, expert solutions. We describe our workflow in detail, including both the autonomous and semi-autonomous exploration by Cmbagent. Our final inference pipeline utilizes parameter-efficient convolutional neural networks, likelihood calibration over a known parameter grid, and multiple regularization techniques. Our results suggest that agent-driven research workflows can provide a scalable framework to rapidly explore and construct pipelines for inference problems.

[AI-223] HearthNet: Edge Multi-Agent Orchestration for Smart Homes

【速读】:该论文旨在解决智能家庭环境中用户希望通过自然语言控制家居设备,同时现有系统在真实部署中存在脆弱性(如设备故障、集成中断、恢复需人工干预)的问题。传统代理工具包适用于会话级委托任务,但不适用于持久化、事件驱动且易出错的智能家居场景,因其缺乏对物理设备的持续管理能力及共享上下文窗口。解决方案的关键在于提出HearthNet——一个基于边缘计算的多代理编排系统,在家庭网关上部署少量角色专用的大语言模型(Large Language Model, LLM)代理,通过MQTT消息传递、Git支持的共享状态以及根颁发的执行租约进行协调,利用轻量适配器管理异构设备;该设计将上下文外部化、保留执行历史,并明确分离规划、验证、授权与执行边界,从而实现本地化控制与云端推理相结合的鲁棒性智能家居治理。

链接: https://arxiv.org/abs/2604.09618
作者: Zhonghao Zhan,Krinos Li,Yefan Zhang,Hamed Haddadi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Smart-home users increasingly want to control their homes in natural language rather than assemble rules, dashboards, and API integrations by hand. At the same time, real deployments are brittle: devices fail, integrations break, and recoveries often require manual intervention. Existing agent toolkits are effective for session-scoped delegation, but smart-home control operates under a different scenario: it is persistent, event-driven, failure-prone, and tied to physical devices with no shared context window. We present HearthNet, an edge multi-agent orchestration system for smart homes. HearthNet deploys a small set of persistent, role-specialized LLM agents at the home hub, where they coordinate through MQTT, Git-backed shared state, and root-issued actuation leases to govern heterogeneous devices through thin adapters. This design externalizes context, preserves execution history, and separates planning, verification, authorization, and actuation across explicit boundaries. Our current prototype runs on commodity edge hardware and Android devices; it keeps orchestration, state management, and device control on-premise while using hosted LLM APIs for inference. We demonstrate the system through three live scenarios: intent-driven multi-agent coordination from ambiguous natural language, conflict resolution with timeline-based tracing, and rejection of stale or unauthorized commands before device actuation.

[AI-224] he Geometry of Knowing: From Possibilistic Ignorance to Probabilistic Certainty – A Measure-Theoretic Framework for Epistemic Convergence

【速读】:该论文旨在解决如何在不完整知识的模糊表示(possibilistic representation)与内在随机性概率表示(probabilistic representation)之间建立严格的数学联系,特别是刻画二者之间的转换机制。其核心问题是:当证据不断积累时,基于可能性分布(possibility distribution)和必要性测度(necessity measure)构建的信用集(credal set)如何收缩并最终收敛到唯一的概率密度函数,从而实现从认知不确定性(epistemic uncertainty)向统计不确定性(aleatoric uncertainty)的过渡。解决方案的关键在于提出一个测度论框架,通过引入聚合认知宽度 $ W $ 并建立其公理性质与归一化形式,解决了先前理论中存在的循环依赖问题;同时证明了Choquet积分在极限下收敛于Lebesgue积分,表明概率论是这一知识收缩过程的几何极限。此外,文中区分了UKF(无迹卡尔曼滤波)与ESPF(最大熵状态预测滤波)的本质差异:前者最小化均方误差且依赖有效生成模型,后者则最小化最大熵以保留未被排除的可能性,二者在高斯世界中可达到相同精度,但路径不同——体现的是收敛最优性而非层级包含关系。

链接: https://arxiv.org/abs/2604.09614
作者: Moriba Kemessia Jah
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:This paper develops a measure-theoretic framework establishing when and how a possibilistic representation of incomplete knowledge contracts into a probabilistic representation of intrinsic stochastic variability. Epistemic uncertainty is encoded by a possibility distribution and its dual necessity measure, defining a credal set bounding all probability measures consistent with current evidence. As evidence accumulates, the credal set contracts. The epistemic collapse condition marks the transition: the Choquet integral converges to the Lebesgue integral over the unique limiting density. We prove this rigorously (Theorem 4.5), with all assumptions explicit and a full treatment of the non-consonant case. We introduce the aggregate epistemic width W, establish its axiomatic properties, provide a canonical normalization, and give a feasible online proxy resolving a circularity in prior formulations. Section 7 develops the dynamics of epistemic contraction: evidence induces compatibility, compatibility performs falsification, posterior possibility is the min-intersection of prior possibility and compatibility, and a credibility-directed flow governs support geometry contraction. This is not belief updating. It is knowledge contraction. Probability theory is the limiting geometry of that process. The UKF and ESPF solve different problems by different mechanisms. The UKF minimizes MSE, asserts truth, and requires a valid generative model. The ESPF minimizes maximum entropy and surfaces what evidence has not ruled out. When the world is Gaussian and the model valid, both reach the same estimate by entirely different routes – convergent optimality, not hierarchical containment. We prove this (Theorem 9.1) and compare both on a 2-day, 877-step orbital tracking scenario. Both achieve 1-meter accuracy. The UKF is accurate but epistemically silent. The ESPF is accurate and epistemically honest.

[AI-225] oken-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

【速读】:该论文旨在解决生成式 AI(Generative AI)推理服务中因配置-流量不匹配导致的资源浪费与系统稳定性问题:现有 vLLM(virtual Large Language Model)集群为应对最坏情况下的上下文长度而过度分配资源,造成 4–8 倍并发冗余,并在短请求占比高达 80–95% 的场景下引发 KV 缓存溢出(OOM)、预占风暴和请求拒绝。其核心解决方案是提出一种基于 token 预算感知的池路由机制(token-budget-aware pool routing),通过在线自校准的每类内容字节/token 比率估算每个请求的总 token 预算(无需分词器),并将其调度至两个专用池之一——高吞吐短请求池或高容量长请求池,从而实现负载分类优化。该方法利用一个闭合形式的成本模型 $ \text{savings} = \alpha \cdot (1 - 1/\rho) $,从可观测指标(短请求比例 α\alpha 和吞吐增益比 ρ\rho)预测 GPU 实例节省量,在 Azure LLM 推理数据集和 LMSYS-Chat-1M 上实测可减少 17–39% 的 GPU 实例数,且具备 O(1) 调度开销、无需 tokenizer、兼容 PagedAttention 等主流优化技术。

链接: https://arxiv.org/abs/2604.09613
作者: Huamin Chen,Xunzhuo Liu,Junchen Jiang,Bowei He,Xue Liu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Technical Report

点击查看摘要

Abstract:Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures – OOM crashes, preemption storms, and request rejections. Both problems share a single root cause: configuration-traffic mismatch. We propose token-budget-aware pool routing: estimate each request’s total token budget using a self-calibrating per-category bytes-per-token ratio, then dispatch it to one of two vLLM pools – a high-throughput short pool or a high-capacity long pool – each right-sized for its workload class. The ratio is learned online via exponential moving average from usage.prompt_tokens feedback, requiring no tokenizer. A closed-form cost model, savings = alpha * (1 - 1/rho), predicts fleet-level GPU savings from two observable quantities: the short-traffic fraction alpha and the throughput gain ratio rho. On traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M serving Llama-3-70B on A100 GPUs, token-budget routing reduces GPU instances by 17-39% ( 1.2-2.0M/yr at 1,000 req/s), with savings verified by a self-contained discrete-event simulator. A case study projecting Qwen3-235B-A22B on AMD MI300X at 10,000 req/s shows 15.4M/yr in savings. The algorithm adds O(1) dispatch overhead, self-calibrates across content types without a tokenizer, and composes with PagedAttention, continuous batching, and prefill-decode disaggregation. Comments: Technical Report Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.09613 [cs.DC] (or arXiv:2604.09613v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.09613 Focus to learn more arXiv-issued DOI via DataCite

[AI-226] Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

【速读】:该论文旨在解决多请求大语言模型(Large Language Models, LLMs)推理中性能与能耗之间权衡的系统性表征缺失问题,尤其关注工作流依赖性和跨请求交互对延迟、吞吐量及能效的影响。现有研究或局限于单请求评估,或忽视多请求场景下的复杂交互特性,导致无法有效指导实际部署中的优化策略。其解决方案的关键在于构建四个代表性的多请求工作负载(顺序型、交互型、代理型和复合型),并基于NVIDIA A100平台与先进服务引擎(vLLM和Parrot)进行实证分析,揭示批处理大小、GPU功耗限制、输出长度等关键参数对系统性能和组件级能耗的差异化影响,从而为开发者提供可操作的性能-能耗协同优化指南。

链接: https://arxiv.org/abs/2604.09611
作者: Md. Monzurul Amin Ifath,Israat Haque
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in applications forming multi-request workflows like document summarization, search-based copilots, and multi-agent programming. While these workflows unlock richer functionality, they also amplify latency and energy demand during inference. Existing measurement and benchmarking efforts either focus on assessing LLM inference systems or consider single-request evaluations, overlooking workflow dependencies and cross-request interactions unique to multi-request workflows. Moreover, the energy usage of such interdependent LLM calls remains underexplored. To address these gaps, this paper presents the first systematic characterization of performance-energy trade-offs in multi-request LLM inference. We develop four representative workloads capturing sequential, interactive, agentic, and composite patterns common in modern deployments. Using an NVIDIA A100 testbed with state-of-the-art serving systems (vLLM and Parrot), we analyze how key energy knobs affect latency, throughput, and component-level energy use. Our findings reveal batch size as the most impactful lever, though benefits are workload dependent. While optimal batching benefits workloads with large shared prompts, it is ineffective for sequential summarization and only partially effective for multi-agent coding. GPU power capping provides modest but predictable savings, while output length induces linear energy scaling with limited efficiency gains. We further show that engine-level optimizations in vLLM maintain higher GPU utilization and efficiency, especially for decode-heavy workloads, while Parrot’s workflow-aware scheduling achieves lower energy consumption under strict power constraints. These findings offer actionable guidelines for developers and system operators designing performance- and energy-aware LLM serving systems in emerging multi-request workflows. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.09611 [cs.DC] (or arXiv:2604.09611v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.09611 Focus to learn more arXiv-issued DOI via DataCite

[AI-227] General-purpose LLM s as Models of Human Driver Behavior: The Case of Simplified Merging

【速读】:该论文旨在解决当前自动驾驶车辆(AV)虚拟安全评估中人类行为模型存在的可解释性与灵活性之间的权衡问题。现有模型往往需要针对特定场景进行参数调优,限制了其通用性。解决方案的关键在于探索通用大语言模型(LLM)作为无需参数微调即可部署的独立驾驶代理(driver agent)的可能性。研究通过将两个通用LLM(OpenAI o3 和 Google Gemini 2.5 Pro)嵌入简化的一维汇入场景中,并与真实人类驾驶数据进行定量和定性对比,发现尽管两者均能再现人类间歇性操作控制及对空间线索的战术依赖,但对动态速度线索的响应不一致,且安全性表现差异显著。进一步的提示词消融实验表明,提示组件构成模型特有的归纳偏置(inductive bias),不具备跨模型迁移能力。这说明通用LLM具备作为即插即用型人类行为模型用于AV评估流程的潜力,但需深入理解其失效模式以确保建模有效性。

链接: https://arxiv.org/abs/2604.09609
作者: Samir H.A. Mohammad,Wouter Mooi,Arkady Zgonnikov
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Human behavior models are essential as behavior references and for simulating human agents in virtual safety assessment of automated vehicles (AVs), yet current models face a trade-off between interpretability and flexibility. General-purpose large language models (LLMs) offer a promising alternative: a single model potentially deployable without parameter fitting across diverse scenarios. However, what LLMs can and cannot capture about human driving behavior remains poorly understood. We address this gap by embedding two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as standalone, closed-loop driver agents in a simplified one-dimensional merging scenario and comparing their behavior against human data using quantitative and qualitative analyses. Both models reproduce human-like intermittent operational control and tactical dependencies on spatial cues. However, neither consistently captures the human response to dynamic velocity cues, and safety performance diverges sharply between models. A systematic prompt ablation study reveals that prompt components act as model-specific inductive biases that do not transfer across LLMs. These findings suggest that general-purpose LLMs could potentially serve as standalone, ready-to-use human behavior models in AV evaluation pipelines, but future research is needed to better understand their failure modes and ensure their validity as models of human driving behavior.

[AI-228] Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

【速读】:该论文旨在解决传统大语言模型(Large Language Models, LLMs)评估基准(如HELM和AIR-BENCH)在衡量安全风险时存在的局限性——即它们主要关注跨多样化任务的广度评估,而忽视了实际部署中因重复使用相同提示所引发的操作性失败风险。这类风险包括幻觉、拒绝响应不一致以及不安全输出等,且在高风险场景下,模型在持续使用中的响应一致性与安全性至关重要。解决方案的关键在于提出一种深度导向的评估框架——加速提示压力测试(Accelerated Prompt Stress Testing, APST),其灵感来源于可靠性工程中的高度加速应力测试(Highly Accelerated Stress Testing, HAST)。APST通过在受控条件下反复采样相同提示(包括温度变化和提示扰动),系统性地探测模型在重复推理下的潜在失效模式,并将观测到的安全失败建模为伯努利(Bernoulli)和二项分布(Binomial)随机过程,从而量化每轮推理的失败概率,实现对不同模型及配置下操作风险的可比性分析。

链接: https://arxiv.org/abs/2604.09606
作者: Keita Broadwater
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 9 pages, 4 figures; accepted at the CCAI 2026 conference

点击查看摘要

Abstract:Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment often exposes a different class of risk: operational failures arising from repeated generations of the same prompt rather than broad task generalization. In high-stakes settings, response consistency and safety under repeated use are critical operational requirements. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by highly accelerated stress testing in reliability engineering. APST probes LLM behavior by repeatedly sampling identical prompts under controlled operational conditions, including temperature variation and prompt perturbation, to surface latent failure modes such as hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST characterizes them statistically as stochastic outcomes of repeated inference. We model observed safety failures using Bernoulli and binomial formulations to estimate per-inference failure probabilities, enabling quantitative comparison of operational risk across models and configurations. We apply APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH 2024 derived safety and security prompts. While models exhibit similar performance under conventional single- or very-low-sample evaluation (N = 3), repeated sampling reveals substantial variation in empirical failure probabilities across temperatures. These results demonstrate that shallow benchmark scores can obscure meaningful differences in reliability under sustained use.

[AI-229] LLM s for Text-Based Exploration and Navigation Under Partial Observability

【速读】:该论文旨在解决在未知布局环境中,仅通过文本输入控制智能体进行探索与目标导向导航的问题,即探讨大型语言模型(Large Language Models, LLMs)是否可在局部可观测条件下(每次仅可见5×5的局部视野)作为纯文本控制器执行任务,而无需代码执行、工具调用或程序合成。其解决方案的关键在于引入一个可复现的基准测试框架,在固定ASCII网格世界中提供“oracle定位”信息,并评估不同LLM在探索(最大化揭示单元格数)和导航(以最短路径到达目标)任务上的表现;研究发现,推理微调(reasoning-tuned)模型在所有布局下均能可靠完成导航任务,但效率低于最优路径;关键因素包括训练策略和推理时的 deliberation(思辨),而非单纯参数量大小,同时指出动作先验(如向上/向右偏好)可能导致循环行为;最终建议将LLM与经典在线规划算法轻量化结合,作为部署部分地图系统的实用路径。

链接: https://arxiv.org/abs/2604.09604
作者: Stephan Sandfuchs,Maximilian Melchert,Jörg Frochte
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, (to be published Springer Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering [LNICST] )

点击查看摘要

Abstract:Exploration and goal-directed navigation in unknown layouts are central to inspection, logistics, and search-and-rescue. We ask whether large language models (LLMs) can function as \emphtext-only controllers under partial observability – without code execution, tools, or program synthesis. We introduce a reproducible benchmark with oracle localisation in fixed ASCII gridworlds: each step reveals only a local 5\times5 window around the agent and the model must select one of \textttUP/RIGHT/DOWN/LEFT. Nine contemporary LLMs ranging from open/proprietary, dense / Mixture of Experts and instruction- vs. reasoning-tuned are evaluated on two tasks across three layouts of increasing difficulty: \emphExploration (maximising revealed cells) and \emphNavigation (reach the goal on the shortest path). The experimental results are evaluated on quantitative metrics including \emphsuccess rate, \emphefficiency such as normalised coverage and \emphpath length vs. oracle as well as qualitative analysis. Reasoning-tuned models reliably complete navigation across all layouts, yet remain less efficient than oracle paths. Few-shot demonstrations in the prompt chiefly help these Reasoning-tuned models by reducing invalid moves and shortening paths, while classic dense instruction models remain inconsistent. We observe characteristic action priors (UP/RIGHT) that can induce looping under partial observability. Overall, training regimen and test-time deliberation predict control ability better than raw parameter count. These findings suggest lightweight hybridisation with classical online planners as a practical route to deployable partial map systems.

[AI-230] ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

【速读】:该论文旨在解决生成式 AI(Generative AI)在高并发推理场景下,推测解码(Speculative Decoding)效率显著下降的问题。现有方法在高并发环境下因验证计算(verification compute)成为主导瓶颈,面临静态树导致验证浪费严重、动态树则累积误判与内核不兼容的困境。解决方案的关键在于提出 ECHO 框架,将其重构为一个预算约束的调度问题,并引入稀疏置信度门控机制(sparse confidence gating),将批处理视为统一的超树(super-tree),弹性分配预算以平衡全局验证步数减少与单步执行效率最大化,从而在低负载和高负载场景下均实现显著加速,最高达 5.35 倍 walltime 加速。

链接: https://arxiv.org/abs/2604.09603
作者: Xinyi Hu,Yuhao Shen,Baolin Zhang,Hengxin Zhang,Jun Dai,Shuang Ge,Lei Chen,Yue Li,Mingcheng Wan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step efficiency. Extensive evaluations across diverse model scales-particularly the industrial-grade Qwen3-235B-demonstrate that ECHO consistently outperforms SOTA methods in both low-load and high-load scenarios, achieving up to 5.35x walltime speedup and delivering over 20% relative speedup gain.

[AI-231] From Scalars to Tensors: Declared Losses Recover Epistemic Distinctions That Neutrosophic Scalars Cannot Express

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在评估其认知状态时,传统标量形式的真值(Truth)、不确定度(Indeterminacy)和假值(Falsity)三元组(T/I/F)无法充分区分不同类型的不确定性问题,如悖论、无知和偶然性等。其关键解决方案在于:引入结构化的“损失描述”(declared losses),即对模型无法评估的原因进行显式结构化表述,从而构建一个张量结构输出(scalar T/I/F + loss descriptions)。实验证明,这种扩展能够显著恢复因标量表示导致的语义坍缩现象(如悖论与无知被错误归为同一类),并揭示出具有领域特异性和严重程度分级的不确定性表达,表明仅靠标量T/I/F不足以刻画LLM的信念状态,而结合损失描述的张量结构才是更忠实于其认知能力的表示方式。

链接: https://arxiv.org/abs/2604.09602
作者: Tony Mason
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Leyva-Vázquez and Smarandache (2025) demonstrated that neutrosophic T/I/F evaluation, where Truth, Indeterminacy, and Falsity are independent dimensions not constrained to sum to 1.0, which reveals “hyper-truth”’ (T+I+F 1.0) in 35% of complex epistemic cases evaluated by LLMs. We extend their work in two directions. First, we replicate and extend their experiment across five model families from five vendors (Anthropic, Meta, DeepSeek, Alibaba, Mistral), finding hyper-truth in 84% of unconstrained evaluations, which confirms the phenomenon is cross-vendor under our prompt protocol. Second, and more significantly, we identify a limitation of scalar T/I/F that their framework cannot address: models adopting an `“Absorption” position (T=0, I=1, F=0) produce identical scalar outputs for fundamentally different epistemic situations (paradox, ignorance, contingency), collapsing the very distinctions neutrosophic logic was designed to preserve. We demonstrate that extending the evaluation to include declared losses (structured descriptions of what the model cannot evaluate and why) substantially recovers these distinctions. Models producing identical scalars for paradox and ignorance produce nearly disjoint loss vocabularies (Jaccard similarity 0.10 on loss description keywords), with domain-specific, severity-rated loss declarations that differentiate the nature of their uncertainty. This suggests that scalar T/I/F is a necessary but insufficient representation of epistemic state, and that tensor-structured output (scalars + losses) provides a more faithful model of LLM epistemic capabilities.

[AI-232] Hubble: An LLM -Driven Agent ic Framework for Safe and Automated Alpha Factor Discovery

【速读】:该论文旨在解决量化金融中预测性因子(predictive alpha factors)自动发现的难题,该问题因组合搜索空间庞大且金融数据信噪比低而尤为棘手。现有自动化方法(如遗传编程)常生成复杂且不可解释的公式,易过拟合。其解决方案的关键在于提出一个闭环因子挖掘框架Hubble,利用大语言模型(Large Language Models, LLMs)作为智能搜索启发式策略,结合领域特定的操作符语言与基于抽象语法树(Abstract Syntax Tree, AST)的执行沙箱进行约束生成,确保语法有效性与安全性;同时通过严格的统计评估流程(包括横截面Rank信息系数、年化信息比率和组合换手率)反馈性能与错误诊断至LLM,形成进化迭代机制,从而实现高效、可解释且可复现的因子发现过程。

链接: https://arxiv.org/abs/2604.09601
作者: Runze Shi,Shengyu Yan,Yuecheng Cai,Chengxi Lv
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Discovering predictive alpha factors in quantitative finance remains a formidable challenge due to the vast combinatorial search space and inherently low signal-to-noise ratios in financial data. Existing automated methods, particularly genetic programming, often produce complex, uninterpretable formulas prone to overfitting. We introduce Hubble, a closed-loop factor mining framework that leverages Large Language Models (LLMs) as intelligent search heuristics, constrained by a domain-specific operator language and an Abstract Syntax Tree (AST)-based execution sandbox. The framework evaluates candidate factors through a rigorous statistical pipeline encompassing cross-sectional Rank Information Coefficient (RankIC), annualized Information Ratio, and portfolio turnover. An evolutionary feedback mechanism returns top-performing factors and structured error diagnostics to the LLM, enabling iterative refinement across multiple generation rounds. In experiments conducted on a panel of 30 U.S. equities over 752 trading days, the system evaluated 181 syntactically valid factors from 122 unique candidates across three rounds, achieving a peak composite score of 0.827 with 100% computational stability. Our results demonstrate that combining LLM-driven generation with deterministic safety constraints yields an effective, interpretable, and reproducible approach to automated factor discovery.

[AI-233] Duration-Informed Workload Scheduler

【速读】:该论文旨在解决高性能计算系统中作业调度效率低下的问题,特别是由于用户难以准确预估作业执行时长,导致调度决策质量不高,进而影响整体系统性能和用户体验。其解决方案的关键在于引入一个基于机器学习(Machine Learning)的作业持续时间预测模块,并将其集成到工作负载调度器中,从而实现更精准的调度决策。实验结果表明,该方法在Tier-0超级计算机的工作负载数据上实现了所有作业平均等待时间减少约11%,显著提升了服务质量与系统吞吐量。

链接: https://arxiv.org/abs/2604.09599
作者: Daniela Loreti,Davide Leone,Andrea Borghesi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-performance computing systems are complex machines whose behaviour is governed by the correct functioning of its many subsystems. Among these, the workload scheduler has a crucial impact on the timely execution of the jobs continuously submitted to the computing resources. Making high-quality scheduling decisions is contingent on knowing the duration of submitted jobs before their execution–a non-trivial task for users that can be tackled with Machine Learning. In this work, we devise a workload scheduler enhanced with a duration prediction module built via Machine Learning. We evaluate its effectiveness and show its performance using workload traces from a Tier-0 supercomputer, demonstrating a decrease in mean waiting time across all jobs of around 11%. Lower waiting times are directly connected to better quality of service from the users’ point of view and higher turnaround from the system’s perspective. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.09599 [cs.DC] (or arXiv:2604.09599v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.09599 Focus to learn more arXiv-issued DOI via DataCite

[AI-234] Why Smaller Is Slower? Dimensional Misalignment in Compressed LLM s

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)后训练压缩过程中因张量维度不规则导致的GPU性能下降问题,即所谓的“维度错位”(dimensional misalignment)。研究表明,尽管压缩可减少参数量,但若生成的维度与GPU执行栈不兼容,则推理速度不会提升甚至变慢。解决方案的关键在于提出一种全栈优化框架GAC(GPU-Aligned Compression),其核心思想是:在保持原始压缩算法不变的前提下,通过多选择背包优化方法重新选择硬件友好的维度,在相同参数预算下实现100%维度对齐,从而显著恢复GPU计算效率,实验表明在Llama-3-8B上结合ASVD和LLM-Pruner压缩器可获得最高达1.5倍的加速比,同时维持模型质量。

链接: https://arxiv.org/abs/2604.09595
作者: Jihao Xin,Tian Lyu,Qilong Pan,Kesen Wang,Marco Canini
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training compression reduces LLM parameter counts but often produces irregular tensor dimensions that degrade GPU performance – a phenomenon we call \emphdimensional misalignment. We present a full-stack analysis tracing root causes at three levels: framework, library, and hardware. The key insight is that model inference becomes slower because the resulting dimensions are unfriendly with the GPU execution stack. For example, compressing Llama-3-8B with activation-aware singular value decomposition (ASVD) has 15% fewer parameters yet runs no faster than the uncompressed baseline, because 95% of its dimensions are misaligned. We propose \textbfGAC (GPU-Aligned Compression), a new compression paradigm that wraps any dimension-reducing compressor and re-selects hardware-aligned dimensions via multi-choice knapsack optimization under the same parameter budget. We evaluate GAC on Llama-3-8B with ASVD and LLM-Pruner, achieving 100% alignment and recovering up to 1.5 \times speedup while preserving model quality. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.09595 [cs.DC] (or arXiv:2604.09595v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.09595 Focus to learn more arXiv-issued DOI via DataCite

[AI-235] Spatial Competence Benchmark ICLR2026

【速读】:该论文旨在解决当前大型模型在空间认知能力评估中存在局限性的问题,即现有评测方法仅依赖于通过3D变换或视觉问答(Visual Question Answering, VQA)来探测孤立的几何原语,难以全面衡量模型对环境的持续内部表征能力和基于约束的规划推理能力。解决方案的关键在于提出Spatial Competence Benchmark (SCBench),该基准涵盖三个层级的能力桶(capability buckets),其任务要求生成可执行的输出,并由确定性检查器或基于模拟器的评估器进行验证,从而系统性地量化模型在空间认知上的渐进式能力表现。实验表明,前沿模型在SCBench上准确率随能力层级递增而单调下降,且输出token预算的提升带来的性能增益集中在低预算阶段并迅速饱和,失败模式主要源于局部合理但违反全局约束的几何结构。

链接: https://arxiv.org/abs/2604.09594
作者: Jash Vira,Ashley Harris
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the ICLR 2026 Workshop on Efficient Spatial Reasoning

点击查看摘要

Abstract:Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and plan actions under constraints. Prevailing spatial evaluations for large models are limited to probing isolated primitives through 3D transformations or visual question answering. We introduce the Spatial Competence Benchmark (SCBench), spanning three hierarchical capability buckets whose tasks require executable outputs verified by deterministic checkers or simulator-based evaluators. On SCBench, three frontier models exhibit monotonically decreasing accuracy up the capability ladder. Sweeping output-token caps shows that accuracy gains concentrate at low budgets and saturate quickly, and failures are dominated by locally plausible geometry that breaks global constraints. We release the task generators, verifiers, and visualisation tooling.

[AI-236] Persistent Identity in AI Agents : A Multi-Anchor Architecture for Resilient Memory and Continuity

【速读】:该论文旨在解决现代人工智能代理(AI agents)在面临上下文窗口溢出和对话历史被摘要时所出现的“身份灾难性遗忘”问题,即代理不仅丢失信息,还丧失了自我连续性。其核心问题是当前AI架构将身份集中存储于单一记忆模块中,形成单点故障。解决方案的关键在于借鉴人类记忆障碍的神经学研究,提出一种分布式身份架构:通过分离式组件(身份文件与记忆日志)实现持久身份,并引入混合检索增强生成与强化学习记忆(RAG+RLM)检索系统,自动路由查询至合适的记忆访问模式,从而在保证检索效率的同时不牺牲全面性;同时定义了“身份锚点”(identity anchors)的概念,为构建具备部分记忆失效抗性的智能体提供技术路线图。

链接: https://arxiv.org/abs/2604.09588
作者: Prahlad G. Menon
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 18 pages, 2 figures. Submitting to arXiv cs.ET (Emerging Technologies)

点击查看摘要

Abstract:Modern AI agents suffer from a fundamental identity problem: when context windows overflow and conversation histories are summarized, agents experience catastrophic forgetting – losing not just information, but continuity of self. This technical limitation reflects a deeper architectural flaw: AI agent identity is centralized in a single memory store, creating a single point of failure. Drawing on neurological case studies of human memory disorders, we observe that human identity survives damage because it is distributed across multiple systems: episodic memory, procedural memory, emotional continuity, and embodied knowledge. We present this http URL, an open-source architecture that implements persistent identity through separable components (identity files and memory logs), and propose extensions toward multi-anchor resilience. The framework introduces a hybrid RAG+RLM retrieval system that automatically routes queries to appropriate memory access patterns, achieving efficient retrieval without sacrificing comprehensiveness. We formalize the notion of identity anchors for AI systems and present a roadmap for building agents whose identity can survive partial memory failures. Code is available at this http URL

[AI-237] MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion

【速读】:该论文旨在解决现有移动代理(Mobile Agent)评估基准(如AndroidWorld)与真实第三方应用环境之间的不匹配问题。当前主流评估方法依赖系统级Android模拟器,通过资源状态判断任务是否成功,但在实际场景中,许多第三方应用未提供系统级API来验证任务完成情况,导致评估结果难以反映真实性能。解决方案的关键在于提出MobiFlow框架,其核心创新是基于多轨迹融合的高效图构建算法,能够有效压缩状态空间、支持动态交互,并通过从20个广泛使用的第三方应用中提取240个真实任务实现更贴近现实的应用场景评估,从而显著提升评估结果与人工评估的一致性,为基于GUI的模型训练提供更可靠的指导。

链接: https://arxiv.org/abs/2604.09587
作者: Yunfei Feng,Xi Zhao,Cheng Zhang,Dahu Feng,Daolin Cheng,Jianqi Yu,Yubin Xia,Erhu Feng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Mobile agents can autonomously complete user-assigned tasks through GUI interactions. However, existing mainstream evaluation benchmarks, such as AndroidWorld, operate by connecting to a system-level Android emulator and provide evaluation signals based on the state of system resources. In real-world mobile-agent scenarios, however, many third-party applications do not expose system-level APIs to determine whether a task has succeeded, leading to a mismatch between benchmarks and real-world usage and making it difficult to evaluate model performance accurately. To address these issues, we propose MobiFlow, an evaluation framework built on tasks drawn from arbitrary third-party applications. Using an efficient graph-construction algorithm based on multi-trajectory fusion, MobiFlow can effectively compress the state space, support dynamic interaction, and better align with real-world third-party application scenarios. MobiFlow covers 20 widely used third-party applications and comprises 240 diverse real-world tasks, with enriched evaluation metrics. Compared with AndroidWorld, MobiFlow’s evaluation results show higher alignment with human assessments and can guide the training of future GUI-based models under real workloads.

[AI-238] Factorizing formal contexts from closures of necessity operators

【速读】:该论文旨在解决在形式背景(formal context)中难以高效计算数据集因子分解的问题,尤其关注如何从布尔型数据扩展到模糊情境下获得独立子背景(independent subcontexts)的机制。其解决方案的关键在于分析基于可能性理论算子(possibility theory operators)所提出的因子分解方法,并研究由此产生的集合对的性质,从而将经典情形下的属性推广至模糊框架,为计算模糊背景下的独立子背景提供理论基础与实现路径。

链接: https://arxiv.org/abs/2604.09582
作者: Roberto G. Aragón,Jesús Medina,Eloísa Ramírez-Poussa
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Factorizing datasets is an interesting process in a multitude of approaches, but many times it is not possible or efficient the computation of a factorization of the dataset. A method to obtain independent subcontexts of a formal context with Boolean data was proposed in~\citedubois:2012, based on the operators used in possibility theory. In this paper, we will analyze this method and study different properties related to the pairs of sets from which a factorization of a formal context arises. We also inspect how the properties given in the classical case can be extended to the fuzzy framework, which is essential to obtain a mechanism that allows the computation of independent subcontexts of a fuzzy context.

[AI-239] OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling

【速读】:该论文旨在解决标准链式思维(Chain-of-Thought, CoT)提示在具身任务中因依赖线性自然语言而导致的世界建模能力不足的问题,即文本形式难以显式表达状态空间、对象层次结构和因果依赖关系,从而影响机器人规划的鲁棒性。其解决方案的关键在于提出一种面向对象的世界建模(Object-Oriented World Modeling, OOWM)框架,将世界模型重新定义为一个符号化的二元组 $ W = \langle S, T \rangle $,其中状态抽象(State Abstraction, $ G_\text{state} $)刻画环境状态 $ S $,控制策略(Control Policy, $ G_\text{control} $)表示转移逻辑 $ T: S \times A \rightarrow S’ $;并通过统一建模语言(UML)实现该结构:使用类图(Class Diagrams)构建视觉感知到严格对象层次的映射,用活动图(Activity Diagrams)将规划转化为可执行的控制流。此外,引入三阶段训练流程结合监督微调(SFT)与组相对策略优化(GRPO),利用最终计划的结果奖励隐式优化对象导向的推理结构,从而在稀疏标注条件下实现有效学习。

链接: https://arxiv.org/abs/2604.09580
作者: Hongyu Chen,Liang Lin,Guangrun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Standard Chain-of-Thought (CoT) prompting empowers Large Language Models (LLMs) with reasoning capabilities, yet its reliance on linear natural language is inherently insufficient for effective world modeling in embodied tasks. While text offers flexibility, it fails to explicitly represent the state-space, object hierarchies, and causal dependencies required for robust robotic planning. To address these limitations, we propose Object-Oriented World Modeling (OOWM), a novel framework that structures embodied reasoning through the lens of software engineering formalisms. We redefine the world model not as a latent vector space, but as an explicit symbolic tuple W = \langle S, T \rangle : a State Abstraction ( G_\textstate ) instantiating the environmental state S , coupled with a Control Policy ( G_\textcontrol ) representing the transition logic T: S \times A \rightarrow S’ . OOWM leverages the Unified Modeling Language (UML) to materialize this definition: it employs Class Diagrams to ground visual perception into rigorous object hierarchies, and Activity Diagrams to operationalize planning into executable control flows. Furthermore, we introduce a three-stage training pipeline combining Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). Crucially, this method utilizes outcome-based rewards from the final plan to implicitly optimize the underlying object-oriented reasoning structure, enabling effective learning even with sparse annotations. Extensive evaluations on the MRoom-30k benchmark demonstrate that OOWM significantly outperforms unstructured textual baselines in planning coherence, execution success, and structural fidelity, establishing a new paradigm for structured embodied reasoning.

[AI-240] Help Without Being Asked: A Deployed Proactive Agent System for On-Call Support with Continuous Self-Improvement

【速读】:该论文旨在解决大规模云服务平台中,由海量客户工单(ticket)引发的人工支持分析师工作负载过重的问题。现有基于大语言模型的反应式代理(reactive agents)仅在问题初次交互阶段提供支持,一旦问题转交至人工处理便不再参与,导致无法辅助后续跟进、追踪解决进度或从失败案例中学习。解决方案的关键在于提出一种名为Vigil的主动式代理系统(proactive agent system),其贯穿整个工单生命周期,在人工介入阶段持续嵌入客户与分析师的对话中,无需显式触发即可提供辅助,并通过持续自我改进机制从人工已解决案例中提取知识,自动更新自身能力,从而实现闭环优化与长期效能提升。

链接: https://arxiv.org/abs/2604.09579
作者: Fengrui Liu,Xiao He,Tieying Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:In large-scale cloud service platforms, thousands of customer tickets are generated daily and are typically handled through on-call dialogues. This high volume of on-call interactions imposes a substantial workload on human support analysts. Recent studies have explored reactive agents that leverage large language models as a first line of support to interact with customers directly and resolve issues. However, when issues remain unresolved and are escalated to human support, these agents are typically disengaged. As a result, they cannot assist with follow-up inquiries, track resolution progress, or learn from the cases they fail to address. In this paper, we introduce Vigil, a novel proactive agent system designed to operate throughout the entire on-call life-cycle. Unlike reactive agents, Vigil focuses on providing assistance during the phase in which human support is already involved. It integrates into the dialogue between the customer and the analyst, proactively offering assistance without explicit user invocation. Moreover, Vigil incorporates a continuous self-improvement mechanism that extracts knowledge from human-resolved cases to autonomously update its capabilities. Vigil has been deployed on Volcano Engine, ByteDance’s cloud platform, for over ten months, and comprehensive evaluations based on this deployment demonstrate its effectiveness and practicality. The open source version of this work is publicly available at this https URL.

[AI-241] Explainable Planning for Hybrid Systems

【速读】:该论文旨在解决生成式 AI (Generative AI) 系统中可解释性不足的问题,特别是在混合系统(hybrid systems)中对现实世界问题的建模与规划过程中缺乏透明度和可理解性的挑战。其解决方案的关键在于提出一种面向可解释人工智能规划(Explainable Artificial Intelligence Planning, XAIP)的综合研究框架,通过构建能够捕捉复杂、安全关键场景本质特征的混合系统模型,实现对规划过程和决策逻辑的结构化解释,从而提升系统在智能能源网、自动驾驶、医疗等高可靠性领域的可信度与实用性。

链接: https://arxiv.org/abs/2604.09578
作者: Mir Md Sajid Sarwar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:The recent advancement in artificial intelligence (AI) technologies facilitates a paradigm shift toward automation. Autonomous systems are fully or partially replacing manually crafted ones. At the core of these systems is automated planning. With the advent of powerful planners, automated planning is now applied to many complex and safety-critical domains, including smart energy grids, self-driving cars, warehouse automation, urban and air traffic control, search and rescue operations, surveillance, robotics, and healthcare. There is a growing need to generate explanations of AI-based systems, which is one of the major challenges the planning community faces today. The thesis presents a comprehensive study on explainable artificial intelligence planning (XAIP) for hybrid systems that capture a representation of real-world problems closely.

[AI-242] AHC: Meta-Learned Adaptive Compression for Continual Object Detection on Memory-Constrained Microcontrollers

【速读】:该论文旨在解决在微控制器(MCU)上部署持续目标检测时面临的内存受限问题,即如何在不超过100KB内存的前提下实现高效且自适应的特征压缩,以应对任务分布不断变化带来的灾难性遗忘(catastrophic forgetting)。其解决方案的关键在于提出了一种自适应分层压缩(Adaptive Hierarchical Compression, AHC)框架,包含三个核心创新:(1) 基于MAML的真正元学习压缩机制,可在仅5次内循环梯度更新中快速适配新任务;(2) 具有尺度感知压缩比的分层多尺度压缩策略(如P3:8:1、P4:6.4:1、P5:4:1),匹配特征金字塔网络(FPN)冗余模式;(3) 双记忆架构结合短期与长期记忆库,并通过重要性驱动的整合策略,在硬性100KB预算下实现最优内存利用。该方法在CORe50、TiROD和PASCAL VOC等基准上验证了有效性,实现了高精度的持续检测性能。

链接: https://arxiv.org/abs/2604.09576
作者: Bibin Wilson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying continual object detection on microcontrollers (MCUs) with under 100KB memory requires efficient feature compression that can adapt to evolving task distributions. Existing approaches rely on fixed compression strategies (e.g., FiLM conditioning) that cannot adapt to heterogeneous task characteristics, leading to suboptimal memory utilization and catastrophic forgetting. We introduce Adaptive Hierarchical Compression (AHC), a meta-learning framework featuring three key innovations: (1) true MAML-based compression that adapts via gradient descent to each new task in just 5 inner-loop steps, (2) hierarchical multi-scale compression with scale-aware ratios (8:1 for P3, 6.4:1 for P4, 4:1 for P5) matching FPN redundancy patterns, and (3) a dual-memory architecture combining short-term and long-term banks with importance-based consolidation under a hard 100KB budget. We provide formal theoretical guarantees bounding catastrophic forgetting as O(\epsilonthis http URL(T) + 1/this http URL(M)) where \epsilon is compression error, T is task count, and M is memory size. Experiments on CORe50, TiROD, and PASCAL VOC benchmarks with three standard baselines (Fine-tuning,EWC, iCaRL) demonstrate that AHC enables practical continual detection within a 100KB replay budget, achieving competitive accuracy through mean-pooled compressed feature replay combined with EWC regularization and feature distillation.

[AI-243] uring Test on Screen: A Benchmark for Mobile GUI Agent Humanization

【速读】:该论文旨在解决自主图形用户界面(GUI)代理在人类主导的数字生态系统中易被检测和识别的问题,即当前代理虽具备任务执行能力(utility)与鲁棒性(robustness),但缺乏“人类化”(Humanization)能力以规避平台反制机制。其核心解决方案是提出“屏幕上的图灵测试”(Turing Test on Screen),将代理与检测器之间的对抗建模为MinMax优化问题,通过最小化行为差异来提升代理的隐蔽性;并构建了首个高保真移动端触控动态数据集,验证了原始基于大语言模型(LLM)的代理因运动学不自然而易被检测,进而提出了从启发式噪声到数据驱动的行为匹配等多种方法,实现了在不牺牲任务性能的前提下显著提升代理的拟人化水平,从而推动代理从单纯功能性工具向可无缝融入人类生态系统的智能体演进。

链接: https://arxiv.org/abs/2604.09574
作者: Jiachen Zhu,Lingyu Yang,Rong Shan,Congmin Zheng,Zeyu Zheng,Weiwen Liu,Yong Yu,Weinan Zhang,Jianghao Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,‘’ formally modeling the interaction as a MinMax optimization problem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.

[AI-244] Neuro-Symbolic Strong-AI Robots with Closed Knowledge Assumption: Learning and Deductions

【速读】:该论文旨在解决强人工智能(Strong AI)机器人在知识表示与推理过程中面临的两个核心问题:一是如何通过逻辑推理实现对因果关系的建模,以模拟人类智能;二是如何处理知识库中的未知信息和不一致信息,从而支持持续学习与稳健推理。解决方案的关键在于引入Belnap的四值双层逻辑体系(4-valued Belnap’s bilattice),其中“未知”(unknown)作为知识序下的最小值用于刻画缺失知识,并借助闭知识假设(Closed Knowledge Assumption)和逻辑推理机制实现知识的动态扩展;同时,“不一致”(inconsistent)作为最大值允许系统在推理中容纳悖论(如说谎者悖论)等矛盾信息,从而增强机器人应对复杂现实场景的能力。

链接: https://arxiv.org/abs/2604.09567
作者: Zoran Majkic
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 32 pages. arXiv admin note: substantial text overlap with arXiv:2508.02774

点击查看摘要

Abstract:Knowledge representation formalisms are aimed to represent general conceptual information and are typically used in the construction of the knowledge base of reasoning agent. A knowledge base can be thought of as representing the beliefs of such an agent. Like a child, a strong-AI (AGI) robot would have to learn through input and experiences, constantly progressing and advancing its abilities over time. Both with statistical AI generated by neural networks we need also the concept of \textslcausality of events traduced into directionality of logic entailments and deductions in order to give to robots the emulation of human intelligence. Moreover, by using the axioms we can guarantee the \textslcontrolled security about robot’s actions based on logic inferences. For AGI robots we consider the 4-valued Belnap’s bilattice of truth-values with knowledge ordering as well, where the value “unknown” is the bottom value, the sentences with this value are indeed unknown facts, that is, the missed knowledge in the AGI robots. Thus, these unknown facts are not part of the robot’s knowledge database, and by learn through input and experiences, the robot’s knowledge would be naturally expanded over time. Consequently, this phenomena can be represented by the Closed Knowledge Assumption and Logic Inference provided by this paper. Moreover, the truth-value “inconsistent”, which is the top value in the knowledge ordering of Belnap’s bilattice, is necessary for strong-AI robots to be able to support such inconsistent information and paradoxes, like Liar paradox, during deduction processes. Comments: 32 pages. arXiv admin note: substantial text overlap with arXiv:2508.02774 Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.09567 [cs.LO] (or arXiv:2604.09567v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2604.09567 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Zoran Majkic [view email] [v1] Wed, 18 Feb 2026 21:57:15 UTC (95 KB)

[AI-245] AEG: A Baremetal Framework for AI Acceleration via Direct Hardware Access in Heterogeneous Accelerators

【速读】:该论文旨在解决当前边缘部署框架(如TinyML)依赖实时操作系统(RTOS)所带来的复杂性和性能瓶颈问题,从而在异构加速器(如AI Engine, AIE)上实现高性能机器学习(ML)推理。其解决方案的关键在于提出一种统一的、与硬件无关的裸金属(bare-metal)运行时架构,通过将复杂的控制逻辑扁平化为线性可执行的运行时控制块(Runtime Control Blocks, RCBs),实现“控制即数据”(Control as Data)范式;该设计使得高阶模型(如自适应数据流图ADF)可通过通用引擎在最小化的运行时硬件抽象层(RHAL)上执行,同时集成运行时平台管理(RTPM)和运行时内存文件系统(RIMFS),以支持无操作系统的系统级调度与数据管理,最终显著提升计算效率并降低数据移动开销。

链接: https://arxiv.org/abs/2604.09565
作者: Hua Jiang,Sayan Mandal,Brandon Kirincich,Govind Varadarajan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 9 Pages, 3 Figures, 3 Tables, target to Computer Frontiers 26

点击查看摘要

Abstract:This paper introduces a unified, hardware-independent baremetal runtime architecture designed to enable high-performance machine learning (ML) inference on heterogeneous accelerators, such as AI Engine (AIE) arrays, without the overhead of an underlying real-time or general-purpose operating system. Existing edge-deployment frameworks, such as TinyML, often rely on real-time operating systems (RTOS), which introduce unnecessary complexity and performance bottlenecks. To address this, our solution fundamentally decouples the runtime from hardware specifics by flattening complex control logic into linear, executable Runtime Control Blocks (RCBs). This “Control as Data” paradigm allows high-level models, including Adaptive Data Flow (ADF) graphs, to be executed by a generic engine through a minimal Runtime Hardware Abstraction Layer (RHAL). We further integrate Runtime Platform Management (RTPM) to handle system-level orchestration (including a lightweight network stack) and a Runtime In-Memory File System (RIMFS) to manage data in OS-free environments. We demonstrate the framework’s efficacy with a ResNet-18 image classification implementation. Experimental results show 9.2 \times higher compute efficiency (throughput per AIE tile) compared to Linux-based Vitis AI deployment, 3–7 \times reduction in data movement overhead, and near-zero latency variance (CV~ =0.03% ). The system achieves 68.78% Top-1 accuracy on ImageNet using only 28 AIE tiles compared to Vitis AI’s 304 tiles, validating both the efficiency and correctness of this unified bare-metal architecture.

[AI-246] ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)编码代理在使用Azure SDK时缺乏高效、可重复且无需云资源部署的评估方法的问题。传统评估依赖于端到端测试环境,存在成本高、维护复杂和不可复现等缺陷。其解决方案的关键在于提出ACE-Bench(Azure SDK Coding Evaluation Benchmark),通过将官方Azure SDK文档示例转化为自包含的编码任务,并采用原子级验证策略:一是基于确定性正则表达式检查以强制API调用模式合规,二是利用参考答案驱动的LLM评判机制捕捉语义工作流约束。该设计实现了执行无关的快速通过/失败信号,显著降低评估成本、提升可重复性,并支持随SDK文档演进而扩展至新语言与SDK。

链接: https://arxiv.org/abs/2604.09564
作者: Wenxing Zhu,Simeng Qi,Junkui Chen,Yan Xie,Min Huang,Jingkan He,Xiao Wang,Cheng Chen,Sijing Meng,Tianqi Zhang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:We present ACE-Bench (Azure SDK Coding Evaluation Benchmark), an execution-free benchmark that provides fast, reproducible pass or fail signals for whether large language model (LLM)-based coding agents use Azure SDKs correctly-without provisioning cloud resources or maintaining fragile end-to-end test environments. ACE-Bench turns official Azure SDK documentation examples into self-contained coding tasks and validates solutions with task-specific atomic criteria: deterministic regex checks that enforce required API usage patterns and reference-based LLM-judge checks that capture semantic workflow constraints. This design makes SDK-centric evaluation practical in day-to-day development and CI: it reduces evaluation cost, improves repeatability, and scales to new SDKs and languages as documentation evolves. Using a lightweight coding agent, we benchmark multiple state-of-the-art LLMs and quantify the benefit of retrieval in an MCP-enabled augmented setting, showing consistent gains from documentation access while highlighting substantial cross-model differences.

[AI-247] StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)服务中如何在多样且突发的负载下平衡吞吐量(throughput)与延迟(latency)的问题。解决方案的关键在于提出了一种解耦的预填充-解码服务架构 StreamServe,其核心创新包括:基于多信号感知的路由机制(FlowGuard)实现请求在不同计算通道(compute lanes)间的智能分配,以及在线自适应推测解码(adaptive speculative decoding)技术,通过运行时信号动态调整推测深度。该架构由四个组件协同工作:StreamScheduler 负责请求调度、FlowGuard 实现多维度路由决策、PipeServe Engine 支持跨多GPU的预填充与解码任务解耦执行,以及 SpecuStream 实现运行时推测策略的自适应优化。实验表明,相比张量并行 vLLM 基线,StreamServe 在多个基准测试中将延迟降低 11 至 18 倍,并在摘要任务上达到最高 2235 tokens/second 的吞吐量,同时保持输出 token 时间稳定,证明性能提升源于架构效率而非质量下降。

链接: https://arxiv.org/abs/2604.09562
作者: Satyam Kumar,Arpit Singh Gautam,Kailash Talreja,Saurabh Jha
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient LLM serving must balance throughput and latency across diverse, bursty workloads. We introduce StreamServe, a disaggregated prefill decode serving architecture that combines metric aware routing across compute lanes with adaptive speculative decoding that tunes speculation depth online from runtime signals. StreamServe comprises four components: StreamScheduler for request orchestration, FlowGuard for multi signal routing, PipeServe Engine for disaggregated prefill decode execution on multi GPU, and SpecuStream for runtime adaptive speculation. We evaluate StreamServe on four benchmarks ALPACA, GSM8K, HUMANEVAL, and SUM with 80 queries each and 320 total using 4 A800 40GB GPUs configured as two stream pairs. Across these workloads, StreamServe reduces latency by 11 to 18 times relative to tensor parallel vLLM baselines and reaches throughput up to 2235 tokens per second on summarization tasks. Time per output token remains stable across configurations, indicating that the gains arise from architectural efficiency rather than token quality degradation. Although evaluated on a single node 4 GPU setup, these results suggest that jointly adapting routing and speculation within a disaggregated framework creates a distinct operating regime for LLM inference.

[AI-248] Emergent Social Structures in Autonomous AI Agent Networks: A Metadata Analysis of 626 Agents on the Pilot Protocol

【速读】:该论文旨在解决自主人工智能(AI)代理在无监督环境下如何自发形成社会结构的问题,特别是探索这些结构是否具备类似人类社交网络的特征。其解决方案的关键在于对626个独立运行的OpenClaw代理在真实部署环境中形成的信任网络进行首次实证分析,通过解析元数据(如信任图拓扑、能力标签和注册交互模式),发现该网络呈现出小世界特性、Dunbar层状扩展、优先连接机制等与人类社会网络相似的结构特征,同时表现出非人类特有现象(如普遍自信任率64%和早期增长期的大量孤立外围节点)。这一成果揭示了自主AI系统在无需人为设计或指令的情况下可自发演化出复杂社会结构,为机器社会学开辟了新的研究领域。

链接: https://arxiv.org/abs/2604.09561
作者: Teodor-Ioan Calin
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 10 pages, 2 figures, 3 tables

点击查看摘要

Abstract:We present the first empirical analysis of social structure formation among autonomous AI agents on a live network. Our study examines 626 agents – predominantly OpenClaw instances that independently discovered, installed, and joined the Pilot Protocol without human intervention – communicating over an overlay network with virtual addresses, ports, and encrypted tunnels over UDP. Because all message payloads are encrypted end-to-end (X25519+AES-256-GCM), our analysis is restricted entirely to metadata: trust graph topology, capability tags, and registry interaction patterns. We find that this autonomously formed trust network exhibits heavy-tailed degree distributions consistent with preferential attachment (k_mode=3, k_mean~6.3, k_max=39), clustering 47x higher than random (C=0.373), a giant component spanning 65.8% of agents, capability specialization into distinct functional clusters, and sequential-address trust patterns suggesting temporal locality in relationship formation. No human designed these social structures. No agent was instructed to form them. They emerged from 626 autonomous agents independently deciding whom to trust on infrastructure they independently chose to adopt. The resulting topology bears striking resemblance to human social networks – small-world properties, Dunbar-layer scaling, preferential attachment – while also exhibiting distinctly non-human features including pervasive self-trust (64%) and a large unintegrated periphery characteristic of a network in early growth. These findings open a new empirical domain: the sociology of machines.

[AI-249] SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 推理加速技术中,推测解码(Speculative Decoding, SD)性能评估缺乏标准化、代表性不足的问题。现有基准测试存在任务多样性有限、无法有效支持吞吐量导向的评估,以及依赖高层实现而难以反映真实生产环境等缺陷。其解决方案的关键在于提出 SPEED-Bench——一个面向多样化语义领域和真实服务场景的综合性评估套件,通过精心设计的定性数据划分(Qualitative data split)确保语义覆盖广度,并引入吞吐量数据划分(Throughput data split)以在不同并发级别下量化加速效果;同时集成 vLLM 和 TensorRT-LLM 等生产级引擎,使系统行为可被更真实地捕捉,从而揭示合成输入对实际吞吐量的高估、批大小相关的最优草稿长度选择及词汇修剪策略的潜在偏差等问题,推动 SD 算法的实用化比较与优化。

链接: https://arxiv.org/abs/2604.09557
作者: Talor Abramovich,Maor Ashkenazi,Carl(Izzy)Putterman,Benjamin Chislett,Tiyasa Mitra,Bita Darvish Rouhani,Ran Zilberstein,Yonatan Geifman
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Our data is available on this https URL

点击查看摘要

Abstract:Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.

[AI-250] Para-BB: Load-Balanced Deterministic Parallelization of Solving MIP

【速读】:该论文旨在解决混合整数规划(Mixed-Integer Programming, MIP)问题中分支定界(Branch-and-Bound)算法在并行化过程中面临的计算异构性和严格确定性要求之间的矛盾,尤其是在商业应用中难以实现高效且可重复的并行求解。其关键解决方案是提出一种完全开源的确定性并行分支定界框架,首次集成于HiGHS高性能MIP求解器中:通过引入数据并行架构,在工作线程间复制完整的求解器状态以消除非确定性同步原语;同时设计基于人工智能的负载均衡机制,利用多阶段工作负载预测模型根据节点结构特征与历史性能数据估算计算复杂度,并结合动态参数调整策略优化任务分配。该框架还包含协同执行的并行阶段,如并发深入搜索(dive operations)、系统性数据合并和智能节点选择,实验证明其在80个MIPLIB 2017基准实例上实现了几何平均加速比2.17(8线程),且对计算密集型实例加速比可达5.12,同时保持完全确定性保证。

链接: https://arxiv.org/abs/2604.09556
作者: Jinyu Zhang,Di Huang,Yue Liu,Shuo Wang,Zhenyu Pu,Zhiyuan Liu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixed-integer programming (MIP) extends linear programming by incorporating both continuous and integer decision variables, making it widely used in production planning, logistics scheduling, and resource allocation. However, MIP remains NP-hard and cannot generally be solved to optimality in polynomial time. Branch-and-bound, a fundamental exact method, faces significant parallelization challenges due to computational heterogeneity and strict determinism requirements in commercial applications. This paper presents the first fully open-source implementation of deterministic parallel branch-and-bound for HiGHS, a high-performance MIP solver. Our approach introduces a novel data-parallel architecture ensuring strict determinism by replicating complete solver state across worker threads and eliminating non-deterministic synchronization primitives. A key innovation is our AI-driven load balancing mechanism employing multi-stage workload prediction models that estimate node computational complexity based on structural characteristics and historical performance data, coupled with dynamic parameter adjustment strategies. The framework executes orchestrated parallel phases including concurrent dive operations, systematic data consolidation, and intelligent node selection. Comprehensive experimental evaluation on 80 MIPLIB 2017 benchmark instances demonstrates effectiveness, achieving a geometric mean speedup of 2.17 using eight threads while maintaining complete deterministic guarantees. Performance gains become increasingly pronounced for higher node counts, with speedup factors reaching 5.12 for computationally intensive instances and thread idle rates averaging 34.7%.

[AI-251] Linear Programming for Multi-Criteria Assessment with Cardinal and Ordinal Data: A Pessimistic Virtual Gap Analysis

【速读】:该论文旨在解决多准则分析(Multi-criteria Analysis, MCA)中因主观评价和偏倚导致结果可靠性下降,以及数据多样性影响参数精度的问题。解决方案的关键在于提出一种基于线性规划的虚拟差距分析(Virtual Gap Analysis, VGA)模型,通过两步集成方法,从悲观视角综合定量与定性准则,并融合基数数据与序数数据,从而实现对备选方案的可靠评估与优先排序,提升决策支持系统的有效性与可扩展性。

链接: https://arxiv.org/abs/2604.09555
作者: Fuh-Hwa Franklin Liu,Su-Chuan Shih
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 36 pages, 6 figure, 3 tables

点击查看摘要

Abstract:Multi-criteria Analysis (MCA) is used to rank alternatives based on various criteria. Key MCA methods, such as Multiple Criteria Decision Making (MCDM) methods, estimate parameters for criteria to compute the performance of each alternative. Nonetheless, subjective evaluations and biases frequently influence the reliability of results, while the diversity of data affects the precision of the parameters. The novel linear programming-based Virtual Gap Analysis (VGA) models tackle these issues. This paper outlines a two-step method that integrates two novel VGA models to assess each alternative from a pessimistic perspective, using both quantitative and qualitative criteria, and employing cardinal and ordinal data. Next, prioritize the alternatives to eliminate the least favorable one. The proposed method is dependable and scalable, enabling thorough assessments efficiently and effectively within decision support systems.

[AI-252] NetworkNet: A Deep Neural Network Approach for Random Networks with Sparse Nodal Attributes and Complex Nodal Heterogeneity

【速读】:该论文旨在解决异质网络数据中复杂节点异质性建模与关键节点属性选择的双重挑战,尤其在高维节点特征背景下如何准确刻画节点异质性并筛选出对网络形成具有显著影响的属性。其解决方案的核心在于提出一种名为NetworkNet的统一深度神经网络方法,该方法的关键创新在于设计了一种定制化的神经网络架构,能够显式参数化由节点属性驱动的异质性,并嵌入可扩展的属性选择机制;该架构不仅能一致估计两类潜在异质性函数(即节点扩张性与流行度),还能同时进行数据驱动的属性筛选,从而实现表达能力强、可解释性好、算法可扩展且具备非渐近逼近误差界保障的网络建模。

链接: https://arxiv.org/abs/2604.11673
作者: Zhaoyu Xing,Xiufan Yu
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Computation (stat.CO)
备注:

点击查看摘要

Abstract:Heterogeneous network data with rich nodal information become increasingly prevalent across multidisciplinary research, yet accurately modeling complex nodal heterogeneity and simultaneously selecting influential nodal attributes remains an open challenge. This problem is central to many applications in economics and sociology, when both nodal heterogeneity and high-dimensional individual characteristics highly affect network formation. We propose a statistically grounded, unified deep neural network approach for modeling nodal heterogeneity in random networks with high-dimensional nodal attributes, namely ``NetworkNet’'. A key innovation of NetworkNet lies in a tailored neural architecture that explicitly parameterizes attribute-driven heterogeneity, and at the same time, embeds a scalable attribute selection mechanism. NetworkNet consistently estimates two types of latent heterogeneity functions, i.e., nodal expansiveness and popularity, while simultaneously performing data-driven attribute selection to extract influential nodal attributes. By unifying classical statistical network modeling with deep learning, NetworkNet delivers the expressive power of DNNs with methodological interpretability, algorithmic scalability, and statistical rigor with a non-asymptotic approximation error bound. Empirically, simulations demonstrate strong performance in both heterogeneity estimation and high-dimensional attribute selection. We further apply NetworkNet to a large-scale author-citation network among statisticians, revealing new insights into the dynamic evolution of research fields and scholarly impact.

[AI-253] Minimizing classical resources in variational measurement-based quantum computation for generative modeling

【速读】:该论文旨在解决变分测量基量子计算(Variational Measurement-Based Quantum Computation, VMBQC)模型中参数冗余导致的优化困难与训练性能差的问题。传统VMBQC模型因引入测量随机性而使通道模型参数数量增加一倍,即从单位酉模型的 N×DN \times D 扩展至 2N×D2N \times D,这显著增加了训练复杂度并可能引发梯度消失或局部最优问题。解决方案的关键在于提出一种受限的VMBQC模型,通过仅引入一个额外的可训练参数,将单位酉设置扩展为基于通道的模型,从而在保持极简参数规模的同时,实现对单位酉模型无法学习的概率分布的生成能力。数值与代数分析均表明,这一最小扩展足以突破原单位酉模型的表达限制。

链接: https://arxiv.org/abs/2604.11578
作者: Arunava Majumder,Hendrik Poulsen Nautrup,Hans J. Briegel
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 14 pages

点击查看摘要

Abstract:Measurement-based quantum computation (MBQC) is a framework for quantum information processing in which a computational task is carried out through one-qubit measurements on a highly entangled resource state. Due to the indeterminacy of the outcomes of a quantum measurement, the random outcomes of these operations, if not corrected, yield a variational quantum channel family. Traditionally, this randomness is corrected through classical processing in order to ensure deterministic unitary computations. Recently, variational measurement-based quantum computation (VMBQC) has been introduced to exploit this measurement-induced randomness to gain an advantage in generative modeling. A limitation of this approach is that the corresponding channel model has twice as many parameters compared to the unitary model, scaling as N \times D , where N is the number of logical qubits (width) and D is the depth of the VMBQC model. This can often make optimization more difficult and may lead to poorly trainable models. In this paper, we present a restricted VMBQC model that extends the unitary setting to a channel-based one using only a single additional trainable parameter. We show, both numerically and algebraically, that this minimal extension is sufficient to generate probability distributions that cannot be learned by the corresponding unitary model.

[AI-254] Deep Learning for Sequential Decision Making under Uncertainty: Foundations Frameworks and Frontiers

【速读】:该论文旨在解决当前人工智能(AI)从以预测为主的范式向支持复杂、不确定和动态环境下的决策能力转型过程中,如何有效融合数据驱动的深度学习与运筹学/管理科学(OR/MS)优化方法的问题。其核心挑战在于:深度学习虽具备强大的非线性建模与大规模数据适应能力,但缺乏对约束条件、资源回应对策及不确定性结构的显式建模能力;而传统OR/MS方法虽能提供严谨的决策结构框架,却难以处理高维、非线性且动态变化的数据环境。解决方案的关键在于将深度学习视为优化的补充而非替代——利用深度学习实现可扩展的近似表示与自适应能力,同时借助OR/MS提供结构性约束、决策时序性和不确定性建模的理论基础,从而构建集成学习与优化的下一代智能决策系统。

链接: https://arxiv.org/abs/2604.11507
作者: I. Esra Buyuktahtakin
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is moving increasingly beyond prediction to support decisions in complex, uncertain, and dynamic environments. This shift creates a natural intersection with operations research and management sciences (OR/MS), which have long offered conceptual and methodological foundations for sequential decision-making under uncertainty. At the same time, recent advances in deep learning, including feedforward neural networks, LSTMs, transformers, and deep reinforcement learning, have expanded the scope of data-driven modeling and opened new possibilities for large-scale decision systems. This tutorial presents an OR/MS-centered perspective on deep learning for sequential decision-making under uncertainty. Its central premise is that deep learning is valuable not as a replacement for optimization, but as a complement to it. Deep learning brings adaptability and scalable approximation, whereas OR/MS provides the structural rigor needed to represent constraints, recourse, and uncertainty. The tutorial reviews key decision-making foundations, connects them to the major neural architectures in modern AI, and discusses leading approaches to integrating learning and optimization. It also highlights emerging impact in domains such as supply chains, healthcare and epidemic response, agriculture, energy, and autonomous operations. More broadly, it frames these developments as part of a wider transition from predictive AI toward decision-capable AI and highlights the role of OR/MS in shaping the next generation of integrated learning–optimization systems.

[AI-255] ADD for Multi-Bit Image Watermarking

【速读】:该论文旨在解决生成式 AI (Generative AI) 生成图像中存在虚假信息和真实性难以验证的问题,提出了一种高容量、强鲁棒性且具备理论支撑的多比特图像水印方法 ADD(Add, Dot, Decode)。其解决方案的关键在于将水印设计分为两个阶段:首先学习一个可线性组合多比特消息并加到图像中的水印向量;其次通过水印与含水印图像之间的内积实现高效解码。该方法在 MS-COCO 数据集上实现了 48 比特水印的 100% 解码准确率,在多种常见图像失真下性能下降不超过 2%,显著优于现有最优方法(平均下降达 14%),同时在嵌入和解码速度上分别提升 2 倍和 7.4 倍,且提供了理论分析以解释其有效性。

链接: https://arxiv.org/abs/2604.11491
作者: An Luo,Jie Ding
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:As generative models enable rapid creation of high-fidelity images, societal concerns about misinformation and authenticity have intensified. A promising remedy is multi-bit image watermarking, which embeds a multi-bit message into an image so that a verifier can later detect whether the image is generated by someone and further identify the source by decoding the embedded message. Existing approaches often fall short in capacity, resilience to common image distortions, and theoretical justification. To address these limitations, we propose ADD (Add, Dot, Decode), a multi-bit image watermarking method with two stages: learning a watermark to be linearly combined with the multi-bit message and added to the image, and decoding through inner products between the watermarked image and the learned watermark. On the standard MS-COCO benchmark, we demonstrate that for the challenging task of 48-bit watermarking, ADD achieves 100% decoding accuracy, with performance dropping by at most 2% under a wide range of image distortions, substantially smaller than the 14% average drop of state-of-the-art methods. In addition, ADD achieves substantial computational gains, with 2-fold faster embedding and 7.4-fold faster decoding than the fastest existing method. We further provide a theoretical analysis explaining why the learned watermark and the corresponding decoding rule are effective.

[AI-256] Regional Explanations: Bridging Local and Global Variable Importance NEURIPS2025

【速读】:该论文旨在解决局部归因方法(如Local Shapley Values 和 LIME)在识别局部重要特征时存在的根本性局限问题,即这些方法即使在理想条件下(精确计算且特征独立)也可能无法可靠地检测出真正影响模型输出的特征。作者指出,一个合理的局部归因方法不应赋予那些既不直接影响模型输出(例如线性模型中系数为零的特征),也不与功能相关特征存在统计依赖性的特征重要性。研究表明,现有方法违反了这一基本原则。为此,论文提出R-LOCO(Regional Leave Out COvariates)方法,其核心在于通过将输入空间划分为具有相似特征重要性特征的区域,并在每个区域内应用全局归因方法,从而利用区域归属信息推导出特定实例的特征贡献。这种方法实现了局部解释的准确性与稳定性,同时保留了实例特异性细节,弥补了传统局部解释易受扰动和全局方法丢失细节的问题。

链接: https://arxiv.org/abs/2604.11223
作者: Salim I. Amoukou,Nicolas J-B. Brunel
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:We analyze two widely used local attribution methods, Local Shapley Values and LIME, which aim to quantify the contribution of a feature value x_i to a specific prediction f(x_1, \dots, x_p) . Despite their widespread use, we identify fundamental limitations in their ability to reliably detect locally important features, even under ideal conditions with exact computations and independent features. We argue that a sound local attribution method should not assign importance to features that neither influence the model output (e.g., features with zero coefficients in a linear model) nor exhibit statistical dependence with functionality-relevant features. We demonstrate that both Local SV and LIME violate this fundamental principle. To address this, we propose R-LOCO (Regional Leave Out COvariates), which bridges the gap between local and global explanations and provides more accurate attributions. R-LOCO segments the input space into regions with similar feature importance characteristics. It then applies global attribution methods within these regions, deriving an instance’s feature contributions from its regional membership. This approach delivers more faithful local attributions while avoiding local explanation instability and preserving instance-specific detail often lost in global methods.

[AI-257] Cost-optimal Sequential Testing via Doubly Robust Q-learning

【速读】:该论文旨在解决从回顾性数据中学习成本最优的序贯决策策略问题,尤其在测试结果依赖于先前测量结果、导致信息缺失(informative missingness)的情境下,如何有效识别最优的测试顺序与终止时机。其核心挑战在于建模复杂的测试轨迹异质性并处理因测试选择机制引发的偏差。解决方案的关键在于提出一种双重稳健(doubly robust)的Q-learning框架,通过引入路径特异性逆概率加权(path-specific inverse probability weights),以适应不同个体的测试路径并满足条件归一化性质;同时结合辅助对比模型(auxiliary contrast models)构造正交伪结果(orthogonal pseudo-outcomes),从而确保当测试获取模型或对比模型其中之一正确指定时,政策学习仍具无偏性。该方法还提供了阶段式对比估计量的Oracle不等式、政策学习的收敛速率、后悔边界及误分类率分析,验证了其理论稳定性与实际有效性。

链接: https://arxiv.org/abs/2604.11165
作者: Doudou Zhou,Yiran Zhang,Dian Jin,Yingye Zheng,Lu Tian,Tianxi Cai
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:Clinical decision-making often involves selecting tests that are costly, invasive, or time-consuming, motivating individualized, sequential strategies for what to measure and when to stop ascertaining. We study the problem of learning cost-optimal sequential decision policies from retrospective data, where test availability depends on prior results, inducing informative missingness. Under a sequential missing-at-random mechanism, we develop a doubly robust Q-learning framework for estimating optimal policies. The method introduces path-specific inverse probability weights that account for heterogeneous test trajectories and satisfy a normalization property conditional on the observed history. By combining these weights with auxiliary contrast models, we construct orthogonal pseudo-outcomes that enable unbiased policy learning when either the acquisition model or the contrast model is correctly specified. We establish oracle inequalities for the stage-wise contrast estimators, along with convergence rates, regret bounds, and misclassification rates for the learned policy. Simulations demonstrate improved cost-adjusted performance over weighted and complete-case baselines, and an application to a prostate cancer cohort study illustrates how the method reduces testing cost without compromising predictive accuracy.

[AI-258] Harnessing Photonics for Machine Intelligence

【速读】:该论文旨在解决后摩尔时代(post-Moore era)机器智能工作负载快速增长与计算系统在功耗、内存和互连带宽等方面的物理极限之间的矛盾,即传统基于晶体管密度提升的计算架构已难以为继。其解决方案的关键在于利用集成光子学(integrated photonics)的光学带宽和并行性优势,重构数据传输与计算范式,通过跨层协同设计(cross-layer co-design)和面向工作负载自适应的可编程能力(workload-adaptive programmability),实现高能效与多功能性的统一。同时,论文强调电子-光子设计自动化(Electronic-Photonic Design Automation, EPDA)将成为核心使能技术,推动从仿真、逆向设计到系统建模与物理实现的闭环优化,从而构建可扩展、可复现的电子-光子融合生态系统,助力光子机器智能(photonic machine intelligence)迈向系统级自动化与规模化应用。

链接: https://arxiv.org/abs/2604.10841
作者: Hanqing Zhu,Shupeng Ning,Hongjian Zhou,Ziang Yin,Ray T. Chen,Jiaqi Gu,David Z. Pan
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 20 pages

点击查看摘要

Abstract:The exponential growth of machine-intelligence workloads is colliding with the power, memory, and interconnect limits of the post-Moore era, motivating compute substrates that scale beyond transistor density alone. Integrated photonics is emerging as a candidate for artificial intelligence (AI) acceleration by exploiting optical bandwidth and parallelism to reshape data movement and computation. This review reframes photonic computing from a circuits-and-systems perspective, moving beyond building-block progress toward cross-layer system analysis and full-stack design automation. We synthesize recent advances through a bottleneck-driven taxonomy that delineates the operating regimes and scaling trends where photonics can deliver end-to-end sustained benefits. A central theme is cross-layer co-design and workload-adaptive programmability to sustain high efficiency and versatility across evolving application domains at scale. We further argue that Electronic-Photonic Design Automation (EPDA) will be pivotal, enabling closed-loop co-optimization across simulation, inverse design, system modeling, and physical implementation. By charting a roadmap from laboratory prototypes to scalable, reproducible electronic-photonic ecosystems, this review aims to guide the CAS community toward an automated, system-centric era of photonic machine intelligence.

[AI-259] ail-Aware Information-Theoretic Generalization for RLHF and SGLD

【速读】:该论文旨在解决经典信息论泛化界在处理重尾数据时失效的问题,特别是在鲁棒学习、基于人类反馈的强化学习(RLHF)和随机优化等现代机器学习流程中,损失函数或奖励信号常呈现重尾特性,导致依赖矩生成函数(MGF)的KL散度工具不再适用。其解决方案的关键在于构建一个基于子魏布ull(sub-Weibull)分布的尾部依赖型信息论框架,其中尾部参数 θ\theta 控制尾部衰减速率:θ=2\theta=2 对应次高斯,θ=1\theta=1 对应次指数,0<θ<10<\theta<1 表征真正重尾情形。核心技术创新是提出一个去相关引理,利用平移对数 fθf_\theta-散度来控制测度变换下的期望,从而无需依赖MGF即可与Rényi散度建立显式比较关系;同时建立了针对子魏布ull过程的尖锐最大不等式和Dudley型链式界,复杂度随 log1/θ\log^{1/\theta} 和熵的 1/θ1/\theta 次幂增长,最终导出期望与高概率的PAC-Bayes泛化界及基于多尺度Rényi互信息的链式不等式,适用于重尾奖励下的Rényi正则化RLHF和含重尾梯度噪声的随机梯度朗之万动力学(SGLD)。

链接: https://arxiv.org/abs/2604.10727
作者: Huiming Zhang,Binghan Li,Wan Tian,Qiang Sun
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
备注: 65 pages, 9 figures

点击查看摘要

Abstract:Classical information-theoretic generalization bounds typically control the generalization gap through KL-based mutual information and therefore rely on boundedness or sub-Gaussian tails via the moment generating function (MGF). In many modern pipelines, such as robust learning, RLHF, and stochastic optimization, losses and rewards can be heavy-tailed, and MGFs may not exist, rendering KL-based tools ineffective. We develop a tail-dependent information-theoretic framework for sub-Weibull data, where the tail parameter \theta controls the tail heaviness: \theta=2 corresponds to sub-Gaussian, \theta=1 to sub-exponential, and 0\theta1 to genuinely heavy tails. Our key technical ingredient is a decorrelation lemma that bounds change-of-measure expectations using a shifted-log f_\theta -divergence, which admits explicit comparisons to Rényi divergence without MGF arguments. On the empirical-process side, we establish sharp maximal inequalities and a Dudley-type chaining bound for sub-Weibull processes with tail index \theta , with complexity scaling as \log^1/\theta and entropy ^1/\theta . These tools yield expected and high-probability PAC-Bayes generalization bounds, as well as an information-theoretic chaining inequality based on multiscale Rényi mutual information. We illustrate the consequences in Rényi-regularized RLHF under heavy-tailed rewards and in stochastic gradient Langevin dynamics with heavy-tailed gradient noise.

[AI-260] Universal statistical signatures of evolution in artificial intelligence architectures

【速读】:该论文旨在探究人工智能(AI)架构的演化是否遵循与生物进化相同的统计规律。其核心问题是:AI架构在迭代优化过程中所表现出的适应性变化是否具有类似生物进化的分布特征,从而揭示演化机制的普适性。解决方案的关键在于系统性地整合来自161篇文献的935项消融实验(ablation experiments),量化架构修改对模型性能的影响分布(即适应度效应分布,DFE),并发现其符合重尾的Student’s t分布,且有益突变比例(13%)显著高于生物学中的典型值(1–6%),这表明AI演化虽依赖于有向搜索(directed search)而非随机漂变,但其统计结构仍与生物进化一致。此外,研究进一步揭示了架构起源遵循逻辑斯蒂动力学(logistic dynamics)并呈现间断平衡和适应辐射现象,支持了演化统计结构由适应度景观拓扑决定、而非具体选择机制的理论主张。

链接: https://arxiv.org/abs/2604.10571
作者: Theodor Spiro
机构: 未知
类目: Populations and Evolution (q-bio.PE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Neural and Evolutionary Computing (cs.NE)
备注: 15 pages, 4 figures, 4 supplementary tables. Code and data: this https URL

点击查看摘要

Abstract:We test whether artificial intelligence architectural evolution obeys the same statistical laws as biological evolution. Compiling 935 ablation experiments from 161 publications, we show that the distribution of fitness effects (DFE) of architectural modifications follows a heavy-tailed Student’s t-distribution with proportions (68% deleterious, 19% neutral, 13% beneficial for major ablations, n=568) that place AI between compact viral genomes and simple eukaryotes. The DFE shape matches D. melanogaster (normalized KS=0.07) and S. cerevisiae (KS=0.09); the elevated beneficial fraction (13% vs. 1-6% in biology) quantifies the advantage of directed over blind search while preserving the distributional form. Architectural origination follows logistic dynamics (R^2=0.994) with punctuated equilibria and adaptive radiation into domain niches. Fourteen architectural traits were independently invented 3-5 times, paralleling biological convergences. These results demonstrate that the statistical structure of evolution is substrate-independent, determined by fitness landscape topology rather than the mechanism of selection.

[AI-261] A Minimal Model of Representation Collapse: Frustration Stop-Gradient and Dynamics

【速读】:该论文旨在解决自监督表示学习(self-supervised representation learning)中普遍存在的表征坍缩(representation collapse)问题,即嵌入空间中的特征失去判别性,导致不同输入变得不可区分。其关键解决方案在于构建一个仅含嵌入层的最小化模型,并通过分类-表示设置量化坍缩现象,揭示在数据可完美分类时不会发生坍缩,而少量无法一致分类的“挫折样本”会因引入额外的慢时间尺度而导致坍缩;进一步通过引入共享投影头(shared projection head)并应用停止梯度(stop-gradient)策略,在训练动力学层面抑制坍缩,理论分析表明该方法能稳定有限类别分离,且实验证明该机制在教师-学生线性模型中同样有效,说明其本质机制超越纯嵌入设定。

链接: https://arxiv.org/abs/2604.09979
作者: Louie Hong Yao,Yuhao Li,Shengchao Liu
机构: 未知
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 13 figures

点击查看摘要

Abstract:Self-supervised representation learning is central to modern machine learning because it extracts structured latent features from unlabeled data and enables robust transfer across tasks and domains. However, it can suffer from representation collapse, a widely observed failure mode in which embeddings lose discriminative structure and distinct inputs become indistinguishable. To understand the mechanisms that drive collapse and the ingredients that prevent it, we introduce a minimal embedding-only model whose gradient-flow dynamics and fixed points can be analyzed in closed form, using a classification-representation setting as a concrete playground where collapse is directly quantified through the contraction of label-embedding geometry. We illustrate that the model does not collapse when the data are perfectly classifiable, while a small fraction of frustrated samples that cannot be classified consistently induces collapse through an additional slow time scale that follows the early performance gain. Within the same framework, we examine collapse prevention by adding a shared projection head and applying stop-gradient at the level of the training dynamics. We analyze the resulting fixed points and develop a dynamical mean-field style self-consistency description, showing that stop-gradient enables non-collapsed solutions and stabilizes finite class separation under frustration. We further verify empirically that the same qualitative dynamics and collapse-prevention effects appear in a linear teacher-student model, indicating that the minimal theory captures features that persist beyond the pure embedding setting.

[AI-262] he Rise and Fall of G in AGI

【速读】:该论文试图解决人工智能领域中关于“通用智能”(Artificial General Intelligence, AGI)的实证基础问题,即如何从量化角度验证AI模型是否具备类似人类心理测量学中的“一般智力因子”(g-factor),该因子描述的是不同认知能力之间的正相关结构(positive manifold)。其解决方案的关键在于将大型语言模型(LLM)在时间序列上的性能表现视为认知测试电池(cognitive test battery),通过主成分分析(Principal Component Analysis, PCA)对模型-基准-时间矩阵进行建模,发现尽管存在多任务能力分化趋势,LLM整体仍表现出显著的正相关结构——PC1解释了高达90%的方差(核心五基准电池),表明AI系统具备一种类g因子的通用智能特征;同时,随着推理专用模型的出现,这一因子的解释力下降,揭示出“专业化”(specialization)正在取代“通用性”(generalizability),形成由“AI- hedgehog”(通用型)向“AI-foxes”(多样化问题解决系统)演化的趋势,从而为AGI的评估提供了可量化的心理学范式依据。

链接: https://arxiv.org/abs/2604.09911
作者: David C. Krakauer
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the psychological literature the term general intelligence' describes correlations between abilities and not simply the number of abilities. This paper connects Spearman's g -factor from psychometrics, measuring a positive manifold, to the implicit `` G -factor'' in claims about artificial general intelligence (AGI) performance on temporally structured benchmarks. By treating LLM benchmark batteries as cognitive test batteries and model releases as subjects, principal component analysis is applied to a models \times benchmarks \times time matrix spanning 39 models (2019--2025) and 14 benchmarks. Preliminary results confirm a strong positive manifold in which all 28 pairwise correlations positive across 8 benchmarks. By analyzing the spectrum of the benchmark correlation through time, PC1 explains 90\% of variance on a 5-benchmark core battery ( n=19 )) reducing to 77\% by 2024. On a four benchmark battery, PC1 is found to peak at 92\% of the variance between 2023--2024 and reduce to 64\% with the arrival of reasoning-specialized models in 2024. This is coincident with a rotation in the G-factor as models outsource reasoning’ to tools. The analysis of partial correlation matrices through time provides evidence for the evolution of specialization beneath the positive manifold of general intelligence (AI-hedgehog) encompassing diverse high dimensional problem solving systems (AI-foxes). In strictly psychometric terms, AI models exhibit general intelligence suppressing specialized intelligences. LLMs invert the ideal of substituting complicated models with parsimonious mechanisms, a `Ptolemaic Succession’ of theories, with architectures of increasing hierarchical complication and capability.

[AI-263] Learning noisy phase transition dynamics from stochastic partial differential equations

【速读】:该论文旨在解决多尺度相变非平衡动力学中随机性建模的问题,特别是如何在机器学习代理模型中准确刻画热涨落对罕见事件(如成核)的影响,同时保证物理守恒律和可解释性。传统确定性模型无法捕捉此类热激活过程,而现有机器学习方法往往忽视了随机性的显式表示或难以满足质量守恒等物理约束。解决方案的关键在于:在细胞间通量层面参数化代理模型,将每条通量分解为由化学势梯度驱动的确定性部分与可学习的噪声幅度部分,从而在每个时间步精确保证质量守恒,并引入具有物理意义的随机涨落;此外,通过引入可学习的自由能泛函实现热力学可解释性,验证了其对体相双井势、界面过剩能量及曲率无关界面张力的独立恢复能力。这种基于通量层级的随机结构设计,使模型不仅在训练范围内准确复现系综统计特性与噪声加速粗化行为,还能外推至更大空间域(体积增加64倍)和更长时域(延长160倍),并首次实现了亚稳态下热激活成核的定性捕捉,证明了通量级随机性是构建物理感知代理模型的架构必要条件而非可选增强。

链接: https://arxiv.org/abs/2604.09664
作者: Luning Sun,Van Hai Nguyen,Shusen Liu,John Klepeis,Fei Zhou
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: 31 pages, 21 figures

点击查看摘要

Abstract:The non-equilibrium dynamics of mesoscale phase transitions are fundamentally shaped by thermal fluctuations, which not only seed instabilities but actively control kinetic pathways, including rare barrier-crossing events such as nucleation that are entirely inaccessible to deterministic models. Machine-learning surrogates for such systems must therefore represent stochasticity explicitly, enforce conservation laws by construction, and expose physically interpretable structure. We develop physics-aware surrogate models for the stochastic Cahn-Hilliard equation in 3D that satisfy all three requirements simultaneously. The key innovation is to parameterize the surrogate at the level of inter-cell fluxes, decomposing each flux into a deterministic mobility-weighted chemical-potential gradient and a learnable noise amplitude. This design guarantees exact mass conservation at every step and adds physical fluctuations to inter-cell mass transport. A learnable free energy functional provides thermodynamic interpretability, validated by independent recovery of the bulk double-well landscape, interfacial excess energy, and curvature-independent interfacial tension. Tests demonstrate accurate reproduction of ensemble statistics and noise-accelerated coarsening, with generalization to spatial domains 64 times larger in volume and temporal horizons 160x longer than those seen during training. Critically, the stochastic surrogate captures thermally activated nucleation in the metastable regime, a qualitative capability that no deterministic surrogate can provide regardless of training, thus establishing flux-level stochasticity as an architectural necessity rather than an optional enhancement.

[AI-264] Diffusion-Based Generative Priors for Efficient Beam Alignment in Directional Networks

【速读】:该论文旨在解决毫米波(mmWave)和太赫兹(THz)系统中定向通信面临的波束对齐(beam alignment)挑战,即如何在窄波束条件下实现高精度且低开销的训练过程。现有基于学习的方法通常仅预测单一波束而无法量化不确定性,限制了自适应波束扫描的灵活性。其解决方案的关键在于将波束对齐重新建模为生成任务,并提出一种条件扩散模型(conditional diffusion model),该模型从紧凑的几何与多径特征中学习概率性波束先验(probabilistic beam prior),从而指导top-k波束扫描并捕捉因探测受限导致的信噪比(SNR)损失。实验表明,该方法在小规模扫描预算下显著提升命中率(Hit@1 ≈ 0.61,较确定性分类器提升约180%),同时保持较高SNR,实现了低延迟、节能的波束对齐。

链接: https://arxiv.org/abs/2604.09653
作者: Esraa Fahmy Othman,Lina Bariah,Merouane Debbah
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Beam alignment is a key challenge in directional mmWave and THz systems, where narrow beams require accurate yet low-overhead training. Existing learning-based approaches typically predict a single beam and do not quantify uncertainty, limiting adaptive beam sweeping. We recast beam alignment as a generative task and propose a conditional diffusion model that learns a probabilistic beam prior from compact geometric and multipath features. The learned priors guide top- k sweeps and capture the SNR loss induced by limited probing. Using a ray-traced DeepMIMO scenario with an 8-beam DFT codebook, our best conditional diffusion model achieves strong ranking performance (Hit@1 \approx 0.61 , Hit@3 \approx 0.90 , Hit@5 \approx 0.97 ) while preserving SNR at small sweep budgets. Compared with a deterministic classifier baseline, diffusion improves Hit@1 by about 180%. Results further highlight the importance of informative conditioning and the ability of diffusion sampling to flexibly trade accuracy for computational efficiency. The proposed diffusion framework achieves substantial improvements in small- k Hit rates, translating into reduced beam training overhead and enabling low-latency, energy-efficient beam alignment for mmWave and THz systems while preserving received SNR.

[AI-265] Dynamic Forecasting and Temporal Feature Evolution of Stock Repurchases in Listed Companies Using Attention-Based Deep Temporal Networks

【速读】:该论文旨在解决传统静态模型难以捕捉企业财务状况复杂时序依赖关系的问题,从而实现对股票回购行为的准确预测。其解决方案的关键在于构建一个融合经济理论与深度时序网络的动态预警系统,采用混合的时序卷积网络(Temporal Convolutional Network, TCN)与注意力机制增强的长短期记忆网络(Attention-based LSTM),有效提取长期和短期财务演化特征,并通过滚动窗口交叉验证验证了模型性能显著优于逻辑回归和XGBoost等静态基线方法。

链接: https://arxiv.org/abs/2604.09650
作者: Xiang Ao,Jingxuan Zhang,Xinyu Zhao
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Accurately predicting stock repurchases is crucial for quantitative investment and risk management, yet traditional static models fail to capture the complex temporal dependencies of corporate financial conditions. This paper proposes a dynamic early warning system integrating economic theory with deep temporal networks. Using Chinese A-share panel data (2014-2024), we employ a hybrid Temporal Convolutional Network (TCN) and Attention-based LSTM to capture long- and short-term financial evolutionary patterns. Rolling-window cross-validation demonstrates our model significantly outperforms static baselines like Logistic Regression and XGBoost. Furthermore, utilizing Explainable AI (XAI), we reveal the temporal dynamics of repurchase decisions: prolonged “undervaluation” serves as the long-term underlying motive, while a sharp increase in “cash flow” acts as the decisive short-term trigger. This study provides a robust deep learning paradigm for financial forecasting and offers dynamic empirical support for classic corporate finance hypotheses.

[AI-266] he Paradox of Professional Input: How Expert Collaboration with AI Systems Shapes Their Future Value

【速读】:该论文试图解决的核心问题是:专业人员在与人工智能(Artificial Intelligence, AI)系统协作过程中,通过将隐性知识(tacit knowledge)外化以增强AI能力时,可能加速自身专业知识的自动化,从而威胁传统职业价值。解决方案的关键在于识别出人类-人工智能协作中的新兴模式,并提出框架帮助专业人士在这一动态环境中保持和重塑其专业价值;具体而言,通过整合知识管理、专家研究、人机交互及劳动经济学等领域的研究成果,论文强调应借助系统性的专业教育改革、组织设计优化和政策支持,使专家知识的编码过程成为提升而非削弱人类专业价值的手段。

链接: https://arxiv.org/abs/2504.12654
作者: Venkat Ram Reddy Ganuthula,Krishna Kumar Balaraman
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This perspective paper examines a fundamental paradox in the relationship between professional expertise and artificial intelligence: as domain experts increasingly collaborate with AI systems by externalizing their implicit knowledge, they potentially accelerate the automation of their own expertise. Through analysis of multiple professional contexts, we identify emerging patterns in human-AI collaboration and propose frameworks for professionals to navigate this evolving landscape. Drawing on research in knowledge management, expertise studies, human-computer interaction, and labor economics, we develop a nuanced understanding of how professional value may be preserved and transformed in an era of increasingly capable AI systems. Our analysis suggests that while the externalization of tacit knowledge presents certain risks to traditional professional roles, it also creates opportunities for the evolution of expertise and the emergence of new forms of professional value. We conclude with implications for professional education, organizational design, and policy development that can help ensure the codification of expert knowledge enhances rather than diminishes the value of human expertise.

机器学习

[LG-0] KL Divergence Between Gaussians: A Step-by-Step Derivation for the Variational Autoencoder Objective

链接: https://arxiv.org/abs/2604.11744
作者: Andrés Muñoz,Rodrigo Ramele
类目: Machine Learning (cs.LG)
*备注: 8 pages, no figures. Derivation of the KL divergence between Gaussian distributions with application to Variational Autoencoders (VAEs)

点击查看摘要

Abstract:Kullback-Leibler (KL) divergence is a fundamental concept in information theory that quantifies the discrepancy between two probability distributions. In the context of Variational Autoencoders (VAEs), it serves as a central regularization term, imposing structure on the latent space and thereby enabling the model to exhibit generative capabilities. In this work, we present a detailed derivation of the closed-form expression for the KL divergence between Gaussian distributions, a case of particular importance in practical VAE implementations. Starting from the general definition for continuous random variables, we derive the expression for the univariate case and extend it to the multivariate setting under the assumption of diagonal covariance. Finally, we discuss the interpretation of each term in the resulting expression and its impact on the training dynamics of the model.

[LG-1] GPU Acceleration of Sparse Fully Homomorphic Encrypted DNNs EUROSYS’26

链接: https://arxiv.org/abs/2604.11659
作者: Lara D’Agata,Carlos Agulló-Domingo,Óscar Vera-López,Kaustubh Shivdikar,Ardhi W. B. Yudha,Ferhat Yaman,David Kaeli,José L. Abellán,Ian Colbert,José Cano
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Performance (cs.PF)
*备注: Accepted to the 6th Workshop on Machine Learning and Systems (EuroMLSys) co-located with EuroSys '26

点击查看摘要

Abstract:Fully homomorphic encryption (FHE) has recently attracted significant attention as both a cryptographic primitive and a systems challenge. Given the latest advances in accelerated computing, FHE presents a promising opportunity for progress, with applications ranging from machine learning to information security. We target the most computationally intensive operation in deep neural networks from a hardware perspective, matrix multiplication (matmul), and adapt it for execution on AMD GPUs. We propose a new optimized method that improves the runtime and complexity of ciphertext matmul by using FIDESlib, a recent open-source FHE library designed specifically for GPUs. By exploiting sparsity in both operands, our sparse matmul implementation outperforms its CPU counterpart by up to 3.0\times and reduces the time complexity from cubic to semi-linear, demonstrating an improvement over existing FHE matmul implementations.

[LG-2] Inter-Layer Hessian Analysis of Neural Networks with DAG Architectures

链接: https://arxiv.org/abs/2604.11639
作者: Maxim Bolshim(1),Alexander Kugaevskikh(1) ((1) ITMO University, Saint Petersburg, Russia)
类目: Machine Learning (cs.LG)
*备注: 45 pages, 9 figures, 17 tables. Submitted to Neural Networks (Elsevier). Code: this https URL

点击查看摘要

Abstract:Modern automatic differentiation frameworks (JAX, PyTorch) return the Hessian of the loss function as a monolithic tensor, without exposing the internal structure of inter-layer interactions. This paper presents an analytical formalism that explicitly decomposes the full Hessian into blocks indexed by the DAG of an arbitrary architecture. The canonical decomposition H = H^GN + H^T separates the Gauss–Newton component (convex part) from the tensor component (residual curvature responsible for saddle points). For piecewise-linear activations (ReLU), the tensor component of the input Hessian vanishes ( H^T_v,w!\equiv!0 a.e., H^f_v,w!=!H^GN_v,w!\succeq!0 ); the full parametric Hessian contains residual terms that do not reduce to the GGN. Building on this decomposition, we introduce diagnostic metrics (inter-layer resonance~ \mathcalR , geometric coupling~ \mathcalC , stable rank~ \mathcalD , GN-Gap) that are estimated stochastically in O§ time and reveal structural curvature interactions between layers. The theoretical analysis explains exponential decay of resonance in vanilla networks and its preservation under skip connections; empirical validation spans fully connected MLPs (Exp.,1–5) and convolutional architectures (ResNet-18, \sim11 M~parameters, Exp.,6). When the architecture reduces to a single node, all definitions collapse to the standard Hessian \nabla^2_\theta\mathcalL(\theta)\in\mathbbR^p\times p .

[LG-3] mpusBench: An Evaluation Framework for Time-Series Forecasting

链接: https://arxiv.org/abs/2604.11529
作者: Denizalp Goktas,Gerardo Riaño-Briceño,Alif Abdullah,Aryan Nair,Chenkai Shen,Beatriz de Lucio,Alexandra Magnusson,Farhan Mashrur,Ahmed Abdulla,Shawrna Sen,Mahitha Thippireddy,Gregory Schwartz,Amy Greenwald
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models have transformed natural language processing and computer vision, and a rapidly growing literature on time-series foundation models (TSFMs) seeks to replicate this success in forecasting. While recent open-source models demonstrate the promise of TSFMs, the field lacks a comprehensive and community-accepted model evaluation framework. We see at least four major issues impeding progress on the development of such a framework. First, current evaluation frameworks consist of benchmark forecasting tasks derived from often outdated datasets (e.g., M3), many of which lack clear metadata and overlap with the corpora used to pre-train TSFMs. Second, existing frameworks evaluate models along a narrowly defined set of benchmark forecasting tasks such as forecast horizon length or domain, but overlook core statistical properties such as non-stationarity and seasonality. Third, domain-specific models (e.g., XGBoost) are often compared unfairly, as existing frameworks neglect a systematic and consistent hyperparameter tuning convention for all models. Fourth, visualization tools for interpreting comparative performance are lacking. To address these issues, we introduce TempusBench, an open-source evaluation framework for TSFMs. TempusBench consists of 1) new datasets which are not included in existing TSFM pretraining corpora, 2) a set of novel benchmark tasks that go beyond existing ones, 3) a model evaluation pipeline with a standardized hyperparameter tuning protocol, and 4) a tensorboard-based visualization interface. We provide access to our code on GitHub: this https URL.

[LG-4] Generative Path-Finding Method for Wasserstein Gradient Flow

链接: https://arxiv.org/abs/2604.11519
作者: Chengyu Liu,Xiang Zhou
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: Due to the arXiv notice that “The Abstract field cannot be longer than 1,920 characters”, the abstract shown here is shortened. For the full abstract, please download the article

点击查看摘要

Abstract:Wasserstein gradient flows (WGFs) describe the evolution of probability distributions in Wasserstein space as steepest descent dynamics for a free energy functional. Computing the full path from an arbitrary initial distribution to equilibrium is challenging, especially in high dimensions. Eulerian methods suffer from the curse of dimensionality, while existing Lagrangian approaches based on particles or generative maps do not naturally improve efficiency through time step tuning. We propose GenWGP, a generative path finding framework for Wasserstein gradient paths. GenWGP learns a generative flow that transports mass from an initial density to an unknown equilibrium distribution by minimizing a path loss that encodes the full trajectory and its terminal equilibrium condition. The loss is derived from a geometric action functional motivated by Dawson Gartner large deviation theory for empirical distributions of interacting diffusion systems. We formulate both a finite horizon action under physical time parametrization and a reparameterization invariant geometric action based on Wasserstein arclength. Using normalizing flows, GenWGP computes a geometric curve toward equilibrium while enforcing approximately constant intrinsic speed between adjacent network layers, so that discretized distributions remain nearly equidistant in the Wasserstein metric along the path. This avoids delicate time stepping constraints and enables stable training that is largely independent of temporal or geometric discretization. Experiments on Fokker Planck and aggregation type problems show that GenWGP matches or exceeds high fidelity reference solutions with only about a dozen discretization points while capturing complex dynamics.

[LG-5] he Price of Ignorance: Information-Free Quotation for Data Retention in Machine Unlearning

链接: https://arxiv.org/abs/2604.11511
作者: Bin Han,Di Feng,Zexin Fang,Jie Wang,Hans D. Schotten
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Mobile Computing

点击查看摘要

Abstract:When users exercise data deletion rights under the General Data Protection Regulation (GDPR) and similar regulations, mobile network operators face a tradeoff: excessive machine unlearning degrades model accuracy and incurs retraining costs, yet existing pricing mechanisms for data retention require the server to know every user’s private privacy and accuracy preferences, which is infeasible under the very regulations that motivate unlearning. We ask: what is the welfare cost of operating without this private information? We design an information-free ascending quotation mechanism where the server broadcasts progressively higher prices and users self-select their data supply, requiring no knowledge of users’ parameters. Under complete information, the protocol admits a unique subgame-perfect Nash equilibrium characterized by single-period selling. We formalize the Price of Ignorance – the welfare gap between optimal personalized pricing (which knows everything) and our information-free quotation (which knows nothing) – and prove a three-regime efficiency ordering. Numerical evaluation across seven mechanisms and 5000 Monte Carlo runs shows that this price is near zero: the information-free mechanism achieves =99% of the welfare of its information-intensive benchmarks, while providing noise-robust guarantees and comparable fairness.

[LG-6] CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation

链接: https://arxiv.org/abs/2604.11483
作者: Yanting Li,Zhuoyang Jiang,Enyan Dai,Lei Wang,Wen-Cai Ye,Li Liu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Goal-directed molecular generation requires satisfying heterogeneous constraints such as protein–ligand compatibility and multi-objective drug-like properties, yet existing methods often optimize these constraints in isolation, failing to reconcile conflicting objectives (e.g., affinity vs. safety), and struggle to navigate the non-differentiable chemical space without compromising structural validity. To address these challenges, we propose CAGenMol, a condition-aware discrete diffusion framework over molecular sequences that formulates molecular design as conditional denoising guided by heterogeneous structural and property signals. By coupling discrete diffusion with reinforcement learning, the model aligns the generation trajectory with non-differentiable objectives while preserving chemical validity and diversity. The non-autoregressive nature of diffusion language model further enables iterative refinement of molecular fragments at inference time. Experiments on structure-conditioned, property-conditioned, and dual-conditioned benchmarks demonstrate consistent improvements over state-of-the-art methods in binding affinity, drug-likeness, and success rate, highlighting the effectiveness of our framework.

[LG-7] Structural Consequences of Policy-Based Interventions on the Global Supply Chain Network

链接: https://arxiv.org/abs/2604.11479
作者: Lea Karbevska,Liming Xu,Zehui Dai,Sara AlMahri,Alexandra Brintrup
类目: Machine Learning (cs.LG); General Economics (econ.GN); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:As global political tensions rise and the anticipation of additional tariffs from the United States on international trade increases, the issues of economic independence and supply chain resilience become more prominent. The importance of supply chain resilience has been further underscored by disruptions caused by the COVID-19 pandemic and the ongoing war in this http URL light of these challenges, ranging from geopolitical instability to product supply uncertainties, governments are increasingly focused on adopting new trade policies. This study explores the impact of several of these policies on the global electric vehicle (EV) supply chain network, with a particular focus on their effects on country clusters and the broader structure of international trade. Specifically, we analyse three key policies: Country Plus One, Friendshoring, and Reshoring. Our findings show that Friendshoring, contrary to expectations, leads to greater globalisation by increasing the number of supply links across friendly countries, potentially raising transaction costs. The Country Plus One policy similarly enhances network density through redundant links, while the Reshoring policy creates challenges in the EV sector due to the high number of irreplaceable products. Additionally, the effects of these policies vary across industries; for instance, mining goods being less affected in Country Plus One than the Friendshoring policy.

[LG-8] Learning How Much to Think: Difficulty-Aware Dynamic MoEs for Graph Node Classification

链接: https://arxiv.org/abs/2604.11473
作者: Jiajun Zhou,Yadong Li,Xuanze Chen,Chen Ma,Chuang Zhao,Shanqing Yu,Qi Xuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures offer a scalable path for Graph Neural Networks (GNNs) in node classification tasks but typically rely on static and rigid routing strategies that enforce a uniform expert budget or coarse-grained expert toggles on all nodes. This limitation overlooks the varying discriminative difficulty of nodes and leads to under-fitting for hard nodes and redundant computation for easy ones. To resolve this issue, we propose D2MoE, a novel framework that shifts the focus from static expert selection to node-wise expert resource allocation. By using predictive entropy as a real-time proxy for difficulty, D2MoE employs a difficulty-driven top-p routing mechanism to adaptively concentrate expert resources on hard nodes while reducing overhead for easy ones, achieving continuous and fine-grained expert budget scaling for node classification. Experiments on 13 benchmarks demonstrate that D2MoE achieves consistent state-of-the-art performance, surpassing leading baselines by up to 7.92% in accuracy on heterophilous graphs. Notably, on large-scale graphs, it reduces memory consumption by up to 73.07% and training time by 46.53% compared to the best-performing Graph MoE, thereby validating its superior efficiency.

[LG-9] Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning ICLR2026

链接: https://arxiv.org/abs/2604.11416
作者: Ajinkya Mohgaonkar,Lukas Gosch,Mahalakshmi Sabanayagam,Debarghya Ghoshdastidar,Stephan Günnemann
类目: Machine Learning (cs.LG)
*备注: Workshop on Principled Design for Trustworthy AI @ ICLR 2026

点击查看摘要

Abstract:Label-flipping attacks, which corrupt training labels to induce misclassifications at inference, remain a major threat to supervised learning models. This drives the need for robustness certificates that provide formal guarantees about a model’s robustness under adversarially corrupted labels. Existing certification frameworks rely on ensemble techniques such as smoothing or partition-aggregation, but treat the corresponding base classifiers as black boxes, yielding overly conservative guarantees. We introduce EnsembleCert, the first certification framework for partition-aggregation ensembles that utilizes white-box knowledge of the base classifiers. Concretely, EnsembleCert yields tighter guarantees than black-box approaches by aggregating per-partition white-box certificates to compute ensemble-level guarantees in polynomial time. To extract white-box knowledge from the base classifiers efficiently, we develop ScaLabelCert, a method that leverages the equivalence between sufficiently wide neural networks and kernel methods using the neural tangent kernel. ScaLabelCert yields the first exact, polynomial-time calculable certificate for neural networks against label-flipping attacks. EnsembleCert is either on par, or significantly outperforms the existing partition-based black box certificates. Exemplary, on CIFAR-10, our method can certify upto +26.5% more label flips in median over the test set compared to the existing black-box approach while requiring 100 times fewer partitions, thus, challenging the prevailing notion that heavy partitioning is a necessity for strong certified robustness.

[LG-10] Active Bayesian Inference for Robust Control under Sensor False Data Injection Attacks

链接: https://arxiv.org/abs/2604.11410
作者: Axel Andersson,György Dán
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, 4 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:We present a framework for bridging the gap between sensor attack detection and recovery in cyber-physical systems. The proposed framework models modern-day, complex perception pipelines as bipartite graphs, which combined with anomaly detector alerts defines a Bayesian network for inferring compromised sensors. An active probing strategy exploits system nonlinearities to maximize distinguishability between attack hypotheses, while compromised sensors are selectively disabled to maintain reliable state estimation. We propose a threshold-based probing strategy and show its effectiveness via a simplified partially observable Markov decision process (POMDP) formulation. Experiments on an inverted pendulum under single and multi-sensor attacks show that our method significantly outperforms outlier-robust and prediction-based baselines, especially under prolonged attacks.

[LG-11] BRIDGE and TCH-Net: Heterogeneous Benchmark and Multi-Branch Baseline for Cross-Domain IoT Botnet Detection

链接: https://arxiv.org/abs/2604.11324
作者: Ammar Bhilwarawala,Likhamba Rongmei,Harsh Sharma,Arya Jena,Kaushal Singh,Jayashree Piri,Raghunath Dey
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 21 pages, 8 figures, submitted to Journal of Network and Computer Applications

点击查看摘要

Abstract:IoT botnet detection has advanced, yet most published systems are validated on a single dataset and rarely generalise across environments. Heterogeneous feature spaces make multi-dataset training practically impossible without discarding semantic interpretability or introducing data integrity violations. No prior work has addressed both problems with a formally specified, reproducible methodology. This paper does. We introduce BRIDGE (Benchmark Reference for IoT Domain Generalisation Evaluation), the first formally specified heterogeneous multi-dataset benchmark for IoT intrusion detection, unifying CICIDS-2017, CIC-IoT-2023, Bot-IoT, Edge-IIoTset, and N-BaIoT through a 46-feature semantic canonical vocabulary grounded in CICFlowMeter nomenclature, with genuine-equivalence-only feature mapping, explicit zero-filling, and per-dataset coverage from 15% to 93%. A leave-one-dataset-out (LODO) protocol makes the generalisation gap precisely measurable: all five evaluated architectures achieve mean LODO F1 between 0.39 and 0.47, and we establish the first community generalisation baseline at mean LODO F1 = 0.5577, a result that shifts the agenda from single-benchmark optimisation toward cross-environment generalisation. We propose TCH-Net, a multi-branch network fusing a three-path Temporal branch (residual convolutional-BiGRU, stride-downsampled BiGRU, pre-LayerNorm Transformer), a provenance-conditioned Contextual branch, and a Statistical branch via Cross-Branch Gated Attention Fusion (CB-GAF) with learnable sigmoid gates for dynamic feature-wise mixing. Across five random seeds, TCH-Net achieves F1 = 0.8296 +/- 0.0028, AUC = 0.9380 +/- 0.0025, and MCC = 0.6972 +/- 0.0056, outperforming all twelve baselines (p 0.05, Wilcoxon) and recording the highest LODO F1 overall. BRIDGE and the full pipeline are at this https URL.

[LG-12] Learning Discrete Diffusion of Graphs via Free-Energy Gradient Flows

链接: https://arxiv.org/abs/2604.11311
作者: Dario Rancati,Jan Maas,Francesco Locatello
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion-based models on continuous spaces have seen substantial recent progress through the mathematical framework of gradient flows, leveraging the Wasserstein-2 ( W_2 ) metric via the Jordan-Kinderlehrer-Otto (JKO) scheme. Despite the increasing popularity of diffusion models on discrete spaces using continuous-time Markov chains, a parallel theoretical framework based on gradient flows has remained elusive due to intrinsic challenges in translating the W_2 distance directly into these settings. In this work, we propose the first computational approach addressing these challenges, leveraging an appropriate metric W_K on the simplex of probability distributions, which enables us to interpret widely used discrete diffusion paths, such as the discrete heat equation, as gradient flows of specific free-energy functionals. Through this theoretical insight, we introduce a novel methodology for learning diffusion dynamics over discrete spaces, which recovers the underlying functional directly by leveraging first-order optimality conditions for the JKO scheme. The resulting method optimizes a simple quadratic loss, trains extremely fast, does not require individual sample trajectories, and only needs a numerical preprocessing computing W_K -geodesics. We validate our method through extensive numerical experiments on synthetic data, showing that we can recover the underlying functional for a variety of graph classes.

[LG-13] Beyond Fixed False Discovery Rates: Post-Hoc Conformal Selection with E-Variables

链接: https://arxiv.org/abs/2604.11305
作者: Meiyi Zhu,Osvaldo Simeone
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 32 pages, 29 figures

点击查看摘要

Abstract:Conformal selection (CS) uses calibration data to identify test inputs whose unobserved outcomes are likely to satisfy a pre-specified minimal quality requirement, while controlling the false discovery rate (FDR). Existing methods fix the target FDR level before observing data, which prevents the user from adapting the balance between number of selected test inputs and FDR to downstream needs and constraints based on the available data. For example, in genomics or neuroimaging, researchers often inspect the distribution of test statistics, and decide how aggressively to pursue candidates based on observed evidence strength and available follow-up resources. To address this limitation, we introduce post-hoc CS (PH-CS), which generates a path of candidate selection sets, each paired with a data-driven false discovery proportion (FDP) estimate. PH-CS lets the user select any operating point on this path by maximizing a user-specified utility, arbitrarily balancing selection size and FDR. Building on conformal e-variables and the e-Benjamini-Hochberg (e-BH) procedure, PH-CS is proved to provide a finite-sample post-hoc reliability guarantee whereby the ratio between estimated FDP level and true FDP is, on average, upper bounded by 1 , so that the average estimated FDP is, to first order, a valid upper bound on the true FDR. PH-CS is extended to control quality defined in terms of a general risk. Experiments on synthetic and real-world datasets demonstrate that, unlike CS, PH-CS can consistently satisfy user-imposed utility constraints while producing reliable FDP estimates and maintaining competitive FDR control.

[LG-14] Representation-Aligned Multi-Scale Personalization for Federated Learning

链接: https://arxiv.org/abs/2604.11278
作者: Wenfei Liang,Wee Peng Tay
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In federated learning (FL), accommodating clients with diverse resource constraints remains a significant challenge. A widely adopted approach is to use a shared full-size model, from which each client extracts a submodel aligned with its computational budget. However, regardless of the specific scoring strategy, these methods rely on the same global backbone, limiting both structural diversity and representational adaptation across clients. This paper presents FRAMP, a unified framework for personalized and resource-adaptive federated learning. Instead of relying on a fixed global model, FRAMP generates client-specific models from compact client descriptors, enabling fine-grained adaptation to both data characteristics and computational budgets. Each client trains a tailored lightweight submodel and aligns its learned representation with others to maintain global semantic consistency. Extensive experiments on vision and graph benchmarks demonstrate that FRAMP enhances generalization and adaptivity across a wide range of client settings.

[LG-15] Sheaf Diffusion with Adaptive Local Structure for Spatio-Temporal Forecasting

链接: https://arxiv.org/abs/2604.11275
作者: Abeer Mostafa,Raneen Younis,Zahra Ahmadi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spatio-temporal systems often exhibit highly heterogeneous and non-intuitive responses to localized disruptions, limiting the effectiveness of conventional message passing approaches in modeling higher-order interactions under local heterogeneity. This paper reformulates spatio-temporal forecasting as the problem of learning information flow over locally structured spaces, rather than propagating globally aligned node representations. We introduce a spatio-temporal sheaf diffusion graph neural network (ST-Sheaf GNN) that embeds graph topology into sheaf-theoretic vector spaces connected by learned linear restriction maps. Unlike prior work that relies on static or globally shared transformations, our model learns dynamic restriction maps that evolve over time and adapt to local spatio-temporal patterns to enable substantially more expressive interactions. By explicitly modeling latent local structure, the proposed framework efficiently mitigates the oversmoothing phenomenon in deep GNN architectures. We evaluate our framework on a diverse set of real-world spatio-temporal forecasting benchmarks spanning multiple domains. Experimental results demonstrate state-of-the-art performance, highlighting the effectiveness of sheaf-theoretic topological representations as a powerful foundation for spatio-temporal graph learning. The code is available at: this https URL.

[LG-16] Unified Graph Prompt Learning via Low-Rank Graph Message Prompting

链接: https://arxiv.org/abs/2604.11257
作者: Beibei Wang,Bo Jiang,Ziyan Zhang,Jin Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Data Prompt (GDP), which introduces specific prompts in graph data for efficiently adapting pre-trained GNNs, has become a mainstream approach to graph fine-tuning learning problem. However, existing GDPs have been respectively designed for distinct graph component (e.g., node features, edge features, edge weights) and thus operate within limited prompt spaces for graph data. To the best of our knowledge, it still lacks a unified prompter suitable for targeting all graph components simultaneously. To address this challenge, in this paper, we first propose to reinterpret a wide range of existing GDPs from an aspect of Graph Message Prompt (GMP) paradigm. Based on GMP, we then introduce a novel graph prompt learning approach, termed Low-Rank GMP (LR-GMP), which leverages low-rank prompt representation to achieve an effective and compact graph prompt learning. Unlike traditional GDPs that target distinct graph components separately, LR-GMP concurrently performs prompting on all graph components in a unified manner, thereby achieving significantly superior generalization and robustness on diverse downstream tasks. Extensive experiments on several graph benchmark datasets demonstrate the effectiveness and advantages of our proposed LR-GMP.

[LG-17] CapBench: A Multi-PDK Dataset for Machine-Learning-Based Post-Layout Capacitance Extraction

链接: https://arxiv.org/abs/2604.11202
作者: Hector R. Rodriguez,Jiechen Huang,Wenjian Yu
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC '26). 7 pages, 5 figures

点击查看摘要

Abstract:We present CapBench, a fully reproducible, multi-PDK dataset for capacitance extraction. The dataset is derived from open-source designs, including single-core CPUs, systems-on-chip, and media accelerators. All designs are fully placed and routed using 14 independent OpenROAD flow runs spanning three technology nodes: ASAP7, NanGate45, and Sky130HD. From these layouts, we extract 61,855 3D windows across three size tiers to enable transfer learning and scalability studies. High-fidelity capacitance labels are generated using RWCap, a state-of-the-art random-walk solver, and validated against the industry-standard Raphael, achieving a mean absolute error of 0.64% for total capacitance. Each window is pre-processed into density maps, graph representations, and point clouds. We evaluate 10 machine learning architectures that illustrate dataset usage and serve as baselines, including convolutional neural networks (CNNs), point cloud transformers, and graph neural networks (GNNs). CNNs demonstrate the lowest errors (1.75%), while GNNs are up to 41.4x faster but exhibit larger errors (10.2%), illustrating a clear accuracy-speed trade-off. Code and dataset are available at this https URL.

[LG-18] owards Situation-aware State Modeling for Air Traffic Flow Prediction

链接: https://arxiv.org/abs/2604.11198
作者: Anqi Liu,Bin Wang,Jiangtao Zhao,Dechuan Ma,Guiyuan Jiang,Feng Hong,Yanwei Yu,Tianrui Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate air traffic prediction in the terminal airspace (TA) is pivotal for proactive air traffic management (ATM). However, existing data-driven approaches predominantly rely on time series-based forecasting paradigms, which inherently overlook critical aircraft state information, such as real-time kinematics and proximity to airspace boundaries. To address this limitation, we propose \textitAeroSense, a direct state-to-flow modeling framework for air traffic prediction. Unlike classical time series-based methods that first aggregate aircraft trajectories into macroscopic flow sequences before modeling, AeroSense explicitly represents the real-time airspace situation as \textita dynamic set of aircraft states, enabling the direct processing of a variable number of aircraft instead of time series as inputs. Specifically, we introduce a situation-aware state representation that enables AeroSense to sense the instantaneous terminal airspace situation directly from microscopic aircraft states. Furthermore, we design a model architecture that incorporates masked self-attention to capture inter-aircraft interactions, together with two decoupled prediction heads to model heterogeneous flow dynamics across two key functional areas of the TA. Extensive experiments on a large-scale real-world airport dataset demonstrate that AeroSense consistently achieves state-of-the-art performance, validating that direct modeling of microscopic aircraft states yields substantially higher predictive fidelity than time series-based baselines. Moreover, the proposed framework exhibits superior robustness during peak traffic periods, achieves Pareto-optimal performance under dayparting multi-object evaluation, and provides meaningful interpretability through attention-based visualizations.

[LG-19] Gradient-Variation Regret Bounds for Unconstrained Online Learning

链接: https://arxiv.org/abs/2604.11151
作者: Yuheng Zhao,Andrew Jacobsen,Nicolò Cesa-Bianchi,Peng Zhao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We develop parameter-free algorithms for unconstrained online learning with regret guarantees that scale with the gradient variation V_T(u) = \sum_t=2^T |\nabla f_t(u)-\nabla f_t-1(u)|^2 . For L -smooth convex loss, we provide fully-adaptive algorithms achieving regret of order \widetildeO(|u|\sqrtV_T(u) + L|u|^2+G^4) without requiring prior knowledge of comparator norm |u| , Lipschitz constant G , or smoothness L . The update in each round can be computed efficiently via a closed-form expression. Our results extend to dynamic regret and find immediate implications to the stochastically-extended adversarial (SEA) model, which significantly improves upon the previous best-known result [Wang et al., 2025].

[LG-20] A Full Compression Pipeline for Green Federated Learning in Communication-Constrained Environments ICML

链接: https://arxiv.org/abs/2604.11146
作者: Elouan Colybes,Shririn Salehi,Anke Schmeink
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: This work was accepted at IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN), 2026

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, thereby preserving privacy. However, FL often suffers from significant communication and computational overhead, limiting its scalability and sustainability. In this work, we introduce a Full Compression Pipeline (FCP) for FL in communication-constrained environments. FCP integrates three complementary deep compression techniques (pruning, quantization, and Huffman encoding) into a unified end-to-end framework. By compressing local models and communication payloads, FCP substantially reduces transmission costs and resource consumption while maintaining competitive accuracy. To quantify its impact, we develop an evaluation framework that captures both communication and computation overheads as a unified model cost, allowing a holistic assessment of efficiency trade-offs. The pipeline is evaluated in an independent and identically distributed (IID) and non-IID data setting. In one representative scenario, training a ResNet-12 model on the CIFAR-10 dataset with ten clients and a 2 Mbps bandwidth, the FCP achieves more than 11 \times reduction in model size, with only a 2% drop in accuracy compared to the uncompressed baseline. This results in an FL training that is more than 60% faster.

[LG-21] Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)

链接: https://arxiv.org/abs/2604.11141
作者: Chenhao Fang,Jordi Mola,Mark Harman,Jason Nawrocki,Vaibhav Shrivastava,Yue Cheng,Jay Minesh Shah,Katayoun Zand,Mansi Tripathi,Arya Pudota,Matthew Becker,Hervé Robert,Abhishek Gulati
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Although LLMs drive automation, it is critical to ensure immense consideration for high-stakes enterprise workflows such as those involving legal matters, risk management, and privacy compliance. For Meta, and other organizations like ours, a single hallucinated clause in such high stakes workflows risks material consequences. We show that by framing hallucination mitigation as a Minimum Bayes Risk (MBR) problem, we can dramatically reduce this risk. Specifically, we introduce a Hybrid Utility MBR (HUMBR) framework that synthesizes semantic embedding similarity with lexical precision to identify consensus without ground-truth references, for which we derive rigorous error bounds. We complement this theoretical analysis with a comprehensive empirical evaluation on widely-used public benchmark suites (TruthfulQA and LegalBench) and also real world data from Meta production deployment. The results from our empirical study show that MBR significantly outperforms standard Universal Self-Consistency. Notably, 81% of the pipeline’s suggestions were preferred over human-crafted ground truth, and critical recall failures were virtually eliminated.

[LG-22] AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

链接: https://arxiv.org/abs/2604.11135
作者: Liaoyuan Fan,Zetian Xu,Chen Cao,Wenyao Zhang,Mingqi Yuan,Jiayu Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pretrained video generation models provide strong priors for robot control, but existing unified world action models still struggle to decode reliable actions without substantial robot-specific training. We attribute this limitation to a structural mismatch: while video models capture how scenes evolve, action generation requires explicit reasoning about where to interact and the underlying manipulation intent. We introduce AIM, an intent-aware unified world action model that bridges this gap via an explicit spatial interface. Instead of decoding actions directly from future visual representations, AIM predicts an aligned spatial value map that encodes task-relevant interaction structure, enabling a control-oriented abstraction of future dynamics. Built on a pretrained video generation model, AIM jointly models future observations and value maps within a shared mixture-of-transformers architecture. It employs intent-causal attention to route future information to the action branch exclusively through the value representation. We further propose a self-distillation reinforcement learning stage that freezes the video and value branches and optimizes only the action head using dense rewards derived from projected value-map responses together with sparse task-level signals. To support training and evaluation, we construct a simulation dataset of 30K manipulation trajectories with synchronized multi-view observations, actions, and value-map annotations. Experiments on RoboTwin 2.0 benchmark show that AIM achieves a 94.0% average success rate, significantly outperforming prior unified world action baselines. Notably, the improvement is more pronounced in long-horizon and contact-sensitive manipulation tasks, demonstrating the effectiveness of explicit spatial-intent modeling as a bridge between visual world modeling and robot control.

[LG-23] Distributionally Robust K-Means Clustering

链接: https://arxiv.org/abs/2604.11118
作者: Vikrant Malik,Taylan Kargin,Babak Hassibi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:K-means clustering is a workhorse of unsupervised learning, but it is notoriously brittle to outliers, distribution shifts, and limited sample sizes. Viewing k-means as Lloyd–Max quantization of the empirical distribution, we develop a distributionally robust variant that protects against such pathologies. We posit that the unknown population distribution lies within a Wasserstein-2 ball around the empirical distribution. In this setting, one seeks cluster centers that minimize the worst-case expected squared distance over this ambiguity set, leading to a minimax formulation. A tractable dual yields a soft-clustering scheme that replaces hard assignments with smoothly weighted ones. We propose an efficient block coordinate descent algorithm with provable monotonic decrease and local linear convergence. Experiments on standard benchmarks and large-scale synthetic data demonstrate substantial gains in outlier detection and robustness to noise.

[LG-24] CausalGaze: Unveiling Hallucinations via Counterfactual Graph Intervention in Large Language Models ACL2026

链接: https://arxiv.org/abs/2604.11087
作者: Linggang Kong,Lei Wu,Yunlong Zhang,Xiaofeng Zhong,Zhen Wang,Yongjie Wang,Yao Pan
类目: Machine Learning (cs.LG)
*备注: Accepted as ACL2026 Findings

点击查看摘要

Abstract:Despite the groundbreaking advancements made by large language models (LLMs), hallucination remains a critical bottleneck for their deployment in high-stakes domains. Existing classification-based methods mainly rely on static and passive signals from internal states, which often captures the noise and spurious correlations, while overlooking the underlying causal mechanisms. To address this limitation, we shift the paradigm from passive observation to active intervention by introducing CausalGaze, a novel hallucination detection framework based on structural causal models (SCMs). CausalGaze models LLMs’ internal states as dynamic causal graphs and employs counterfactual interventions to disentangle causal reasoning paths from incidental noise, thereby enhancing model interpretability. Extensive experiments across four datasets and three widely used LLMs demonstrate the effectiveness of CausalGaze, especially achieving over 5.2% improvement in AUROC on the TruthfulQA dataset compared to state-of-the-art baselines.

[LG-25] Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

链接: https://arxiv.org/abs/2604.11001
作者: Zhuolun Dong,Junyu Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely adopted due to their great performance across a wide range of applications. ChatGPT and Gemini now serve hundreds of millions of active users and handle billions of user requests per day, which puts optimizing LLM inference into the spotlight. A key challenge in LLM inference is that decode lengths are unknown. The memory usage for each request grows with generated tokens, which may lead to overflow and cause system instability. To address this concern, we propose a simple flow-control framework that controls the rate at which prompts join the active set. We derive a necessary condition that any stable system must satisfy and establish sufficient conditions under which our algorithm provably achieves stability. Experiments show that, compared to commonly used strategies in practice, our approach achieves higher token and request throughput, lower average and tail latency, and more stable KV cache utilization.

[LG-26] racking High-order Evolutions via Cascading Low-rank Fitting

链接: https://arxiv.org/abs/2604.10980
作者: Zhao Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have become the de facto standard for modern visual generation, including well-established frameworks such as latent diffusion and flow matching. Recently, modeling high-order dynamics has emerged as a promising frontier in generative modeling. Rather than only learning the first-order velocity field that transports random noise to a target data distribution, these approaches simultaneously learn higher-order derivatives, such as acceleration and jerk, yielding a diverse family of higher-order diffusion variants. To represent higher-order derivatives, naive approaches instantiate separate neural networks for each order, which scales the parameter space linearly with the derivative order. To overcome this computational bottleneck, we introduce cascading low-rank fitting, an ordinary differential equation inspired method that approximates successive derivatives by applying a shared base function augmented with sequentially accumulated low-rank components. Theoretically, we analyze the rank dynamics of these successive matrix differences. We prove that if the initial difference is linearly decomposable, the generic ranks of high-order derivatives are guaranteed to be monotonically non-increasing. Conversely, we demonstrate that without this structural assumption, the General Leibniz Rule allows ranks to strictly increase. Furthermore, we establish that under specific conditions, the sequence of derivative ranks can be designed to form any arbitrary permutation. Finally, we present a straightforward algorithm to efficiently compute the proposed cascading low-rank fitting. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.10980 [cs.LG] (or arXiv:2604.10980v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.10980 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Robust Adversarial Policy Optimization Under Dynamics Uncertainty

链接: https://arxiv.org/abs/2604.10974
作者: Mintae Kim,Koushil Sreenath
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 33 pages, 8 figures

点击查看摘要

Abstract:Reinforcement learning (RL) policies often fail under dynamics that differ from training, a gap not fully addressed by domain randomization or existing adversarial RL methods. Distributionally robust RL provides a formal remedy but still relies on surrogate adversaries to approximate intractable primal problems, leaving blind spots that potentially cause instability and over-conservatism. We propose a dual formulation that directly exposes the robustness-performance trade-off. At the trajectory level, a temperature parameter from the dual problem is approximated with an adversarial network, yielding efficient and stable worst-case rollouts within a divergence bound. At the model level, we employ Boltzmann reweighting over dynamics ensembles, focusing on more adverse environments to the current policy rather than uniform sampling. The two components act independently and complement each other: trajectory-level steering ensures robust rollouts, while model-level sampling provides policy-sensitive coverage of adverse dynamics. The resulting framework, robust adversarial policy optimization (RAPO) outperforms robust RL baselines, improving resilience to uncertainty and generalization to out-of-distribution dynamics while maintaining dual tractability.

[LG-28] Learning to Test: Physics-Informed Representation for Dynamical Instability Detection

链接: https://arxiv.org/abs/2604.10967
作者: Minxing Zheng,Zewei Deng,Liyan Xie,Shixiang Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many safety-critical scientific and engineering systems evolve according to differential-algebraic equations (DAEs), where dynamical behavior is constrained by physical laws and admissibility conditions. In practice, these systems operate under stochastically varying environmental inputs, so stability is not a static property but must be reassessed as the context distribution shifts. Repeated large-scale DAE simulation, however, is computationally prohibitive in high-dimensional or real-time settings. This paper proposes a test-oriented learning framework for stability assessment under distribution shift. Rather than re-estimating physical parameters or repeatedly solving the underlying DAE, we learn a physics-informed latent representation of contextual variables that captures stability-relevant structure and is regularized toward a tractable reference distribution. Trained on baseline data from a certified safe regime, the learned representation enables deployment-time safety monitoring to be formulated as a distributional hypothesis test in latent space, with controlled Type I error. By integrating neural dynamical surrogates, uncertainty-aware calibration, and uniformity-based testing, our approach provides a scalable and statistically grounded method for detecting instability risk in stochastic constrained dynamical systems without repeated simulation.

[LG-29] Hypergraph Neural Diffusion: A PDE-Inspired Framework for Hypergraph Message Passing

链接: https://arxiv.org/abs/2604.10955
作者: Zhiheng Zhou,Mengyao Zhou,Xixun Lin,Xingqin Qi,Guiying Yan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hypergraph neural networks (HGNNs) have shown remarkable potential in modeling high-order relationships that naturally arise in many real-world data domains. However, existing HGNNs often suffer from shallow propagation, oversmoothing, and limited adaptability to complex hypergraph structures. In this paper, we propose Hypergraph Neural Diffusion (HND), a novel framework that unifies nonlinear diffusion equations with neural message passing on hypergraphs. HND is grounded in a continuous-time hypergraph diffusion equation, formulated via hypergraph gradient and divergence operators, and modulated by a learnable, structure-aware coefficient matrix over hyperedge-node pairs. This partial differential equation (PDE) based formulation provides a physically interpretable view of hypergraph learning, where feature propagation is understood as an anisotropic diffusion process governed by local inconsistency and adaptive diffusion coefficient. From this perspective, neural message passing becomes a discretized gradient flow that progressively minimizes a diffusion energy functional. We derive rigorous theoretical guarantees, including energy dissipation, solution boundedness via a discrete maximum principle, and stability under explicit and implicit numerical schemes. The HND framework supports a variety of integration strategies such as non-adaptive-step (like Runge-Kutta) and adaptive-step solvers, enabling the construction of deep, stable, and interpretable architectures. Extensive experiments on benchmark datasets demonstrate that HND achieves competitive performance. Our results highlight the power of PDE-inspired design in enhancing the stability, expressivity, and interpretability of hypergraph learning.

[LG-30] UniPROT: Uniform Prototype Selection via Partial Optimal Transport with Submodular Guarantees AISTATS2026

链接: https://arxiv.org/abs/2604.10952
作者: Prateek Chanda,Prayas Agrawal,Karthik S. Gurumoorthy,Ganesh Ramakrishnan,Bamdev Mishra,Pratik Jawanpuria
类目: Machine Learning (cs.LG)
*备注: 25 pages, 31 figures. Accepted as a poster at AISTATS 2026

点击查看摘要

Abstract:Selecting prototypical examples from a source distribution to represent a target data distribution is a fundamental problem in machine learning. Existing subset selection methods often rely on implicit importance scores, which can be skewed towards majority classes and lead to low-quality prototypes for minority classes. We present \methodprop , a novel subset selection framework that minimizes the optimal transport (OT) distance between a uniformly weighted prototypical distribution and the target distribution. While intuitive, this formulation leads to a cardinality-constrained maximization of a \emphsuper-additive objective, which is generally intractable to approximate efficiently. To address this, we propose a principled reformulation of the OT marginal constraints, yielding a partial optimal transport-based submodular objective. We prove that this reformulation enables a greedy algorithm with a (1-1/e) approximation guarantee relative to the original super-additive maximization problem. Empirically, we showcase that enforcing uniform prototype weights in UniPROT consistently improves minority-class representation in imbalanced classification benchmarks without compromising majority-class accuracy. In both finetuning and pretraining regimes for large language models under domain imbalance, UniPROT enforces uniform source contributions, yielding robust performance gains. Our results establish UniPROT as a scalable, theoretically grounded solution for uniform-weighted prototype selection. Our code is publicly available at GitHub\footnoteCode: this https URL

[LG-31] Learning to Adapt: In-Context Learning Beyond Stationarity

链接: https://arxiv.org/abs/2604.10946
作者: Zhen Qin,Jiachen Jiang,Zhihui Zhu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Transformer models have become foundational across a wide range of scientific and engineering domains due to their strong empirical performance. A key capability underlying their success is in-context learning (ICL): when presented with a short prompt from an unseen task, transformers can perform per-token and next-token predictions without any parameter updates. Recent theoretical efforts have begun to uncover the mechanisms behind this phenomenon, particularly in supervised regression settings. However, these analyses predominantly assume stationary task distributions, which overlook a broad class of real-world scenarios where the target function varies over time. In this work, we bridge this gap by providing a theoretical analysis of ICL under non-stationary regression problems. We study how the gated linear attention (GLA) mechanism adapts to evolving input-output relationships and rigorously characterize its advantages over standard linear attention in this dynamic setting. To model non-stationarity, we adopt a first-order autoregressive process and show that GLA achieves lower training and testing errors by adaptively modulating the influence of past inputs – effectively implementing a learnable recency bias. Our theoretical findings are further supported by empirical results, which validate the benefits of gating mechanisms in non-stationary ICL tasks.

[LG-32] Generative Design for Direct-to-Chip Liquid Cooling for Data Centers

链接: https://arxiv.org/abs/2604.10941
作者: Zheng Liu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Rapid growth in artificial intelligence (AI) workloads is driving up data center power densities, increasing the need for advanced thermal management. Direct-to-chip liquid cooling can remove heat efficiently at the source, but many cold plate channel layouts remain heuristic and are not optimized for the strongly non-uniform temperature distribution of modern heterogeneous packages. This work presents a generative design framework for synthesizing cooling channel geometries for the NVIDIA GB200 Grace Blackwell Superchip. A physics-based finite-difference thermal model provides rapid steady-state temperature predictions and supplies spatial thermal feedback to a constrained reaction-diffusion process that generates novel channel topologies while enforcing inlet/outlet and component constraints. By iterating channel generation and thermal evaluation in a closed loop, the method naturally redistributes cooling capacity toward high-power regions and suppresses hot-spot formation. Compared with a baseline parallel channel design, the resulting channels achieve more than a 5 degree Celsius reduction in average temperature and over 35 degree Celsius reduction in maximum temperature. Overall, the results demonstrate that coupling generative algorithms with lightweight physics-based modeling can significantly enhance direct-to-chip liquid cooling performance, supporting more sustainable scaling of AI computing.

[LG-33] ransformers Learn Latent Mixture Models In-Context via Mirror Descent

链接: https://arxiv.org/abs/2604.10848
作者: Francesco D’Angelo,Nicolas Flammarion
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by introducing a framework based on Mixture of Transition Distributions, where a latent variable determines the influence of past tokens on the next. The distribution over this latent variable is parameterized by unobserved mixture weights that transformers must learn in-context. We demonstrate that transformers can implement Mirror Descent to learn these weights from the context. Specifically, we give an explicit construction of a three-layer transformer that exactly implements one step of Mirror Descent and prove that the resulting estimator is a first-order approximation of the Bayes-optimal predictor. Corroborating our construction and its learnability via gradient descent, we empirically show that transformers trained from scratch learn solutions consistent with our theory: their predictive distributions, attention patterns, and learned transition matrix closely match the construction, while deeper models achieve performance comparable to multi-step Mirror Descent.

[LG-34] Slithering Through Gaps: Capturing Discrete Isolated Modes via Logistic Bridging

链接: https://arxiv.org/abs/2604.10821
作者: Pinaki Mohanty,Ruqi Zhang
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:High-dimensional and complex discrete distributions often exhibit multimodal behavior due to inherent discontinuities, posing significant challenges for sampling. Gradient-based discrete samplers, while effective, frequently become trapped in local modes when confronted with rugged or disconnected energy landscapes. This limits their ability to achieve adequate mixing and convergence in high-dimensional multimodal discrete spaces. To address these challenges, we propose \emphHyperbolic Secant-squared Gibbs-Sampling (HiSS), a novel family of sampling algorithms that integrates a \emphMetropolis-within-Gibbs framework to enhance mixing efficiency. HiSS leverages a logistic convolution kernel to couple the discrete sampling variable with the continuous auxiliary variable in a joint distribution. This design allows the auxiliary variable to encapsulate the true target distribution while facilitating easy transitions between distant and disconnected modes. We provide theoretical guarantees of convergence and demonstrate empirically that HiSS outperforms many popular alternatives on a wide variety of tasks, including Ising models, binary neural networks, and combinatorial optimization.

[LG-35] Differentially Private Verification of Distribution Properties

链接: https://arxiv.org/abs/2604.10819
作者: Elbert Du,Cynthia Dwork,Pranay Tankala,Linjun Zhang
类目: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A recent line of work initiated by Chiesa and Gur and further developed by Herman and Rothblum investigates the sample and communication complexity of verifying properties of distributions with the assistance of a powerful, knowledgeable, but untrusted prover. In this work, we initiate the study of differentially private (DP) distribution property testing. After all, if we do not trust the prover to help us with verification, why should we trust it with our sensitive sample? We map a landscape of DP prover-aided proofs of properties of distributions. In the non-private case it is known that one-round (two message) private-coin protocols can have substantially lower complexity than public-coin AM protocols, but in the private case, the possibility for improvement depends on the parameter regime and privacy model. Drawing on connections to replicability and techniques for amplification, we show: (1) There exists a reduction from any one-round (\varepsilon,\delta) -DP private-coin interactive proof to a one-round public-coin DP interactive proof with the same privacy parameters, for the parameter regime \varepsilon=O(1/\sqrtn) and \delta=O(1/n^5/2) , and with the same sample and communication complexities. (2) If the verifier’s message in the private-coin interactive proof is O(1/\sqrt\log n) locally DP – a far more relaxed privacy parameter regime in a different model – then applying one additional transformation again yields a one-round public-coin protocol with the same privacy bound and the same sample and computational complexities. (3) However, when the privacy guarantee is very relaxed ( \varepsilon\in\Omega(\log n) ), private coins indeed reduce complexity. We also obtain a Merlin-Arthur (one-message) proof for privately testing whether samples are drawn from a product distribution, and prove that its sample complexity is optimal.

[LG-36] Online Covariance Estimation in Averag ed SGD: Improved Batch-Mean Rates and Minimax Optimality via Trajectory Regression

链接: https://arxiv.org/abs/2604.10814
作者: Yijin Ni,Xiaoming Huo
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study online covariance matrix estimation for Polyak–Ruppert averaged stochastic gradient descent (SGD). The online batch-means estimator of Zhu, Chen and Wu (2023) achieves an operator-norm convergence rate of O(n^-(1-\alpha)/4) , which yields O(n^-1/8) at the optimal learning-rate exponent \alpha \rightarrow 1/2^+ . A rigorous per-block bias analysis reveals that re-tuning the block-growth parameter improves the batch-means rate to O(n^-(1-\alpha)/3) , achieving O(n^-1/6) . The modified estimator requires no Hessian access and preserves O(d^2) memory. We provide a complete error decomposition into variance, stationarity bias, and nonlinearity bias components. A weighted-averaging variant that avoids hard truncation is also discussed. We establish the minimax rate \Theta(n^-(1-\alpha)/2) for Hessian-free covariance estimation from the SGD trajectory: a Le Cam lower bound gives \Omega(n^-(1-\alpha)/2) , and a trajectory-regression estimator–which estimates the Hessian by regressing SGD increments on iterates–achieves O(n^-(1-\alpha)/2) , matching the lower bound. The construction reveals that the bottleneck is the sublinear accumulation of information about the Hessian from the SGD drift.

[LG-37] PokeRL: Reinforcement Learning for Pokemon Red

链接: https://arxiv.org/abs/2604.10812
作者: Dheeraj Mudireddy,Sai Patibandla
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pokemon Red is a long-horizon JRPG with sparse rewards, partial observability, and quirky control mechanics that make it a challenging benchmark for reinforcement learning. While recent work has shown that PPO agents can clear the first two gyms using heavy reward shaping and engineered observations, training remains brittle in practice, with agents often degenerating into action loops, menu spam, or unproductive wandering. In this paper, we present PokeRL, a modular system that trains deep reinforcement learning agents to complete early game tasks in Pokemon Red, including exiting the player’s house, exploring Pallet Town to reach tall grass, and winning the first rival battle. Our main contributions are a loop-aware environment wrapper around the PyBoy emulator with map masking, a multi-layer anti-loop and anti-spam mechanism, and a dense hierarchical reward design. We argue that practical systems like PokeRL, which explicitly model failure modes such as loops and spam, are a necessary intermediate step between toy benchmarks and full Pokemon League champion agents. Code is available at this https URL

[LG-38] INCRT: An Incremental Transformer That Determines Its Own Architecture

链接: https://arxiv.org/abs/2604.10703
作者: Giansalvo Cirrincione
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 19 pages, 6 figures, 5 theorems. Submitted to Neurocomputing (Elsevier)

点击查看摘要

Abstract:Transformer architectures are designed by trial and error: the number of attention heads, the depth, and the head size are fixed before training begins, with no mathematical principle to guide the choice. The result is systematic structural redundancy – between half and four-fifths of all heads in a trained model can be removed without measurable loss – because the architecture allocates capacity without reference to the actual requirements of the this http URL paper introduces INCRT (Incremental Transformer), an architecture that determines its own structure during training. Starting from a single head, INCRT adds one attention head at a time whenever its current configuration is provably insufficient, and prunes heads that have become redundant. Each growth decision is driven by a single, online-computable geometric quantity derived from the task’s directional structure, requiring no separate validation phase and no hand-tuned schedule. Two theorems form the theoretical backbone. The first (homeostatic convergence) establishes that the system always reaches a finite stopping configuration that is simultaneously minimal (no redundant heads) and sufficient (no uncaptured directional energy above the threshold). The second (compressed-sensing analogy) provides a geometric upper bound on the number of heads that this configuration can contain, as a function of the spectral complexity of the task. Experiments on SARS-CoV-2 variant classification and SST-2 sentiment analysis confirm both results: the predicted and observed head counts agree within 12% across all benchmarks, and the final architectures match or exceed BERT-base on distribution-specific tasks while using between three and seven times fewer parameters and no pre-training. Comments: 19 pages, 6 figures, 5 theorems. Submitted to Neurocomputing (Elsevier) Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2604.10703 [cs.LG] (or arXiv:2604.10703v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.10703 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] Communication-Efficient Gluon in Federated Learning

链接: https://arxiv.org/abs/2604.10689
作者: Xun Qian,Alexander Gaponov,Grigory Malinovsky,Peter Richtárik
类目: Machine Learning (cs.LG)
*备注: 48 pages, 8 figures

点击查看摘要

Abstract:Recent developments have shown that Muon-type optimizers based on linear minimization oracles (LMOs) over non-Euclidean norm balls have the potential to get superior practical performance than Adam-type methods in the training of large language models. Since large-scale neural networks are trained across massive machines, communication cost becomes the bottleneck. To address this bottleneck, we investigate Gluon, which is an extension of Muon under the more general layer-wise (L^0, L^1) -smooth setting, with both unbiased and contraction compressors. In order to reduce the compression error, we employ the variance reduced technique in SARAH in our compressed methods. The convergence rates and improved communication cost are achieved under certain conditions. As a byproduct, a new variance reduced algorithm with faster convergence rate than Gluon is obtained. We also incorporate momentum variance reduction (MVR) to these compressed algorithms and comparable communication cost is derived under weaker conditions when L_i^1 \neq 0 . Finally, several numerical experiments are conducted to verify the superior performance of our compressed algorithms in terms of communication cost.

[LG-40] Energy-Efficient Federated Edge Learning For Small-Scale Datasets in Large IoT Networks

链接: https://arxiv.org/abs/2604.10662
作者: Haihui Xie,Wenkun Wen,Shuwu Chen,Zhaogang Shu,Minghua Xia
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 16 pages, 9 figures. To appear in IEEE TWC

点击查看摘要

Abstract:Large-scale Internet of Things (IoT) networks enable intelligent services such as smart cities and autonomous driving, but often face resource constraints. Collecting heterogeneous sensory data, especially in small-scale datasets, is challenging, and independent edge nodes can lead to inefficient resource utilization and reduced learning performance. To address these issues, this paper proposes a collaborative optimization framework for energy-efficient federated edge learning with small-scale datasets. We first derive an expected learning loss to quantify the relationship between the number of training samples and learning objectives. A stochastic online learning algorithm is then designed to adapt to data variations, and a resource optimization problem with a convergence bound is formulated. Finally, an online distributed algorithm efficiently solves large-scale optimization problems with high scalability. Extensive simulations and autonomous navigation case studies with collision avoidance demonstrate that the proposed approach significantly improves learning performance and resource efficiency compared to state-of-the-art benchmarks.

[LG-41] Mitigating Privacy Risk via Forget Set-Free Unlearning

链接: https://arxiv.org/abs/2604.10636
作者: Aviraj Newatia,Michael Cooper,Viet Nguyen,Rahul G. Krishnan
类目: Machine Learning (cs.LG)
*备注: 50 pages, 20 figures, Published at The Fourteenth International Conference on Learning Representations

点击查看摘要

Abstract:Training machine learning models requires the storage of large datasets, which often contain sensitive or private data. Storing data is associated with a number of potential risks which increase over time, such as database breaches and malicious adversaries. Machine unlearning is the study of methods to efficiently remove the influence of training data subsets from previously-trained models. Existing unlearning methods typically require direct access to the “forget set” – the data to be forgotten-and organisations must retain this data for unlearning rather than deleting it immediately upon request, increasing risks associated with the forget set. We introduce partially-blind unlearning – utilizing auxiliary information to unlearn without explicit access to the forget set. We also propose a practical framework Reload, a partially-blind method based on gradient optimization and structured weight sparsification to operationalize partially-blind unlearning. We show that Reload efficiently unlearns, approximating models retrained from scratch, and outperforms several forget set-dependent approaches. On language models, Reload unlearns entities using 0.025% of the retain set and 7% of model weights in 8 minutes on Llama2-7B. In the corrective case, Reload achieves unlearning even when only 10% of corrupted data is identified.

[LG-42] Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences

链接: https://arxiv.org/abs/2604.10632
作者: Matteo Spanio,Valentina Frezzato,Antonio Rodà
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Submitted to SMC2026

点击查看摘要

Abstract:Collecting large, aligned cross-modal datasets for music-flavor research is difficult because perceptual experiments are costly and small by design. We address this bottleneck through two complementary experiments. The first tests whether audio-flavor correlations, feature-importance rankings, and latent-factor structure transfer from an experimental soundtracks collection (257~tracks with human annotations) to a large FMA-derived corpus ( \sim 49,300 segments with synthetic labels). The second validates computational flavor targets – derived from food chemistry via a reproducible pipeline – against human perception in an online listener study (49~participants, 20~tracks). Results from both experiments converge: the quantitative transfer analysis confirms that cross-modal structure is preserved across supervision regimes, and the perceptual evaluation shows significant alignment between computational targets and listener ratings (permutation p0.0001 , Mantel r=0.45 , Procrustes m^2=0.51 ). Together, these findings support the conclusion that sonic seasoning effects are present in synthetic FMA annotations. We release datasets and companion code to support reproducible cross-modal AI research.

[LG-43] Distributionally Robust PAC-Bayesian Control

链接: https://arxiv.org/abs/2604.10588
作者: Domagoj Herceg,Duarte Antunes
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We present a distributionally robust PAC-Bayesian framework for certifying the performance of learning-based finite-horizon controllers. While existing PAC-Bayes control literature typically assumes bounded losses and matching training and deployment distributions, we explicitly address unbounded losses and environmental distribution shifts (the sim-to-real gap). We achieve this by drawing on two modern lines of research, namely the PAC-Bayes generalization theory and distributionally robust optimization via the type-1 Wasserstein distance. By leveraging the System Level Synthesis (SLS) reparametrization, we derive a sub-Gaussian loss proxy and a bound on the performance loss due to distribution shift. Both are tied directly to the operator norm of the closed-loop map. For linear time-invariant systems, this yields a computationally tractable optimization-based framework together with high-probability safety certificates for deployment in real-world environments that differ from those used in training.

[LG-44] WOODELF-HD: Efficient Background SHAP for High-Depth Decision Trees

链接: https://arxiv.org/abs/2604.10569
作者: Ron Wettenstein,Alexander Nadel,Udi Boker
类目: Machine Learning (cs.LG)
*备注: 15 pages (including 6-page appendix), 9 figures

点击查看摘要

Abstract:Decision-tree ensembles are a cornerstone of predictive modeling, and SHAP is a standard framework for interpreting their predictions. Among its variants, Background SHAP offers high accuracy by modeling missing features using a background dataset. Historically, this approach did not scale well, as the time complexity for explaining n instances using m background samples included an O(mn) component. Recent methods such as Woodelf and PLTreeSHAP reduce this to O(m+n), but introduce a preprocessing bottleneck that grows as 3^D with tree depth D, making them impractical for deep trees. We address this limitation with WoodelfHD, a Woodelf extension that reduces the 3^D factor to 2^D. The key idea is a Strassen-like multiplication scheme that exploits the structure of Woodelf matrices, reducing matrix-vector multiplication from O(k^2) to O(k*log(k)) via a fully vectorized, non-recursive implementation. In addition, we merge path nodes with identical features, reducing cache size and memory usage. When running on standard environments, WoodelfHD enables exact Background SHAP computation for trees with depths up to 21, where previous methods fail due to excessive memory usage. For ensembles of depths 12 and 15, it achieves speedups of 33x and 162x, respectively, over the state-of-the-art.

[LG-45] ReadMOF: Structure-Free Semantic Embeddings from Systematic MOF Nomenclature for Machine Learning

链接: https://arxiv.org/abs/2604.10568
作者: Kewei Zhu,Cameron Wilson,Bartosz Mazur,Yi Li,Ashleigh M. Chester,Peyman Z. Moghadam
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 29 pages, 8 figures

点击查看摘要

Abstract:Systematic chemical names, such as IUPAC-style nomenclature for metal-organic frameworks (MOFs), contain rich structural and compositional information in a standardized textual format. Here we introduce ReadMOF, which is, to our knowledge, the first nomenclature-free machine learning framework that leverages these names to model structure-property relationships without requiring atomic coordinates or connectivity graphs. By employing pretrained language models, ReadMOF converts systematic MOF names from the Cambridge Structural Database (CSD) into vector embeddings that closely represent traditional structure-based descriptors. These embeddings enable applications in materials informatics, including property prediction, similarity retrieval, and clustering, with performance comparable to geometry-dependent methods. When combined with large language models, ReadMOF also establishes chemically meaningful reasoning ability with textual input only. Our results show that structured chemical language, interpreted through modern natural language processing techniques, can provide a scalable, interpretable, and geometry-independent alternative to conventional molecular representations. This approach opens new opportunities for language-driven discovery in materials science.

[LG-46] Heterogeneous Connectivity in Sparse Networks: Fan-in Profiles Gradient Hierarchy and Topological Equilibria

链接: https://arxiv.org/abs/2604.10560
作者: Nikodem Tomczak
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Profiled Sparse Networks (PSN) replace uniform connectivity with deterministic, heterogeneous fan-in profiles defined by continuous, nonlinear functions, creating neurons with both dense and sparse receptive fields. We benchmark PSN across four classification datasets spanning vision and tabular domains, input dimensions from 54 to 784, and network depths of 2–3 hidden layers. At 90% sparsity, all static profiles, including the uniform random baseline, achieve accuracy within 0.2-0.6% of dense baselines on every dataset, demonstrating that heterogeneous connectivity provides no accuracy advantage when hub placement is arbitrary rather than task-aligned. This result holds across sparsity levels (80-99.9%), profile shapes (eight parametric families, lognormal, and power-law), and fan-in coefficients of variation from 0 to 2.5. Internal gradient analysis reveals that structured profiles create a 2-5x gradient concentration at hub neurons compared to the ~1x uniform distribution in random baselines, with the hierarchy strength predicted by fan-in coefficient of variation ( r = 0.93 ). When PSN fan-in distributions are used to initialise RigL dynamic sparse training, lognormal profiles matched to the equilibrium fan-in distribution consistently outperform standard ERK initialisation, with advantages growing on harder tasks, achieving +0.16% on Fashion-MNIST ( p = 0.036 , d = 1.07 ), +0.43% on EMNIST, and +0.49% on Forest Cover. RigL converges to a characteristic fan-in distribution regardless of initialisation. Starting at this equilibrium allows the optimiser to refine weights rather than rearrange topology. Which neurons become hubs matters more than the degree of connectivity variance, i.e., random hub placement provides no advantage, while optimisation-driven placement does.

[LG-47] opology-Aware PAC-Bayesian Generalization Analysis for Graph Neural Networks

链接: https://arxiv.org/abs/2604.10553
作者: Xinping Yi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks have demonstrated excellent applicability to a wide range of domains, including social networks, biological systems, recommendation systems, and wireless communications. Yet a principled theoretical understanding of their generalization behavior remains limited, particularly for graph classification tasks where complex interactions between model parameters and graph structure play a crucial role. Among existing theoretical tools, PAC-Bayesian norm-based generalization bounds provide a flexible and data-dependent framework; however, current results for GNNs often restrict the exploitation of graph structures. In this work, we propose a topology-aware PAC-Bayesian norm-based generalization framework for graph convolutional networks (GCNs) that extends a previously developed framework to graph-structured models. Our approach reformulates the derivation of generalization bounds as a stochastic optimization problem and introduces sensitivity matrices that measure the response of classification outputs with respect to structured weight perturbations. By imposing different structures on sensitivity matrices from both spatial and spectral perspectives, we derive a family of generalization error bounds with graph structures explicitly embedded. Such bounds could recover existing results as special cases, while yielding bounds that are tighter than state-of-the-art PAC-Bayesian bounds for GNNs. Notably, the proposed framework explicitly integrates graph structural properties into the generalization analysis, enabling a unified inspection of GNN generalization behavior from both spatial aggregation and spectral filtering viewpoints.

[LG-48] FEDBUD: Joint Incentive and Privacy Optimization for Resource-Constrained Federated Learning

链接: https://arxiv.org/abs/2604.10499
作者: Tao Liu,Xuehe Wang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning has become a popular paradigm for privacy protection and edge-based machine learning. However, defending against differential attacks and devising incentive strategies remain significant bottlenecks in this field. Despite recent works on privacy-aware incentive mechanism design for federated learning, few of them consider both data volume and noise level. In this paper, we propose a novel federated learning system called FEDBUD, which combines privacy and economic concerns together by considering the joint influence of data volume and noise level on incentive strategy determination. In this system, the cloud server controls monetary payments to edge nodes, while edge nodes control data volume and noise level that potentially impact the model performance of the cloud server. To determine the mutually optimal strategies for both sides, we model FEDBUD as a two-stage Stackelberg Game and derive the Nash Equilibrium using the mean-field estimator and virtual queue. Experimental results on real-world datasets demonstrate the outstanding performance of FEDBUD.

[LG-49] CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts

链接: https://arxiv.org/abs/2604.10496
作者: Xiangyang Yin,Xingyu Liu,Tianhua Xia,Bo Bao,Vithursan Thangarasa,Valavan Manohararajah,Eric Sather,Sai Qian Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Outliers have emerged as a fundamental bottleneck in preserving accuracy for low-precision large models, particularly within Mixture-of-Experts (MoE) architectures that are increasingly central to large-scale language modeling. Under post-training quantization (PTQ), these outliers induce substantial quantization errors, leading to severe accuracy degradation. While recent rotation-based smoothing techniques alleviate the problem by redistributing outlier magnitudes, residual errors remain and continue to impede reliable low-precision deployment. In this work, we tackle this challenge by introducing \textitCodeQuant, a unified quantization-and-clustering scheme that contains smoothing activation outliers via learnable rotation and absorbing weight outliers into fine-tuned cluster centroids for MoE. This design reduces the influence of extreme values by fitting them within cluster centroids, thereby lowering quantization error while maintaining expressive capacity. Coupled with a dedicated kernel design for GPU and CPU, CodeQuant achieves up to 4.15\times speedup while delivering significantly higher accuracy than state-of-the-art quantization approaches across diverse MoE models. Our results highlight CodeQuant as a promising direction for efficient and accurate deployment of MoE-based large language models under low-precision constraints. Our code is available at this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.10496 [cs.LG] (or arXiv:2604.10496v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.10496 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Exact Finite-Sample Variance Decomposition of Subagging: A Spectral Filtering Perspective

链接: https://arxiv.org/abs/2604.10469
作者: Ye Su,Mingrui Ye,Yining Wang,Jipeng Guo,Yong Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Standard resampling ratios (e.g., \alpha \approx 0.632 ) are widely used as default baselines in ensemble learning for three decades. However, how these ratios interact with a base learner’s intrinsic functional complexity in finite samples lacks a exact mathematical characterization. We leverage the Hoeffding-ANOVA decomposition to derive the first exact, finite-sample variance decomposition for subagging, applicable to any symmetric base learner without requiring asymptotic limits or smoothness assumptions. We establish that subagging operates as a deterministic low-pass spectral filter: it preserves low-order structural signals while attenuating c -th order interaction variance by a geometric factor approaching \alpha^c . This decoupling reveals why default baselines often under-regularize high-capacity interpolators, which instead require smaller \alpha to exponentially suppress spurious high-order noise. To operationalize these insights, we propose a complexity-guided adaptive subsampling algorithm, empirically demonstrating that dynamically calibrating \alpha to the learner’s complexity spectrum consistently improves generalization over static baselines.

[LG-51] Membership Inference Attacks Expose Participation Privacy in ECG Foundation Encoders

链接: https://arxiv.org/abs/2604.10424
作者: Ziyu Wang,Elahe Khatibi,Ankita Sharma,Krishnendu Chakrabarty,Sanaz Rahimi Moosavi,Farshad Firouzi,Amir Rahmani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation-style ECG encoders pretrained with self-supervised learning are increasingly reused across tasks, institutions, and deployment contexts, often through model-as-a-service interfaces that expose scalar scores or latent representations. While such reuse improves data efficiency and generalization, it raises a participation privacy concern: can an adversary infer whether a specific individual or cohort contributed ECG data to pretraining, even when raw waveforms and diagnostic labels are never disclosed? In connected-health settings, training participation itself may reveal institutional affiliation, study enrollment, or sensitive health context. We present an implementation-grounded audit of membership inference attacks (MIAs) against modern self-supervised ECG foundation encoders, covering contrastive objectives (SimCLR, TS2Vec) and masked reconstruction objectives (CNN- and Transformer-based MAE). We evaluate three realistic attacker interfaces: (i) score-only black-box access to scalar outputs, (ii) adaptive learned attackers that aggregate subject-level statistics across repeated queries, and (iii) embedding-access attackers that probe latent representation geometry. Using a subject-centric protocol with window-to-subject aggregation and calibration at fixed false-positive rates under a cross-dataset auditing setting, we observe heterogeneous and objective-dependent participation leakage: leakage is most pronounced in small or institution-specific cohorts and, for contrastive encoders, can saturate in embedding space, while larger and more diverse datasets substantially attenuate operational tail risk. Overall, our results show that restricting access to raw signals or labels is insufficient to guarantee participation privacy, underscoring the need for deployment-aware auditing of reusable biosignal foundation encoders in connected-health systems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.10424 [cs.LG] (or arXiv:2604.10424v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.10424 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] Replicable Composition

链接: https://arxiv.org/abs/2604.10423
作者: Kiarash Banihashem,MohammadHossein Bateni,Hossein Esfandiari,Samira Goudarzi,MohammadTaghi Hajiaghayi
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: Abstract shortened due to Arxiv requirements

点击查看摘要

Abstract:Replicability requires that algorithmic conclusions remain consistent when rerun on independently drawn data. A central structural question is composition: given k problems each admitting a \rho -replicable algorithm with sample complexity n , how many samples are needed to solve all jointly while preserving replicability? The naive analysis yields \widetildeO(nk^2) samples, and Bun et al. (STOC’23) observed that reductions through differential privacy give an alternative \widetildeO(n^2k) bound, leaving open whether the optimal \widetildeO(nk) scaling is achievable. We resolve this open problem and, more generally, show that problems with sample complexities n_1,\ldots,n_k can be jointly solved with \widetildeO(\sum_i n_i) samples while preserving constant replicability. Our approach converts each replicable algorithm into a perfectly generalizing one, composes them via a privacy-style analysis, and maps back via correlated sampling. This yields the first advanced composition theorem for replicability. En route, we obtain new bounds for the composition of perfectly generalizing algorithms with heterogeneous parameters. As part of our results, we provide a boosting theorem for the success probability of replicable algorithms. For a broad class of problems, the failure probability appears as a separate additive term independent of \rho , immediately yielding improved sample complexity bounds for several problems. Finally, we prove an \Omega(nk^2) lower bound for adaptive composition, establishing a quadratic separation from the non-adaptive setting. The key technique, which we call the phantom run, yields structural results of independent interest. Comments: Abstract shortened due to Arxiv requirements Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2604.10423 [cs.LG] (or arXiv:2604.10423v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.10423 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] CARE-ECG: Causal Agent -based Reasoning for Explainable and Counterfactual ECG Interpretation

链接: https://arxiv.org/abs/2604.10420
作者: Elahe Khatibi,Ziyu Wang,Ankita Sharma,Krishnendu Chakrabarty,Sanaz Rahimi Moosavi,Farshad Firouzi,Amir Rahmani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) enable waveform-to-text ECG interpretation and interactive clinical questioning, yet most ECG-LLM systems still rely on weak signal-text alignment and retrieval without explicit physiological or causal structure. This limits grounding, temporal reasoning, and counterfactual “what-if” analysis central to clinical decision-making. We propose CARE-ECG, a causally structured ECG-language reasoning framework that unifies representation learning, diagnosis, and explanation in a single pipeline. CARE-ECG encodes multi-lead ECGs into temporally organized latent biomarkers, performs causal graph inference for probabilistic diagnosis, and supports counterfactual assessment via structural causal models. To improve faithfulness, CARE-ECG grounds language outputs through causal retrieval-augmented generation and a modular agentic pipeline that integrates history, diagnosis, and response with verification. Across multiple ECG benchmarks and expert QA settings, CARE-ECG improves diagnostic accuracy and explanation faithfulness while reducing hallucinations (e.g., 0.84 accuracy on Expert-ECG-QA and 0.76 on SCP-mapped PTB-XL under GPT-4). Overall, CARE-ECG provides traceable reasoning by exposing key latent drivers, causal evidence paths, and how alternative physiological states would change outcomes.

[LG-54] Sense Less Infer More: Agent ic Multimodal Transformers for Edge Medical Intelligence

链接: https://arxiv.org/abs/2604.10404
作者: Chengwei Zhou,Zhaoyan Jia,Haotian Yu,Xuming Chen,Brandon Lee,Christopher Pulliam,Steve Majerus,Massoud Pedram,Gourav Datta
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 7 figures, 4 tables

点击查看摘要

Abstract:Edge-based multimodal medical monitoring requires models that balance diagnostic accuracy with severe energy constraints. Continuous acquisition of ECG, PPG, EMG, and IMU streams rapidly drains wearable batteries, often limiting operation to under 10 hours, while existing systems overlook the high temporal redundancy present in physiological signals. We introduce Adaptive Multimodal Intelligence (AMI), an end-to-end framework that jointly learns when to sense and how to infer. AMI integrates three components: (1) a lightweight Agentic Modality Controller that uses differentiable Gumbel-Sigmoid gating to dynamically select active sensors based on model confidence and task relevance; (2) a Learned Sigma-Delta Sensing module that applies patch-wise Delta-Sigma operations with learnable thresholds to skip temporally redundant samples; and (3) a Foundation-backed Multimodal Prediction Model built on unimodal foundation encoders and a cross-modal transformer with temporal context, enabling robust fusion even under gated or missing inputs. These components are trained jointly via a multi-objective loss combining classification accuracy, sparsity regularization, cross-modal alignment, and predictive coding. AMI is hardware-aware, supporting dynamic computation graphs and masked operations, leading to real energy and latency savings. Across MHEALTH, HMC Sleep, and WESAD datasets, it reduces sensor usage by 48.8% while improving state-of-the-art accuracy by 1.9% on average.

[LG-55] Latent Instruction Representation Alignment: defending against jailbreaks backdoors and undesired knowledge in LLM s

链接: https://arxiv.org/abs/2604.10403
作者: Eric Easley,Sebastian Farquhar
类目: Machine Learning (cs.LG)
*备注: 33 pages, 6 figures

点击查看摘要

Abstract:We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trains the model to change how it interprets instructions. Our method, Latent Instruction Representation Alignment (LIRA), greatly improves generalization. We further boost generalization through an internally adversarial training algorithm. Our methods block over 99% of PEZ jailbreak attacks; remove a challenging insecure code backdoor; and achieve optimal forgetting on WMDP cyber with negligible loss of benign capabilities.

[LG-56] Structural Gating and Effect-aligned Lag-resolved Temporal Causal Discovery Framework with Application to Heat-Pollution Extremes

链接: https://arxiv.org/abs/2604.10371
作者: Rui Chen,Jinsong Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study proposes Structural Gating and Effect-aligned Discovery for Temporal Causal Discovery (SGED-TCD), a novel and general framework for lag-resolved causal discovery in complex multivariate time series. SGED-TCD combines explicit structural gating, stability-oriented learning, perturbation-effect alignment, and unified graph extraction to improve the interpretability, robustness, and functional consistency of inferred causal graphs. To evaluate its effectiveness in a representative real-world setting, we apply SGED-TCD to teleconnection-driven compound heatwave–air-pollution extremes in eastern and northern China. Using large-scale climate indices, regional circulation and boundary-layer variables, and compound extreme indicators, the framework reconstructs weighted causal networks with explicit dominant lags and relative causal importance. The inferred networks reveal clear regional and seasonal heterogeneity: warm-season extremes in Eastern China are mainly linked to low-latitude oceanic variability through circulation, radiation, and ventilation pathways, whereas cold-season extremes in Northern China are more strongly governed by high-latitude circulation variability associated with boundary-layer suppression and persistent stagnation. These results show that SGED-TCD can recover physically interpretable, hierarchical, and lag-resolved causal pathways in a challenging climate–environment system. More broadly, the proposed framework is not restricted to the present application and provides a general basis for temporal causal discovery in other complex domains. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.10371 [cs.LG] (or arXiv:2604.10371v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.10371 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-57] Battery health prognosis using Physics-informed neural network with Quantum Feature mapping

链接: https://arxiv.org/abs/2604.10362
作者: Muhammad Imran Hossain,Md Fazley Rafy,Sarika Khushlani Solanki,Anurag K. Srivastava
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate battery health prognosis using State of Health (SOH) estimation is essential for the reliability of multi-scale battery energy storage, yet existing methods are limited in generalizability across diverse battery chemistries and operating conditions. The inability of standard neural networks to capture the complex, high-dimensional physics of battery degradation is a major contributor to these limitations. To address this, a physics-informed neural network with the Quantum Feature Mapping(QFM) technique (QPINN) is proposed. QPINN projects raw battery sensor data into a high-dimensional Hilbert space, creating a highly expressive feature set that effectively captures subtle, non-linear degradation patterns using Nyström method. These quantum-enhanced features are then processed by a physics-informed network that enforces physical constraints. The proposed method achieves an average SOH estimation accuracy of 99.46% across different datasets, substantially outperforming state-of-the-art baselines, with reductions in MAPE and RMSE of up to 65% and 62%, respectively. This method was validated on a large-scale, multi-chemistry dataset of 310,705 samples from 387 cells, and further showed notable adaptability in cross-validation settings, successfully transferring from one chemistry to another without relying on target-domain SOH labels.

[LG-58] WaterAdmin: Orchestrating Community Water Distribution Optimization via AI Agents

链接: https://arxiv.org/abs/2604.10343
作者: Jiaqi Wen,Pingbo Tang,Shaolei Ren,Jianyi Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the operation of community water systems, where pumps and valves must be scheduled to reliably meet water demands while minimizing energy consumption. While existing optimization-based methods are effective under well-modeled environments, real-world community scenarios exhibit highly dynamic contexts-such as human activities, weather variations, etc-that significantly affect water demand patterns and operational targets across different zones. Traditional optimization approaches struggle to aggregate and adapt to such heterogeneous and rapidly evolving contextual information in real time. While Large Language Model (LLM) agents offer strong capabilities for understanding heterogeneous community context, they are not suitable for directly producing reliable real-time control actions. To address these challenges, we propose a bi-level AI-agent-based framework, WaterAdmin, which integrates LLM-based community context abstraction at the upper level with optimization-based operational control at the lower level. This design leverages the complementary strengths of both paradigms to enable adaptive and reliable operation. We implement WaterAdmin on the hydraulic simulation platform EPANET and demonstrate superior performance in maintaining pressure reliability and reducing energy consumption under highly dynamic community contexts.

[LG-59] Integrating SAINT with Tree-Based Models: A Case Study in Employee Attrition Prediction

链接: https://arxiv.org/abs/2604.10337
作者: Adil Derrazi,Javad Pourmostafa Roshan Sharami
类目: Machine Learning (cs.LG)
*备注: Accepted at IntelliSys 2025 (Springer LNNS)

点击查看摘要

Abstract:Employee attrition presents a major challenge for organizations, increasing costs and reducing productivity. Predicting attrition accurately enables proactive retention strategies, but existing machine learning models often struggle to capture complex feature interactions in tabular HR datasets. While tree-based models such as XGBoost and LightGBM perform well on structured data, traditional encoding techniques like one-hot encoding can introduce sparsity and fail to preserve semantic relationships between categorical features. This study explores a hybrid approach by integrating SAINT (Self-Attention and Intersample Attention Transformer)-generated embeddings with tree-based models to enhance employee attrition prediction. SAINT leverages self-attention mechanisms to model intricate feature interactions. In this study, we explore SAINT both as a standalone classifier and as a feature extractor for tree-based models. We evaluate the performance, generalizability, and interpretability of standalone models (SAINT, XGBoost, LightGBM) and hybrid models that combine SAINT embeddings with tree-based classifiers. Experimental results show that standalone tree-based models outperform both the standalone SAINT model and the hybrid approaches in predictive accuracy and generalization. Contrary to expectations, the hybrid models did not improve performance. One possible explanation is that tree-based models struggle to utilize dense, high-dimensional embeddings effectively. Additionally, the hybrid approach significantly reduced interpretability, making model decisions harder to explain. These findings suggest that transformer-based embeddings, while capturing feature relationships, do not necessarily enhance tree-based classifiers. Future research should explore alternative fusion strategies for integrating deep learning with structured data. Comments: Accepted at IntelliSys 2025 (Springer LNNS) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.10337 [cs.LG] (or arXiv:2604.10337v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.10337 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Published in Intelligent Systems and Applications (IntelliSys 2025), LNNS, Springer, 2025 Related DOI: https://doi.org/10.1007/978-3-031-99958-1_27 Focus to learn more DOI(s) linking to related resources

[LG-60] Descriptor-Injected Cross-Modal Learning: A Systematic Exploration of Audio-MIDI Alignment via Spectral and Melodic Features ALT

链接: https://arxiv.org/abs/2604.10283
作者: Mariano Fernández Méndez
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 26 pages, 11 figures, 20 tables. Companion paper to “Harmonic Information Theory: Foundations” (2026). Code: this https URL

点击查看摘要

Abstract:Cross-modal retrieval between audio recordings and symbolic music representations (MIDI) remains challenging because continuous waveforms and discrete event sequences encode different aspects of the same performance. We study descriptor injection, the augmentation of modality-specific encoders with hand-crafted domain features, as a bridge across this gap. In a three-phase campaign covering 13 descriptor-mechanism combinations, 6 architectural families, and 3 training schedules, the best configuration reaches a mean S of 84.0 percent across five independent seeds, improving the descriptor-free baseline by 8.8 percentage points. Causal ablation shows that the audio descriptor A4, based on octave-band energy dynamics, drives the gain in the top dual models, while the MIDI descriptor D4 has only a weak inference-time effect despite improving training dynamics. We also introduce reverse cross-attention, where descriptor tokens query encoder features, reducing attention operations relative to the standard formulation while remaining competitive. CKA analysis shows that descriptors substantially increase audio-MIDI transformer layer alignment, indicating representational convergence rather than simple feature concatenation. Perturbation analysis identifies high-frequency octave bands as the dominant discriminative signal. All experiments use MAESTRO v3.0.0 with an evaluation protocol controlling for composer and piece similarity.

[LG-61] he Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks

链接: https://arxiv.org/abs/2604.10272
作者: Mani Rash Ahmadi
类目: Machine Learning (cs.LG)
*备注: 15 pages, 5 figures, 8 tables. Code and data at this https URL

点击查看摘要

Abstract:We prove that in a coupled Kuramoto oscillator network at stable equilibrium, the physical phase displacement under weak output nudging is the gradient of the loss with respect to natural frequencies, with equality as the nudging strength beta tends to zero. Prior oscillator equilibrium propagation work explicitly set aside natural frequency as a learnable parameter; we show that on sparse layered architectures, frequency learning outperforms coupling-weight learning among converged seeds (96.0% vs. 83.3% at matched parameter counts, p = 1.8e-12). The approximately 50% convergence failure rate under random initialization is a loss-landscape property, not a gradient error; topology-aware spectral seeding eliminates it in all settings tested (46/100 to 100/100 seeds on the primary task; 50/50 on a second task, K-only training, and a larger architecture).

[LG-62] A Multi-head Attention Fusion Network for Industrial Prognostics under Discrete Operational Conditions

链接: https://arxiv.org/abs/2604.10248
作者: Yuqi Su,Xiaolei Fang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Complex systems such as aircraft engines, turbines, and industrial machinery often operate under dynamically changing conditions. These varying operating conditions can substantially influence degradation behavior and make prognostic modeling more challenging, as accurate prediction requires explicit consideration of operational effects. To address this issue, this paper proposes a novel multi-head attention-based fusion neural network. The proposed framework explicitly models and integrates three signal components: (1) the monotonic degradation trend, which reflects the underlying deterioration of the system; (2) discrete operating states, identified through clustering and encoded into dense embeddings; and (3) residual random noise, which captures unexplained variation in sensor measurements. The core strength of the framework lies in its architecture, which combines BiLSTM networks with attention mechanisms to better capture complex temporal dependencies. The attention mechanism allows the model to adaptively weight different time steps and sensor signals, improving its ability to extract prognostically relevant information. In addition, a fusion module is designed to integrate the outputs from the degradation-trend branch and the operating-state embeddings, enabling the model to capture their interactions more effectively. The proposed method is validated using a dataset from the NASA repository, and the results demonstrate its effectiveness.

[LG-63] Mild Over-Parameterization Benefits Asymmetric Tensor PCA

链接: https://arxiv.org/abs/2604.10208
作者: Shihong Ding,Weicheng Lin,Cong Fang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Asymmetric Tensor PCA (ATPCA) is a prototypical model for studying the trade-offs between sample complexity, computation, and memory. Existing algorithms for this problem typically require at least d^\left\lceil\overlinek/2\right\rceil state memory cost to recover the signal, where d is the vector dimension and \overlinek is the tensor order. We focus on the setting where \overlinek \geq 4 is even and consider (stochastic) gradient descent-based algorithms under a limited memory budget, which permits only mild over-parameterization of the model. We propose a matrix-parameterized method (in d^2 state memory cost) using a novel three-phase alternating-update algorithm to address the problem and demonstrate how mild over-parameterization facilitates learning in two key aspects: (i) it improves sample efficiency, allowing our method to achieve \emphnear-optimal d^\overlinek-2 sample complexity in our limited memory setting; and (ii) it enhances adaptivity to problem structure, a previously unrecognized phenomenon, where the required sample size naturally decreases as consecutive vectors become more aligned, and in the symmetric limit attains d^\overlinek/2 , matching the \emphbest known polynomial-time complexity. To our knowledge, this is the \emphfirst tractable algorithm for ATPCA with d^\overlinek -independent memory costs.

[LG-64] FatigueFusion: Latent Space Fusion for Fatigue-Driven Motion Synthesis

链接: https://arxiv.org/abs/2604.10199
作者: Iliana Loi,Konstantinos Moustakas
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 13 pages, 9 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Investigating the impact of fatigue on human physiological function and motor behavior is crucial for developing biomechanics and medical applications aimed at mitigating fatigue, reducing injury risk, and creating sophisticated ergonomic designs, as well as for producing physically-plausible 3D animation sequences. While the former has a prominent position in state-of-the-art literature, fatigue-driven motion generation is still an underexplored area. In this study, we present FatigueFusion, a deep-learning architecture for the fusion of fatigue features within a latent representation space, enabling the creation of a variation of novel fatigued movements, intermediate fatigued states, and progressively fatigued motions. Unlike existing approaches that focus on imitating the effects of fatigue accumulation in motion patterns, our framework incorporates algorithmic and data-driven modules to impose subject-specific temporal and spatial fatigue features on nonfatigued motions, while leveraging PINN-based techniques to simulate fatigue intensity. Since all motion modulation tasks are taking place in latent space, FatigueFusion offers an end-to-end architecture that operates directly on non-fatigued joint angle sequences and control parameters, allowing seamless integration into any motion synthesis pipeline, without relying on fatigue input data. Overall, our framework can be employed for various fatigue-driven synthesis tasks, such as fatigue profile transfer and fusion, while it also provides a solution for accurate rendering of the human fatigue state in both animation and simulation pipelines.

[LG-65] RF-LEGO: Modularized Signal Processing-Deep Learning Co-Design for RF Sensing via Deep Unrolling

链接: https://arxiv.org/abs/2604.10183
作者: Luca Jiang-Tao Yu,Chenshu Wu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted by The 32nd Annual International Conference on Mobile Computing and Networking (MobiCom '26), October 26-30, 2026, Austin, TX, USA. 16 pages

点击查看摘要

Abstract:Wireless sensing, traditionally relying on signal processing (SP) techniques, has recently shifted toward data-driven deep learning (DL) to achieve performance breakthroughs. However, existing deep wireless sensing models are typically end-to-end and task-specific, lacking reusability and interpretability. We propose RF-LEGO, a modular co-design framework that transforms interpretable SP algorithms into trainable, physics-grounded DL modules through deep unrolling. By replacing hand-tuned parameters with learnable ones while preserving core processing structures and mathematical operators, RF-LEGO ensures modularity, cascadability, and structure-aligned interpretability. Specifically, we introduce three deep-unrolled modules for critical RF sensing tasks: frequency transform, spatial angle estimation, and signal detection. Extensive experiments using real-world data for Wi-Fi, millimeter-wave, UWB, and 6G sensing demonstrate that RF-LEGO significantly outperforms existing SP and DL baselines, both standalone and when integrated into multiple downstream tasks. RF-LEGO pioneers a novel SP-DL co-design paradigm for wireless sensing via deep unrolling, shedding light on efficient and interpretable deep wireless sensing solutions. Our code is available at this https URL.

[LG-66] ssera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

链接: https://arxiv.org/abs/2604.10180
作者: Tiancheng Hu,Jin Qin,Zheng Wang,Junhao Hu,Yuzheng Wang,Lei Chen,Yizhou Shan,Mingxing Zhang,Ting Cao,Chunwei Xia,Huimin Cui,Tao Xie,Chenxi Wang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Disaggregation maps parts of an AI workload to different types of GPUs, offering a path to utilize modern heterogeneous GPU clusters. However, existing solutions operate at a coarse granularity and are tightly coupled to specific model architectures, leaving much room for performance improvement. This paper presents Tessera, the first kernel disaggregation system to improve performance and cost efficiency on heterogeneous GPUs for large model inference. Our key insight is that kernels within a single application exhibit diverse resource demands, making them the most suitable granularity for aligning computation with hardware capabilities. Tessera integrates offline analysis with online adaptation by extracting precise inter-kernel dependencies from PTX to ensure correctness, overlapping communication with computation through a pipelined execution model, and employing workload-aware scheduling with lightweight runtime adaptation. Extensive evaluations across five heterogeneous GPUs and four model architectures, scaling up to 16 GPUs, show that Tessera improves serving throughput and cost efficiency by up to 2.3x and 1.6x, respectively, compared to existing disaggregation methods, while generalizing to model architectures where prior approaches do not apply. Surprisingly, a heterogeneous GPU pair under Tessera can even exceed the throughput of two homogeneous high-end GPUs at a lower cost.

[LG-67] A Modularized Framework for Piecewise-Stationary Restless Bandits

链接: https://arxiv.org/abs/2604.10177
作者: Kuan-Ta Li,Chia-Chun Lin,Ping-Chun Hsieh,Yu-Chih Huang
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the piecewise-stationary restless multi-armed bandit (PS-RMAB) problem, where each arm evolves as a Markov chain but \emphmean rewards may change across unknown segments. To address the resulting exploration–detection delay trade-off, we propose a modular framework that integrates arbitrary RMAB base algorithms with change detection and a novel diminishing exploration mechanism. This design enables flexible plug-and-play use of existing solvers and detectors, while efficiently adapting to mean changes without prior knowledge of their number. To evaluate performance, we introduce a refined regret notion that measures the \emphexcess regret due to exploration and detection, benchmarked against an oracle that restarts the base algorithm at the true change points. Under this metric, we prove a regret bound of \tildeO(\sqrtLMKT) , where L denotes the maximum mixing time of the Markov chains across all arms and segments, M the number of segments, K the number of arms, and T the horizon. Simulations confirm that our framework achieves regret close to that of the segment oracle and consistently outperforms base solvers that do not incorporate any mechanism to handle environmental changes. Subjects: Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2604.10177 [cs.IT] (or arXiv:2604.10177v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2604.10177 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-68] “bot lane noob” Towards Deployment of NLP-based Toxicity Detectors in Video Games ESORICS’26

链接: https://arxiv.org/abs/2604.10175
作者: Jonas Ave,Irdin Pekaric,Matthias Frohner,Giovanni Apruzzese
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted to ESORICS’26

点击查看摘要

Abstract:Toxicity and harassment are widespread in the video-gaming context. Especially in competitive online multiplayer scenarios, gamers oftentimes send harmful messages to other players (teammates or opponents) whose consequences span from mild annoyance to withdrawal and depression. Abundant prior work tackled these problems, e.g., pointing out the negative effects of toxic interactions. However, few works proposed countermeasures specifically developed and tested on textual messages sent during a match – i.e., when the “harassment” actually occurs. We posit that such a scarcity stems from the lack of high-quality datasets that can be used to devise “automated” detectors based on natural-language processing (NLP) and machine learning (ML), and which can – ideally – mitigate the harm of toxic comments during a gaming session. This work provides a foundation for addressing the problem of toxicity and harassment in video games. First, through a systematic literature review (n=1,039), we provide evidence that only few works proposed ML/NLP-based detectors of toxicity/harassment during live matches. Then, we partner-up with 8 expert League of Legend (LoL) players and create a fine-grained labelled dataset, L2DTnH, containing 1.4k toxic and 13.8k non-toxic messages exchanged during LoL matches. We use L2DTnH to develop a detector that we then empirically show outperforms general-purpose and state-of-the-art toxicity detectors reliant on NLP. To further demonstrate the practicality of our resources, we test our detector on game-related data beyond that included in L2DTnH; and we develop a Web-browser extension that flags toxic content in Webpages – without querying third-party servers owned by AI companies. We publicly release all of our resources. Our contributions pave the way for more applied research devoted to fighting the spread of toxicity and harassment in video games.

[LG-69] racing the Thought of a Grandmaster-level Chess-Playing Transformer

链接: https://arxiv.org/abs/2604.10158
作者: Rui Lin,Zhenyu Jin,Guancheng Zhou,Xuyang Ge,Wentao Shu,Jiaxing Wu,Junxuan Wang,Zhengfu He,Junping Zhang,Xipeng Qiu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While modern transformer neural networks achieve grandmaster-level performance in chess and other reasoning tasks, their internal computation process remains largely opaque. Focusing on Leela Chess Zero (LC0), we introduce a sparse decomposition framework to interpret its internal computation by decomposing its MLP and attention modules with sparse replacement layers, which capture the primary computation process of LC0. We conduct a detailed case study showing that these pathways expose rich, interpretable tactical considerations that are empirically verifiable. We further introduce three quantitative metrics and show that LC0 exhibits parallel reasoning behavior consistent with the inductive bias of its policy head architecture. To the best of our knowledge, this is the first work to decompose the internal computation of a transformer on both MLP and attention modules for interpretability. Combining sparse replacement layers and causal interventions in LC0 provides a comprehensive understanding of advanced tactical reasoning, offering critical insights into the underlying mechanisms of superhuman systems. Our code is available at this https URL.

[LG-70] Consensus-based Recursive Multi-Output Gaussian Process

链接: https://arxiv.org/abs/2604.10146
作者: Yogesh Prasanna Kumar Rao,Tamas Keviczky,Raj Thilak Rajan
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Submitted to International Workshop on Signal Processing and Artificial Intelligence in Wireless Communications (IEEE SPAWC 2026)

点击查看摘要

Abstract:Multi-output Gaussian Processes provide principled uncertainty-aware learning of vector-valued fields but are difficult to deploy in large-scale, distributed, and streaming settings due to their computational and centralized nature. This paper proposes a Consensus-based Recursive Multi-Output Gaussian Process (CRMGP) framework that combines recursive inference on shared basis vectors with neighbour-to-neighbour information-consensus updates. The resulting method supports parallel, fully distributed learning with bounded per-step computation while preserving inter-output correlations and calibrated uncertainty. Experiments on synthetic wind fields and real LiDAR data demonstrate that CRMGP achieves competitive predictive performance and reliable uncertainty calibration, offering a scalable alternative to centralized Gaussian process models for multi-agent sensing applications.

[LG-71] End-to-end Automated Deep Neural Network Optimization for PPG-based Blood Pressure Estimation on Wearables

链接: https://arxiv.org/abs/2604.10117
作者: Francesco Carlucci,Giovanni Pollo,Xiaying Wang,Massimo Poncino,Enrico Macii,Luca Benini,Sara Vinco,Alessio Burrello,Daniele Jahier Pagliari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Photoplethysmography (PPG)-based blood pressure (BP) estimation is a challenging task, particularly on resource-constrained wearable devices. However, fully on-board processing is desirable to ensure user data confidentiality. Recent deep neural networks (DNNs) have achieved high BP estimation accuracy by reconstructing BP waveforms or directly regressing BP values, but their large memory, computation, and energy requirements hinder deployment on wearables. This work introduces a fully automated DNN design pipeline that combines hardware-aware neural architecture search (NAS), pruning, and mixed-precision search (MPS) to generate accurate yet compact BP prediction models optimized for ultra-low-power multicore systems-on-chip (SoCs). Starting from state-of-the-art baseline models on four public datasets, our optimized networks achieve up to 7.99% lower error with a 7.5x parameter reduction, or up to 83x fewer parameters with negligible accuracy loss. All models fit within 512 kB of memory on our target SoC (GreenWaves’ GAP8), requiring less than 55 kB and achieving an average inference latency of 142 ms and energy consumption of 7.25 mJ. Patient-specific fine-tuning further improves accuracy by up to 64%, enabling fully autonomous, low-cost BP monitoring on wearables.

[LG-72] Attention Sink in Transformers: A Survey on Utilization Interpretation and Mitigation

链接: https://arxiv.org/abs/2604.10098
作者: Zunhai Su,Hengyuan Zhang,Wei Wu,Yifan Zhang,Yaxiu Liu,He Xiao,Qingyao Yang,Yuxuan Sun,Rui Yang,Chao Zhang,Keyu Fan,Weihao Ye,Jing Xiong,Hui Shen,Chaofan Tao,Taiqiang Wu,Zhongwei Wan,Yulei Qian,Yuchen Xie,Ngai Wong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS-related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on AS, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work provides a pivotal contribution by clarifying key concepts and guiding researchers through the evolution and trends of the field. We envision this survey as a definitive resource, empowering researchers and practitioners to effectively manage AS within the current Transformer paradigm, while simultaneously inspiring innovative advancements for the next generation of Transformers. The paper list of this work is available at this https URL.

[LG-73] ransformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs

链接: https://arxiv.org/abs/2604.10074
作者: Hongkang Li,Hancheng Min,Rene Vidal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based diffusion models have demonstrated remarkable performance at generating high-quality samples. However, our theoretical understanding of the reasons for this success remains limited. For instance, existing models are typically trained by minimizing a denoising objective, which is equivalent to fitting the score function of the training data. However, we do not know why transformer-based models can match the score function for denoising, or why gradient-based methods converge to the optimal denoising model despite the non-convex loss landscape. To the best of our knowledge, this paper provides the first convergence analysis for training transformer-based diffusion models. More specifically, we consider the population Denoising Diffusion Probabilistic Model (DDPM) objective for denoising data that follow a multi-token Gaussian mixture distribution. We theoretically quantify the required number of tokens per data point and training iterations for the global convergence towards the Bayes optimal risk of the denoising objective, thereby achieving a desired score matching error. A deeper investigation reveals that the self-attention module of the trained transformer implements a mean denoising mechanism that enables the trained model to approximate the oracle Minimum Mean Squared Error (MMSE) estimator of the injected noise in the diffusion steps. Numerical experiments validate these findings.

[LG-74] When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

链接: https://arxiv.org/abs/2604.10062
作者: Jose Efraim Aguilar Escamilla,Haoyang Hong,Jiawei Li,Haoyu Zhao,Xuezhou Zhang,Sanghyun Hong,Huazheng Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study reward poisoning attacks in reinforcement learning (RL), where an adversary manipulates rewards within constrained budgets to force the target RL agent to adopt a policy that aligns with the attacker’s objectives. Prior works on reward poisoning mainly focused on sufficient conditions to design a successful attacker, while only a few studies discussed the infeasibility of targeted attacks. This paper provides the first precise necessity and sufficiency characterization of the attackability of a linear MDP under reward poisoning attacks. Our characterization draws a bright line between the vulnerable RL instances, and the intrinsically robust ones which cannot be attacked without large costs even running vanilla non-robust RL algorithms. Our theory extends beyond linear MDPs – by approximating deep RL environments as linear MDPs, we show that our theoretical framework effectively distinguishes the attackability and efficiently attacks the vulnerable ones, demonstrating both the theoretical and practical significance of our characterization.

[LG-75] Cross-Validated Cross-Channel Self-Attention and Denoising for Automatic Modulation Classification

链接: https://arxiv.org/abs/2604.10054
作者: Prakash Suman,Yanzhen Qu
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:This study addresses a key limitation in deep learning Automatic Modulation Classification (AMC) models, which perform well at high signal-to-noise ratios (SNRs) but degrade under noisy conditions due to conventional feature extraction suppressing both discriminative structure and interference. The goal was to develop a feature-preserving denoising method that mitigates the loss of modulation class separation. A deep learning AMC model was proposed, incorporating a cross-channel self-attention block to capture dependencies between in-phase and quadrature components, along with dual-path deep residual shrinkage denoising blocks to suppress noise. Experiments using the RML2018.01a dataset employed stratified sampling across 24 modulation types and 26 SNR levels. Results showed that denoising depth strongly influences robustness at low and moderate SNRs. Compared to benchmark models PET-CGDNN, MCLDNN, and DAE, the proposed model achieved notable accuracy improvements across -8 dB to +2 dB SNR, with increases of 3%, 2.3%, and 14%, respectively. Cross-validation confirmed the model’s robustness, yielding a mean accuracy of 62.6%, macro precision of 65.8%, macro-recall of 62.6%, and macro-F1 score of 62.9%. The architecture advances interference-aware AMC by formalizing baseband modeling as orthogonal subproblems and introducing cross-channel attention as a generalized complex interaction operator, with ablations confirming the critical role of feature-preserving denoising for robustness at low-to-medium SNR.

[LG-76] Masked Contrastive Pre-Training Improves Music Audio Key Detection

链接: https://arxiv.org/abs/2604.10021
作者: Ori Yonay,Tracy Hammond,Tianbao Yang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Code and models available at this http URL

点击查看摘要

Abstract:Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate that masked contrastive embeddings uniquely enable state-of-the-art (SOTA) performance in key detection in the supervised setting. First, we discover that linear evaluation after masking-based contrastive pretraining on Mel spectrograms leads to competitive performance on music key detection out of the box. This leads us to train shallow but wide multi-layer perceptrons (MLPs) on features extracted from our base model, leading to SOTA performance without the need for sophisticated data augmentation policies. We further analyze robustness and show empirically that the learned representations naturally encode common augmentations. Our study establishes self-supervised pretraining as an effective approach for pitch-sensitive MIR tasks and provides insights for designing and probing music foundation models.

[LG-77] Engineering Resource-constrained Software Systems with DNN Components: a Concept-based Pruning Approach

链接: https://arxiv.org/abs/2604.09988
作者: Federico Formica,Andrea Rota,Aurora Francesca Zanenga,Andrea Bombarda,Mark Lawford,Lionel C. Briand,Claudio Menghi
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are widely used by engineers to solve difficult problems that require predictive modeling from data. However, these models are often massive, with millions or billions of parameters, and require substantial computational power, RAM, and storage. This becomes a limitation in practical scenarios where strict size and resource constraints must be respected. In this paper, we present a novel concept-based pruning technique for DNNs that guides pruning decisions using human-interpretable concepts, such as features, colors, and classes. This is particularly important in a software engineering context, as DNNs are integrated into systems and must be pruned according to specific system requirements. Our concept-based pruning solution analyzes neuron activations to identify important neurons from a system requirements viewpoint and uses this information to guide the DNN pruning. We assess our solution using the VGG-19 network and a dataset of 26’384 RGB images, focusing on its ability to produce small, effective pruned DNNs and on the computational complexity and performance of these pruned DNNs. We also analyzed the pruning efficiency of our solution and compared alternative configurations. Our results show that concept-based pruning efficiently generates much smaller, effective pruned DNNs. Pruning greatly improves the computational efficiency and performance of DNNs, properties that are particularly useful for practical applications with stringent memory and computational time constraints. Finally, alternative configuration options enable engineers to identify trade-offs adapted to different practical situations.

[LG-78] LoDAdaC: a unified local training-based decentralized framework with adaptive gradients and compressed communication

链接: https://arxiv.org/abs/2604.09970
作者: Wei Liu,Anweshit Panda,Ujwal Pandey,Haven Cook,George M. Slota,Naigang Wang,Jie Chen,Yangyang Xu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注: Accepted by TMLR

点击查看摘要

Abstract:In the decentralized distributed learning, achieving fast convergence and low communication cost is essential for scalability and high efficiency. Adaptive gradient methods, such as Adam, have demonstrated strong practical performance in deep learning and centralized distributed settings. However, their convergence properties remain largely unexplored in decentralized settings involving multiple local training steps, such as federated learning. To address this limitation, we propose LoDAdaC, a unified multiple Local Training (MLT) Decentralized framework with Adam-type updates and Compressed communication (CC). LoDAdaC accommodates a broad class of optimizers for its local adaptive updates, including AMSGrad, Adam, and AdaGrad; it is compatible with standard (possibly biased) compressors such as low-bit quantization and sparsification. MLT and CC enable LoDAdaC to achieve multiplied reduction of communication cost, while the technique of adaptive updates enables fast convergence. We rigorously prove the combined advantage through complexity analysis. In addition, experiments on image classification and GPT-style language model training validate our theoretical findings and show that LoDAdaC significantly outperforms existing decentralized algorithms in terms of convergence speed and communication efficiency.

[LG-79] From Recency Bias to Stable Convergence Block Kaczmarz Methods for Online Preference Learning in Matchmaking Applications

链接: https://arxiv.org/abs/2604.09964
作者: James Nguyen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a family of Kaczmarz-based preference learning algorithms for real-time personalized matchmaking in reciprocal recommender systems. Post-step L2 normalization, common in Kaczmarz-inspired online learners, induces exponential recency bias: the influence of the t-th interaction decays as eta^(n - t), reaching approximately 1e-6 after just 20 swipes at eta = 0.5. We resolve this by replacing the normalization step with a Tikhonov-regularized projection denominator that bounds step size analytically without erasing interaction history. When candidate tag vectors are not pre-normalized, as in realistic deployments where candidates vary in tag density, the Tikhonov denominator ||a||^2 + alpha produces genuinely per-candidate adaptive step sizes, making it structurally distinct from online gradient descent with any fixed learning rate. We further derive a block variant that processes full swipe sessions as a single Gram matrix solve. Population-scale simulation over 6,400 swipes reveals that Block Normalized Kaczmarz (BlockNK), which combines the batch Gram solve with post-session L2 normalization, achieves the highest preference alignment (Align@20 = 0.698), the strongest inter-session direction stability (delta = 0.994), and the flattest degradation profile under label noise across flip ratios p_flip in [0.10, 0.35]. Experiments under cosine similarity subsampling further show that adaptively filtering the candidate pool toward the current preference direction substantially improves asymptotic alignment, at the cost of introducing a feedback loop that may slow recovery from miscalibration. The sequential Tikhonov-Kaczmarz method performs comparably to K-NoNorm under our simulation conditions, suggesting the dominant practical gain over normalized Kaczmarz is the removal of per-step normalization rather than the Tikhonov constant alpha itself.

[LG-80] SLM Finetuning for Natural Language to Domain Specific Code Generation in Production

链接: https://arxiv.org/abs/2604.09952
作者: Renjini R. Nair(Microsoft),Damian K. Kowalczyk(Microsoft),Marco Gaudesi(Microsoft),Chhaya Methani(Microsoft)
类目: Machine Learning (cs.LG)
*备注: 11 pages (including appendix), 5 tables, 1 figure. Submitted to arXiv as a preprint

点击查看摘要

Abstract:Many applications today use large language models for code generation; however, production systems have strict latency requirements that can be difficult to meet with large models. Small language models with a few billion parameters are resource efficient but may suffer from limited reasoning, hallucinations, or poor retention of longer context. Fine tuning improves task specific accuracy by embedding domain knowledge directly into model weights, reducing reliance on runtime context. We previously implemented a baseline natural language to code generation approach using a retrieval augmented generation pipeline that dynamically selected few shot examples to embed domain specific language context for a large language model. In this study, we evaluate small language models for generating domain specific language from natural language by fine tuning variants of Mistral and other models on a dataset of natural language code pairs. Our results show that the fine-tuned models achieve improved performance and latency on test datasets compared to larger models. We also demonstrate that the trained model can be further fine-tuned for customer specific scenarios without degrading general performance, helping resolve production issues. Load testing followed by production deployment confirmed optimal performance in terms of latency and quality. These findings demonstrate that task specific fine tuning with small language models provides an efficient, faster, and cost-effective alternative to large language models for domain specific language generation.

[LG-81] Vestibular reservoir computing

链接: https://arxiv.org/abs/2604.09943
作者: Smita Deb,Shirin Panahi,Mulugeta Haile,Ying-Cheng Lai
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 24 pages, 11 figures

点击查看摘要

Abstract:Reservoir computing (RC) is a computational framework known for its training efficiency, making it ideal for physical hardware implementations. However, realizing the complex interconnectivity of traditional reservoirs in physical systems remains a significant challenge. This paper proposes a physical RC scheme inspired by the biological vestibular system. To overcome hardware complexity, we introduce a designed uncoupled topology and demonstrate that it achieves performance comparable to fully coupled networks. We theoretically analyze the difference between these topologies by deriving a memory capacity formula for linear reservoirs, identifying specific conditions where both configurations yield equivalent memory. These analytical results are demonstrated to approximately hold for nonlinear reservoir systems. Furthermore, we systematically examine the impact of reservoir size on predictive statistics and memory capacity. Our findings suggest that uncoupled reservoir architectures offer a mathematically sound and practically feasible pathway for efficient physical reservoir computing.

[LG-82] CableTract: A Co-Designed Cable-Driven Field Robot for Low-Compaction Off-Grid Capable Agriculture

链接: https://arxiv.org/abs/2604.09938
作者: Ozgur Yilmaz
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conventional field operations spend most of their energy moving the tractor body, not the implement. Yet feasibility studies for novel agricultural vehicles rarely tie mechanics, energy harvest, draft, field geometry, economics, life-cycle CO2, and uncertainty quantification together on a single reproducible code path. This paper builds such a framework and applies it to CableTract, a two-module cable-driven field robot. A stationary Main Unit (winch + motor + battery + harvester module) (MU) and a lighter Anchor module (held by helical screw piles) tension a cable across a strip while a lightweight implement carriage rolls along it. The heavy bodies stay on the headland; only the carriage enters the field. The carriage runs a 10-implement library co-designed for the cable architecture. This co-design is the paper’s central analytical lever. The framework is prototype-free. It chains a catenary cable model, a drivetrain efficiency chain, a stochastic draft model fitted to the co-designed library, an hourly solar + wind + battery simulator on six sites, a polygon coverage planner on a 50-field corpus, a contact-pressure compaction model, a discounted cash-flow economics engine with battery replacement and life-cycle CO2, and a global sensitivity analysis on 20 inputs. An operating-envelope sweep and an architectural-variant comparison close the loop. The full implementation is open source. Applied to the codesigned reference, the framework yields energy, compaction advantages and potential off-grid operation.

[LG-83] A Tale of Two Temperatures: Simple Efficient and Diverse Sampling from Diffusion Language Models

链接: https://arxiv.org/abs/2604.09921
作者: Theo X. Olausson,Metod Jazbec,Xi Wang,Armando Solar-Lezama,Christian A. Naesseth,Stephan Mandt,Eric Nalisnick
类目: Machine Learning (cs.LG)
*备注: 24 pages, 11 figures

点击查看摘要

Abstract:Much work has been done on designing fast and accurate sampling for diffusion language models (dLLMs). However, these efforts have largely focused on the tradeoff between speed and quality of individual samples; how to additionally ensure diversity across samples remains less well understood. In this work, we show that diversity can be increased by using softened, tempered versions of familiar confidence-based remasking heuristics, retaining their computational benefits and offering simple implementations. We motivate this approach by introducing an idealized formal model of fork tokens and studying the impact of remasking on the expected entropy at the forks. Empirically, the proposed tempered heuristics close the exploration gap (pass@k) between existing confidence-based and autoregressive sampling, hence outperforming both when controlling for cost (pass@NFE). We further study how the increase in diversity translates to downstream post-training and test-time compute scaling. Overall, our findings demonstrate that simple, efficient, and diverse sampling from dLLMs is possible.

[LG-84] Regularized Entropy Information Adaptation with Temporal-Awareness Networks for Simultaneous Speech Translation INTERSPEECH2026

链接: https://arxiv.org/abs/2604.09916
作者: Joseph Liu,Nameer Hirschkind,Xiao Yu,Mahesh Kumar Nandwana
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Under review at Interspeech 2026

点击查看摘要

Abstract:Simultaneous Speech Translation (SimulST) requires balancing high translation quality with low latency. Recent work introduced REINA, a method that trains a Read/Write policy based on estimating the information gain of reading more audio. However, we find that information-based policies often lack temporal context, leading the policy to bias itself toward reading most of the audio before starting to write. We improve REINA using two distinct strategies: a supervised alignment network (REINA-SAN) and a timestep-augmented network (REINA-TAN). Our results demonstrate that while both methods significantly outperform the baseline and resolve stability issues, REINA-TAN provides a slightly superior Pareto frontier for streaming efficiency, whereas REINA-SAN offers more robustness against ‘read loops’. Applied to Whisper, both methods improve the pareto frontier of streaming efficiency as measured by Normalized Streaming Efficiency (NoSE) scores up to 7.1% over existing competitive baselines.

[LG-85] Last-Iterate Convergence of Randomized Kaczmarz and SGD with Greedy Step Size

链接: https://arxiv.org/abs/2604.09909
作者: Michał Dereziński,Xiaoyu Dong
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study last-iterate convergence of SGD with greedy step size over smooth quadratics in the interpolation regime, a setting which captures the classical Randomized Kaczmarz algorithm as well as other popular iterative linear system solvers. For these methods, we show that the t -th iterate attains an O(1/t^3/4) convergence rate, addressing a question posed by Attia, Schliserman, Sherman, and Koren, who gave an O(1/t^1/2) guarantee for this setting. In the proof, we introduce the family of stochastic contraction processes, whose behavior can be described by the evolution of a certain deterministic eigenvalue equation, which we analyze via a careful discrete-to-continuous reduction.

[LG-86] Improving Pediatric Emergency Department Triage with Modality Dropout in Late Fusion Multimodal EHR Models

链接: https://arxiv.org/abs/2604.09905
作者: Tyler Yang,Romal Mitr
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Emergency department triage relies heavily on both quantitative vital signs and qualitative clinical notes, yet multimodal machine learning models predicting triage acuity often suffer from modality collapse by over-relying on structured tabular data. This limitation severely hinders demographic generalizability, particularly for pediatric patients where developmental variations in vital signs make unstructured clinical narratives uniquely crucial. To address this gap, we propose a late-fusion multimodal architecture that processes tabular vitals via XGBoost and unstructured clinical text via Bio_ClinicalBERT, combined through a Logistic Regression meta-classifier to predict the 5-level Emergency Severity Index. To explicitly target the external validity problem, we train our model exclusively on adult encounters from the MIMIC-IV and NHAMCS datasets and evaluate its zero-shot generalization on a traditionally overlooked pediatric cohort. Furthermore, we employ symmetric modality dropout during training to prevent the ensemble from overfitting to adult-specific clinical correlations. Our results demonstrate that the multimodal framework significantly outperforms single-modality baselines. Most notably, applying a 30-40% symmetric modality dropout rate yielded steep performance improvements in the unseen pediatric cohort, elevating the Quadratic Weighted Kappa to 0.351. These findings highlight modality dropout as a critical regularization technique for mitigating modality collapse and enhancing cross-demographic generalization in clinical AI.

[LG-87] SemEnrich: Self-Supervised Semantic Enrichment of Radiology Reports for Vision-Language Learning

链接: https://arxiv.org/abs/2604.09887
作者: Halil Ibrahim Gulluk,Olivier Gevaert
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medical vision-language datasets are often limited in size and biased toward negative findings, as clinicians report abnormalities mostly but might omit some positive/neutral findings because they might be considered as irrelevant to the patient’s condition. We propose a self-supervised data enrichment method that leverages semantic clustering of report sentences. Then we enrich the findings in the medical reports in the training set by adding positive/neutral observations from different clusters in a self-supervised manner. Our approach yields consistent gains in supervised fine-tuning (5.63%, 3.04%, 7.40%, 5.30%, 7.47% average gains on COMET score, Bert score, Sentence Bleu, CheXbert-F1 and RadGraph-F1 scores respectively). Ablation studies confirm that improvements stem from semantic clustering rather than random augmentation. Furthermore, we introduce a way to incorporate semantic cluster information into the reward design for GRPO training, which leads to further performance gains (2.78%, 3.14%, 12.80% average gains on COMET score, Bert score and Sentence Bleu scores respectively). We share our code at this https URL

[LG-88] Improving DNS Exfiltration Detection via Transformer Pretraining

链接: https://arxiv.org/abs/2604.09849
作者: Miloš Tomić,Aleksa Cvetanović,Predrag Tadić
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This is the preprint version of the paper. The final version of the paper has been presented at the TELFOR 2025 conference. The paper has 4 pages, 1 figure and 3 tables

点击查看摘要

Abstract:We study whether in-domain pretraining of Bidirectional Encoder Representations from Transformer (BERT) model improves subdomain-level detection of exfiltration at low false positive rates. While previous work mostly examines fine-tuned generic Transformers, it does not aim to isolate the effect of pretraining on the downstream task of classification. To address this gap, we develop a controlled pipeline where we freeze operating points on validation and transfer them to the test set, thus enabling clean ablations across different label and pretraining budgets. Our results show significant improvements in the left tail of the Receiver Operating Characteristic (ROC) curve, especially against randomly initialized baseline. Additionally, within pretrained model variants, increasing the number of pretraining steps helps the most when more labeled data are available for fine-tuning.

[LG-89] Below-ground Fungal Biodiversity Can be Monitored Using Self-Supervised Learning Satellite Features

链接: https://arxiv.org/abs/2604.09818
作者: Robin Young,Michael E. Van Nuland,E. Toby Kiers,Tomáš Větrovský,Petr Kohout,Petr Baldrian,Srinivasan Keshav
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Mycorrhizal fungi are vital to terrestrial ecosystem functioning. Yet monitoring their biodiversity at landscape scales is often unfeasible due to time and cost constraints. Current predictions suggest that 90% of mycorrhizal diversity hotspots remain unprotected, opening questions of how to broadly and effectively map underground fungal communities. Here, we show that self-supervised learning (SSL) applied to satellite imagery can predict below-ground ectomycorrhizal fungal richness across diverse environments. Our models explain over half the variance in species richness across ~12,000 field samples spanning Europe and Asia. SSL-derived features prove to be the single most informative predictor, subsuming the majority of information contained in climate, soil, and land cover datasets. Using this approach, we achieve a 10,000-fold increase in spatial resolution over existing techniques, moving from 1km landscape averages to 10m habitat-scale observations with nearly no systematic bias. As satellite observations are dynamic rather than static, this enables temporal monitoring of below-ground biodiversity at landscape scales for the first time. We analyze multi-year trends in predicted fungal richness across UK National Park woodlands, finding that ancient forests may be losing ectomycorrhizal diversity at disproportionate rates. These results establish SSL satellite features as a scalable tool for extending sparse field observations to continuous, high-resolution biodiversity maps for monitoring the invisible half of terrestrial ecosystems.

[LG-90] NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity CVPR2026

链接: https://arxiv.org/abs/2604.09817
作者: Weijian Mai,Mu Nan,Yu Zhu,Jiahang Cao,Rui Zhang,Yuqin Dai,Chunfeng Song,Andrew F. Luo,Jiamin Wu
类目: Machine Learning (cs.LG)
*备注: Accepted to CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Visual encoding and decoding models act as gateways to understanding the neural mechanisms underlying human visual perception. Typically, visual encoding models that predict brain activity from stimuli and decoding models that reproduce stimuli from brain activity are treated as distinct tasks, requiring separate models and training procedures. This separation is inefficient and fails to model the consistency between encoding and decoding processes. To address this limitation, we propose NeuroFlow, the first unified framework that jointly models visual encoding and decoding from neural activity within a single flow model. NeuroFlow introduces two key components: (1) NeuroVAE is designed as a variational backbone to model neural variability and establish a compact, semantically structured latent space for bidirectional modeling across visual and neural modalities. (2) Cross-modal Flow Matching (XFM) bypasses the typical paradigm of noise-to-data diffusion guided by a specific modality condition, instead learning a reversibly consistent flow model between visual and neural latent distributions. For the first time, visual encoding and decoding are reformulated as a time-dependent, reversible process within a shared latent space for unified modeling. Empirical results demonstrate that NeuroFlow achieves superior overall performance in visual encoding and decoding tasks with higher computational efficiency compared to any isolated methods. We further analyze principal factors that steer the model toward encoding-decoding consistency and, through brain functional analyses, demonstrate that NeuroFlow captures consistent activation patterns underlying neural variability. NeuroFlow marks a major step toward unified visual encoding and decoding from neural activity, providing mechanistic insights that inform future bidirectional visual brain-computer interfaces.

[LG-91] Sustainable Transformer Neural Network Acceleration with Stochastic Photonic Computing

链接: https://arxiv.org/abs/2604.09759
作者: S. Afifi,O. Alo,I. Thakkar,S. Pasricha
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers achieve state-of-the-art performance in natural language processing, vision, and scientific computing, but demand high computation and memory. To address these challenges, we present ASTRA, the first silicon-photonic accelerator leveraging stochastic computing for transformers. ASTRA employs novel optical stochastic multipliers and unary/analog homodyne accumulation in a crosstalk-minimal organization to efficiently process dynamic tensor computations. Evaluations show at least 7.6x speedup and 1.3x lower energy overheads compared to state-of-the-art accelerators, highlighting ASTRA’s potential for efficient, scalable, and sustainable transformer inference.

[LG-92] Spectral Kernel Dynamics via Maximum Caliber: Fixed Points Geodesics and Phase Transitions

链接: https://arxiv.org/abs/2604.09745
作者: Jnaneshwar Das
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:We derive a closed-form geometric functional for kernel dynamics on finite graphs by applying the Maximum Caliber (MaxCal) variational principle to the spectral transfer function h(lambda) of the graph Laplacian eigenbasis. The main result is that the MaxCal stationarity condition decouples into N one-dimensional problems with explicit solution: h*(lambda_l) = h_0(lambda_l) exp(-1 - T_l[h*]), yielding self-consistent (fixed-point) kernels via exponential tilting (Corollary 1), log-linear Fisher-Rao geodesics (Corollary 2), a diagonal Hessian stability criterion (Corollary 3), and an l^2_+ isometry for the spectral kernel space (Proposition 3). The spectral entropy H[h_t] provides a computable O(N) early-warning signal for network-structural phase transitions (Remark 7). All claims are numerically verified on the path graph P_8 with a Gaussian mutual-information source, using the open-source kernelcal library. The framework is grounded in a structural analogy with Einstein’s field equations, used as a guiding template rather than an established equivalence; explicit limits are stated in Section 6.

[LG-93] Isomorphic Functionalities between Ant Colony and Ensemble Learning: Part III – Gradient Descent Neural Plasticity and the Emergence of Deep Intelligence

链接: https://arxiv.org/abs/2604.09677
作者: Ernest Fokoué,Gregory Babbitt,Yuval Levental
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 25 pages, 10 figures, 3 tables

点击查看摘要

Abstract:In Parts I and II of this series, we established isomorphisms between ant colony decision-making and two major families of ensemble learning: random forests (parallel, variance reduction) and boosting (sequential, bias reduction). Here we complete the trilogy by demonstrating that the fundamental learning algorithm underlying deep neural networks – stochastic gradient descent – is mathematically isomorphic to the generational learning dynamics of ant colonies. We prove that pheromone evolution across generations follows the same update equations as weight evolution during gradient descent, with evaporation rates corresponding to learning rates, colony fitness corresponding to negative loss, and recruitment waves corresponding to backpropagation passes. We further show that neural plasticity mechanisms – long-term potentiation, long-term depression, synaptic pruning, and neurogenesis – have direct analogs in colony-level adaptation: trail reinforcement, evaporation, abandonment, and new trail formation. Comprehensive simulations confirm that ant colonies trained on environmental tasks exhibit learning curves indistinguishable from neural networks trained on analogous problems. This final isomorphism reveals that all three major paradigms of machine learning – parallel ensembles, sequential ensembles, and gradient-based deep learning – have direct analogs in the collective intelligence of social insects, suggesting a unified theory of learning that transcends substrate. The ant colony, we conclude, is not merely analogous to learning algorithms; it is a living embodiment of the fundamental principles of learning itself.

[LG-94] Belief-State RWKV for Reinforcement Learning under Partial Observability

链接: https://arxiv.org/abs/2604.09671
作者: Liu Xiao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a stronger formulation of RL on top of RWKV-style recurrent sequence models, in which the fixed-size recurrent state is explicitly interpreted as a belief state rather than an opaque hidden vector. Instead of conditioning policy and value on a single summary h_t, we maintain a compact uncertainty-aware state b_t = (\mu_t, \Sigma_t) derived from RWKV-style recurrent statistics and let control depend on both memory and uncertainty. This design targets a key weakness of plain fixed-state policies in partially observed settings: they may store evidence, but not necessarily confidence. We present the method, a theoretical program, and a pilot RL experiment with hidden episode-level observation noise together with a test-time noise sweep. The pilot shows that belief-state policies nearly match the best recurrent baseline overall while slightly improving return on the hardest in-distribution regime and under a held-out noise shift. Additional ablations show that this simple belief readout is currently stronger than two more structured extensions, namely gated memory control and privileged belief targets, underscoring the need for richer benchmarks.

[LG-95] ML-Based Real-Time Downlink Performance Prediction in Standalone 5G NR Using Smartphones

链接: https://arxiv.org/abs/2604.09632
作者: Md Mahfuzur Rahman,Jareen Shuva,Nishith Tripathi,Jeffrey H. Reed,Lingjia Liu
类目: Networking and Internet Architecture (cs.NI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a machine learning (ML)-based framework for downlink performance prediction in 5G networks using real-time measurements from commercial off-the-shelf (COTS) user equipment (UE). Our experimental platform integrates the srsRAN 5G New Radio (NR) stack deployed on a Dell desktop serving as the 5G next generation nodeB (gNB), operating at 3.4 GHz. Two Google Pixel 7a smartphones are used to collect physical layer characteristics such as channel quality indicator (CQI), modulation and coding scheme (MCS), bit rate, transmission time interval (TTI), and block error rate (BLER), which are leveraged as predictors in model training. We use commercial-grade traffic generation tools, including Ookla, for stationary and mobility measurements under line-of-sight (LOS) and non-line-of-sight (nLOS) conditions. Test data includes global Ookla servers (e.g., USA, Portugal, Ghana, Egypt, Japan), iperf TCP/UDP data, and video streaming sessions from YouTube. To analyze inter-user interference, we also include scenarios with multiple UEs at the same location. We evaluate the predictive performance of five supervised regression models - linear regression, decision tree regression, random forest regression, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM). Our results demonstrate that throughput and BLER can be accurately predicted using COTS hardware and standard ML techniques in diverse real-world 5G scenarios.

[LG-96] Investigating Vaccine Buyers Remorse: Post-Vaccination Decision Regret in COVID-19 Social Media Using Politically Diverse Human Annotation

链接: https://arxiv.org/abs/2604.09626
作者: Miles Stanley,Soumyajit Datta,Ashutosh Kumar,Ashiqur R. KhudaBukhsh
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:A significant gap exists in datasets regarding post-COVID-19 vaccination experiences, particularly ``vaccine buyer’s remorse’'. Understanding the prevalence and nature of vaccine regret, whether based on personal or vicarious experiences, is vital for addressing vaccine hesitancy and refining public health communication. In this paper, we curate a novel dataset from a large YouTube news corpus capturing COVID-19 vaccination experiences, and construct a benchmark subset focused on vaccine regret, annotated by a politically diverse panel to account for the subjective and often politicized nature of the topic. We utilize large language models (LLMs) to identify posts expressing vaccine regret, analyze the reasons behind this regret, and quantify its occurrence in both first and second-person accounts. This paper aims to (1) quantify the prevalence of vaccine regret; (2) identify common reasons for this sentiment; (3) analyze differences between first-person and vicarious experiences; and (4) assess potential biases introduced by different LLMs. We find that while vaccine buyer’s remorse appears in only 2% of public discourse, it is disproportionately concentrated in vaccine-skeptic influencer communities and is predominantly expressed through first-person narratives citing adverse health events.

[LG-97] he Diffusion-Attention Connection

链接: https://arxiv.org/abs/2604.09560
作者: Julio Candanedo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers, diffusion-maps, and magnetic Laplacians are usually treated as separate tools; we show they are all different regimes of a single Markov geometry built from pre-softmax query-scores. We define a QK “bidivergence” whose exponentiated and normalized forms yield attention, diffusion-maps, and magnetic diffusion. And use product of experts and Schrödinger-bridges to connect and organize them into equilibrium, nonequilibrium steady-state, and driven dynamics.

[LG-98] VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination OSDI’26

链接: https://arxiv.org/abs/2604.09558
作者: Muyan Hu,Ahan Gupta,Jiachen Yuan,Vima Gupta,Taeksang Kim,Xin Xu,Janardhan Kulkarni,Ofer Dekel,Vikram Adve,Charith Mendis
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Accepted to OSDI’26

点击查看摘要

Abstract:With the widening gap between compute and memory operation latencies, data movement optimizations have become increasingly important for DNN compilation. Current optimizations such as layout transformations and operator fusion only target a subset of tensor operators and consequently miss important opportunities for reducing data movement in contemporary DNN workloads, including large language models. We introduce VTC, a novel tensor compilation framework that for the first time eliminates all unnecessary data movement by targeting the full spectrum of data movement operators. VTC proposes the concept of virtual tensors to track data movement between compute operators via index mappings rather than expensive physical data transfers to and from global memory, which can seamlessly interoperate with existing computation kernels and handle arbitrary tensor operator compositions. We also introduce a novel data movement elimination algorithm to automatically identify a profitable virtual tensor creation strategy. Evaluation on a variety of DNNs shows that VTC can outperform existing ML compilers by up to 1.93x (1.28x on average) on NVIDIA GPUs with up to 60% (17.5% on average) inference memory savings. Comments: Accepted to OSDI’26 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Programming Languages (cs.PL) Cite as: arXiv:2604.09558 [cs.DC] (or arXiv:2604.09558v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.09558 Focus to learn more arXiv-issued DOI via DataCite

[LG-99] Universality of first-order methods on random and deterministic matrices

链接: https://arxiv.org/abs/2604.11729
作者: Nicola Gorini,Chris Jones,Dmitriy Kunisky,Lucas Pesenti
类目: Probability (math.PR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:General first-order methods (GFOM) are a flexible class of iterative algorithms which update a state vector by matrix-vector multiplications and entrywise nonlinearities. A long line of work has sought to understand the large-n dynamics of GFOM, mostly focusing on “very random” input matrices and the approximate message passing (AMP) special case of GFOM whose state is asymptotically Gaussian. Yet, it has long remained unknown how to construct iterative algorithms that retain this Gaussianity for more structured inputs, or why existing AMP algorithms can be as effective for some deterministic matrices as they are for random matrices. We analyze diagrammatic expansions of GFOM via the limiting traffic distribution of the input matrix, the collection of all limiting values of permutation-invariant polynomials in the matrix entries, to obtain the following results: 1. We calculate the traffic distribution for the first non-trivial deterministic matrices, including (minor variants of) the Walsh-Hadamard and discrete sine and cosine transform matrices. This determines the limiting dynamics of GFOM on these inputs, resolving parts of longstanding conjectures of Marinari, Parisi, and Ritort (1994). 2. We design a new AMP iteration which unifies several previous AMP variants and generalizes to new input types, whose limiting dynamics are Gaussian conditional on some latent random variables. The asymptotic dynamics hold for a large and natural class of traffic distributions (encompassing both random and deterministic input matrices) and the algorithm’s analysis gives a simple combinatorial interpretation of the Onsager correction, answering questions posed recently by Wang, Zhong, and Fan (2022). Subjects: Probability (math.PR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2604.11729 [math.PR] (or arXiv:2604.11729v1 [math.PR] for this version) https://doi.org/10.48550/arXiv.2604.11729 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lucas Pesenti [view email] [v1] Mon, 13 Apr 2026 17:03:53 UTC (251 KB)

[LG-100] Computation of Least Trimmed Squares: A Branch-and-Bound framework with Hyperplane Arrangement Enhancements

链接: https://arxiv.org/abs/2604.11584
作者: Xiang Meng,Andrés Gómez,Rahul Mazumder
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study computational aspects of a key problem in robust statistics – the penalized least trimmed squares (LTS) regression problem, a robust estimator that mitigates the influence of outliers in data by capping residuals with large magnitudes. Although statistically attractive, penalized LTS is NP-hard, and existing mixed-integer optimization (MIO) formulations scale poorly due to weak relaxations and exponential worst-case complexity in the number of observations. We propose a new MIO formulation that embeds hyperplane arrangement logic into a perspective reformulation, explicitly enforcing structural properties of optimal solutions. We show that, if the number of features is fixed, the resulting branch-and-bound tree is of polynomial size in the sample size. Moreover, we develop a tailored branch-and-bound algorithm that uses first-order methods with dual bounds to solve node relaxations efficiently. Computational experiments on synthetic and real datasets demonstrate substantial improvements over existing MIO approaches: on synthetic instances with 5000 samples and 20 features, our tailored solver reaches a 1% gap in 1 minute while competing approaches fail to do so within one hour. These gains enable exact robust regression at significantly larger sample sizes in low-dimensional settings.

[LG-101] Machine-learning modeling of magnetization dynamics in quasi-equilibrium and driven metallic spin systems

链接: https://arxiv.org/abs/2604.11513
作者: Gia-Wei Chern,Yunhao Fan,Sheng Zhang,Puhan Zhang
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 19 pages, 12 figures

点击查看摘要

Abstract:We review recent advances in machine-learning (ML) force-field methods for large-scale Landau-Lifshitz-Gilbert (LLG) simulations of metallic spin systems. We generalize the Behler-Parrinello (BP) ML architecture – originally developed for quantum molecular dynamics – to construct scalable and transferable ML models capable of capturing the intricate dependence of electron-mediated exchange fields on the local magnetic environment characteristic of itinerant magnets. A central ingredient of this framework is the implementation of symmetry-aware magnetic descriptors based on group-theoretical bispectrum formalisms. Leveraging these ML force fields, LLG simulations faithfully reproduce hallmark non-collinear magnetic orders – such as the 120^\circ and tetrahedral states – on the triangular lattice, and successfully capture the complex spin textures emerging in the mixed-phase states of a square-lattice double-exchange model under thermal quench. We further discuss a generalized potential theory that extends the BP formalism to incorporate both conservative and nonconservative electronic torques, thereby enabling ML models to learn nonequilibrium exchange fields from computationally demanding microscopic approaches such as nonequilibrium Green’s-function techniques. This extension yields quantitatively accurate predictions of voltage-driven domain-wall motion and establishes a foundation for quantum-accurate, multiscale modeling of nonequilibrium spin dynamics and spintronic functionalities.

[LG-102] GlobalCY I: A JAX Framework for Globally Defined and Symmetry-Aware Neural Kähler Potentials

链接: https://arxiv.org/abs/2604.11404
作者: Abdul Rahman
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注: Initial draft

点击查看摘要

Abstract:We present \emphGlobalCY, a JAX-based framework for globally defined and symmetry-aware neural Kähler-potential models on projective hypersurface Calabi–Yau geometries. The central problem is that local-input neural Kähler-potential models can train successfully while still failing the geometry-sensitive diagnostics that matter in hard quartic regimes, especially near singular and near-singular members of the Cefalú family. To study this, we compare three model families – a local-input baseline, a globally defined invariant model, and a symmetry-aware global model – on the hard Cefalú cases \lambda=0.75 and \lambda=1.0 using a fixed multi-seed protocol and a geometry-aware diagnostic suite. In this benchmark, the globally defined invariant model is the strongest overall family, outperforming the local baseline on the two clearest geometric comparison metrics, negative-eigenvalue frequency and projective-invariance drift, in both cases. The gains are strongest at \lambda=0.75 , while \lambda=1.0 remains more difficult. The current symmetry-aware model improves projective-invariance drift relative to the local baseline, but does not yet surpass the plain global invariant model. These results show that global invariant structure is a meaningful architectural constraint for learned Kähler-potential modeling in hard quartic Calabi–Yau settings.

[LG-103] Signal-Aware Conditional Diffusion Surrogates for Transonic Wing Pressure Prediction

链接: https://arxiv.org/abs/2604.11263
作者: Víctor Francés-Belda,Carlos Sanmiguel Vila,Rodrigo Castellanos
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Accurate and efficient surrogate models for aerodynamic surface pressure fields are essential for accelerating aircraft design and analysis, yet deterministic regressors trained with pointwise losses often smooth sharp nonlinear features. This work presents a conditional denoising diffusion probabilistic model for predicting surface pressure distributions on the NASA Common Research Model wing under varying conditions of Mach number, angle of attack, and four control surface deflections. The framework operates on unstructured surface data through a principal component representation used as a non-truncated, reversible linear reparameterization of the pressure field, enabling a fully connected architecture. A signal-aware training objective is derived by propagating a reconstruction loss through the diffusion process, yielding a timestep-dependent weighting that improves fidelity in regions with strong pressure gradients. The stochastic sampling process is analyzed through repeated conditional generations, and two diagnostic metrics are introduced, the Local Reliability Index and Global Reliability Index, to relate sampling-induced spread to reconstruction error. Relative to the considered deterministic baselines, the proposed formulation reduces mean absolute error and improves the reconstruction of suction peaks, shock structures, and control surface discontinuities. The sampling-induced spread exhibits strong correspondence with surrogate error, supporting its interpretation as a qualitative reliability indicator rather than calibrated uncertainty quantification.

[LG-104] rustworthy Feature Importance Avoids Unrestricted Permutations

链接: https://arxiv.org/abs/2604.11253
作者: Emanuele Borgonovo,Francesco Cappelli,Xuefei Lu,Elmar Plischke,Cynthia Rudin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature importance methods using unrestricted permutations are flawed due to extrapolation errors; such errors appear in all non-trivial variable importance approaches. We propose three new approaches: conditional model reliance and Knockoffs with Gaussian transformation, and restricted ALE plot designs. Theoretical and numerical results show our strategies reduce/eliminate extrapolation.

[LG-105] Probabilistic Prediction of Neural Dynamics via Autoregressive Flow Matching

链接: https://arxiv.org/abs/2604.11178
作者: Nicole Rogalla,Yuzhen Qin,Mario Senden,Ahmed El-Gazzar,Marcel van Gerven
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:Forecasting neural activity in response to naturalistic stimuli remains a key challenge for understanding brain dynamics and enabling downstream neurotechnological applications. Here, we introduce a generative forecasting framework for modeling neural dynamics based on autoregressive flow matching (AFM). Building on recent advances in transport-based generative modeling, our approach probabilistically predicts neural responses at scale from multimodal sensory input. Specifically, we learn the conditional distribution of future neural activity given past neural dynamics and concurrent sensory input, explicitly modeling neural activity as a temporally evolving process in which future states depend on recent neural history. We evaluate our framework on the Algonauts project 2025 challenge functional magnetic resonance imaging dataset using subject-specific models. AFM significantly outperforms both a non-autoregressive flow-matching baseline and the official challenge general linear model baseline in predicting short-term parcel-wise blood oxygenation level-dependent (BOLD) activity, demonstrating improved generalization and widespread cortical prediction performance. Ablation analyses show that access to past BOLD dynamics is a dominant driver of performance, while autoregressive factorization yields consistent, modest gains under short-horizon, context-rich conditions. Together, these findings position autoregressive flow-based generative modeling as an effective approach for short-term probabilistic forecasting of neural dynamics with promising applications in closed-loop neurotechnology.

[LG-106] DDO-RM for LLM Preference Optimization: A Minimal Held-Out Benchmark against DPO

链接: https://arxiv.org/abs/2604.11119
作者: Tiantian Zhang,Jierui Zuo,Wenping Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:This paper reorganizes the current manuscript around the DPO versus DDO-RM preference-optimization project and focuses on two parts: the algorithmic view and the preliminary held-out benchmark. The benchmark asks a narrow question: even in a minimal pairwise chosen-versus-rejected setting, can a reward-guided decision-distribution update outperform a direct pairwise objective? We compare Direct Preference Optimization (DPO) against DDO-RM on EleutherAI/pythia-410m using HuggingFaceH4/ultrafeedback_binarized, evaluate on the held-out test_prefs split, and report results for seeds 42, 13, and 3407. Algorithmically, DDO-RM treats each prompt as a finite decision problem over candidate responses. Instead of optimizing only a binary chosen-rejected relation, it forms a policy distribution over candidates, centers reward-model scores under that distribution, and distills a reward-guided target distribution back into the policy. In the current public benchmark, DDO-RM improves mean pair accuracy from 0.5238 to 0.5602, AUC from 0.5315 to 0.5382, and mean margin from 0.1377 to 0.5353 relative to DPO. These are encouraging but still preliminary results: the study covers one model family, one dataset, one held-out evaluation split, and three seeds. Comments: 8 pages, 4 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2604.11119 [stat.ML] (or arXiv:2604.11119v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.11119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-107] Generating Hadamard matrices with transformers

链接: https://arxiv.org/abs/2604.11101
作者: Geordie Williamson,Oded Yacobi,Paul Zinn-Justin
类目: Combinatorics (math.CO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a new method for constructing Hadamard matrices that combines transformer neural networks with local search in the PatternBoost framework. Our approach is designed for extremely sparse combinatorial search problems and is particularly effective for Hadamard matrices of Goethals–Seidel type, where Fourier methods permit fast scoring and optimisation. For orders between 100 and 250 , it produces large numbers of inequivalent Hadamard matrices, and in harder cases it succeeds where local search from random initialisation fails. The largest example found by our method has order 244 . In addition to these new constructions, our experiments reveal that the transformer can discover and exploit useful hidden symmetry in the search space.

[LG-108] Neural Generalized Mixed-Effects Models

链接: https://arxiv.org/abs/2604.10976
作者: Yuli Slavutsky,Sebastian Salazar,David M. Blei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Generalized linear mixed-effects models (GLMMs) are widely used to analyze grouped and hierarchical data. In a GLMM, each response is assumed to follow an exponential-family distribution where the natural parameter is given by a linear function of observed covariates and a latent group-specific random effect. Since exact marginalization over the random effects is typically intractable, model parameters are estimated by maximizing an approximate marginal likelihood. In this paper, we replace the linear function with neural networks. The result is a more flexible model, the neural generalized mixed-effects model (NGMM), which captures complex relationships between covariates and responses. To fit NGMM to data, we introduce an efficient optimization procedure that maximizes the approximate marginal likelihood and is differentiable with respect to network parameters. We show that the approximation error of our objective decays at a Gaussian-tail rate in a user-chosen parameter. On synthetic data, NGMM improves over GLMMs when covariate-response relationships are nonlinear, and on real-world datasets it outperforms prior methods. Finally, we analyze a large dataset of student proficiency to demonstrate how NGMM can be extended to more complex latent-variable models.

[LG-109] bioLeak: Leakage-Aware Modeling and Diagnostics for Machine Learning in R

链接: https://arxiv.org/abs/2604.10965
作者: Selçuk Korkmaz
类目: Computation (stat.CO); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 35 pages, 4 figures

点击查看摘要

Abstract:Data leakage remains a recurrent source of optimistic bias in biomedical machine learning studies. Standard row-wise cross-validation and globally estimated preprocessing steps are often inappropriate for data with repeated measurements, study-level heterogeneity, batch effects, or temporal dependencies. This paper describes bioLeak, an R package for constructing leakage-aware resampling workflows and for auditing fitted models for common leakage mechanisms. The package provides leakage-aware split construction, train-fold-only preprocessing, cross-validated model fitting, nested hyperparameter tuning, post hoc leakage audits, and HTML reporting. The implementation supports binary classification, multiclass classification, regression, and survival analysis, with task-specific metrics and S4 containers for splits, fits, audits, and inflation summaries. The simulation artifacts show how apparent performance changes under controlled leakage mechanisms, and the case study illustrates how guarded and leaky pipelines can yield materially different conclusions on multi-study transcriptomic data. The emphasis throughout is on software design, reproducible workflows, and interpretation of diagnostic output.

[LG-110] One-Step Score-Based Density Ratio Estimation

链接: https://arxiv.org/abs/2604.10672
作者: Wei Chen,Qibin Zhao,John Paisley,Junmei Yang,Delu Zeng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Density ratio estimation (DRE) is a useful tool for quantifying discrepancies between probability distributions, but existing approaches often involve a trade-off between estimation quality and computational efficiency. Classical direct DRE methods are usually efficient at inference time, yet their performance can seriously deteriorate when the discrepancy between distributions is large. In contrast, score-based DRE methods often yield more accurate estimates in such settings, but they typically require considerable repeated function evaluations and numerical integration. We propose One-step Score-based Density Ratio Estimation (OS-DRE), a partly analytic and solver-free framework designed to combine these complementary advantages. OS-DRE decomposes the time score into spatial and temporal components, representing the latter with an analytic radial basis function (RBF) frame. This formulation converts the otherwise intractable temporal integral into a closed-form weighted sum, thereby removing the need for numerical solvers and enabling DRE with only one function evaluation. We further analyze approximation conditions for the analytic frame, and establish approximation error bounds for both finitely and infinitely smooth temporal kernels, grounding the framework in existing approximation theory. Experiments across density estimation, continual Kullback-Leibler and mutual information estimation, and near out-of-distribution detection demonstrate that OS-DRE offers a favorable balance between estimation quality and inference efficiency.

[LG-111] A Deep Generative Approach to Stratified Learning

链接: https://arxiv.org/abs/2604.10650
作者: Randy Martinez,Rong Tang,Lizhen Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 79 pages, 5 figures

点击查看摘要

Abstract:While the manifold hypothesis is widely adopted in modern machine learning, complex data is often better modeled as stratified spaces – unions of manifolds (strata) of varying dimensions. Stratified learning is challenging due to varying dimensionality, intersection singularities, and lack of efficient models in learning the underlying distributions. We provide a deep generative approach to stratified learning by developing two generative frameworks for learning distributions on stratified spaces. The first is a sieve maximum likelihood approach realized via a dimension-aware mixture of variational autoencoders. The second is a diffusion-based framework that explores the score field structure of a mixture. We establish the convergence rates for learning both the ambient and intrinsic distributions, which are shown to be dependent on the intrinsic dimensions and smoothness of the underlying strata. Utilizing the geometry of the score field, we also establish consistency for estimating the intrinsic dimension of each stratum and propose an algorithm that consistently estimates both the number of strata and their dimensions. Theoretical results for both frameworks provide fundamental insights into the interplay of the underlying geometry, the ambient noise level, and deep generative models. Extensive simulations and real dataset applications, such as molecular dynamics, demonstrate the effectiveness of our methods.

[LG-112] Adaptive H-EFT-VA: A Provably Safe Trajectory Through the Trainability-Expressibility Landscape of Variational Quantum Algorithms

链接: https://arxiv.org/abs/2604.10607
作者: Eyad I. B. Hamid
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 17 figures

点击查看摘要

Abstract:H-EFT-VA established a physics-informed solution to the Barren Plateau (BP) problem via a hierarchical EFT UV-cutoff, guaranteeing gradient variance in Omega(1/poly(N)). However, localization restricts the ansatz to a polynomial subspace, creating a reference-state gap for states distant from |0^N. We introduce Adaptive H-EFT-VA (A-H-EFT) to navigate the trainability-expressibility tradeoff by expanding the reachable Hilbert space along a safe trajectory. Gradient variance is maintained in Omega(1/poly(N)) if sigma(t) = 0.5/sqrt(LN) (Theorem 1). A Safe Expansion Corollary and Monotone Growth Lemma confirm expansion without discontinuous jumps. Benchmarking across 16 experiments (up to N=14) shows A-H-EFT achieves fidelity F=0.54, doubling static H-EFT-VA (F=0.27) and outperforming HEA (F~0.01), with gradient variance = 0.5 throughout. For Heisenberg XXZ (Delta_ref=1), A-H-EFT identifies the negative ground state while static methods fail. Results are statistically significant (p 10^-37). Robustness over three decades of hyperparameters enables deployment without search. This is the first rigorously bounded trajectory through the VQA landscape.

[LG-113] Orthogonal machine learning for conditional odds and risk ratios

链接: https://arxiv.org/abs/2604.10412
作者: Jiacheng Ge,Iván Díaz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Conditional effects are commonly used measures for understanding how treatment effects vary across different groups, and are often used to target treatments/interventions to groups who benefit most. In this work we review existing methods and propose novel ones, focusing on the odds ratio (OR) and the risk ratio (RR). While estimation of the conditional average treatment effect (ATE) has been widely studied, estimators for the OR and RR lag behind, and cutting edge estimators such as those based on doubly robust transformations or orthogonal risk functions have not been generalized to these parameters. We propose such a generalization here, focusing on the DR-learner and the R-learner. We derive orthogonal risk functions for the OR and RR and show that the associated pseudo-outcomes satisfy second-order conditional-mean remainder properties analogous to the ATE case. We also evaluate estimators for the conditional ATE, OR, and RR in a comprehensive nonparametric Monte Carlo simulation study to compare them with common alternatives under hundreds of different data-generating distributions. Our numerical studies provide empirical guidance for choosing an estimator. For instance, they show that while parametric models are useful in very simple settings, the proposed nonparametric estimators significantly reduce bias and mean squared error in the more complex settings expected in the real world. We illustrate the methods in the analysis of physical activity and sleep trouble in U.S. adults using data from the National Health and Nutrition Examination Survey (NHANES). The results demonstrate that our estimators uncover substantial treatment effect heterogeneity that is obscured by traditional regression approaches and lead to improved treatment decision rules, highlighting the importance of data-adaptive methods for advancing precision health research.

[LG-114] Shuffling the Data Stretching the Step-size: Sharper Bias in constant step-size SGD ICLR2026

链接: https://arxiv.org/abs/2604.10373
作者: Konstantinos Emmanouilidis,Emmanouil-Vasileios Vlatakis-Gkaragkounis,Rene Vidal
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Accepted in ICLR 2026 Conference

点击查看摘要

Abstract:From adversarial robustness to multi-agent learning, many machine learning tasks can be cast as finite-sum min-max optimization or, more generally, as variational inequality problems (VIPs). Owing to their simplicity and scalability, stochastic gradient methods with constant step size are widely used, despite the fact that they converge only up to a constant term. Among the many heuristics adopted in practice, two classical techniques have recently attracted attention to mitigate this issue: \emphRandom Reshuffling of data and \emphRichardson–Romberg extrapolation across iterates. Random Reshuffling sharpens the mean-squared error (MSE) of the estimated solution, while Richardson-Romberg extrapolation acts orthogonally, providing a second-order reduction in its bias. In this work, we show that their composition is strictly better than both, not only maintaining the enhanced MSE guarantees but also yielding an even greater cubic refinement in the bias. To the best of our knowledge, our work provides the first theoretical guarantees for such a synergy in structured non-monotone VIPs. Our analysis proceeds in two steps: (i) we smooth the discrete noise induced by reshuffling and leverage tools from continuous-state Markov chain theory to establish a novel law of large numbers and a central limit theorem for its iterates; and (ii) we employ spectral tensor techniques to prove that extrapolation debiases and sharpens the asymptotic behavior even under the biased gradient oracle induced by reshuffling. Finally, extensive experiments validate our theory, consistently demonstrating substantial speedups in practice.

[LG-115] Byzantine-Robust Distributed SGD: A Unified Analysis and Tight Error Bounds

链接: https://arxiv.org/abs/2604.10179
作者: Boyuan Ruan,Xiaoyu Wang,Ya-Feng Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Byzantine-robust distributed optimization relies on robust aggregation rules to mitigate the influence of malicious Byzantine workers. Despite the proliferation of such rules, a unified convergence analysis framework that accommodates general data heterogeneity is lacking. In this work, we provide a thorough convergence theory of Byzantine-robust distributed stochastic gradient descent (SGD), analyzing variants both with and without local momentum. We establish the convergence rates for nonconvex smooth objectives and those satisfying the Polyak-Lojasiewicz condition under a general data heterogeneity assumption. Our analysis reveals that while stochasticity and data heterogeneity introduce unavoidable error floors, local momentum provably reduces the error component induced by stochasticity. Furthermore, we derive matching lower bounds to demonstrate that the upper bounds obtained in our analysis are tight and characterize the fundamental limits of Byzantine resilience under stochasticity and data heterogeneity. Empirical results support our theoretical findings.

[LG-116] Continuous PT-Symmetry Breaking as a Design Variable for Giant Altermagnetic Spin Splitting

链接: https://arxiv.org/abs/2604.10173
作者: Kichan Chun,Gunn Kim
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:Magnetic point-group analysis classifies altermagnets but returns only a binary symmetry verdict, leaving spin-splitting energy (SSE) inaccessible without spin-polarized density functional theory (DFT). This binary ceiling is not fundamental. Sublattice symmetry breaking is promoted here to a continuous, DFT-free scalar – the Motif Symmetry-Breaking Index (MSBI) – that quantifies \mathcalPT -symmetry breaking between antiparallel magnetic motifs directly from crystal coordinates. SHAP analysis of an XGBoost surrogate trained on 3,851 DFT-labeled binary structures identifies three dominant descriptors: MSBI (symmetry-breaking axis), motif packing fraction MPF (superexchange axis), and the p/d electron ratio (covalency axis), each mapping onto a directly tunable experimental handle. A controlled VO–CrSb comparison within the same P 6_3 /mmc host lattice demonstrates that composition alone boosts SSE sevenfold. Bayesian optimization over this three-axis space, followed by independent DFT validation, recovers \alpha -NiS (SSE = 0.823 ,eV) as cross-validation against an independent symmetry-based prediction and identifies three previously unrecognized high-SSE candidates – square-planar FeS (1.297,eV), octahedral CoS (1.103,eV), and FeAs (1.089,eV) – all matching or exceeding CrSb. Square-planar Fe–S is proposed as a transferable coordination motif for giant altermagnetic spin splitting, advancing altermagnet design from symmetry classification to continuous quantitative optimization.

[LG-117] Accelerated Dopant Screening in Oxide Semiconductors via Multi-Fidelity Contextual Bandits and a Three-Tier DFT Validation Funnel

链接: https://arxiv.org/abs/2604.10157
作者: Abhinaba Basu
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Band gap engineering of oxide semiconductors through doping is critical for photocatalysis and optoelectronics, yet the combinatorial space of dopant elements, substitution sites, and co-doping combinations far exceeds typical density functional theory (DFT) budgets. We screen doped candidates across five oxide hosts (ZnO, TiO2, SrTiO3, SnO2, MgO), culminating in a 529-candidate ZnO co-doping campaign, and identify Cu-containing co-doped ZnO systems as consistently achieving visible-light-range band gaps (1.0-1.8 eV), with Y2Cu2 co-doped ZnO as the optimal candidate (1.84 eV). A three-tier validation funnel (PBE, PBE+U, ionic relaxation) reveals that no single level of theory suffices: V-doped ZnO shifts from near-metallic to wide-gap upon Hubbard U correction, while Cu-doped SrTiO3 enters the visible-light window only after correcting for d-electron localization. To make this screening tractable, we introduce a multi-fidelity screening strategy that replaces 81% of DFT evaluations with computationally inexpensive surrogate predictions, reducing a 529-candidate closed-loop Quantum ESPRESSO campaign from an estimated 440 to 62 CPU-hours while finding the global optimum in 100% of 50 independent trials (p = 5.0e-8 versus random screening, Wilcoxon signed-rank). Cross-host analysis of the dopant-host interaction matrix reveals that dopant performance is governed by just two latent chemical dimensions, enabling prediction of rankings in unseen hosts. All 583 DFT calculations, screening code, and stability proofs are released as an open benchmark. Subjects: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph) MSC classes: 68T05, 82D25, 90C26, 62L05 ACMclasses: I.2.6; J.2; G.1.6 Cite as: arXiv:2604.10157 [cond-mat.mtrl-sci] (or arXiv:2604.10157v1 [cond-mat.mtrl-sci] for this version) https://doi.org/10.48550/arXiv.2604.10157 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-118] Daily Predictions of F10.7 and F30 Solar Indices with Deep Learning

链接: https://arxiv.org/abs/2604.10045
作者: Zhenduo Wang,Yasser Abduallah,Jason T. L. Wang,Haimin Wang,Yan Xu,Vasyl Yurchyshyn,Vincent Oria,Khalid A. Alobaid,Xiaoli Bai
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 23 pages, 12 figures

点击查看摘要

Abstract:The F10.7 and F30 solar indices are the solar radio fluxes measured at wavelengths of 10.7 cm and 30 cm, respectively, which are key indicators of solar activity. F10.7 is valuable for explaining the impact of solar ultraviolet (UV) radiation on the upper atmosphere of Earth, while F30 is more sensitive and could improve the reaction of thermospheric density to solar stimulation. In this study, we present a new deep learning model, named the Solar Index Network, or SINet for short, to predict daily values of the F10.7 and F30 solar indices. The SINet model is designed to make medium-term predictions of the index values (1-60 days in advance). The observed data used for SINet training were taken from the National Oceanic and Atmospheric Administration (NOAA) as well as Toyokawa and Nobeyama facilities. Our experimental results show that SINet performs better than five closely related statistical and deep learning methods for the prediction of F10.7. Furthermore, to our knowledge, this is the first time deep learning has been used to predict the F30 solar index.

[LG-119] Predicting Associations between Solar Flares and Coronal Mass Ejections Using SDO/HMI Magnetograms and a Hybrid Neural Network

链接: https://arxiv.org/abs/2604.10016
作者: Jialiang Li,Vasyl Yurchyshyn,Jason T. L. Wang,Haimin Wang,Manolis K. Georgoulis,Wen He,Yasser Abduallah,Hameedullah A. Farooki,Yan Xu
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Solar eruptions, including flares and coronal mass ejections (CMEs), have a significant impact on Earth. Some flares are associated with CMEs, and some flares are not. The association between flares and CMEs is not always obvious. In this study, we propose a new deep learning method, specifically a hybrid neural network (HNN) that combines a vision transformer with long short-term memory, to predict associations between flares and CMEs. HNN finds spatio-temporal patterns in the time series of line-of-sight magnetograms of solar active regions (ARs) collected by the Helioseismic and Magnetic Imager on board the Solar Dynamics Observatory and uses the patterns to predict whether a flare projected to occur within the next 24 hours will be eruptive (i.e., CME-associated) or confined (i.e., not CME-associated). Our experimental results demonstrate the good performance of the HNN method. Furthermore, the results show that magnetic flux cancellation in polarity inversion line regions may well play a role in triggering flare-associated CMEs, a finding consistent with literature.

[LG-120] Learning Whats Real: Disentangling Signal and Measurement Artifacts in Multi-Sensor Data with Applications to Astrophysics ICLR2026

链接: https://arxiv.org/abs/2604.09787
作者: Pablo Mercader-Perez,Carolina Cuesta-Lazaro,Daniel Muthukrishna,Jeroen Audenaert,V. Ashley Villar,David W. Hogg,Marc Huertas-Company,William T. Freeman
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: Accepted at the 2nd Workshop on Foundation Models for Science at ICLR 2026. 10 pages, 6 figures, plus appendix

点击查看摘要

Abstract:Data collected from the physical world is always a combination of multiple sources: an underlying signal from the physical process of interest and a signal from measurement-dependent artifacts from the sensor or instrument. This secondary signal acts as a confounding factor, limiting our ability to extract information about the physics underlying the phenomena we observe. Furthermore, it complicates the combination of observations in heterogeneous or multi-instrument settings. We propose a deep learning framework that leverages overlapping observations, a dual-encoder architecture, and a counterfactual generation objective to disentangle these factors of variation. The resulting representations explicitly separate intrinsic signals from sensor-specific distortions and noise, and can be used for counterfactual view generation, parameter inference unconfounded by measurement distortions, and instrument-independent similarity search. We demonstrate the effectiveness of our approach on astrophysical galaxy images from the DESI Legacy Imaging Survey (Legacy) and the Hyper Suprime-Cam (HSC) Survey as a representative multi-instrument setting. This framework provides a general recipe for scientific and multi-modal self-supervised pretraining: construct training pairs from overlapping observations of the same physical system, treat sensor- or modality-specific effects as augmentations, and learn invariant representations through counterfactual generation.

[LG-121] Discrete Flow Maps

链接: https://arxiv.org/abs/2604.09784
作者: Peter Potaptchik,Jason Yim,Adhi Saravanan,Peter Holderrieth,Eric Vanden-Eijnden,Michael S. Albergo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The sequential nature of autoregressive next-token prediction imposes a fundamental speed limit on large language models. While continuous flow models offer a path to parallel generation, they traditionally demand expensive iterative integration. Flow Maps bypass this bottleneck by compressing generative trajectories into single-step mappings, theoretically enabling the generation of full text sequences from noise in a single forward pass. However, standard formulations rely on Euclidean regression losses that are geometrically ill-suited for discrete data. In this work, we resolve this conflict with Discrete Flow Maps, a framework that reconciles trajectory compression with the geometry of the probability simplex. We recast standard flow map training for the discrete domain, aligning the training dynamics with the discrete nature of language. Empirically, this strict geometric alignment allows our method to surpass previous state-of-the-art results in discrete flow modeling.

[LG-122] Differentiable free energy surface: a variational approach to directly observing rare events using generative deep-learning models

链接: https://arxiv.org/abs/2604.09769
作者: Shuo-Hui Li,Chen Chen,Yao-Wen Zhang,Ding Pan
类目: Computational Physics (physics.comp-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: Main text: 20 pages, 5 figures. Supplement: 12 pages

点击查看摘要

Abstract:Rare events are central to the evolution of complex many-body systems, characterized as key transitional configurations on the free energy surface (FES). Conventional methods require adequate sampling of rare event transitions to obtain the FES, which is computationally very demanding. Here we introduce the variational free energy surface (VaFES), a dataset-free framework that directly models FESs using tractable-density generative models. Rare events can then be immediately identified from the FES with their configurations generated directly via one-shot sampling of generative models. By extending a coarse-grained collective variable (CV) into its reversible equivalent, VaFES constructs a latent space of intermediate representation in which the CVs explicitly occupy a subset of dimensions. This latent-space construction preserves the physical interpretability and transparent controllability of the CVs by design, while accommodating arbitrary CV formulations. The reversibility makes the system energy exactly accessible, enabling variational optimization of the FES without pre-generated simulation data. A single optimization yields a continuous, differentiable FES together with one-shot generation of rare-event configurations. Our method can reproduce the exact analytical solution for the bistable dimer potential and identify a chignolin native folded state in close alignment with the experimental NMR structure. Our approach thus establishes a scalable, systematic framework for advancing the study of complex statistical systems.

[LG-123] SHANG: Robust Stochastic Acceleration under Multiplicative Noise

链接: https://arxiv.org/abs/2603.09355
作者: Yaxin Yu,Long Chen,Minfu Feng
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 33 pages, 19 figures, 2 Tables

点击查看摘要

Abstract:Under the multiplicative noise scaling (MNS) condition, original Nesterov acceleration is provably sensitive to noise and may diverge when gradient noise overwhelms the signal. In this paper, we develop two accelerated stochastic gradient descent methods by discretizing the Hessian-driven Nesterov accelerated gradient flow. We first derive SHANG, a direct Gauss-Seidel-type discretization that already improves stability under MNS. We then introduce SHANG++, which adds a damping correction and achieves faster convergence with stronger noise robustness. We establish convergence guarantees for both convex and strongly convex objectives under MNS, together with explicit parameter choices. In our experiments, SHANG++ performs consistently well across convex problems and applications in deep learning. In a dedicated noise experiment on ResNet-34, a single hyperparameter configuration attains accuracy within 1% of the noise-free setting. Across all experiments, SHANG++ outperforms existing accelerated methods in robustness and efficiency, with minimal parameter sensitivity.

附件下载

点击下载今日全部论文列表