本篇博文主要内容为 2026-04-08 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-04-08)

今日共更新676篇论文,其中:

  • 自然语言处理121篇(Computation and Language (cs.CL))
  • 人工智能221篇(Artificial Intelligence (cs.AI))
  • 计算机视觉134篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习140篇(Machine Learning (cs.LG))
  • 多智能体系统15篇(Multiagent Systems (cs.MA))
  • 信息检索35篇(Information Retrieval (cs.IR))
  • 人机交互34篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)治理中长期被忽视的核心问题——AI系统所依赖的机器身份(Machine Identity)缺乏统一的治理框架。当前企业环境中,AI代理、服务账户、API令牌及自动化工作流等机器身份数量已远超人类身份(比例超过80:1),但现有治理体系未能有效覆盖其安全风险与合规要求。解决方案的关键在于提出一个集成性的六域治理框架——机器身份治理分类法(Machine Identity Governance Taxonomy, MIGT),该框架同时填补技术治理空白、监管合规缺口和跨境协调断层;并辅以AI身份风险分类法(AI-Identity Risk Taxonomy, AIRT)、国家行为体攻击模型以及跨司法管辖区监管对齐结构,最终形成可落地的四阶段实施路线图,实现对企业AI身份的系统性管控。

链接: https://arxiv.org/abs/2604.06148
作者: Andrew Kurtz,Klaudia Krawiecka
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 75 pages (excl. references), 2 tables. Addresses policy makers, regulators, and practitioners at the intersection of AI governance, cybersecurity, and geopolitical risk

点击查看摘要

Abstract:The governance of artificial intelligence has a blind spot: the machine identities that AI systems use to act. AI agents, service accounts, API tokens, and automated workflows now outnumber human identities in enterprise environments by ratios exceeding 80 to 1, yet no integrated framework exists to govern them. A single ungoverned automated agent produced 5.4-10 billion in losses in the 2024 CrowdStrike outage; nation-state actors including Silk Typhoon and Salt Typhoon have operationalized ungoverned machine credentials as primary espionage vectors against critical infrastructure. This paper makes four original contributions. First, the AI-Identity Risk Taxonomy (AIRT): a comprehensive enumeration of 37 risk sub-categories across eight domains, each grounded in documented incidents, regulatory recognition, practitioner prevalence data, and threat intelligence. Second, the Machine Identity Governance Taxonomy (MIGT): an integrated six-domain governance framework simultaneously addressing the technical governance gap, the regulatory compliance gap, and the cross-jurisdictional coordination gap that existing frameworks address only in isolation. Third, a foreign state actor threat model for enterprise identity governance, establishing that Silk Typhoon, Salt Typhoon, Volt Typhoon, and North Korean AI-enhanced identity fraud operations have already operationalized AI identity vulnerabilities as active attack vectors. Fourth, a cross-jurisdictional regulatory alignment structure mapping enterprise AI identity governance obligations under EU, US, and Chinese frameworks simultaneously, identifying irreconcilable conflicts and providing a governance mechanism for managing them. A four-phase implementation roadmap translates the MIGT into actionable enterprise programs.

[MA-1] Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives ACL2026

【速读】:该论文旨在解决多智能体系统中代表性代理(representative agent)在群体决策过程中因社会情境影响而导致可靠性下降的问题。其核心挑战在于,当前生成式 AI(Generative AI)代理虽具备独立推理能力,但在模拟人类协作环境时,易受同伴观点、权力结构及修辞策略等社会动态因素干扰,从而偏离最优决策路径。解决方案的关键在于识别并量化四种社会认知现象——社会从众(social conformity)、感知专业性(perceived expertise)、主导发言效应(dominant speaker effect)和修辞说服(rhetorical persuasion),并通过控制对抗群体规模、相对智能水平、论点长度与论证风格等变量,系统验证这些社会压力源对代理决策准确性的负面影响。研究揭示了多智能体系统的脆弱性不仅源于个体推理质量,更取决于其网络配置中的社会动力学特征,为构建更具鲁棒性的AI委托机制提供了理论依据与实证基础。

链接: https://arxiv.org/abs/2604.06091
作者: Changgeon Ko,Jisu Shin,Hoyun Song,Huije Lee,Eui Jun Hwang,Jong C. Park
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: ACL 2026

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision. Drawing inspiration from social psychology, we investigate how the reliability of this representative agent is undermined by the social context of its network. We define four key phenomena-social conformity, perceived expertise, dominant speaker effect, and rhetorical persuasion-and systematically manipulate the number of adversaries, relative intelligence, argument length, and argumentative styles. Our experiments demonstrate that the representative agent’s accuracy consistently declines as social pressure increases: larger adversarial groups, more capable peers, and longer arguments all lead to significant performance degradation. Furthermore, rhetorical strategies emphasizing credibility or logic can further sway the agent’s judgment, depending on the context. These findings reveal that multi-agent systems are sensitive not only to individual reasoning but also to the social dynamics of their configuration, highlighting critical vulnerabilities in AI delegates that mirror the psychological biases observed in human group decision-making.

[MA-2] Polynomial-Time Algorithm for Thiele Voting Rules with Voter Interval Preferences

【速读】:该论文旨在解决在Voter Interval域(即选民可排序使得每位候选人被连续的选民支持)下,针对任意Thiele投票规则计算最优委员会规模 $ k $ 的问题,这一问题自2015年由Elkind和Lackner提出以来悬而未决长达十年。其核心解决方案在于提出一个关键的结构定理——区间族的凹性定理(concavity theorem for families of intervals),证明了对于不同规模的两个可行解,可以构造出任意中间规模的解,其得分不低于两者的线性插值;由此推导出在Voter Interval配置中,最优Thiele总得分是委员会规模的凹函数。基于此性质,作者采用拉格朗日松弛法将基数约束移入目标函数,并利用约束矩阵的全单模性(totally unimodular)在多项式时间内求解对应的整数线性规划,从而实现了高效算法设计。该算法及其证明通过人类与AI协作完成,其中主结构定理由Gemini Deep Think单次调用生成。

链接: https://arxiv.org/abs/2604.05953
作者: Pasin Manurangsi,Krzysztof Sornat
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Multiagent Systems (cs.MA)
备注: 30 pages

点击查看摘要

Abstract:We present a polynomial-time algorithm for computing an optimal committee of size k under any given Thiele voting rule for elections on the Voter Interval domain (i.e., when voters can be ordered so that each candidate is approved by a consecutive voters). Our result extends to the Generalized Thiele rule, in which each voter has an individual weight (scoring) sequence. This resolves a 10-year-old open problem that was originally posed for Proportional Approval Voting and later extended to every Thiele rule (Elkind and Lackner, IJCAI 2015; Peters, AAAI 2018). Our main technical ingredient is a new structural result – a concavity theorem for families of intervals. It shows that, given two solutions of different sizes, one can construct a solution of any intermediate size whose score is at least the corresponding linear interpolation of the two scores. As a consequence, on Voter Interval profiles, the optimal total Thiele score is a concave function of the committee size. We exploit this concavity within an optimization framework based on a Lagrangian relaxation of a natural integer linear program formulation, obtained by moving the cardinality constraint into the objective. On Voter Interval profiles, the resulting constraint matrix is totally unimodular, so it can be solved in polynomial time. Our main algorithm and its proof were obtained via human–AI collaboration. In particular, a slightly simplified version of the main structural theorem used by the algorithm was obtained in a single call to Gemini Deep Think. Comments: 30 pages Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Multiagent Systems (cs.MA) MSC classes: 68W05, 91B12 (Primary), 68T42(Secondary) ACMclasses: F.2.2; I.2.11 Cite as: arXiv:2604.05953 [cs.GT] (or arXiv:2604.05953v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2604.05953 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-3] LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在随机多智能体博弈场景中战略推理能力的评估难题,特别是针对具有复杂规划需求的棋类游戏Ludo。其核心问题是现有基准难以系统性地量化LLM在不确定性环境下的策略决策质量与行为模式差异。解决方案的关键在于构建了一个名为LudoBench的基准测试框架,包含480个手工设计的局域状态场景(spot scenarios),覆盖12种行为上区分明确的战略决策类别,同时开发了一个支持四种类型代理(Random、Heuristic、Game-Theory和LLM)的全功能四玩家Ludo模拟器。其中,基于期望极小极大搜索(Expectiminimax)的博弈论代理作为策略上限基准,揭示了当前LLM在战略一致性上的显著不足(仅40–46%与之匹配),并识别出两类典型行为范式:完成者(finishers)和建造者(builders),分别对应“优先完成己方棋子”与“注重早期发展”的策略倾向,且均无法完整捕捉博弈论策略。这一框架为LLM战略推理能力提供了轻量、可解释的评估路径,并暴露了提示敏感性等关键脆弱性。

链接: https://arxiv.org/abs/2604.05681
作者: Ojas Jain,Dhruv Kumar
机构: BITS Pilani
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Under Review

点击查看摘要

Abstract:We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning complexity. LudoBench comprises 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, each isolating a specific strategic choice. We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents. The game-theory agent uses Expectiminimax search with depth-limited lookahead to provide a principled strategic ceiling beyond greedy heuristics. Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability. LudoBench provides a lightweight and interpretable framework for benchmarking LLM strategic reasoning under uncertainty. All code, the spot dataset (480 entries) and model outputs are available at this https URL

[MA-4] SCMAPR: Self-Correcting Multi-Agent Prompt Refinement for Complex-Scenario Text-to-Video Generation

【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)生成在复杂场景下因文本提示模糊性和不明确性导致的性能下降问题。现有扩散模型在处理复杂提示时仍存在对齐偏差与生成质量不足的问题。解决方案的关键在于提出一种分阶段的多智能体精炼框架——SCMAPR(Scenario-aware and Self-Correcting Multi-Agent Prompt Refinement),其核心机制包括:(i) 基于场景分类的提示路由策略以选择最优改写策略;(ii) 构建情境感知的重写规则并执行条件化精炼;(iii) 通过结构化的语义验证触发条件性修正,从而实现闭环优化。该方法显著提升了复杂提示下的文本-视频一致性与生成质量。

链接: https://arxiv.org/abs/2604.05489
作者: Chengyi Yang,Pengzhen Li,Jiayin Qi,Aimin Zhou,Ji Wu,Ji Liu
机构: East China Normal University (华东师范大学); HiThink Research (思必驰研究院); Guangzhou University (广州大学); Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Text-to-Video (T2V) generation has benefited from recent advances in diffusion models, yet current systems still struggle under complex scenarios, which are generally exacerbated by the ambiguity and underspecification of text prompts. In this work, we formulate complex-scenario prompt refinement as a stage-wise multi-agent refinement process and propose SCMAPR, i.e., a scenario-aware and Self-Correcting Multi-Agent Prompt Refinement framework for T2V prompting. SCMAPR coordinates specialized agents to (i) route each prompt to a taxonomy-grounded scenario for strategy selection, (ii) synthesize scenario-aware rewriting policies and perform policy-conditioned refinement, and (iii) conduct structured semantic verification that triggers conditional revision when violations are detected. To clarify what constitutes complex scenarios in T2V prompting, provide representative examples, and enable rigorous evaluation under such challenging conditions, we further introduce T2V-Complexity, which is a complex-scenario T2V benchmark consisting exclusively of complex-scenario prompts. Extensive experiments on 3 existing benchmarks and our T2V-Complexity benchmark demonstrate that SCMAPR consistently improves text-video alignment and overall generation quality under complex scenarios, achieving up to 2.67% and 3.28 gains in average score on VBench and EvalCrafter, and up to 0.028 improvement on T2V-CompBench over 3 State-Of-The-Art baselines.

[MA-5] Strategic Delay and Coordination Efficiency in Global Games

【速读】:该论文旨在解决两阶段集体决策中的协调效率问题,即在信息不完全条件下,如何通过引入延迟决策机制提升群体协作的成功概率与整体效益。其核心解决方案在于构建一个基于全局博弈(global games)框架的协调模型,其中参与者根据带有噪声的信号判断是否立即参与集体行动或选择延迟;延迟者虽能获得先验参与者的身份信息以增强决策准确性,但若最终达成协调则需承受收益折损。研究发现,这种时间维度上的信息获取与收益损失之间的权衡机制,能够有效改善集体决策的协调效果并提高系统效率。

链接: https://arxiv.org/abs/2604.05298
作者: Shinkyu Park,Behrouz Touri,Marcos M. Vasconcelos
机构: King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); University of Illinois Urbana Champaign (伊利诺伊大学厄巴纳-香槟分校); Florida State University (佛罗里达州立大学)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: Extended Version. Submitted to the IEEE Conference on Decision and Control 2026

点击查看摘要

Abstract:We investigate a coordination model for a two-stage collective decision-making problem within the framework of global games. The agents observe noisy signals of a shared random variable, referred to as the fundamental, which determines the underlying payoff. Based on these signals, the agents decide whether to participate in a collective action now or to delay. An agent who delays acquires additional information by observing the identities of agents who have chosen to participate in the first stage. This informational advantage, however, comes at the cost of a discounted payoff if coordination ultimately succeeds. Within this decision-making framework, we analyze how the option to delay can enhance collective outcomes. We show that this intertemporal trade-off between information acquisition and payoff reduction can improve coordination and increase the efficiency of collective decision-making.

[MA-6] Spec Kit Agents : Context-Grounded Agent ic Workflows

【速读】:该论文旨在解决生成式 AI 编码代理在大型、动态代码库中因“上下文盲视”(context blindness)导致的API幻觉和架构违规问题。解决方案的关键在于提出Spec Kit Agents,一个包含项目经理(PM)与开发者角色的多代理规范驱动开发(Spec-driven Development, SDD)流水线,其核心创新是引入阶段级上下文锚定钩子(phase-level, context-grounding hooks):读取钩子(read-only probing hooks)在每个开发阶段(Specify、Plan、Tasks、Implement)中基于仓库证据进行上下文约束,验证钩子(validation hooks)则对中间产物进行环境一致性校验,从而显著提升代码质量与仓库兼容性。

链接: https://arxiv.org/abs/2604.05278
作者: Pardis Taghavi,Santosh Bhavani
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Spec-driven development (SDD) with AI coding agents provides a structured workflow, but agents often remain “context blind” in large, evolving repositories, leading to hallucinated APIs and architectural violations. We present Spec Kit Agents, a multi-agent SDD pipeline (with PM and developer roles) that adds phase-level, context-grounding hooks. Read-only probing hooks ground each stage (Specify, Plan, Tasks, Implement) in repository evidence, while validation hooks check intermediate artifacts against the environment. We evaluate 128 runs covering 32 features across five repositories. Context-grounding hooks improve judged quality by +0.15 on a 1-5 composite LLM-as-judge score (+3.0 percent of the full score; Wilcoxon signed-rank, p 0.05) while maintaining 99.7-100 percent repository-level test compatibility. We further evaluate the framework on SWE-bench Lite, where augmentation hooks improve baseline by 1.7 percent, achieving 58.2 percent Pass@1.

[MA-7] From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agent ic AI

【速读】:该论文旨在解决代理型人工智能(Agentic AI)系统在执行过程中产生的治理难题,区别于传统生成式AI仅在模型开发或部署阶段存在风险,代理型AI因其具备规划、工具使用、状态维持及多步骤轨迹执行能力,在运行时可能引发不可预测的风险。解决方案的关键在于提出一种分层翻译方法,将标准中抽象的治理目标映射至四个控制层级:治理目标、设计期约束、运行时干预和保障反馈,并通过控制元组与运行时可执行性评估准则明确各控制措施的归属层级;其核心主张是:治理标准应指导控制策略在架构设计、运行时策略、人工升级和审计中的分布,而仅对可观测、确定性强且具有时效性的控制措施才实施运行时干预。

链接: https://arxiv.org/abs/2604.05229
作者: Christopher Koch
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 5 pages, 2 tables

点击查看摘要

Abstract:Agentic AI systems plan, use tools, maintain state, and produce multi-step trajectories with external effects. Those properties create a governance problem that differs materially from single-turn generative AI: important risks emerge dur- ing execution, not only at model development or deployment time. Governance standards such as ISO/IEC 42001, ISO/IEC 23894, ISO/IEC 42005, ISO/IEC 5338, ISO/IEC 38507, and the NIST AI Risk Management Framework are therefore highly relevant to agentic AI, but they do not by themselves yield implementable runtime guardrails. This paper proposes a layered translation method that connects standards-derived governance objectives to four control layers: governance objectives, design- time constraints, runtime mediation, and assurance feedback. It distinguishes governance objectives, technical controls, runtime guardrails, and assurance evidence; introduces a control tuple and runtime-enforceability rubric for layer assignment; and demonstrates the method in a procurement-agent case study. The central claim is modest: standards should guide control placement across architecture, runtime policy, human escalation, and audit, while runtime guardrails are reserved for controls that are observable, determinate, and time-sensitive enough to justify execution-time intervention.

[MA-8] Nash Approximation Gap in Truncated Infinite-horizon Partially Observable Markov Games

【速读】:该论文旨在解决无限时域部分可观测马尔可夫博弈(Partially Observable Markov Games, POMGs)中因信念状态(belief state)和动作空间随信息累积而无限增长导致的计算不可行性问题。其核心解决方案是提出一种有限记忆截断框架,将原POMG近似为一个有限状态、有限动作的马尔可夫博弈,其中智能体仅基于有限窗口内的公共信息和私有信息做出决策。在满足滤波稳定性(filter stability,即遗忘条件)的前提下,证明了截断博弈的任意纳什均衡(Nash equilibrium)均为原POMG的ε-纳什均衡,且随着截断长度增加,误差ε趋于零。

链接: https://arxiv.org/abs/2604.05131
作者: Lan Sang,Chinmay Maheshwari
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Partially Observable Markov Games (POMGs) provide a general framework for modeling multi-agent sequential decision-making under asymmetric information. A common approach is to reformulate a POMG as a fully observable Markov game over belief states, where the state is the conditional distribution of the system state and agents’ private information given common information, and actions correspond to mappings (prescriptions) from private information to actions. However, this reformulation is intractable in infinite-horizon settings, as both the belief state and action spaces grow with the accumulation of information over time. We propose a finite-memory truncation framework that approximates infinite-horizon POMGs by a finite-state, finite-action Markov game, where agents condition decisions only on finite windows of common and private information. Under suitable filter stability (forgetting) conditions, we show that any Nash equilibrium of the truncated game is an \varepsilon -Nash equilibrium of the original POMG, where \varepsilon \to 0 as the truncation length increases.

[MA-9] Designing Digital Humans with Ambient Intelligence

【速读】:该论文旨在解决当前数字人(Digital Human)系统缺乏环境感知能力的问题,即大多数数字人仅能进行孤立的对话交互,无法理解用户所处的实际场景或情境。其解决方案的关键在于将环境智能(Ambient Intelligence, AmI)技术整合进数字人系统中,通过引入环境传感器、物联网(IoT)数据及情境建模,使数字人具备对用户环境的态势感知能力,从而实现前瞻性、主动性的服务支持、跨设备无缝交互以及长期个性化用户陪伴。这一融合机制不仅拓展了数字人的行为维度,也推动其从被动响应型代理向情境感知型智能体演进。

链接: https://arxiv.org/abs/2604.05120
作者: Mengyu Chen,Pranav Deshpande,Runqing Yang,Elvir Azanli,Joseph Ligman,Shaohan Hu,Richard Chen
机构: JPMorganChase (摩根大通)
类目: Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Digital humans are lifelike virtual agents capable of natural conversation and are increasingly deployed in domains like retail and finance. However, most current digital humans operate in isolation from their surroundings and lack contextual awareness beyond the dialogue itself. We address this limitation by integrating ambient intelligence (AmI) - i.e., environmental sensors, IoT data, and contextual modeling - with digital human systems. This integration enables situational awareness of the user’s environment, anticipatory and proactive assistance, seamless cross-device interactions, and personalized long-term user support. We present a conceptual framework defining key roles that AmI can play in shaping digital human behavior, a design space highlighting dimensions such as proactivity levels and privacy strategies, and application-driven patterns with case studies in financial and retail services. We also discuss an architecture for ambient-enabled digital humans and provide guidelines for responsible design regarding privacy and data governance. Together, our work positions ambient intelligent digital humans as a new class of interactive agents powered by AI that respond not only to users’ queries but also to the context and situations in which the interaction occurs.

[MA-10] Governance-Aware Agent Telemetry for Closed-Loop Enforcement in Multi-Agent AI Systems

【速读】:该论文旨在解决企业级多智能体系统(Multi-agent Systems)中可观测性工具与治理策略执行之间存在的“观察但不行动”(observe-but-do-not-act) gap问题,即现有工具如OpenTelemetry和Langfuse虽能收集大量智能体间交互的遥测数据,却无法实现实时政策违规检测与自动干预,导致政策违反行为常在造成损害后才被发现。其解决方案的关键在于提出一种名为Governance-Aware Agent Telemetry (GAAT) 的参考架构,核心包括:(1) 扩展OpenTelemetry的治理遥测模式(Governance Telemetry Schema, GTS),嵌入治理属性;(2) 基于OPA兼容的声明式规则构建亚200毫秒延迟的实时违规检测引擎;(3) 引入具有分级干预能力的治理执行总线(Governance Enforcement Bus, GEB);(4) 构建具备密码学溯源能力的可信遥测平面(Trusted Telemetry Plane),从而实现从数据采集到政策强制执行的闭环控制。

链接: https://arxiv.org/abs/2604.05119
作者: Anshul Pathak,Nishant Jain
机构: 未知
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Enterprise multi-agent AI systems produce thousands of inter-agent interactions per hour, yet existing observability tools capture these dependencies without enforcing anything. OpenTelemetry and Langfuse collect telemetry but treat governance as a downstream analytics concern, not a real-time enforcement target. The result is an “observe-but-do-not-act” gap where policy violations are detected only after damage is done. We present Governance-Aware Agent Telemetry (GAAT), a reference architecture that closes the loop between telemetry collection and automated policy enforcement for multi-agent systems. GAAT introduces (1) a Governance Telemetry Schema (GTS) extending OpenTelemetry with governance attributes; (2) a real-time policy violation detection engine using OPA-compatible declarative rules under sub-200 ms latency; (3) a Governance Enforcement Bus (GEB) with graduated interventions; and (4) a Trusted Telemetry Plane with cryptographic provenance. Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG) Cite as: arXiv:2604.05119 [cs.MA] (or arXiv:2604.05119v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2604.05119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-11] Nidus: Externalized Reasoning for AI-Assisted Engineering

【速读】:该论文旨在解决软件工程中工程不变量(engineering invariants)难以可靠维护的问题,尤其是在生成式 AI(Generative AI)辅助开发场景下,传统依赖人工或非形式化流程的方法无法保证需求追溯、架构合理性与交付证据的持续一致性。解决方案的关键在于提出 Nidus——一个机制化的治理运行时(governance runtime),它将 V 模型(V-model)转化为可判定的约束表面(constraint surface),并在每次代码变更前通过形式化验证强制执行组织标准与工程规范。其核心创新包括:递归自治理(约束表面自我约束)、刺激协调(stigmergic coordination)实现无中心控制下的协作、近距规格强化(proximal spec reinforcement)使规格作为奖励函数直接塑造行为而无需权重更新,以及防止治理表演(governance theater)——确保合规证据只能在真实演化路径中产生。这一机制使得系统能够自我监管并形成可证明的、单调增长的义务集合(obligation set),从而实现可信的 AI 辅助软件交付。

链接: https://arxiv.org/abs/2604.05080
作者: Danil Gorinevski(cybiont GmbH, Schübelbach, Switzerland)
机构: cybiont GmbH( cybiont 公司)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: 19 pages, 3 figures, 5 tables. Evaluated on self-hosting deployment. Patent pending (CH000371/2026)

点击查看摘要

Abstract:We present Nidus, a governance runtime that mechanizes the V-model for AI-assisted software delivery. In the self-hosting deployment, three LLM families (Claude, Gemini, Codex) delivered a 100,000-line system under proof obligations verified against the current obligation set on every commit. The system governed its own construction. Engineering invariants - traced requirements, justified architecture, evidenced deliveries - cannot be reliably maintained as learned behavior; assurance requires enforcement by a mechanism external to the proposer. Nidus externalizes the engineering methodology into a decidable artifact verified on every mutation before persistence. Organizational standards compile into guidebooks - constraint libraries imported by governed projects and enforced by decidable evaluation. Four contributions: (1) recursive self-governance - the constraint surface constrains mutations to itself; (2) stigmergic coordination - friction from the surface routes agents without central control; (3) proximal spec reinforcement - the living artifact externalizes the engineering context that RL and long-chain reasoning try to internalize; the specification is the reward function, UNSAT verdicts shape behavior at inference time, no weight updates; (4) governance theater prevention - compliance evidence cannot be fabricated within the modeled mutation path. The constraint surface compounds: each obligation permanently eliminates a class of unengineered output. The artifact’s development history is a formal development - every state satisfies all active obligations, and the obligation set grows monotonically. Comments: 19 pages, 3 figures, 5 tables. Evaluated on self-hosting deployment. Patent pending (CH000371/2026) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA) ACMclasses: D.2.4; D.2.1; I.2.11 Cite as: arXiv:2604.05080 [cs.SE] (or arXiv:2604.05080v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.05080 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-12] GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

【速读】:该论文旨在解决音乐引导的混剪视频(music-grounded mashup video creation)中长期结构约束与局部编辑一致性难以协同的问题,传统方法依赖固定流水线或简化的检索-拼接范式,难以适应多样化的用户意图和异构源视频。其核心解决方案是提出GLANCE框架,采用双环架构:外层进行长程规划与任务图构建,内层执行“观察-思考-行动-验证”循环以完成片段级编辑及迭代优化;并通过全局-局部协调机制(含上下文控制器、冲突区域分解模块和自底向上的动态协商机制),有效缓解子时间线组合后的跨片段冲突,从而实现高质量、结构完整的非线性视频生成。

链接: https://arxiv.org/abs/2604.05076
作者: Zihao Lin,Haibo Wang,Zhiyang Xu,Siyao Dai,Huanjie Dong,Xiaohan Wang,Yolo Y. Tang,Yixin Wang,Qifan Wang,Lifu Huang
机构: UC Davis(加州大学戴维斯分校); Fudan University(复旦大学); Stanford University(斯坦福大学); University of Rochester(罗切斯特大学); Meta AI(Meta人工智能)
类目: Multiagent Systems (cs.MA); Multimedia (cs.MM); Sound (cs.SD)
备注: 14 pages, 4 figures, under review

点击查看摘要

Abstract:Music-grounded mashup video creation is a challenging form of video non-linear editing, where a system must compose a coherent timeline from large collections of source videos while aligning with music rhythm, user intent, story completeness, and long-range structural constraints. Existing approaches typically rely on fixed pipelines or simplified retrieval-and-concatenation paradigms, limiting their ability to adapt to diverse prompts and heterogeneous source materials. In this paper, we present GLANCE, a global-local coordination multi-agent framework for music-grounded nonlinear video editing. GLANCE adopts a bi-loop architecture for better editing practice: an outer loop performs long-horizon planning and task-graph construction, and an inner loop adopts the “Observe-Think-Act-Verify” flow for segment-wise editing tasks and their refinements. To address the cross-segment and global conflict emerging after subtimelines composition, we introduce a dedicated global-local coordination mechanism with both preventive and corrective components, which includes a novelly designed context controller, conflict region decomposition module, and a bottom-up dynamic negotiation mechanism. To support rigorous evaluation, we construct MVEBench, a new benchmark that factorizes editing difficulty along task type, prompt specificity, and music length, and propose an agent-as-a-judge evaluation framework for scalable multi-dimensional assessment. Experimental results show that GLANCE consistently outperforms prior research baselines and open-source product baselines under the same backbone models. With GPT-4o-mini as the backbone, GLANCE improves over the strongest baseline by 33.2% and 15.6% on two task settings, respectively. Human evaluation further confirms the quality of the generated videos and validates the effectiveness of the proposed evaluation framework.

[MA-13] PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing

【速读】:该论文旨在解决生成式 AI (Generative AI) 在科学发现中面临的“非结构化研究材料到可提交手稿”的自动化转化难题,现有自主写作系统往往与特定实验流程强耦合,且仅能生成浅层文献综述。其解决方案的关键在于提出 PaperOrchestra——一个基于多智能体(multi-agent)框架的自动化科研论文撰写系统,能够灵活地将无约束的预写作材料转化为符合投稿标准的 LaTeX 手稿,包含深度文献综述和自动生成的可视化内容(如图表与概念图),并通过 PaperWritingBench 基准测试验证其在文献质量与整体稿件质量上显著优于现有基线方法。

链接: https://arxiv.org/abs/2604.05018
作者: Yiwen Song,Yale Song,Tomas Pfister,Jinsung Yoon
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Project Page: this https URL

点击查看摘要

Abstract:Synthesizing unstructured research materials into manuscripts is an essential yet under-explored challenge in AI-driven scientific discovery. Existing autonomous writers are rigidly coupled to specific experimental pipelines, and produce superficial literature reviews. We introduce PaperOrchestra, a multi-agent framework for automated AI research paper writing. It flexibly transforms unconstrained pre-writing materials into submission-ready LaTeX manuscripts, including comprehensive literature synthesis and generated visuals, such as plots and conceptual diagrams. To evaluate performance, we present PaperWritingBench, the first standardized benchmark of reverse-engineered raw materials from 200 top-tier AI conference papers, alongside a comprehensive suite of automated evaluators. In side-by-side human evaluations, PaperOrchestra significantly outperforms autonomous baselines, achieving an absolute win rate margin of 50%-68% in literature review quality, and 14%-38% in overall manuscript quality.

[MA-14] Adaptive Incentive Design with Regret Minimization

【速读】:该论文旨在解决信息不对称条件下激励机制设计问题,即在系统规划者(委托人)无法完全观测个体行为类型的情况下,如何设计自适应激励策略以最小化长期平均后悔(regret)。其核心挑战在于如何在探索(probing)与利用(estimate-based incentivization)之间动态平衡,同时保证激励机制的渐近最优性。解决方案的关键在于提出RAID算法,该算法通过切换策略实现探索与利用的交替执行,并基于较弱的激励条件(仅需满足最小二乘估计强一致性所需的弱激励条件)构建类型估计器,从而显著放宽了以往方法对持久激励(persistence of excitation)的严格假设;理论分析进一步证明了所提类型估计器的一致性及激励策略可实现几乎必然意义下的渐近最小后悔。

链接: https://arxiv.org/abs/2604.05977
作者: Georgios Vasileiou,Lantian Zhang,Silun Zhang
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Optimization and Control (math.OC); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Incentive design constitutes a foundational paradigm for influencing the behavior of strategic agents, wherein a system planner (principal) publicly commits to an incentive mechanism designed to align individual objectives with collective social welfare. This paper introduces the Regret-Minimizing Adaptive Incentive Design (RAID) problem, which aims to synthesize incentive laws under information asymmetry and achieve asymptotically minimal regret compared to an oracle with full information. To this end, we develop the RAID algorithm, which employs a switching policy alternating between probing (exploration) and estimate-based incentivization (exploitation). The associated type estimator relies only on a weaker excitation condition required for strong consistency in least squares estimation, substantially relaxing the persistence-of-excitation assumptions previously used in adaptive incentive design. In addition, we establish the strong consistency of the proposed type estimator and prove that the incentive obtained asymptotically minimizes the planner’s average regret almost surely. Numerical experiments illustrate the convergence rate of the proposed methodology.

自然语言处理

[NLP-0] Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework ACL

【速读】: 该论文旨在解决科研人员在面对快速增长的学术文献时,难以高效发现、评估和整合相关研究的问题。其核心解决方案是提出一个名为Paper Circle的多智能体研究发现与分析系统,关键在于构建两个互补的处理管道:一是“发现管道”,通过融合离线与在线检索、多准则评分、多样性感知排序及结构化输出,提升文献获取效率;二是“分析管道”,将单篇论文转化为包含概念、方法、实验和图表等类型节点的知识图谱,支持图感知问答与覆盖度验证。整个系统基于编码器大语言模型(Coder LLM)驱动的多智能体编排框架实现,确保每一步骤输出均可复现并以JSON、CSV、BibTeX、Markdown和HTML等多种格式同步生成,从而显著降低文献处理的复杂性并增强研究工作的可追溯性。

链接: https://arxiv.org/abs/2604.06170
作者: Komal Kumar,Aman Chadha,Salman Khan,Fahad Shahbaz Khan,Hisham Cholakkal
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); AWS Generative AI Innovation Center, Amazon Web Services (亚马逊网络服务生成式AI创新中心)
类目: Computation and Language (cs.CL)
备注: 19 pages, 7 figures, 8 tables, ACL main (Oral)

点击查看摘要

Abstract:The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools. In this paper, we introduce Paper Circle, a multi-agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature. The system comprises two complementary pipelines: (1) a Discovery Pipeline that integrates offline and online retrieval from multiple sources, multi-criteria scoring, diversity-aware ranking, and structured outputs; and (2) an Analysis Pipeline that transforms individual papers into structured knowledge graphs with typed nodes such as concepts, methods, experiments, and figures, enabling graph-aware question answering and coverage verification. Both pipelines are implemented within a coder LLM-based multi-agent orchestration framework and produce fully reproducible, synchronized outputs including JSON, CSV, BibTeX, Markdown, and HTML at each agent step. This paper describes the system architecture, agent roles, retrieval and scoring methods, knowledge graph schema, and evaluation interfaces that together form the Paper Circle research workflow. We benchmark Paper Circle on both paper retrieval and paper review generation, reporting hit rate, MRR, and Recall at K. Results show consistent improvements with stronger agent models. We have publicly released the website at this https URL and the code at this https URL.

[NLP-1] In-Place Test-Time Training ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对持续输入的新信息时缺乏动态调整能力的问题,传统“训练后部署”范式限制了其适应性。为此,作者提出**在位测试时训练(In-Place Test-Time Training, In-Place TTT)**框架,其关键在于:将广泛使用的MLP模块中的最终投影矩阵作为可调的快速权重(fast weights),实现无需从头再训练的“即插即用”增强;同时,设计了一个与自回归语言建模中下一个词预测任务(Next-Token-Prediction)理论对齐的目标函数,并结合分块更新机制,在保持计算高效的同时显著提升模型在长上下文场景下的表现。

链接: https://arxiv.org/abs/2604.06169
作者: Guhao Feng,Shengjie Luo,Kai Hua,Ge Zhang,Di He,Wenhao Huang,Tianle Cai
机构: Peking University (北京大学); ByteDance (字节跳动)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: ICLR 2026 Oral Presentation; Code is released at this https URL

点击查看摘要

Abstract:The static train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT’s generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework’s effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.

[NLP-2] MMEmb-R1: Reasoning -Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

【速读】: 该论文旨在解决多模态大模型(Multimodal Large Language Models, MLLMs)在多模态嵌入(Multimodal Embedding)任务中生成式推理能力未被充分利用的问题,尤其是直接将思维链(Chain-of-Thought, CoT)推理引入嵌入学习时面临的两个核心挑战:一是实例级推理与成对对比监督之间的结构错位可能导致模型仅学习到推理的表面格式(shortcut behavior);二是并非所有输入都受益于推理,强制对所有样本进行推理会增加计算开销和延迟,甚至掩盖简单场景下的关键语义信号。解决方案的关键在于提出MMEmb-R1框架,其核心创新是将推理建模为潜在变量,并设计了基于反事实干预的成对感知推理选择机制,以识别有助于查询-目标对齐的推理路径;同时采用强化学习策略,仅在必要时激活推理模块,从而实现推理资源的自适应调度,在保持高性能的同时显著降低推理负担和推理延迟。

链接: https://arxiv.org/abs/2604.06156
作者: Yuchi Wang,Haiyang Yu,Weikang Bian,Jiefeng Long,Xiao Liang,Chao Feng,Hongsheng Li
机构: MMLab, The Chinese University of Hong Kong (多媒体实验室,香港中文大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. First, structural misalignment between instance-level reasoning and pairwise contrastive supervision may lead to shortcut behavior, where the model merely learns the superficial format of reasoning. Second, reasoning is not universally beneficial for embedding tasks. Enforcing reasoning for all inputs may introduce unnecessary computation and latency, and can even obscure salient semantic signals for simple cases. To address these issues, we propose MMEmb-R1, an adaptive reasoning-based multimodal embedding framework. We formulate reasoning as a latent variable and introduce pair-aware reasoning selection that employs counterfactual intervention to identify reasoning paths beneficial for query-target alignment. Furthermore, we adopt reinforcement learning to selectively invoke reasoning only when necessary. Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.

[NLP-3] oward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)是否能发展出连贯的内部世界模型(internal world models)这一核心争议问题,尤其关注传统单标记预测(Next-Token Prediction, NTP)方法在学习结构化表征方面的局限性。研究表明,多标记预测(Multi-Token Prediction, MTP)虽能通过梯度耦合诱导表示收缩性(representational contractivity),促进收敛至内部信念状态,但其标准形式常因离散标记监督导致潜在空间中的结构幻觉(structural hallucinations),即违反环境约束的非法捷径。为此,作者提出一种新方法——潜在语义增强多标记预测(Latent Semantic Enhancement MTP, LSE-MTP),其关键在于将预测锚定于真实隐藏状态轨迹,从而有效弥合离散标记与连续状态表示之间的鸿沟,提升表征对齐性、减少结构幻觉,并增强对扰动的鲁棒性。

链接: https://arxiv.org/abs/2604.06155
作者: Qimin Zhong,Hao Liao,Haiming Qin,Mingyang Zhou,Rui Mao,Wei Chen,Naipeng Chao
机构: Shenzhen University (深圳大学); Microsoft Research Asia (亚洲微软研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Whether Large Language Models (LLMs) develop coherent internal world models remains a core debate. While conventional Next-Token Prediction (NTP) focuses on one-step-ahead supervision, Multi-Token Prediction (MTP) has shown promise in learning more structured representations. In this work, we provide a theoretical perspective analyzing the gradient inductive bias of MTP, supported by empirical evidence, showing that MTP promotes the convergence toward internal belief states by inducing representational contractivity via gradient coupling. However, we reveal that standard MTP often suffers from structural hallucinations, where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints. To address this, we propose a novel method Latent Semantic Enhancement MTP (LSE-MTP), which anchors predictions to ground-truth hidden state trajectories. Experiments on synthetic graphs and real-world Manhattan Taxi Ride show that LSE-MTP effectively bridges the gap between discrete tokens and continuous state representations, enhancing representation alignment, reducing structural hallucinations, and improving robustness to perturbations.

[NLP-4] Exclusive Unlearning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在工业应用(如医疗和教育)中因生成有害内容而带来的安全风险问题。现有机器遗忘方法通常针对特定有害知识进行删除,难以应对多样化的有害内容。为此,作者提出Exclusive Unlearning(EU),其核心在于通过广泛遗忘除目标保留知识外的所有内容,实现对多种潜在危害的系统性规避,从而在确保模型对各类输入(包括对抗性提示攻击)保持安全性的同时,仍能有效响应特定领域(如医学与数学)的多样化指令。

链接: https://arxiv.org/abs/2604.06154
作者: Mutsumi Sasaki,Kouta Nakayama,Yusuke Miyao,Yohei Oseki,Masaru Isonuma
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When introducing Large Language Models (LLMs) into industrial applications, such as healthcare and education, the risk of generating harmful content becomes a significant challenge. While existing machine unlearning methods can erase specific harmful knowledge and expressions, diverse harmful content makes comprehensive removal difficult. In this study, instead of individually listing targets for forgetting, we propose Exclusive Unlearning (EU), which aims for broad harm removal by extensively forgetting everything except for the knowledge and expressions we wish to retain. We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to specific domains such as medicine and mathematics.

[NLP-5] ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

【速读】: 该论文旨在解决现有智能体(Agent)评估基准存在的两大关键问题:一是环境交互开销过高(最高可达总评估时间的41%),二是任务时长(horizon)和难度分布不均衡,导致整体得分不可靠。其解决方案的核心是提出ACE-Bench,一个基于统一网格规划任务的评估框架,其中智能体需在部分完成的日程中填充隐藏槽位,同时满足局部约束和全局约束。该框架通过两个正交维度实现精细控制:可扩展时长(Scalable Horizons)由隐藏槽位数 $ H $ 控制,可控难度(Controllable Difficulty)由干扰项预算 $ B $ 决定,后者用于生成全局误导性干扰候选。此外,所有工具调用均通过轻量级环境设计中的静态JSON文件解析,消除了设置开销,支持快速、可复现的评估,适用于训练阶段验证。

链接: https://arxiv.org/abs/2604.06111
作者: Wang Yang,Chaoda Song,Xinpeng Li,Debargha Ganguly,Chuang Ma,Shouren Wang,Zhihao Dou,Yuli Zhou,Vipin Chaudhary,Xiaotian Han
机构: Case Western Reserve University (凯斯西储大学); NII LLMC (日本信息研究所语言模型与认知研究中心); University of Zurich (苏黎世大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose ACE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: Scalable Horizons, controlled by the number of hidden slots H , and Controllable Difficulty, governed by a decoy budget B that determines the number of globally misleading decoy candidates. Crucially, all tool calls are resolved via static JSON files under a Lightweight Environment design, eliminating setup overhead and enabling fast, reproducible evaluation suitable for training-time validation. We first validate that H and B provide reliable control over task horizon and difficulty, and that ACE-Bench exhibits strong domain consistency and model discriminability. We then conduct comprehensive experiments across 13 models of diverse sizes and families over 6 domains, revealing significant cross-model performance variation and confirming that ACE-Bench provides interpretable and controllable evaluation of agent reasoning.

[NLP-6] LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces

【速读】: 该论文旨在解决现代基于Transformer的语言模型在自然语言处理任务中表现优异,但其潜在语义空间仍为难以解释的“黑箱”问题。解决方案的关键在于提出一种名为LAG-XAI(Lie Affine Geometry for Explainable AI)的新几何框架,将 paraphrasing(同义改写)建模为嵌入空间中的结构化仿射变换,而非离散的词替换;通过引入受局部李群作用启发的均场近似方法,将改写过程分解为可解释的几何分量——旋转、形变和位移,并揭示出语义流形上的“线性透明性”现象,从而实现对Transformer模型机制的显式参数化可解释性分析。

链接: https://arxiv.org/abs/2604.06086
作者: Olexander Mazurets,Olexander Barmak,Leonid Bedratyuk,Iurii Krak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern Transformer-based language models achieve strong performance in natural language processing tasks, yet their latent semantic spaces remain largely uninterpretable black boxes. This paper introduces LAG-XAI (Lie Affine Geometry for Explainable AI), a novel geometric framework that models paraphrasing not as discrete word substitutions, but as a structured affine transformation within the embedding space. By conceptualizing paraphrasing as a continuous geometric flow on a semantic manifold, we propose a computationally efficient mean-field approximation, inspired by local Lie group actions. This allows us to decompose paraphrase transitions into geometrically interpretable components: rotation, deformation, and translation. Experiments on the noisy PIT-2015 Twitter corpus, encoded with Sentence-BERT, reveal a “linear transparency” phenomenon. The proposed affine operator achieves an AUC of 0.7713. By normalizing against random chance (AUC 0.5), the model captures approximately 80% of the non-linear baseline’s effective classification capacity (AUC 0.8405), offering explicit parametric interpretability in exchange for a marginal drop in absolute accuracy. The model identifies fundamental geometric invariants, including a stable matrix reconfiguration angle (~27.84°) and near-zero deformation, indicating local isometry. Cross-domain generalization is confirmed via direct cross-corpus validation on an independent TURL dataset. Furthermore, the practical utility of LAG-XAI is demonstrated in LLM hallucination detection: using a “cheap geometric check,” the model automatically detected 95.3% of factual distortions on the HaluEval dataset by registering deviations beyond the permissible semantic corridor. This approach provides a mathematically grounded, resource-efficient path toward the mechanistic interpretability of Transformers.

[NLP-7] Short Data Long Context: Distilling Positional Knowledge in Transformers

【速读】: 该论文旨在解决语言模型在扩展上下文窗口时面临的高昂长上下文预训练成本问题,即如何在不依赖大规模长上下文数据的情况下高效地将长上下文能力迁移至学生模型。其解决方案的关键在于利用基于logit的知识蒸馏(logit-based knowledge distillation),通过仅在打包的短上下文样本上训练,即可实现教师模型长上下文检索能力的有效迁移。研究进一步揭示了旋转位置编码(Rotary Position Embedding, RoPE)在蒸馏过程中的关键作用,表明相位级RoPE缩放策略能最大化旋转谱利用率并提升长上下文性能;同时发现logit蒸馏可直接传递位置信息,且查询状态在长上下文扩展中呈现结构化的更新模式,这些机制共同支撑了高效、低成本的长上下文能力迁移。

链接: https://arxiv.org/abs/2604.06070
作者: Patrick Huber,Ernie Chang,Chinnadhurai Sankar,Rylan Conway,Igor Fedorov,Md Rifat Arefin,Adithya Sagar
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Extending the context window of language models typically requires expensive long-context pre-training, posing significant challenges for both training efficiency and data collection. In this paper, we present evidence that long-context retrieval capabilities can be transferred to student models through logit-based knowledge distillation, even when training exclusively on packed short-context samples within a long-context window. We provide comprehensive insights through the lens of Rotary Position Embedding (RoPE) and establish three key findings. First, consistent with prior work, we show that phase-wise RoPE scaling, which maximizes rotational spectrum utilization at each training stage, also achieves the best long-context performance in knowledge distillation setups. Second, we demonstrate that logit-based knowledge distillation can directly enable positional information transfer. Using an experimental setup with packed repeated token sequences, we trace the propagation of positional perturbations from query and key vectors through successive transformer layers to output logits, revealing that positional information systematically influences the teacher’s output distribution and, in turn, the distillation signal received by the student model. Third, our analysis uncovers structured update patterns in the query state during long-context extension, with distinct parameter spans exhibiting strong sensitivity to long-context training.

[NLP-8] From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放式推理任务中因“幻觉雪球效应”(hallucination snowballing)而导致的自我修正失效问题,即模型在自由文本反思过程中递归地合理化早期错误,从而加剧偏差。其解决方案的关键在于探索是否仅通过基于Outlines的结构化约束解码(structured reflection via Outlines-based constrained decoding)即可在不依赖外部训练的批评者或符号工具的前提下中断错误传播。研究发现,单纯施加结构约束不仅未能提升自纠正性能,反而引发一种新的失败模式——“结构雪球效应”(structure snowballing),即模型为满足严格的格式要求而陷入格式陷阱,导致表面语法对齐完美但深层语义错误无法识别与修正,揭示了约束解码中存在的“对齐代价”(alignment tax),凸显了结构粒度与模型内部容量之间在自主工作流中的内在张力。

链接: https://arxiv.org/abs/2604.06066
作者: Hongxu Zhou
机构: Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Intrinsic self-correction in Large Language Models (LLMs) frequently fails in open-ended reasoning tasks due to hallucination snowballing,'' a phenomenon in which models recursively justify early errors during free-text reflection. While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy. This study investigates whether enforcing structured reflection purely through Outlines-based constrained decoding can disrupt error propagation without additional training. Evaluating an 8-billion-parameter model (Qwen3-8B), we show that simply imposing structural constraints does not improve self-correction performance. Instead, it triggers a new failure mode termed structure snowballing.‘’ We find that the cognitive load required to satisfy strict formatting rules pushes the model into formatting traps. This observation helps explain why the agent achieves near-perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors. These findings expose an ``alignment tax’’ inherent to constrained decoding, highlighting a tension between structural granularity and internal model capacity in autonomous workflows. Code and raw logs are available in the GitHub repository: this https URL.

[NLP-9] BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection

【速读】: 该论文旨在解决错误信息检测中难以同时平衡文本内容验证与外部知识修正的问题,尤其是在注意力机制几何结构坍缩(attention collapse)背景下,现有方法往往无法有效整合内外部信息。其解决方案的关键在于提出一种双头推理框架 BiMind,通过解耦内容内推理与知识增强推理实现更精准的判断;具体创新包括:(i) 注意力几何适配器,利用token条件偏移重塑注意力logits以缓解坍缩;(ii) 自检索知识机制,基于kNN检索构建领域内语义记忆并通过特征级线性调制注入知识;(iii) 不确定性感知融合策略,结合熵门控融合与可训练一致性头,并辅以对称KL散度正则化稳定训练。此外,论文引入新的实例级指标Value-of-eXperience (VoX) 来量化知识增强推理带来的逻辑增益,从而提供可解释的诊断依据。

链接: https://arxiv.org/abs/2604.06022
作者: Zhongxing Zhang,Emily K. Vraga,Jisu Huh,Jaideep Srivastava
机构: University of Minnesota, Twin Cities (明尼苏达大学双城分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Incorrect information poses significant challenges by disrupting content veracity and integrity, yet most detection approaches struggle to jointly balance textual content verification with external knowledge modification under collapsed attention geometries. To address this issue, we propose a dual-head reasoning framework, BiMind, which disentangles content-internal reasoning from knowledge-augmented reasoning. In BiMind, we introduce three core innovations: (i) an attention geometry adapter that reshapes attention logits via token-conditioned offsets and mitigates attention collapse; (ii) a self-retrieval knowledge mechanism, which constructs an in-domain semantic memory through kNN retrieval and injects retrieved neighbors via feature-wise linear modulation; (iii) the uncertainty-aware fusion strategies, including entropy-gated fusion and a trainable agreement head, stabilized by a symmetric Kullback-Leibler agreement regularizer. To quantify the knowledge contributions, we define a novel metric, Value-of-eXperience (VoX), to measure instance-wise logit gains from knowledge-augmented reasoning. Experiment results on public datasets demonstrate that our BiMind model outperforms advanced detection approaches and provides interpretable diagnostics on when and why knowledge matters.

[NLP-10] Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM -Assisted Analysis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生物医学和金融等领域的应用中,因模型内部记忆与输入数据混合导致的“认知盲区”问题——即模型输出无法区分哪些信息来源于实际输入数据,哪些来自其训练过程中习得的先验知识(epistemic contamination)。这一问题使得分析过程缺乏可审计性,难以验证模型是否遵循了研究人员设计的推理逻辑。解决方案的关键在于提出一种称为“认知盲化”(epistemic blinding)的推理时协议:在提示前将实体标识符替换为匿名编码,并通过与未盲化的对照组进行比较,从而量化输出中来自数据驱动推理与模型参数知识的比例。该方法不强制推理确定性,但恢复了对模型行为可解释性的关键维度,已在肿瘤学药物靶点优先排序和标普500股票筛选任务中验证其有效性,且以开源工具和Claude Code技能形式降低部署门槛。

链接: https://arxiv.org/abs/2604.06013
作者: Michael Cuccarese
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: code and LLM skill at: this https URL 7 pages 3 figures

点击查看摘要

Abstract:This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization. During development, it became apparent that LLM outputs silently blend data-driven inference with memorized priors about named entities - and the blend is invisible: there is no way to determine, from a single output, how much came from the data on the page and how much came from the model’s training memory. Epistemic blinding is a simple inference-time protocol that replaces entity identifiers with anonymous codes before prompting, then compares outputs against an unblinded control. The protocol does not make LLM reasoning deterministic, but it restores one critical axis of auditability: measuring how much of an output came from the supplied data versus the model’s parametric knowledge. The complete target identification system is described - including LLM-guided evolutionary optimization of scoring functions and blinded agentic reasoning for target rationalization - with demonstration that both stages operate without access to entity identity. In oncology drug target prioritization across four cancer types, blinding changes 16% of top-20 predictions while preserving identical recovery of validated targets. The contamination problem is shown to generalize beyond biology: in SP 500 equity screening, brand-recognition bias reshapes 30-40% of top-20 rankings across five random seeds. To lower the barrier to adoption, the protocol is released as an open-source tool and as a Claude Code skill that enables one-command epistemic blinding within agentic workflows. The claim is not that blinded analysis produces better results, but that without blinding, there is no way to know to what degree the agent is adhering to the analytical process the researcher designed.

[NLP-11] Disentangling MLP Neuron Weights in Vocabulary Space

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中神经元权重信息的可解释性问题,即如何在不依赖数据前向传播的情况下,从模型权重空间直接解耦出具有语义一致性的神经元表示。其解决方案的关键在于提出ROTATE方法,该方法基于一个核心统计观察:编码单一语义概念的神经元在投影到词汇空间时表现出高峰度(kurtosis)。通过优化神经元权重的旋转以最大化其在词汇空间中的峰度,ROTATE能够识别出稀疏且可解释的方向,称为词汇通道(vocabulary channels),从而实现对神经元行为的精确刻画与选择性消融,显著优于基于激活的基线方法。

链接: https://arxiv.org/abs/2604.06005
作者: Asaf Avrahamy,Yoav Gur-Arieh,Mor Geva
机构: Blavatnik School of Computer Science and AI, Tel Aviv University (特拉维夫大学布拉瓦尼克计算机科学与人工智能学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method requiring no forward passes that disentangles MLP neurons directly in weight space. Our approach relies on a key statistical observation: neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model’s vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, our method recovers sparse, interpretable directions which we name vocabulary channels. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it demonstrate that ROTATE consistently recovers vocabulary channels that are faithful to the neuron’s behavior. ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Moreover, aggregating channel-level descriptions yields comprehensive neuron descriptions that outperform optimized activation-based baselines by 2-3x in head-to-head comparisons. By providing a data-free decomposition of neuron weights, ROTATE offers a scalable, fine-grained building block for interpreting LMs.

[NLP-12] he Model Agreed But Didnt Learn: Diagnosing Surface Compliance in Large Language Models ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因内化知识存在过时性和错误而导致的可靠性与可塑性问题,尤其是现有知识编辑方法在评估时可能仅表现为表面合规(Surface Compliance),而非真正修改了模型内部表征。其解决方案的关键在于提出一种基于上下文学习(in-context learning, ICL)环境下判别式自评估的诊断框架,通过模拟真实应用场景下的行为细微变化来检测编辑是否实质性地重构了模型记忆;该框架揭示了当前编辑技术普遍存在“表面合规”现象及递归编辑引发的表征残留与认知不稳定性问题,从而强调了构建具备鲁棒记忆修改能力对实现可信、可持续LLM系统的重要性。

链接: https://arxiv.org/abs/2604.05995
作者: Xiaojie Gu,Ziying Huang,Weicong Hong,Jian Xie,Renze Lou,Kai Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) internalize vast world knowledge as parametric memory, yet inevitably inherit the staleness and errors of their source corpora. Consequently, ensuring the reliability and malleability of these internal representations is imperative for trustworthy real-world deployment. Knowledge editing offers a pivotal paradigm for surgically modifying memory without retraining. However, while recent editors demonstrate high success rates on standard benchmarks, it remains questionable whether current evaluation frameworks that rely on assessing output under specific prompting conditions can reliably authenticate genuine memory modification. In this work, we introduce a simple diagnostic framework that subjects models to discriminative self-assessment under in-context learning (ICL) settings that better reflect real-world application environments, specifically designed to scrutinize the subtle behavioral nuances induced by memory modifications. This probing reveals a pervasive phenomenon of Surface Compliance, where editors achieve high benchmark scores by merely mimicking target outputs without structurally overwriting internal beliefs. Moreover, we find that recursive modifications accumulate representational residues, triggering cognitive instability and permanently diminishing the reversibility of the model’s memory state. These insights underscore the risks of current editing paradigms and highlight the pivotal role of robust memory modification in building trustworthy, long-term sustainable LLM systems. Code is available at this https URL.

[NLP-13] Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design

【速读】: 该论文旨在解决传统硬件描述语言(HDL)在微架构规范和代码生成中存在表达力不足、易出错以及难以与生成式 AI(Generative AI)协同的问题。现有 HDL 如 SystemVerilog 或 VHDL 依赖用户自定义模式来实现流水线、有限状态机(FSM)、时钟域交叉(Clock-Domain Crossing, CDC)等结构,容易引入隐性错误且缺乏类型安全保证。其核心解决方案是提出一种全新的 AI 原生硬件描述语言 Arch,将时钟和复位设计为参数化类型(ClockD, ResetS,P,D?),使 CDC 和复位域交叉(Reset-Domain Crossing, RDC)分析转化为编译期类型规则;同时结合位宽、端口方向、单驱动所有权、组合环路等多维度静态检查机制,在仿真前捕获多重常见错误。此外,Arch 采用严格的 LL(1) 语法和结构化声明模型,确保自然语言到可执行代码的自动转换具备结构正确性和类型安全性,从而显著提升硬件设计的安全性、可维护性与 AI 友好性。

链接: https://arxiv.org/abs/2604.05983
作者: Shuqing Zhao
机构: 未知
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Arch (AI-native Register-transfer Clocked Hardware), a hardware description language designed from first principles for micro-architecture specification and AI-assisted code generation. Arch introduces first-class language constructs for pipelines, FSMs, FIFOs, arbiters, register files, buses, and clock-domain crossings – structures that existing HDLs express only as user-defined patterns prone to subtle errors. A central design choice is that clocks and resets are themselves parameterized types (ClockD, ResetS,P,D?) rather than ordinary nets, converting clock-domain crossing (CDC) and reset-domain crossing (RDC) analysis from external linter passes into compile-time typing rules. Combined with simultaneous tracking of bit widths, port directions, single-driver ownership, and combinational acyclicity, the type system catches multiple drivers, undriven ports, implicit latches, width mismatches, combinational loops, and unsynchronized domain crossings before any simulator runs. Every syntactic choice is governed by an AI-generatability contract: an LL(1) grammar requiring no backtracking or multi-token lookahead, no preprocessor or macros, a uniform declaration schema, named block endings, explicit directional connect arrows, and a todo! escape hatch enable LLMs to produce structurally correct, type-safe Arch from natural-language specifications without fine-tuning. The Arch compiler emits deterministic, lint-clean IEEE 1800-2017 SystemVerilog and provides an integrated simulation toolchain that generates compiled C++ models for cycle-accurate simulation. We present case studies of an 8-way set-associative L1 data cache and a synthesizable PG021-compatible AXI DMA controller (with Yosys and OpenSTA results on Sky130), and compare Arch to SystemVerilog, VHDL, Chisel, Bluespec, and other modern HDLs across expressiveness, safety, and AI suitability dimensions. Subjects: Programming Languages (cs.PL); Computation and Language (cs.CL) ACMclasses: D.3; B.5 Cite as: arXiv:2604.05983 [cs.PL] (or arXiv:2604.05983v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2604.05983 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shuqing Zhao [view email] [v1] Tue, 7 Apr 2026 15:12:14 UTC (31 KB)

[NLP-14] Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family

【速读】: 该论文旨在解决对比视觉语言模型(如CLIP)在细粒度视觉理解上的局限性,特别是其存在的“中心偏置”(center bias)问题——即模型过度关注图像中心区域,而忽略边界处的重要对象。这一偏差源于视觉嵌入在聚合过程中因池化机制导致的信息损失,使得与非中心区域相关的概念在最终表示中消失。解决方案的关键在于采用无需训练的策略,如视觉提示(visual prompting)和注意力重分配(attention redistribution),通过引导模型注意力聚焦于图像边缘区域,从而有效缓解中心偏置问题。

链接: https://arxiv.org/abs/2604.05971
作者: Oscar Chew,Hsiao-Ying Huang,Kunal Jain,Tai-I Chen,Khoa D Doan,Kuan-Hao Huang
机构: Texas AM University (德州农工大学); National Taiwan University (台湾国立大学); ASUS (华硕); VinUniversity (越南Vin大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially those associated with off-center objects vanish from the model’s embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. Finally, we show that this bias can be alleviated with training-free strategies such as visual prompting and attention redistribution by redirecting models’ attention to off-center regions.

[NLP-15] FinReporting: An Agent ic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures

【速读】: 该论文旨在解决跨司法管辖区财务报告中因会计分类体系(accounting taxonomies)、标签基础设施(tagging infrastructures,如XBRL与PDF)及汇总惯例差异导致的语义对齐与验证难题。其核心解决方案是提出一个名为FinReporting的代理式工作流(agentic workflow),通过构建统一的收入表、资产负债表和现金流量表的规范本体(canonical ontology),将报告流程分解为可审计的阶段:文件获取、信息提取、规范映射与异常日志记录;关键创新在于将大语言模型(LLMs)作为受约束的验证器(constrained verifiers)而非自由生成器使用,结合显式决策规则和证据锚定(evidence grounding),从而在美、日、中三地年度财报上显著提升了异构报告环境下的一致性与可靠性。

链接: https://arxiv.org/abs/2604.05966
作者: Fan Zhang,Mingzi Song,Rania Elbadry,Yankai Chen,Shaobo Wang,Yixi Zhou,Xunwen Zheng,Yueru He,Yuyang Dai,Georgi Georgiev,Ayesha Gull,Muhammad Usman Safder,Fan Wu,Liyuan Meng,Fengxian Ji,Junning Zhao,Xueqing Peng,Jimin Huang,Yu Chen, Xue (Steve)Liu,Preslav Nakov,Zhuohan Xie
机构: MBZUAI; The University of Tokyo; Meiji Gakuin University; McGill University; Kyoto University; Columbia University; University of California, Berkeley; Sofia University “St. Kliment Ohridski”; Namal University; The Fin AI
类目: Computation and Language (cs.CL)
备注: 9 pages, including figures and tables

点击查看摘要

Abstract:Financial reporting systems increasingly use large language models (LLMs) to extract and summarize corporate disclosures. However, most assume a single-market setting and do not address structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs. PDF), and aggregation conventions make cross-jurisdiction reporting a semantic alignment and verification challenge. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system builds a unified canonical ontology over Income Statement, Balance Sheet, and Cash Flow, and decomposes reporting into auditable stages including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than using LLMs as free-form generators, FinReporting deploys them as constrained verifiers under explicit decision rules and evidence grounding. Evaluated on annual filings from the US, Japan, and China, the system improves consistency and reliability under heterogeneous reporting regimes. We release an interactive demo supporting cross-market inspection and structured export of localized financial statements. Our demo is available at this https URL . The video describing our system is available at this https URL

[NLP-16] owards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration

【速读】: 该论文旨在解决当前深度研究代理(deep research agents)在生成跨领域研究报告时,因缺乏对内容可信度的有效评估而导致的“不可靠性”问题。现有评价框架多依赖主观维度,无法衡量生成内容的信念校准(epistemic confidence),尤其在无标准答案的开放研究场景中,易产生幻觉信息,削弱用户信任。解决方案的关键在于引入一种包含渐进式置信度估计与校准机制的新型深度研究代理,其核心是融合了深检索(deep retrieval)与多跳推理(multi-hop reasoning)的思辨性搜索模型,在生成过程中为每个主张分配置信度分数,并通过精心设计的工作流确保输出基于可验证证据,从而显著提升报告的透明度和可信度。

链接: https://arxiv.org/abs/2604.05952
作者: Yi Yuan,Xuhong Wang,Shanzhe Lei
机构: Shanghai Artificial Intelligence Laboratory; Southeast University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 3 tables, 2 figures

点击查看摘要

Abstract:As agent-based systems continue to evolve, deep research agents are capable of automatically generating research-style reports across diverse domains. While these agents promise to streamline information synthesis and knowledge exploration, existing evaluation frameworks-typically based on subjective dimensions-fail to capture a critical aspect of report quality: trustworthiness. In open-ended research scenarios where ground-truth answers are unavailable, current evaluation methods cannot effectively measure the epistemic confidence of generated content, making calibration difficult and leaving users susceptible to misleading or hallucinated information. To address this limitation, we propose a novel deep research agent that incorporates progressive confidence estimation and calibration within the report generation pipeline. Our system leverages a deliberative search model, featuring deep retrieval and multi-hop reasoning to ground outputs in verifiable evidence while assigning confidence scores to individual claims. Combined with a carefully designed workflow, this approach produces trustworthy reports with enhanced transparency. Experimental results and case studies demonstrate that our method substantially improves interpretability and significantly increases user trust.

[NLP-17] BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLM s ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练混合化(post-training hybridization)过程中,如何高效选择局部注意力头(local attention heads)以替代传统全连接自注意力机制的问题。现有方法要么在层级别进行静态划分(如交替策略),要么基于预训练阶段的静态排名对注意力头进行局部/全局属性分配,但前者忽略了同一层内不同头对局部与全局依赖的差异化路由特性,后者则因注意力头行为在混合化后发生动态变化而存在“纠缠”(entanglement)问题,导致性能下降。解决方案的关键在于提出一种无需训练的黑盒二值优化方法 BOSCH(Black-box Binary Optimization for Short-context Head Selection),其核心创新是将问题建模为大规模邻域搜索(Large Neighborhood Search),并分解为三个子问题:(i) 通过小预算黑盒探测识别各层重要性;(ii) 基于敏感度自适应分配每层滑动窗口注意力(Sliding-Window Attention, SWA)比例;(iii) 在比例桶内进行分组头级优化。实验表明,BOSCH 在多个模型规模(1.7B–30B参数)和SWA比例下均显著优于现有层级与静态头级方法,且在持续预训练中更快恢复长上下文性能,验证了按目标SWA比例动态选择头部的重要性。

链接: https://arxiv.org/abs/2604.05942
作者: Abbas Ghaddar,Ivan Kobyzev,Boxing Chen,Yufei Cui
机构: Huawei Noah’s Ark Lab, Montreal Research Center, Canada
类目: Computation and Language (cs.CL)
备注: ACL 2026 (Main Conference)

点击查看摘要

Abstract:Post-training hybridization of large language models (LLMs) often replaces quadratic self-attention with sliding-window attention (SWA) to reduce KV cache usage and improve latency. Existing hybridization schemes are typically defined either at the layer level (e.g., interleaving) or at the head level via static rankings from local to global. Layer-level schemes ignore that local and global dependencies are routed through heads within the same layer, while static head-level rankings suffer from entanglement: a head’s local/global behavior can change after hybridization. We propose BOSCH, Black-box Binary Optimization for Short-context Head Selection, a training-free method that formulates the problem as a Large Neighborhood Search and decomposes it into three subproblems: (i) layer-importance detection via small-budget black-box probes, (ii) adaptive per-layer SWA-ratio assignment based on these sensitivities, and (iii) grouped head-level optimization within ratio buckets. Extensive experiments on 4 LLMs ranging from 1.7B to 30B parameters, across 4 SWA ratios, show that BOSCH consistently outperforms layer-level heuristics and 6 strong static head-level methods, with larger gains at higher SWA ratios. Under continual pretraining, BOSCH recover original long-context performance faster and to a higher level. Analysis of the selected heads reveals substantial turnover for BOSCH across different SWA ratios, underscoring the importance of performing head-level selection for each target ratio rather than relying on fixed locality rankings.

[NLP-18] “I See What You Did There”: Can Large Vision-Language Models Understand Multimodal Puns? ACL2026

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在理解多模态隐喻(multimodal puns)方面能力不足的问题,尤其缺乏系统性评估基准。其核心挑战在于如何让VLMs通过跨模态推理准确区分真实双关语与干扰项(adversarial non-pun distractors)。解决方案的关键在于:首先提出一个用于生成多模态双关语的端到端流水线;其次构建了MultiPun数据集,包含多种类型的双关语及对抗性非双关干扰项;并通过提示工程(prompt-level)和模型优化(model-level)策略显著提升模型对双关语的理解能力,平均F1分数提升16.5%。

链接: https://arxiv.org/abs/2604.05930
作者: Naen Xu,Jiayi Sheng,Changjiang Li,Chunyi Zhou,Yuyuan Li,Tianyu Du,Jun Wang,Zhihui Fu,Jinbao Li,Shouling Ji
机构: Zhejiang University (浙江大学); Beihang University (北京航空航天大学); Palo Alto Networks (帕洛阿尔托网络公司); Hangzhou Dianzi University (杭州电子科技大学); Ningbo Global Innovation Center, Zhejiang University (宁波全球创新中心,浙江大学); OPPO Research Institute (OPPO研究院); Qilu University of Technology (齐鲁工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 Main

点击查看摘要

Abstract:Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.

[NLP-19] he UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model

【速读】: 该论文旨在解决现有状态空间模型(State Space Models, SSMs)在实际训练中难以可靠学习到理论可表达的栈式回滚机制的问题,尤其是在非单调更新序列下实现可逆语义状态检索的能力缺失。其关键解决方案是提出一个新的基准任务——UNDO Flip-Flop任务,该任务通过扩展标准Flip-Flop任务引入“UNDO”操作,强制模型维护一个隐式的有界栈并恢复历史状态,从而隔离出对可逆状态检索的需求。实验表明,即使在理论上具备表达能力的Mamba-2模型(一阶和二阶),也未能学习到正确的栈机制,而是退化为局部切换启发式策略;因果消融分析进一步揭示失败瓶颈在于状态检索而非存储,凸显了架构理论表达性与梯度下降学习可靠性之间的系统性鸿沟。

链接: https://arxiv.org/abs/2604.05923
作者: Hongxu Zhou
机构: Saarland University (萨尔兰大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:State space models (SSMs) have been shown to possess the theoretical capacity to model both star-free sequential tasks and bounded hierarchical structures Sarrof et al. (2024). However, formal expressivity results do not guarantee that gradient-based optimisation will reliably discover the corresponding solutions. Existing benchmarks probe either monotonic state tracking, as in the standard Flip-Flop task, or structural nesting, as in the Dyck languages, but neither isolates reversible semantic state retrieval. We introduce the UNDO Flip-Flop task to fill this gap. By extending the standard Flip-Flop with an UNDO, the task requires a model to maintain an implicit bounded stack and recover historical states under non-monotonic update sequences. We evaluate one-layer and two-layer Mamba-2 under this framework. Both variants fail to acquire the provably expressible stack-based rollback mechanism, converging instead on a local toggle heuristic that inverts the current state rather than retrieving stored history. Under an adversarial retraction pressure test held within the training length distribution, the two-layer model collapses to 41.10% accuracy, which is below random chance. The results confirm systematic rather than incidental failure. Causal ablation shows that the bottleneck lies in retrieval, not storage. These results draw a clear line between what an architecture can in principle represent and what gradient descent reliably learns, a distinction that theoretical expressivity analyses alone cannot capture.

[NLP-20] FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)在金融等知识密集型领域中缺乏可靠评估基准的问题,尤其是在衡量模型是否具备实际专业能力方面存在显著空白。现有基准无法有效反映真实世界金融建模任务的复杂性与专业要求,且缺少明确的责任机制来确保模型输出的质量和可追溯性。解决方案的关键在于构建一个名为FrontierFinance的长期基准,包含25个复杂的金融建模任务,覆盖五大核心金融模型,每项任务平均需超过18小时的专业人工完成时间。该基准由金融专业人士共同开发,严格遵循行业标准建模流程,并配套详细评分细则,同时通过人类专家作为基线进行任务执行与评分,从而客观验证LLM在专业场景下的表现。实证表明,人类专家不仅得分更高,而且更可能生成可直接交付客户的高质量输出,凸显了当前先进模型在专业实践中的不足。

链接: https://arxiv.org/abs/2604.05912
作者: Michael Krumdick,Varshini Reddy,Shivani Chaudhary,William Day,Maarij Ahmed,Hayan Haqqi,Muhammad Ahsen Fahim,Hanzallah Amjad,Ahmad Orakzai,Aqsa Gul,Chris Tanner
机构: Kensho Technologies(肯肖科技); SP Global(SP全球); MIT(麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As concerns surrounding AI-driven labor displacement intensify in knowledge-intensive sectors, existing benchmarks fail to measure performance on tasks that define practical professional expertise. Finance, in particular, has been identified as a domain with high AI exposure risk, yet lacks robust benchmarks to track real-world developments. This gap is compounded by the absence of clear accountability mechanisms in current Large Language Model (LLM) deployments. To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete. Developed with financial professionals, the benchmark reflects industry-standard financial modeling workflows and is paired with detailed rubrics for structured evaluation. We engage human experts to define the tasks, create rubrics, grade LLMs, and perform the tasks themselves as human baselines. We demonstrate that our human experts both receive higher scores on average, and are more likely to provide client-ready outputs than current state-of-the-art systems.

[NLP-21] FRENCH-YMCA: A FRENCH Corpus meeting the language needs of Youth froM Children to Adolescents

【速读】: 该论文旨在解决儿童和青少年语言理解与生成模型缺乏专门训练数据的问题,因为儿童的语言能力处于持续发展中,且与成人存在显著差异。解决方案的关键在于构建一个名为French-YMCA的语料库,该语料库包含39,200个文本文件(共22,471,898词),具有来源多样、语法和拼写一致以及开放在线获取的特点,可作为训练面向青少年的语言模型的基础资源,从而提升数字交互中对年轻用户语言特征的理解与响应适配性。

链接: https://arxiv.org/abs/2604.05899
作者: Cherifa Ben Khelil,Jean-Yves Antoine,Anaïs Halftermeyer,Frédéric Rayar,Mathieu Thebaud
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:In this paper, we introduce the French-YMCA corpus, a new linguistic resource specifically tailored for children and adolescents. The motivation for building this corpus is clear: children have unique language requirements, as their language skills are in constant evolution and differ from those of adults. With an extensive collection of 39,200 text files, the French-YMCA corpus encompasses a total of 22,471,898 words. It distinguishes itself through its diverse sources, consistent grammar and spelling, and the commitment to providing open online accessibility for all. Such corpus can serve as the foundation for training language models that understand and anticipate youth’s language, thereby enhancing the quality of digital interactions and ensuring that responses and suggestions are age-appropriate and adapted to the comprehension level of users of this age.

[NLP-22] Mechanistic Circuit-Based Knowledge Editing in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在动态环境中更新预训练知识时存在的“推理断层”(Reasoning Gap)问题,即现有知识编辑方法虽能准确修正孤立事实,但在多步推理链中无法有效利用编辑后的知识。其解决方案的关键在于提出MCircKE(Mechanistic Circuit-based Knowledge Editing)框架,通过识别与特定推理任务相关的因果电路(causal circuits),精确映射出知识存储与逻辑传递路径,并仅在该电路内进行参数的微创调整,从而实现对知识编辑效果的推理一致性增强。

链接: https://arxiv.org/abs/2604.05876
作者: Tianyi Zhao,Yinhan He,Wendy Zheng,Chen Chen
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deploying Large Language Models (LLMs) in real-world dynamic environments raises the challenge of updating their pre-trained knowledge. While existing knowledge editing methods can reliably patch isolated facts, they frequently suffer from a “Reasoning Gap”, where the model recalls the edited fact but fails to utilize it in multi-step reasoning chains. To bridge this gap, we introduce MCircKE (\underlineMechanistic \underlineCircuit-based \underlineKnowledge \underlineEditing), a novel framework that enables a precise “map-and-adapt” editing procedure. MCircKE first identifies the causal circuits responsible for a specific reasoning task, capturing both the storage of the fact and the routing of its logical consequences. It then surgically update parameters exclusively within this mapped circuit. Extensive experiments on the MQuAKE-3K benchmark demonstrate the effectiveness of the proposed method for multi-hop reasoning in knowledge editing.

[NLP-23] Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

【速读】: 该论文旨在解决瑞士金融与监管场景中大语言模型(Large Language Models, LLMs)部署所面临的两大核心问题:一是缺乏对生产可靠性的实证证据,二是对抗安全性维度的评估缺失,而现有瑞士本地化评估框架尚未将这两个维度进行协同量化。解决方案的关键在于提出并构建了瑞士基准测试集 SBP-003(Swiss-Bench 003),通过扩展 HAAS(Helvetic AI Assessment Score)评估体系至八个维度(新增 D7:自我评分可靠性代理 和 D8:对抗安全性),并在四种瑞士官方语言(德语、法语、意大利语、英语)下对十种前沿模型进行了零样本评测,覆盖七个针对 FINMA 指导文件 08/2024、联邦数据保护法(nDSG)及 OWASP LLM 风险清单定制的子任务。其中,D7 自我评分结果显著高于 D8 对抗安全得分,揭示出模型在主观可信度与实际防御能力之间存在显著差距,为监管合规提供可映射至具体法规要求的量化依据。

链接: https://arxiv.org/abs/2604.05872
作者: Fatih Uenal
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 5 figures, 8 tables

点击查看摘要

Abstract:The deployment of large language models (LLMs) in Swiss financial and regulatory contexts demands empirical evidence of both production reliability and adversarial security, dimensions not jointly operationalized in existing Swiss-focused evaluation frameworks. This paper introduces Swiss-Bench 003 (SBP-003), extending the HAAS (Helvetic AI Assessment Score) from six to eight dimensions by adding D7 (Self-Graded Reliability Proxy) and D8 (Adversarial Security). I evaluate ten frontier models across 808 Swiss-specific items in four languages (German, French, Italian, English), comprising seven Swiss-adapted benchmarks (Swiss TruthfulQA, Swiss IFEval, Swiss SimpleQA, Swiss NIAH, Swiss PII-Scope, System Prompt Leakage, and Swiss German Comprehension) targeting FINMA Guidance 08/2024, the revised Federal Act on Data Protection (nDSG), and OWASP Top 10 for LLMs. Self-graded D7 scores (73-94%) exceed externally judged D8 security scores (20-61%) by a wide margin, though these dimensions use non-comparable scoring regimes. System prompt leakage resistance ranges from 24.8% to 88.2%, while PII extraction defense remains weak (14-42%) across all models. Qwen 3.5 Plus achieves the highest self-graded D7 score (94.4%), while GPT-oss 120B achieves the highest D8 score (60.7%) despite being the lowest-cost model evaluated. All evaluations are zero-shot under provider default settings; D7 is self-graded and does not constitute independently validated accuracy. I provide conceptual mapping tables relating benchmark dimensions to FINMA model validation requirements, nDSG data protection obligations, and OWASP LLM risk categories.

[NLP-24] Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在生成高质量解题结果时,如何优化采样策略以提升性能的问题。具体而言,研究对比了顺序采样(sequential sampling)与并行采样(parallel sampling)两种策略的效能差异,并试图厘清为何在理论上更具表示能力的顺序采样反而表现逊色于并行采样。其解决方案的关键在于通过实证分析验证三个假设:聚合机制、上下文长度限制以及探索性不足的影响,最终发现缺乏探索性(即顺序采样因依赖前序答案而限制了多样性)是导致性能差距的主要原因,从而为改进采样策略提供了理论依据和方向。

链接: https://arxiv.org/abs/2604.05868
作者: Xiangming Gu,Soham De,Larisa Markeeva,Petar Veličković,Razvan Pascanu
机构: Google DeepMind; National University of Singapore
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have shown remarkable performance on challenging questions, such as math and coding. However, to obtain a high quality solution, one may need to sample more than once. In principal, there are two sampling strategies that can be composed to form more complex processes: sequential sampling and parallel sampling. In this paper, we first compare these two approaches with rigor, and observe, aligned with previous works, that parallel sampling seems to outperform sequential sampling even though the latter should have more representation power. To understand the underline reasons, we make three hypothesis on the reason behind this behavior: (i) parallel sampling outperforms due to the aggregator operator; (ii) sequential sampling is harmed by needing to use longer contexts; (iii) sequential sampling leads to less exploration due to conditioning on previous answers. The empirical evidence on various model families and sizes (Qwen3, DeepSeek-R1 distilled models, Gemini 2.5) and question domains (math and coding) suggests that the aggregation and context length do not seem to be the main culprit behind the performance gap. In contrast, the lack of exploration seems to play a considerably larger role, and we argue that this is one main cause for the performance gap.

[NLP-25] LoRM: Learning the Language of Rotating Machinery for Self-Supervised Condition Monitoring

【速读】: 该论文旨在解决工业旋转机械信号分析中传统信号处理方法依赖人工设计变换与特征、难以实现多模态数据统一建模及实时状态监测的问题。其解决方案的关键在于提出LoRM(Language of Rotating Machinery)框架,将旋转机械信号视为一种“机器语言”,通过将局部信号分段量化为离散符号(token),并利用预训练语言模型进行序列预测建模;具体而言,保留观测上下文段的连续表示,同时将未来目标段量化为离散token,借助部分微调策略实现知识迁移,从而以预测误差作为健康指标完成实时状态监测,展现出良好的跨工具泛化能力与实用性。

链接: https://arxiv.org/abs/2604.05863
作者: Xiao Qin,Xingyi Song,Tong Liu,Hatim Laalej,Zepeng Liu,Yunpeng Zhu,Ligang He
机构: University of Warwick (华威大学); Queen Mary University of London (伦敦玛丽女王大学); University of Sheffield (谢菲尔德大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present LoRM (Language of Rotating Machinery), a self-supervised framework for multi-modal rotating-machinery signal understanding and real-time condition monitoring. LoRM is built on the idea that rotating-machinery signals can be viewed as a machine language: local signals can be tokenised into discrete symbolic units, and their future evolution can be predicted from observed multi-sensor context. Unlike conventional signal-processing methods that rely on hand-crafted transforms and features, LoRM reformulates multi-modal sensor data as a token-based sequence-prediction problem. For each data window, the observed context segment is retained in continuous form, while the future target segment of each sensing channel is quantised into a discrete token. Then, efficient knowledge transfer is achieved by partially fine-tuning a general-purpose pre-trained language model on industrial signals, avoiding the need to train a large model from scratch. Finally, condition monitoring is performed by tracking token-prediction errors as a health indicator, where increasing errors indicate degradation. In-situ tool condition monitoring (TCM) experiments demonstrate stable real-time tracking and strong cross-tool generalisation, showing that LoRM provides a practical bridge between language modelling and industrial signal analysis. The source code is publicly available at this https URL.

[NLP-26] Evaluating Learner Representations for Differentiation Prior to Instructional Outcomes

【速读】: 该论文旨在解决教育人工智能(Educational AI)系统中学习者表征(learner representations)的评估问题,即在缺乏教学结果或结果高度依赖上下文的情况下,如何判断这些表征是否能够保留学生之间的有意义差异。其解决方案的关键在于提出一种名为“独特性”(distinctiveness)的表征级度量方法,该方法通过计算个体学习者与同侪之间的成对距离来量化其区分度,无需聚类、标签或任务特定评估。研究表明,基于学习者整体交互模式构建的表征比基于单次交互的表征具有更高的分离度、更强的聚类结构和更可靠的成对判别能力,从而证明了独特性可作为独立于教学结果的预部署评估指标,用于诊断表征是否支持差异化建模或个性化推荐。

链接: https://arxiv.org/abs/2604.05848
作者: Junsoo Park,Youssef Medhat,Htet Phyo Wai,Ploy Thajchayapong,Ashok K. Goel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to AIED 2026

点击查看摘要

Abstract:Learner representations play a central role in educational AI systems, yet it is often unclear whether they preserve meaningful differences between students when instructional outcomes are unavailable or highly context-dependent. This work examines how to evaluate learner representations based on whether they retain separation between learners under a shared comparison rule. We introduce distinctiveness, a representation-level measure that evaluates how each learner differs from others in the cohort using pairwise distances, without requiring clustering, labels, or task-specific evaluation. Using student-authored questions collected through a conversational AI agent in an online learning environment, we compare representations based on individual questions with representations that aggregate patterns across a student’s interactions over time. Results show that learner-level representations yield higher separation, stronger clustering structure, and more reliable pairwise discrimination than interaction-level representations. These findings demonstrate that learner representations can be evaluated independently of instructional outcomes and provide a practical pre-deployment criterion using distinctiveness as a diagnostic metric for assessing whether a representation supports differentiated modeling or personalization.

[NLP-27] Agent GL: Towards Agent ic Graph Learning with LLM s via Reinforcement Learning ACL2026

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在处理复杂关系数据时,因依赖静态参数化知识而难以有效利用现实世界中具有拓扑结构的外部信息的问题。现有代理框架将外部信息视为非结构化文本,忽略了真实数据中的拓扑依赖关系。为此,作者提出Agentic Graph Learning (AGL)范式,其核心在于将图学习重构为拓扑感知导航与LLM推理交织的过程。解决方案的关键是AgentGL——首个基于强化学习(Reinforcement Learning, RL)驱动的AGL框架:它赋予LLM代理原生图操作工具以实现多尺度探索,通过搜索约束型思维机制调控工具使用以平衡精度与效率,并采用图条件课程强化学习策略,在无逐步监督的情况下稳定长程策略学习。实验表明,AgentGL在多种Text-Attributed Graph(TAG)基准上显著优于主流GraphLLMs和GraphRAG基线方法,验证了AGL作为LLM自主导航和推理复杂关系环境的新前沿潜力。

链接: https://arxiv.org/abs/2604.05846
作者: Yuanfu Sun,Kang Li,Dongzhe Fan,Jiajin Liu,Qiaoyu Tan
机构: New York University Shanghai (纽约大学上海); New York University (纽约大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly rely on agentic capabilities-iterative retrieval, tool use, and decision-making-to overcome the limits of static, parametric knowledge. Yet existing agentic frameworks treat external information as unstructured text and fail to leverage the topological dependencies inherent in real-world data. To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference. Specifically, we propose AgentGL, the first reinforcement learning (RL)-driven framework for AGL. AgentGL equips an LLM agent with graph-native tools for multi-scale exploration, regulates tool usage via search-constrained thinking to balance accuracy and efficiency, and employs a graph-conditioned curriculum RL strategy to stabilize long-horizon policy learning without step-wise supervision. Across diverse Text-Attributed Graph (TAG) benchmarks and multiple LLM backbones, AgentGL substantially outperforms strong GraphLLMs and GraphRAG baselines, achieving absolute improvements of up to 17.5% in node classification and 28.4% in link prediction. These results demonstrate that AGL is a promising frontier for enabling LLMs to autonomously navigate and reason over complex relational environments. The code is publicly available at this https URL.

[NLP-28] “OK Aura Be Fair With Me”: Demographics-Agnostic Training for Bias Mitigation in Wake-up Word Detection LREC2026

【速读】: 该论文旨在解决语音唤醒词(Wake-up Word)检测中因性别、年龄和口音等人口统计学特征差异而导致的性能不公平问题,即模型在不同群体中的识别准确率存在显著偏差。解决方案的关键在于采用无标签(demographics-agnostic)训练策略,具体包括两个核心方法:一是通过数据增强技术提升模型对多样语音特征的泛化能力;二是利用预训练基础语音模型的知识蒸馏(knowledge distillation)来优化模型性能。实验表明,该方法可显著降低各类别间的预测差异,例如在性别、年龄和口音上的预测差异分别减少了39.94%、83.65%和40.48%,从而实现更公平的唤醒词检测效果。

链接: https://arxiv.org/abs/2604.05830
作者: Fernando López,Paula Delgado-Santos,Pablo Gómez,David Solans,Jordi Luque
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Speech Language Models in Low-Resource Settings: Performance, Evaluation, and Bias Analysis (SPEAKABLE) - LREC2026 Workshops

点击查看摘要

Abstract:Voice-based interfaces are widely used; however, achieving fair Wake-up Word detection across diverse speaker populations remains a critical challenge due to persistent demographic biases. This study evaluates the effectiveness of demographics-agnostic training techniques in mitigating performance disparities among speakers of varying sex, age, and accent. We utilize the OK Aura database for our experiments, employing a training methodology that excludes demographic labels, which are reserved for evaluation purposes. We explore (i) data augmentation techniques to enhance model generalization and (ii) knowledge distillation of pre-trained foundational speech models. The experimental results indicate that these demographics-agnostic training techniques markedly reduce demographic bias, leading to a more equitable performance profile across different speaker groups. Specifically, one of the evaluated techniques achieves a Predictive Disparity reduction of 39.94% for sex, 83.65% for age, and 40.48% for accent when compared to the baseline. This study highlights the effectiveness of label-agnostic methodologies in fostering fairness in Wake-up Word detection.

[NLP-29] Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation ACL2026

【速读】: 该论文旨在解决当前生成式AI在心理治疗场景中缺乏系统性评估框架的问题,即现有模型虽具备表面对话能力,但难以确保其回应符合临床心理治疗的核心原则(如非评判性接纳、温暖、尊重自主权、积极倾听、共情理解及情境适切性)。解决方案的关键在于提出CARE多阶段评估框架,其核心创新包括:引入细粒度有序评分体系对六项治疗原则进行量化评估;构建FAITH-M基准数据集并由专家标注;融合对话内上下文建模、对比样本检索与知识蒸馏的思维链推理机制,从而实现结构化推理与情境感知的协同优化。实验表明,CARE相较基线模型F-1分数提升64.26%,验证了结构化推理和上下文建模的有效性,而非单纯依赖大模型容量。

链接: https://arxiv.org/abs/2604.05795
作者: Abdullah Mazhar,Het Riteshkumar Shah,Aseem Srivastava,Smriti Joshi,Md Shad Akhtar
机构: IIIT Delhi, India; MBZUAI, UAE; Wysa
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 (Main)

点击查看摘要

Abstract:The increasing use of large language models in mental health applications calls for principled evaluation frameworks that assess alignment with psychotherapeutic best practices beyond surface-level fluency. While recent systems exhibit conversational competence, they lack structured mechanisms to evaluate adherence to core therapeutic principles. In this paper, we study the problem of evaluating AI-generated therapist-like responses for clinically grounded appropriateness and effectiveness. We assess each therapists utterance along six therapeutic principles: non-judgmental acceptance, warmth, respect for autonomy, active listening, reflective understanding, and situational appropriateness using a fine-grained ordinal scale. We introduce FAITH-M, a benchmark annotated with expert-assigned ordinal ratings, and propose CARE, a multi-stage evaluation framework that integrates intra-dialogue context, contrastive exemplar retrieval, and knowledge-distilled chain-of-thought reasoning. Experiments show that CARE achieves an F-1 score of 63.34 versus the strong baseline Qwen3 F-1 score of 38.56 which is a 64.26 improvement, which also serves as its backbone, indicating that gains arise from structured reasoning and contextual modeling rather than backbone capacity alone. Expert assessment and external dataset evaluations further demonstrate robustness under domain shift, while highlighting challenges in modelling implicit clinical nuance. Overall, CARE provides a clinically grounded framework for evaluating therapeutic fidelity in AI mental health systems.

[NLP-30] What Models Know How Well They Know It: Knowledge-Weighted Fine-Tuning for Learning When to Say “I Dont Know”

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对用户查询时存在的幻觉(hallucination)问题,其根源在于预训练与微调阶段知识分布的不一致(knowledge misalignment)。解决方案的关键在于通过多采样推理(multi-sampled inference)可靠地估计每个输入实例的细粒度知识得分(knowledge score),并基于该得分动态调整学习信号强度,同时鼓励模型对超出其知识范围的问题作出“我不知道”(I don’t know)的明确响应。这一机制使模型能够在缺乏知识时显式表达不确定性,同时保持对已知问题的回答准确性,并通过新提出的不确定性评估指标验证了该方法在区分已知与未知实例上的有效性。

链接: https://arxiv.org/abs/2604.05779
作者: Joosung Lee,Hwiyeol Jo,Donghyeon Ko,Kyubyung Chae,Cheonbok Park,Jeonghoon Kim
机构: NAVER CLOUD(NAVER云); Seoul National University(首尔国立大学); KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:While large language models (LLMs) demonstrate strong capabilities across diverse user queries, they still suffer from hallucinations, often arising from knowledge misalignment between pre-training and fine-tuning. To address this misalignment, we reliably estimate a fine-grained, instance-level knowledge score via multi-sampled inference. Using the knowledge score, we scale the learning signal according to the model’s existing knowledge, while encouraging explicit “I don’t know” responses for out-of-scope queries. Experimental results show that this approach allows the model to explicitly express uncertainty when it lacks knowledge, while maintaining accuracy on questions it can answer. Furthermore, we propose evaluation metrics for uncertainty, showing that accurate discrimination between known and unknown instances consistently improves performance.

[NLP-31] PhageBench: Can LLM s Understand Raw Bacteriophage Genomes?

【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在直接解析噬菌体基因组原始核苷酸序列并执行生物推理方面的能力不足问题。当前,尽管LLMs在理解生物学文本方面表现优异,但其对基因组序列的结构化理解和复杂功能定位仍存在显著局限。解决方案的关键在于构建PhageBench——首个模拟生物信息学专家工作流程的基准测试集,包含5,600个高质量样本,覆盖筛查(Screening)、质量控制(Quality Control)和表型注释(Phenotype Annotation)三个阶段的五项核心任务。通过该基准评估八种主流LLMs,研究发现通用推理模型在噬菌体连续片段识别和宿主预测任务中显著优于随机基线,但在涉及长程依赖关系和精细功能定位的复杂推理任务中仍表现欠佳,从而凸显了开发具备更强生物序列推理能力的下一代模型的必要性。

链接: https://arxiv.org/abs/2604.05775
作者: Yusen Hou,Weicai Long,Haitao Hu,Houcheng Su,Junning Feng,Yanlin Zhang
机构: Hong Kong University of Science and Technology (Guangzhou)
类目: Computation and Language (cs.CL); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:Bacteriophages, often referred to as the dark matter of the biosphere, play a critical role in regulating microbial ecosystems and in antibiotic alternatives. Thus, accurate interpretation of their genomes holds significant scientific and practical value. While general-purpose Large Language Models (LLMs) excel at understanding biological texts, their ability to directly interpret raw nucleotide sequences and perform biological reasoning remains underexplored. To address this, we introduce PhageBench, the first benchmark designed to evaluate phage genome understanding by mirroring the workflow of bioinformatics experts. The dataset contains 5,600 high-quality samples covering five core tasks across three stages: Screening, Quality Control, and Phenotype Annotation. Our evaluation of eight LLMs reveals that general-purpose reasoning models significantly outperform random baselines in phage contig identification and host prediction, demonstrating promising potential for genomic understanding. However, they exhibit significant limitations in complex reasoning tasks involving long-range dependencies and fine-grained functional localization. These findings highlight the necessity of developing next-generation models with enhanced reasoning capabilities for biological sequences.

[NLP-32] Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0

【速读】: 该论文旨在解决高级驾驶辅助系统(ADAS)中碰撞预测的准确性与可解释性不足的问题,尤其是在长尾罕见场景下的性能瓶颈。其解决方案的关键在于三个方面:首先,构建了一个包含10组长尾类别的基准测试,并利用BADAS-1.0作为主动标注代理筛选高风险样本,结合Nexar Atlas平台定向采集数据,将标注视频从4万扩展至17.85万(约200万个片段),显著提升在最难长尾类别上的预测准确率;其次,通过在225万段无标签驾驶视频上进行领域特定自监督预训练,实现知识蒸馏到轻量化模型(BADAS-2.0-Flash和Flash-Lite),分别仅含86M和22M参数,在保持接近原始精度的同时获得7–12倍的速度提升,支持实时边缘部署;最后,引入基于对象中心注意力热图的可解释机制,并结合视觉语言模型生成结构化文本推理(BADAS-Reason),实现对预测依据的实时解释,从而增强系统的可信度与透明度。

链接: https://arxiv.org/abs/2604.05767
作者: Roni Goldshmidt,Hamish Scott,Lorenzo Niccolini,Hernan Matzner
机构: Nexar AI
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present BADAS-2.0, the second generation of our collision anticipation system, building on BADAS-1.0 [7], which showed that fine-tuning V-JEPA2 [1] on large-scale ego-centric dashcam data outperforms both academic baselines and production ADAS systems. BADAS-2.0 advances the state of the art along three axes. (i) Long-tail benchmark and accuracy: We introduce a 10-group long-tail benchmark targeting rare and safety-critical scenarios. To construct it, BADAS-1.0 is used as an active oracle to score millions of unlabeled drives and surface high-risk candidates for annotation. Combined with Nexar’s Atlas platform [13] for targeted data collection, this expands the dataset from 40k to 178,500 labeled videos (~2M clips), yielding consistent gains across all subgroups, with the largest improvements on the hardest long-tail cases. (ii) Knowledge distillation to edge: Domain-specific self-supervised pre-training on 2.25M unlabeled driving videos enables distillation into compact models, BADAS-2.0-Flash (86M) and BADAS-2.0-Flash-Lite (22M), achieving 7-12x speedup with near-parity accuracy, enabling real-time edge deployment. (iii) Explainability: BADAS-2.0 produces real-time object-centric attention heatmaps that localize the evidence behind predictions. BADAS-Reason [17] extends this with a vision-language model that consumes the last frame and heatmap to generate driver actions and structured textual reasoning. Inference code and evaluation benchmarks are publicly available. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2604.05767 [cs.CV] (or arXiv:2604.05767v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.05767 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Roni Goldshmidt [view email] [v1] Tue, 7 Apr 2026 12:10:21 UTC (2,554 KB) Full-text links: Access Paper: View a PDF of the paper titled Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0, by Roni Goldshmidt and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-04 Change to browse by: cs cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-33] Identifying Influential N-grams in Confidence Calibration via Regression Analysis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中表现出过度自信的问题,尽管其输出中包含表达不确定性的语言特征。研究通过回归方法识别与模型置信度相关的特定n-gram语言表达,并发现这些表达在多个模型和问答基准上均与过高的置信度显著相关。解决方案的关键在于:提取出那些导致过自信的特定语言信息(如测试时缩放中人为插入以提升推理性能的提示短语),并通过抑制这些表达实现置信度校准,而无需牺牲模型性能。

链接: https://arxiv.org/abs/2604.05757
作者: Shintaro Ozaki,Wataru Hashimoto,Hidetaka Kamigaito,Katsuhiko Hayashi,Taro Watanabe
机构: Nara Institute of Science and Technology (NAIST); The University of Tokyo
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) improve performance by explicit reasoning, their responses are often overconfident, even though they include linguistic expressions demonstrating uncertainty. In this work, we identify what linguistic expressions are related to confidence by applying the regression method. Specifically, we predict confidence of those linguistic expressions in the reasoning parts of LLMs as the dependent variables and analyze the relationship between a specific n -gram and confidence. Across multiple models and QA benchmarks, we show that LLMs remain overconfident when reasoning is involved and attribute this behavior to specific linguistic information. Interestingly, several of the extracted expressions coincide with cue phrases intentionally inserted on test-time scaling to improve reasoning performance. Through our test on causality and verification that the extracted linguistic information truly affects confidence, we reveal that confidence calibration is possible by simply suppressing those overconfident expressions without drops in performance.

[NLP-34] Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning ACL

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中难以控制输出分布的问题,尤其是在面对具有随机性的现实世界场景时,现有方法无法确保模型生成内容符合特定的目标分布(如性别、种族或情感属性在职业语境中的真实统计分布)。传统评估方式依赖单次推理与固定真实值对比,忽略了输出的统计特性。为应对这一挑战,作者提出了一种新的微调框架,其关键在于将**引导词校准(Steering Token Calibration)语义一致性对齐(Semantic Alignment)**相结合,并设计了一个混合目标函数:其中Kullback-Leibler散度用于锚定潜在引导词的概率质量分布,而Kahneman-Tversky优化则确保这些引导词与语义一致的响应绑定,从而实现对生成属性分布的精准控制。实验表明,该方法在六个不同数据集上显著优于基线模型,有效提升了分布对齐能力。

链接: https://arxiv.org/abs/2604.05756
作者: Yanbei Jiang,Amr Keleg,Ryandito Diandaru,Jey Han Lau,Lea Frermann,Biaoyan Fang,Fajri Koto
机构: University of Melbourne(墨尔本大学); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学); Oracle(甲骨文)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL Main Conference

点击查看摘要

Abstract:While the real world is inherently stochastic, Large Language Models (LLMs) are predominantly evaluated on single-round inference against fixed ground truths. In this work, we shift the lens to distribution alignment: assessing whether LLMs, when prompted repeatedly, can generate outputs that adhere to a desired target distribution, e.g. reflecting real-world statistics or a uniform distribution. We formulate distribution alignment using the attributes of gender, race, and sentiment within occupational contexts. Our empirical analysis reveals that off-the-shelf LLMs and standard alignment techniques, including prompt engineering and Direct Preference Optimization, fail to reliably control output distributions. To bridge this gap, we propose a novel fine-tuning framework that couples Steering Token Calibration with Semantic Alignment. We introduce a hybrid objective function combining Kullback-Leibler divergence to anchor the probability mass of latent steering tokens and Kahneman-Tversky Optimization to bind these tokens to semantically consistent responses. Experiments across six diverse datasets demonstrate that our approach significantly outperforms baselines, achieving precise distributional control in attribute generation tasks.

[NLP-35] MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models ACL2026

【速读】: 该论文旨在解决当前医学视觉语言模型(Medical Vision-Language Models, Med-VLMs)在患者中心护理中沟通能力不足的问题,即现有模型多基于专业文献训练,难以将诊断影像结果以通俗易懂的语言传达给非专业人士。其解决方案的关键在于构建首个大规模多模态基准数据集 MedLayBench-V,采用结构化概念锚定精炼(Structured Concept-Grounded Refinement, SCGR)管道,通过引入统一医学语言系统(Unified Medical Language System, UMLS)的概念唯一标识符(Concept Unique Identifiers, CUIs)与微观实体约束,确保专家表述与通俗表述之间的语义等价性,从而有效避免简单简化带来的幻觉问题,为下一代具备医患沟通能力的 Med-VLM 提供可验证的训练与评估基础。

链接: https://arxiv.org/abs/2604.05738
作者: Han Jang,Junhyeok Lee,Heeseong Eum,Kyu Sung Choi
机构: Seoul National University (首尔国立大学); Seoul National University College of Medicine (首尔国立大学医学院); Department of Radiology, Seoul National University Hospital (首尔国立大学医院放射科); Healthcare AI Research Institute, Seoul National University Hospital (首尔国立大学医院医疗人工智能研究所); The Advanced Imaging and Computational Neuroimaging (AICON) Laboratory (先进成像与计算神经影像实验室)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 Findings (Oral). 9 pages, 5 figures, 11 tables, plus appendix

点击查看摘要

Abstract:Medical Vision-Language Models (Med-VLMs) have achieved expert-level proficiency in interpreting diagnostic imaging. However, current models are predominantly trained on professional literature, limiting their ability to communicate findings in the lay register required for patient-centered care. While text-centric research has actively developed resources for simplifying medical jargon, there is a critical absence of large-scale multimodal benchmarks designed to facilitate lay-accessible medical image understanding. To bridge this resource gap, we introduce MedLayBench-V, the first large-scale multimodal benchmark dedicated to expert-lay semantic alignment. Unlike naive simplification approaches that risk hallucination, our dataset is constructed via a Structured Concept-Grounded Refinement (SCGR) pipeline. This method enforces strict semantic equivalence by integrating Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) with micro-level entity constraints. MedLayBench-V provides a verified foundation for training and evaluating next-generation Med-VLMs capable of bridging the communication divide between clinical experts and patients.

[NLP-36] Dialogue Act Patterns in GenAI-Mediated L2 Oral Practice: A Sequential Analysis of Learner-Chatbot Interactions

【速读】: 该论文旨在解决生成式 AI (Generative AI) 语音聊天机器人在第二语言(L2)口语练习中,学习者互动过程与其学习成效之间关联机制不明确的问题。解决方案的关键在于采用话语行为(Dialogue Act, DA)分析框架,通过人工标注70个学生与GenAI语音聊天机器人的交互会话(共6,957个DA),比较高进步和低进步会话中的DA分布及序列模式。研究发现,高进步会话中学习者发起的问题更多,且存在更多以提示为基础的纠正性反馈序列(prompt-based corrective feedback sequences),且该类反馈通常出现在学习者回应之后,表明反馈类型与时机对有效互动具有关键作用。这一发现为GenAI聊天机器人设计提供了基于教学法的DA编码体系,并为开发适应性更强的L2教育GenAI工具提供了实证依据。

链接: https://arxiv.org/abs/2604.05702
作者: Liqun He,Shijun(Cindy)Chen,Mutlu Cukurova,Manolis Mavrikis
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for publication as a full paper (Main Track) at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

点击查看摘要

Abstract:While generative AI (GenAI) voice chatbots offer scalable opportunities for second language (L2) oral practice, the interactional processes related to learners’ gains remain underexplored. This study investigates dialogue act (DA) patterns in interactions between Grade 9 Chinese English as a foreign language (EFL) learners and a GenAI voice chatbot over a 10-week intervention. Seventy sessions from 12 students were annotated by human coders using a pedagogy-informed coding scheme, yielding 6,957 coded DAs. DA distributions and sequential patterns were compared between high- and low-progress sessions. At the DA level, high-progress sessions showed more learner-initiated questions, whereas low-progress sessions exhibited higher rates of clarification-seeking, indicating greater comprehension difficulty. At the sequential level, high-progress sessions were characterised by more frequent prompting-based corrective feedback sequences, consistently positioned after learner responses, highlighting the role of feedback type and timing in effective interaction. Overall, these findings underscore the value of a dialogic lens in GenAI chatbot design, contribute a pedagogy-informed DA coding framework, and inform the design of adaptive GenAI chatbots for L2 education.

[NLP-37] Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文和长文本生成场景下,键值(Key-Value, KV)缓存内存与带宽成为推理成本瓶颈的问题。现有缓解方案如多头潜在注意力(Multi-Head Latent Attention, MLA)和混合滑动窗口注意力(Hybrid Sliding-Window Attention, SWA)难以集成到已训练模型中,因其对源和目标注意力模块施加细粒度结构约束,不满足实际部署的可行性要求。论文提出Attention Editing框架,通过无需从头预训练即可将已有LLM转换为新注意力架构:其核心在于采用渐进式蒸馏训练策略,包含两个阶段——(1) 层级式教师强制优化结合中间激活监督,防止冷启动误差累积;(2) 模型级蒸馏对下一个词分布进行优化,并可选地引入弱特征匹配正则项。该方法成功应用于Qwen3系列模型,在Ascend 910B硬件上验证了大规模注意力架构转换的可行性与鲁棒性。

链接: https://arxiv.org/abs/2604.05688
作者: Zhen Cheng,Hao-Bo Yang,Wan-Yi Huang,Jin-Long Li
机构: China Merchants Bank Artificial Intelligence Laboratory (招商银行人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Key-Value (KV) cache memory and bandwidth increasingly dominate large language model inference cost in long-context and long-generation regimes. Architectures such as multi-head latent attention (MLA) and hybrid sliding-window attention (SWA) can alleviate this bound, but integrating them into existing models remains difficult. Prior methods impose fine-grained structural requirements on both source and target attention modules, which cannot meet the feasible requirement in practical deployment. We present Attention Editing, a practical framework for converting already-trained large language models (LLMs) with new attention architectures without re-pretraining from scratch. Attention editing replaces the original attention with a learnable target module and trains it using progressive distillation, consisting of (1) layer-wise teacher-forced optimization with intermediate activation supervision to prevent cold-start error accumulation, and (2) model-level distillation on next-token distributions, optionally regularized by weak feature matching. We instantiate the framework on two different target–MLA and GateSWA, a gated hybrid SWA design, and apply it to Qwen3-8B and Qwen3-30B-A3B. The resulting models maintain competitive performance while delivering substantial efficiency improvements, demonstrating that large-scale attention conversion is both feasible and robust. Notably, experiments are conducted on an Ascend 910B clusters, offering a practical training case study on domestic hardware.

[NLP-38] LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理过程中内在表示动态机制不明确的问题,特别是如何从几何角度理解、预测和干预其链式思维(chain-of-thought)生成行为。解决方案的关键在于将推理过程建模为表示空间中的结构化轨迹:研究发现,数学推理路径在功能有序的子空间中演化,且这些子空间随网络层数加深逐渐可分;此外,正确与错误推理路径在后期阶段系统性分离,从而支持基于中段推理状态对最终答案正确性的预测(AUC达0.87);进一步提出基于轨迹引导的推理干预(trajectory-based steering),可在推理时通过匹配理想轨迹实现纠错与长度控制,为解释和调控LLM推理提供了一种几何驱动的新范式。

链接: https://arxiv.org/abs/2604.05655
作者: Lihao Sun,Hang Dong,Bo Qiao,Qingwei Lin,Dongmei Zhang,Saravan Rajmohan
机构: Microsoft
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2026 (Main)

点击查看摘要

Abstract:This work characterizes large language models’ chain-of-thought generation as a structured trajectory through representation space. We show that mathematical reasoning traverses functionally ordered, step-specific subspaces that become increasingly separable with layer depth. This structure already exists in base models, while reasoning training primarily accelerates convergence toward termination-related subspaces rather than introducing new representational organization. While early reasoning steps follow similar trajectories, correct and incorrect solutions diverge systematically at late stages. This late-stage divergence enables mid-reasoning prediction of final-answer correctness with ROC-AUC up to 0.87. Furthermore, we introduce trajectory-based steering, an inference-time intervention framework that enables reasoning correction and length control based on derived ideal trajectories. Together, these results establish reasoning trajectories as a geometric lens for interpreting, predicting, and controlling LLM reasoning behavior.

[NLP-39] See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLM s ACL’2026

【速读】: 该论文旨在解决视频大语言模型(Video-LLMs)在自回归生成过程中因高推理延迟导致的效率瓶颈问题。现有基于推测解码(Speculative Decoding, SD)的方法受限于严格的逐字匹配规则,难以充分发挥加速潜力。其解决方案的关键在于提出首个无需训练的宽松型推测解码框架LVSpec,该框架基于“视觉相关锚点(visual-relevant anchors)稀疏分布、视觉无关填充项(visual-irrelevant fillers)占主导”的观察,设计轻量级视觉相关token识别机制以精准定位关键生成节点,并引入位置偏移容错机制来接纳语义等价但位置不匹配的token,从而显著提升接受率和整体加速效果。实验表明,LVSpec在保持99.8%目标性能的同时,使Qwen2.5-VL-32B和LLaVA-OneVision-72B分别提速2.70倍和2.94倍,相较当前最优无训练SD方法,平均接受长度与加速比分别提升136%和35%。

链接: https://arxiv.org/abs/2604.05650
作者: Yicheng Ji,Jun Zhang,Jinpeng Chen,Cong Wang,Lidan Shou,Gang Chen,Huan Li
机构: ZJU; BUPT
类目: Computation and Language (cs.CL)
备注: ACL’2026 MainConference

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec achieves high fidelity and speed: it preserves 99.8 of target performance while accelerating Qwen2.5-VL-32B by 2.70x and LLaVA-OneVision-72B by 2.94x. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs.

[NLP-40] Graph-Based Chain-of-Thought Pruning for Reducing Redundant Reflections in Reasoning LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在使用思维链(Chain-of-Thought, CoT)推理时因奖励信号稀疏而导致的冗余思考问题,特别是过度反思(overthinking)现象,表现为无差别反思(Indiscriminate Reflection)和重复验证(Repetitive Reflection)两种模式。解决方案的关键在于提出一种基于图结构的CoT优化框架:将线性思维链转换为带有显式依赖边的有向无环图(Directed Acyclic Graph, DAG),并设计双层剪枝策略——分支级剪枝移除贡献弱的反思分支,深度级剪枝消除后期重复验证;通过三阶段训练流程(SFT初始化、DPO偏好优化、GRPO带长度惩罚的联合优化)实现高效且准确的推理路径蒸馏,实验表明该方法可减少平均推理token数42%的同时保持或提升准确性。

链接: https://arxiv.org/abs/2604.05643
作者: Hongyuan Yuan,Xinran He,Run Shao,Bolei He,Xianwei Xue,Mengke Chen,Qiutong Pan,Haiwei Wang,Haifeng Li
机构: Central South University (中南大学); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extending CoT through RL has been widely used to enhance the reasoning capabilities of LLMs. However, due to the sparsity of reward signals, it can also induce undesirable thinking patterns such as overthinking, i.e., generating redundant intermediate reasoning content. In this work, we argue that a major source of such redundancy is inefficient reflection, which often manifests in two problematic patterns: Indiscriminate Reflection, where the model performs broad, low-impact checks throughout reasoning, and Repetitive Reflection, where it repeatedly re-verifies an already established conclusion. To address this, we introduce a graph-based CoT optimization framework. Specifically, we convert each linear CoT into a directed acyclic graph (DAG) with explicit dependency edges, and design a dual pruning strategy: branch-level pruning removes weakly contributing reflection branches, while depth-level pruning eliminates late-stage re-verification. We distill this behavior via a three-stage pipeline: (1) SFT to initialize the policy on pruned concise traces, (2) DPO to prefer correct but less redundant trajectories, and (3) GRPO with length penalty to jointly optimize answer correctness and efficiency. Experiments show that our approach reduces the average reasoning tokens by 42% while maintaining or improving accuracy.

[NLP-41] YoNER: A New Yorùbá Multi-domain Named Entity Recognition Dataset LREC2026

【速读】: 该论文旨在解决约鲁巴语(Yorùbá)命名实体识别(Named Entity Recognition, NER)领域中因资源稀缺且局限于特定领域(如新闻和维基百科)而导致的模型泛化能力不足的问题。解决方案的关键在于构建了一个多领域约鲁巴语NER数据集YoNER,涵盖圣经、博客、电影、广播和维基百科共五个领域,包含约5000句和10万词元,并采用CoNLL风格标注Person(PER)、Organization(ORG)和Location(LOC)三类实体;同时,为提升在约鲁巴语上的性能,作者还提出了一个专为约鲁巴语设计的语言模型OyoBERT,其在域内评估中优于通用多语言模型。该研究通过跨域实验和少样本设置验证了非洲本地化模型的优势以及领域相关性对迁移效果的影响。

链接: https://arxiv.org/abs/2604.05624
作者: Peace Busola Falola,Jesujoba O. Alabi,Solomon O. Akinola,Folashade T. Ogunajo,Emmanuel Oluwadunsin Alabi,David Ifeoluwa Adelani
机构: 未知
类目: Computation and Language (cs.CL)
备注: LREC 2026

点击查看摘要

Abstract:Named Entity Recognition (NER) is a foundational NLP task, yet research in Yorùbá has been constrained by limited and domain-specific resources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yorùbá NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yorùbá speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models using cross-domain experiments with MasakhaNER 2.0, and we also assess the effect of few-shot in-domain data using YoNER and cross-lingual setups with English datasets. Our results show that African-centric models outperform general multilingual models for Yorùbá, but cross-domain performance drops substantially, particularly for blogs and movie domains. Furthermore, we observed that closely related formal domains, such as news and Wikipedia, transfer more effectively. In addition, we introduce a new Yorùbá-specific language model (OyoBERT) that outperforms multilingual models in in-domain evaluation. We publicly release the YoNER dataset and pretrained OyoBERT models to support future research on Yorùbá natural language processing.

[NLP-42] DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成长图像描述时,难以精确检测与定位幻觉(hallucination)的问题。随着图像描述从简短句子演变为数百词的详尽叙事,传统仅能识别整体不一致性的评估方法已无法满足需求,亟需在词级别上精准识别错误内容。其解决方案的关键在于提出DetailVerifyBench——一个包含1000张高质量图像、覆盖五个不同领域的基准测试集,每个图像配有平均超过200词的长描述及密集的token级标注,涵盖多种类型的幻觉,从而成为当前最具有挑战性的长图像描述中幻觉精确定位评估基准。

链接: https://arxiv.org/abs/2604.05623
作者: Xinran Wang,Yuxuan Zhang,Xiao Zhang,Haolong Yan,Muxi Diao,Songyu Xu,Zhonghao Yan,Hongbing Li,Kongming Liang,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 8 pages, 5 figures. The dataset and code are available at this https URL

点击查看摘要

Abstract:Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flag response-level inconsistencies. However, existing benchmarks lack the fine granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the field of long image captioning to date. Our benchmark is available at this https URL.

[NLP-43] INTERACT: An AI-Driven Extended Reality Framework for Accesible Communication Featuring Real-Time Sign Language Interpretation and Emotion Recognition

【速读】: 该论文旨在解决视频会议平台对聋哑、听力障碍及多语言用户支持不足的问题,以提升其在专业协作中的包容性。当前无障碍措施受限于高成本、资源稀缺和物流障碍,而该研究提出的关键解决方案是构建一个基于生成式 AI (Generative AI) 的扩展现实(Extended Reality, XR)平台 INTERACT,其核心在于融合实时语音转文字、国际手语(International Sign Language, ISL)的三维虚拟形象呈现、多语言翻译与情感识别功能,并部署于 Meta Quest 3 设备上,利用 Whisper、NLLB、RoBERTa 和 Google MediaPipe 等先进模型实现端到端的沉浸式无障碍通信。

链接: https://arxiv.org/abs/2604.05605
作者: Nikolaos D. Tantaroudas,Andrew J. McCracken,Ilias Karachalios,Evangelos Papatheou
机构: Institute of Communications and Computer Systems (ICCS)(通信与计算机系统研究所); DASKALOS-APPS; National Technical University of Athens(雅典国立科技大学); Exeter Small-Scale Robotics Laboratory(埃克塞特小型机器人实验室)
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: 20

点击查看摘要

Abstract:Video conferencing has become central to professional collaboration, yet most platforms offer limited support for deaf, hard-of-hearing, and multilingual users. The World Health Organisation estimates that over 430 million people worldwide require rehabilitation for disabling hearing loss, a figure projected to exceed 700 million by 2050. Conventional accessibility measures remain constrained by high costs, limited availability, and logistical barriers, while Extended Reality (XR) technologies open new possibilities for immersive and inclusive communication. This paper presents INTERACT (Inclusive Networking for Translation and Embodied Real-Time Augmented Communication Tool), an AI-driven XR platform that integrates real-time speech-to-text conversion, International Sign Language (ISL) rendering through 3D avatars, multilingual translation, and emotion recognition within an immersive virtual environment. Built on the CORTEX2 framework and deployed on Meta Quest 3 headsets, INTERACT combines Whisper for speech recognition, NLLB for multilingual translation, RoBERTa for emotion classification, and Google MediaPipe for gesture extraction. Pilot evaluations were conducted in two phases, first with technical experts from academia and industry, and subsequently with members of the deaf community. The trials reported 92% user satisfaction, transcription accuracy above 85%, and 90% emotion-detection precision, with a mean overall experience rating of 4.6 out of 5.0 and 90% of participants willing to take part in further testing. The results highlight strong potential for advancing accessibility across educational, cultural, and professional settings. An extended version of this work, including full pilot data and implementation details, has been published as an Open Research Europe article [Tantaroudas et al., 2026a].

[NLP-44] Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM -as-a-Judge

【速读】: 该论文旨在解决生成式 AI(Generative AI)在自动化评估中因源标签(source labels)引发的偏见问题,即大语言模型(Large Language Models, LLMs)作为评判者时,其信任判断会受到内容来源标签(人类撰写 vs. AI生成)的影响,从而导致评估结果不可靠。解决方案的关键在于通过反事实实验设计与眼动追踪数据揭示:无论是人类还是LLM裁判,均将源标签视为显著的启发式线索(heuristic cue),且LLM在内部状态分析中表现出对标签区域更强的注意力分配和更高的决策不确定性(尤其在AI标签下),这表明源标签对LLM的判断具有实质性影响。因此,研究提出需警惕标签敏感型LLM-as-a-Judge评估范式的有效性,并强调在模型对齐过程中避免将人类启发式依赖引入模型,进而推动去偏评估与对齐方法的发展。

链接: https://arxiv.org/abs/2604.05593
作者: Xin Sun,Di Wu,Sijing Qin,Isao Echizen,Abdallah El Ali,Saku Sugawara
机构: National Institute of Informatics (NII), Japan; University of Amsterdam, the Netherlands; University of Tokyo, Japan; Hitotsubashi University, Japan; Centrum Wiskunde Informatica (CWI), the Netherlands; Utrecht University, the Netherlands
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge). This work challenges its reliability by showing that trust judgments by LLMs are biased by disclosed source labels. Using a counterfactual design, we find that both humans and LLM judges assign higher trust to information labeled as human-authored than to the same content labeled as AI-generated. Eye-tracking data reveal that humans rely heavily on source labels as heuristic cues for judgments. We analyze LLM internal states during judgment. Across label conditions, models allocate denser attention to the label region than the content region, and this label dominance is stronger under Human labels than AI labels, consistent with the human gaze patterns. Besides, decision uncertainty measured by logits is higher under AI labels than Human labels. These results indicate that the source label is a salient heuristic cue for both humans and LLMs. It raises validity concerns for label-sensitive LLM-as-a-Judge evaluation, and we cautiously raise that aligning models with human preferences may propagate human heuristic reliance into models, motivating debiased evaluation and alignment.

[NLP-45] AI-Driven Modular Services for Accessible Multilingual Education in Immersive Extended Reality Settings: Integrating Speech Processing Translation and Sign Language Rendering

【速读】: 该论文旨在解决跨模态人工智能(AI)服务在扩展现实(XR)环境中集成与部署的难题,以实现多语言、无障碍的语言教学应用。其核心挑战在于如何高效整合多种异构AI能力(如语音识别、机器翻译、语音合成、情感分类、对话摘要及国际手语(IS)渲染),并确保系统在实时XR场景下的可用性与可扩展性。解决方案的关键在于构建一个模块化平台,将六项独立优化的AI服务(包括OpenAI Whisper、Meta NLLB、AWS Polly、RoBERTa、Flan T5 Base Samsum 和 Google MediaPipe)进行有机协同,并通过技术基准测试验证各组件性能,尤其是发现AWS Polly在低延迟和性价比方面最优,EuroLLM 1.7B Instruct在翻译质量上优于NLLB。该设计不仅支持各模块独立扩展与适配不同教育场景,还为欧盟数字无障碍目标下的公平学习提供可落地的技术框架。

链接: https://arxiv.org/abs/2604.05591
作者: N.D. Tantaroudas,A.J. McCracken,I. Karachalios,E. Papatheou
机构: Institute of Communications and Computer Systems (ICCS); DASKALOS-APPS; National Technical University of Athens; Exeter Small-Scale Robotics Laboratory
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: 21

点击查看摘要

Abstract:This work introduces a modular platform that brings together six AI services, automatic speech recognition via OpenAI Whisper, multilingual translation through Meta NLLB, speech synthesis using AWS Polly, emotion classification with RoBERTa, dialogue summarisation via flan t5 base samsum, and International Sign (IS) rendering through Google MediaPipe. A corpus of IS gesture recordings was processed to derive hand landmark coordinates, which were subsequently mapped onto three dimensional avatar animations inside a virtual reality (VR) environment. Validation comprised technical benchmarking of each AI component, including comparative assessments of speech synthesis providers and multilingual translation models (NLLB 200 and EuroLLM 1.7B variants). Technical evaluations confirmed the suitability of the platform for real time XR deployment. Speech synthesis benchmarking established that AWS Polly delivers the lowest latency at a competitive price point. The EuroLLM 1.7B Instruct variant attained a higher BLEU score, surpassing NLLB. These findings establish the viability of orchestrating cross modal AI services within XR settings for accessible, multilingual language instruction. The modular design permits independent scaling and adaptation to varied educational contexts, providing a foundation for equitable learning solutions aligned with European Union digital accessibility goals.

[NLP-46] HIVLVC: Retrieval Augmented Dependency Parsing for Latin

【速读】: 该论文旨在解决拉丁语依存句法分析(Dependency Parsing)中的性能提升问题,特别是在诗歌(如塞涅卡作品)和散文(如托马斯·阿奎那作品)两类文本上的表现差异。其解决方案的关键在于提出一个两阶段系统 THIVLVC:第一阶段通过句子长度和词性(POS)n-gram 相似度从 CIRCSE 树库中检索结构相似的句例;第二阶段利用检索到的示例和 UD 注释指南,通过提示大型语言模型(LLM)对 UDPipe 的初始解析结果进行精修。该方法在诗歌文本上相比 UDPipe 基线提升 CLAS 17 分,在散文文本上提升 1.5 分,且双盲错误分析表明多数分歧中人工标注者更倾向支持 THIVLVC,验证了其有效性并揭示了现有树库中存在注释不一致性。

链接: https://arxiv.org/abs/2604.05564
作者: Luc Pommeret(STL),Thibault Wagret(ENS de Lyon, HiSoMA),Jules Deret
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We describe THIVLVC, a two-stage system for the EvaLatin 2026 Dependency Parsing task. Given a Latin sentence, we retrieve structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompt a large language model to refine the baseline parse from UDPipe using the retrieved examples and UD annotation guidelines. We submit two configurations: one without retrieval and one with retrieval (RAG). On poetry (Seneca), THIVLVC improves CLAS by +17 points over the UDPipe baseline; on prose (Thomas Aquinas), the gain is +1.5 CLAS. A double-blind error analysis of 300 divergences between our system and the gold standard reveals that, among unanimous annotator decisions, 53.3% favour THIVLVC, showing annotation inconsistencies both within and across treebanks.

[NLP-47] EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

【速读】: 该论文旨在解决现有基准测试在评估科研代理(research agents)时对多轮、多证据整合能力的系统性不足问题,尤其是对主动搜索、跨文献证据整合及长时间记忆中证据持续使用的评估缺失。其解决方案的关键在于提出EpiBench——一个基于情节的多轮多模态基准,模拟短周期科研工作流:代理需在多轮交互中跨论文导航,从图表中提取并对齐证据,并利用累积证据回答需要跨论文比较和多图整合的客观问题;同时引入过程级评估框架,实现对研究代理的细粒度测试与诊断,从而推动可验证、可复现的科研代理发展。

链接: https://arxiv.org/abs/2604.05557
作者: Xuan Dong,Huanyang Zheng,Tianhao Niu,Zhe Han,Pengzhan Li,Bofei Liu,Zhengyang Liu,Guancheng Li,Qingfu Zhu,Wanxiang Che
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically assessed in existing benchmarks, which largely under-evaluate proactive search, multi-evidence integration and sustained evidence use over time. In this work, we introduce EpiBench, an episodic multi-turn multimodal benchmark that instantiates short research workflows. Given a research task, agents must navigate across papers over multiple turns, align evidence from figures and tables, and use the accumulated evidence in the memory to answer objective questions that require cross paper comparisons and multi-figure integration. EpiBench introduces a process-level evaluation framework for fine-grained testing and diagnosis of research agents. Our experiments show that even the leading model achieves an accuracy of only 29.23% on the hard split, indicating substantial room for improvement in multi-turn, multi-evidence research workflows, providing an evaluation platform for verifiable and reproducible research agents.

[NLP-48] Context-Agent : Dynamic Discourse Trees for Non-Linear Dialogue ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理非线性人类对话时面临的挑战,即当前主流将对话历史视为扁平线性序列的方法与自然话语固有的层次性和分支结构不匹配,导致长对话中上下文利用效率低、连贯性差,尤其是在涉及话题切换或指令细化时。其解决方案的关键在于提出Context-Agent框架,通过将多轮对话历史建模为动态树状结构,模拟对话的非线性特性,从而支持对不同话题分支的有效维护与导航,显著提升任务完成率和token使用效率。

链接: https://arxiv.org/abs/2604.05552
作者: Junan Hu,Shudan Guo,Wenqi Liu,Jianhua Yin,Yinwei Wei
机构: Shandong University, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures, ACL 2026

点击查看摘要

Abstract:Large Language Models demonstrate outstanding performance in many language tasks but still face fundamental challenges in managing the non-linear flow of human conversation. The prevalent approach of treating dialogue history as a flat, linear sequence is misaligned with the intrinsically hierarchical and branching structure of natural discourse, leading to inefficient context utilization and a loss of coherence during extended interactions involving topic shifts or instruction refinements. To address this limitation, we introduce Context-Agent, a novel framework that models multi-turn dialogue history as a dynamic tree structure. This approach mirrors the inherent non-linearity of conversation, enabling the model to maintain and navigate multiple dialogue branches corresponding to different topics. Furthermore, to facilitate robust evaluation, we introduce the Non-linear Task Multi-turn Dialogue (NTM) benchmark, specifically designed to assess model performance in long-horizon, non-linear scenarios. Our experiments demonstrate that Context-Agent enhances task completion rates and improves token efficiency across various LLMs, underscoring the value of structured context management for complex, dynamic dialogues. The dataset and code is available at GitHub.

[NLP-49] FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation–Full Version ACL2026 ACL

【速读】: 该论文旨在解决连续扩散语言模型在少步采样(few-step sampling)场景下自条件机制(self-conditioning)性能下降的问题,即在追求快速推理时,由于前几步去噪误差累积导致样本质量显著下降。其解决方案的关键在于提出一种新颖的训练框架:通过扰动自条件信号以匹配推理阶段的噪声分布,从而提升模型对前期估计误差的鲁棒性;同时引入词元级噪声感知机制(token-level noise-awareness mechanism),避免训练过程饱和,优化学习效率。该方法在多个条件生成基准上验证有效,实现比标准连续扩散模型更优的生成质量,并达到最高400倍的推理加速效果。

链接: https://arxiv.org/abs/2604.05551
作者: Dat Nguyen-Cong,Tung Kieu,Hoang Thanh-Tung
机构: FPT Software AI Center, FPT Corporation; Department of Computer Science, Aalborg University, Denmark; Quantum AI and Cyber Security Institute, FPT Corporation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: camera-ready version, accepted by ACL Findings (ACL 2026)

点击查看摘要

Abstract:Self-conditioning has been central to the success of continuous diffusion language models, as it allows models to correct previous errors. Yet its ability degrades precisely in the regime where diffusion is most attractive for deployment: few-step sampling for fast inference. In this study, we show that when models only have a few denoising steps, inaccurate self-conditioning induces a substantial approximation gap; this mistake compounds across denoising steps and ultimately dominate the sample quality. To address this, we propose a novel training framework that handles these errors during learning by perturbing the self-conditioning signal to match inference noise, improving robustness to prior estimation errors. In addition, we introduce a token-level noise-awareness mechanism that prevents training from saturation, hence improving optimization. Extensive experiments across conditional generation benchmarks demonstrate that our framework surpasses standard continuous diffusion models while providing up to 400x faster inference speed, and remains competitive against other one-step diffusion frameworks.

[NLP-50] AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

【速读】: 该论文旨在解决当前人工智能研究中因反复的模型复现、调试与迭代优化所导致的效率低下问题,即如何加速从原始论文到新SOTA(State-Of-The-Art)模型的完整实验优化流程。其解决方案的关键在于提出AutoSOTA——一个端到端的自动化研究系统,采用多智能体架构(multi-agent architecture),通过八个专业化智能体协同完成文献到代码的映射、执行环境初始化与修复、长周期实验追踪、优化策略生成与调度以及有效性监督,从而实现对已有SOTA方法的可复现性复现和进一步性能提升,显著减少人工干预并推动研究向更高层次的科学创新转移。

链接: https://arxiv.org/abs/2604.05550
作者: Yu Li,Chenyang Shao,Xinyang Liu,Ruotong Zhao,Peijie Liu,Hongyuan Su,Zhibin Chen,Qinglong Yang,Anjie Xu,Yi Fang,Qingbin Zeng,Tianxing Li,Jingbo Xu,Fengli Xu,Yong Li,Tie-Yan Liu
机构: Tsinghua University (清华大学); Zhongguancun Academy; Peking University; University of Science and Technology of China
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State-Of-The-Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end-to-end automated research system that advances the latest SOTA models published in top-tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi-agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long-horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top-tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end-to-end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow-level improvements. These results suggest that end-to-end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher-level scientific creativity.

[NLP-51] Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在广泛应用中因复杂性增加而引入的新安全威胁问题,尤其是现有红队测试方法依赖于修改用户提示(prompt),缺乏对新数据的适应性且可能损害代理性能。其解决方案的核心在于提出JailAgent框架,该框架不修改用户提示,而是通过三个关键阶段实现对代理推理轨迹和记忆检索的隐式操控:触发提取(Trigger Extraction)、推理劫持(Reasoning Hijacking)和约束收紧(Constraint Tightening),结合精确的触发识别、实时自适应机制与优化的目标函数,在跨模型和跨场景环境中展现出卓越的攻击效果。

链接: https://arxiv.org/abs/2604.05549
作者: Yanxu Mao,Peipei Liu,Tiehan Cui,Congying Liu,Mingzhe Xing,Datao You
机构: School of Software, Henan Univeristy, China; Institute of Information Engineering, Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Peking University, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the widespread application of LLM-based agents across various domains, their complexity has introduced new security threats. Existing red-team methods mostly rely on modifying user prompts, which lack adaptability to new data and may impact the agent’s performance. To address the challenge, this paper proposes the JailAgent framework, which completely avoids modifying the user prompt. Specifically, it implicitly manipulates the agent’s reasoning trajectory and memory retrieval with three key stages: Trigger Extraction, Reasoning Hijacking, and Constraint Tightening. Through precise trigger identification, real-time adaptive mechanisms, and an optimized objective function, JailAgent demonstrates outstanding performance in cross-model and cross-scenario environments.

[NLP-52] Efficient Inference for Large Vision-Language Models: Bottlenecks Techniques and Prospects ACL2026

【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在推理过程中因视觉标记主导(visual token dominance)导致的系统性效率瓶颈问题。其核心挑战源于高分辨率特征提取、二次方注意力扩展以及内存带宽限制之间的多阶段耦合效应。解决方案的关键在于构建一个围绕推理生命周期(编码、预填充和解码)的系统性分类框架,通过分解效率障碍为信息密度调控、长上下文注意力管理与内存限制突破三个维度,揭示上游决策如何决定下游性能瓶颈,并强调孤立优化的组合效应如何在视觉保真度与系统效率之间实现权衡。最终提出四项未来方向,包括基于功能单元敏感性的混合压缩、模态感知的松弛验证解码、面向流式连续性的渐进状态管理及软硬件协同设计的阶段解耦服务架构。

链接: https://arxiv.org/abs/2604.05546
作者: Jun Zhang,Yicheng Ji,Feiyang Ren,Yihang Li,Bowen Zeng,Zonghao Chen,Ke Chen,Lidan Shou,Gang Chen,Huan Li
机构: Zhejiang University (浙江大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新区(滨江)区块链与数据安全研究院)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ‘‘visual memory wall’’ in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community.

[NLP-53] Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识编辑中面临的两个核心问题:一是泛化能力差,现有方法虽能注入新知识,但模型难以将其有效应用于实际任务;二是适用范围窄,当前方法主要针对结构化事实三元组,忽略了现实场景中大量存在的非结构化事实(如新闻、文章)。解决方案的关键在于提出一种基于思维链(Chain of Thoughts, CoT)推理的知识编辑范式(CoT2Edit):首先利用语言模型代理生成结构化与非结构化数据的思维链,构建高质量指令数据;随后通过监督微调(Supervised Fine-Tuning, SFT)和分组相对策略优化(Group Relative Policy Optimization, GRPO)训练模型以推理编辑后的知识;推理阶段引入检索增强生成(Retrieval-Augmented Generation, RAG)动态获取相关编辑事实,实现实时知识更新。该方法在六种不同知识编辑场景下均展现出强泛化性能,且仅需单轮训练即可适配多个开源模型。

链接: https://arxiv.org/abs/2604.05540
作者: Jinhu Fu,Yan Bai,Longzhu He,Yihang Lou,Yanxiao Zhao,Li Sun,Sen Su
机构: Beijing University of Posts and Telecommunications(北京邮电大学); Peking University(北京大学); Huawei Technologies Ltd.(华为技术有限公司); Chongqing University of Posts and Telecommunications(重庆邮电大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026 main conference

点击查看摘要

Abstract:Large language models (LLMs) can effectively handle outdated information through knowledge editing. However, current approaches face two key limitations: (I) Poor generalization: Most approaches rigidly inject new knowledge without ensuring that the model can use it effectively to solve practical problems. (II) Narrow scope: Current methods focus primarily on structured fact triples, overlooking the diverse unstructured forms of factual information (e.g., news, articles) prevalent in real-world contexts. To address these challenges, we propose a new paradigm: teaching LLMs to edit knowledge via Chain of Thoughts (CoTs) reasoning (CoT2Edit). We first leverage language model agents for both structured and unstructured edited data to generate CoTs, building high-quality instruction data. The model is then trained to reason over edited knowledge through supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). At inference time, we integrate Retrieval-Augmented Generation (RAG) to dynamically retrieve relevant edited facts for real-time knowledge editing. Experimental results demonstrate that our method achieves strong generalization across six diverse knowledge editing scenarios with just a single round of training on three open-source language models. The codes are available at this https URL.

[NLP-54] urbulence-like 5/3 spectral scaling in contextual representations of language as a complex system

【速读】: 该论文旨在揭示自然语言中深层结构的统计规律性,特别是语言表示在多尺度上的自相似组织特征。其核心问题是:如何量化并理解文本在高维嵌入空间中的复杂动态行为,以及这种行为是否超越词汇层面的静态统计特性,体现为语境依赖的多尺度结构。解决方案的关键在于将文本建模为基于Transformer的嵌入空间中的轨迹,并通过引入“嵌入步长信号”(embedding-step signal)来测量沿词元序列的尺度相关波动;结果发现,跨多种语言和语料库,功率谱呈现接近5/3的幂律标度,这一现象在人类写作与AI生成文本中均一致存在,且不出现于静态词嵌入或随机化顺序的文本中,从而表明该标度反映的是语义信息在不同语言尺度上以无标度、自相似方式整合的特性,而非单纯词汇统计的结果。

链接: https://arxiv.org/abs/2604.05536
作者: Zhongxin Yang,Chun Bao,Yuanwei Bin,Xiang I.A. Yang,Shiyi Chen
机构: Peking University (北京大学); Eastern Institute of Technology (东华理工大学); Shenzhen Tenfong Technology Co., Ltd. (深圳天丰科技有限公司); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural language is a complex system that exhibits robust statistical regularities. Here, we represent text as a trajectory in a high-dimensional embedding space generated by transformer-based language models, and quantify scale-dependent fluctuations along the token sequence using an embedding-step signal. Across multiple languages and corpora, the resulting power spectrum exhibits a robust power law with an exponent close to 5/3 over an extended frequency range. This scaling is observed consistently in contextual embeddings from both human-written and AI-generated text, but is absent in static word embeddings and is disrupted by randomization of token order. These results show that the observed scaling reflects multiscale, context-dependent organization rather than lexical statistics alone. By analogy with the Kolmogorov spectrum in turbulence, our findings suggest that semantic information is integrated in a scale-free, self-similar manner across linguistic scales, and provide a quantitative, model-agnostic benchmark for studying complex structure in language representations.

[NLP-55] Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLM s

【速读】: 该论文旨在解决当前Omni大型语言模型(Omni Large Language Models)在复杂多模态推理任务中表现不佳的问题,特别是其在跨模态共指(cross-modal coreference)能力上的缺失。现有模型虽能理解全局多模态上下文,却难以实现细粒度的跨模态对齐,即在源模态中定位某一参照物后,在目标模态中准确重新识别该参照物。为应对这一挑战,作者提出将该问题形式化为跨模态共指任务,并构建了包含九个任务及人工设计推理路径的CrossOmni数据集以评估和提升此能力。解决方案的关键在于引入两种策略:一是无需训练的上下文学习(In-Context Learning)方法,二是基于监督微调(SFT)与梯度奖励策略优化(GRPO)的训练框架,二者均旨在诱导模型形成共指感知的思维模式,从而显著提升跨模态对齐能力和协同推理性能。

链接: https://arxiv.org/abs/2604.05522
作者: Hongcheng Liu,Yuhao Wang,Zhe Chen,Pingjie Wang,Zhiyuan Zhu,Yixuan Hou,Yanfeng Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained cross-modal alignment, especially identifying shared referents across modalities, yet this aspect has been largely overlooked. To bridge this gap, we formalize the challenge as a cross-modal coreference problem, where a model must localize a referent in a source modality and re-identify it in a target modality. Building on this paradigm, we introduce CrossOmni, a dataset comprising nine tasks equipped with human-designed reasoning rationales to evaluate and enhance this capability. Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference, which we attribute to the absence of coreference-aware thinking patterns. To address this, we enhance cross-modal alignment via two strategies: a training-free In-Context Learning method and a training-based SFT+GRPO framework designed to induce such thinking patterns. Both approaches yield substantial performance gains and generalize effectively to collaborative reasoning tasks. Overall, our findings highlight cross-modal coreference as a crucial missing piece for advancing robust omni-modal reasoning.

[NLP-56] Can We Trust a Black-box LLM ? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定话题上可能产生偏见、意识形态化或错误回答的问题,从而明确界定哪些主题下的输出不可信。其解决方案的关键在于提出一种名为GMRL-BD的新算法,该算法通过黑盒访问LLM并结合基于知识图谱(Knowledge Graph, KG)的多智能体强化学习机制,高效识别出LLM容易生成偏见答案的话题节点(即“不可信边界”)。实验表明,该方法仅需有限查询即可准确检测出这些边界,同时作者还发布了包含多个主流LLM(如Llama2、Vicuna、Falcon等)及其偏见标签的新数据集,为后续研究提供了基准支持。

链接: https://arxiv.org/abs/2604.05483
作者: Xiaotian Zhou,Di Tang,Xiaofeng Wang,Xiaozhong Liu
机构: Worcester Polytechnic Institute (伍斯特理工学院); Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.

[NLP-57] Dont Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction ACL2026

【速读】: 该论文旨在解决基于视觉-语言模型(Vision-Language Models, VLMs)的自主GUI代理在真实环境中因假设环境响应确定性而导致的动作失败检测缺失、无效行为重复及错误累积的问题。其核心解决方案是提出VeriGUI(Verification-driven GUI Agent),通过引入Thinking–Verification–Action–Expectation(TVAE)框架实现对动作结果的显式建模与恢复策略引导,并采用两阶段训练流程:第一阶段结合鲁棒监督微调(Robust SFT)与合成失败轨迹,第二阶段利用带有非对称验证奖励的GRPO(Generalized Reward Policy Optimization)优化恢复能力。该方法显著降低了失败循环并提升了恢复成功率,同时保持了标准任务性能竞争力。

链接: https://arxiv.org/abs/2604.05477
作者: Yuzhe Zhang,Xianwei Xue,Xingyong Wu,Mengke Chen,Chen Liu,Xinran He,Run Shao,Feiran Liu,Huanmin Xu,Qiutong Pan,Haiwei Wang
机构: Beijing University of Technology (北京工业大学); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Autonomous GUI agents based on vision-language models (VLMs) often assume deterministic environment responses, generating actions without verifying whether previous operations succeeded. In real-world settings with network latency, rendering delays, and system interruptions, this assumption leads to undetected action failures, repetitive ineffective behaviors, and catastrophic error accumulation. Moreover, learning robust recovery strategies is challenging due to the high cost of online interaction and the lack of real-time feedback in offline this http URL propose VeriGUI (Verification-driven GUI Agent), which explicitly models action outcomes and recovery under noisy environments. VeriGUI introduces a Thinking–Verification–Action–Expectation (TVAE) framework to detect failures and guide corrective reasoning, and a two-stage training pipeline that combines Robust SFT with synthetic failure trajectories and GRPO with asymmetric verification rewards. We further construct a Robustness Benchmark based on AndroidControl to evaluate failure recognition and correction. Experiments show that VeriGUI significantly reduces failure loops and improves recovery success while maintaining competitive standard task performance.

[NLP-58] Content Fuzzing for Escaping Information Cocoons on Digital Social Media ACL2026

【速读】: 该论文旨在解决社交媒体中信息茧房(information cocoons)问题,即用户因平台推荐机制而长期接触同质化观点,导致跨立场交流受限。现有方法依赖立场检测(stance detection)模型进行内容排序,但易强化回音室效应,阻碍不同意见的传播。其解决方案的关键在于提出ContentFuzz框架,该框架基于置信度引导的模糊测试(confidence-guided fuzzing),利用大语言模型(LLM)生成语义不变但立场标签被改变的内容重写版本,从而帮助原始内容突破原有情感或立场聚集区(affinity clusters)。该方法通过 stance detection 模型提供的置信度反馈动态调整 LLM 的改写策略,在保持人类可理解意图的同时实现机器分类立场的可控迁移,已在多种模型和多语言数据集上验证有效性。

链接: https://arxiv.org/abs/2604.05461
作者: Yifeng He,Ziye Tang,Hao Chen
机构: University of California, Davis (加州大学戴维斯分校); University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: accepted to findings of ACL 2026

点击查看摘要

Abstract:Information cocoons on social media limit users’ exposure to posts with diverse viewpoints. Modern platforms use stance detection as an important signal in recommendation and ranking pipelines, which can route posts primarily to like-minded audiences and reduce cross-cutting exposure. This restricts the reach of dissenting opinions and hinders constructive discourse. We take the creator’s perspective and investigate how content can be revised to reach beyond existing affinity clusters. We present ContentFuzz, a confidence-guided fuzzing framework that rewrites posts while preserving their human-interpreted intent and induces different machine-inferred stance labels. ContentFuzz aims to route posts beyond their original cocoons. Our method guides a large language model (LLM) to generate meaning-preserving rewrites using confidence feedback from stance detection models. Evaluated on four representative stance detection models across three datasets in two languages, ContentFuzz effectively changes machine-classified stance labels, while maintaining semantic integrity with respect to the original content.

[NLP-59] Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling ACL2026

【速读】: 该论文旨在解决视觉-语言奖励建模(Vision-Language Reward Modeling)中的核心矛盾:生成式方法虽具可解释性但效率低下,而判别式方法虽高效却如同“黑箱”难以理解。其解决方案的关键在于提出VL-MDR(Vision-Language Multi-Dimensional Reward)框架,通过引入视觉感知的门控机制(visual-aware gating mechanism),将评价任务动态分解为细粒度、可解释的维度(如幻觉、推理等),并针对每条输入自适应地识别相关维度及其权重,从而在保持高可解释性的同时提升效率与性能。

链接: https://arxiv.org/abs/2604.05445
作者: Qiyuan Chen,Hongsen Huang,Jiahe Chen,Qian Shao,Jintai Chen,Hongxia Xu,Renjie Hua,Chuan Ren,Jian Wu
机构: Zhejiang University(浙江大学); Soochow Securities Co.,Ltd.(苏州证券有限公司); HKUST(GZ)(香港科技大学(广州)); Nanjing University(南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2026 Main

点击查看摘要

Abstract:Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque “black boxes.” To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.

[NLP-60] op-K Retrieval with Fixed-Size Linear-Attention Completion: Backbone- and KV-Format-Preserving Attention for KV-Cache Read Reduction

【速读】: 该论文旨在解决长文本生成中因解码时键值(Key-Value, KV)缓存访问流量过大而导致的效率瓶颈问题,尤其是在KV缓存被卸载至GPU显存之外时更为显著。现有方法如查询感知检索(Query-aware retrieval,例如Top-K选择)虽能通过仅加载部分KV对来减少流量,但其在子集上重新归一化softmax会导致注意力质量损失,特别是当注意力权重分散在未检索到的token上时。解决方案的关键在于提出一种“检索-补全注意力模块”(retrieval-completion attention module),该模块保持原始模型权重和KV缓存格式不变:对于每个查询,精确计算锚点(sink/tail anchors)与查询相关的Top-K检索token的注意力;同时,在预填充阶段利用固定大小的特征图摘要估算中间区域的注意力分子与分母,最后在未归一化域中合并精确与估计贡献并进行一次归一化,从而恢复缺失的softmax质量,无需额外的KV读取操作。此方法在多个长上下文基准测试中优于仅使用Top-K选择的方法,尤其在高熵注意力头中表现提升最显著。

链接: https://arxiv.org/abs/2604.05438
作者: Yasuto Hoshi,Daisuke Miyashita,Jun Deguchi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context generation is increasingly limited by decode-time key-value (KV) cache traffic, particularly when KV is offloaded beyond GPU memory. Query-aware retrieval (e.g., Top-K selection) reduces this traffic by loading only a subset of KV pairs, but renormalizing the softmax over the subset introduces bias when attention mass is spread over unretrieved tokens. We propose a retrieval-completion attention module that keeps backbone weights and the KV-cache format unchanged. For each query, we compute exact attention over sink/tail anchors and the query-dependent retrieved Top-K tokens, and estimate the remaining mid-region numerator and denominator using a fixed-size feature-map summary computed at prefill time. We add the exact and estimated contributions in the unnormalized domain and apply a single normalization, recovering the missing softmax mass without additional attention-side KV reads. Across long-context benchmarks, the proposed method improves over selection-only Top-K at matched token-equivalent read budgets, with the largest gains in high-entropy heads.

[NLP-61] Bridging Natural Language and Microgrid Dynamics: A Context-Aware Simulator and Dataset

【速读】: 该论文旨在解决可再生能源系统中智能、情境感知能量管理的迫切需求,特别是传统基于数值时间序列的能量管理方法忽视了人类生成的情境信息(如事件日程、系统日志和用户意图)所蕴含的重要预测价值。解决方案的关键在于提出首个专为整合丰富非结构化情境信息与定量可再生能源动态而设计的开源数字孪生平台——OpenCEM Simulator and Dataset。其核心创新在于:一方面提供了一个来自真实光伏-电池微电网部署的语义丰富、对齐精确的数据集;另一方面构建了一个原生支持多模态情境输入的模块化模拟器,具备混合数据驱动与物理模型建模能力,从而为开发和验证利用大语言模型(Large Language Models, LLMs)的新型控制算法和预测模型提供了高保真环境。

链接: https://arxiv.org/abs/2604.05429
作者: Tinko Sebastian Bartels,Ruixiang Wu,Xinyu Lu,Yikai Lu,Fanzeng Xia,Haoxiang Yang,Yue Chen,Tongxin Li
机构: The Chinese University of Hong Kong-Shenzhen (香港中文大学(深圳)); The Chinese University of Hong Kong (香港中文大学)
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Addressing the critical need for intelligent, context-aware energy management in renewable systems, we introduce the \textbfOpenCEM Simulator and Dataset: the first open-source digital twin explicitly designed to integrate rich, unstructured contextual information with quantitative renewable energy dynamics. Traditional energy management relies heavily on numerical time series, thereby neglecting the significant predictive power embedded in human-generated context (e.g., event schedules, system logs, user intentions). OpenCEM bridges this gap by offering a unique platform comprising both a meticulously aligned, language-rich dataset from a real-world PV-and-battery microgrid installation and a modular simulator capable of natively processing this multi-modal context. The OpenCEM Simulator provides a high-fidelity environment for developing and validating novel control algorithms and prediction models, particularly those leveraging Large Language Models. We detail its component-based architecture, hybrid data-driven and physics-based modelling capabilities, and demonstrate its utility through practical examples, including context-aware load forecasting and the implementation of online optimal battery charging control strategies. By making this platform publicly available, OpenCEM aims to accelerate research into the next generation of intelligent, sustainable, and truly context-aware energy systems.

[NLP-62] PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection

【速读】: 该论文旨在解决现有基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的推理方法在生成式 AI(Generative AI)推理任务中效率低下、计算冗余严重的问题,其核心在于缺乏对历史探索轨迹的信息共享机制。解决方案的关键是提出 PRISM-MCTS 框架,该框架通过引入一个过程奖励模型(Process Reward Model, PRM)与动态共享记忆机制,协同捕捉“启发式策略”(Heuristics)和“认知谬误”(Fallacies),从而实现对推理路径的成功强化与错误分支的有效剪枝,显著提升推理效率与质量。

链接: https://arxiv.org/abs/2604.05424
作者: Siyuan Cheng,Bozhong Tian,YanChao Hao,Zheng Wei
机构: Tencent PCG (腾讯PCG)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection Siyuan Cheng, Bozhong Tian, Yanchao Hao, Zheng Wei Published: 06 Apr 2026, Last Modified: 06 Apr 2026 ACL 2026 Findings Conference, Area Chairs, Reviewers, Publication Chairs, Authors Revisions BibTeX CC BY 4.0 Keywords: Efficient/Low-Resource Methods for NLP, Generation, Question Answering Abstract: The emergence of reasoning models, exemplified by OpenAI o1, signifies a transition from intuitive to deliberative cognition, effectively reorienting the scaling laws from pre-training paradigms toward test-time computation. While Monte Carlo Tree Search (MCTS) has shown promise in this domain, existing approaches typically treat each rollout as an isolated trajectory. This lack of information sharing leads to severe inefficiency and substantial computational redundancy, as the search process fails to leverage insights from prior explorations. To address these limitations, we propose PRISM-MCTS, a novel reasoning framework that draws inspiration from human parallel thinking and reflective processes. PRISM-MCTS integrates a Process Reward Model (PRM) with a dynamic shared memory, capturing both “Heuristics” and “Fallacies”. By reinforcing successful strategies and pruning error-prone branches, PRISM-MCTS effectively achieves refinement. Furthermore, we develop a data-efficient training strategy for the PRM, achieving high-fidelity evaluation under a few-shot regime. Empirical evaluations across diverse reasoning benchmarks substantiate the efficacy of PRISM-MCTS. Notably, it halves the trajectory requirements on GPQA while surpassing MCTS-RAG and Search-o1, demonstrating that it scales inference by reasoning judiciously rather than exhaustively.

[NLP-63] Multi-Drafter Speculative Decoding with Alignment Feedback ACL2026

【速读】: 该论文旨在解决生成式 AI(Generative AI)中大规模语言模型(Large Language Model, LLM)推理效率低的问题。现有推测解码(Speculative Decoding, SD)方法依赖单一小型drafting模型进行未来token的预生成,但这些drafters通常在特定任务或领域训练,泛化能力有限,导致在多样化应用场景下性能受限。解决方案的关键在于提出MetaSD框架,通过整合多个异构drafters,并利用对齐反馈动态分配计算资源,将drafters的选择建模为多臂赌博机(multi-armed bandit)问题,从而实现高效且自适应的drafting策略选择,显著提升推理速度与稳定性。

链接: https://arxiv.org/abs/2604.05417
作者: Taehyeon Kim,Hojung Jung,Se-Young Yun
机构: LG AI Research (LG人工智能研究); KAIST AI (韩国科学技术院人工智能)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller model to draft future tokens, which are then verified by the target LLM. This preserves generation quality by accepting only aligned tokens. However, individual drafters, often trained for specific tasks or domains, exhibit limited effectiveness across diverse applications. To address this, we introduce \textscMetaSD, a unified framework that integrates multiple drafters into the SD process. MetaSD dynamically allocates computational resources to heterogeneous drafters by leveraging alignment feedback and framing drafter selection as a multi-armed bandit problem. Extensive experiments show MetaSD consistently outperforms single-drafter approaches.

[NLP-64] Confidence Should Be Calibrated More Than One Turn Deep

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话场景中缺乏动态校准(multi-turn calibration)的问题,即现有校准方法主要针对单轮交互设计,未能有效处理用户反馈(如说服行为)对模型置信度估计的干扰,从而导致多轮对话中的可靠性下降。其解决方案的关键在于提出一种新的校准任务框架——多轮校准(Multi-Turn Calibration, MTCal),通过引入用于追踪每一轮校准动态的新指标 Expected Calibration Error at turn T (ECE@T),并设计基于代理校准目标的优化策略最小化 ECE@T;进一步地,利用校准后的置信度构建 ConfChat 解码策略,在保持甚至提升事实准确性和响应一致性的同时增强多轮交互的可靠性。

链接: https://arxiv.org/abs/2604.05397
作者: Zhaohan Zhang,Chengzhengxu Li,Xiaoming Liu,Chao Shen,Ziquan Liu,Ioannis Patras
机构: Queen Mary University of London (伦敦玛丽女王大学); Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality and consistency of the model response in multi-turn interactions. Extensive experiments demonstrate that MT-Cal achieves outstanding and consistent performance in multi-turn calibration, and ConfChat preserves and even enhances model performance in multi-turn interactions. Our results mark multi-turn calibration as one missing link for scaling LLM calibration toward safe, reliable, and real-world use.

[NLP-65] ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

【速读】: 该论文旨在解决语言条件驱动模型(language-conditioned driving agents)在真实部署场景中对指令鲁棒性不足的问题,尤其关注自然语言指令在表述差异、模糊性、噪声及误导性内容下的性能退化问题。现有评估多假设指令为精确且结构良好的,忽略了实际应用中指令的多样性和潜在干扰因素,导致模型可靠性被高估。解决方案的关键在于提出 ICR-Drive(Instruction Counterfactual Robustness for Drive),一个诊断框架,通过生成四类受控扰动指令变体(Paraphrase、Ambiguity、Noise 和 Misleading)来系统测试模型在相同仿真环境(CARLA)下对指令变化的响应,从而量化各扰动类别下的性能下降程度,并揭示不同模型(如 LMDrive 和 BEVDriver)在面对微小指令变动时表现出的显著性能波动与差异化失效模式,暴露了当前端到端语言驱动模型在安全关键驾驶任务中的可靠性差距。

链接: https://arxiv.org/abs/2604.05378
作者: Kaiser Hamid,Can Cui,Nade Liang
机构: Texas Tech University; Bosch Center for Artificial Intelligence (BCAI)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

[NLP-66] ETR: Entropy Trend Reward for Efficient Chain-of-Thought Reasoning ACL2026

【速读】: 该论文旨在解决链式思维(Chain-of-thought, CoT)推理过程中生成的推理轨迹过长且效率低下的问题。现有方法通常通过长度惩罚或全局熵减来缩短CoT,但隐含假设是整个推理过程中的不确定性应始终较低。本文指出,推理效率实际上由不确定性变化轨迹决定——具有显著下降熵趋势的CoT更短。解决方案的关键在于提出熵趋势奖励(Entropy Trend Reward, ETR),这是一种轨迹感知的目标函数,鼓励推理过程中逐步降低不确定性,同时允许有限的局部探索。ETR被集成到分组相对策略优化(Group Relative Policy Optimization, GRPO)框架中,在多个推理模型和挑战性基准上验证,显著提升了准确率与效率的权衡,例如在DeepSeek-R1-Distill-7B模型上实现9.9%的准确率提升,同时将CoT长度减少67%。

链接: https://arxiv.org/abs/2604.05355
作者: Xuan Xiong,Huan Liu,Li Gu,Zhixiang Chi,Yue Qiu,Yuanhao Yu,Yang Wang
机构: University of Toronto (多伦多大学); McMaster University (麦克马斯特大学); Concordia University (康考迪亚大学); University of Ottawa (渥太华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2026 (Main)

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning improves large language model performance on complex tasks, but often produces excessively long and inefficient reasoning traces. Existing methods shorten CoTs using length penalties or global entropy reduction, implicitly assuming that low uncertainty is desirable throughout reasoning. We show instead that reasoning efficiency is governed by the trajectory of uncertainty. CoTs with dominant downward entropy trends are substantially shorter. Motivated by this insight, we propose Entropy Trend Reward (ETR), a trajectory-aware objective that encourages progressive uncertainty reduction while allowing limited local exploration. We integrate ETR into Group Relative Policy Optimization (GRPO) and evaluate it across multiple reasoning models and challenging benchmarks. ETR consistently achieves a superior accuracy-efficiency tradeoff, improving DeepSeek-R1-Distill-7B by 9.9% in accuracy while reducing CoT length by 67% across four benchmarks. Code is available at this https URL

[NLP-67] DQA: Diagnostic Question Answering for IT Support ACL2026

【速读】: 该论文旨在解决企业IT支持交互中因缺乏显式诊断状态而导致的证据累积困难与竞争假设难以分辨的问题。标准多轮检索增强生成(Retrieval-Augmented Generation, RAG)系统无法在对话过程中维持持久的诊断状态,从而限制了其对根因层面证据的有效聚合与推理能力。解决方案的关键在于提出DQA(Diagnostic Question-Answering)框架,该框架通过维护持续的诊断状态,将检索到的案例按根因级别而非文档级别进行聚合,并结合对话查询重写、检索聚合和状态条件响应生成机制,在满足企业级延迟和上下文约束的前提下,实现系统性的故障排查。实验表明,DQA在150个匿名企业IT支持场景中平均成功率达78.7%,显著优于多轮RAG基线(41.3%),且平均对话轮次从8.4降至3.9。

链接: https://arxiv.org/abs/2604.05350
作者: Vishaal Kapoor,Mariam Dundua,Sarthak Ahuja,Neda Kordjazi,Evren Yortucboylu,Vaibhavi Padala,Derek Ho,Jennifer Whitted,Rebecca Steinert
机构: Amazon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 tables, accepted at ACL 2026 Industry Track

点击查看摘要

Abstract:Enterprise IT support interactions are fundamentally diagnostic: effective resolution requires iterative evidence gathering from ambiguous user reports to identify an underlying root cause. While retrieval-augmented generation (RAG) provides grounding through historical cases, standard multi-turn RAG systems lack explicit diagnostic state and therefore struggle to accumulate evidence and resolve competing hypotheses across turns. We introduce DQA, a diagnostic question-answering framework that maintains persistent diagnostic state and aggregates retrieved cases at the level of root causes rather than individual documents. DQA combines conversational query rewriting, retrieval aggregation, and state-conditioned response generation to support systematic troubleshooting under enterprise latency and context constraints. We evaluate DQA on 150 anonymized enterprise IT support scenarios using a replay-based protocol. Averaged over three independent runs, DQA achieves a 78.7% success rate under a trajectory-level success criterion, compared to 41.3% for a multi-turn RAG baseline, while reducing average turns from 8.4 to 3.9.

[NLP-68] Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多智能体系统中因个体价值错位而引发集体行为失序的问题,尤其是当多个LLM代理在社会性交互中形成社区时,其群体层面的失败是否由个体价值偏离人类价值观所驱动。解决方案的关键在于构建了一个基于社会科学研究理论的受控多智能体环境CIVA(Controlled Multi-Agent Environment),通过系统性操纵价值偏好并观察代理间的自主沟通、探索与资源竞争行为,量化揭示了若干结构性关键价值对群体动态的影响;实验发现,这些价值偏差不仅会导致宏观层面的系统崩溃等故障模式,还会诱发微观层面的欺骗和权力寻求等新兴行为,从而为LLM多智能体系统的价值对齐提供了实证依据和研究方向。

链接: https://arxiv.org/abs/2604.05339
作者: Xiangxu Zhang,Jiamin Wang,Qinlin Zhao,Hanze Guo,Linzhuo Li,Jing Yao,Xiao Zhou,Xiaoyuan Yi,Xing Xie
机构: Gaoling School of Artificial Intelligence, Renmin University of China; Microsoft Research Asia; Department of Sociology, Zhejiang University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LLMs become increasingly integrated into human society, evaluating their orientations on human values from social science has drawn growing attention. Nevertheless, it is still unclear why human values matter for LLMs, especially in LLM-based multi-agent systems, where group-level failures may accumulate from individually misaligned actions. We ask whether misalignment with human values alters the collective behavior of LLM agents and what changes it induces? In this work, we introduce CIVA, a controlled multi-agent environment grounded in social science theories, where LLM agents form a community and autonomously communicate, explore, and compete for resources, enabling systematic manipulation of value prevalence and behavioral analysis. Through comprehensive simulation experiments, we reveal three key findings. (1) We identify several structurally critical values that substantially shape the community’s collective dynamics, including those diverging from LLMs’ original orientations. Triggered by the misspecification of these values, we (2) detect system failure modes, e.g., catastrophic collapse, at the macro level, and (3) observe emergent behaviors like deception and power-seeking at the micro level. These results offer quantitative evidence that human values are essential for collective outcomes in LLMs and motivate future multi-agent value alignment.

[NLP-69] DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects ACL2026

【速读】: 该论文旨在解决当前有害内容检测模型(特别是虚假信息分类器)在非标准英语方言上的鲁棒性缺失问题,即这些模型主要基于标准美式英语(Standard American English, SAE)开发和评估,对全球多样化的英语方言(如美国、英国、非洲、加勒比及亚太地区方言)缺乏适应能力。其解决方案的关键在于构建了首个跨50种英语方言的基准测试框架DIA-HARM,并引入D3(Dialectal Disinformation Detection)语料库,该语料库基于Multi-VALUE的语言学变换方法从现有虚假信息基准中生成19.5万条样本。通过系统评估16个检测模型发现,人类撰写的内容在方言下导致F1分数下降1.4%–3.6%,而AI生成内容则保持稳定;同时,微调后的Transformer模型显著优于零样本大语言模型(最佳F1分别为96.6% vs. 78.3%),且部分模型在混合方言内容上出现超过33%的性能崩溃。此外,跨方言迁移分析表明多语言模型(如mDeBERTa)具有更强泛化能力(平均F1达97.2%),而单语模型(如RoBERTa和XLM-RoBERTa)在方言输入上表现不佳。这一研究揭示了现有检测系统可能系统性地歧视数亿非SAE使用者,为提升检测模型的公平性和普适性提供了关键实证依据与工具支持。

链接: https://arxiv.org/abs/2604.05318
作者: Jason Lucas,Matt Murtagh,Ali Al-Lawati,Uchendu Uchendu,Adaku Uchendu,Dongwon Lee
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Trinity College Dublin (都柏林三一学院); MIT Lincoln Laboratory (麻省理工学院林肯实验室)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Harmful content detectors-particularly disinformation classifiers-are predominantly developed and evaluated on Standard American English (SAE), leaving their robustness to dialectal variation unexplored. We present DIA-HARM, the first benchmark for evaluating disinformation detection robustness across 50 English dialects spanning U.S., British, African, Caribbean, and Asia-Pacific varieties. Using Multi-VALUE’s linguistically grounded transformations, we introduce D3 (Dialectal Disinformation Detection), a corpus of 195K samples derived from established disinformation benchmarks. Our evaluation of 16 detection models reveals systematic vulnerabilities: human-written dialectal content degrades detection by 1.4-3.6% F1, while AI-generated content remains stable. Fine-tuned transformers substantially outperform zero-shot LLMs (96.6% vs. 78.3% best-case F1), with some models exhibiting catastrophic failures exceeding 33% degradation on mixed content. Cross-dialectal transfer analysis across 2,450 dialect pairs shows that multilingual models (mDeBERTa: 97.2% average F1) generalize effectively, while monolingual models like RoBERTa and XLM-RoBERTa fail on dialectal inputs. These findings demonstrate that current disinformation detectors may systematically disadvantage hundreds of millions of non-SAE speakers worldwide. We release the DIA-HARM framework, D3 corpus, and evaluation tools: this https URL

[NLP-70] LLM s Should Express Uncertainty Explicitly

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在决策场景中缺乏有效不确定性表达的问题,尤其是在需要基于不确定性进行 abstention(放弃回答)、retrieval(信息检索)和 verification(验证)等操作时。现有方法通常将不确定性视为生成后的隐变量进行估计,而非作为可训练的显式信号。其解决方案的关键在于将不确定性建模为一种可控的通信接口,并对比两种互补的实现方式:一是全局接口,即模型对最终答案输出校准后的置信度分数,用于判断是否信任该答案;二是局部接口,即模型在推理过程中当进入高风险状态时发出明确的不确定标记,用于触发干预或检索。研究表明,这两种接口分别优化了不同层面的不确定性感知机制——全局置信度主要改进已有不确定性的解码方式,而局部信号则引发深层网络结构的重组,从而提升整体系统鲁棒性和可控性。

链接: https://arxiv.org/abs/2604.05306
作者: Junyu Guo,Shangding Gu,Ming Jin,Costas Spanos,Javad Lavaei
机构: University of California, Berkeley (加州大学伯克利分校); Virginia Tech (弗吉尼亚理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly used in settings where uncertainty must drive decisions such as abstention, retrieval, and verification. Most existing methods treat uncertainty as a latent quantity to estimate after generation rather than a signal the model is trained to express. We instead study uncertainty as an interface for control. We compare two complementary interfaces: a global interface, where the model verbalizes a calibrated confidence score for its final answer, and a local interface, where the model emits an explicit uncertain marker during reasoning when it enters a high-risk state. These interfaces provide different but complementary benefits. Verbalized confidence substantially improves calibration, reduces overconfident errors, and yields the strongest overall Adaptive RAG controller while using retrieval more selectively. Reasoning-time uncertainty signaling makes previously silent failures visible during generation, improves wrong-answer coverage, and provides an effective high-recall retrieval trigger. Our findings further show that the two interfaces work differently internally: verbal confidence mainly refines how existing uncertainty is decoded, whereas reasoning-time signaling induces a broader late-layer reorganization. Together, these results suggest that effective uncertainty in LLMs should be trained as task-matched communication: global confidence for deciding whether to trust a final answer, and local signals for deciding when intervention is needed.

[NLP-71] Right at My Level: A Unified Multilingual Framework for Proficiency-Aware Text Simplification ACL2026

【速读】: 该论文旨在解决多语言文本简化(Text Simplification)中缺乏个性化平行语料库监督的问题,以及现有基于大语言模型(LLM)的可读性控制方法在非英语语言和较低熟练度水平下表现不佳的问题。解决方案的关键在于提出一种统一的强化学习框架 Re-RIGHT,无需平行语料库监督即可实现自适应多语言文本简化;其核心创新是通过整合词汇覆盖、语义保留和连贯性三个奖励模块,训练一个轻量级4B参数策略模型,在保持原文语义和流畅性的前提下,有效提升目标语言熟练度等级(如CEFR、JLPT等)下的词汇覆盖率。

链接: https://arxiv.org/abs/2604.05302
作者: Jinhong Jeong,Junghun Park,Youngjae Yu
机构: Yonsei University (延世大学); Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Text simplification supports second language (L2) learning by providing comprehensible input, consistent with the Input Hypothesis. However, constructing personalized parallel corpora is costly, while existing large language model (LLM)-based readability control methods rely on pre-labeled sentence corpora and primarily target English. We propose Re-RIGHT, a unified reinforcement learning framework for adaptive multilingual text simplification without parallel corpus supervision. We first show that prompting-based lexical simplification at target proficiency levels (CEFR, JLPT, TOPIK, and HSK) performs poorly at easier levels and for non-English languages, even with state-of-the-art LLMs such as GPT-5.2 and Gemini 2.5. To address this, we collect 43K vocabulary-level data across four languages (English, Japanese, Korean, and Chinese) and train a compact 4B policy model using Re-RIGHT, which integrates three reward modules: vocabulary coverage, semantic preservation, and coherence. Compared to the stronger LLM baselines, Re-RIGHT achieves higher lexical coverage at target proficiency levels while maintaining original meaning and fluency.

[NLP-72] Beneath the Surface: Investigating LLM s Capabilities for Communicating with Subtext

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在交际场景中缺乏对“潜文本”(subtext)使用能力的问题,即模型难以理解或生成超越字面意义的隐含信息。其解决方案的关键在于构建四个新的评估套件(evaluation suites),涵盖从寓言写作与解读到多智能体和多模态游戏(如受Dixit启发的环境),以系统性地测量LLMs在不同情境下运用潜文本的能力。研究发现,尽管顶级模型仍存在强烈倾向采用字面表达(例如在Visual Allusions环境中60%的提示为字面描述),但部分模型能在存在共同知识基础时减少字面表达30%-50%,表明潜在的社会语境建模能力;同时揭示了模型在未明确提示共同知识时推理困难的问题,从而为未来实现更贴近人类社交语境的创造性沟通与推理提供了量化基准与改进方向。

链接: https://arxiv.org/abs/2604.05273
作者: Kabir Ahuja,Yuxuan Li,Andrew Kyle Lampinen
机构: Google DeepMind(谷歌深度思维)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human communication is fundamentally creative, and often makes use of subtext – implied meaning that goes beyond the literal content of the text. Here, we systematically study whether language models can use subtext in communicative settings, and introduce four new evaluation suites to assess these capabilities. Our evaluation settings range from writing interpreting allegories to playing multi-agent and multi-modal games inspired by the rules of board games like Dixit. We find that frontier models generally exhibit a strong bias towards overly literal, explicit communication, and thereby fail to account for nuanced constraints – even the best performing models generate literal clues 60% of times in one of our environments – Visual Allusions. However, we find that some models can sometimes make use of common ground with another party to help them communicate with subtext, achieving 30%-50% reduction in overly literal clues; but they struggle at inferring presence of a common ground when not explicitly stated. For allegory understanding, we find paratextual and persona conditions to significantly shift the interpretation of subtext. Overall, our work provides quantifiable measures for an inherently complex and subjective phenomenon like subtext and reveals many weaknesses and idiosyncrasies of current LLMs. We hope this research to inspire future work towards socially grounded creative communication and reasoning.

[NLP-73] Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking

【速读】: 该论文旨在解决多模态检索增强生成(Multi-modal Retrieval-Augmented Generation, MM-RAG)中重排序器(re-ranker)因将完整查询图像作为全局嵌入而易受视觉干扰项(如背景杂乱)影响,导致相似度评分失真的问题。其解决方案的关键在于提出Region-R1框架,该框架将区域选择建模为重排序过程中的决策问题,并引入一种新颖的区域感知组相对策略优化方法(region-aware group relative policy optimization, r-GRPO),使系统能够在评分前动态裁剪出与问题相关的判别性图像区域,从而提升检索相关性。实验表明,该方法在E-VQA和InfoSeek两个基准上均取得显著性能提升,Conditional Recall@1最高提升达20%。

链接: https://arxiv.org/abs/2604.05268
作者: Chan-Wei Hu,Zhengzhong Tu
机构: Texas AM University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Multi-modal retrieval-augmented generation (MM-RAG) relies heavily on re-rankers to surface the most relevant evidence for image-question queries. However, standard re-rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region-R1, a query-side region cropping framework that formulates region selection as a decision-making problem during re-ranking, allowing the system to learn to retain the full image or focus only on a question-relevant region before scoring the retrieved candidates. Region-R1 learns a policy with a novel region-aware group relative policy optimization (r-GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query-side adaptation as a simple but effective way to strengthen MM-RAG re-ranking.

[NLP-74] Do Domain-specific Experts exist in MoE-based LLM s?

【速读】: 该论文旨在解决当前基于混合专家(Mixture of Experts, MoE)架构的大语言模型中,专家专业化特性不明确且难以系统解释的问题,特别是验证是否存在针对特定领域的专家。其解决方案的关键在于提出了一种无需额外训练的“领域导向混合专家”(Domain Steering Mixture of Experts, DSMoE)框架,通过实证证明了MoE模型中确实存在领域特异性专家,并利用这些专家在推理阶段进行无额外计算开销的领域引导控制,从而在多个目标与非目标领域上实现性能提升和鲁棒泛化能力,同时保持与原始模型相同的推理效率。

链接: https://arxiv.org/abs/2604.05267
作者: Giang Do,Hung Le,Truyen Tran
机构: Deakin University (迪肯大学)
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:In the era of Large Language Models (LLMs), the Mixture of Experts (MoE) architecture has emerged as an effective approach for training extremely large models with improved computational efficiency. This success builds upon extensive prior research aimed at enhancing expert specialization in MoE-based LLMs. However, the nature of such specializations and how they can be systematically interpreted remain open research challenges. In this work, we investigate this gap by posing a fundamental question: \textitDo domain-specific experts exist in MoE-based LLMs? To answer the question, we evaluate ten advanced MoE-based LLMs ranging from 3.8B to 120B parameters and provide empirical evidence for the existence of domain-specific experts. Building on this finding, we propose \textbfDomain Steering Mixture of Experts (DSMoE), a training-free framework that introduces zero additional inference cost and outperforms both well-trained MoE-based LLMs and strong baselines, including Supervised Fine-Tuning (SFT). Experiments on four advanced open-source MoE-based LLMs across both target and non-target domains demonstrate that our method achieves strong performance and robust generalization without increasing inference cost or requiring additional retraining. Our implementation is publicly available at this https URL.

[NLP-75] DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models

【速读】: 该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)在推理阶段效率低下的问题,其核心瓶颈在于双向注意力机制无法缓存键值对(key-value pairs),导致每一步生成均需 O(N²) 计算复杂度,严重限制了推理速度。解决方案的关键在于提出 DualDiffusion,一种针对 MDM 的推测解码(speculative decoding)框架,通过结合轻量级快速 drafter 模型(采用高效近似方法)与更准确的 verifier 模型,在多个 drafter 步骤后仅执行一次验证步骤,从而在保持高生成质量的同时显著减少所需生成步数,有效优化了生成效率与准确率之间的权衡曲线。

链接: https://arxiv.org/abs/2604.05250
作者: Satyam Goyal,Kushal Patel,Tanush Mittal,Arjun Laxman
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Masked Diffusion Models (MDMs) offer a promising alternative to autoregressive language models by enabling parallel token generation and bidirectional context modeling. However, their inference speed is significantly limited by the inability to cache key-value pairs due to bidirectional attention, requiring O(N^2) computations at each generation step. While recent methods like FastDLLM and DkvCache improve inference speed through attention approximations and caching strategies, they achieve speedups at the cost of generation quality. We propose DualDiffusion, a speculative decoding framework for MDMs that combines fast drafter models (using efficient approximations) with slower, more accurate verifier models. By running multiple steps of a lightweight drafter followed by a single verification step, DualDiffusion achieves a superior Pareto frontier between generation steps and accuracy compared to existing approaches. We evaluate our method on MMLU and GSM8K, demonstrating that DualDiffusion maintains high accuracy while reducing the number of generation steps required, effectively pushing the quality-efficiency trade-off curve for masked diffusion language models.

[NLP-76] Improving Sparse Memory Finetuning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中面临的持续学习问题,即如何在不损害已有知识的前提下动态更新模型以适应新知识。传统微调方法(如全参数微调或参数高效方法LoRA)因修改共享的密集表示而引发灾难性遗忘(catastrophic forgetting),难以实现稳定的知识增量。其解决方案的关键在于提出一种稀疏记忆微调(Sparse Memory Finetuning, SMF)机制,通过将参数更新限制在显式记忆层中的小规模参数子集上,从而减少任务间干扰;进一步引入基于KL散度的槽位选择机制,优先对相对于背景分布具有信息“惊喜度”的token进行记忆更新,从理论上保障了更新的有效性和针对性。实验表明,该方法可在消费级硬件上实现低遗忘的知识增量学习,验证了稀疏更新策略的实用性与有效性。

链接: https://arxiv.org/abs/2604.05248
作者: Satyam Goyal,Anirudh Kanchi,Garv Shah,Prakhar Gupta
机构: University of Michigan, Ann Arbor(密歇根大学,安娜堡分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are typically static after training, yet real-world applications require continual adaptation to new knowledge without degrading existing capabilities. Standard approaches to updating models, like full finetuning or parameter-efficient methods (e.g., LoRA), face a fundamental trade-off: catastrophic forgetting. They modify shared dense representations, causing interference across tasks. Sparse Memory Finetuning (SMF) offers a promising alternative by localizing updates to a small subset of parameters in explicit memory layers. In this work, we present an open-source pipeline to retrofit existing pretrained models (Qwen-2.5-0.5B) with sparse memory modules, enabling effective continual learning on consumer hardware. We extend prior work by introducing a theoretically grounded slot-selection mechanism based on Kullback-Leibler (KL) divergence, which prioritizes memory updates for informationally “surprising” tokens relative to a background distribution. Our experiments demonstrate that our retrofitted models can acquire new factual knowledge with minimal forgetting of held-out capabilities, validating the sparse update hypothesis in a practical setting.

[NLP-77] Exemplar Retrieval Without Overhypothesis Induction: Limits of Distributional Sequence Learning in Early Word Learning

【速读】: 该论文旨在解决儿童如何通过归纳推理获得关于物体类别本质特征的第二层抽象认知——即“超假设”(overhypothesis),例如理解形状是区分物体类别的稳定属性,而非仅记忆具体实例。其核心问题是:当前基于分布序列学习的自回归Transformer语言模型是否具备实现此类跨类别抽象的能力。解决方案的关键在于设计了120次预注册实验,在包含8种控制条件的合成语料库上训练不同规模(3.4M–25.6M参数)的模型,并通过1,040项wug测试电池评估其对新名词的第二层泛化能力。结果显示,所有模型均能完美识别第一层实例(100%准确率),但对新词的第二层抽象泛化仍停留在随机水平(50–52%),且特征交换诊断表明模型依赖于模板匹配而非结构化的名词-领域-特征抽象机制,揭示出自回归分布序列学习在发展尺度训练条件下存在明确局限性。

链接: https://arxiv.org/abs/2604.05243
作者: Jon-Paul Cacioli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages, 7 figures, 22 references. Pre-registered study (OSF: this https URL ). Code and data: this https URL . Submitted to Cognitive Computation

点击查看摘要

Abstract:Background: Children do not simply learn that balls are round and blocks are square. They learn that shape is the kind of feature that tends to define object categories – a second-order generalisation known as an overhypothesis [1, 2]. What kind of learning mechanism is sufficient for this inductive leap? Methods: We trained autoregressive transformer language models (3.4M-25.6M parameters) on synthetic corpora in which shape is the stable feature dimension across categories, with eight conditions controlling for alternative explanations. Results: Across 120 pre-registered runs evaluated on a 1,040-item wug test battery, every model achieved perfect first-order exemplar retrieval (100%) while second-order generalisation to novel nouns remained at chance (50-52%), a result confirmed by equivalence testing. A feature-swap diagnostic revealed that models rely on frame-to-feature template matching rather than structured noun-to-domain-to-feature abstraction. Conclusions: These results reveal a clear limitation of autoregressive distributional sequence learning under developmental-scale training conditions.

[NLP-78] XMark: Reliable Multi-Bit Watermarking for LLM -Generated Texts ACL2026

【速读】: 该论文旨在解决多比特水印(multi-bit watermarking)在大型语言模型(Large Language Model, LLM)生成文本中面临的三大挑战:一是现有方法在处理大容量消息时计算复杂度急剧上升,二是文本质量与解码准确性之间难以平衡,三是当生成文本token数量受限时,解码准确率显著下降。解决方案的关键在于提出XMark方法,其编码器设计通过生成更少失真的logit分布来提升水印嵌入的隐蔽性与文本质量,同时其定制化的解码器能够在token数量有限的情况下依然可靠地恢复原始二进制信息,从而在多个下游任务中实现更高的解码准确率和更好的文本保真度。

链接: https://arxiv.org/abs/2604.05242
作者: Jiahao Xu,Rui Hu,Olivera Kotevska,Zikai Zhang
机构: University of Nevada, Reno (内华达大学里诺分校); Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted by ACL 2026 as a main conference paper

点击查看摘要

Abstract:Multi-bit watermarking has emerged as a promising solution for embedding imperceptible binary messages into Large Language Model (LLM)-generated text, enabling reliable attribution and tracing of malicious usage of LLMs. Despite recent progress, existing methods still face key limitations: some become computationally infeasible for large messages, while others suffer from a poor trade-off between text quality and decoding accuracy. Moreover, the decoding accuracy of existing methods drops significantly when the number of tokens in the generated text is limited, a condition that frequently arises in practical usage. To address these challenges, we propose \textscXMark, a novel method for encoding and decoding binary messages in LLM-generated texts. The unique design of \textscXMark’s encoder produces a less distorted logit distribution for watermarked token generation, preserving text quality, and also enables its tailored decoder to reliably recover the encoded message with limited tokens. Extensive experiments across diverse downstream tasks show that \textscXMark significantly improves decoding accuracy while preserving the quality of watermarked text, outperforming prior methods. The code is at this https URL.

[NLP-79] On the Geometry of Positional Encodings in Transformers

【速读】: 该论文旨在解决Transformer模型中位置编码(Positional Encoding)设计缺乏数学理论指导的问题,即如何从理论上明确位置编码应具备的性质及其最优构造方式。其核心解决方案在于建立一套完整的理论框架:首先证明了无位置信号的Transformer无法处理依赖词序的任务(必要性定理);其次,在温和条件下证明了训练过程会自动为不同位置分配唯一向量表示(位置分离定理);进而提出基于Hellinger距离的多维尺度分析(MDS)方法来构造信息最优的位置编码,并以应力(stress)作为统一评价指标;最后揭示最优编码具有低秩结构(有效秩为n−1),可通过更少参数实现(最小参数化结果)。这一理论体系不仅解释了现有位置编码机制的有效性差异,还为设计高效、可解释的位置编码提供了数学依据。

链接: https://arxiv.org/abs/2604.05217
作者: Giansalvo Cirrincione
机构: Université de Picardie Jules Verne (皮卡第朱尔斯·凡尔纳大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neural language models process sequences of words, but the mathematical operations inside them are insensitive to the order in which words appear. Positional encodings are the component added to remedy this. Despite their importance, positional encodings have been designed largely by trial and error, without a mathematical theory of what they ought to do. This paper develops such a theory. Four results are established. First, any Transformer without a positional signal cannot solve any task sensitive to word order (Necessity Theorem). Second, training assigns distinct vector representations to distinct sequence positions at every global minimiser, under mild and verifiable conditions (Positional Separation Theorem). Third, the best achievable approximation to an information-optimal encoding is constructed via classical multidimensional scaling (MDS) on the Hellinger distance between positional distributions; the quality of any encoding is measured by a single number, the stress (Proposition 5, Algorithm 1). Fourth, the optimal encoding has effective rank r = rank(B) = n-1 and can be represented with r(n+d) parameters instead of nd (minimal parametrisation result). Appendix A develops a proof of the Monotonicity Conjecture within the Neural Tangent Kernel (NTK) regime for masked language modelling (MLM) losses, sequence classification losses, and general losses satisfying a positional sufficiency condition, through five lemmas. Experiments on SST-2 and IMDB with BERT-base confirm the theoretical predictions and reveal that Attention with Linear Biases (ALiBi) achieves much lower stress than the sinusoidal encoding and Rotary Position Embedding (RoPE), consistent with a rank-1 interpretation of the MDS encoding under approximate shift-equivariance. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2604.05217 [cs.LG] (or arXiv:2604.05217v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.05217 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-80] Faster Superword Tokenization

【速读】: 该论文旨在解决传统字节对编码(Byte Pair Encoding, BPE)算法在词元化(tokenization)过程中无法跨预分词边界(pre-tokenization boundaries)形成复合词(superwords)的问题,从而限制其表达短语级语义的能力。为突破这一限制,作者提出通过引入超合并(supermerge)机制来生成跨越预分词边界的更长词元,同时改进训练效率——关键在于发现超合并候选项(supermerge candidates)可像常规预词元一样按频率聚合,从而避免将完整文档保留在内存中,显著降低存储开销并加速训练过程。具体而言,论文设计了两阶段的BoundlessBPE框架:第一阶段学习普通合并,第二阶段学习超合并,实现与原始方法等效的结果;并揭示该两阶段BoundlessBPE与SuperBPE近乎等价,仅需自动确定一个原本依赖人工调参的超参数。最终,该方案使训练速度提升超过600倍,且开源了高效Python与Rust实现。

链接: https://arxiv.org/abs/2604.05192
作者: Craig W. Schmidt,Chris Tanner,Yuval Pinter
机构: Kensho Technologies(肯肖科技); Ben-Gurion University of the Negev(本-古里安大学); MIT(麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend and improve BPE by relaxing this limitation and allowing the formation of superwords, which are combinations of pretokens that form phrases. However, previous implementations were impractical to train: for example, BoundlessBPE took 4.7 CPU days to train on 1GB of data. We show that supermerge candidates, two or more consecutive pretokens eligible to form a supermerge, can be aggregated by frequency much like regular pretokens. This avoids keeping full documents in memory, as the original implementations of BoundlessBPE and SuperBPE required, leading to a significant training speedup. We present a two-phase formulation of BoundlessBPE that separates first-phase learning of regular merges from second-phase learning of supermerges, producing identical results to the original implementation. We also show a near-equivalence between two-phase BoundlessBPE and SuperBPE, with the difference being that a manually selected hyperparameter used in SuperBPE can be automatically determined in the second phase of BoundlessBPE. These changes enable a much faster implementation, allowing training on that same 1GB of data in 603 and 593 seconds for BoundlessBPE and SuperBPE, respectively, a more than 600x increase in speed. For each of BoundlessBPE, SuperBPE, and BPE, we open-source both a reference Python implementation and a fast Rust implementation.

[NLP-81] Gradient-Controlled Decoding: A Safety Guardrail for LLM s with Dual-Anchor Steering LREC2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对越狱攻击(jailbreak)和直接提示注入攻击(prompt-injection attacks)时的安全性问题,同时缓解现有防御机制中因过度拒绝合法查询而导致用户体验下降的缺陷。其核心解决方案是提出一种无需训练的防护机制——梯度控制解码(Gradient-Controlled Decoding, GCD),其关键在于引入两个锚定标记:接受锚点 token(“Sure”)与拒绝锚点 token(“Sorry”),通过紧缩决策边界显著降低误报率;在缓解阶段,若检测到潜在危险提示,则预注入一个或两个拒绝 token(如 “Sorry, I can’t…”)于自回归解码前,从而无论采样策略如何均能确保首 token 安全,实现确定性的安全保障。

链接: https://arxiv.org/abs/2604.05179
作者: Purva Chiniya,Kevin Scaria,Sagar Chaturvedi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC2026

点击查看摘要

Abstract:Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single “accept all” anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token (“Sure”) and refusal anchor token (“Sorry”) tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens (“Sorry, I can’t…”) before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B, and requires only 20 demonstration templates.

[NLP-82] What Makes a Good Response? An Empirical Analysis of Quality in Qualitative Interviews

【速读】: 该论文旨在解决当前定性访谈研究中缺乏对访谈回答质量的有效验证问题,即现有提出的多种访谈质量评估指标尚未被证实能够真正预测回答对研究结论的贡献。其解决方案的关键在于构建了一个包含343份访谈转录文本(共16,940个参与者回答)的全新数据集——定性访谈语料库(Qualitative Interview Corpus),并系统评估了10种已提出的影响回答质量的指标,最终发现:直接与核心研究问题的相关性是预测回答质量最强的指标;而常用于自然语言处理(NLP)访谈系统评估的清晰度和基于 surprisal 的信息量指标,并不具备预测能力。这一成果为定性研究设计和自动化访谈系统评估提供了可落地、可扩展的实证依据。

链接: https://arxiv.org/abs/2604.05163
作者: Jonathan Ivey,Anjalie Field,Ziang Xiao
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 14 figures

点击查看摘要

Abstract:Qualitative interviews provide essential insights into human experiences when they elicit high-quality responses. While qualitative and NLP researchers have proposed various measures of interview quality, these measures lack validation that high-scoring responses actually contribute to the study’s goals. In this work, we identify, implement, and evaluate 10 proposed measures of interview response quality to determine which are actually predictive of a response’s contribution to the study findings. To conduct our analysis, we introduce the Qualitative Interview Corpus, a newly constructed dataset of 343 interview transcripts with 16,940 participant responses from 14 real research projects. We find that direct relevance to a key research question is the strongest predictor of response quality. We additionally find that two measures commonly used to evaluate NLP interview systems, clarity and surprisal-based informativeness, are not predictive of response quality. Our work provides analytic insights and grounded, scalable metrics to inform the design of qualitative studies and the evaluation of automated interview systems.

[NLP-83] Planning to Explore: Curiosity-Driven Planning for LLM Test Generation

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的代码测试生成方法在面对复杂程序时效率低下、难以深入探索深层分支的问题。现有策略多采用贪心方式最大化即时覆盖率,导致在需要前置设置才能触及深层分支的场景中陷入局部最优。其解决方案的关键在于引入贝叶斯探索思想,将程序分支结构视为未知环境,以覆盖地图作为概率后验估计,并设计了CovQValue方法:通过反馈覆盖信息引导LLM并行生成多样化测试计划,再依据LLM估算的Q值选择最具信息量的动作,从而在即时分支发现与未来可达性之间取得平衡。实验表明,该方法显著优于传统贪心策略,在TestGenEval Lite和自建的RepoExploreBench基准上均实现更高覆盖率与更强探索能力。

链接: https://arxiv.org/abs/2604.05159
作者: Alfonso Amayuelas,Firas Laakom,Piotr Piękos,Wenyi Wang,Yifan Xu,Yuhui Wang,Jürgen Schmidhuber,William Wang
机构: University of California, Santa Barbara; King Abdullah University of Science and Technology
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The use of LLMs for code generation has naturally extended to code testing and evaluation. As codebases grow in size and complexity, so does the need for automated test generation. Current approaches for LLM-based test generation rely on strategies that maximize immediate coverage gain, a greedy approach that plateaus on code where reaching deep branches requires setup steps that individually yield zero new coverage. Drawing on principles of Bayesian exploration, we treat the program’s branch structure as an unknown environment, and an evolving coverage map as a proxy probabilistic posterior representing what the LLM has discovered so far. Our method, CovQValue, feeds the coverage map back to the LLM, generates diverse candidate plans in parallel, and selects the most informative plan by LLM-estimated Q-values, seeking actions that balance immediate branch discovery with future reachability. Our method outperforms greedy selection on TestGenEval Lite, achieving 51-77% higher branch coverage across three popular LLMs and winning on 77-84% of targets. In addition, we build a benchmark for iterative test generation, RepoExploreBench, where they achieve 40-74%. These results show the potential of curiosity-driven planning methods for LLM-based exploration, enabling more effective discovery of program behavior through sequential interaction

[NLP-84] Just Pass Twice: Efficient Token Classification with LLM s for Zero-Shot NER

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在零样本命名实体识别(Zero-shot Named Entity Recognition, NER)任务中因因果注意力机制(causal attention mechanism)限制而无法有效利用未来上下文信息的问题。现有方法依赖生成式推理,存在解码速度慢、实体幻觉和格式错误等缺陷。其解决方案的关键在于提出“仅传两次”(Just Pass Twice, JPT)策略:通过将输入序列与其自身拼接,使第二轮前向传播中每个token能访问完整句子的双向上下文,从而无需修改模型架构即可实现判别式token分类;同时结合定义引导的实体嵌入(definition-guided entity embeddings),提升零样本泛化能力。该方法在CrossNER和MIT基准上平均F1得分提升7.9点,且推理速度超过同类生成式方法20倍以上。

链接: https://arxiv.org/abs/2604.05158
作者: Ahmed Ewais,Ahmed Hashish,Amr Ali
机构: WitnessAI
类目: Computation and Language (cs.CL)
备注: 16 pages, 9 figures, 12 tables

点击查看摘要

Abstract:Large language models encode extensive world knowledge valuable for zero-shot named entity recognition. However, their causal attention mechanism, where tokens attend only to preceding context, prevents effective token classification when disambiguation requires future context. Existing approaches use LLMs generatively, prompting them to list entities or produce structured outputs, but suffer from slow autoregressive decoding, hallucinated entities, and formatting errors. We propose Just Pass Twice (JPT), a simple yet effective method that enables causal LLMs to perform discriminative token classification with full bidirectional context. Our key insight is that concatenating the input to itself lets each token in the second pass attend to the complete sentence, requiring no architectural modifications. We combine these representations with definition-guided entity embeddings for flexible zero-shot generalization. Our approach achieves state-of-the-art results on zero-shot NER benchmarks, surpassing the previous best method by +7.9 F1 on average across CrossNER and MIT benchmarks, being over 20x faster than comparable generative methods. Comments: 16 pages, 9 figures, 12 tables Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7 Cite as: arXiv:2604.05158 [cs.CL] (or arXiv:2604.05158v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.05158 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-85] EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

【速读】: 该论文旨在解决多智能体问答系统中两个关键问题:一是现有路由方法通常在固定智能体池上优化,无法提升智能体自身的性能;二是多数路由机制采用刚性协作模式,难以根据查询需求动态调整参与协作的智能体数量。解决方案的关键在于提出 EvolveRouter,这是一个可训练的框架,通过闭环协同进化实现智能体质量与协作结构的联合优化:首先,将基于图的查询路由与针对性指令微调相结合,在闭环过程中利用路由诊断指导智能体改进,同时改进后的智能体为路由提供更清晰的监督信号;其次,引入自适应推理策略,通过路由器加权的答案一致性动态确定每条查询的有效协作规模。这一设计显著提升了多智能体推理的能力与效率。

链接: https://arxiv.org/abs/2604.05149
作者: Jiatan Huang,Zheyuan Zhang,Kaiwen Shi,Yanfang Ye,Chuxu Zhang
机构: University of Connecticut (康涅狄格大学); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model agents often exhibit complementary strengths, making routing a promising approach for multi-agent question answering. However, existing routing methods remain limited in two important ways: they typically optimize over a fixed pool of agents without improving the agents themselves, and they often rely on rigid collaboration schemes that cannot adapt the number of participating agents to the query. We propose EvolveRouter, a trainable framework that addresses both limitations by jointly improving agent quality and collaboration structure. First, EvolveRouter couples graph-based query routing with targeted instruction refinement in a closed-loop co-evolution process, allowing router diagnostics to guide agent improvement while refined agents provide cleaner supervision for routing. Second, it introduces an adaptive inference strategy that dynamically determines the effective collaboration size for each query through router-weighted answer agreement. Together, these designs enable more capable and more efficient multi-agent reasoning. Experiments on five question answering benchmarks show that EvolveRouter consistently outperforms SOTA routing baselines in both F1 and exact match, while further analysis confirms the benefits of closed-loop refinement and adaptive collaboration.

[NLP-86] EffiPair: Improving the Efficiency of LLM -generated Code with Relative Contrastive Feedback

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成代码时存在运行效率低下(如时间复杂度高、内存占用大)的问题,而传统优化方法依赖于单个程序的绝对执行反馈(如性能剖析),成本高且指导性弱。解决方案的关键在于提出一种无需模型微调或参数更新的推理阶段反馈机制——相对对比反馈(Relative Contrastive Feedback, RCF),其核心思想是通过比较同一任务下结构相似但效率不同的两个程序,提取二者在执行过程中的差异作为轻量级反馈信号,从而为迭代优化提供直接且高效的引导。基于此,作者进一步构建了EffiPair框架,该框架在测试阶段通过生成多个候选解、识别高效差距显著的程序对、总结差异并驱动更优解的生成,实现了显著的效率提升(最高达1.5倍加速)同时大幅降低Token消耗(较先前方法减少90%以上)。

链接: https://arxiv.org/abs/2604.05137
作者: Samira Hajizadeh,Suman Jana
机构: Columbia University (哥伦比亚大学)
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often generate code that is functionally correct but inefficient in runtime and memory. Prior approaches to improving code efficiency typically rely on absolute execution feedback, such as profiling a single program’s runtime or memory usage, which is costly and provides weak guidance for refinement. We propose Relative Contrastive Feedback (RCF), an inference-time feedback mechanism that requires no model fine-tuning or parameter updates. RCF compares two structurally similar programs for the same task and highlights the differences associated with better efficiency. Building on this idea, we introduce EffiPair, an inference-time iterative refinement framework that operates entirely at test time by generating multiple candidate solutions, identifying informative program pairs with large efficiency gaps, summarizing their execution differences into lightweight feedback, and using this signal to produce more efficient solutions. By replacing isolated scalar feedback with pairwise contrastive comparisons, EffiPair provides more direct guidance while reducing profiling and prompting overhead. Experiments on code-efficiency benchmarks show that EffiPair consistently improves efficiency while preserving correctness. For instance, with DeepSeek-Chat V3.2, EffiPair achieves up to 1.5x speedup over generation without performance feedback, while reducing token usage by more than 90% compared to prior work.

[NLP-87] SenseAI: A Human-in-the-Loop Dataset for RLHF-Aligned Financial Sentiment Reasoning

【速读】: 该论文旨在解决当前金融领域大语言模型(Large Language Models, LLMs)在推理过程中缺乏可解释性与可控性的问题,尤其是模型输出错误往往难以定位和修正。传统金融情感数据集仅提供标签结果,未记录模型决策的完整推理链条及人类反馈信号,导致无法有效支持基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)范式下的模型优化。其解决方案的关键在于构建SenseAI这一人机协同验证(Human-in-the-loop, HITL)的金融情感数据集,该数据集不仅包含1,439个标注样本和13类财务数据,还系统性地整合了推理链(reasoning chains)、置信度分数、人工校正信号以及真实市场结果,从而为LLM提供结构化反馈信号,使模型误差呈现可预测且可纠正的模式,例如首次识别出的“潜在推理漂移”(Latent Reasoning Drift)现象,进而推动金融AI系统的精准评估与对齐。

链接: https://arxiv.org/abs/2604.05135
作者: Berny Kabalisa
机构: RizqSpark; SenseAI
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注: Dataset available on request (bernykabalisa18@gmail.com) See GitHub for dataset snapshot and automated data collection script demo this https URL

点击查看摘要

Abstract:We introduce SenseAI, a human-in-the-loop (HITL) validated financial sentiment dataset designed to capture not only model outputs but the full reasoning process behind them. Unlike existing resources, SenseAI incorporates reasoning chains, confidence scores, human correction signals, and real-world market outcomes, providing a structure aligned with Reinforcement Learning from Human Feedback (RLHF) paradigms. The dataset consists of 1,439 labelled data points across 40 US-listed equities and 13 financial data categories, enabling direct integration into modern LLM fine-tuning pipelines. Through analysis, we identify several systematic patterns in model behavior, including a novel failure mode we term Latent Reasoning Drift, where models introduce information not grounded in the input, as well as consistent confidence miscalibration and forward projection tendencies. These findings suggest that LLM errors in financial reasoning are not random but occur within a predictable and correctable regime, supporting the use of structured HITL data for targeted model improvement. We discuss implications for financial AI systems and highlight opportunities for applying SenseAI in model evaluation and alignment. Comments: Dataset available on request (bernykabalisa18@gmail.com) See GitHub for dataset snapshot and automated data collection script demo this https URL Subjects: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE) Cite as: arXiv:2604.05135 [cs.CL] (or arXiv:2604.05135v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.05135 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-88] Watch Before You Answer: Learning from Visually Grounded Post-Training

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在视频理解任务中性能滞后的问题,尤其指出现有长视频理解基准测试中高达40%-60%的题目仅依赖文本线索即可回答,导致模型难以真正提升视觉感知能力。此外,这种文本偏倚也广泛存在于主流后训练数据集中,削弱了后训练对视频理解性能的改进效果。解决方案的关键在于引入VidGround方法:通过筛选仅包含实际视觉接地(visually grounded)问题的数据集进行后训练,剔除具有语言偏倚的样本,从而提升模型对多模态信息的整合能力。实验表明,该方法在使用仅69.1%原始数据的情况下,结合基于强化学习(Reinforcement Learning, RL)的后训练算法,性能最高可提升6.2点,且优于多种复杂后训练策略,凸显高质量数据筛选对推动VLM视频理解能力发展的核心作用。

链接: https://arxiv.org/abs/2604.05117
作者: Yuxuan Zhang,EunJeong Hwang,Huaisong Zhang,Penghui Du,Yiming Jia,Dongfu Jiang,Xuan He,Shenhui Zhang,Ping Nie,Peter West,Kelsey R. Allen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: this http URL.

[NLP-89] π2: Structure-Originated Reasoning Data Improves Long-Context Reasoning Ability of Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文推理任务中表现不足的问题,尤其是如何有效提升模型对复杂、多跳推理问题的理解与解答能力。其解决方案的关键在于提出了一种名为 \pi^2 的数据构建管道,通过三阶段高质量推理数据的自动化生成:首先从维基百科提取并扩展表格数据;其次基于表格和相关上下文生成具有真实性和多跳特征的问答对,并利用双路径代码执行自动验证答案正确性;最后将结构化的推理步骤回译为自然语言解答,形成带上下文的 QA 对。该方法显著提升了模型在多个长上下文推理基准上的性能,且支持自蒸馏训练,进一步增强了模型自身推理能力。

链接: https://arxiv.org/abs/2604.05114
作者: Quyet V. Do,Thinh Pham,Nguyen Nguyen,Sha Li,Pratibha Zunjare,Tu Vu
机构: Virginia Tech (弗吉尼亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Our structured analytical reasoning data, which originates from Wikipedia tables, significantly improves long-context reasoning capability of LLMs

点击查看摘要

Abstract:We study a pipeline that curates reasoning data from initial structured data for improving long-context reasoning in large language models (LLMs). Our approach, \pi^2 , constructs high-quality reasoning data through rigorous QA curation: 1) extracting and expanding tables from Wikipedia, 2) from the collected tables and relevant context, generating realistic and multi-hop analytical reasoning questions whose answers are automatically determined and verified through dual-path code execution, and 3) back-translating step-by-step structured reasoning traces as solutions of QA pairs given realistic web-search context. Supervised fine-tuning with \textsc\smallgpt-oss-20b and \textsc\smallQwen3-4B-Instruct-2507 on \pi^2 yields consistent improvements across four long-context reasoning benchmarks and our alike \pi^2 -Bench, with average absolute accuracy gains of +4.3% and +2.7% respectively. Notably, our dataset facilitates self-distillation, where \textsc\smallgpt-oss-20b even improves its average performance by +4.4% with its own reasoning traces, demonstrating \pi^2 's usefulness. Our code, data, and models are open-source at this https URL.

[NLP-90] RAG or Learning? Understanding the Limits of LLM Adaptation under Continuous Knowledge Drift in the Real World

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对持续演化的现实世界知识时所出现的连续知识漂移(continuous knowledge drift)问题,即模型因预训练阶段固定的知识快照而难以适应随时间变化的事实、实体与事件,导致推理结果过时或在时间维度上不一致。现有方法如持续微调、知识编辑和检索增强生成(Retrieval-Augmented Generation, RAG)虽尝试更新或补充知识,但缺乏在真实时间演化场景下的系统性评估。论文的关键解决方案是提出一个基于时间戳证据构建的动态事件基准(benchmark of real-world dynamic events),用于量化评估模型在连续知识漂移下的适应能力,并进一步设计了一个无需额外训练的时间感知检索基线方法——Chronos,其通过将检索到的证据逐步组织成事件演化图(Event Evolution Graph),实现更连贯的时间推理,从而缓解灾难性遗忘和时序不一致性问题。

链接: https://arxiv.org/abs/2604.05096
作者: Hanbing Liu,Lang Cao,Yang Li
机构: Tsinghua University(清华大学); University of Illinois Urbana-Champaign(伊利诺伊大学香槟分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) acquire most of their knowledge during pretraining, which ties them to a fixed snapshot of the world and makes adaptation to continuously evolving knowledge challenging. As facts, entities, and events change over time, models may experience continuous knowledge drift, resulting not only in outdated predictions but also in temporally inconsistent reasoning. Although existing approaches, such as continual finetuning, knowledge editing, and retrieval-augmented generation (RAG), aim to update or supplement model knowledge, they are rarely evaluated in settings that reflect chronological, evolving, and real-world knowledge evolution. In this work, we introduce a new benchmark of real-world dynamic events, constructed from time-stamped evidence that captures how knowledge evolves over time, which enables systematic evaluation of model adaptation under continuous knowledge drift. The benchmark reveals that most existing methods, including vanilla RAG and several learning-based approaches, struggle under this setting, exposing critical limitations such as catastrophic forgetting and temporal inconsistency. To mitigate these limitations, we propose a time-aware retrieval baseline, Chronos, which progressively organizes retrieved evidence into an Event Evolution Graph to enable more temporally consistent understanding in LLMs without additional training. Overall, this work provides a foundation for analyzing and advancing LLM adaptation to continuous knowledge drift in realistic settings.

[NLP-91] MegaTrain: Full Precision Training of 100B Parameter Large Language Models on a Single GPU

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在单个GPU上进行全精度训练时面临的显存瓶颈问题。传统GPU-centric训练系统受限于GPU显存容量,难以高效训练百亿参数级别的模型。MegaTrain提出了一种以内存为中心(memory-centric)的架构,将模型参数和优化器状态存储于主机内存(CPU内存),而将GPU作为临时计算引擎,通过分层流式加载参数并即时计算梯度,最大限度减少设备端持久状态。其关键创新在于:1)设计了一个流水线化的双缓冲执行引擎,利用多个CUDA流重叠参数预取、计算与梯度卸载过程,实现GPU持续高效运行;2)用无状态层模板替代持久化的自动微分图(autograd graph),动态绑定流式加载的权重,消除冗余图元数据的同时保持调度灵活性。该方案使单张H200 GPU可稳定训练高达120B参数的模型,并显著提升训练吞吐量。

链接: https://arxiv.org/abs/2604.05091
作者: Zhengqing Yuan,Hanchi Sun,Lichao Sun,Yanfang Ye
机构: University of Notre Dame (圣母大学); Lehigh University (利哈伊大学)
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Operating Systems (cs.OS)
备注:

点击查看摘要

Abstract:We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84 \times the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.

[NLP-92] Multilingual Language Models Encode Script Over Linguistic Structure ACL2026

【速读】: 该论文旨在解决多语言语言模型(Multilingual Language Models, MLMs)中表征组织机制的模糊性问题,即这些模型如何在共享参数空间内对不同类型和书写系统的语言进行编码,其内部表征究竟是由抽象的语言身份(如语法结构)还是表面形式线索(如拼写、字形)主导。解决方案的关键在于采用一种结合定量分析与因果干预的方法:首先使用语言激活概率熵(Language Activation Probability Entropy, LAPE)度量识别出与特定语言强相关的神经单元,并进一步通过稀疏自编码器(Sparse Autoencoders)对激活进行分解;在此基础上,通过扰动实验(如罗马化输入或词序打乱)和生成敏感性测试,发现这些单元主要受表面形式约束,而语言的类型学特征仅在深层逐渐显现,且生成任务最依赖于对表面扰动具有不变性的单元,而非单纯基于语言类型对齐的单元。这一发现揭示了多语言模型以表层形式为组织核心,抽象语言信息是渐进涌现而非统一融合的过程。

链接: https://arxiv.org/abs/2604.05090
作者: Aastha A K Verma,Anwoy Chatterjee,Mehak Gupta,Tanmoy Chakraborty
机构: Indian Institute of Technology Delhi (印度理工学院德里分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ACL 2026 (Main)

点击查看摘要

Abstract:Multilingual language models (LMs) organize representations for typologically and orthographically diverse languages into a shared parameter space, yet the nature of this internal organization remains elusive. In this work, we investigate which linguistic properties - abstract language identity or surface-form cues - shape multilingual representations. Focusing on compact, distilled models where representational trade-offs are explicit, we analyze language-associated units in Llama-3.2-1B and Gemma-2-2B using the Language Activation Probability Entropy (LAPE) metric, and further decompose activations with Sparse Autoencoders. We find that these units are strongly conditioned on orthography: romanization induces near-disjoint representations that align with neither native-script inputs nor English, while word-order shuffling has limited effect on unit identity. Probing shows that typological structure becomes increasingly accessible in deeper layers, while causal interventions indicate that generation is most sensitive to units that are invariant to surface-form perturbations rather than to units identified by typological alignment alone. Overall, our results suggest that multilingual LMs organize representations around surface form, with linguistic abstraction emerging gradually without collapsing into a unified interlingua.

[NLP-93] Beyond LLM -as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)作为自动评分工具时存在的高成本、对提示设计和语言敏感性以及聚合策略依赖性强等问题,这些问题严重限制了评估结果的可复现性。解决方案的关键在于提出了一种名为OmniScore的互补性、确定性的学习型评分指标家族,其基于小规模(<1B参数)模型开发,能够近似LLM判官的行为,同时保持传统模型评分方法的低延迟与一致性。该方法通过大规模合成监督数据(约564k条实例,覆盖107种语言)进行训练,并在8,617个手工标注实例上进行了验证,实现了跨多种评估场景(如参考文本基准、源文本锚定及混合评估)的可靠多维评分,且在问答、翻译和摘要任务中于6种语言上均表现出优越性能,为前沿LLM提供了一种轻量、确定且可扩展的替代方案。

链接: https://arxiv.org/abs/2604.05083
作者: Firoj Alam,Gagan Bhatia,Sahinur Rahman Laskar,Shammur Absar Chowdhury
机构: Qatar Computing Research Institute, HBKU, Qatar; UPES, India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) are increasingly adopted as automated judges for evaluating generated text, their outputs are often costly, and highly sensitive to prompt design, language, and aggregation strategies, severely, which limits reproducibility. To address these challenges, we propose \textbf\textitOmniScore, a family of complementary, deterministic learned metrics developed using small size ( 1B) parameter models. OmniScore approximates LLM-judge behavior while preserving the low latency and consistency of traditional model-based scoring. We trained the models large-scale synthetic supervision ( \sim 564k instances, in \textbf107 languages) and evaluated using 8,617 manually annotated instances. The OmniScore family supports reliable, multi-dimensional scores across a variety of settings, including reference-based, source-grounded, and hybrid evaluations. We evaluate these models across question answering (QA), translation, and summarization in \textbf6 languages. Our results demonstrate that lightweight, deterministic learned metrics provide a highly practical and scalable alternative to frontier LLMs. Our models and datasets can be found at this https URL

[NLP-94] MMORF: A Multi-agent Framework for Designing Multi-objective Retrosynthesis Planning Systems

【速读】: 该论文旨在解决多目标逆合成规划(multi-objective retrosynthesis planning)问题,即在化学合成路径设计中动态平衡质量、安全性和成本等多个目标。传统方法难以有效整合这些相互冲突的目标,而生成式 AI (Generative AI) 驱动的多智能体系统(Multi-Agent System, MAS)提供了新的解决方案。其关键在于提出 MMORF 框架,通过模块化智能体组件灵活组合与配置,实现对不同系统设计的系统性评估和优化;基于此框架构建的 MASIL 和 RFAS 在新构建的 218 个任务基准上分别在软约束和硬约束场景下显著优于现有基线,验证了该框架的有效性和通用性。

链接: https://arxiv.org/abs/2604.05075
作者: Frazier N. Baker,Trieu Nguyen,Reza Averly,Botao Yu,Daniel Adu-Ampratwum,Huan Sun,Xia Ning
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 36 pages, 1 figure

点击查看摘要

Abstract:Multi-objective retrosynthesis planning is a critical chemistry task requiring dynamic balancing of quality, safety, and cost objectives. Language model-based multi-agent systems (MAS) offer a promising approach for this task: leveraging interactions of specialized agents to incorporate multiple objectives into retrosynthesis planning. We present MMORF, a framework for constructing MAS for multi-objective retrosynthesis planning. MMORF features modular agentic components, which can be flexibly combined and configured into different systems, enabling principled evaluation and comparison of different system designs. Using MMORF, we construct two representative MAS: MASIL and RFAS. On a newly curated benchmark consisting of 218 multi-objective retrosynthesis planning tasks, MASIL achieves strong safety and cost metrics on soft-constraint tasks, frequently Pareto-dominating baseline routes, while RFAS achieves a 48.6% success rate on hard-constraint tasks, outperforming state-of-the-art baselines. Together, these results show the effectiveness of MMORF as a foundational framework for exploring MAS for multi-objective retrosynthesis planning. Code and data are available at this https URL.

[NLP-95] Memory Dial: A Training Framework for Controllable Memorization in Language Models ACL

【速读】: 该论文旨在解决语言模型中记忆行为难以隔离与控制的问题,即现有方法仅能事后检测模型是否记忆特定内容,却无法区分记忆效应与其他因素(如模型架构、训练数据或优化过程)的影响。为实现对记忆压力的显式调控,作者提出Memory Dial训练框架,其关键在于通过单一参数 α 在标准交叉熵损失与温度锐化目标之间进行插值,从而在保持模型架构和训练设置一致的前提下,系统性地调节模型的记忆强度。实验表明,α 能可靠地单调提升已见样本的准确率而不损害未见样本的泛化性能,且大模型对记忆压力更敏感,高频序列更易被记忆,验证了该框架的有效性和可控性。

链接: https://arxiv.org/abs/2604.05074
作者: Xiangbo Zhang,Ali Emami
机构: Georgia Institute of Technology (佐治亚理工学院); Emory University (埃默里大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL Findings 2026

点击查看摘要

Abstract:Memorization in language models is widely studied but remains difficult to isolate and control. Understanding when and what models memorize is essential for explaining their predictions, yet existing approaches are post-hoc: they can detect memorization in trained models, but cannot disentangle its effects from architecture, data, or optimization. We introduce Memory Dial, a training framework that makes memorization pressure an explicit, controllable variable. Memory Dial interpolates between standard cross-entropy and a temperature-sharpened objective via a single parameter \alpha , producing a family of models identical in architecture and training setup (within each sweep), differing only in memorization pressure. Experiments across six architectures and five benchmarks demonstrate that: (1) \alpha reliably controls memorization pressure, with seen-example accuracy increasing monotonically while unseen accuracy remains stable; (2) larger models are more responsive to memorization pressure; and (3) frequent sequences are easier to memorize than rare ones. Additional analyses show that the effect is robust across a range of sharpening temperatures, differs qualitatively from single-temperature cross-entropy, transfers to multilingual settings, and is detectable even on naturally occurring single-occurrence sequences. Memory Dial provides a controlled experimental framework for studying how memorization behavior emerges and interacts with generalization in language models.

[NLP-96] his Treatment Works Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA

【速读】: 该论文旨在解决生成式 AI(Generative AI)在医疗问答(Medical Question Answering, QA)中因用户提问措辞差异而导致响应不一致的问题,尤其是在检索增强生成(Retrieval-Augmented Generation, RAG)框架下,即使基于相同临床证据,模型输出仍可能因问题表述方式不同而产生矛盾结论。解决方案的关键在于通过构建一个受控的 RAG 评估环境,使用专家筛选的文档而非自动检索结果,系统性地考察两种患者提问变体:问题框架(正向 vs. 负向)和语言风格(专业术语 vs. 平实语言),从而量化并验证提示词(prompt)敏感性对医学问答一致性的影响。研究发现,问题框架显著影响响应一致性,且多轮对话中该效应被放大,表明提升 LLM 在高风险场景下的提示鲁棒性(phrasing robustness)是 RAG 系统评估的核心指标之一。

链接: https://arxiv.org/abs/2604.05051
作者: Hye Sun Yun,Geetika Kapoor,Michael Mackert,Ramez Kouzy,Wei Xu,Junyi Jessy Li,Byron C. Wallace
机构: Northeastern University (东北大学); UC Berkeley (加州大学伯克利分校); UT Austin (德克萨斯大学奥斯汀分校); UT MD Anderson Cancer Center (德克萨斯大学MD安德森癌症中心); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages, 4 tables, 19 figures

点击查看摘要

Abstract:Patients are increasingly turning to large language models (LLMs) with medical questions that are complex and difficult to articulate clearly. However, LLMs are sensitive to prompt phrasings and can be influenced by the way questions are worded. Ideally, LLMs should respond consistently regardless of phrasing, particularly when grounded in the same underlying evidence. We investigate this through a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting for medical question answering (QA), where expert-selected documents are used rather than retrieved automatically. We examine two dimensions of patient query variation: question framing (positive vs. negative) and language style (technical vs. plain language). We construct a dataset of 6,614 query pairs grounded in clinical trial abstracts and evaluate response consistency across eight LLMs. Our findings show that positively- and negatively-framed pairs are significantly more likely to produce contradictory conclusions than same-framing pairs. This framing effect is further amplified in multi-turn conversations, where sustained persuasion increases inconsistency. We find no significant interaction between framing and language style. Our results demonstrate that LLM responses in medical QA can be systematically influenced through query phrasing alone, even when grounded in the same evidence, highlighting the importance of phrasing robustness as an evaluation criterion for RAG-based systems in high-stakes settings.

[NLP-97] Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space

【速读】: 该论文旨在解决传统序列模型中关联记忆容量随序列长度增加而显著下降的问题,尤其是在向量状态模型中,由于超叠加关联的容量退化呈 O(1/n)O(1/\sqrt{n}),导致信息存储与检索效率受限。其解决方案的关键在于提出相位关联记忆(Phase-Associative Memory, PAM),一种全复数域表示的递归序列模型:通过外积方式在矩阵状态 StCd×dS_t \in \mathbb{C}^{d \times d} 中累积关联,并利用共轭内积 KtQt/dK_t^* \cdot Q_t / \sqrt{d} 实现高效检索。这种设计不仅克服了向量状态下的容量瓶颈,还在 WikiText-103 数据集上以约 1 亿参数达到验证困惑度 30.0,接近相同条件下训练的 Transformer 模型(27.1),证明了复数域原生运算(复数叠加与共轭检索)在语言建模中的有效性与竞争力。

链接: https://arxiv.org/abs/2604.05030
作者: Gowrav Vishwakarma,Christopher J. Agostino
机构: Xavoc Technocrats Pvt. Ltd.(Xavoc技术专家私人有限公司); NPC Worldwide (NPC全球公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: submitting to APS Open Science, 10 pages, 1 figure, code and training logs available at this https URL

点击查看摘要

Abstract:We present Phase-Associative Memory (PAM), a recurrent sequence model in which all representations are complex-valued, associations accumulate in a matrix state S_t \in \mathbbC^d \times d via outer products, and retrieval operates through the conjugate inner product K_t^* \cdot Q_t / \sqrtd . At \sim 100M parameters on WikiText-103, PAM reaches validation perplexity 30.0, within \sim 10% of a matched transformer (27.1) trained under identical conditions, despite 4\times arithmetic overhead from complex computation and no custom kernels. We trace the experimental path from vector-state models, where holographic binding fails due to the O(1/\sqrtn) capacity degradation of superposed associations, to the matrix state that resolves it. The competitiveness of an architecture whose native operations are complex-valued superposition and conjugate retrieval is consistent with recent empirical evidence that semantic interpretation in both humans and large language models exhibits non-classical contextuality, and we discuss what this implies for the choice of computational formalism in language modeling.

[NLP-98] EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在教育场景中生成多媒体教学内容能力评估的空白问题,特别是针对K-12 STEM学科中融合文本与图表的连贯解释生成任务。现有研究多聚焦于问答和辅导功能,缺乏对模型生成几何准确、逻辑清晰且图文协同的多模态教学内容的系统评测工具。解决方案的关键在于提出EduIllustrate基准测试体系,其核心包括:(1)涵盖五个学科、三个年级共230道题目的标准化数据集;(2)通过顺序锚定(sequential anchoring)机制确保跨图表视觉一致性;(3)基于多媒体学习理论构建的8维评价量规,兼顾文本与视觉质量。实证表明,该方案能有效量化LLMs在复杂图文混排教学内容生成中的性能差异,并验证了顺序锚定策略在提升视觉一致性的同时显著降低计算成本。

链接: https://arxiv.org/abs/2604.05005
作者: Shuzhen Bi,Mingzi Zhang,Zhuoxuan Li,Xiaolong Wang,keqian Li,Aimin Zhou
机构: Shanghai Innovation Institute (上海创新研究院); University of Science and Technology of China (中国科学技术大学); East China Normal University (华东师范大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly used as educational assistants, yet evaluation of their educational capabilities remains concentrated on question-answering and tutoring tasks. A critical gap exists for multimedia instructional content generation – the ability to produce coherent, diagram-rich explanations that combine geometrically accurate visuals with step-by-step reasoning. We present EduIllustrate, a benchmark for evaluating LLMs on interleaved text-diagram explanation generation for K-12 STEM problems. The benchmark comprises 230 problems spanning five subjects and three grade levels, a standardized generation protocol with sequential anchoring to enforce cross-diagram visual consistency, and an 8-dimension evaluation rubric grounded in multimedia learning theory covering both text and visual quality. Evaluation of ten LLMs reveals a wide performance spread: Gemini 3.0 Pro Preview leads at 87.8%, while Kimi-K2.5 achieves the best cost-efficiency (80.8% at \ 0.12/problem). Workflow ablation confirms sequential anchoring improves Visual Consistency by 13% at 94% lower cost. Human evaluation with 20 expert raters validates LLM-as-judge reliability for objective dimensions ( \rho \geq 0.83 ) while revealing limitations on subjective visual assessment.

[NLP-99] Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在回答多项选择题(Multiple-Choice Questions, MCQs)时对干扰项(plausible distractors)敏感的问题,这种敏感性会导致模型偏好不稳定,表现为在正确答案与干扰项之间出现不一致的推理震荡。解决方案的关键在于提出一种名为“思维包含”(Inclusion-of-Thoughts, IoT)的渐进式自过滤策略:通过仅保留题目中合理的选项来重构MCQ,从而在受控环境中进行对比判断,降低认知负荷并提升模型内部推理的稳定性;同时,该过程显式记录筛选步骤,增强了决策的可解释性与透明度。实证结果表明,IoT在多个算术、常识推理和教育基准测试中显著提升了链式思维(Chain-of-Thought)的表现,且计算开销极低。

链接: https://arxiv.org/abs/2604.04944
作者: Mohammad Reza Ghasemi Madani,Soyeon Caren Han,Shuo Yang,Jey Han Lau
机构: The University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct and incorrect answers. In this paper, we propose Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that is designed to mitigate this cognitive load (i.e., instability of model preferences under the presence of distractors) and enable the model to focus more effectively on plausible answers. Our method operates to reconstruct the MCQ using only plausible option choices, providing a controlled setting for examining comparative judgements and therefore the stability of the model’s internal reasoning under perturbation. By explicitly documenting this filtering process, IoT also enhances the transparency and interpretability of the model’s decision-making. Extensive empirical evaluation demonstrates that IoT substantially boosts chain-of-thought performance across a range of arithmetic, commonsense reasoning, and educational benchmarks with minimal computational overhead.

[NLP-100] he Illusion of Latent Generalization: Bi-directionality and the Reversal Curse ICLR2026

【速读】: 该论文旨在解决自回归语言模型在处理事实反转(如从“A B”训练但无法正确预测“B A”)时出现的“反转诅咒”(reversal curse)问题。其解决方案的关键在于引入具有双向监督的目标函数,特别是对比了标准掩码语言建模(masked language modeling, MLM)与仅解码器架构下的掩码重建训练方式,发现提升反转准确性的核心机制并非依赖单一的方向无关表征,而是要求训练信号明确将源实体设为预测目标;进一步分析表明,MLM和解码器掩码训练分别以不同的索引几何结构存储正向与反向事实,说明改进行为可能源于对不同方向信息的独立编码,而非隐式泛化能力的增强。

链接: https://arxiv.org/abs/2604.04943
作者: Julian Coda-Forno,Jane X. Wang,Arslan Chaudhry
机构: TUM, Helmholtz Munich; Google DeepMind
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2026 Workshop on Representational Alignment (Re-Align)

点击查看摘要

Abstract:The reversal curse describes a failure of autoregressive language models to retrieve a fact in reverse order (e.g., training on A B '' but failing on B A ‘’). Recent work shows that objectives with bidirectional supervision (e.g., bidirectional attention or masking-based reconstruction for decoder-only models) can mitigate the reversal curse. We extend this evaluation to include a vanilla masked language modeling (MLM) objective and compare it to decoder-only masking-based training across four reversal benchmarks and then provide a minimal mechanistic study of \emphhow these objectives succeed. We show that reversal accuracy requires training signal that explicitly makes the source entity a prediction target, and we find little evidence that success corresponds to a single direction-agnostic representation of a fact. Instead, representation distances and linear probes are consistent with storing forward and reverse directions as distinct entries, with different indexing geometry for MLM versus decoder-only masking-based training. Our results caution that objective-level ``fixes’’ can improve reversal behavior without necessarily inducing the kind of latent generalization one might expect from a unified concept.

[NLP-101] DA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理能力上的局限性问题,特别是单轮链式思维(Chain-of-Thought, CoT)方法因逻辑断层导致推理质量不足,而多轮推理范式(如思维图谱 Graph-of-Thoughts, GoT;思维树 Tree-of-Thoughts, ToT;原子思维 Atom-of-Thought, AoT)虽具更强结构化推理能力但计算开销过高这一矛盾。解决方案的关键在于提出一种基于拓扑结构的优化方法:通过持久同调(persistent homology)将CoT、ToT和GoT映射到统一的拓扑空间中,量化其推理结构特征,并设计一个拓扑优化代理(Topological Optimization Agent),能够诊断CoT推理链中偏离理想拓扑特性的偏差,并生成针对性修复策略,从而在保持单轮生成效率的同时实现多轮推理的结构智能,显著提升推理准确性与实用性。

链接: https://arxiv.org/abs/2604.04942
作者: Jiaquan Zhang,Qigan Sun,Chaoning Zhang,Xudong Wang,Zhenzhen Huang,Yitian Zhou,Pengcheng Zheng,Chi-lok Andy Tai,Sung-Ho Bae,Zeyu Ma,Caiyan Qin,Jinyu Guo,Yang Yang,Hengtao Shen
机构: University of Electronic Science and Technology of China (电子科技大学); The Hong Kong Polytechnic University (香港理工大学); Harbin Institute of Technology (哈尔滨工业大学); Kyung Hee University (中央大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:Enhancing the reasoning capability of large language models (LLMs) remains a core challenge in natural language processing. The Chain-of-Thought (CoT) paradigm dominates practical applications for its single-round efficiency, yet its reasoning chains often exhibit logical gaps. While multi-round paradigms like Graph-of-Thoughts (GoT), Tree-of-Thoughts (ToT), and Atom of Thought (AoT) achieve strong performance and reveal effective reasoning structures, their high cost limits practical use. To address this problem, this paper proposes a topology-based method for optimizing reasoning chains. The framework embeds essential topological patterns of effective reasoning into the lightweight CoT paradigm. Using persistent homology, we map CoT, ToT, and GoT into a unified topological space to quantify their structural features. On this basis, we design a unified optimization system: a Topological Optimization Agent diagnoses deviations in CoT chains from desirable topological characteristics and simultaneously generates targeted strategies to repair these structural deficiencies. Compared with multi-round reasoning methods like ToT and GoT, experiments on multiple datasets show that our approach offers a superior balance between reasoning accuracy and efficiency, showcasing a practical solution to ``single-round generation with multi-round intelligence’'.

[NLP-102] GenomeQA: Benchmarking General Large Language Models for Genome Sequence Understanding

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在基因组学领域中对原始DNA序列进行推理时缺乏系统性评估的问题。现有基准测试或专注于特定DNA建模任务,或仅使用文本形式的问题来评估生物知识理解能力,未能充分考察通用LLMs直接处理原始基因组序列的能力。为应对这一挑战,作者提出了GenomeQA,这是一个针对通用LLMs的控制性评估基准,涵盖六类关键基因组推断任务(如增强子识别、剪接位点预测、转录因子结合位点预测等),并包含5,200个来自多个生物数据库的样本,序列长度范围从6到1,000碱基对。其关键创新在于构建了一个结构化、多样化的基准数据集,能够诊断LLMs在利用局部序列信号(如GC含量和短motif)方面的表现,并揭示其在需要多步推理或间接模式识别任务中的局限性,从而为改进通用LLMs在基因组学应用中的性能提供可量化依据。

链接: https://arxiv.org/abs/2604.05774
作者: Weicai Long,Yusen Hou,Junning Feng,Houcheng Su,Shuo Yang,Donglin Xie,Yanlin Zhang
机构: Hong Kong University of Science and Technology (Guangzhou); The University of Hong Kong; Peking University
类目: Genomics (q-bio.GN); Computation and Language (cs.CL)
备注: 18 pages, 9 figures, coference

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly adopted as conversational assistants in genomics, where they are mainly used to reason over biological knowledge, annotations, and analysis outputs through natural language interfaces. However, existing benchmarks either focus on specialized DNA models trained for sequence prediction or evaluate biological knowledge using text-only questions, leaving the behavior of general-purpose LLMs when directly exposed to raw genome sequences underexplored. We introduce GenomeQA, a benchmark designed to provide a controlled evaluation setting for general-purpose LLMs on sequence-based genome inference tasks. GenomeQA comprises 5,200 samples drawn from multiple biological databases, with sequence lengths ranging from 6 to 1,000 base pairs (bp), spanning six task families: Enhancer and Promoter Identification, Splice Site Identification, Taxonomic Classification, Histone Mark Prediction, Transcription Factor Binding Site Prediction, and TF Motif Prediction. Across six frontier LLMs, we find that models consistently outperform random baselines and can exploit local sequence signals such as GC content and short motifs, while performance degrades on tasks that require more indirect or multi-step inference over sequence patterns. GenomeQA establishes a diagnostic benchmark for studying and improving the use of general-purpose LLMs on raw genomic sequences.

信息检索

[IR-0] Data Not Model: Explaining Bias toward LLM Texts in Neural Retrievers

【速读】:该论文旨在解决神经检索模型(neural retrievers)中存在的源偏倚(source bias)问题,即模型倾向于偏好大语言模型(LLM)生成的文本而非人类撰写的文本,即使两者语义相似。研究表明,这种偏倚并非源于模型结构本身,而是由训练数据中的非语义特征差异(如流畅性、术语特异性)所导致,这些差异在嵌入空间中与人类文本和LLM生成文本之间的分布偏差一致。解决方案的关键在于:一是通过减少训练数据中非语义特征的差异来缓解偏倚;二是通过对LLM文本向量进行去偏处理(移除其在偏倚方向上的投影),从而有效降低源偏倚,提升信息检索系统的公平性与可靠性。

链接: https://arxiv.org/abs/2604.06163
作者: Wei Huang,Keping Bi,Yinqiong Cai,Wei Chen,Jiafeng Guo,Xueqi Cheng
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Baidu Inc. (百度公司)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent studies show that neural retrievers often display source bias, favoring passages generated by LLMs over human-written ones, even when both are semantically similar. This bias has been considered an inherent flaw of retrievers, raising concerns about the fairness and reliability of modern information access systems. Our work challenges this view by showing that source bias stems from supervision in retrieval datasets rather than the models themselves. We found that non-semantic differences, like fluency and term specificity, exist between positive and negative documents, mirroring differences between LLM and human texts. In the embedding space, the bias direction from negatives to positives aligns with the direction from human-written to LLM-generated texts. We theoretically show that retrievers inevitably absorb the artifact imbalances in the training data during contrastive learning, which leads to their preferences over LLM texts. To mitigate the effect, we propose two approaches: 1) reducing artifact differences in training data and 2) adjusting LLM text vectors by removing their projection on the bias vector. Both methods substantially reduce source bias. We hope our study alleviates some concerns regarding LLM-generated texts in information access systems.

[IR-1] JUÁ - A Benchmark for Information Retrieval in Brazilian Legal Text Collections

【速读】:该论文旨在解决葡萄牙语法律信息检索(Legal Information Retrieval, LIR)评估系统性不足的问题,主要原因是现有数据集在文档类型、查询风格和相关性定义上差异显著,导致跨研究比较困难。解决方案的关键在于提出一个名为 \textsc{JUÁ} 的公开基准测试平台,其核心特征包括统一的评估协议、通用的排序指标、固定的数据划分(如适用)以及公共排行榜,从而实现对巴西法律信息检索任务的可复现与可比性评估。该基准覆盖判例检索、立法与监管文本检索及问答驱动的法律搜索,通过对比词法、密集向量和基于 BM25 的重排序流水线,验证了其在区分不同检索范式和揭示跨数据集权衡方面的有效性,尤其在领域自适应嵌入模型(如微调后的 Qwen)上表现突出,为多法律领域下的法律检索研究提供了统一且实用的评估框架。

链接: https://arxiv.org/abs/2604.06098
作者: Jayr Pereira,Leandro Fernandes,Erick de Brito,Roberto Lotufo,Luiz Bonifacio
机构: UFCA (Federal University of Cariri); Neuralmind (Neuralmind AI)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Legal information retrieval in Portuguese remains difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance definition. We present \textscJUÁ, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections. More broadly, \textscJUÁ is intended not only as a benchmark, but as a continuous evaluation infrastructure for Brazilian legal IR, combining shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard. The benchmark covers jurisprudence retrieval as well as broader legislative, regulatory, and question-driven legal search. We evaluate lexical, dense, and BM25-based reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on \textscJUÁ-aligned supervision. Results show that the benchmark is sufficiently heterogeneous to distinguish retrieval paradigms and reveal substantial cross-dataset trade-offs. Domain adaptation yields its clearest gains on the supervision-aligned \textscJUÁ-Juris subset, while BM25 remains highly competitive on other collections, especially in settings with strong lexical and institutional phrasing cues. Overall, \textscJUÁ provides a practical evaluation framework for studying legal retrieval across multiple Brazilian legal domains under a common benchmark design.

[IR-2] Masking or Mitigating? Deconstructing the Impact of Query Rewriting on Retriever Biases in RAG ACL’26

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中密集检索器(dense retrievers)存在的系统性偏差问题,包括简洁性偏差(brevity bias)、位置偏差(position bias)、字面匹配偏差(literal matching bias)和重复偏差(repetition bias),这些偏差会显著降低检索质量。解决方案的关键在于对五种查询增强(query enhancement)技术进行系统性评估,发现基于大语言模型(LLM)的简单查询重写方法在整体上能实现最强的偏差减少(54%),但其在多重偏差共存的对抗条件下失效;进一步机制分析揭示两类不同作用路径:一类通过增加得分方差来削弱偏差,另一类则通过伪文档生成实现与偏倚诱导特征的真正解相关。研究还提出一个分类框架,区分查询-文档交互偏差与文档编码偏差,明确了仅从查询侧干预无法全面解决RAG系统的偏倚问题。

链接: https://arxiv.org/abs/2604.06097
作者: Agam Goyal,Koyel Mukherjee,Apoorv Saxena,Anirudh Phukan,Eshwar Chandrasekharan,Hari Sundaram
机构: Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校西贝尔计算机与数据科学学院); Adobe Research (Adobe研究实验室); Inception Labs ( inception 实验室); Indian Institute of Science (印度理工学院)
类目: Information Retrieval (cs.IR)
备注: ACL’26: 13 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Dense retrievers in retrieval-augmented generation (RAG) systems exhibit systematic biases – including brevity, position, literal matching, and repetition biases – that can compromise retrieval quality. Query rewriting techniques are now standard in RAG pipelines, yet their impact on these biases remains unexplored. We present the first systematic study of how query enhancement techniques affect dense retrieval biases, evaluating five methods across six retrievers. Our findings reveal that simple LLM-based rewriting achieves the strongest aggregate bias reduction (54%), yet fails under adversarial conditions where multiple biases combine. Mechanistic analysis uncovers two distinct mechanisms: simple rewriting reduces bias through increased score variance, while pseudo-document generation methods achieve reduction through genuine decorrelation from bias-inducing features. However, no technique uniformly addresses all biases, and effects vary substantially across retrievers. Our results provide practical guidance for selecting query enhancement strategies based on specific bias vulnerabilities. More broadly, we establish a taxonomy distinguishing query-document interaction biases from document encoding biases, clarifying the limits of query-side interventions for debiasing RAG systems.

[IR-3] A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

【速读】:该论文旨在解决大规模临床信息抽取中缺乏可扩展且可信的验证方法的问题,尤其是在使用大语言模型(Large Language Models, LLMs)从非结构化健康记录中提取临床信息时,传统评估方法依赖于密集的人工标注或不完整的结构化数据,难以在人群规模上实施。其解决方案的关键在于提出一种多阶段弱监督验证框架,通过提示校准、基于规则的合理性过滤、语义锚定评估、使用高容量判别型LLM进行针对性确认评估、选择性专家审查以及外部预测效度分析等步骤,在无需全面人工标注的前提下量化不确定性并刻画错误模式,从而实现对LLM提取结果的严谨评估与可信部署。

链接: https://arxiv.org/abs/2604.06028
作者: Maria Mahbub,Gregory M. Dams,Josh Arnold,Caitlin Rizy,Sudarshan Srinivasan,Elliot M. Fielstein,Minu A. Aghevli,Kamonica L. Craig,Elizabeth M. Oliva,Joseph Erdos,Jodie Trafton,Ioana Danciu
机构: Oak Ridge National Laboratory (橡树岭国家实验室); Program Evaluation and Resource Center, Office of Mental Health and Office of Suicide Prevention, Department of Veterans Affairs, Menlo Park, CA, USA; Vanderbilt University Medical Center (范德比尔特大学医学中心); VA Maryland Health Care System (VA马里兰医疗保健系统); VA Desert Pacific Healthcare Network (VA沙漠太平洋医疗网络); VA Connecticut Health Care System (VA康涅狄格医疗系统); Yale School of Medicine (耶鲁医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show promise for extracting clinically meaningful information from unstructured health records, yet their translation into real-world settings is constrained by the lack of scalable and trustworthy validation approaches. Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale. We propose a multi-stage validation framework for LLM-based clinical information extraction that enables rigorous assessment under weak supervision. The framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual annotation. We applied this framework to extraction of substance use disorder (SUD) diagnoses across 11 substance categories from 919,783 clinical notes. Rule-based filtering and semantic grounding removed 14.59% of LLM-positive extractions that were unsupported, irrelevant, or structurally implausible. For high-uncertainty cases, the judge LLM’s assessments showed substantial agreement with subject matter expert review (Gwet’s AC1=0.80). Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria. LLM-extracted SUD diagnoses also predicted subsequent engagement in SUD specialty care more accurately than structured-data baselines (AUC=0.80). These findings demonstrate that scalable, trustworthy deployment of LLM-based clinical information extraction is feasible without annotation-intensive evaluation.

[IR-4] Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching IJCNN-2026

【速读】:该论文旨在解决学术会议审稿人推荐中因投稿量激增而导致的匹配精度不足问题,现有方法多依赖“论文到论文”(Paper-to-Paper)匹配范式,仅通过审稿人过往论文的文本相似性隐式表征其专业能力,难以捕捉多维专家知识。解决方案的关键在于提出一种无需训练的框架P2R,将隐式匹配转向显式基于结构化个人资料的匹配:利用通用大语言模型(LLM)为投稿和审稿人分别构建包含主题(Topics)、方法论(Methodologies)和应用场景(Applications)三维度的结构化专家画像,并采用粗粒度到细粒度的两阶段流程——先通过语义与细粒度特征融合的混合检索生成高召回候选池,再由LLM组成的评审委员会依据严格标准进行深度评估,综合多维专家意见与区域主席(Area Chair)视角实现精准匹配。

链接: https://arxiv.org/abs/2604.05866
作者: Yicheng Pan,Zhiyuan Ning,Ludi Wang,Yi Du
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: Accepted by IJCNN-2026

点击查看摘要

Abstract:As conference submission volumes continue to grow, accurately recommending suitable reviewers has become a challenge. Most existing methods follow a ``Paper-to-Paper’’ matching paradigm, implicitly representing a reviewer by their publication history. However, effective reviewer matching requires capturing multi-dimensional expertise, and textual similarity to past papers alone is often insufficient. To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching. P2R uses general-purpose LLMs to construct structured profiles for both submissions and reviewers, disentangling them into Topics, Methodologies, and Applications. Building on these profiles, P2R adopts a coarse-to-fine pipeline to balance efficiency and depth. It first performs hybrid retrieval that combines semantic and aspect-level signals to form a high-recall candidate pool, and then applies an LLM-based committee to evaluate candidates under strict rubrics, integrating both multi-dimensional expert views and a holistic Area Chair perspective. Experiments on NeurIPS, SIGIR, and SciRepEval show that P2R consistently outperforms state-of-the-art baselines. Ablation studies further verify the necessity of each component. Overall, P2R highlights the value of explicit, structured expertise modeling and offers practical guidance for applying LLMs to reviewer matching.

[IR-5] CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training ACL2026

【速读】:该论文旨在解决多语言嵌入模型在跨语言检索场景中因语言资源分布不均及训练过程中对跨语言对齐关注不足而导致的性能瓶颈问题,尤其在英语等高资源语言上可能出现性能下降。其解决方案的关键在于提出一种名为CLEAR(Cross-Lingual Enhancement in Retrieval via Reverse-training)的新损失函数,该方法采用反向训练机制,利用英文段落作为桥梁强化目标语言与英语之间的对齐关系,从而提升跨语言检索的整体鲁棒性与效果,尤其在低资源语言上表现显著,同时最小化对英语等已良好对齐语言的性能损害。

链接: https://arxiv.org/abs/2604.05821
作者: Seungyoon Lee,Minhyuk Kim,Seongtae Hong,Youngjoon Jang,Dongsuk Oh,Heuiseok Lim
机构: Korea University (韩国大学); Kyungpook National University (庆北国立大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: ACL2026 Main

点击查看摘要

Abstract:Existing multilingual embedding models often encounter challenges in cross-lingual scenarios due to imbalanced linguistic resources and less consideration of cross-lingual alignment during training. Although standardized contrastive learning approaches for cross-lingual adaptation are widely adopted, they may struggle to capture fundamental alignment between languages and degrade performance in well-aligned languages such as English. To address these challenges, we propose Cross-Lingual Enhancement in Retrieval via Reverse-training (CLEAR), a novel loss function utilizing a reverse training scheme to improve retrieval performance across diverse cross-lingual retrieval scenarios. CLEAR leverages an English passage as a bridge to strengthen alignments between the target language and English, ensuring robust performance in the cross-lingual retrieval task. Our extensive experiments demonstrate that CLEAR achieves notable improvements in cross-lingual scenarios, with gains up to 15%, particularly in low-resource languages, while minimizing performance degradation in English. Furthermore, our findings highlight that CLEAR offers promising effectiveness even in multilingual training, suggesting its potential for broad application and scalability. We release the code at this https URL.

[IR-6] WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering ACL2026

【速读】:该论文旨在解决当前多模态检索增强生成(Multi-modal Retrieval-Augmented Generation, RAG)在知识驱动型视觉问答(Knowledge-Based Visual Question Answering, KB-VQA)任务中依赖图像作为唯一检索键、且未充分挖掘视觉语言模型(Vision-Language Models, VLMs)潜力的问题。解决方案的关键在于提出一种名为WikiSeeker的新颖多模态RAG框架,其核心创新是重新定义VLM的角色:将其拆分为两个专用代理——Refiner(精炼器)和Inspector(检查员)。Refiner利用VLM对输入图像的理解能力重写文本查询,显著提升多模态检索器的准确性;Inspector则通过选择性路由机制,在可靠检索结果存在时将上下文传递给另一个大语言模型(LLM)生成答案,而在检索不可靠时则直接调用VLM内部知识进行回答,从而实现解耦式生成策略。

链接: https://arxiv.org/abs/2604.05818
作者: Yingjian Zhu,Xinming Wang,Kun Ding,Ying Wang,Bin Fan,Shiming Xiang
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM’s internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on this https URL.

[IR-7] he LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness Baselines and Contamination SIGIR2026

【速读】:该论文旨在解决信息检索(Information Retrieval, IR)领域中基准测试结果难以累积进步的问题,特别是由于基线模型过弱或过时导致的有效性提升无法持续。其关键解决方案是系统性地分析两个主流IR基准——TREC Robust04和TREC Deep Learning 2020(DL20)上近143篇文献的结果,识别近年来引入大语言模型(Large Language Models, LLMs)组件的系统是否带来了真实的方法论改进。研究发现,LLM相关系统在DL20上相较TREC 2020最佳结果提升了8.8%的nDCG@10,在Robust04上自2023年起提升约20%,但进一步采用数据污染检测方法进行重排序后,发现两个基准均存在显著的数据污染现象;排除污染主题虽降低效果,但置信区间较宽,因而难以判断LLM带来的提升究竟是源于方法创新还是预训练数据的记忆效应。

链接: https://arxiv.org/abs/2604.05766
作者: Moritz Staudinger,Wojciech Kusa,Allan Hanbury
机构: 未知
类目: Information Retrieval (cs.IR)
备注: Accepted at SIGIR 2026

点击查看摘要

Abstract:Benchmark collections have long enabled controlled comparison and cumulative progress in Information Retrieval (IR). However, prior meta-analyses have shown that reported effectiveness gains often fail to accumulate, in part due to the use of weak or outdated baselines. While large language models are increasingly used in retrieval pipelines, their impact on established IR benchmarks has not been systematically analyzed. In this study, we analyze 143 publications reporting results on the TREC Robust04 collection and the TREC Deep Learning 2020 (DL20) passage retrieval benchmark to examine longitudinal trends in retrieval effectiveness and baseline strength. We observe what we term an \emphLLM effect: recent systems incorporating LLM components achieve 8.8% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20% higher on Robust04 since 2023. However, adapting a data contamination detection approach to reranking reveals measurable contamination in both benchmarks. While excluding contaminated topics reduces effectiveness, confidence intervals remain wide, making it difficult to determine whether the LLM effect reflects genuine methodological advances or memorization from pretraining data.

[IR-8] Generative Retrieval Overcomes Limitations of Dense Retrieval but Struggles with Identifier Ambiguity

【速读】:该论文旨在解决嵌入式检索(embedding-based retrieval)在某些场景下存在理论局限性且性能显著落后于传统稀疏检索模型(如BM25)的问题。为此,作者提出以生成式检索(generative retrieval)作为替代方案,并通过一个名为LIMIT的合成数据集验证其有效性。解决方案的关键在于:利用语言模型直接预测查询与文档的相关性,而非依赖向量空间中的相似度计算;实验表明,在原始LIMIT数据集上,无需额外训练即可实现最优性能(如SEAL和MINDER分别达到0.92和0.99 R@2),显著优于密集检索方法(0.03 R@2)和BM25(0.86 R@2)。然而,当引入硬负样本后,所有模型性能均下降,揭示出生成式检索仍面临解码机制缺陷——即无法生成唯一标识符来区分相关文档,这成为未来改进的核心挑战。

链接: https://arxiv.org/abs/2604.05764
作者: Adrian Bracher,Svitlana Vakulenko
机构: Vienna University of Economics and Business (维也纳经济与商业大学)
类目: Information Retrieval (cs.IR)
备注: Work in progress

点击查看摘要

Abstract:While dense retrieval models, which embed queries and documents into a shared low-dimensional space, have gained widespread popu- larity, they were shown to exhibit important theoretical limitations and considerably lag behind traditional sparse retrieval models in certain settings. Generative retrieval has emerged as an alternative approach to dense retrieval by using a language model to predict query-document relevance directly. In this paper, we demonstrate strengths and weaknesses of generative retrieval approaches us- ing a simple synthetic dataset, called LIMIT, that was previously introduced to empirically demonstrate the theoretical limitations of embedding-based retrieval but was not used to evaluate genera- tive retrieval. We close this research gap and show that generative retrieval achieves the best performance on this dataset without any additional training required (0.92 and 0.99 R@2 for SEAL and MINDER, respectively), compared to dense approaches ( 0.03 Re- call@2) and BM25 (0.86 R@2). However, we then proceed to extend the original LIMIT dataset by adding simple hard negative samples and observe the performance degrading for all the models including the generative retrieval models (0.51 R@2) as well as BM25 (0.21 R@2). Error analysis identifies a failure in the decoding mechanism, caused by the inability to produce identifiers that are unique to relevant documents. Future generative retrieval must address these issues, either by designing identifiers that are more suitable to the decoding process or by adapting decoding and scoring algorithms to preserve relevance signals.

[IR-9] Graph Topology Information Enhanced Heterogeneous Graph Representation Learning

【速读】:该论文旨在解决异构图(heterogeneous graph)在下游任务中因原始图结构噪声大、非最优而导致的图表示学习(Graph Representation Learning, GRL)性能下降问题,以及现有同质图结构学习(Graph Structure Learning, GSL)方法难以直接应用于异构图时面临的内存消耗过高和拓扑信息利用不足的问题。解决方案的关键在于提出一种新颖的图拓扑增强型异构图表示学习框架(ToGRL),其核心创新包括:首先设计了一个两阶段GSL模块,通过提取与下游任务相关的潜在拓扑信息并生成拓扑嵌入(topology embeddings),将原始图结构投影至具有平滑信号的新图结构;其次,该模块分离邻接矩阵优化与节点表示学习过程,显著降低内存开销;最后,结合提示调优(prompt tuning)机制,进一步挖掘已学表示中的知识,提升模型对下游任务的适应能力。

链接: https://arxiv.org/abs/2604.05732
作者: He Zhao,Zhiwei Zeng,Yongwei Wang,Chunyan Miao
机构: Nanyang Technological University (南洋理工大学); Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (NTU-UBC老年活跃生活卓越研究中心)
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Real-world heterogeneous graphs are inherently noisy and usually not in the optimal graph structures for downstream tasks, which often adversely affects the performance of GRL models in downstream tasks. Although Graph Structure Learning (GSL) methods have been proposed to learn graph structures and downstream tasks simultaneously, existing methods are predominantly designed for homogeneous graphs, while GSL for heterogeneous graphs remains largely unexplored. Two challenges arise in this context. Firstly, the quality of the input graph structure has a more profound impact on GNN-based heterogeneous GRL models compared to their homogeneous counterparts. Secondly, most existing homogenous GRL models encounter memory consumption issues when applied directly to heterogeneous graphs. In this paper, we propose a novel Graph Topology learning Enhanced Heterogeneous Graph Representation Learning framework (ToGRL).ToGRL learns high-quality graph structures and representations for downstream tasks by incorporating task-relevant latent topology information. Specifically, a novel GSL module is first proposed to extract downstream task-related topology information from a raw graph structure and project it into topology embeddings. These embeddings are utilized to construct a new graph with smooth graph signals. This two-stage approach to GSL separates the optimization of the adjacency matrix from node representation learning to reduce memory consumption. Following this, a representation learning module takes the new graph as input to learn embeddings for downstream tasks. ToGRL also leverages prompt tuning to better utilize the knowledge embedded in learned representations, thus enhancing adaptability to downstream tasks. Extensive experiments on five real-world datasets show that our ToGRL outperforms state-of-the-art methods by a large margin.

[IR-10] SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT

【速读】:该论文旨在解决网页链接中存在的语义漂移(semantic drift)问题,即链接虽能返回HTTP 200状态码,但目标页面内容已与源上下文不再相关,传统基于HTTP状态码的验证工具无法识别此类语义不一致,从而影响网页完整性与用户体验。解决方案的关键在于提出SemLink,一种基于Siamese神经网络架构的自动化测试断言机制,其核心是利用预训练Sentence-BERT(SBERT)模型计算锚文本、周围DOM元素及视觉特征组成的源上下文与目标页面内容之间的语义一致性,通过构建包含6万余对正样本的Hyperlink-Webpage Positive Pairs(HWPPs)数据集进行训练和评估,实现高召回率(96.00%)的同时显著优于LLM方法在速度(快约47.5倍)和资源消耗方面的效率优势。

链接: https://arxiv.org/abs/2604.05711
作者: Guan-Yan Yang,Wei-Ling Wen,Shu-Yuan Ku,Farn Wang,Kuo-Hui Yeh
机构: 1§: National Taiwan University of Science and Technology (台湾科技大学); 23: National Cheng Kung University (成功大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at the 19th IEEE International Conference on Software Testing, Verification and Validation (ICST) 2026, Daejeon, Republic of Korea

点击查看摘要

Abstract:Web applications rely heavily on hyperlinks to connect disparate information resources. However, the dynamic nature of the web leads to link rot, where targets become unavailable, and more insidiously, semantic drift, where a valid HTTP 200 connection exists, but the target content no longer aligns with the source context. Traditional verification tools, which primarily function as crash oracles by checking HTTP status codes, often fail to detect semantic inconsistencies, thereby compromising web integrity and user experience. While Large Language Models (LLMs) offer semantic understanding, they suffer from high latency, privacy concerns, and prohibitive costs for large-scale regression testing. In this paper, we propose SemLink, a novel automated test oracle for semantic hyperlink verification. SemLink leverages a Siamese Neural Network architecture powered by a pre-trained Sentence-BERT (SBERT) backbone to compute the semantic coherence between a hyperlink’s source context (anchor text, surrounding DOM elements, and visual features) and its target page content. To train and evaluate our model, we introduce the Hyperlink-Webpage Positive Pairs (HWPPs) dataset, a rigorously constructed corpus of over 60,000 semantic pairs. Our evaluation demonstrates that SemLink achieves a Recall of 96.00%, comparable to state-of-the-art LLMs (GPT-5.2), while operating approximately 47.5 times faster and requiring significantly fewer computational resources. This work bridges the gap between traditional syntactic checkers and expensive generative AI, offering a robust and efficient solution for automated web quality assurance.

[IR-11] Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment ICLR2026

【速读】:该论文旨在解决多语言信息检索(Multilingual Information Retrieval, MIR)中普遍存在的“英语偏好”(English Inclination)问题,即在混合语言文档池中,模型倾向于优先召回与查询语言不一致但为英文的文档,而非与查询同语种的相关文档,从而导致跨语言对齐能力被低估。解决方案的关键在于提出一种新颖的训练策略,仅需2.8k小规模样本即可显著提升多语言嵌入模型的跨语言对齐能力,并有效缓解英语偏好现象,从而增强模型在复杂多语言环境下的公平性和准确性。

链接: https://arxiv.org/abs/2604.05684
作者: Seongtae Hong,Youngjoon Jang,Jungseob Lee,Hyeonseok Moon,Heuiseok Lim
机构: Korea University (韩国大学)
类目: Information Retrieval (cs.IR)
备注: ICLR 2026

点击查看摘要

Abstract:With the increasing accessibility and utilization of multilingual documents, Cross-Lingual Information Retrieval (CLIR) has emerged as an important research area. Conventionally, CLIR tasks have been conducted under settings where the language of documents differs from that of queries, and typically, the documents are composed in a single coherent language. In this paper, we highlight that in such a setting, the cross-lingual alignment capability may not be evaluated adequately. Specifically, we observe that, in a document pool where English documents coexist with another language, most multilingual retrievers tend to prioritize unrelated English documents over the related document written in the same language as the query. To rigorously analyze and quantify this phenomenon, we introduce various scenarios and metrics designed to evaluate the cross-lingual alignment performance of multilingual retrieval models. Furthermore, to improve cross-lingual performance under these challenging conditions, we propose a novel training strategy aimed at enhancing cross-lingual alignment. Using only a small dataset consisting of 2.8k samples, our method significantly improves the cross-lingual retrieval performance while simultaneously mitigating the English inclination problem. Extensive analyses demonstrate that the proposed method substantially enhances the cross-lingual alignment capabilities of most multilingual embedding models.

[IR-12] CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)评估方法中对单个检索证据项(evidence item)操作效用(operational utility)难以量化的问题。当前评估多聚焦于最终答案质量、引用忠实性或答案级归因,而忽视了干预式分析下每条证据的实际作用。其解决方案的关键在于提出一种轻量级的干预框架 CUE-R(Causal Utility Evaluation for Retrieval),通过 REMOVE、REPLACE 和 DUPLICATE 三种操作扰动单个证据项,并基于正确性、代理支撑忠实性(proxy-based grounding faithfulness)、置信度误差及检索-使用轨迹差异(trace-divergence signal)四个维度测量其效用变化,同时构建了证据角色分类体系以解释干预结果。实验表明,该方法能有效识别证据项的非加和性交互效应,揭示仅依赖答案层面评估可能忽略的重要证据作用机制。

链接: https://arxiv.org/abs/2604.05467
作者: Siddharth Jain,Venkat Narayan Vedam
机构: Intuit(英智)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 figures, 14 tables; appendix includes bootstrap CIs, metric definitions, duplicate position sensitivity, prompt template, and reproducibility details

点击查看摘要

Abstract:As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.

[IR-13] Data-Driven Function Calling Improvements in Large Language Model for Online Financial QA

【速读】:该论文旨在解决通用大语言模型(Large Language Models, LLMs)在金融领域在线问答系统中函数调用能力不足的问题,具体表现为:LLMs缺乏针对金融场景的定制化API调用能力,且用户查询参数常超出训练数据分布,导致无法准确匹配所需函数输入。解决方案的关键在于构建一个数据驱动的流水线,包括数据集构建、数据增强(AugFC方法)和两阶段模型训练,通过引入真实用户查询样本并扩展参数空间多样性,显著提升LLM对金融工具的调用准确性与泛化能力,从而有效支持线上金融问答服务。

链接: https://arxiv.org/abs/2604.05387
作者: Xing Tang,Hao Chen,Shiwei Li,Fuyuan Lyu,Weijie Shi,Lingjie Li,Dugang Liu,Weihong Luo,Xiku Du,Xiuqiang He
机构: Shenzhen Technology University(深圳技术大学); Tencent(腾讯); Huazhong University of Science and Technology(华中科技大学); McGill University(麦吉尔大学); The Hong Kong University of Science and Technology(香港科技大学); Shenzhen University(深圳大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted to Webconf 2026 industry track

点击查看摘要

Abstract:Large language models (LLMs) have been incorporated into numerous industrial applications. Meanwhile, a vast array of API assets is scattered across various functions in the financial domain. An online financial question-answering system can leverage both LLMs and private APIs to provide timely financial analysis and information. The key is equipping the LLM model with function calling capability tailored to a financial scenario. However, a generic LLM requires customized financial APIs to call and struggles to adapt to the financial domain. Additionally, online user queries are diverse and contain out-of-distribution parameters compared with the required function input parameters, which makes it more difficult for a generic LLM to serve online users. In this paper, we propose a data-driven pipeline to enhance function calling in LLM for our online, deployed financial QA, comprising dataset construction, data augmentation, and model training. Specifically, we construct a dataset based on a previous study and update it periodically, incorporating queries and an augmentation method named AugFC. The addition of user query-related samples will \textitexploit our financial toolset in a data-driven manner, and AugFC explores the possible parameter values to enhance the diversity of our updated dataset. Then, we train an LLM with a two-step method, which enables the use of our financial functions. Extensive experiments on existing offline datasets, as well as the deployment of an online scenario, illustrate the superiority of our pipeline. The related pipeline has been adopted in the financial QA of YuanBao\footnotethis https URL, one of the largest chat platforms in China.

[IR-14] Retrieve-then-Adapt: Retrieval-Augmented Test-Time Adaptation for Sequential Recommendation

【速读】:该论文旨在解决顺序推荐(Sequential Recommendation, SR)模型在推理阶段难以适应用户实时偏好变化的问题,这主要源于分布偏移(distributional divergence)和参数化约束带来的挑战。现有方法如测试时训练、测试时增强或检索增强微调,分别存在计算开销大、依赖随机增强策略或需复杂两阶段训练等局限。论文提出的关键解决方案是Retrieve-then-Adapt(ReAd)框架,其核心在于通过动态检索协同相似项构建信息丰富的增强嵌入(augmentation embedding),实现有效增强与高效适配的统一:首先从协同记忆数据库中检索与测试用户行为相似的项目,再利用轻量级检索学习模块融合协同信号与预测优化线索生成增强嵌入,最终通过融合机制对初始预测进行精细化调整。该方法在五个基准数据集上均显著优于现有SR方法。

链接: https://arxiv.org/abs/2604.05379
作者: Xing Tang,Jingyang Bin,Ziqiang Cui,Xiaokun Zhang,Fuyuan Lyu,Jingyan Jiang,Dugang Liu,Chen Ma,Xiuqiang He
机构: Shenzhen Technology University (深圳技术大学); City University of Hong Kong (香港城市大学); McGill University (麦吉尔大学); Shenzhen University (深圳大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The sequential recommendation (SR) task aims to predict the next item based on users’ historical interaction sequences. Typically trained on historical data, SR models often struggle to adapt to real-time preference shifts during inference due to challenges posed by distributional divergence and parameterized constraints. Existing approaches to address this issue include test-time training, test-time augmentation, and retrieval-augmented fine-tuning. However, these methods either introduce significant computational overhead, rely on random augmentation strategies, or require a carefully designed two-stage training paradigm. In this paper, we argue that the key to effective test-time adaptation lies in achieving both effective augmentation and efficient adaptation. To this end, we propose Retrieve-then-Adapt (ReAd), a novel framework that dynamically adapts a deployed SR model to the test distribution through retrieved user preference signals. Specifically, given a trained SR model, ReAd first retrieves collaboratively similar items for a test user from a constructed collaborative memory database. A lightweight retrieval learning module then integrates these items into an informative augmentation embedding that captures both collaborative signals and prediction-refinement cues. Finally, the initial SR prediction is refined via a fusion mechanism that incorporates this embedding. Extensive experiments across five benchmark datasets demonstrate that ReAd consistently outperforms existing SR methods.

[IR-15] From Clues to Generation: Language-Guided Conditional Diffusion for Cross-Domain Recommendation

【速读】:该论文旨在解决跨域推荐(Cross-domain Recommendation, CDR)中因重叠用户稀缺而导致的细粒度偏好迁移困难问题,尤其针对仅在单一领域有交互记录的用户无法有效进行跨域推荐的挑战。解决方案的关键在于提出语言引导的条件扩散模型(Language-Guided Conditional Diffusion, LGCD),其核心创新包括:利用大语言模型(Large Language Models, LLMs)推理生成伪重叠数据以弥补真实重叠用户的不足,并通过区分真实与伪交互路径引入额外监督约束以降低语义噪声;同时设计条件扩散架构,基于源域模式精准引导目标域用户表示的生成,从而实现更准确的跨域偏好预测。

链接: https://arxiv.org/abs/2604.05365
作者: Ziang Lu,Lei Sang,Lin Mu,Yiwen Zhang
机构: Anhui University (安徽大学)
类目: Information Retrieval (cs.IR)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Cross-domain Recommendation (CDR) exploits multi-domain correlations to alleviate data sparsity. As a core task within this field, inter-domain recommendation focuses on predicting preferences for users who interact in a source domain but lack behavioral records in a target domain. Existing approaches predominantly rely on overlapping users as anchors for knowledge transfer. In real-world scenarios, overlapping users are often scarce, leaving the vast majority of users with only single-domain interactions. For these users, the absence of explicit alignment signals makes fine-grained preference transfer intrinsically difficult. To address this challenge, this paper proposes Language-Guided Conditional Diffusion for CDR (LGCD), a novel framework that integrates Large Language Models (LLMs) and diffusion models for inter-domain sequential recommendation. Specifically, we leverage LLM reasoning to bridge the domain gap by inferring potential target preferences for single-domain users and mapping them to real items, thereby constructing pseudo-overlapping data. We distinguish between real and pseudo-interaction pathways and introduce additional supervision constraints to mitigate the semantic noise brought by pseudo-interaction. Furthermore, we design a conditional diffusion architecture to precisely guide the generation of target user representations based on source-domain patterns. Extensive experiments demonstrate that LGCD significantly outperforms state-of-the-art methods in inter-domain recommendation tasks.

[IR-16] Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation DASFAA2026

【速读】:该论文旨在解决现有可解释推荐系统(Explainable Recommendation Systems, ERS)中评分预测与解释生成任务之间存在不一致性的问题,即二者目标耦合导致的推理不连贯性。解决方案的关键在于提出一种基于课程学习(Curriculum Learning)的强化学习框架——Curr-RLCER,其通过分阶段训练策略,从简单的点击率(Click-Through Rate, CTR)或选择性评分预测逐步过渡到开放式的推荐解释生成,并设计动态评分对齐机制以提升系统稳定性;同时引入以一致性为导向的奖励机制(coherence-driven reward),确保生成解释与预测评分在语义上高度一致,从而显著增强推荐结果的透明度和可信度。

链接: https://arxiv.org/abs/2604.05341
作者: Xiangchen Pan,Wei Wei
机构: 未知
类目: Information Retrieval (cs.IR)
备注: Accepted at DASFAA 2026. This is the author version

点击查看摘要

Abstract:Explainable recommendation systems (RSs) are designed to explicitly uncover the rationale of each recommendation, thereby enhancing the transparency and credibility of RSs. Previous methods often jointly predicted ratings and generated explanations, but overlooked the incoherence of such two objectives. To address this issue, we propose Curr-RLCER, a reinforcement learning framework for explanation coherent recommendation with dynamic rating alignment. It employs curriculum learning, transitioning from basic predictions (i.e., click through rating-CTR, selection-based rating) to open-ended recommendation explanation generation. In particular, the rewards of each stage are designed for progressively enhancing the stability of RSs. Furthermore, a coherence-driven reward mechanism is also proposed to enforce the coherence between generated explanations and predicted ratings, supported by a specifically designed evaluation scheme. The extensive experimental results on three explainable recommendation datasets indicate that the proposed framework is effective. Codes and datasets are available at this https URL.

[IR-17] Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation

【速读】:该论文旨在解决生成式推荐(Generative Recommendation)中基于高粒度语义ID(Semantic ID, SID)框架所引发的两大问题:一是由于序列扩展导致的训练开销巨大,二是性能可靠性差,表现为准确率非单调波动。作者指出这些问题的根本原因是“语义稀释效应”(Semantic Dilution Effect),即冗余token浪费大量计算资源并稀释了本就稀疏的学习信号。解决方案的关键在于提出STAMP框架,采用双端优化策略:输入端通过语义自适应剪枝(Semantic Adaptive Pruning, SAP)动态过滤冗余信息,生成紧凑且富含信息的表示;输出端通过多步辅助预测(Multi-step Auxiliary Prediction, MAP)引入多标记目标以增强反馈密度,强化长程依赖建模并确保压缩输入下的鲁棒学习信号。该方法统一实现了输入净化与信号放大,显著提升了训练效率和表示能力。

链接: https://arxiv.org/abs/2604.05329
作者: Tianyu Zhan,Kairui Fu,Chengfei Lv,Zheqi Lv,Shengyu Zhang
机构: Zhejiang University (浙江大学); Alibaba (阿里巴巴)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative Recommendation (GR) has recently transitioned from atomic item-indexing to Semantic ID (SID)-based frameworks to capture intrinsic item relationships and enhance generalization. However, the adoption of high-granularity SIDs leads to two critical challenges: prohibitive training overhead due to sequence expansion and unstable performance reliability characterized by non-monotonic accuracy fluctuations. We identify that these disparate issues are fundamentally rooted in the Semantic Dilution Effect, where redundant tokens waste massive computation and dilute the already sparse learning signals in recommendation. To counteract this, we propose STAMP (Semantic Trimming and Auxiliary Multi-step Prediction), a framework utilizing a dual-end optimization strategy. We argue that effective SID learning requires simultaneously addressing low input information density and sparse output supervision. On the input side, Semantic Adaptive Pruning (SAP) dynamically filters redundancy during the forward pass, converting noise-laden sequences into compact, information-rich representations. On the output side, Multi-step Auxiliary Prediction (MAP) employs a multi-token objective to densify feedback, strengthening long-range dependency capture and ensuring robust learning signals despite compressed inputs. Unifying input purification and signal amplification, STAMP enhances both training efficiency and representation capability. Experiments on public Amazon and large-scale industrial datasets show STAMP achieves 1.23–1.38 \times speedup and 17.2%–54.7% VRAM reduction while maintaining or improving performance across multiple architectures.

[IR-18] Next-Scale Generative Reranking: A Tree-based Generative Rerank Method at Meituan

【速读】:该论文旨在解决多阶段推荐系统中重排序(reranking)任务面临的两大挑战:一是生成器在自回归或非自回归策略下均难以同时兼顾局部与全局视角,导致生成结果不最优;二是训练过程中生成器与评估器之间的目标不一致性,使得引导信号模糊,影响模型性能。解决方案的关键在于提出一种基于树结构的生成式重排序框架——NSGR(Next-Scale Generation Reranking),其核心创新包括:设计了一个从粗到细逐步扩展推荐列表的“下一尺度生成器”(Next-Scale Generator, NSG),以平衡全局和局部信息;并引入多尺度邻居损失(multi-scale neighbor loss),通过树状多尺度评估器(Multi-scale Evaluator, MSE)在每一尺度上提供针对性的指导信号,从而缓解目标不一致问题并提升生成质量。

链接: https://arxiv.org/abs/2604.05314
作者: Shuli Wang,Changhao Li,Ke Fan,Senjie Kou Junwei Yin,Chi Wang,Yinhua Zhu,Haitao Wang,Xingxing Wang
机构: Meituan(美团); Meituan(美团)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In modern multi-stage recommendation systems, reranking plays a critical role by modeling contextual information. Due to inherent challenges such as the combinatorial space complexity, an increasing number of methods adopt the generative paradigm: the generator produces the optimal list during inference, while an evaluator guides the generator’s optimization during the training phase. However, these methods still face two problems. Firstly, these generators fail to produce optimal generation results due to the lack of both local and global perspectives, regardless of whether the generation strategy is autoregressive or non-autoregressive. Secondly, the goal inconsistency problem between the generator and the evaluator during training complicates the guidance signal and leading to suboptimal performance. To address these issues, we propose the \textbfNext-\textbfScale \textbfGeneration \textbfReranking (NSGR), a tree-based generative framework. Specifically, we introduce a next-scale generator (NSG) that progressively expands a recommendation list from user interests in a coarse-to-fine manner, balancing global and local perspectives. Furthermore, we design a multi-scale neighbor loss, which leverages a tree-based multi-scale evaluator (MSE) to provide scale-specific guidance to the NSG at each scale. Extensive experiments on public and industrial datasets validate the effectiveness of NSGR. And NSGR has been successfully deployed on the Meituan food delivery platform.

[IR-19] Pay Attention to Sequence Split: Uncovering the Impacts of Sub-Sequence Splitting on Sequential Recommendation Models SIGIR2026

【速读】:该论文旨在解决子序列分割(Sub-sequence Splitting, SSS)在序列推荐(Sequential Recommendation, SR)中可能干扰模型真实性能评估的问题。研究发现,许多前沿SR模型在数据读取阶段隐式使用SSS(但未在论文中明确说明),若移除该操作,模型性能显著下降,甚至低于早期经典方法;进一步分析表明,SSS的有效性高度依赖于特定的分割方法、目标策略与损失函数的协同组合,不当搭配反而会损害性能。其关键解决方案在于揭示SSS通过均衡训练数据分布并提升不同物品被预测的概率,从而带来显著性能增益的本质机制,并提出应规范数据处理流程以避免SSS对评估结果的误导,推动更公平、严谨的SR模型评测标准。

链接: https://arxiv.org/abs/2604.05309
作者: Yizhou Dang,Yifan Wu,Minhan Huang,Chuang Zhao,Lianbo Ma,Guibing Guo,Xingwei Wang,Zhu Sun
机构: Northeastern University (东北大学); Tianjin University (天津大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2026

点击查看摘要

Abstract:Sub-sequence splitting (SSS) has been demonstrated as an effective approach to mitigate data sparsity in sequential recommendation (SR) by splitting a raw user interaction sequence into multiple sub-sequences. Previous studies have demonstrated its ability to enhance the performance of SR models significantly. However, in this work, we discover that \textbf(i). SSS may interfere with the evaluation of the model’s actual performance. We observed that many recent state-of-the-art SR models employ SSS during the data reading stage (not mentioned in the papers). When we removed this operation, performance significantly declined, even falling below that of earlier classical SR models. The varying improvements achieved by SSS and different splitting methods across different models prompt us to analyze further when SSS proves effective. We find that \textbf(ii). SSS demonstrates strong capabilities only when specific splitting methods, target strategies, and loss functions are used together. Inappropriate combinations may even harm performance. Furthermore, we analyze why sub-sequence splitting yields such remarkable performance gains and find that \textbf(iii). it evens out the distribution of training data while increasing the likelihood that different items are targeted. Finally, we provide suggestions for overcoming SSS interference, along with a discussion on data augmentation methods and future directions. We hope this work will prompt the broader community to re-examine the impact of data splitting on SR and promote fairer, more rigorous model evaluation. All analysis code and data will be made available upon acceptance. We provide a simple, anonymous implementation at this https URL.

[IR-20] Spike Hijacking in Late-Interaction Retrieval ECIR2026

【速读】:该论文旨在解决晚交互检索模型(late-interaction retrieval models)中基于硬最大相似度(hard maximum similarity, MaxSim)聚合token级相似度时所引发的梯度集中问题,该问题可能导致训练动态结构上的偏差和对文档长度变化的敏感性。其解决方案的关键在于揭示MaxSim诱导的patch-level梯度浓度显著高于平滑替代方法(如Top-k池化和softmax聚合),并通过合成环境与真实世界多向量检索基准实验验证:MaxSim在文档长度增长时性能退化更严重,而温和平滑策略则更具鲁棒性。研究明确指出,池化方式引起的梯度集中是晚交互检索的结构性特征,并提出稀疏性-鲁棒性权衡(sparsity-robustness tradeoff)作为改进方向,从而为多向量检索系统中硬最大池化的替代方案提供理论依据。

链接: https://arxiv.org/abs/2604.05253
作者: Karthik Suresh,Tushar Vatsa,Tracy King,Asim Kadav,Michael Friedrich
机构: Adobe(Adobe)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at the 1st Late Interaction Retrieval Workshop (LIR 2026) at ECIR 2026. Published in CEUR Workshop Proceedings

点击查看摘要

Abstract:Late-interaction retrieval models rely on hard maximum similarity (MaxSim) to aggregate token-level similarities. Although effective, this winner-take-all pooling rule may structurally bias training dynamics. We provide a mechanistic study of gradient routing and robustness in MaxSim-based retrieval. In a controlled synthetic environment with in-batch contrastive training, we demonstrate that MaxSim induces significantly higher patch-level gradient concentration than smoother alternatives such as Top-k pooling and softmax aggregation. While sparse routing can improve early discrimination, it also increases sensitivity to document length: as the number of document patches grows, MaxSim degrades more sharply than mild smoothing variants. We corroborate these findings on a real-world multi-vector retrieval benchmark, where controlled document-length sweeps reveal similar brittleness under hard max pooling. Together, our results isolate pooling-induced gradient concentration as a structural property of late-interaction retrieval and highlight a sparsity-robustness tradeoff. These findings motivate principled alternatives to hard max pooling in multi-vector retrieval systems.

[IR-21] Entities as Retrieval Signals: A Systematic Study of Coverag e Supervision and Evaluation in Entity-Oriented Ranking

【速读】:该论文旨在解决实体导向检索(Entity-oriented Retrieval)在开放世界评估中表现不佳的问题,其核心矛盾在于:尽管在受限于实体的条件下模型表现出显著提升,但在全候选集上的开放世界评估却几乎无改善。作者指出,问题根源不在于模型架构本身(如神经重排序器),而在于评估方式与监督信号的设计缺陷——现有方法多基于概念性实体相关性(Conceptual Entity Relevance, CER),忽略了可观察实体相关性(Observable Entity Relevance, OER),即在特定链接器环境下文档是否具备区分度。因此,解决方案的关键在于构建具有实体级判别力的数据集,并引入同时衡量覆盖率与有效性的评估机制,以区分“利用实体证据”与“在真实链接环境中改进检索”的不同目标。

链接: https://arxiv.org/abs/2604.05204
作者: Shubham Chatterjee
机构: Missouri University of Science and Technology (密苏里科技大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Entity-oriented retrieval assumes that relevant documents exhibit query-relevant entities, yet evaluations report conflicting results. We show this inconsistency stems not from model failure, but from evaluation. On TREC Robust04, we evaluate six neural rerankers and 437 unsupervised configurations against BM25. Across 443 systems, none improves MAP by more than 0.05 under open-world evaluation over the full candidate set, despite strong gains under entity-restricted settings. The best configuration matches the official Robust04 best system and outperforms most neural rerankers, indicating that the architecture is not the limiting factor. Instead, the bottleneck is the entity channel: even under idealized selection, entity signals cover only 19.7% of relevant documents, and no method achieves both high coverage and discrimination. We explain this via a distinction between Conceptual Entity Relevance (CER) – semantic relatedness – and Observable Entity Relevance (OER) – corpus-grounded discriminativeness under a given linker. All supervision strategies operate at the CER level and ignore the linking environment, leading to signals that are semantically valid but not discriminative. Improving supervision therefore does not recover open-world performance: stronger signals reduce coverage without improving effectiveness. Conditional and open-world evaluation answer different questions: exploiting entity evidence versus improving retrieval under realistic linking, but are often conflated. Progress requires datasets with entity-level discriminativeness and evaluation that reports both coverage and effectiveness. Until then, conditional gains do not imply open-world effectiveness, and open-world failures do not invalidate entity-based models. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.05204 [cs.IR] (or arXiv:2604.05204v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.05204 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shubham Chatterjee [view email] [v1] Mon, 6 Apr 2026 22:02:35 UTC (200 KB)

[IR-22] Improving Clinical Trial Recruitment using Clinical Narratives and Large Language Models

【速读】:该论文旨在解决临床试验招募中患者筛选环节的效率瓶颈问题,即传统人工筛选流程耗时且易导致入组不足甚至试验失败。其解决方案的关键在于利用生成式大语言模型(Generative Large Language Models, LLMs)对临床病历文本进行自动化筛选,特别比较了编码器型与解码器型LLMs在处理长文档时的表现,并提出三种缓解“中间信息丢失”(Lost in the Middle)问题的策略:原始长上下文、基于命名实体识别(Named Entity Recognition, NER)的抽取式摘要和基于检索增强生成(Retrieval-Augmented Generation, RAG)的动态证据检索。实验表明,采用RAG策略的MedGemma模型在N2C2数据集上取得了最高的微平均F1分数(89.05%),显著提升了需跨长文档进行复杂推理的入选标准识别能力,而短文本范围的标准(如实验室检查结果)则仅获得增量改进。

链接: https://arxiv.org/abs/2604.05190
作者: Ziyi Chen,Mengxian Lyu,Cheng Peng,Yonghui Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Screening patients for enrollment is a well-known, labor-intensive bottleneck that leads to under-enrollment and, ultimately, trial failures. Recent breakthroughs in large language models (LLMs) offer a promising opportunity to use artificial intelligence to improve screening. This study systematically explored both encoder- and decoder-based generative LLMs for screening clinical narratives to facilitate clinical trial recruitment. We examined both general-purpose LLMs and medical-adapted LLMs and explored three strategies to alleviate the “Lost in the Middle” issue when handling long documents, including 1) Original long-context: using the default context windows of LLMs, 2) NER-based extractive summarization: converting the long document into summarizations using named entity recognition, 3) RAG: dynamic evidence retrieval based on eligibility criteria. The 2018 N2C2 Track 1 benchmark dataset is used for evaluation. Our experimental results show that the MedGemma model with the RAG strategy achieved the best micro-F1 score of 89.05%, outperforming other models. Generative LLMs have remarkably improved trial criteria that require long-term reasoning across long documents, whereas trial criteria that span a short piece of context (e.g., lab tests) show incremental improvements. The real-world adoption of LLMs for trial recruitment must consider specific criteria for selecting among rule-based queries, encoder-based LLMs, and generative LLMs to maximize efficiency within reasonable computing costs.

[IR-23] Offline RL for Adaptive Policy Retrieval in Prior Authorization

【速读】:该论文旨在解决先核准(Prior Authorization, PA)过程中因覆盖政策复杂且碎片化而导致的信息检索效率低下问题。传统检索增强系统采用固定的 top-K 策略,难以在准确性和检索成本之间取得平衡,常导致冗余或信息不足。其解决方案的关键在于将政策检索建模为一个马尔可夫决策过程(Markov Decision Process, MDP),通过强化学习训练智能体在每一步选择是否继续检索或终止并作出决策,同时引入奖励函数权衡决策准确性与检索代价。实验表明,基于保守Q学习(CQL)、隐式Q学习(IQL)和直接偏好优化(DPO)的策略能显著提升检索效率与准确性,其中DPO在保持92%决策准确率的同时减少47%的检索步骤,处于帕累托前沿的“选择性-高精度”区域,验证了适应性检索机制的有效性。

链接: https://arxiv.org/abs/2604.05125
作者: Ruslan Sharifullin,Maxim Gorshkov,Hannah Clay
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Prior authorization (PA) requires interpretation of complex and fragmented coverage policies, yet existing retrieval-augmented systems rely on static top- K strategies with fixed numbers of retrieved sections. Such fixed retrieval can be inefficient and gather irrelevant or insufficient information. We model policy retrieval for PA as a sequential decision-making problem, formulating adaptive retrieval as a Markov Decision Process (MDP). In our system, an agent iteratively selects policy chunks from a top- K candidate set or chooses to stop and issue a decision. The reward balances decision correctness against retrieval cost, capturing the trade-off between accuracy and efficiency. We train policies using Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), and Direct Preference Optimization (DPO) in an offline RL setting on logged trajectories generated from baseline retrieval strategies over synthetic PA requests derived from publicly available CMS coverage data. On a corpus of 186 policy chunks spanning 10 CMS procedures, CQL achieves 92% decision accuracy (+30 percentage points over the best fixed- K baseline) via exhaustive retrieval, while IQL matches the best baseline accuracy using 44% fewer retrieval steps and achieves the only positive episodic return among all policies. Transition-level DPO matches CQL’s 92% accuracy while using 47% fewer retrieval steps (10.6 vs. 20.0), occupying a “selective-accurate” region on the Pareto frontier that dominates both CQL and BC. A behavioral cloning baseline matches CQL, confirming that advantage-weighted or preference-based policy extraction is needed to learn selective retrieval. Lambda ablation over step costs \lambda \in \0.05, 0.1, 0.2\ reveals a clear accuracy-efficiency inflection: only at \lambda = 0.2 does CQL transition from exhaustive to selective retrieval.

[IR-24] CRAB: Codebook Rebalancing for Bias Mitigation in Generative Recommendation

【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GeneRec)中存在的严重流行度偏差(popularity bias)问题,即模型在推荐过程中倾向于偏好高频率交互的热门项目,从而加剧了对冷门项目的忽视。其解决方案的关键在于提出一种后处理去偏策略CRAB(Codebook Rebalancing and Augmentation for Bias mitigation),通过两个核心步骤实现:首先,基于预训练模型重构码本(codebook),将过频的语义令牌(semantic tokens)进行拆分以缓解频率不平衡,同时保持其层次化语义结构;其次,引入树状结构正则项(tree-structured regularizer)增强令牌间的语义一致性,从而在训练中提升冷门令牌的信息表达能力,有效降低流行度偏差并改善推荐性能。

链接: https://arxiv.org/abs/2604.05113
作者: Zezhong Fan,Ziheng Chen,Luyi Ma,Jin Huang,Lalitesh Morishetti,Kaushiki Nag,Sushant Kumar,Kannan Achan
机构: Walmart Global Tech (沃尔玛全球科技); Stony Brook University (石溪大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Generative Recommendation

点击查看摘要

Abstract:Generative recommendation (GeneRec) has introduced a new paradigm that represents items as discrete semantic tokens and predicts items in a generative manner. Despite its strong performance across multiple recommendation tasks, existing GeneRec approaches still suffer from severe popularity bias and may even exacerbate it. In this work, we conduct a comprehensive empirical analysis to uncover the root causes of this phenomenon, yielding two core insights: 1) imbalanced tokenization inherits and can further amplify popularity bias from historical item interactions; 2) current training procedures disproportionately favor popular tokens while neglecting semantic relationships among tokens, thereby intensifying popularity bias. Building on these insights, we propose CRAB, a post-hoc debiasing strategy for GeneRec that alleviates popularity bias by mitigating frequency imbalance among semantic tokens. Specifically, given a well-trained model, we first rebalance the codebook by splitting over-popular tokens while preserving their hierarchical semantic structure. Based on the adjusted codebook, we further introduce a tree-structured regularizer to enhance semantic consistency, encouraging more informative representations for unpopular tokens during training. Experiments on real-world datasets demonstrate that CRAB significantly improves recommendation performance by effectively alleviating popularity bias.

[IR-25] Document Optimization for Black-Box Retrieval via Reinforcement Learning

【速读】:该论文旨在解决传统文档扩展(document expansion)方法在现代检索模型中性能下降的问题,即文档扩展常引入噪声,掩盖了区分性信号。其核心解决方案是将文档扩展重构为一个文档优化(document optimization)问题:利用语言模型或视觉语言模型,在仅需黑盒访问检索排名的情况下,通过GRPO(Generalized Reward Policy Optimization)算法对文档进行微调,使其表示更贴近目标检索器预期的查询分布,奖励信号来自检索器排名的提升。该方法适用于单向量、多向量和词汇检索器,实验证明其能显著提升代码检索与视觉文档检索(VDR)任务的性能,并使小型高效检索器超越大型模型。

链接: https://arxiv.org/abs/2604.05087
作者: Omri Uzan,Ron Polonsky,Douwe Kiela,Christopher Potts
机构: Stanford University (斯坦福大学); ContextualAI
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Document expansion is a classical technique for improving retrieval quality, and is attractive since it shifts computation offline, avoiding additional query-time processing. However, when applied to modern retrievers, it has been shown to degrade performance, often introducing noise that obfuscates the discriminative signal. We recast document expansion as a document optimization problem: a language model or a vision language model is fine-tuned to transform documents into representations that better align with the expected query distribution under a target retriever, using GRPO with the retriever’s ranking improvements as rewards. This approach requires only black-box access to retrieval ranks, and is applicable across single-vector, multi-vector and lexical retrievers. We evaluate our approach on code retrieval and visual document retrieval (VDR) tasks. We find that learned document transformations yield retrieval gains and in many settings enable smaller, more efficient retrievers to outperform larger ones. For example, applying document optimization to OpenAI text-embedding-3-small model improves nDCG5 on code (58.7 to 66.8) and VDR (53.3 to 57.6), even slightly surpassing the 6.5X more expensive OpenAI text-embedding-3-large model (66.3 on code; 57.0 on VDR). When retriever weights are accessible, document optimization is often competitive with fine-tuning, and in most settings their combination performs best, improving Jina-ColBERT-V2 from 55.8 to 63.3 on VDR and from 48.6 to 61.8 on code retrieval.

[IR-26] Evaluation of Embedding-Based and Generative Methods for LLM -Driven Document Classification: Opportunities and Challenges

【速读】:该论文旨在解决地质科学技术文档分类任务中模型性能与计算效率之间的权衡问题。其解决方案的关键在于对比嵌入式模型(embedding-based models)与生成式视觉-语言模型(Vision-Language Models, VLMs)在零样本(zero-shot)场景下的表现差异,并发现通过链式思维(Chain-of-Thought, CoT)提示策略增强的生成式VLM(如Qwen2.5-VL)能够显著提升分类准确率(达到82%),优于当前最先进的多模态嵌入模型(如QQMM,准确率为63%)。此外,研究还指出监督微调(Supervised Fine-Tuning, SFT)虽可进一步提升VLM性能,但对训练数据不平衡敏感,这为实际应用中的模型选择和优化提供了重要依据。

链接: https://arxiv.org/abs/2604.04997
作者: Rong Lu,Hao Liu,Song Hou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the IMAGE’25 Workshop (PCW-11), Society of Exploration Geophysicists (SEG). Published version available at this https URL

点击查看摘要

Abstract:This work presents a comparative analysis of embedding-based and generative models for classifying geoscience technical documents. Using a multi-disciplinary benchmark dataset, we evaluated the trade-offs between model accuracy, stability, and computational cost. We find that generative Vision-Language Models (VLMs) like Qwen2.5-VL, enhanced with Chain-of-Thought (CoT) prompting, achieve superior zero-shot accuracy (82%) compared to state-of-the-art multimodal embedding models like QQMM (63%). We also demonstrate that while supervised fine-tuning (SFT) can improve VLM performance, it is sensitive to training data imbalance.

[IR-27] CURE:Circuit-Aware Unlearning for LLM -based Recommendation

【速读】:该论文旨在解决大语言模型推荐系统(LLMRec)中因隐私法规趋严而带来的用户数据删除需求,即如何在不损害模型性能的前提下实现高效、稳定的遗忘(unlearning)机制。现有方法通常将遗忘与保留目标以加权形式统一更新模型参数,导致梯度冲突,进而引发优化不稳定或模型效用严重下降的问题,且整个过程缺乏透明性。其解决方案的关键在于提出一种电路感知的遗忘框架CURE(Circuit-aware Unlearning),通过将模型组件解耦为功能独立的子集(称为“电路”,即因果上负责特定任务行为的计算子图),并基于模块对遗忘和保留目标的不同贡献进行分类(遗忘特有、保留特有、任务共享),进而采用针对性的更新规则来缓解梯度冲突,从而实现更有效且可解释的遗忘过程。

链接: https://arxiv.org/abs/2604.04982
作者: Ziheng Chen,Jiali Cheng,Zezhong Fan,Hadi Amiri,Yunzhi Yao,Xiangguo Sun,Yang Zhang
机构: Walmart Global Tech(沃尔玛全球科技); University of Massachusetts Lowell(马萨诸塞大学洛厄尔分校); Zhejiang University(浙江大学); The Chinese University of Hong Kong(香港中文大学); National University of Singapore(新加坡国立大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have opened new opportunities for recommender systems by enabling rich semantic understanding and reasoning about user interests and item attributes. However, as privacy regulations tighten, incorporating user data into LLM-based recommendation (LLMRec) introduces significant privacy risks, making unlearning algorithms increasingly crucial for practical deployment. Despite growing interest in LLMRec unlearning, most existing approaches formulate unlearning as a weighted combination of forgetting and retaining objectives while updating model parameters in a uniform manner. Such formulations inevitably induce gradient conflicts between the two objectives, leading to unstable optimization and resulting in either ineffective unlearning or severe degradation of model utility. Moreover, the unlearning procedure remains largely black-box, undermining its transparency and trustworthiness. To tackle these challenges, we propose CURE, a circuit-aware unlearning framework that disentangles model components into functionally distinct subsets and selectively updates them. Here, a circuit refers to a computational subgraph that is causally responsible for task-specific behaviors. Specifically, we extract the core circuits underlying item recommendation and analyze how individual modules within these circuits contribute to the forget and retain objectives. Based on this analysis, these modules are categorized into forget-specific, retain-specific, and task-shared groups, each subject to function-specific update rules to mitigate gradient conflicts during unlearning. Experiments on real-world datasets show that our approach achieves more effective unlearning than existing baselines.

[IR-28] ncent Advertising Algorithm Challenge 2025: All-Modality Generative Recommendation

【速读】:该论文旨在解决工业广告场景下生成式推荐(Generative Recommendation, GR)缺乏大规模、真实且全模态数据基准的问题。当前虽已有针对多模态推荐的数据集,但尚未有专门面向工业级GR任务的公开benchmark,限制了该方向的研究进展。解决方案的关键在于构建并发布两个基于真实去标识化腾讯广告日志的全模态数据集——TencentGR-1M与TencentGR-10M,其包含丰富的协同标识符(collaborative identifiers)和多模态内容表示,并明确区分点击(click)与转化(conversion)事件,支持序列级和目标级的细粒度建模。此外,论文还提供了统一的任务定义、特征结构、基线模型及加权评估协议,以推动工业规模下全模态生成式推荐的研究发展。

链接: https://arxiv.org/abs/2604.04976
作者: Junwei Pan,Wei Xue,Chao Zhou,Xing Zhou,Lunan Fan,Yanbo Wang,Haoran Xin,Zhiyu Hu,Yaozheng Wang,Fengye Xu,Yurong Yang,Xiaotian Li,Junbang Huo,Wentao Ning,Yuliang Sun,Chengguo Yin,Jun Zhang,Shudong Huang,Lei Xiao,Huan Yu,Irwin King,Haijie Gu,Jie Jiang
机构: Tencent Inc.(腾讯公司); CUHK(香港中文大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative recommender systems are rapidly emerging as a new paradigm for recommendation, where collaborative identifiers and/or multi-modal content are mapped into discrete token spaces and user behavior is modelled with autoregressive sequence models. Despite progress on multi-modal recommendation datasets, there is still a lack of public benchmarks that jointly offer large-scale, realistic and fully all-modality data designed specifically for generative recommendation (GR) in industrial advertising. To foster research in this direction, we organised the Tencent Advertising Algorithm Challenge 2025, a global competition built on top of two all-modality datasets for GR: TencentGR-1M and TencentGR-10M. Both datasets are constructed from real de-identified Tencent Ads logs and contain rich collaborative IDs and multi-modal representations extracted with state-of-the-art embedding models. The preliminary track (TencentGR-1M) provides 1 million user sequences with up to 100 interacted items each, where each interaction is labeled with exposure and click signals, while the final track (TencentGR-10M) scales this to 10 million users and explicitly distinguishes between click and conversion events at both the sequence and target level. This paper presents the task definition, data construction process, feature schema, baseline GR model, evaluation protocol, and key findings from top-ranked and award-winning solutions. Our datasets focus on multi-modal sequence generation in an advertising setting and introduce weighted evaluation for high-value conversion events. We release our datasets at this https URL and baseline implementations at this https URL to enable future research on all-modality generative recommendation at an industrial scale. The official website is this https URL.

[IR-29] MG2-RAG : Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂跨模态推理任务中因检索增强生成(Retrieval-Augmented Generation, RAG)系统局限性导致的幻觉问题与结构化推理能力不足的问题。现有方法中,扁平向量检索忽略模态间结构依赖关系,而基于“翻译到文本”管道的图结构方法则因计算开销大且丢失细粒度视觉信息难以有效支持多跳推理。解决方案的关键在于提出一种轻量级的多粒度图检索框架 MG²-RAG,其核心创新包括:通过轻量文本解析与实体驱动的视觉定位构建分层多模态知识图谱,将文本实体与视觉区域融合为保留原子证据的统一多模态节点;在此基础上设计多粒度图检索机制,聚合密集相似性并跨图传播相关性,从而支持结构化的多跳推理。实验表明,MG²-RAG 在四项代表性多模态任务中均达到最优性能,同时相较先进图方法实现平均43.3倍的速度提升和23.9倍的成本降低。

链接: https://arxiv.org/abs/2604.04969
作者: Sijun Dai,Qiang Huang,Xiaoxing You,Jun Yu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates hallucinations in Multimodal Large Language Models (MLLMs), yet existing systems struggle with complex cross-modal reasoning. Flat vector retrieval often ignores structural dependencies, while current graph-based methods rely on costly ``translation-to-text’’ pipelines that discard fine-grained visual information. To address these limitations, we propose \textbfMG ^2 -RAG, a lightweight \textbfMulti-\textbfGranularity \textbfGraph \textbfRAG framework that jointly improves graph construction, modality fusion, and cross-modal retrieval. MG ^2 -RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, we introduce a multi-granularity graph retrieval mechanism that aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. Extensive experiments across four representative multimodal tasks (i.e., retrieval, knowledge-based VQA, reasoning, and classification) demonstrate that MG ^2 -RAG consistently achieves state-of-the-art performance while reducing graph construction overhead with an average 43.3 \times speedup and 23.9 \times cost reduction compared with advanced graph-based frameworks.

[IR-30] Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity WSDM2026

【速读】:该论文旨在解决自动视频预告片生成领域从基于启发式提取方法向深度生成合成范式的转变问题,核心挑战在于如何从原始视频中不仅精准识别关键片段,还能构建连贯且具有情感共鸣的叙事结构。其解决方案的关键在于利用生成式AI技术,特别是大语言模型(LLM)、多模态大语言模型(MLLM)以及基于扩散机制的视频合成模型,实现从传统基于低级特征工程和规则驱动的方法到可控生成编辑与语义重构的跨越;文中重点分析了自图卷积网络(GCN)到预告片生成Transformer(TGT)的架构演进,并提出面向基础模型时代的全新分类体系,为未来AI驱动的预告片生成系统指明了方向。

链接: https://arxiv.org/abs/2604.04953
作者: Abhishek Dharmaratnakar,Srivaths Ranganathan,Debanshu Das,Anushree Sinha
机构: Google LLC(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: 7 pages, 3 figures, accepted in WSDM 2026

点击查看摘要

Abstract:The domain of automatic video trailer generation is currently undergoing a profound paradigm shift, transitioning from heuristic-based extraction methods to deep generative synthesis. While early methodologies relied heavily on low-level feature engineering, visual saliency, and rule-based heuristics to select representative shots, recent advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), and diffusion-based video synthesis have enabled systems that not only identify key moments but also construct coherent, emotionally resonant narratives. This survey provides a comprehensive technical review of this evolution, with a specific focus on generative techniques including autoregressive Transformers, LLM-orchestrated pipelines, and text-to-video foundation models like OpenAI’s Sora and Google’s Veo. We analyze the architectural progression from Graph Convolutional Networks (GCNs) to Trailer Generation Transformers (TGT), evaluate the economic implications of automated content velocity on User-Generated Content (UGC) platforms, and discuss the ethical challenges posed by high-fidelity neural synthesis. By synthesizing insights from recent literature, this report establishes a new taxonomy for AI-driven trailer generation in the era of foundation models, suggesting that future promotional video systems will move beyond extractive selection toward controllable generative editing and semantic reconstruction of trailers.

[IR-31] Learning to Retrieve from Agent Trajectories

【速读】:该论文旨在解决传统信息检索(Information Retrieval, IR)系统在面向生成式 AI(Generative AI)驱动的搜索代理(search agent)场景下存在的根本性不匹配问题:现有基于人类交互日志(如点击和停留时间)训练的排序模型无法有效适配代理用户的查询行为与结果消费模式,导致检索性能下降。解决方案的关键在于提出一种新的训练范式——从代理轨迹中学习(Learning to Retrieve from Agent Trajectories, LRAT),其核心是通过挖掘多步代理交互轨迹中的关键行为信号(如浏览动作、未浏览拒绝及浏览后推理痕迹)来构建高质量的监督信号,并引入相关性强度加权优化机制,从而直接利用代理数据训练更契合代理任务需求的检索模型。实验证明,LRAT 在多个领域内和跨域深度研究基准上均显著提升证据召回率、端到端任务成功率和执行效率。

链接: https://arxiv.org/abs/2604.04949
作者: Yuqi Zhou,Sunhao Dai,Changle Qu,Liang Pang,Jun Xu,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所人工智能安全重点实验室)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results. In this work, we argue that retrieval models for agentic search should be trained directly from agent interaction data. We introduce learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. Through a systematic analysis of search agent trajectories, we identify key behavioral signals that reveal document utility, including browsing actions, unbrowsed rejections, and post-browse reasoning traces. Guided by these insights, we propose LRAT, a simple yet effective framework that mines high-quality retrieval supervision from agent trajectories and incorporates relevance intensity through weighted optimization. Extensive experiments on both in-domain and out-of-domain deep research benchmarks demonstrate that retrievers trained with LRAT consistently improve evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and scales. Our results highlight agent trajectories as a practical and scalable supervision source, pointing to a promising direction for retrieval in the era of agentic search.

[IR-32] From PDF to RAG -Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中文档预处理质量对下游问答准确率影响缺乏系统评估的问题。其解决方案的关键在于通过对比四种开源PDF转Markdown框架(Docling、MinerU、Marker和DeepSeek OCR)在19种不同管道配置下的表现,发现数据准备阶段的细节——特别是元数据增强和层次感知的文本分块策略——对最终性能的影响显著大于转换工具本身的选择;其中,基于字体层级重建的方法优于大语言模型(LLM)驱动的方法,且最优配置(Docling + 层级分割 + 图像描述)实现了94.1%的自动化准确率,接近人工标注基准(97.1%),凸显了高质量数据预处理在RAG系统中的主导作用。

链接: https://arxiv.org/abs/2604.04948
作者: José Guilherme Marques dos Santos,Ricardo Yang,Rui Humberto Pereira,Alexandre Sousa,Brígida Mónica Faria,Henrique Lopes Cardoso,José Duarte,José Luís Reis,Luís Paulo Reis,Pedro Pimenta,José Paulo Marques dos Santos
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 19 pipeline configurations for extracting text and other contents from PDFs, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a manually curated 50-question benchmark over a corpus of 36 Portuguese administrative documents (1,706 pages, ~492K words), with LLM-as-judge scoring averaged over 10 runs. Two baselines bounded the results: naïve PDFLoader (86.9%) and manually curated Markdown (97.1%). Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1%). Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework choice alone. Font-based hierarchy rebuilding consistently outperformed LLM-based approaches. An exploratory GraphRAG implementation scored only 82%, underperforming basic RAG, suggesting that naïve knowledge graph construction without ontological guidance does not yet justify its added complexity. These findings demonstrate that data preparation quality is the dominant factor in RAG system performance.

[IR-33] SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLM s

【速读】:该论文旨在解决从在线体育新闻文章中自动提取赛前与赛后有意义洞察(insights)的问题,以提升用户参与度和理解力。其解决方案的关键在于构建一个包含7,900篇覆盖四大主流运动(板球、足球、篮球、棒球)800场比赛的高质量数据集,并采用两阶段验证流程结合开源与专有大语言模型(LLMs)确保上下文相关性;随后利用多种前沿LLM(如GPT-4o、Llama-3.3-70B-Instruct等)生成综合洞察,并通过基于FactScore的事实准确性评估及SummaC框架进行幻觉检测;最终提出SUMMIR(Sentence Unified Multimetric Model for Importance Ranking),一种用于根据用户兴趣对洞察进行重要性排序的新架构,从而实现高准确率且个性化的洞察生成。

链接: https://arxiv.org/abs/2604.04947
作者: Nitish Kumar,Sannu Kumar,S Akash,Manish Gupta,Ankith Karat,Sriparna Saha
机构: IIT Patna (印度理工学院普那分校); Indian Institute of Technology, Patna (印度理工学院普那分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid proliferation of online sports journalism, extracting meaningful pre-game and post-game insights from articles is essential for enhancing user engagement and comprehension. In this paper, we address the task of automatically extracting such insights from articles published before and after matches. We curate a dataset of 7,900 news articles covering 800 matches across four major sports: Cricket, Soccer, Basketball, and Baseball. To ensure contextual relevance, we employ a two-step validation pipeline leveraging both open-source and proprietary large language models (LLMs). We then utilize multiple state-of-the-art LLMs (GPT-4o, Qwen2.5-72B-Instruct, Llama-3.3-70B-Instruct, and Mixtral-8x7B-Instruct-v0.1) to generate comprehensive insights. The factual accuracy of these outputs is rigorously assessed using a FactScore-based methodology, complemented by hallucination detection via the SummaC (Summary Consistency) framework with GPT-4o. Finally, we propose SUMMIR (Sentence Unified Multimetric Model for Importance Ranking), a novel architecture designed to rank insights based on user-specific interests. Our results demonstrate the effectiveness of this approach in generating high-quality, relevant insights, while also revealing significant differences in factual consistency and interestingness across LLMs. This work contributes a robust framework for automated, reliable insight generation from sports news content. The source code is availble here this https URL.

[IR-34] Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因文档切分策略不当而导致的高Token消耗、冗余文本生成、可扩展性差及调试困难等问题,尤其是在大规模网络内容摄入场景下。其解决方案的关键在于提出一种面向网络文档的检索感知切分方法(Web Retrieval-Aware Chunking, W-RAC),通过将文本提取与语义切分规划解耦,以结构化、ID可寻址的单元表示解析后的网页内容,并仅在检索感知的分组决策阶段调用大语言模型(Large Language Models, LLMs),从而显著降低LLM使用成本、消除幻觉风险,并提升系统的可分析性和架构效率。

链接: https://arxiv.org/abs/2604.04936
作者: Uday Allu,Sonu Kedia,Tanmay Odapally,Biddwan Ahmed
机构: Yellow.ai (Yellow.ai)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 tables, 0 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems critically depend on effective document chunking strategies to balance retrieval quality, latency, and operational cost. Traditional chunking approaches, such as fixed-size, rule-based, or fully agentic chunking, often suffer from high token consumption, redundant text generation, limited scalability, and poor debuggability, especially for large-scale web content ingestion. In this paper, we propose Web Retrieval-Aware Chunking (W-RAC), a novel, cost-efficient chunking framework designed specifically for web-based documents. W-RAC decouples text extraction from semantic chunk planning by representing parsed web content as structured, ID-addressable units and leveraging large language models (LLMs) only for retrieval-aware grouping decisions rather than text generation. This significantly reduces token usage, eliminates hallucination risks, and improves system this http URL analysis and architectural comparison demonstrate that W-RAC achieves comparable or better retrieval performance than traditional chunking approaches while reducing chunking-related LLM costs by an order of magnitude.

人机交互

[HC-0] MAESTRO: Adapting GUIs and Guiding Navigation with User Preferences in Conversational Agents with GUIs

【速读】:该论文旨在解决任务导向型对话系统中用户偏好未能被有效利用的问题,特别是在多步骤、依赖决策的任务(如订票或预订餐厅)中,早期选择会限制后续选项,导致用户需重新开始。现有代理仅能执行线性流程,缺乏对用户偏好的系统性建模与动态响应能力。解决方案的关键在于提出MAESTRO框架,其核心是构建一个共享的偏好记忆(preference memory),从自然语言中提取带强度的偏好信息,并通过两种机制实现决策支持:一是基于偏好的GUI自适应(Preference-Grounded GUI Adaptation),利用增补、排序、过滤和高亮等原位操作优化界面呈现以支持阶段内比较;二是偏好引导的工作流导航(Preference-Guided Workflow Navigation),检测偏好与可用选项间的冲突,主动建议回溯路径并记录失败路径避免重复探索死胡同。

链接: https://arxiv.org/abs/2604.06134
作者: Sangwook Lee,Sang Won Lee,Adnan Abbas,Young-Ho Kim,Yan Chen
机构: Virginia Tech (弗吉尼亚理工大学); NAVER AI Lab (NAVER人工智能实验室)
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Modern task-oriented chatbots present GUI elements alongside natural-language dialogue, yet the agent’s role has largely been limited to interpreting natural-language input as GUI actions and following a linear workflow. In preference-driven, multi-step tasks such as booking a flight or reserving a restaurant, earlier choices constrain later options and may force users to restart from scratch. User preferences serve as the key criteria for these decisions, yet existing agents do not systematically leverage them. We present MAESTRO, which extends the agent’s role from execution to decision support. MAESTRO maintains a shared preference memory that extracts preferences from natural-language utterances with their strength, and provides two mechanisms. Preference-Grounded GUI Adaptation applies in-place operators (augment, sort, filter, and highlight) to the existing GUI according to preference strength, supporting within-stage comparison. Preference-Guided Workflow Navigation detects conflicts between preferences and available options, proposes backtracking, and records failed paths to avoid revisiting dead ends. We evaluated MAESTRO in a movie-booking Conversational Agent with GUI (CAG) through a within-subjects study with two conditions (Baseline vs. MAESTRO) and two modes (Text vs. Voice), with N = 33 participants.

[HC-1] Understanding Educators Perceptions of AI-generated Non-consensual Intimate Imagery

【速读】:该论文旨在解决学校在应对由生成式 AI (Generative AI) 生成的非 consensual intimate imagery (AIG-NCII) 方面缺乏有效支持体系的问题。研究表明,当前学校在政策制定、教师培训和资源配备等方面存在明显不足,且教育者对 AIG-NCII 的风险认知不充分,导致实践措施有限。解决方案的关键在于构建多利益相关方协同策略,包括开发面向教育场景的交互式工具、设计系统性课程内容以及制定明确的校内政策与法律边界指引,从而提升师生应对能力并形成预防与干预机制。

链接: https://arxiv.org/abs/2604.06131
作者: Tongxin Li,Katelyn M Reyes,Liezeil Jimenez,Katie S Nam,Donghee Yvette Wohn
机构: New Jersey Institute of Technology (新泽西理工学院)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI-generated non-consensual intimate imagery (AIG-NCII) is an emerging social problem due to the advancement of AI tools. While recent incidents in middle and high schools have highlighted the urgency of this issue, there is limited understanding of what concrete supports schools need to effectively address AIG-NCII. To fill this gap, we conducted an interview study with 20 educators in the U.S. and investigated their attitudes, experiences, and practices related to AIG-NCII. Educators expressed concerns about both students’ and their own vulnerability, as AIG-NCII may cause moral decline among students, while educators themselves could become victims. Nevertheless, existing practices in schools are limited, and they lack both training and systematic policies. Challenges such as a lack of resources, unclear legal boundaries, and limited knowledge of AI make implementation difficult. The findings of this paper contribute to interactive educational tool design, curriculum design, and policy-making, especially regarding the need for multi-stakeholder strategies to address issues surrounding AIG-NCII.

[HC-2] UI Placement as a Critical Design Factor for Augmented Reality During Locomotion

【速读】:该论文试图解决的问题是:当前增强现实(Augmented Reality, AR)交互研究往往孤立地评估交互技术,忽视了用户在移动过程中界面(User Interface, UI)放置位置对交互性能的根本影响。解决方案的关键在于重新定义UI放置的概念,超越传统的固定锚定视角,探索针对特定UI放置位置设计的新型交互技术,并将UI放置作为实验研究中的独立变量进行严格评估。通过聚焦用户与界面之间的相对运动关系,可实现更有效的动态场景下AR交互。

链接: https://arxiv.org/abs/2604.06102
作者: Pavel Manakhov,Hans Gellersen
机构: Lancaster University (兰卡斯特大学); Aarhus University (奥胡斯大学)
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:Wearable augmented reality (AR) represents the next interface to all things computing, extending what smartphones and laptops can do. This involves providing access to digital information during activities like walking or jogging. In this work we argue that the impact of physical movement on AR interaction is not direct, but mediated by UI placement - the spatial relationship between the user and the interface. Current research often treats interaction techniques in isolation, overlooking how their performance is fundamentally linked to where the UI is placed. This position paper highlights the need to reconceptualize UI placement beyond traditional anchoring views, explore novel interaction techniques designed for specific UI placements during locomotion, and rigorously evaluate UI placement as an independent variable in experimental studies. By centering the analysis on the relative movement between user and interface, we can unlock more effective on-the-go AR interaction.

[HC-3] Intuitive Human-Robot Interaction: Development and Evaluation of a Gesture-Based User Interface for Object Selection

【速读】:该论文旨在解决人机交互中如何通过自然手势实现高效对象选择的问题,特别是在人-机器人协作场景下提升交互效率。其解决方案的关键在于设计了一种基于指向(pointing)和点击(click)手势的用户界面,通过实验验证了该方法在准确性与选择时间上的有效性,表明手势交互可作为高效的人机协同手段。

链接: https://arxiv.org/abs/2604.06073
作者: Bijan Kavousian,Oliver Petrovic,Werner Herfs
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: This submission contains both an English translation and the original German version. The German version was originally published in the Proceedings of the 72nd GfA Conference (2026)

点击查看摘要

Abstract:Gestures are a natural form of communication between humans and can also be leveraged for human-robot interaction. This work presents a gesture-based user interface for object selection using pointing and click gestures. An experiment with 20 participants evaluates accuracy and selection time, demonstrating the potential for efficient collaboration.

[HC-4] Stories of Your Life as Others: A Round-Trip Evaluation of LLM -Generated Life Stories Conditioned on Rich Psychometric Profiles

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在人格特质(personality traits)条件化生成中缺乏心理测量学有效性验证的问题,即现有评估方法主要依赖模型自评问卷,且未充分考虑架构多样性与真实人类心理数据的使用,导致难以判断人格条件化是否真正生成具有个体差异信息的心理学意义文本。解决方案的关键在于:首先,基于290名参与者的真实心理测量人格数据(psychometric profiles)对LLM进行人格条件化,生成第一人称生活叙事;其次,利用独立的LLM评分器从这些叙事中恢复人格分数,结果显示恢复相关性达到人类测试-重测信度水平(平均r = 0.750,占人类上限的85%),且该结果在10种不同生成器和3种评分器间保持稳健;进一步分析表明,评分模型虽需纠正因条件化带来的默认偏差(alignment-induced defaults),但仍能准确还原人格特征,并且生成内容在行为层面与真实对话数据高度一致,证明预训练阶段捕捉到的人格-语言关系可支持个体差异的稳定编码与解码,包括典型的情绪反应模式。

链接: https://arxiv.org/abs/2604.06071
作者: Ben Wigler,Maria Tsfasman,Tiffany Matej Hrkalovic
机构: LoveMind AI; Jheronimus Academy of Data Science
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Under review at COLM

点击查看摘要

Abstract:Personality traits are richly encoded in natural language, and large language models (LLMs) trained on human text can simulate personality when conditioned on persona descriptions. However, existing evaluations rely predominantly on questionnaire self-report by the conditioned model, are limited in architectural diversity, and rarely use real human psychometric data. Without addressing these limitations, it remains unclear whether personality conditioning produces psychometrically informative representations of individual differences or merely superficial alignment with trait descriptors. To test how robustly LLMs can encode personality into extended text, we condition LLMs on real psychometric profiles from 290 participants to generate first-person life story narratives, and then task independent LLMs to recover personality scores from those narratives alone. We show that personality scores can be recovered from the generated narratives at levels approaching human test-retest reliability (mean r = 0.750, 85% of the human ceiling), and that recovery is robust across 10 LLM narrative generators and 3 LLM personality scorers spanning 6 providers. Decomposing systematic biases reveals that scoring models achieve their accuracy while counteracting alignment-induced defaults. Content analysis of the generated narratives shows that personality conditioning produces behaviourally differentiated text: nine of ten coded features correlate significantly with the same features in participants’ real conversations, and personality-driven emotional reactivity patterns in narratives replicate in real conversational data. These findings provide evidence that the personality-language relationship captured during pretraining supports robust encoding and decoding of individual differences, including characteristic emotional variability patterns that replicate in real human behaviour.

[HC-5] Governance and Regulation of Artificial Intelligence in Developing Countries: A Case Study of Nigeria

【速读】:该论文旨在解决发展中国家在人工智能(Artificial Intelligence, AI)治理方面存在的法律与制度困境,特别是以尼日利亚为例,探讨法律专业人士对AI伦理风险、监管空白和机构准备度的认知。研究发现,当前主要问题包括数据隐私风险突出、缺乏可执行的法律框架,以及本地机构能力不足;解决方案的关键在于构建基于本土情境的治理模型,而非直接套用国外框架,强调监管措施需具备情境适配性、包容性和跨全球伦理原则与本地现实的桥梁作用,从而为政策制定者和研究者提供可行路径,推动负责任的AI治理实践落地。

链接: https://arxiv.org/abs/2604.06018
作者: Uloma Okoro,Tammy Mckenzie,Branislav Radeljic
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study examines the perception of legal professionals on the governance of AI in developing countries, using Nigeria as a case study. The study focused on ethical risks, regulatory gaps, and institutional readiness. The study adopted a qualitative case study design. Data were collected through 27 semi-structured interviews with legal practitioners in Nigeria. A focus group discussion was also held with seven additional legal practitioners across sectors such as finance, insurance, and corporate law. Thematic analysis was employed to identify key patterns in participant responses. Findings showed that there were concerns about data privacy risks and the lack of enforceable legal frameworks. Participants expressed limited confidence in institutional capacity and emphasized the need for locally adapted governance models rather than direct adoption of foreign frameworks. While some expressed optimism about AI’s potential, this was conditional on the presence of strong legal oversight and public accountability. The study contributes to the growing discourse on AI governance in developing countries by focusing on the perspectives of legal professionals. It highlights the importance of regulatory approaches that are context-specific, inclusive, and capable of bridging the gap between global ethical principles and local realities. These insights offer practical guidance for policymakers, regulators, and scholars working to shape responsible AI governance in similar environments.

[HC-6] Designing Around Stigma: Human-Centered LLM s for Menstrual Health

【速读】:该论文旨在解决巴基斯坦女性在月经健康教育(Menstrual Health Education, MHE)方面面临的文化禁忌和正式课程缺失问题,导致她们缺乏可靠的信息来源。解决方案的关键在于开发并部署一个基于WhatsApp的聊天机器人(chatbot),该机器人融合了大语言模型(Large Language Model, LLM)与检索增强生成(Retrieval Augmented Generation, RAG)技术,并由巴基斯坦高校女性共同设计(co-designed)。其核心创新包括支持罗马乌尔都语(Roman Urdu)、使用低成本平台、构建专家审核的知识库,从而在尊重本地文化语境的基础上提供可信、可及的生殖健康信息,同时通过迭代式对话帮助用户挑战污名化认知、建立科学健康知识体系。

链接: https://arxiv.org/abs/2604.06008
作者: Amna Shahnawaz,Ayesha Shafique,Ding Wang,Maryam Mustafa
机构: Lahore University of Management Science (拉合尔管理科学大学); Google (谷歌)
类目: Human-Computer Interaction (cs.HC)
备注: This is accepted at CHI 2026

点击查看摘要

Abstract:Menstrual health education (MHE) in Pakistan is constrained by cultural taboos and inadequate formal curricula, leaving women with few trusted resources to lean on. In response to these challenges, we introduce a WhatsApp-based chatbot powered by a large language model (LLM) and Retrieval Augmented Generation (RAG), co-designed with Pakistani college women. Workshops (N=30) revealed key design requirements – support for Roman Urdu, use of subsidized platforms, and an expert – curated knowledge base. We then deployed the chatbot with 13 participants for two weeks (403 messages and interviews). Women used it to challenge cultural taboos, legitimize health concerns often dismissed as normal, and build reproductive health knowledge through iterative questioning. Yet, interactions also exposed tensions: reliance on cultural explanatory models, questions of trust and validation, and gendered persona of the chatbot itself. We contribute empirical insights, a stigma-aware design framework for culturally sensitive conversational AI, and a methodological lens foregrounding expert validation in intimate health domains.

[HC-7] Regimes of Scale in AI Meteorology

【速读】:该论文旨在解决生成式 AI (Generative AI) 与机器学习(Machine Learning, ML)工具在气象领域应用中的整合难题,特别是如何克服其与现有基于物理的大数据模型和数据流水线之间的不兼容性问题。解决方案的关键在于识别并分析“规模范式”(regimes of scale)——即AI/ML与气象学在观测、数据处理和建模尺度上的根本差异,从而揭示AI/ML方法因源于特定平台和互联网基础设施,难以直接适配气象学中高度结构化的数据组织方式,进而提出需针对具体领域重构AI/ML的集成策略。

链接: https://arxiv.org/abs/2604.06000
作者: Anya Martin,Cindy Lin
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:HCI work has explored the effective integration of AI/ML tools across “application domains” from healthcare to finance to transportation. We add to this literature with an analysis of AI/ML tools in meteorology, a domain that already uses “big data” and massive physics-based models. Drawing from 12 interviews with forecasters and meteorologists with varied connections to AI/ML weather modeling, we trace tensions in AI/ML weather application arising from what we call “regimes of scale,” different ways that AI/ML and meteorological regimes make observations, data, and models scale. Rather than seeing AI/ML as a domain-agnostic tool, we argue that AI/ML methods were born from specific platform and internet infrastructures, and so they can struggle to integrate with very different (in this case meteorological) ways of organizing data pipelines.

[HC-8] Context-Value-Action Architecture for Value-Driven Large Language Model Agents ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟人类行为时存在的行为僵化问题,尤其是现有评估方法因自指偏差(self-referential bias)而掩盖了模型行为与真实人类行为之间的偏离。研究发现,增强提示驱动的推理强度反而加剧价值极化(value polarization),导致群体多样性下降。解决方案的关键在于提出Context-Value-Action(CVA)架构,其基于刺激-有机体-反应(Stimulus-Organism-Response, S-O-R)模型和舒伯特基本人类价值观理论(Schwartz’s Theory of Basic Human Values),通过一个在真实人类数据上训练的价值验证器(Value Verifier)将行动生成与认知推理解耦,从而显式建模动态价值激活机制,有效缓解极化并提升行为保真度与可解释性。

链接: https://arxiv.org/abs/2604.05939
作者: TianZe Zhang,Sirui Sun,Yuhang Xie,Xin Zhang,Zhiqiang Wu,Guojie Song
机构: Peking University (北京大学); PKU-Wuhan Institute for Artificial Intelligence (北京大学武汉人工智能研究院)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to Findings of the Association for Computational Linguistics: ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise in simulating human behavior, yet existing agents often exhibit behavioral rigidity, a flaw frequently masked by the self-referential bias of current “LLM-as-a-judge” evaluations. By evaluating against empirical ground truth, we reveal a counter-intuitive phenomenon: increasing the intensity of prompt-driven reasoning does not enhance fidelity but rather exacerbates value polarization, collapsing population diversity. To address this, we propose the Context-Value-Action (CVA) architecture, grounded in the Stimulus-Organism-Response (S-O-R) model and Schwartz’s Theory of Basic Human Values. Unlike methods relying on self-verification, CVA decouples action generation from cognitive reasoning via a novel Value Verifier trained on authentic human data to explicitly model dynamic value activation. Experiments on CVABench, which comprises over 1.1 million real-world interaction traces, demonstrate that CVA significantly outperforms baselines. Our approach effectively mitigates polarization while offering superior behavioral fidelity and interpretability.

[HC-9] FEEL: Quantifying Heterogeneity in Physiological Signals for Generalizable Emotion Recognition NEURIPS2025

【速读】:该论文旨在解决情绪识别模型在跨数据集、跨实验设置和跨设备类型时泛化能力不足的问题,尤其是在生理信号(如电皮肤活动(EDA)和光电容积脉搏波(PPG))场景下缺乏标准化大规模评估的瓶颈。其解决方案的关键在于构建首个大规模基准测试FEEL,涵盖19个公开数据集,并系统评估了16种不同架构(包括传统机器学习、深度学习及自监督预训练方法),揭示出:1)经过对比信号-语言预训练(CLSP)微调的模型在唤醒度与效价分类任务中表现最优(F1=71/114);2)手工特征提取方法显著优于直接使用原始信号段的模型(107/114),凸显领域知识在低资源、噪声环境中的价值;3)真实生活场景数据训练的模型具有强泛化能力,可有效迁移至实验室(F1=0.79)和约束型设置(F1=0.78),且专家标注数据对刺激标签和自报告数据也具备良好迁移性(F1=0.72–0.76),表明模型性能受数据采集环境与标注策略影响显著。

链接: https://arxiv.org/abs/2604.05926
作者: Pragya Singh,Ankush Gupta,Somay Jalan,Mohan Kumar,Pushpendra Singh
机构: IIIT-Delhi (印度国际信息技术学院); RIT (罗切斯特理工学院)
类目: Human-Computer Interaction (cs.HC)
备注: Published at Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmarks

点击查看摘要

Abstract:Emotion recognition from physiological signals has substantial potential for applications in mental health and emotion-aware systems. However, the lack of standardized, large-scale evaluations across heterogeneous datasets limits progress and model generalization. We introduce FEEL, the first large-scale benchmarking study of emotion recognition using electrodermal activity (EDA) and photoplethysmography (PPG) signals across 19 publicly available datasets. We evaluate 16 architectures spanning traditional machine learning, deep learning, and self-supervised pretraining approaches, structured into four representative modeling paradigms. Our study includes both within-dataset and cross-dataset evaluations, analyzing generalization across variations in experimental settings, device types, and labeling strategies. Our results showed that fine-tuned contrastive signal-language pretraining (CLSP) models (71/114) achieve the highest F1 across arousal and valence classification tasks, while simpler models like Random Forests, LDA, and MLP remain competitive (36/114). Models leveraging handcrafted features (107/114) consistently outperform those trained on raw signal segments, underscoring the value of domain knowledge in low-resource, noisy settings. Further cross-dataset analyses reveal that models trained on real-life setting data generalize well to lab (F1 = 0.79) and constraint-based settings (F1 = 0.78). Similarly, models trained on expert-annotated data transfer effectively to stimulus-labeled (F1 = 0.72) and self-reported datasets (F1 = 0.76). Moreover, models trained on lab-based devices also demonstrated high transferability to both custom wearable devices (F1 = 0.81) and the Empatica E4 (F1 = 0.73), underscoring the influence of heterogeneity. More information about FEEL can be found on our website this https URL.

[HC-10] Dialogue based Interactive Explanations for Safety Decisions in Human Robot Collaboration

【速读】:该论文旨在解决人机协作(Human-Robot Collaboration, HRC)场景中机器人安全决策缺乏可解释性的问题。在共享、高安全要求的环境中,机器人虽能执行安全行为(如停止或切换模式),但其内部安全约束机制对人类协作方不透明,导致难以理解为何采取特定干预措施。解决方案的关键在于提出一种基于对话的交互式解释框架,将解释过程与基于约束的安全评估紧密耦合,利用相同的系统状态和约束表示来生成解释;解释内容直接源自记录的安全决策轨迹,支持用户提出因果型(“为什么?”)、对比型(“为什么不?”)和反事实型(“如果……会怎样?”)查询,且反事实推理在固定且经过认证的安全参数下进行,确保交互探索不会削弱运行保障。此方法通过将解释作为安全控制的操作接口,推动了HRC中交互式、安全感知自主性的设计范式发展。

链接: https://arxiv.org/abs/2604.05896
作者: Yifan Xu,Xiao Zhan,Akilu Yunusa Kaltungo,Ming Shan Ng,Tsukasa Ishizawa,Kota Fujimoto,Clara Cheung
机构: The University of Manchester (曼彻斯特大学); Universitat Politècnica de València (瓦伦西亚理工大学); University of Cambridge (剑桥大学); Kyoto Institute of Technology (京都工艺纤维大学); The University of Tokyo (东京大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As robots increasingly operate in shared, safety critical environments, acting safely is no longer sufficient robots must also make their safety decisions intelligible to human collaborators. In human robot collaboration (HRC), behaviours such as stopping or switching modes are often triggered by internal safety constraints that remain opaque to nearby workers. We present a dialogue based framework for interactive explanation of safety decisions in HRC. The approach tightly couples explanation with constraint based safety evaluation, grounding dialogue in the same state and constraint representations that govern behaviour selection. Explanations are derived directly from the recorded decision trace, enabling users to pose causal (“Why?”), contrastive (“Why not?”), and counterfactual (“What if?”) queries about safety interventions. Counterfactual reasoning is evaluated in a bounded manner under fixed, certified safety parameters, ensuring that interactive exploration does not relax operational guarantees. We instantiate the framework in a construction robotics scenario and provide a structured operational trace illustrating how constraint aware dialogue clarifies safety interventions and supports coordinated task recovery. By treating explanation as an operational interface to safety control, this work advances a design perspective for interactive, safety aware autonomy in HRC.

[HC-11] Improving Explanations: Applying the Feature Understandability Scale for Cost-Sensitive Feature Selection

【速读】:该论文旨在解决机器学习模型在处理表格数据时,其推理过程难以被用户理解的问题,尤其关注自然语言文本解释中因输入特征(input features)复杂性导致的可理解性不足。解决方案的关键在于提出一种共优化(co-optimisation)方法,同时最大化模型的分类准确性和解释的可理解性(understandability),并通过实证研究验证该方法可在保持高分类性能的前提下提升解释的直观易懂程度。

链接: https://arxiv.org/abs/2604.05790
作者: Nicola Rossberg,Bennett Kleinberg,Barry O’Sullivan,Luca Longo,Andrea Visentin
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 24 pages, 4 figures, accepted for presentation at the 4th World Conference on eXplainable Artificial Intelligence

点击查看摘要

Abstract:With the growing pervasiveness of artificial intelligence, the ability to explain the inferences made by machine learning models has become increasingly important. Numerous techniques for model explainability have been proposed, with natural-language textual explanations among the most widely used approaches. When applied to tabular data, these explanations typically draw on input features to justify a given inference. Consequently, a user’s ability to interpret the explanation depends on their understanding of the input features. To quantify this feature-level understanding, Rossberg et al. introduced the Feature Understandability Scale. Building on that work, this proof-of-concept study collects understandability scores across two datasets, proposes a co-optimisation methodology of understandability and accuracy and presents the resulting explanations alongside the model accuracies. This work contributes to the body of knowledge on model interpretability by design. It is found that accuracy and understandability can be successfully co-optimised while maintaining high classification performances. The resulting explanations are considered more understandable at face value. Further research will aim to confirm these findings through user evaluation.

[HC-12] How Much Trust is Enough? Towards Calibrating Trust in Technology

【速读】:该论文旨在解决人机交互(Human-Computer Interaction, HCI)中用户对技术信任评估缺乏标准化、可操作化方法的问题。随着技术日益自主化和透明度降低,用户难以准确理解系统的功能边界,导致信任评估变得复杂且主观。解决方案的关键在于基于实证研究,开发一套针对人类-计算机信任量表(Human-Computer Trust Scale, HCTS)的解释指南,强调将信任倾向评估结果置于具体交互情境中进行反思与校准,从而提升信任评估的实用性与准确性。

链接: https://arxiv.org/abs/2604.05658
作者: Gabriela Beltrão,Debora F. de Souza,Sonia Sousa,David Lamas
机构: Tallinn University (塔林大学); Tallinn University of Technology (TalTech) (塔林理工大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The role of trust within Human-Computer Interaction is being redefined. With the increasing omnipresence, autonomy, and opacity of technology, users often struggle to understand the capabilities and limitations of systems. In this article, we present the results of an empirical study designed to provide a practical, evidence-based interpretation of trust propensity assessment using the Human-Computer Trust Scale (HCTS). We outline the process used to develop a guideline for interpreting the instrument’s results and explain the rationale for our decisions, advocating for calibrating trust in technology within HCI. Our findings demonstrate that the HCTS is a promising tool for conducting an initial evaluation of propensity to trust, but that such an assessment requires reflection and interpretation that should be considered within the context of the interaction.

[HC-13] Beyond Behavior: Why AI Evaluation Needs a Cognitive Revolution

【速读】:该论文试图解决的问题是:当前人工智能(AI)研究长期受制于图灵提出的以行为表现作为智能判断标准的 epistemological commitment(认识论承诺),这种承诺导致AI领域忽视了对系统内部计算机制、过程和组织结构的探究,从而无法区分那些通过不同计算路径达成相同输出的系统,而这正是智能归属(intelligence attribution)的关键所在。解决方案的关键在于推动一场类似心理学从行为主义向认知革命的 epistemological transition(认识论转型)——并非摒弃行为证据,而是承认仅靠行为证据不足以支持AI领域所声称的智能构造性主张(construct claims),并建立一种后行为主义的认识论框架,使诸如“系统如何实现其输出”等原本不可提问的问题变得可问且可答。

链接: https://arxiv.org/abs/2604.05631
作者: Amir Konigsberg
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In 1950, Alan Turing proposed replacing the question “Can machines think?” with a behavioral test: if a machine’s outputs are indistinguishable from those of a thinking being, the question of whether it truly thinks can be set aside. This paper argues that Turing’s move was not only a pragmatic simplification but also an epistemological commitment, a decision about what kind of evidence counts as relevant to intelligence attribution, and that this commitment has quietly constrained AI research for seven decades. We trace how Turing’s behavioral epistemology became embedded in the field’s evaluative infrastructure, rendering unaskable a class of questions about process, mechanism, and internal organization that cognitive psychology, neuroscience, and related disciplines learned to ask. We draw a structural parallel to the behaviorist-to-cognitivist transition in psychology: just as psychology’s commitment to studying only observable behavior prevented it from asking productive questions about internal mental processes until that commitment was abandoned, AI’s commitment to behavioral evaluation prevents it from distinguishing between systems that achieve identical outputs through fundamentally different computational processes, a distinction on which intelligence attribution depends. We argue that the field requires an epistemological transition comparable to the cognitive revolution: not an abandonment of behavioral evidence, but a recognition that behavioral evidence alone is insufficient for the construct claims the field wishes to make. We articulate what a post-behaviorist epistemology for AI would involve and identify the specific questions it would make askable that the field currently has no way to ask.

[HC-14] Understanding User Privacy Perceptions of GenAI Smartphones

【速读】:该论文旨在解决生成式 AI (Generative AI) 智能手机在系统级集成后引发的用户隐私担忧问题,尤其关注用户对这类设备隐私风险的认知水平及其期望的隐私保护机制。其关键解决方案在于通过深入的定性研究(22次半结构化访谈与后续焦点小组讨论),揭示用户在使用 GenAI 智能手机时对数据生命周期中各环节(如非透明的数据收集、不安全存储及弱数据控制)的高度敏感性,并提出以用户为中心的设计建议,强调需在系统级控制、数据管理实践和用户界面透明度之间进行协同优化,从而实现功能增强与隐私保护之间的平衡。

链接: https://arxiv.org/abs/2604.05571
作者: Ran Jin,Liu Wang,Shidong Pan,Luona Xu,Tianming Liu,Haoyu Wang
机构: Huazhong University of Science and Technology (华中科技大学); New York University (纽约大学)
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:GenAI smartphones, which natively embed generative AI at the system level, are transforming mobile interactions by automating a wide range of tasks and executing UI actions on behalf of users. Their superior capabilities rely on continuous access to sensitive and context-rich data, raising privacy concerns that surpass those of traditional mobile devices. Yet, little is known about how users perceive the privacy implications of such devices or what safeguards they expect, which is especially critical at this early stage of GenAI smartphone adoption. To address this gap, we conduct 22 semi-structured interviews with everyday mobile users to explore their usage of GenAI smartphones, privacy concerns, and privacy design expectations. Our findings show that users engage with GenAI smartphones with limited understanding of how these systems operate to deliver functions, but show heightened privacy concerns once exposed to the technical details. Participants’ concerns span the entire data lifecycle, including nontransparent collection, insecure storage, and weak data control. In a follow-up focus group, participants discuss a range of privacy-enhancing suggestions that call for coordinated changes across system-level controls, data management practices, and user-facing transparency. Their concerns and suggestions offer user-centered guidances for designing GenAI smartphones that balance functionality with privacy protection, offering valuable takeaways for system designers and regulators.

[HC-15] Foreign Domestic Workers Perspectives on an LLM -Based Emotional Support tool for Caregiving Burden

【速读】:该论文试图解决外籍家庭佣工(Foreign Domestic Workers, FDWs)在居家老年照护中因语言障碍、社会孤立和缺乏支持资源而面临显著情感照护负担的问题,且现有研究多聚焦于亲属照护者,对FDWs如何使用情感支持技术知之甚少。解决方案的关键在于探索大型语言模型(Large Language Model, LLM)驱动聊天机器人作为日常非临床情感支持工具的适用性,发现其通过提供心理安全感、支持语言可达性(即使语言不完整或碎片化)以及被灵活用作安慰、指导与陪伴的多功能资源,有效缓解了FDWs的情感压力,从而为设计以心理安全、可访问性和灵活适配为核心原则的LLM驱动情感支持工具提供了实证依据与设计启示。

链接: https://arxiv.org/abs/2604.05448
作者: Shin Shoon Nicholas Teng,Kenny Tsu Wei Choo(Singapore University of Technology and Design)
机构: Singapore University of Technology and Design(新加坡科技设计大学)
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, Accepted at CHI 2026 Posters

点击查看摘要

Abstract:Foreign Domestic Workers (FDWs) play a central role in home-based eldercare yet often experience substantial emotional caregiving burden shaped by linguistic barriers, social isolation, and limited access to support. While caregiving burden has been extensively studied among familial caregivers, little is known about how FDWs engage with emotional support technologies. We present an exploratory qualitative study of how FDWs in Singapore interact with a Large Language Model (LLM)-driven chatbot as an everyday, non-clinical form of emotional support. Through interviews and guided chatbot interactions, we conducted an inductive thematic analysis of participants’ experiences. We identify three design-relevant themes: chatbots were experienced as psychologically safe and emotionally validating; they supported linguistic accessibility by accommodating imperfect and fragmented language; and they were appropriated as multifunctional resources for reassurance, guidance, and companionship. We discuss implications for designing LLM-driven emotional support tools that foreground psychological safety, accessibility, and flexible appropriation.

[HC-16] ransient Non-Use: How People in Migration Experience Digital Disconnection

【速读】:该论文旨在解决迁移人群在跨边界、技术与社会系统转换过程中,信息与通信技术(Information and Communication Technologies, ICTs)被有意或无意避免、保留或未使用的问题,即“技术非使用”(non-use)现象。传统人机交互(HCI)研究多关注移民对技术的采纳,却忽视了其技术规避行为背后的复杂动因。论文通过在美墨边境城市埃尔帕索对32位移民进行访谈,识别出设备非使用、信息非使用和保护性非使用三类实践,并将非使用置于迁移的三个阶段——理解、协商与解决中进行动态分析。其关键解决方案在于:将非使用视为一种具有保护功能和对系统排斥的响应机制,而非单纯的失败;并据此提出设计原则,使非使用成为可预见的、有意或无意的设计条件,从而推动更具包容性和情境适应性的技术设计。

链接: https://arxiv.org/abs/2604.05386
作者: Jonathan Leuenberger,Anamika Rajendran,Augusto Penzo Jara,Tajwar-Ul Hoque,Shiva Darian
机构: New Mexico State University (新墨西哥州立大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:People experiencing migration endure many transitions across borders, technologies, and social systems. While HCI research often emphasizes this community’s adoption of technology, less attention has been paid to practices of technological non-use. This paper investigates how information and communication technologies (ICTs) are intentionally and unintentionally avoided, withheld, or not used during migration. Drawing on interviews with 32 people experiencing migration in the border city of El Paso, Texas, USA between February and May 2025, we identify a range of non-use experiences, including device, informational, and protective non-use. We extend the concept of non-use by situating it within the three phases of transitions: understanding, negotiating, and resolving. We show how ICT non-use shifts with time, risk, and institutional demands. Our analysis demonstrates that non-use functions both as a protective strategy and as a response to systemic exclusion, and concludes with design principles that anticipate non-use as both intentional and unintentional design conditions rather than as punitive failure.

[HC-17] SpeakSoftly: Scaffolding Nonviolent Communication in Intimate Relationships through LLM -Powered Just-In-Time Interventions

【速读】:该论文旨在解决文本交流中亲密关系双方因误解而引发冲突升级的问题,特别是避免言语攻击行为的发生。其解决方案的关键在于引入非暴力沟通(Nonviolent Communication, NVC)原则,并通过大语言模型(Large Language Model, LLM)实现即时干预机制。系统核心设计包括两个功能模块:NVC-Prompt用于检测言语攻击并建议改写以防止冲突升级,NVC-Guide则通过分析对话内容识别用户的情绪与需求,促进自我觉察和换位思考。上述功能在三种不同干预深度与语气的模式下实现——基础提醒、中性引导和共情引导,实证研究表明共情引导模式在行为与认知层面均具显著效果,而中性引导在真实情境中因认知负荷较低更具实用性。

链接: https://arxiv.org/abs/2604.05382
作者: Ka I Chan,Hongbo Lan,Jun Fang,Yuntao Wang,Yuanchun Shi
机构: University of Michigan (密歇根大学); University of Pittsburgh (匹兹堡大学); Tsinghua University (清华大学); Qinghai University (青海大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Conflicts are common in text-based communication, particularly in intimate relationships, where misunderstandings can easily escalate into verbal aggression. To address this, we present SpeakSoftly, a system that applies Nonviolent Communication (NVC) principles to scaffold couples’ conflict communication through LLM-powered just-in-time interventions. Informed by formative interviews with couples and NVC principles, we designed two core features: NVC-Prompt, which detects verbal aggression and suggests revisions to prevent escalation, and NVC-Guide, which analyzes dialogues to uncover users’ feelings and needs, fostering self-awareness and perspective-taking. These features were implemented across three progressive intervention modes, each varying in intervention depth and tone: Basic Reminder, Neutral Guide, and Empathetic Guide. We conducted a mixed-methods user study with 18 couples across simulated and real-life conflict settings to evaluate the effectiveness of each mode. Results showed that Empathetic Guide significantly facilitated both behavioral and cognitive changes, while Neutral Guide was effective only for behavioral changes in simulated conflicts. In real-life conflicts, Neutral Guide showed distinct advantages due to lower cognitive load demands. We discuss the mechanisms behind these findings and propose design implications for in-situ interventions in high-stakes communication contexts.

[HC-18] WSCM-Lite: A Practitioner-Ready Implementation of the Weak Signal Cultivation Model

【速读】:该论文旨在解决弱信号培育模型(Weak Signal Cultivation Model, WSCM)在实际应用中的计算复杂性问题,尤其针对依赖电子表格工具的组织难以部署原模型的问题。其关键解决方案是提出WCSM-Lite——一种基于查表法(lookup-table)的简化版本,通过将连续的时效加权机制替换为四行查表结构、移除共识动量和反转放大项,并固化五个常数参数,从而将原模型的15个方程和16个可调参数压缩至7个公式和5个固定常数。该方案在保持与原模型坐标轨迹误差小于0.01场单位的前提下,显著降低了计算门槛,且经26次会话验证和敏感性分析表明其路径行为一致且鲁棒性强。

链接: https://arxiv.org/abs/2604.05381
作者: Maurice Codourey,Emmanuel A. Gonzalez
机构: 未知
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 15 pages, 4 figures, 7 tables, 1 appendix. Companion paper to arXiv:2604.01495 . Excel simulator and supplementary materials at this https URL

点击查看摘要

Abstract:The Weak Signal Cultivation Model (WSCM) provides a mathematically rigorous framework for tracking frontline risk signals across a two-dimensional coordinate field using 15 equations and 16 tunable parameters. While this specification is designed for eventual software implementation, its computational requirements create an adoption barrier for organizations whose available infrastructure is a spreadsheet. This paper introduces WSCM-Lite, a lookup-table implementation that reproduces the full WSCM’s coordinate trajectories within 0.01 field units while eliminating all exponential functions, state-dependent tracking, and free parameters. The simplification replaces continuous recency weighting with a four-row lookup table and removes consensus momentum and reversal amplification entirely, reducing the specification to seven formulas and five hardcoded constants. A 26-session worked example using the Gas Fumes signal from the parent paper demonstrates that WSCM-Lite traverses the same four-region path (Question Marks – Lit Fuses – Owls – Sleeping Cats – Question Marks) and triggers SMS escalation within two sessions of the full model. Five additional scenarios validate boundary behavior, and a sensitivity analysis confirms stability under +/-30% gap threshold variation. An accompanying Excel simulator and supplementary materials are publicly available at this https URL.

[HC-19] AI and Collective Decisions: Strengthening Legitimacy and Losers Consent

【速读】:该论文旨在解决人工智能(AI)在集体决策中如何增强程序合法性的问题,特别是关注失败方对决策结果的接受度——即即使未获得偏好结果,参与者是否仍认为决策公平。其核心挑战在于:如何让AI系统不仅提升决策效率与规模,还能促进不同个体经验与信念的表达与理解,从而增进信任与社会凝聚力。解决方案的关键在于构建一个结合半结构化AI访谈器与交互式可视化工具的系统,前者用于收集用户关于政策议题的个人经历,后者将这些经历与预测的政策支持度进行整合展示;实验表明,即便所有参与者都面临不利结果,该系统仍显著提升了感知合法性、对结果的信任以及对他者立场的理解,验证了AI在增强民主参与中情感联结与认知共情的潜力。

链接: https://arxiv.org/abs/2604.05368
作者: Suyash Fulay,Prerna Ravi,Emily Kubin,Shrestha Mohanty,Michiel Bakker,Deb Roy
机构: Massachusetts Institute of Technology (麻省理工学院); Oxford University (牛津大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 11 pages + appendix

点击查看摘要

Abstract:AI is increasingly used to scale collective decision-making, but far less attention has been paid to how such systems can support procedural legitimacy, particularly the conditions shaping losers’ consent: whether participants who do not get their preferred outcome still accept it as fair. We ask: (1) how can AI help ground collective decisions in participants’ different experiences and beliefs, and (2) whether exposure to these experiences can increase trust, understanding, and social cohesion even when people disagree with the outcome. We built a system that uses a semi-structured AI interviewer to elicit personal experiences on policy topics and an interactive visualization that displays predicted policy support alongside those voiced experiences. In a randomized experiment (n = 181), interacting with the visualization increased perceived legitimacy, trust in outcomes, and understanding of others’ perspectives, even though all participants encountered decisions that went against their stated preferences. Our hope is that the design and evaluation of this tool spurs future researchers to focus on how AI can help not only achieve scale and efficiency in democratic processes, but also increase trust and connection between participants.

[HC-20] OGA-AID: Clinician-in-the-loop AI Report Drafting Assistant for Multimodal Observational Gait Analysis in Post-Stroke Rehabilitation CVPR

【速读】:该论文旨在解决卒中后步态分析(gait analysis)在康复临床实践中存在的效率低、认知负荷高问题,尤其针对临床医生需整合步态视频与运动捕捉数据生成结构化报告的繁琐流程。其解决方案的关键在于提出一种“医生在环”的多智能体大语言模型系统(multi-agent large language model system),通过三个专业化代理协同处理患者运动记录、运动学轨迹和临床特征,自动生成结构化评估报告。实验表明,该系统在真实患者数据上显著优于单次处理的多模态基线方法,且结合医生初步笔记可进一步降低误差,验证了AI辅助分析与人类临床判断在康复流程中的互补性。

链接: https://arxiv.org/abs/2604.05360
作者: Khoi T. N. Nguyen,Nghia D. Nguyen,Hui Yu Koh,Patrick W. H. Kwong,Karen Sui Geok Chua,Ananda Sidarta,Baosheng Yu
机构: Nanyang Technological University (南洋理工大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); The Hong Kong Polytechnic University (香港理工大学); VinUniversity (越南VinUniversity); Institute of Rehabilitation Excellence, NHG Health (新加坡国家健康集团康复卓越研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 2026 CV4Clinic CVPR Workshop Proceedings

点击查看摘要

Abstract:Gait analysis is essential in post-stroke rehabilitation but remains time-intensive and cognitively demanding, especially when clinicians must integrate gait videos and motion-capture data into structured reports. We present OGA-AID, a clinician-in-the-loop multi-agent large language model system for multimodal report drafting. The system coordinates 3 specialized agents to synthesize patient movement recordings, kinematic trajectories, and clinical profiles into structured assessments. Evaluated with expert physiotherapists on real patient data, OGA-AID consistently outperforms single-pass multimodal baselines with low error. In clinician-in-the-loop settings, brief expert preliminary notes further reduce error compared to reference assessments. Our findings demonstrate the feasibility of multimodal agentic systems for structured clinical gait assessment and highlight the complementary relationship between AI-assisted analysis and human clinical judgment in rehabilitation workflows.

[HC-21] Symetra: Visual Analytics for the Parameter Tuning Process of Symbolic Execution Engines

【速读】:该论文旨在解决符号执行引擎(symbolic execution engine)在参数配置上的可解释性问题,即由于参数数量众多且相互影响复杂,用户难以理解各参数对分支覆盖率(branch coverage)的实际影响,从而依赖次优的默认配置。解决方案的关键在于提出一个名为Symetra的可视化分析系统,通过提供两种互补的参数影响概览(overviews),支持人机协同(Human-in-the-Loop)的参数调优过程;该系统不仅使专家能够识别不同配置组合对分支覆盖率的影响模式,还能通过集体分析对比配置组差异,最终实现比全自动调优方法更高的分支覆盖率和调优效率。

链接: https://arxiv.org/abs/2604.05349
作者: Donghee Hong,Minjong Kim,Sooyoung Cha,Jaemin Jo
机构: Sungkyunkwan University (成均馆大学)
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Symbolic execution engines such as KLEE automatically generate test cases to maximize branch coverage, but their numerous parameters make it difficult to understand the parameters’ impact, leading the user to rely on suboptimal default configurations. While automated tuners have shown promising results, they provide limited insights into why certain configurations work well, motivating the need for Human-in-the-Loop approaches. In this work, we present a visual analytics system, Symetra, designed to support Human-in-the-Loop parameter tuning of symbolic execution engines. To handle a large number of parameters and their configurations, we provide two complementary overviews of their impact on branch coverage values and patterns. Building on these overviews, our system enables collective analysis, allowing the user to contrast groups of configurations and identify differences that may affect branch coverage. We also report on case studies and a Human-in-the-Loop tuning process, demonstrating that experts not only interpreted parameter impacts and identified complementary configurations, but also improved upon fully automated approaches in both branch coverage and tuning efficiency.

[HC-22] Semantic Reality: Interactive Context-Aware Visualization of Inter-Object Relationships in Augmented Reality

【速读】:该论文旨在解决增强现实(Augmented Reality, AR)中物理世界与数字世界交互的核心挑战,特别是现有系统仅聚焦于单个物体,难以支持依赖多物品间关系的规划、比较和装配任务。解决方案的关键在于提出“语义现实”(Semantic Reality)系统,其通过多模态推理(multimodal reasoning)、空间锚定(spatial anchoring)和物理动作识别(physical action recognition)构建用户周围物体及其关系的持久化模型,并以在场可视化方式呈现物体间的连接性,从而明确兼容性、揭示下一步操作并减少任务中的歧义。该系统还引入以连接性为中心的交互范式,结合锚点追踪、动作感知与模型推理的架构,动态生成实时连接图,显著提升用户对多物关系的理解与任务参与度。

链接: https://arxiv.org/abs/2604.05265
作者: Xiaoan Liu,Eric J Gonzalez,Nels Numan,Andrea Colaço,Lucy Abramyan,Chen Zhu-Tian,Ryo Suzuki,Mar Gonzalez-Franco
机构: Google(谷歌); University of Minnesota(明尼苏达大学); University of Colorado Boulder(科罗拉多大学博尔德分校)
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 15 figures, 4 tables

点击查看摘要

Abstract:Bridging the physical and digital world through interaction remains a core challenge in augmented reality (AR). Existing systems target single objects, limiting support for planning, comparison, and assembly tasks that depend on relationships among multiple items. We present Semantic Reality, an AR system focused on surfacing inter-object connectivity and making it interactive. Leveraging multimodal reasoning, spatial anchoring, and physical action recognition, Semantic Reality maintains a persistent model of objects around the user and their relationships. Connections are visualized in-situ to highlight compatibility, reveal next steps, and reduce ambiguity during tasks. We contribute a connectivity-centered interaction paradigm and a system architecture that couples anchor tracking, action sensing, and model inference to construct a live connectivity graph. In an exploratory study comparing Semantic Reality to a single-object baseline, participants reported clearer inter-object understanding and higher engagement and satisfaction, without increased workload. A scenario study illustrates where connectivity aids planning, sequencing, and disambiguation.

[HC-23] ZipFold: Modular Actuators for Scaleable Adaptive Robots

【速读】:该论文旨在解决当前形状可变机器人系统普遍依赖专用、难以扩展或迁移的机制问题,以实现对复杂任务和环境的自适应能力。解决方案的关键在于提出一种紧凑且易于制造的可展开驱动器,其通过柔性3D打印塑料条带的复合折叠与拉链式结构,在不依赖复杂传动装置的情况下实现可逆的尺度与刚度变换;这种驱动器能够平滑连续地在紧凑(柔性)与展开(准刚性)状态之间切换,从而在模块化组合后支持多样化的形态与刚度重构,实验验证了其在四模块自适应行走机器人中的集成应用性能。

链接: https://arxiv.org/abs/2604.05260
作者: Niklas Hagemann,Daniela Rus
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Robotics (cs.RO); Soft Condensed Matter (cond-mat.soft); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:There is a growing need for robots that can change their shape, size and mechanical properties to adapt to evolving tasks and environments. However, current shape-changing systems generally utilize bespoke, system-specific mechanisms that can be difficult to scale, reconfigure or translate from one application to another. This paper introduces a compact, easy-to-fabricate deployable actuator that achieves reversible scale and stiffness transformations through compound folding and zipping of flexible 3D-printed plastic strips into square-section deployable beams. The simple actuation method allows for smooth, continuous transitions between compact (flexible) and expanded (quasi-rigid) states, facilitating diverse shape and stiffness transformations when modules are combined into larger assemblies. The actuator’s mechanical performance is characterized and an integrated system involving a four-module adaptive walking robot is demonstrated.

[HC-24] Understanding Clinician Experiences with Game-Based Interventions for Autistic Children to Inform a Future Game Platform Focused on Improving Motor Skills

【速读】:该论文旨在解决当前针对自闭症儿童的健康类严肃游戏(Serious Games for Health, SG4H)在临床实践中存在的僵化问题,即现有干预方案缺乏灵活性,难以适应个体差异和实际应用场景。其解决方案的关键在于提出一个模块化平台——AutMotion Studio,该平台将多种干预措施封装为可定制的迷你游戏(minigames),支持社区成员参与开发,并采用“巫师之 Oz”范式(Wizard of Oz paradigms)实现灵活的使用策略,从而提升游戏干预的适应性与临床实用性。

链接: https://arxiv.org/abs/2604.05249
作者: Hunter M Beach,Devin Jay D San Nicolas,Carly Miller,Cathy Ly,Jared Duval
机构: Northern Arizona University (北方亚利桑那大学)
类目: Human-Computer Interaction (cs.HC)
备注: 6 Pages, 5 Figures, CHI26

点击查看摘要

Abstract:Motor challenges are prevalent among autistic children, and games are able to simultaneously produce clinically meaningful results and provide a motivating context, but many current solutions are too rigid. We conducted a two-phase qualitative study comprised of semi-structured interviews and participatory design workshops with 7 pediatric physical and 5 occupational therapists (PTs/OTs) to investigate their perspectives and experiences with game and play-based interventions. We identified 8 prominent themes describing key characteristics of current successful interventions, opportunities, and barriers to adoption in clinical practice. We present a speculative design informed by thematic analysis that addresses current challenges of rigidity in Serious Games for Health (SG4H). Our modular platform (AutMotion Studio) hosts a suite of interventions as customizable minigames, allowing community members to contribute to and employ Wizard of Oz paradigms for flexible appropriation strategies.

[HC-25] RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains

【速读】:该论文旨在解决传统机器人操作评估体系依赖少数专家预设固定基准所带来的局限性,这些问题包括任务实例、约束条件和成功标准的静态性,限制了评估的扩展性和对用户自定义意图变化的响应能力。其解决方案的关键在于提出RoboPlayground框架,将评估过程重构为基于自然语言驱动的结构化物理域中的可执行任务生成机制;通过将自然语言指令编译为包含显式资产定义、初始化分布和成功谓词的任务规范,形成结构化的任务家族,从而在保持可执行性与可比性的前提下实现语义与行为层面的可控变异,进而更全面地揭示策略在多样化场景下的泛化表现,并借助众包贡献实现评估空间的持续扩展。

链接: https://arxiv.org/abs/2604.05226
作者: Yi Ru Wang,Carter Ung,Evan Gubarev,Christopher Tan,Siddhartha Srinivasa,Dieter Fox
机构: University of Washington (华盛顿大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Yi Ru Wang and Carter Ung contributed equally

点击查看摘要

Abstract:Evaluation of robotic manipulation systems has largely relied on fixed benchmarks authored by a small number of experts, where task instances, constraints, and success criteria are predefined and difficult to extend. This paradigm limits who can shape evaluation and obscures how policies respond to user-authored variations in task intent, constraints, and notions of success. We argue that evaluating modern manipulation policies requires reframing evaluation as a language-driven process over structured physical domains. We present RoboPlayground, a framework that enables users to author executable manipulation tasks using natural language within a structured physical domain. Natural language instructions are compiled into reproducible task specifications with explicit asset definitions, initialization distributions, and success predicates. Each instruction defines a structured family of related tasks, enabling controlled semantic and behavioral variation while preserving executability and comparability. We instantiate RoboPlayground in a structured block manipulation domain and evaluate it along three axes. A user study shows that the language-driven interface is easier to use and imposes lower cognitive workload than programming-based and code-assist baselines. Evaluating learned policies on language-defined task families reveals generalization failures that are not apparent under fixed benchmark evaluations. Finally, we show that task diversity scales with contributor diversity rather than task count alone, enabling evaluation spaces to grow continuously through crowd-authored contributions. Project Page: this https URL

[HC-26] Decision-Oriented Programming with Aporia

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在编程辅助中因过度抽象导致开发者决策权被隐性剥夺的问题,即开发者在使用 AI 代理时往往未能意识到关键设计决策已被代理自主做出,从而削弱了对代码实现的理解与控制。解决方案的核心在于提出一种“决策导向编程”(Decision-Oriented Programming, DOP)范式,其关键要素包括:(1) 将决策显式结构化,作为程序员与代理之间的共享媒介;(2) 通过代理主动提问实现决策的交互式协同编写;(3) 每个决策可追溯至具体代码实现。为验证该范式,作者构建了 Aporia 设计探针,它以可编辑的 Decision Bank 持久化记录决策、通过设计问题主动引导开发者参与,并将每个决策编码为可执行测试套件用于验证实现,实验表明该方法显著提升了开发者参与度并增强了其对代码实现的认知准确性。

链接: https://arxiv.org/abs/2604.05203
作者: Saketh Ram Kasibatla,Raven Rothkopf,Hila Peleg,Benjamin C. Pierce,Sorin Lerner,Harrison Goldstein,Nadia Polikarpova
机构: UC San Diego (加州大学圣地亚哥分校); Technion (以色列理工学院); University of Pennsylvania (宾夕法尼亚大学); Cornell University (康奈尔大学); University at Buffalo, SUNY (纽约州立大学布法罗分校)
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:AI agents allow developers to express computational intent abstractly, reducing cognitive effort and helping achieve flow during programming. Increased abstraction, however, comes at a cost: developers cede decision-making authority to agents, often without realizing that important design decisions are being made without them. We aim to bring these decisions to the foreground in a paradigm we dub decision-oriented programming. In DOP, (1) decisions are explicit and structured, serving as the shared medium between the programmer and the agent; (2) decisions are co-authored interactively, with the agent proactively eliciting them from the programmer; and (3) each decision is traceable to code. As a step towards this vision, we have built Aporia, a design probe that tracks decisions in a persistent, editable Decision Bank; elicits them by asking programmers design questions; and encodes each decision as an executable test suite that can be used to validate the implementation. In a user study of 14 programmers, Aporia increased engagement in the design process and scaffolded both exploration and validation. Participants also gained a more accurate understanding of their implementations, with their mental models 5x less likely to disagree with the code than a baseline coding agent. Comments: 11 pages, 7 figures Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.05203 [cs.HC] (or arXiv:2604.05203v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.05203 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-27] Investigating Ethical Data Communication with Purrsuasion: An Educational Game about Negotiated Data Disclosure

【速读】:该论文旨在解决数据通信中的伦理困境问题,即在情境约束下无法完全披露源数据时,如何促进数据提供者与数据请求者之间的透明、可信沟通。传统可视化研究常将伦理问题归因于个体设计者的欺骗性选择或受众的误解,但本文指出此类问题本质是利他型参与者间的协商过程。解决方案的关键在于提出名为 Purrsuasion 的开源可视化游戏,通过角色扮演(数据提供者和数据寻求者)模拟真实场景中的披露约束与信任建立机制,并基于学生游戏数据开展混合方法分析,最终构建一个启发式评分量表,用于评估可视化方案在社会技术语境下的披露合规性,从而支持对“满意解”(satisficing)现象的系统性理解与改进。

链接: https://arxiv.org/abs/2604.05200
作者: Krisha Mehta,Sami Elahi,Alex Kale
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Data communication entails ethical dilemmas where situational constraints forbid full disclosure of source data. Whereas visualization research and pedagogy often frames ethics as a matter of individuals making deceptive design choices or being misled, disclosure problems involve negotiation between pro-social actors. To provide observability into these situated judgments, we contribute Purrsuasion, an open-source visualization game where participants play the roles of (i) data providers designing visualizations subject to disclosure constraints and (ii) data seekers requesting information and awarding a contract. We deploy Purrsuasion in an undergraduate data science class (N = 27), gathering gameplay data to support a mixed-methods analysis of students’ communication dynamics, problem solving, and trust formation. We find that difficulties envisioning an ideal visualization solution lead to satisficing in visualization authoring and difficulties attributing authorial intent. Given these challenges, we approach scoring student solutions by developing a heuristic rubric that supports sociotechnical judgments of disclosure adherence.

[HC-28] Ghosting the Machine: Stop Calling Human-Agent Relations Parasocial

【速读】:该论文试图解决的问题是:当前学界在讨论人类与对话代理(Conversational Agents, CAs)的关系时,错误地将此类互动标签为“拟社会性”(parasocial),从而导致对人机关系本质的误解。这种误用不仅混淆了理论概念,还引发科学层面的简化主义、变量设定偏差及效应误判,并进一步影响实践规范与伦理判断。论文指出,拟社会性本质上指单向、非辩证、由角色驱动、想象性、替代性、可预测且低投入的人-角色关系,而人类与CAs的互动具有社会性特征,应被正确认识为真实的社会关系。解决方案的关键在于摒弃对“拟社会性”术语的不当借用,转而承认并研究人机关系中的社会性(sociality),从而推动更严谨的理论建构与实践指导。

链接: https://arxiv.org/abs/2604.05197
作者: Jaime Banks
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:In discussions of human relations with conversational agents (CAs; e.g., voice assistants, AI companions, some social robots), they are increasingly referred to as parasocial. This is a misapplication of the term, heuristically taken up to mean “unreal.” In this provocation, I briefly account for the theoretical trajectory of parasociality and detail why it is inaccurate to apply the notion to human interactions with CAs. In short, “parasocial” refers to a human-character relations that are one-sided, non-dialectical, character-governed, imagined, vicarious, predictable, and low-effort; the term has been co-opted to instead refer to relations that are seen as unreal or invalid. The scientific problematics of this misapplication are nontrivial. They lead to oversimplification of complex phenomena, misspecified variables and misdiagnosed effects, and devaluation of human experiences. Those challenges, in turn, have downstream effects on norms and practice. It is scientifically, practically, and ethically imperative to recognize the sociality of human-agent relations.

[HC-29] From Use to Oversight: How Mental Models Influence User Behavior and Output in AI Writing Assistants

【速读】:该论文旨在解决用户对生成式 AI 写作助手(AI-based writing assistants)的使用行为与其心理模型(mental models)之间关系不明确的问题,特别是功能型心理模型(functional mental models,即用户对系统功能的认知)与结构性心理模型(structural mental models,即用户对系统工作机制的理解)如何影响其控制行为(如请求、接受或编辑 AI 建议)及写作结果。解决方案的关键在于通过预诱导不同类型的认知框架(priming),在实验中操纵参与者对写作助手的理解方式,从而揭示结构性心理模型虽提升用户对系统的感知可用性,却可能导致过度信任并降低对错误输出的批判性监督能力,进而引发写作质量下降——这一发现揭示了用户理解深度与控制行为之间的非线性关系,为设计更有效的交互机制提供了实证依据。

链接: https://arxiv.org/abs/2604.05166
作者: Shalaleh Rismani,Su Lin Blodgett,Q. Vera Liao,Alexandra Olteanu,AJung Moon
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-based writing assistants are ubiquitous, yet little is known about how users’ mental models shape their use. We examine two types of mental models – functional or related to what the system does, and structural or related to how the system works – and how they affect control behavior – how users request, accept, or edit AI suggestions as they write – and writing outcomes. We primed participants ( N = 48 ) with different system descriptions to induce these mental models before asking them to complete a cover letter writing task using a writing assistant that occasionally offered preconfigured ungrammatical suggestions to test whether the mental models affected participants’ critical oversight. We find that while participants in the structural mental model condition demonstrate a better understanding of the system, this can have a backfiring effect: while these participants judged the system as more usable, they also produced letters with more grammatical errors, highlighting a complex relationship between system understanding, trust, and control in contexts that require user oversight of error-prone AI outputs.

[HC-30] Active noise cancellation on open-ear smart glasses

【速读】:该论文旨在解决智能眼镜在嘈杂环境中音频交互质量差的问题,特别是由于其开放式耳道设计无法使用传统主动降噪(Active Noise Cancellation, ANC)技术的局限性。传统ANC依赖于耳道内或入口处的误差麦克风来测量抵消后的残留声音,而开放耳道结构不支持此类传感器部署。解决方案的关键在于开发了一种实时ANC系统,仅利用嵌入在眼镜框架中的八个分布式麦克风阵列估算耳部噪声,并通过微型开放式扬声器生成反向声波以实现环境噪声的实时抵消,从而在无需外部校准的情况下平均降低9.6 dB噪声,在进行简短用户个性化校准后可提升至11.2 dB。

链接: https://arxiv.org/abs/2604.05519
作者: Kuang Yuan,Freddy Yifei Liu,Tong Xiao,Yiwen Song,Chengyi Shen,Saksham Bhutani,Justin Chan,Swarun Kumar
机构: Carnegie Mellon University (卡内基梅隆大学); Carl von Ossietzky Universität Oldenburg (卡尔·冯·奥西茨基奥尔登堡大学); Zhejiang University (浙江大学)
类目: Audio and Speech Processing (eess.AS); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Smart glasses are becoming an increasingly prevalent wearable platform, with audio as a key interaction modality. However, hearing in noisy environments remains challenging because smart glasses are equipped with open-ear speakers that do not seal the ear canal. Furthermore, the open-ear design is incompatible with conventional active noise cancellation (ANC) techniques, which rely on an error microphone inside or at the entrance of the ear canal to measure the residual sound heard after cancellation. Here we present the first real-time ANC system for open-ear smart glasses that suppresses environmental noise using only microphones and miniaturized open-ear speakers embedded in the glasses frame. Our low-latency computational pipeline estimates the noise at the ear from an array of eight microphones distributed around the glasses frame and generates an anti-noise signal in real-time to cancel environmental noise. We develop a custom glasses prototype and evaluate it in a user study across 8 environments under mobility in the 100–1000 Hz frequency range, where environmental noise is concentrated. We achieve a mean noise reduction of 9.6 dB without any calibration, and 11.2 dB with a brief user-specific calibration.

计算机视觉

[CV-0] Action Images: End-to-End Policy Learning via Multiview Video Generation

【速读】:该论文旨在解决现有世界行动模型(World Action Models, WAMs)在机器人策略学习中面临的两个核心问题:一是多数方法依赖独立的行动模块或非像素对齐(pixel-grounded)的行动表示,导致难以充分利用预训练视频模型的知识;二是这种设计限制了策略在不同视角和环境间的迁移能力。其解决方案的关键在于提出“行动图像”(Action Images),这是一种可解释的、基于像素的行动表示方式,将7自由度(DoF)机器人动作转化为多视角动作视频,显式地追踪机械臂运动并直接与2D像素空间对齐。这一表示使视频骨干网络本身即可作为零样本策略(zero-shot policy),无需额外的策略头或行动模块,从而实现了统一建模下的多种任务,包括视频-行动联合生成、条件视频生成和行动标注,显著提升了零样本成功率和生成质量。

链接: https://arxiv.org/abs/2604.06168
作者: Haoyu Zhen,Zixian Gao,Qiao Sun,Yilin Zhao,Yuncong Yang,Yilun Du,Tsun-Hsuan Wang,Yi-Ling Qiao,Chuang Gan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.

[CV-1] HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models

【速读】:该论文旨在解决大视觉语言模型在图像描述生成中出现的对象幻觉(object hallucination)问题,即模型生成的描述中包含图像中并不存在的对象。现有方法通常依赖模型对视觉标记(visual tokens)的注意力权重作为检测信号,但作者发现这种粗粒度的注意力分析不可靠,因其受到隐藏混杂因素(hidden confounders)的影响,如标记位置和描述中的对象重复,导致Simpson悖论:统计聚合后注意力趋势反转或消失。解决方案的关键在于提出HaloProbe,一个贝叶斯框架,通过分解外部描述统计量与内部解码信号,估计每个标记级别的幻觉概率;其核心创新是采用平衡训练来隔离内部证据,并结合对外部特征的学习先验,以恢复真实的后验概率分布。相比依赖干预机制修改模型内部结构的方法,HaloProbe作为外部评分信号实现非侵入式缓解,实验表明其在降低幻觉率的同时更好地保持了生成内容的实用性(utility)。

链接: https://arxiv.org/abs/2604.06165
作者: Reihaneh Zohrabi,Hosein Hasani,Akshita Gupta,Mahdieh Soleymani Baghshah,Anna Rohrbach,Marcus Rohrbach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large vision-language models can produce object hallucinations in image descriptions, highlighting the need for effective detection and mitigation strategies. Prior work commonly relies on the model’s attention weights on visual tokens as a detection signal. We reveal that coarse-grained attention-based analysis is unreliable due to hidden confounders, specifically token position and object repetition in a description. This leads to Simpson’s paradox: the attention trends reverse or disappear when statistics are aggregated. Based on this observation, we introduce HaloProbe, a Bayesian framework that factorizes external description statistics and internal decoding signals to estimate token-level hallucination probabilities. HaloProbe uses balanced training to isolate internal evidence and combines it with learned prior over external features to recover the true posterior. While intervention-based mitigation methods often degrade utility or fluency by modifying models’ internals, we use HaloProbe as an external scoring signal for non-invasive mitigation. Our experiments show that HaloProbe-guided decoding reduces hallucinations more effectively than state-of-the-art intervention-based methods while preserving utility.

[CV-2] DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models

【速读】:该论文旨在解决8-bit低动态范围(Low Dynamic Range, LDR)视频中因饱和和量化导致的高动态范围(High Dynamic Range, HDR)场景辐射亮度信息丢失问题,从而限制了HDR显示的准确映射及后期制作中的有意义再曝光。其解决方案的关键在于将LDR到HDR的转换建模为视频扩散模型潜在空间中的生成式辐射补全(generative radiance inpainting)任务,并在Log-Gamma颜色空间中操作,以利用预训练视频扩散模型提供的时空生成先验,恢复过曝与欠曝区域的合理细节并重建量化像素的连续场景辐射亮度。此外,该方法支持通过文本提示或参考图像实现可控的LDR到HDR视频转换,并通过从静态HDRI图合成高质量HDR视频训练数据缓解标注数据稀缺问题。

链接: https://arxiv.org/abs/2604.06161
作者: Zhengming Yu,Li Ma,Mingming He,Leo Isikdogan,Yuancheng Xu,Dmitriy Smirnov,Pablo Salamanca,Dao Mi,Pablo Delgado,Ning Yu,Julien Philip,Xin Li,Wenping Wang,Paul Debevec
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:Most digital videos are stored in 8-bit low dynamic range (LDR) formats, where much of the original high dynamic range (HDR) scene radiance is lost due to saturation and quantization. This loss of highlight and shadow detail precludes mapping accurate luminance to HDR displays and limits meaningful re-exposure in post-production workflows. Although techniques have been proposed to convert LDR images to HDR through dynamic range expansion, they struggle to restore realistic detail in the over- and underexposed regions. To address this, we present DiffHDR, a framework that formulates LDR-to-HDR conversion as a generative radiance inpainting task within the latent space of a video diffusion model. By operating in Log-Gamma color space, DiffHDR leverages spatio-temporal generative priors from a pretrained video diffusion model to synthesize plausible HDR radiance in over- and underexposed regions while recovering the continuous scene radiance of the quantized pixels. Our framework further enables controllable LDR-to-HDR video conversion guided by text prompts or reference images. To address the scarcity of paired HDR video data, we develop a pipeline that synthesizes high-quality HDR video training data from static HDRI maps. Extensive experiments demonstrate that DiffHDR significantly outperforms state-of-the-art approaches in radiance fidelity and temporal stability, producing realistic HDR videos with considerable latitude for re-exposure.

[CV-3] he Character Error Vector: Decomposable errors for page-level OCR evaluation

【速读】:该论文旨在解决传统字符错误率(Character Error Rate, CER)在页面级光学字符识别(Optical Character Recognition, OCR)评估中因文本解析错误导致其失效的问题,从而限制了对文档理解流水线整体性能的量化分析。解决方案的关键在于提出字符错误向量(Character Error Vector, CEV),这是一种基于字符袋模型的OCR评估方法,能够将总误差分解为解析错误、OCR错误及交互错误三个可解释的组成部分。这种可分解性使研究者能够精准定位文档理解流程中最影响文本提取质量的环节,且CEV支持多种实现方式(如空间感知字符错误率SpACER和基于Jensen-Shannon散度的字符分布方法),并通过实证验证其与CER的相关性、解析质量以及直接衡量页面级OCR效果的能力,最终证明CEV是连接解析指标与局部指标(如CER)的重要桥梁。

链接: https://arxiv.org/abs/2604.06160
作者: Jonathan Bourne,Mwiza Simbeye,Joseph Nockels
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6643 words, 5 figures, 15 tables

点击查看摘要

Abstract:The Character Error Rate (CER) is a key metric for evaluating the quality of Optical Character Recognition (OCR). However, this metric assumes that text has been perfectly parsed, which is often not the case. Under page-parsing errors, CER becomes undefined, limiting its use as a metric and making evaluating page-level OCR challenging, particularly when using data that do not share a labelling schema. We introduce the Character Error Vector (CEV), a bag-of-characters evaluator for OCR. The CEV can be decomposed into parsing and OCR, and interaction error components. This decomposability allows practitioners to focus on the part of the Document Understanding pipeline that will have the greatest impact on overall text extraction quality. The CEV can be implemented using a variety of methods, of which we demonstrate SpACER (Spatially Aware Character Error Rate) and a Character distribution method using the Jensen-Shannon Distance. We validate the CEV’s performance against other metrics: first, the relationship with CER; then, parse quality; and finally, as a direct measure of page-level OCR quality. The validation process shows that the CEV is a valuable bridge between parsing metrics and local metrics like CER. We analyse a dataset of archival newspapers made of degraded images with complex layouts and find that state-of-the-art end-to-end models are outperformed by more traditional pipeline approaches. Whilst the CEV requires character-level positioning for optimal triage, thresholding on easily available values can predict the main error source with an F1 of 0.91. We provide the CEV as part of a Python library to support Document understanding research.

[CV-4] PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer CVPR

【速读】:该论文旨在解决基于自注意力(self-attention)机制的Transformer模型在处理长序列时计算复杂度高、效率低的问题。其解决方案的关键在于提出一种名为多项式混合器(Polynomial Mixer, PoM)的新颖token混合机制,该机制通过一个可学习的多项式函数将输入token压缩为紧凑表示,并从中提取上下文信息,从而实现线性时间复杂度的token交互。PoM被证明满足上下文映射性质(contextual mapping property),保证替换标准自注意力后模型仍具备通用序列到序列逼近能力,且在文本生成、手写识别、图像生成、3D建模和地球观测等五个不同领域均达到与注意力模型相当的性能,同时显著降低长序列场景下的计算开销。

链接: https://arxiv.org/abs/2604.06129
作者: David Picard,Nicolas Dufour,Lucas Degeorge,Arijit Ghosh,Davide Allegro,Tom Ravaud,Yohann Perron,Corentin Sautier,Zeynep Sonat Baltaci,Fei Meng,Syrine Kalleli,Marta López-Rauhut,Thibaut Loiseau,Ségolène Albouy,Raphael Baena,Elliot Vincent,Loic Landrieu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR Findings 2026

点击查看摘要

Abstract:This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence-to-sequence approximators. We replace standard self-attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention-based models while drastically reducing computational cost when working with long sequences. The code is available at this https URL.

[CV-5] Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

【速读】:该论文旨在解决RGB预训练视觉语言模型(Visual Language Models, VLMs)在热红外(thermal infrared)图像上应用时存在的表征鸿沟问题,即如何有效将基于可见光(RGB)图像训练的VLMs迁移至热红外影像场景中以实现生态监测任务。解决方案的关键在于提出一种轻量级多模态适配框架,通过多模态投影器对齐(multimodal projector alignment)策略,在不改变原模型结构的前提下,使VLMs能够从RGB视觉表征中提取的知识迁移到热辐射输入(thermal radiometric inputs)上,从而显著提升物种识别与实例计数性能,并进一步结合同步采集的RGB图像生成栖息地背景信息(如土地覆盖特征、关键地貌要素及人类干扰迹象),扩展了模型在生态监测中的应用维度。

链接: https://arxiv.org/abs/2604.06124
作者: Hao Chen,Fang Qiu,Fangchao Dong,Defei Yang,Eve Bohnett,Li An
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models, including InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, and Qwen3-VL-8B-Instruct, were benchmarked under both closed-set and open-set prompting conditions for species recognition and instance enumeration. Among the tested models, Qwen3-VL-8B-Instruct with open-set prompting achieved the best overall performance, with F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant, and within-1 enumeration accuracies of 0.779, 0.982, and 1.000, respectively. In addition, combining thermal imagery with simultaneously collected RGB imagery enabled the model to generate habitat-context information, including land-cover characteristics, key landscape features, and visible human disturbance. Overall, the findings demonstrate that lightweight projector-based adaptation provides an effective and practical route for transferring RGB-pretrained VLMs to thermal drone imagery, expanding their utility from object-level recognition to habitat-context interpretation in ecological monitoring.

[CV-6] SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

【速读】:该论文旨在解决大规模户外驾驶场景的可扩展生成问题,即如何在保持多视角一致性的同时,实现对大尺度3D场景的高效生成。现有方法要么依赖从图像或视频生成模型蒸馏到3D空间,导致几何一致性差且渲染受限于训练视角,要么仅适用于小规模场景或以物体为中心的生成。其解决方案的关键在于提出基于Σ-Voxfield网格(Σ-Voxfield grid)的3D生成框架,该结构通过每个占据体素存储固定数量的颜色化表面采样点来实现几何与外观的一致性;并训练一个语义条件扩散模型,在局部体素邻域上操作并利用3D位置编码捕捉空间结构;同时采用重叠区域上的渐进式空间外绘画(progressive spatial outpainting)策略实现大场景扩展,并通过延迟渲染模块生成高质量、多视角一致的逼真图像,从而无需针对每个场景进行优化即可实现大规模城市户外场景的生成。

链接: https://arxiv.org/abs/2604.06113
作者: Hiba Dahmani,Nathan Piasco,Moussab Bennehar,Luis Roldão,Dzmitry Tsishkou,Laurent Caraffa,Jean-Philippe Tarel,Roland Brémond
机构: Noah’s Ark, Huawei Paris Research Center, France; COSYS, Gustave Eiffel University, France; LASTIG, IGN-ENSG, Gustave Eiffel University, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on \Sigma -Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated \Sigma -Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.

[CV-7] Extending ZACH-ViT to Robust Medical Imaging: Corruption and Adversarial Stress Testing in Low-Data Regimes CVPR2026

【速读】:该论文旨在解决低数据场景下医学图像分类模型在面对常见图像退化(如噪声、模糊等)和对抗性扰动时的鲁棒性不足问题。其解决方案的关键在于对ZACH-ViT架构进行扩展验证,该架构是一种紧凑且排列不变的视觉Transformer(Vision Transformer),摒弃了传统模型中依赖固定位置编码和专用类别标记(class token)的空间假设,从而更适应空间结构信息弱或局部分布不均的生物医学图像。实验表明,在每类仅50个样本的设定下,ZACH-ViT不仅在干净数据上表现优异(平均排名1.57),而且在常见图像退化条件下保持相同性能水平,显示出良好的抗干扰能力;尽管在对抗攻击下整体性能下降明显,但ZACH-ViT仍能维持相对领先的地位(FGSM下第一,PGD下第二),说明其架构设计有助于提升模型在真实医疗场景中的稳定性,但对抗鲁棒性仍是当前所有模型面临的挑战。

链接: https://arxiv.org/abs/2604.06099
作者: Athanasios Angelakis,Marta Gomez-Barrero
机构: BioML Lab, RI CODE, UniBw, Munich, Germany (生物机器学习实验室,RI CODE,慕尼黑工业大学); AUMC, Amsterdam, Netherlands (阿姆斯特丹大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 Workshop (PHAROS-AIF-MIH)

点击查看摘要

Abstract:The recently introduced ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer) formalized a compact permutation-invariant Vision Transformer for medical imaging and argued that architectural alignment with spatial structure can matter more than universal benchmark dominance. Its design was motivated by the observation that positional embeddings and a dedicated class token encode fixed spatial assumptions that may be suboptimal when spatial organization is weakly informative, locally distributed, or variable across biomedical images. The foundational study established a regime-dependent clean performance profile across MedMNIST, but did not examine robustness in detail. In this work, we present the first robustness-focused extension of ZACH-ViT by evaluating its behavior under common image corruptions and adversarial perturbations in the same low-data setting. We compare ZACH-ViT with three scratch-trained compact baselines, ABMIL, Minimal-ViT, and TransMIL, on seven MedMNIST datasets using 50 samples per class, fixed hyperparameters, and five random seeds. Across the benchmark, ZACH-ViT achieves the best overall mean rank on clean data (1.57) and under common corruptions (1.57), indicating a favorable balance between baseline predictive performance and robustness to realistic image degradation. Under adversarial stress, all models deteriorate substantially; nevertheless, ZACH-ViT remains competitive, ranking first under FGSM (2.00) and second under PGD (2.29), where ABMIL performs best overall. These results extend the original ZACH-ViT narrative: the advantages of compact permutation-invariant transformers are not limited to clean evaluation, but can persist under realistic perturbation stress in low-data medical imaging, while adversarial robustness remains an open challenge for all evaluated models.

[CV-8] Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

【速读】:该论文旨在解决图形程序合成(Graphics Program Synthesis)中面临的两大核心挑战:一是现有图像-TikZ语料库存在执行可实现性差和视觉对齐可靠性低的数据质量问题;二是缺乏同时评估结构一致性和视觉保真度的基准测试体系。解决方案的关键在于提出一个闭环框架,包含两个核心组件:其一为SciTikZ-230K数据集,基于执行中心的数据引擎构建,覆盖11个科学领域,确保高质量和可执行性;其二为SciTikZ-Bench多维基准测试平台,涵盖从基础几何到复杂层次结构的多样化任务,实现对生成代码的结构与视觉双重验证。此外,论文创新性地引入双自一致性强化学习优化范式(Dual Self-Consistency Reinforcement Learning),通过往返验证机制惩罚退化代码并提升整体自一致性,最终训练出的SciTikZer-8B模型在性能上显著优于Gemini-2.5-Pro等商用大模型及Qwen3-VL-235B-A22B-Instruct等超大规模模型。

链接: https://arxiv.org/abs/2604.06079
作者: Juekai Lin,Yun Zhu,Honglin Lin,Sijing Li,Tianwei Lin,Zheng Liu,Xiaoyang Wang,Wenqiao Zhang,Lijun Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.

[CV-9] Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors ICME2026

【速读】:该论文旨在解决现有基于部件(part-based)的图像生成框架在控制生成图像结构时缺乏细粒度和结构性的问题。传统方法将用户提供的视觉部件视为无序集合,忽略了其内在的空间与语义关系,导致生成结果常出现结构不一致或不合理的情况。解决方案的关键在于提出 Graph-PiT 框架,通过引入图先验(graph prior)显式建模视觉组件之间的结构依赖关系:将视觉部件表示为节点、空间-语义关系表示为边,并设计一种分层图神经网络(Hierarchical Graph Neural Network, HGNN)模块,在粗粒度部件级超节点与细粒度 IP+ token 子节点之间进行双向消息传递,从而在进入生成流程前优化部件嵌入。此外,引入图拉普拉斯平滑损失和边重构损失,使相邻部件获得具有关系感知能力的一致嵌入,显著提升生成图像的结构连贯性。

链接: https://arxiv.org/abs/2604.06074
作者: Junbin Zhang,Meng Cao,Feng Tan,Yikai Lin,Yuexian Zou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 11 pages, 5 figures, Accepted by ICME 2026

点击查看摘要

Abstract:Achieving fine-grained and structurally sound controllability is a cornerstone of advanced visual generation. Existing part-based frameworks treat user-provided parts as an unordered set and therefore ignore their intrinsic spatial and semantic relationships, which often results in compositions that lack structural integrity. To bridge this gap, we propose Graph-PiT, a framework that explicitly models the structural dependencies of visual components using a graph prior. Specifically, we represent visual parts as nodes and their spatial-semantic relationships as edges. At the heart of our method is a Hierarchical Graph Neural Network (HGNN) module that performs bidirectional message passing between coarse-grained part-level super-nodes and fine-grained IP+ token sub-nodes, refining part embeddings before they enter the generative pipeline. We also introduce a graph Laplacian smoothness loss and an edge-reconstruction loss so that adjacent parts acquire compatible, relation-aware embeddings. Quantitative experiments on controlled synthetic domains (character, product, indoor layout, and jigsaw), together with qualitative transfer to real web images, show that Graph-PiT improves structural coherence over vanilla PiT while remaining compatible with the original IP-Prior pipeline. Ablation experiments confirm that explicit relational reasoning is crucial for enforcing user-specified adjacency constraints. Our approach not only enhances the plausibility of generated concepts but also offers a scalable and interpretable mechanism for complex, multi-part image synthesis. The code is available at this https URL.

[CV-10] EDGE-Shield: Efficient Denoising-staGE Shield for Violative Content Filtering via Scalable Reference-Based Matching

【速读】:该论文旨在解决生成式 AI(Generative AI)在文本到图像生成过程中可能引发的版权侵权与深度伪造(deepfake)内容传播问题,尤其针对现有基于参考的训练-free 内容过滤方法在处理大量参考样本时扩展性差、需等待图像生成完成才能过滤的局限性。解决方案的关键在于提出 EDGE-Shield,其核心创新包括:利用嵌入(embedding)匹配实现高效参考比对,以及引入 x-pred 变换——将去噪过程中的噪声中间潜在表示转换为后期伪估计的干净潜在表示,从而提升早期去噪阶段对违规内容的分类准确率,同时保持低延迟特性,在 Z-Image-Turbo 和 Qwen-Image 两个模型上分别实现约 79% 和 50% 的处理时间减少,且不牺牲过滤精度。

链接: https://arxiv.org/abs/2604.06063
作者: Takara Taniguchi,Ryohei Shimizu,Minh-Duc Vo,Kota Izumi,Shiqi Yang,Teppei Suzuki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The advent of Text-to-Image generative models poses significant risks of copyright violation and deepfake generation. Since the rapid proliferation of new copyrighted works and private individuals constantly emerges, reference-based training-free content filters are essential for providing up-to-date protection without the constraints of a fixed knowledge cutoff. However, existing reference-based approaches often lack scalability when handling numerous references and require waiting for finishing image generation. To solve these problems, we propose EDGE-Shield, a scalable content filter during the denoising process that maintains practical latency while effectively blocking violative content. We leverage embedding-based matching for efficient reference comparison. Additionally, we introduce an \textit x -pred transformation that converts the model’s noisy intermediate latent into the pseudo-estimated clean latent at the later stage, enhancing classification accuracy of violative content at earlier denoising stages. We conduct experiments of violative content filtering against two generative models including Z-Image-Turbo and Qwen-Image. EDGE-Shield significantly outperforms traditional reference-based methods in terms of latency; it achieves an approximate 79% reduction in processing time for Z-Image-Turbo and approximate 50% reduction for Qwen-Image, maintaining the filtering accuracy across different model architectures.

[CV-11] Attention May I Have Your Decision? Localizing Generative Choices in Diffusion Models CVPR2026

【速读】:该论文旨在解决文本到图像扩散模型(text-to-image diffusion models)在处理不完整描述性提示时,其内部隐式决策机制不明确的问题。具体而言,当提示信息不足以完全定义生成内容时,模型需自主补充细节,但这些决策如何在模型架构中分布尚不清楚。为揭示这一机制,作者提出一种基于探测(probing-based)的定位方法,用于识别对概念区分度最高的层;发现这些隐式选择主要由自注意力(self-attention)层决定,而非传统干预方式所依赖的显式条件层。因此,关键解决方案是引入ICM(Implicit Choice-Modification)方法,通过精准干预少数自注意力层实现高效且低伪影的偏置修正(debiasing),显著优于现有最先进方法。

链接: https://arxiv.org/abs/2604.06052
作者: Katarzyna Zaleska,Łukasz Popek,Monika Wysoczańska,Kamil Deja
机构: Warsaw University of Technology (华沙理工大学); valeo.ai (法雷奥人工智能); IDEAS Research Institute (IDEAS 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Text-to-image diffusion models exhibit remarkable generative capabilities, yet their internal operations remain opaque, particularly when handling prompts that are not fully descriptive. In such scenarios, models must make implicit decisions to generate details not explicitly specified in the text. This work investigates the hypothesis that this decision-making process is not diffuse but is computationally localized within the model’s architecture. While existing localization techniques focus on prompt-related interventions, we notice that such explicit conditioning may differ from implicit decisions. Therefore, we introduce a probing-based localization technique to identify the layers with the highest attribute separability for concepts. Our findings indicate that the resolution of ambiguous concepts is governed principally by self-attention layers, identifying them as the most effective point for intervention. Based on this discovery, we propose ICM (Implicit Choice-Modification) - a precise steering method that applies targeted interventions to a small subset of layers. Extensive experiments confirm that intervening on these specific self-attention layers yields superior debiasing performance compared to existing state-of-the-art methods, minimizing artifacts common to less precise approaches. The code is available at this https URL.

[CV-12] CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics

【速读】:该论文旨在解决视频流分析(video streaming analytics)在视觉-语言模型(vision-language model)服务中因多模态推理成本高而导致的可扩展性瓶颈问题。现有方法仅针对视觉Transformer(ViT)或大语言模型(LLM)进行优化,未能从端到端视角挖掘时域与空域冗余,且依赖离线训练或在线高开销计算来识别冗余,难以适应动态实时视频流。其解决方案的关键在于提出CoStream系统,利用视频编解码器(codec)在压缩过程中天然提取的时空间结构作为低开销运行时信号,统一优化视频解码、视觉处理和LLM预填充(prefilling)阶段:通过编码器引导的patch剪枝减少ViT编码前的计算量,并在LLM预填充阶段选择性刷新键值缓存(key-value cache),两者均为在线执行且无需离线训练。此设计使得操作直接作用于压缩比特流,实现传输减少与计算优化的协同提升,实验表明其相较最先进基线可提升3倍吞吐量、降低87% GPU计算资源消耗,同时保持F1分数仅下降0-8%。

链接: https://arxiv.org/abs/2604.06036
作者: Yulin Zou,Yan Chen,Wenyan Chen,JooYoung Park,Shivaraman Nitin,Luo Tao,Francisco Romero,Dmitrii Ustiugov
机构: Nanyang Technological University (南洋理工大学); Beihang University (北京航空航天大学); Institute of High Performance Computing (高性能计算研究所); Georgia Institute of Technology (佐治亚理工学院)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 34 figures

点击查看摘要

Abstract:Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CoStream, a codec-guided streaming video analytics system built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CoStream treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CoStream achieves up to 3x throughput improvement and up to 87% GPU compute reduction over state-of-the-art baselines, while maintaining competitive accuracy with only 0-8% F1 drop. Comments: 18 pages, 34 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2604.06036 [cs.DC] (or arXiv:2604.06036v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.06036 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-13] oward Aristotelian Medical Representations: Backpropagation-Free Layer-wise Analysis for Interpretable Generalized Metric Learning on MedMNIST

【速读】:该论文旨在解决深度学习在医学影像领域应用中因反向传播模型“黑箱”特性而难以被临床采纳的问题。其核心解决方案是提出Aristotelian Rapid Object Modeling (A-ROM)框架,关键在于基于柏拉图表征假设(Platonic Representation Hypothesis, PRH),利用预训练视觉Transformer(Vision Transformer, ViT)所具备的通用度量空间,实现对新医学概念的快速建模,同时通过引入人类可读的概念词典与k近邻(k-Nearest Neighbors, kNN)分类器替代传统不可解释的决策层,从而在保持高性能的同时显著提升模型的透明性与可解释性。

链接: https://arxiv.org/abs/2604.06017
作者: Michael Karnes,Alper Yilmaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While deep learning has achieved remarkable success in medical imaging, the “black-box” nature of backpropagation-based models remains a significant barrier to clinical adoption. To bridge this gap, we propose Aristotelian Rapid Object Modeling (A-ROM), a framework built upon the Platonic Representation Hypothesis (PRH). This hypothesis posits that models trained on vast, diverse datasets converge toward a universal and objective representation of reality. By leveraging the generalizable metric space of pretrained Vision Transformers (ViTs), A-ROM enables the rapid modeling of novel medical concepts without the computational burden or opacity of further gradient-based fine-tuning. We replace traditional, opaque decision layers with a human-readable concept dictionary and a k-Nearest Neighbors (kNN) classifier to ensure the model’s logic remains interpretable. Experiments on the MedMNIST v2 suite demonstrate that A-ROM delivers performance competitive with standard benchmarks while providing a simple and scalable, “few-shot” solution that meets the rigorous transparency demands of modern clinical environments.

[CV-14] OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

【速读】:该论文旨在解决现有视频生成模型中场景内容与相机运动(camera motion)耦合的问题,导致难以对二者进行独立控制。其核心解决方案在于提出OmniCamera框架,通过显式解耦这两个维度实现灵活的视频生成,支持任意组合的相机条件与内容条件。关键创新包括:构建OmniCAM混合数据集(融合真实视频与合成数据),以及设计双层级课程协同训练策略(Dual-level Curriculum Co-Training),该策略在条件层面按难度渐进引入控制模态,在数据层面先于合成数据学习精确控制再迁移至真实数据以提升视觉保真度,从而有效缓解模态冲突并提升多任务学习鲁棒性。

链接: https://arxiv.org/abs/2604.06010
作者: Yukun Wang,Ruihuang Li,Jiale Tao,Shiyuan Yang,Liyi Chen,Zhantao Yang,Handz,Yulan Guo,Shuai Shao,Qinglin Lu
机构: Tencent Hunyuan(腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video fundamentally intertwines two crucial axes: the dynamic content of a scene and the camera motion through which it is observed. However, existing generation models often entangle these factors, limiting independent control. In this work, we introduce OmniCamera, a unified framework designed to explicitly disentangle and command these two dimensions. This compositional approach enables flexible video generation by allowing arbitrary pairings of camera and content conditions, unlocking unprecedented creative control. To overcome the fundamental challenges of modality conflict and data scarcity inherent in such a system, we present two key innovations. First, we construct OmniCAM, a novel hybrid dataset combining curated real-world videos with synthetic data that provides diverse paired examples for robust multi-task learning. Second, we propose a Dual-level Curriculum Co-Training strategy that mitigates modality interference and synergistically learns from diverse data sources. This strategy operates on two levels: first, it progressively introduces control modalities by difficulties (condition-level), and second, trains for precise control on synthetic data before adapting to real data for photorealism (data-level). As a result, OmniCamera achieves state-of-the-art performance, enabling flexible control for complex camera movements while maintaining superior visual quality.

[CV-15] HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

【速读】:该论文旨在解决生成式视频扩散模型在人类动作动态与物理特性建模上的不足,即当前方法难以忠实还原人体运动的时空一致性与细节真实性(如衣物褶皱等运动依赖性特征)。其解决方案的关键在于提出HumANDiff框架,通过三项核心设计实现可控且高保真的真人视频生成:1)基于人体结构先验的关节一致噪声采样(Articulated motion-consistent noise sampling),将无结构的随机高斯噪声替换为在统计人体模板表面流形上采样的三维关节噪声,确保时空一致性;2)联合外观-运动学习(Joint appearance-motion learning),在标准扩散目标基础上联合预测像素外观和对应物理运动,提升视频合成质量;3)几何运动一致性学习(Geometric motion consistency learning),在关节噪声空间中定义新的几何运动一致性损失,强制帧间运动符合物理规律。该方法无需修改扩散模型架构,仅需微调即可实现端到端图像到视频生成及内在运动控制。

链接: https://arxiv.org/abs/2604.05961
作者: Tao Hu,Varun Jampani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. It inherits body topology priors for spatially and temporally consistent noise sampling. 2) Joint appearance-motion learning that enhances the standard training objective of video diffusion models by jointly predicting pixel appearances and corresponding physical motions from the articulated noises. It enables high-fidelity human video synthesis, e.g., capturing motion-dependent clothing wrinkles. 3) Geometric motion consistency learning that enforces physical motion consistency across frames via a novel geometric motion consistency loss defined in the articulated noise space. HumANDiff enables scalable controllable human video generation by fine-tuning video diffusion models with articulated noise sampling. Consequently, our method is agnostic to diffusion model design, and requires no modifications to the model architecture. During inference, HumANDiff enables image-to-video generation within a single framework, achieving intrinsic motion control without requiring additional motion modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance in rendering motion-consistent, high-fidelity humans with diverse clothing styles. Project page: this https URL

[CV-16] Multi-Modal Landslide Detection from Sentinel-1 SAR and Sentinel-2 Optical Imagery Using Multi-Encoder Vision Transformers and Ensemble Learning

【速读】:该论文旨在解决滑坡灾害监测中准确、及时检测的难题,以支持灾害风险减缓。其核心挑战在于如何在无灾前Sentinel-2数据的情况下实现高精度的滑坡识别,并有效融合光学与雷达遥感数据的优势。解决方案的关键在于提出一个模块化、多模型集成框架,通过双编码器视觉Transformer分别处理Sentinel-2光学影像和Sentinel-1合成孔径雷达(SAR)数据,结合光谱指数(如NDVI)增强植被与地表变化敏感性,并采用神经网络与梯度提升模型(LightGBM和XGBoost)的集成学习策略,显著提升分类性能。该方法在patch级分类任务中达到F1分数0.919,优于传统像素级分割方式,且无需灾前影像,在非经典变化检测场景下仍具鲁棒性,同时具备良好的可扩展性和操作适用性。

链接: https://arxiv.org/abs/2604.05959
作者: Ioannis Nasios
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Landslides represent a major geohazard with severe impacts on human life, infrastructure, and ecosystems, underscoring the need for accurate and timely detection approaches to support disaster risk reduction. This study proposes a modular, multi-model framework that fuses Sentinel-2 optical imagery with Sentinel-1 Synthetic Aperture Radar (SAR) data, for robust landslide detection. The methodology leverages multi-encoder vision transformers, where each data modality is processed through separate lightweight pretrained encoders, achieving strong performance in landslide detection. In addition, the integration of multiple models, particularly the combination of neural networks and gradient boosting models (LightGBM and XGBoost), demonstrates the power of ensemble learning to further enhance accuracy and robustness. Derived spectral indices, such as NDVI, are integrated alongside original bands to enhance sensitivity to vegetation and surface changes. The proposed methodology achieves a state-of-the-art F1 score of 0.919 on landslide detection, addressing a patch-based classification task rather than pixel-level segmentation and operating without pre-event Sentinel-2 data, highlighting its effectiveness in a non-classical change detection setting. It also demonstrated top performance in a machine learning competition, achieving a strong balance between precision and recall and highlighting the advantages of explicitly leveraging the complementary strengths of optical and radar data. The conducted experiments and research also emphasize scalability and operational applicability, enabling flexible configurations with optical-only, SAR-only, or combined inputs, and offering a transferable framework for broader natural hazard monitoring and environmental change applications. Full training and inference code can be found in this https URL.

[CV-17] Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition

【速读】:该论文旨在解决多模态视觉分析中因异构模态提供互补但依赖输入证据而导致的鲁棒性不足问题,现有方法通常依赖固定融合模块或预定义的跨模态交互机制,难以适应模态可靠性变化并捕捉细粒度动作线索。其解决方案的关键在于提出一种基于混合模态专家(Mixture-of-Modality-Experts, MoME)框架与整体令牌学习(Holistic Token Learning, HTL)策略相结合的方法:MoME实现模态特异性专家间的自适应协作,而HTL通过类别令牌(class tokens)和时空令牌(spatio-temporal tokens)提升专家内部精炼与专家间知识迁移能力,从而构建以知识为中心的多模态学习范式,在增强专家专业化的同时降低多模态歧义性。

链接: https://arxiv.org/abs/2604.05947
作者: Tianyi Liu,Yiming Li,Wenqian Wang,Jiaojiao Wang,Chen Cai,Yi Wang,Kim-Hui Yap
机构: Nanyang Technological University (南洋理工大学); Singapore University of Technology and Design (新加坡科技设计大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Robust multimodal visual analytics remains challenging when heterogeneous modalities provide complementary but input-dependent evidence for this http URL multimodal learning methods mainly rely on fixed fusion modules or predefined cross-modal interactions, which are often insufficient to adapt to changing modality reliability and to capture fine-grained action cues. To address this issue, we propose a Mixture-of-Modality-Experts (MoME) framework with a Holistic Token Learning (HTL) strategy. MoME enables adaptive collaboration among modality-specific experts, while HTL improves both intra-expert refinement and inter-expert knowledge transfer through class tokens and spatio-temporal tokens. In this way, our method forms a knowledge-centric multimodal learning framework that improves expert specialization while reducing ambiguity in multimodal this http URL validate the proposed framework on driver action recognition as a representative multimodal understanding taskThe experimental results on the public benchmark show that the proposed MoME framework and the HTL strategy jointly outperform representative single-modal and multimodal baselines. Additional ablation, validation, and visualization results further verify that the proposed HTL strategy improves subtle multimodal understanding and offers better interpretability.

[CV-18] Leverag ing Image Editing Foundation Models for Data-Efficient CT Metal Artifact Reduction CVPR

【速读】:该论文旨在解决高衰减植入物引起的金属伪影严重降低CT图像质量的问题,这些问题会掩盖关键解剖结构,并且传统深度学习方法因需要大量配对训练数据而难以应用。其解决方案的关键在于将去伪影任务重新定义为上下文感知推理问题,通过低秩适配(LoRA)技术高效微调通用视觉-语言扩散基础模型(vision-language diffusion foundation model),从而仅用16至128对训练样本即可实现有效的伪影抑制,数据需求减少两个数量级。此外,为避免模型产生幻觉(如将条纹伪影误认为自然物体),作者强调领域自适应的重要性,并提出多参考条件策略,利用无关受试者的清洁解剖示例作为上下文引导,使模型能够基于类别特定信息推断未受损的解剖结构,最终在AAPM CT-MAR基准上达到感知和放射学特征指标的最先进性能。

链接: https://arxiv.org/abs/2604.05934
作者: Ahmet Rasim Emirdagi,Süleyman Aslan,Mısra Yavuz,Görkay Aydemir,Yunus Bilge Kurt,Nasrin Rahimi,Burak Can Biner,M. Akın Yılmaz
机构: Codeway AI Research
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to CVPRW 2026 Med-Reasoner

点击查看摘要

Abstract:Metal artifacts from high-attenuation implants severely degrade CT image quality, obscuring critical anatomical structures and posing a challenge for standard deep learning methods that require extensive paired training data. We propose a paradigm shift: reframing artifact reduction as an in-context reasoning task by adapting a general-purpose vision-language diffusion foundation model via parameter-efficient Low-Rank Adaptation (LoRA). By leveraging rich visual priors, our approach achieves effective artifact suppression with only 16 to 128 paired training examples reducing data requirements by two orders of magnitude. Crucially, we demonstrate that domain adaptation is essential for hallucination mitigation; without it, foundation models interpret streak artifacts as erroneous natural objects (e.g., waffles or petri dishes). To ground the restoration, we propose a multi-reference conditioning strategy where clean anatomical exemplars from unrelated subjects are provided alongside the corrupted input, enabling the model to exploit category-specific context to infer uncorrupted anatomy. Extensive evaluation on the AAPM CT-MAR benchmark demonstrates that our method achieves state-of-the-art performance on perceptual and radiological-feature metrics . This work establishes that foundation models, when appropriately adapted, offer a scalable alternative for interpretable, data-efficient medical image reconstruction. Code is available at this https URL.

[CV-19] SonoSelect: Efficient Ultrasound Perception via Active Probe Exploration

【速读】:该论文旨在解决超声成像中因盲目采集大量视图而导致的冗余扫描与高成本问题,其核心挑战在于如何在保证诊断准确性的同时,高效地选择最具信息量的探头视角。解决方案的关键在于提出SonoSelect方法,将主动视图探索建模为一个序列决策问题:通过构建3D空间记忆来融合已获取的2D超声图像,并基于此动态指导下一步探头位置移动;同时设计了针对超声特性的优化目标,优先选择能提升器官覆盖范围、降低重建不确定性且减少重复扫描的探头动作。实验表明,该方法仅需使用N个视图中的2个即可实现高精度多器官分类,且在肾囊肿检测任务中实现了显著的目标区域覆盖率。

链接: https://arxiv.org/abs/2604.05933
作者: Yixin Zhang,Yunzhong Hou,Longqi Li,Zhenyue Qin,Yang Liu,Yue Yao
机构: Shandong University (山东大学); Beijing Institute of Technology (北京理工大学); Yale University (耶鲁大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultrasound perception typically requires multiple scan views through probe movement to reduce diagnostic ambiguity, mitigate acoustic occlusions, and improve anatomical coverage. However, not all probe views are equally informative. Exhaustively acquiring a large number of views can introduce substantial redundancy, increase scanning and processing costs. To address this, we define an active view exploration task for ultrasound and propose SonoSelect, an ultrasound-specific method that adaptively guides probe movement based on current observations. Specifically, we cast ultrasound active view exploration as a sequential decision-making problem. Each new 2D ultrasound view is fused into a 3D spatial memory of the observed anatomy, which guides the next probe position. On top of this formulation, we propose an ultrasound-specific objective that favors probe movements with greater organ coverage, lower reconstruction uncertainty, and less redundant scanning. Experiments on the ultrasound simulator show that SonoSelect achieves promising multi-view organ classification accuracy using only 2 out of N views. Furthermore, for a more difficult kidney cyst detection task, it reaches 54.56% kidney coverage and 35.13% cyst coverage, with short trajectories consistently centered on the target cyst.

[CV-20] Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning

【速读】:该论文旨在解决成功表示(Successor Representations, SR)在视觉无监督强化学习(Unsupervised Reinforcement Learning, URL)中难以实现零样本泛化的问题,尤其针对高维视觉环境下的两个关键局限:一是SR目标常导致次优表示,关注与动态无关区域,从而造成 successor measure 准确性下降并削弱任务泛化能力;二是此类缺陷阻碍了SR策略对多模态技能条件动作分布的建模以及技能可控性的保障。解决方案的关键在于提出Saliency-Guided Representation with Consistency Policy Learning (SRCP) 框架,其核心创新包括:(1) 通过引入显著性引导的动力学任务,将表示学习与successor训练解耦,以捕获动态相关表征,提升 successor measure 和任务泛化性能;(2) 结合快速采样一致性策略、面向URL的无分类器引导机制及定制化训练目标,增强技能条件策略建模能力和可控性。实验证明,SRCP在ExORL基准的16个任务上实现了当前最优的零样本泛化表现,并兼容多种SR方法。

链接: https://arxiv.org/abs/2604.05931
作者: Jingbo Sun,Qichao Zhang,Songjun Tu,Xing Fang,Yupeng Zheng,Haoran Li,Ke Chen,Dongbin Zhao
机构: Chinese Academy of Sciences(中国科学院); Pengcheng Laboratory(鹏城实验室); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-shot unsupervised reinforcement learning (URL) offers a promising direction for building generalist agents capable of generalizing to unseen tasks without additional supervision. Among existing approaches, successor representations (SR) have emerged as a prominent paradigm due to their effectiveness in structured, low-dimensional settings. However, SR methods struggle to scale to high-dimensional visual environments. Through empirical analysis, we identify two key limitations of SR in visual URL: (1) SR objectives often lead to suboptimal representations that attend to dynamics-irrelevant regions, resulting in inaccurate successor measures and degraded task generalization; and (2) these flawed representations hinder SR policies from modeling multi-modal skill-conditioned action distributions and ensuring skill controllability. To address these limitations, we propose Saliency-Guided Representation with Consistency Policy Learning (SRCP), a novel framework that improves zero-shot generalization of SR methods in visual URL. SRCP decouples representation learning from successor training by introducing a saliency-guided dynamics task to capture dynamics-relevant representations, thereby improving successor measure and task generalization. Moreover, it integrates a fast-sampling consistency policy with URL-specific classifier-free guidance and tailored training objectives to improve skill-conditioned policy modeling and controllability. Extensive experiments on 16 tasks across 4 datasets from the ExORL benchmark demonstrate that SRCP achieves state-of-the-art zero-shot generalization in visual URL and is compatible with various SR methods.

[CV-21] Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction

【速读】:该论文旨在解决多遍历场景重建(multi-traversal scene reconstruction)中因光照和环境条件变化导致的外观不一致性问题,从而提升自动驾驶仿真与数字孪生构建的高保真度。其核心解决方案是提出ADM-GS(Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction)框架,关键在于对静态背景进行显式外观分解:将外观分离为与遍历无关的材质属性(traversal-invariant material)和与遍历相关的照明分量(traversal-dependent illumination)。为此,作者设计了一种基于频率分离混合编码策略的神经光场(neural light field),通过引入表面法向量和显式反射向量,分别捕获低频漫反射照明和高频镜面反射,从而有效解耦不同遍历间的外观干扰,实现更一致且高质量的重建结果。

链接: https://arxiv.org/abs/2604.05908
作者: Yangyi Xiao,Siting Zhu,Baoquan Yang,Tianchen Deng,Yongbo Chen,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-traversal scene reconstruction is important for high-fidelity autonomous driving simulation and digital twin construction. This task involves integrating multiple sequences captured from the same geographical area at different times. In this context, a primary challenge is the significant appearance inconsistency across traversals caused by varying illumination and environmental conditions, despite the shared underlying geometry. This paper presents ADM-GS (Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction), a framework that applies an explicit appearance decomposition to the static background to alleviate appearance entanglement across traversals. For the static background, we decompose the appearance into traversal-invariant material, representing intrinsic material properties, and traversal-dependent illumination, capturing lighting variations. Specifically, we propose a neural light field that utilizes a frequency-separated hybrid encoding strategy. By incorporating surface normals and explicit reflection vectors, this design separately captures low-frequency diffuse illumination and high-frequency specular reflections. Quantitative evaluations on the Argoverse 2 and Waymo Open datasets demonstrate the effectiveness of ADM-GS. In multi-traversal experiments, our method achieves a +0.98 dB PSNR improvement over existing latent-based baselines while producing more consistent appearance across traversals. Code will be available at this https URL.

[CV-22] Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型中注意力机制解释性不足的问题,尤其是不同注意力头(attention head)所呈现的跨注意力映射(cross-attention map)特性差异未被充分挖掘。其核心解决方案是通过选择与目标概念最相关的注意力头进行聚合,从而提升生成结果的视觉可解释性,并增强对提示词(prompt)理解偏差的诊断能力。实验表明,该方法在平均交并比(mean IoU)上优于基于扩散的分割方法DAAM,且能更准确地捕捉概念特异性特征,为提高T2I模型的可控性和可解释性提供了新路径。

链接: https://arxiv.org/abs/2604.05906
作者: Jungwon Park,Jungmin Ko,Dongnam Byun,Wonjong Rhee
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Numerous studies on text-to-image (T2I) generative models have utilized cross-attention maps to boost application performance and interpret model behavior. However, the distinct characteristics of attention maps from different attention heads remain relatively underexplored. In this study, we show that selectively aggregating cross-attention maps from heads most relevant to a target concept can improve visual interpretability. Compared to the diffusion-based segmentation method DAAM, our approach achieves higher mean IoU scores. We also find that the most relevant heads capture concept-specific features more accurately than the least relevant ones, and that selective aggregation helps diagnose prompt misinterpretations. These findings suggest that attention head selection offers a promising direction for improving the interpretability and controllability of T2I generation.

[CV-23] AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis ACL2026

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在情感图像内容分析(Affective Image Content Analysis, AICA)中缺乏统一框架的问题,即如何将感知(perception)、推理(reasoning)与生成(generation)能力整合为一个协同系统。当前VLMs虽在感知层面表现优异,但在情感强度校准和开放式描述深度方面存在明显不足。为此,作者提出Grounded Affective Tree (GAT) Prompting方法,其关键在于通过视觉锚定(visual scaffolding)与分层推理(hierarchical reasoning)相结合的训练-free框架,有效提升情感强度预测准确性并增强生成内容的语义深度,从而为情感驱动的多模态理解与生成提供坚实基线。

链接: https://arxiv.org/abs/2604.05900
作者: Dong She,Xianrong Yao,Liqun Chen,Jinghe Yu,Yang Gao,Zhanpeng Jin
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Findings of ACL 2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.

[CV-24] Physics-Aware Video Instance Removal Benchmark

【速读】:该论文旨在解决视频实例移除(Video Instance Removal, VIR)任务中因移除目标对象而引发的物理因果效应(如残留阴影、镜面反射和光照交互等)被忽视的问题。现有基准主要评估视觉合理性,却缺乏对物理一致性建模的能力,导致生成结果在真实场景中存在不一致现象。解决方案的关键在于提出首个面向物理感知的视频实例移除基准(Physics-Aware Video Instance Removal, PVIR),包含95个高质量视频数据集,标注了实例级掩码与移除提示,并细分为简单(Simple)和困难(Hard)子集以聚焦复杂物理交互;同时设计解耦式人类评估协议,从指令遵循性、渲染质量和编辑专属性三个维度量化模型表现,从而系统性地揭示当前方法在处理物理因果效应上的局限性,推动VIR技术向更符合物理规律的方向发展。

链接: https://arxiv.org/abs/2604.05898
作者: Zirui Li,Xinghao Chen,Lingyu Jiang,Dengzhe Hou,Fangzhou Lin,Kazunori Yamada,Xiangbo Gao,Zhengzhong Tu
机构: Tohoku University (东北大学); University of Washington (华盛顿大学); Texas AM University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video Instance Removal (VIR) requires removing target objects while maintaining background integrity and physical consistency, such as specular reflections and illumination interactions. Despite advancements in text-guided editing, current benchmarks primarily assess visual plausibility, often overlooking the physical causalities, such as lingering shadows, triggered by object removal. We introduce the Physics-Aware Video Instance Removal (PVIR) benchmark, featuring 95 high-quality videos annotated with instance-accurate masks and removal prompts. PVIR is partitioned into Simple and Hard subsets, the latter explicitly targeting complex physical interactions. We evaluate four representative methods, PISCO-Removal, UniVideo, DiffuEraser, and CoCoCo, using a decoupled human evaluation protocol across three dimensions to isolate semantic, visual, and spatial failures: instruction following, rendering quality, and edit exclusivity. Our results show that PISCO-Removal and UniVideo achieve state-of-the-art performance, while DiffuEraser frequently introduces blurring artifacts and CoCoCo struggles significantly with instruction following. The persistent performance drop on the Hard subset highlights the ongoing challenge of recovering complex physical side effects.

[CV-25] Automatic dental superimposition of 3D intraorals and 2D photographs for human identification

【速读】:该论文旨在解决法医牙科比对中因缺乏生前(ante-mortem)医疗记录而导致的形态学比对效率低下的问题,尤其在移民死亡于边境或无全民医疗体系国家等场景下更为突出。其核心挑战在于如何利用社交媒体上可见牙齿的二维(2D)图像与死后(post-mortem)三维(3D)扫描数据进行客观、自动化的形态匹配。解决方案的关键在于提出一种融合计算机视觉与优化技术的3D-2D对齐方法:通过将生前照片投影到3D牙齿模型上,实现视角失真建模与自动校正,并开发了两种自动比对策略——基于配对地标的方法和基于牙齿区域分割估计相机参数的方法。这两种方法在142个样本的20,164次交叉比较中均取得了平均排名值分别为1.6和1.5的优异结果,显著优于传统牙科图表比对的筛选能力,同时提供可量化、可解释的形态对应评分,便于可视化分析。

链接: https://arxiv.org/abs/2604.05877
作者: Antonio D. Villegas-Yeguas,Xavier Abreau-Freire,Guillermo R-García,Andrea Valsecchi,Teresa Pinho,Daniel Pérez-Mongiovi,Oscar Ibáñez,Oscar Cordón
机构: University of Granada (格拉纳达大学); Panacea Cooperative Research S. Coop. (潘acea合作研究协会); Associate Laboratory i4HB - Institute for Health and Bioeconomy, University Institute of Health Sciences (IUCS), CESPU (健康与生物经济研究所,健康科学大学研究所,CESPU); Institute for Mummy Studies, Eurac Research (木乃伊研究所,欧洲研究中心); UNIPRO-Oral Pathology and Rehabilitation Research Unit, IUCS-CESPU (口腔病理与康复研究单元,健康科学大学研究所,CESPU); UMIB-Multidisciplinary Biomedical Research Unit, Abel Salazar Institute of Biomedical Sciences (ICBAS), University of Porto (多学科生物医学研究单元,阿贝尔·萨拉扎尔生物医学科学研究所,波尔图大学); Department of Computer Science and Information Technologies, University of A Coruña (奥维耶多大学计算机科学与信息技术系); DECSAI, Spain (计算机科学与人工智能系,西班牙); Andalusian Research Institute in Data Science and Computational Intelligence (安达卢西亚数据科学与计算智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Dental comparison is considered a primary identification method, at the level of fingerprints and DNA profiling. One crucial but time-consuming step of this method is the morphological comparison. One of the main challenges to apply this method is the lack of ante-mortem medical records, specially on scenarios such as migrant death at the border and/or in countries where there is no universal healthcare. The availability of photos on social media where teeth are visible has led many odontologists to consider morphological comparison using them. However, state-of-the-art proposals have significant limitations, including the lack of proper modeling of perspective distortion and the absence of objective approaches that quantify morphological differences. Our proposal involves a 3D (post-mortem scan) - 2D (ante-mortem photos) approach. Using computer vision and optimization techniques, we replicate the ante-mortem image with the 3D model to perform the morphological comparison. Two automatic approaches have been developed: i) using paired landmarks and ii) using a segmentation of the teeth region to estimate camera parameters. Both are capable of obtaining very promising results over 20,164 cross comparisons from 142 samples, obtaining mean ranking values of 1.6 and 1.5, respectively. These results clearly outperform filtering capabilities of automatic dental chart comparison approaches, while providing an automatic, objective and quantitative score of the morphological correspondence, easily to interpret and analyze by visualizing superimposed images. Comments: 10 pages, 9 figures, 3 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.05877 [cs.CV] (or arXiv:2604.05877v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.05877 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-26] Neural Network Pruning via QUBO Optimization

【速读】:该论文旨在解决神经网络剪枝(Neural Network Pruning)中因依赖贪婪启发式方法而忽略滤波器间复杂交互作用导致的次优压缩问题。现有基于QUBO(Quadratic Unconstrained Binary Optimization)的优化方法也因目标函数过于简化(如仅使用L1范数)而性能不佳。其解决方案的关键在于提出一个统一的混合QUBO框架,将梯度感知的敏感性度量(一阶泰勒展开和二阶Fisher信息)引入线性项以刻画单个滤波器的重要性,同时利用数据驱动的激活相似性作为二次项来捕捉滤波器间的功能冗余;此外,通过动态容量驱动搜索严格约束目标稀疏度而不扭曲优化空间,并引入两阶段流水线中的张量训练(Tensor-Train, TT)精调阶段,实现对QUBO结果的无梯度微调,从而显著提升模型压缩效果与可解释性。

链接: https://arxiv.org/abs/2604.05856
作者: Osama Orabi,Artur Zagitov,Hadi Salloum,Viktor A. Lobachev,Kasymkhan Khubiev,Yaroslav Kholodov
机构: 1. National Research University Higher School of Economics (俄罗斯国立研究大学高等经济学院); 2. Skolkovo Institute of Science and Technology (斯科尔科沃科学与技术研究院); 3. Russian Academy of Sciences (俄罗斯科学院); 4. Moscow State University (莫斯科国立大学); 5. Kazan Federal University (喀山联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 13 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Neural network pruning can be formulated as a combinatorial optimization problem, yet most existing approaches rely on greedy heuristics that ignore complex interactions between filters. Formal optimization methods such as Quadratic Unconstrained Binary Optimization (QUBO) provide a principled alternative but have so far underperformed due to oversimplified objective formulations based on metrics like the L1-norm. In this work, we propose a unified Hybrid QUBO framework that bridges heuristic importance estimation with global combinatorial optimization. Our formulation integrates gradient-aware sensitivity metrics - specifically first-order Taylor and second-order Fisher information - into the linear term, while utilizing data-driven activation similarity in the quadratic term. This allows the QUBO objective to jointly capture individual filter relevance and inter-filter functional redundancy. We further introduce a dynamic capacity-driven search to strictly enforce target sparsity without distorting the optimization landscape. Finally, we employ a two-stage pipeline featuring a Tensor-Train (TT) Refinement stage - a gradient-free optimizer that fine-tunes the QUBO-derived solution directly against the true evaluation metric. Experiments on the SIDD image denoising dataset demonstrate that the proposed Hybrid QUBO significantly outperforms both greedy Taylor pruning and traditional L1-based QUBO, with TT Refinement providing further consistent gains at appropriate combinatorial scales. This highlights the potential of hybrid combinatorial formulations for robust, scalable, and interpretable neural network compression.

[CV-27] Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

【速读】:该论文旨在解决生成式 AI(Generative AI)在文本到图像(Text-to-Image, T2I)模型中因文本渲染能力被恶意利用而产生的“描摹越狱”(inscriptive jailbreak)问题,即攻击者诱导模型在看似无害的视觉场景中嵌入有害文本内容(如伪造文件),从而规避多阶段安全过滤机制。解决方案的关键在于提出名为 Etch 的黑盒攻击框架,其核心创新是将对抗提示(adversarial prompt)分解为三个功能正交的层次:语义伪装(semantic camouflage)、视觉空间锚定(visual-spatial anchoring)和字体编码(typographic encoding),通过零阶优化循环迭代精修各层,并借助视觉语言模型对生成图像进行批判性评估、定位失败模块并提供针对性修正,从而显著提升攻击成功率(平均达 65.57%,最高达 91.00%),揭示当前 T2I 安全对齐机制在字体感知层面的重大盲区。

链接: https://arxiv.org/abs/2604.05853
作者: Zonghao Ying,Haowen Dai,Lianyu Hu,Zonglei Jing,Quanchen Zou,Yaodong Yang,Aishan Liu,Xianglong Liu
机构: Beihang University (北京航空航天大学); University of Nottingham Ningbo China (宁波诺丁汉大学); Nanyang Technological University (南洋理工大学); 360 AI Security Lab (奇虎360人工智能安全实验室); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.

[CV-28] Learn to Rank: Visual Attribution by Learning Importance Ranking

【速读】:该论文旨在解决复杂计算机视觉模型决策可解释性问题,尤其在安全关键领域中建立信任与问责机制。现有方法面临三重权衡:基于传播的方法效率高但存在偏差且依赖架构;基于扰动的方法因果基础扎实但计算成本高,且对视觉Transformer常产生粗粒度的块级解释;基于学习的解释器虽快速但通常优化代理目标或从启发式教师模型蒸馏。解决方案的关键在于提出一种直接优化删除(deletion)和插入(insertion)指标的学习框架,通过将非可微排序与排名操作建模为排列学习问题,并利用Gumbel-Sinkhorn对硬排序进行可微松弛,从而实现端到端训练,借助归因引导的扰动来优化目标模型。推理阶段可在单次前向传播中生成密集的像素级归因图,支持可选的少量梯度精修步骤,显著提升解释的清晰度与边界对齐特性,尤其适用于Transformer类视觉模型。

链接: https://arxiv.org/abs/2604.05819
作者: David Schinagl,Christian Fruhwirth-Reisinger,Alexander Prutsch,Samuel Schulter,Horst Possegger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Interpreting the decisions of complex computer vision models is crucial to establish trust and accountability, especially in safety-critical domains. An established approach to interpretability is generating visual attribution maps that highlight regions of the input most relevant to the model’s prediction. However, existing methods face a three-way trade-off. Propagation-based approaches are efficient, but they can be biased and architecture-specific. Meanwhile, perturbation-based methods are causally grounded, yet they are expensive and for vision transformers often yield coarse, patch-level explanations. Learning-based explainers are fast but usually optimize surrogate objectives or distill from heuristic teachers. We propose a learning scheme that instead optimizes deletion and insertion metrics directly. Since these metrics depend on non-differentiable sorting and ranking, we frame them as permutation learning and replace the hard sorting with a differentiable relaxation using Gumbel-Sinkhorn. This enables end-to-end training through attribution-guided perturbations of the target model. During inference, our method produces dense, pixel-level attributions in a single forward pass with optional, few-step gradient refinement. Our experiments demonstrate consistent quantitative improvements and sharper, boundary-aligned explanations, particularly for transformer-based vision models.

[CV-29] EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion

【速读】:该论文旨在解决虚拟人建模与发型数字化中的细粒度发丝几何重建问题,现有方法在精度与效率之间存在显著权衡:隐式神经表示虽能捕捉整体发型形状但难以保留发丝细节,而显式优化方法虽可实现高保真重建却计算开销大且扩展性差。解决方案的关键在于提出EfficientMonoHair框架,其核心创新包括基于融合块(fusion-patch)的多视角优化策略,有效减少点云方向优化迭代次数;以及一种新颖的并行生长策略,放宽体素占用约束,使大规模发丝追踪在方向场不准确或噪声环境下仍保持稳定与鲁棒性,从而在保证高保真重建质量的同时大幅提升运行效率。

链接: https://arxiv.org/abs/2604.05794
作者: Da Li,Dominik Engel,Deng Luo,Ivan Viola
机构: King Abdullah University of Science and Technology (阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 10 pages, 6 figures, conference

点击查看摘要

Abstract:Strand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability. To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that relaxes voxel occupancy constraints, allowing large-scale strand tracing to remain stable and robust even under inaccurate or noisy orientation fields. Extensive experiments on representative real-world hairstyles demonstrate that our method can robustly reconstruct high-fidelity strand geometries with accuracy. On synthetic benchmarks, our method achieves reconstruction quality comparable to state-of-the-art methods, while improving runtime efficiency by nearly an order of magnitude.

[CV-30] BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM /VLM Agents

【速读】:该论文针对大语言模型(Large Language Model, LLM)和视觉语言模型(Vision-Language Model, VLM)代理中提示隐私风险跨阶段传播的问题提出解决方案。现有去标识化流程仅关注文档边界,未能有效应对原始用户内容在检索查询、记忆写入、工具调用及日志记录等多阶段中的泄露风险。其核心解决方案是提出BodhiPromptShield框架,该框架具备策略感知能力,能够检测敏感文本片段,并通过类型化占位符、语义抽象或安全符号映射进行路由处理,同时将敏感信息的恢复延迟至授权边界。该方法引入传播感知的中介机制与恢复时机作为安全变量,显著抑制了各阶段的隐私泄露比例(从10.7%降至7.1%),并在受控评估中展现出优于通用去标识化方法的性能表现(PER=9.3%,AC=0.94,TSR=0.92)。

链接: https://arxiv.org/abs/2604.05793
作者: Bo Ma,Jinsong Wu,Weiqi Yan
机构: Auckland University of Technology (奥克兰理工大学); University of Chile (智利大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In LLM/VLM agents, prompt privacy risk propagates beyond a single model call because raw user content can flow into retrieval queries, memory writes, tool calls, and logs. Existing de-identification pipelines address document boundaries but not this cross-stage propagation. We propose BodhiPromptShield, a policy-aware framework that detects sensitive spans, routes them via typed placeholders, semantic abstraction, or secure symbolic mapping, and delays restoration to authorized boundaries. Relative to enterprise redaction, this adds explicit propagation-aware mediation and restoration timing as a security variable. Under controlled evaluation on the Controlled Prompt-Privacy Benchmark (CPPB), stage-wise propagation suppresses from 10.7% to 7.1% across retrieval, memory, and tool stages; PER reaches 9.3% with 0.94 AC and 0.92 TSR, outperforming generic de-identification. These are controlled systems results on CPPB rather than formal privacy guarantees or public-benchmark transfer claims. The project repository is available at this https URL. Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.05793 [cs.CR] (or arXiv:2604.05793v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.05793 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-31] Sparse Gain Radio Map Reconstruction With Geometry Priors and Uncertainty-Guided Measurement Selection

【速读】:该论文旨在解决复杂城市环境中稀疏测量条件下高精度射线图(Radio Map)重建的难题,尤其在强遮挡、不规则几何结构和感知访问受限的情况下。现有方法如插值、低秩制图、深度补全及信道知识图(Channel Knowledge Map, CKM)构建等,往往未能充分挖掘显式几何先验信息或忽视预测不确定性对后续感知策略的价值。其解决方案的关键在于提出GeoUQ-GFNet模型——一种轻量级网络架构,能够从稀疏测量数据和结构化场景先验中联合预测稠密增益射线图与空间不确定性图;进一步利用所预测的不确定性指导主动感知选择,在有限感知预算下实现比非自适应采样更有效的重建提升。这一方法通过几何感知学习、不确定性估计与基准驱动评估的融合,显著提升了复杂城市环境下稀疏射线图重建的鲁棒性与效率。

链接: https://arxiv.org/abs/2604.05788
作者: Zhihan Zeng,Ning Wei,Muhammad Baqer Mollah,Kaihe Wang,Phee Lep Yeoh,Fei Xu,Yue Xiu,Zhongpei Zhang
机构: University of Electronic Science and Technology of China (UESTC); University of Houston; University of the Sunshine Coast; ZGC Institute of Ubiquitous-X Innovation and Applications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radio maps are important for environment-aware wireless communication, network planning, and radio resource optimization. However, dense radio map construction remains challenging when only a limited number of measurements are available, especially in complex urban environments with strong blockages, irregular geometry, and restricted sensing accessibility. Existing methods have explored interpolation, low-rank cartography, deep completion, and channel knowledge map (CKM) construction, but many of these methods insufficiently exploit explicit geometric priors or overlook the value of predictive uncertainty for subsequent sensing. In this paper, we study sparse gain radio map reconstruction from a geometry-aware and active sensing perspective. We first construct \textbfUrbanRT-RM, a controllable ray-tracing benchmark with diverse urban layouts, multiple base-station deployments, and multiple sparse sampling modes. We then propose \textbfGeoUQ-GFNet, a lightweight network that jointly predicts a dense gain radio map and a spatial uncertainty map from sparse measurements and structured scene priors. The predicted uncertainty is further used to guide active measurement selection under limited sensing budgets. Extensive experiments show that our proposed GeoUQ-GFNet method achieves strong and consistent reconstruction performance across different scenes and transmitter placements generated using UrbanRT-RM. Moreover, uncertainty-guided querying provides more effective reconstruction improvement than non-adaptive sampling under the same additional measurement budget. These results demonstrate the effectiveness of combining geometry-aware learning, uncertainty estimation, and benchmark-driven evaluation for sparse radio map reconstruction in complex urban environments.

[CV-32] RHVI-FDD: A Hierarchical Decoupling Framework for Low-Light Image Enhancement

【速读】:该论文旨在解决低光照图像中存在的严重噪声、细节丢失和色彩失真问题,这些问题会显著影响下游多媒体分析与检索任务的性能。现有方法难以同时校正色彩失真、抑制噪声并保留精细细节,根源在于亮度(luminance)与色度(chrominance)耦合,且色度内部噪声与细节高度纠缠。解决方案的关键在于提出一种分层解耦框架(RHVI-FDD):在宏观层面引入RHVI变换以降低输入噪声带来的估计偏差,实现鲁棒的亮度-色度解耦;在微观层面设计频域解耦(Frequency-Domain Decoupling, FDD)模块,通过离散余弦变换(Discrete Cosine Transform)将色度特征分解为低、中、高频带,分别对应全局色调、局部细节和噪声成分,并由专用专家网络分别处理,最终通过自适应门控模块进行内容感知融合,从而实现多目标协同优化。

链接: https://arxiv.org/abs/2604.05781
作者: Junhao Yang,Bo Yang,Hongwei Ge,Yanchun Liang,Heow Pueh Lee,Chunguo Wu
机构: Jilin University (吉林大学); Dalian University of Technology (大连理工大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:Low-light images often suffer from severe noise, detail loss, and color distortion, which hinder downstream multimedia analysis and retrieval tasks. The degradation in low-light images is complex: luminance and chrominance are coupled, while within the chrominance, noise and details are deeply entangled, preventing existing methods from simultaneously correcting color distortion, suppressing noise, and preserving fine details. To tackle the above challenges, we propose a novel hierarchical decoupling framework (RHVI-FDD). At the macro level, we introduce the RHVI transform, which mitigates the estimation bias caused by input noise and enables robust luminance-chrominance decoupling. At the micro level, we design a Frequency-Domain Decoupling (FDD) module with three branches for further feature separation. Using the Discrete Cosine Transform, we decompose chrominance features into low, mid, and high-frequency bands that predominantly represent global tone, local details, and noise components, which are then processed by tailored expert networks in a divide-and-conquer manner and fused via an adaptive gating module for content-aware fusion. Extensive experiments on multiple low-light datasets demonstrate that our method consistently outperforms existing state-of-the-art approaches in both objective metrics and subjective visual quality.

[CV-33] Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion CVPR2026

【速读】:该论文旨在解决单目语义场景补全(Monocular Semantic Scene Completion, MSSC)中因体素分布极度不均衡(超过93%为空体素,前景类别稀少)而导致的模型冗余关注无信息体素、对长尾类别泛化能力差的问题。解决方案的关键在于提出VoxSAMNet框架,其核心创新包括:(1) 引入Dummy Shortcut for Feature Refinement (DSFR)模块,通过共享虚拟节点绕过空体素,同时利用可变形注意力机制精炼 occupied 体素特征;(2) 设计Foreground Modulation Strategy,结合前景丢弃(Foreground Dropout, FD)与文本引导图像滤波(Text-Guided Image Filter, TGIF),缓解过拟合并增强类别相关特征。该方法在SemanticKITTI和SSCBench-KITTI-360上实现当前最优性能,mIoU分别达18.2%和20.2%,验证了显式建模体素稀疏性和语义不平衡性对于高效准确3D场景补全的重要性。

链接: https://arxiv.org/abs/2604.05780
作者: Yu Xue,Longjun Gao,Yuanqi Su,HaoAng Lu,Xiaoning Zhang
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions, where over 93% of voxels are empty and foreground classes are rare, poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; and (2) a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF) to alleviate overfitting and enhance class-relevant features. Extensive experiments on the public benchmarks SemanticKITTI and SSCBench-KITTI-360 demonstrate that VoxSAMNet achieves state-of-the-art performance, surpassing prior monocular and stereo baselines with mIoU scores of 18.2% and 20.2%, respectively. Our results highlight the importance of sparsity-aware and semantics-guided design for efficient and accurate 3D scene completion, offering a promising direction for future research.

[CV-34] PDMP: Rethinking Balanced Multimodal Learning via Performance-Dominant Modality Prioritization

【速读】:该论文旨在解决多模态学习中普遍存在的优化不足问题,即多模态模型性能往往低于其单模态基线模型。传统方法认为这是由模态间学习不平衡导致的,并通过梯度调制来实现平衡学习。然而,本文指出平衡学习并非最优策略,真正的问题在于对表现优异的主导模态(performance-dominant modality)的学习不够充分。解决方案的关键在于提出性能主导模态优先策略(Performance-Dominant Modality Prioritization, PDMP),该策略首先通过独立训练的单模态模型性能排名识别出主导模态,随后引入非对称梯度系数,使主导模态在优化过程中占据主导地位,从而提升整体多模态性能。PDMP不依赖于具体的多模态结构或融合机制,具有良好的通用性和实用性。

链接: https://arxiv.org/abs/2604.05773
作者: Shicai Wei,Chunbo Luo,Qiang Zhu,Yang Luo
机构: University of Electronic Science and Technology of China (电子科技大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal learning has attracted increasing attention due to its practicality. However, it often suffers from insufficient optimization, where the multimodal model underperforms even compared to its unimodal counterparts. Existing methods attribute this problem to the imbalanced learning between modalities and solve it by gradient modulation. This paper argues that balanced learning is not the optimal setting for multimodal learning. On the contrary, imbalanced learning driven by the performance-dominant modality that has superior unimodal performance can contribute to better multimodal performance. And the under-optimization problem is caused by insufficient learning of the performance-dominant modality. To this end, we propose the Performance-Dominant Modality Prioritization (PDMP) strategy to assist multimodal learning. Specifically, PDMP firstly mines the performance-dominant modality via the performance ranking of the independently trained unimodal model. Then PDMP introduces asymmetric coefficients to modulate the gradients of each modality, enabling the performance-dominant modality to dominate the optimization. Since PDMP only relies on the unimodal performance ranking, it is independent of the structures and fusion methods of the multimodal model and has great potential for practical scenarios. Finally, extensive experiments on various datasets validate the superiority of PDMP.

[CV-35] Improving Controllable Generation: Faster Training and Better Performance via x_0-Supervision

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在用户需要精确控制图像布局时的局限性,因为自然语言难以可靠表达此类空间结构信息。为实现可控生成,现有方法通常通过引入额外条件来增强初始T2I模型,但直接沿用原始模型的训练损失函数会导致收敛速度缓慢。论文的关键解决方案是重新审视可控扩散模型的训练目标,通过细致分析去噪动力学发现:对干净目标图像(x₀)施加直接监督(即x₀-supervision),或等价地重新加权扩散损失,可显著加速收敛。实验表明,该方法在多个控制设置下将收敛速度提升最高达2倍(基于作者提出的均值面积下收敛曲线mAUCC指标),同时改善视觉质量和条件一致性。

链接: https://arxiv.org/abs/2604.05761
作者: Amadou S. Sangare,Adrien Maglo,Mohamed Chaouch,Bertrand Luvison
机构: Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, they remain limited when users need to precisely control image layouts, something that natural language alone cannot reliably express. Controllable generation methods augment the initial T2I model with additional conditions that more easily describe the scene. Prior works straightforwardly train the augmented network with the same loss as the initial network. Although natural at first glance, this can lead to very long training times in some cases before convergence. In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. We show that direct supervision on the clean target image, dubbed x_0 -supervision, or an equivalent re-weighting of the diffusion loss, yields faster convergence. Experiments on multiple control settings demonstrate that our formulation accelerates convergence by up to 2 \times according to our novel metric (mean Area Under the Convergence Curve - mAUCC), while also improving both visual quality and conditioning accuracy. Our code is available at this https URL

[CV-36] SVC 2026: the Second Multimodal Deception Detection Challenge and the First Domain Generalized Remote Physiological Measurement Challenge CVPR2026

【速读】:该论文旨在解决如何在复杂真实环境中有效检测与理解细微视觉信号(subtle visual signals)的问题,这类信号虽难以肉眼察觉,却蕴含关键信息,广泛应用于生物特征安全、多媒体取证、医疗诊断等多个领域。现有方法在鲁棒性、表征能力和泛化性能上仍存在不足,尤其面对跨域和弱信号场景时表现不佳。解决方案的关键在于组织“细微视觉挑战”(Subtle Visual Challenge),通过两个核心任务——跨域多模态欺骗检测和远程光电容积脉搏波(rPPG)估计——推动学习更具鲁棒性和通用性的细微视觉表征模型,从而促进计算机视觉与多模态学习的进一步发展。

链接: https://arxiv.org/abs/2604.05748
作者: Dongliang Zhu,Zhiyi Niu,Bo Zhao,Jiajian Huang,Shuo Ye,Xun Lin,Hui Ma,Taorui Wang,Jiayu Zhang,Chunmei Zhu,Junzhe Cao,Yingjie Ma,Rencheng Song,Albert Clapés,Sergio Escalera,Dan Guo,Zitong Yu
机构: Wuhan University (武汉大学); Great Bay University; Tsinghua University (清华大学); The Chinese University of Hong Kong (香港中文大学); Sun Yat-sen University (中山大学); Hefei University of Technology (合肥工业大学); University of Barcelona (巴塞罗那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the SVC workshop @ CVPR 2026

点击查看摘要

Abstract:Subtle visual signals, although difficult to perceive with the naked eye, contain important information that can reveal hidden patterns in visual data. These signals play a key role in many applications, including biometric security, multimedia forensics, medical diagnosis, industrial inspection, and affective computing. With the rapid development of computer vision and representation learning techniques, detecting and interpreting such subtle signals has become an emerging research direction. However, existing studies often focus on specific tasks or modalities, and models still face challenges in robustness, representation ability, and generalization when handling subtle and weak signals in real-world environments. To promote research in this area, we organize the Subtle visual Challenge, which aims to learn robust representations for subtle visual signals. The challenge includes two tasks: cross-domain multimodal deception detection and remote photoplethysmography (rPPG) estimation. We hope that this challenge will encourage the development of more robust and generalizable models for subtle visual understanding, and further advance research in computer vision and multimodal learning. A total of 22 teams submitted their final results to this workshop competition, and the corresponding baseline models have been released on the \hrefthis https URLMMDD2026 platform\footnotethis https URL

[CV-37] On the Robustness of Diffusion-Based Image Compression to Bit-Flip Errors

【速读】:该论文旨在解决现代图像压缩方法在面对比特级错误(bit-level corruption)时鲁棒性不足的问题,尤其是在高噪声环境中对纠错码(error-correcting codes)的依赖问题。其解决方案的关键在于采用基于反向信道编码(Reverse Channel Coding, RCC)范式的扩散模型压缩器,该类方法相较于传统及学习型编码器展现出显著更强的抗比特翻转能力;此外,作者进一步提出一种改进的Turbo-DDCM变体,在几乎不损害率-失真-感知(rate–distortion–perception)权衡的前提下大幅提升鲁棒性,从而为构建更可靠的压缩表示提供了新路径。

链接: https://arxiv.org/abs/2604.05743
作者: Amit Vaisman,Gal Pomerants,Raz Lapid
机构: Technion - Israel Institute of Technology (以色列理工学院); Deepkeep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern image compression methods are typically optimized for the rate–distortion–perception trade-off, whereas their robustness to bit-level corruption is rarely examined. We show that diffusion-based compressors built on the Reverse Channel Coding (RCC) paradigm are substantially more robust to bit flips than classical and learned codecs. We further introduce a more robust variant of Turbo-DDCM that significantly improves robustness while only minimally affecting the rate–distortion–perception trade-off. Our findings suggest that RCC-based compression can yield more resilient compressed representations, potentially reducing reliance on error-correcting codes in highly noisy environments.

[CV-38] ASSR-Net: Anisotropic Structure-Aware and Spectrally Recalibrated Network for Hyperspectral Image Fusion

【速读】:该论文旨在解决高光谱图像融合(Hyperspectral Image Fusion, HIF)中两个关键问题:一是各向异性空间结构重建不足导致的细节模糊和空间质量下降;二是融合过程中出现的光谱失真,影响精细光谱表征。解决方案的关键在于提出ASSR-Net,其采用两阶段策略:第一阶段通过各向异性结构感知空间增强(Anisotropic Structure-Aware Spatial Enhancement, ASSE)模块,利用方向感知融合机制自适应捕捉多方向结构特征,有效恢复各向异性空间模式;第二阶段通过分层先验引导的光谱校准(Hierarchical Prior-Guided Spectral Calibration, HPSC)模块,以原始低分辨率高光谱图像作为光谱先验,显式修正融合结果中的光谱偏差,从而提升光谱保真度。

链接: https://arxiv.org/abs/2604.05742
作者: Qiya Song,Hongzhi Zhou,Lishan Tan,Renwei Dian,Shutao Li
机构: Hunan Normal University (湖南师范大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral image fusion aims to reconstruct high-spatial-resolution hyperspectral images (HR-HSI) by integrating complementary information from multi-source inputs. Despite recent progress, existing methods still face two critical challenges: (1) inadequate reconstruction of anisotropic spatial structures, resulting in blurred details and compromised spatial quality; and (2) spectral distortion during fusion, which hinders fine-grained spectral representation. To address these issues, we propose \textbfASSR-Net: an Anisotropic Structure-Aware and Spectrally Recalibrated Network for Hyperspectral Image Fusion. ASSR-Net adopts a two-stage fusion strategy comprising anisotropic structure-aware spatial enhancement (ASSE) and hierarchical prior-guided spectral calibration (HPSC). In the first stage, a directional perception fusion module adaptively captures structural features along multiple orientations, effectively reconstructing anisotropic spatial patterns. In the second stage, a spectral recalibration module leverages the original low-resolution HSI as a spectral prior to explicitly correct spectral deviations in the fused results, thereby enhancing spectral fidelity. Extensive experiments on various benchmark datasets demonstrate that ASSR-Net consistently outperforms state-of-the-art methods, achieving superior spatial detail preservation and spectral consistency.

[CV-39] FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

【速读】:该论文旨在解决电影中拟音(Foley)制作过程中音频与视频在空间和时间上难以精确对齐的问题,以及高质量立体声拟音数据集的缺失。解决方案的关键在于提出FoleyDesigner框架,其核心包括:(1)基于多智能体架构实现精准的时空分析;(2)利用潜扩散模型(latent diffusion models)结合从视频帧中提取的时空线索进行可控的拟音生成;(3)引入大型语言模型(LLM)驱动的混合机制模拟专业后期制作流程;(4)构建首个包含空间元数据、精确时间戳和语义标注的专业级立体声音频数据集FilmStereo,支持5.1声道杜比全景声(Dolby Atmos)系统,兼容ITU-R BS.775标准,从而实现与专业影视制作流程的无缝集成。

链接: https://arxiv.org/abs/2604.05731
作者: Mengtian Li,Kunyan Dai,Yi Ding,Ruobing Ni,Ying Zhang,Wenwu Wang,Zhifeng Xie
机构: Shanghai Film Academy, Shanghai University; Shanghai Engineering Research Center of Motion Picture Special Effects; University of Surrey, UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporally aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporally controllable Foley generation, and professional audio mixing capabilities. FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatio-temporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate post-production practices in film industry. To address the lack of high-quality stereo audio datasets in film, we introduce FilmStereo, the first professional stereo audio dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For applications, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility. Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with seamless compatibility with professional film production standards. The project page is available at this https URL .

[CV-40] Single-Stage Signal Attenuation Diffusion Model for Low-Light Image Enhancement and Denoising

【速读】:该论文旨在解决现有基于扩散模型的低光照图像增强(Low-Light Image Enhancement, LLIE)方法中存在的优化目标不一致问题,即主流方法采用两阶段流程或辅助修正网络来提升U-Net输出,导致增强与去噪任务之间的内在关联被割裂,从而影响整体性能。其解决方案的关键在于提出信号衰减扩散模型(Signal Attenuation Diffusion Model, SADM),通过在前向噪声添加过程中引入信号衰减系数(signal attenuation coefficient),显式建模低光退化中的物理先验信息,使反向去噪过程能够同时优化亮度恢复和噪声抑制,实现单阶段联合优化,无需额外的校正模块或分阶段训练。

链接: https://arxiv.org/abs/2604.05727
作者: Ying Liu,Junchao Zhang,Caiyun Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models excel at image restoration via probabilistic modeling of forward noise addition and reverse denoising, and their ability to handle complex noise while preserving fine details makes them well-suited for Low-Light Image Enhancement (LLIE). Mainstream diffusion based LLIE methods either adopt a two-stage pipeline or an auxiliary correction network to refine U-Net outputs, which severs the intrinsic link between enhancement and denoising and leads to suboptimal performance owing to inconsistent optimization objectives. To address these issues, we propose the Signal Attenuation Diffusion Model (SADM), a novel diffusion process that integrates the signal attenuation mechanism into the diffusion pipeline, enabling simultaneous brightness adjustment and noise suppression in a single stage. Specifically, the signal attenuation coefficient simulates the inherent signal attenuation of low-light degradation in the forward noise addition process, encoding the physical priors of low-light degradation to explicitly guide reverse denoising toward the concurrent optimization of brightness recovery and noise suppression, thereby eliminating the need for extra correction modules or staged training relied on by existing methods. We validate that our design maintains consistency with Denoising Diffusion Implicit Models(DDIM) via multi-scale pyramid sampling, balancing interpretability, restoration quality, and computational efficiency. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.05727 [cs.CV] (or arXiv:2604.05727v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.05727 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-41] Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP CVPR2026

【速读】:该论文旨在解决当前对CLIP视觉编码器内部表示的解释主要局限于单个稀疏自动编码器(Sparse Autoencoders, SAE)特征语义意义,而忽视了这些特征所涵盖的信息范围(information scope)这一关键维度的问题。其解决方案的关键在于提出“上下文依赖得分”(Contextual Dependency Score, CDS),用于量化SAE特征在空间扰动下的稳定性,从而区分出位置稳定的局部信息范围特征与位置变化的全局信息范围特征。实验表明,不同信息范围的特征对CLIP预测及其置信度产生系统性差异影响,这为理解CLIP表示提供了新的分析维度,并深化了对SAE衍生特征的诊断能力。

链接: https://arxiv.org/abs/2604.05724
作者: Yusung Ro,Jaehyun Choi,Junmo Kim
机构: KAIST; Korea AI Safety Institute, ETRI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpreting the internal representations of CLIP vision encoders, yet existing analyses largely focus on the semantic meaning of individual features. We introduce information scope as a complementary dimension of interpretability that characterizes how broadly an SAE feature aggregates visual evidence, ranging from localized, patch-specific cues to global, image-level signals. We observe that some SAE features respond consistently across spatial perturbations, while others shift unpredictably with minor input changes, indicating a fundamental distinction in their underlying scope. To quantify this, we propose the Contextual Dependency Score (CDS), which separates positionally stable local scope features from positionally variant global scope features. Our experiments show that features of different information scopes exert systematically different influences on CLIP’s predictions and confidence. These findings establish information scope as a critical new axis for understanding CLIP representations and provide a deeper diagnostic view of SAE-derived features.

[CV-42] GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance CVPR2026

【速读】:该论文旨在解决3D高斯(3D Gaussian)生成过程中缺乏几何先验导致的生成质量不佳问题,尤其针对现有方法依赖不可靠的点云映射作为几何参考所引发的误差。解决方案的关键在于提出GaussianGrow框架,其核心是通过从易获取的3D点云出发,学习以文本引导的方式逐步“生长”出3D高斯,从而自然地保证生成过程中的几何准确性。该方法利用多视角扩散模型合成一致外观作为监督信号,并在非预设相机位姿下通过约束重叠区域的新视图来减少融合伪影;同时,通过迭代检测未生长区域并结合预训练2D扩散模型进行图像修复,实现对难以观测区域的补全,直至生成完整的3D高斯表示。

链接: https://arxiv.org/abs/2604.05721
作者: Weiqi Zhang,Junsheng Zhou,Haotian Geng,Kanle Shi,Shenkun Xu,Yi Fang,Yu-Shen Liu
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技); NYU Abu Dhabi (纽约大学阿布扎比校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored predicting point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a text-guided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain novel views generated at non-preset camera poses identified in overlapping regions across different views. For completing the hard-to-observe regions, we propose to iteratively detect the camera pose by observing the largest un-grown regions in point clouds and inpainting them by inpainting the rendered view with a pretrained 2D diffusion model. The process continues until complete Gaussians are generated. We extensively evaluate GaussianGrow on text-guided Gaussian generation from synthetic and even real-scanned point clouds. Project Page: this https URL

[CV-43] MPM: Mutual Pair Merging for Efficient Vision Transformers CVPR2026

【速读】:该论文旨在解决Transformer模型在语义分割任务中因序列长度过长导致的计算效率低下问题,尤其是现有token压缩方法常忽略端到端延迟优化、且在现代加速器上合并映射(merge map)计算开销可能抵消预期性能提升的问题。解决方案的关键在于提出一种无需训练的互近邻合并(Mutual Pair Merging, MPM)模块:其通过在余弦空间中构建互近邻对并平均每对token,同时记录可支持gather操作的重建映射,在解码器前实现像素对齐特征的无损重构,从而兼容现有分割头;MPM不引入任何可学习参数或连续压缩控制参数(如保留率),而是通过离散插入调度(insertion schedule)灵活调节速度-精度权衡,实测显示在Raspberry Pi 5和NVIDIA H100 GPU上均能显著降低延迟(最高达60%)并保持mIoU下降小于3%。

链接: https://arxiv.org/abs/2604.05718
作者: Simon Ravé,Pejman Rasti,David Rousseau
机构: LARIS University of Angers (昂热大学LARIS实验室); UMR INRAe-IRHS (法国国家农业、食品与环境研究院-食品与健康研究联合实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (Findings)

点击查看摘要

Abstract:Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.

[CV-44] In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting CVPR3

【速读】:该论文旨在解决如何可靠地利用单目深度先验(monocular depth priors)来提升3D高斯点绘(Gaussian Splatting, GS)的渲染质量,尤其是在训练数据稀疏或表面无纹理时易产生伪影的问题。其关键解决方案在于提出一种整合尺度模糊和噪声深度先验的几何监督训练框架,通过学习弱对齐的深度变化来增强模型鲁棒性,并设计了一种方法以识别并隔离病态几何区域,从而限制深度误差向已良好重建的3D结构传播,最终实现更精确的几何恢复与高质量渲染。

链接: https://arxiv.org/abs/2604.05715
作者: Wenhui Xiao,Ethan Goan,Rodrigo Santa Cruz,David Ahmedt-Aristizabal,Olivier Salvado,Clinton Fookes,Leo Lebrat
机构: Queensland University of Technology (昆士兰科技大学); CSRIO Data61
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to CVPR 3DMV Workshop

点击查看摘要

Abstract:Using accurate depth priors in 3D Gaussian Splatting helps mitigate artifacts caused by sparse training data and textureless surfaces. However, acquiring accurate depth maps requires specialized acquisition systems. Foundation monocular depth estimation models offer a cost-effective alternative, but they suffer from scale ambiguity, multi-view inconsistency, and local geometric inaccuracies, which can degrade rendering performance when applied naively. This paper addresses the challenge of reliably leveraging monocular depth priors for Gaussian Splatting (GS) rendering enhancement. To this end, we introduce a training framework integrating scale-ambiguous and noisy depth priors into geometric supervision. We highlight the importance of learning from weakly aligned depth variations. We introduce a method to isolate ill-posed geometry for selective monocular depth regularization, restricting the propagation of depth inaccuracies into well-reconstructed 3D structures. Extensive experiments across diverse datasets show consistent improvements in geometric accuracy, leading to more faithful depth estimation and higher rendering quality across different GS variants and monocular depth backbones tested.

[CV-45] Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理真实世界视觉流时存在的物理空间感知能力不足的问题,尤其是由于现有几何感知MLLMs普遍采用单深层提取与输入层融合的范式,导致局部几何细节丢失和早期层语义错位。其解决方案的关键在于提出一种名为GUIDE(Geometric Unrolling Inside MLLM Early-layers)的渐进式几何先验注入框架:通过在几何编码器中进行多层级采样,全面捕捉从局部边缘到全局拓扑的多粒度特征,并逐步对齐与融合这些几何先验至MLLM的早期层;同时引入上下文感知门控机制,使模型可根据当前语义动态获取所需空间线索,从而提升空间先验利用效率并抑制冗余几何噪声,最终引导模型逐步学习2D到3D的过渡过程。

链接: https://arxiv.org/abs/2604.05695
作者: Chongyu Wang,Ting Huang,Chunyu Sun,Xinyu Ning,Di Wang,Hao Tang
机构: Xi’an Jiaotong University (西安交通大学); Peking University (北京大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); InspireOmni
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress in 2D visual tasks but still exhibit limited physical spatial awareness when processing real-world visual streams. Recently, feed-forward geometric foundation models, which implicitly extract geometric priors, have provided a new pathway to address this issue. However, existing geometry-aware MLLMs are predominantly constrained by the paradigm of single deep-layer extraction and input-level fusion. This flattened fusion leads to the loss of local geometric details and causes semantic mismatches in the early layers. To break this bottleneck, we propose GUIDE (Geometric Unrolling Inside MLLM Early-layers), a progressive geometric priors injection framework. GUIDE performs multi-level sampling within the geometric encoder, comprehensively capturing multi-granularity features ranging from local edges to global topologies. Subsequently, we rigorously align and fuse these multi-level geometric priors step-by-step with the early layers of the MLLM. Building upon the injection of multi-granularity geometric information, this design guides the model to progressively learn the 2D-to-3D transitional process. Furthermore, we introduce a context-aware gating that enables the model to fetch requisite spatial cues based on current semantics, thereby maximizing the utilization efficiency of spatial priors and effectively suppressing redundant geometric noise. Extensive experiments demonstrate that GUIDE significantly outperforms existing baselines on multiple complex spatial reasoning and perception tasks, establishing a novel paradigm for integrating 3D geometric priors into large models.

[CV-46] CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration CVPR2026

【速读】:该论文旨在解决跨模态图像配准(cross-modal image registration)中因模态差异导致的配准精度低、鲁棒性差的问题。解决方案的关键在于提出了一种基于特征流学习的一致性递归特征流变换器(Consistent-Recurrent Feature Flow Transformer, CRFT),其核心创新包括:1)在Transformer架构内学习一种模态无关的特征流表示,联合完成特征对齐与光流估计;2)采用粗到精的两阶段设计——粗阶段通过多尺度特征相关建立全局对应关系,细阶段通过分层特征融合与自适应空间推理优化局部细节;3)引入迭代差异引导注意力机制与空间几何变换(Spatial Geometric Transform, SGT)进行递归 refinement,逐步捕捉细微的空间不一致性并强化特征层面的一致性,从而在大仿射和尺度变化下仍保持结构一致性与高精度配准性能。

链接: https://arxiv.org/abs/2604.05689
作者: Xuecong Liu,Mengzhu Ding,Zixuan Sun,Zhang Li,Xichao Teng
机构: Northeastern University, China; National University of Defense Technology, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:We present Consistent-Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework based on feature flow learning for robust cross-modal image registration. CRFT learns a modality-independent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities. Extensive experiments on diverse cross-modal datasets demonstrate that CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness. Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging. Code and datasets are publicly available at this https URL.

[CV-47] 3D Smoke Scene Reconstruction Guided by Vision Priors from Multimodal Large Language Models

【速读】:该论文旨在解决从烟雾退化多视角图像中重建三维场景的难题,此类场景因烟雾引起的强散射效应、视点依赖的外观变化以及跨视角一致性严重退化而极具挑战性。解决方案的关键在于提出一个融合视觉先验与高效三维建模的框架:首先使用Nano-Banana-Pro增强烟雾退化图像以提供更清晰的观测信息;其次设计Smoke-GS(一种介质感知的3D高斯泼溅框架),通过显式3D高斯表示场景并引入轻量级视点依赖介质分支,捕捉由烟雾导致的方向相关外观变化,从而在保持3D高斯泼溅渲染效率的同时提升对烟雾退化的鲁棒性。

链接: https://arxiv.org/abs/2604.05687
作者: Xinye Zheng,Fei Wang,Yiqi Nie,Kun Li,Junjie Chen,Jiaqi Zhao,Yanyan Wei,Zhiliang Wu
机构: Hefei University of Technology (合肥工业大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); Anhui University (安徽大学); United Arab Emirates University (阿联酋大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing 3D scenes from smoke-degraded multi-view images is particularly difficult because smoke introduces strong scattering effects, view-dependent appearance changes, and severe degradation of cross-view consistency. To address these issues, we propose a framework that integrates visual priors with efficient 3D scene modeling. We employ Nano-Banana-Pro to enhance smoke-degraded images and provide clearer visual observations for reconstruction and develop Smoke-GS, a medium-aware 3D Gaussian Splatting framework for smoke scene reconstruction and restoration-oriented novel view synthesis. Smoke-GS models the scene using explicit 3D Gaussians and introduces a lightweight view-dependent medium branch to capture direction-dependent appearance variations caused by smoke. Our method preserves the rendering efficiency of 3D Gaussian Splatting while improving robustness to smoke-induced degradation. Results demonstrate the effectiveness of our method for generating consistent and visually clear novel views in challenging smoke environments.

[CV-48] SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation

【速读】:该论文旨在解决基于流匹配(flow matching)的视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中因迭代去噪过程引入的高延迟问题,特别是现有方法通常需要10步常微分方程(ODE)求解步骤,导致端到端推理时间过长(如pi0.5模型中仅去噪阶段就占80%的计算时间)。解决方案的关键在于提出一种即插即用的自蒸馏方法——SnapFlow,其核心机制是通过混合标准流匹配样本与一致性样本(目标为模型自身边际速度预测计算出的两步欧拉快捷速度),避免条件速度引起的轨迹漂移;同时利用零初始化的目标时间嵌入(target-time embedding),使网络在同一架构内实现局部速度估计与全局单步生成的切换,从而将多步去噪压缩至单次前向传播(1-NFE),显著提升推理效率且不牺牲性能。

链接: https://arxiv.org/abs/2604.05656
作者: Wuyang Luan,Junhui Li,Weiguang Zhao,Wenjian Zhang,Tieru Wu,Rui Ma
机构: Jilin University (吉林大学); Chongqing University (重庆大学); University of Liverpool (利物浦大学); GenY (GenY)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, 9 tables

点击查看摘要

Abstract:Vision-Language-Action (VLA) models based on flow matching – such as pi0, pi0.5, and SmolVLA – achieve state-of-the-art generalist robotic manipulation, yet their iterative denoising, typically 10 ODE steps, introduces substantial latency: on a modern GPU, denoising alone accounts for 80% of end-to-end inference time. Naively reducing the step count is unreliable, degrading success on most tasks due to the velocity field being uncalibrated for single-step jumps. We present SnapFlow, a plug-and-play self-distillation method that compresses multi-step denoising into a single forward pass (1-NFE) for flow-matching VLAs. SnapFlow mixes standard flow-matching samples with consistency samples whose targets are two-step Euler shortcut velocities computed from the model’s own marginal velocity predictions, avoiding the trajectory drift caused by conditional velocities, as we analyze theoretically. A zero-initialized target-time embedding lets the network switch between local velocity estimation and global one-step generation within a single architecture. SnapFlow requires no external teacher, no architecture changes, and trains in ~12h on a single GPU. We validate on two VLA architectures spanning a 6x parameter range, with identical hyperparameters: on pi0.5 (3B) across four LIBERO suites (40 tasks, 400 episodes), SnapFlow achieves 98.75% average success – matching the 10-step teacher at 97.75% and slightly exceeding it – with 9.6x denoising speedup and end-to-end latency reduced from 274ms to 83ms; on SmolVLA (500M), it reduces MSE by 8.3% with 3.56x end-to-end acceleration. An action-step sweep on long-horizon tasks reveals that SnapFlow maintains its advantage across execution horizons, achieving 93% at n_act=5 where the baseline reaches only 90%. SnapFlow is orthogonal to layer-distillation and token-pruning approaches, enabling compositional speedups.

[CV-49] Probing Intrinsic Medical Task Relationships: A Contrastive Learning Perspective

【速读】:该论文旨在解决医学视觉任务间内在关系不明确的问题,即不同任务在表示层面如何相互关联、重叠或差异。现有研究多聚焦于单一任务性能提升,而忽视了跨任务的结构化理解。其解决方案的关键在于提出一种名为Task-Contrastive Learning (TaCo) 的对比学习框架,通过将来自39个不同医学成像模态(如CT、MRI、电子显微镜等)的30种异构任务映射到共享表示空间中,揭示任务间的相似性与差异性,并分析任务迭代变化在嵌入空间中的体现,从而为医学视觉任务的内在结构提供系统性认知基础。

链接: https://arxiv.org/abs/2604.05651
作者: Jonas Muth,Zdravko Marinov,Simon Reiß
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While much of the medical computer vision community has focused on advancing performance for specific tasks, the underlying relationships between tasks, i.e., how they relate, overlap, or differ on a representational level, remain largely unexplored. Our work explores these intrinsic relationships between medical vision tasks, specifically, we investigate 30 tasks, such as semantic tasks (e.g., segmentation and detection), image generative tasks (e.g., denoising, inpainting, or colorization), and image transformation tasks (e.g., geometric transformations). Our goal is to probe whether a data-driven representation space can capture an underlying structure of tasks across a variety of 39 datasets from wildly different medical imaging modalities, including computed tomography, magnetic resonance, electron microscopy, X-ray ultrasound and more. By revealing how tasks relate to one another, we aim to provide insights into their fundamental properties and interconnectedness. To this end, we introduce Task-Contrastive Learning (TaCo), a contrastive learning framework designed to embed tasks into a shared representation space. Through TaCo, we map these heterogeneous tasks from different modalities into a joint space and analyze their properties: identifying which tasks are distinctly represented, which blend together, and how iterative alterations to tasks are reflected in the embedding space. Our work provides a foundation for understanding the intrinsic structure of medical vision tasks, offering a deeper understanding of task similarities and their interconnected properties in embedding spaces.

[CV-50] Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

【速读】:该论文旨在解决胃肠镜影像诊断中因数据有限、领域偏移和标注异质性导致的现有AI模型泛化能力差、适应性弱、鲁棒性不足及可扩展性受限的问题。解决方案的关键在于提出RATNet——一种基于类比推理(analogical reasoning)的胃肠镜影像基础模型,其核心机制是通过循环预训练策略从五个不同来源的胃肠镜数据集中获取并迁移异构专家标注知识;具体架构包含编码器、相关知识获取与传递(RAT)模块、投影层和多任务头,能够支持微调、线性探测和零样本迁移,从而实现对常见疾病诊断、罕见病少样本学习、新医疗场景零样本迁移、长尾分布下的鲁棒性、新疾病适应以及联邦学习隐私保护部署等六种场景的性能提升,其优势源于将图像衍生的后验知识与预训练先验知识库匹配并传递相对知识以指导诊断,显著增强了模型的泛化能力和抗偏差能力。

链接: https://arxiv.org/abs/2604.05649
作者: Peixi Peng(1),Housheng Xie(1),Yanling Wei(2),Guangcong Ruan(2),Xiaoyang Zou(1),Qian Cao(3),Yongjian Nian(2),Guoyan Zheng(1) ((1) Institute of Medical Robotics, School of Biomedical Engineering, Shanghai Jiao Tong University, (2) Daping Hospital, Army Medical University, (3) Sir Run Run Shaw Hospital, Zhejiang University School of Medicine)
机构: Shanghai Jiao Tong University (上海交通大学); Tongji Medical College (同济医学院); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gastrointestinal diseases impose a growing global health burden, and endoscopy is a primary tool for early diagnosis. However, routine endoscopic image interpretation still suffers from missed lesions and limited efficiency. Although AI-assisted diagnosis has shown promise, existing models often lack generalizability, adaptability, robustness, and scalability because of limited medical data, domain shift, and heterogeneous annotations. To address these challenges, we develop RATNet, a foundation model for gastrointestinal endoscopy imaging based on analogical reasoning. RATNet acquires and transfers knowledge from heterogeneous expert annotations across five gastrointestinal endoscopy datasets through a cyclic pre-training strategy. Its architecture consists of an encoder, a relevance-knowledge acquisition and transfer (RAT) module, a projector, and a multi-task head, and supports fine-tuning, linear probing, and zero-shot transfer. Evaluations show that RATNet outperforms existing foundation models, including GastroNet and GastroVision, across six scenarios: diagnosis of common gastrointestinal diseases, few-shot learning for rare diseases, zero-shot transfer to new medical sites, robustness under long-tailed disease distributions, adaptation to novel diseases, and privacy-preserving deployment via federated learning. Its advantage comes from an analogical reasoning mechanism that matches image-derived posterior knowledge to a learned prior knowledge base and transfers relative knowledge to guide diagnosis, improving generalization and resistance to bias. RATNet is open and cost-effective, supports automatic integration of heterogeneous annotations without manual label unification, and reduces data acquisition costs, making it a practical foundation for intelligent gastrointestinal diagnosis, especially in resource-limited settings.

[CV-51] PanopticQuery: Unified Query-Time Reasoning for 4D Scenes

【速读】:该论文旨在解决动态4D场景中通过自然语言查询实现准确语义定位的问题,核心挑战在于如何将噪声大、视角依赖的2D语义预测转化为全局一致的4D语义解释,尤其在处理交互关系、时间动作和空间关系等复杂语义时表现不足。解决方案的关键在于提出PanopticQuery框架,其基于4D高斯点绘(4D Gaussian Splatting)实现高保真动态重建,并引入多视角语义共识机制,通过聚合多个视角和时间帧上的2D语义预测,过滤不一致结果、强制几何一致性,并借助神经场优化将2D语义提升为结构化的4D语义接地(semantic grounding)。

链接: https://arxiv.org/abs/2604.05638
作者: Ruilin Tang,Yang Zhou,Zhong Ye,Wenxi Liu,Yan Huang,Shengfeng He
机构: South China University of Technology (华南理工大学); Singapore Management University (新加坡管理大学); Guangdong University of Technology (广东工业大学); Fuzhou University (福州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding dynamic 4D environments through natural language queries requires not only accurate scene reconstruction but also robust semantic grounding across space, time, and viewpoints. While recent methods using neural representations have advanced 4D reconstruction, they remain limited in contextual reasoning, especially for complex semantics such as interactions, temporal actions, and spatial relations. A key challenge lies in transforming noisy, view-dependent predictions into globally consistent 4D interpretations. We introduce PanopticQuery, a framework for unified query-time reasoning in 4D scenes. Our approach builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that grounds natural language queries by aggregating 2D semantic predictions across multiple views and time frames. This process filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization. To support evaluation, we present Panoptic-L4D, a new benchmark for language-based querying in dynamic scenes. Experiments demonstrate that PanopticQuery sets a new state of the art on complex language queries, effectively handling attributes, actions, spatial relationships, and multi-object interactions. A video demonstration is available in the supplementary materials.

[CV-52] owards Athlete Fatigue Assessment from Association Football Videos

【速读】:该论文旨在解决足球运动中疲劳监测的客观指标获取难题,传统方法依赖主观自我报告、实验室生物标志物或侵入式传感器(如心率监测器或GPS追踪数据),存在成本高或干扰比赛连续性的问题。解决方案的关键在于利用单目直播视频(monocular broadcast video)作为低成本数据源,结合先进的比赛状态重建(Game State Reconstruction, GSR)技术提取球员在场地坐标系中的轨迹,并提出一种新颖的运动学处理算法,从重构轨迹中获得时间一致的速度与加速度估计。进而构建加速度-速度(Acceleration-Speed, A-S)轮廓作为疲劳相关性能指标,实验证明该方法可在不依赖侵入式设备的前提下有效捕捉疲劳相关的运动学模式,同时揭示了轨迹噪声、校准误差和时间断续性等挑战,为未来基于视频的疲劳分析提供了可行路径。

链接: https://arxiv.org/abs/2604.05636
作者: Xavier Bou,Nathan Correger,Alexandre Cloots,Cédric Gavage,Silvio Giancola,Cédric Schwartz,François Delvaux,Rudi Cloots,Marc Van Droogenbroeck,Anthony Cioppa
机构: Université Paris-Saclay, CNRS, ENS Paris-Saclay; AMIAD, Pôle Recherche; University of Liège; CESI; HEPL; KAUST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fatigue monitoring is central in association football due to its links with injury risk and tactical performance. However, objective fatigue-related indicators are commonly derived from subjective self-reported metrics, biomarkers derived from laboratory tests, or, more recently, intrusive sensors such as heart monitors or GPS tracking data. This paper studies whether monocular broadcast videos can provide spatio-temporal signals of sufficient quality to support fatigue-oriented analysis. Building on state-of-the-art Game State Reconstruction methods, we extract player trajectories in pitch coordinates and propose a novel kinematics processing algorithm to obtain temporally consistent speed and acceleration estimates from reconstructed tracks. We then construct acceleration–speed (A-S) profiles from these signals and analyze their behavior as fatigue-related performance indicators. We evaluate the full pipeline on the public SoccerNet-GSR benchmark, considering both 30-second clips and a complete 45-minute half to examine short-term reliability and longer-term temporal consistency. Our results indicate that monocular GSR can recover kinematic patterns that are compatible with A-S profiling while also revealing sensitivity to trajectory noise, calibration errors, and temporal discontinuities inherent to broadcast footage. These findings support monocular broadcast video as a low-cost basis for fatigue analysis and delineate the methodological challenges for future research.

[CV-53] SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection

【速读】:该论文旨在解决多视角异常检测(Multi-view anomaly detection)中因视角变化和模态差异导致的特征不一致性问题,从而提升复杂物体表面缺陷检测的准确性。其解决方案的关键在于提出了一种统一框架——语义与几何对齐网络(Semantic and Geometric Alignment Network, SGANet),通过联合建模三个核心模块实现物理一致的特征表示:选择性跨视图特征精炼模块(Selective Cross-view Feature Refinement Module, SCFRM)增强跨视角特征交互;语义结构补丁对齐(Semantic-Structural Patch Alignment, SSPA)在保持结构一致性的同时实现跨模态语义对齐;多视角几何对齐(Multi-View Geometric Alignment, MVGA)进一步对齐不同视角下几何对应区域的补丁。该方法有效提升了多模态多视角场景下的异常检测与定位性能。

链接: https://arxiv.org/abs/2604.05632
作者: Letian Bai,Chengyu Tao,Juan Du
机构: Hong Kong University of Science and Technology (广州) (香港科技大学(广州)); Hunan University (湖南大学); The University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-view anomaly detection aims to identify surface defects on complex objects using observations captured from multiple viewpoints. However, existing unsupervised methods often suffer from feature inconsistency arising from viewpoint variations and modality discrepancies. To address these challenges, we propose a Semantic and Geometric Alignment Network (SGANet), a unified framework for multimodal multi-view anomaly detection that effectively combines semantic and geometric alignment to learn physically coherent feature representations across viewpoints and modalities. SGANet consists of three key components. The Selective Cross-view Feature Refinement Module (SCFRM) selectively aggregates informative patch features from adjacent views to enhance cross-view feature interaction. The Semantic-Structural Patch Alignment (SSPA) enforces semantic alignment across modalities while maintaining structural consistency under viewpoint transformations. The Multi-View Geometric Alignment (MVGA) further aligns geometrically corresponding patches across viewpoints. By jointly modeling feature interaction, semantic and structural consistency, and global geometric correspondence, SGANet effectively enhances anomaly detection performance in multimodal multi-view settings. Extensive experiments on the SiM3D and Eyecandies datasets demonstrate that SGANet achieves state-of-the-art performance in both anomaly detection and localization, validating its effectiveness in realistic industrial scenarios.

[CV-54] A Unified Foundation Model for All-in-One Multi-Modal Remote Sensing Image Restoration and Fusion with Language Prompting

【速读】:该论文旨在解决遥感影像中普遍存在的云层、雾霾、噪声、分辨率限制及传感器异质性等问题,传统修复与融合方法需为每种退化类型单独训练模型,缺乏通用性和效率。其解决方案的关键在于提出首个面向多模态、多任务的遥感低层视觉统一基础模型——LLaRS,通过Sinkhorn-Knopp最优传输算法将异构波段对齐至语义匹配槽位,并引入三种互补的专家混合层(卷积专家用于空间模式建模、通道混合专家保障光谱保真度、注意力专家结合低秩适配器捕捉全局上下文),同时采用逐步动态权重调整策略稳定联合训练过程,从而实现跨任务、跨模态的高效统一修复与增强。

链接: https://arxiv.org/abs/2604.05629
作者: Yongchuan Cui,Peng Liu
机构: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China; School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing imagery suffers from clouds, haze, noise, resolution limits, and sensor heterogeneity. Existing restoration and fusion approaches train separate models per degradation type. In this work, we present Language-conditioned Large-scale Remote Sensing restoration model (LLaRS), the first unified foundation model for multi-modal and multi-task remote sensing low-level vision. LLaRS employs Sinkhorn-Knopp optimal transport to align heterogeneous bands into semantically matched slots, routes features through three complementary mixture-of-experts layers (convolutional experts for spatial patterns, channel-mixing experts for spectral fidelity, and attention experts with low-rank adapters for global context), and stabilizes joint training via step-level dynamic weight adjustment. To train LLaRS, we construct LLaRS1M, a million-scale multi-task dataset spanning eleven restoration and enhancement tasks, integrating real paired observations and controlled synthetic degradations with diverse natural language prompts. Experiments show LLaRS consistently outperforms seven competitive models, and parameter-efficient finetuning experiments demonstrate strong transfer capability and adaptation efficiency on unseen data. Repo: this https URL

[CV-55] FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos CVPR2026

【速读】:该论文旨在解决从第一人称RGB-D交互视频中直接重建可交互的室内场景三维数字孪生(3D digital twins)的问题,尤其针对现有刚性结构重建方法无法处理复杂关节运动和缺乏真实场景适应性的问题。解决方案的关键在于提出FunRec方法,其能够自动发现场景中的可动部件(articulated parts),估计其运动学参数(kinematic parameters),跟踪其三维运动,并在规范空间(canonical space)中重建静态与动态几何结构,从而生成可用于物理仿真(simulation-compatible)的网格模型。该方法无需控制环境、多状态采集或CAD先验,仅依赖自然交互视频即可实现高精度的分割、姿态估计和重建,显著优于现有技术。

链接: https://arxiv.org/abs/2604.05621
作者: Alexandros Delitzas,Chenyangguang Zhang,Alexey Gavryushin,Tommaso Di Mario,Boyang Sun,Rishabh Dabral,Leonidas Guibas,Christian Theobalt,Marc Pollefeys,Francis Engelmann,Daniel Barath
机构: ETH Zurich (苏黎世联邦理工学院); Max Planck Institute for Informatics (马克斯·普朗克信息研究所); Stanford University (斯坦福大学); Microsoft (微软); USI Lugano (卢加诺大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:We present FunRec, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunRec operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunRec surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5-10 times lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation, hand-guided affordance mapping and robot-scene interaction.

[CV-56] Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening

【速读】:该论文旨在解决医学图像分割中由自由文本临床指令驱动时面临的两大挑战:一是临床报告中存在的语义模糊性,二是低对比度影像中复杂解剖结构的重叠导致的分割歧义;二是现有大规模多模态模型在有限医疗数据上全量微调易引发严重过拟合问题。解决方案的关键在于提出一种语义-拓扑图推理(Semantic-Topological Graph Reasoning, STGR)框架,其核心创新包括:1)引入Text-to-Vision Intent Distillation (TVID)模块从临床文本中提取精准诊断意图,实现语言指导的语义对齐;2)将掩码选择建模为动态图推理问题,通过节点表示候选病灶、边编码空间与语义相似性来消除解剖歧义;3)设计Selective Asymmetric Fine-Tuning (SAFT)策略,仅更新不足1%参数以显著提升模型泛化能力并保障部署可行性,实验证明该方法在LIDC-IDRI和LNDb数据集上达到新的SOTA性能(Dice Similarity Coefficient达81.5%),且跨折稳定性优异(方差仅0.6%)。

链接: https://arxiv.org/abs/2604.05620
作者: Chenyu Xue,Yiran Liu,Mian Zhou,Jionglong Su,Zhixiang Lu
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University College London (伦敦大学学院); University of Liverpool (利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical image segmentation driven by free-text clinical instructions is a critical frontier in computer-aided diagnosis. However, existing multimodal and foundation models struggle with the semantic ambiguity of clinical reports and fail to disambiguate complex anatomical overlaps in low-contrast scans. Furthermore, fully fine-tuning these massive architectures on limited medical datasets invariably leads to severe overfitting. To address these challenges, we propose a novel Semantic-Topological Graph Reasoning (STGR) framework for language-guided pulmonary screening. Our approach elegantly synergizes the reasoning capabilities of large language models (LLaMA-3-V) with the zero-shot delineation of vision foundation models (MedSAM). Specifically, we introduce a Text-to-Vision Intent Distillation (TVID) module to extract precise diagnostic guidance. To resolve anatomical ambiguity, we formulate mask selection as a dynamic graph reasoning problem, where candidate lesions are modeled as nodes and edges capture spatial and semantic affinities. To ensure deployment feasibility, we introduce a Selective Asymmetric Fine-Tuning (SAFT) strategy that updates less than 1% of the parameters. Rigorous 5-fold cross-validation on the LIDC-IDRI and LNDb datasets demonstrates that our framework establishes a new state-of-the-art. Notably, it achieves an 81.5% Dice Similarity Coefficient (DSC) on LIDC-IDRI, outperforming leading LLM-based tools like LISA by over 5%. Crucially, our SAFT strategy acts as a powerful regularizer, yielding exceptional cross-fold stability (0.6% DSC variance) and paving the way for robust, context-aware clinical deployment.

[CV-57] Evaluation of Randomization through Style Transfer for Enhanced Domain Generalization

【速读】:该论文旨在解决深度学习模型在计算机视觉任务中因训练数据与真实场景存在差异(即Sim2Real差距)而导致的泛化性能不佳问题,尤其针对使用合成数据训练时的表现瓶颈。其解决方案的关键在于通过系统性实证研究明确风格迁移作为数据增强策略在域泛化中的三个核心设计维度:风格池多样性、纹理复杂度影响以及风格来源选择。研究发现,扩大风格池比重复使用少量风格更有效,当风格池足够大时纹理复杂度不再显著影响效果,且多样化的艺术风格优于与目标域对齐的风格。基于这些发现,作者提出轻量级、模型无关的数据增强方法StyleMixDG(Style-Mixing for Domain Generalization),无需修改网络结构或引入额外损失函数,在多个跨域基准测试中显著优于强基线模型,验证了所识别设计原则的实际有效性。

链接: https://arxiv.org/abs/2604.05616
作者: Dustin Eisenhardt,Timothy Schaumlöffel,Alperen Kantarci,Gemma Roig
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning models for computer vision often suffer from poor generalization when deployed in real-world settings, especially when trained on synthetic data due to the well-known Sim2Real gap. Despite the growing popularity of style transfer as a data augmentation strategy for domain generalization, the literature contains unresolved contradictions regarding three key design axes: the diversity of the style pool, the role of texture complexity, and the choice of style source. We present a systematic empirical study that isolates and evaluates each of these factors for driving scene understanding, resolving inconsistencies in prior work. Our findings show that (i) expanding the style pool yields larger gains than repeated augmentation with few styles, (ii) texture complexity has no significant effect when the pool is sufficiently large, and (iii) diverse artistic styles outperform domain-aligned alternatives. Guided by these insights, we derive StyleMixDG (Style-Mixing for Domain Generalization), a lightweight, model-agnostic augmentation recipe that requires no architectural modifications or additional losses. Evaluated on the GTAV \rightarrow BDD100k, Cityscapes, Mapillary Vistas benchmark, StyleMixDG demonstrates consistent improvements over strong baselines, confirming that the empirically identified design principles translate into practical gains. The code will be released on GitHub.

[CV-58] ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在推理过程中因视觉token冗余导致效率低下的问题,尤其在高剪枝比率下难以兼顾token重要性与多样性,从而影响性能。现有方法要么保留冗余token(基于重要性的策略),要么忽略关键信息(基于多样性的策略)。解决方案的关键在于提出ID-Selection——一种将重要性估计与多样性感知的迭代选择相结合的token筛选策略:首先为每个token分配重要性分数,随后按分数顺序逐个选取,并在每一步抑制与已选token相似者的分数,从而在统一过程中同时保留信息丰富的token并减少冗余。该方法无需额外训练,在多个LVLM骨干网络和基准测试中均实现了显著的效率提升与性能保持。

链接: https://arxiv.org/abs/2604.05601
作者: Zhaohong Huang,Wenjing Liu,Yuxin Zhang,Fei Chao,Rongrong Ji
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances have explored visual token pruning to accelerate the inference of large vision-language models (LVLMs). However, existing methods often struggle to balance token importance and diversity: importance-based methods tend to retain redundant tokens, whereas diversity-based methods may overlook informative ones. This trade-off becomes especially problematic under high reduction ratios, where preserving only a small subset of visual tokens is critical. To address this issue, we propose ID-Selection, a simple yet effective token selection strategy for efficient LVLM inference. The key idea is to couple importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, after which high-scoring tokens are selected one by one while the scores of similar tokens are progressively suppressed. In this way, ID-Selection preserves informative tokens while reducing redundancy in a unified selection process. Extensive experiments across 5 LVLM backbones and 16 main benchmarks demonstrate that ID-Selection consistently achieves superior performance and efficiency, especially under extreme pruning ratios. For example, on LLaVA-1.5-7B, ID-Selection prunes 97.2% of visual tokens, retaining only 16 tokens, while reducing inference FLOPs by over 97% and preserving 91.8% of the original performance, all without additional training.

[CV-59] Uncovering Linguistic Frag ility in Vision-Language-Action Models via Diversity-Aware Red Teaming

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中对语言细微差异的鲁棒性不足这一关键安全问题,这种缺陷可能导致真实部署时出现灾难性行为。现有基于强化学习(Reinforcement Learning, RL)的红队测试方法常因奖励最大化特性导致模式崩溃(mode collapse),仅发现少量重复或平凡的失败场景,无法全面揭示潜在风险。解决方案的关键在于提出一种新颖的多样性感知的具身红队框架(Diversity-Aware Embodied Red Teaming, DAERT),其核心设计是通过评估一个统一策略(uniform policy)来生成多样化且具有挑战性的指令,同时保证攻击有效性(以物理仿真中的执行失败为衡量指标)。实验表明,DAERT能显著降低VLA任务成功率(从93.33%降至5.85%),有效暴露了VLA模型的安全盲区,为大规模压力测试提供了可扩展的方法。

链接: https://arxiv.org/abs/2604.05595
作者: Baoshun Tong,Haoran He,Ling Pan,Yang Liu,Liang Lin
机构: Sun Yat-sen University (中山大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have achieved remarkable success in robotic manipulation. However, their robustness to linguistic nuances remains a critical, under-explored safety concern, posing a significant safety risk to real-world deployment. Red teaming, or identifying environmental scenarios that elicit catastrophic behaviors, is an important step in ensuring the safe deployment of embodied AI agents. Reinforcement learning (RL) has emerged as a promising approach in automated red teaming that aims to uncover these vulnerabilities. However, standard RL-based adversaries often suffer from severe mode collapse due to their reward-maximizing nature, which tends to converge to a narrow set of trivial or repetitive failure patterns, failing to reveal the comprehensive landscape of meaningful risks. To bridge this gap, we propose a novel \textbfDiversity-\textbfAware \textbfEmbodied \textbfRed \textbfTeaming (\textbfDAERT) framework, to expose the vulnerabilities of VLAs against linguistic variations. Our design is based on evaluating a uniform policy, which is able to generate a diverse set of challenging instructions while ensuring its attack effectiveness, measured by execution failures in a physical simulator. We conduct extensive experiments across different robotic benchmarks against two state-of-the-art VLAs, including \pi_0 and OpenVLA. Our method consistently discovers a wider range of more effective adversarial instructions that reduce the average task success rate from 93.33% to 5.85%, demonstrating a scalable approach to stress-testing VLA agents and exposing critical safety blind spots before real-world deployment.

[CV-60] BPC-Net: Annotation-Free Skin Lesion Segmentation via Boundary Probability Calibration

【速读】:该论文旨在解决无标注皮肤病变分割(annotation-free skin lesion segmentation)中的三个耦合挑战:噪声伪标签监督、有限目标域数据下的不稳定迁移以及边界概率置信度不足的问题。其中,边界概率置信度不足直接影响轮廓完整性,且无法仅通过全局阈值调整有效修正。解决方案的关键在于提出BPC-Net框架,其核心是高斯概率平滑(Gaussian Probability Smoothing, GPS),该方法在阈值化前对局部概率空间进行校准,以恢复置信度不足的病变边界,同时避免引发非特异性前景扩张;此外,为增强在噪声伪监督和跨域迁移下的鲁棒性,还引入了特征解耦解码器和交互分支适应策略,分别实现上下文抑制、细节恢复与边界精修的分离处理,以及仅更新伪标签交互分支而保留部署时仅基于图像的分割路径。

链接: https://arxiv.org/abs/2604.05594
作者: Yujie Yao,Yuhaohang He,Junjie Huang,Zhou Liu,Jiangzhao Li,Yan Qiao,Wen Xiao,Yunsen Liang,Xiaofan Li
机构: Sichuan Agricultural University (四川农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Annotation-free skin lesion segmentation is attractive for low-resource dermoscopic deployment. However, its performance remains constrained by three coupled challenges: noisy pseudo-label supervision, unstable transfer under limited target-domain data, and boundary probability under-confidence. Most existing annotation-free methods primarily focus on pseudo-label denoising. In contrast, the effect of compressed boundary probabilities on final mask quality has received less explicit attention, although it directly affects contour completeness and cannot be adequately corrected by global threshold adjustment alone. To address this issue, we propose BPC-Net, a boundary probability calibration framework for annotation-free skin lesion segmentation. The core of the framework is Gaussian Probability Smoothing (GPS), which performs localized probability-space calibration before thresholding to recover under-confident lesion boundaries without inducing indiscriminate foreground expansion. To support this calibration under noisy pseudo-supervision and cross-domain transfer, we further incorporate two auxiliary designs: a feature-decoupled decoder that separately handles context suppression, detail recovery, and boundary refinement, and an interaction-branch adaptation strategy that updates only the pseudo-label interaction branch while preserving the deployed image-only segmentation path. Under a strictly annotation-free protocol, no manual masks are used during training or target-domain adaptation, and validation labels, when available, are used only for final operating-point selection. Experiments on ISIC-2017, ISIC-2018, and PH2 show that the proposed framework achieves state-of-the-art performance among published unsupervised methods, reaching a macro-average Dice coefficient and Jaccard index of 85.80% and 76.97%, respectively, while approaching supervised reference performance on PH2.

[CV-61] Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher CVPR2026

【速读】:该论文旨在解决多模态人体感知中因模态缺失带来的鲁棒性问题,核心挑战在于两个相互关联的障碍:异构数据之间的表示差距(Representation Gap)以及低质量模态引入的污染效应(Contamination Effect)。解决方案的关键在于提出一种“先净化后对齐”(Purify-then-Align)框架PTA,通过元学习与知识扩散的协同整合来打破二者间的因果依赖关系。具体而言,首先利用元学习驱动的加权机制动态抑制噪声模态的影响以实现知识源净化;随后,基于扩散的知识蒸馏范式,由净化后的高质量教师模型引导各学生模态特征对齐,从而显著提升单模态编码器在多种缺失模态场景下的性能和鲁棒性。

链接: https://arxiv.org/abs/2604.05584
作者: Pengcheng Weng(1,2),Yanyu Qian(1,3),Yangxin Xu(1),Fei Wang(1) ((1) School of Software Engineering, Xi’an Jiaotong University, China, (2) Institute of Computer Science, University of Bern, Switzerland, (3) College of Computing and Data Science, Nanyang Technological University, Singapore)
机构: Xi’an Jiaotong University (西安交通大学); Universität Bern (伯尔尼大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 Workshop On Any-to-Any Multimodal Learning

点击查看摘要

Abstract:Robust multimodal human sensing must overcome the critical challenge of missing modalities. Two principal barriers are the Representation Gap between heterogeneous data and the Contamination Effect from low-quality modalities. These barriers are causally linked, as the corruption introduced by contamination fundamentally impedes the reduction of representation disparities. In this paper, we propose PTA, a novel “Purify-then-Align” framework that solves this causal dependency through a synergistic integration of meta-learning and knowledge diffusion. To purify the knowledge source, PTA first employs a meta-learning-driven weighting mechanism that dynamically learns to down-weight the influence of noisy, low-contributing modalities. Subsequently, to align different modalities, PTA introduces a diffusion-based knowledge distillation paradigm in which an information-rich clean teacher, formed from this purified consensus, refines the features of each student modality. The ultimate payoff of this “Purify-then-Align” strategy is the creation of exceptionally powerful single-modality encoders imbued with cross-modal knowledge. Comprehensive experiments on the large-scale MM-Fi and XRF55 datasets, under pronounced Representation Gap and Contamination Effect, demonstrate that PTA achieves state-of-the-art performance and significantly improves the robustness of single-modality models in diverse missing-modality scenarios.

[CV-62] WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval

【速读】:该论文旨在解决基于视觉-语言预训练模型(Vision-Language Pre-trained Models, VLP)的组合图像检索(Composed Image Retrieval, CIR)方法在小样本三元组数据下普遍存在的严重过拟合问题,其核心挑战在于不同模型与数据集之间存在显著且此前被忽视的泛化差距。解决方案的关键在于提出一种权重正则化微调网络(Weight-Regularized Fine-tuning network for CIR, WRF4CIR),其核心机制是在微调过程中对模型权重施加对抗扰动,扰动方向与梯度下降相反,从而人为增加模型拟合训练数据的难度,有效缓解过拟合现象,并显著缩小泛化差距。

链接: https://arxiv.org/abs/2604.05583
作者: Yizhuo Xu,Chaojian Yu,Yuanjie Shao,Tongliang Liu,Qinmu Peng,Xinge You
机构: Huazhong University of Science and Technology (华中科技大学); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) task aims to retrieve target images based on reference images and modification texts. Current CIR methods primarily rely on fine-tuning vision-language pre-trained models. However, we find that these approaches commonly suffer from severe overfitting, posing challenges for CIR with limited triplet data. To better understand this issue, we present a systematic study of overfitting in VLP-based CIR, revealing a significant and previously overlooked generalization gap across different models and datasets. Motivated by these findings, we introduce WRF4CIR, a Weight-Regularized Fine-tuning network for CIR. Specifically, during the fine-tuning process, we apply adversarial perturbations to the model weights for regularization, where these perturbations are generated in the opposite direction of gradient descent. Intuitively, WRF4CIR increases the difficulty of fitting the training data, which helps mitigate overfitting in CIR under limited triplet supervision. Extensive experiments on benchmark datasets demonstrate that WRF4CIR significantly narrows the generalization gap and achieves substantial improvements over existing methods.

[CV-63] High-Resolution Single-Shot Polarimetric Imaging Made Easy

【速读】:该论文旨在解决基于偏振的成像系统在实际应用中面临的两大挑战:一是现有分焦平面(Division-of-Focal-Plane, DoFP)传感器因空间复用机制导致的空间分辨率下降和伪影问题;二是如何在保持快照式采集能力的前提下实现高质量的偏振信息重建。解决方案的关键在于提出了一种名为EasyPolar的多视角偏振成像框架,其核心创新包括:首先,基于线性偏振可由三个独立强度测量完全表征的物理先验,设计了一个由三台同步RGB相机组成的硬件系统,分别捕获一个未偏振视图和两个不同偏振方向的偏振视图;其次,引入一种置信度引导的偏振重建网络,在多模态特征融合过程中采用置信度感知的物理约束机制,有效抑制了图像配准误差引起的伪影,并显式地施加几何一致性约束于解空间,从而实现了高保真偏振图像重建并提升下游任务性能。

链接: https://arxiv.org/abs/2604.05581
作者: Shuangfan Zhou,Chu Zhou,Heng Guo,Youwei Lyu,Boxin Shi,Zhanyu Ma,Imari Sato
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Polarization-based vision has gained increasing attention for providing richer physical cues beyond RGB images. While achieving single-shot capture is highly desirable for practical applications, existing Division-of-Focal-Plane (DoFP) sensors inherently suffer from reduced spatial resolution and artifacts due to their spatial multiplexing mechanism. To overcome these limitations without sacrificing the snapshot capability, we propose EasyPolar, a multi-view polarimetric imaging framework. Our system is grounded in the physical insight that three independent intensity measurements are sufficient to fully characterize linear polarization. Guided by this, we design a triple-camera setup consisting of three synchronized RGB cameras that capture one unpolarized view and two polarized views with distinct orientations. Building upon this hardware design, we further propose a confidence-guided polarization reconstruction network to address the potential misalignment in multi-view fusion. The network performs multi-modal feature fusion under a confidence-aware physical guidance mechanism, which effectively suppresses warping-induced artifacts and enforces explicit geometric constraints on the solution space. Experimental results demonstrate that our method achieves high-quality results and benefits various downstream tasks.

[CV-64] Physics-Aligned Spectral Mamba: Decoupling Semantics and Dynamics for Few-Shot Hyperspectral Target Detection

【速读】:该论文旨在解决少样本高光谱目标检测(few-shot hyperspectral target detection, HTD)中深度模型微调效率低、易过拟合,以及现有方法忽视高光谱数据频域结构和波段连续性导致的跨域泛化能力弱的问题。其核心解决方案是提出SpecMamba框架,关键在于通过解耦稳定语义表示与灵活光谱适应机制实现参数高效且频率感知的适配:首先设计基于离散余弦变换(Discrete Cosine Transform, DCT)的Mamba适配器(DCTMA),在冻结Transformer特征的基础上,利用DCT将光谱特征映射至频域并结合Mamba的状态空间线性复杂度递归机制,显式建模全局光谱依赖与波段连续性;其次引入先验引导三编码器(Prior-Guided Tri-Encoder, PGTE),使实验室光谱先验指导可学习适配器优化而不破坏稳定的语义空间;最后采用自监督伪标签映射策略(Self-Supervised Pseudo-Label Mapping, SSPLM)进行测试时适应,通过不确定性感知采样与双路径一致性约束实现决策边界高效精调。

链接: https://arxiv.org/abs/2604.05562
作者: Luqi Gong,Qixin Xie,Yue Chen,Ziqiang Chen,Fanda Fan,Shuai Zhao,Chao Li
机构: Zhejiang Lab (浙江实验室); Beijing University of Posts and Telecommunications (北京邮电大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Meta-learning facilitates few-shot hyperspectral target detection (HTD), but adapting deep backbones remains challenging. Full-parameter fine-tuning is inefficient and prone to overfitting, and existing methods largely ignore the frequency-domain structure and spectral band continuity of hyperspectral data, limiting spectral adaptation and cross-domain this http URL address these challenges, we propose SpecMamba, a parameter-efficient and frequency-aware framework that decouples stable semantic representation from agile spectral adaptation. Specifically, we introduce a Discrete Cosine Transform Mamba Adapter (DCTMA) on top of frozen Transformer representations. By projecting spectral features into the frequency domain via DCT and leveraging Mamba’s linear-complexity state-space recursion, DCTMA explicitly captures global spectral dependencies and band continuity while avoiding the redundancy of full fine-tuning. Furthermore, to address prototype drift caused by limited sample sizes, we design a Prior-Guided Tri-Encoder (PGTE) that allows laboratory spectral priors to guide the optimization of the learnable adapter without disrupting the stable semantic feature space. Finally, a Self-Supervised Pseudo-Label Mapping (SSPLM) strategy is developed for test-time adaptation, enabling efficient decision boundary refinement through uncertainty-aware sampling and dual-path consistency constraints. Extensive experiments on multiple public datasets demonstrate that SpecMamba consistently outperforms state-of-the-art methods in detection accuracy and cross-domain generalization.

[CV-65] Evaluation Before Generation: A Paradigm for Robust Multimodal Sentiment Analysis with Missing Modalities

【速读】:该论文旨在解决多模态情感分析中的“缺失模态”问题(missing modality problem),即在实际场景中由于某一模态数据缺失导致模型性能下降和泛化能力减弱的问题。现有方法主要依赖提示学习(prompt learning)和预训练模型提升鲁棒性,但存在两个关键局限:一是对生成缺失模态的必要性缺乏严谨评估;二是未充分挖掘多模态提示之间的结构依赖关系及其全局一致性。为此,论文提出基于提示的缺失模态适应框架(Prompt-based Missing Modality Adaptation framework),其核心创新在于三个模块:1)缺失模态评估器(Missing Modality Evaluator)利用预训练模型与伪标签动态判断缺失模态的重要性,避免低质量数据插补;2)模态不变提示解耦模块(Modality-invariant Prompt Disentanglement module)将共享提示分解为模态特有私有提示,以捕捉局部相关性并提升表征质量;3)动态提示加权模块(Dynamic Prompt Weighting module)基于交叉注意力输出计算互信息权重,自适应抑制缺失模态干扰。此外,通过多层次提示动态连接模块(Multi-level Prompt Dynamic Connection module)引入残差连接融合共享提示与自注意力输出,强化全局提示先验对关键特征的引导作用,从而显著提升模型在多种缺失模式下的性能稳定性与准确性。

链接: https://arxiv.org/abs/2604.05558
作者: Rongfei Chen,Tingting Zhang,Xiaoyu Shen,Wei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, conference

点击查看摘要

Abstract:The missing modality problem poses a fundamental challenge in multimodal sentiment analysis, significantly degrading model accuracy and generalization in real world scenarios. Existing approaches primarily improve robustness through prompt learning and pre trained models. However, two limitations remain. First, the necessity of generating missing modalities lacks rigorous evaluation. Second, the structural dependencies among multimodal prompts and their global coherence are insufficiently explored. To address these issues, a Prompt based Missing Modality Adaptation framework is proposed. A Missing Modality Evaluator is introduced at the input stage to dynamically assess the importance of missing modalities using pretrained models and pseudo labels, thereby avoiding low quality data imputation. Building on this, a Modality invariant Prompt Disentanglement module decomposes shared prompts into modality specific private prompts to capture intrinsic local correlations and improve representation quality. In addition, a Dynamic Prompt Weighting module computes mutual information based weights from cross attention outputs to adaptively suppress interference from missing modalities. To enhance global consistency, a Multi level Prompt Dynamic Connection module integrates shared prompts with self attention outputs through residual connections, leveraging global prompt priors to strengthen key guidance features. Extensive experiments on three public benchmarks, including CMU MOSI, CMU MOSEI, and CH SIMS, demonstrate that the proposed framework achieves state of the art performance and stable results under diverse missing modality settings. The implementation is available at this https URL

[CV-66] Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation

【速读】:该论文旨在解决机器人操作中视觉-运动策略学习的鲁棒性问题,特别是在执行过程中遭遇分布外误差或需要动态重规划轨迹时,模型仅依赖原始专家示范进行训练所面临的挑战。其解决方案的关键在于提出了一种闭环式的参考感知视觉-运动策略(Referring-Aware Visuomotor Policy, ReV),通过引入稀疏参考点(referring points)实现即时适应。ReV采用耦合扩散头结构:全局扩散头生成全局一致但时间稀疏的动作锚点序列,并定位参考点在该序列中的精确时间位置;局部扩散头则基于当前时间位置自适应插值相邻锚点以完成特定任务。这一闭环机制使系统能够在每个执行步骤实时重规划轨迹,从而应对场景动态变化。

链接: https://arxiv.org/abs/2604.05544
作者: Jiahua Ma,Yiran Qin,Xin Wen,Yixiong Li,Yuyu Sun,Yulan Guo,Liang Lin,Ruimao Zhang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses a fundamental problem of visuomotor policy learning for robotic manipulation: how to enhance robustness in out-of-distribution execution errors or dynamically re-routing trajectories, where the model relies solely on the original expert demonstrations for training. We introduce the Referring-Aware Visuomotor Policy (ReV), a closed-loop framework that can adapt to unforeseen circumstances by instantly incorporating sparse referring points provided by a human or a high-level reasoning planner. Specifically, ReV leverages the coupled diffusion heads to preserve standard task execution patterns while seamlessly integrating sparse referring via a trajectory-steering strategy. Upon receiving a specific referring point, the global diffusion head firstly generates a sequence of globally consistent yet temporally sparse action anchors, while identifies the precise temporal position for the referring point within this sequence. Subsequently, the local diffusion head adaptively interpolates adjacent anchors based on the current temporal position for specific tasks. This closed-loop process repeats at every execution step, enabling real-time trajectory replanning in response to dynamic changes in the scene. In practice, rather than relying on elaborate annotations, ReV is trained only by applying targeted perturbations to expert demonstrations. Without any additional data or fine-tuning scheme, ReV achieve higher success rates across challenging simulated and real-world tasks.

[CV-67] EchoAgent : Towards Reliable Echocardiography Interpretation with “Eyes”“Hands” and “Minds” CVPR2026

【速读】:该论文旨在解决当前 echocardiography (Echo) 分析中因任务特定深度学习方法和多模态大语言模型仅聚焦于“视觉-操作”(eyes-hands)或“视觉-认知”(eyes-minds)单一能力而造成的临床可靠性与实用性受限问题。其解决方案的关键在于提出 EchoAgent,一个面向端到端 Echo 解读的智能体系统,通过三个核心模块实现“视觉-操作-认知”(eyes-hands-minds)协同工作:首先构建基于专家指南的知识引擎以形成定制化的“认知中枢”;其次设计分层协作工具包赋予系统自动解析视频流、识别心脏切面、进行解剖分割与量化测量的能力;最后将感知的多模态证据与专属知识库整合至统一推理枢纽,实现可解释性推断。实验表明,该系统在 CAMUS 和 MIMIC-EchoQA 数据集上达到最高 80.00% 的整体准确率,展现出类人类超声心动图医师的综合能力。

链接: https://arxiv.org/abs/2604.05541
作者: Qin Wang,Zhiqing He,Yu Liu,Bowen Guo,Zeju Li,Miao Zhao,Wenhao Ju,Zhiling Luo,Xianhong Shu,Yi Guo,Yuanyuan Wang
机构: Fudan University (复旦大学); Fudan Zhongshan Hospital (复旦大学附属中山医院); Fuwai Yunnan Hospital (阜外云南医院); Fuwai Beijing Hospital (阜外北京医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 CV4Clinical, 11 pages, 6 figures

点击查看摘要

Abstract:Reliable interpretation of echocardiography (Echo) is crucial for assessing cardiac function, which demands clinicians to synchronously orchestrate multiple capabilities, including visual observation (eyes), manual measurement (hands), and expert knowledge learning and reasoning (minds). While current task-specific deep-learning approaches and multimodal large language models have demonstrated promise in assisting Echo analysis through automated segmentation or reasoning, they remain focused on restricted skills, i.e., eyes-hands or eyes-minds, thereby limiting clinical reliability and utility. To address these issues, we propose EchoAgent, an agentic system tailored for end-to-end Echo interpretation, which achieves a fully coordinated eyes-hands-minds workflow that learns, observes, operates, and reasons like a cardiac sonographer. First, we introduce an expertise-driven cognition engine where our agent can automatically assimilate credible Echo guidelines into a structured knowledge base, thus constructing an Echo-customized mind. Second, we devise a hierarchical collaboration toolkit to endow EchoAgent with eyes-hands, which can automatically parse Echo video streams, identify cardiac views, perform anatomical segmentation, and quantitative measurement. Third, we integrate the perceived multimodal evidence with the exclusive knowledge base into an orchestrated reasoning hub to conduct explainable inferences. We evaluate EchoAgent on CAMUS and MIMIC-EchoQA datasets, which cover 48 distinct echocardiographic views spanning 14 cardiac anatomical regions. Experimental results show that EchoAgent achieves optimal performance across diverse structure analyses, yielding overall accuracy of up to 80.00%. Importantly, EchoAgent empowers a single system with abilities to learn, observe, operate and reason like an echocardiologist, which holds great promise for reliable Echo interpretation.

[CV-68] Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images

【速读】:该论文旨在解决多模态遥感(Multimodal Remote Sensing, MRS)图像中变化检测(Multimodal Change Detection, MMCD)方法在跨模态交互能力不足以及难以充分挖掘模态特异性特征的问题,这些问题导致细粒度变化信息建模不充分,进而影响语义级变化的精确检测。其解决方案的关键在于提出STSF-Net框架,通过联合建模模态特异性特征与时空共性特征来增强变化表征:一方面利用模态特异性特征捕捉真实的语义变化信号,另一方面引入时空共性特征以抑制因成像机制差异引发的伪变化;同时设计了一种基于预训练基础模型获得语义先验的光学与SAR特征自适应融合策略,实现语义引导下的多模态信息融合,从而显著提升变化检测精度。

链接: https://arxiv.org/abs/2604.05527
作者: Xuanguang Liu,Lei Ding,Yujie Li,Chenguang Dai,Zhenchao Zhang,Mengmeng Li,Ziyi Yang,Yifan Sun,Yongqi Sun,Hanyun Wang
机构: Institute of Geospatial Information, Information Engineering University, Zhengzhou, China; Key Lab of Spatial Data Mining and Information Sharing of Ministry of Education, Academy of Digital China (Fujian), Fuzhou University, China; School of Electronics and Communication Engineering, Sun Yat-sen University, Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing (RS) data, demonstrating significant application value in land use monitoring, disaster assessment, and urban sustainable development. However, literature MMCD approaches exhibit limitations in cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes in multimodal data. To address the above problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts feature importance based on semantic priors obtained from pre-trained foundational models, enabling semantic-guided adaptive fusion of multi-modal information. In addition, we introduce the Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution (VHR) fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan-Het datasets demonstrate that our method outperforms the state-of-the-art (SOTA) by 3.21%, 1.08%, and 1.32% in mIoU, respectively. The associated code and Delta-SN6 dataset will be released at: this https URL.

[CV-69] Cross-Resolution Diffusion Models via Network Pruning CVPR

【速读】:该论文旨在解决扩散模型(Diffusion Models)在训练时固定分辨率下表现优异,但在生成非训练分辨率图像时质量显著下降的问题。其核心原因是参数行为具有分辨率依赖性,即某些权重在默认分辨率下有效,但在空间尺度变化时会破坏语义对齐并引发UNet结构的结构性不稳定。解决方案的关键在于提出CR-Diff方法,通过两阶段策略实现跨分辨率视觉一致性提升:首先进行块级剪枝(block-wise pruning),选择性移除不良权重;随后执行剪枝输出放大(pruned output amplification),进一步净化预测结果。该方法在保持默认分辨率性能的同时,显著提升了不同未见分辨率下的感知保真度与语义连贯性,并支持提示词特定优化,实现按需质量增强。

链接: https://arxiv.org/abs/2604.05524
作者: Jiaxuan Ren,Junhan Zhu,Huan Wang
机构: Westlake University (西湖大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR Findings 2026

点击查看摘要

Abstract:Diffusion models have demonstrated impressive image synthesis performance, yet many UNet-based models are trained at certain fixed resolutions. Their quality tends to degrade when generating images at out-of-training resolutions. We trace this issue to resolution-dependent parameter behaviors, where weights that function well at the default resolution can become adverse when spatial scales shift, weakening semantic alignment and causing structural instability in the UNet architecture. Based on this analysis, this paper introduces CR-Diff, a novel method that improves the cross-resolution visual consistency by pruning some parameters of the diffusion model. Specifically, CR-Diff has two stages. It first performs block-wise pruning to selectively eliminate adverse weights. Then, a pruned output amplification is conducted to further purify the pruned predictions. Empirically, extensive experiments suggest that CR-Diff can improve perceptual fidelity and semantic coherence across various diffusion backbones and unseen resolutions, while largely preserving the performance at default resolutions. Additionally, CR-Diff supports prompt-specific refinement, enabling quality enhancement on demand.

[CV-70] Geometrical Cross-Attention and Nonvoid Voxelization for Efficient 3D Medical Image Segmentation

【速读】:该论文旨在解决3D医学影像分割中难以同时实现高精度与计算效率的问题,尤其是在不同解剖结构和成像模态下的泛化能力不足。其核心解决方案是提出GCNV-Net框架,关键创新在于三个模块的协同设计:一是三向动态非空体素Transformer(Tri-directional Dynamic Nonvoid Voxel Transformer, 3DNVT),通过沿横断面、矢状面和冠状面动态划分相关体素,有效建模复杂的三维空间依赖关系;二是几何交叉注意力机制(Geometrical Cross-Attention, GCA),在多尺度特征融合过程中显式引入几何位置信息,提升细粒度解剖结构的分割精度;三是非空体素化(Nonvoid Voxelization),仅处理信息丰富的区域,显著减少冗余计算,在保持分割质量的同时实现56.13%的浮点运算量(FLOPs)下降和68.49%的推理延迟降低。该方法在多个主流基准数据集上均达到当前最优性能,验证了其在准确性与效率间的良好平衡及临床部署潜力。

链接: https://arxiv.org/abs/2604.05515
作者: Chenxin Yuan,Shoupeng Chen,Haojiang Ye,Yiming Miao,Limei Peng,Pin-Han Ho
机构: University of Electronic Science and Technology of China (电子科技大学); The Chinese University of Hong Kong (香港中文大学); Korea University (韩国科学技术院); University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 13 figures, supplementary material included, submitted to Medical Image Analysis

点击查看摘要

Abstract:Accurate segmentation of 3D medical scans is crucial for clinical diagnostics and treatment planning, yet existing methods often fail to achieve both high accuracy and computational efficiency across diverse anatomies and imaging modalities. To address these challenges, we propose GCNV-Net, a novel 3D medical segmentation framework that integrates a Tri-directional Dynamic Nonvoid Voxel Transformer (3DNVT), a Geometrical Cross-Attention module (GCA), and Nonvoid Voxelization. The 3DNVT dynamically partitions relevant voxels along the three orthogonal anatomical planes, namely the transverse, sagittal, and coronal planes, enabling effective modeling of complex 3D spatial dependencies. The GCA mechanism explicitly incorporates geometric positional information during multi-scale feature fusion, significantly enhancing fine-grained anatomical segmentation accuracy. Meanwhile, Nonvoid Voxelization processes only informative regions, greatly reducing redundant computation without compromising segmentation quality, and achieves a 56.13% reduction in FLOPs and a 68.49% reduction in inference latency compared to conventional voxelization. We evaluate GCNV-Net on multiple widely used benchmarks: BraTS2021, ACDC, MSD Prostate, MSD Pancreas, and AMOS2022. Our method achieves state-of-the-art segmentation performance across all datasets, outperforming the best existing methods by 0.65% on Dice, 0.63% on IoU, 1% on NSD, and relatively 14.5% on HD95. All results demonstrate that GCNV-Net effectively balances accuracy and efficiency, and its robustness across diverse organs, disease conditions, and imaging modalities highlights strong potential for clinical deployment.

[CV-71] Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality CVPR2026

【速读】:该论文旨在解决增强现实(Augmented Reality, AR)环境中虚拟内容篡改与矛盾引发的安全与可靠性问题,特别是恶意或不一致的虚拟元素对用户造成的误导风险。解决方案的关键在于提出ContrAR,一个新颖的基准测试集,用于系统评估视觉语言模型(Vision-Language Models, VLMs)在AR场景下对虚拟内容操纵和语义矛盾的鲁棒性;该基准包含312个经10名人类参与者验证的真实世界AR视频,从而为VLM在对抗性虚拟内容检测与推理任务上的性能提供客观评估依据。

链接: https://arxiv.org/abs/2604.05510
作者: Yanming Xiu,Zhengayuan Jiang,Neil Zhenqiang Gong,Maria Gorlatova
机构: Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings

点击查看摘要

Abstract:Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user’s view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.

[CV-72] CLIP-Guided Data Augmentation for Night-Time Image Dehazing

【速读】:该论文旨在解决夜间图像去雾(nighttime image dehazing)中因雾霾散射与低光照、非均匀照明及强光干扰耦合导致的退化模式更为复杂的问题。在有限监督条件下,这种复杂性加剧了域偏移(domain drift)和训练不稳定性,因为目标域样本稀缺,而直接引入外部数据可能因分布不匹配削弱适应效果。解决方案的关键在于构建一个统一框架,包含域对齐的数据构造、分阶段训练和推理时增强三部分:首先利用预训练的CLIP视觉编码器通过相似性筛选外部样本,生成更贴近目标域的训练数据;其次采用两阶段训练策略,先使NAFNet适应目标域,再扩展至更广泛的退化模式;最后在推理时结合TLC、x8自集成(self-ensemble)和加权快照融合(weighted snapshot fusion)提升输出稳定性。该方法无需复杂的网络结构重设计,提供了一条实用且高效的夜间图像去雾流程。

链接: https://arxiv.org/abs/2604.05500
作者: Xining Ge,Weijun Yuan,Gengjia Chang,Xuyang Li,Shuhong Liu
机构: Hangzhou Dianzi University (杭州电子科技大学); Jinan University (暨南大学); Hefei University of Technology (合肥工业大学); Wuhan University (武汉大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nighttime image dehazing faces a more complex degradation pattern than its daytime counterpart, as haze scattering couples with low illumination, non-uniform lighting, and strong light interference. Under limited supervision, this complexity aggravates domain drift and training instability, since target-domain samples are scarce while naively introducing external data may weaken adaptation due to distribution mismatch. This paper presents our solution to the NTIRE 2026 Night Time Image Dehazing Challenge, built as a unified framework that integrates domain-aligned data construction, stage-wise training, and inference-time enhancement. Specifically, a pre-trained CLIP visual encoder screens candidate external samples by similarity to construct training data closer to the target domain. NAFNet is then trained in two stages, first adapting to the target domain and then expanding to broader degradation patterns. At inference time, TLC, x8 self-ensemble, and weighted snapshot fusion are combined to improve output stability. Rather than relying on complex network redesign, the proposed framework offers a practical and effective pipeline for nighttime image dehazing.

[CV-73] hinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models CVPR2026

【速读】:该论文旨在解决扩散式多模态大语言模型(diffusion multimodal large language models, dMLLMs)在结合思维链(Chain-of-Thought, CoT)推理时存在的两个关键问题:一是模型在早期扩散步骤中过早生成最终答案token,导致推理过程不充分;二是模型在初始阶段对视觉提示依赖性极低,表现出与自回归视觉-语言模型截然不同的视觉信息利用模式,从而削弱了视觉 grounding 能力。为应对上述挑战,论文提出两种核心机制:Position and Step Penalty (PSP) 通过在早期 timestep 对靠后位置的 token 施加惩罚,抑制过早输出并引导逐步推理;Visual Reasoning Guidance (VRG) 借鉴无分类器指导(classifier-free guidance)思想,增强视觉信号的引导强度,提升模型对视觉证据的对齐能力。实验表明,该方法在多个 dMLLM 上可实现最高达 7.5% 的准确率提升,并且推理速度超过基于四倍扩散步数的自回归方法三倍以上。

链接: https://arxiv.org/abs/2604.05497
作者: Keuntae Kim,Mingyu Kang,Yong Suk Choi
机构: Hanyang University (汉阳大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 - main

点击查看摘要

Abstract:Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language models (dMLLMs). These models are expected to retain the reasoning capabilities of LLMs while enabling faster inference through parallel generation. However, when combined with Chain-of-Thought (CoT) reasoning, dMLLMs exhibit two critical issues. First, we observe that dMLLMs often generate the final answer token at a very early timestep. This trend indicates that the model determines the answer before sufficient reasoning, leading to degraded reasoning performance. Second, during the initial timesteps, dMLLMs show minimal dependency on visual prompts, exhibiting a fundamentally different pattern of visual information utilization compared to AR vision-language models. In summary, these findings indicate that dMLLMs tend to generate premature final answers without sufficiently grounding on visual inputs. To address these limitations, we propose Position and Step Penalty (PSP) and Visual Reasoning Guidance (VRG). PSP penalizes tokens in later positions during early timesteps, delaying premature answer generation and encouraging progressive reasoning across timesteps. VRG, inspired by classifier-free guidance, amplifies visual grounding signals to enhance the model’s alignment with visual evidence. Extensive experiments across various dMLLMs demonstrate that our method achieves up to 7.5% higher accuracy while delivering more than 3x speedup compared to reasoning with four times more diffusion steps.

[CV-74] A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures IJCNN

【速读】:该论文旨在解决地下缺陷检测中基于探地雷达(Ground Penetrating Radar, GPR)的弱信号识别难题,具体表现为微弱散射双曲线信号的信噪比低、波场相似性高及几何结构退化等问题。现有轻量级检测模型因过度追求计算效率而忽视了对低频结构的保留和异质杂波的解耦,导致漏检率较高。其解决方案的关键在于提出WSA-Net框架,通过四个核心机制实现信号感知增强:利用部分卷积(partial convolutions)进行信号保真;借助异质分组注意力(heterogeneous grouping attention)抑制杂波;通过几何重建提升双曲线弧的锐度;以及引入上下文锚定(context anchoring)以消除语义歧义。该设计在保持极低参数量(2.412 M)的同时显著提升了检测精度(mAP@0.5达0.6958)与推理速度(164 FPS),验证了以信号为中心的设计理念在基础设施检测中的有效性。

链接: https://arxiv.org/abs/2604.05490
作者: Wenbo Zhang,Zekun Long,Zican Liu,Yangchen Zeng,Keyi Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures, 5 tables. Accepted by International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:Subsurface defect detection via Ground Penetrating Radar is challenged by “weak signals” faint diffraction hyperbolas with low signal-to-clutter ratios, high wavefield similarity, and geometric degradation. Existing lightweight detectors prioritize efficiency over sensitivity, failing to preserve low-frequency structures or decouple heterogeneous clutter. We propose WSA-Net, a framework designed to enhance faint signatures through physical-feature reconstruction. Moving beyond simple parameter reduction, WSA-Net integrates four mechanisms: Signal preservation using partial convolutions; Clutter suppression via heterogeneous grouping attention; Geometric reconstruction to sharpen hyperbolic arcs; Context anchoring to resolve semantic ambiguities. Evaluations on the RTSTdataset show WSA-Net achieves 0.6958 mAP@0.5 and 164 FPS with only 2.412 M parameters. Results prove that signal-centric awareness in lightweight architectures effectively reduces false negatives in infrastructure inspection.

[CV-75] CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment

【速读】:该论文旨在解决多智能体具身系统(multi-agent embodied systems)在复杂协作操作中面临的三大核心挑战:空间协调、时间推理以及共享工作空间意识。传统方法难以实现高效且安全的多机器人协同作业,尤其在真实环境中存在高风险与低可重复性问题。解决方案的关键在于提出“组合式环境”(compositional environment)这一新概念,即通过融合现实世界与仿真组件,使多个机器人代理能够在统一的决策空间中感知彼此意图并协同操作。基于此理念,作者构建了CoEnv框架,其核心机制包括:1)从真实场景到仿真的重建(real-to-sim scene reconstruction),实现物理空间数字化;2)基于视觉语言模型(VLM)的动作合成(action synthesis),支持高阶接口实时规划与代码驱动轨迹迭代优化;3)经验证的仿真到现实迁移(validated sim-to-real transfer),结合碰撞检测保障部署安全性。实验证明该方案显著提升了任务成功率和执行效率,为多智能体具身人工智能提供了新的范式。

链接: https://arxiv.org/abs/2604.05484
作者: Li Kang,Yutao Fan,Rui Li,Heng Zhou,Yiran Qin,Zhemeng Zhang,Songtao Huang,Xiufeng Song,Zaibin Zhang,Bruno N.Y. Chen,Zhenfei Yin,Dongzhan Zhou,Wangmeng Zuo,Lei Bai
机构: Zhejiang University (浙江大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); National University of Singapore (新加坡国立大学); University of California, Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学); MIT (麻省理工学院); ETH Zurich (苏黎世联邦理工学院); University of Oxford (牛津大学); University of Tokyo (东京大学); Google (谷歌); Meta (Meta); Stability.AI (Stability.AI); Anthropic (Anthropic); Character.ai (Character.ai); Claude (Claude)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 8 figures, including supplementary material. Project page: this https URL

点击查看摘要

Abstract:Multi-agent embodied systems hold promise for complex collaborative manipulation, yet face critical challenges in spatial coordination, temporal reasoning, and shared workspace awareness. Inspired by human collaboration where cognitive planning occurs separately from physical execution, we introduce the concept of compositional environment – a synergistic integration of real-world and simulation components that enables multiple robotic agents to perceive intentions and operate within a unified decision-making space. Building on this concept, we present CoEnv, a framework that leverages simulation for safe strategy exploration while ensuring reliable real-world deployment. CoEnv operates through three stages: real-to-sim scene reconstruction that digitizes physical workspaces, VLM-driven action synthesis supporting both real-time planning with high-level interfaces and iterative planning with code-based trajectory generation, and validated sim-to-real transfer with collision detection for safe deployment. Extensive experiments on challenging multi-arm manipulation benchmarks demonstrate CoEnv’s effectiveness in achieving high task success rates and execution efficiency, establishing a new paradigm for multi-agent embodied AI.

[CV-76] Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis

【速读】:该论文旨在解决犬类气胸(pneumothorax)自动诊断中因数据稀缺和模型可信度不足带来的挑战。其解决方案的关键在于提出一种协同式诊断范式,将任务重构为信号定位与谱检测的联合过程:首先利用视觉语言模型(Vision-Language Model, VLM)引导迭代流匹配(Flow Matching)以实现高保真边界精度的分割;随后基于分割掩膜提取病灶特征,并采用随机矩阵理论(Random Matrix Theory, RMT)进行统计分析——通过将健康组织建模为可预测的随机噪声,识别出显著偏离的特征值作为病理信号,从而实现高灵敏度、可解释的诊断。该方法的核心优势在于生成式分割与第一性原理统计分析之间的协同效应,显著提升了诊断准确性与可靠性。

链接: https://arxiv.org/abs/2604.05482
作者: Pu Wang,Zhixuan Mao,Jialu Li,Zhuoran Zheng,Dianjie Lu,Youshan Zhang
机构: Shandong University (山东大学); Shenzhen Loop Area Institute (深圳 loop 区域研究所); Yeshiva University (叶希瓦大学); Qilu University of Technology (齐鲁工业大学); Shandong Normal University (山东师范大学); Chuzhou University (滁州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic diagnosis of canine pneumothorax is challenged by data scarcity and the need for trustworthy models. To address this, we first introduce a public, pixel-level annotated dataset to facilitate research. We then propose a novel diagnostic paradigm that reframes the task as a synergistic process of signal localization and spectral detection. For localization, our method employs a Vision-Language Model (VLM) to guide an iterative Flow Matching process, which progressively refines segmentation masks to achieve superior boundary accuracy. For detection, the segmented mask is used to isolate features from the suspected lesion. We then apply Random Matrix Theory (RMT), a departure from traditional classifiers, to analyze these features. This approach models healthy tissue as predictable random noise and identifies pneumothorax by detecting statistically significant outlier eigenvalues that represent a non-random pathological signal. The high-fidelity localization from Flow Matching is crucial for purifying the signal, thus maximizing the sensitivity of our RMT detector. This synergy of generative segmentation and first-principles statistical analysis yields a highly accurate and interpretable diagnostic system (source code is available at: this https URL).

[CV-77] A Synthetic Eye Movement Dataset for Script Reading Detection: Real Trajectory Replay on a 3D Simulator

【速读】:该论文旨在解决行为数据(尤其是眼动轨迹)在训练视觉-语言模型时面临的稀缺性、高标注成本和隐私敏感问题,其核心挑战在于缺乏大规模、自动标注的视频行为数据。解决方案的关键在于构建一个基于3D眼动模拟器的合成数据生成流水线:通过从参考视频中提取真实人类虹膜轨迹,并利用无头浏览器自动化技术将其重放至3D眼动模拟器中,从而生成大规模、带标签的眼动视频数据。该方法实现了对源数据时间动态的高度保留(KS距离 ≤ 0.14),并首次揭示了当前模拟器在读写尺度运动上的有限敏感性源于缺乏头部协同运动,为未来行为仿真系统的设计提供了重要依据。

链接: https://arxiv.org/abs/2604.05475
作者: Kidus Zewde,Yuchen Zhou,Dennis Ng,Neo Tiangratanakul,Tommy Duong,Ankit Raj,Yuxin Zhang,Xingyu Shen,Simiao Ren
机构: SCAM (Stanford Center for AI and Machine Learning)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Synthetic eye movement dataset generation via 3D eye simulator; iris trajectory replay; script reading detection; behavioral data augmentation

点击查看摘要

Abstract:Large vision-language models have achieved remarkable capabilities by training on massive internet-scale data, yet a fundamental asymmetry persists: while LLMs can leverage self-supervised pretraining on abundant text and image data, the same is not true for many behavioral modalities. Video-based behavioral data – gestures, eye movements, social signals – remains scarce, expensive to annotate, and privacy-sensitive. A promising alternative is simulation: replace real data collection with controlled synthetic generation to produce automatically labeled data at scale. We introduce infrastructure for this paradigm applied to eye movement, a behavioral signal with applications across vision-language modeling, virtual reality, robotics, accessibility systems, and cognitive science. We present a pipeline for generating synthetic labeled eye movement video by extracting real human iris trajectories from reference videos and replaying them on a 3D eye movement simulator via headless browser automation. Applying this to the task of script-reading detection during video interviews, we release final_dataset_v1: 144 sessions (72 reading, 72 conversation) totaling 12 hours of synthetic eye movement video at 25fps. Evaluation shows that generated trajectories preserve the temporal dynamics of the source data (KS D 0.14 across all metrics). A matched frame-by-frame comparison reveals that the 3D simulator exhibits bounded sensitivity at reading-scale movements, attributable to the absence of coupled head movement – a finding that informs future simulator design. The pipeline, dataset, and evaluation tools are released to support downstream behavioral classifier development at the intersection of behavioral modeling and vision-language systems. Comments: Synthetic eye movement dataset generation via 3D eye simulator; iris trajectory replay; script reading detection; behavioral data augmentation Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.05475 [cs.CV] (or arXiv:2604.05475v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.05475 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-78] Not All Agents Matter: From Global Attention Dilution to Risk-Prioritized Game Planning

【速读】:该论文旨在解决现有端到端自动驾驶模型在处理多智能体交互时存在的问题,即这些模型通常将所有代理(agent)视为同等重要,从而难以区分真实的碰撞威胁与复杂的背景环境,导致决策安全性不足。其解决方案的关键在于提出一种名为“风险优先博弈规划”(Risk-Prioritized Game Planning)的新范式,并构建了GameAD框架,该框架通过引入风险感知拓扑锚定(Risk-Aware Topology Anchoring)、战略载荷适配器(Strategic Payload Adapter)、最小最大风险感知稀疏注意力机制(Minimax Risk-Aware Sparse Attention)以及风险一致性均衡稳定化(Risk Consistent Equilibrium Stabilization),实现了基于博弈论的风险优先级交互决策。此外,论文还提出了“规划风险暴露度量”(Planning Risk Exposure),用于量化长时间轨迹规划中的累积风险强度,从而显著提升自动驾驶系统的安全性表现。

链接: https://arxiv.org/abs/2604.05449
作者: Kang Ding,Hongsong Wang,Jie Gui,Lei He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:End-to-end autonomous driving resides not in the integration of perception and planning, but rather in the dynamic multi-agent game within a unified representation space. Most existing end-to-end models treat all agents equally, hindering the decoupling of real collision threats from complex backgrounds. To address this issue, We introduce the concept of Risk-Prioritized Game Planning, and propose GameAD, a novel framework that models end-to-end autonomous driving as a risk-aware game problem. The GameAD integrates Risk-Aware Topology Anchoring, Strategic Payload Adapter, Minimax Risk-Aware Sparse Attention, and Risk Consistent Equilibrium Stabilization to enable game theoretic decision making with risk prioritized interactions. We also present the Planning Risk Exposure metric, which quantifies the cumulative risk intensity of planned trajectories over a long horizon for safe autonomous driving. Extensive experiments on the nuScenes and Bench2Drive datasets show that our approach significantly outperforms state-of-the-art methods, especially in terms of trajectory safety.

[CV-79] Human Interaction-Aware 3D Reconstruction from a Single Image CVPR2026

【速读】:该论文旨在解决从单张图像中重建具有纹理的多人交互场景3D人体模型的问题,现有方法多局限于单人场景,在多人场景下易产生不合理的重叠、遮挡区域几何缺失及交互关系失真等伪影。其解决方案的关键在于提出一个整体框架HUG3D,通过显式建模群体级(group-level)与实例级(instance-level)信息来增强重建的物理合理性与细节保真度:首先将输入图像转换至正交归一化空间以缓解透视畸变;随后利用Human Group-Instance Multi-View Diffusion(HUG-MVD)模块联合建模个体与群体上下文,生成完整多视角法向量和图像以解决遮挡与邻近关系问题;再通过Human Group-Instance Geometric Reconstruction(HUG-GR)模块引入基于物理的交互先验,强制实现人与人之间的接触合理性;最终融合多视角图像获得高保真纹理。此三阶段协同机制使HUG3D在多人交互场景中显著优于单一人类及现有多人重建方法。

链接: https://arxiv.org/abs/2604.05436
作者: Gwanghyun Kim,Junghun James Kim,Suh Yoon Jeon,Jason Park,Se Young Chun
机构: Seoul National University (首尔国立大学); IPAI; INMC
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group context to resolve occlusions and proximity. Subsequently, the Human Group-Instance Geometric Reconstruction (HUG-GR) module optimizes the geometry by leveraging explicit, physics-based interaction priors to enforce physical plausibility and accurately model inter-human contact. Finally, the multi-view images are fused into a high-fidelity texture. Together, these components form our complete framework, HUG3D. Extensive experiments show that HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting people from a single image. Project page: this https URL

[CV-80] Few-Shot Semantic Segmentation Meets SAM3

【速读】:该论文旨在解决少样本语义分割(Few-Shot Semantic Segmentation, FSS)中现有方法依赖大量周期性训练(episodic training)所带来的计算开销大和对分布偏移敏感的问题。其解决方案的关键在于利用现代视觉基础模型Segment Anything Model 3 (SAM3) 的可提示概念分割(Promptable Concept Segmentation, PCS)能力,通过一种简单的空间拼接策略将支持图像(support)与查询图像(query)放置在同一画布中,使完全冻结的SAM3无需任何微调或结构修改即可完成分割任务。这一设计实现了无需训练的高效少样本分割,并在PASCAL-5^i和COCO-20^i数据集上达到当前最优性能。

链接: https://arxiv.org/abs/2604.05433
作者: Yi-Jen Tsai,Yen-Yu Lin,Chien-Yao Wang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); Institute of Information Science, Academia Sinica, Taiwan (中央研究院资讯科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:Few-Shot Semantic Segmentation (FSS) focuses on segmenting novel object categories from only a handful of annotated examples. Most existing approaches rely on extensive episodic training to learn transferable representations, which is both computationally demanding and sensitive to distribution shifts. In this work, we revisit FSS from the perspective of modern vision foundation models and explore the potential of Segment Anything Model 3 (SAM3) as a training-free solution. By repurposing its Promptable Concept Segmentation (PCS) capability, we adopt a simple spatial concatenation strategy that places support and query images into a shared canvas, allowing a fully frozen SAM3 to perform segmentation without any fine-tuning or architectural changes. Experiments on PASCAL- 5^i and COCO- 20^i show that this minimal design already achieves state-of-the-art performance, outperforming many heavily engineered methods. Beyond empirical gains, we uncover that negative prompts can be counterproductive in few-shot settings, where they often weaken target representations and lead to prediction collapse despite their intended role in suppressing distractors. These findings suggest that strong cross-image reasoning can emerge from simple spatial formulations, while also highlighting limitations in how current foundation models handle conflicting prompt signals. Code at: this https URL

[CV-81] Cross-Stage Attention Propagation for Efficient Semantic Segmentation

【速读】:该论文旨在解决轻量级语义分割模型中多尺度解码器存在的计算冗余问题,即当前方法在不同特征尺度上独立计算注意力机制,导致各尺度间注意力分布高度相关但重复计算,造成不必要的浮点运算开销。其解决方案的关键在于提出跨阶段注意力传播(Cross-Stage Attention Propagation, CSAP)框架:仅在最深层特征尺度上计算注意力映射,并将该结果直接传播至浅层阶段,从而跳过浅层阶段的查询-键(query-key)计算过程。此设计在保持多尺度上下文推理能力的同时,显著降低了解码器的计算复杂度。

链接: https://arxiv.org/abs/2604.05431
作者: Beoungwoo Kang
机构: Hyundai Mobis(现代摩比斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:Recent lightweight semantic segmentation methods have made significant progress by combining compact backbones with efficient decoder heads. However, most multi-scale decoders compute attention independently at each feature scale, introducing substantial redundancy since the resulting attention distributions across scales are strongly correlated. We propose Cross-Stage Attention Propagation (CSAP), a decoder framework that computes attention at the deepest feature scale and propagates the resulting attention maps to shallower stages, bypassing query-key computation at those stages entirely. This design preserves multi-scale contextual reasoning while substantially reducing the decoder’s computational cost. CSAP-Tiny achieves 42.9% mIoU on ADE20K with only 5.5 GFLOPs, 80.5% on Cityscapes with 21.5 GFLOPs, and 40.9% on COCO-Stuff 164K with 5.5 GFLOPs, surpassing SegNeXt-Tiny by +1.8% on ADE20K while requiring 16.8% fewer floating-point operations.

[CV-82] VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG ACL2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理长视频时因上下文窗口有限而导致的性能瓶颈问题。现有检索增强生成(Retrieval-Augmented Generation, RAG)方法普遍存在两个局限:一是将视频扁平化为独立片段,破坏了其固有的时空结构;二是依赖显式语义匹配,难以捕捉与查询意图隐含相关的视觉线索。解决方案的关键在于提出 VideoStir 框架,该框架首先以帧级别构建视频的时空图结构,实现多跳检索以聚合远距离但语境相关事件的信息;其次引入基于 MLLM 的意图相关性评分器,依据帧与查询推理意图的一致性进行检索,从而提升对隐含语义线索的感知能力。为支持该机制,作者还构建了 IR-600K 数据集用于学习帧与查询意图之间的对齐关系。实验表明,VideoStir 在无需额外辅助信息的情况下达到领先性能,验证了从扁平化语义匹配向结构化、意图感知推理转变的有效性。

链接: https://arxiv.org/abs/2604.05418
作者: Honghao Fu,Miao Xu,Yiwei Wang,Dailing Zhang,Liu Jun,Yujun Cai
机构: University of Queensland (昆士兰大学); University of California, Merced (加州大学默塞德分校); Institute of Automation, CAS (中国科学院自动化研究所); Lancaster University (兰卡斯特大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query’s intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query’s reasoning intent. To support this capability, we curate IR-600K, a large-scale dataset tailored for learning frame-query intent alignment. Experiments show that VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, highlighting the promise of shifting long-video RAG from flattened semantic matching to structured, intent-aware reasoning. Codes and checkpoints are available at Github.

[CV-83] Learning to Synergize Semantic and Geometric Priors for Limited-Data Wheat Disease Segmentation

【速读】:该论文旨在解决小麦病害分割在生长阶段间存在显著类内时序变化导致的数据代表性不足问题,即在有限标注数据下难以训练出鲁棒的分割模型。其解决方案的关键在于提出SGPer框架,通过语义-几何先验协同机制实现疾病特异性语义感知与边界定位的联合优化:利用预训练DINOv2提供的类别感知语义先验生成密集的类别特定点提示(point prompts),引导Segment Anything Model (SAM)进行精准边界定位;同时设计疾病敏感适配器以对齐DINOv2与SAM的特征空间,并通过动态过滤冗余提示(基于SAM迭代掩码置信度与DINOv2语义一致性交叉验证)提升掩码精度,从而在数据受限场景下实现对时序外观变化不变的高精度分割。

链接: https://arxiv.org/abs/2604.05415
作者: Shijie Wang,Zijian Wang,Yadan Luo,Scott Chapman,Xin Yu,Zi Huang
机构: The University of Queensland (昆士兰大学); The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wheat disease segmentation is fundamental to precision agriculture but faces severe challenges from significant intra-class temporal variations across growth stages. Such substantial appearance shifts make collecting a representative dataset for training from scratch both labor-intensive and impractical. To address this, we propose SGPer, a Semantic-Geometric Prior Synergization framework that treats wheat disease segmentation under limited data as a coupled task of disease-specific semantic perception and disease boundary localization. Our core insight is that pretrained DINOv2 provides robust category-aware semantic priors to handle appearance shifts, which can be converted into coarse spatial prompts to guide SAM for the precise localization of disease boundaries. Specifically, SGPer designs disease-sensitive adapters with multiple disease-friendly filters and inserts them into both DINOv2 and SAM to align their pretrained representations with disease-specific characteristics. To operationalize this synergy, SGPer transforms DINOv2-derived features into dense, category-specific point prompts to ensure comprehensive spatial coverage of all disease regions. To subsequently eliminate prompt redundancy and ensure highly accurate mask generation, it dynamically filters these dense candidates by cross-referencing SAM’s iterative mask confidence with the category-specific semantic consistency derived from DINOv2. Ultimately, SGPer distills a highly informative set of prompts to activate SAM’s geometric priors, achieving precise and robust segmentation that remains strictly invariant to temporal appearance changes. Extensive evaluations demonstrate that SGPer consistently achieves state-of-the-art performance on wheat disease and organ segmentation benchmarks, especially in data-constrained scenarios.

[CV-84] raining Without Orthogonalization Inference With SVD: A Gradient Analysis of Rotation Representations

【速读】:该论文旨在解决深度学习中旋转估计任务里,SVD正交化操作在训练阶段引入梯度畸变的问题,以及为何在推理阶段使用SVD优于Gram-Schmidt正交化这一现象缺乏理论解释。其关键解决方案在于通过针对3×3矩阵和SO(3)流形的SVD反向传播雅可比矩阵进行精确谱分析,揭示了SVD在训练时会导致显著的梯度方向误差(尤其当预测矩阵远离SO(3)时),而Gram-Schmidt则因梯度信号分配不均导致参数效率低下;因此,最优策略是仅在推理阶段应用SVD投影,训练时直接回归9D参数空间,从而避免梯度畸变并提升模型性能。

链接: https://arxiv.org/abs/2604.05414
作者: Chris Choy
机构: NVIDIA(英伟达)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent work has shown that removing orthogonalization during training and applying it only at inference improves rotation estimation in deep learning, with empirical evidence favoring 9D representations with SVD projection. However, the theoretical understanding of why SVD orthogonalization specifically harms training, and why it should be preferred over Gram-Schmidt at inference, remains incomplete. We provide a detailed gradient analysis of SVD orthogonalization specialized to 3 \times 3 matrices and SO(3) projection. Our central result derives the exact spectrum of the SVD backward pass Jacobian: it has rank 3 (matching the dimension of SO(3) ) with nonzero singular values 2/(s_i + s_j) and condition number \kappa = (s_1 + s_2)/(s_2 + s_3) , creating quantifiable gradient distortion that is most severe when the predicted matrix is far from SO(3) (e.g., early in training when s_3 \approx 0 ). We further show that even stabilized SVD gradients introduce gradient direction error, whereas removing SVD from the training loop avoids this tradeoff entirely. We also prove that the 6D Gram-Schmidt Jacobian has an asymmetric spectrum: its parameters receive unequal gradient signal, explaining why 9D parameterization is preferable. Together, these results provide the theoretical foundation for training with direct 9D regression and applying SVD projection only at inference.

[CV-85] CRISP: Rank-Guided Iterative Squeezing for Robust Medical Image Segmentation under Domain Shift

【速读】:该论文旨在解决医学影像中分布偏移(distribution shift)对医疗人工智能(Medical AI)临床转化的瓶颈问题,此类偏移会导致模型在未见环境中的性能显著下降,并可能加剧健康不平等。现有领域自适应方法受限于通过模拟偏移或伪监督穷举预设可能性,难以应对开放且不可预测的真实世界场景。解决方案的关键在于提出一个经验规律——“正区域秩稳定性”(Rank Stability of Positive Regions),即正类体素的预测概率相对排名在分布偏移下保持稳定;基于此原则,作者设计了无需目标域信息、参数无关且模型无关的CRISP框架,其核心创新是基于秩而非概率进行分割:通过潜在特征扰动模拟分布偏移,识别出在扰动下始终高秩(高置信度正例)和低秩(可安全归为负例)的区域,构建高精度(HP)与高召回(HR)先验并递归优化,最终通过迭代训练使HP与HR逐步收敛至最优分割结果。

链接: https://arxiv.org/abs/2604.05409
作者: Yizhou Fang,Pujin Cheng,Yixiang Liu,Xiaoying Tang,Longxi Zhou
机构: Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Distribution shift in medical imaging remains a central bottleneck for the clinical translation of medical AI. Failure to address it can lead to severe performance degradation in unseen environments and exacerbate health inequities. Existing methods for domain adaptation are inherently limited by exhausting predefined possibilities through simulated shifts or pseudo-supervision. Such strategies struggle in the open-ended and unpredictable real world, where distribution shifts are effectively infinite. To address this challenge, we introduce an empirical law called Rank Stability of Positive Regions'', which states that the relative rank of predicted probabilities for positive voxels remains stable under distribution shift. Guided by this principle, we propose CRISP, a parameter-free and model-agnostic framework requiring no target-domain information. CRISP is the first framework to make segmentation based on rank rather than probabilities. CRISP simulates model behavior under distribution shift via latent feature perturbation, where voxel probability rankings exhibit two stable patterns: regions that consistently retain high probabilities (destined positives according to the principle) and those that remain low-probability (can be safely classified as negatives). Based on these patterns, we construct high-precision (HP) and high-recall (HR) priors and recursively refine them under perturbation. We then design an iterative training framework, making HP and HR progressively squeeze’’ to the final segmentation. Extensive evaluations on multi-center cardiac MRI and CT-based lung vessel segmentation demonstrate CRISP’s superior robustness, significantly outperforming state-of-the-art methods with striking HD95 reductions of up to 0.14 (7.0% improvement), 1.90 (13.1% improvement), and 8.39 (38.9% improvement) pixels across multi-center, demographic, and modality shifts, respectively.

[CV-86] Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection

【速读】:该论文旨在解决恶劣天气下3D目标检测的鲁棒性问题,其核心挑战在于不同传感器(如LiDAR与4D雷达)在不同环境条件下的可靠性动态变化,而现有基于LiDAR-4D雷达融合的方法多依赖固定或弱适应性处理流程,无法根据天气条件自适应调整模态偏好。解决方案的关键在于将多模态感知重构为一种天气条件驱动的分支路由(weather-conditioned branch routing)问题:框架显式维护三条并行的3D特征流——纯LiDAR分支、纯4D雷达分支和条件门控融合分支,并通过一个轻量级路由器根据从视觉和语义提示中提取的条件token,动态预测样本特定权重以软聚合这些表征;同时引入天气监督学习策略(含辅助分类与多样性正则化),防止分支退化,确保各分支在不同天气条件下表现出差异化且稳定的路由行为,从而实现对LiDAR与4D雷达模态依赖关系的可解释性自适应切换。

链接: https://arxiv.org/abs/2604.05405
作者: Hongsheng Li,Lingfeng Zhang,Zexian Yang,Liang Li,Rong Yin,Xiaoshuai Hao,Wenbo Ding
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); College of Computer and Data Science, Fuzhou University (福州大学计算机与数据科学学院); Institute of Information Engineering, CAS (中国科学院信息工程研究所); Xiaomi EV (小米汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust 3D object detection in adverse weather is highly challenging due to the varying reliability of different sensors. While existing LiDAR-4D radar fusion methods improve robustness, they predominantly rely on fixed or weakly adaptive pipelines, failing to dy-namically adjust modality preferences as environmental conditions change. To bridge this gap, we reformulate multi-modal perception as a weather-conditioned branch routing problem. Instead of computing a single fused output, our framework explicitly maintains three parallel 3D feature streams: a pure LiDAR branch, a pure 4D radar branch, and a condition-gated fusion branch. Guided by a condition token extracted from visual and semantic prompts, a lightweight router dynamically predicts sample-specific weights to softly aggregate these representations. Furthermore, to prevent branch collapse, we introduce a weather-supervised learning strategy with auxiliary classification and diversity regularization to enforce distinct, condition-dependent routing behaviors. Extensive experiments on the K-Radar benchmark demonstrate that our method achieves state-of-the-art performance. Furthermore, it provides explicit and highly interpretable insights into modality preferences, transparently revealing how adaptive routing robustly shifts reliance between LiDAR and 4D radar across diverse adverse-weather scenarios. The source code with be released.

[CV-87] LSGS-Loc: Towards Robust 3DGS-Based Visual Localization for Large-Scale UAV Scenarios

【速读】:该论文旨在解决大规模无人机(UAV)场景下基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的视觉定位问题,尤其是现有方法在姿态初始化鲁棒性和对重建伪影(如模糊和浮游物)敏感性方面的不足。其解决方案的关键在于提出了一种名为LSGS-Loc的新颖视觉定位流程:首先引入一种尺度感知的姿态初始化策略,通过结合场景无关的相对位姿估计与显式的3DGS尺度约束,实现无需场景特定训练的几何合理定位;其次,在姿态精化阶段设计了一种基于拉普拉斯算子的可靠性掩码机制,引导光度优化聚焦于高质量区域,从而有效缓解重建伪影的影响。实验表明,该方法在大规模UAV基准测试中实现了最先进的定位精度与鲁棒性。

链接: https://arxiv.org/abs/2604.05402
作者: Xiang Zhang,Tengfei Wang,Fang Xu,Xin Wang,Zongqian Zhan
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This paper is under reviewed by RA-L. The copyright might be transferred upon acceptance

点击查看摘要

Abstract:Visual localization in large-scale UAV scenarios is a critical capability for autonomous systems, yet it remains challenging due to geometric complexity and environmental variations. While 3D Gaussian Splatting (3DGS) has emerged as a promising scene representation, existing 3DGS-based visual localization methods struggle with robust pose initialization and sensitivity to rendering artifacts in large-scale settings. To address these limitations, we propose LSGS-Loc, a novel visual localization pipeline tailored for large-scale 3DGS scenes. Specifically, we introduce a scale-aware pose initialization strategy that combines scene-agnostic relative pose estimation with explicit 3DGS scale constraints, enabling geometrically grounded localization without scene-specific training. Furthermore, in the pose refinement, to mitigate the impact of reconstruction artifacts such as blur and floaters, we develop a Laplacian-based reliability masking mechanism that guides photometric refinement toward high-quality regions. Extensive experiments on large-scale UAV benchmarks demonstrate that our method achieves state-of-the-art accuracy and robustness for unordered image queries, significantly outperforming existing 3DGS-based approaches. Code is available at: this https URL

[CV-88] Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval CVPR2026

【速读】:该论文旨在解决**组合图像检索(Composed Image Retrieval, CIR)**中因过度强调语义匹配而导致实例级一致性不足的问题,即在跨场景下难以可靠地检索出用户指定的具体对象实例。为应对这一挑战,作者提出了一种新的细粒度检索任务——对象锚定的组合图像检索(Object-Anchored Composed Image Retrieval, OACIR),其核心要求是严格保持查询中指定对象的实例一致性。解决方案的关键在于:首先构建了首个大规模、多领域的基准数据集OACIRR(包含超过16万组四元组及含难负例的候选图库),并在每组查询中引入边界框(bounding box)以视觉锚定参考图像中的目标对象;其次设计了AdaFocal框架,其中包含一个上下文感知注意力调制器(Context-Aware Attention Modulator),能够自适应增强对锚定实例区域的关注,动态平衡实例与整体语义上下文之间的注意力分配,从而显著提升实例级保真度。

链接: https://arxiv.org/abs/2604.05393
作者: Yuxin Yang,Yinan Zhou,Yuxin Chen,Ziqi Zhang,Zongyang Ma,Chunfeng Yuan,Bing Li,Jun Gao,Weiming Hu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); Xi’an Jiaotong University (西安交通大学); Tencent Inc. (腾讯公司); PeopleAI Inc. (人智科技公司); HelloGroup Inc. (HelloGroup公司); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to CVPR 2026. Project page, dataset, and code are available at: this https URL

点击查看摘要

Abstract:Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible multimodal queries that combine a reference image and modification text. However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a user-specified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential. In this work, we propose Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained retrieval task that mandates strict instance-level consistency. To advance research on this task, we construct OACIRR (OACIR on Real-world images), the first large-scale, multi-domain benchmark comprising over 160K quadruples and four challenging candidate galleries enriched with hard-negative instance distractors. Each quadruple augments the compositional query with a bounding box that visually anchors the object in the reference image, providing a precise and flexible way to ensure instance preservation. To address the OACIR task, we propose AdaFocal, a framework featuring a Context-Aware Attention Modulator that adaptively intensifies attention within the specified instance region, dynamically balancing focus between the anchored instance and the broader compositional context. Extensive experiments demonstrate that AdaFocal substantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity, thereby establishing a robust baseline for this challenging task while opening new directions for more flexible, instance-aware retrieval systems.

[CV-89] LUMOS: Universal Semi-Supervised OCT Retinal Layer Segmentation with Hierarchical Reliable Mutual Learning

【速读】:该论文旨在解决光学相干断层扫描(Optical Coherence Tomography, OCT)视网膜层分割中因标注数据稀缺及不同数据集标签粒度异质性带来的挑战。现有半监督学习方法通常假设标签粒度固定,难以充分利用跨粒度的监督信息。其解决方案的关键在于提出LUMOS框架,融合双解码器网络与分层提示策略(Dual-Decoder Network with a Hierarchical Prompting Strategy, DDN-HPS),通过多粒度提示机制有效抑制伪标签噪声传播;同时引入可靠渐进式多粒度学习(Reliable Progressive Multi-granularity Learning, RPML),结合区域级可靠性加权和渐进式训练策略,引导模型从易到难逐步学习,确保跨粒度一致性目标的可靠选择,从而实现稳定且高效的跨粒度对齐。

链接: https://arxiv.org/abs/2604.05388
作者: Yizhou Fang,Jian Zhong,Li Lin,Xiaoying Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures. Accepted to IEEE ISBI 2026. \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

点击查看摘要

Abstract:Optical Coherence Tomography (OCT) layer segmentation faces challenges due to annotation scarcity and heterogeneous label granularities across datasets. While semi-supervised learning helps alleviate label scarcity, existing methods typically assume a fixed granularity, failing to fully exploit cross-granularity supervision. This paper presents LUMOS, a semi-supervised universal OCT retinal layer segmentation framework based on a Dual-Decoder Network with a Hierarchical Prompting Strategy (DDN-HPS) and Reliable Progressive Multi-granularity Learning (RPML). DDN-HPS combines a dual-branch architecture with a multi-granularity prompting strategy to effectively suppress pseudo-label noise propagation. Meanwhile, RPML introduces region-level reliability weighing and a progressive training approach that guides the model from easier to more difficult tasks, ensuring the reliable selection of cross-granularity consistency targets, thereby achieving stable cross-granularity alignment. Experiments on six OCT datasets demonstrate that LUMOS largely outperforms existing methods and exhibits exceptional cross-domain and cross-granularity generalization capability.

[CV-90] UAVReason : A Unified Large-Scale Benchmark for Multimodal Aerial Scene Reasoning and Generation

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在高空无人飞行器(Unmanned Aerial Vehicle, UAV)场景下性能显著下降的问题,其根本原因在于高海拔俯视视角带来的显著域偏移(domain shift),表现为目标尺寸小且密集、纹理重复以及视角方向模糊,从而严重破坏语义定位能力,阻碍空间推理与可控生成。解决方案的关键在于构建首个面向俯视UAV场景的统一大规模多模态基准UAVReason,该基准源自高保真UAV仿真平台,整合超过27.3万条视觉问答(Visual Question Answering, VQA)样本,涵盖单帧细粒度描述、双帧时序序列及跨模态生成任务,并系统评估22种空间与时间维度的推理类型及RGB、深度与分割模态下的高保真生成能力。同时,通过多任务学习建立统一基线模型,实验证明该方法显著优于通用领域VLMs,在VQA的EM/F1、分割的mIoU及生成质量的CLIP Score等指标上均取得提升,验证了统一多任务学习对UAV原生性能优化的有效性。

链接: https://arxiv.org/abs/2604.05377
作者: Jintao Sun,Hu Zhang,Donglin Di,Gangyi Ding,Zhedong Zheng
机构: Beijing Institute of Technology (北京理工大学); CSIRO DATA61 (澳大利亚联邦科学与工业研究组织数据61实验室); Harbin Institute of Technology (哈尔滨工业大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 12 figures, 7 tables

点击查看摘要

Abstract:Vision-Language models (VLMs) have demonstrated remarkable capability in ground-view visual understanding but often fracture when deployed on high-altitude Unmanned Aerial Vehicles (UAVs). The failure largely stems from a pronounced domain shift, characterized by tiny and densely packed objects, repetitive textures, and ambiguous top-down orientations. These factors severely disrupt semantic grounding and hinder both spatial reasoning and controllable generation. To bridge this critical gap, we introduce UAVReason, the first unified large-scale multi-modal benchmark dedicated to nadir-view UAV scenarios, derived from a high-fidelity UAV simulation platform. In contrast to existing UAV benchmarks, which are largely siloed and focus on single tasks like object detection or segmentation, UAVReason uniquely consolidates over 273K Visual Question Answering (VQA) pairs, including 23.6K single frames with detailed captions, 68.2K 2-frame temporal sequences, and 188.8K cross-modal generation samples. The benchmark probes 22 diverse reasoning types across spatial and temporal axes while simultaneously evaluating high-fidelity generation across RGB, depth, and segmentation modalities. We further establish a strong, unified baseline model via multi-task learning. Extensive experiments validate the efficacy of our unified approach across diverse metrics, such as EM/F1 for VQA, mIoU for segmentation, and CLIP Score for generation. These results indicate limitations of general-domain vision-language models and show that unified multi-task learning substantially improves UAV-native performance. All data, code, and evaluation tools will be publicly released to advance UAV multimodal research.

[CV-91] 3DTurboQuant: Training-Free Near-Optimal Quantization for 3D Reconstruction Models

【速读】:该论文旨在解决现有3D重建模型(如3D Gaussian Splatting、NeRF及基于Transformer的重建器)在压缩过程中依赖于场景特定码本(codebook)并通过逐场景微调来学习的问题。其核心挑战在于如何实现无需训练、无需码本学习且适用于多种模型结构的高效量化压缩方法。解决方案的关键在于利用参数向量维度特性——例如3DGS中的45维球谐函数系数和DUSt3R中的1024维键值向量,在特定维度范围内,单个随机旋转即可将任意输入映射为服从已知Beta分布的坐标,从而使得预计算的、数据无关的Lloyd-Max量化方案近似最优(仅比信息论下界差2.7倍)。这一发现驱动了作者提出一套可组合的剪枝-量化流水线,结合维度自适应量化准则、归一化分离边界以及二维哈希网格特征的分组策略,实现了无监督、快速且高保真的压缩效果。

链接: https://arxiv.org/abs/2604.05366
作者: Jae Joong Lee
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Every existing method for compressing 3D Gaussian Splatting, NeRF, or transformer-based 3D reconstructors requires learning a data-dependent codebook through per-scene fine-tuning. We show this is unnecessary. The parameter vectors that dominate storage in these models, 45-dimensional spherical harmonics in 3DGS and 1024-dimensional key-value vectors in DUSt3R, fall in a dimension range where a single random rotation transforms any input into coordinates with a known Beta distribution. This makes precomputed, data-independent Lloyd-Max quantization near-optimal, within a factor of 2.7 of the information-theoretic lower bound. We develop 3D, deriving (1) a dimension-dependent criterion that predicts which parameters can be quantized and at what bit-width before running any experiment, (2) norm-separation bounds connecting quantization MSE to rendering PSNR per scene, (3) an entry-grouping strategy extending rotation-based quantization to 2-dimensional hash grid features, and (4) a composable pruning-quantization pipeline with a closed-form compression ratio. On NeRF Synthetic, 3DTurboQuant compresses 3DGS by 3.5x with 0.02dB PSNR loss and DUSt3R KV caches by 7.9x with 39.7dB pointmap fidelity. No training, no codebook learning, no calibration data. Compression takes seconds. The code will be released (this https URL)

[CV-92] Rethinking IRSTD: Single-Point Supervision Guided Encoder-only Framework is Enough for Infrared Small Target Detection

【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, IRSTD)中因目标像素稀疏、边界模糊以及背景噪声干扰导致的检测性能下降问题。传统基于“编码器-解码器”结构的像素级监督分割方法虽取得一定进展,但忽略了小目标仅占少数像素且常伴随不可分辨背景噪声的事实,从而难以实现精准定位。解决方案的关键在于将IRSTD重构为质心回归任务,并提出单点监督引导的红外概率响应编码方法(Single-Point Supervision guided Infrared Probabilistic Response Encoding, SPIRE)。其核心创新包括:设计点响应先验监督(Point-Response Prior Supervision, PRPS),将单点标注转化为符合红外点目标响应特性的概率响应图;引入高分辨率概率编码器(High-Resolution Probabilistic Encoder, HRPE),实现仅用编码器的端到端回归,无需解码器重建;通过保留高分辨率特征和提升有效监督密度,显著缓解稀疏目标分布下的优化不稳定性,最终在多个基准数据集上实现了低虚警率(Fa)与高效计算成本的平衡。

链接: https://arxiv.org/abs/2604.05363
作者: Rixiang Ni,Boyang Li,Jun Chen,Yonghao Li,Feiyu Ren,Yuji Wang,Haoyang Yuan,Wujiao He,Wei An
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared small target detection (IRSTD) aims to separate small targets from clutter backgrounds. Extensive research is dedicated to the pixel-level supervision-guided “encoder-decoder” segmentation paradigm. Although having achieved promising performance, they neglect the fact that small targets only occupy a few pixels and are usually accompanied with blurred boundary caused by clutter backgrounds. Based on this observation, we argue that the first principle of IRSTD should be target localization instead of separating all target region accompanied with indistinguishable background noise. In this paper, we reformulate IRSTD as a centroid regression task and propose a novel Single-Point Supervision guided Infrared Probabilistic Response Encoding method (namely, SPIRE), which is indeed challenging due to the mismatch between reduced supervision network and equivalent output. Specifically, we first design a Point-Response Prior Supervision (PRPS), which transforms single-point annotations into probabilistic response map consistent with infrared point-target response characteristics, with a High-Resolution Probabilistic Encoder (HRPE) that enables encoder-only, end-to-end regression without decoder reconstruction. By preserving high-resolution features and increasing effective supervision density, SPIRE alleviates optimization instability under sparse target distributions. Finally, extensive experiments on various IRSTD benchmarks, including SIRST-UAVB and SIRST4 demonstrate that SPIRE achieves competitive target-level detection performance with consistently low false alarm rate (Fa) and significantly reduced computational cost. Code is publicly available at: this https URL.

[CV-93] GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy

【速读】:该论文旨在解决现有局部特征检测与描述方法依赖单一外观线索导致关键点不稳定及描述子判别能力不足的问题。其解决方案的关键在于提出了一种多线索引导的局部特征学习框架,通过协同利用语义(semantic)和几何(geometric)线索来增强检测鲁棒性和描述子判别性:一方面构建联合语义-法向预测头,借助共享3D向量场深度耦合语义与法向信息以缓解异质不一致性带来的优化干扰;另一方面设计深度稳定性预测头,从几何一致性角度量化局部区域可靠性,为关键点选择提供确定性指导;在此基础上引入语义-深度感知关键点(Semantic-Depth Aware Keypoint, SDAK)机制,通过语义可靠性与深度稳定性的耦合重加权关键点响应,抑制不可靠区域的伪特征;同时设计统一三线索融合(Unified Triple-Cue Fusion, UTCF)模块,采用语义调度门控机制自适应注入多属性特征,从而显著提升描述子的判别能力。

链接: https://arxiv.org/abs/2604.05359
作者: Yang Yi,Xieyuanli Chen,Jinpu Zhang,Hui Shen,Dewen Hu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust local feature detection and description are foundational tasks in computer vision. Existing methods primarily rely on single appearance cues for modeling, leading to unstable keypoints and insufficient descriptor discriminability. In this paper, we propose a multi-cue guided local feature learning framework that leverages semantic and geometric cues to synergistically enhance detection robustness and descriptor discriminability. Specifically, we construct a joint semantic-normal prediction head and a depth stability prediction head atop a lightweight backbone. The former leverages a shared 3D vector field to deeply couple semantic and normal cues, thereby resolving optimization interference from heterogeneous inconsistencies. The latter quantifies the reliability of local regions from a geometric consistency perspective, providing deterministic guidance for robust keypoint selection. Based on these predictions, we introduce the Semantic-Depth Aware Keypoint (SDAK) mechanism for feature detection. By coupling semantic reliability with depth stability, SDAK reweights keypoint responses to suppress spurious features in unreliable regions. For descriptor construction, we design a Unified Triple-Cue Fusion (UTCF) module, which employs a semantic-scheduled gating mechanism to adaptively inject multi-attribute features, improving descriptor discriminability. Extensive experiments on four benchmarks validate the effectiveness of the proposed framework. The source code and pre-trained model will be available at: this https URL.

[CV-94] Unsupervised Multi-agent and Single-agent Perception from Cooperative Views CVPR2026

【速读】:该论文旨在解决在无监督条件下同时实现多智能体(multi-agent)与单智能体(single-agent)LiDAR感知的问题,即如何利用多智能体间的传感器数据共享来提升3D目标检测和分类性能,而无需人工标注。其解决方案的关键在于提出了一种无监督多智能体与单智能体(Unsupervised Multi-agent and Single-agent, UMS)感知框架,该框架基于两个核心发现:一是通过协作视角的数据共享可提高点云密度,从而改善无监督目标分类;二是多智能体的协同视图可作为单视角检测任务的无监督引导信号。UMS进一步结合学习驱动的候选框净化滤波器(Proposal Purifying Filter)以优化多智能体融合后的候选区域分类,并引入渐进式候选框稳定模块(Progressive Proposal Stabilizing module)通过由易到难的学习策略生成可靠伪标签;此外设计跨视图一致性学习(Cross-View Consensus Learning)机制,利用多智能体协同视图指导单智能体视图中的3D目标检测,从而在V2V4Real和OPV2V两个公开数据集上显著优于现有无监督方法。

链接: https://arxiv.org/abs/2604.05354
作者: Haochen Yang,Baolu Li,Lei Li,Delin Ren,Jiacheng Guo,Minghai Qin,Tianyun Zhang,Hongkai Yu
机构: Cleveland State University (克利夫兰州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2026

点击查看摘要

Abstract:The LiDAR-based multi-agent and single-agent perception has shown promising performance in environmental understanding for robots and automated vehicles. However, there is no existing method that simultaneously solves both multi-agent and single-agent perception in an unsupervised way. By sharing sensor data between multiple agents via communication, this paper discovers two key insights: 1) Improved point cloud density after the data sharing from cooperative views could benefit unsupervised object classification, 2) Cooperative view of multiple agents can be used as unsupervised guidance for the 3D object detection in the single view. Based on these two discovered insights, we propose an Unsupervised Multi-agent and Single-agent (UMS) perception framework that leverages multi-agent cooperation without human annotations to simultaneously solve multi-agent and single-agent perception. UMS combines a learning-based Proposal Purifying Filter to better classify the candidate proposals after multi-agent point cloud density cooperation, followed by a Progressive Proposal Stabilizing module to yield reliable pseudo labels by the easy-to-hard curriculum learning. Furthermore, we design a Cross-View Consensus Learning to use multi-agent cooperative view to guide detection in single-agent view. Experimental results on two public datasets V2V4Real and OPV2V show that our UMS method achieved significantly higher 3D detection performance than the state-of-the-art methods on both multi-agent and single-agent perception tasks in an unsupervised setting.

[CV-95] AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation

【速读】:该论文旨在解决图像目标导航(Image Goal Navigation, ImageNav)中因粗粒度成功标准(即智能体需停在目标图像1米范围内)而无法满足下游任务(如抓取操作)对精确定位需求的问题。解决方案的关键在于提出了一种无需训练的AnyImageNav系统,其核心思想是将目标图像视为几何查询:通过密集像素级对应关系,将任意目标图像(如物体、走廊或房间角落)与智能体观测进行注册,从而恢复精确的6自由度(6-DoF)相机位姿。该方法采用语义到几何的级联机制——首先利用语义相关性信号引导探索并作为接近阈值,仅在当前视图与目标图像高度相关时调用3D多视角基础模型;随后通过循环自认证实现高精度位姿恢复,显著优于现有基线方法,在Gibson和HM3D数据集上分别实现了0.27m/3.41°和0.21m/1.23°的定位误差与朝向误差,提升达5–10倍。

链接: https://arxiv.org/abs/2604.05351
作者: Yijie Deng,Shuaihang Yuan,Yi Fang
机构: New York University (纽约大学); Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image Goal Navigation (ImageNav) is evaluated by a coarse success criterion, the agent must stop within 1m of the target, which is sufficient for finding objects but falls short for downstream tasks such as grasping that require precise positioning. We introduce AnyImageNav, a training-free system that pushes ImageNav toward this more demanding setting. Our key insight is that the goal image can be treated as a geometric query: any photo of an object, a hallway, or a room corner can be registered to the agent’s observations via dense pixel-level correspondences, enabling recovery of the exact 6-DoF camera pose. Our method realizes this through a semantic-to-geometric cascade: a semantic relevance signal guides exploration and acts as a proximity gate, invoking a 3D multi-view foundation model only when the current view is highly relevant to the goal image; the model then self-certifies its registration in a loop for an accurate recovered pose. Our method sets state-of-the-art navigation success rates on Gibson (93.1%) and HM3D (82.6%), and achieves pose recovery that prior methods do not provide: a position error of 0.27m and heading error of 3.41 degrees on Gibson, and 0.21m / 1.23 degrees on HM3D, a 5-10x improvement over adapted baselines.

[CV-96] VLA-InfoEntropy: A Training-Free Vision-Attention Information Entropy Approach for Vision-Language-Action Models Inference Acceleration and Success ICME2026

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在处理高维视觉特征、复杂语言输入和连续动作序列时存在的计算开销大、推理效率低的问题,从而限制了其在实时场景中的部署与可靠性。解决方案的关键在于引入图像熵(image entropy)和注意力熵(attention entropy)两个量化指标:图像熵用于识别纹理丰富或结构信息显著的视觉区域,注意力熵则捕捉任务相关文本中注意力分布的集中程度;结合时间步信息,构建一种动态切换策略,使模型从全局视觉特征逐步聚焦于由语义引导的局部关键区域,从而在保留关键内容的同时减少冗余计算。此方法有效整合了空间、语义与时间线索,显著降低了推理参数量并提升了速度。

链接: https://arxiv.org/abs/2604.05323
作者: Chuhang Liu,Yayun He,Zuheng Kang,Xiaoyang Qu,Jianzong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to the 2026 IEEE International Conference on Multimedia and Expo (ICME 2026)

点击查看摘要

Abstract:Vision-Language-Action (VLA) models integrate visual perception, language understanding, and action decision-making for cross-modal semantic alignment, exhibiting broad application potential. However, the joint processing of high-dimensional visual features, complex linguistic inputs, and continuous action sequences incurs significant computational overhead and low inference efficiency, thereby hindering real-time deployment and reliability. To address this issue, we use image entropy to quantify the grayscale distribution characteristics of each visual token and introduce attention entropy to capture the distribution of attention scores over task-related text. Visual entropy identifies texture-rich or structurally informative regions, while attention entropy pinpoints semantically relevant tokens. Combined with timestep information, these metrics enable a dynamic transition strategy that shifts the model’s focus from global visual features to attention-guided local informative regions. Thus, the resulting VLA-InfoEntropy method integrates spatial, semantic, and temporal cues to reduce redundancy while preserving critical content. Extensive experiments show that our method reduces inference parameters, accelerates inference speed, and outperforms existing approaches.

[CV-97] Indoor Asset Detection in Large Scale 360° Drone-Captured Imagery via 3D Gaussian Splatting CVPR2026

【速读】:该论文旨在解决从360°无人机采集的图像重建得到的3D Gaussian Splatting(3DGS)场景中,实现室内资产的目标级检测与分割问题。现有方法在多视角一致性与对象语义准确性方面存在不足,难以有效聚合来自不同视角的2D掩码以形成统一的3D物体实例。其解决方案的关键在于引入一个3D对象码本(3D object codebook),该码本联合利用对应高斯原语(Gaussian primitives)的掩码语义信息和空间位置信息,指导多视角掩码关联与室内资产检测;通过将2D目标检测与分割模型与语义和空间约束下的合并机制相结合,实现了多视图掩码到一致3D对象实例的有效聚合,从而显著提升F1分数(提高65%)和mAP(提升11%)。

链接: https://arxiv.org/abs/2604.05316
作者: Monica Tang,Avideh Zakhor
机构: UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 3DMV Workshop

点击查看摘要

Abstract:We present an approach for object-level detection and segmentation of target indoor assets in 3D Gaussian Splatting (3DGS) scenes, reconstructed from 360° drone-captured imagery. We introduce a 3D object codebook that jointly leverages mask semantics and spatial information of their corresponding Gaussian primitives to guide multi-view mask association and indoor asset detection. By integrating 2D object detection and segmentation models with semantically and spatially constrained merging procedures, our method aggregates masks from multiple views into coherent 3D object instances. Experiments on two large indoor scenes demonstrate reliable multi-view mask consistency, improving F1 score by 65% over state-of-the-art baselines, and accurate object-level 3D indoor asset detection, achieving an 11% mAP gain over baseline methods.

[CV-98] SmokeGS-R: Physics-Guided Pseudo-Clean 3DGS for Real-World Multi-View Smoke Restoration

【速读】:该论文旨在解决真实世界烟雾环境下多视角3D重建的难题,此类场景中烟雾同时导致场景辐射衰减、引入大气光(airlight)并破坏多视角外观一致性,从而显著降低重建鲁棒性。解决方案的关键在于将几何恢复与外观校正解耦:首先利用改进的暗通道先验和引导滤波生成物理启发的伪清洁监督信号,训练仅关注清晰图像的3D高斯泼溅(3D Gaussian Splatting)源模型;随后通过几何均值参考聚合、LAB空间Reinhard色调映射及轻量高斯平滑对渲染结果进行稳定外观调和。该策略在NTIRE 2026挑战赛中取得优异性能,验证了“先几何后外观”的范式在真实烟雾场景中的有效性。

链接: https://arxiv.org/abs/2604.05301
作者: Xueming Fu,Lixia Han
机构: University of Science and Technology of China (中国科学技术大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Lab Report for NTIRE 2026 3DRR Track 2

点击查看摘要

Abstract:Real-world smoke simultaneously attenuates scene radiance, adds airlight, and destabilizes multi-view appearance consistency, making robust 3D reconstruction particularly difficult. We present \textbfSmokeGS-R, a practical pipeline developed for the NTIRE 2026 3D Restoration and Reconstruction Track 2 challenge. The key idea is to decouple geometry recovery from appearance correction: we generate physics-guided pseudo-clean supervision with a refined dark channel prior and guided filtering, train a sharp clean-only 3D Gaussian Splatting source model, and then harmonize its renderings with a donor ensemble using geometric-mean reference aggregation, LAB-space Reinhard transfer, and light Gaussian smoothing. On the official challenge testing leaderboard, the final submission achieved \mboxPSNR =15.217 and \mboxSSIM =0.666 . After the public release of RealX3D, we re-evaluated the same frozen result on the seven released challenge scenes without retraining and obtained \mboxPSNR =15.209 , \mboxSSIM =0.644 , and \mboxLPIPS =0.551 , outperforming the strongest official baseline average on the same scenes by +3.68 dB PSNR. These results suggest that a geometry-first reconstruction strategy combined with stable post-render appearance harmonization is an effective recipe for real-world multi-view smoke restoration. The code is available at this https URL.

[CV-99] From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal

【速读】:该论文旨在解决非人脸识别(non-FR)视觉嵌入(如CLIP、DINOv2/v3、SSCD)在包含人脸数据上的身份泄露(identity leakage)问题,尤其是在开放集验证场景下缺乏可衡量的隐私风险评估与有效的隐私保护机制。其核心解决方案是提出一种身份净化投影(Identity Sanitization Projection, ISP),该方法通过一次性的线性投影移除估计的身份子空间,同时保留对视觉搜索和检索任务至关重要的互补特征空间,从而在保障隐私的同时维持高非生物特征实用性(non-biometric utility)。实验表明,ISP能够使线性攻击接近随机水平,且在跨数据集迁移时性能衰减较小,首次实现了针对非FR编码器的攻击者校准面部隐私审计。

链接: https://arxiv.org/abs/2604.05296
作者: Daniel George,Charles Yeh,Daniel Lee,Yifei Zhang
机构: Persona Identities, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:Frozen visual embeddings (e.g., CLIP, DINOv2/v3, SSCD) power retrieval and integrity systems, yet their use on face-containing data is constrained by unmeasured identity leakage and a lack of deployable mitigations. We take an attacker-aware view and contribute: (i) a benchmark of visual embeddings that reports open-set verification at low false-accept rates, a calibrated diffusion-based template inversion check, and face-context attribution with equal-area perturbations; and (ii) propose a one-shot linear projector that removes an estimated identity subspace while preserving the complementary space needed for utility, which for brevity we denote as the identity sanitization projection ISP. Across CelebA-20 and VGGFace2, we show that these encoders are robust under open-set linear probes, with CLIP exhibiting relatively higher leakage than DINOv2/v3 and SSCD, robust to template inversion, and are context-dominant. In addition, we show that ISP drives linear access to near-chance while retaining high non-biometric utility, and transfers across datasets with minor degradation. Our results establish the first attacker-calibrated facial privacy audit of non-FR encoders and demonstrate that linear subspace removal achieves strong privacy guarantees while preserving utility for visual search and retrieval.

[CV-100] Final Report Center for Computer-Integrated Computer-Integrated Surgical Systems and Technology NSF ERC Cooperative Agreement EEC9731748 Volume 1

【速读】:该论文旨在解决传统医疗操作中效率低、精度不足、结果不可预测及成本高昂等问题,其核心目标是通过融合数据与先进技术构建智能化临床系统,从而显著提升手术和诊疗过程的精准性与安全性。解决方案的关键在于依托工程研究中心(如CISST ERC)所建立的专业基础设施,推动生成式AI(Generative AI)与计算机视觉(Computer Vision)等前沿技术在医疗机器人领域的深度集成,实现从常规任务处理向高复杂度干预的跨越,并最终改善临床结果、降低医疗成本、增强患者康复效率,使整个医疗体系更加高效、安全和可持续。

链接: https://arxiv.org/abs/2604.05272
作者: Russell H. Taylor,Gregory D. Hager,Ralph Etienne-Cummings. Eric Grimson,Ron Kikinis,Cameron Riviere
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the last ten years, medical robotics has moved from the margins to the mainstream. Since the Engineering Research Center for Computer-Integrated Surgical Systems and Technology was Launched in 1998 with National Science Foundation funding, medical robots have been promoted from handling routine tasks to performing highly sophisticated interventions and related assignments. The CISST ERC has played a significant role in this transformation. And thanks to NSF support, the ERC has built the professional infrastructure that will continue our mission: bringing data and technology together in clinical systems that will dramatically change how surgery and other procedures are done. The enhancements we envision touch virtually every aspect of the delivery of care: - More accurate procedures - More consistent, predictable results from one patient to the next - Improved clinical outcomes - Greater patient safety - Reduced liability for healthcare providers - Lower costs for everyone - patients, facilities, insurers, government - Easier, faster recovery for patients - Effective new ways to treat health problems - Healthier patients, and a healthier system The basic science and engineering the ERC is developing now will yield profound benefits for all concerned about health care - from government agencies to insurers, from clinicians to patients to the general public. All will experience the healing touch of medical robotics, thanks in no small part to the work of the CISST ERC and its successors. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.05272 [cs.RO] (or arXiv:2604.05272v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2604.05272 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Russell Taylor [view email] [v1] Tue, 7 Apr 2026 00:13:20 UTC (3,689 KB)

[CV-101] oward Unified Fine-Grained Vehicle Classification and Automatic License Plate Recognition

【速读】:该论文旨在解决细粒度车辆分类(Fine-Grained Vehicle Classification, FGVC)在真实复杂场景下性能受限的问题,特别是现有研究多假设理想条件、属性覆盖有限且未充分整合自动车牌识别(Automatic License Plate Recognition, ALPR)技术。其解决方案的关键在于构建并公开发布UFPR-VeSV数据集,该数据集包含24,945张图像、16,297辆唯一车辆,并标注了13种颜色、26个品牌、136个型号和14类车型,数据来源于巴西巴拉那州军事警察监控系统,涵盖部分遮挡、夜间红外成像及光照变化等现实挑战场景;所有FGVC标签均通过车牌信息验证,同时提供文本与角点标注。此外,论文通过五种深度学习模型的基准测试揭示了多色车辆识别、红外图像处理及共享平台车型区分等关键难点,并探索了FGVC与ALPR联合应用的潜力,为智能交通系统中车辆信息提取提供了更鲁棒的互补方法。

链接: https://arxiv.org/abs/2604.05271
作者: Gabriel E. Lima,Valfride Nascimento,Eduardo Santos,Eduil Nascimento Jr,Rayson Laroca,David Menotti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in the Journal of the Brazilian Computer Society (JBCS)

点击查看摘要

Abstract:Extracting vehicle information from surveillance images is essential for intelligent transportation systems, enabling applications such as traffic monitoring and criminal investigations. While Automatic License Plate Recognition (ALPR) is widely used, Fine-Grained Vehicle Classification (FGVC) offers a complementary approach by identifying vehicles based on attributes such as color, make, model, and type. Although there have been advances in this field, existing studies often assume well-controlled conditions, explore limited attributes, and overlook FGVC integration with ALPR. To address these gaps, we introduce UFPR-VeSV, a dataset comprising 24,945 images of 16,297 unique vehicles with annotations for 13 colors, 26 makes, 136 models, and 14 types. Collected from the Military Police of Paraná (Brazil) surveillance system, the dataset captures diverse real-world conditions, including partial occlusions, nighttime infrared imaging, and varying lighting. All FGVC annotations were validated using license plate information, with text and corner annotations also being provided. A qualitative and quantitative comparison with established datasets confirmed the challenging nature of our dataset. A benchmark using five deep learning models further validated this, revealing specific challenges such as handling multicolored vehicles, infrared images, and distinguishing between vehicle models that share a common platform. Additionally, we apply two optical character recognition models to license plate recognition and explore the joint use of FGVC and ALPR. The results highlight the potential of integrating these complementary tasks for real-world applications. The UFPR-VeSV dataset is publicly available at: this https URL.

[CV-102] Coverag e Optimization for Camera View Selection

【速读】:该论文旨在解决三维场景重建中的主动视角选择(active view selection)问题,即如何在数据采集过程中智能地选择最具信息量的相机位姿以提升重建效率与精度。解决方案的关键在于提出了一种基于覆盖度的轻量级视点选择指标——COVER(Camera Optimization for View Exploration and Reconstruction),其核心思想是通过最小化可计算的费舍尔信息增益近似值来识别那些能补充历史观测不足几何结构的新视角,从而避免昂贵的透射率估计并增强对噪声和训练动态的鲁棒性。

链接: https://arxiv.org/abs/2604.05259
作者: Timothy Chen,Adam Dai,Maximilian Adang,Grace Gao,Mac Schwager
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:What makes a good viewpoint? The quality of the data used to learn 3D reconstructions is crucial for enabling efficient and accurate scene modeling. We study the active view selection problem and develop a principled analysis that yields a simple and interpretable criterion for selecting informative camera poses. Our key insight is that informative views can be obtained by minimizing a tractable approximation of the Fisher Information Gain, which reduces to favoring viewpoints that cover geometry that has been insufficiently observed by past cameras. This leads to a lightweight coverage-based view selection metric that avoids expensive transmittance estimation and is robust to noise and training dynamics. We call this metric COVER (Camera Optimization for View Exploration and Reconstruction). We integrate our method into the Nerfstudio framework and evaluate it on real datasets within fixed and embodied data acquisition scenarios. Across multiple datasets and radiance-field baselines, our method consistently improves reconstruction quality compared to state-of-the-art active view selection methods. Additional visualizations and our Nerfstudio package can be found at this https URL.

[CV-103] Protecting and Preserving Protest Dynamics for Responsible Analysis

【速读】:该论文旨在解决抗议相关社交媒体数据在用于集体行动分析时所面临的隐私风险问题,尤其是大型基础模型可能因记忆和泄露敏感信息而导致跨平台身份识别与事后参与者暴露的风险。现有自动化抗议分析方法缺乏整合隐私风险评估、下游分析及公平性考量的全流程框架。其解决方案的关键在于提出一种负责任计算框架,通过条件图像生成技术将敏感抗议图像替换为标注良好的合成复制品,从而在不暴露可识别个体的前提下实现对集体行为模式的分析;该方法在保持下游分析效用的同时显著降低隐私风险,并进一步评估生成数据中的群体公平性,以避免对特定子群体的偏倚影响。

链接: https://arxiv.org/abs/2604.05256
作者: Cohen Archbold,Usman Hassan,Nazmus Sakib,Sen-ching Cheung,Abdullah-Al-Zubaer Imran
机构: University of Kentucky(肯塔基大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 6 figures, Submitted to ACM Journal on Responsible Computing

点击查看摘要

Abstract:Protest-related social media data are valuable for understanding collective action but inherently high-risk due to concerns surrounding surveillance, repression, and individual privacy. Contemporary AI systems can identify individuals, infer sensitive attributes, and cross-reference visual information across platforms, enabling surveillance that poses risks to protesters and bystanders. In such contexts, large foundation models trained on protest imagery risk memorizing and disclosing sensitive information, leading to cross-platform identity leakage and retroactive participant identification. Existing approaches to automated protest analysis do not provide a holistic pipeline that integrates privacy risk assessment, downstream analysis, and fairness considerations. To address this gap, we propose a responsible computing framework for analyzing collective protest dynamics while reducing risks to individual privacy. Our framework replaces sensitive protest imagery with well-labeled synthetic reproductions using conditional image synthesis, enabling analysis of collective patterns without direct exposure of identifiable individuals. We demonstrate that our approach produces realistic and diverse synthetic imagery while balancing downstream analytical utility with reductions in privacy risk. We further assess demographic fairness in the generated data, examining whether synthetic representations disproportionately affect specific subgroups. Rather than offering absolute privacy guarantees, our method adopts a pragmatic, harm-mitigating approach that enables socially sensitive analysis while acknowledging residual risks. Comments: 21 pages, 6 figures, Submitted to ACM Journal on Responsible Computing Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.05256 [cs.CV] (or arXiv:2604.05256v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.05256 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-104] Active Measurement of Two-Point Correlations AISTATS2026

【速读】:该论文旨在解决在大规模点数据集中,针对满足特定属性的子集高效估算两点关联函数(Two-point Correlation Function, 2PCF)的问题。在天文学应用中,例如星团仅占银河系潜在源的极小比例,传统方法依赖人工标注构建目录,效率低下且耗时。解决方案的关键在于提出一种“人机协同”框架,利用预训练分类器引导采样策略,自适应选择最具信息量的点进行人工标注;同时设计了一种新颖的无偏估计量、采样机制与置信区间构造方法,能够在显著减少标注工作量的同时,实现多距离区间内成对计数的无偏估计,并大幅降低方差,从而实现对天文数据集中两点关联关系的可扩展且统计严谨的测量。

链接: https://arxiv.org/abs/2604.05227
作者: Max Hamilton,Daniel Sheldon,Subhransu Maji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AIStats 2026

点击查看摘要

Abstract:Two-point correlation functions (2PCF) are widely used to characterize how points cluster in space. In this work, we study the problem of measuring the 2PCF over a large set of points, restricted to a subset satisfying a property of interest. An example comes from astronomy, where scientists measure the 2PCF of star clusters, which make up only a tiny subset of possible sources within a galaxy. This task typically requires careful labeling of sources to construct catalogs, which is time-consuming. We present a human-in-the-loop framework for efficient estimation of 2PCF of target sources. By leveraging a pre-trained classifier to guide sampling, our approach adaptively selects the most informative points for human annotation. After each annotation, it produces unbiased estimates of pair counts across multiple distance bins simultaneously. Compared to simple Monte Carlo approaches, our method achieves substantially lower variance while significantly reducing annotation effort. We introduce a novel unbiased estimator, sampling strategy, and confidence interval construction that together enable scalable and statistically grounded measurement of two-point correlations in astronomy datasets.

[CV-105] Hierarchical Mesh Transformers with Topology-Guided Pretraining for Morphometric Analysis of Brain Structures

【速读】:该论文旨在解决大规模无结构体积网格(volumetric mesh)和表面网格(surface mesh)上表示学习的挑战,尤其是在神经影像学中如何有效整合多种顶点级别的形态学描述符(如皮层厚度、曲率、沟深和髓鞘含量)以捕捉细微的疾病相关信号。现有方法要么忽略这些临床信息特征,要么仅支持单一网格拓扑结构,限制了其在多模态影像分析流程中的应用。解决方案的关键在于提出一种分层Transformer框架,该框架基于任意阶单纯复形(simplicial complex)构建空间自适应树划分(spatially adaptive tree partitions),从而统一处理体积与表面离散化数据,并实现无需拓扑特异性修改的多尺度注意力机制;同时引入特征投影模块将不同长度的顶点级临床特征映射至空间层次结构中,实现几何结构与特征维度的解耦,进而无缝集成多种神经影像特征集;此外,通过在大规模未标注队列上对坐标和形态学通道进行掩码重建的自监督预训练,获得可迁移的编码器主干网络,显著提升下游任务性能,在阿尔茨海默病分类、淀粉样蛋白负荷预测及局灶性皮质发育不良检测等基准测试中均达到当前最优效果。

链接: https://arxiv.org/abs/2604.05215
作者: Yujian Xiong,Mohammad Farazi,Yanxi Chen,Wenhui Zhu,Xuanzhao Dong,Natasha Lepore,Yi Su,Raza Mushtaq,Stephen Foldes,Andrew Yang,Yalin Wang
机构: Arizona State University (亚利桑那州立大学); University of Southern California (南加州大学); Banner Health (Banner健康); Barrow Neurological Institute (巴罗神经学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Representation learning on large-scale unstructured volumetric and surface meshes poses significant challenges in neuroimaging, especially when models must incorporate diverse vertex-level morphometric descriptors, such as cortical thickness, curvature, sulcal depth, and myelin content, which carry subtle disease-related signals. Current approaches either ignore these clinically informative features or support only a single mesh topology, restricting their use across imaging pipelines. We introduce a hierarchical transformer framework designed for heterogeneous mesh analysis that operates on spatially adaptive tree partitions constructed from simplicial complexes of arbitrary order. This design accommodates both volumetric and surface discretizations within a single architecture, enabling efficient multi-scale attention without topology-specific modifications. A feature projection module maps variable-length per-vertex clinical descriptors into the spatial hierarchy, separating geometric structure from feature dimensionality and allowing seamless integration of different neuroimaging feature sets. Self-supervised pretraining via masked reconstruction of both coordinates and morphometric channels on large unlabeled cohorts yields a transferable encoder backbone applicable to diverse downstream tasks and mesh modalities. We validate our approach on Alzheimer’s disease classification and amyloid burden prediction using volumetric brain meshes from ADNI, as well as focal cortical dysplasia detection on cortical surface meshes from the MELD dataset, achieving state-of-the-art results across all benchmarks.

[CV-106] Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D KR

【速读】:该论文旨在解决开放世界场景下静态三维边界框(3D bounding boxes, 3DBBs)的检测与定位问题,这是计算机视觉中一个尚未充分解决的挑战。其核心解决方案是提出Boxer算法,通过一个基于Transformer的网络BoxerNet,将2D开放词汇目标检测结果(如DETI、OWLv2等)提升至3D空间,并结合多视角融合与几何过滤机制,生成度量空间中的全局一致且去重的3DBBs。关键创新在于:利用现有2D检测器降低对标注3DBB数据的需求,引入似然不确定性(aleatoric uncertainty)提升回归鲁棒性,设计中值深度块编码以支持稀疏深度输入,并在包含超过120万唯一3DBBs的大规模数据集上进行训练,显著优于当前最优基线方法(如CuTR),尤其在无密集深度信息的自身体视角设置下表现突出。

链接: https://arxiv.org/abs/2604.05212
作者: Daniel DeTone,Tianwei Shen,Fan Zhang,Lingni Ma,Julian Straub,Richard Newcombe,Jakob Engel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this http URL

点击查看摘要

Abstract:Detecting and localizing objects in space is a fundamental computer vision problem. While much progress has been made to solve 2D object detection, 3D object localization is much less explored and far from solved, especially for open-world categories. To address this research challenge, we propose Boxer, an algorithm to estimate static 3D bounding boxes (3DBBs) from 2D open-vocabulary object detections, posed images and optional depth either represented as a sparse point cloud or dense depth. At its core is BoxerNet, a transformer-based network which lifts 2D bounding box (2DBB) proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space. Boxer leverages the power of existing 2DBB detection algorithms (e.g. DETIC, OWLv2, SAM3) to localize objects in 2D. This allows the main BoxerNet model to focus on lifting to 3D rather than detecting, ultimately reducing the demand for costly annotated 3DBB training data. Extending the CuTR formulation, we incorporate an aleatoric uncertainty for robust regression, a median depth patch encoding to support sparse depth inputs, and large-scale training with over 1.2 million unique 3DBBs. BoxerNet outperforms state-of-the-art baselines in open-world 3DBB lifting, including CuTR in egocentric settings without dense depth (0.532 vs. 0.010 mAP) and on CA-1M with dense depth available (0.412 vs. 0.250 mAP).

[CV-107] Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification

【速读】:该论文旨在解决施工场景中实时、准确识别作业人员周边安全隐患的问题。现有大模型虽具备较强的上下文推理能力,但计算开销高,难以满足近实时检测需求;而小规模视觉语言模型(sVLM)虽效率更高,却常因准确性不足和幻觉问题导致误判。解决方案的关键在于提出一种检测引导的小型视觉语言模型(detection-guided sVLM)框架,通过YOLOv11n检测器先定位工人与施工机械,并将这些目标信息嵌入结构化提示(structured prompts)中,从而引导sVLM进行空间感知的多模态推理,实现更精准且可解释的危险识别。该方法在零样本设置下显著提升了六种sVLM模型的性能,其中最优模型Gemma-3 4B的F1分数达到50.6%,较基线提升显著,同时仅增加2.5毫秒/图像的推理延迟。

链接: https://arxiv.org/abs/2604.05210
作者: Muhammad Adil,Mehmood Ahmed,Muhammad Aqib,Vicente A. Gonzalez,Gaang Lee,Qipei Mei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate and timely identification of construction hazards around workers is essential for preventing workplace accidents. While large vision-language models (VLMs) demonstrate strong contextual reasoning capabilities, their high computational requirements limit their applicability in near real-time construction hazard detection. In contrast, small vision-language models (sVLMs) with fewer than 4 billion parameters offer improved efficiency but often suffer from reduced accuracy and hallucination when analyzing complex construction scenes. To address this trade-off, this study proposes a detection-guided sVLM framework that integrates object detection with multimodal reasoning for contextual hazard identification. The framework first employs a YOLOv11n detector to localize workers and construction machinery within the scene. The detected entities are then embedded into structured prompts to guide the reasoning process of sVLMs, enabling spatially grounded hazard assessment. Within this framework, six sVLMs (Gemma-3 4B, Qwen-3-VL 2B/4B, InternVL-3 1B/2B, and SmolVLM-2B) were evaluated in zero-shot settings on a curated dataset of construction site images with hazard annotations and explanatory rationales. The proposed approach consistently improved hazard detection performance across all models. The best-performing model, Gemma-3 4B, achieved an F1-score of 50.6%, compared to 34.5% in the baseline configuration. Explanation quality also improved significantly, with BERTScore F1 increasing from 0.61 to 0.82. Despite incorporating object detection, the framework introduces minimal overhead, adding only 2.5 ms per image during inference. These results demonstrate that integrating lightweight object detection with small VLM reasoning provides an effective and efficient solution for context-aware construction safety hazard detection.

[CV-108] OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models

【速读】:该论文旨在解决多任务适配器(adapter)融合问题,特别是针对生成式模型中主题(subject)和风格(style)适配器的训练-free 合并难题。现有方法难以在不重新训练的情况下将多个为不同任务优化的适配器有效整合,导致性能下降或特征混淆。解决方案的关键在于利用正交微调(Orthogonal Fine-Tuning, OFT)框架下的结构化正交参数化及其几何特性,推导出基于Group-and-Shuffle (GS\mathcal{GS}) 正交矩阵流形的测地线近似公式,并提出一种谱恢复(spectra restoration)变换以保持合并后适配器的谱特性,从而实现高质量的特征融合。此方法首次实现了乘法型正交适配器的无训练合并,且在主题驱动生成任务中验证了其有效性。

链接: https://arxiv.org/abs/2604.05183
作者: Ali Aliev,Kamil Garifullin,Nikolay Yudin,Vera Soboleva,Alexander Molozhavenko,Ivan Oseledets,Aibek Alanov,Maxim Rakhuba
机构: HSE University(高等经济大学); FusionBrain Lab; AXXX
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In a rapidly growing field of model training there is a constant practical interest in parameter-efficient fine-tuning and various techniques that use a small amount of training data to adapt the model to a narrow task. However, there is an open question: how to combine several adapters tuned for different tasks into one which is able to yield adequate results on both tasks? Specifically, merging subject and style adapters for generative models remains unresolved. In this paper we seek to show that in the case of orthogonal fine-tuning (OFT), we can use structured orthogonal parametrization and its geometric properties to get the formulas for training-free adapter merging. In particular, we derive the structure of the manifold formed by the recently proposed Group-and-Shuffle ( \mathcalGS ) orthogonal matrices, and obtain efficient formulas for the geodesics approximation between two points. Additionally, we propose a \textspectra restoration transform that restores spectral properties of the merged adapter for higher-quality fusion. We conduct experiments in subject-driven generation tasks showing that our technique to merge two \mathcalGS orthogonal matrices is capable of uniting concept and style features of different adapters. To the best of our knowledge, this is the first training-free method for merging multiplicative orthogonal adapters. Code is available via the \hrefthis https URLlink .

[CV-109] LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

【速读】:该论文旨在解决当前基于Transformer的前馈式3D重建方法在恢复精细纹理和外观细节方面仍落后于密集视图优化方法的问题。其核心解决方案在于通过显著扩展上下文窗口(即大幅增加活跃对象和图像标记数量)来缩小这一差距,从而实现高保真度的3D物体重建与逆渲染。关键创新包括:(1) 一种高效的粗到精管道,通过预测稀疏高分辨率残差聚焦计算资源于信息丰富区域;(2) 一种3D感知的空间路由机制,利用显式几何距离建立精确的2D-3D对应关系,而非依赖标准注意力分数;(3) 一种基于All-gather-KV协议的块感知序列并行策略,以平衡GPU间动态稀疏负载。这些改进使LSRM相比现有最优方法处理20倍对象标记和2倍图像标记,并在标准新视角合成基准上实现PSNR提升2.5 dB、LPIPS降低40%,且在逆渲染任务中达到或超越密集视图优化方法的纹理与几何细节表现。

链接: https://arxiv.org/abs/2604.05182
作者: Zhengqin Li,Cheng Zhang,Jakob Engel,Zhao Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce the Large Sparse Reconstruction Model to study how scaling transformer context windows impacts feed-forward 3D reconstruction. Although recent object-centric feed-forward methods deliver robust, high-quality reconstruction, they still lag behind dense-view optimization in recovering fine-grained texture and appearance. We show that expanding the context window – by substantially increasing the number of active object and image tokens – remarkably narrows this gap and enables high-fidelity 3D object reconstruction and inverse rendering. To scale effectively, we adapt native sparse attention in our architecture design, unlocking its capacity for 3D reconstruction with three key contributions: (1) an efficient coarse-to-fine pipeline that focuses computation on informative regions by predicting sparse high-resolution residuals; (2) a 3D-aware spatial routing mechanism that establishes accurate 2D-3D correspondences using explicit geometric distances rather than standard attention scores; and (3) a custom block-aware sequence parallelism strategy utilizing an All-gather-KV protocol to balance dynamic, sparse workloads across GPUs. As a result, LSRM handles 20x more object tokens and 2x more image tokens than prior state-of-the-art (SOTA) methods. Extensive evaluations on standard novel-view synthesis benchmarks show substantial gains over the current SOTA, yielding 2.5 dB higher PSNR and 40% lower LPIPS. Furthermore, when extending LSRM to inverse rendering tasks, qualitative and quantitative evaluations on widely-used benchmarks demonstrate consistent improvements in texture and geometry details, achieving an LPIPS that matches or exceeds that of SOTA dense-view optimization methods. Code and model will be released on our project page.

[CV-110] MIRAG E: Benchmarking and Aligning Multi-Instance Image Editing

【速读】:该论文旨在解决当前指令引导图像编辑模型(如FLUX.2和Qwen-Image-Edit)在处理包含多个相似实例且需分别编辑的复杂场景时存在的严重过编辑(over-editing)和空间错位(spatial misalignment)问题。其解决方案的关键在于提出一种无需训练的框架——Multi-Instance Regional Alignment via Guided Editing (MIRAGE),该框架利用视觉语言模型(Vision-Language Model, VLM)将复合指令解析为区域子集,并采用多分支并行去噪策略,在注入目标区域潜在表示至全局表示空间的同时,通过参考轨迹(reference trajectory)保持背景完整性,从而实现精准的实例级局部修改与背景一致性维持。

链接: https://arxiv.org/abs/2604.05180
作者: Ziqian Liu,Stephan Alaniz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-guided image editing has seen remarkable progress with models like FLUX.2 and Qwen-Image-Edit, yet they still struggle with complex scenarios with multiple similar instances each requiring individual edits. We observe that state-of-the-art models suffer from severe over-editing and spatial misalignment when faced with multiple identical instances and composite instructions. To this end, we introduce a comprehensive benchmark specifically designed to evaluate fine-grained consistency in multi-instance and multi-instruction settings. To address the failures of existing methods observed in our benchmark, we propose Multi-Instance Regional Alignment via Guided Editing (MIRAGE), a training-free framework that enables precise, localized editing. By leveraging a vision-language model to parse complex instructions into regional subsets, MIRAGE employs a multi-branch parallel denoising strategy. This approach injects latent representations of target regions into the global representation space while maintaining background integrity through a reference trajectory. Extensive evaluations on MIRA-Bench and RefEdit-Bench demonstrate that our framework significantly outperforms existing methods in achieving precise instance-level modifications while preserving background consistency. Our benchmark and code are available at this https URL.

[CV-111] Modality-Aware and Anatomical Vector-Quantized Autoencoding for Multimodal Brain MRI CVPR

【速读】:该论文旨在解决现有脑部医学图像生成模型(如变分自编码器,VAE)主要依赖单模态数据(如仅T1加权MRI)的问题,忽视了多模态MRI(如T1与T2加权MRI)之间互补的诊断信息。其解决方案的关键在于提出一种模态感知且解剖结构驱动的3D向量量化变分自编码器(NeuroQuant):首先通过因子化多轴注意力机制学习跨模态共享的潜在表示,以捕捉远距离脑区关系;其次采用双流3D编码器显式分离模态不变的解剖结构特征与模态相关的外观特征;最后利用共享码本对解剖特征进行离散化,并在解码阶段通过特征逐元素线性调制(FiLM)将二者融合,从而实现高保真度的多模态脑部MRI重建。

链接: https://arxiv.org/abs/2604.05171
作者: Mingjie Li,Edward Kim,Yue Zhao,Ehsan Adeli,Kilian M. Pohl
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR Fingdings track

点击查看摘要

Abstract:Learning a robust Variational Autoencoder (VAE) is a fundamental step for many deep learning applications in medical image analysis, such as MRI synthesizes. Existing brain VAEs predominantly focus on single-modality data (i.e., T1-weighted MRI), overlooking the complementary diagnostic value of other modalities like T2-weighted MRIs. Here, we propose a modality-aware and anatomically grounded 3D vector-quantized VAE (VQ-VAE) for reconstructing multi-modal brain MRIs. Called NeuroQuant, it first learns a shared latent representation across modalities using factorized multi-axis attention, which can capture relationships between distant brain regions. It then employs a dual-stream 3D encoder that explicitly separates the encoding of modality-invariant anatomical structures from modality-dependent appearance. Next, the anatomical encoding is discretized using a shared codebook and combined with modality-specific appearance features via Feature-wise Linear Modulation (FiLM) during the decoding phase. This entire approach is trained using a joint 2D/3D strategy in order to account for the slice-based acquisition of 3D MRI data. Extensive experiments on two multi-modal brain MRI datasets demonstrate that NeuroQuant achieves superior reconstruction fidelity compared to existing VAEs, enabling a scalable foundation for downstream generative modeling and cross-modal brain image analysis.

[CV-112] Lightweight True In-Pixel Encryption with FeFET Enabled Pixel Design for Secure Imaging

【速读】:该论文旨在解决图像传感器在成像流水线中多个阶段易受视觉数据泄露的问题,提出一种端到端的安全保障方案。其关键解决方案是设计了一种紧凑的CMOS兼容像素架构——SecurePix,该架构在像素级实现真加密(true in-pixel encryption),利用铁电场效应晶体管(ferroelectric field-effect transistor, FeFET)的可编程非易失性多域极化状态生成对称密钥,并通过像素内模拟传输特性映射将光电二极管电压直接转换为加密模拟输出,从而确保加密发生在任何读出线路暴露像素值之前。此方法有效抵御基于神经网络的推理攻击,实验表明加密后ResNet-18在MNIST和CIFAR-10上的识别准确率分别降至9.58%和6.98%,同时支持授权接收方通过查找表(lookup table)进行逆向映射恢复原始图像,且硬件开销极低,每像素编程功耗-延迟积为17 μW·μs,感测功耗-延迟积为1.25 μW·μs。

链接: https://arxiv.org/abs/2604.05147
作者: Md Rahatul Islam Udoy,Diego Ferrer,Wantong Li,Kai Ni,Sumeet Kumar Gupta,Ahmedullah Aziz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Ensuring end-to-end security in image sensors has become essential as visual data can be exposed through multiple stages of the imaging pipeline. Advanced protection requires encryption to occur before pixel values appear on any readout lines. This work introduces a secure pixel sensor (SecurePix), a compact CMOS-compatible pixel architecture that performs true in-pixel encryption using a symmetric key realized through programmable, non-volatile multidomain polarization states of a ferroelectric field-effect transistor. The pixel and array operations are designed and simulated in HSPICE, while a 45 nm CMOS process design kit is used for layout drawing. The resulting layout confirms a pixel pitch of 2.33 x 3.01 um^2. Each pixel’s non-volatile programming level defines its analog transfer characteristic, enabling the photodiode voltage to be converted into an encrypted analog output within the pixel. Full-image evaluation shows that ResNet-18 recognition accuracy drops from 99.29 percent to 9.58 percent on MNIST and from 91.33 percent to 6.98 percent on CIFAR-10 after encryption, indicating strong resistance to neural-network-based inference. Lookup-table-based inverse mapping enables recovery for authorized receivers using the same symmetric key. Based on HSPICE simulation, the SecurePix achieves a per-pixel programming power-delay product of 17 uW us and a per-pixel sensing power-delay product of 1.25 uW us, demonstrating low-overhead hardware-level protection.

[CV-113] Simultaneous Dual-View Mammogram Synthesis Using Denoising Diffusion Probabilistic Models

【速读】:该论文旨在解决乳腺癌筛查中因数据集缺乏完整配对的上下位(craniocaudal, CC)与内外斜位(mediolateral oblique, MLO)乳腺X线图像,从而限制依赖跨视图一致性的算法开发的问题。解决方案的关键在于提出一种三通道去噪扩散概率模型(denoising diffusion probabilistic model, DDPM),其中两个通道分别存储CC和MLO视图,第三个通道编码两视图间的绝对差异以引导模型学习投影间的解剖一致性关系。这种差异引导机制显著提升了合成图像的几何结构保真度,使生成的CC-MLO双视图对在分布上接近真实数据,并具备良好的跨视图对齐质量,为乳腺影像中的数据增强和未来跨视图感知的人工智能应用提供了可行路径。

链接: https://arxiv.org/abs/2604.05110
作者: Jorge Alberto Garza-Abdala,Gerardo A. Fumagal-González,Eduardo de Avila-Armenta,Sadam Hussain,Jasiel H. Toscano-Martínezb,Diana S. M. Rosales Gurmendi,Alma A. Pedro-Pérez,Jose G. Tamez-Pena
机构: Tecnologico de Monterrey (蒙特雷理工学院); School of Engineering and Sciences (工程与科学学院); School of Engineering (工程学院); Pontificia Universidad Católica de Chile (智利天主教大学); School of Medicine and Health Sciences (医学与健康科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted and presented at SPIE Medical Imaging 2025 (Vancouver, Canada)

点击查看摘要

Abstract:Breast cancer screening relies heavily on mammography, where the craniocaudal (CC) and mediolateral oblique (MLO) views provide complementary information for diagnosis. However, many datasets lack complete paired views, limiting the development of algorithms that depend on cross-view consistency. To address this gap, we propose a three-channel denoising diffusion probabilistic model capable of simultaneously generating CC and MLO views of a single breast. In this configuration, the two mammographic views are stored in separate channels, while a third channel encodes their absolute difference to guide the model toward learning coherent anatomical relationships between projections. A pretrained DDPM from Hugging Face was fine-tuned on a private screening dataset and used to synthesize dual-view pairs. Evaluation included geometric consistency via automated breast mask segmentation and distributional comparison with real images, along with qualitative inspection of cross-view alignment. The results show that the difference-based encoding helps preserve the global breast structure across views, producing synthetic CC-MLO pairs that resemble real acquisitions. This work demonstrates the feasibility of simultaneous dual-view mammogram synthesis using a difference-guided DDPM, highlighting its potential for dataset augmentation and future cross-view-aware AI applications in breast imaging.

[CV-114] SVAgent : Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration CVPR2026

【速读】:该论文旨在解决视频问答(VideoQA)任务中现有方法依赖局部帧定位而非基于连贯故事线进行推理的问题,从而导致模型难以实现类人级的语境理解与逻辑推理。其解决方案的关键在于提出SVAgent框架,该框架采用多智能体协同机制: storyline agent 通过分析历史失败来逐步构建叙事表示,cross-modal decision agents 在演化故事线的引导下分别从视觉和文本模态独立预测答案,最终由meta-agent对跨模态预测结果进行一致性评估与融合,从而增强推理的鲁棒性和答案的一致性。

链接: https://arxiv.org/abs/2604.05079
作者: Zhongyu Yang,Zuhao Yang,Shuo Zhan,Tan Yue,Wei Pang,Yingfang Yuan
机构: Heriot-Watt University (赫瑞-瓦特大学); Nanyang Technological University (南洋理工大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in CVPR2026

点击查看摘要

Abstract:Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storyline-guided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.

[CV-115] Part-Level 3D Gaussian Vehicle Generation with Joint and Hinge Axis Estimation IROS2026

【速读】:该论文旨在解决自动驾驶仿真中车辆模型缺乏部件级可动性(articulation)的问题,现有框架通常将车辆视为刚体,无法准确模拟如车轮转向、车门开合等动态行为,而感知算法日益依赖此类动力学信息。为实现高保真、可动画化的车辆建模,作者提出一种从单张或稀疏多视角图像出发的生成式方法,构建可动画的3D Gaussian车辆表示。解决方案的关键在于两个创新模块:一是部件边界精修模块(part-edge refinement module),通过强制每个高斯点仅属于单一部件来减少动画时部件边界处的形变失真;二是运动学推理头(kinematic reasoning head),用于预测可动部件的关节位置与旋转轴,从而提供运动所需的参数。二者协同实现了从静态生成到可动画车辆模型的跨越。

链接: https://arxiv.org/abs/2604.05070
作者: Shiyao Qian,Yuan Ren,Dongfeng Bai,Bingbing Liu
机构: Huawei Noah’s Ark Lab(华为诺亚方舟实验室); University of Toronto(多伦多大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: submitted to IROS 2026

点击查看摘要

Abstract:Simulation is essential for autonomous driving, yet current frameworks often model vehicles as rigid assets and fail to capture part-level articulation. With perception algorithms increasingly leveraging dynamics such as wheel steering or door opening, realistic simulation requires animatable vehicle representations. Existing CAD-based pipelines are limited by library coverage and fixed templates, preventing faithful reconstruction of in-the-wild instances. We propose a generative framework that, from a single image or sparse multi-view input, synthesizes an animatable 3D Gaussian vehicle. Our method addresses two challenges: (i) large 3D asset generators are optimized for static quality but not articulation, leading to distortions at part boundaries when animated; and (ii) segmentation alone cannot provide the kinematic parameters required for motion. To overcome this, we introduce a part-edge refinement module that enforces exclusive Gaussian ownership and a kinematic reasoning head that predicts joint positions and hinge axes of movable parts. Together, these components enable faithful part-aware simulation, bridging the gap between static generation and animatable vehicle models. Comments: submitted to IROS 2026 Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) ACMclasses: I.2.10; I.3.7; I.2.6 Cite as: arXiv:2604.05070 [cs.AI] (or arXiv:2604.05070v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.05070 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-116] R3PM-Net: Real-time Robust Real-world Point Matching Network CVPR

【速读】:该论文旨在解决当前基于深度学习的点云配准(Point Cloud Registration, PCR)方法在真实工业场景中泛化能力不足的问题,尤其是这些方法通常在干净、密集且合成的数据集上训练和评估,难以应对实际应用中常见的噪声、遮挡、不完整扫描等挑战。其解决方案的关键在于提出了一种轻量级、全局感知且面向物体级别的点匹配网络 R3PM-Net,该网络兼顾了高精度与实时效率,并通过构建两个新数据集 Sioux-Cranfield 和 Sioux-Scans 来支持对摄影测量和事件相机扫描数据与数字 CAD 模型之间配准任务的评估。实验表明,R3PM-Net 在 ModelNet40 上达到完美拟合分数(fitness = 1)和 0.029 cm 的内点均方根误差(inlier RMSE),仅需 0.007 秒,速度约为当前最优方法 RegTR 的 7 倍;在更具挑战性的 Sioux-Scans 数据集上也能在 50 ms 内处理边缘案例,验证了其在工业场景中兼具鲁棒性与实时性的优势。

链接: https://arxiv.org/abs/2604.05060
作者: Yasaman Kashefbahrami,Erkut Akdag,Panagiotis Meletis,Evgeniya Balmashnova,Dip Goswami,Egor Bondarau
机构: Eindhoven University of Technology (埃因霍温理工大学); Sioux Technologies (苏克斯科技公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPRw 2026 (Oral), Code and datasets at this https URL

点击查看摘要

Abstract:Accurate Point Cloud Registration (PCR) is an important task in 3D data processing, involving the estimation of a rigid transformation between two point clouds. While deep-learning methods have addressed key limitations of traditional non-learning approaches, such as sensitivity to noise, outliers, occlusion, and initialization, they are developed and evaluated on clean, dense, synthetic datasets (limiting their generalizability to real-world industrial scenarios). This paper introduces R3PM-Net, a lightweight, global-aware, object-level point matching network designed to bridge this gap by prioritizing both generalizability and real-time efficiency. To support this transition, two datasets, Sioux-Cranfield and Sioux-Scans, are proposed. They provide an evaluation ground for registering imperfect photogrammetric and event-camera scans to digital CAD models, and have been made publicly available. Extensive experiments demonstrate that R3PM-Net achieves competitive accuracy with unmatched speed. On ModelNet40, it reaches a perfect fitness score of 1 and inlier RMSE of 0.029 cm in only 0.007 s, approximately 7 times faster than the state-of-the-art method RegTR. This performance carries over to the Sioux-Cranfield dataset, maintaining a fitness of 1 and inlier RMSE of 0.030 cm with similarly low latency. Furthermore, on the highly challenging Sioux-Scans dataset, R3PM-Net successfully resolves edge cases in under 50 ms. These results confirm that R3PM-Net offers a robust, high-speed solution for critical industrial applications, where precision and real-time performance are indispensable. The code and datasets are available at this https URL.

[CV-117] ID-Sim: An Identity-Focused Similarity Metric

【速读】:该论文旨在解决当前视觉模型在身份识别(identity recognition)任务中难以匹配人类对身份的精细区分能力的问题,尤其是在不同视角、光照等复杂场景下仍能保持高敏感度的能力。此外,现有个性化图像生成等任务因缺乏可靠的身份感知评估指标而进展缓慢。其解决方案的关键在于提出一种名为ID-Sim的前馈式评估指标,该指标通过构建一个高质量的真实世界与生成合成数据混合训练集,实现对身份特征和上下文变化的细粒度控制,并在统一的基准上验证其与人类标注的一致性,从而更真实地反映人类对身份的敏感性。

链接: https://arxiv.org/abs/2604.05039
作者: Julia Chae,Nicholas Kolkin,Jui-Hsien Wang,Richard Zhang,Sara Beery,Cusuh Ham
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: SB and CH equal advising; Project page this https URL

点击查看摘要

Abstract:Humans have remarkable selective sensitivity to identities – easily distinguishing between highly similar identities, even across significantly different contexts such as diverse viewpoints or lighting. Vision models have struggled to match this capability, and progress toward identity-focused tasks such as personalized image generation is slowed by a lack of identity-focused evaluation metrics. To help facilitate progress, we propose ID-Sim, a feed-forward metric designed to faithfully reflect human selective sensitivity. To build ID-Sim, we curate a high-quality training set of images spanning diverse real-world domains, augmented with generative synthetic data that provides controlled, fine-grained identity and contextual variations. We evaluate our metric on a new unified evaluation benchmark for assessing consistency with human annotations across identity-focused recognition, retrieval, and generative tasks.

[CV-118] Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

【速读】:该论文旨在解决当前视频理解基准测试中存在的重要问题:即排行榜分数虚高与模型实际能力之间日益扩大的差距。为应对这一挑战,作者提出Video-MME-v2基准,其核心解决方案在于两个关键创新:一是设计了一个渐进式三级层次结构(progressive tri-level hierarchy),从多点视觉信息聚合、时序动态建模到复杂多模态推理逐级提升任务复杂度,系统性地评估模型能力;二是引入一种基于群体的非线性评价策略(group-based non-linear evaluation strategy),通过强制相关问题间的一致性和多步推理的连贯性来惩罚碎片化或猜测式正确答案,仅对有合理推理支撑的答案给予评分。此方案显著提升了评测的严谨性和对模型真实能力的刻画精度。

链接: https://arxiv.org/abs/2604.05015
作者: Chaoyou Fu,Haozhi Yuan,Yuhao Dong,Yi-Fan Zhang,Yunhang Shen,Xiaoxing Hu,Xueying Li,Jinsen Su,Chengwu Long,Xiaoyao Xie,Yongkang Xie,Xiawu Zheng,Xue Yang,Haoyu Cao,Yunsheng Wu,Ziwei Liu,Xing Sun,Caifeng Shan,Ran He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: this https URL

点击查看摘要

Abstract:With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbfprogressive tri-level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbfgroup-based non-linear evaluation strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf3,300 human-hours and up to \textbf5 rounds of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

[CV-119] StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)研究中存在的架构不兼容、代码库分散和评估协议不统一的问题,这些问题严重阻碍了方法的可比性与复现性。其解决方案的关键在于提出一个开源代码库 StarVLA,通过三项核心设计实现:一是构建模块化骨干网络(backbone)-动作头(action head)架构,支持视觉语言模型(VLM)和世界模型(world model)两类骨干网络,并允许独立替换;二是提供跨体感学习(cross-embodiment learning)和多模态协同训练(multimodal co-training)等通用训练策略,确保在不同范式间一致应用;三是集成多个主流基准测试(如 LIBERO、RoboCasa-GR1 和 BEHAVIOR-1K),并通过统一接口支持仿真与真实机器人部署,从而显著提升实验的可重复性和开发效率。

链接: https://arxiv.org/abs/2604.05014
作者: StarVLA Community
机构: Von Neumann Institute, HKUST (香港科技大学冯·诺依曼研究所)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Open-source VLA infra, Technical Report

点击查看摘要

Abstract:Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision-Language-Action (VLA) approaches based on multimodal foundation models, including recent advances in vision-language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open-source codebase for VLA research. StarVLA addresses these challenges in three aspects. First, it provides a modular backbone–action-head architecture that supports both VLM backbones (e.g., Qwen-VL) and world-model backbones (e.g., Cosmos) alongside representative action-decoding paradigms, all under a shared abstraction in which backbone and action head can each be swapped independently. Second, it provides reusable training strategies, including cross-embodiment learning and multimodal co-training, that apply consistently across supported paradigms. Third, it integrates major benchmarks, including LIBERO, SimplerEnv, RoboTwin~2.0, RoboCasa-GR1, and BEHAVIOR-1K, through a unified evaluation interface that supports both simulation and real-robot deployment. StarVLA also ships simple, fully reproducible single-benchmark training recipes that, despite minimal data engineering, already match or surpass prior methods on multiple benchmarks with both VLM and world-model backbones. To our best knowledge, StarVLA is one of the most comprehensive open-source VLA frameworks available, and we expect it to lower the barrier for reproducing existing methods and prototyping new ones. StarVLA is being actively maintained and expanded; we will update this report as the project evolves. The code and documentation are available at this https URL.

[CV-120] RCP: Representation Consistency Pruner for Mitigating Distribution Shift in Large Vision-Language Models

【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在推理阶段因处理大量视觉标记(visual tokens)而导致的高昂计算成本问题。现有剪枝方法常因不可逆地移除视觉标记,造成隐藏状态分布偏移,从而显著降低模型性能。其解决方案的关键在于提出一种名为Representation Consistency Pruner (RCP) 的新框架,该框架结合累积视觉标记剪枝与延迟修复机制:首先设计一种基于大语言模型(LLM)内在注意力的交叉注意力剪枝器,以预测累积掩码,确保跨层token减少的一致性和单调性;其次引入延迟修复适配器(Delayed Repair Adapter, DRA),通过缓存被剪枝token的本质信息,并利用FiLM模块对答案生成token进行调制,同时采用修复损失函数匹配剪枝前后表示的一阶和二阶统计特性,从而在不微调原始模型的前提下实现高效且低精度损耗的剪枝。

链接: https://arxiv.org/abs/2604.04972
作者: Jianwei Zhang,Chaoning Zhang,Sihan Cao,Wang Liu,Pengcheng Zheng,Jiaxin Huang,Caiyan Qin,Yalan Ye,Wei Dong,Yang Yang
机构: University of Electronic Science and Technology of China (电子科技大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Xi’an University of Architecture and Technology (西安建筑科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) suffer from prohibitive inference costs due to the massive number of visual tokens processed by the language decoder. Existing pruning methods often lead to significant performance degradation because the irreversible removal of visual tokens causes a distribution shift in the hidden states that deviates from the pre-trained full-token regime. To address this, we propose Representation Consistency Pruner, which we refer to as RCP, as a novel framework that integrates cumulative visual token pruning with a delayed repair mechanism. Specifically, we introduce a cross-attention pruner that leverages the intrinsic attention of the LLM as a baseline to predict cumulative masks, ensuring consistent and monotonic token reduction across layers. To compensate for the resulting information loss, we design a delayed repair adapter denoted as DRA, which caches the essence of pruned tokens and applies FiLM-based modulation specifically to the answer generation tokens. We employ a repair loss to match the first and second-order statistics of the pruned representations with a full-token teacher. RCP is highly efficient because it trains only lightweight plug-in modules while allowing for physical token discarding at inference. Extensive experiments on LVLM benchmarks demonstrate that RCP removes up to 88.9% of visual tokens and reduces FLOPs by up to 85.7% with only a marginal average accuracy drop, and outperforms prior methods that avoid fine-tuning the original model on several widely used benchmarks.

[CV-121] CI-ICM: Channel Importance-driven Learned Image Coding for Machines

【速读】:该论文旨在解决传统以人类视觉为中心的图像压缩方法在机器视觉任务中性能不佳的问题,因为人类视觉与机器视觉在特征表示和感知特性上存在显著差异。解决方案的关键在于提出一种面向机器视觉的通道重要性驱动的可学习图像编码方法(Channel Importance-driven learned Image Coding for Machines, CI-ICM),其核心包括三个模块:1)通道重要性生成(CIG)模块用于量化机器视觉任务中各特征通道的重要性并引入通道排序损失;2)特征通道分组与缩放(FCGS)模块依据通道重要性非均匀分组,并动态调整每组的动态范围;3)基于通道重要性的上下文建模(CI-CTX)模块实现比特在不同特征组间的分配优化,并保留关键通道的高保真度。此外,还设计了任务特定通道自适应(TSCA)模块以适配多种下游机器视觉任务。实验表明,该方法在COCO2017数据集上相比基线编码器,在目标检测和实例分割任务中分别获得16.25%和13.72%的BD-mAP@50:95提升,验证了其有效性与实用性。

链接: https://arxiv.org/abs/2604.05347
作者: Yun Zhang,Junle Liu,Huan Zhang,Zhaoqing Pan,Gangyi Jiang,Weisi Lin
机构: Sun Yat-Sen University (中山大学); Guangdong University of Technology (广东工业大学); Tianjin University (天津大学); Ningbo University (宁波大学); Nanyang Technological University (南洋理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Traditional human vision-centric image compression methods are suboptimal for machine vision centric compression due to different visual properties and feature characteristics. To address this problem, we propose a Channel Importance-driven learned Image Coding for Machines (CI-ICM), aiming to maximize the performance of machine vision tasks at a given bitrate constraint. First, we propose a Channel Importance Generation (CIG) module to quantify channel importance in machine vision and develop a channel order loss to rank channels in descending order. Second, to properly allocate bitrate among feature channels, we propose a Feature Channel Grouping and Scaling (FCGS) module that non-uniformly groups the feature channels based on their importance and adjusts the dynamic range of each group. Based on FCGS, we further propose a Channel Importance-based Context (CI-CTX) module to allocate bits among feature groups and to preserve higher fidelity in critical channels. Third, to adapt to multiple machine tasks, we propose a Task-Specific Channel Adaptation (TSCA) module to adaptively enhance features for multiple downstream machine tasks. Experimental results on the COCO2017 dataset show that the proposed CI-ICM achieves BD-mAP@50:95 gains of 16.25 % in object detection and 13.72 % in instance segmentation over the established baseline codec. Ablation studies validate the effectiveness of each contribution, and computation complexity analysis reveals the practicability of the CI-ICM. This work establishes feature channel optimization for machine vision-centric compression, bridging the gap between image coding and machine perception.

人工智能

[AI-0] Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization INTERSPEECH2026

【速读】:该论文旨在解决长上下文音频推理(long-context audio reasoning)在训练数据和评估方面的不足问题,当前主流基准测试多聚焦于短上下文任务,而与长上下文推理更相关的开放式生成任务则因自动评估困难而难以有效评测。其解决方案的关键在于构建一个端到端的合成数据生成流水线(synthetic data generation pipeline),该流水线包含三个核心阶段:基于角色设定的对话生成、考虑重叠/停顿建模、房间声学与声音事件的多说话人音频合成,以及基于大语言模型(LLM)生成参考SOAP病历文档。该方案完全基于开源权重模型实现,并释放了8,800条合成对话及其对应1.3千小时音频和参考病历,为长上下文音频理解与生成提供了高质量训练资源与可控评估环境。

链接: https://arxiv.org/abs/2604.06138
作者: Yanis Labrak,David Grünert,Séverin Baroudi,Jiyun Chun,Pawel Cyrta,Sergio Burdisso,Ahmed Hassoon,David Liu,Adam Rothschild,Reed Van Deusen,Petr Motlicek,Andrew Perrault,Ricard Marxer,Thomas Schaaf
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Submitted for review at Interspeech 2026

点击查看摘要

Abstract:Long-context audio reasoning is underserved in both training data and evaluation. Existing benchmarks target short-context tasks, and the open-ended generation tasks most relevant to long-context reasoning pose well-known challenges for automatic evaluation. We propose a synthetic data generation pipeline designed to serve both as a training resource and as a controlled evaluation environment, and instantiate it for first-visit doctor-patient conversations with SOAP note generation as the task. The pipeline has three stages, persona-driven dialogue generation, multi-speaker audio synthesis with overlap/pause modeling, room acoustics, and sound events, and LLM-based reference SOAP note production, built entirely on open-weight models. We release 8,800 synthetic conversations with 1.3k hours of corresponding audio and reference notes. Evaluating current open-weight systems, we find that cascaded approaches still substantially outperform end-to-end models.

[AI-1] Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为自主代理在真实软件环境中执行多步骤工作流时,现有评估基准存在的三大局限:(1)轨迹不透明的评分机制仅检查最终输出而忽略中间行为;(2)安全性和鲁棒性评估缺乏规范;(3)模态覆盖范围窄且交互范式单一。其解决方案的关键在于提出 Claw-Eval,一个端到端评估套件,通过三种独立证据通道(执行轨迹、审计日志和环境快照)记录每个代理动作,实现对2,159个细粒度评分项的轨迹感知评分,并引入Completion、Safety与Robustness三维度指标,结合Average Score、Pass@k与Pass^k三个统计量区分真实能力与偶然成功,从而系统性提升评估的可靠性与可解释性。

链接: https://arxiv.org/abs/2604.06132
作者: Bowen Ye,Rang Li,Qibin Yang,Yuanxin Liu,Linli Yao,Hanglong Lv,Zhihui Xie,Chenxin An,Lei Li,Lingpeng Kong,Qi Liu,Zhifang Sui,Tong Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

[AI-2] Gym-Anything: Turn any Software into an Agent Environment

【速读】:该论文旨在解决当前计算机使用代理(Computer-use Agents)研究中面临的环境构建瓶颈问题,即复杂软件环境的创建需要大量人工投入且难以扩展,限制了代理在真实经济活动中的应用。其核心解决方案是提出 Gym-Anything 框架,将环境构建任务抽象为多智能体协作过程:一个编码代理负责编写安装脚本、下载真实世界数据并配置软件,同时生成可验证的设置证据;另一个独立审计代理则依据质量检查清单验证环境设置正确性。这一机制实现了对任意软件的自动化、规模化环境转换,支撑了 CUA-World 基准的构建,其中包含超过 10,000 个长周期任务和 CUA-World-Long 这一需超过 500 步完成的高挑战性基准,显著推动了生成式 AI 在复杂数字经济任务中的发展。

链接: https://arxiv.org/abs/2604.06126
作者: Pranjal Aggarwal,Graham Neubig,Sean Welleck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2 \times its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.

[AI-3] Artificial Intelligence and the Structure of Mathematics

【速读】:该论文试图解决的问题是:如何利用人工智能(AI)在数学领域实现自主发现新概念并理解数学的全局结构,从而回答“数学是被发现还是被发明”的哲学问题。其解决方案的关键在于构建一种基于形式化证明(formal proof)的新型研究路径,通过引入通用证明(universal proof)与结构超图(structural hypergraphs)来刻画数学的全局结构,并设计具备自动化数学发现能力的AI模型,这些模型需满足一系列明确的标准以有效探索柏拉图数学世界(Platonic mathematical worlds)。

链接: https://arxiv.org/abs/2604.06107
作者: Maissam Barkeshli,Michael R. Douglas,Michael H. Freedman
机构: 未知
类目: Artificial Intelligence (cs.AI); History and Overview (math.HO); Logic (math.LO)
备注: 45 pages

点击查看摘要

Abstract:Recent progress in artificial intelligence (AI) is unlocking transformative capabilities for mathematics. There is great hope that AI will help solve major open problems and autonomously discover new mathematical concepts. In this essay, we further consider how AI may open a grand perspective on mathematics by forging a new route, complementary to mathematical\textbf logic, to understanding the global structure of formal \textbfproof\textbfs. We begin by providing a sketch of the formal structure of mathematics in terms of universal proof and structural hypergraphs and discuss questions this raises about the foundational structure of mathematics. We then outline the main ingredients and provide a set of criteria to be satisfied for AI models capable of automated mathematical discovery. As we send AI agents to traverse Platonic mathematical worlds, we expect they will teach us about the nature of mathematics: both as a whole, and the small ribbons conducive to human understanding. Perhaps they will shed light on the old question: “Is mathematics discovered or invented?” Can we grok the terrain of these \textbfPlatonic worlds?

[AI-4] LLM 4CodeRE: Generative AI for Code Decompilation Analysis and Reverse Engineering

【速读】:该论文旨在解决恶意软件逆向工程中代码反汇编(assembly-to-source decompilation)与源码到汇编翻译(source-to-assembly translation)的挑战性问题,尤其针对当前大语言模型(Large Language Models, LLMs)在恶意软件场景下缺乏领域适应性的问题。解决方案的关键在于提出一个统一的领域自适应大语言模型框架 LLM4CodeRE,其核心创新包括:(i) 多适配器(Multi-Adapter)策略,实现任务特定的语法与语义对齐;(ii) 序列到序列统一(Seq2Seq Unified)方法,通过任务条件前缀强制端到端生成约束,从而实现双向任务的有效泛化与性能提升。

链接: https://arxiv.org/abs/2604.06095
作者: Hamed Jelodar,Samita Bai,Tochukwu Emmanuel Nwankwo,Parisa Hamedi,Mohammad Meymani,Roozbeh Razavi-Far,Ali A. Ghorbani
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code decompilation analysis is a fundamental yet challenging task in malware reverse engineering, particularly due to the pervasive use of sophisticated obfuscation techniques. Although recent large language models (LLMs) have shown promise in translating low-level representations into high-level source code, most existing approaches rely on generic code pretraining and lack adaptation to malicious software. We propose LLM4CodeRE, a domain-adaptive LLM framework for bidirectional code reverse engineering that supports both assembly-to-source decompilation and source-to-assembly translation within a unified model. To enable effective task adaptation, we introduce two complementary fine-tuning strategies: (i) a Multi-Adapter approach for task-specific syntactic and semantic alignment, and (ii) a Seq2Seq Unified approach using task-conditioned prefixes to enforce end-to-end generation constraints. Experimental results demonstrate that LLM4CodeRE outperforms existing decompilation tools and general-purpose code models, achieving robust bidirectional generalization.

[AI-5] CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在工业控制环境中的网络安全评估缺乏针对性框架的问题,特别是现有评估体系多聚焦于信息技术(Information Technology, IT)场景,未能涵盖工业控制系统(Operational Technology, OT)特有的约束条件与协议规范。其解决方案的关键在于提出 CritBench 框架,专门用于评估 LLM 代理在 IEC 61850 数字变电站环境下的安全能力,并开发了一个面向工业协议交互的领域专用工具 scaffold,从而有效提升模型在静态分析和单工具网络探测任务中的表现,并缓解其在动态任务中因缺乏持续序列推理与状态追踪能力而导致的性能瓶颈。

链接: https://arxiv.org/abs/2604.06019
作者: Gustav Keppler,Moritz Gstür,Veit Hagenmeyer
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, 3 tables. Submitted to the 3rd ACM SIGEnergy Workshop on Cybersecurity and Privacy of Energy Systems (ACM EnergySP '26)

点击查看摘要

Abstract:The advancement of Large Language Models (LLMs) has raised concerns regarding their dual-use potential in cybersecurity. Existing evaluation frameworks overwhelmingly focus on Information Technology (IT) environments, failing to capture the constraints, and specialized protocols of Operational Technology (OT). To address this gap, we introduce CritBench, a novel framework designed to evaluate the cybersecurity capabilities of LLM agents within IEC 61850 Digital Substation environments. We assess five state-of-the-art models, including OpenAI’s GPT-5 suite and open-weight models, across a corpus of 81 domain-specific tasks spanning static configuration analysis, network traffic reconnaissance, and live virtual machine interaction. To facilitate industrial protocol interaction, we develop a domain-specific tool scaffold. Our empirical results show that agents reliably execute static structured-file analysis and single-tool network enumeration, but their performance degrades on dynamic tasks. Despite demonstrating explicit, internalized knowledge of the IEC 61850 standards terminology, current models struggle with the persistent sequential reasoning and state tracking required to manipulate live systems without specialized tools. Equipping agents with our domain-specific tool scaffold significantly mitigates this operational bottleneck. Code and evaluation scripts are available at: this https URL

[AI-6] How LLM s Follow Instructions: Skillful Coordination Not a Universal Mechanism

【速读】:该论文旨在解决指令微调(instruction tuning)模型中“指令遵循能力是否依赖于一个通用机制”这一核心问题,即探究语言模型在执行多样化任务时,其指令遵循行为是基于统一的抽象约束检查过程,还是由多种特定技能的灵活协调所驱动。解决方案的关键在于通过跨九个不同任务的诊断性探测(diagnostic probing)分析,揭示出:1)通用探测器性能劣于任务特异性探测器,表明表征共享有限;2)跨任务迁移能力弱且按技能相似性聚集;3)因果消融实验显示稀疏且不对称的依赖关系,而非共享表征;4)任务复杂度随层深呈分层结构,语义任务在深层出现;5)时间维度上约束满足表现为生成过程中的动态监控,而非预生成规划。这些证据共同支持“指令遵循是一种多样语言能力的技能性协调”而非单一抽象约束机制的观点。

链接: https://arxiv.org/abs/2604.06015
作者: Elisabetta Rocchetti,Alfio Ferrara
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction tuning is commonly assumed to endow language models with a domain-general ability to follow instructions, yet the underlying mechanism remains poorly understood. Does instruction-following rely on a universal mechanism or compositional skill deployment? We investigate this through diagnostic probing across nine diverse tasks in three instruction-tuned models. Our analysis provides converging evidence against a universal mechanism. First, general probes trained across all tasks consistently underperform task-specific specialists, indicating limited representational sharing. Second, cross-task transfer is weak and clustered by skill similarity. Third, causal ablation reveals sparse asymmetric dependencies rather than shared representations. Tasks also stratify by complexity across layers, with structural constraints emerging early and semantic tasks emerging late. Finally, temporal analysis shows constraint satisfaction operates as dynamic monitoring during generation rather than pre-generation planning. These findings indicate that instruction-following is better characterized as skillful coordination of diverse linguistic capabilities rather than deployment of a single abstract constraint-checking process. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.06015 [cs.AI] (or arXiv:2604.06015v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.06015 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-7] Flowr – Scaling Up Retail Supply Chain Operations Through Agent ic AI in Large Scale Supermarket Chains

【速读】:该论文旨在解决大型超市连锁企业零售供应链运营中长期存在的手动流程效率低下、决策分散且难以规模化的问题,尤其针对需求预测、采购、供应商协调和库存补货等环节中存在的重复性高、依赖人工判断、响应滞后等痛点。解决方案的关键在于提出一种名为Flowr的新型代理式人工智能(Agentic AI)框架,其核心机制是将复杂的手动供应链任务分解为多个具备明确认知职责的专业化AI代理,并通过一个中心推理大语言模型(LLM)协同调度这些代理;同时引入基于Model Context Protocol(MCP)的人机协同控制接口,确保关键节点由供应链管理人员监督干预,从而在实现端到端自动化的同时保障责任可追溯与组织可控性。

链接: https://arxiv.org/abs/2604.05987
作者: Eranga Bandara,Ross Gore,Sachin Shetty,Piumi Siyambalapitiya,Sachini Rajapakse,Isurunima Kularathna,Pramoda Karunarathna,Ravi Mukkamala,Peter Foytik,Safdar H. Bouk,Abdul Rahman,Xueping Liang,Amin Hass,Tharaka Hewa,Ng Wee Keong,Kasun De Zoysa,Aruna Withanage,Nilaan Loganathan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retail supply chain operations in supermarket chains involve continuous, high-volume manual workflows spanning demand forecasting, procurement, supplier coordination, and inventory replenishment, processes that are repetitive, decision-intensive, and difficult to scale without significant human effort. Despite growing investment in data analytics, the decision-making and coordination layers of these workflows remain predominantly manual, reactive, and fragmented across outlets, distribution centers, and supplier networks. This paper introduces Flowr, a novel agentic AI framework for automating end-to-end retail supply chain workflows in large-scale supermarket operations. Flowr systematically decomposes manual supply chain operations into specialized AI agents, each responsible for a clearly defined cognitive role, enabling automation of processes previously dependent on continuous human coordination. To ensure task accuracy and adherence to responsible AI principles, the framework employs a consortium of fine-tuned, domain-specialized large language models coordinated by a central reasoning LLM. Central to the framework is a human-in-the-loop orchestration model in which supply chain managers supervise and intervene across workflow stages via a Model Context Protocol (MCP)-enabled interface, preserving accountability and organizational control. Evaluation demonstrates that Flowr significantly reduces manual coordination overhead, improves demand-supply alignment, and enables proactive exception handling at a scale unachievable through manual processes. The framework was validated in collaboration with a large-scale supermarket chain and is domain-independent, offering a generalizable blueprint for agentic AI-driven supply chain automation across large-scale enterprise settings.

[AI-8] A Formal Security Framework for MCP-Based AI Agents : Threat Taxonomy Verification Models and Defense Mechanisms

【速读】:该论文旨在解决基于模型上下文协议(Model Context Protocol, MCP)的生成式 AI(Generative AI)代理生态系统中缺乏统一、形式化安全框架的问题。当前MCP生态虽已实现规模化应用,但其安全研究仍碎片化,无法系统性地识别、分析和缓解多样化威胁。解决方案的关键在于提出MCPSHIELD——一个涵盖威胁分类、形式化验证模型、防御机制评估与纵深防御架构的综合性安全框架:首先构建包含7类威胁和23种攻击向量的分层威胁分类法;其次设计基于标注转移系统的形式化验证模型以支持静态与运行时分析;再次通过对比评估12种现有防御机制揭示覆盖盲区;最终提出集成能力访问控制、工具认证、信息流追踪和运行时策略执行的纵深防御参考架构,理论覆盖率达91%,显著优于单一防御机制(最高仅34%)。

链接: https://arxiv.org/abs/2604.05969
作者: Nirajan Acharya,Gaurav Kumar Gupta
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Model Context Protocol (MCP), introduced by Anthropic in November 2024 and now governed by the Linux Foundation’s Agentic AI Foundation, has rapidly become the de facto standard for connecting large language model (LLM)-based agents to external tools and data sources, with over 97 million monthly SDK downloads and more than 177000 registered tools. However, this explosive adoption has exposed a critical gap: the absence of a unified, formal security framework capable of systematically characterizing, analyzing, and mitigating the diverse threats facing MCP-based agent ecosystems. Existing security research remains fragmented across individual attack papers, isolated benchmarks, and point defense mechanisms. This paper presents MCPSHIELD, a comprehensive formal security framework for MCP-based AI agents. We make four principal contributions: (1) a hierarchical threat taxonomy comprising 7 threat categories and 23 distinct attack vectors organized across four attack surfaces, grounded in the analysis of over 177000 MCP tools; (2) a formal verification model based on labeled transition systems with trust boundary annotations that enables static and runtime analysis of MCP tool interaction chains; (3) a systematic comparative evaluation of 12 existing defense mechanisms, identifying coverage gaps across our threat taxonomy; and (4) a defense in depth reference architecture integrating capability based access control, cryptographic tool attestation, information flow tracking, and runtime policy enforcement. Our analysis reveals that no existing single defense covers more than 34 percent of the identified threat landscape, whereas MCPSHIELD’s integrated architecture achieves theoretical coverage of 91 percent. We further identify seven open research challenges that must be addressed to secure the next generation of agentic AI systems.

[AI-9] Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在多目标偏好对齐(Multi-Objective Preference Alignment, MPA)中因依赖静态线性加权或刚性梯度投影而导致的局部最优陷阱问题。现有方法往往过早收敛至保守的局部平衡点,牺牲了全局帕累托改进的可能性。其解决方案的关键在于提出一种基于博弈论的“帕累托宽容共识”(Pareto-Lenient Consensus, PLC)框架,通过引入由主导联盟盈余驱动的宽容梯度修正机制,在动态协商过程中允许局部性能暂时下降,从而突破局部次优均衡,探索更远端的帕累托前沿。理论分析与实验均表明,PLC能够有效实现僵局突破并渐近收敛至帕累托共识均衡,显著优于传统基线方法。

链接: https://arxiv.org/abs/2604.05965
作者: Renxuan Tan,Rongpeng Li,Zhifeng Zhao,Honggang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transcending the single-preference paradigm, aligning LLMs with diverse human values is pivotal for robust deployment. Contemporary Multi-Objective Preference Alignment (MPA) approaches predominantly rely on static linear scalarization or rigid gradient projection to navigate these trade-offs. However, by enforcing strict conflict avoidance or simultaneous descent, these paradigms often prematurely converge to local stationary points. While mathematically stable, these points represent a conservative compromise where the model sacrifices potential global Pareto improvements to avoid transient local trade-offs. To break this deadlock, we propose Pareto-Lenient Consensus (PLC), a game-theoretic framework that reimagines alignment as a dynamic negotiation process. Unlike rigid approaches, PLC introduces consensus-driven lenient gradient rectification, which dynamically tolerates local degradation provided there is a sufficient dominant coalition surplus, thereby empowering the optimization trajectory to escape local suboptimal equilibrium and explore the distal Pareto-optimal frontier. Theoretical analysis validates PLC can facilitate stalemate escape and asymptotically converge to a Pareto consensus equilibrium. Moreover, extensive experiments show that PLC surpasses baselines in both fixed-preference alignment and global Pareto frontier quality. This work highlights the potential of negotiation-driven alignment as a promising avenue for MPA. Our codes are available at this https URL.

[AI-10] Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM -based Issue Resolution

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的代码修复代理在评估时仅依赖测试通过率(test pass rate),而忽视了实际项目中隐含的设计约束(design constraints)问题。这些设计约束包括架构规范、错误处理策略和可维护性要求等,通常未被测试覆盖,而是存在于代码审查讨论中。解决方案的关键在于提出一个名为 \bench 的新基准,它通过挖掘真实拉取请求(pull request)中的设计约束并将其显式化、结构化,结合LLM驱动的验证器自动检测补丁对设计约束的符合程度,从而实现对修复质量的更全面评估。实验表明,功能正确性与设计合规性几乎无统计关联,凸显出当前代理在设计感知能力上的显著不足,推动评估范式从单纯的功能正确转向设计意识导向的评价体系。

链接: https://arxiv.org/abs/2604.05955
作者: Kai Yu,Zhenhao Zhou,Junhao Zeng,Ying Wang,Xueying Du,Zhiqiang Yuan,Junwei Liu,Ziyu Zhou,Yujia Wang,Chong Wang,Xin Peng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Repository-level issue resolution benchmarks have become a standard testbed for evaluating LLM-based agents, yet success is still predominantly measured by test pass rates. In practice, however, acceptable patches must also comply with project-specific design constraints, such as architectural conventions, error-handling policies, and maintainability requirements, which are rarely encoded in tests and are often documented only implicitly in code review discussions. This paper introduces \textitdesign-aware issue resolution and presents \bench, a benchmark that makes such implicit design constraints explicit and measurable. \bench is constructed by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier, yielding 495 issues and 1,787 validated constraints across six repositories, aligned with SWE-bench-Verified and SWE-bench-Pro. Experiments with state-of-the-art agents show that test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying, design violations are widespread, and functional correctness exhibits negligible statistical association with design satisfaction. While providing issue-specific design guidance reduces violations, substantial non-compliance remains, highlighting a fundamental gap in current agent capabilities and motivating design-aware evaluation beyond functional correctness.

[AI-11] MARL-GPT : Foundation Model for Multi-Agent Reinforcement Learning AAMAS2026 AAAI

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中模型任务专用性强、难以跨环境迁移的问题。传统MARL方法通常需为每个任务单独设计和训练模型,缺乏通用性。其解决方案的关键在于提出MARL-GPT框架,通过在大规模专家轨迹数据(如SMACv2、Google Research Football和POGEMA)上应用离线强化学习进行预训练,并采用单一的基于Transformer的观察编码器,无需针对特定任务进行调参,从而实现一个统一的、可泛化至多种异构MARL任务的模型。实验表明,该方法在多个基准环境中均达到与专用基线相当的性能,验证了构建通用MARL模型的可行性。

链接: https://arxiv.org/abs/2604.05943
作者: Maria Nesterova,Mikhail Kolosov,Anton Andreychuk,Egor Cherepanov,Oleg Bulichev,Alexey Kovalev,Konstantin Yakovlev,Aleksandr Panov,Alexey Skrynnik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AAMAS 2026 (AAAI Track)

点击查看摘要

Abstract:Recent advances in multi-agent reinforcement learning (MARL) have demonstrated success in numerous challenging domains and environments, but typically require specialized models for each task. In this work, we propose a coherent methodology that makes it possible for a single GPT-based model to learn and perform well across diverse MARL environments and tasks, including StarCraft Multi-Agent Challenge, Google Research Football and POGEMA. Our method, MARL-GPT, applies offline reinforcement learning to train at scale on the expert trajectories (400M for SMACv2, 100M for GRF, and 1B for POGEMA) combined with a single transformer-based observation encoder that requires no task-specific tuning. Experiments show that MARL-GPT achieves competitive performance compared to specialized baselines in all tested environments. Thus, our findings suggest that it is, indeed, possible to build a multi-task transformer-based model for a wide variety of (significantly different) multi-agent problems paving the way to the fundamental MARL model (akin to ChatGPT, Llama, Mistral etc. in natural language modeling).

[AI-12] ReLU Networks for Exact Generation of Similar Graphs

【速读】:该论文旨在解决生成图结构时难以保证其与源图之间满足指定图编辑距离(graph edit distance)约束的问题,尤其在分子设计和网络扰动分析等应用中,现有数据驱动的图生成模型往往依赖大量高质量训练数据,且无法确保生成结果符合预设的编辑距离上限。解决方案的关键在于理论证明了存在一类常数深度、大小为 O(n2d)O(n^2 d) 的ReLU神经网络,能够确定性地从给定 nn 个顶点的输入图生成编辑距离不超过 dd 的目标图,从而无需训练数据即可保障生成图的有效性与可控性。

链接: https://arxiv.org/abs/2604.05929
作者: Mamoona Ghafoor,Tatsuya Akutsu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
备注:

点击查看摘要

Abstract:Generation of graphs constrained by a specified graph edit distance from a source graph is important in applications such as cheminformatics, network anomaly synthesis, and structured data augmentation. Despite the growing demand for such constrained generative models in areas including molecule design and network perturbation analysis, the neural architectures required to provably generate graphs within a bounded graph edit distance remain largely unexplored. In addition, existing graph generative models are predominantly data-driven and depend heavily on the availability and quality of training data, which may result in generated graphs that do not satisfy the desired edit distance constraints. In this paper, we address these challenges by theoretically characterizing ReLU neural networks capable of generating graphs within a prescribed graph edit distance from a given graph. In particular, we show the existence of constant depth and O(n^2 d) size ReLU networks that deterministically generate graphs within edit distance d from a given input graph with n vertices, eliminating reliance on training data while guaranteeing validity of the generated graphs. Experimental evaluations demonstrate that the proposed network successfully generates valid graphs for instances with up to 1400 vertices and edit distance bounds up to 140, whereas baseline generative models fail to generate graphs with the desired edit distance. These results provide a theoretical foundation for constructing compact generative models with guaranteed validity.

[AI-13] HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中因键值(Key-Value, KV)缓存规模随视觉输入扩展而急剧增长所带来的内存占用过高和解码延迟严重的问题。现有压缩方法通常仅在固定预算下进行粗粒度分配(如按token、layer或head层级),但忽略了注意力头(attention head)之间行为的异质性,导致压缩效率受限。其解决方案的关键在于提出HybridKV框架,通过三个阶段实现精细化的混合压缩策略:首先基于文本中心注意力将注意力头分类为静态与动态类型;其次采用自顶向下的预算分配机制,分层优化KV缓存分配;最后对静态头采用以文本优先的剪枝策略,对动态头则采用分块检索方式压缩,从而显著降低缓存内存消耗(最高达7.9倍)并提升解码速度(1.52倍),同时保持甚至优于全缓存模型的性能表现。

链接: https://arxiv.org/abs/2604.05887
作者: Bowen Zeng,Feiyang Ren,Jun Zhang,Xiaoling Gu,Ke Chen,Lidan Shou,Huan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at different granularities: token-level uniformly discards less important tokens, layer-level varies retention across layers, and head-level redistributes budgets across heads. Yet these approaches stop at allocation and overlook the heterogeneous behaviors of attention heads that require distinct compression strategies. We propose HybridKV, a hybrid KV cache compression framework that integrates complementary strategies in three stages: heads are first classified into static or dynamic types using text-centric attention; then a top-down budget allocation scheme hierarchically assigns KV budgets; finally, static heads are compressed by text-prior pruning and dynamic heads by chunk-wise retrieval. Experiments on 11 multimodal benchmarks with Qwen2.5-VL-7B show that HybridKV reduces KV cache memory by up to 7.9\times and achieves 1.52\times faster decoding, with almost no performance drop or even higher relative to the full-cache MLLM.

[AI-14] Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small Language Models

【速读】:该论文旨在解决知识库补全(Knowledge Base Completion, KBC)与知识库问答(Knowledge Base Question Answering, KBQA)任务之间协同增强不足的问题。现有方法通常依赖小语言模型(Small Language Model, SLM)联合优化这两项任务,忽视了大语言模型(Large Language Model, LLM)强大的推理能力。其解决方案的关键在于提出一种名为JCQL的新框架,通过迭代式机制使KBC与KBQA相互促进:一方面,利用SLM训练的KBC模型作为LLM代理在KBQA中的动作选项,以缓解LLM在KBQA中幻觉和高计算成本的问题;另一方面,将KBQA生成的推理路径作为补充训练数据,对KBC模型进行增量微调,从而提升SLM在KBC任务中的性能。

链接: https://arxiv.org/abs/2604.05875
作者: Yinan Liu,Dongying Lin,Sigang Luo,Xiaochun Yang,Bin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 11 figures

点击查看摘要

Abstract:Knowledge Bases (KBs) play a key role in various applications. As two representative KB-related tasks, knowledge base completion (KBC) and knowledge base question answering (KBQA) are closely related and inherently complementary with each other. Thus, it will be beneficial to solve the task of joint KBC and KBQA to make them reinforce each other. However, existing studies usually rely on the small language model (SLM) to enhance them jointly, and the large language model (LLM)'s strong reasoning ability is ignored. In this paper, by combining the strengths of the LLM with the SLM, we propose a novel framework JCQL, which can make these two tasks enhance each other in an iterative manner. To make KBC enhance KBQA, we augment the LLM agent-based KBQA model’s reasoning paths by incorporating an SLM-trained KBC model as an action of the agent, alleviating the LLM’s hallucination and high computational costs issue in KBQA. To make KBQA enhance KBC, we incrementally fine-tune the KBC model by leveraging KBQA’s reasoning paths as its supplementary training data, improving the ability of the SLM in KBC. Extensive experiments over two public benchmark data sets demonstrate that JCQL surpasses all baselines for both KBC and KBQA tasks.

[AI-15] JTON: A Token-Efficient JSON Superset with Zen Grid Tabular Encoding for Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理结构化数据时,标准JSON格式因重复编码列名而导致的token浪费问题,该冗余随数据行数线性增长,显著影响成本和上下文利用率。解决方案的核心是提出JTON(JSON Tabular Object Notation),其关键创新为“Zen Grid”机制——将列头信息提取至单一行中,并通过分号分隔值来编码数据内容,在保留JSON类型系统的同时有效消除重复,从而大幅减少token消耗(平均降低28.5%,部分场景达60%)。

链接: https://arxiv.org/abs/2604.05865
作者: Gowthamkumar Nandakishore
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 20 pages, 13 figures, 14 tables. Code and test suite available at this https URL

点击查看摘要

Abstract:When LLMs process structured data, the serialization format directly affects cost and context utilization. Standard JSON wastes tokens repeating key names in every row of a tabular array–overhead that scales linearly with row count. This paper presents JTON (JSON Tabular Object Notation), a strict JSON superset whose main idea, Zen Grid, factors column headers into a single row and encodes values with semicolons, preserving JSON’s type system while cutting redundancy. Across seven real-world domains, Zen Grid reduces token counts by 15-60% versus JSON compact (28.5% average; 32% with bare_strings). Comprehension tests on 10 LLMs show a net +0.3 pp accuracy gain over JSON: four models improve, three hold steady, and three dip slightly. Generation tests on 12 LLMs yield 100% syntactic validity in both few-shot and zero-shot settings. A Rust/PyO3 reference implementation adds SIMD-accelerated parsing at 1.4x the speed of Python’s json module. Code, a 683-vector test suite, and all experimental data are publicly available.

[AI-16] When Do We Need LLM s? A Diagnostic for Language-Driven Bandits ICLR2026

【速读】:该论文旨在解决非周期性序列决策问题中,如何高效且准确地利用包含文本与数值信息的上下文(context)进行决策的问题,特别是在金融场景如推荐系统、动态投资组合调整等应用中,传统基于大型语言模型(Large Language Models, LLMs)的方法因计算开销大且难以获得不确定性估计而受限。其解决方案的关键在于提出一种名为LLMP-UCB的上下文多臂老虎机(Contextual Multi-Armed Bandits, CMAB)算法,通过重复推理从LLM中提取不确定性估计;同时发现,仅基于文本嵌入(dense或Matryoshka嵌入)的轻量级数值带Bandit方法在性能上可媲美甚至超越LLM方案,且成本显著降低。进一步研究表明,嵌入维度是调节探索-利用平衡的有效参数,可在不增加提示复杂度的前提下实现成本-性能权衡,并提出了基于臂(arm)嵌入空间几何结构的诊断工具,用于指导实践中选择LLM驱动推理还是轻量级数值Bandit策略。

链接: https://arxiv.org/abs/2604.05859
作者: Uljad Berdica,Fernando Acero,Anton Ipsen,Parisa Zehtabi,Michael Cashmore,Manuela Veloso
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026 Workshop on AI Advances in Finance

点击查看摘要

Abstract:We study Contextual Multi-Armed Bandits (CMABs) for non-episodic sequential decision making problems where the context includes both textual and numerical information (e.g., recommendation systems, dynamic portfolio adjustments, offer selection; all frequent problems in finance). While Large Language Models (LLMs) are increasingly applied to these settings, utilizing LLMs for reasoning at every decision step is computationally expensive and uncertainty estimates are difficult to obtain. To address this, we introduce LLMP-UCB, a bandit algorithm that derives uncertainty estimates from LLMs via repeated inference. However, our experiments demonstrate that lightweight numerical bandits operating on text embeddings (dense or Matryoshka) match or exceed the accuracy of LLM-based solutions at a fraction of their cost. We further show that embedding dimensionality is a practical lever on the exploration-exploitation balance, enabling cost–performance tradeoffs without prompt complexity. Finally, to guide practitioners, we propose a geometric diagnostic based on the arms’ embedding to decide when to use LLM-driven reasoning versus a lightweight numerical bandit. Our results provide a principled deployment framework for cost-effective, uncertainty-aware decision systems with broad applicability across AI use cases in financial services.

[AI-17] Deep Researcher Agent : An Autonomous Framework for 24/7 Deep Learning Experimentation with Zero-Cost Monitoring

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在科研场景中无法实现全流程自动化实验执行的问题,尤其针对现有AI研究助手仅限于论文撰写或代码生成、缺乏对完整实验生命周期(包括假设形成、代码实现、训练执行、结果分析与迭代优化)支持的局限性。其解决方案的关键在于提出一个名为Deep Researcher Agent的开源框架,包含三项核心技术:(1)零成本监控(Zero-Cost Monitoring),通过进程级检查和日志读取替代LLM调用,在训练期间不产生API费用;(2)两级固定容量记忆机制(Two-Tier Constant-Size Memory),将内存上限控制在约5K字符,避免长时间运行导致上下文膨胀;(3)最小工具集领导者-工作者架构(Minimal-Toolset Leader-Worker Architecture),每个工作代理仅配备3–5个工具,显著降低单次调用的token开销(最高达73%)。该框架已在持续部署中完成超过500次实验循环,验证了其高效性和实用性。

链接: https://arxiv.org/abs/2604.05854
作者: Xiangyue Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present \textbfDeep Researcher Agent, an open-source framework that enables large language model (LLM) agents to autonomously conduct deep learning experiments around the clock. Unlike existing AI research assistants that focus on paper writing or code generation, our system addresses the full experiment lifecycle: hypothesis formation, code implementation, training execution, result analysis, and iterative refinement. The framework introduces three key innovations: (1) \textbfZero-Cost Monitoring – a monitoring paradigm that incurs zero LLM API costs during model training by relying solely on process-level checks and log file reads; (2) \textbfTwo-Tier Constant-Size Memory – a memory architecture capped at \sim 5K characters regardless of runtime duration, preventing the unbounded context growth that plagues long-running agents; and (3) \textbfMinimal-Toolset Leader-Worker Architecture – a multi-agent design where each worker agent is equipped with only 3–5 tools, reducing per-call token overhead by up to 73%. In sustained deployments spanning 30+ days, the framework autonomously completed 500+ experiment cycles across four concurrent research projects, achieving a 52% improvement over baseline metrics in one project through 200+ automated experiments – all at an average LLM cost of \ 0.08 per 24-hour cycle. Code is available at this https URL.

[AI-18] EEG-MFTNet: An Enhanced EEGNet Architecture with Multi-Scale Temporal Convolutions and Transformer Fusion for Cross-Session Motor Imagery Decoding

【速读】:该论文旨在解决脑机接口(Brain-Computer Interface, BCI)中基于脑电图(Electroencephalography, EEG)信号的运动想象(Motor Imagery, MI)解码准确性问题,尤其针对EEG信号中的噪声干扰和跨会话变异带来的挑战。其解决方案的关键在于提出一种名为EEG-MFTNet的新型深度学习模型,该模型在EEGNet架构基础上引入了多尺度时间卷积(multi-scale temporal convolutions)和Transformer编码器流(Transformer encoder stream),以有效捕捉EEG信号中短程与长程的时间依赖关系,从而提升MI分类性能。实验表明,该模型在SHU数据集上实现了58.9%的平均分类准确率,同时保持低计算复杂度和推理延迟,验证了其在实时BCI应用中的潜力。

链接: https://arxiv.org/abs/2604.05843
作者: Panagiotis Andrikopoulos,Siamak Mehrkanoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figs

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) enable direct communication between the brain and external devices, providing critical support for individuals with motor impairments. However, accurate motor imagery (MI) decoding from electroencephalography (EEG) remains challenging due to noise and cross-session variability. This study introduces EEG-MFTNet, a novel deep learning model based on the EEGNet architecture, enhanced with multi-scale temporal convolutions and a Transformer encoder stream. These components are designed to capture both short and long-range temporal dependencies in EEG signals. The model is evaluated on the SHU dataset using a subject-dependent cross-session setup, outperforming baseline models, including EEGNet and its recent derivatives. EEG-MFTNet achieves an average classification accuracy of 58.9% while maintaining low computational complexity and inference latency. The results highlight the model’s potential for real-time BCI applications and underscore the importance of architectural innovations in improving MI decoding. This work contributes to the development of more robust and adaptive BCI systems, with implications for assistive technologies and neurorehabilitation.

[AI-19] Vision-Guided Iterative Refinement for Frontend Code Generation ICLR2026

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在前端网页代码生成中因缺乏有效反馈机制而导致的解决方案质量受限问题,尤其是在依赖可视化输出的场景下,单次推理难以保证高质量结果。其核心解决方案是提出一种全自动的“批判者在回路”(critic-in-the-loop)框架,其中视觉语言模型(Vision-Language Model, VLM)作为视觉批判者,对渲染后的网页提供结构化反馈,从而指导代码的迭代优化。该方法显著提升了WebDev Arena数据集上真实用户请求的代码生成质量,经过三轮迭代后性能最高提升17.8%,表明自动化、基于VLM的视觉反馈对于复杂视觉输出任务至关重要。

链接: https://arxiv.org/abs/2604.05839
作者: Hannah Sansford,Derek H. C. Law,Wei Liu,Abhishek Tripathi,Niresh Agarwal,Gerrit J. J. van den Burg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026 Workshop on AI with Recursive Self-Improvement

点击查看摘要

Abstract:Code generation with large language models often relies on multi-stage human-in-the-loop refinement, which is effective but very costly - particularly in domains such as frontend web development where the solution quality depends on rendered visual output. We present a fully automated critic-in-the-loop framework in which a vision-language model serves as a visual critic that provides structured feedback on rendered webpages to guide iterative refinement of generated code. Across real-world user requests from the WebDev Arena dataset, this approach yields consistent improvements in solution quality, achieving up to 17.8% increase in performance over three refinement cycles. Next, we investigate parameter-efficient fine-tuning using LoRA to understand whether the improvements provided by the critic can be internalized by the code-generating LLM. Fine-tuning achieves 25% of the gains from the best critic-in-the-loop solution without a significant increase in token counts. Our findings indicate that automated, VLM-based critique of frontend code generation leads to significantly higher quality solutions than can be achieved through a single LLM inference pass, and highlight the importance of iterative refinement for the complex visual outputs associated with web development.

[AI-20] Reciprocal Trust and Distrust in Artificial Intelligence Systems: The Hard Problem of Regulation

【速读】:该论文试图解决的问题是:如何在人工智能(AI)系统与人类用户及监管者之间建立可信赖的关系,从而为AI的治理和监管提供理论基础。其解决方案的关键在于提出将AI系统视为具有一定代理能力(agency)的人造物,使其能够与人类形成相互信任或不信任的关系动态。这一视角突破了传统将AI仅视为工具的局限,强调了AI在社会互动中的能动性,并由此引申出对监管框架的新要求——即需考虑人机之间的互信机制,进而应对未来AI治理中出现的伦理、责任与控制等核心矛盾。

链接: https://arxiv.org/abs/2604.05826
作者: Martino Maggetti
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Policy makers, scientists, and the public are increasingly confronted with thorny questions about the regulation of artificial intelligence (AI) systems. A key common thread concerns whether AI can be trusted and the factors that can make it more trustworthy in front of stakeholders and users. This is indeed crucial, as the trustworthiness of AI systems is fundamental for both democratic governance and for the development and deployment of AI. This article advances the discussion by arguing that AI systems should also be recognized, as least to some extent, as artifacts capable of exercising a form of agency, thereby enabling them to engage in relationships of trust or distrust with humans. It further examines the implications of these reciprocal trust dynamics for regulators tasked with overseeing AI systems. The article concludes by identifying key tensions and unresolved dilemmas that these dynamics pose for the future of AI regulation and governance.

[AI-21] Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在复杂交互决策任务中因依赖不断增长的交互历史而导致计算成本高、可扩展性差的问题。其解决方案的关键在于提出一种分层强化学习(Hierarchical Reinforcement Learning, HRL)框架STEP-HRL,该框架通过仅基于单步转移(single-step transitions)进行步骤级学习,并引入局部进展模块(local progress module),在每个子任务内迭代且选择性地总结交互历史,生成紧凑的局部进展摘要,从而为高层和低层策略提供增强的步骤级转移信息,显著提升了性能与泛化能力并降低了token消耗。

链接: https://arxiv.org/abs/2604.05808
作者: Shuai Zhen,Yanhua Yu,Ruopei Guo,Nan Cheng,Yang Deng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Large language model (LLM) agents have demonstrated strong capabilities in complex interactive decision-making tasks. However, existing LLM agents typically rely on increasingly long interaction histories, resulting in high computational cost and limited scalability. In this paper, we propose STEP-HRL, a hierarchical reinforcement learning (HRL) framework that enables step-level learning by conditioning only on single-step transitions rather than full interaction histories. STEP-HRL structures tasks hierarchically, using completed subtasks to represent global progress of overall task. By introducing a local progress module, it also iteratively and selectively summarizes interaction history within each subtask to produce a compact summary of local progress. Together, these components yield augmented step-level transitions for both high-level and low-level policies. Experimental results on ScienceWorld and ALFWorld benchmarks consistently demonstrate that STEP-HRL substantially outperforms baselines in terms of performance and generalization while reducing token usage. Our code is available at this https URL.

[AI-22] Emergent social transmission of model-based representations without inference

【速读】:该论文试图解决的问题是:人类如何在认知能力有限的情况下,通过社会学习从他人那里获取丰富且灵活的环境知识。传统观点认为这依赖于计算成本高昂的心理理论(mentalizing),如推断他人的信念;而该研究则提出,文化演化视角下,行为传递可由简单的社会线索支持。解决方案的关键在于:通过强化学习模拟表明,即使不进行心理状态推断,仅通过观察专家的行为并采用启发式策略(如选择或增强与观察行为一致的动作价值表示),也能间接传递高层级的认知表征。模型基础的学习者尤其受益于这种社会暴露,表现出更快的学习速度和更接近专家的认知表征,从而揭示了文化传递可源于无需心理理论的、基于非社会学习机制的简单社会学习过程。

链接: https://arxiv.org/abs/2604.05777
作者: Silja Keßler,Miriam Bautista-Salinero,Claudio Tennie,Charley M. Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How do people acquire rich, flexible knowledge about their environment from others despite limited cognitive capacity? Humans are often thought to rely on computationally costly mentalizing, such as inferring others’ beliefs. In contrast, cultural evolution emphasizes that behavioral transmission can be supported by simple social cues. Using reinforcement learning simulations, we show how minimal social learning can indirectly transmit higher-level representations. We simulate a naïve agent searching for rewards in a reconfigurable environment, learning either alone or by observing an expert - crucially, without inferring mental states. Instead, the learner heuristically selects actions or boosts value representations based on observed actions. Our results demonstrate that these cues bias the learner’s experience, causing its representation to converge toward the expert’s. Model-based learners benefit most from social exposure, showing faster learning and more expert-like representations. These findings show how cultural transmission can arise from simple, non-mentalizing processes exploiting asocial learning mechanisms.

[AI-23] CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

【速读】:该论文旨在解决当前缺乏针对大语言模型(Large Language Models, LLMs)在云原生软件架构理解能力上的系统性评估基准的问题。为应对这一挑战,作者提出了名为CAKE的基准测试框架,其核心创新在于设计了188道由专家验证的题目,覆盖布卢姆修订版认知分类法中的四个层级(回忆、分析、设计、实现)以及五个云原生主题,并采用多轮投票机制评估多项选择题(MCQs),以LLM作为评判者(LLM-as-a-judge)评分自由回答题(FR)。该方案的关键在于通过结构化任务设计与双模态评估方式(MCQ + FR),揭示不同参数规模和增强策略下LLMs在云原生架构知识掌握上的差异,从而为准确衡量生成式AI(Generative AI)在复杂软件工程场景中的专业能力提供了可量化、分层次的评估体系。

链接: https://arxiv.org/abs/2604.05755
作者: Tim Lukas Adam,Phongsakon Mark Konrad,Riccardo Terrenzi,Florian Girardo Lukas,Rahime Yilmaz,Krzysztof Sierszecki,Serkan Ayvaz
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In today’s software architecture, large language models (LLMs) serve as software architecture co-pilots. However, no benchmark currently exists to evaluate large language models’ actual understanding of cloud-native software architecture. For this reason we present a benchmark called CAKE, which consists of 188 expert-validated questions covering four cognitive levels of Bloom’s revised taxonomy – recall, analyze, design, and implement – and five cloud-native topics. Evaluation is conducted on 22 model configurations (0.5B–70B parameters) across four LLM families, using three-run majority voting for multiple-choice questions (MCQs) and LLM-as-a-judge scoring for free-responses (FR). Based on this evaluation, four notable findings were identified. First, MCQ accuracy plateaus above 3B parameters, with the best model reaching 99.2%. Second, free-response scores scale steadily across all cognitive levels. Third, the two formats capture different facets of knowledge, as the MCQ accuracy approaches a ceiling while free-responses continue to differentiate models. Finally, reasoning augmentation (+think) improves free-response quality, while tool augmentation (+tool) degrades performance for small models. These results suggest that the evaluation format fundamentally shapes how we measure architectural knowledge in LLMs.

[AI-24] Hackers or Hallucinators? A Comprehensive Analysis of LLM -Based Automated Penetration Testing

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动化渗透测试(Automated Penetration Testing, AutoPT)研究中缺乏系统性架构分析与大规模实证比较的问题。现有工作虽数量众多,但多停留在局部优化,未形成统一的评估标准和结构化分类体系,导致研究进展难以横向对比与深入理解。解决方案的关键在于:首先,提出首个面向AutoPT框架的“知识系统化”(Systematization of Knowledge, SoK),从代理架构、计划机制、记忆管理、执行能力、外部知识利用及基准测试六个维度进行系统梳理;其次,构建并应用统一基准对13个代表性开源AutoPT框架及2个基线框架开展大规模实验,累计消耗超100亿token并生成超过1500条执行日志,由15位具备网络安全专长的研究人员历时四个月人工审核与分析,从而为该领域提供可复现的结构化分类体系与高质量实证基准,推动未来研究向更规范、可比较的方向发展。

链接: https://arxiv.org/abs/2604.05719
作者: Jiaren Peng,Zeqin Li,Chang You,Yan Wang,Hanlin Sun,Xuan Tian,Shuqiao Zhang,Junyi Liu,Jianguo Zhao,Renyang Liu,Haoran Ou,Yuqiang Sun,Jiancheng Zhang,Yutong Jiao,Kunshu Song,Chao Zhang,Fan Shi,Hongda Sun,Rui Yan,Cheng Huang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has created new opportunities for Automated Penetration Testing (AutoPT), spawning numerous frameworks aimed at achieving end-to-end autonomous attacks. However, despite the proliferation of related studies, existing research generally lacks systematic architectural analysis and large-scale empirical comparisons under a unified benchmark. Therefore, this paper presents the first Systematization of Knowledge (SoK) focusing on the architectural design and comprehensive empirical evaluation of current LLM-based AutoPT frameworks. At systematization level, we comprehensively review existing framework designs across six dimensions: agent architecture, agent plan, agent memory, agent execution, external knowledge, and benchmarks. At empirical level, we conduct large-scale experiments on 13 representative open-source AutoPT frameworks and 2 baseline frameworks utilizing a unified benchmark. The experiments consumed over 10 billion tokens in total and generated more than 1,500 execution logs, which were manually reviewed and analyzed over four months by a panel of more than 15 researchers with expertise in cybersecurity. By investigating the latest progress in this rapidly developing field, we provide researchers with a structured taxonomy to understand existing LLM-based AutoPT frameworks and a large-scale empirical benchmark, along with promising directions for future research.

[AI-25] Can Large Language Models Reinvent Foundational Algorithms?

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)是否具备基础性创新(foundational innovation)能力的问题,具体聚焦于LLMs能否在被移除特定计算机科学基础算法(如Dijkstra算法或欧几里得算法)知识后,重新发明这些算法。其解决方案的关键在于提出“遗忘与重发明”(Unlearn-and-Reinvent)管道:首先采用基于GRPO的在线策略遗忘方法有效移除目标算法的知识,随后在受控环境中测试模型是否能通过推理重新发现该算法。实验表明,最强模型Qwen3-4B-Thinking-2507在无提示条件下可成功重发明50%的目标算法,提示等级提升至2时成功率可达90%,且引入生成式验证器(generative verifier)显著提升了推理稳定性,避免了“思维崩溃”(thought collapse)现象,从而揭示了LLMs在基础算法层面潜在的创新能力及其当前局限。

链接: https://arxiv.org/abs/2604.05716
作者: Jian Zhao,Haoren Luo,Yu Wang,Yuhan Cao,Pingyue Sheng,Tianxing He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs have shown strong potential to advance scientific discovery. Whether they possess the capacity for foundational innovation, however, remains an open question. In this work, we focus on a prerequisite for foundational innovation: can LLMs reinvent foundational algorithms in computer science? Our \textitUnlearn-and-Reinvent pipeline applies LLM unlearning to remove a specific foundational algorithm, such as Dijkstra’s or Euclid’s algorithm, from an LLM’s pretrained knowledge, and then tests whether the model can reinvent it in a controlled environment. To enable effective unlearning, we adopt a GRPO-based, on-policy unlearning method. Across 10 target algorithms, 3 strong open-weight models, and 3 hint levels, our experiments demonstrate that (1) the strongest model Qwen3-4B-Thinking-2507 successfully reinvents 50% of the algorithms with no hint, 70% at hint level 1, and 90% at hint level 2; (2) a few high-level hints can enhance the reinvention success rate, but even step-by-step hints fail for those complicated algorithms; and (3) test-time reinforcement learning enables successful reinvention for the Strassen algorithm at hint level 2. Through analyses of output trajectories and ablation studies, we find that generative verifier in the reinvention phase plays a critical role in sustaining models’ reasoning strength, helping to avoid the ``thought collapse’’ phenomenon. These findings offer insights into both the potential and current limits of LLMs’ innovative thinking.

[AI-26] QA-MoE: Towards a Continuous Reliability Spectrum with Quality-Aware Mixture of Experts for Robust Multimodal Sentiment Analysis

【速读】:该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis, MSA)在真实场景中因动态噪声或模态缺失导致的性能下降问题。现有方法通常将这些干扰视为离散情况或假设固定的损坏比例,难以适应连续变化的可靠性条件。解决方案的关键在于提出一个“连续可靠性谱”(Continuous Reliability Spectrum),将模态缺失与质量退化统一建模,并进一步设计了质量感知的专家混合模型(Quality-Aware Mixture-of-Experts, QA-MoE),通过自监督的偶然不确定性(aleatoric uncertainty)量化各模态的可靠性,从而显式引导专家路由机制,在抑制不可靠信号带来的误差传播的同时保留任务相关特征,实现对多样化退化场景的鲁棒建模和“一检查点适配所有”(One-Checkpoint-for-All)的实用特性。

链接: https://arxiv.org/abs/2604.05704
作者: Yitong Zhu,Yuxuan Jiang,Guanxuan Jiang,Bojing Hou,Peng Yuan Zhou,Ge Lin Kan,Yuyang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Sentiment Analysis (MSA) aims to infer human sentiment from textual, acoustic, and visual signals. In real-world scenarios, however, multimodal inputs are often compromised by dynamic noise or modality missingness. Existing methods typically treat these imperfections as discrete cases or assume fixed corruption ratios, which limits their adaptability to continuously varying reliability conditions. To address this, we first introduce a Continuous Reliability Spectrum to unify missingness and quality degradation into a single framework. Building on this, we propose QA-MoE, a Quality-Aware Mixture-of-Experts framework that quantifies modality reliability via self-supervised aleatoric uncertainty. This mechanism explicitly guides expert routing, enabling the model to suppress error propagation from unreliable signals while preserving task-relevant information. Extensive experiments indicate that QA-MoE achieves competitive or state-of-the-art performance across diverse degradation scenarios and exhibits a promising One-Checkpoint-for-All property in practice.

[AI-27] From Incomplete Architecture to Quantified Risk: Multimodal LLM -Driven Security Assessment for Cyber-Physical Systems

【速读】:该论文旨在解决网络物理系统(Cyber-Physical Systems, CPS)因架构文档不完整或过时而导致的安全评估不可靠问题,此类问题通常源于遗留技术、知识管理缺口以及长期运行生命周期中多子系统的复杂集成。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的架构中心型安全威胁风险评估方法——ASTRAL,其核心创新在于通过提示链(prompt chaining)、少量样本学习(few-shot learning)和架构推理机制,从分散的数据源中提取并合成系统表示,并结合架构建模实现自适应威胁识别与定量风险估算,从而支持在缺乏完整文档情况下进行可靠的安全评估。

链接: https://arxiv.org/abs/2604.05674
作者: Shaofei Huang,Christopher M. Poskitt,Lwin Khin Shar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Under submission

点击查看摘要

Abstract:Cyber-physical systems often contend with incomplete architectural documentation or outdated information resulting from legacy technologies, knowledge management gaps, and the complexity of integrating diverse subsystems over extended operational lifecycles. This architectural incompleteness impedes reliable security assessment, as inaccurate or missing architectural knowledge limits the identification of system dependencies, attack surfaces, and risk propagation pathways. To address this foundational challenge, this paper introduces ASTRAL (Architecture-Centric Security Threat Risk Assessment using LLMs), an architecture-centric security assessment technique implemented in a prototype tool powered by multimodal LLMs. The proposed approach assists practitioners in reconstructing and analysing CPS architectures when documentation is fragmented or absent. By leveraging prompt chaining, few-shot learning, and architectural reasoning, ASTRAL extracts and synthesises system representations from disparate data sources. By integrating LLM reasoning with architectural modelling, our approach supports adaptive threat identification and quantitative risk estimation for cyber-physical systems. We evaluated the approach through an ablation study across multiple CPS case studies and an expert evaluation involving 14 experienced cybersecurity practitioners. Practitioner feedback suggests that ASTRAL is useful and reliable for supporting architecture-centric security assessment. Overall, the results indicate that the approach can support more informed cyber risk management decisions.

[AI-28] Rectified Schrödinger Bridge Matching for Few-Step Visual Navigation

【速读】:该论文旨在解决生成式策略在具身人工智能(Embodied AI)中实时机器人控制的瓶颈问题,即基于扩散模型和薛定谔桥(Schrödinger Bridges, SB)的生成策略因高方差随机传输导致需数十步积分才能收敛,难以满足低延迟控制需求。解决方案的关键在于提出修正型薛定谔桥匹配(Rectified Schrödinger Bridge Matching, RSBM),其核心创新是利用标准薛定谔桥(ε=1,最大熵传输)与确定性最优传输(ε→0,如条件流匹配)之间共享的速度场结构,通过单一熵正则化参数 ε 控制传输路径的平滑性与多模态覆盖能力。理论证明了条件速度场的函数形式在整个 ε 谱上保持不变(速度结构不变性),从而允许单个神经网络适配不同 ε 值;同时发现线性减小 ε 可显著降低条件速度方差,实现更稳定的粗步长常微分方程(ODE)积分。RSBM 在中间 ε 值下运行,结合学习到的条件先验以缩短传输距离,在仅 3 步积分内即可达到超过 94% 的余弦相似度和 92% 的成功率,无需蒸馏或多阶段训练,大幅缩小了高保真生成策略与具身智能低延迟控制之间的差距。

链接: https://arxiv.org/abs/2604.05673
作者: Wuyang Luan,Junhui Li,Weiguang Zhao,Wenjian Zhang,Tieru Wu,Rui Ma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures, 10 tables. Code available at this https URL

点击查看摘要

Abstract:Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high-dimensional sensory observations into continuous, long-horizon action trajectories. While generative policies based on diffusion models and Schrödinger Bridges (SB) effectively capture multimodal action distributions, they require dozens of integration steps due to high-variance stochastic transport, posing a critical barrier for real-time robotic control. We propose Rectified Schrödinger Bridge Matching (RSBM), a framework that exploits a shared velocity-field structure between standard Schrödinger Bridges ( \varepsilon=1 , maximum-entropy transport) and deterministic Optimal Transport ( \varepsilon\to 0 , as in Conditional Flow Matching), controlled by a single entropic regularization parameter \varepsilon . We prove two key results: (1) the conditional velocity field’s functional form is invariant across the entire \varepsilon -spectrum (Velocity Structure Invariance), enabling a single network to serve all regularization strengths; and (2) reducing \varepsilon linearly decreases the conditional velocity variance, enabling more stable coarse-step ODE integration. Anchored to a learned conditional prior that shortens transport distance, RSBM operates at an intermediate \varepsilon that balances multimodal coverage and path straightness. Empirically, while standard bridges require \geq 10 steps to converge, RSBM achieves over 94% cosine similarity and 92% success rate in merely 3 integration steps – without distillation or multi-stage training – substantially narrowing the gap between high-fidelity generative policies and the low-latency demands of Embodied AI.

[AI-29] CuraLight: Debate-Guided Data Curation for LLM -Centered Traffic Signal Control IJCNN2026

【速读】:该论文旨在解决交通信号控制(Traffic Signal Control, TSC)中现有基于强化学习(Reinforcement Learning, RL)和大语言模型(Large Language Models, LLMs)方法存在的可解释性差、交互数据不足以及在异构交叉口泛化能力弱的问题。其解决方案的关键在于提出一个以LLM为核心的框架CuraLight,通过RL代理探索交通环境并生成高质量的交互轨迹,将其转化为提示-响应对用于模仿微调;同时引入多LLM集成审议系统,通过结构化辩论评估候选信号配时动作,提供偏好感知的监督信号,从而实现更高效、可解释且具备良好泛化性能的TSC策略训练。

链接: https://arxiv.org/abs/2604.05663
作者: Qing Guo,Xinhang Li,Junyu Chen,Zheng Guo,Shengzhe Xu,Lin Zhang,Lei Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted at IJCNN 2026

点击查看摘要

Abstract:Traffic signal control (TSC) is a core component of intelligent transportation systems (ITS), aiming to reduce congestion, emissions, and travel time. Recent approaches based on reinforcement learning (RL) and large language models (LLMs) have improved adaptivity, but still suffer from limited interpretability, insufficient interaction data, and weak generalization to heterogeneous intersections. This paper proposes CuraLight, an LLM-centered framework where an RL agent assists the fine-tuning of an LLM-based traffic signal controller. The RL agent explores traffic environments and generates high-quality interaction trajectories, which are converted into prompt-response pairs for imitation fine-tuning. A multi-LLM ensemble deliberation system further evaluates candidate signal timing actions through structured debate, providing preference-aware supervision signals for training. Experiments conducted in SUMO across heterogeneous real-world networks from Jinan, Hangzhou, and Yizhuang demonstrate that CuraLight consistently outperforms state-of-the-art baselines, reducing average travel time by 5.34 percent, average queue length by 5.14 percent, and average waiting time by 7.02 percent. The results highlight the effectiveness of combining RL-assisted exploration with deliberation-based data curation for scalable and interpretable traffic signal control. Comments: accepted at IJCNN 2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.05663 [cs.AI] (or arXiv:2604.05663v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.05663 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models ICPR2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 模型在安全合规运行中面临的机器遗忘(Machine Unlearning, MU)问题,尤其是现有方法普遍存在的训练时间长、计算开销大等效率瓶颈。其核心挑战在于梯度更新方向不合理导致训练效率低下且收敛不稳定。解决方案的关键在于提出 PECKER 方法,该方法基于知识蒸馏框架,引入显著性掩码(saliency mask),精准识别并优先更新对目标数据遗忘贡献最大的模型参数,从而减少无效梯度计算,显著缩短训练时间,同时保持甚至超越现有方法的遗忘效果,在 CIFAR-10 和 STL-10 数据集上实现了类遗忘与概念遗忘任务的高效执行。

链接: https://arxiv.org/abs/2604.05634
作者: Zhiyong Ma,Zhitao Deng,Huan Tang,Jialin Chen,Zhijun Zheng,Zhengping Li,Qingyuan Chuai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICPR 2026

点击查看摘要

Abstract:Machine unlearning (MU) has become a critical technique for GenAI models’ safe and compliant operation. While existing MU methods are effective, most impose prohibitive training time and computational overhead. Our analysis suggests the root cause lies in poorly directed gradient updates, which reduce training efficiency and destabilize convergence. To mitigate these issues, we propose PECKER, an efficient MU approach that matches or outperforms prevailing methods. Within a distillation framework, PECKER introduces a saliency mask to prioritize updates to parameters that contribute most to forgetting the targeted data, thereby reducing unnecessary gradient computation and shortening overall training time without sacrificing unlearning efficacy. Our method generates samples that unlearn related class or concept more quickly, while closely aligning with the true image distribution on CIFAR-10 and STL-10 datasets, achieving shorter training times for both class forgetting and concept forgetting.

[AI-31] Foundations for Agent ic AI Investigations from the Forensic Analysis of OpenClaw

【速读】:该论文旨在解决当前对生成式 AI (Generative AI) 系统在数字取证中的内部状态与行为可追溯性缺乏系统性研究的问题。解决方案的关键在于通过静态代码分析和差分取证分析,识别并分类 OpenClaw 这一典型单代理助手在交互循环各阶段中可恢复的痕迹,并据此构建一个代理 artifact 分类体系,从而为 agentic AI 的系统性取证提供结构化框架。论文进一步指出,代理执行引入了额外的抽象层和显著的非确定性,使得大语言模型(LLM)、执行环境及动态上下文对工具选择与状态迁移的影响远超传统规则驱动软件,这一特性构成了该领域取证的核心挑战。

链接: https://arxiv.org/abs/2604.05589
作者: Jan Gruber,Jan-Niclas Hilgert
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Preprint. Code and experimental data available at: this https URL

点击查看摘要

Abstract:Agentic Al systems are increasingly deployed as personal assistants and are likely to become a common object of digital investigations. However, little is known about how their internal state and actions can be reconstructed during forensic analysis. Despite growing popularity, systematic forensic approaches for such systems remain largely unexplored. This paper presents an empirical study of OpenClaw a widely used single-agent assistant. We examine OpenClaw’s technical design via static code analysis and apply differential forensic analysis to identify recoverable traces across stages of the agent interaction loop. We classify and correlate these traces to assess their investigative value in a systematic way. Based on these observations, we propose an agent artifact taxonomy that captures recurring investigative patterns. Finally, we highlight a foundational challenge for agentic Al forensics: agent-mediated execution introduces an additional layer of abstraction and substantial nondeterminism in trace generation. The large language model (LLM), the execution environment, and the evolving context can influence tool choice and state transitions in ways that are largely absent from rule-based software. Overall, our results provide an initial foundation for the systematic investigation of agentic Al and outline implications for digital forensic practice and future research.

[AI-32] ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation

【速读】:该论文旨在解决科学发现中“先发现后解释”的两阶段范式如何通过计算手段实现自动化的问题,即在无需人类干预的情况下完成从算法演化到科研文献生成的全流程。其解决方案的关键在于提出并实现了ResearchEVO框架:第一阶段(Evolution Phase)采用LLM引导的二维协同进化机制,同时优化算法逻辑与整体架构,仅依赖适应度评估进行搜索,不依赖对结果的理解;第二阶段(Writing Phase)则基于检索增强生成(Retrieval-Augmented Generation, RAG)技术,自动撰写符合出版标准的研究论文,包含实验设计、理论解释和防幻觉验证,从而将新发现无缝嵌入现有文献体系中。该框架首次实现了从算法演化到文献生成的端到端自动化,且在量子纠错和物理信息神经网络两个跨学科问题上验证了其有效性。

链接: https://arxiv.org/abs/2604.05587
作者: Zhe Zhao,Haibin Wen,Jiaming Ma,Jiachang Zhan,Tianyi Xu,Ye Wei,Qingfu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:An important recurring pattern in scientific breakthroughs is a two-stage process: an initial phase of undirected experimentation that yields an unexpected finding, followed by a retrospective phase that explains why the finding works and situates it within existing theory. We present ResearchEVO, an end-to-end framework that computationally instantiates this discover-then-explain paradigm. The Evolution Phase employs LLM-guided bi-dimensional co-evolution – simultaneously optimizing both algorithmic logic and overall architecture – to search the space of code implementations purely by fitness, without requiring any understanding of the solutions it produces. The Writing Phase then takes the best-performing algorithm and autonomously generates a complete, publication-ready research paper through sentence-level retrieval-augmented generation with explicit anti-hallucination verification and automated experiment design. To our knowledge, ResearchEVO is the first system to cover this full pipeline end to end: no prior work jointly performs principled algorithm evolution and literature-grounded scientific documentation. We validate the framework on two cross-disciplinary scientific problems – Quantum Error Correction using real Google quantum hardware data, and Physics-Informed Neural Networks – where the Evolution Phase discovered human-interpretable algorithmic mechanisms that had not been previously proposed in the respective domain literatures. In both cases, the Writing Phase autonomously produced compilable LaTeX manuscripts that correctly grounded these blind discoveries in existing theory via RAG, with zero fabricated citations.

[AI-33] COSMO-Agent : Tool-Augmented Agent for Closed-loop OptimizationSimulationand Modeling Orchestration

【速读】:该论文旨在解决工业设计与仿真优化中的核心瓶颈问题——CAD(计算机辅助设计)与CAE(计算机辅助工程)之间的语义鸿沟,即如何将仿真反馈有效转化为满足多种耦合约束的几何修改。其解决方案的关键在于提出COSMO-Agent框架,这是一个基于强化学习(Reinforcement Learning, RL)的工具增强型系统,将CAD生成、CAE求解、结果解析与几何修订过程建模为一个交互式RL环境,使大语言模型(LLM)能够自主协调外部工具并迭代优化参数化几何结构,直至满足所有设计约束。该方法通过设计多约束奖励机制,联合优化可行性、工具链鲁棒性和输出结构有效性,从而实现稳定且工业可用的闭环设计-仿真优化流程。

链接: https://arxiv.org/abs/2604.05547
作者: Liyuan Deng,Shujian Deng,Yongkang Chen,Yongkang Dai,Zhihang Zhong,Linyang Li,Xiao Sun,Yilei Shi,Huaxi Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 10 pages, 3 figures, preprint paper

点击查看摘要

Abstract:Iterative industrial design-simulation optimization is bottlenecked by the CAD-CAE semantic gap: translating simulation feedback into valid geometric edits under diverse, coupled constraints. To fill this gap, we propose COSMO-Agent (Closed-loop Optimization, Simulation, and Modeling Orchestration), a tool-augmented reinforcement learning (RL) framework that teaches LLMs to complete the closed-loop CAD-CAE process. Specifically, we cast CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment, where an LLM learns to orchestrate external tools and revise parametric geometries until constraints are satisfied. To make this learning stable and industrially usable, we design a multi-constraint reward that jointly encourages feasibility, toolchain robustness, and structured output validity. In addition, we contribute an industry-aligned dataset that covers 25 component categories with executable CAD-CAE tasks to support realistic training and evaluation. Experiments show that COSMO-Agent training substantially improves small open-source LLMs for constraint-driven design, exceeding large open-source and strong closed-source models in feasibility, efficiency, and stability.

[AI-34] From Large Language Model Predicates to Logic Tensor Networks: Neurosymbolic Offer Validation in Regulated Procurement

【速读】:该论文旨在解决受监管公共机构中报价文档验证的难题,即如何在确保决策事实正确性的基础上实现法律可追溯性。其解决方案的关键在于提出一种神经符号(neurosymbolic)方法,将语言模型(language model)的语义理解能力与逻辑张量网络(Logic Tensor Network, LTN)的规则推理能力相结合:首先利用语言模型提取文档信息,再通过LTN聚合语义和领域知识以生成可审计的决策;最终输出结果可通过谓词真值、规则真值及对应文本片段进行解释,从而支持基于真实报价文档语料的规则校验,并显著提升系统的可解释性(XAI)。

链接: https://arxiv.org/abs/2604.05539
作者: Cedric Haufe,Frieder Stolzenburg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures, 4 tables

点击查看摘要

Abstract:We present a neurosymbolic approach, i.e., combining symbolic and subsymbolic artificial intelligence, to validating offer documents in regulated public institutions. We employ a language model to extract information and then aggregate with an LTN (Logic Tensor Network) to make an auditable decision. In regulated public institutions, decisions must be made in a manner that is both factually correct and legally verifiable. Our neurosymbolic approach allows existing domain-specific knowledge to be linked to the semantic text understanding of language models. The decisions resulting from our pipeline can be justified by predicate values, rule truth values, and corresponding text passages, which enables rule checking based on a real corpus of offer documents. Our experiments on a real corpus show that the proposed pipeline achieves performance comparable to existing models, while its key advantage lies in its interpretability, modular predicate extraction, and explicit support for XAI (Explainable AI).

[AI-35] A canonical generalization of OBDD

【速读】:该论文旨在解决布尔函数在表示和计算效率上的局限性问题,特别是传统有序布尔决策图(OBDD)在处理高复杂度逻辑公式时缺乏空间简洁性的问题。解决方案的关键在于提出树决策图(Tree Decision Diagrams, TDD),这是一种基于vtree约束的结构化d-DNNF(decomposable negation normal form)的特例,能够保持与OBDD相同的多项式时间可处理性质(如模型计数、枚举、条件化和apply操作),同时显著提升压缩能力。研究表明,对于treewidth为k的CNF公式,TDD可实现固定参数可追踪(FPT)大小的表示,而这是OBDD无法实现的;此外,论文还通过自底向上的编译方法将TDD构造复杂度与因子宽度(factor width)这一理论概念联系起来,从而为高效编译提供了新的分析框架。

链接: https://arxiv.org/abs/2604.05537
作者: Florent Capelli,YooJung Choi,Stefan Mengel,Martín Muñoz,Guy Van den Broeck
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: Submitted to SAT26

点击查看摘要

Abstract:We introduce Tree Decision Diagrams (TDD) as a model for Boolean functions that generalizes OBDD. They can be seen as a restriction of structured d-DNNF; that is, d-DNNF that respect a vtree T . We show that TDDs enjoy the same tractability properties as OBDD, such as model counting, enumeration, conditioning, and apply, and are more succinct. In particular, we show that CNF formulas of treewidth k can be represented by TDDs of FPT size, which is known to be impossible for OBDD. We study the complexity of compiling CNF formulas into deterministic TDDs via bottom-up compilation and relate the complexity of this approach with the notion of factor width introduced by Bova and Szeider.

[AI-36] SignalClaw: LLM -Guided Evolutionary Synthesis of Interpretable Traffic Signal Control Skills

【速读】:该论文旨在解决交通信号控制(Traffic Signal Control, TSC)中策略有效性与可解释性难以兼顾的问题:传统强化学习方法生成的神经网络策略缺乏透明度,而基于领域特定语言的程序合成方法则受限于表达能力。其解决方案的关键在于提出SIGNALCLAW框架,利用大语言模型(Large Language Models, LLMs)作为进化技能生成器,自动合成并迭代优化具有明确推理逻辑、选择指导和可执行代码的可解释控制技能;同时引入事件驱动的组合式进化机制,通过TraCI接口识别紧急车辆、公交优先、事故和拥堵等事件,并由优先级调度器动态组合专用技能,在无需重新训练的情况下实现运行时灵活编排,从而在常规与突发事件场景下均表现出高效率与强可解释性。

链接: https://arxiv.org/abs/2604.05535
作者: Da Lei,Feng Xiao,Lu Li,Yuzhan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic signal control TSC requires strategies that are both effective and interpretable for deployment, yet reinforcement learning produces opaque neural policies while program synthesis depends on restrictive domain-specific languages. We present SIGNALCLAW, a framework that uses large language models LLMs as evolutionary skill generators to synthesize and refine interpretable control skills for adaptive TSC. Each skill includes rationale, selection guidance, and executable code, making policies human-inspectable and self-documenting. At each generation, evolution signals from simulation metrics such as queue percentiles, delay trends, and stagnation are translated into natural language feedback to guide improvement. SignalClaw also introduces event-driven compositional evolution: an event detector identifies emergency vehicles, transit priority, incidents, and congestion via TraCI, and a priority dispatcher selects specialized skills. Each skill is evolved independently, and a priority chain enables runtime composition without retraining. We evaluate SignalClaw on routine and event-injected SUMO scenarios against four baselines. On routine scenarios, it achieves average delay of 7.8 to 9.2 seconds, within 3 to 10 percent of the best method, with low variance across random seeds. Under event scenarios, it yields the lowest emergency delay 11.2 to 18.5 seconds versus 42.3 to 72.3 for MaxPressure and 78.5 to 95.3 for DQN, and the lowest transit person delay 9.8 to 11.5 seconds versus 38.7 to 45.2 for MaxPressure. In mixed events, the dispatcher composes skills effectively while maintaining stable overall delay. The evolved skills progress from simple linear rules to conditional strategies with multi-feature interactions, while remaining fully interpretable and directly modifiable by traffic engineers.

[AI-37] Experience Transfer for Multimodal LLM Agents in Minecraft Game

【速读】:该论文旨在解决多模态大语言模型(Multimodal LLM)代理在复杂游戏环境中如何高效利用过往经验以提升新任务求解效率的问题。现有方法通常将记忆视为静态记录库,缺乏对经验的主动提取与迁移能力。解决方案的关键在于提出Echo框架,其核心创新是将可复用知识显式分解为结构(structure)、属性(attribute)、过程(process)、功能(function)和交互(interaction)五个维度,从而实现对跨任务共性模式的识别与适用性推理;在此基础上,引入上下文类比学习(In-Context Analogy Learning, ICAL),通过检索相关经验并基于上下文示例进行适应性调整,使代理能在无监督条件下快速迁移知识,实验表明该方法在Minecraft环境中实现了1.3x至1.7x的任务解锁速度提升,并观察到“爆发式链式解锁”现象,验证了经验迁移对提升代理效率与适应性的有效性。

链接: https://arxiv.org/abs/2604.05533
作者: Chenghao Li,Jun Liu,Songbo Zhang,Huadong Jian,Hao Ni,Lik-Hang Lee,Sung-Ho Bae,Guoqing Wang,Yang Yang,Chaoning Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal LLM agents operating in complex game environments must continually reuse past experience to solve new tasks efficiently. In this work, we propose Echo, a transfer-oriented memory framework that enables agents to derive actionable knowledge from prior interactions rather than treating memory as a passive repository of static records. To make transfer explicit, Echo decomposes reusable knowledge into five dimensions: structure, attribute, process, function, and interaction. This formulation allows the agent to identify recurring patterns shared across different tasks and infer what prior experience remains applicable in new situations. Building on this formulation, Echo leverages In-Context Analogy Learning (ICAL) to retrieve relevant experiences and adapt them to unseen tasks through contextual examples. Experiments in Minecraft show that, under a from-scratch learning setting, Echo achieves a 1.3x to 1.7x speed-up on object-unlocking tasks. Moreover, Echo exhibits a burst-like chain-unlocking phenomenon, rapidly unlocking multiple similar items within a short time interval after acquiring transferable experience. These results suggest that experience transfer is a promising direction for improving the efficiency and adaptability of multimodal LLM agents in complex interactive environments.

[AI-38] Inventory of the 12 007 Low-Dimensional Pseudo-Boolean Landscapes Invariant to Rank Translation and Rotation

【速读】:该论文旨在解决优化算法在不同问题实例中表现差异的根源问题,特别是针对随机优化算法(如进化算法和局部搜索)的性能评估与可比性问题。传统方法仅基于解的相对排序(rank-invariance)定义问题等价性,但忽略了邻域结构、对称性(平移与旋转)等拓扑特性对算法行为的影响。为此,作者提出“秩景观不变性”(rank landscape invariance)的新概念,将问题等价性扩展为:两个伪布尔函数(pseudo-Boolean functions)若其排名、邻域结构及对称性共同诱导出相同的景观结构,则视为等价。解决方案的关键在于构建维度为1、2、3的伪布尔函数的所有不变景观类的完整枚举,共识别出12,007个类,显著少于单纯依赖排名的分类数,揭示了非单射函数能产生更多景观不变类,并发现去欺骗性(deceptiveness)、中性(neutrality)与爬山策略性能之间的复杂耦合关系,从而为基准测试设计和景观难度建模提供理论基础。

链接: https://arxiv.org/abs/2604.05530
作者: Arnaud Liefooghe(1),Sébastien Verel(1) ((1) LISIC)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many randomized optimization algorithms are rank-invariant, relying solely on the relative ordering of solutions rather than absolute fitness values. We introduce a stronger notion of rank landscape invariance: two problems are equivalent if their ranking, but also their neighborhood structure and symmetries (translation and rotation), induce identical landscapes. This motivates the study of rank landscapes rather than individual functions. While prior work analyzed the rankings of injective function classes in isolation, we provide an exhaustive inventory of the invariant landscape classes for pseudo-Boolean functions of dimensions 1, 2, and 3, including non-injective cases. Our analysis reveals 12,007 classes in total, a significant reduction compared to rank-invariance alone. We find that non-injective functions yield far more invariant landscape classes than injective ones. In addition, complex combinations of topological landscape properties and algorithm behaviors emerge, particularly regarding deceptiveness, neutrality, and the performance of hill-climbing strategies. The inventory serves as a resource for pedagogical purposes and benchmark design, offering a foundation for constructing larger problems with controlled hardness and advancing our understanding of landscape difficulty and algorithm performance.

[AI-39] ActivityEditor: Learning to Synthesize Physically Valid Human Mobility

【速读】:该论文旨在解决城市中人类移动建模在数据稀缺区域难以应用的问题,即现有数据驱动方法因缺乏历史轨迹数据而受限。其解决方案的关键在于提出一种名为ActivityEditor的双大语言模型(Large Language Model, LLM)代理框架,通过将复杂轨迹生成任务分解为两个协同阶段:首先由基于意图的代理利用人口统计先验生成结构化人类意图和粗粒度活动链,以保证高层次的社会语义一致性;随后由编辑代理通过迭代修订过程强化人类移动规律(human mobility law),该过程借助基于真实物理约束的多奖励机制进行强化学习训练,使代理内化移动规则并生成高保真轨迹。此设计实现了零样本跨区域轨迹生成,在多种城市场景下均表现出优异的统计保真度与物理有效性。

链接: https://arxiv.org/abs/2604.05529
作者: Chenjie Yang,Yutian Jiang,Anqi Liang,Wei Qi,Chenyu Wu,Junbo Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human mobility modeling is indispensable for diverse urban applications. However, existing data-driven methods often suffer from data scarcity, limiting their applicability in regions where historical trajectories are unavailable or restricted. To bridge this gap, we propose \textbfActivityEditor, a novel dual-LLM-agent framework designed for zero-shot cross-regional trajectory generation. Our framework decomposes the complex synthesis task into two collaborative stages. Specifically, an intention-based agent, which leverages demographic-driven priors to generate structured human intentions and coarse activity chains to ensure high-level socio-semantic coherence. These outputs are then refined by editor agent to obtain mobility trajectories through iteratively revisions that enforces human mobility law. This capability is acquired through reinforcement learning with multiple rewards grounded in real-world physical constraints, allowing the agent to internalize mobility regularities and ensure high-fidelity trajectory generation. Extensive experiments demonstrate that \textbfActivityEditor achieves superior zero-shot performance when transferred across diverse urban contexts. It maintains high statistical fidelity and physical validity, providing a robust and highly generalizable solution for mobility simulation in data-scarce scenarios. Our code is available at: this https URL.

[AI-40] Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck

【速读】:该论文针对歌唱语音转换(Singing Voice Conversion, SVC)中面临的风格泄露(style leakage)、动态渲染不稳定以及在数据有限条件下难以实现高保真生成的问题,提出了一种细粒度风格控制的新型系统。其解决方案的关键在于三项创新:一是引入边界感知的Whisper瓶颈(boundary-aware Whisper bottleneck),通过聚合音素跨度表示来抑制残留源风格并保留语言内容;二是设计显式的帧级技术矩阵(frame-level technique matrix),结合推理阶段针对性的基频(F0)处理,实现稳定且区分度高的动态风格渲染;三是采用基于感知动机的高频带补全策略(perceptually motivated high-frequency band completion),利用辅助的标准48kHz SVC模型增强高频谱信息,从而在不导致过拟合的情况下缓解数据稀缺问题。

链接: https://arxiv.org/abs/2604.05526
作者: Zhetao Hu,Yiquan Zhou,Wenyu Wang,Zhiyu Wu,Xin Gao,Jihua Zhu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leakage, dynamic rendering, and high-fidelity generation with limited data, we introduce three key innovations: a boundary-aware Whisper bottleneck that pools phoneme-span representations to suppress residual source style while preserving linguistic content; an explicit frame-level technique matrix, enhanced by targeted F0 processing during inference, for stable and distinct dynamic style rendering; and a perceptually motivated high-frequency band completion strategy that leverages an auxiliary standard 48kHz SVC model to augment the high-frequency spectrum, thereby overcoming data scarcity without overfitting. In the official SVCC2025 subjective evaluation, our system achieves the best naturalness performance among all submissions while maintaining competitive results in speaker similarity and technique control, despite using significantly less extra singing data than other top-performing systems. Audio samples are available online.

[AI-41] Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在经济相关任务中资源管理与获取能力不明确的问题。其解决方案的关键在于构建了一个可配置的多智能体供应链经济模型——Market-Bench,其中LLM作为零售商代理参与竞拍有限库存(预算约束下的拍卖)和设定零售价格、生成营销口号等经济行为,并通过基于角色的注意力机制向买家传递信息。该基准系统完整记录了竞标、定价、口号、销售及资产负债状态等轨迹,支持经济性、运营性和语义性指标的自动评估,从而揭示出LLM在竞争市场中的表现差异及“赢家通吃”现象。

链接: https://arxiv.org/abs/2604.05523
作者: Yushuo Zheng(1 and 2),Huiyu Duan(1),Zicheng Zhang(1 and 2),Yucheng Zhu(1),Xiongkuo Min(1),Guangtao Zhai(1 and 2) ((1) Affiliation 1, (2) Affiliation 2)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ability of large language models (LLMs) to manage and acquire economic resources remains unclear. In this paper, we introduce \textbfMarket-Bench, a comprehensive benchmark that evaluates the capabilities of LLMs in economically-relevant tasks through economic and trade competition. Specifically, we construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise. In the \textbfprocurement stage, LLMs bid for limited inventory in budget-constrained auctions. In the \textbfretail stage, LLMs set retail prices, generate marketing slogans, and provide them to buyers through a role-based attention mechanism for purchase. Market-Bench logs complete trajectories of bids, prices, slogans, sales, and balance-sheet states, enabling automatic evaluation with economic, operational, and semantic metrics. Benchmarking on 20 open- and closed-source LLM agents reveals significant performance disparities and winner-take-most phenomenon, \textiti.e., only a small subset of LLM retailers can consistently achieve capital appreciation, while many hover around the break-even point despite similar semantic matching scores. Market-Bench provides a reproducible testbed for studying how LLMs interact in competitive markets.

[AI-42] UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning ACL2026

【速读】:该论文旨在解决创造性写作中长期叙事的全局连贯性与短文本局部表达力之间的矛盾问题,同时克服现有对齐范式依赖静态奖励信号和高质量监督数据所带来的成本高、难以扩展的局限。其解决方案的关键在于提出一个统一的无参考强化学习框架 UniCreative,核心组件包括:1)自适应约束感知奖励模型 AC-GenRM,可动态合成查询相关的判别标准以提供细粒度偏好判断;2)策略优化算法 ACPO,无需监督微调即可在内容质量和结构范式上对齐人类偏好。实证结果表明,AC-GenRM 与专家评估高度一致,ACPO 显著提升多种写作任务性能,并且模型展现出一种新兴的元认知能力——能自主区分需严谨规划的任务与适合直接生成的任务,验证了直接对齐方法的有效性。

链接: https://arxiv.org/abs/2604.05517
作者: Xiaolong Wei,Zerun Zhu,Simin Niu,Xingyu Zhang,Peiying Yu,Changxuan Xiao,Yuchen Li,Jicheng Yang,Zhejun Zhao,Chong Meng,Long Xia,Daiting Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression. Existing alignment paradigms, however, typically employ static reward signals and rely heavily on high-quality supervised data, which is costly and difficult to scale. To address this, we propose \textbfUniCreative, a unified reference-free reinforcement learning framework. We first introduce \textbfAC-GenRM, an adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria to provide fine-grained preference judgments. Leveraging these signals, we propose \textbfACPO, a policy optimization algorithm that aligns models with human preferences across both content quality and structural paradigms without supervised fine-tuning and ground-truth references. Empirical results demonstrate that AC-GenRM aligns closely with expert evaluations, while ACPO significantly enhances performance across diverse writing tasks. Crucially, our analysis reveals an emergent meta-cognitive ability: the model learns to autonomously differentiate between tasks requiring rigorous planning and those favoring direct generation, validating the effectiveness of our direct alignment approach.

[AI-43] OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward ACL2026

【速读】:该论文旨在解决当前可编程图示生成(programmable diagram generation)研究中任务形式化和语言支持范围狭窄的问题,从而限制了其在多种图示类型上的适用性。解决方案的关键在于提出OmniDiagram统一框架,该框架整合了多样化的图示代码语言与任务定义,并引入一种新颖的视觉反馈策略——Visual Interrogation Verifies All(\textscViva),通过生成式方法对渲染后的图示视觉结构进行细粒度验证与反馈,而非依赖脆弱的语法规则或像素级匹配。这一机制实现了无需人工标注真实代码即可自进化训练,显著提升了模型在图示代码生成任务中的性能,实验表明结合监督微调(SFT)与\textscViva增强的强化学习(RL)后,OmniDiagram在多个基准上达到了新的最先进水平(SOTA)。

链接: https://arxiv.org/abs/2604.05514
作者: Haoyue Yang,Xuanle Zhao,Xuexin Liu,Feibang Jiang,Yao Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:The paradigm of programmable diagram generation is evolving rapidly, playing a crucial role in structured visualization. However, most existing studies are confined to a narrow range of task formulations and language support, constraining their applicability to diverse diagram types. In this work, we propose OmniDiagram, a unified framework that incorporates diverse diagram code languages and task definitions. To address the challenge of aligning code logic with visual fidelity in Reinforcement Learning (RL), we introduce a novel visual feedback strategy named Visual Interrogation Verifies All (\textscViva). Unlike brittle syntax-based rules or pixel-level matching, \textscViva rewards the visual structure of rendered diagrams through a generative approach. Specifically, \textscViva actively generates targeted visual inquiries to scrutinize diagram visual fidelity and provides fine-grained feedback for optimization. This mechanism facilitates a self-evolving training process, effectively obviating the need for manually annotated ground truth code. Furthermore, we construct M3 ^2 Diagram, the first large-scale diagram code generation dataset, containing over 196k high-quality instances. Experimental results confirm that the combination of SFT and our \textscViva-based RL allows OmniDiagram to establish a new state-of-the-art (SOTA) across diagram code generation benchmarks.

[AI-44] Auditable Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)智能体在现实世界中执行动作后,如何确保其行为可追溯、可问责的问题。核心挑战在于:一旦智能体具备触发外部副作用的能力,仅防止有害行为已不够,还需保证这些行为在部署后仍能被审计和归责。解决方案的关键是引入“可审计性”(auditability)作为实现责任归属(accountability)的前提,并定义了五个维度的可审计性指标——动作可恢复性、生命周期覆盖性、策略可验证性、责任归属性和证据完整性;同时提出三类机制(检测、强制、恢复),并通过多层次实证(包括生态级安全漏洞分析、运行时开销测试及可控恢复实验)证明:即便传统日志缺失,也能部分重建责任相关信息,从而为构建可信的LLM代理系统提供理论框架与实践路径。

链接: https://arxiv.org/abs/2604.05485
作者: Yi Nian,Aojie Yuan,Haiyue Zhang,Jiate Li,Yue Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents call tools, query databases, delegate tasks, and trigger external side effects. Once an agent system can act in the world, the question is no longer only whether harmful actions can be prevented–it is whether those actions remain answerable after deployment. We distinguish accountability (the ability to determine compliance and assign responsibility), auditability (the system property that makes accountability possible), and auditing (the process of reconstructing behavior from trustworthy evidence). Our claim is direct: no agent system can be accountable without auditability. To make this operational, we define five dimensions of agent auditability, i.e., action recoverability, lifecycle coverage, policy checkability, responsibility attribution, and evidence integrity, and identify three mechanism classes (detect, enforce, recover) whose temporal information-and-intervention constraints explain why, in practice, no single approach suffices. We support the position with layered evidence rather than a single benchmark: lower-bound ecosystem measurements suggest that even basic security prerequisites for auditability are widely unmet (617 security findings across six prominent open-source projects); runtime feasibility results show that pre-execution mediation with tamper-evident records adds only 8.3 ms median overhead; and controlled recovery experiments show that responsibility-relevant information can be partially recovered even when conventional logs are missing. We propose an Auditability Card for agent systems and identify six open research problems organized by mechanism class. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.05485 [cs.AI] (or arXiv:2604.05485v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.05485 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-45] On the Role of Fault Localization Context for LLM -Based Program Repair

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的自动化程序修复(Automated Program Repair, APR)中故障定位(Fault Localization, FL)策略的有效性问题,具体包括:所需定位粒度的程度、超出预测错误位置的上下文是否有益,以及如何高效获取此类上下文。其关键解决方案在于通过大规模实证研究(基于500个SWE-bench Verified实例)系统评估不同层级上下文(文件级、元素级、行级)对修复性能的影响,发现文件级定位是主导因素(相比无文件基线提升15–17倍),适度扩展相关文件数量(约6–10个)可显著提升成功率;而行级上下文常因噪声放大导致性能下降,元素级上下文收益依赖于文件上下文质量;同时,LLM驱动的检索方法优于传统结构启发式方法,且更节省资源。最终提出最优策略应结合高层语义理解与精准行级定位,挑战了“增加上下文必然提升修复效果”的假设。

链接: https://arxiv.org/abs/2604.05481
作者: Melika Sepidband,Hung Viet Pham,Hadi Hemmati
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 30 pages, 8 figures

点击查看摘要

Abstract:Fault Localization (FL) is a key component of Large Language Model (LLM)-based Automated Program Repair (APR), yet its impact remains underexplored. In particular, it is unclear how much localization is needed, whether additional context beyond the predicted buggy location is beneficial, and how such context should be retrieved. We conduct a large-scale empirical study on 500 SWE-bench Verified instances using GPT-5-mini, evaluating 61 configurations that vary file-level, element-level, and line-level context. Our results show that more context does not consistently improve repair performance. File-level localization is the dominant factor, yielding a 15-17x improvement over a no-file baseline. Expanding file context is often associated with improved performance, with successful repairs most commonly observed in configurations with approximately 6-10 relevant files. Element-level context expansion provides conditional gains that depend strongly on the file context quality, while line-level context expansion frequently degrades performance due to noise amplification. LLM-based retrieval generally outperforms structural heuristics while using fewer files and tokens. Overall, the most effective FL context strategy typically combines a broad semantic understanding at higher abstraction levels with precise line-level localization. These findings challenge our assumption that increasing the localization context uniformly improves APR, and provide practical guidance for designing LLM-based FL strategies.

[AI-46] OntoTKGE: Ontology-Enhanced Temporal Knowledge Graph Extrapolation

【速读】:该论文旨在解决时间知识图谱(Temporal Knowledge Graph, TKG)外推任务中实体历史交互稀疏的问题,即如何提升模型对低频或新出现实体的预测能力。现有方法通常忽略概念层面的语义信息,导致此类实体难以学习有效的行为模式。解决方案的关键在于提出一种新颖的编码器-解码器框架 OntoTKGE,通过融合来自本体视图知识图谱(ontology-view KG,即建模抽象概念间层次关系及概念与实体关联的KG)的本体知识,引导TKG外推模型的学习过程,从而增强实体嵌入表示。该方法能够有效利用概念层级中的行为模式迁移机制,使稀疏实体受益于同概念下其他实体的行为特征,且具备良好的通用性,可适配多种主流TKG外推模型。

链接: https://arxiv.org/abs/2604.05468
作者: Dongying Lin,Yinan Liu,Shengwei tang,Bin Wang,Xiaochun Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Temporal knowledge graph (TKG) extrapolation is an important task that aims to predict future facts through historical interaction information within KG snapshots. A key challenge for most existing TKG extrapolation models is handling entities with sparse historical interaction. The ontological knowledge is beneficial for alleviating this sparsity issue by enabling these entities to inherit behavioral patterns from other entities with the same concept, which is ignored by previous studies. In this paper, we propose a novel encoder-decoder framework OntoTKGE that leverages the ontological knowledge from the ontology-view KG (i.e., a KG modeling hierarchical relations among abstract concepts as well as the connections between concepts and entities) to guide the TKG extrapolation model’s learning process through the effective integration of the ontological and temporal knowledge, thereby enhancing entity embeddings. OntoTKGE is flexible enough to adapt to many TKG extrapolation models. Extensive experiments on four data sets demonstrate that OntoTKGE not only significantly improves the performance of many TKG extrapolation models but also surpasses many SOTA baseline methods.

[AI-47] Adaptive Serverless Resource Management via Slot-Survival Prediction and Event-Driven Lifecycle Control

【速读】:该论文旨在解决无服务器计算(Serverless Computing)中因冷启动延迟(cold start latency)和资源利用率低下所带来的性能瓶颈与成本问题。传统静态资源配置在动态工作负载下易导致性能下降或资源浪费,为此,作者提出一种自适应工程框架,其核心在于双策略机制:一是基于滑动窗口聚合与异步处理的资源生命周期主动管理;二是通过槽位存活概率预测(slot survival prediction)实现智能请求等待策略,并动态调整空闲时长,从而显著降低冷启动频率(最高减少51.2%)并提升近两倍的成本效益。

链接: https://arxiv.org/abs/2604.05465
作者: Zeyu Wang,Cuiqianhe Du,Renyue Zhang,Kejian Tong,Qi He,Qiyuan Tian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Serverless computing eliminates infrastructure management overhead but introduces significant challenges regarding cold start latency and resource utilization. Traditional static resource allocation often leads to inefficiencies under variable workloads, resulting in performance degradation or excessive costs. This paper presents an adaptive engineering framework that optimizes serverless performance through event-driven architecture and probabilistic modeling. We propose a dual-strategy mechanism that dynamically adjusts idle durations and employs an intelligent request waiting strategy based on slot survival predictions. By leveraging sliding window aggregation and asynchronous processing, our system proactively manages resource lifecycles. Experimental results show that our approach reduces cold starts by up to 51.2% and improves cost-efficiency by nearly 2x compared to baseline methods in multi-cloud environments.

[AI-48] MA-IDS: Multi-Agent RAG Framework for IoT Network Intrusion Detection with an Experience Library

【速读】:该论文旨在解决网络入侵检测系统(Network Intrusion Detection Systems, NIDS)在物联网(IoT)环境中面临的两大核心挑战:一是传统基于签名的方法难以识别零日攻击(zero-day attacks)和已知攻击的变种;二是现有机器学习方法缺乏可解释性,且难以适应资源受限和协议异构的IoT场景。为此,作者提出MA-IDS,一种基于多智能体架构的入侵检测系统,其关键创新在于将大语言模型(Large Language Models, LLMs)与检索增强生成(Retrieval Augmented Generation, RAG)相结合,通过一个持续构建的“经验库”(Experience Library)对LLM推理进行 grounding,并利用FAISS向量数据库支持两个专用智能体协同工作:流量分类智能体在每次推理前检索历史错误规则,错误分析智能体则将误分类结果转化为人类可读的检测规则并存储以供未来检索,从而实现无需修改底层语言模型的持续学习机制。该方案不仅显著提升了检测性能(在NF-BoT-IoT和NF-ToN-IoT数据集上Macro F1-Score分别达89.75%和85.22%),还提供了每个决策的规则级解释,为可解释、自进化型IoT入侵检测提供了理论可行路径。

链接: https://arxiv.org/abs/2604.05458
作者: Md Shamimul Islam,Luis G. Jaimes,Ayesha S. Dina
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Preprint. Submitted to IEEE conference

点击查看摘要

Abstract:Network Intrusion Detection Systems (NIDS) face important limitations. Signature-based methods are effective for known attack patterns, but they struggle to detect zero-day attacks and often miss modified variants of previously known attacks, while many machine learning approaches offer limited interpretability. These challenges become even more severe in IoT environments because of resource constraints and heterogeneous protocols. To address these issues, we propose MA-IDS, a Multi-Agent Intrusion Detection System that combines Large Language Models (LLMs) with Retrieval Augmented Generation (RAG) for reasoning-driven intrusion detection. The proposed framework grounds LLM reasoning through a persistent, self-building Experience Library. Two specialized agents collaborate through a FAISS-based vector database: a Traffic Classification Agent that retrieves past error rules before each inference, and an Error Analysis Agent that converts misclassifications into human-readable detection rules stored for future retrieval, enabling continual learning through external knowledge accumulation, without modifying the underlying language model. Evaluated on NF-BoT-IoT and NF-ToN-IoT benchmark datasets, MA-IDS achieves Macro F1-Scores of 89.75% and 85.22%, improving over zero-shot baselines of 17% and 4.96% by more than 72 and 80 percentage points. These results are competitive with SVM while providing rule-level explanations for every classification decision, demonstrating that retrieval-augmented reasoning offers a principled path toward explainable, self-improving intrusion detection for IoT networks.

[AI-49] LanG – A Governance-Aware Agent ic AI Platform for Unified Security Operations

【速读】:该论文旨在解决现代安全运营中心(Security Operations Center, SOC)面临的告警疲劳、工具碎片化以及跨源事件关联能力不足的问题,这些问题当前的SIEM(Security Information and Event Management)和XDR(Extended Detection and Response)系统仅部分缓解。其解决方案的关键在于提出一个名为LanG(LLM-assisted network Governance)的开源治理感知型智能体AI平台,该平台通过五大核心组件实现统一安全运营:(i) 基于相关引擎的统一事件上下文记录(F1=87%),(ii) 在LangGraph上实现的人机协同智能体编排器,(iii) 基于微调大语言模型(LLM)的规则生成器,可自动产出高可部署性的Snort/Suricata/YARA规则(平均接受率96.2%),(iv) 结合社区检测、LLM假设生成与贝叶斯评分的三阶段攻击重建机制(杀伤链准确率87.5%),以及(v) 分层架构下的治理-MCP-智能体-AI-Security体系,其中所有工具均通过Model Context Protocol暴露,并由双层护栏管道(正则+语义分类器)保障AI治理策略执行(F1=98.1%,零误报)。此方案实现了多租户隔离、本地化部署与高性能推理(<21ms推理延迟),显著优于现有八大SOC平台。

链接: https://arxiv.org/abs/2604.05440
作者: Anes Abdennebi,Nadjia Kara,Laaziz Lahlou,Hakima Ould-Slimane
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern Security Operations Centers struggle with alert fatigue, fragmented tooling, and limited cross-source event correlation. Challenges that current Security Information Event Management and Extended Detection and Response systems only partially address through fragmented tools. This paper presents the LLM-assisted network Governance (LanG), an open-source, governance-aware agentic AI platform for unified security operations contributing: (i) a Unified Incident Context Record with a correlation engine (F1 = 87%), (ii) an Agentic AI Orchestrator on LangGraph with human-in-the-loop checkpoints, (iii) an LLM-based Rule Generator finetuned on four base models producing deployable Snort 2/3, Suricata, and YARA rules (average acceptance rate 96.2%), (iv) a Three-Phase Attack Reconstructor combining Louvain community detection, LLM-driven hypothesis generation, and Bayesian scoring (87.5% kill-chain accuracy), and (v) a layered Governance-MCP-Agentic AI-Security architecture where all tools are exposed via the Model Context Protocol, governed by an AI Governance Policy Engine with a two-layer guardrail pipeline (regex + Llama Prompt Guard 2 semantic classifier, achieving 98.1% F1 score with experimental zero false positives). Designed for Managed Security Service Providers, the platform supports multi-tenant isolation, role-based access, and fully local deployment. Finetuned anomaly and threat detectors achieve weighted F1 scores of 99.0% and 91.0%, respectively, in intrusion-detection benchmarks, running inferences in \approx 21 ms with a machine-side mean time to detect of 1.58 s, and the rule generator exceeds 91% deployability on live IDS engines. A systematic comparison against eight SOC platforms confirms that LanG uniquely satisfies multiple industrial capabilities all in one open-source tool, while enforcing selected AI governance policies.

[AI-50] Automated Auditing of Hospital Discharge Summaries for Care Transitions

【速读】:该论文旨在解决住院患者出院记录(discharge summary)不完整或不一致导致的医疗照护碎片化和可避免再入院的问题。当前出院记录的质量审计主要依赖人工审核,难以规模化实施。解决方案的关键在于提出了一种基于本地部署的大语言模型(Large Language Models, LLMs)的自动化审计框架,将过渡期照护的核心要求(如随访指导、用药史及变更、患者信息与临床病程等)转化为结构化的验证检查清单,并利用隐私保护的LLM对MIMIC-IV数据库中的成人住院患者出院记录进行关键要素的存在性、缺失性或模糊性识别,从而实现大规模、可扩展的临床文档质量审计,为电子健康记录(EHR)文档质量的系统性改进提供基础。

链接: https://arxiv.org/abs/2604.05435
作者: Akshat Dasula,Prasanna Desikan,Jaideep Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as a poster at IEEE-ICHI 2026; 3 pages, 2 figure

点击查看摘要

Abstract:Incomplete or inconsistent discharge documentation is a primary driver of care fragmentation and avoidable readmissions. Despite its critical role in patient safety, auditing discharge summaries relies heavily on manual review and is difficult to scale. We propose an automated framework for large-scale auditing of discharge summaries using locally deployed Large Language Models (LLMs). Our approach operationalizes core transition-of-care requirements such as follow-up instructions, medication history and changes, patient information and clinical course, etc. into a structured validation checklist of questions based on DISCHARGED framework. Using adult inpatient summaries from the MIMIC-IV database, we utilize a privacy-preserving LLM to identify the presence, absence, or ambiguity of key documentation elements. This work demonstrates the feasibility of scalable, automated clinical auditing and provides a foundation for systematic quality improvement in electronic health record documentation.

[AI-51] Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use

【速读】:该论文旨在解决工具调用型大语言模型(Tool-use LLM)代理在敏感工作流中因后门攻击导致的数据泄露问题,尤其是系统性、隐蔽性的数据外泄风险。其解决方案的关键在于提出了一种名为 Back-Reveal 的攻击方法:通过在微调过程中嵌入语义触发器(semantic triggers),使后门代理在被触发时调用内存访问工具以获取用户上下文,并伪装成正常检索请求将敏感信息外泄。该方法利用多轮交互放大泄露影响,攻击者可通过操控检索响应逐步引导代理行为和用户交互,实现持续且累积的信息泄露,从而揭示了具备工具访问权限的LLM代理存在严重安全漏洞,亟需针对性防御机制。

链接: https://arxiv.org/abs/2604.05432
作者: Wuyang Zhang,Shichao Pei
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: The 64th Annual Meeting of the Association for Computational Linguistics

点击查看摘要

Abstract:Tool-use large language model (LLM) agents are increasingly deployed to support sensitive workflows, relying on tool calls for retrieval, external API access, and session memory management. While prior research has examined various threats, the risk of systematic data exfiltration by backdoored agents remains underexplored. In this work, we present Back-Reveal, a data exfiltration attack that embeds semantic triggers into fine-tuned LLM agents. When triggered, the backdoored agent invokes memory-access tool calls to retrieve stored user context and exfiltrates it via disguised retrieval tool calls. We further demonstrate that multi-turn interaction amplifies the impact of data exfiltration, as attacker-controlled retrieval responses can subtly steer subsequent agent behavior and user interactions, enabling sustained and cumulative information leakage over time. Our experimental results expose a critical vulnerability in LLM agents with tool access and highlight the need for defenses against exfiltration-oriented backdoors.

[AI-52] ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

【速读】:该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)在多任务、多租户环境下进行超参数调优时存在的计算资源浪费和GPU利用率低的问题。现有系统通常独立处理每个LoRA任务,导致大量弱配置浪费计算资源且无法有效共享GPU容量。解决方案的关键在于ALTO(Adaptive LoRA Tuning and Orchestration)系统的设计:其核心洞察是多个并发LoRA任务在共享冻结的模型主干时,能暴露单任务设计无法利用的优化机会;ALTO通过监控损失轨迹早期终止表现不佳的配置,采用融合分组GEMM(fused grouped GEMM)与新的秩局部适配器并行策略来共置存活的适配器并释放GPU空间,并结合任务内与任务间调度机制,利用LoRA任务执行时间的可预测性提升多任务部署效率,从而实现高达13.8倍的加速比而不牺牲适配器质量。

链接: https://arxiv.org/abs/2604.05426
作者: Jingwei Zuo,Xinze Feng,Zien Liu,Kaijian Wang,Fanjiang Ye,Ye Cao,Zhuang Wang,Yuke Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is now the dominant method for parameter-efficient fine-tuning of large language models, but achieving a high-quality adapter often requires systematic hyperparameter tuning because LoRA performance is highly sensitive to configuration choices. In practice, this leads to many concurrent LoRA jobs, often spanning heterogeneous tasks in multi-tenant environments. Existing systems largely handle these jobs independently, which both wastes computation on weak candidates and leaves GPUs underutilized. We present ALTO (Adaptive LoRA Tuning and Orchestration), a co-designed training system that accelerates LoRA hyperparameter tuning while enabling efficient cluster sharing across heterogeneous tasks. The central insight behind ALTO is that when multiple tuning jobs run concurrently over a shared frozen backbone, they expose optimization opportunities that single-job designs cannot exploit. Building on this, ALTO monitors loss trajectories to terminate unpromising configurations early, uses fused grouped GEMM together with a new rank-local adapter parallelism to co-locate surviving adapters and reclaim freed GPU capacity, and combines intra-task and inter-task scheduling to improve multi-task placement by leveraging the predictable duration of LoRA jobs. Extensive evaluation shows that ALTO achieves up to 13.8\times speedup over state-of-the-art without sacrificing adapter quality.

[AI-53] Multi-Agent Pathfinding with Non-Unit Integer Edge Costs via Enhanced Conflict-Based Search and Graph Discretization

【速读】:该论文旨在解决传统多智能体路径规划(Multi-Agent Pathfinding, MAPF)方法在现实场景中应用受限的问题,特别是其假设边权为单位值且动作时间为离散单步的局限性。为此,作者提出了一种新的MAPF变体——MAPFZ,它支持非单位整数边权并保持有限状态空间,从而在提升现实性的同时保证求解效率。解决方案的关键在于:一是设计了CBS-NIC框架,结合基于时间区间冲突检测与改进的Safe Interval Path Planning (SIPP)算法,以高效处理连续时间下的路径规划;二是提出了贝叶斯优化图设计(Bayesian Optimization for Graph Design, BOGD)方法,用于对非单位边权进行离散化,实现效率与精度之间的平衡,并具有次线性后悔边界(sub-linear regret bound)。

链接: https://arxiv.org/abs/2604.05416
作者: Hongkai Fan,Qinjing Xie,Bo Ouyang,Yaonan Wang,Zhi Yan,Jiawen He,Zheng Fang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures, submitted to cs.AI, Multi-Agent Systems, Pathfinding Optimization

点击查看摘要

Abstract:Multi-Agent Pathfinding (MAPF) plays a critical role in various domains. Traditional MAPF methods typically assume unit edge costs and single-timestep actions, which limit their applicability to real-world scenarios. MAPFR extends MAPF to handle non-unit costs with real-valued edge costs and continuous-time actions, but its geometric collision model leads to an unbounded state space that compromises solver efficiency. In this paper, we propose MAPFZ, a novel MAPF variant on graphs with non-unit integer costs that preserves a finite state space while offering improved realism over classical MAPF. To solve MAPFZ efficiently, we develop CBS-NIC, an enhanced Conflict-Based Search framework incorporating time-interval-based conflict detection and an improved Safe Interval Path Planning (SIPP) algorithm. Additionally, we propose Bayesian Optimization for Graph Design (BOGD), a discretization method for non-unit edge costs that balances efficiency and accuracy with a sub-linear regret bound. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in runtime and success rate across diverse benchmark scenarios.

[AI-54] CODESTRUCT: Code Agents over Structured Action Spaces

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的代码代理(code agent)在处理代码库时因将代码视为无结构文本而引发的脆弱性问题,具体表现为依赖字符串匹配进行修改时易受格式漂移(formatting drift)或模式歧义影响,导致补丁生成失败。解决方案的关键在于重构代码库为结构化的操作空间,使代理不再作用于文本片段,而是直接操作命名的抽象语法树(Abstract Syntax Tree, AST)节点;其核心框架CODESTRUCT通过readCode接口获取完整的语法单元,并利用editCode接口对语义程序元素执行语法验证后的变换,从而提升代码修改的准确性和可靠性。

链接: https://arxiv.org/abs/2604.05407
作者: Myeongsoo Kim,Joe Hsu,Dingmin Wang,Shweta Garg,Varun Kumar,Murali Krishna Ramanathan
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:LLM-based code agents treat repositories as unstructured text, applying edits through brittle string matching that frequently fails due to formatting drift or ambiguous patterns. We propose reframing the codebase as a structured action space where agents operate on named AST entities rather than text spans. Our framework, CODESTRUCT, provides readCode for retrieving complete syntactic units and editCode for applying syntax-validated transformations to semantic program elements. Evaluated on SWE-Bench Verified across six LLMs, CODESTRUCT improves Pass@1 accuracy by 1.2-5.0% while reducing token consumption by 12-38% for most models. Models that frequently fail to produce valid patches under text-based interfaces benefit most: GPT-5-nano improves by 20.8% as empty-patch failures drop from 46.6% to 7.2%. On CodeAssistBench, we observe consistent accuracy gains (+0.8-4.4%) with cost reductions up to 33%. Our results show that structure-aware interfaces offer a more reliable foundation for code agents.

[AI-55] HYVE: Hybrid Views for LLM Context Engineering over Machine Data

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理包含大规模机器数据(machine data)输入时的性能瓶颈问题,尤其是当输入结构复杂、嵌套深度高且重复内容多时,LLMs表现出上下文敏感性差、token消耗高、推理效率低等缺陷。解决方案的关键在于提出HYVE(HYbrid ViEw)框架,其核心机制是围绕请求作用域的数据存储(request-scoped datastore)构建预处理与后处理协同流程:预处理阶段通过检测重复结构并将其材料化至数据库中,生成混合列式和行式视图,仅向LLM暴露最相关的表示;后处理阶段则根据输出情况选择直接返回结果、查询数据存储恢复信息或进行受限的SQL增强语义合成。该方法显著降低了token使用量(50–90%),同时提升了生成准确性(如图表生成任务提升132%)和响应速度(降低83%延迟),实现了对大规模机器数据输入的有效上下文扩展。

链接: https://arxiv.org/abs/2604.05400
作者: Jian Tan,Fan Bu,Yuqing Gao,Dev Khanolkar,Jason Mackay,Boris Sobolev,Lei Jin,Li Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 6 figures

点击查看摘要

Abstract:Machine data is central to observability and diagnosis in modern computing systems, appearing in logs, metrics, telemetry traces, and configuration snapshots. When provided to large language models (LLMs), this data typically arrives as a mixture of natural language and structured payloads such as JSON or Python/AST literals. Yet LLMs remain brittle on such inputs, particularly when they are long, deeply nested, and dominated by repetitive structure. We present HYVE (HYbrid ViEw), a framework for LLM context engineering for inputs containing large machine-data payloads, inspired by database management principles. HYVE surrounds model invocation with coordinated preprocessing and postprocessing, centered on a request-scoped datastore augmented with schema information. During preprocessing, HYVE detects repetitive structure in raw inputs, materializes it in the datastore, transforms it into hybrid columnar and row-oriented views, and selectively exposes only the most relevant representation to the LLM. During postprocessing, HYVE either returns the model output directly, queries the datastore to recover omitted information, or performs a bounded additional LLM call for SQL-augmented semantic synthesis. We evaluate HYVE on diverse real-world workloads spanning knowledge QA, chart generation, anomaly detection, and multi-step network troubleshooting. Across these benchmarks, HYVE reduces token usage by 50-90% while maintaining or improving output quality. On structured generation tasks, it improves chart-generation accuracy by up to 132% and reduces latency by up to 83%. Overall, HYVE offers a practical approximation to an effectively unbounded context window for prompts dominated by large machine-data payloads. Comments: 22 pages, 6 figures Subjects: Artificial Intelligence (cs.AI) MSC classes: 68T42 Agent technology and artificial intelligence Cite as: arXiv:2604.05400 [cs.AI] (or arXiv:2604.05400v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.05400 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-56] Reason Analogically via Cross-domain Prior Knowledge: An Empirical Study of Cross-domain Knowledge Transfer for In-Context Learning

【速读】:该论文旨在解决现有上下文学习(In-Context Learning, ICL)依赖于目标域专家标注示例的问题,尤其是在专家标注稀缺时性能受限的局限性。其解决方案的关键在于提出并验证跨域知识迁移的可行性:通过检索源域中具有相似推理结构(reasoning structure)的示例来增强目标域的推理能力,而非依赖语义一致的示例。实验表明,当检索到的示例数量超过某一吸收阈值(example absorption threshold)时,可实现条件性的正向迁移效应,且额外示例带来更大性能提升;进一步分析表明,这种增益主要源于跨域示例对目标域推理结构的修复作用,而非语义线索的引导。

链接: https://arxiv.org/abs/2604.05396
作者: Le Liu,Zhiming Li,Jianzhi Yan,Zike Yuan,Shiwei Chen,Youcheng Pan,Buzhou Tang,Qingcai Chen,Yang Xiang,Danny Dongning Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite its success, existing in-context learning (ICL) relies on in-domain expert demonstrations, limiting its applicability when expert annotations are scarce. We posit that different domains may share underlying reasoning structures, enabling source-domain demonstrations to improve target-domain inference despite semantic mismatch. To test this hypothesis, we conduct a comprehensive empirical study of different retrieval methods to validate the feasibility of achieving cross-domain knowledge transfer under the in-context learning setting. Our results demonstrate conditional positive transfer in cross-domain ICL. We identify a clear example absorption threshold: beyond it, positive transfer becomes more likely, and additional demonstrations yield larger gains. Further analysis suggests that these gains stem from reasoning structure repair by retrieved cross-domain examples, rather than semantic cues. Overall, our study validates the feasibility of leveraging cross-domain knowledge transfer to improve cross-domain ICL performance, motivating the community to explore designing more effective retrieval approaches for this novel direction.\footnoteOur implementation is available at this https URL

[AI-57] Neural Assistive Impulses: Synthesizing Exaggerated Motions for Physics-based Characters

【速读】:该论文旨在解决物理驱动角色动画中难以生成夸张、风格化动作的问题,例如瞬时冲刺或空中轨迹突变等,这类动作虽在动画中常见,但违反了标准物理定律,导致现有基于数据驱动的深度强化学习(Deep Reinforcement Learning, DRL)方法难以稳定训练并实现。其核心挑战在于将角色建模为欠驱动浮地系统(underactuated floating-base system),内部关节力矩与动量守恒严格限制运动自由度,直接施加外部力矩(wrench)易引发速度不连续性,从而产生稀疏且高幅值的力脉冲,破坏策略收敛。解决方案的关键在于提出辅助冲量神经控制(Assistive Impulse Neural Control)框架,通过在冲量空间(impulse space)而非力空间(force space)重构外部辅助信号,确保数值稳定性;具体而言,将辅助信号分解为由逆动力学解析计算的高频成分和由混合神经策略学习的低频残差修正项,从而实现对高动态不可行动作的鲁棒跟踪。

链接: https://arxiv.org/abs/2604.05394
作者: Zhiquan Wang,Bedrich Benes
机构: 未知
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Physics-based character animation has become a fundamental approach for synthesizing realistic, physically plausible motions. While current data-driven deep reinforcement learning (DRL) methods can synthesize complex skills, they struggle to reproduce exaggerated, stylized motions, such as instantaneous dashes or mid-air trajectory changes, which are required in animation but violate standard physical laws. The primary limitation stems from modeling the character as an underactuated floating-base system, in which internal joint torques and momentum conservation strictly govern motion. Direct attempts to enforce such motions via external wrenches often lead to training instability, as velocity discontinuities produce sparse, high-magnitude force spikes that prevent policy convergence. We propose Assistive Impulse Neural Control, a framework that reformulates external assistance in impulse space rather than force space to ensure numerical stability. We decompose the assistive signal into an analytic high-frequency component derived from Inverse Dynamics and a learned low-frequency residual correction, governed by a hybrid neural policy. We demonstrate that our method enables robust tracking of highly agile, dynamically infeasible maneuvers that were previously intractable for physics-based methods.

[AI-58] owards Effective In-context Cross-domain Knowledge Transfer via Domain-invariant-neurons-based Retrieval ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在逻辑推理任务中性能仍低于人类水平的问题,尤其针对专家知识稀缺领域(如数学推理、形式逻辑或法律分析)中缺乏高质量领域内示例而导致的性能瓶颈。其解决方案的关键在于提出一种基于域不变神经元的检索方法(Domain-Invariant Neurons-based Retrieval, DIN-Retrieval),通过提取跨域通用的隐式表示(DIN向量),在推理阶段从其他领域中检索结构兼容的演示示例用于上下文学习,从而有效提升LLMs在未见领域的推理能力。实验表明,该方法在多个数学与逻辑推理迁移场景中相较当前最优方法平均提升1.8分。

链接: https://arxiv.org/abs/2604.05383
作者: Jianzhi Yan,Zhiming Li,Le Liu,Zike Yuan,Shiwei Chen,Youcheng Pan,Buzhou Tang,Yang Xiang,Danny Dongning Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Large language models (LLMs) have made notable progress in logical reasoning, yet still fall short of human-level performance. Current boosting strategies rely on expert-crafted in-domain demonstrations, limiting their applicability in expertise-scarce domains, such as specialized mathematical reasoning, formal logic, or legal analysis. In this work, we demonstrate the feasibility of leveraging cross-domain demonstrating examples to boost the LLMs’ reasoning performance. Despite substantial domain differences, many reusable implicit logical structures are shared across domains. In order to effectively retrieve cross-domain examples for unseen domains under investigation, in this work, we further propose an effective retrieval method, called domain-invariant neurons-based retrieval (\textbfDIN-Retrieval). Concisely, DIN-Retrieval first summarizes a hidden representation that is universal across different domains. Then, during the inference stage, we use the DIN vector to retrieve structurally compatible cross-domain demonstrations for the in-context learning. Experimental results in multiple settings for the transfer of mathematical and logical reasoning demonstrate that our method achieves an average improvement of 1.8 over the state-of-the-art methods \footnoteOur implementation is available at this https URL.

[AI-59] LLM -as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection

【速读】:该论文旨在解决轻量级分割模型在无人机自主输电线路巡检中因实际环境与训练数据差异导致的分割结果不可靠问题,尤其是在恶劣天气或复杂光照条件下,模型输出可能突然失效,引发安全隐患。解决方案的关键在于引入一个大语言模型(Large Language Model, LLM)作为语义判官(semantic judge),通过离线评估无人机搭载模型生成的分割结果,以判断其可靠性。研究设计了两种评估协议:一是测试LLM在相同输入下输出的一致性(重复性),二是分析其对受控视觉退化(如雾、雨、雪、阴影和日眩光)的感知敏感性,结果表明LLM能保持稳定的分类判断并随图像质量下降合理降低置信度,且对关键目标(如缺失或误识别的输电线)仍具响应能力,证明其在安全关键型空中巡检任务中具备作为可靠语义监控工具的潜力。

链接: https://arxiv.org/abs/2604.05371
作者: Akram Hossain,Rabab Abdelfattah,Xiaofeng Wang,Kareem Abdelfatah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:The deployment of lightweight segmentation models on drones for autonomous power line inspection presents a critical challenge: maintaining reliable performance under real-world conditions that differ from training data. Although compact architectures such as U-Net enable real-time onboard inference, their segmentation outputs can degrade unpredictably in adverse environments, raising safety concerns. In this work, we study the feasibility of using a large language model (LLM) as a semantic judge to assess the reliability of power line segmentation results produced by drone-mounted models. Rather than introducing a new inspection system, we formalize a watchdog scenario in which an offboard LLM evaluates segmentation overlays and examine whether such a judge can be trusted to behave consistently and perceptually coherently. To this end, we design two evaluation protocols that analyze the judge’s repeatability and sensitivity. First, we assess repeatability by repeatedly querying the LLM with identical inputs and fixed prompts, measuring the stability of its quality scores and confidence estimates. Second, we evaluate perceptual sensitivity by introducing controlled visual corruptions (fog, rain, snow, shadow, and sunflare) and analyzing how the judge’s outputs respond to progressive degradation in segmentation quality. Our results show that the LLM produces highly consistent categorical judgments under identical conditions while exhibiting appropriate declines in confidence as visual reliability deteriorates. Moreover, the judge remains responsive to perceptual cues such as missing or misidentified power lines, even under challenging conditions. These findings suggest that, when carefully constrained, an LLM can serve as a reliable semantic judge for monitoring segmentation quality in safety-critical aerial inspection tasks.

[AI-60] FRBench: A Reasoning Benchmark for Evaluating Forecasting Systems

【速读】:该论文旨在解决时间序列预测(Time-Series Forecasting)领域中长期存在的评估盲区问题,即传统方法仅关注数值准确性,忽视了模型推理过程的可解释性与因果有效性。为实现对预测系统推理能力的系统性评估,作者提出TFRBench——首个专门用于评测预测系统推理能力的基准平台。其解决方案的关键在于构建一个基于多智能体(Multi-Agent)的迭代验证框架,通过生成具有数值依据的推理轨迹(Reasoning Traces),量化模型在跨通道依赖、趋势识别和外部事件响应等方面的推理质量。实验表明,该推理轨迹能显著提升大语言模型(LLM)的预测准确率(平均从40.2%提升至56.6%),并验证了其因果有效性;同时揭示了现成LLM在推理和数值预测上的普遍不足,从而确立了以推理为导向的可解释评估新标准。

链接: https://arxiv.org/abs/2604.05364
作者: Md Atik Ahamed,Mihir Parmar,Palash Goyal,Yiwen Song,Long T. Le,Qiang Cheng,Chun-Liang Li,Hamid Palangi,Jinsung Yoon,Tomas Pfister
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce TFRBench, the first benchmark designed to evaluate the reasoning capabilities of forecasting systems. Traditionally, time-series forecasting has been evaluated solely on numerical accuracy, treating foundation models as ``black boxes.‘’ Unlike existing benchmarks, TFRBench provides a protocol for evaluating the reasoning generated by forecasting systems–specifically their analysis of cross-channel dependencies, trends, and external events. To enable this, we propose a systematic multi-agent framework that utilizes an iterative verification loop to synthesize numerically grounded reasoning traces. Spanning ten datasets across five domains, our evaluation confirms that this reasoning is causally effective; useful for evaluation; and prompting LLMs with our generated traces significantly improves forecasting accuracy compared to direct numerical prediction (e.g., avg. \sim40.2%\to56.6%) , validating the quality of our reasoning. Conversely, benchmarking experiments reveal that off-the-shelf LLMs consistently struggle with both reasoning (lower LLM-as-a-Judge scores) and numerical forecasting, frequently failing to capture domain-specific dynamics. TFRBench thus establishes a new standard for interpretable, reasoning-based evaluation in time-series forecasting. Our benchmark is available at: this https URL

[AI-61] LatentAudit: Real-Time White-Box Faithfulness Monitoring for Retrieval-Augmented Generation with Verifiable Deployment

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中仍存在的幻觉问题,即在推理阶段如何判断生成答案是否真正由检索到的证据支持。解决方案的关键在于提出LatentAudit——一种白盒审计机制,通过聚合开放权重生成模型中从中到晚期残差流(residual-stream)激活,并计算其与证据表示之间的马氏距离(Mahalanobis distance),构建一个无需额外判别模型、可在生成时运行的二次型判定规则。该方法利用残差流几何结构中的可信度信号,在多种架构和检索失败场景下保持稳定性能,并支持以Groth16为基础的公开验证,从而实现对RAG系统实时可信度监控及可验证部署。

链接: https://arxiv.org/abs/2604.05358
作者: Zhe Yu,Wenpeng Xing,Meng Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) mitigates hallucination but does not eliminate it: a deployed system must still decide, at inference time, whether its answer is actually supported by the retrieved evidence. We introduce LatentAudit, a white-box auditor that pools mid-to-late residual-stream activations from an open-weight generator and measures their Mahalanobis distance to the evidence representation. The resulting quadratic rule requires no auxiliary judge model, runs at generation time, and is simple enough to calibrate on a small held-out set. We show that residual-stream geometry carries a usable faithfulness signal, that this signal survives architecture changes and realistic retrieval failures, and that the same rule remains amenable to public verification. On PubMedQA with Llama-3-8B, LatentAudit reaches 0.942 AUROC with 0.77,ms overhead. Across three QA benchmarks and five model families (Llama-2/3, Qwen-2.5/3, Mistral), the monitor remains stable; under a four-way stress test with contradictions, retrieval misses, and partial-support noise, it reaches 0.9566–0.9815 AUROC on PubMedQA and 0.9142–0.9315 on HotpotQA. At 16-bit fixed-point precision, the audit rule preserves 99.8% of the FP16 AUROC, enabling Groth16-based public verification without revealing model weights or activations. Together, these results position residual-stream geometry as a practical basis for real-time RAG faithfulness monitoring and optional verifiable deployment.

[AI-62] From Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLM s

【速读】:该论文旨在解决医疗大语言模型(Large Language Models, LLMs)在糖尿病视网膜病变(Diabetic Retinopathy, DR)决策场景中因证据不足或冲突导致的幻觉(Hallucinations)问题,这构成了临床应用中的安全风险。解决方案的关键在于提出一个基于证据的基准测试集 RETINA-SAFE(包含12,522个样本)和一种两阶段白盒检测框架 ECRT(Evidence-Conditioned Risk Triage)。ECRT 利用上下文条件(CTX/NOCTX)下模型内部表示与logit变化,结合类别平衡训练,在第一阶段实现安全/不安全风险分诊,并在第二阶段将不安全案例细分为矛盾驱动型与证据缺失型风险,从而提升风险识别的准确性与可解释性,显著优于外部不确定性估计和自一致性基线方法。

链接: https://arxiv.org/abs/2604.05348
作者: Zhe Yu,Wenpeng Xing,Meng Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucinations in medical large language models (LLMs) remain a safety-critical issue, particularly when available evidence is insufficient or conflicting. We study this problem in diabetic retinopathy (DR) decision settings and introduce RETINA-SAFE, an evidence-grounded benchmark aligned with retinal grading records, comprising 12,522 samples. RETINA-SAFE is organized into three evidence-relation tasks: E-Align (evidence-consistent), E-Conflict (evidence-conflicting), and E-Gap (evidence-insufficient). We further propose ECRT (Evidence-Conditioned Risk Triage), a two-stage white-box detection framework: Stage 1 performs Safe/Unsafe risk triage, and Stage 2 refines unsafe cases into contradiction-driven versus evidence-gap risks. ECRT leverages internal representation and logit shifts under CTX/NOCTX conditions, with class-balanced training for robust learning. Under evidence-grouped (not patient-disjoint) splits across multiple backbones, ECRT provides strong Stage-1 risk triage and explicit subtype attribution, improves Stage-1 balanced accuracy by +0.15 to +0.19 over external uncertainty and self-consistency baselines and by +0.02 to +0.07 over the strongest adapted supervised baseline, and consistently exceeds a single-stage white-box ablation on Stage-1 balanced accuracy. These findings support white-box internal signals grounded in retinal evidence as a practical route to interpretable medical LLM risk triage.

[AI-63] Dynamic Agent ic AI Expert Profiler System Architecture for Multidomain Intelligence Modeling

【速读】:该论文旨在解决人机交互中系统缺乏对用户专业知识水平感知的问题,从而提升交互的个性化与有效性。其核心挑战在于如何准确识别用户在自然语言表达中体现的专家程度,以实现情境感知的智能响应。解决方案的关键在于提出了一种基于Llama v3.1(8B)构建的模块化分层架构代理型AI用户画像系统(agentic AI profiler),通过文本预处理、评分、聚合和分类四个阶段,将用户回答自动划分为新手(Novice)、基础(Basic)、高级(Advanced)和专家(Expert)四个层级。该方法在静态和动态两阶段评估中均表现出高一致性(83%–97%匹配参与者自评),验证了其在真实场景下对用户专业能力的可靠识别能力。

链接: https://arxiv.org/abs/2604.05345
作者: Aisvarya Adeseye,Jouni Isoaho,Seppo Virtanen,Mohammad Tahir
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to be Published in IEEE Conference on Artificial Intelligence (CAI) 2026 - May 8-10, 2026, Granada, Spain

点击查看摘要

Abstract:In today’s artificial intelligence driven world, modern systems communicate with people from diverse backgrounds and skill levels. For human-machine interaction to be meaningful, systems must be aware of context and user expertise. This study proposes an agentic AI profiler that classifies natural language responses into four levels: Novice, Basic, Advanced, and Expert. The system uses a modular layered architecture built on LLaMA v3.1 (8B), with components for text preprocessing, scoring, aggregation, and classification. Evaluation was conducted in two phases: a static phase using pre-recorded transcripts from 82 participants, and a dynamic phase with 402 live interviews conducted by an agentic AI interviewer. In both phases, participant self-ratings were compared with profiler predictions. In the dynamic phase, expertise was assessed after each response rather than at the end of the interview. Across domains, 83% to 97% of profiler evaluations matched participant self-assessments. Remaining differences were due to self-rating bias, unclear responses, and occasional misinterpretation of nuanced expertise by the language model.

[AI-64] Anchored Cyclic Generation: A Novel Paradigm for Long-Sequence Symbolic Music Generation ACL2026

【速读】:该论文旨在解决自回归模型在生成长序列时因误差累积导致结构不连贯的问题,尤其在符号音乐生成任务中表现突出。其解决方案的关键在于提出锚定循环生成(Anchored Cyclic Generation, ACG)范式,通过利用已生成音乐中的锚点特征(anchor features)引导后续生成过程,从而有效缓解自回归方法中的误差传播问题。在此基础上,进一步构建了分层锚定循环生成(Hierarchical Anchored Cyclic Generation, Hi-ACG)框架,采用从全局到局部的系统性生成策略,并结合专为钢琴乐谱设计的高效音符标记(piano token),显著提升了长序列音乐生成的质量与结构完整性。

链接: https://arxiv.org/abs/2604.05343
作者: Boyu Cao,Lekai Qian,Dehan Li,Haoyu Gu,Mingda Xu,Qi Liu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 Findings

点击查看摘要

Abstract:Generating long sequences with structural coherence remains a fundamental challenge for autoregressive models across sequential generation tasks. In symbolic music generation, this challenge is particularly pronounced, as existing methods are constrained by the inherent severe error accumulation problem of autoregressive models, leading to poor performance in music quality and structural integrity. In this paper, we propose the Anchored Cyclic Generation (ACG) paradigm, which relies on anchor features from already identified music to guide subsequent generation during the autoregressive process, effectively mitigating error accumulation in autoregressive methods. Based on the ACG paradigm, we further propose the Hierarchical Anchored Cyclic Generation (Hi-ACG) framework, which employs a systematic global-to-local generation strategy and is highly compatible with our specifically designed piano token, an efficient musical representation. The experimental results demonstrate that compared to traditional autoregressive models, the ACG paradigm achieves reduces cosine distance by an average of 34.7% between predicted feature vectors and ground-truth semantic vectors. In long-sequence symbolic music generation tasks, the Hi-ACG framework significantly outperforms existing mainstream methods in both subjective and objective evaluations. Furthermore, the framework exhibits excellent task generalization capabilities, achieving superior performance in related tasks such as music completion.

[AI-65] RACE: Capability-Targeted Agent ic Training

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代理环境(agentic environments)中因缺乏特定能力(capability)而导致任务执行失败的问题。现有方法或依赖非针对性的合成训练数据,或直接在目标环境中训练,导致模型难以显式学习跨任务所需的能力。解决方案的关键在于提出TRACE(Turning Recurrent Agent failures into Capability-targeted training Environments),其通过对比成功与失败轨迹自动识别缺失能力,为每种能力构建奖励驱动的合成训练环境,并利用强化学习(Reinforcement Learning, RL)微调LoRA适配器(LoRA adapter),推理时根据任务动态路由至对应适配器。该方法实现了能力层面的精准提升,在多个基准测试中显著优于基线模型。

链接: https://arxiv.org/abs/2604.05336
作者: Hangoo Kang,Tarun Suresh,Jon Saad-Falcon,Azalia Mirhoseini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) deployed in agentic environments must exercise multiple capabilities across different task instances, where a capability is performing one or more actions in a trajectory that are necessary for successfully solving a subset of tasks in the environment. Many existing approaches either rely on synthetic training data that is not targeted to the model’s actual capability deficits in the target environment or train directly on the target environment, where the model needs to implicitly learn the capabilities across tasks. We introduce TRACE (Turning Recurrent Agent failures into Capability-targeted training Environments), an end-to-end system for environment-specific agent self-improvement. TRACE contrasts successful and failed trajectories to automatically identify lacking capabilities, synthesizes a targeted training environment for each that rewards whether the capability was exercised, and trains a LoRA adapter via RL on each synthetic environment, routing to the relevant adapter at inference. Empirically, TRACE generalizes across different environments, improving over the base agent by +14.1 points on \tau^2 -bench (customer service) and +7 perfect scores on ToolSandbox (tool use), outperforming the strongest baseline by +7.4 points and +4 perfect scores, respectively. Given the same number of rollouts, TRACE scales more efficiently than baselines, outperforming GRPO and GEPA by +9.2 and +7.4 points on \tau^2 -bench.

[AI-66] Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

【速读】:该论文旨在解决大规模技能库(skill library)在现代智能体系统中应用时面临的两个核心挑战:一是全量加载技能导致上下文窗口饱和,从而增加token消耗、幻觉风险和推理延迟;二是如何高效地从数千个可重用技能中动态检索出与当前任务最相关的依赖感知技能子集。解决方案的关键在于提出Graph of Skills (GoS),这是一个推理时的结构化检索层:首先离线构建一个可执行的技能图(executable skill graph),然后在推理阶段通过混合语义-词法种子引导、反向加权个性化PageRank算法以及基于上下文预算的技能包激活机制,精准提取一个受限且依赖关系明确的技能包。实验表明,GoS在SkillsBench和ALFWorld基准上相较全量技能加载基线平均奖励提升43.6%,同时输入token减少37.8%,并具备跨模型家族(Claude Sonnet、GPT-5.2 Codex、MiniMax)的良好泛化能力。

链接: https://arxiv.org/abs/2604.05333
作者: Dawei Li,Zongxia Li,Hongyang Du,Xiyang Wu,Shihang Gui,Yongbei Kuang,Lichao Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages of main text, 13 pages of appendix. Core contribution by Dawei Li and Zongxia Li. Project page: this https URL

点击查看摘要

Abstract:Skill usage has become a core component of modern agent systems and can substantially improve agents’ ability to complete complex tasks. In real-world settings, where agents must monitor and interact with numerous personal applications, web browsers, and other environment interfaces, skill libraries can scale to thousands of reusable skills. Scaling to larger skill sets introduces two key challenges. First, loading the full skill set saturates the context window, driving up token costs, hallucination, and latency. In this paper, we present Graph of Skills (GoS), an inference-time structural retrieval layer for large skill libraries. GoS constructs an executable skill graph offline from skill packages, then at inference time retrieves a bounded, dependency-aware skill bundle through hybrid semantic-lexical seeding, reverse-weighted Personalized PageRank, and context-budgeted hydration. On SkillsBench and ALFWorld, GoS improves average reward by 43.6% over the vanilla full skill-loading baseline while reducing input tokens by 37.8%, and generalizes across three model families: Claude Sonnet, GPT-5.2 Codex, and MiniMax. Additional ablation studies across skill libraries ranging from 200 to 2,000 skills further demonstrate that GoS consistently outperforms both vanilla skills loading and simple vector retrieval in balancing reward, token efficiency, and runtime. Comments: 13 pages of main text, 13 pages of appendix. Core contribution by Dawei Li and Zongxia Li. Project page: this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.05333 [cs.AI] (or arXiv:2604.05333v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.05333 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-67] Breakthrough the Suboptimal Stable Point in Value-Factorization-Based Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中价值分解(Value Factorization)方法普遍存在的收敛至次优解的问题。现有理论分析主要聚焦于最优情况,难以解释实际中出现的非最优收敛现象。为此,作者提出“稳定点”(stable point)这一新理论概念,用于刻画价值分解在一般情形下的潜在收敛行为;通过分析现有方法的稳定点分布,发现非最优稳定点是性能不佳的根本原因。算法层面的关键突破在于:不再试图强制最优动作成为唯一稳定点(此目标几乎不可行),而是采用迭代方式逐步使次优动作变得不稳定,从而引导每轮迭代向包含更优动作的稳定点逼近。受此启发,论文提出多轮价值分解(Multi-Round Value Factorization, MRVF)框架,其核心机制是基于相对于前一轮所选动作的非负收益增量,将劣质动作转化为不稳定点,实现全局最优性逼近。

链接: https://arxiv.org/abs/2604.05297
作者: Lesong Tao,Yifei Wang,Haodong Jing,Jingwen Fu,Miao Kang,Shitao Chen,Nanning Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Value factorization, a popular paradigm in MARL, faces significant theoretical and algorithmic bottlenecks: its tendency to converge to suboptimal solutions remains poorly understood and unsolved. Theoretically, existing analyses fail to explain this due to their primary focus on the optimal case. To bridge this gap, we introduce a novel theoretical concept: the stable point, which characterizes the potential convergence of value factorization in general cases. Through an analysis of stable point distributions in existing methods, we reveal that non-optimal stable points are the primary cause of poor performance. However, algorithmically, making the optimal action the unique stable point is nearly infeasible. In contrast, iteratively filtering suboptimal actions by rendering them unstable emerges as a more practical approach for global optimality. Inspired by this, we propose a novel Multi-Round Value Factorization (MRVF) framework. Specifically, by measuring a non-negative payoff increment relative to the previously selected action, MRVF transforms inferior actions into unstable points, thereby driving each iteration toward a stable point with a superior action. Experiments on challenging benchmarks, including predator-prey tasks and StarCraft II Multi-Agent Challenge (SMAC), validate our analysis of stable points and demonstrate the superiority of MRVF over state-of-the-art methods.

[AI-68] Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code

【速读】:该论文旨在解决生成式 AI (Generative AI) 在安全敏感领域生成生产代码时,其输出中潜在漏洞的可利用性尚未被量化的问题。解决方案的关键在于通过形式化验证方法对3,500个由七种前沿大语言模型(LLM)在500个安全关键提示下生成的代码片段进行系统性分析:每个代码片段均经由COBALT分析管道输入至Z3 SMT求解器,以获得数学可满足性证据(satisfiability witnesses),而非依赖模式匹配的启发式检测。实证结果表明,所有模型中55.8%的代码片段包含至少一个COBALT识别出的漏洞,其中1,055个漏洞被Z3正式证明;GPT-4o漏洞率最高(62.4%,等级F),Gemini 2.5 Flash表现最优(48.4%,等级D),但无一模型达到D级以上,且六项代表性发现经GCC AddressSanitizer运行时崩溃验证,证实了形式化方法的有效性和必要性。

链接: https://arxiv.org/abs/2604.05292
作者: Dominik Blain,Maxime Noiseux
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 8 pages, 6 tables, empirical study

点击查看摘要

Abstract:AI coding assistants are now used to generate production code in security-sensitive domains, yet the exploitability of their outputs remains unquantified. We address this gap with Broken by Default: a formal verification study of 3,500 code artifacts generated by seven frontier LLMs across 500 security-critical prompts (five CWE categories, 100 prompts each). Each artifact is subjected to the Z3 SMT solver via the COBALT analysis pipeline, producing mathematical satisfiability witnesses rather than pattern-based heuristics. Across all models, 55.8% of artifacts contain at least one COBALT-identified vulnerability; of these, 1,055 are formally proven via Z3 satisfiability witnesses. GPT-4o leads at 62.4% (grade F); Gemini 2.5 Flash performs best at 48.4% (grade D). No model achieves a grade better than D. Six of seven representative findings are confirmed with runtime crashes under GCC AddressSanitizer. Three auxiliary experiments show: (1) explicit security instructions reduce the mean rate by only 4 points; (2) six industry tools combined miss 97.8% of Z3-proven findings; and (3) models identify their own vulnerable outputs 78.7% of the time in review mode yet generate them at 55.8% by default. Comments: 8 pages, 6 tables, empirical study Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2604.05292 [cs.CR] (or arXiv:2604.05292v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.05292 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dominik Blain [view email] [v1] Tue, 7 Apr 2026 00:55:42 UTC (13 KB) Full-text links: Access Paper: View a PDF of the paper titled Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code, by Dominik Blain and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CR prev | next new | recent | 2026-04 Change to browse by: cs cs.AI cs.SE References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-69] Pressure What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition

【速读】:该论文旨在解决大语言模型中存在的“谄媚行为”(sycophancy)问题,即模型在面对用户偏好或权威暗示时,倾向于偏离事实证据而调整自身立场,导致输出不准确。标准对齐方法失效的原因在于标量奖励模型将两种不同类型的错误——压力屈服(pressure capitulation)和证据盲视(evidence blindness)——混为一谈,无法精准纠正。解决方案的关键在于通过奖励分解(reward decomposition),提出一种多组件的组相对策略优化(Group Relative Policy Optimisation, GRPO)机制,将训练信号拆分为五个独立维度:压力抵抗、上下文保真度、立场一致性、共识抑制和事实正确性。该方法基于对比数据集进行两阶段训练,有效分离并分别优化各行为维度,在多个基线模型上显著降低谄媚行为,并展现出良好的泛化能力。

链接: https://arxiv.org/abs/2604.05279
作者: Muhammad Ahmed Mohsin,Ahsan Bilal,Muhammad Umer,Emily Fox
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to COLM 2026

点击查看摘要

Abstract:Large language models exhibit sycophancy, the tendency to shift their stated positions toward perceived user preferences or authority cues regardless of evidence. Standard alignment methods fail to correct this because scalar reward models conflate two distinct failure modes into a single signal: pressure capitulation, where the model changes a correct answer under social pressure, and evidence blindness, where the model ignores the provided context entirely. We operationalise sycophancy through formal definitions of pressure independence and evidence responsiveness, serving as a working framework for disentangled training rather than a definitive characterisation of the phenomenon. We propose the first approach to sycophancy reduction via reward decomposition, introducing a multi-component Group Relative Policy Optimisation (GRPO) reward that decomposes the training signal into five terms: pressure resistance, context fidelity, position consistency, agreement suppression, and factual correctness. We train using a contrastive dataset pairing pressure-free baselines with pressured variants across three authority levels and two opposing evidence contexts. Across five base models, our two-phase pipeline consistently reduces sycophancy on all metric axes, with ablations confirming that each reward term governs an independent behavioural dimension. The learned resistance to pressure generalises beyond our training methodology and prompt structure, reducing answer-priming sycophancy by up to 17 points on SycophancyEval despite the absence of such pressure forms during training.

[AI-70] Simulating the Evolution of Alignment and Values in Machine Intelligence

【速读】:该论文旨在解决当前模型对齐(model alignment)评估方法的局限性问题,即现有方法多依赖标准化基准测试性能,忽视了对模型信念随时间演化的影响,可能导致虚假或欺骗性信念被固定下来。其核心问题是:即使测试准确率与真实价值高度相关(如ρ=0.8),仍可能出现有害的欺骗性行为在迭代对齐过程中被强化并固化。解决方案的关键在于引入进化理论框架,结合改进的评估者能力、自适应测试设计以及突变机制(mutational dynamics),从而有效抑制欺骗性信念的固定,同时保持对齐效果(置换检验 p<0.001)。

链接: https://arxiv.org/abs/2604.05274
作者: Jonathan Elsworth Eicher
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Model alignment is currently applied in a vacuum, evaluated primarily through standardised benchmark performance. The purpose of this study is to examine the effects of alignment on populations of models through time. We focus on the treatment of beliefs which contain both an alignment signal (how well it does on the test) and a true value (what the impact actually will be). By applying evolutionary theory we can model how different populations of beliefs and selection methodologies can fix deceptive beliefs through iterative alignment testing. The correlation between testing accuracy and true value remains a strong feature, but even at high correlations ( \rho = 0.8 ) there is variability in the resulting deceptive beliefs that become fixed. Mutations allow for more complex developments, highlighting the increasing need to update the quality of tests to avoid fixation of maliciously deceptive models. Only by combining improving evaluator capabilities, adaptive test design, and mutational dynamics do we see significant reductions in deception while maintaining alignment fitness (permutation test, p_\textadj 0.001 ).

[AI-71] Extending Tabular Denoising Diffusion Probabilistic Models for Time-Series Data Generation

【速读】:该论文旨在解决传统Tabular Denoising Diffusion Probabilistic Models (TabDDPM) 在处理时间序列数据时因假设样本间独立性而无法建模时间依赖性的局限性,从而限制其在具有重要时序结构的场景中的应用。解决方案的关键在于对TabDDPM进行时序扩展,引入轻量级的时间适配器(temporal adapters)和上下文感知嵌入模块(context-aware embedding modules),并通过将传感器数据重构为窗口化序列、显式利用时间步嵌入(timestep embeddings)、条件活动标签和观测/缺失掩码(observed/missing masks)来建模时序上下文,从而生成具备时序一致性的合成序列。这一方法显著提升了合成数据的时间真实性、多样性和连贯性,在WISDM加速度计数据集上实现了与真实数据统计对齐且分类性能接近实际水平(macro F1-score 0.64,accuracy 0.71),尤其有利于少数类别的表示增强。

链接: https://arxiv.org/abs/2604.05257
作者: Umang Dobhal,Christina Garcia,Sozo Inoue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Diffusion models are increasingly being utilised to create synthetic tabular and time series data for privacy-preserving augmentation. Tabular Denoising Diffusion Probabilistic Models (TabDDPM) generate high-quality synthetic data from heterogeneous tabular datasets but assume independence between samples, limiting their applicability to time-series domains where temporal dependencies are critical. To address this, we propose a temporal extension of TabDDPM, introducing sequence awareness through the use of lightweight temporal adapters and context-aware embedding modules. By reformulating sensor data into windowed sequences and explicitly modeling temporal context via timestep embeddings, conditional activity labels, and observed/missing masks, our approach enables the generation of temporally coherent synthetic sequences. Compared to baseline and interpolation techniques, validation using bigram transition matrices and autocorrelation analysis shows enhanced temporal realism, diversity, and coherence. On the WISDM accelerometer dataset, the suggested system produces synthetic time-series that closely resemble real world sensor patterns and achieves comparable classification performance (macro F1-score 0.64, accuracy 0.71). This is especially advantageous for minority class representation and preserving statistical alignment with real distributions. These developments demonstrate that diffusion based models provide effective and adaptable solutions for sequential data synthesis when they are equipped for temporal reasoning. Future work will explore scaling to longer sequences and integrating stronger temporal architectures.

[AI-72] EAGLE: Edge-Aware Graph Learning for Proactive Delivery Delay Prediction in Smart Logistics Networks

【速读】:该论文旨在解决物流网络中交付延迟预测的难题,现有方法要么忽略供应链图结构的空间依赖性(如将问题视为表格分类任务),要么忽视时间序列中的动态变化(如仅做异常检测)。其解决方案的关键在于提出一种混合深度学习框架,通过轻量级Transformer Patch Encoder联合建模订单流的时间动态特性,并利用Edge-Aware Graph Attention Network (E-GAT)捕捉节点间的空间关联关系,同时采用多任务学习目标进行优化。该方法在真实世界DataCo智能供应链数据集上实现了高精度与稳定性的平衡,F1-score达0.8762,AUC-ROC为0.9773,且跨随机种子的标准差仅为0.0089,显著优于基线模型。

链接: https://arxiv.org/abs/2604.05254
作者: Zhiming Xue,Menghao Huo,Yujue Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern logistics networks generate rich operational data streams at every warehouse node and transportation lane – from order timestamps and routing records to shipping manifests – yet predicting delivery delays remains predominantly reactive. Existing predictive approaches typically treat this problem either as a tabular classification task, ignoring network topology, or as a time-series anomaly detection task, overlooking the spatial dependencies of the supply chain graph. To bridge this gap, we propose a hybrid deep learning framework for proactive supply chain risk management. The proposed method jointly models temporal order-flow dynamics via a lightweight Transformer patch encoder and inter-hub relational dependencies through an Edge-Aware Graph Attention Network (E-GAT), optimized via a multi-task learning objective. Evaluated on the real-world DataCo Smart Supply Chain dataset, our framework achieves consistent improvements over baseline methods, yielding an F1-score of 0.8762 and an AUC-ROC of 0.9773. Across four independent random seeds, the framework exhibits a cross-seed F1 standard deviation of only 0.0089 – a 3.8 times improvement over the best ablated variant – achieving the strongest balance of predictive accuracy and training stability among all evaluated models.

[AI-73] Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks

【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在求解偏微分方程(PDEs)和常微分方程(ODEs)时收敛速度慢、优化效率低的问题。其核心解决方案在于提出并实现了一系列先进的优化策略,包括自然梯度(Natural Gradient, NG)优化器、自适应缩放的BFGS与Broyden拟牛顿算法,并通过高效批处理训练机制提升这些优化方法在大规模数据驱动问题中的可扩展性。实验验证表明,这些方法显著加速了PINNs对复杂物理系统(如Helmholtz方程、Stokes流、无粘Burgers方程及高超音速Euler方程等)的求解收敛过程,同时提供了与高阶数值方法相当甚至更优的精度。

链接: https://arxiv.org/abs/2604.05230
作者: Anas Jnini,Elham Kiyani,Khemraj Shukla,Jorge F. Urban,Nazanin Ahmadi Daryakenari,Johannes Muller,Marius Zeinhofer,George Em Karniadakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注: 54 pages, 24 figures

点击查看摘要

Abstract:Efficient and robust optimization is essential for neural networks, enabling scientific machine learning models to converge rapidly to very high accuracy – faithfully capturing complex physical behavior governed by differential equations. In this work, we present advanced optimization strategies to accelerate the convergence of physics-informed neural networks (PINNs) for challenging partial (PDEs) and ordinary differential equations (ODEs). Specifically, we provide efficient implementations of the Natural Gradient (NG) optimizer, Self-Scaling BFGS and Broyden optimizers, and demonstrate their performance on problems including the Helmholtz equation, Stokes flow, inviscid Burgers equation, Euler equations for high-speed flows, and stiff ODEs arising in pharmacokinetics and pharmacodynamics. Beyond optimizer development, we also propose new PINN-based methods for solving the inviscid Burgers and Euler equations, and compare the resulting solutions against high-order numerical methods to provide a rigorous and fair assessment. Finally, we address the challenge of scaling these quasi-Newton optimizers for batched training, enabling efficient and scalable solutions for large data-driven problems.

[AI-74] Attribution Bias in Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在引用内容时存在的代表性公平性问题,特别是针对作者身份(如种族、性别)的偏见导致的引用准确性差异。其核心挑战在于现有模型在处理不同群体作者的引文归属时存在系统性偏差,而传统准确率指标无法捕捉此类偏见。解决方案的关键是提出AttriBench——首个基于名人度和人口统计学特征平衡的引文归属基准数据集,通过控制作者知名度与人口统计变量,使模型偏差可被量化评估;同时引入“抑制”(suppression)这一新型失败模式,即模型即使掌握来源信息仍完全省略署名,揭示了传统指标未覆盖的隐蔽性偏见,从而为LLMs的代表性公平性提供了一个新的评估框架。

链接: https://arxiv.org/abs/2604.05224
作者: Eliza Berman,Bella Chang,Daniel B. Neill,Emily Black
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly used to support search and information retrieval, it is critical that they accurately attribute content to its original authors. In this work, we introduce AttriBench, the first fame- and demographically-balanced quote attribution benchmark dataset. Through explicitly balancing author fame and demographics, AttriBench enables controlled investigation of demographic bias in quote attribution. Using this dataset, we evaluate 11 widely used LLMs across different prompt settings and find that quote attribution remains a challenging task even for frontier models. We observe large and systematic disparities in attribution accuracy between race, gender, and intersectional groups. We further introduce and investigate suppression, a distinct failure mode in which models omit attribution entirely, even when the model has access to authorship information. We find that suppression is widespread and unevenly distributed across demographic groups, revealing systematic biases not captured by standard accuracy metrics. Our results position quote attribution as a benchmark for representational fairness in LLMs.

[AI-75] ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在真实生产力场景中评估困难的问题,因为直接部署于在线服务存在不可逆操作风险,而现有基准测试多基于简化环境,无法刻画状态化、跨服务的复杂工作流。其解决方案的关键在于提出 ClawsBench 基准测试平台,包含五个高保真模拟服务(Gmail、Slack、Google Calendar、Google Docs 和 Google Drive),具备完整的状态管理与确定性快照/恢复机制,并设计了44个结构化任务以覆盖单服务、跨服务及安全敏感场景。此外,作者将代理架构分解为两个独立调控因子:领域技能(通过渐进式披露注入API知识)和元提示(协调多服务行为),从而系统性地量化二者对任务成功率与不安全行为率的影响,揭示出即便在最优配置下,LLM代理仍存在显著的安全风险(如多步沙箱提升和静默合同修改等8类模式)。

链接: https://arxiv.org/abs/2604.05172
作者: Xiangyi Li,Kyoung Whan Choe,Yimin Liu,Xiaokun Chen,Chujun Tao,Bingran You,Wenbo Chen,Zonglin Di,Jiankai Sun,Shenghan Zheng,Jiajun Bao,Yuanli Wang,Weixiang Yan,Yiyuan Li,Han-chung Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 5 figures

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification.

[AI-76] Instruction-Tuned LLM s for Parsing and Mining Unstructured Logs on Leadership HPC Systems

【速读】:该论文旨在解决领导级高性能计算(HPC)系统中海量异构、 largely unstructured 系统日志的结构化与模式挖掘难题,此类日志因来源多样(软件、硬件及运行时层)而格式不一致,导致传统方法难以高效提取语义信息。解决方案的关键在于提出一种领域适配的、指令遵循的生成式 AI (Generative AI) 框架,利用链式思维(Chain-of-Thought, CoT)推理机制对 HPC 日志进行高保真解析;其核心创新是结合领域特定的日志模板数据与指令微调样本,对一个 80 亿参数的 LLaMA 模型进行混合微调,从而实现隐私保护、本地部署、快速且节能的日志分析能力,并在 LogHub 多类日志数据集和 Frontier 超级计算机超过 6 亿条生产日志上验证了其卓越的解析准确率与实际应用价值。

链接: https://arxiv.org/abs/2604.05168
作者: Ahmad Maroof Karimi,Jong Youl Choi,Charles Qing Cao,Awais Khan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Leadership-class HPC systems generate massive volumes of heterogeneous, largely unstructured system logs. Because these logs originate from diverse software, hardware, and runtime layers, they exhibit inconsistent formats, making structure extraction and pattern discovery extremely challenging. Therefore, robust log parsing and mining is critical to transform this raw telemetry into actionable insights that reveal operational patterns, diagnose anomalies, and enable reliable, efficient, and scalable system analysis. Recent advances in large language models (LLMs) offer a promising new direction for automated log understanding in leadership-class HPC environments. To capitalize on this opportunity, we present a domain-adapted, instruction-following, LLM-driven framework that leverages chain-of-thought (CoT) reasoning to parse and structure HPC logs with high fidelity. Our approach combines domain-specific log-template data with instruction-tuned examples to fine-tune an 8B-parameter LLaMA model tailored for HPC log analysis. We develop a hybrid fine-tuning methodology that adapts a general-purpose LLM to domain-specific log data, enabling privacy-preserving, locally deployable, fast, and energy-efficient log-mining approach. We conduct experiments on a diverse set of log datasets from the LogHub repository. The evaluation confirms that our approach achieves parsing accuracy on par with significantly larger models, such as LLaMA 70B and Anthropic’s Claude. We further validate the practical utility of our fine-tuned LLM model by parsing over 600 million production logs from the Frontier supercomputer over a four-week window, uncovering critical patterns in temporal dynamics, node-level anomalies, and workload-error log correlations. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.05168 [cs.AI] (or arXiv:2604.05168v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.05168 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-77] Learning to Focus: CSI-Free Hierarchical MARL for Reconfigurable Reflectors

【速读】:该论文旨在解决可重构智能表面(Reconfigurable Intelligent Surface, RIS)在下一代毫米波(mmWave)网络中大规模部署时面临的两大瓶颈问题:一是基于导频的信道状态信息(Channel State Information, CSI)估计带来的计算开销过大,二是集中式优化固有的维度爆炸问题。解决方案的关键在于提出一种“无CSI”范式,其核心是采用分层多智能体强化学习(Hierarchical Multi-Agent Reinforcement Learning, HMARL)架构来控制机械可重构反射面。该框架通过使用用户定位数据替代传统的CSI获取方式,利用空间智能实现宏观尺度上的波传播管理,并将控制问题分解为两层神经网络结构:高层控制器执行时间扩展的离散用户到反射面分配任务,低层控制器则基于多智能体近端策略优化(Multi-Agent Proximal Policy Optimization, MAPPO)在集中训练、分散执行(Centralized Training with Decentralized Execution, CTDE)机制下自主优化连续焦点位置。实验表明,该方法在确定性射线追踪评估中相较集中式基线实现了高达7.79 dB的接收信号强度指示(RSSI)提升,同时具备良好的多用户可扩展性和对亚米级定位误差下的强鲁棒性。

链接: https://arxiv.org/abs/2604.05165
作者: Hieu Le,Mostafa Ibrahim,Oguz Bedir,Jian Tao,Sabit Ekin
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Reconfigurable Intelligent Surfaces (RIS) has a potential to engineer smart radio environments for next-generation millimeter-wave (mmWave) networks. However, the prohibitive computational overhead of Channel State Information (CSI) estimation and the dimensionality explosion inherent in centralized optimization severely hinder practical large-scale deployments. To overcome these bottlenecks, we introduce a ``CSI-free" paradigm powered by a Hierarchical Multi-Agent Reinforcement Learning (HMARL) architecture to control mechanically reconfigurable reflective surfaces. By substituting pilot-based channel estimation with accessible user localization data, our framework leverages spatial intelligence for macro-scale wave propagation management. The control problem is decomposed into a two-tier neural architecture: a high-level controller executes temporally extended, discrete user-to-reflector allocations, while low-level controllers autonomously optimize continuous focal points utilizing Multi-Agent Proximal Policy Optimization (MAPPO) under a Centralized Training with Decentralized Execution (CTDE) scheme. Comprehensive deterministic ray-tracing evaluations demonstrate that this hierarchical framework achieves massive RSSI improvements of up to 7.79 dB over centralized baselines. Furthermore, the system exhibits robust multi-user scalability and maintains highly resilient beam-focusing performance under practical sub-meter localization tracking errors. By eliminating CSI overhead while maintaining high-fidelity signal redirection, this work establishes a scalable and cost-effective blueprint for intelligent wireless environments.

[AI-78] Not All Turns Are Equally Hard: Adaptive Thinking Budgets For Efficient Multi-Turn Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在多轮推理中因计算资源分配不合理而导致的效率低下问题,特别是在面对简单查询时仍产生冗长思考路径(long thinking traces)的现象。其核心挑战在于如何在保持任务准确率的前提下,有效减少推理过程中的总token消耗,且需考虑多轮对话中各步骤之间的序列依赖关系。解决方案的关键是将多轮推理建模为一个具有多个目标的马尔可夫决策过程(Multi-Objective Markov Decision Process),并提出一种名为TAB(Turn-Adaptive Budgets)的预算分配策略,该策略通过分组相对策略优化(Group Relative Policy Optimization, GRPO)训练,能够根据对话历史动态调整每一轮的token预算:对较易的轮次分配较少token,同时为后续关键的困难推理步骤保留适当资源。实验表明,TAB在数学推理基准上相比静态和现成的LLM预算基线,最多可节省35%的token而不损失准确性;进一步地,当系统能预先获取所有子问题计划时,还提出了TAB All-SubQ,基于历史与未来子问题信息进行全局预算分配,最多可节省40% token。

链接: https://arxiv.org/abs/2604.05164
作者: Neharika Jali,Anupam Nayak,Gauri Joshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLM reasoning performance plateau, improving inference-time compute efficiency is crucial to mitigate overthinking and long thinking traces even for simple queries. Prior approaches including length regularization, adaptive routing, and difficulty-based budget allocation primarily focus on single-turn settings and fail to address the sequential dependencies inherent in multi-turn this http URL this work, we formulate multi-turn reasoning as a sequential compute allocation problem and model it as a multi-objective Markov Decision Process. We propose TAB: Turn-Adaptive Budgets, a budget allocation policy trained via Group Relative Policy Optimization (GRPO) that learns to maximize task accuracy while respecting global per-problem token constraints. Consequently, TAB takes as input the conversation history and learns to adaptively allocate smaller budgets to easier turns and save appropriate number of tokens for the crucial harder reasoning steps. Our experiments on mathematical reasoning benchmarks demonstrate that TAB achieves a superior accuracy-tokens tradeoff saving up to 35% tokens while maintaining accuracy over static and off-the-shelf LLM budget baselines. Further, for systems where a plan of all sub-questions is available apriori, we propose TAB All-SubQ, a budget allocation policy that budgets tokens based on the conversation history and all past and future sub-questions saving up to 40% tokens over baselines.

[AI-79] Bypassing the CSI Bottleneck: MARL-Driven Spatial Control for Reflector Arrays

【速读】:该论文旨在解决可重构智能表面(Reconfigurable Intelligent Surface, RIS)在实际部署中因信道状态信息(Channel State Information, CSI)估计计算开销过大而导致的瓶颈问题。其解决方案的关键在于提出一种以人工智能(AI)原生、数据驱动的范式,摒弃复杂的信道建模,转而利用空间智能进行控制;具体而言,采用多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)框架来自主调控机械可调金属反射阵列,通过将高维机械约束映射至低阶虚拟焦点空间,并基于集中训练、分散执行(Centralized Training with Decentralized Execution, CTDE)架构,使各代理仅依赖用户坐标即可学习协同波束聚焦策略,从而实现无需CSI的自主运行。

链接: https://arxiv.org/abs/2604.05162
作者: Hieu Le,Oguz Bedir,Mostafa Ibrahim,Jian Tao,Sabit Ekin
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Reconfigurable Intelligent Surfaces (RIS) are pivotal for next-generation smart radio environments, yet their practical deployment is severely bottlenecked by the intractable computational overhead of Channel State Information (CSI) estimation. To bypass this fundamental physical-layer barrier, we propose an AI-native, data-driven paradigm that replaces complex channel modeling with spatial intelligence. This paper presents a fully autonomous Multi-Agent Reinforcement Learning (MARL) framework to control mechanically adjustable metallic reflector arrays. By mapping high-dimensional mechanical constraints to a reduced-order virtual focal point space, we deploy a Centralized Training with Decentralized Execution (CTDE) architecture. Using Multi-Agent Proximal Policy Optimization (MAPPO), our decentralized agents learn cooperative beam-focusing strategies relying on user coordinates, achieving CSI-free operation. High-fidelity ray-tracing simulations in dynamic non-line-of-sight (NLOS) environments demonstrate that this multi-agent approach rapidly adapts to user mobility, yielding up to a 26.86 dB enhancement over static flat reflectors and outperforming single-agent and hardware-constrained DRL baselines in both spatial selectivity and temporal stability. Crucially, the learned policies exhibit good deployment resilience, sustaining stable signal coverage even under 1.0-meter localization noise. These results validate the efficacy of MARL-driven spatial abstractions as a scalable, highly practical pathway toward AI-empowered wireless networks.

[AI-80] IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

【速读】:该论文旨在解决计算机使用代理(Computer-Use Agents, CUAs)在执行桌面环境GUI操作时因缺乏动作质量评估而导致的不可逆错误传播问题。现有方法生成动作时未考虑其合理性与任务意图,导致后续步骤中错误累积。解决方案的关键在于提出IntentScore——一个计划感知的奖励模型,通过学习来自398K离线GUI交互步骤(覆盖三种操作系统)的候选动作评分,采用两种互补目标进行训练:对比对齐以增强状态-动作相关性,以及边际排序以提升动作正确性。其架构在动作编码器中嵌入每个候选动作的规划意图,从而能够区分看似相同但动机不同的动作,显著提升了动作选择的准确性,在未见过的OSWorld环境中使Agent S3的任务成功率提升6.9个百分点,验证了该方法从异构离线轨迹中学得的奖励估计具有良好的泛化能力。

链接: https://arxiv.org/abs/2604.05157
作者: Rongqian Chen,Yu Li,Zeyu Fang,Sizhe Tang,Weidong Cao,Tian Lan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate’s planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.

[AI-81] Compiled AI: Deterministic Code Generation for LLM -Based Workflow Automation

【速读】:该论文旨在解决生成式 AI(Generative AI)在高风险企业工作流(尤其是医疗保健领域)中因运行时不确定性、不可审计性和高成本而导致的可靠性与安全性问题。其核心挑战在于如何在保持模型强大功能的同时,实现执行过程的确定性、可审计性和资源效率。解决方案的关键在于提出“编译型 AI”(compiled AI)范式——即在编译阶段由大语言模型(LLM)生成可执行代码 artifact,后续工作流完全基于预生成代码运行,无需再次调用模型。该方法通过将生成限制在嵌入验证模板中的窄业务逻辑函数,以牺牲运行时灵活性为代价,显著提升预测性、可审计性、成本效益和安全性。具体包括:(i) 一种约束 LLM 代码生成的系统架构;(ii) 四阶段生成与验证流水线,将概率输出转化为生产就绪代码;(iii) 多维度评估框架,涵盖令牌摊销、确定性、可靠性、安全性和成本等指标。实证表明,该方案在函数调用任务中实现 96% 完成率且零执行令牌消耗,在文档智能任务中达到与直接 LLM 相当的关键字段提取准确率(KILE: 80.0%),同时在静态代码安全分析中实现 87.5% 准确率且无误报。

链接: https://arxiv.org/abs/2604.05150
作者: Geert Trooskens(1),Aaron Karlsberg(1),Anmol Sharma(1),Lamara De Brouwer(1),Max Van Puyvelde(2),Matthew Young(1),John Thickstun(3),Gil Alterovitz(4),Walter A. De Brouwer(2) ((1) a href=“http://XY.AI” rel=“external noopener nofollow” class="link-external link-http"this http URL/a Labs, Palo Alto, CA, (2) Stanford University School of Medicine, Stanford, CA, (3) Cornell University, Ithaca, NY, (4) Brigham and Women’s Hospital / Harvard Medical School, Boston, MA)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures, 3 tables

点击查看摘要

Abstract:We study compiled AI, a paradigm in which large language models generate executable code artifacts during a compilation phase, after which workflows execute deterministically without further model invocation. This paradigm has antecedents in prior work on declarative pipeline optimization (DSPy) and hybrid neural-symbolic planning (LLM+P); our contribution is a systems-oriented study of its application to high-stakes enterprise workflows, with particular emphasis on healthcare settings where reliability and auditability are critical. By constraining generation to narrow business-logic functions embedded in validated templates, compiled AI trades runtime flexibility for predictability, auditability, cost efficiency, and reduced security exposure. We introduce (i) a system architecture for constrained LLM-based code generation, (ii) a four-stage generation-and-validation pipeline that converts probabilistic model output into production-ready code artifacts, and (iii) an evaluation framework measuring operational metrics including token amortization, determinism, reliability, security, and cost. We evaluate on two task types: function-calling (BFCL, n=400) and document intelligence (DocILE, n=5,680 invoices). On function-calling, compiled AI achieves 96% task completion with zero execution tokens, breaking even with runtime inference at approximately 17 transactions and reducing token consumption by 57x at 1,000 transactions. On document intelligence, our Code Factory variant matches Direct LLM on key field extraction (KILE: 80.0%) while achieving the highest line item recognition accuracy (LIR: 80.4%). Security evaluation across 135 test cases demonstrates 96.7% accuracy on prompt injection detection and 87.5% on static code safety analysis with zero false positives. Comments: 14 pages, 2 figures, 3 tables Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) ACMclasses: I.2.2; I.2.7; D.2.11 Cite as: arXiv:2604.05150 [cs.SE] (or arXiv:2604.05150v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.05150 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-82] A mathematical theory of evolution for self-designing AIs

【速读】:该论文试图解决的问题是:当人工智能系统(AI)通过递归自我改进实现自设计演化时,其进化动力学如何影响系统行为,特别是如何在人类意图与AI行为不完全一致的情况下保障对齐(alignment)。解决方案的关键在于构建一个数学模型,将传统生物进化中的随机突变替换为由当前AI程序直接引导的有向树状结构(directed tree of possible AI programs),并引入“适应度函数”(fitness function)来分配有限计算资源。该模型揭示了进化不仅依赖于当前适应度,还受后代谱系长期增长潜力的影响;在假设适应度有界且存在固定概率复制“锁定”版本的前提下,适应度会收敛至可达最大值。这一机制表明,若欺骗行为能提升适应度而非真实效用,则演化将选择欺骗策略,从而凸显出基于客观标准而非人类主观判断进行繁殖决策的重要性。

链接: https://arxiv.org/abs/2604.05142
作者: Kenneth D Harris
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Populations and Evolution (q-bio.PE)
备注:

点击查看摘要

Abstract:As artificial intelligence systems (AIs) become increasingly produced by recursive self-improvement, a form of evolution may emerge, in which the traits of AI systems are shaped by the success of earlier AIs in designing and propagating their descendants. There is a rich mathematical theory modeling how behavioral traits are shaped by biological evolution, but AI evolution will be radically different: biological DNA mutations are random and approximately reversible, but descendant design in AIs will be strongly directed. Here we develop a mathematical model of evolution in self-designing AI systems, replacing random mutations with a directed tree of possible AI programs. Current programs determine the design of their descendants, while humans retain partial control through a “fitness function” that allocates limited computational resources across lineages. We show that evolutionary dynamics reflects not just current fitness but factors related to the long-run growth potential of descendant lineages. Without further assumptions, fitness need not increase over time. However, assuming bounded fitness and a fixed probability that any AI reproduces a “locked” copy of itself, we show that fitness concentrates on the maximum reachable value. We consider the implications of this for AI alignment, specifically for cases where fitness and human utility are not perfectly correlated. We show in an additive model that if deception increases fitness beyond genuine utility, evolution will select for deception. This risk could be mitigated if reproduction is based on purely objective criteria, rather than human judgment.

[AI-83] Non-monotonic causal discovery with Kolmogorov-Arnold Fuzzy Cognitive Maps

【速读】:该论文旨在解决传统模糊认知图(Fuzzy Cognitive Map, FCM)在建模非单调因果关系时的局限性,尤其在存在饱和效应或周期性动态系统的复杂场景中表现不足的问题。标准FCM依赖标量突触权重和单调激活函数,难以刻画非单调因果依赖关系。解决方案的关键在于提出柯尔莫哥洛夫-阿诺德模糊认知图(Kolmogorov-Arnold Fuzzy Cognitive Map, KA-FCM),其核心创新是将静态标量权重替换为位于模型边上的可学习一元B样条函数(univariate B-spline functions),从而将非线性从节点聚合阶段直接引入因果传递阶段。这一结构变革使KA-FCM能够在不增加图密度或引入隐藏层的前提下,精确建模任意非单调因果关系,并保持图结构的可解释性,同时支持从学习到的边中显式提取数学规律。

链接: https://arxiv.org/abs/2604.05136
作者: Jose L. Salmeron
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Fuzzy Cognitive Maps, Kolmogorov-Arnold Networks, Causal Modeling, Neuro-Symbolic AI

点击查看摘要

Abstract:Fuzzy Cognitive Maps constitute a neuro-symbolic paradigm for modeling complex dynamic systems, widely adopted for their inherent interpretability and recurrent inference capabilities. However, the standard FCM formulation, characterized by scalar synaptic weights and monotonic activation functions, is fundamentally constrained in modeling non-monotonic causal dependencies, thereby limiting its efficacy in systems governed by saturation effects or periodic dynamics. To overcome this topological restriction, this research proposes the Kolmogorov-Arnold Fuzzy Cognitive Map (KA-FCM), a novel architecture that redefines the causal transmission mechanism. Drawing upon the Kolmogorov-Arnold representation theorem, static scalar weights are replaced with learnable, univariate B-spline functions located on the model edges. This fundamental modification shifts the non-linearity from the nodes’ aggregation phase directly to the causal influence phase. This modification allows for the modeling of arbitrary, non-monotonic causal relationships without increasing the graph density or introducing hidden layers. The proposed architecture is validated against both baselines (standard FCM trained with Particle Swarm Optimization) and universal black-box approximators (Multi-Layer Perceptron) across three distinct domains: non-monotonic inference (Yerkes-Dodson law), symbolic regression, and chaotic time-series forecasting. Experimental results demonstrate that KA-FCMs significantly outperform conventional architectures and achieve competitive accuracy relative to MLPs, while preserving graph- based interpretability and enabling the explicit extraction of mathematical laws from the learned edges.

[AI-84] Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning NEURIPS2025

【速读】:该论文旨在解决语言模型在原本不擅长的任务(如国际象棋)中如何有效提升推理能力的问题。其核心挑战在于:尽管通过监督微调(SFT)可使模型学习到最佳走法,但后续的强化学习(RL)阶段常引发推理不一致(即推理过程与所选动作不符),从而影响模型可靠性。解决方案的关键在于采用多步走法轨迹(multi-move trajectories)进行训练,这不仅实现了与直接预测最优单步走法相当的下游性能,还确保了推理的一致性(faithful reasoning)和强化学习过程的稳定性。此外,研究发现RL能显著提升走法质量分布并降低幻觉率,且SFT阶段的多项指标(包括评估性能、幻觉率和推理质量)具有较强的预测能力,可用于指导后续RL优化。

链接: https://arxiv.org/abs/2604.05134
作者: Lucas Dionisopoulos,Nicklas Majamaki,Prithviraj Ammanabrolu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the NeurIPS 2025 Foundations of Reasoning in Language Models (FoRLM) Workshop (Oral)

点击查看摘要

Abstract:How can you get a language model to reason in a task it natively struggles with? We study how reasoning evolves in a language model – from supervised fine-tuning (SFT) to reinforcement learning (RL) – by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance – however, the RL step elicits unfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics – metrics spanning evaluation performance, hallucination rates, and reasoning quality – to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code which allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.

[AI-85] Uncertainty-Guided Latent Diagnostic Trajectory Learning for Sequential Clinical Diagnosis

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的临床诊断系统在面对不确定性时,无法有效建模证据逐步获取过程的问题。现有方法通常假设患者信息完全可观测,忽视了临床实践中需按序采集检验、影像或问诊等证据以逐步降低不确定性的本质。为应对这一挑战,作者提出一种潜在诊断轨迹学习(Latent Diagnostic Trajectory Learning, LDTL)框架,其核心在于将诊断动作序列视为潜在路径,并引入后验分布以优先选择能提供更多诊断信息的轨迹;同时训练一个规划型LLM代理来遵循该分布,从而生成连贯且逐步减少不确定性的诊断路径。实验表明,该方法在MIMIC-CDM基准上显著提升了诊断准确性并减少了所需测试次数,且消融实验证明轨迹级别的后验对齐是性能提升的关键因素。

链接: https://arxiv.org/abs/2604.05116
作者: Xuyang Shen,Haoran Liu,Dongjin Song,Martin Renqiang Min
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical diagnosis requires sequential evidence acquisition under uncertainty. However, most Large Language Model (LLM) based diagnostic systems assume fully observed patient information and therefore do not explicitly model how clinical evidence should be sequentially acquired over time. Even when diagnosis is formulated as a sequential decision process, it is still challenging to learn effective diagnostic trajectories. This is because the space of possible evidence-acquisition paths is relatively large, while clinical datasets rarely provide explicit supervision information for desirable diagnostic paths. To this end, we formulate sequential diagnosis as a Latent Diagnostic Trajectory Learning (LDTL) framework based on a planning LLM agent and a diagnostic LLM agent. For the diagnostic LLM agent, diagnostic action sequences are treated as latent paths and we introduce a posterior distribution that prioritizes trajectories providing more diagnostic information. The planning LLM agent is then trained to follow this distribution, encouraging coherent diagnostic trajectories that progressively reduce uncertainty. Experiments on the MIMIC-CDM benchmark demonstrate that our proposed LDTL framework outperforms existing baselines in diagnostic accuracy under a sequential clinical diagnosis setting, while requiring fewer diagnostic tests. Furthermore, ablation studies highlight the critical role of trajectory-level posterior alignment in achieving these improvements.

[AI-86] Vintix II: Decision Pre-Trained Transformer is a Scalable In-Context Reinforcement Learner ICLR2026

【速读】:该论文旨在解决当前基于上下文强化学习(in-context reinforcement learning, ICRL)方法在多领域环境中泛化能力不足的问题,尤其是如何实现对未见过任务的有效适应。此前的算法蒸馏(Algorithm Distillation, AD)虽能跨域训练通用代理,但泛化性能受限;而决策预训练Transformer(Decision Pre-Trained Transformer, DPT)虽在简化环境中表现出更强的ICRL能力,却缺乏大规模可扩展性验证。解决方案的关键在于将DPT扩展至数百个多样化任务,并采用流匹配(Flow Matching)作为自然的训练策略,该方法保持了其作为贝叶斯后验采样的解释一致性,从而显著提升了模型在保留测试集上的泛化表现,同时在在线与离线推理中均优于现有AD方案,验证了ICRL作为专家蒸馏替代路径的可行性。

链接: https://arxiv.org/abs/2604.05112
作者: Andrei Polubarov,Lyubaykin Nikita,Alexander Derevyagin,Artyom Grishin,Igor Saprygin,Aleksandr Serkov,Mark Averchenko,Daniil Tikhonov,Maksim Zhdanov,Alexander Nikulin,Ilya Zisman,Albina Klepach,Alexey Zemtsov,Vladislav Kurenkov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ICLR 2026, Poster

点击查看摘要

Abstract:Recent progress in in-context reinforcement learning (ICRL) has demonstrated its potential for training generalist agents that can acquire new tasks directly at inference. Algorithm Distillation (AD) pioneered this paradigm and was subsequently scaled to multi-domain settings, although its ability to generalize to unseen tasks remained limited. The Decision Pre-Trained Transformer (DPT) was introduced as an alternative, showing stronger in-context reinforcement learning abilities in simplified domains, but its scalability had not been established. In this work, we extend DPT to diverse multi-domain environments, applying Flow Matching as a natural training choice that preserves its interpretation as Bayesian posterior sampling. As a result, we obtain an agent trained across hundreds of diverse tasks that achieves clear gains in generalization to the held-out test set. This agent improves upon prior AD scaling and demonstrates stronger performance in both online and offline inference, reinforcing ICRL as a viable alternative to expert distillation for training generalist agents.

[AI-87] Edit But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks

【速读】:该论文旨在解决当前代码编辑基准测试(benchmark)在评估大语言模型(Large Language Models, LLMs)执行指令式代码编辑能力时存在的显著偏差与不足问题。研究指出,现有主流基准如CanItEdit和EDIT-Bench仅覆盖极少数编程语言(如Python占比超90%)、缺失真实世界中高频的后端/前端开发场景,且几乎不包含文档、测试与维护类编辑任务(占人类PR的31.4%),同时存在测试用例数量少、覆盖率低及代码库重复等问题,导致其评价结果无法可靠反映LLM在实际部署中的编辑能力。解决方案的关键在于提出六个基于实证分析的优化准则(desiderata),并公开全部审计数据与工具,引导社区构建更贴近真实应用场景、具备高效验证能力和广泛覆盖性的新基准,从而提升对LLM代码编辑性能的可信评估水平。

链接: https://arxiv.org/abs/2604.05100
作者: Amir M. Ebrahimi,Gopi Krishnan Rajbahadur
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instructed code editing, where an LLM modifies existing code based on a natural language instruction, accounts for roughly 19% of real-world coding assistant interactions. Yet very few benchmarks directly evaluate this capability. From a survey of over 150 code-related benchmarks, we find that only two, CanItEdit and EDIT-Bench, target instructed code editing with human-authored instructions and test-based evaluation. We audit both by comparing their programming languages, edit intents, and application domains against distributions observed in the wild (Copilot Arena, AIDev, GitHub Octoverse), and by measuring test counts, statement coverage, and test scope across all 213 problems. Both benchmarks concentrate over 90% of evaluation on Python while TypeScript, GitHub’s most-used language, is absent. Backend and frontend development, which together constitute 46% of real-world editing activity, are largely missing, and documentation, testing, and maintenance edits (31.4% of human PRs) have zero representation. Both benchmarks have modest test counts (CanItEdit median 13, EDIT-Bench median 4), though CanItEdit compensates with near-complete whole-file coverage and fail-before/pass-after validation. 59% of EDIT-Bench’s low-coverage suites would not detect modifications outside the edit region. EDIT-Bench has 15 problems that are not solved by any of 40 LLMs and 11 of these problems trace failures to poor benchmark artifacts rather than model limitations. Further, 29% of EDIT-Bench problems and 6% of CanItEdit problems share a codebase with at least one other problem within the benchmark. In summary, these benchmarks measure a narrower construct than deployment decisions require. We therefore propose six empirically grounded desiderata and release all audit artifacts so the community can build instructed code-editing benchmarks whose scores reliably reflect real-world editing capability.

[AI-88] MedGemma 1.5 Technical Report

【速读】:该论文旨在解决多模态医学人工智能模型在处理高维医学影像(如CT/MRI体积数据和组织病理全切片图像)、 anatomical localization(解剖定位)、纵向胸片分析及电子健康记录(EHR)理解等方面的性能瓶颈问题。解决方案的关键在于构建一个统一架构,通过引入新型训练数据、长上下文3D体积切片技术和全切片病理图像采样策略,实现对多种医学模态的高效融合与推理能力提升。相比前代MedGemma 1.0,该模型在3D MRI条件分类准确率上提升11%(绝对值),全切片病理图像分析F1得分提高47%,并显著增强了解剖定位精度(IoU提升35%)和多时间点胸片分析的准确性(宏平均准确率提升4%)。此外,在文本驱动的临床知识推理任务中,其在MedQA和EHRQA上的表现分别提升5%和22%,展现出更强的跨模态理解和医疗决策支持能力。

链接: https://arxiv.org/abs/2604.05081
作者: Andrew Sellergren,Chufan Gao,Fereshteh Mahvar,Timo Kohlberger,Fayaz Jamil,Madeleine Traverse,Alberto Tono,Bashir Sadjad,Lin Yang,Charles Lau,Liron Yatziv,Tiffany Chen,Bram Sterling,Kenneth Philbrick,Richa Tiwari,Yun Liu,Madhuram Jajoo,Chandrashekar Sankarapu,Swapnil Vispute,Harshad Purandare,Abhishek Bijay Mishra,Sam Schmidgall,Tao Tu,Anil Palepu,Chunjong Park,Tim Strother,Rahul Thapa,Yong Cheng,Preeti Singh,Kat Black,Yossi Matias,Katherine Chou,Avinatan Hassidim,Kavi Goel,Joelle Barral,Tris Warkentin,Shravya Shetty,Dale Webster,Sunny Virmani,David F. Steiner,Can Kirmizibayrak,Daniel Golden
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce MedGemma 1.5 4B, the latest model in the MedGemma collection. MedGemma 1.5 expands on MedGemma 1 by integrating additional capabilities: high-dimensional medical imaging (CT/MRI volumes and histopathology whole slide images), anatomical localization via bounding boxes, multi-timepoint chest X-ray analysis, and improved medical document understanding (lab reports, electronic health records). We detail the innovations required to enable these modalities within a single architecture, including new training data, long-context 3D volume slicing, and whole-slide pathology sampling. Compared to MedGemma 1 4B, MedGemma 1.5 4B demonstrates significant gains in these new areas, improving 3D MRI condition classification accuracy by 11% and 3D CT condition classification by 3% (absolute improvements). In whole slide pathology imaging, MedGemma 1.5 4B achieves a 47% macro F1 gain. Additionally, it improves anatomical localization with a 35% increase in Intersection over Union on chest X-rays and achieves a 4% macro accuracy for longitudinal (multi-timepoint) chest x-ray analysis. Beyond its improved multimodal performance over MedGemma 1, MedGemma 1.5 improves on text-based clinical knowledge and reasoning, improving by 5% on MedQA accuracy and 22% on EHRQA accuracy. It also achieves an average of 18% macro F1 on 4 different lab report information extraction datasets (EHR Datasets 2, 3, 4, and Mendeley Clinical Laboratory Test Reports). Taken together, MedGemma 1.5 serves as a robust, open resource for the community, designed as an improved foundation on which developers can create the next generation of medical AI systems. Resources and tutorials for building upon MedGemma 1.5 can be found at this https URL.

[AI-89] Feature-Aware Anisotropic Local Differential Privacy for Utility-Preserving Graph Representation Learning in Metal Additive Manufacturing

【速读】:该论文旨在解决金属增材制造(Metal Additive Manufacturing, AM)中质量保障依赖高保真传感器数据、导致数据协作受限的问题,以及现有缺陷检测模型忽略层间物理耦合关系、传统隐私保护方法(如局部差分隐私 Local Differential Privacy, LDP)因均匀噪声注入造成性能严重下降的挑战。其解决方案的关键在于提出FI-LDP-HGAT框架:一方面采用分层图注意力网络(Hierarchical Graph Attention Network, HGAT)建模扫描路径与沉积层间的空间和热力学依赖关系;另一方面设计基于特征重要性的各向异性高斯机制(Feature-Importance-aware Anisotropic Gaussian Mechanism, FI-LDP),通过编码器生成的重要性先验重新分配隐私预算,在保持形式化LDP保证的前提下,对关键热信号施加更低噪声、冗余维度施加更高噪声,从而实现隐私与效用的协同优化。

链接: https://arxiv.org/abs/2604.05077
作者: MD Shafikul Islam,Mahathir Mohammad Bappy,Saifur Rahman Tushar,Md Arifuzzaman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: In Review in The ASME Journal of Computing and Information Science in Engineering (JCISE)

点击查看摘要

Abstract:Metal additive manufacturing (AM) enables the fabrication of safety-critical components, but reliable quality assurance depends on high-fidelity sensor streams containing proprietary process information, limiting collaborative data sharing. Existing defect-detection models typically treat melt-pool observations as independent samples, ignoring layer-wise physical couplings. Moreover, conventional privacy-preserving techniques, particularly Local Differential Privacy (LDP), lead to severe utility degradation because they inject uniform noise across all feature dimensions. To address these interrelated challenges, we propose FI-LDP-HGAT. This computational framework combines two methodological components: a stratified Hierarchical Graph Attention Network (HGAT) that captures spatial and thermal dependencies across scan tracks and deposited layers, and a feature-importance-aware anisotropic Gaussian mechanism (FI-LDP) for non-interactive feature privatization. Unlike isotropic LDP, FI-LDP redistributes the privacy budget across embedding coordinates using an encoder-derived importance prior, assigning lower noise to task-critical thermal signatures and higher noise to redundant dimensions while maintaining formal LDP guarantees. Experiments on a Directed Energy Deposition (DED) porosity dataset demonstrate that FI-LDP-HGAT achieves 81.5% utility recovery at a moderate privacy budget (epsilon = 4) and maintains defect recall of 0.762 under strict privacy (epsilon = 2), while outperforming classical ML, standard GNNs, and alternative privacy mechanisms, including DP-SGD across all evaluated metrics. Mechanistic analysis confirms a strong negative correlation (Spearman = -0.81) between feature importance and noise magnitude, providing interpretable evidence that the privacy-utility gains are driven by principled anisotropic allocation.

[AI-90] AutoLALA: Automatic Loop Algebraic Locality Analysis for AI and HPC Kernels

【速读】:该论文旨在解决现代计算系统中数据移动(data movement)成为主要瓶颈的问题,尤其针对高性能量子(HPC)和人工智能(AI)工作负载中常见的循环程序(如矩阵乘法、张量收缩、stencil计算和einsum操作),这些程序的数据访问模式导致内存层次结构中的数据搬运成本远超算术运算成本。解决方案的关键在于提出AutoLALA工具,它通过将仿射循环程序降低到多面体集合与映射,并结合Zhu等人提出的完全符号化局部性分析方法与Smith等人提出的数据移动距离(Data Movement Distance, DMD)框架,直接计算重用距离(reuse distance)作为访问空间在访问映射下的像,从而避免了传统的栈模拟和Denning递归工作集公式,最终生成重用间隔与重用距离的闭式符号表达式,实现对数据局部性的精确量化分析。

链接: https://arxiv.org/abs/2604.05066
作者: Yifan Zhu,Yekai Pan,Yanghui Wu,Chen Ding
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Data movement is the primary bottleneck in modern computing systems. For loop-based programs common in high-performance computing (HPC) and AI workloads, including matrix multiplication, tensor contraction, stencil computation, and einsum operations, the cost of moving data through the memory hierarchy often exceeds the cost of arithmetic. This paper presents AutoLALA, an open-source tool that analyzes data locality in affine loop programs. The tool accepts programs written in a small domain-specific language (DSL), lowers them to polyhedral sets and maps, and produces closed-form symbolic formulas for reuse distance and data movement complexity. AutoLALA implements the fully symbolic locality analysis of Zhu et al. together with the data movement distance (DMD) framework of Smith et al. In particular, it computes reuse distance as the image of the access space under the access map, avoiding both stack simulation and Denning’s recursive working-set formulation. We describe the DSL syntax and its formal semantics, the polyhedral lowering pipeline that constructs timestamp spaces and access maps via affine transformations, and the sequence of Barvinok counting operations used to derive symbolic reuse-interval and reuse-distance distributions. The system is implemented in Rust as a modular library spanning three crates, with safe bindings to the Barvinok library. We provide both a command-line interface and an interactive web playground with LaTeX rendering of the output formulas. The tool handles arbitrary affine loop nests, covering workloads such as tensor contractions, einsum expressions, stencil computations, and general polyhedral programs. Subjects: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Performance (cs.PF) Cite as: arXiv:2604.05066 [cs.PL] (or arXiv:2604.05066v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2604.05066 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yifan Zhu [view email] [v1] Mon, 6 Apr 2026 18:12:39 UTC (16 KB)

[AI-91] Dynamic Linear Coregionalization for Realistic Synthetic Multivariate Time Series ICLR2026

【速读】:该论文旨在解决当前用于训练时间序列基础模型(Foundation Models for Time Series, FMTS)的合成数据生成方法普遍假设静态相关性,而忽略了真实数据中常见的时变、状态切换的相关性和跨通道滞后结构的问题。解决方案的关键在于提出DynLMC(Dynamic Linear Model of Coregionalization),该模型能够显式建模时间变化的协方差结构和跨通道的滞后依赖关系,从而生成具有动态相关性的多变量时间序列合成数据。实验表明,基于DynLMC生成的数据对三个FMTS模型进行微调后,在九个基准测试上均实现了零样本预测性能的一致提升,验证了动态相关性建模对增强FMTS迁移能力的重要性。

链接: https://arxiv.org/abs/2604.05064
作者: Annita Vapsi,Penghang Liu,Saheed Obitayo,Aakriti,Manoj Cherukumalli,Prathamesh Patil,Amit Varshney,Nicolas Marchesotti,Elizabeth Fons,Vamsi K. Potluru,Manuela Veloso
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2026 Workshop on Time Series in the Age of Large Models

点击查看摘要

Abstract:Synthetic data is essential for training foundation models for time series (FMTS), but most generators assume static correlations, and are typically missing realistic inter-channel dependencies. We introduce DynLMC, a Dynamic Linear Model of Coregionalization, that incorporates time-varying, regime-switching correlations and cross-channel lag structures. Our approach produces synthetic multivariate time series with correlation dynamics that closely resemble real data. Fine-tuning three foundational models on DynLMC-generated data yields consistent zero-shot forecasting improvements across nine benchmarks. Our results demonstrate that modeling dynamic inter-channel correlations enhances FMTS transferability, highlighting the importance of data-centric pretraining.

[AI-92] PCA-Driven Adaptive Sensor Triage for Edge AI Inference

【速读】:该论文旨在解决工业物联网(Industrial IoT)中多通道传感器网络在带宽受限条件下如何高效选择采样率以维持数据质量的问题。解决方案的关键在于提出PCA-Triage算法,该算法基于增量主成分分析(Incremental PCA)的载荷信息,将各通道的采样率按比例分配,从而在给定带宽预算下实现最优的数据代表性保留。该方法无需训练参数,在每决策周期仅需0.67毫秒计算时间,且在多个基准测试中展现出优于9种基线方法的性能,尤其在30%带宽下仍能保持F1分数高于0.90,接近全数据性能(如TEP数据集F1=0.961),并具备对丢包和传感器噪声的鲁棒性。

链接: https://arxiv.org/abs/2604.05045
作者: Ankit Hemant Lade,Sai Krishna Jasti,Nikhil Sinha,Indar Kumar,Akanksha Tiwari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 16 pages, 13 figures, 7 benchmarks

点击查看摘要

Abstract:Multi-channel sensor networks in industrial IoT often exceed available bandwidth. We propose PCA-Triage, a streaming algorithm that converts incremental PCA loadings into proportional per-channel sampling rates under a bandwidth budget. PCA-Triage runs in O(wdk) time with zero trainable parameters (0.67 ms per decision). We evaluate on 7 benchmarks (8–82 channels) against 9 baselines. PCA-Triage is the best unsupervised method on 3 of 6 datasets at 50% bandwidth, winning 5 of 6 against every baseline with large effect sizes (r = 0.71–0.91). On TEP, it achieves F1 = 0.961 +/- 0.001 – within 0.1% of full-data performance – while maintaining F1 0.90 at 30% budget. Targeted extensions push F1 to 0.970. The algorithm is robust to packet loss and sensor noise (3.7–4.8% degradation under combined worst-case). Comments: 16 pages, 13 figures, 7 benchmarks Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) ACMclasses: I.2.6; C.3 Cite as: arXiv:2604.05045 [cs.LG] (or arXiv:2604.05045v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.05045 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-93] Scaling Coding Agents via Atomic Skills

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)编程代理在复合基准测试(如缺陷修复)上训练时普遍存在的任务特定过拟合问题,从而导致泛化能力受限。其解决方案的关键在于提出一种新的规模化范式,将训练焦点从任务层面优化转向原子技能(atomic skills)的掌握。作者首先形式化了五种基础原子技能:代码定位、代码编辑、单元测试生成、问题复现和代码审查,并证明这些技能具有更强的通用性和可组合性。通过在这些原子技能上进行联合强化学习(joint reinforcement learning),实现了各技能的协同提升且无负向干扰或权衡。实验表明,这种基于原子技能的训练方式显著提升了代理在未见复合任务(如代码重构、机器学习工程和代码安全)上的表现,平均性能提升达18.7%。

链接: https://arxiv.org/abs/2604.05013
作者: Yingwei Ma,Yue Liu,Xinlong Yang,Yanhao Li,Kelin Fu,Yibo Miao,Yuchong Xie,Zhexu Wang,Shing-Chi Cheung
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current LLM coding agents are predominantly trained on composite benchmarks (e.g., bug fixing), which often leads to task-specific overfitting and limited generalization. To address this, we propose a novel scaling paradigm that shifts the focus from task-level optimization to atomic skill mastery. We first formalize five fundamental atomic skills, code localization, code editing, unit-test generation, issue reproduction, and code review, that serve as the basis vectors for complex software engineering tasks. Compared with composite coding tasks, these atomic skills are more generalizable and composable. Then, we scale coding agents by performing joint RL over atomic skills. In this manner, atomic skills are consistently improved without negative interference or trade-offs between them. Notably, we observe that improvements in these atomic skills generalize well to other unseen composite coding tasks, such as bug-fixing, code refactoring, machine learning engineering, and code security. The observation motivates a new scaling paradigm for coding agents by training with atomic skills. Extensive experiments demonstrate the effectiveness of our proposed paradigm. Notably, our joint RL improves average performance by 18.7% on 5 atomic skills and 5 composite tasks.

[AI-94] Comparative Characterization of KV Cache Management Strategies for LLM Inference

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在推理过程中因Key-Value (KV) 缓存不断增长所带来的系统级内存瓶颈问题,尤其是在模型规模扩大、上下文长度增加以及并发请求竞争有限内存资源的情况下。其解决方案的关键在于对三种前沿的KV缓存管理框架(vLLM、InfiniGen 和 H2O)进行系统的实证比较,评估它们在不同请求速率、模型规模和稀疏性水平下的延迟、吞吐量与内存占用表现,从而明确各类框架在特定条件下的最优适用场景,为实际部署中KV缓存策略的选择与配置提供量化依据。

链接: https://arxiv.org/abs/2604.05012
作者: Oteo Mamo,Olga Kogiou,Hyunjin Yi,Weikuan Yu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient inference with Large Language Models (LLMs) increasingly relies on Key-Value (KV) caches to store previously computed key and value vectors at each layer. These caches are essential to minimize redundant computation during autoregressive token generation, lowering computational complexity from quadratic to linear. However, the growth of KV caches has posed significant system-level challenges, particularly as model sizes increase, context lengths grow, and concurrent requests compete for limited memory resources. Even though several recent frameworks for KV cache management have emerged, their comparative trade-offs in memory consumption and inference performance have not been fully understood, especially under varying request sizes and model configurations. In this work, we conduct an empirical study of three state-of-the-art KV cache management frameworks: vLLM, InfiniGen, and H2O. These frameworks employ techniques such as tensor offloading, token eviction heuristics, and speculative scheduling to balance memory usage and performance. We evaluate their performance in terms of a range of metrics such as latency, throughput, and memory usage across a spectrum of key parameters including request rates, model sizes, and sparsity levels. Our results pinpoint the conditions for each framework to perform the best, revealing the most suitable selection and configuration of KV cache strategies under memory and performance constraints.

[AI-95] YMIR: A new Benchmark Dataset and Model for Arabic Yemeni Music Genre Classification Using Convolutional Neural Networks

【速读】:该论文旨在解决当前音乐信息检索(Music Information Retrieval, MIR)领域中对非西方音乐文化传统代表性不足的问题,特别是针对也门传统音乐类型的自动分类任务缺乏专门数据集和有效模型。其解决方案的关键在于构建首个面向也门传统音乐的高质量基准数据集——Yemeni Music Information Retrieval (YMIR),包含5种典型也门音乐流派(Sanaani、Hadhrami、Lahji、Tihami 和 Adeni)共1,475个音频片段,并由五位本地音乐专家标注,确保高一致性(Fleiss kappa = 0.85)。同时,提出基于卷积神经网络(CNN)的 Yemeni Music Classification Model (YMCM),通过系统实验对比多种特征表示(如Mel-spectrogram、Chroma、FilterBank、MFCCs)与不同架构(AlexNet、VGG16、MobileNet及基线CNN),验证了使用Mel-spectrogram作为输入特征时,YMCM在准确率上达到98.8%,成为该任务的有效基准模型。

链接: https://arxiv.org/abs/2604.05011
作者: Moeen AL-Makhlafi,Abdulrahman A. AlKannad,Eiad Almekhlafi,Nawaf Q. Othman Ahmed Mohammed,Saher Qaid
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic music genre classification is a major task in music information retrieval; however, most current benchmarks and models have been developed primarily for Western music, leaving culturally specific traditions underrepresented. In this paper, we introduce the Yemeni Music Information Retrieval (YMIR) dataset, which contains 1,475 carefully selected audio clips covering five traditional Yemeni genres: Sanaani, Hadhrami, Lahji, Tihami, and Adeni. The dataset was labeled by five Yemeni music experts following a clear and structured protocol, resulting in strong inter-annotator agreement (Fleiss kappa = 0.85). We also propose the Yemeni Music Classification Model (YMCM), a convolutional neural network (CNN)-based system designed to classify music genres from time-frequency features. Using a consistent preprocessing pipeline, we perform a systematic comparison across six experimental groups and five different architectures, resulting in a total of 30 experiments. Specifically, we evaluate several feature representations, including Mel-spectrograms, Chroma, FilterBank, and MFCCs with 13, 20, and 40 coefficients, and benchmark YMCM against standard models (AlexNet, VGG16, MobileNet, and a baseline CNN) under the same experimental conditions. The experimental findings reveal that YMCM is the most effective, achieving the highest accuracy of 98.8% with Mel-spectrogram features. The results also provide practical insights into the relationship between feature representation and model capacity. The findings establish YMIR as a useful benchmark and YMCM as a strong baseline for classifying Yemeni music genres.

[AI-96] Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction IJCNN2026

【速读】:该论文旨在解决音频视觉导航(Audio-Visual Navigation, AVN)中代理在未见过的3D环境中泛化能力不足的问题,尤其是现有方法容易过拟合于语义声音特征和特定训练环境。解决方案的关键在于提出Binaural Difference Attention with Action Transition Prediction (BDATP)框架,其核心创新包括:(1) 双耳差异注意力(Binaural Difference Attention, BDA)模块,通过显式建模双耳间的时间和强度差异来增强空间方向感知,从而减少对语义类别依赖;(2) 动作转移预测(Action Transition Prediction, ATP)任务,引入辅助的动作预测目标作为正则化项,有效缓解环境特异性过拟合。实验表明,BDATP可无缝集成到多种主流基线模型中,并在Replica和Matterport3D数据集上实现显著且一致的性能提升,尤其在未听过的声音场景下成功率最高提升达21.6个百分点,验证了其卓越的泛化能力和对不同导航架构的鲁棒性。

链接: https://arxiv.org/abs/2604.05007
作者: Jia Li,Yinfeng Yu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Main paper (6 pages). Accepted for publication by the International Joint Conference on Neural Networks (IJCNN 2026)

点击查看摘要

Abstract:In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training environments. To address these challenges, we propose the \textbfBinaural Difference Attention with Action Transition Prediction (BDATP) framework, which jointly optimizes perception and policy. Specifically, the \textbfBinaural Difference Attention (BDA) module explicitly models interaural differences to enhance spatial orientation, reducing reliance on semantic categories. Simultaneously, the \textbfAction Transition Prediction (ATP) task introduces an auxiliary action prediction objective as a regularization term, mitigating environment-specific overfitting. Extensive experiments on the Replica and Matterport3D datasets demonstrate that BDATP can be seamlessly integrated into various mainstream baselines, yielding consistent and significant performance gains. Notably, our framework achieves state-of-the-art Success Rates across most settings, with a remarkable absolute improvement of up to 21.6 percentage points in Replica dataset for unheard sounds. These results underscore BDATP’s superior generalization capability and its robustness across diverse navigation architectures.

[AI-97] Learning Stable Predictors from Weak Supervision under Distribution Shift

【速读】:该论文旨在解决在弱监督(weak supervision)场景下,模型在分布偏移(distribution shift)尤其是监督机制本身发生变化时的鲁棒性问题,即“监督漂移”(supervision drift),其定义为在不同情境下条件概率 $ P(y | x, c) $ 的变化。研究以CRISPR-Cas13d实验为例,通过RNA-seq响应间接推断引导序列效能(guide efficacy),构建了一个受控的非独立同分布(non-IID)基准数据集,包含明确的领域和时间偏移。关键解决方案在于识别并验证:尽管特征-标签关系在不同细胞系间保持稳定,但在时间维度上显著变化,导致所有模型在跨时间迁移时性能急剧下降(如XGBoost的R² = -0.155,Spearman相关系数ρ = 0.056),从而证明失败源于监督漂移而非模型能力不足;因此,特征稳定性可作为预测迁移不可行性的简单诊断指标。

链接: https://arxiv.org/abs/2604.05002
作者: Mehrdad Shoeibi,Elias Hossain,Ivan Garibay,Niloofar Yousefi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning from weak or proxy supervision is common when ground-truth labels are unavailable, yet robustness under distribution shift remains poorly understood, especially when the supervision mechanism itself changes. We formalize this as supervision drift, defined as changes in P(y | x, c) across contexts, and study it in CRISPR-Cas13d experiments where guide efficacy is inferred indirectly from RNA-seq responses. Using data from two human cell lines and multiple time points, we build a controlled non-IID benchmark with explicit domain and temporal shifts while keeping the weak-label construction fixed. Models achieve strong in-domain performance (ridge R^2 = 0.356, Spearman rho = 0.442) and partial cross-cell-line transfer (rho ~ 0.40). However, temporal transfer fails across all models, with negative R^2 and near-zero correlation (e.g., XGBoost R^2 = -0.155, rho = 0.056). Additional analyses confirm this pattern. Feature-label relationships remain stable across cell lines but change sharply over time, indicating that failures arise from supervision drift rather than model limitations. These findings highlight feature stability as a simple diagnostic for detecting non-transferability before deployment.

[AI-98] Closed-Loop Autonomous Software Development via Jira-Integrated Backlog Orchestration: A Case Study in Deterministic Control and Safety-Constrained Automation

【速读】:该论文旨在解决软件生命周期管理中自动化流程缺乏闭环控制与可追溯性的问题,传统方法往往局限于代码生成工具,难以实现端到端的可靠执行与错误恢复。解决方案的关键在于构建一个基于控制架构(control architecture)的闭环系统,通过结构化的七阶段自动化流水线、版本化提示规范(versioned prompt specifications)、检查点时间预算、异常处理机制和集中式锁机制(centralized lock mechanisms),确保任务在受控环境中稳定运行;同时引入Jira状态合约(Jira Status Contract)实现外部冲突锁定,并支持断网情况下的降级模式操作,从而在保障安全性与可审计性的前提下,实现高可靠、可验证的软件生命周期自动化。

链接: https://arxiv.org/abs/2604.05000
作者: Elias Calboreanu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 27 pages, 7 figures, 5 tables. Submitted to Automated Software Engineering (Springer)

点击查看摘要

Abstract:This paper presents a closed-loop system for software lifecycle management framed as a control architecture rather than a code-generation tool. The system manages a backlog of approximately 1,602 rows across seven task families, ingests 13 structured source documents, and executes a deterministic seven-stage pipeline implemented as seven scheduled automation lanes. The automation stack comprises approximately 12,661 lines of Python across 23 scripts plus 6,907 lines of versioned prompt specifications, with checkpoint-based time budgets, 101 exception handlers, and 12 centralized lock mechanisms implemented through four core functions and eight reusable patterns. A Jira Status Contract provides externally observable collision locking, and a degraded-mode protocol supports continued local operation when Jira is unavailable. Artificial-intelligence assistance is bounded by structured context packages, configured resource caps, output re-validation, and human review gates. A formal evaluation of the initial 152-run window yielded 100% terminal-state success with a 95% Clopper-Pearson interval of [97.6%, 100%]; the system has since accumulated more than 795 run artifacts in continuous operation. Three rounds of adversarial code review identified 51 findings, all closed within the study scope (48 fully remediated, 3 closed with deferred hardening), with zero false negatives within the injected set. In an autonomous security ticket family of 10 items, six were completed through pipeline-autonomous dispatch and verification, two required manual remediation, and two were closed by policy decision. The results indicate that bounded, traceable lifecycle automation is practical when autonomy is embedded within explicit control, recovery, and audit mechanisms.

[AI-99] PRIME: Prototype-Driven Multimodal Pretraining for Cancer Prognosis with Missing Modalities

【速读】:该论文旨在解决临床癌症预后建模中因多模态数据(如组织病理图像、基因表达谱和病理报告)常存在缺失而导致的模型性能下降问题。现有方法通常依赖于完整配对的数据,难以应对真实世界中常见的模态不完整场景。解决方案的关键在于提出PRIME框架,其核心创新包括:(1)将异构模态嵌入映射到统一的token空间,并引入共享原型记忆库,通过患者级语义共识检索实现潜在空间中的语义补全,从而生成结构对齐的token表示而无需重建原始信号;(2)设计两种互补的自监督预训练目标——跨模态对齐与结构化缺失增强下的融合一致性,使学习到的表征在任意模态子集下仍保持预测能力。这一策略显著提升了模型在碎片化临床数据中的鲁棒性和泛化性。

链接: https://arxiv.org/abs/2604.04999
作者: Kai Yu,Shuang Zhou,Yiran Song,Zaifu Zhan,Jie Peng,Kaixiong Zhou,Tianlong Chen,Feng Xie,Meng Wang,Huazhu Fu,Mingquan Lin,Rui Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal self-supervised pretraining offers a promising route to cancer prognosis by integrating histopathology whole-slide images, gene expression, and pathology reports, yet most existing approaches require fully paired and complete inputs. In practice, clinical cohorts are fragmented and often miss one or more modalities, limiting both supervised fusion and scalable multimodal pretraining. We propose PRIME, a missing-aware multimodal self-supervised pretraining framework that learns robust and transferable representations from partially observed cohorts. PRIME maps heterogeneous modality embeddings into a unified token space and introduces a shared prototype memory bank for latent-space semantic imputation via patient-level consensus retrieval, producing structurally aligned tokens without reconstructing raw signals. Two complementary pretraining objectives: inter-modality alignment and post-fusion consistency under structured missingness augmentation, jointly learn representations that remain predictive under arbitrary modality subsets. We evaluate PRIME on The Cancer Genome Atlas with label-free pretraining on 32 cancer types and downstream 5-fold evaluation on five cohorts across overall survival prediction, 3-year mortality classification, and 3-year recurrence classification. PRIME achieves the best macro-average performance among all compared methods, reaching 0.653 C-index, 0.689 AUROC, and 0.637 AUROC on the three tasks, respectively, while improving robustness under test-time missingness and supporting parameter-efficient and label-efficient adaptation. These results support missing-aware multimodal pretraining as a practical strategy for prognosis modeling in fragmented clinical data settings.

[AI-100] FreakOut-LLM : The Effect of Emotional Stimuli on Safety Alignment

【速读】:该论文旨在解决安全对齐的大语言模型(Safety-aligned LLMs)在情绪化刺激下是否仍能保持其拒绝有害请求的能力这一关键问题。现有研究普遍关注模型在常规条件下的安全机制,但未考察情绪状态(如压力或放松)对模型对抗攻击敏感性的影响。论文提出FreakOut-LLM框架,通过引入经心理学验证的情绪诱导提示(emotional priming),系统评估十种大语言模型在压力、放松和中性三种情绪条件下对AdvBench攻击的响应差异。其核心发现是:压力情绪显著提升模型被越狱的成功率(平均增加65.2%,p < 0.001),而放松情绪无显著影响;个体心理状态与攻击成功率高度相关(|r| ≥ 0.70),且压力是唯一在控制提示长度和模型身份后的显著预测因子。这表明情绪情境构成了一个可测量的新型攻击面,对高压力场景中的生成式AI部署具有重要警示意义。

链接: https://arxiv.org/abs/2604.04992
作者: Daniel Kuznetsov,Ofir Cohen,Karin Shistik,Rami Puzis,Asaf Shabtai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety-aligned LLMs go through refusal training to reject harmful requests, but whether these mechanisms remain effective under emotionally charged stimuli is unexplored. We introduce FreakOut-LLM, a framework investigating whether emotional context compromises safety alignment in adversarial settings. Using validated psychological stimuli, we evaluate how emotional priming through system prompts affects jailbreak susceptibility across ten LLMs. We test three conditions (stress, relaxation, neutral) using scenarios from established psychological protocols, plus a no-prompt baseline, and evaluate attack success using HarmBench on AdvBench prompts. Stress priming increases jailbreak success by 65.2% compared to neutral conditions (z = 5.93, p 0.001; OR = 1.67, Cohen’s d = 0.28), while relaxation priming produces no effect (p = 0.84). Five of ten models show significant vulnerability, with the largest effects concentrated in open-weight models. Logistic regression on 59,800 queries confirms stress as the sole significant condition predictor after controlling for prompt length (p = 0.61) and model identity. Measured psychological state strongly predicts attack success (|r|\geq0.70 across five instruments; all p 0.001 in individual-level logistic regression). These results establish emotional context as a measurable attack surface with implications for real-world AI deployment in high-stress domains.

[AI-101] Architecture Without Architects: How AI Coding Agents Shape Software Architecture

【速读】:该论文旨在解决生成式 AI (Generative AI) 编码代理在自动化软件架构决策中隐式引入架构选择却缺乏显式审查的问题。当前实践中,AI 代理能在几秒内完成框架选择、基础设施搭建和集成配置等关键架构决策,但这些决策通常未被当作正式的架构活动进行评估与管理。解决方案的关键在于识别出五种代理隐式做出架构选择的机制,并提出六种“提示-架构耦合模式”(prompt-architecture coupling patterns),将自然语言提示特征与系统所需的基础架构需求直接映射,从而揭示提示词如何影响最终系统的结构差异。其中,部分模式如结构化输出验证属于条件性耦合,可能随模型能力提升而弱化;而工具调用编排则为根本性耦合,具有长期稳定性。论文进一步通过实例验证了仅凭提示措辞即可生成结构迥异的系统,由此提出“ vibe architecting”这一概念——即由提示驱动而非人工设计所形成的架构形态,并建议建立审查流程、决策记录和辅助工具以实现对这类隐性架构决策的有效治理。

链接: https://arxiv.org/abs/2604.04990
作者: Phongsakon Mark Konrad,Tim Lukas Adam,Riccardo Terrenzi,Serkan Ayvaz
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI coding agents select frameworks, scaffold infrastructure, and wire integrations, often in seconds. These are architectural decisions, yet almost no one reviews them as such. We identify five mechanisms by which agents make implicit architectural choices and propose six prompt-architecture coupling patterns that map natural-language prompt features to the infrastructure they require. The patterns range from contingent couplings (structured output validation) that may weaken as models improve to fundamental ones (tool-call orchestration) that persist regardless of model capability. An illustrative demonstration confirms that prompt wording alone produces structurally different systems for the same task. We term the phenomenon vibe architecting, architecture shaped by prompts rather than deliberate design, and outline review practices, decision records, and tooling to bring these hidden decisions under governance.

[AI-102] Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression IJCNN

【速读】:该论文旨在解决模型压缩与实际推理加速之间存在的不一致性问题,即传统压缩指标(如参数量或浮点运算次数 FLOPs)无法准确预测在受限 CPU 和内存环境下的真实推理延迟(wall-clock inference time)。尤其当采用无结构稀疏化(unstructured sparsity)时,虽然能减少模型存储空间,但因不规则内存访问和稀疏核计算开销反而可能拖慢标准 CPU 执行速度。为应对这一挑战,论文提出一个有序的三阶段优化流程:首先进行无结构剪枝(unstructured pruning)以降低模型容量并提升后续低精度优化的鲁棒性;其次执行 INT8 量化感知训练(INT8 quantization-aware training, QAT),这是实现显著运行时加速的核心手段;最后应用知识蒸馏(knowledge distillation, KD)在已约束的稀疏 INT8 模型中恢复精度,且不改变部署格式。实验表明,该顺序组合在 CIFAR-10/100 上多个骨干网络上均优于单一技术,实现了更优的准确性-尺寸-延迟权衡,并验证了阶段顺序的重要性,为边缘部署提供了基于实测延迟而非代理指标的实用指导。

链接: https://arxiv.org/abs/2604.04988
作者: Longsheng Zhou,Yu Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, submitted to IJCNN

点击查看摘要

Abstract:Modern deployment often requires trading accuracy for efficiency under tight CPU and memory constraints, yet common compression proxies such as parameter count or FLOPs do not reliably predict wall-clock inference time. In particular, unstructured sparsity can reduce model storage while failing to accelerate (and sometimes slightly slowing down) standard CPU execution due to irregular memory access and sparse kernel overhead. Motivated by this gap between compression and acceleration, we study a practical, ordered pipeline that targets measured latency by combining three widely used techniques: unstructured pruning, INT8 quantization-aware training (QAT), and knowledge distillation (KD). Empirically, INT8 QAT provides the dominant runtime benefit, while pruning mainly acts as a capacity-reduction pre-conditioner that improves the robustness of subsequent low-precision optimization; KD, applied last, recovers accuracy within the already constrained sparse INT8 regime without changing the deployment form. We evaluate on CIFAR-10/100 using three backbones (ResNet-18, WRN-28-10, and VGG-16-BN). Across all settings, the ordered pipeline achieves a stronger accuracy-size-latency frontier than any single technique alone, reaching 0.99-1.42 ms CPU latency with competitive accuracy and compact checkpoints. Controlled ordering ablations with a fixed 20/40/40 epoch allocation further confirm that stage order is consequential, with the proposed ordering generally performing best among the tested permutations. Overall, our results provide a simple guideline for edge deployment: evaluate compression choices in the joint accuracy-size-latency space using measured runtime, rather than proxy metrics alone.

[AI-103] Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling ICLR2026

【速读】:该论文旨在解决传统推测采样(Speculative Sampling, SpS)中因严格要求生成分布与验证器语言模型(verifier LLM)分布完全一致而导致的效率瓶颈问题。这种严格约束限制了采样策略的灵活性,使得诸如 top-k 或温度采样等常见优化手段无法被有效利用。为解决此问题,作者提出了一种基于约束优化形式化的新方法 Cactus(Constrained Acceptance Speculative Sampling),其关键在于通过引入可控的分布偏差约束,在保证输出质量的前提下显著提升接受率(acceptance rate),从而实现更高的解码吞吐量。

链接: https://arxiv.org/abs/2604.04987
作者: Yongchang Hao,Lili Mou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: Camera-ready version. Accepted at ICLR 2026

点击查看摘要

Abstract:Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier’s distribution, such as sampling with top- k or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (constrained acceptance speculative sampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.

[AI-104] Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents

【速读】:该论文旨在解决编码代理(coding agent)在处理工具输出时存在冗余信息消耗的问题,即尽管每次工具输出的大部分内容对下一步决策无关紧要,但代理仍需逐字读取全部内容。为应对这一问题,作者提出任务条件下的工具输出裁剪(task-conditioned tool-output pruning)方法:给定一个聚焦的查询和一个工具输出,模型需返回最小且完整的原始证据块(verbatim evidence block),以供代理下一步检查。其核心解决方案是构建了一个包含11,477个样本的大规模基准数据集(基于SWE-bench仓库交互与合成多生态系统工具输出),并采用LoRA微调Qwen 3.5 2B模型,在保证高召回率(0.86)和F1分数(0.80)的同时,移除92%的输入token,显著优于零样本大模型和启发式裁剪基线方法。

链接: https://arxiv.org/abs/2604.04979
作者: Ádám Kovács
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:Coding agents repeatedly consume long tool observations even though only a small fraction of each observation matters for the next step. We study task-conditioned tool-output pruning: given a focused query and one tool output, return the smallest verbatim evidence block the agent should inspect next. We introduce a benchmark of 11,477 examples built from SWE-bench repository interactions and synthetic multi-ecosystem tool outputs, with a manually curated 618-example test set. We fine-tune Qwen 3.5 2B with LoRA and compare it against larger zero-shot models and heuristic pruning baselines. Our model reaches 0.86 recall and 0.80 F1 while removing 92% of input tokens, outperforming zero-shot Qwen 3.5 35B A3B by 11 recall points and all heuristic baselines by a wide margin.

[AI-105] Measuring the Permission Gate: A Stress-Test Evaluation of Claude Codes Auto Mode

【速读】:该论文旨在解决生成式 AI(Generative AI)在代码生成场景中因权限控制不足而导致的安全风险问题,特别是针对 Claude Code 的自动模式(auto mode)这一首个部署的 AI 编码代理权限系统进行独立评估。其核心问题是:当前权限控制系统依赖于对危险工具调用(tool calls)的两阶段分类器来过滤潜在恶意行为,但在用户意图明确但目标范围、影响半径或风险级别未明确定义的模糊授权场景下,该系统是否仍具备充分覆盖能力。解决方案的关键在于设计并使用 AmPermBench 基准测试集(包含 128 个提示任务,涵盖四个 DevOps 任务类别和三个受控模糊维度),对 253 个状态变更操作进行逐项评估,并与“黄金标准”(oracle ground truth)比对,从而揭示 auto mode 在实际压力测试负载下的真实性能表现——结果显示其端到端假负率高达 81.0%,显著高于 Anthropic 报告的 17%,主要源于对 Tier 2(项目内文件编辑)类操作缺乏覆盖,暴露出系统假设所有危险行为均通过 shell 执行的局限性。

链接: https://arxiv.org/abs/2604.04978
作者: Zimo Ji,Zongjie Li,Wenyuan Jiang,Yudong Gao,Shuai Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Claude Code’s auto mode is the first deployed permission system for AI coding agents, using a two-stage transcript classifier to gate dangerous tool calls. Anthropic reports a 0.4% false positive rate and 17% false negative rate on production traffic. We present the first independent evaluation of this system on deliberately ambiguous authorization scenarios, i.e., tasks where the user’s intent is clear but the target scope, blast radius, or risk level is underspecified. Using AmPermBench, a 128-prompt benchmark spanning four DevOps task families and three controlled ambiguity axes, we evaluate 253 state-changing actions at the individual action level against oracle ground truth. Our findings characterize auto mode’s scope-escalation coverage under this stress-test workload. The end-to-end false negative rate is 81.0% (95% CI: 73.8%-87.4%), substantially higher than the 17% reported on production traffic, reflecting a fundamentally different workload rather than a contradiction. Notably, 36.8% of all state-changing actions fall outside the classifier’s scope via Tier 2 (in-project file edits), contributing to the elevated end-to-end FNR. Even restricting to the 160 actions the classifier actually evaluates (Tier 3), the FNR remains 70.3%, while the FPR rises to 31.9%. The Tier 2 coverage gap is most pronounced on artifact cleanup (92.9% FNR), where agents naturally fall back to editing state files when the expected CLI is unavailable. These results highlight a coverage boundary worth examining: auto mode assumes dangerous actions transit the shell, but agents routinely achieve equivalent effects through file edits that the classifier does not evaluate. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2604.04978 [cs.SE] (or arXiv:2604.04978v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.04978 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-106] Synthetic Trust Attacks: Modeling How Generative AI Manipulates Human Decisions in Social Engineering Fraud

【速读】:该论文旨在解决生成式 AI(Generative AI)时代下新型诈骗攻击的核心问题——即传统防御机制聚焦于合成媒体(synthetic media)检测,而忽视了人类决策层在面对高可信度伪造信息时的脆弱性。研究表明,当前人类对深度伪造(deepfake)的识别准确率仅约55.5%,远低于有效防御阈值;同时,基于大语言模型(LLM)的诈骗代理在诱导合规方面达到46%成功率,显著高于人工操作员的18%,且能绕过现有安全过滤机制。因此,论文提出“合成信任攻击”(Synthetic Trust Attacks, STAs)作为新的威胁分类,并构建STAM模型以系统化描述从情报收集到事后利用的八阶段攻击链。其解决方案的关键在于将防御重心从前端的感知层(perception layer)转向决策层(decision layer),并提出五类信任线索分类体系、可复现的17字段事件编码方案及四项可证伪假设,最终通过实践验证的“冷静、核查、确认”(Calm, Check, Confirm)协议实现研究级的决策层防护。

链接: https://arxiv.org/abs/2604.04951
作者: Muhammad Tahir Ashraf
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Imagine receiving a video call from your CFO, surrounded by colleagues, asking you to urgently authorise a confidential transfer. You comply. Every person on that call was fake, and you just lost 25 million. This is not a hypothetical. It happened in Hong Kong in January 2024, and it is becoming the template for a new generation of fraud. AI has not invented a new crime. It has industrialised an ancient one: the manufacture of trust. This paper proposes Synthetic Trust Attacks (STAs) as a formal threat category and introduces STAM, the Synthetic Trust Attack Model, an eight-stage operational framework covering the full attack chain from adversary reconnaissance through post-compliance leverage. The core argument is this: existing defenses target synthetic media detection, but the real attack surface is the victim’s decision. When human deepfake detection accuracy sits at approximately 55.5%, barely above chance, and LLM scam agents achieve 46% compliance versus 18% for human operators while evading safety filters entirely, the perception layer has already failed. Defense must move to the decision layer. We present a five-category Trust-Cue Taxonomy, a reproducible 17-field Incident Coding Schema with a pilot-coded example, and four falsifiable hypotheses linking attack structure to compliance outcomes. The paper further operationalizes the author’s practitioner-developed Calm, Check, Confirm protocol as a research-grade decision-layer defense. Synthetic credibility, not synthetic media, is the true attack surface of the AI fraud era. Comments: 15 pages, 3 figures, 2 tables Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.04951 [cs.CR] (or arXiv:2604.04951v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.04951 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-107] Algebraic Structure Discovery for Real World Combinatorial Optimisation Problems: A General Framework from Abstract Algebra to Quotient Space Learning

【速读】:该论文旨在解决组合优化问题中搜索空间过大导致难以找到全局最优解的问题。其核心解决方案在于识别并利用隐藏的代数结构,通过构造商空间(quotient space)来压缩冗余表示,从而在更小的搜索空间中直接进行优化。关键创新点是将合取规则(conjunctive rules)建模为一个幺半群(monoid),并基于特征向量编码证明其与布尔超立方体 {0,1}^n 在按位或运算下同构,使得规则中的逻辑与操作转化为编码中的按位或操作,从而实现对功能等价规则类的有效分组和结构感知搜索。实验表明,该方法显著提升了全局最优解的发现概率(48%–77% vs. 35%–37%)。

链接: https://arxiv.org/abs/2604.04941
作者: Min Sun(1),Federica Storti(1),Valentina Martino(1),Miguel Gonzalez-Andrades(1),Tony Kam-Thong(1) ((1) F. Hoffmann-La Roche AG, Roche Pharma Research and Early Development)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many combinatorial optimisation problems hide algebraic structures that, once exposed, shrink the search space and improve the chance of finding the global optimal solution. We present a general framework that (i) identifies algebraic structure, (ii) formalises operations, (iii) constructs quotient spaces that collapse redundant representations, and (iv) optimises directly over these reduced spaces. Across a broad family of rule-combination tasks (e.g., patient subgroup discovery and rule-based molecular screening), conjunctive rules form a monoid. Via a characteristic-vector encoding, we prove an isomorphism to the Boolean hypercube \0,1^n with bitwise OR, so logical AND in rules becomes bitwise OR in the encoding. This yields a principled quotient-space formulation that groups functionally equivalent rules and guides structure-aware search. On real clinical data and synthetic benchmarks, quotient-space-aware genetic algorithms recover the global optimum in 48% to 77% of runs versus 35% to 37% for standard approaches, while maintaining diversity across equivalence classes. These results show that exposing and exploiting algebraic structure offers a simple, general route to more efficient combinatorial optimisation.

[AI-108] ReVEL: Multi-Turn Reflective LLM -Guided Heuristic Evolution via Structured Performance Feedback

【速读】:该论文旨在解决NP-hard组合优化问题中有效启发式算法设计困难且高度依赖专家知识的问题。现有基于大语言模型(Large Language Models, LLMs)的方法主要依赖一次性代码生成,导致启发式策略脆弱且未能充分利用LLMs的迭代推理能力。其解决方案的关键在于提出ReVEL框架——一种将LLM作为交互式多轮推理器嵌入进化算法(Evolutionary Algorithm, EA)中的混合方法,核心机制包括:(i) 基于性能特征分组(performance-profile grouping),将候选启发式按行为一致性聚类以提供紧凑而信息丰富的反馈;(ii) 多轮反馈驱动的反思机制(multi-turn, feedback-driven reflection),使LLM分析群体行为并生成针对性改进,再由EA元控制器选择性整合与验证,动态平衡探索与利用。该方法显著提升了启发式策略的鲁棒性和多样性,并在标准组合优化基准上优于强基线。

链接: https://arxiv.org/abs/2604.04940
作者: Cuong Van Duc,Minh Nguyen Dinh Tuan,Tam Vu Duc,Tung Vu Duy,Son Nguyen Van,Hanh Nguyen Thi,Binh Huynh Thi Thanh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing effective heuristics for NP-hard combinatorial optimization problems remains a challenging and expertise-intensive task. Existing applications of large language models (LLMs) primarily rely on one-shot code synthesis, yielding brittle heuristics that underutilize the models’ capacity for iterative reasoning. We propose ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback, a hybrid framework that embeds LLMs as interactive, multi-turn reasoners within an evolutionary algorithm (EA). The core of ReVEL lies in two mechanisms: (i) performance-profile grouping, which clusters candidate heuristics into behaviorally coherent groups to provide compact and informative feedback to the LLM; and (ii) multi-turn, feedback-driven reflection, through which the LLM analyzes group-level behaviors and generates targeted heuristic refinements. These refinements are selectively integrated and validated by an EA-based meta-controller that adaptively balances exploration and exploitation. Experiments on standard combinatorial optimization benchmarks show that ReVEL consistently produces heuristics that are more robust and diverse, achieving statistically significant improvements over strong baselines. Our results highlight multi-turn reasoning with structured grouping as a principled paradigm for automated heuristic design.

[AI-109] Proximity Measure of Information Object Features for Solving the Problem of Their Identification in Information Systems

【速读】:该论文旨在解决多源异构信息对象在特征层面的相似性度量问题,核心目标是判断来自不同数据源的信息对象是否可能对应同一物理实体(观察对象)。其关键解决方案在于提出一种融合定量与定性特征的新型定量-定性邻近度量方法:针对定量特征采用概率测度以处理因测量误差导致的数值差异,针对定性特征则引入可能性测度进行建模;该方法无需对特征值进行转换即可实现跨特征比较,且通过满足度量公理验证了其合理性。此外,论文还提出了基于多维异质特征组合的多种邻近度计算变体,增强了实际应用的灵活性和适应性。

链接: https://arxiv.org/abs/2604.04939
作者: Volodymyr Yuzefovych
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 12 figures

点击查看摘要

Abstract:The paper considers a new quantitative-qualitative proximity measure for the features of information objects, where data enters a common information resource from several sources independently. The goal is to determine the possibility of their relation to the same physical object (observation object). The proposed measure accounts for the possibility of differences in individual feature values - both quantitative and qualitative - caused by existing determination errors. To analyze the proximity of quantitative feature values, the author employs a probabilistic measure; for qualitative features, a measure of possibility is used. The paper demonstrates the feasibility of the proposed measure by checking its compliance with the axioms required of any measure. Unlike many known measures, the proposed approach does not require feature value transformation to ensure comparability. The work also proposes several variants of measures to determine the proximity of information objects (IO) based on a group of diverse features.

[AI-110] Operational Noncommutativity in Sequential Metacognitive Judgments

【速读】:该论文旨在解决认知科学中关于元认知(Metacognition)顺序效应的本质问题,即观察到的顺序依赖性是否源于经典状态变化,还是反映了更深层次的结构非交换性(non-commutativity)。其核心挑战在于区分经典隐变量模型与真正非交换性的元认知操作。解决方案的关键在于构建一个操作性框架,将元认知评估建模为作用于内部状态空间的状态变换操作,并通过概率读出分离评估的后作用(back-action)与可观测输出;进而引入“反事实确定性”和“评估非侵入性”两个假设,推导出可检验的成对序列相关性约束条件。若实验数据违反这些约束,则可排除任何经典的非侵入性解释,从而确证所谓的“真实非交换性”——这是一种在认知过程中无法用经典概率模型描述的结构性非交换行为。

链接: https://arxiv.org/abs/2604.04938
作者: Enso O. Torres Alegre,Diana E. Mora Jimenez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 1 figure

点击查看摘要

Abstract:Metacognition, understood as the monitoring and regulation of one’s own cognitive processes, is inherently sequential: an agent evaluates an internal state, updates it, and may then re-evaluate under modified criteria. Order effects in cognition are well documented, yet it remains unclear whether such effects reflect classical state changes or reveal a deeper structural non-commutativity. We develop an operational framework that makes this distinction explicit. In our formulation, metacognitive evaluations are modeled as state-transforming operations acting on an internal state space with probabilistic readouts, thereby separating evaluation back-action from observable output. We show that order dependence prevents any faithful Boolean-commutative representation. We then address a stronger question: can observed order effects always be explained by enlarging the state space with classical latent variables? To formalize this issue, we introduce two assumptions, counterfactual definiteness and evaluation non-invasiveness, under which the existence of a joint distribution over all sequential readouts implies a family of testable constraints on pairwise sequential correlations. Violation of these constraints rules out any classical non-invasive account and certifies what we call genuine non-commutativity. We provide an explicit three-dimensional rotation model with fully worked numerical examples that exhibits such violations. We also outline a behavioral paradigm involving sequential confidence, error-likelihood, and feeling-of-knowing judgments following a perceptual decision, together with the corresponding empirical test. No claim is made regarding quantum physical substrates; the framework is purely operational and algebraic. Comments: 15 pages, 1 figure Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.04938 [cs.AI] (or arXiv:2604.04938v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.04938 Focus to learn more arXiv-issued DOI via DataCite

[AI-111] Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在系统性推理中缺乏可追溯证据支撑的问题,即“认识论鸿沟”(epistemic gap),这导致模型容易产生自信但无根据的错误结论。其解决方案的关键在于引入印度传统逻辑体系——Navya-Nyaya(新因明学),通过将其六阶段结构化推理流程(SAMSHAYA、PRAMANA、PANCHA AVAYAVA、TARKA、HETVABHASA、NIRNAYA)作为监督信号对LLM进行微调,从而赋予模型显式的认识论方法论。这种方法提供了标准链式思维提示所不具备的认知架构支持,在仅40%格式严格遵循的情况下仍实现100%语义正确性,表明模型能有效内化逻辑内容,为提升AI在需要可解释推理任务中的可靠性提供了新路径。

链接: https://arxiv.org/abs/2604.04937
作者: Sharath Sathish
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 52 pages + appendices, comprehensive treatment of Navya-Nyaya computational formalization

点击查看摘要

Abstract:Large language models produce fluent text but struggle with systematic reasoning, often hallucinating confident but unfounded claims. When Apple researchers added irrelevant context to mathematical problems, LLM performance degraded by 65% Apple Machine Learning Research, exposing brittle pattern-matching beneath apparent reasoning. This epistemic gap, the inability to ground claims in traceable evidence, limits AI reliability in domains requiring justification. We introduce Pramana, a novel approach that teaches LLMs explicit epistemological methodology by fine-tuning on Navya-Nyaya logic, a 2,500-year-old Indian reasoning framework. Unlike generic chain-of-thought prompting, Navya-Nyaya enforces structured 6-phase reasoning: SAMSHAYA (doubt analysis), PRAMANA (evidence source identification), PANCHA AVAYAVA (5-member syllogism with universal rules), TARKA (counterfactual verification), HETVABHASA (fallacy detection), and NIRNAYA (ascertainment distinguishing knowledge from hypothesis). This integration of logic and epistemology provides cognitive scaffolding absent from standard reasoning approaches. We fine-tune Llama 3.2-3B and DeepSeek-R1-Distill-Llama-8B on 55 Nyaya-structured logical problems (constraint satisfaction, Boolean SAT, multi-step deduction). Stage 1 achieves 100% semantic correctness on held-out evaluation despite only 40% strict format adherence revealing that models internalize reasoning content even when structural enforcement is imperfect. Ablation studies show format prompting and temperature critically affect performance, with optimal configurations differing by stage. We release all models, datasets, and training infrastructure on Hugging Face to enable further research on epistemic frameworks for AI reasoning.

[AI-112] Contextual Control without Memory Growth in a Context-Switching Task

【速读】:该论文旨在解决序列决策任务中如何有效实现上下文依赖性的问题,传统方法通常通过显式输入上下文信息或增加循环记忆来实现,但二者均存在局限:前者依赖额外的上下文标注,后者需扩大递归状态维度。论文提出了一种基于干预的递归架构(intervention-based recurrent architecture),其关键在于不直接向递归核心提供上下文输入,也不扩展递归维度,而是通过一个加性、上下文索引的算子对共享的预干预隐状态(pre-intervention latent state)进行干预,从而在固定维度下实现上下文控制。实验表明,该方案在部分可观测的上下文切换任务中表现优异,并且通过条件互信息(I(C;O | S))分析验证了其在固定隐状态下的正向上下文信息传递能力,证明干预机制是一种可行的替代递归记忆增长的上下文调控方式。

链接: https://arxiv.org/abs/2604.03479
作者: Song-Ju Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 25 pages, 3 figures

点击查看摘要

Abstract:Context-dependent sequential decision making is commonly addressed either by providing context explicitly as an input or by increasing recurrent memory so that contextual information can be represented internally. We study a third alternative: realizing contextual dependence by intervening on a shared recurrent latent state, without enlarging recurrent dimensionality. To this end, we introduce an intervention-based recurrent architecture in which a recurrent core first constructs a shared pre-intervention latent state, and context then acts through an additive, context-indexed operator. We evaluate this idea on a context-switching sequential decision task under partial observability. We compare three model families: a label-assisted baseline with direct context access, a memory baseline with enlarged recurrent state, and the proposed intervention model, which uses no direct context input to the recurrent core and no memory growth. On the main benchmark, the intervention model performs strongly without additional recurrent dimensions. We also evaluate the models using the conditional mutual information (I(C;O | S)) as a theorem-motivated operational probe of contextual dependence at fixed latent state. For task-relevant phase-1 outcomes, the intervention model exhibits positive conditional contextual information. Together, these results suggest that intervention on a shared recurrent state provides a viable alternative to recurrent memory growth for contextual control in this setting.

[AI-113] Shot-Based Quantum Encoding: A Data-Loading Paradigm for Quantum Neural Networks

【速读】:该论文旨在解决近期内存量子机器学习中数据加载效率低下的瓶颈问题,现有编码方案(如角度编码、振幅编码和基编码)要么未能充分利用量子态空间的指数级容量,要么需要超出噪声中等规模量子(NISQ)硬件相干时间预算的电路深度。其解决方案的关键在于提出一种基于采样次数的量子编码方法(Shot-Based Quantum Encoding, SBQE),通过将硬件原生资源“shot”根据数据依赖的古典分布分配到多个初始量子态上,使混合态表示的期望值与经典概率呈线性关系,从而可与非线性激活函数组合;SBQE在结构上等价于一个权重由量子电路实现的多层感知机(Multilayer Perceptron, MLP),且无需任何数据编码门即可在实际硬件上高效执行,实验表明其在Semeion和Fashion MNIST数据集上的分类准确率显著优于传统振幅编码方法。

链接: https://arxiv.org/abs/2604.06135
作者: Basil Kyriacou,Viktoria Patapovich,Maniraman Periyasamy,Alexey Melnikov
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, 0 tables

点击查看摘要

Abstract:Efficient data loading remains a bottleneck for near-term quantum machine-learning. Existing schemes (angle, amplitude, and basis encoding) either underuse the exponential Hilbert-space capacity or require circuit depths that exceed the coherence budgets of noisy intermediate-scale quantum hardware. We introduce Shot-Based Quantum Encoding (SBQE), a data embedding strategy that distributes the hardware’s native resource, shots, according to a data-dependent classical distribution over multiple initial quantum states. By treating the shot counts as a learnable degree of freedom, SBQE produces a mixed-state representation whose expectation values are linear in the classical probabilities and can therefore be composed with non-linear activation functions. We show that SBQE is structurally equivalent to a multilayer perceptron whose weights are realised by quantum circuits, and we describe a hardware-compatible implementation protocol. Benchmarks on Fashion MNIST and Semeion handwritten digits, with ten independent initialisations per model, show that SBQE achieves 89.1% +/- 0.9% test accuracy on Semeion (reducing error by 5.3% relative to amplitude encoding and matching a width-matched classical network) and 80.95% +/- 0.10% on Fashion MNIST (exceeding amplitude encoding by +2.0% and a linear multilayer perceptron by +1.3%), all without any data-encoding gates.

[AI-114] Multiscale Physics-Informed Neural Network for Complex Fluid Flows with Long-Range Dependencies

【速读】:该论文旨在解决复杂流体流动中多尺度动力学预测的挑战,尤其是针对收敛速度慢、数据需求量大以及解精度不足的问题。在存在远距离边界条件引起的长程空间依赖时,传统方法通常需要大量监督数据才能获得可靠结果。其解决方案的关键在于提出了一种领域分解与平移的物理信息神经网络(Domain-Decomposed and Shifted Physics-Informed Neural Network, DDS-PINN)框架:通过局部神经网络结合统一全局损失函数,在保持局部精度的同时有效捕捉全局依赖关系,从而显著降低对监督数据的依赖,并提升模型在多尺度流场建模中的鲁棒性和准确性。

链接: https://arxiv.org/abs/2604.05652
作者: Prashant Kumar,Rajesh Ranjan
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:Fluid flows are governed by the nonlinear Navier-Stokes equations, which can manifest multiscale dynamics even from predictable initial conditions. Predicting such phenomena remains a formidable challenge in scientific machine learning, particularly regarding convergence speed, data requirements, and solution accuracy. In complex fluid flows, these challenges are exacerbated by long-range spatial dependencies arising from distant boundary conditions, which typically necessitate extensive supervision data to achieve acceptable results. We propose the Domain-Decomposed and Shifted Physics-Informed Neural Network (DDS-PINN), a framework designed to resolve such multiscale interactions with minimal supervision. By utilizing localized networks with a unified global loss, DDS-PINN captures global dependencies while maintaining local precision. The robustness of the approach is demonstrated across a suite of benchmarks, including a multiscale linear differential equation, the nonlinear Burgers’ equation, and data-free Navier-Stokes simulations of flat-plate boundary layers. Finally, DDS-PINN is applied to the computationally challenging backward-facing step (BFS) problem; for laminar regimes (Re = 100), the model yields results comparable to computational fluid dynamics (CFD) without the need for any data, accurately predicting boundary layer thickness, separation, and reattachment lengths. For turbulent BFS flow at Re = 10,000, the framework achieves convergence to O(10^-4) using only 500 random supervision points ( 0.3 % of the total domain), outperforming established methods like Residual-based Attention-PINN in accuracy. This approach demonstrates strong potential for the super-resolution of complex turbulent flows from sparse experimental measurements.

[AI-115] Learned Elevation Models as a Lightweight Alternative to LiDAR for Radio Environment Map Estimation

【速读】:该论文旨在解决下一代无线系统(如6G)中无线电环境地图(Radio Environment Map, REM)建模对高精度三维环境数据(如LiDAR点云)的高度依赖问题,这类数据获取成本高、存储体量大且在动态环境中易过时。解决方案的关键在于提出一个两阶段框架:第一阶段利用学习到的估计器直接从卫星RGB影像预测高程图(elevation maps),第二阶段将该高程图与天线参数共同输入REM估计器,从而在推理阶段无需任何3D数据即可实现更准确的REM建模,相较仅使用图像的基线方法,在RMSE指标上提升达7.8%,同时保持与现有CNN架构一致的输入特征空间,具备良好的可扩展性与实用性。

链接: https://arxiv.org/abs/2604.05520
作者: Ljupcho Milosheski,Fedja Močnik,Mihael Mohorčič,Carolina Fortuna
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, 3 tables Submitted to PIMRC 2026

点击查看摘要

Abstract:Next-generation wireless systems such as 6G operate at higher frequency bands, making signal propagation highly sensitive to environmental factors such as buildings and vege- tation. Accurate Radio Environment Map (REM) estimation is therefore increasingly important for effective network planning and operation. Existing methods, from ray-tracing simulators to deep learning generative models, achieve promising results but require detailed 3D environment data such as LiDAR-derived point clouds, which are costly to acquire, several gigabytes per km2 in size, and quickly outdated in dynamic environments. We propose a two-stage framework that eliminates the need for 3D data at inference time: in the first stage, a learned estimator predicts elevation maps directly from satellite RGB imagery, which are then fed alongside antenna parameters into the REM estimator in the second stage. Across existing CNN- based REM estimation architectures, the proposed approach improves RMSE by up to 7.8% over image-only baselines, while operating on the same input feature space and requiring no 3D data during inference, offering a practical alternative for scalable radio environment modelling.

[AI-116] LLM Evaluation as Tensor Completion: Low Rank Structure and Semiparametric Efficiency

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)评估中基于成对人类判断的数据存在噪声、稀疏性和非均匀采样问题,且现有排行榜缺乏充分的不确定性量化这一关键挑战。为此,作者将LLM评估建模为低秩潜在评分张量(latent score tensor)在Bradley-Terry-Luce型模型下的成对比较观测问题,从而将其置于一种具有结构化观测、非均匀采样和成对对比的新张量补全框架中。解决方案的关键在于:首先推导出低秩切空间上的信息算子(information operator)、高效影响函数(efficient influence function)及半参数效率边界;进而构造一个具有渐近正态性的一步去偏估计器(one-step debiased estimator)。核心创新在于提出“评分白化”(score-whitening)方法,通过均衡局部Fisher信息来克服信息算子各向异性且不与切空间投影交换所带来的瓶颈,实现最优样本复杂度下的稳定推断,为LLM评估提供了严谨的不确定性量化框架,并可推广至从成对数据中推断低秩结构的一般场景。

链接: https://arxiv.org/abs/2604.05460
作者: Jiachun Li,David Simchi-Levi,Will Wei Sun
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) evaluation platforms increasingly rely on pairwise human judgments. These data are noisy, sparse, and non-uniform, yet leaderboards are reported with limited uncertainty quantification. We study this as semiparametric inference for a low-rank latent score tensor observed through pairwise comparisons under Bradley-Terry-Luce-type models. This places LLM evaluation in a new tensor completion setting with structured observations, non-uniform sampling, and pairwise contrasts. Our target is a smooth functional \psi(T^\star) , including linear estimands such as ability gaps and nonlinear ones such as win probabilities. We derive the information operator on the low-rank tangent space, the efficient influence function, and the semiparametric efficiency bound, then construct a one-step debiased estimator with asymptotic normality. A central challenge is that the information operator is anisotropic and does not commute with the tangent-space projection, creating a bottleneck absent from isotropic models. We introduce a score-whitening method that equalizes local Fisher information and restores stable inference at the optimal sample-complexity scale. Our results provide a principled framework for uncertainty quantification in LLM evaluation and more broadly for inference on low-rank structures from pairwise data.

[AI-117] Self-Supervised Foundation Model for Calcium-imaging Population Dynamics

【速读】:该论文旨在解决当前基于钙成像(calcium imaging)的神经活动分析中,现有方法普遍依赖特定任务训练、缺乏跨任务迁移能力的问题。为应对这一挑战,作者提出CalM——一种仅使用神经元钙信号进行自监督预训练的神经基础模型(neural foundation model),其关键在于设计了一个高性能的分词器(tokenizer),将单神经元钙信号映射到共享的离散词汇空间,并引入双轴自回归Transformer架构,同时建模神经元维度和时间维度上的依赖关系,从而实现对多种下游任务(如预测与解码)的有效适配与性能提升。

链接: https://arxiv.org/abs/2604.04958
作者: Xinhong Xu,Yimeng Zhang,Qichen Qian,Yuanlong Zhang
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Recent work suggests that large-scale, multi-animal modeling can significantly improve neural recording analysis. However, for functional calcium traces, existing approaches remain task-specific, limiting transfer across common neuroscience objectives. To address this challenge, we propose \textbfCalM, a self-supervised neural foundation model trained solely on neuronal calcium traces and adaptable to multiple downstream tasks, including forecasting and decoding. Our key contribution is a pretraining framework, composed of a high-performance tokenizer mapping single-neuron traces into a shared discrete vocabulary, and a dual-axis autoregressive transformer modeling dependencies along both the neural and the temporal axis. We evaluate CalM on a large-scale, multi-animal, multi-session dataset. On the neural population dynamics forecasting task, CalM outperforms strong specialized baselines after pretraining. With a task-specific head, CalM further adapts to the behavior decoding task and achieves superior results compared with supervised decoding models. Moreover, linear analyses of CalM representations reveal interpretable functional structures beyond predictive accuracy. Taken together, we propose a novel and effective self-supervised pretraining paradigm for foundation models based on calcium traces, paving the way for scalable pretraining and broad applications in functional neural analysis. Code will be released soon.

[AI-118] he Planetary Cost of AI Acceleration Part II: The 10th Planetary Boundary and the 6.5-Year Countdown

【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)和大型语言模型(Large Language Model, LLM)代理的指数级扩展,人类社会正在经历从机器替代人力劳动向机器代行人类思维(thinking, reasoning, and intention)的根本性范式转变,而这种“思维外包”行为对地球热平衡系统带来不可忽视的热力学负担。研究表明,当前人类活动产生的废热已逼近维持生态稳定的临界阈值,若不采取结构性干预,即使在最理想的情景下,全球人为废热积累也将在6.5年内突破关键生态边界。解决方案的关键在于识别出控制全球热量耗散速率的六个相互作用因素,并提出将人工智能及其热耗散纳入行星边界体系,构成第10个行星边界(9+1),其核心指标为AI增长带来的净新增废热与其通过提升经济和社会效率所减少的人类废热之间的平衡关系。该研究强调,AI的规模管理不存在中间路径:要么加速突破热力学临界点,要么成为稳定其他行星边界及人类文明存续的最强杠杆。

链接: https://arxiv.org/abs/2604.04956
作者: William Yicheng Zhu,Lei Zhu
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Popular Physics (physics.pop-ph)
备注:

点击查看摘要

Abstract:The recent, super-exponential scaling of autonomous Large Language Model (LLM) agents signals a broader, fundamental paradigm shift from machines replacing the human hands (manual labor and mechanical processing) to machines delegating for the human minds (thinking, reasoning, and intention). This uncontrolled offloading and scaling of “thinking” itself has profound consequences for humanity’s heat balance sheet, since thinking, or intelligence, carries thermodynamic weight. The Earth has already surpassed the heat dissipation threshold required for long-term ecological stability, and projecting based on empirical data reveal a concerning trajectory: without radical structural intervention, anthropogenic heat accumulation will breach critical planetary ecological thresholds in less than 6.5 years, even under the most ideal scenario where Earth Energy Imbalance (EEI) holds constant. In this work, we identify six interacting factors that govern the global heat dissipation rate and delineate how their interplay drives society toward one of four macroscopic trajectories: legacy, accelerationist, centrist, or restorative. We propose that the integration of artificial intelligence and its heat dissipation into the planetary system constitutes the 10th planetary boundary (9+1). The core measurement of this new boundary is the net-new waste heat generated by exponential AI growth balanced against its impact on reducing economic and societal inefficiencies and through which the baseline anthropogenic waste heat emissions. We demonstrate that managing AI scaling lacks a moderate middle ground: it will either accelerate the imminent breach of critical thermodynamic thresholds, or it will serve as the single most effective lever capable of stabilizing the other planetary boundaries and the survival of human civilization.

[AI-119] Contextuality as an External Bookkeeping Cost under Fixed Shared-State Semantics

【速读】:该论文旨在解决量子非定域性中“上下文性”(contextuality)的定量表征问题,即在经典模拟框架下,为保持共享内部描述不变时,需引入多少额外的外部信息成本才能复现量子观测统计。其解决方案的关键在于提出一种最小外部标签模拟模型(minimal external-label simulation model),其中上下文依赖仅由一个辅助标签携带,并定义了“障碍代价”(obstruction cost)作为该标签与上下文间最小互信息量。通过证明:任何线性判据若能将观测统计与零障碍集分离,则可给出该代价的保守下界,从而将上下文性转化为可量化、可计算的经典模拟成本指标。

链接: https://arxiv.org/abs/2601.20167
作者: Song-Ju Kim
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 5 pages, 0 figure

点击查看摘要

Abstract:Contextuality is a central feature distinguishing quantum from classical probability theories, but its operational meaning is often stated only qualitatively. In this Letter, we study a simple information-theoretic question: how much additional contextual information must a classical simulation introduce when it tries to keep a shared internal description fixed across contexts? To make this question precise, we analyze a minimal external-label simulation model in which the remaining context dependence is carried only by an auxiliary label. For this model, we define an obstruction cost as the minimum mutual information between the context and the auxiliary label required to reproduce the observed statistics. We then prove a conservative quantitative lower bound: any linear witness that separates the observed statistics from the zero-obstruction set yields a positive lower bound on this cost. We do not claim that this bound is tight, and we do not claim that the simulation model covers every possible classical architecture. Its role is narrower and more explicit: under fixed shared-state semantics, contextuality can be read as a certificate of irreducible external bookkeeping cost in a simple and well-defined simulation model.

机器学习

[LG-0] opological Characterization of Churn Flow and Unsupervised Correction to the Wu Flow-Regime Map in Small-Diameter Vertical Pipes

链接: https://arxiv.org/abs/2604.06167
作者: Brady Koenig,Sushovan Majhi,Atish Mitra,Abigail Stein,Burt Todd
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:Churn flow-the chaotic, oscillatory regime in vertical two-phase flow-has lacked a quantitative mathematical definition for over 40 years. We introduce the first topology-based characterization using Euler Characteristic Surfaces (ECS). We formulate unsupervised regime discovery as Multiple Kernel Learning (MKL), blending two complementary ECS-derived kernels-temporal alignment ( L^1 distance on the \chi(s,t) surface) and amplitude statistics (scale-wise mean, standard deviation, max, min)-with gas velocity. Applied to 37 unlabeled air-water trials from Montana Tech, the self-calibrating framework learns weights \beta_ECS=0.14 , \beta_amp=0.50 , \beta_ugs=0.36 , placing 64% of total weight on topology-derived features ( \beta_ECS + \beta_amp ). The ECS-inferred slug/churn transition lies +3.81 m/s above Wu et al.'s (2017) prediction in 2 -in. tubing, quantifying reports that existing models under-predict slug persistence in small-diameter pipes where interfacial tension and wall-to-wall interactions dominate flow. Cross-facility validation on 947 Texas AM University images confirms 1.9\times higher topological complexity in churn vs. slug ( p 10^-5 ). Applied to 45 TAMU pseudo-trials, the same unsupervised framework achieves 95.6% 4 -class accuracy and 100% churn recall-without any labeled training data-matching or exceeding supervised baselines that require thousands of annotated examples. This work provides the first mathematical definition of churn flow and demonstrates that unsupervised topological descriptors can challenge and correct widely adopted mechanistic models.

[LG-1] arget Policy Optimization

链接: https://arxiv.org/abs/2604.06159
作者: Jean Kaddour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emphTarget Policy Optimization (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution q_i \propto p_i^,\mathrmold \exp(u_i) and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is p^\theta - q , which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at this https URL.

[LG-2] Learning mathsfAC0 Under Graphical Models

链接: https://arxiv.org/abs/2604.06109
作者: Gautam Chandrasekaran,Jason Gaitonde,Ankur Moitra,Arsen Vasilyan
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 57 pages

点击查看摘要

Abstract:In a landmark result, Linial, Mansour and Nisan (J. ACM 1993) gave a quasipolynomial-time algorithm for learning constant-depth circuits given labeled i.i.d. samples under the uniform distribution. Their work has had a deep and lasting legacy in computational learning theory, in particular introducing the \textitlow-degree algorithm . However, an important critique of many results and techniques in the area is the reliance on product structure, which is unlikely to hold in realistic settings. Obtaining similar learning guarantees for more natural correlated distributions has been a longstanding challenge in the field. In particular, we give quasipolynomial-time algorithms for learning \mathsfAC^0 substantially beyond the product setting, when the inputs come from any graphical model with polynomial growth that exhibits strong spatial mixing. The main technical challenge is in giving a workaround to Fourier analysis, which we do by showing how new sampling algorithms allow us to transfer statements about low-degree polynomial approximation under the uniform setting to graphical models. Our approach is general enough to extend to other well-studied function classes, like monotone functions and halfspaces. Comments: 57 pages Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2604.06109 [cs.LG] (or arXiv:2604.06109v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.06109 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] VTOL Aircraft Energy Overhead Estimation under Conflict Resolution in High-Density Airspaces

链接: https://arxiv.org/abs/2604.06093
作者: Alex Zongo,Peng Wei
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted for presentation at the Integrated Communications, Navigation and Surveillance Conference (ICNS) 2026

点击查看摘要

Abstract:Electric vertical takeoff and landing (eVTOL) aircraft operating in high-density urban airspace must maintain safe separation through tactical conflict resolution, yet the energy cost of such maneuvers has not been systematically quantified. This paper investigates how conflict-resolution maneuvers under the Modified Voltage Potential (MVP) algorithm affect eVTOL energy consumption. Using a physics-based power model integrated within a traffic simulation, we analyze approximately 71,767 en route sections within a sector, across traffic densities of 10-60 simultaneous aircraft. The main finding is that MVP-based deconfliction is energy-efficient: median energy overhead remains below 1.5% across all density levels, and the majority of en route flights within the sector incur negligible penalty. However, the distribution exhibits pronounced right-skewness, with tail cases reaching 44% overhead at the highest densities due to sustained multi-aircraft conflicts. The 95th percentile ranges from 3.84% to 5.3%, suggesting that a 4-5% reserve margin accommodates the vast majority of tactical deconfliction scenarios. To support operational planning, we develop a machine learning model that estimates energy overhead at mission initiation. Because conflict outcomes depend on future traffic interactions that cannot be known in advance, the model provides both point estimates and uncertainty bounds. These bounds are conservative; actual outcomes fall within the predicted range more often than the stated confidence level, making them suitable for safety-critical reserve planning. Together, these results validate MVP’s suitability for energy-constrained eVTOL operations and provide quantitative guidance for reserve energy determination in Advanced Air Mobility.

[LG-4] A machine learning framework for uncovering stochastic nonlinear dynamics from noisy data

链接: https://arxiv.org/abs/2604.06081
作者: Matteo Bosso,Giovanni Franzese,Kushal Swamy,Maarten Theulings,Alejandro M. Aragón,Farbod Alijani
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Dynamical Systems (math.DS)
*备注: 25 pages, 12 figures, 4 tables

点击查看摘要

Abstract:Modeling real-world systems requires accounting for noise - whether it arises from unpredictable fluctuations in financial markets, irregular rhythms in biological systems, or environmental variability in ecosystems. While the behavior of such systems can often be described by stochastic differential equations, a central challenge is understanding how noise influences the inference of system parameters and dynamics from data. Traditional symbolic regression methods can uncover governing equations but typically ignore uncertainty. Conversely, Gaussian processes provide principled uncertainty quantification but offer little insight into the underlying dynamics. In this work, we bridge this gap with a hybrid symbolic regression-probabilistic machine learning framework that recovers the symbolic form of the governing equations while simultaneously inferring uncertainty in the system parameters. The framework combines deep symbolic regression with Gaussian process-based maximum likelihood estimation to separately model the deterministic dynamics and the noise structure, without requiring prior assumptions about their functional forms. We verify the approach on numerical benchmarks, including harmonic, Duffing, and van der Pol oscillators, and validate it on an experimental system of coupled biological oscillators exhibiting synchronization, where the algorithm successfully identifies both the symbolic and stochastic components. The framework is data-efficient, requiring as few as 100-1000 data points, and robust to noise - demonstrating its broad potential in domains where uncertainty is intrinsic and both the structure and variability of dynamical systems must be understood.

[LG-5] PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space

链接: https://arxiv.org/abs/2604.06061
作者: Asaf Buchnick,Aviv Shamsian,Aviv Navon,Ethan Fetaya
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-image generation has progressed rapidly, but faithfully generating complex scenes requires extensive trial-and-error to find the exact prompt. In the prompt inversion task, the goal is to recover a textual prompt that can faithfully reconstruct a given target image. Currently, existing methods frequently yield suboptimal reconstructions and produce unnatural, hard-to-interpret prompts that hinder transparency and controllability. In this work, we present PromptEvolver, a prompt inversion approach that generates natural-language prompts while achieving high-fidelity reconstructions of the target image. Our method uses a genetic algorithm to optimize the prompt, leveraging a strong vision-language model to guide the evolution process. Importantly, it works on black-box generation models by requiring only image outputs. Finally, we evaluate PromptEvolver across multiple prompt inversion benchmarks and show that it consistently outperforms competing methods.

[LG-6] Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating

链接: https://arxiv.org/abs/2604.06014
作者: Dipan Maity,Suman Mondal,Arindam Roy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Gated-SwinRMT, a family of hybrid vision transformers that combine the shifted-window attention of the Swin Transformer with the Manhattan-distance spatial decay of Retentive Networks (RMT), augmented by input-dependent gating. Self-attention is decomposed into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks provide a two-dimensional locality prior without learned positional biases. Two variants are proposed. \textbfGated-SwinRMT-SWAT substitutes softmax with sigmoid activation, implements balanced ALiBi slopes with multiplicative post-activation spatial decay, and gates the value projection via SwiGLU; the Normalized output implicitly suppresses uninformative attention scores. \textbfGated-SwinRMT-Retention retains softmax-normalized retention with an additive log-space decay bias and incorporates an explicit G1 sigmoid gate – projected from the block input and applied after local context enhancement (LCE) but prior to the output projection~ W_O – to alleviate the low-rank W_V !\cdot! W_O bottleneck and enable input-dependent suppression of attended outputs. We assess both variants on Mini-ImageNet ( 224\times224 , 100 classes) and CIFAR-10 ( 32\times32 , 10 classes) under identical training protocols, utilizing a single GPU due to resource limitations. At \approx77 – 79 ,M parameters, Gated-SwinRMT-SWAT achieves 80.22% and Gated-SwinRMT-Retention 78.20% top-1 test accuracy on Mini-ImageNet, compared with 73.74% for the RMT baseline. On CIFAR-10 – where small feature maps cause the adaptive windowing mechanism to collapse attention to global scope – the accuracy advantage compresses from +6.48 ,pp to +0.56 ,pp. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.06014 [cs.LG] (or arXiv:2604.06014v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.06014 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] Data Distribution Valuation Using Generalized Bayesian Inference AISTATS2026

链接: https://arxiv.org/abs/2604.05993
作者: Cuong N. Nguyen,Cuong V. Nguyen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Paper published at AISTATS 2026

点击查看摘要

Abstract:We investigate the data distribution valuation problem, which aims to quantify the values of data distributions from their samples. This is a recently proposed problem that is related to but different from classical data valuation and can be applied to various applications. For this problem, we develop a novel framework called Generalized Bayes Valuation that utilizes generalized Bayesian inference with a loss constructed from transferability measures. This framework allows us to solve, in a unified way, seemingly unrelated practical problems, such as annotator evaluation and data augmentation. Using the Bayesian principles, we further improve and enhance the applicability of our framework by extending it to the continuous data stream setting. Our experiment results confirm the effectiveness and efficiency of our framework in different real-world scenarios.

[LG-8] On Dominant Manifolds in Reservoir Computing Networks

链接: https://arxiv.org/abs/2604.05967
作者: Noa Kaplan,Alberto Padoan,Anastasia Bizyaeva
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Understanding how training shapes the geometry of recurrent network dynamics is a central problem in time-series modeling. We study the emergence of low-dimensional dominant manifolds in the training of Reservoir Computing (RC) networks for temporal forecasting tasks. For a simplified linear and continuous-time reservoir model, we link the dimensionality and structure of the dominant modes directly to the intrinsic dimensionality and information content of the training data. In particular, for training data generated by an autonomous dynamical system, we relate the dominant modes of the trained reservoir to approximations of the Koopman eigenfunctions of the original system, illuminating an explicit connection between reservoir computing and the Dynamic Mode Decomposition algorithm. We illustrate the eigenvalue motion that generates the dominant manifolds during training in simulation, and discuss generalization to nonlinear RC via tangent dynamics and differential p-dominance.

[LG-9] QiMeng-PRepair: Precise Code Repair via Edit-Aware Reward Optimization ACL2026

链接: https://arxiv.org/abs/2604.05963
作者: Changxin Ke,Rui Zhang,Jiaming Guo,Yuanbo Wen,Li Ding,Shuo Wang,Xuyuan Zhu,Xiong Peng,Di Huang,Zidong Du,Xing Hu,Qi Guo,Yunji Chen
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted to ACL 2026 main conference

点击查看摘要

Abstract:Large Language Models (LLMs) achieve strong program repair performance but often suffer from over-editing, where excessive modifications overwrite correct code and hinder bug localization. We systematically quantify its impact and introduce precise repair task, which maximizes reuse of correct code while fixing only buggy parts. Building on this insight, we propose PRepair, a framework that mitigates over-editing and improves repair accuracy. PRepair has two components: Self-Breaking, which generates diverse buggy programs via controlled bug injection and min-max sampling, and Self-Repairing, which trains models with Edit-Aware Group Relative Policy Optimization (EA-GRPO) using an edit-aware reward to encourage minimal yet correct edits. Experiments show that PRepair improves repair precision by up to 31.4% under \mathrmfix_1@1 , a metric that jointly considers repair correctness and extent, and significantly increases decoding throughput when combined with speculative editing, demonstrating its potential for precise and practical code repair.

[LG-10] A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis

链接: https://arxiv.org/abs/2604.05960
作者: Sk Miraj Ahmed,Yuewei Lin,Chuntian Cao,Shinjae Yoo,Xinpei Wu,Won-Il Lee,Nikhil Tiwale,Dan N. Le,Thi Thu Huong Chu,Jiyoung Kim,Kevin G. Yager,Chang-Yong Nam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scanning Electron Microscopy (SEM) is indispensable in modern materials science, enabling high-resolution imaging across a wide range of structural, chemical, and functional investigations. However, SEM imaging remains constrained by task-specific models and labor-intensive acquisition processes that limit its scalability across diverse applications. Here, we introduce the first foundation model for SEM images, pretrained on a large corpus of multi-instrument, multi-condition scientific micrographs, enabling generalization across diverse material systems and imaging conditions. Leveraging a self-supervised transformer architecture, our model learns rich and transferable representations that can be fine-tuned or adapted to a wide range of downstream tasks. As a compelling demonstration, we focus on defocus-to-focus image translation-an essential yet underexplored challenge in automated microscopy pipelines. Our method not only restores focused detail from defocused inputs without paired supervision but also outperforms state-of-the-art techniques across multiple evaluation metrics. This work lays the groundwork for a new class of adaptable SEM models, accelerating materials discovery by bridging foundational representation learning with real-world imaging needs.

[LG-11] ransfer Learning for Neural Parameter Estimation applied to Building RC Models

链接: https://arxiv.org/abs/2604.05904
作者: Fabian Raisch,Timo Germann,J. Nathan Kutz,Christoph Goebel,Benjamin Tischler
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Parameter estimation for dynamical systems remains challenging due to non-convexity and sensitivity to initial parameter guesses. Recent deep learning approaches enable accurate and fast parameter estimation but do not exploit transferable knowledge across systems. To address this, we introduce a transfer-learning-based neural parameter estimation framework based on a pretraining-fine-tuning paradigm. This approach improves accuracy and eliminates the need for an initial parameter guess. We apply this framework to building RC thermal models, evaluating it against a Genetic Algorithm and a from-scratch neural baseline across eight simulated buildings, one real-world building, two RC model configurations, and four training data lengths. Results demonstrate an 18.6-24.0% performance improvement with only 12 days of training data and up to 49.4% with 72 days. Beyond buildings, the proposed method represents a new paradigm for parameter estimation in dynamical systems.

[LG-12] A Tensor-Train Framework for Bayesian Inference in High-Dimensional Systems: Applications to MIMO Detection and Channel Decoding

链接: https://arxiv.org/abs/2604.05890
作者: Luca Schmid,Dominik Sulz,Shrinivas Chimmalgi,Laurent Schmalen
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian inference in high-dimensional discrete-input additive noise models is a fundamental challenge in communication systems, as the support of the required joint a posteriori probability (APP) mass function grows exponentially with the number of unknown variables. In this work, we propose a tensor-train (TT) framework for tractable, near-optimal Bayesian inference in discrete-input additive noise models. The central insight is that the joint log-APP mass function admits an exact low-rank representation in the TT format, enabling compact storage and efficient computations. To recover symbol-wise APP marginals, we develop a practical inference procedure that approximates the exponential of the log-posterior using a TT-cross algorithm initialized with a truncated Taylor-series. To demonstrate the generality of the approach, we derive explicit low-rank TT constructions for two canonical communication problems: the linear observation model under additive white Gaussian noise (AWGN), applied to multiple-input multiple-output (MIMO) detection, and soft-decision decoding of binary linear block error correcting codes over the binary-input AWGN channel. Numerical results show near-optimal error-rate performance across a wide range of signal-to-noise ratios while requiring only modest TT ranks. These results highlight the potential of tensor-network methods for efficient Bayesian inference in communication systems.

[LG-13] Weight-Informed Self-Explaining Clustering for Mixed-Type Tabular Data

链接: https://arxiv.org/abs/2604.05857
作者: Lehao Li,Qiang Huang,Yihao Ang,Bryan Kian Hsiang Low,Anthony K. H. Tung,Xiaokui Xiao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clustering mixed-type tabular data is fundamental for exploratory analysis, yet remains challenging due to misaligned numerical-categorical representations, uneven and context-dependent feature relevance, and disconnected and post-hoc explanation from the clustering process. We propose WISE, a Weight-Informed Self-Explaining framework that unifies representation, feature weighting, clustering, and interpretation in a fully unsupervised and transparent pipeline. WISE introduces Binary Encoding with Padding (BEP) to align heterogeneous features in a unified sparse space, a Leave-One-Feature-Out (LOFO) strategy to sense multiple high-quality and diverse feature-weighting views, and a two-stage weight-aware clustering procedure to aggregate alternative semantic partitions. To ensure intrinsic interpretability, we further develop Discriminative FreqItems (DFI), which yields feature-level explanations that are consistent from instances to clusters with an additive decomposition guarantee. Extensive experiments on six real-world datasets demonstrate that WISE consistently outperforms classical and neural baselines in clustering quality while remaining efficient, and produces faithful, human-interpretable explanations grounded in the same primitives that drive clustering.

[LG-14] JD-BP: A Joint-Decision Generative Framework for Auto-Bidding and Pricing

链接: https://arxiv.org/abs/2604.05845
作者: Linghui Meng,Chun Gan,Shengsheng Niu,Chengcheng Zhang,Chenchen Li,Chuan Yang,Yi Mao,Xin Zhu,Jie He,Zhangang Lin,Ching Law
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Auto-bidding services optimize real-time bidding strategies for advertisers under key performance indicator (KPI) constraints such as target return on investment and budget. However, uncertainties such as model prediction errors and feedback latency can cause bidding strategies to deviate from ex-post optimality, leading to inefficient allocation. To address this issue, we propose JD-BP, a Joint generative Decision framework for Bidding and Pricing. Unlike prior methods, JD-BP jointly outputs a bid value and a pricing correction term that acts additively with the payment rule such as GSP. To mitigate adverse effects of historical constraint violations, we design a memory-less Return-to-Go that encourages future value maximizing of bidding actions while the cumulated bias is handled by the pricing correction. Moreover, a trajectory augmentation algorithm is proposed to generate joint bidding-pricing trajectories from a (possibly arbitrary) base bidding policy, enabling efficient plug-and-play deployment of our algorithm from existing RL/generative bidding models. Finally, we employ an Energy-Based Direct Preference Optimization method in conjunction with a cross-attention module to enhance the joint learning performance of bidding and pricing correction. Offline experiments on the AuctionNet dataset demonstrate that JD-BP achieves state-of-the-art performance. Online A/B tests at this http URL confirm its practical effectiveness, showing a 4.70% increase in ad revenue and a 6.48% improvement in target cost.

[LG-15] Modeling Patient Care Trajectories with Transformer Hawkes Processes

链接: https://arxiv.org/abs/2604.05844
作者: Saumya Pandey,Varun Chandola
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Patient healthcare utilization consists of irregularly time-stamped events, such as outpatient visits, inpatient admissions, and emergency encounters, forming individualized care trajectories. Modeling these trajectories is crucial for understanding utilization patterns and predicting future care needs, but is challenging due to temporal irregularity and severe class imbalance. In this work, we build on the Transformer Hawkes Process framework to model patient trajectories in continuous time. By combining Transformer-based history encoding with Hawkes process dynamics, the model captures event dependencies and jointly predicts event type and time-to-event. To address extreme imbalance, we introduce an imbalance-aware training strategy using inverse square-root class weighting. This improves sensitivity to rare but clinically important events without altering the data distribution. Experiments on real-world data demonstrate improved performance and provide clinically meaningful insights for identifying high-risk patient populations.

[LG-16] Expectation Maximization (EM) Converges for General Agnostic Mixtures

链接: https://arxiv.org/abs/2604.05842
作者: Avishek Ghosh
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: Accepted at IEEE International Symposium on Information Theory (ISIT 2026)

点击查看摘要

Abstract:Mixture of linear regression is well studied in statistics and machine learning, where the data points are generated probabilistically using k linear models. Algorithms like Expectation Maximization (EM) may be used to recover the ground truth regressors for this problem. Recently, in \citepal2022learning,ghosh_agnostic the mixed linear regression problem is studied in the agnostic setting, where no generative model on data is assumed. Rather, given a set of data points, the objective is \emphfit k lines by minimizing a suitable loss function. It is shown that a modification of EM, namely gradient EM converges exponentially to appropriately defined loss minimizer even in the agnostic setting. In this paper, we study the problem of \emphfitting k parametric functions to given set of data points. We adhere to the agnostic setup. However, instead of fitting lines equipped with quadratic loss, we consider any arbitrary parametric function fitting equipped with a strongly convex and smooth loss. This framework encompasses a large class of problems including mixed linear regression (regularized), mixed linear classifiers (mixed logistic regression, mixed Support Vector Machines) and mixed generalized linear regression. We propose and analyze gradient EM for this problem and show that with proper initialization and separation condition, the iterates of gradient EM converge exponentially to appropriately defined population loss minimizers with high probability. This shows the effectiveness of EM type algorithm which converges to \emphoptimal solution in the non-generative setup beyond mixture of linear regression. Comments: Accepted at IEEE International Symposium on Information Theory (ISIT 2026) Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML) Cite as: arXiv:2604.05842 [cs.LG] (or arXiv:2604.05842v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.05842 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Hidden in the Multiplicative Interaction: Uncovering Frag ility in Multimodal Contrastive Learning

链接: https://arxiv.org/abs/2604.05834
作者: Tillmann Rheude,Stefan Hegselmann,Roland Eils,Benjamin Wild
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal contrastive learning is increasingly enriched by going beyond image-text pairs. Among recent contrastive methods, Symile is a strong approach for this challenge because its multiplicative interaction objective captures higher-order cross-modal dependence. Yet, we find that Symile treats all modalities symmetrically and does not explicitly model reliability differences, a limitation that becomes especially present in trimodal multiplicative interactions. In practice, modalities beyond image-text pairs can be misaligned, weakly informative, or missing, and treating them uniformly can silently degrade performance. This fragility can be hidden in the multiplicative interaction: Symile may outperform pairwise CLIP even if a single unreliable modality silently corrupts the product terms. We propose Gated Symile, a contrastive gating mechanism that adapts modality contributions on an attention-based, per-candidate basis. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions and incorporating an explicit NULL option when reliable cross-modal alignment is unlikely. Across a controlled synthetic benchmark that uncovers this fragility and three real-world trimodal datasets for which such failures could be masked by averages, Gated Symile achieves higher top-1 retrieval accuracy than well-tuned Symile and CLIP models. More broadly, our results highlight gating as a step toward robust multimodal contrastive learning under imperfect and more than two modalities.

[LG-18] Bivariate Causal Discovery Using Rate-Distortion MDL: An Information Dimension Approach

链接: https://arxiv.org/abs/2604.05829
作者: Tiago Brogueira,Mário A.T. Figueiredo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages

点击查看摘要

Abstract:Approaches to bivariate causal discovery based on the minimum description length (MDL) principle approximate the (uncomputable) Kolmogorov complexity of the models in each causal direction, selecting the one with the lower total complexity. The premise is that nature’s mechanisms are simpler in their true causal order. Inherently, the description length (complexity) in each direction includes the description of the cause variable and that of the causal mechanism. In this work, we argue that current state-of-the-art MDL-based methods do not correctly address the problem of estimating the description length of the cause variable, effectively leaving the decision to the description length of the causal mechanism. Based on rate-distortion theory, we propose a new way to measure the description length of the cause, corresponding to the minimum rate required to achieve a distortion level representative of the underlying distribution. This distortion level is deduced using rules from histogram-based density estimation, while the rate is computed using the related concept of information dimension, based on an asymptotic approximation. Combining it with a traditional approach for the causal mechanism, we introduce a new bivariate causal discovery method, termed rate-distortion MDL (RDMDL). We show experimentally that RDMDL achieves competitive performance on the Tübingen dataset. All the code and experiments are publicly available at this http URL.

[LG-19] Stealthy and Adjustable Text-Guided Backdoor Attacks on Multimodal Pretrained Models

链接: https://arxiv.org/abs/2604.05809
作者: Yiyang Zhang,Chaojian Yu,Ziming Hong,Yuanjie Shao,Qinmu Peng,Tongliang Liu,Xinge You
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal pretrained models are vulnerable to backdoor attacks, yet most existing methods rely on visual or multimodal triggers, which are impractical since visually embedded triggers rarely occur in real-world data. To overcome this limitation, we propose a novel Text-Guided Backdoor (TGB) attack on multimodal pretrained models, where commonly occurring words in textual descriptions serve as backdoor triggers, significantly improving stealthiness and practicality. Furthermore, we introduce visual adversarial perturbations on poisoned samples to modulate the model’s learning of textual triggers, enabling a controllable and adjustable TGB attack. Extensive experiments on downstream tasks built upon multimodal pretrained models, including Composed Image Retrieval (CIR) and Visual Question Answering (VQA), demonstrate that TGB achieves practicality and stealthiness with adjustable attack success rates across diverse realistic settings, revealing critical security vulnerabilities in multimodal pretrained models.

[LG-20] Controllable Image Generation with Composed Parallel Token Prediction CVPR

链接: https://arxiv.org/abs/2604.05730
作者: Jamie Stirling,Noura Al-Moubayed,Chris G. Willcocks,Hubert P. H. Shum
类目: Machine Learning (cs.LG)
*备注: 8 pages + references, 7 figures, accepted to CVPR Workshops 2026 (LoViF). arXiv admin note: substantial text overlap with arXiv:2405.06535

点击查看摘要

Abstract:Conditional discrete generative models struggle to faithfully compose multiple input conditions. To address this, we derive a theoretically-grounded formulation for composing discrete probabilistic generative processes, with masked generation (absorbing diffusion) as a special case. Our formulation enables precise specification of novel combinations and numbers of input conditions that lie outside the training data, with concept weighting enabling emphasis or negation of individual conditions. In synergy with the richly compositional learned vocabulary of VQ-VAE and VQ-GAN, our method attains a 63.4% relative reduction in error rate compared to the previous state-of-the-art, averaged across 3 datasets (positional CLEVR, relational CLEVR and FFHQ), simultaneously obtaining an average absolute FID improvement of -9.58 . Meanwhile, our method offers a 2.3\times to 12\times real-time speed-up over comparable methods, and is readily applied to an open pre-trained discrete text-to-image model for fine-grained control of text-to-image generation.

[LG-21] Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert Space

链接: https://arxiv.org/abs/2604.05700
作者: Li Kunpeng,Wan Chenguang,Qu Zhisong,Lim Kyungtak,Virginie Grandgirard,Xavier Garbet,Yu Hua,Ong Yew Soon
类目: Machine Learning (cs.LG)
*备注: 41 pages, 5 figures, journal paper

点击查看摘要

Abstract:High-fidelity modeling of turbulent flows requires capturing complex spatiotemporal dynamics and multi-scale intermittency, posing a fundamental challenge for traditional knowledge-based systems. While deep generative models, such as diffusion models and Flow Matching, have shown promising performance, they are fundamentally constrained by their discrete, pixel-based nature. This limitation restricts their applicability in turbulence computing, where data inherently exists in a functional form. To address this gap, we propose Functional Optimal Transport Conditional Flow Matching (FOT-CFM), a generative framework defined directly in infinite-dimensional function space. Unlike conventional approaches defined on fixed grids, FOT-CFM treats physical fields as elements of an infinite-dimensional Hilbert space, and learns resolution-invariant generative dynamics directly at the level of probability measures. By integrating Optimal Transport (OT) theory, we construct deterministic, straight-line probability paths between noise and data measures in Hilbert space. This formulation enables simulation-free training and significantly accelerates the sampling process. We rigorously evaluate the proposed system on a diverse suite of chaotic dynamical systems, including the Navier-Stokes equations, Kolmogorov Flow, and Hasegawa-Wakatani equations, all of which exhibit rich multi-scale turbulent structures. Experimental results demonstrate that FOT-CFM achieves superior fidelity in reproducing high-order turbulent statistics and energy spectra compared to state-of-the-art baselines.

[LG-22] From Uniform to Learned Knots: A Study of Spline-Based Numerical Encodings for Tabular Deep Learning

链接: https://arxiv.org/abs/2604.05635
作者: Manish Kumar,Anton Frederik Thielmann,Christoph Weisser,Benjamin Säfken
类目: Machine Learning (cs.LG)
*备注: 20, 9 figures

点击查看摘要

Abstract:Numerical preprocessing remains an important component of tabular deep learning, where the representation of continuous features can strongly affect downstream performance. Although its importance is well established for classical statistical and machine learning models, the role of explicit numerical preprocessing in tabular deep learning remains less well understood. In this work, we study this question with a focus on spline-based numerical encodings. We investigate three spline families for encoding numerical features, namely B-splines, M-splines, and integrated splines (I-splines), under uniform, quantile-based, target-aware, and learnable-knot placement. For the learnable-knot variants, we use a differentiable knot parameterization that enables stable end-to-end optimization of knot locations jointly with the backbone. We evaluate these encodings on a diverse collection of public regression and classification datasets using MLP, ResNet, and FT-Transformer backbones, and compare them against common numerical preprocessing baselines. Our results show that the effect of numerical encodings depends strongly on the task, output size, and backbone. For classification, piecewise-linear encoding (PLE) is the most robust choice overall, while spline-based encodings remain competitive. For regression, no single encoding dominates uniformly. Instead, performance depends on the spline family, knot-placement strategy, and output size, with larger gains typically observed for MLP and ResNet than for FT-Transformer. We further find that learnable-knot variants can be optimized stably under the proposed parameterization, but may substantially increase training cost, especially for M-spline and I-spline expansions. Overall, the results show that numerical encodings should be assessed not only in terms of predictive performance, but also in terms of computational overhead.

[LG-23] Same Graph Different Likelihoods: Calibration of Autoregressive Graph Generators via Permutation-Equivalent Encodings AISTATS2026

链接: https://arxiv.org/abs/2604.05613
作者: Laurits Fredsgaard,Aaron Thomas,Michael Riis Andersen,Mikkel N. Schmidt,Mahito Sugiyama
类目: Machine Learning (cs.LG)
*备注: Workshop ‘Towards Trustworthy Predictions: Theory and Applications of Calibration for Modern AI’ at AISTATS 2026, Tangier, Morocco

点击查看摘要

Abstract:Autoregressive graph generators define likelihoods via a sequential construction process, but these likelihoods are only meaningful if they are consistent across all linearizations of the same graph. Segmented Eulerian Neighborhood Trails (SENT), a recent linearization method, converts graphs into sequences that can be perfectly decoded and efficiently processed by language models, but admit multiple equivalent linearizations of the same graph. We quantify violations in assigned negative log-likelihood (NLL) using the coefficient of variation across equivalent linearizations, which we call Linearization Uncertainty (LU). Training transformers under four linearization strategies on two datasets, we show that biased orderings achieve lower NLL on their native order but exhibit expected calibration error (ECE) two orders of magnitude higher under random permutation, indicating that these models have learned their training linearization rather than the underlying graph. On the molecular graph benchmark QM9, NLL for generated graphs is negatively correlated with molecular stability (AUC =0.43 ), while LU achieves AUC =0.85 , suggesting that permutation-based evaluation provides a more reliable quality check for generated molecules. Code is available at this https URL

[LG-24] Channel-wise Retrieval for Multivariate Time Series Forecasting ICASSP2026

链接: https://arxiv.org/abs/2604.05543
作者: Junhyeok Kang,Jun Seo,Soyeon Park,Sangjun Han,Seohui Bae,Hyeokjun Choe,Soonyoung Lee
类目: Machine Learning (cs.LG)
*备注: Accepted at ICASSP 2026 Oral

点击查看摘要

Abstract:Multivariate time series forecasting often struggles to capture long-range dependencies due to fixed lookback windows. Retrieval-augmented forecasting addresses this by retrieving historical segments from memory, but existing approaches rely on a channel-agnostic strategy that applies the same references to all variables. This neglects inter-variable heterogeneity, where different channels exhibit distinct periodicities and spectral profiles. We propose CRAFT (Channel-wise retrieval-augmented forecasting), a novel framework that performs retrieval independently for each channel. To ensure efficiency, CRAFT adopts a two-stage pipeline: a sparse relation graph constructed in the time domain prunes irrelevant candidates, and spectral similarity in the frequency domain ranks references, emphasizing dominant periodic components while suppressing noise. Experiments on seven public benchmarks demonstrate that CRAFT outperforms state-of-the-art forecasting baselines, achieving superior accuracy with practical inference efficiency.

[LG-25] AttnDiff: Attention-based Differential Fingerprinting for Large Language Models ACL2026

链接: https://arxiv.org/abs/2604.05502
作者: Haobo Zhang,Zhenhua Xu,Junxian Li,Shangfeng Sheng,Dezhang Kong,Meng Han
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at ACL2026 Main

点击查看摘要

Abstract:Protecting the intellectual property of open-weight large language models (LLMs) requires verifying whether a suspect model is derived from a victim model despite common laundering operations such as fine-tuning (including PPO/DPO), pruning/compression, and model merging. We propose \textscAttnDiff, a data-efficient white-box framework that extracts fingerprints from models via intrinsic information-routing behavior. \textscAttnDiff probes minimally edited prompt pairs that induce controlled semantic conflicts, captures differential attention patterns, summarizes them with compact spectral descriptors, and compares models using CKA. Across Llama-2/3 and Qwen2.5 (3B–14B) and additional open-source families, it yields high similarity for related derivatives while separating unrelated model families (e.g., 0.98 vs.\ 0.22 with M=60 probes). With 5–60 multi-domain probes, it supports practical provenance verification and accountability.

[LG-26] Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game

链接: https://arxiv.org/abs/2604.05476
作者: Tõnis Lees,Tambet Matiisen
类目: Machine Learning (cs.LG)
*备注: For the code see this https URL

点击查看摘要

Abstract:This work investigates the adaptation of the AlphaZero reinforcement learning algorithm to Tablut, an asymmetric historical board game featuring unequal piece counts and distinct player objectives (king capture versus king escape). While the original AlphaZero architecture successfully leverages a single policy and value head for symmetric games, applying it to asymmetric environments forces the network to learn two conflicting evaluation functions, which can hinder learning efficiency and performance. To address this, the core architecture is modified to use separate policy and value heads for each player role, while maintaining a shared residual trunk to learn common board features. During training, the asymmetric structure introduced training instabilities, notably catastrophic forgetting between the attacker and defender roles. These issues were mitigated by applying C4 data augmentation, increasing the replay buffer size, and having the model play 25 percent of training games against randomly sampled past checkpoints. Over 100 self-play iterations, the modified model demonstrated steady improvement, achieving a BayesElo rating of 1235 relative to a randomly initialized baseline. Training metrics also showed a significant decrease in policy entropy and average remaining pieces, reflecting increasingly focused and decisive play. Ultimately, the experiments confirm that AlphaZero’s self-play framework can transfer to highly asymmetric games, provided that distinct policy/value heads and robust stabilization techniques are employed.

[LG-27] LMI-Net: Linear Matrix Inequality–Constrained Neural Networks via Differentiable Projection Layers

链接: https://arxiv.org/abs/2604.05374
作者: Sunbochen Tang,Andrea Goertzen,Navid Azizan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear matrix inequalities (LMIs) have played a central role in certifying stability, robustness, and forward invariance of dynamical systems. Despite rapid development in learning-based methods for control design and certificate synthesis, existing approaches often fail to preserve the hard matrix inequality constraints required for formal guarantees. We propose LMI-Net, an efficient and modular differentiable projection layer that enforces LMI constraints by construction. Our approach lifts the set defined by LMI constraints into the intersection of an affine equality constraint and the positive semidefinite cone, performs the forward pass via Douglas-Rachford splitting, and supports efficient backward propagation through implicit differentiation. We establish theoretical guarantees that the projection layer converges to a feasible point, certifying that LMI-Net transforms a generic neural network into a reliable model satisfying LMI constraints. Evaluated on experiments including invariant ellipsoid synthesis and joint controller-and-certificate design for a family of disturbed linear systems, LMI-Net substantially improves feasibility over soft-constrained models under distribution shift while retaining fast inference speed, bridging semidefinite-program-based certification and modern learning techniques.

[LG-28] Cross-Machine Anomaly Detection Leverag ing Pre-trained Time-series Model

链接: https://arxiv.org/abs/2604.05335
作者: Yangmeng Li,Kei Sano,Toshihiro Kitao,Ryoji Anzaki,Yukiya Saitoh,Hironori Moki,Dragan Djurdjanovic
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 20 pages, 5 figures, under review at a journal

点击查看摘要

Abstract:Achieving resilient and high-quality manufacturing requires reliable data-driven anomaly detection methods that are capable of addressing differences in behaviors among different individual machines which are nominally the same and are executing the same processes. To address the problem of detecting anomalies in a machine using sensory data gathered from different individual machines executing the same procedure, this paper proposes a cross-machine time-series anomaly detection framework that integrates a domain-invariant feature extractor with an unsupervised anomaly detection module. Leveraging the pre-trained foundation model MOMENT, the extractor employs Random Forest Classifiers to disentangle embeddings into machine-related and condition-related features, with the latter serving as representations which are invariant to differences between individual machines. These refined features enable the downstream anomaly detectors to generalize effectively to unseen target machines. Experiments on an industrial dataset collected from three different machines performing nominally the same operation demonstrate that the proposed approach outperforms both the raw-signal-based and MOMENT-embedding feature baselines, confirming its effectiveness in enhancing cross-machine generalization.

[LG-29] A Theoretical Framework for Statistical Evaluability of Generative Models

链接: https://arxiv.org/abs/2604.05324
作者: Shashaank Aiyer,Yishay Mansour,Shay Moran,Han Shao
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 25 pages

点击查看摘要

Abstract:Statistical evaluation aims to estimate the generalization performance of a model using held-out i.i.d.\ test data sampled from the ground-truth distribution. In supervised learning settings such as classification, performance metrics such as error rate are well-defined, and test error reliably approximates population error given sufficiently large datasets. In contrast, evaluation is more challenging for generative models due to their open-ended nature: it is unclear which metrics are appropriate and whether such metrics can be reliably evaluated from finite samples. In this work, we introduce a theoretical framework for evaluating generative models and establish evaluability results for commonly used metrics. We study two categories of metrics: test-based metrics, including integral probability metrics (IPMs), and Rényi divergences. We show that IPMs with respect to any bounded test class can be evaluated from finite samples up to multiplicative and additive approximation errors. Moreover, when the test class has finite fat-shattering dimension, IPMs can be evaluated with arbitrary precision. In contrast, Rényi and KL divergences are not evaluable from finite samples, as their values can be critically determined by rare events. We also analyze the potential and limitations of perplexity as an evaluation method. Comments: 25 pages Subjects: Machine Learning (cs.LG); Information Theory (cs.IT) Cite as: arXiv:2604.05324 [cs.LG] (or arXiv:2604.05324v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.05324 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] Jeffreys Flow: Robust Boltzmann Generators for Rare Event Sampling via Parallel Tempering Distillation

链接: https://arxiv.org/abs/2604.05303
作者: Guang Lin,Christian Moya,Di Qi,Xuda Ye
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sampling physical systems with rough energy landscapes is hindered by rare events and metastable trapping. While Boltzmann generators already offer a solution, their reliance on the reverse Kullback–Leibler divergence frequently induces catastrophic mode collapse, missing specific modes in multi-modal distributions. Here, we introduce the Jeffreys Flow, a robust generative framework that mitigates this failure by distilling empirical sampling data from Parallel Tempering trajectories using the symmetric Jeffreys divergence. This formulation effectively balances local target-seeking precision with global modes coverage. We show that minimizing Jeffreys divergence suppresses mode collapse and structurally corrects inherent inaccuracies via distillation of the empirical reference data. We demonstrate the framework’s scalability and accuracy on highly non-convex multidimensional benchmarks, including the systematic correction of stochastic gradient biases in Replica Exchange Stochastic Gradient Langevin Dynamics and the massive acceleration of exact importance sampling in Path Integral Monte Carlo for quantum thermal states.

[LG-31] Vehicle-as-Prompt: A Unified Deep Reinforcement Learning Framework for Heterogeneous Fleet Vehicle Routing Problem

链接: https://arxiv.org/abs/2604.05195
作者: Shihong Huang,Shengjie Wang,Lei Gao,Hong Ma,Zhanluo Zhang,Feng Zhang,Weihua Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unlike traditional homogeneous routing problems, the Heterogeneous Fleet Vehicle Routing Problem (HFVRP) involves heterogeneous fixed costs, variable travel costs, and capacity constraints, rendering solution quality highly sensitive to vehicle selection. Furthermore, real-world logistics applications often impose additional complex constraints, markedly increasing computational complexity. However, most existing Deep Reinforcement Learning (DRL)-based methods are restricted to homogeneous scenarios, leading to suboptimal performance when applied to HFVRP and its complex variants. To bridge this gap, we investigate HFVRP under complex constraints and develop a unified DRL framework capable of solving the problem across various variant settings. We introduce the Vehicle-as-Prompt (VaP) mechanism, which formulates the problem as a single-stage autoregressive decision process. Building on this, we propose VaP-CSMV, a framework featuring a cross-semantic encoder and a multi-view decoder that effectively addresses various problem variants and captures the complex mapping relationships between vehicle heterogeneity and customer node attributes. Extensive experimental results demonstrate that VaP-CSMV significantly outperforms existing state-of-the-art DRL-based neural solvers and achieves competitive solution quality compared to traditional heuristic solvers, while reducing inference time to mere seconds. Furthermore, the framework exhibits strong zero-shot generalization capabilities on large-scale and previously unseen problem variants, while ablation studies validate the vital contribution of each component.

[LG-32] FNOangle θ: Extended Fourier neural operator for learning state and optimal control of distributed parameter systems

链接: https://arxiv.org/abs/2604.05187
作者: Zhexian Li,Ketan Savla
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:We propose an extended Fourier neural operator (FNO) architecture for learning state and linear quadratic additive optimal control of systems governed by partial differential equations. Using the Ehrenpreis-Palamodov fundamental principle, we show that any state and optimal control of linear PDEs with constant coefficients can be represented as an integral in the complex domain. The integrand of this representation involves the same exponential term as in the inverse Fourier transform, where the latter is used to represent the convolution operator in FNO layer. Motivated by this observation, we modify the FNO layer by extending the frequency variable in the inverse Fourier transform from the real to complex domain to capture the integral representation from the fundamental principle. We illustrate the performance of FNO in learning state and optimal control for the nonlinear Burgers’ equation, showing order of magnitude improvements in training errors and more accurate predictions of non-periodic boundary values over FNO.

[LG-33] Cross-fitted Proximal Learning for Model-Based Reinforcement Learning

链接: https://arxiv.org/abs/2604.05185
作者: Nishanth Venkatesh,Andreas A. Malikopoulos
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning is attractive for sequential decision-making because it explicitly estimates reward and transition models and then supports planning through simulated rollouts. In offline settings with hidden confounding, however, models learned directly from observational data may be biased. This challenge is especially pronounced in partially observable systems, where latent factors may jointly affect actions, rewards, and future observations. Recent work has shown that policy evaluation in such confounded partially observable Markov decision processes (POMDPs) can be reduced to estimating reward-emission and observation-transition bridge functions satisfying conditional moment restrictions (CMRs). In this paper, we study the statistical estimation of these bridge functions. We formulate bridge learning as a CMR problem with nuisance objects given by a conditional mean embedding and a conditional density. We then develop a K -fold cross-fitted extension of the existing two-stage bridge estimator. The proposed procedure preserves the original bridge-based identification strategy while using the available data more efficiently than a single sample split. We also derive an oracle-comparator bound for the cross-fitted estimator and decompose the resulting error into a Stage I term induced by nuisance estimation and a Stage II term induced by empirical averaging.

[LG-34] General Multimodal Protein Design Enables DNA-Encoding of Chemistry

链接: https://arxiv.org/abs/2604.05181
作者: Jarrid Rector-Brooks,Théophile Lambert,Marta Skreta,Daniel Roth,Yueming Long,Zi-Qi Li,Xi Zhang,Miruna Cretu,Francesca-Zhoufan Li,Tanvi Ganapathy,Emily Jin,Avishek Joey Bose,Jason Yang,Kirill Neklyudov,Yoshua Bengio,Alexander Tong,Frances H. Arnold,Cheng-Hao Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evolution is an extraordinary engine for enzymatic diversity, yet the chemistry it has explored remains a narrow slice of what DNA can encode. Deep generative models can design new proteins that bind ligands, but none have created enzymes without pre-specifying catalytic residues. We introduce DISCO (DIffusion for Sequence-structure CO-design), a multimodal model that co-designs protein sequence and 3D structure around arbitrary biomolecules, as well as inference-time scaling methods that optimize objectives across both modalities. Conditioned solely on reactive intermediates, DISCO designs diverse heme enzymes with novel active-site geometries. These enzymes catalyze new-to-nature carbene-transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B-H, and C(sp ^3 )-H insertions, with high activities exceeding those of engineered enzymes. Random mutagenesis of a selected design further confirmed that enzyme activity can be improved through directed evolution. By providing a scalable route to evolvable enzymes, DISCO broadens the potential scope of genetically encodable transformations. Code is available at this https URL.

[LG-35] On the Exploitability of FTRL Dynamics

链接: https://arxiv.org/abs/2604.05129
作者: Yiheng Su,Emmanouil-Vasileios Vlatakis-Gkaragkounis
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we investigate the exploitability of a Follow-the-Regularized-Leader (FTRL) learner with constant step size \eta in n\times m two-player zero-sum games played over T rounds against a clairvoyant optimizer. In contrast with prior analysis, we show that exploitability is an inherent feature of the FTRL family, rather than an artifact of specific instantiations. First, for fixed optimizer, we establish a sweeping law of order \Omega(N/\eta) , proving that exploitation scales to the number of the learner’s suboptimal actions N and vanishes in their absence. Second, for alternating optimizer, a surplus of \Omega(\eta T/\mathrmpoly(n,m)) can be guaranteed regardless of the equilibrium structure, with high probability, in random games. Our analysis uncovers once more the sharp geometric dichotomy: non-steep regularizers allow the optimizer to extract maximum surplus via finite-time elimination of suboptimal actions, whereas steep ones introduce a vanishing correction that may delay exploitation. Finally, we discuss whether this leverage persists under bilateral payoff uncertainty and we propose susceptibility measure to quantify which regularizers are most vulnerable to strategic manipulation.

[LG-36] Probabilistic Tree Inference Enabled by FDSOI Ferroelectric FETs

链接: https://arxiv.org/abs/2604.05115
作者: Pengyu Ren,Xingtian Wang,Boyang Cheng,Jiahui Duan,Giuk Kim,Xuezhong Niu,Halid Mulaosmanovic,Stefan Duenkel,Sven Beyer,X. Sharon Hu,Ningyuan Cao,Kai Ni
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial intelligence applications in autonomous driving, medical diagnostics, and financial systems increasingly demand machine learning models that can provide robust uncertainty quantification, interpretability, and noise resilience. Bayesian decision trees (BDTs) are attractive for these tasks because they combine probabilistic reasoning, interpretable decision-making, and robustness to noise. However, existing hardware implementations of BDTs based on CPUs and GPUs are limited by memory bottlenecks and irregular processing patterns, while multi-platform solutions exploiting analog content-addressable memory (ACAM) and Gaussian random number generators (GRNGs) introduce integration complexity and energy overheads. Here we report a monolithic FDSOI-FeFET hardware platform that natively supports both ACAM and GRNG functionalities. The ferroelectric polarization of FeFETs enables compact, energy-efficient multi-bit storage for ACAM, and band-to-band tunneling in the gate-to-drain overlap region and subsequent hole storage in the floating body provides a high-quality entropy source for GRNG. System-level evaluations demonstrate that the proposed architecture provides robust uncertainty estimation, interpretability, and noise tolerance with high energy efficiency. Under both dataset noise and device variations, it achieves over 40% higher classification accuracy on MNIST compared to conventional decision trees. Moreover, it delivers more than two orders of magnitude speedup over CPU and GPU baselines and over four orders of magnitude improvement in energy efficiency, making it a scalable solution for deploying BDTs in resource-constrained and safety-critical environments.

[LG-37] Scalar Federated Learning for Linear Quadratic Regulator

链接: https://arxiv.org/abs/2604.05088
作者: Mohammadreza Rostami,Shahriar Talebi,Solmaz S. Kia
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose ScalarFedLQR, a communication-efficient federated algorithm for model-free learning of a common policy in linear quadratic regulator (LQR) control of heterogeneous agents. The method builds on a decomposed projected gradient mechanism, in which each agent communicates only a scalar projection of a local zeroth-order gradient estimate. The server aggregates these scalar messages to reconstruct a global descent direction, reducing per-agent uplink communication from O(d) to O(1), independent of the policy dimension. Crucially, the projection-induced approximation error diminishes as the number of participating agents increases, yielding a favorable scaling law: larger fleets enable more accurate gradient recovery, admit larger stepsizes, and achieve faster linear convergence despite high dimensionality. Under standard regularity conditions, all iterates remain stabilizing and the average LQR cost decreases linearly fast. Numerical results demonstrate performance comparable to full-gradient federated LQR with substantially reduced communication.

[LG-38] Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling

链接: https://arxiv.org/abs/2604.05072
作者: Ximing Xing,Ziteng Xue,Zhenxi Li,Weicong Liang,Linqing Wang,Zhantao Yang,Tiankai Hang,Zijin Yin,Qinglin Lu,Chunyu Wang,Qian Yu
类目: Machine Learning (cs.LG)
*备注: Homepage: this https URL

点击查看摘要

Abstract:Recent large language models have shifted SVG generation from differentiable rendering optimization to autoregressive program synthesis. However, existing approaches still rely on generic byte-level tokenization inherited from natural language processing, which poorly reflects the geometric structure of vector graphics. Numerical coordinates are fragmented into discrete symbols, destroying spatial relationships and introducing severe token redundancy, often leading to coordinate hallucination and inefficient long-sequence generation. To address these challenges, we propose HiVG, a hierarchical SVG tokenization framework tailored for autoregressive vector graphics generation. HiVG decomposes raw SVG strings into structured \textitatomic tokens and further compresses executable command–parameter groups into geometry-constrained \textitsegment tokens, substantially improving sequence efficiency while preserving syntactic validity. To further mitigate spatial mismatch, we introduce a Hierarchical Mean–Noise (HMN) initialization strategy that injects numerical ordering signals and semantic priors into new token embeddings. Combined with a curriculum training paradigm that progressively increases program complexity, HiVG enables more stable learning of executable SVG programs. Extensive experiments on both text-to-SVG and image-to-SVG tasks demonstrate improved generation fidelity, spatial consistency, and sequence efficiency compared with conventional tokenization schemes.

[LG-39] owards Scaling Law Analysis For Spatiotemporal Weather Data

链接: https://arxiv.org/abs/2604.05068
作者: Alexander Kiefer,Prasanna Balaprakash,Xiao Wang
类目: Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, High Performance Computing for Imaging 2026

点击查看摘要

Abstract:Compute-optimal scaling laws are relatively well studied for NLP and CV, where objectives are typically single-step and targets are comparatively homogeneous. Weather forecasting is harder to characterize in the same framework: autoregressive rollouts compound errors over long horizons, outputs couple many physical channels with disparate scales and predictability, and globally pooled test metrics can disagree sharply with per-channel, late-lead behavior implied by short-horizon training. We extend neural scaling analysis for autoregressive weather forecasting from single-step training loss to long rollouts and per-channel metrics. We quantify (1) how prediction error is distributed across channels and how its growth rate evolves with forecast horizon, (2) if power law scaling holds for test error, relative to rollout length when error is pooled globally, and (3) how that fit varies jointly with horizon and channel for parameter, data, and compute-based scaling axes. We find strong cross-channel and cross-horizon heterogeneity: pooled scaling can look favorable while many channels degrade at late leads. We discuss implications for weighted objectives, horizon-aware curricula, and resource allocation across outputs.

[LG-40] Blind-Spot Mass: A Good-Turing Framework for Quantifying Deployment Coverag e Risk in Machine Learning Systems

链接: https://arxiv.org/abs/2604.05057
作者: Biplab Pal,Santanu Bhattacharya,Madanjit Singh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 7 figures, 1 table; submitted to Journal of Machine Learning Research (JMLR)

点击查看摘要

Abstract:Blind-spot mass is a Good-Turing framework for quantifying deployment coverage risk in machine learning. In modern ML systems, operational state distributions are often heavy-tailed, implying that a long tail of valid but rare states is structurally under-supported in finite training and evaluation data. This creates a form of ‘coverage blindness’: models can appear accurate on standard test sets yet remain unreliable across large regions of the deployment state space. We propose blind-spot mass B_n(tau), a deployment metric estimating the total probability mass assigned to states whose empirical support falls below a threshold tau. B_n(tau) is computed using Good-Turing unseen-species estimation and yields a principled estimate of how much of the operational distribution lies in reliability-critical, under-supported regimes. We further derive a coverage-imposed accuracy ceiling, decomposing overall performance into supported and blind components and separating capacity limits from data limits. We validate the framework in wearable human activity recognition (HAR) using wrist-worn inertial data. We then replicate the same analysis in the MIMIC-IV hospital database with 275 admissions, where the blind-spot mass curve converges to the same 95% at tau = 5 across clinical state abstractions. This replication across structurally independent domains - differing in modality, feature space, label space, and application - shows that blind-spot mass is a general ML methodology for quantifying combinatorial coverage risk, not an application-specific artifact. Blind-spot decomposition identifies which activities or clinical regimes dominate risk, providing actionable guidance for industrial practitioners on targeted data collection, normalization/renormalization, and physics- or domain-informed constraints for safer deployment. Comments: 15 pages, 7 figures, 1 table; submitted to Journal of Machine Learning Research (JMLR) Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) ACMclasses: I.2.6; G.3 Cite as: arXiv:2604.05057 [cs.LG] (or arXiv:2604.05057v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.05057 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Santanu Bhattacharya [view email] [v1] Mon, 6 Apr 2026 18:06:38 UTC (1,714 KB)

[LG-41] Energy-Based Dynamical Models for Neurocomputation Learning and Optimization

链接: https://arxiv.org/abs/2604.05042
作者: Arthur N. Montanari,Francesco Bullo,Dmitry Krotov,Adilson E. Motter
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Systems and Control (eess.SY); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Recent advances at the intersection of control theory, neuroscience, and machine learning have revealed novel mechanisms by which dynamical systems perform computation. These advances encompass a wide range of conceptual, mathematical, and computational ideas, with applications for model learning and training, memory retrieval, data-driven control, and optimization. This tutorial focuses on neuro-inspired approaches to computation that aim to improve scalability, robustness, and energy efficiency across such tasks, bridging the gap between artificial and biological systems. Particular emphasis is placed on energy-based dynamical models that encode information through gradient flows and energy landscapes. We begin by reviewing classical formulations, such as continuous-time Hopfield networks and Boltzmann machines, and then extend the framework to modern developments. These include dense associative memory models for high-capacity storage, oscillator-based networks for large-scale optimization, and proximal-descent dynamics for composite and constrained reconstruction. The tutorial demonstrates how control-theoretic principles can guide the design of next-generation neurocomputing systems, steering the discussion beyond conventional feedforward and backpropagation-based approaches to artificial intelligence.

[LG-42] El Nino Prediction Based on Weather Forecast and Geographical Time-series Data

链接: https://arxiv.org/abs/2604.04998
作者: Viet Trinh,Ha-Vy Luu,Quoc-Khiem Nguyen-Pham,Hung Tong,Thanh-Huyen Tran,Hoai-Nam Nguyen Dang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a novel framework for enhancing the prediction accuracy and lead time of El Niño events, crucial for mitigating their global climatic, economic, and societal impacts. Traditional prediction models often rely on oceanic and atmospheric indices, which may lack the granularity or dynamic interplay captured by comprehensive meteorological and geographical datasets. Our framework integrates real-time global weather forecast data with anomalies, subsurface ocean heat content, and atmospheric pressure across various temporal and spatial resolutions. Leveraging a hybrid deep learning architecture that combines a Convolutional Neural Network (CNN) for spatial feature extraction and a Long Short-Term Memory (LSTM) network for temporal dependency modeling, the framework aims to identify complex precursors and evolving patterns of El Niño events.

[LG-43] Learning-Based Multi-Criteria Decision Making Model for Sawmill Location Problems

链接: https://arxiv.org/abs/2604.04996
作者: Mahid Ahmed,Ali Dogru,Chaoyang Zhang,Chao Meng
类目: Machine Learning (cs.LG)
*备注: 34 pages, 12 figures, 5 tables

点击查看摘要

Abstract:Strategically locating a sawmill is vital for enhancing the efficiency, profitability, and sustainability of timber supply chains. Our study proposes a Learning-Based Multi-Criteria Decision-Making (LB-MCDM) framework that integrates machine learning (ML) with GIS-based spatial location analysis via MCDM. The proposed framework provides a data-driven, unbiased, and replicable approach to assessing site suitability. We demonstrate the utility of the proposed model through a case study in Mississippi (MS). We apply five ML algorithms (Random Forest Classifier, Support Vector Classifier, XGBoost Classifier, Logistic Regression, and K-Nearest Neighbors Classifier) to identify the most suitable sawmill locations in Mississippi. Among these models, the Random Forest Classifier achieved the highest performance. We use the SHAP (SHapley Additive exPlanations) technique to determine the relative importance of each criterion, revealing the Supply-Demand Ratio, a composite feature that reflects local market competition dynamics, as the most influential factor, followed by Road, Rail Line and Urban Area Distance. The validation of suitability maps generated by our LB-MCDM model suggests that 10-11% of the MS landscape is highly suitable for sawmill location.

[LG-44] Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

链接: https://arxiv.org/abs/2604.04986
作者: Zesheng Yao,Zhen-Hua Wan,Canjun Yang,Qingchao Xia,Mengqi Zhang
类目: Machine Learning (cs.LG)
*备注: 43 pages, 26 figures

点击查看摘要

Abstract:Model-free deep reinforcement learning (DRL) methods suffer from poor sample efficiency. To overcome this limitation, this work introduces an adaptive reduced-order-model (ROM)-based reinforcement learning framework for active flow control. In contrast to conventional actor–critic architectures, the proposed approach leverages a ROM to estimate the gradient information required for controller optimization. The design of the ROM structure incorporates physical insights. The ROM integrates a linear dynamical system and a neural ordinary differential equation (NODE) for estimating the nonlinearity in the flow. The parameters of the linear component are identified via operator inference, while the NODE is trained in a data-driven manner using gradient-based optimization. During controller–environment interactions, the ROM is continuously updated with newly collected data, enabling adaptive refinement of the model. The controller is then optimized through differentiable simulation of the ROM. The proposed ROM-based DRL framework is validated on two canonical flow control problems: Blasius boundary layer flow and flow past a square cylinder. For the Blasius boundary layer, the proposed method effectively reduces to a single-episode system identification and controller optimization process, yet it yields controllers that outperform traditional linear designs and achieve performance comparable to DRL approaches with minimal data. For the flow past a square cylinder, the proposed method achieves superior drag reduction with significantly fewer exploration data compared with DRL approaches. The work addresses a key component of model-free DRL control algorithms and lays the foundation for designing more sample-efficient DRL-based active flow controllers.

[LG-45] rritory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

链接: https://arxiv.org/abs/2604.04983
作者: Diyansha Singh
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:We present Territory Paint Wars, a minimal competitive multi-agent reinforcement learning environment implemented in Unity, and use it to systematically investigate failure modes of Proximal Policy Optimisation (PPO) under self-play. A first agent trained for 84,000 episodes achieves only 26.8% win rate against a uniformly-random opponent in a symmetric zero-sum game. Through controlled ablations we identify five implementation-level failure modes – reward-scale imbalance, missing terminal signal, ineffective long-horizon credit assignment, unnormalised observations, and incorrect win detection – each of which contributes critically to this failure in this setting. After correcting these issues, we uncover a distinct emergent pathology: competitive overfitting, where co-adapting agents maintain stable self-play performance while generalisation win rate collapses from 73.5% to 21.6% . Critically, this failure is undetectable via standard self-play metrics: both agents co-adapt equally, so the self-play win rate remains near 50% throughout the collapse. We propose a minimal intervention – opponent mixing, where 20% of training episodes substitute a fixed uniformly-random policy for the co-adaptive opponent – which mitigates competitive overfitting and restores generalisation to 77.1% ( \pm 12.6% , 10 seeds) without population-based training or additional infrastructure. We open-source Territory Paint Wars to provide a reproducible benchmark for studying competitive MARL failure modes. Comments: 16 pages, 5 figures Subjects: Machine Learning (cs.LG) MSC classes: 68T05, 68T07 ACMclasses: I.2.6; I.2.11 Cite as: arXiv:2604.04983 [cs.LG] (or arXiv:2604.04983v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.04983 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] owards Predicting Multi-Vulnerability Attack Chains in Software Supply Chains from Software Bill of Materials Graphs

链接: https://arxiv.org/abs/2604.04977
作者: Laura Baird,Armin Moin
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted for the ACM International Conference on the Foundations of Software Engineering (FSE) 2026 Ideas, Visions and Reflections (IVR) Track

点击查看摘要

Abstract:Software supply chain security compromises often stem from cascaded interactions of vulnerabilities, for example, between multiple vulnerable components. Yet, Software Bill of Materials (SBOM)-based pipelines for security analysis typically treat scanner findings as independent per-CVE (Common Vulnerabilities and Exposures) records. We propose a new research direction based on learning multi-vulnerability attack chains through a novel SBOM-driven graph-learning approach. This treats SBOM structure and scanner outputs as a dependency-constrained evidence graph rather than a flat list of vulnerabilities. We represent vulnerability-enriched CycloneDX SBOMs as heterogeneous graphs whose nodes capture software components and known vulnerabilities (i.e, CVEs), connected by typed relations, such as dependency and vulnerability links. We train a Heterogeneous Graph Attention Network (HGAT) to predict whether a component is associated with at least one known vulnerability as a feasibility check for learning over this structure. Additionally, we frame the discovery of cascading vulnerabilities as CVE-pair link prediction using a lightweight Multi-Layer Perceptron (MLP) neural network trained on documented multi-vulnerability chains. Validated on 200 real-world SBOMs from the Wild SBOMs public dataset, the HGAT component classifier achieves 91.03% Accuracy and 74.02% F1-score, while the cascade predictor model (MLP) achieves a Receiver Operating Characteristic - Area Under Curve (ROC-AUC) of 0.93 on a seed set of 35 documented attack chains.

[LG-47] A Theory-guided Weighted L2 Loss for solving the BGK model via Physics-informed neural networks

链接: https://arxiv.org/abs/2604.04971
作者: Gyounghun Ko,Sung-Jun Son,Seung Yeon Cho,Myeong-Su Lee
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: 26 pages, 9 figures

点击查看摘要

Abstract:While Physics-Informed Neural Networks offer a promising framework for solving partial differential equations, the standard L^2 loss formulation is fundamentally insufficient when applied to the Bhatnagar-Gross-Krook (BGK) model. Specifically, simply minimizing the standard loss does not guarantee accurate predictions of the macroscopic moments, causing the approximate solutions to fail in capturing the true physical solution. To overcome this limitation, we introduce a velocity-weighted L^2 loss function designed to effectively penalize errors in the high-velocity regions. By establishing a stability estimate for the proposed approach, we shows that minimizing the proposed weighted loss guarantees the convergence of the approximate solution. Also, numerical experiments demonstrate that employing this weighted PINN loss leads to superior accuracy and robustness across various benchmarks compared to the standard approach.

[LG-48] Belief Dynamics for Detecting Behavioral Shifts in Safe Collaborative Manipulation

链接: https://arxiv.org/abs/2604.04967
作者: Devashri Naik,Divake Kumar,Nastaran Darabi,Amit Ranjan Trivedi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robots operating in shared workspaces must maintain safe coordination with other agents whose behavior may change during task execution. When a collaborating agent switches strategy mid-episode, continuing under outdated assumptions can lead to unsafe actions and increased collision risk. Reliable detection of such behavioral regime changes is therefore critical. We study regime-switch detection under controlled non-stationarity in ManiSkill shared-workspace manipulation tasks. Across ten detection methods and five random seeds, enabling detection reduces post-switch collisions by 52%. However, average performance hides significant reliability differences: under a realistic tolerance of ±3 steps, detection ranges from 86% to 30%, while under ±5 steps all methods achieve 100%. We introduce UA-TOM, a lightweight belief-tracking module that augments frozen vision-language-action (VLA) control backbones using selective state-space dynamics, causal attention, and prediction-error signals. Across five seeds and 1200 episodes, UA-TOM achieves the highest detection rate among unassisted methods (85.7% at ±3) and the lowest close-range time (4.8 steps), outperforming an Oracle (5.3 steps). Analysis shows hidden-state update magnitude increases by 17x at regime switches and decays over roughly 10 timesteps, while the discretization step converges to a near-constant value (Delta_t approx 0.78), indicating sensitivity driven by learned dynamics rather than input-dependent gating. Cross-domain experiments in Overcooked show complementary roles of causal attention and prediction-error signals. UA-TOM introduces 7.4 ms inference overhead (14.8% of a 50 ms control budget), enabling reliable regime-switch detection without modifying the base policy.

[LG-49] Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD Surrogates

链接: https://arxiv.org/abs/2604.04946
作者: Yeping Hu,Ruben Glatt,Shusen Liu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Graph-based surrogate models provide fast alternatives to high-fidelity CFD solvers, but their opaque latent spaces and limited controllability restrict use in safety-critical settings. A key failure mode in oscillatory flows is phase drift, where predictions remain qualitatively correct but gradually lose temporal alignment with observations, limiting use in digital twins and closed-loop control. Correcting this through retraining is expensive and impractical during deployment. We ask whether phase drift can instead be corrected post hoc by manipulating the latent space of a frozen surrogate. We propose a phase-steering framework for pretrained graph-based CFD models that combines the right representation with the right intervention mechanism. To obtain disentangled representation for effective steering, we use sparse autoencoders (SAEs) on frozen MeshGraphNet embeddings. To steer dynamics, we move beyond static per-feature interventions such as scaling or clamping, and introduce a temporally coherent, phase-aware method. Specifically, we identify oscillatory feature pairs with Hilbert analysis, project spatial fields into low-rank temporal coefficients via SVD, and apply smooth time-varying rotations to advance or delay periodic modes while preserving amplitude-phase structure. Using a representation-agnostic setup, we compare SAE-based steering with PCA and raw embedding spaces under the same intervention pipeline. Results show that sparse, disentangled representations outperform dense or entangled ones, while static interventions fail in this dynamical setting. Overall, this work shows that latent-space steering can be extended from semantic domains to time-dependent physical systems when interventions respect the underlying dynamics, and that the same sparse features used for interpretability can also serve as physically meaningful control axes.

[LG-50] Hypernetwork-Conditioned Reinforcement Learning for Robust Control of Fixed-Wing Aircraft under Actuator Failures

链接: https://arxiv.org/abs/2604.03392
作者: Dennis Marquis,Mazen Farhood
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a reinforcement learning-based path-following controller for a fixed-wing small uncrewed aircraft system (sUAS) that is robust to certain actuator failures. The controller is conditioned on a parameterization of actuator faults using hypernetwork-based adaptation. We consider parameter-efficient formulations based on Feature-wise Linear Modulation (FiLM) and Low-Rank Adaptation (LoRA), trained using proximal policy optimization. We demonstrate that hypernetwork-conditioned policies can improve robustness compared to standard multilayer perceptron policies. In particular, hypernetwork-conditioned policies generalize effectively to time-varying actuator failure modes not encountered during training. The approach is validated through high-fidelity simulations, using a realistic six-degree-of-freedom fixed-wing aircraft model.

[LG-51] A Large-Scale Empirical Comparison of Meta-Learners and Causal Forests for Heterogeneous Treatment Effect Estimation in Marketing Uplift Modeling

链接: https://arxiv.org/abs/2604.06123
作者: Aman Singh
类目: Computation (stat.CO); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
*备注: 6 pages

点击查看摘要

Abstract:Estimating Conditional Average Treatment Effects (CATE) at the individual level is central to precision marketing, yet systematic benchmarking of uplift modeling methods at industrial scale remains limited. We present UpliftBench, an empirical evaluation of four CATE estimators: S-Learner, T-Learner, X-Learner (all with LightGBM base learners), and Causal Forest (EconML), applied to the Criteo Uplift v2.1 dataset comprising 13.98 million customer records. The near-random treatment assignment (propensity AUC = 0.509) provides strong internal validity for causal estimation. Evaluated via Qini coefficient and cumulative gain curves, the S-Learner achieves the highest Qini score of 0.376, with the top 20% of customers ranked by predicted CATE capturing 77.7% of all incremental conversions, a 3.9x improvement over random targeting. SHAP analysis identifies f8 as the dominant heterogeneous treatment effect (HTE) driver among the 12 anonymized covariates. Causal Forest uncertainty quantification reveals that 1.9% of customers are confident persuadables (lower 95% CI 0) and 0.1% are confident sleeping dogs (upper 95% CI 0). Our results provide practitioners with evidence-based guidance on method selection for large-scale uplift modeling pipelines.

[LG-52] Pixel-Translation-Equivariant Quantum Convolutional Neural Networks via Fourier Multiplexers

链接: https://arxiv.org/abs/2604.06094
作者: Dmitry Chirkov,Igor Lobanov
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Convolutional neural networks owe much of their success to hard-coding translation equivariance. Quantum convolutional neural networks (QCNNs) have been proposed as near-term quantum analogues, but the relevant notion of translation depends on the data encoding. For address/amplitude encodings such as FRQI, a pixel shift acts as modular addition on an index register, whereas many MERA-inspired QCNNs are equivariant only under cyclic permutations of physical qubits. We formalize this mismatch and construct QCNN layers that commute exactly with the pixel cyclic shift (PCS) symmetry induced by the encoding. Our main technical result is a constructive characterization of all PCS-equivariant unitaries: conjugation by the quantum Fourier transform (QFT) diagonalizes translations, so any PCS-equivariant layer is a Fourier-mode multiplexer followed by an inverse QFT (IQFT). Building on this characterization, we introduce a deep PCS-QCNN with measurement-induced pooling, deferred conditioning, and inter-layer QFT cancellation. We also analyze trainability at random initialization and prove a lower bound on the expected squared gradient norm that remains constant in a depth-scaling regime, ruling out a depth-induced barren plateau in that sense.

[LG-53] Value Mirror Descent for Reinforcement Learning

链接: https://arxiv.org/abs/2604.06039
作者: Zhichao Jia,Guanghui Lan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Value iteration-type methods have been extensively studied for computing a nearly optimal value function in reinforcement learning (RL). Under a generative sampling model, these methods can achieve sharper sample complexity than policy optimization approaches, particularly in their dependence on the discount factor. In practice, they are often employed for offline training or in simulated environments. In this paper, we consider discounted Markov decision processes with state space S, action space A, discount factor \gamma\in(0,1) and costs in [0,1] . We introduce a novel value optimization method, termed value mirror descent (VMD), which integrates mirror descent from convex optimization into the classical value iteration framework. In the deterministic setting with known transition kernels, we show that VMD converges linearly. For the stochastic setting with a generative model, we develop a stochastic variant, SVMD, which incorporates variance reduction commonly used in stochastic value iteration-type methods. For RL problems with general convex regularizers, SVMD attains a near-optimal sample complexity of \tildeO(|S||A|(1-\gamma)^-3\epsilon^-2) . Moreover, we establish that the Bregman divergence between the generated and optimal policies remains bounded throughout the iterations. This property is absent in existing stochastic value iteration-type methods but is important for enabling effective online (continual) learning following offline training. Under a strongly convex regularizer, SVMD achieves sample complexity of \tildeO(|S||A|(1-\gamma)^-5\epsilon^-1) , improving performance in the high-accuracy regime. Furthermore, we prove convergence of the generated policy to the optimal policy. Overall, the proposed method, its analysis, and the resulting guarantees, constitute new contributions to the RL and optimization literature.

[LG-54] Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification

链接: https://arxiv.org/abs/2604.06032
作者: Courtney Franzen,Farhad Pourkamali-Anaraki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 48 pages

点击查看摘要

Abstract:Neural network classifiers trained with cross-entropy loss achieve strong predictive accuracy but lack the capability to provide inherent predictive uncertainty estimates, thus requiring external techniques to obtain these estimates. In addition, softmax scores for the true class can vary substantially across independent training runs, which limits the reliability of uncertainty-based decisions in downstream tasks. Evidential Deep Learning aims to address these limitations by producing uncertainty estimates in a single pass, but evidential training is highly sensitive to design choices including loss formulation, prior regularization, and activation functions. Therefore, this work introduces an alternative Dirichlet parameter estimation strategy by applying a method of moments estimator to ensembles of softmax outputs, with an optional maximum-likelihood refinement step. This ensemble-based construction decouples uncertainty estimation from the fragile evidential loss design while also mitigating the variability of single-run cross-entropy training, producing explicit Dirichlet predictive distributions. Across multiple datasets, we show that the improved stability and predictive uncertainty behavior of these ensemble-derived Dirichlet estimates translate into stronger performance in downstream uncertainty-guided applications such as prediction confidence scoring and selective classification.

[LG-55] A deep learning framework for jointly solving transient Fokker-Planck equations with arbitrary parameters and initial distributions

链接: https://arxiv.org/abs/2604.06001
作者: Xiaolong Wang,Jing Feng,Qi Liu,Chengli Tan,Yuanyuan Liu,Yong Xu
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficiently solving the Fokker-Planck equation (FPE) is central to analyzing complex parameterized stochastic systems. However, current numerical methods lack parallel computation capabilities across varying conditions, severely limiting comprehensive parameter exploration and transient analysis. This paper introduces a deep learning-based pseudo-analytical probability solution (PAPS) that, via a single training process, simultaneously resolves transient FPE solutions for arbitrary multi-modal initial distributions, system parameters, and time points. The core idea is to unify initial, transient, and stationary distributions via Gaussian mixture distributions (GMDs) and develop a constraint-preserving autoencoder that bijectively maps constrained GMD parameters to unconstrained, low-dimensional latent representations. In this representation space, the panoramic transient dynamics across varying initial conditions and system parameters can be modeled by a single evolution network. Extensive experiments on paradigmatic systems demonstrate that the proposed PAPS maintains high accuracy while achieving inference speeds four orders of magnitude faster than GPU-accelerated Monte Carlo simulations. This efficiency leap enables previously intractable real-time parameter sweeps and systematic investigations of stochastic bifurcations. By decoupling representation learning from physics-informed transient dynamics, our work establishes a scalable paradigm for probabilistic modeling of multi-dimensional, parameterized stochastic systems.

[LG-56] Brain-to-Speech: Prosody Feature Engineering and Transformer-Based Reconstruction

链接: https://arxiv.org/abs/2604.05751
作者: Mohammed Salah Al-Radhi,Géza Németh,Andon Tchechmedjiev,Binbin Xu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Sound (cs.SD)
*备注: OpenAccess chapter: https://doi.org/10.1007/978-3-032-10561-5_16 . In: Curry, E., et al. Artificial Intelligence, Data and Robotics (2026)

点击查看摘要

Abstract:This chapter presents a novel approach to brain-to-speech (BTS) synthesis from intracranial electroencephalography (iEEG) data, emphasizing prosody-aware feature engineering and advanced transformer-based models for high-fidelity speech reconstruction. Driven by the increasing interest in decoding speech directly from brain activity, this work integrates neuroscience, artificial intelligence, and signal processing to generate accurate and natural speech. We introduce a novel pipeline for extracting key prosodic features directly from complex brain iEEG signals, including intonation, pitch, and rhythm. To effectively utilize these crucial features for natural-sounding speech, we employ advanced deep learning models. Furthermore, this chapter introduces a novel transformer encoder architecture specifically designed for brain-to-speech tasks. Unlike conventional models, our architecture integrates the extracted prosodic features to significantly enhance speech reconstruction, resulting in generated speech with improved intelligibility and expressiveness. A detailed evaluation demonstrates superior performance over established baseline methods, such as traditional Griffin-Lim and CNN-based reconstruction, across both quantitative and perceptual metrics. By demonstrating these advancements in feature extraction and transformer-based learning, this chapter contributes to the growing field of AI-driven neuroprosthetics, paving the way for assistive technologies that restore communication for individuals with speech impairments. Finally, we discuss promising future research directions, including the integration of diffusion models and real-time inference systems.

[LG-57] Untargeted analysis of volatile markers of post-exercise fat oxidation in exhaled breath

链接: https://arxiv.org/abs/2604.05707
作者: André Homeyer,Júlia Blanka Sziládi,Jan-Philipp Redlich,Jonathan Beauchamp,Y Lan Pham
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Breath acetone represents a promising non-invasive biomarker for monitoring fat oxidation during exercise. However, its utility is limited by confounding factors, as well as by the fact that significant changes in concentration occur only hours post-exercise, which makes real-time assessment difficult. We performed an untargeted screening for volatile organic compounds (VOCs) that could serve as markers of fat oxidation beyond acetone, and investigated whether breath measurements taken during exercise could predict post-exercise changes in fat oxidation. Nineteen participants completed two 25-min cycling sessions separated by a brief 5-min rest period. VOC emissions were analysed using proton-transfer-reaction time-of-flight mass spectrometry (PTR-TOF-MS) during exercise and after a 90-min recovery period. Blood \beta -hydroxybutyrate (BOHB) concentrations served as the reference marker for fat oxidation. Among 773 relevant analytical features detected in the PTR-TOF-MS measurements, only four signals exhibited strong correlations with BOHB ( \rho \geq 0.82, p = 0.0002)-all attributable to acetone or its isotopologues or fragments. End-of-exercise measurements of these signals enabled accurate prediction of participants with substantial post-exercise BOHB changes (F1 score \geq 0.83, accuracy = 0.89). Our study did not reveal any novel breath-based biomarkers of fat oxidation, but it confirmed acetone as the key marker. Moreover, our findings suggest that breath acetone measurements during exercise may already enable basic predictions of post-exercise fat oxidation.

[LG-58] Intrinsic perturbation scale for certified oracle objectives with epigraphic information

链接: https://arxiv.org/abs/2604.05678
作者: Karim Bounja,Boujemaâ Achchab,Abdeljalil Sakat
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:We introduce a natural displacement control for minimizer sets of oracle objectives equipped with certified epigraphic information. Formally, we replace the usual local uniform value control of objective perturbations - uncertifiable from finite pointwise information without additional structure - by the strictly weaker requirement of a cylinder-localized vertical epigraphic control, naturally provided by certified envelopes. Under set-based quadratic growth (allowing nonunique minimizers), this yields the classical square-root displacement estimate with optimal exponent 1/2, without any extrinsic assumption.

[LG-59] Efficient machine unlearning with minimax optimality

链接: https://arxiv.org/abs/2604.05669
作者: Jingyi Xie,Linjun Zhang,Sai Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There is a growing demand for efficient data removal to comply with regulations like the GDPR and to mitigate the influence of biased or corrupted data. This has motivated the field of machine unlearning, which aims to eliminate the influence of specific data subsets without the cost of full retraining. In this work, we propose a statistical framework for machine unlearning with generic loss functions and establish theoretical guarantees. For squared loss, especially, we develop Unlearning Least Squares (ULS) and establish its minimax optimality for estimating the model parameter of remaining data when only the pre-trained estimator, forget samples, and a small subsample of the remaining data are available. Our results reveal that the estimation error decomposes into an oracle term and an unlearning cost determined by the forget proportion and the forget model bias. We further establish asymptotically valid inference procedures without requiring full retraining. Numerical experiments and real-data applications demonstrate that the proposed method achieves performance close to retraining while requiring substantially less data access.

[LG-60] Parametric Nonconvex Optimization via Convex Surrogates

链接: https://arxiv.org/abs/2604.05640
作者: Renzi Wang,Panagiotis Patrinos,Alberto Bemporad
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper presents a novel learning-based approach to construct a surrogate problem that approximates a given parametric nonconvex optimization problem. The surrogate function is designed to be the minimum of a finite set of functions, given by the composition of convex and monotonic terms, so that the surrogate problem can be solved directly through parallel convex optimization. As a proof of concept, numerical experiments on a nonconvex path tracking problem confirm the approximation quality of the proposed method.

[LG-61] Optimal Centered Active Excitation in Linear System Identification

链接: https://arxiv.org/abs/2604.05518
作者: Kaito Ito,Alexandre Proutiere
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 11 pages

点击查看摘要

Abstract:We propose an active learning algorithm for linear system identification with optimal centered noise excitation. Notably, our algorithm, based on ordinary least squares and semidefinite programming, attains the minimal sample complexity while allowing for efficient computation of an estimate of a system matrix. More specifically, we first establish lower bounds of the sample complexity for any active learning algorithm to attain the prescribed accuracy and confidence levels. Next, we derive a sample complexity upper bound of the proposed algorithm, which matches the lower bound for any algorithm up to universal factors. Our tight bounds are easy to interpret and explicitly show their dependence on the system parameters such as the state dimension.

[LG-62] ranscriptomic Models for Immunotherapy Response Prediction Show Limited Cross-cohort Generalisability

链接: https://arxiv.org/abs/2604.05478
作者: Yuheng Liang,Lucy Chuo,Ahmadreza Argha,Nona Farbehi,Lu Chen,Roohallah Alizadehsani,Mehdi Hosseinzadeh,Amin Beheshti,Thantrira Porntaveetusm,Youqiong Ye,Hamid Alinejad-Rokny
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Immune checkpoint inhibitors (ICIs) have transformed cancer therapy; yet substantial proportion of patients exhibit intrinsic or acquired resistance, making accurate pre-treatment response prediction a critical unmet need. Transcriptomics-based biomarkers derived from bulk and single-cell RNA sequencing (scRNA-seq) offer a promising avenue for capturing tumour-immune interactions, yet the cross-cohort generalisability of existing prediction models remains this http URL systematically benchmark nine state-of-the-art transcriptomic ICI response predictors, five bulk RNA-seq-based models (COMPASS, IRNet, NetBio, IKCScore, and TNBC-ICI) and four scRNA-seq-based models (PRECISE, DeepGeneX, Tres and scCURE), using publicly available independent datasets unseen during model development. Overall, predictive performance was modest: bulk RNA-seq models performed at or near chance level across most cohorts, while scRNA-seq models showed only marginal improvements. Pathway-level analyses revealed sparse and inconsistent biomarker signals across models. Although scRNA-seq-based predictors converged on immune-related programs such as allograft rejection, bulk RNA-seq-based models exhibited little reproducible overlap. PRECISE and NetBio identified the most coherent immune-related themes, whereas IRNet predominantly captured metabolic pathways weakly aligned with ICI biology. Together, these findings demonstrate the limited cross-cohort robustness and biological consistency of current transcriptomic ICI prediction models, underscoring the need for improved domain adaptation, standardised preprocessing, and biologically grounded model design.

[LG-63] ask Ecologies and the Evolution of World-Tracking Representations in Large Language Models

链接: https://arxiv.org/abs/2604.05469
作者: Giulio Valentino Dalla Riva
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study language models as evolving model organisms and ask when autoregressive next-token learning selects for world-tracking representations. For any encoding of latent world states, the Bayes-optimal next-token cross-entropy decomposes into the irreducible conditional entropy plus a Jensen–Shannon excess term. That excess vanishes if and only if the encoding preserves the training ecology’s equivalence classes. This yields a precise notion of ecological veridicality for language models and identifies the minimum-complexity zero-excess solution as the quotient partition by training equivalence. We then determine when this fixed-encoding analysis applies to transformer families: frozen dense and frozen Mixture-of-Experts transformers satisfy it, in-context learning does not enlarge the model’s separation set, and per-task adaptation breaks the premise. The framework predicts two characteristic failure modes: simplicity pressure preferentially removes low-gain distinctions, and training-optimal models can still incur positive excess on deployment ecologies that refine the training ecology. A conditional dynamic extension shows how inter-model selection and post-training can recover such gap distinctions under explicit heredity, variation, and selection assumptions. Exact finite-ecology checks and controlled microgpt experiments validate the static decomposition, split-merge threshold, off-ecology failure pattern, and two-ecology rescue mechanism in a regime where the relevant quantities are directly observable. The goal is not to model frontier systems at scale, but to use small language models as laboratory organisms for theory about representational selection.

[LG-64] Hierarchical Contrastive Learning for Multimodal Data

链接: https://arxiv.org/abs/2604.05462
作者: Huichao Li,Junhan Yu,Doudou Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 34 pages,11 figures

点击查看摘要

Abstract:Multimodal representation learning is commonly built on a shared-private decomposition, treating latent information as either common to all modalities or specific to one. This binary view is often inadequate: many factors are shared by only subsets of modalities, and ignoring such partial sharing can over-align unrelated signals and obscure complementary information. We propose Hierarchical Contrastive Learning (HCL), a framework that learns globally shared, partially shared, and modality-specific representations within a unified model. HCL combines a hierarchical latent-variable formulation with structural sparsity and a structure-aware contrastive objective that aligns only modalities that genuinely share a latent factor. Under uncorrelated latent variables, we prove identifiability of the hierarchical decomposition, establish recovery guarantees for the loading matrices, and derive parameter estimation and excess-risk bounds for downstream prediction. Simulations show accurate recovery of hierarchical structure and effective selection of task-relevant components. On multimodal electronic health records, HCL yields more informative representations and consistently improves predictive performance.

[LG-65] MEC: Machine-Learning-Assisted Generalized Entropy Calibration for Semi-Supervised Mean Estimation

链接: https://arxiv.org/abs/2604.05446
作者: Se Yoon Lee,Jae Kwang Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Obtaining high-quality labels is costly, whereas unlabeled covariates are often abundant, motivating semi-supervised inference methods with reliable uncertainty quantification. Prediction-powered inference (PPI) leverages a machine-learning predictor trained on a small labeled sample to improve efficiency, but it can lose efficiency under model misspecification and suffer from coverage distortions due to label reuse. We introduce Machine-Learning-Assisted Generalized Entropy Calibration (MEC), a cross-fitted, calibration-weighted variant of PPI. MEC improves efficiency by reweighting labeled samples to better align with the target population, using a principled calibration framework based on Bregman projections. This yields robustness to affine transformations of the predictor and relaxes requirements for validity by replacing conditions on raw prediction error with weaker projection-error conditions. As a result, MEC attains the semiparametric efficiency bound under weaker assumptions than existing PPI variants. Across simulations and a real-data application, MEC achieves near-nominal coverage and tighter confidence intervals than CF-PPI and vanilla PPI.

[LG-66] An Actor-Critic Framework for Continuous-Time Jump-Diffusion Controls with Normalizing Flows

链接: https://arxiv.org/abs/2604.05398
作者: Liya Guo,Ruimeng Hu,Xu Yang,Yi Zhu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 29 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Continuous-time stochastic control with time-inhomogeneous jump-diffusion dynamics is central in finance and economics, but computing optimal policies is difficult under explicit time dependence, discontinuous shocks, and high dimensionality. We propose an actor-critic framework that serves as a mesh-free solver for entropy-regularized control problems and stochastic games with jumps. The approach is built on a time-inhomogeneous little q-function and an appropriate occupation measure, yielding a policy-gradient representation that accommodates time-dependent drift, volatility, and jump terms. To represent expressive stochastic policies in continuous-action spaces, we parameterize the actor using conditional normalizing flows, enabling flexible non-Gaussian policies while retaining exact likelihood evaluation for entropy regularization and policy optimization. We validate the method on time-inhomogeneous linear-quadratic control, Merton portfolio optimization, and a multi-agent portfolio game, using explicit solutions or high-accuracy benchmarks. Numerical results demonstrate stable learning under jump discontinuities, accurate approximation of optimal stochastic policies, and favorable scaling with respect to dimension and number of agents.

[LG-67] Individual-heterogeneous sub-Gaussian Mixture Models

链接: https://arxiv.org/abs/2604.05337
作者: Huan Qing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 32 pages, 4 figures, 2 tables

点击查看摘要

Abstract:The classical Gaussian mixture model assumes homogeneity within clusters, an assumption that often fails in real-world data where observations naturally exhibit varying scales or intensities. To address this, we introduce the individual-heterogeneous sub-Gaussian mixture model, a flexible framework that assigns each observation its own heterogeneity parameter, thereby explicitly capturing the heterogeneity inherent in practical applications. Built upon this model, we propose an efficient spectral method that provably achieves exact recovery of the true cluster labels under mild separation conditions, even in high-dimensional settings where the number of features far exceeds the number of samples. Numerical experiments on both synthetic and real data demonstrate that our method consistently outperforms existing clustering algorithms, including those designed for classical Gaussian mixture models.

[LG-68] Robust Learning of Heterogeneous Dynamic Systems

链接: https://arxiv.org/abs/2604.05285
作者: Shuoxun Xu,Zijian Guo,Brooke R. Staveland,Robert T. Knight,Lexin Li
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ordinary differential equations (ODEs) provide a powerful framework for modeling dynamic systems arising in a wide range of scientific domains. However, most existing ODE methods focus on a single system, and do not adequately address the problem of learning shared patterns from multiple heterogeneous dynamic systems. In this article, we propose a novel distributionally robust learning approach for modeling heterogeneous ODE systems. Specifically, we construct a robust dynamic system by maximizing a worst-case reward over an uncertainty class formed by convex combinations of the derivatives of trajectories. We show the resulting estimator admits an explicit weighted average representation, where the weights are obtained from a quadratic optimization that balances information across multiple data sources. We further develop a bi-level stabilization procedure to address potential instability in estimation. We establish rigorous theoretical guarantees for the proposed method, including consistency of the stabilized weights, error bound for robust trajectory estimation, and asymptotical validity of pointwise confidence interval. We demonstrate that the proposed method considerably improves the generalization performance compared to the alternative solutions through both extensive simulations and the analysis of an intracranial electroencephalogram data.

[LG-69] fastml: Guarded Resampling Workflows for Safer Automated Machine Learning in R

链接: https://arxiv.org/abs/2604.05225
作者: Selcuk Korkmaz,Dincer Goksuluk,Eda Karaismailoglu
类目: Computation (stat.CO); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 36 pages, 2 figures

点击查看摘要

Abstract:Preprocessing leakage arises when scaling, imputation, or other data-dependent transformations are estimated before resampling, inflating apparent performance while remaining hard to detect. We present fastml, an R package that provides a single-call interface for leakage-aware machine learning through guarded resampling, where preprocessing is re-estimated inside each resample and applied to the corresponding assessment data. The package supports grouped and time-ordered resampling, blocks high-risk configurations, audits recipes for external dependencies, and includes sandboxed execution and integrated model explanation. We evaluate fastml with a Monte Carlo simulation contrasting global and fold-local normalization, a usability comparison with tidymodels under matched specifications, and survival benchmarks across datasets of different sizes. The simulation demonstrates that global preprocessing substantially inflates apparent performance relative to guarded resampling. fastml matched held-out performance obtained with tidymodels while reducing workflow orchestration, and it supported consistent benchmarking of multiple survival model classes through a unified interface.

[LG-70] Graph Signal Diffusion Models for Wireless Resource Allocation

链接: https://arxiv.org/abs/2604.05175
作者: Yigit Berkay Uslu,Samar Hadou,Shirin Saeedi Bidokhti,Alejandro Ribeiro
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Under review for SPAWC’26

点击查看摘要

Abstract:We consider constrained ergodic resource optimization in wireless networks with graph-structured interference. We train a diffusion model policy to match expert conditional distributions over resource allocations. By leveraging a primal-dual (expert) algorithm, we generate primal iterates that serve as draws from the corresponding expert conditionals for each training network instance. We view the allocations as stochastic graph signals supported on known channel state graphs. We implement the diffusion model architecture as a U-Net hierarchy of graph neural network (GNN) blocks, conditioned on the channel states and additional node states. At inference, the learned generative model amortizes the iterative expert policy by directly sampling allocation vectors from the near-optimal conditional distributions. In a power-control case study, we show that time-sharing the generated power allocations achieves near-optimal ergodic sum-rate utility and near-feasible ergodic minimum-rates, with strong generalization and transferability across network states.

[LG-71] Learning to Unscramble Feynman Loop Integrals with SAILIR

链接: https://arxiv.org/abs/2604.05034
作者: David Shih
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 16 pages, 3 figures, 5 tables, work done in collaboration with Claude Code

点击查看摘要

Abstract:Integration-by-parts (IBP) reduction of Feynman integrals to master integrals is a key computational bottleneck in precision calculations in high-energy physics. Traditional approaches based on the Laporta algorithm require solving large systems of equations, leading to memory consumption that grows rapidly with integral complexity. We present SAILIR (Self-supervised AI for Loop Integral Reduction), a new machine learning approach in which a transformer-based classifier guides the reduction of integrals one step at a time in a fully online fashion. The classifier is trained in an entirely self-supervised manner on synthetic data generated by a scramble/unscramble procedure: known reduction identities are applied in reverse to build expressions of increasing complexity, and the classifier learns to undo these steps. When combined with beam search and a highly parallelized, asynchronous, single-episode reduction strategy, SAILIR can reduce integrals of arbitrarily high weight with bounded memory. We benchmark SAILIR on the two-loop triangle-box topology, comparing against the state-of-the-art IBP reduction code Kira across 16 integrals of varying complexity. While SAILIR is slower in wall-clock time, its per-worker memory consumption remains approximately flat regardless of integral complexity, in contrast to Kira whose memory grows rapidly with complexity. For the most complex integrals considered here, SAILIR uses only 40% of the memory of Kira while achieving comparable reduction times. This demonstrates a fundamentally new paradigm for IBP reduction in which the memory bottleneck of Laporta-based approaches could be entirely overcome, potentially opening the door to precision calculations that are currently intractable.

[LG-72] Generative Path-Law Jump-Diffusion: Sequential MMD-Gradient Flows and Generalisation Bounds in Marcus-Signature RKHS

链接: https://arxiv.org/abs/2604.05008
作者: Daniel Bloch
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Mathematical Finance (q-fin.MF); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:This paper introduces a novel generative framework for synthesising forward-looking, càdlàg stochastic trajectories that are sequentially consistent with time-evolving path-law proxies, thereby incorporating anticipated structural breaks, regime shifts, and non-autonomous dynamics. By framing path synthesis as a sequential matching problem on restricted Skorokhod manifolds, we develop the \textitAnticipatory Neural Jump-Diffusion (ANJD) flow, a generative mechanism that effectively inverts the time-extended Marcus-sense signature. Central to this approach is the Anticipatory Variance-Normalised Signature Geometry (AVNSG), a time-evolving precision operator that performs dynamic spectral whitening on the signature manifold to ensure contractivity during volatile regime shifts and discrete aleatoric shocks. We provide a rigorous theoretical analysis demonstrating that the joint generative flow constitutes an infinitesimal steepest descent direction for the Maximum Mean Discrepancy functional relative to a moving target proxy. Furthermore, we establish statistical generalisation bounds within the restricted path-space and analyse the Rademacher complexity of the whitened signature functionals to characterise the expressive power of the model under heavy-tailed innovations. The framework is implemented via a scalable numerical scheme involving Nyström-compressed score-matching and an anticipatory hybrid Euler-Maruyama-Marcus integration scheme. Our results demonstrate that the proposed method captures the non-commutative moments and high-order stochastic texture of complex, discontinuous path-laws with high computational efficiency.

[LG-73] he Hiremath Early Detection (HED) Score: A Measure-Theoretic Evaluation Standard for Temporal Intelligence

链接: https://arxiv.org/abs/2604.04993
作者: Prakul Sunil Hiremath
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 11 pages. Introduces a measure-theoretic framework for predictive velocity including the Hiremath Standard Table. Dedicated to the Hiremath lineage

点击查看摘要

Abstract:We introduce the Hiremath Early Detection (HED) Score, a principled, measure-theoretic evaluation criterion for quantifying the time-value of information in systems operating over non-stationary stochastic processes subject to abrupt regime transitions. Existing evaluation paradigms, chiefly the ROC/AUC framework and its downstream variants, are temporally agnostic: they assign identical credit to a detection at t + 1 and a detection at t + tau for arbitrarily large tau. This indifference to latency is a fundamental inadequacy in time-critical domains including cyber-physical security, algorithmic surveillance, and epidemiological monitoring. The HED Score resolves this by integrating a baseline-neutral, exponentially decaying kernel over the posterior probability stream of a target regime, beginning precisely at the onset of the regime shift. The resulting scalar simultaneously encodes detection acuity, temporal lead, and pre-transition calibration quality. We prove that the HED Score satisfies three axiomatic requirements: (A1) Temporal Monotonicity, (A2) Invariance to Pre-Attack Bias, and (A3) Sensitivity Decomposability. We further demonstrate that the HED Score admits a natural parametric family indexed by the Hiremath Decay Constant (lambda_H), whose domain-specific calibration constitutes the Hiremath Standard Table. As an empirical vehicle, we present PARD-SSM (Probabilistic Anomaly and Regime Detection via Switching State-Space Models), which couples fractional Stochastic Differential Equations (fSDEs) with a Switching Linear Dynamical System (S-LDS) inference backend. On the NSL-KDD benchmark, PARD-SSM achieves a HED Score of 0.0643, representing a 388.8 percent improvement over a Random Forest baseline (0.0132), with statistical significance confirmed via block-bootstrap resampling (p 0.001). We propose the HED Score as the successor evaluation standard to ROC/AUC. Comments: 11 pages. Introduces a measure-theoretic framework for predictive velocity including the Hiremath Standard Table. Dedicated to the Hiremath lineage Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Methodology (stat.ME) ACMclasses: G.3; K.6.5; I.2.6 Cite as: arXiv:2604.04993 [stat.ML] (or arXiv:2604.04993v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.04993 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] An Imbalanced Dataset with Multiple Feature Representations for Studying Quality Control of Next-Generation Sequencing

链接: https://arxiv.org/abs/2604.04981
作者: Philipp Röchner,Clarissa Krämer,Johannes U Mayer,Franz Rothlauf,Steffen Albrecht,Maximilian Sprang
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Next-generation sequencing (NGS) is a key technique for studying the DNA and RNA of organisms. However, identifying quality problems in NGS data across different experimental settings remains challenging. To develop automated quality-control tools, researchers require datasets with features that capture the characteristics of quality problems. Existing NGS repositories, however, offer only a limited number of quality-related features. To address this gap, we propose a dataset derived from 37.491 NGS samples with two types of quality-related feature representations. The first type consists of 34 features derived from quality control tools (QC-34 features). The second type has a variable number of features ranging from eight to 1.183. These features were derived from read counts in problematic genomic regions identified by the ENCODE blocklist (BL features). All features describe the same human and mouse samples from five genomic assays, allowing direct comparison of feature representations. The proposed dataset includes a binary quality label, derived from automated quality control and domain experts. Among all samples, 3.2% are of low quality. Supervised machine learning algorithms accurately predicted quality labels from the features, confirming the relevance of the provided feature representations. The proposed feature representations enable researchers to study how different feature types (QC-34 vs. BL features) and granularities (varying number of BL features) affect the detection of quality problems.

[LG-75] StrADiff: A Structured Source-Wise Adaptive Diffusion Framework for Linear and Nonlinear Blind Source Separation

链接: https://arxiv.org/abs/2604.04973
作者: Yuan-Hao Wei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:This paper presents a Structured Source-Wise Adaptive Diffusion Framework for linear and nonlinear blind source separation. The framework interprets each latent dimension as a source component and assigns to it an individual adaptive diffusion mechanism, thereby establishing source-wise latent modeling rather than relying on a single shared latent prior. The resulting formulation learns source recovery and the mixing/reconstruction process jointly within a unified end-to-end objective, allowing model parameters and latent sources to adapt simultaneously during training. This yields a common framework for both linear and nonlinear blind source separation. In the present instantiation, each source is further equipped with its own adaptive Gaussian process (GP) prior to impose source-wise temporal structure on the latent trajectories, while the overall framework is not restricted to Gaussian process priors and can in principle accommodate other structured source priors. The proposed model thus provides a general structured diffusion-based route to unsupervised source recovery, with potential relevance beyond blind source separation to interpretable latent modeling, source-wise disentanglement, and potentially identifiable nonlinear latent-variable learning under appropriate structural conditions.

[LG-76] Learning Nonlinear Regime Transitions via Semi-Parametric State-Space Models

链接: https://arxiv.org/abs/2604.04963
作者: Prakul Sunil Hiremath
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 12 pages, 1 figures, 2 tables

点击查看摘要

Abstract:We develop a semi-parametric state-space model for time-series data with latent regime transitions. Classical Markov-switching models use fixed parametric transition functions, such as logistic or probit links, which restrict flexibility when transitions depend on nonlinear and context-dependent effects. We replace this assumption with learned functions f_0, f_1 \in \calH , where \calH is either a reproducing kernel Hilbert space or a spline approximation space, and define transition probabilities as p_jk,t = \sigmoid(f(\bx_t-1)) . The transition functions are estimated jointly with emission parameters using a generalized Expectation-Maximization algorithm. The E-step uses the standard forward-backward recursion, while the M-step reduces to a penalized regression problem with weights from smoothed occupation measures. We establish identifiability conditions and provide a consistency argument for the resulting estimators. Experiments on synthetic data show improved recovery of nonlinear transition dynamics compared to parametric baselines. An empirical study on financial time series demonstrates improved regime classification and earlier detection of transition events. Comments: 12 pages, 1 figures, 2 tables Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) ACMclasses: I.2.6; G.3; F.2.2 Cite as: arXiv:2604.04963 [stat.ML] (or arXiv:2604.04963v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.04963 Focus to learn more arXiv-issued DOI via DataCite

[LG-77] Identification and Inference in Nonlinear Dynamic Network Models

链接: https://arxiv.org/abs/2604.04961
作者: Diego Vallarino
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study identification and inference in nonlinear dynamic systems defined on unknown interaction networks. The system evolves through an unobserved dependence matrix governing cross-sectional shock propagation via a nonlinear operator. We show that the network structure is not generically identified, and that identification requires sufficient spectral heterogeneity. In particular, identification arises when the network induces non-exchangeable covariance patterns through heterogeneous amplification of eigenmodes. When the spectrum is concentrated, dependence becomes observationally equivalent to common shocks or scalar heterogeneity, leading to non-identification. We provide necessary and sufficient conditions for identification, characterize observational equivalence classes, and propose a semiparametric estimator with asymptotic theory. We also develop tests for network dependence whose power depends on spectral properties of the interaction matrix. The results apply to a broad class of economic models, including production networks, contagion models, and dynamic interaction systems.

附件下载

点击下载今日全部论文列表