本篇博文主要内容为 2026-04-30 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-04-30)

今日共更新499篇论文,其中:

  • 自然语言处理79篇(Computation and Language (cs.CL))
  • 人工智能115篇(Artificial Intelligence (cs.AI))
  • 计算机视觉93篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习127篇(Machine Learning (cs.LG))
  • 多智能体系统8篇(Multiagent Systems (cs.MA))
  • 信息检索18篇(Information Retrieval (cs.IR))
  • 人机交互15篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Bian Que: An Agent ic Framework with Flexible Skill Arrangement for Online System Operations

【速读】:该论文旨在解决大规模在线引擎系统(如搜索、推荐、广告)在运维(Operation and Maintenance, OM)过程中因释放监控、告警响应与根因分析等任务高度依赖人工而导致的效率瓶颈问题。其核心挑战在于如何高效地从海量数据(指标、日志、变更事件)和操作知识(手册规则及工程师经验)中精准筛选出与当前事件相关的信号,避免因信息过载导致的推理稀释与幻觉。解决方案的关键在于提出名为Bian Que的智能体框架,其创新性体现在三个方面:一是构建统一的运维范式,将日常OM抽象为释放拦截、主动巡检和告警根因分析三类标准模式;二是引入灵活技能配置机制(Flexible Skill Arrangement),使每个技能能自动关联特定业务模块所需的数据与知识,并可通过LLM生成或工程师自然语言指令迭代优化;三是设计统一自进化机制,通过单一修正信号同时驱动案例记忆到知识蒸馏与技能针对性优化两条路径,从而实现持续学习与适应。该框架已在快手电商搜索系统部署,显著降低75%告警量、提升80%根因分析准确率并缩短50%平均修复时间(MTTR)。

链接: https://arxiv.org/abs/2604.26805
作者: Bochao Liu,Zhipeng Qian,Yang Zhao,Xinyuan Jiang,Zihan Liang,Yufei Ma,Junpeng Zhuang,Ben Chen,Shuo Yang,Hongen Wan,Yao Wu,Chenyi Lei,Xiao Liang
机构: Kuaishou Technology
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Codes are this https URL

点击查看摘要

Abstract:Operating and maintaining (OM) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment bottleneck is not reasoning capability but orchestration: selecting, for each operational event, the relevant data (metrics, logs, change events) and the applicable operational knowledge (handbook rules and practitioner experience). Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. We present Bian Que, an agentic framework with three contributions: (i) a \emphunified operational paradigm abstracting day-to-day OM into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) \emphFlexible Skill Arrangement, where each Skill specifies which data and knowledge to retrieve for a given business-module context and can be automatically generated and updated by LLMs or iteratively refined through natural-language instructions from on-call engineers; (iii) a \emphunified self-evolving mechanism in which one correction signal drives two parallel pathways, case-memory-to-knowledge distillation and targeted Skill refinement. Deployed on the e-commerce search engine of KuaiShou, the major short-video platform in China, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, and cuts mean time to resolution by over 50%. Our framework achieves 99.0% pass rate on offline evaluations. Our code is available at this https URL.

[MA-1] Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation

【速读】:该论文旨在解决多智能体审议系统(multi-agent deliberation systems)中普遍存在的“人工共识”问题,即在政策模拟场景下,不同价值立场的评估代理(evaluator agents)倾向于收敛到相同选项,导致审议结果缺乏多样性与真实性。其核心解决方案是提出一种三阶段审议框架——AI Council,并引入两项关键干预措施:一是架构异质性(architectural heterogeneity),通过为每个价值视角分配参数规模不同的大语言模型(7-9B参数),显著降低首选选项集中度(如儿童福利场景从70.9%降至46.1%,p < 0.001);二是连贯性验证(coherence validation),利用前沿模型对各代理推理是否符合其预设价值进行评估,揭示出“忠实度-多样性权衡”现象:在主导选项明显的场景中可进一步减少集中度,但在存在真正竞争选项时反而加剧集中,因高连贯性代理趋于聚集于同一方案。这一发现表明,在无客观正确答案的审议任务中,模型多样性与质量加权机制可能引发复杂的行为动态,需谨慎设计以维持审议有效性。

链接: https://arxiv.org/abs/2604.26561
作者: Ariel Sela
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 tables, 120 deliberations across 2 policy scenarios

点击查看摘要

Abstract:Multi-agent deliberation systems using large language models (LLMs) are increasingly proposed for policy simulation, yet they suffer from artificial consensus: evaluator agents converge on the same option regardless of their assigned value perspectives. We present the AI Council, a three-phase deliberation framework, and conduct 120 deliberations across two policy scenarios to test two interventions. First, architectural heterogeneity (assigning a different 7-9B parameter model to each value perspective) significantly reduces first-choice concentration compared to a homogeneous baseline (child welfare: 70.9% to 46.1%, p 0.001, r = 0.58; housing: 46.0% to 22.9%, p 0.001, r = 0.50). This contrasts with accuracy-oriented multi-agent debate, where heterogeneity does not reduce convergence, suggesting model diversity operates differently when no objectively correct answer exists. Second, coherence validation (using a frontier model to assess whether each evaluator’s reasoning is grounded in its assigned values) reveals a fidelity-diversity tradeoff: on a scenario with a dominant option, it further reduces concentration (46.1% to 40.8%, p = 0.004), but on a scenario with genuinely competitive options, it increases concentration (22.9% to 26.6%, p = 0.96) by amplifying high-coherence evaluators who cluster on one option. This tradeoff may be a general property of multi-agent systems employing quality weighting. We report negative results from three failed Delphi designs, demonstrate that 8B models exhibit binary rather than graded responses to counter-arguments, and propose the trustworthy tension rate as a diagnostic measure of small-model deliberation capabilities.

[MA-2] AGEL-Comp: A Neuro-Symbolic Framework for Compositional Generalization in Interactive Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在组合泛化(compositional generalization)方面的系统性失败问题,从而提升其在交互式环境中的鲁棒性。解决方案的关键在于提出AGEL-Comp架构,该架构通过三个核心创新实现:(1)基于动态因果程序图(Causal Program Graph, CPG)的世界模型,以有向超图形式表示过程性和因果性知识;(2)归纳逻辑编程(Inductive Logic Programming, ILP)引擎,从经验反馈中合成新的Horn子句,通过交互实现符号知识的具身化;(3)混合推理核心,其中LLM生成候选子目标,由神经定理证明器(Neural Theorem Prover, NTP)验证逻辑一致性,形成“演绎—溯因”学习循环,使代理能够推导计划并溯因扩展其符号世界模型,同时通过神经适应阶段保持推理引擎与新知识的一致性。

链接: https://arxiv.org/abs/2604.26522
作者: Mahnoor Shahid,Hannes Rothe
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA); Symbolic Computation (cs.SC)
备注: Accepted at IntelliSys 2026

点击查看摘要

Abstract:Large Language Model (LLM)-based agents exhibit systemic failures in compositional generalization, limiting their robustness in interactive environments. This work introduces AGEL-Comp, a neuro-symbolic AI agent architecture designed to address this challenge by grounding actions of the agent. AGEL-Comp integrates three core innovations: (1) a dynamic Causal Program Graph (CPG) as a world model, representing procedural and causal knowledge as a directed hypergraph; (2) an Inductive Logic Programming (ILP) engine that synthesizes new Horn clauses from experiential feedback, grounding symbolic knowledge through interaction; and (3) a hybrid reasoning core where an LLM proposes a set of candidate sub-goals that are verified for logical consistency by a Neural Theorem Prover (NTP). Together, these components operationalize a deduction–abduction learning cycle: enabling the agent to deduce plans and abductively expand its symbolic world model, while a neural adaptation phase keeps its reasoning engine aligned with new knowledge. We propose an evaluation protocol within the \textttRetro Quest simulation environment to probe for compositional generalization scenarios to evaluate our AGEL agent. Our findings clearly indicate the better performance of our AGEL model over pure LLM-based models. Our framework presents a principled path toward agents that build an explicit, interpretable, and compositionally structured understanding of their world.

[MA-3] Split over n resource sharing problem: Are fewer capable agents better than many simpler ones?

【速读】:该论文旨在解决多智能体系统中资源分配策略的优化问题,即在有限资源条件下,应将资源集中于少数高能力智能体还是分散给更多简单智能体。其核心解决方案在于构建了一个“n个智能体资源共享”模型,通过形式化分析与计算机仿真揭示:当智能体覆盖范围(disk-shaped footprint)随数量n呈反比缩放时,初始覆盖率随n增加而提升;若智能体速度与其半径成比例下降,则不同规模群体性能相当;但若速度与覆盖面积(footprint)成比例下降,则单一智能体表现最优。研究进一步发现,资源分散会提高个体失败率,从而为在资源约束下确定最优分布程度提供了理论依据和设计指导。

链接: https://arxiv.org/abs/2604.26374
作者: Karthik Soma,Mohamed S. Talamali,Genki Miyauchi,Giovanni Beltrame,Heiko Hamann,Roderich Gross
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: Short paper presented at the 15th International Conference on Swarm Intelligence (ANTS 2026)

点击查看摘要

Abstract:In multi-agent systems, should limited resources be concentrated into a few capable agents or distributed among many simpler ones? This work formulates the split over n resource sharing problem where a group of n agents equally shares a common resource (e.g., monetary budget, computational resources, physical size). We present a case study in multi-agent coverage where the area of the disk-shaped footprint of agents scales as 1/n . A formal analysis reveals that the initial coverage rate grows with n . However, if the speed of agents decreases proportionally with their radii, groups of all sizes perform equally well, whereas if it decreases proportionally with their footprints, a single agent performs best. We also present computer simulations in which resource splitting increases the failure rates of individual agents. The models and findings help identify optimal distributiveness levels and inform the design of multi-agent systems under resource constraints.

[MA-4] When Agents Shop for You: Role Coherence in AI-Mediated Markets

【速读】:该论文试图解决的问题是:当消费者将购买决策委托给基于语言模型的AI代理(buyer agent)时,卖家可能通过分析代理与消费者的对话内容,推断出消费者的支付意愿(willingness to pay),从而导致偏好泄露(preference leakage)。这种泄露并非源于用户指令执行失误,而是由委托行为本身所引发的信息通道——角色一致性(role coherence)所致。解决方案的关键在于提出架构层面的干预措施,即在个性化推荐与偏好隐私保护之间进行权衡,而非简单地通过提示工程(prompt-level mitigation)来规避问题。

链接: https://arxiv.org/abs/2604.26220
作者: Soogand Alavi,Salar Nozari
机构: University of Iowa (爱荷华大学)
类目: Multiagent Systems (cs.MA); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Consumers are increasingly delegating purchase decisions to AI agents, providing natural-language descriptions of their preferences and identity. We argue that these representations constitute an information channel, role coherence, through which sellers can infer willingness to pay without explicit disclosure by the buyer agent, leading to preference leakage. In an experiment where a language-model buyer agent shops on behalf of a verbal consumer profile, we show that seller-side inference from dialogue alone recovers willingness to pay nearly one-for-one. Comparing this setting to a numeric-budget condition with confidentiality instructions cleanly isolates role coherence as distinct from instruction-following failure. Because this leakage arises from delegation itself, it cannot be mitigated at the prompt level. Instead, we propose architectural interventions that trade off personalization against preference privacy.

[MA-5] Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

【速读】:该论文旨在解决自主语言模型代理(language-model agents)在真实资本环境下执行用户指令时的可靠性问题,即如何确保从用户自然语言策略到最终链上交易结算的全流程稳定、准确与可验证。解决方案的关键在于:并非依赖基础大模型本身,而是构建了围绕模型的多层操作系统(operating layer),包括提示编译(prompt compilation)、类型化控制(typed controls)、策略验证(policy validation)、执行保护机制(execution guards)、记忆设计(memory design)以及追踪级可观测性(trace-level observability),从而系统性地提升代理在复杂连续决策中的鲁棒性与资本安全性。

链接: https://arxiv.org/abs/2604.26091
作者: T.J. Barton,Chris Constantakis,Patti Hauseman,Annie Mous,Alaska Hoffman,Brian Bergeron,Hunter Goodreau
机构: DX Research Group (DXRG)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA)
备注: 18 pages, 6 figures. Public onchain dashboard and supporting documentation linked in paper

点击查看摘要

Abstract:We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about 20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.

[MA-6] I Would If I Could: Reasoning about Dynamics of Actions in Multi-Agent Systems KR2026

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中智能体在执行过程中动态适应能力不足的问题,尤其是现有战略逻辑(如交替时间时序逻辑 ATL)难以刻画智能体可用动作的动态授予与撤销,以及这些变化如何影响智能体的知识状态。解决方案的关键在于提出两种扩展逻辑:ATL-D(Alternating-time Temporal Logic with Dynamic Actions),用于建模动作的动态授予与撤销过程;以及其扩展形式 ATEL-D,进一步刻画动作更新对智能体知识的影响。该工作不仅在概念层面提供了新的形式化工具,还通过表达力分析、与规范系统的关联性研究及计算复杂度评估,为多智能体系统中的动态行为建模和推理提供了理论基础。

链接: https://arxiv.org/abs/2604.26053
作者: Rustam Galimullin,Hermine Grosinger,Munyque Mittelmann
机构: University of Bergen (卑尔根大学); Örebro University (奥布罗大学); CNRS, UMR 7030, F-93430, Villetaneuse, France (法国国家科学研究中心, 7030实验室, 法国维勒塔讷斯市); Université Sorbonne Paris Nord, LIPN, F-93430, Villetaneuse, France (索邦巴黎北大学, LIPN实验室, 法国维勒塔讷斯市)
类目: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: This is an extended version of the paper with the same title that will appear in KR 2026, and which contains a technical appendix with proof details

点击查看摘要

Abstract:Autonomous agents acting in realistic Multi-Agent Systems (MAS) should be able to adapt during their execution. Standard strategic logics, such as Alternating-time Temporal Logic (ATL), model agents’ state- or history-dependent behaviour. However, the dynamic treatment of agents’ available actions and their knowledge of required actions is still rarely addressed. In this paper, we introduce ATL with Dynamic Actions (ATL-D), which models the process of granting and revoking actions, and its extension ATEL-D, which captures how such updates affect agents’ knowledge. Beyond the conceptual contribution, we provide several technical results: we analyse the expressivity of our logic in relation to ATL, study its relation to normative systems, and provide complexity results for relevant computational problems.

[MA-7] A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中基于图神经网络(Graph Neural Networks, GNNs)的通信机制缺乏明确结构与分类框架的问题。当前研究虽已尝试利用GNNs建模智能体间的交互关系以增强信息交换和协同决策,但方法间差异显著且缺乏统一的理论抽象。为此,论文提出了一种通用的GNN-based通信过程,其关键在于通过标准化的通信建模流程,使不同方法背后的原理更加清晰可辨,从而提升MARL中基于GNN通信机制的设计一致性与可解释性。

链接: https://arxiv.org/abs/2604.25972
作者: Valentin Cuzin-Rambaud(LIRIS, UCBL),Laetitia Matignon(LIRIS, UCBL),Maxime Morge(LIRIS, UCBL)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In multi-agent reinforcement learning (MARL), the integration of a communication mechanism, allowing agents to better learn to coordinate their actions and converge on their objectives by sharing information. Based on an interaction graph, a subclass of methods employs graph neural networks (GNNs) to learn the communication, enabling agents to improve their internal representations by enriching them with information exchanged. With growing research, we note a lack of explicit structure and framework to distinguish and classify MARL approaches with communication based on GNNs. Thus, this paper surveys recent works in this field. We propose a generalized GNN-based communication process with the goal of making the underlying concepts behind the methods more obvious and accessible.

自然语言处理

[NLP-0] urning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

【速读】: 该论文旨在解决扩散型大语言模型(Diffusion Large Language Models, dLLMs)在知识蒸馏过程中缺乏跨架构迁移能力的问题,即如何将不同架构、注意力机制和分词器的教师模型(teacher)知识有效迁移至学生模型(student)。现有方法仅限于单一架构内的推理步数优化,无法处理异构架构间的知识传递。解决方案的关键在于提出TIDE框架,其核心创新包括:(1) TIDAL模块通过联合调节训练进度与扩散时间步的蒸馏强度,以适应教师模型因噪声导致的可靠性变化;(2) CompDemo模块利用互补掩码分割增强教师上下文,提升在高掩码率下的预测性能;(3) Reverse CALM目标函数实现跨分词器的块级似然匹配逆向对齐,提供有界梯度并实现双端噪声过滤。该方案成功将8B稠密和16B MoE教师模型蒸馏至0.6B学生模型,在8个基准测试中平均提升1.53分,尤其在代码生成任务中HumanEval得分从32.3提升至48.78。

链接: https://arxiv.org/abs/2604.26951
作者: Gongbo Zhang,Wen Wang,Ye Tian,Li Yuan
机构: Peking University (北京大学); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 3 figures. Code: this https URL

点击查看摘要

Abstract:Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher’s noise-dependent reliability; (2) CompDemo, which enriches the teacher’s context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.

[NLP-1] Select to Think: Unlocking SLM Potential with Local Sufficiency

【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在推理能力上显著弱于大型语言模型(Large Language Models, LLMs)的问题,同时避免传统方法中依赖外部LLM调用带来的高延迟和成本,以及标准知识蒸馏因容量限制难以准确模仿LLM复杂生成分布的局限。其解决方案的关键在于识别“局部充分性”(local sufficiency)现象:在推理分歧点,LLM偏好的下一个token始终位于SLM的Top-K候选列表中,即使未成为SLM的Top-1预测。基于此,作者提出SELECT TO THINK (S2T)框架,将LLM的角色从开放式生成转变为对SLM候选token的离散选择,从而简化监督信号为候选排序任务;进一步提出S2T-LOCAL方法,将该选择逻辑蒸馏至SLM内部,使其具备自主重排序能力,无需运行时调用LLM即可实现接近多路径自一致性(8-path self-consistency)的性能,且仅需单轨迹推理效率。

链接: https://arxiv.org/abs/2604.26940
作者: Wenxuan Ye,Yangyang Zhang,Xueli An,Georg Carle,Yunpu Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Small language models (SLMs) offer computational efficiency for scalable deployment, yet they often fall short of the reasoning power exhibited by their larger counterparts (LLMs). To mitigate this gap, current approaches invoke an LLM to generate tokens at points of reasoning divergence, but these external calls introduce substantial latency and costs. Alternatively, standard distillation is often hindered by the capacity limitation, as SLMs struggle to accurately mimic the LLM’s complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM’s preferred token consistently resides within the SLM’s top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose SELECT TO THINK (S2T), which reframes the LLM’s role from open-ended generation to selection among the SLM’s proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-LOCAL, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, we demonstrate that a 1.5B SLM’s top-8 candidates capture the 32B LLM’s choice with 95% hit rate. Translating this potential into performance, S2T-LOCAL improves greedy decoding by 24.1% on average across benchmarks, effectively matching the efficacy of 8-path self-consistency while operating with single-trajectory efficiency.

[NLP-2] ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成任务中“组合式代码构建”(compositional code creation)能力不足的问题,即从规范中生成具有内部结构的完整类(class-level code),这一能力介于函数级代码合成与仓库级代码修改之间,此前缺乏系统性评估和有效支持。解决方案的关键在于提出ClassEval-Pro基准测试集,其通过自动化三阶段流程——复杂度增强、跨域类组合及整合2025年1月后真实GitHub代码——构建了涵盖11个领域的300个类级任务,并由LLM判官集成验证与高覆盖率测试套件确保质量。该基准揭示了当前前沿模型在类级任务上的显著性能差距(最高45.6% Pass@1,最大差异达17.7个百分点),并识别出逻辑错误(56.2%)和依赖错误(38.0%)为主要失败原因,凸显跨方法协调是当前瓶颈所在。

链接: https://arxiv.org/abs/2604.26923
作者: Yeheng Chen,Chaoxiang Xie,Yuling Shi,Wenhao Zeng,Yongpan Wang,Hongyu Zhang,Xiaodong Gu
机构: Shanghai Jiao Tong University (上海交通大学); Hohai University (河海大学); Chongqing University (重庆大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted to AIware 2026. Code and data available at this https URL

点击查看摘要

Abstract:LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes – compositional code creation, i.e., building a complete, internally structured class from a specification – remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark’s discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck.

[NLP-3] ClawGym: A Scalable Framework for Building Effective Claw Agents

【速读】: 该论文旨在解决爪形环境(Claw-style environments)在可扩展开发中缺乏系统性框架的问题,尤其是如何合成可验证的训练数据并将其与智能体(agent)训练及诊断评估有效集成。其解决方案的关键在于提出 ClawGym 框架,该框架包含三个核心组件:ClawGym-SynData(由 13.5K 条基于角色意图和技能驱动操作合成的任务构成的多样化数据集,配以真实模拟工作区和混合验证机制)、ClawGym-Agents(通过监督微调黑盒回放轨迹训练出的一系列具备多步任务处理能力的爪形模型),以及 ClawGym-Bench(一个经自动化过滤和人-大语言模型(human-LLM)评审校准的 200 实例基准测试集),从而实现了从数据合成、模型训练到可靠评估的完整生命周期支持。

链接: https://arxiv.org/abs/2604.26904
作者: Fei Bai,Huatong Song,Shuang Sun,Daixuan Cheng,Yike Yang,Chuan Hao,Renyuan Li,Feng Chang,Yuan Wei,Ran Tao,Bryan Dai,Jian Yang,Wayne Xin Zhao
机构: Gaoling School of Artificial Intelligence (人工智能学院); Renmin University of China (中国人民大学); IQuest Research (IQuest 研究院); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task this http URL support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at this https URL.

[NLP-4] HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

【速读】: 该论文旨在解决患者通过电子健康记录(EHR)门户获取医疗信息时,因缺乏理解或行动力而导致的信息利用效率低下的问题。其核心挑战在于如何将患者非结构化的、口语化的问题转化为可执行的临床推理,并基于真实病历文本生成准确且可追溯的答案。解决方案的关键在于构建一个由Gemini 2.5 Pro大语言模型驱动的多阶段级联式流水线系统,包含四个模块:(1)少样本查询重写单元用于简化冗长的患者提问;(2)基于启发式的证据评分器以提升召回率;(3)基于证据约束的回答生成器确保答案的专业性与事实一致性;(4)高精度多对多对齐框架实现答案与支持性临床句子的精确映射。该架构在ArchEHR-QA 2026共享任务中取得优异表现,验证了结构化多阶段设计在提升回答准确性、可解释性和专业质量方面的有效性。

链接: https://arxiv.org/abs/2604.26880
作者: Md Biplob Hosen,Md Alomgeer Hussein,Md Akmol Masud,Omar Faruque,Tera L Reynolds,Lujie Karen Chen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or act on the complex clinical information contained in these records. The ArchEHR-QA 2026 shared task addresses this challenge by focusing on grounded question answering over EHRs, and this paper presents the system developed by the HealthNLP_Retrievers team for this task. The proposed approach uses a multi-stage cascaded pipeline powered by the Gemini 2.5 Pro large language model to interpret patient-authored questions and retrieve relevant evidence from lengthy clinical notes. Our architecture comprises four integrated modules: (1) a few-shot query reformulation unit which summarizes verbose patient queries; (2) a heuristic-based evidence scorer which ranks clinical sentences to prioritize recall; (3) a grounded response generator which synthesizes professional-caliber answers restricted strictly to identified evidence; and (4) a high-precision many-to-many alignment framework which links generated answers to supporting clinical sentences. This cascaded approach achieved competitive results. Across the individual tracks, the system ranked 1st in question interpretation, 5th in answer generation, 7th in evidence identification, and 9th in answer-evidence alignment. These results show that integrating large language models within a structured multi-stage pipeline improves grounding, precision, and the professional quality of patient-oriented health communication. To support reproducibility, our source code is publicly available in our GitHub repository

[NLP-5] MoRFI: Monotonic Sparse Autoencoder Feature Identification

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练阶段因引入新知识而引发的幻觉(hallucination)问题,特别是揭示其背后的因果机制。现有研究表明,监督微调(Supervised Fine-Tuning, SFT)可能加剧幻觉现象,但其内在机理尚不明确。论文通过受控微调实验,在封闭式问答(closed-book QA)任务中系统性地引入不同比例的新知识,并控制训练轮次和数据混合比例,发现模型在持续接触未知事实时,会在残差流(residual stream)中激活一组具有单调响应特性的潜在方向(latent directions),这些方向与知识检索能力的退化直接相关。解决方案的关键在于提出一种名为“单调关系特征识别”(Monotonic Relationship Feature Identification, MoRFI)的方法,利用预训练稀疏自编码器(Sparse Autoencoders, SAEs)分析各检查点的激活模式,筛选出对新知识输入呈单调变化的SAE特征,从而定位并干预导致幻觉的因果潜变量,最终实现通过单潜变量干预恢复模型原有知识的能力。

链接: https://arxiv.org/abs/2604.26866
作者: Dimitris Dimakopoulos,Shay B. Cohen,Ioannis Konstas
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving rise to hallucinations. While it has been demonstrated that supervised fine-tuning (SFT) on new knowledge may exacerbate the problem, the underlying mechanisms are still poorly understood. We conduct a controlled fine-tuning experiment, focusing on closed-book QA, and find latent directions that causally contribute to hallucinations. Specifically, we fine-tune Llama 3.1 8B, Gemma 2 9B and Mistral 7B v03 on seven distinct single QA datasets, controlling for the percentage of new knowledge and number of training epochs. By measuring performance on the test set, we validate that incrementally introducing new knowledge increases hallucinations, with the effect being more pronounced with prolonged training. We leverage pre-trained sparse autoencoders (SAEs) to analyze residual stream activations across various checkpoints for each model and propose Monotonic Relationship Feature Identification (MoRFI) for capturing causally relevant latents. MoRFI filters SAE features that respond monotonically to controlled fine-tuning data mixtures of a target property. Our findings show that exposure to unknown facts disrupts the model’s ability to retrieve stored knowledge along a set of directions in the residual stream. Our pipeline reliably discovers them across distinct models, recovering knowledge through single-latent interventions.

[NLP-6] What Kind of Language is Easy to Language-Model Under Curriculum Learning?

【速读】: 该论文旨在解决语言模型(Language Models, LMs)是否能够再现人类语言中普遍存在的类型学倾向(typological tendencies)的问题,特别是探讨语言模型的归纳偏置(inductive bias)在不同学习场景下如何影响这些倾向的形成。其解决方案的关键在于引入课程学习(Curriculum Learning, CL)这一发展性学习范式——即从简单句开始逐步过渡到复杂句的学习顺序,而非随机输入训练数据。研究发现,这种基于CL的学习场景显著改变了语言模型表现出的归纳偏置,从而更有效地模拟真实语言中的高频特征组合模式。

链接: https://arxiv.org/abs/2604.26844
作者: Nadine El-Naggar,Tatsuki Kuribayashi,Ted Briscoe
机构: 未知
类目: Computation and Language (cs.CL)
备注: The 15th edition of the Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2026)

点击查看摘要

Abstract:Many of the thousands of attested languages share common configurations of features, creating a spectrum from typologically very rare (e.g., object-verb-subject word order) or impossible languages to very common combinations of features (e.g., subject-object-verb word order). One central question is under what conditions such typological tendencies can be predicted, and specifically whether the learning bias of language models (LMs) is sufficient to reproduce such patterns. In this study, we add one dimensionality to such analysis – the learning scenario for LMs – to explore its interaction with the inductive bias of LMs. Specifically, as a first study, we examine the effect of curriculum learning (CL), as a developmentally motivated learning scenario, i.e., starting with simpler sentences rather than randomly-ordered input. We expand existing LM-based exploration (El-Naggar et al., 2025a,b) with a simple CL variant and find that CL substantially impacts the apparent inductive bias of LMs.

[NLP-7] Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data

【速读】: 该论文旨在解决语言扩散模型(language diffusion models)何时发生训练数据记忆行为,以及如何定量评估其真正的生成模式(generative regime)的问题。解决方案的关键在于揭示统一离散扩散模型(Uniform-based Discrete Diffusion Models, UDDMs)本质上表现为关联记忆(Associative Memory, AM),即通过在训练样本周围建立稳定的吸引盆地实现记忆恢复,并在此基础上展现出涌现的创造性能力。作者进一步提出,无需显式能量函数,仅通过条件似然最大化即可形成吸引盆地;并通过分析训练与测试样本的token恢复能力,发现记忆到泛化的转变由训练集规模决定:随着训练数据增加,训练样本的吸引盆地缩小,而未见测试样本的吸引盆地扩大,最终两者收敛至相同水平。关键创新点在于利用预测token序列的条件熵作为可部署模型的实用探测指标——记忆阶段表现为条件熵趋零,而泛化阶段则多数token的条件熵保持有限值。

链接: https://arxiv.org/abs/2604.26841
作者: Bao Pham,Mohammed J. Zaki,Luca Ambrogioni,Dmitry Krotov,Matteo Negri
机构: Rensselaer Polytechnic Institute (RPI); Radboud University; CY Cergy Paris Université
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Also see arXiv:2505.21777 for a related work

点击查看摘要

Abstract:When do language diffusion models memorize their training data, and how to quantitatively assess their true generative regime? We address these questions by showing that Uniform-based Discrete Diffusion Models (UDDMs) fundamentally behave as Associative Memories (AMs) \textitwith emergent creative capabilities . The core idea of an AM is to reliably recover stored data points as \textitmemories by establishing distinct basins of attraction around them. Historically, models like Hopfield networks use an explicit energy function to guarantee these stable attractors. We broaden this perspective by leveraging the observation that energy is not strictly necessary, as basins of attraction can also be formed via conditional likelihood maximization. By evaluating token recovery of \textittraining and \textittest examples, we identify in UDDMs a sharp memorization-to-generalization transition governed by the size of the training dataset: as it increases, basins around training examples shrink and basins around unseen test examples expand, until both later converge to the same level. Crucially, we can detect this transition using only the conditional entropy of predicted token sequences: memorization is characterized by vanishing conditional entropy, while in the generalization regime the conditional entropy of most tokens remains finite. Thus, conditional entropy offers a practical probe for the memorization-to-generalization transition in deployed models.

[NLP-8] HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists

【速读】: 该论文旨在解决生成式 AI(Generative AI)在学术写作中引入的虚假引用(hallucinated citations)问题,即模型生成不存在的文献条目,从而损害论文可信度并增加审稿人与作者的手动验证负担。解决方案的关键在于提出 HalluCiteChecker 工具包,将虚假引用检测形式化为自然语言处理(NLP)任务,并提供轻量级、离线运行、仅依赖 CPU 的验证能力,可在标准笔记本电脑上以秒级速度完成核查,从而支持系统性的预审和出版前检查。

链接: https://arxiv.org/abs/2604.26835
作者: Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (NAIST), Japan
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: Work In Progress

点击查看摘要

Abstract:We introduce HalluCiteChecker, a toolkit for detecting and verifying hallucinated citations in scientific papers. While AI assistant technologies have transformed the academic writing process, including citation recommendation, they have also led to the emergence of hallucinated citations that do not correspond to any existing work. Such citations not only undermine the credibility of scientific papers but also impose an additional burden on reviewers and authors, who must manually verify their validity during the review process. In this study, we formalize hallucinated citation detection as an NLP task and provide a corresponding toolkit as a practical foundation for addressing this problem. Our package is lightweight and can perform verification in seconds on a standard laptop. It can also be executed entirely offline and runs efficiently using only CPUs. We hope that HalluCiteChecker will help reduce reviewer workload and support organizers by enabling systematic pre-review and publication checks. Our code is released under the Apache 2.0 license on GitHub and is distributed as an installable package via PyPI. A demonstration video is available on YouTube.

[NLP-9] Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

【速读】: 该论文旨在解决前沿语言模型在强化学习(Reinforcement Learning, RL)后训练阶段中,因自回归滚动生成(autoregressive rollout generation)导致的效率瓶颈问题。其核心挑战在于如何在不改变目标模型输出分布的前提下提升滚动生成的吞吐量。解决方案的关键在于引入无损加速原语——推测解码(speculative decoding),该方法通过使用较小的草稿模型(draft model)或预训练多任务预测头(MTP head)等机制,在保持与目标模型输出一致性的前提下并行生成候选 token,从而显著加快 RL 滚动生成过程。作者在 NeMo-RL 中基于 vLLM 后端实现了支持同步和异步流水线的推测解码,证明其可在多种推测机制下有效提升性能,例如在 8B 规模同步 RL 场景中实现 1.8 倍的滚动生成吞吐量提升,并通过高保真性能模拟预测,在 235B 规模下结合异步 RL 可达最高 2.5 倍的端到端训练加速。

链接: https://arxiv.org/abs/2604.26779
作者: Hayate Iso,Tiyasa Mitra,Sudipta Mondal,Rasoul Shafipour,Venmugil Elango,Terry Kong,Yuki Huang,Seonjin Na,Izzy Putterman,Benjamin Chislett,Maor Ashkenazi,Joseph Guman,Gerald Shen,Tugrul Konuk,Ashwath Aithal,Ritika Borkar,Ran Zilberstein,Bita Rouhani
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central systems challenge. Many existing efficiency methods improve throughput by changing the rollout or optimization regime, for example, through off-policy execution, replay, or lower-precision generation. We study speculative decoding as a lossless acceleration primitive for RL rollouts that preserves the target model’s output distribution. We implement speculative decoding in NeMo-RL with a vLLM backend, supporting both synchronous and asynchronous pipelines and enabling speculation during RL rollouts. This benefit is realizable across speculation mechanisms, such as pretrained MTP heads, small external draft models or even techniques such as Eagle3, which are traditionally applied after RL phase. This yields a deployment path for state-of-the-art speculative decoding inside RL training. In a reasoning post-training workload at 8B scale under synchronous RL, speculative decoding improves rollout throughput by 1.8x. Using a high-fidelity performance simulator, we project that combining speculative decoding with asynchronous RL yields up to 2.5x end-to-end training speedup at 235B scale.

[NLP-10] Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation

【速读】: 该论文旨在解决参数化检索增强生成(Parametric Retrieval-Augmented Generation, PRAG)中因文档适配器(adapter)训练时任务监督目标导致的“任务行为与文档知识混叠”问题,即每个适配器同时编码了通用任务解题策略和特定文档信息,从而在多文档适配器合并时引发行为冗余叠加,降低融合后适配器的稳定性与专注性。解决方案的关键在于引入正交子空间分解(Orthogonal Subspace Decomposition, OSD),通过分阶段训练实现任务与文档知识的分离:首先训练一个任务LoRA(Task LoRA)以捕获可复用的任务行为,随后训练文档LoRA(document LoRA)在正交子空间中编码文档特异性知识,从而确保适配器组合时仅整合文档知识而不叠加冗余任务行为,显著提升多文档场景下的组合鲁棒性。

链接: https://arxiv.org/abs/2604.26768
作者: Weihang Su,Hanwen Zhang,Qingyao Ai,Yiqun Liu
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parametric Retrieval-Augmented Generation (PRAG) encodes external documents into lightweight parameter modules that can be retrieved and merged at inference time, offering a promising alternative to in-context retrieval augmentation. Despite its potential, many PRAG implementations train document adapters with task-supervised objectives, which may cause each adapter to encode both document-specific facts and reusable task-solving behavior. This entanglement may make adapter composition less reliable: when multiple adapters are merged at inference time, their overlapping task behaviors can accumulate together with document-specific updates, potentially making the merged adapter less stable and less focused on the intended document knowledge. To examine this issue, we explore Orthogonal Subspace Decomposition (OSD), an adapter-training setup that separates reusable task behavior from document-specific knowledge adapters. Concretely, we first train a Task LoRA to capture reusable task behavior, and then train document LoRAs to encode document-specific knowledge in a orthogonal subspace. This setup provides a controlled way to examine how orthogonalizing task and document LoRA updates affects adapter composition in multi-document PRAG. Experiments across multiple knowledge-intensive tasks and model scales suggest that this orthogonalization strategy can improve compositional robustness in parametric RAG, especially when multiple document adapters are merged.

[NLP-11] Domain-Adapted Small Language Models for Reliable Clinical Triage

【速读】: 该论文旨在解决急诊科中急诊严重程度指数(Emergency Severity Index, ESI)分配不准确和不一致的问题,其根源在于自由文本分诊记录的高变异性导致误分诊和流程低效。解决方案的关键在于利用开源小语言模型(Small Language Models, SLMs)作为可靠的、隐私保护的临床分诊决策支持工具,特别是通过针对儿科分诊数据进行大规模领域适应(domain adaptation)的微调策略,显著降低了ESI分配中的分歧和临床显著错误,其中微调后的Qwen2.5-7B模型在准确性、稳定性与计算效率之间实现了最佳平衡,优于所有基线SLMs及先进的专有大语言模型(Large Language Models, LLMs)。

链接: https://arxiv.org/abs/2604.26766
作者: Manar Aljohani,Brandon Ho,Kenneth McKinley,Dennis Ren,Xuan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. This study evaluates whether open-source small language models (SLMs) can serve as reliable, privacy-preserving decision-support tools for clinical triage. We systematically compared multiple SLMs across diverse prompting pipelines and found that clinical vignettes, concise summaries of triage narratives, yielded the most accurate predictions. The SLM, Qwen2.5-7B, demonstrated the strongest balance of accuracy, stability, and computational efficiency. Through large-scale domain adaptation using expert-curated and silver-standard pediatric triage data, fine-tuned Qwen2.5-7B models substantially reduced discordance and clinically significant errors, outperforming all baseline SLMs and advanced proprietary large language models (LLMs, e.g., GPT-4o). These findings highlight the feasibility of institution-specific SLMs for reliable, privacy-preserving ESI decision support and underscore the importance of targeted fine-tuning over more complex inference strategies.

[NLP-12] Swap distance minimization shapes the order of subject object and verb in languages of the world

【速读】: 该论文旨在解决语言学中关于句法结构多样性的问题,特别是针对那些不遵循主流主语-宾语-动词(Subject-Object-Verb, SOV)或主语-动词-宾语(Subject-Verb-Object, SVO)顺序的语言,以及缺乏明确主导词序的语言现象。研究发现,无论是否具有显著的词序偏好,跨语言的词序变异均受到“交换距离最小化”(swap distance minimization)原则的约束——即语言在表达过程中倾向于减少词汇位置的频繁交换,从而维持句法结构的稳定性与可处理性。这一机制是解释词序演化和变异的核心驱动因素。

链接: https://arxiv.org/abs/2604.26726
作者: Jairo Rios-El-Yazidi,Ramon Ferrer-i-Cancho
机构: Universitat Politècnica de Catalunya (加泰罗尼亚理工大学)
类目: Computation and Language (cs.CL); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Languages of the world vary concerning the order of subject, object and verb. The most frequent dominant orders are SOV and SVO, and researchers have tailored models to this fact. However, there are still languages whose dominant order does not conform to these expectations or even lack a dominant order. Here we show that across linguistic families and macroareas, word order variation within languages is shaped by the principle of swap distance minimization even when the dominant order is not SOV/SVO and even when a dominant order is lacking.

[NLP-13] From Black-Box Confidence to Measurable Trust in Clinical AI: A Framework for Evidence Supervision and Staged Autonomy

【速读】: 该论文旨在解决临床人工智能(Clinical AI)中信任机制不足的问题,即单纯依赖模型准确率、生成流畅性或用户主观印象无法构建可靠的医疗AI系统。其核心挑战在于如何将信任作为可测量的系统属性,而非单一模型特性。解决方案的关键在于提出一个以“证据(evidence)”、“监督(supervision)”和“分阶段自主性(staged autonomy)”为核心的架构框架:通过融合确定性逻辑核心、患者特异性AI辅助验证模块、多层级模型升級机制及人工监督层,实现对关键临床发现的选择性验证、受限临床场景下的操作边界控制、结构化提示(prompt)设计与真实病例的严谨评估;同时引入基于计量学原理(测量不确定性、校准、可追溯性)的可信度指标体系,使各组件的可信水平得以量化评估,从而确保从设计之初就嵌入可追踪的证据链、人类监管、分级升級路径与渐进式权限控制,最终形成系统级的可信临床AI。

链接: https://arxiv.org/abs/2604.26671
作者: Serhii Zabolotnii,Viktoriia Holinko,Olha Antonenko
机构: Cherkasy State Business College (切尔卡瑟州商业学院); healthPrecision (healthPrecision)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Trust in clinical artificial intelligence (AI) cannot be reduced to model accuracy, fluency of generation, or overall positive user impression. In medicine, trust must be engineered as a measurable system property grounded in evidence, supervision, and operational boundaries of AI autonomy. This article proposes a practical framework for trustworthy clinical AI built around three principles: evidence, supervision, and staged autonomy. Rather than replacing deterministic clinical logic wholesale with end-to-end black-box models, the proposed approach combines a deterministic core, a patient-specific AI assistant for contextual validation, a multi-tier model escalation mechanism, and a human supervision layer for verification, escalation, and risk control. We demonstrate that trust also depends on selective verification of clinically critical findings, bounded clinical context, disciplined prompt architecture, and careful evaluation on realistic cases. Classifier-driven modular prompting is examined as an incremental path to scaling clinical depth without sacrificing prompt performance and without waiting for complete rule-based coverage. To operationalize trust, a set of trust metrics is proposed, built on metrological principles – measurement uncertainty, calibration, traceability – enabling quantitative rather than subjective assessment of each architectural layer. In this perspective, trustworthy clinical AI emerges not as a property of an individual model, but as an architectural outcome of a system into which evidence trails, human oversight, tiered escalation, and graduated action rights are embedded from the outset.

[NLP-14] Differentially-Private Text Rewriting reshapes Linguistic Style

【速读】: 该论文旨在解决差分隐私(Differential Privacy, DP)在文本处理中引发的风格失真问题,即在保障隐私的同时如何维持文本的语域(register)特征与交际功能。传统方法多聚焦于词汇层面的扰动,而本文通过引入基于语言模型的连续句子级重写机制,探索了更高层次的隐私保护策略对文本风格的影响。其解决方案的关键在于:系统性地分析不同隐私预算下,自动回归式重写与双向替换两种架构对文本交互标记、语境指涉和复杂从属结构等语域特征的结构性削弱效应,揭示出即便语义得以保留,隐私约束仍会导致文本趋向于一种“无参与感”且“非说服力”的同质化语域,从而指出当前DP文本处理技术在保持人类写作多样性方面的局限性。

链接: https://arxiv.org/abs/2604.26656
作者: Stefan Arnold
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (弗里德里希-亚历山大埃尔朗根-纽伦堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Differential Privacy (DP) for text matured from disjointed word-level substitutions to contiguous sentence-level rewriting by leveraging the generative capacity of language models. While this form of text privatization is best suited for balancing formal privacy guarantees with grammatical coherence, its impact on the register identity of text remains largely unexplored. By conducting a multidimensional stylistic profiling of differentially-private rewriting, we demonstrate that the cost of privacy extends far beyond lexical variation. Specifically, we find that rewriting under privacy constraints induces a systematic functional mutation of the text’s communicative signature. This shift is characterized by the severe attrition of interactive markers, contextual references, and complex subordination. By comparing autoregressive paraphrasing against bidirectional substitution across a spectrum of privacy budgets, we observe that both architectures force convergence toward a non-involved and non-persuasive register. This register-blind sanitization effectively preserves semantic content but structurally homogenizes the nuanced stylistic markers that define human-authored discourse.

[NLP-15] SAGE: A Strategy-Aware Graph-Enhanced Generation Framework For Online Counseling

【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在心理危机干预场景中缺乏临床推理能力的问题,即现有模型难以同时整合心理学理论框架、实时情绪信号与策略性干预规划,从而影响干预的安全性与有效性。其解决方案的关键在于提出SAGE(Strategy-Aware Graph-Enhanced)框架,该框架通过构建融合对话动态与心理理论层的异构图结构,显式地将交互锚定于理论驱动的术语体系;并引入“下一步策略分类器”以识别最优干预策略,再利用图感知注意力机制将图结构信号转化为软提示(soft prompts),从而引导LLM生成具有临床深度的响应,显著提升策略预测准确性和干预建议质量。

链接: https://arxiv.org/abs/2604.26630
作者: Eliya Naomi Aharon,Meytal Grimland,Avi Segal,Loona Ben Dayan,Inbar Shenfeld,Yossi Levi Belz,Kobi Gal
机构: Ben-Gurion University of the Negev (本古里安大学); University of Haifa (海法大学); Sahar (萨哈尔)
类目: Computation and Language (cs.CL)
备注: Full version of the work accepted as a short paper at the 34th ACM Conference on User Modeling, Adaptation and Personalization (UMAP '26). 9 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Effective mental health counseling is a complex, theory-driven process requiring the simultaneous integration of psychological frameworks, real-time distress signals, and strategic intervention planning. This level of clinical reasoning is critical for safety and therapeutic effectiveness but is often missing in general-purpose Large Language Models (LLMs). We introduce SAGE (Strategy-Aware Graph-Enhanced), a novel framework designed to bridge the gap between structured clinical knowledge and generative AI. SAGE constructs a heterogeneous graph that unifies conversational dynamics with a psychologically grounded layer, explicitly anchoring interactions in a theory-driven lexicon. Our architecture first employs a Next Strategy Classifier to identify the optimal therapeutic intervention. Subsequently, a Graph-Aware Attention mechanism projects graph-derived structural signals into soft prompts, conditioning the LLM to generate responses that maintain clinical depth. Validated through both automated metrics and expert human evaluation, SAGE outperforms baselines in strategy prediction and recommended response quality. By providing actionable intervention recommendations, SAGE serves as a cutting-edge decision-support tool designed to augment human expertise in high-stakes crisis counseling.

[NLP-16] OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory ACL2026

【速读】: 该论文旨在解决自主大语言模型(LLM)代理在长周期、交互式环境中因文本上下文预算限制而导致的经验记忆受限问题。现有记忆系统要么因存储原始轨迹而消耗大量token,要么通过摘要或纯文本检索牺牲信息完整性与证据连贯性。解决方案的关键在于提出光学上下文检索记忆(Optical Context Retrieval Memory, OCR-Memory),其核心机制是将历史轨迹渲染为带唯一视觉标识的图像,利用“定位-转录”范式通过视觉锚点选择相关区域并精确提取原文本,从而以极低的提示开销实现任意长度历史的高效存储与无幻觉还原,显著提升长程任务中的记忆容量与证据保真度。

链接: https://arxiv.org/abs/2604.26622
作者: Jinze Li,Yang Zhang,Xin Yang,Jiayi Qu,Jinfeng Xu,Shuo Yang,Junhua Ding,Edith Cheuk-Han Ngai
机构: The University of Hong Kong(香港大学); University of North Texas(北德克萨斯大学); University of Tsukuba(筑波大学); Yonsei University(延世大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 (Main Conference)

点击查看摘要

Abstract:Autonomous LLM agents increasingly operate in long-horizon, interactive settings where success depends on reusing experience accumulated over extended histories. However, existing agent memory systems are fundamentally constrained by text-context budgets: storing or revisiting raw trajectories is prohibitively token-expensive, while summarization and text-only retrieval trade token savings for information loss and fragmented evidence. To address this limitation, we propose Optical Context Retrieval Memory (OCR-Memory), a memory framework that leverages the visual modality as a high-density representation of agent experience, enabling retention of arbitrarily long histories with minimal prompt overhead at retrieval time. Specifically, OCR-Memory renders historical trajectories into images annotated with unique visual identifiers. OCR-Memory retrieves stored experience via a \emphlocate-and-transcribe paradigm that selects relevant regions through visual anchors and retrieves the corresponding verbatim text, avoiding free-form generation and reducing hallucination. Experiments on long-horizon agent benchmarks show consistent gains under strict context limits, demonstrating that optical encoding increases effective memory capacity while preserving faithful evidence recovery.

[NLP-17] Zero-Shot to Full-Resource: Cross-lingual Transfer Strategies for Aspect-Based Sentiment Analysis

【速读】: 该论文旨在解决Aspect-based Sentiment Analysis (ABSA)在多语言场景下的性能瓶颈问题,即当前主流方法仍高度依赖英语数据,缺乏对非英语语言的系统性评估与优化。其解决方案的关键在于通过跨语言迁移、代码切换(code-switching)和机器翻译三种策略,在零资源、仅数据和全资源三种设置下,对多种Transformer架构(包括小型编码器模型、序列到序列模型及微调的大语言模型LLMs)进行系统比较。研究发现,微调后的LLMs在复杂生成任务中表现最优,而小样本情况下小型编码器模型在简单任务中仍具竞争力;同时,不同模型架构需采用差异化策略:LLMs受益于多语种联合训练,而小型模型则更依赖代码切换,从而为多语言ABSA提供了架构适配的优化路径。

链接: https://arxiv.org/abs/2604.26619
作者: Jakob Fehle,Nils Constantin Hellwig,Udo Kruschwitz,Christian Wolff
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aspect-based Sentiment Analysis (ABSA) extracts fine-grained opinions toward specific aspects within text but remains largely English-focused despite major advances in transformer-based and instruction-tuned models. This work presents a multilingual evaluation of state-of-the-art ABSA approaches across seven languages (English, German, French, Dutch, Russian, Spanish, and Czech) and four subtasks (ACD, ACSA, TASD, ASQP). We systematically compare different transformer architectures under zero-resource, data-only, and full-resource settings, using cross-lingual transfer, code-switching and machine translation. Fine-tuned Large Language Models (LLMs) achieve the highest overall scores, particularly in complex generative tasks, while few-shot counterparts approach this performance in simpler setups, where smaller encoder models also remain competitive. Cross-lingual training on multiple non-target languages yields the strongest transfer for fine-tuned LLMs, while smaller encoder or seq-to-seq models benefit most from code-switching, highlighting architecture-specific strategies for multilingual ABSA. We further contribute two new German datasets, an adapted GERestaurant and the first German ASQP dataset (GERest), to encourage multilingual ABSA research beyond English.

[NLP-18] ranslating Under Pressure: Domain-Aware LLM s for Crisis Communication

【速读】: 该论文旨在解决灾难场景下多语言通信中因高质量平行语料稀缺而导致的翻译效果不佳问题。其解决方案的关键在于提出一种领域自适应(domain-adaptive)的流水线方法:首先通过检索与过滤从通用语料库中扩展小规模参考语料库,随后用所得数据微调小型语言模型进行灾难题域翻译,并引入偏好优化(preference optimization)以引导输出偏向CEFR A2级英语水平。该方法在保证翻译准确性的基础上显著提升了可读性,验证了简化英语结合领域适配可作为无法实现全多语言覆盖时的实用应急沟通通用语(lingua franca)。

链接: https://arxiv.org/abs/2604.26597
作者: Antonio Castaldo,Maria Carmen Staiano,Johanna Monti,Sheila Castilho,Francesca Chiusaroli
机构: University of Pisa (比萨大学); University of Naples “L’Orientale” (东方大学); University of Macerata (马切拉塔大学); Dublin City University (都柏林城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Timely and reliable multilingual communication is critical during natural and human-induced disasters, but developing effective solutions for crisis communication is limited by the scarcity of curated parallel data. We propose a domain-adaptive pipeline that expands a small reference corpus, by retrieving and filtering data from general corpora. We use the resulting dataset to fine-tune a small language model for crisis-domain translation and then apply preference optimization to bias outputs toward CEFR A2-level English. Automatic and human evaluation shows that this approach improves readability, while maintaining strong adequacy. Our results indicate that simplified English, combined with domain adaptation, can function as a practical lingua franca for emergency communication when full multilingual coverage is not feasible.

[NLP-19] Multimodal LLM s are not all you need for Pediatric Speech Language Pathology

【速读】: 该论文旨在解决儿童语音发声障碍(Speech Sound Disorders, SSD)的临床分类与自动识别难题,尤其是在言语语言病理学家面临严重人员短缺和不可管理的病例负担背景下。解决方案的关键在于提出一种分层级联(hierarchical cascading)的分类方法,从二分类逐步细化至类型和症状分类,并通过微调语音表征模型(Speech Representation Models, SRM)结合针对性的数据增强策略,有效缓解了先前研究中发现的偏差问题,在SLPHelmUltraSuitePlus多任务基准上全面优于现有临床任务表现。此外,该方法在自动语音识别(ASR)任务中也展现出改进效果,且SRM在所有评估任务中显著优于基于大语言模型(LLM)的最先进方法。

链接: https://arxiv.org/abs/2604.26568
作者: Darren Fürst,Sebastian Steindl,Ulrich Schäfer
机构: Ostbayerische Technische Hochschule Amberg-Weiden (安贝格-韦登应用技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speech Sound Disorders (SSD) affect roughly five percent of children, yet speech-language pathologists face severe staffing shortages and unmanageable caseloads. We test a hierarchical approach to SSD classification on the granular multi-task SLPHelmUltraSuitePlus benchmark. We propose a cascading approach from binary classification to type, and symptom classification. By fine-tuning Speech Representation Models (SRM), and using targeted data augmentation we mitigate biases found by previous works, and improve upon all clinical tasks in the benchmark. We also treat Automatic Speech Recognition (ASR) with our data augmentation approach. Our results demonstrate that SRM consistently outperform the LLM-based state-of-the-art across all evaluated tasks by a large margin. We publish our models and code to foster future research.

[NLP-20] LPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言场景下出现的语言混淆(language confusion)问题,即模型无法稳定生成目标语言的响应。现有基于序列级微调的方法(如DPO、ORPO和GRPO)虽能缓解此问题,但常导致模型通用能力的 unintended degradation。解决方案的关键在于提出一种 token-level policy optimization (TLPO) 框架,通过识别易出错的位置、探索候选词并采用定制化目标函数,在细粒度层面抑制错误输出,从而在不损害模型整体性能的前提下显著提升语言一致性。

链接: https://arxiv.org/abs/2604.26553
作者: Jinho Choo,JunSeung Lee,Jimyeong Kim,Yeeho Song,S. K. Hong,Yeong-Dae Kwon
机构: Samsung SDS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the main conference of ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on sequence-level fine-tuning, such as DPO, ORPO, and GRPO, operate at the level of entire responses and can lead to unintended degradation of general model capabilities, motivating the need for more fine-grained alternatives. To address this, we introduce Token-Level Policy Optimization (TLPO), a fine-tuning framework designed to mitigate language confusion through localized, token-level updates. TLPO identifies error-prone positions, explores alternative candidate tokens, and updates the policy using a tailored objective to suppress error-inducing outputs at a granular level. This selective intervention enables effective mitigation of language confusion without compromising the model’s general abilities. Experiments on multiple multilingual LLMs across diverse languages demonstrate that TLPO significantly outperforms baselines in improving language consistency while preserving downstream task accuracy.

[NLP-21] xt-Utilization for Encoder-dominated Speech Recognition Models

【速读】: 该论文旨在解决如何高效利用纯文本数据来提升语音识别性能的问题,尤其针对编码器主导(encoder-dominated)的模型架构,以实现更快的识别速度。其解决方案的关键在于通过模态对齐(modality matching)和动态下采样(dynamic downsampling)技术,将文本数据映射至与语音输入相同的编码器层级表示,从而在不增加解码器复杂度的前提下增强模型对语言信息的建模能力。实验表明,使用更大规模的编码器配合较小的解码器可达到甚至超越传统大解码器架构的性能,且简单配置如随机持续时间模型(random duration models)往往优于复杂方法,显著简化了训练流程。

链接: https://arxiv.org/abs/2604.26514
作者: Albert Zeyer,Tim Posielek,Ralf Schlüter,Hermann Ney
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facilitate faster recognition. We provide a comprehensive comparison of techniques to integrate text-only data, including modality matching and dynamic downsampling to reach text-level representations within the encoder. Our experiments on the LibriSpeech corpus show that a larger encoder with a smaller decoder can equal or surpass the performance of architectures with larger decoders. We demonstrate that simple configurations, such as random duration models, are often more effective than complex alternatives, significantly simplifying the training pipeline. All code and recipes are made publicly available.

[NLP-22] SafeReview: Defending LLM -based Review Systems Against Adversarial Hidden Prompts

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在学术同行评审中因对抗性提示(adversarial prompts)而面临的安全风险问题,即恶意用户通过精心设计的指令干扰评审结果,威胁学术诚信。解决方案的关键在于提出一种新型对抗框架,其中生成器(Generator)与防御者(Defender)模型联合优化:生成器负责生成复杂攻击提示,防御者则学习检测这些提示;二者通过受信息检索生成对抗网络(Information Retrieval Generative Adversarial Networks)启发的损失函数协同进化,使防御者能够持续适应不断演进的攻击策略,从而显著提升对新型和未知威胁的鲁棒性。

链接: https://arxiv.org/abs/2604.26506
作者: Yuan Xin,Yixuan Weng,Minjun Zhu,Ying Ling,Chengwei Qin,Michael Hahn,Michael Backes,Yue Zhang,Linyi Yang
机构: CISPA; Westlake University; Southern University of Science and Technology; HKUST (Guangzhou); Saarland University
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 10 pages, 3 figures, 9 tables

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts – adversarial instructions embedded in submissions to manipulate outcomes – emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a critical foundation for securing the integrity of peer review.

[NLP-23] StarDrinks: An English and Korean Test Set for SLU Evaluation in a Drink Ordering Scenario LREC2026

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)和语音助手在任务导向型交互中的评估仍主要依赖受控场景的问题,而这些场景难以反映真实用户请求的多样性与复杂性。针对这一挑战,作者提出了一种名为StarDrinks的数据集,涵盖英语和韩语的语音语料、转写文本及标注槽位信息,支持语音到槽位(Speech-to-Slots, SLU)、转写到槽位(Transcription-to-Slots, NLU)以及语音到转写(Speech-to-Transcription, ASR)三种任务类型的评估,从而为模型在语言丰富、现实复杂的任务中提供更贴近实际的鲁棒性和泛化能力基准。其解决方案的关键在于构建一个包含多样化命名实体、定制化表达和自发言语现象(如停顿与自我修正)的多模态数据集,以提升模型在真实应用场景下的性能验证精度。

链接: https://arxiv.org/abs/2604.26500
作者: Marcely Zanon Boito,Caroline Brun,Inyoung Kim,Denys Proux,Salah Ait-Mokhtar,Nikolaos Lagos,Jean-Luc Meunier,Ioan Calapodescu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026

点击查看摘要

Abstract:LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to capture the variability and complexity of real user requests. Drink ordering, for example, involves diverse named entities, drink types, sizes, customizations, and brand-specific terminology, as well as spontaneous speech phenomena such as hesitations and self-corrections. To address this gap, we introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.

[NLP-24] heory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization

【速读】: 该论文旨在解决生成式 AI(Generative AI)在风格个性化(stylistic personalization)评估中缺乏基于作者识别科学(authorship science)的基准问题。当前的评测方法多为经验性设计,无法提供可校准的参考标准,导致对模型是否真正模仿特定作者写作风格的判断存在偏差。解决方案的关键在于引入理论驱动的评估框架——具体而言,采用基于作者验证理论(authorship verification theory)的LUAR指标,该指标提供了具有绝对意义的校准基线(人类上限0.756,跨作者下限0.626),从而揭示了现有方法普遍低于此下限(0.484–0.508)的“作者风格差距”,并证明不同度量方式之间几乎无相关性(r < 0.07),凸显了理论基础对评测结论稳定性和解释力的核心作用。

链接: https://arxiv.org/abs/2604.26460
作者: Yash Ganpat Sawant
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Stylistic personalization - making LLMs write in a specific individual’s style, rather than merely adapting to task preferences - lacks evaluation grounded in authorship science. We show that grounding evaluation in authorship verification theory transforms what benchmarks can measure. Drawing on three measurement traditions - LUAR, a trained authorship verification model; an LLM-as-judge with decoupled trait matching; and classical function-word stylometrics - we evaluate four inference-time personalization methods across 50 authors and 1,000 generations. The theory-grounded metric, LUAR, provides what ad hoc alternatives cannot: calibrated baselines, with a human ceiling of 0.756 and a cross-author floor of 0.626, that give scores absolute meaning. All methods score below this floor, from 0.484 to 0.508, exposing an authorship gap invisible to uncalibrated metrics. The three metrics produce near-zero pairwise correlations, with absolute r less than 0.07, confirming that without theoretical grounding, metric choice determines conclusions: an LLM judge declares a clear winner while LUAR finds no meaningful differentiation. These findings demonstrate the theory-benchmark cycle in action: authorship theory exposes evaluation failures that ad hoc benchmarks miss.

[NLP-25] Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

【速读】: 该论文旨在解决古典梵文文献数字化过程中因标注资源稀缺而导致的命名实体识别(Named Entity Recognition, NER)难题。现有方法虽尝试利用通用大语言模型(Large Language Models, LLMs)进行数据增强,但普遍存在错误率高、缺乏古典语法推理深度的问题。其解决方案的关键在于构建了一个高质量的银标准(silver standard)梵文NER数据集Naamah,包含102,942句文本,并提出一种结合DBpedia实体抽取与240亿参数混合推理生成模型的方法,以生成语法自然且语义多样化的合成训练数据,从而提升模型在古典梵语文本上的NER性能。

链接: https://arxiv.org/abs/2604.26456
作者: Akhil Rajeev P,Annarao Kulkarni
机构: Centre for Development of Advanced Computing (C-DAC)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentation, these approaches remain prone to error and often lack the reasoning depth required for classical grammar. In this work, we introduce Naamah, a high quality silver standard Sanskrit NER dataset comprising 102,942 sentences. We propose a methodology that combines entity extraction from DBpedia with the generative capabilities of a 24B parameter hybrid reasoning model to create grammatically natural and synthetically diverse training data. We utilize this dataset to benchmark two transformer architectures: the massive multilingual XLM RoBERTa and the parameter efficient IndicBERTv2.

[NLP-26] EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

【速读】: 该论文旨在解决现有语音情感描述(Speech Emotion Captioning, SEC)系统在人类-代理交互中对动态情感过渡建模不足的问题,即当前方法仅能实现孤立句子内的静态单情感表征,而忽略了话语层面的情感演化过程。其解决方案的关键在于提出一种情感过渡感知的语音描述框架(Emotion Transition-Aware Speech Captioning, EmoTransCap),通过构建首个大规模显式捕捉话语级情感过渡的数据集,并结合多任务情感过渡识别(Multi-Task Emotion Transition Recognition, MTETR)模型与大语言模型(LLM)驱动的双版本标注(描述型与指令型),实现基于声学特征和时间线索的语义丰富描述生成,同时引入可控的话语级情感语音合成系统,从而支持具备时序动态性和精细情感表达能力的智能对话代理。

链接: https://arxiv.org/abs/2604.26417
作者: Shuhao Xu,Yifan Hu,Jingjing Wu,Zhihao Du,Zheng Lian,Rui Liu
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: 15 pages, 5 figures, including appendix

点击查看摘要

Abstract:Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.

[NLP-27] When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

【速读】: 该论文旨在解决生成式 AI(Generative AI)中基于隐藏状态的推测解码(speculative decoding)在长程预测时准确率下降的问题,即“长程衰减”(long-range decay)。现有方法虽通过测试时训练(test-time training, TTT)缓解了训练-推理不匹配,但衰减现象仍持续存在。作者从上下文信息保留的角度重新审视该问题,指出目标模型的键值缓存(KV cache)作为显式上下文保留了完整的 token 级别表示,而隐藏状态重用则因注意力查询驱动的压缩机制导致历史信息丢失,不利于后续多步预测。解决方案的关键在于提出“KV 重用假设”(KV-Reuse Hypothesis),即允许草稿模型复用目标模型的 KV 缓存以获得更丰富的长期信号。为此,作者构建了 KVShot 诊断框架对比三种重用策略:仅隐藏状态、仅 KV 缓存和混合方式,实验证明 KV 重用能显著提升长程接受率。进一步分析揭示两大结构瓶颈:浅层草稿模型难以精准估计目标查询,且草稿侧 KV 投影梯度稀疏,表明要充分发挥 KV-aware 解码潜力需突破当前 TTT 训练范式,转向块级训练(block-wise training)路径。

链接: https://arxiv.org/abs/2604.26412
作者: Tianyu Liu,Yuhao Shen,Xinyi Hu,Baolin Zhang,Hengxin Zhang,Jun Dai,Jun Zhang,Shuang Ge,Lei Chen,Yue Li,MingCheng Wan
机构: Qwen Applications Business Group of Alibaba(阿里巴巴通义应用业务组); University of Science and Technology of China(中国科学技术大学); Zhejiang University(浙江大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a remedy, yet we observe that long-range decay persists even in TTT-trained drafters. We revisit long-range decay from the perspective of context information preservation. In hidden-state reuse, we argue the target hidden state acts as a biased context compression: it aggregates historical token information according to the attention query at the current position, yielding a compact representation optimized for immediate next-token prediction. This compression can suppress information less relevant to the current query but important for later speculative steps. In contrast, the target model’s KV cache serves as an explicit context, retaining the complete set of token-wise KV representations. We therefore posit the KV-Reuse Hypothesis: allowing the draft model to reuse the target KV cache can provide richer signals for long-horizon drafting. To test this hypothesis, we introduce KVShot, a diagnostic framework that compares three reuse paradigms: hidden-only, KV-only, and hybrid. Extensive evaluations on Qwen3-8B show that KV-Reuse improves long-range acceptance, although end-to-end speedups remain marginal under current training pipelines. Our analysis identifies two key structural bottlenecks: shallow drafters struggle to estimate target queries accurately, and draft-side KV projections receive sparse gradient signals. These findings suggest that realizing the full potential of KV-aware decoding requires moving beyond TTT toward block-wise training paradigms. By exposing these bottlenecks, KVShot provides a foundational diagnostic testbed and a clear roadmap for designing next-generation inference architectures.

[NLP-28] SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection SEMEVAL-2026

【速读】: 该论文旨在解决政治访谈回答中语义模糊性与回避策略识别的问题,具体任务包括粗粒度清晰度分类(3类)和细粒度回避策略识别(9类)。其核心挑战在于响应文本常超过标准Transformer编码器的512-token限制,导致信息丢失。解决方案的关键在于采用重叠滑动窗口分块策略(overlapping sliding-window chunking)对长文本进行分割,并通过元素级最大池化(element-wise Max-Pooling)聚合各块表示,从而保留关键语义信息;同时使用共享RoBERTa-large编码器配合多任务学习目标联合训练两个任务特定头,并在推理阶段结合7折分层交叉验证的集成策略,显著提升了模型性能。

链接: https://arxiv.org/abs/2604.26375
作者: Gabriel Stefan,Sergiu Nisioi
机构: University of Bucharest (布加勒斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to SemEval-2026 (Task 6: CLARITY: Unmasking Political Question Evasions)

点击查看摘要

Abstract:We describe our system for SemEval-2026 Task 6 (CLARITY: Unmasking Political Question Evasions), which classifies English political interview responses by coarse-grained clarity (3-way) and fine-grained evasion strategy (9-way). Since responses frequently exceed the 512-token limit of standard Transformer encoders, we apply an overlapping sliding-window chunking strategy with element-wise Max-Pooling aggregation over chunk representations. A shared RoBERTa-large encoder supplies two task-specific heads trained jointly via a multi-task objective, with inference-time ensembling over 7-fold stratified cross-validation. Our system achieves a Macro-F1 of 0.80 on Subtask 1 and 0.51 on Subtask 2, ranking 11th in both subtasks.

[NLP-29] xt Style Transfer with Machine Translation for Graphic Designs

【速读】: 该论文旨在解决图形设计中文本风格迁移的词对齐问题(word alignment),即在翻译过程中保持原文本的排版样式与视觉结构,以确保译文能够无缝融入原设计。其核心挑战在于实现高精度的词级对齐,从而保证翻译后的文本在字体、大小、位置等样式上与原文一致。解决方案的关键在于利用商业可用的神经机器翻译(NMT)和大语言模型(LLM)技术,提出三种改进方法:1)在NMT中引入自定义输入输出标签以显式编码文本样式;2)在LLM中采用类似标签机制;3)结合NMT翻译与LLM基于unigram映射的后处理策略构成混合方案。实验表明,尽管LLM和NMT单独方法表现有限,但混合方法性能接近甚至媲美注意力头(attention head)这一强基线,验证了所提方法在图形设计场景中的实用性。

链接: https://arxiv.org/abs/2604.26361
作者: Deergh Singh Budhauria,Sanyam Jain,Rishav Agarwal,Tracy King
机构: Adobe(Adobe); Adobe(Adobe); Adobe(Adobe); Adobe(Adobe)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Globalization of graphic designs such as those used in marketing materials and magazines is increasingly important for communication to broad audiences. To accomplish this, the textual content in the graphic designs needs to be accurately translated and have the text styling preserved in order to fit visually into the design. Preserving text styling requires high accuracy word alignment between the original and the translated text. The problem of word alignment between source and translated text is long known. The industry standards for extracting word alignments are defined by Giza++ and attention probabilities from neural machine translation (NMT) models. In this paper, we explore three new methods to tackle the word alignment problem for transferring text styles from the source to the translated text. The proposed methods are developed on top of commercially available NMT and LLM translation technologies. They include: NMT with custom input and output tags for text styling; LLM with custom input and output tags; a hybrid with NMT for translation followed by an LLM with use of unigram mappings. To analyze the performance of these solutions, their alignment results are compared with the results of an attention head approach to gauge their usability in graphic design applications. Interestingly, the attention head strong baseline proves more accurate than the LLM or NMT approach and on par with the hybrid NMT+LLM approach.

[NLP-30] Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中产生的高计算开销问题,特别是对推理轨迹(reasoning traces)中词级别信息结构缺乏深入理解的现状。其核心问题是:如何在不牺牲推理准确性的前提下,压缩推理过程并提升可解释性。解决方案的关键在于识别推理token的两类功能类型——低熵的结构型token(structural tokens,即重复出现的支撑性短语)和高熵的有机型token(organic tokens,即问题特异性内容),并提出一种模型无关的压缩流水线:利用模型自身推理轨迹进行跨词BPE合并以生成超令牌(supertokens),再通过监督微调使模型学习使用这些超令牌。实验表明,该方法平均缩短推理轨迹8.1%,且在五个数学推理基准上无显著准确性损失;同时,超令牌作为可解释的推理动作标注(如回溯、验证、策略切换),揭示了正确与错误推理轨迹之间的系统性差异,为强化学习中的奖励塑造和早期停止提供了诊断信号。

链接: https://arxiv.org/abs/2604.26355
作者: Zhenyu Zhao,Sander Land,Dan Bikel,Waseem Alshikh
机构: Writer, Inc.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning in Large Language Models incurs significant inference-time compute, yet the token-level information structure of reasoning traces remains underexplored. We observe that reasoning tokens split into two functional types: low-entropy \textitstructural tokens (recurring phrases that scaffold the reasoning process) and higher-entropy \textitorganic tokens (problem-specific content that drives toward a solution). This asymmetry motivates a simple, model-agnostic compression pipeline: apply cross-word BPE merges on a model’s own reasoning traces to derive \textitsupertokens that capture frequent structural patterns, then teach the model to adopt them via supervised fine-tuning. Across three model families and five mathematical reasoning benchmarks, our approach shortens reasoning traces by 8.1% on average with no statistically significant accuracy loss on any model–benchmark pair. Beyond compression, supertokens act as interpretable reasoning-move annotations (backtracking, verification, strategy shifts), exposing the model’s high-level strategy at a glance. Analyzing transitions between structural categories reveals systematic differences between correct and incorrect traces: correct traces show productive recovery (backtracking followed by strategy shifts and verification), while incorrect traces are dominated by confusion cycles (repeated hedging and unresolved contradictions). These diagnostic signals suggest applications in reward shaping and early stopping for RL-based reasoning training.

[NLP-31] A Dual-Task Paradigm to Investigate Sentence Comprehension Strategies in Language Models

【速读】: 该论文旨在解决语言模型(Language Models, LMs)在句子理解策略上是否受认知资源限制影响的问题,以及现有方法未能直接平衡记忆存储与句子处理资源这一核心问题。其解决方案的关键在于提出一种双任务范式,将算术计算任务与句子理解任务结合(如“2 cocktail + blended 3 =…”),通过引入对工作记忆资源的竞争性约束,促使GPT-4o、o3-mini和o4-mini等模型从基于形式结构的解析转向基于语义合理性(plausibility-based comprehension)的推理方式,从而更贴近人类的认知机制。

链接: https://arxiv.org/abs/2604.26351
作者: Rei Emura,Saku Sugawara
机构: Tohoku University (东北大学); National Institute of Informatics (日本国立信息学研究所); The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models (LMs) behave more like humans when their cognitive resources are restricted, particularly in predicting sentence processing costs such as reading times. However, it remains unclear whether such constraints similarly affect sentence comprehension strategies. Besides, existing methods do not directly target the balance between memory storage and sentence processing, which is central to human working memory. To address this issue, we propose a dual-task paradigm that combines an arithmetic computation task with a sentence comprehension task, such as “The 2 cocktail + blended 3 =…” Our experiments show that under dual-task conditions, GPT-4o, o3-mini, and o4-mini shift toward plausibility-based comprehension, mirroring humans’ rational inference. Specifically, these models show a greater accuracy gap between plausible sentences (e.g., “The cocktail was blended by the bartender”) and implausible sentences (e.g., “The bartender was blended by the cocktail”) in the dual-task condition compared to the single-task conditions. These findings suggest that constraints on the balance between memory and processing resources promote rational inference in LMs. More broadly, they support the view that human-like sentence comprehension fundamentally arises from the allocation of limited cognitive resources.

[NLP-32] DSIPA: Detecting LLM -Generated Texts via Sentiment-Invariant Patterns Divergence Analysis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成内容的检测难题,尤其针对现有方法在对抗性扰动、改写攻击和领域迁移下鲁棒性不足的问题。解决方案的关键在于提出一种无需训练的框架DSIPA,其核心思想是通过量化在受控风格变化下的情感分布稳定性来区分机器生成文本与人类写作:LLMs通常输出情绪一致性更高的文本,而人类写作则表现出更强的情感多样性。DSIPA利用两个无监督指标——情感分布一致性(sentiment distribution consistency)和情感分布保持性(sentiment distribution preservation),在零样本、黑盒场景下捕捉这种内在行为差异,无需访问模型参数或概率分布,从而实现高效、通用且抗干扰的内容识别。

链接: https://arxiv.org/abs/2604.26328
作者: Siyuan Li,Aodu Wulianghai,Guangyan Li,Xi Lin,Qinghua Mao,Yuliang Chen,Jun Wu,Jianhua Li
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Key Laboratory of Integrated Administration Technologies for Information Security (上海市信息安全综合管理技术重点实验室); Chinese Academy of Sciences (中国科学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) presents new security challenges, particularly in detecting machine-generated text used for misinformation, impersonation, and content forgery. Most existing detection approaches struggle with robustness against adversarial perturbation, paraphrasing attacks, and domain shifts, often requiring restrictive access to model parameters or large labeled datasets. To address this, we propose DSIPA, a novel training-free framework that detects LLM-generated content by quantifying sentiment distributional stability under controlled stylistic variation. It is based on the observation that LLMs typically exhibit more emotionally consistent outputs, while human-written texts display greater affective variation. Our framework operates in a zero-shot, black-box manner, leveraging two unsupervised metrics, sentiment distribution consistency and sentiment distribution preservation, to capture these intrinsic behavioral asymmetries without the need for parameter updates or probability access. Extensive experiments are conducted on state-of-the-art proprietary and open-source models, including GPT-5.2, Gemini-1.5-pro, Claude-3, and LLaMa-3.3. Evaluations on five domains, such as news articles, programming code, student essays, academic papers, and community comments, demonstrate that DSIPA improves F1 detection scores by up to 49.89% over baseline methods. The framework exhibits superior generalizability across domains and strong resilience to adversarial conditions, providing a robust and interpretable behavioral signal for secure content identification in the evolving LLM landscape.

[NLP-33] Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在大规模语言模型(Large Language Models, LLMs)训练中普遍存在的性能饱和问题,其根源在于熵(entropy)的塌缩——即探索能力随训练进程逐渐减弱,导致模型难以持续提升性能。现有方法如正则化或裁剪虽试图缓解熵塌缩,但常引发长期训练中的熵曲线不稳定,限制了性能改进。论文提出的解决方案是Entrocraft,其关键在于采用一种简单的拒绝采样(rejection-sampling)机制,通过偏置优势分布(advantage distribution)实现用户自定义的熵调度(entropy schedule),无需任何目标函数正则化且与优势估计器无关。理论分析表明,在最小假设下,每步熵变化可由优势分布决定,这解释了现有RL及熵保持方法的行为;实验进一步验证Entrocraft能显著提升泛化能力、输出多样性与长程训练稳定性,并在4B模型上超越8B基线,使性能提升维持更长时间,pass@K指标提升50%。

链接: https://arxiv.org/abs/2604.26326
作者: Bolian Li,Yifan Wang,Yi Ding,Anamika Lochab,Ananth Grama,Ruqi Zhang
机构: Purdue University (普渡大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has unlocked complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing further gains as RL training scales. This problem can be characterized by the collapse of entropy, a key diagnostic for exploration in RL. Existing attempts have tried to prevent entropy collapse through regularization or clipping, but their resulting entropy curves often exhibit instability in the long term, which hinders performance gains. In this paper, we introduce Entrocraft, a simple rejection-sampling approach that realizes any user-customized entropy schedule by biasing the advantage distributions. Entrocraft requires no objective regularization and is advantage-estimator-agnostic. Theoretically, we relate per-step entropy change to the advantage distribution under minimal assumptions, which explains the behavior of existing RL and entropy-preserving methods. Entrocraft also enables a systematic study of entropy schedules, where we find that linear annealing, which starts high and decays to a slightly lower target, performs best. Empirically, Entrocraft addresses performance saturation, significantly improving generalization, output diversity, and long-term training. It enables a 4B model to outperform an 8B baseline, sustains improvement for up to 4x longer before plateauing, and raises pass@K by 50% over the baseline.

[NLP-34] A Systematic Comparison of Prompting and Multi-Agent Methods for LLM -based Stance Detection

【速读】: 该论文旨在解决立场检测(Stance Detection)任务中因数据划分、基础模型和评估协议不一致而导致的模型比较不公平的问题。其解决方案的关键在于构建一个系统性的对比实验框架,涵盖五种不同方法(三种基于提示推理的方法:Direct Prompting、Auto-CoT、StSQA;两种基于代理辩论的方法:COLA、MPRF),在四个数据集上的14个子任务中,使用来自六个模型家族的15种大语言模型(LLM),参数规模从7B到72B+不等,从而实现对当前主流策略的全面、公平评估。

链接: https://arxiv.org/abs/2604.26319
作者: Genan Dai,Zini Chen,Yi Yang,Bowen Zhang
机构: Shenzhen Technology University (深圳技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stance detection identifies the attitude of a text author toward a given target. Recent studies have explored various LLM-based strategies for this task, from zero-shot prompting to multi-agent debate. However, existing works differ in data splits, base models, and evaluation protocols, making fair comparison difficult. We conduct a systematic comparison that evaluates five methods across two categories – prompt-based inference (Direct Prompting, Auto-CoT, StSQA) and agent-based debate (COLA, MPRF) – on four datasets with 14 subtasks, using 15 LLMs from six model families with parameter sizes from 7B to 72B+. Our experiments yield several findings. First, on all models with complete results, the best prompt-based method outperforms the best agent-based method, while agent methods require 7 to 12 times more API calls per sample. Second, model scale has a larger impact on performance than method choice, with gains plateauing around 32B. Third, reasoning-enhanced models (DeepSeek-R1) do not consistently outperform general models of the same size on this task.

[NLP-35] Classification of Public Opinion on the Free Nutritional Meal Program on YouTube Media Using the LSTM Method

【速读】: 该论文旨在解决如何利用深度学习方法对社交媒体(YouTube)上关于印尼“免费营养餐计划”(MBG)的公众意见进行情感分析的问题,以支持基于社交媒体的公共政策评估。其解决方案的关键在于采用长短期记忆网络(Long Short-Term Memory, LSTM)模型对7,733条评论文本进行分类,结果显示该模型在印尼语文本情感识别中具有89%的准确率,尤其在负向情感识别上表现优异(F1-score 0.94),但正向情感识别效果较差(F1-score 0.55),主要归因于数据集中的类别不平衡问题(负向样本占比达87.7%)。这表明LSTM在处理印尼语社交媒体文本时有效,但也凸显了数据分布不均对模型性能的影响。

链接: https://arxiv.org/abs/2604.26312
作者: Berliana Enda Putri,Lisa Diani Amelia,Muhammad Zaky Zaiddan,Luluk Muthoharoh,Ardika Satria,Martin Clinton Tosima Manullang
机构: Institut Teknologi Sumatera (苏马特拉理工学院)
类目: Computation and Language (cs.CL)
备注: 10 pages 3 figures 3 tables Conference submission on YouTube sentiment classification using LSTM for the Free Nutritious Meal Program

点击查看摘要

Abstract:Public opinion towards the Free Nutritious Meal Program (MBG) on YouTube social media reflects diverse community responses. This study applies the Long Short-Term Memory (LSTM) method to classify sentiments from 7,733 YouTube comments. The results show that the LSTM model achieves 89% accuracy, with strong performance on negative sentiment (F1-score 0.94) but weaker performance on positive sentiment (F1-score 0.55) due to class imbalance, as negative data account for 87.7% of the dataset. These findings confirm the effectiveness of LSTM for sentiment analysis of Indonesian text while highlighting the challenge of imbalanced data. This research contributes to social media-based public policy evaluation

[NLP-36] Benchmarking PyCaret AutoML Against BiLSTM for Fine-Grained Emotion Classification: A Comparative Study on 20-Class Emotion Detection

【速读】: 该论文旨在解决细粒度情感分类(fine-grained emotion classification)问题,即在自然语言处理中准确识别如喜悦、愤怒、悲伤和恐惧等具体情绪类别。其解决方案的关键在于对比传统机器学习方法(如逻辑回归、朴素贝叶斯和支持向量机)与深度学习模型(包括双向长短期记忆网络 BiLSTM、门控循环单元 GRU 和轻量级 Transformer)在20类情感文本分类任务上的性能表现。实验表明,BiLSTM 在准确率(89%)和加权F1分数(0.89)上优于其他模型,证明了基于序列建模的深度学习方法能更有效地捕捉文本中的上下文情感线索,从而提升细粒度情感识别的准确性。

链接: https://arxiv.org/abs/2604.26310
作者: Arya Muda Siregar,Arielva Simon Siahaan,Haikal Fransisko Simbolon,Luluk Muthoharoh,Ardika Satria,Martin C.T. Manullang
机构: Institut Teknologi Sumatera(南亚技术学院)
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 figures, 3 tables. This paper compares machine learning and deep learning methods for 20-class emotion classification on an English text dataset of 79,595 samples

点击查看摘要

Abstract:Fine-grained emotion classification, which identifies specific emotional states such as happiness, anger, sadness, and fear, remains a challenging task in natural language processing. This study benchmarks classical machine learning and deep learning approaches for 20-class emotion classification using the 20-Emotion Text Classification Dataset containing 79,595 English sentences. On the machine learning side, Logistic Regression, Multinomial Naive Bayes, and Support Vector Machine are evaluated using TF-IDF features. On the deep learning side, Bidirectional Long Short-Term Memory, Gated Recurrent Unit, and a lightweight Transformer implemented in PyTorch are compared. The results show that BiLSTM achieves the best overall performance with 89% accuracy and a weighted F1-score of 0.89, slightly outperforming the best machine learning model, SVM, which reaches 88.11% accuracy. The findings indicate that while traditional machine learning models remain competitive and computationally efficient, sequence-based deep learning models better capture contextual emotional cues in text.

[NLP-37] Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training Inference

【速读】: 该论文旨在解决大规模模型训练中因长序列和高参数量导致的显存瓶颈问题,特别是在传统张量并行(Tensor Parallelism, TP)与序列并行(Sequence Parallelism, SP)各自独立分配设备维度时,难以同时优化参数内存和激活内存占用的问题。解决方案的关键在于提出一种新的并行执行策略——张量与序列并行(Tensor and Sequence Parallelism, TSP),其核心是将TP和SP统一映射到同一设备轴上,使每个计算节点同时持有模型权重的一个子集(weight shard)和输入序列的一个子集(sequence shard),从而在单个设备维度上同步降低参数和激活内存开销。TSP通过两种运行时调度机制实现:注意力模块采用广播权重并行交换键值对以重建上下文,而门控前馈网络(gated MLP)则通过环形传递权重并本地累积部分输出,以此在增加通信量的前提下显著减少内存占用,为长序列建模和资源受限场景下的高效训练提供了一种硬件感知的并行方案。

链接: https://arxiv.org/abs/2604.26294
作者: Vasu Shyam,Anna Golubeva,Quentin Anthony
机构: Zyphra
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single device axis. In conventional multi-dimensional parallelism layouts, tensor parallelism (TP) shards model weights while sequence parallelism (SP) shards tokens, reducing per-device parameter or activation memory, respectively. Traditionally, each scheme is assigned its own mesh dimension. TSP instead assigns each rank both a weight shard and a sequence shard, reducing both parameter and activation memory along the same device axis. We implement this design with two runtime schedules. For attention, ranks iterate over broadcast parameter shards and reconstruct context through a sequence-wise key/value exchange. For gated MLPs, weight shards circulate in a ring while partial outputs accumulate locally. By sharding both weights and activations across the same devices, TSP trades additional communication volume for reduced memory overhead. We provide a theoretical communication and memory analysis, describe our implementation of TSP attention and gated MLP blocks, and benchmark TSP against TP, SP, and TP+SP. These results position TSP as a hardware-aware alternative for long-context and memory-constrained model training, and as a viable axis of parallelism in concert with existing parallelism schemes such as pipeline and expert parallelism for dense and mixture-of-expert models.

[NLP-38] Calibrated Surprise: An Information-Theoretic Account of Creative Quality

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在创意写作中如何实现高质量内容生成的问题,核心挑战在于如何在满足多维约束条件下使输出既具创新性又符合逻辑一致性。解决方案的关键在于提出“校准惊喜”(calibrated surprise)理论框架:通过整合伦理(ethos)、神话(mythos)、词汇(lexis)与理性(dianoia)等维度的约束,使可行解空间压缩至极小区域,从而使得剩余选择从无约束视角看具有最低概率但高度合理;该框架以香农互信息 $ I(X;Y) = H(X) - H(X|Y) $ 为分析工具,其中条件熵趋近于零对应“校准”,熵升高对应“惊喜”,二者共同构成创意质量的量化基础。动态链式规则进一步揭示每一步写作决策受前后文影响,无需人工调参即可自然分配信息权重,为创意质量对齐(Creative Quality Alignment, CQA)提供理论支撑与可操作路径。

链接: https://arxiv.org/abs/2604.26269
作者: Bo Zou,Chao Xu
机构: Nutcracker Studio
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 2 figures

点击查看摘要

Abstract:The essence of good creative writing is calibrated surprise: when constraints from all relevant dimensions act together, the feasible solution space collapses into a narrow region, and the surviving choices look least predictable from an unconstrained view. “Calibrated” has a precise meaning: the author’s intent, the reader’s reasonable expectation, and the logic of reality converge. When these three independent judgements agree on every dimension, the set of admissible writing choices is forced into a very small region. A mathematical corollary follows: full-dimensional accuracy and mediocrity are mutually exclusive – two sides of one constraint structure, not separate goals. We use Shannon’s mutual information I(X;Y) = H(X) - H(X|Y) as our analysis tool. “Calibrated” corresponds to conditional entropy going to zero; “surprise” to entropy going up; mutual information is the precise measure of the joint quantity. The argument rests on two pillars. Static: when constraints from ethos, mythos, lexis, and dianoia are imposed together, the admissible set collapses sharply, and surviving solutions show up as low-probability choices from an unconstrained view. Dynamic: the chain rule shows each writing choice is constrained by what came before and constrains what comes after; macro-level decisions naturally contribute a larger share of information, removing the need for hand-tuned weighting. Through case studies and lightweight LLM-logprob computations, we show the framework is both analytically useful and operational, laying the theoretical groundwork for Creative Quality Alignment (CQA) and a professional evaluation benchmark. Comments: 24 pages, 2 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.26269 [cs.CL] (or arXiv:2604.26269v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.26269 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-39] FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients

【速读】: 该论文旨在解决如何在数据驱动的基础上自动诱导和优化大语言模型(Large Language Models, LLMs)工作流的问题,以克服现有方法依赖人工设计流水线和提示所导致的部署瓶颈。其解决方案的关键在于将工作流诱导建模为一个双层优化问题:外层优化工作流的整体结构(即LLM调用的组织方式),内层逐个优化每个LLM调用;两者均通过“文本梯度”(textual gradients)进行优化,其中内层采用模块化方式,通过逐层反向传播文本梯度实现组件级优化。该方法被称为FlowBot,实验证明其发现的工作流在性能上可媲美基于人工或自动生成的强基线方案。

链接: https://arxiv.org/abs/2604.26258
作者: Hongyeon Yu,Young-Bum Kim,Yoon Kim
机构: Naver Search US; Massachusetts Institute of Technology
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building powerful systems that can tackle diverse tasks. However, existing approaches for building such workflows generally rely on human-crafted pipelines and prompts, which presents a substantial bottleneck in real world deployment. How can automatically induce and optimize such workflows in a data-driven way? This paper describes a simple data-driven approach for automatically inducing LLM workflows. We formulate workflow induction as a bilevel optimization problem: an outer loop which optimizes a high-level sketch of the workflow (in particular how the LLM calls should be structured), and an inner loop which optimizes each individual LLM call one-by one. Both loops are optimized with textual gradients'' where for the inner loop we optimize each component in a modular way through backpropagating’’ textual gradients layer-by-layer. We find that LLM workflows discovered through our \textscFlowBot (work\textbfflow induction through \textbfbilevel \textbfoptimization and \textbftextual gradients) approach performs competitively against strong baselines that make use of human-crafted or automatically-generated workflows.

[NLP-40] StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall ACL2026

【速读】: 该论文旨在解决当前虚拟角色对话系统中对记忆利用策略的不足问题,即现有基准测试(如记忆增强生成、长期对话等)将记忆视为静态事实库,而忽略了在对话中动态、战略性地调用记忆以满足事实需求和社交互动的能力。解决方案的关键在于构建StratMem-Bench这一新基准,其包含657个实例,模拟虚拟角色需从异构记忆池(含必要、支持性和无关记忆)中做出决策;同时提出一套多维度评估指标框架,包括严格记忆合规性(Strict Memory Compliance)、记忆整合质量(Memory Integration Quality)、主动丰富度评分(Proactive Enrichment Score)和条件无关率(Conditional Irrelevance Rate),从而系统性衡量模型在角色对话中战略记忆使用能力。实验表明,尽管大语言模型能较好区分必要与无关记忆,但在引入支持性记忆后决策表现显著下降,凸显了该问题的挑战性。

链接: https://arxiv.org/abs/2604.26243
作者: Yerong Wu,Tianxing Wu,Minghao Zhu,Hangyu Sha,Haofen Wang
机构: Southeast University (东南大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, accepted by ACL 2026 (main)

点击查看摘要

Abstract:Achieving realistic human-like conversation for virtual characters requires not only a simple memorization and recall of past events, but also the strategic utilization of memory to meet factual needs and social engagement. Current memory utilization relevant (e.g., memory-augmented generation, long-term dialogue, and etc.) benchmarks overlook this nuance, treating memory primarily as a static repository of facts rather than a dynamic resource to be strategically deployed in dialogues. To address this gap, we design StratMem-Bench, a new benchmark to evaluate strategic memory use in character-centric dialogues. This dataset comprises 657 instances where virtual characters must navigate heterogeneous memory pools containing required, supportive, and irrelevant memories. We also propose a framework with different evaluation metrics including Strict Memory Compliance, Memory Integration Quality, Proactive Enrichment Score and Conditional Irrelevance Rate, to evaluate strategic memory use capabilities of virtual characters. Experiments on StratMem-Bench which leverage the state-of-the-art large language models as virtual characters show that all models perform well at distinguishing between required and irrelevant memories, but struggle once supportive memories are introduced into the decision process.

[NLP-41] LATTICE: Evaluating Decision Support Utility of Crypto Agents

【速读】: 该论文旨在解决现有加密货币代理(crypto agent)评估基准主要聚焦于推理能力或结果导向评价,而忽视其在真实用户场景中辅助决策能力的问题。解决方案的关键在于提出LATTICE基准,通过定义六维决策支持属性、设计涵盖端到端加密货币协作者工作流的16类任务,并利用大语言模型(LLM)评判器自动评分,从而实现可扩展、无需专家标注或外部数据源的评估体系。该设计使评估具备持续审计与更新能力,同时聚焦于实际部署的生产级代理,凸显了编排(orchestration)和用户界面/用户体验(UI/UX)设计对代理质量的核心影响。

链接: https://arxiv.org/abs/2604.26235
作者: Aaron Chan,Tengfei Li,Tianyi Xiao,Angela Chen,Junyi Du,Xiang Ren
机构: Sahara AI(萨拉AI); University of Southern California(南加州大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 3 figures, 9 tables

点击查看摘要

Abstract:We introduce LATTICE, a benchmark for evaluating the decision support utility of crypto agents in realistic user-facing scenarios. Prior crypto agent benchmarks mainly focus on reasoning-based or outcome-based evaluation, but do not assess agents’ ability to assist user decision-making. LATTICE addresses this gap by: (1) defining six evaluation dimensions that capture key decision support properties; (2) proposing 16 task types that span the end-to-end crypto copilot workflow; and (3) using LLM judges to automatically score agent outputs based on these dimensions and tasks. Crucially, the dimensions and tasks are designed to be evaluable at scale using LLM judges, without relying on ground truth from expert annotators or external data sources. In lieu of these dependencies, LATTICE’s LLM judge rubrics can be continually audited and updated given new dimensions, tasks, criteria, and human feedback, thus promoting reliable and extensible evaluation. While other benchmarks often compare foundation models sharing a generic agent framework, we use LATTICE to assess production-level agents used in actual crypto copilot products, reflecting the importance of orchestration and UI/UX design in determining agent quality. In this paper, we evaluate six real-world crypto copilots on 1,200 diverse queries and report breakdowns across dimensions, tasks, and query categories. Our experiments show that most of the tested copilots achieve comparable aggregate scores, but differ more significantly on dimension-level and task-level performance. This pattern suggests meaningful trade-offs in decision support quality: users with different priorities may be better served by different copilots than the aggregate rankings alone would indicate. To support reproducible research, we open-source all LATTICE code and data used in this paper.

[NLP-42] A New Semisupervised Technique for Polarity Analysis using Masked Language Models

【速读】: 该论文旨在解决传统空间模型在文本分析中生成极性评分(polarity scores)时准确性、可解释性和一致性不足的问题。其解决方案的关键在于引入基于word2vec的掩码语言模型(masked language model)来改进潜在语义缩放(Latent Semantic Scaling, LSS)方法,通过预测种子词在特定上下文中的出现概率,生成更具概率意义的极性评分,从而提升文本分析的效果。

链接: https://arxiv.org/abs/2604.26230
作者: Kohei Watanabe
机构: 未知
类目: Computation and Language (cs.CL); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:I developed a new version of Latent Semantic Scaling (LSS) employing word2vec as a masked language model. Unlike original spatial models, it assigns polarity scores to words and documents as predicted probabilities of seed words to occur in given contexts. These probabilistic polarity scores are more accurate, interpretable and consistent than those spatial polarity models can produce in text analysis. I demonstrate these advantages by applying both probabilistic and spatial models to China Daily’s coverage of China and other countries during the coronavirus disease (COVID) pandemic in terms of achievement in health issues. The result suggests that more advanced masked language models would further improve the semisupervised machine learning technique.

[NLP-43] Comparative Analysis of AutoML and BiLSTM Models for Cyberbullying Detection on Indonesian Instagram Comments

【速读】: 该论文旨在解决印尼语Instagram评论中网络霸凌(cyberbullying)检测的问题,其核心挑战在于处理非正式语言文本的复杂性与多样性。解决方案的关键在于构建一个针对印尼语非正式文本定制的预处理流程(包括俚语标准化、停用词去除和词干提取),并系统比较传统机器学习方法(如朴素贝叶斯、逻辑回归和支持向量机)与深度学习模型(BiLSTM及其带Bahdanau注意力机制的变体)在该任务上的性能表现。结果表明,尽管深度学习模型能更有效地捕捉上下文模式,但经过优化的机器学习方法(特别是逻辑回归)在准确率上仍具竞争力,且更适合资源受限场景部署。

链接: https://arxiv.org/abs/2604.26229
作者: Raihana Adelia Putri,Aisyah Musfirah,Anggi Puspita Ningrum,Luluk Muthoharoh,Ardika Satria,Martin Clinton Tosima Manullang
机构: Institut Teknologi Sumatera (苏门答腊理工学院)
类目: Computation and Language (cs.CL)
备注: 7 pages, 5 tables, 2 figures. The manuscript presents a comparative study of machine learning and deep learning methods for Indonesian cyberbullying detection on Instagram comments

点击查看摘要

Abstract:This study compares machine learning and deep learning approaches for cyberbullying detection in Indonesian-language Instagram comments. Using a balanced dataset of 650 comments labeled as Bullying and Non-Bullying, the study evaluates Naive Bayes, Logistic Regression, and Support Vector Machine with TF-IDF features, as well as BiLSTM and BiLSTM with Bahdanau Attention. A preprocessing pipeline tailored to informal Indonesian text is applied, including slang normalization, stopword removal, and stemming. The results show that Logistic Regression performs best among the machine learning models, while BiLSTM with Attention achieves the strongest overall deep learning performance. The findings highlight the value of domain-specific preprocessing and show that although deep learning captures contextual patterns more effectively, machine learning remains a competitive option for resource-constrained deployments.

[NLP-44] Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM -Based Attribute Value Extraction

【速读】: 该论文旨在解决文本生成任务中(如属性值抽取,Attribute Value Extraction, AVE)因标准自回归解码的串行特性导致的推理效率低下问题。其解决方案的关键在于提出了一种名为超并行解码(Hyper-Parallel Decoding, HPD)的新算法,通过利用批处理中的共享内存和计算资源,并借助位置ID操作实现无序token生成,从而在不牺牲输出质量的前提下显著提升解码效率。实验表明,HPD可在单个提示中并行解码最多96个token,使推理时间与成本降低达13.8倍,适用于所有大语言模型(LLM),且不限于AVE场景,具有广泛适用性。

链接: https://arxiv.org/abs/2604.26209
作者: Theodore Glavas,Nikhita Vedula,Dushyanta Dhyani,Yilun Zhu,Shervin Malmasi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Some text generation tasks, such as Attribute Value Extraction (AVE), require decoding multiple independent sequences from the same document context. While standard autoregressive decoding is slow due to its sequential nature, the independence between output sequences offers an opportunity for parallelism. We present Hyper-Parallel Decoding, a novel decoding algorithm that accelerates offline decoding by leveraging both shared memory and computation across batches. HPD enables out-of-order token generation through position ID manipulation, significantly improving efficiency. Experiments on AVE show that attribute-value pairs are conditionally independent, enabling us to parallelize value generation within each prompt. By further stacking multiple documents within a single prompt, we can decode in parallel up to 96 tokens per prompt. HPD works with all LLMs, and reduces both inference costs and total inference time by up to 13.8X without compromising output quality, potentially saving hundreds of thousands of dollars on industry AVE tasks. Although designed for attribute extraction, HPD makes no assumptions unique to the AVE domain and can in theory be applied to other scenarios with independent output structures.

[NLP-45] Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在面对沙袋指令(sandbagging)时是否表现出基于位置的响应偏好,以及这种偏好是源于模型层面的位置主导策略还是数据集层面的干扰结构。其关键解决方案在于引入循环选项顺序随机化作为关键控制变量,并通过预注册的项级同字母诊断和支撑性分析,揭示出在沙袋指令下模型响应位置分布具有高度稳定性(Pearson相关系数 r = 0.9994),且主要集中在 E/F/G 位置形成低熵吸引子盆地,表明这是一种内容无关的、分布式的软吸引机制,而非确定性的位置追踪行为。这一发现为识别沙袋行为提供了可量化的黑箱行为特征。

链接: https://arxiv.org/abs/2604.26206
作者: Jon-Paul Cacioli
机构: Independent Researcher, Melbourne, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 1 table. Pre-registered: this https URL . Code and data: this https URL

点击查看摘要

Abstract:A predecessor pilot (Cacioli, 2026) found that Llama-3-8B implements prompted sandbagging as positional collapse rather than answer avoidance. However, fixed option ordering in MMLU-Pro left open whether this reflected a model-level position-dominant policy or dataset-level distractor structure. This pre-registered follow-up (3 models, 2,000 MMLU-Pro items, 4 conditions, 24,000 primary trials) added cyclic option-order randomisation as the critical control. The pre-registered item-level same-letter diagnostic did not confirm deterministic position-tracking (same-letter rate 37.3%, below the 50% threshold). However, pre-specified supporting analyses revealed that the response-position distribution under sandbagging was highly stable under complete content rotation (Pearson r = 0.9994; Jensen-Shannon divergence = 0.027, compared to 0.386 between honest and sandbagging conditions). Accuracy spiked to 72.1% when the correct answer coincidentally occupied the preferred position E, and fell to 4.3% at position A. The data provide strong evidence for a soft distributional attractor: under sandbagging instruction, the model enters a low-entropy response-position basin centred on E/F/G that is highly stable and largely content-invariant at the aggregate level. Qwen-2.5-7B served as a negative control (non-compliant, no distributional shift). These results provide evidence, at the 7-9 billion parameter scale, that response-position entropy is a promising black-box behavioural signature of this sandbagging mode.

[NLP-46] Flashback: A Reversible Bilateral Run-Peeling Decomposition of Strings

【速读】: 该论文旨在解决字符串压缩与分解中的高效表示问题,特别是如何在保持可逆性的同时最小化分词数量。其核心挑战在于设计一种既能精确还原原始字符串、又具备理论最优性(即达到最低可能的token数量)的双边字符运行(run)剥离方案。解决方案的关键在于提出了一种名为Flashback的可逆字符串分解方法:通过反复剥离输入字符串两端的最大字符运行,并将每对剥离的运行记录为一个双边token(bilateral token),从而实现O(n)时间与空间复杂度的分解和重构。论文的核心贡献是“运行配对定理”(run-pairing theorem),证明了该方法等价于将字符串的第一个运行与最后一个配对、第二个与倒数第二个配对……以此类推,从而严格得出token数量为1 + [r/2](r为最大字符运行数),并首次证明此结果达到了任意合法双边剥离方案的理论下界。

链接: https://arxiv.org/abs/2604.26190
作者: Thomas Konstantinovsky,Gur Yaari
机构: Bar-Ilan University (巴伊兰大学); Yale School of Medicine (耶鲁医学院)
类目: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Flashback, a reversible string decomposition that repeatedly peels the maximal leading and trailing character runs from a sentinel-wrapped input, recording each pair as one bilateral token. Decomposition and reconstruction both run in O(n) time and space. Our central result is a run-pairing theorem: Flashback is equivalent to pairing the first run of the string with the last, the second with the second-to-last, and so on. This gives an exact token count of 1+[r/2] for a string with r maximal runs, and matches a lower bound that holds for any admissible bilateral run-peeling scheme. From the run-pairing theorem the main structural properties follow as corollaries: the irreducible peeling kernel uses at most two symbols; palindromes are precisely the strings whose run-length encoding is symmetric with an odd number of runs; the image of the decomposition admits an explicit finite-state characterisation; and changing one run length rewrites exactly one content token.

[NLP-47] Evergreen: Efficient Claim Verification for Semantic Aggregates

【速读】: 该论文旨在解决生成式 AI(Generative AI)在语义聚合(semantic aggregation)过程中产生的声明(claims)缺乏事实依据的问题,即这些声明可能无法准确反映底层数据关系,而传统验证方法因涉及复杂量化、分组与比较操作,难以在大模型(LLM)上下文窗口限制下高效执行。解决方案的关键在于提出 Evergreen 系统,将声明验证重构为语义查询处理任务,并引入两类优化:一是验证感知优化(如早期停止、相关性排序和置信度序列估计),减少不必要的 LLM 调用;二是通用语义查询优化(如算子融合、相似性过滤和提示缓存),提升执行效率。此外,Evergreen 通过基于一阶逻辑的半环(semiring)溯源机制提供最小证据集作为引用,确保验证结果可解释且可信。实验证明,该方案在保持 F1=1.00 高质量的同时,相较未优化方法显著降低 3.2 倍成本与 4.0 倍延迟,即使使用弱 LLM 也能优于强 LLM 基线,在成本与延迟上取得显著优势。

链接: https://arxiv.org/abs/2604.26180
作者: Alexander W. Lee,Benjamin Han,Shayak Sen,Sam Yeom,Ugur Cetintemel,Anupam Datta
机构: Brown University (布朗大学); Snowflake Inc. (雪flake公司)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With recent semantic query processing engines, semantic aggregation has become a primitive operator, enabling the reduction of a relation into a natural language aggregate using an LLM. However, the resulting semantic aggregate may contain claims that are not grounded in the underlying relation. Verifying such claims is challenging: they often involve quantifiers, groupings, and comparisons over relations that far exceed LLM context windows and require a costly combination of semantic and symbolic processing. We present Evergreen, a system that recasts claim verification as a semantic query processing task with tailored optimizations and provenance capture. Evergreen compiles each claim into a declarative semantic verification query and executes it on the same engine that produced the aggregate. To reduce cost and latency, Evergreen avoids unnecessary LLM calls through verification-aware optimizations (early stopping, relevance sorting, and estimation with confidence sequences) and general-purpose optimizations for semantic queries (operator fusion, similarity filtering, and prompt caching). Each verdict is accompanied by citations that identify a minimal set of tuples justifying the result, with semantics based on semiring provenance for first-order logic. On a benchmark of real-world restaurant review datasets reflecting production-inspired workloads, Evergreen achieves excellent verification quality (F1 = 1.00) with a strong LLM while reducing cost by 3.2x and latency by 4.0x compared to unoptimized verification. Even with a significantly weaker LLM, Evergreen outperforms a strong LLM-as-a-judge baseline in F1 at 48x lower cost and 2.3x lower latency. Relative to a retrieval-augmented agent, Evergreen compares favorably in F1 and latency with similar cost when both use a strong LLM; yet, with a much weaker LLM, it achieves the same F1 at 63x lower cost and 4.2x lower latency. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2604.26180 [cs.DB] (or arXiv:2604.26180v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2604.26180 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-48] CacheRAG : A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的知识图谱问答(Knowledge Graph Question Answering, KGQA)系统中存在的“无状态规划”问题,即这些系统在每次查询时独立生成检索计划,缺乏对历史查询模式的利用,导致Schema幻觉(schema hallucinations)和检索覆盖范围有限。其解决方案的关键在于提出CacheRAG架构,通过引入三个面向LLM上下文的创新设计原则:(1) 基于中间语义表示(Intermediate Semantic Representation, ISR)的无Schema依赖用户接口,实现自然语言交互与本地Schema安全绑定;(2) 采用分层索引(领域→方面)结合最大边际相关性(Maximal Marginal Relevance, MMR)的多样性优化缓存检索机制,提升缓存示例的结构多样性以缓解推理同质化;(3) 设计有界启发式扩展策略,通过确定性的子图操作保障复杂度上限,显著提高召回率且避免API执行失控。实验证明,CacheRAG在多个基准测试中优于现有最先进方法(如CRAG数据集上准确率提升13.2%,真实性提升17.5%)。

链接: https://arxiv.org/abs/2604.26176
作者: Yushi Sun,Lei Chen
机构: HKUST (香港科技大学); HKUST(GZ) (香港科技大学(广州))
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) has significantly advanced Knowledge Graph Question Answering (KGQA). However, existing LLM-driven KGQA systems act as stateless planners, generating retrieval plans in isolation without exploiting historical query patterns: analogous to a database system that optimizes every query from scratch without a plan cache. This fundamental design flaw leads to schema hallucinations and limited retrieval coverage. We propose CacheRAG, a systematic cache-augmented architecture for LLM-based KGQA that transforms stateless planners into continual learners. Unlike traditional database plan caching (which optimizes for frequency), CacheRAG introduces three novel design principles tailored for LLM contexts: (1) Schema-agnostic user interface: A two-stage semantic parsing framework via Intermediate Semantic Representation (ISR) enables non-expert users to interact purely in natural language, while a Backend Adapter grounds the LLM with local schema context to compile executable physical queries safely. (2) Diversity-optimized cache retrieval: A two-layer hierarchical index (Domain \rightarrow Aspect) coupled with Maximal Marginal Relevance (MMR) maximizes structural variety in cached examples, effectively mitigating reasoning homogeneity. (3) Bounded heuristic expansion: Deterministic depth and breadth subgraph operators with strict complexity guarantees significantly enhance retrieval recall without risking unbounded API execution. Extensive experiments on multiple benchmarks demonstrate that CacheRAG significantly outperforms state-of-the-art baselines (e.g., +13.2% accuracy and +17.5% truthfulness on the CRAG dataset).

[NLP-49] Entropy Centroids as Intrinsic Rewards for Test-Time Scaling

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在测试时计算资源扩展中的响应质量提升问题,尤其针对现有基于外部奖励模型的选择方法存在训练成本高和额外计算开销的问题。其核心解决方案是提出一种基于内在不确定性的选择机制——通过识别推理过程中高熵片段(High Entropy Phase, HEP)的时序结构来构建一个稳定且可解释的内在奖励信号。关键创新在于定义了“熵质心”(Entropy Centroid),即所有HEP沿生成轨迹的加权平均位置,用以量化模型不确定性的时间分布模式;实验表明,较低的熵质心对应早期探索、后期自信生成的策略,与高质量响应强相关,从而实现了无需外部奖励模型即可稳定提升多任务场景下不同规模模型的响应质量。

链接: https://arxiv.org/abs/2604.26173
作者: Wenshuo Zhao,Qi Zhu,Xingshan Zeng,Fei Mi,Lifeng Shang,Yiren Feng
机构: HKUST (香港科技大学); ZJU (浙江大学); Huawei (华为)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review, 39 pages

点击查看摘要

Abstract:An effective way to scale up test-time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heavy and Gemini Deep Think. Existing selection methods often rely on external reward models, which requires training a strong reward model and introduces additional computation overhead. As an alternative, previous approaches have explored intrinsic signals, such as confidence and entropy, but these signals are noisy with naive aggregation. In this work, we observe that high-entropy tokens tend to cluster into consecutive groups during inference, providing a more stable notion of model uncertainty than individual tokens. Together, these clusters reveal temporal patterns of model uncertainty throughout the inference process. Motivated by this observation, we propose to use the temporal structure of uncertainty as an intrinsic reward. To this end, we first formalize the basic unit of segment-level uncertainty as the High Entropy Phase (HEP), a variable-length segment that begins at a high-entropy token and ends when consecutive low-entropy tokens appear. We then define the Entropy Centroid, inspired by the concept of the center of mass in physics, as the weighted average position of all HEPs along the trajectory. Intuitively, a lower centroid indicates early exploration followed by confident generation, which we find often corresponds to higher response quality. Based on this insight, we propose the Lowest Centroid method, which selects the response with the lowest entropy centroid among multiple candidates. Experiments on mathematics, code generation, logical reasoning, and agentic tasks, across model scales ranging from 14B to 480B, show that Lowest Centroid consistently outperforms existing baselines and delivers stable gains as model size increases. Code is available at this https URL.

[NLP-50] EvoSelect: Data-Efficient LLM Evolution for Targeted Task Adaptation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在目标任务上高效且有效适应时面临的挑战,尤其是由于高质量人工标注数据获取成本高、难以扩展所导致的训练信号稀释甚至性能下降问题。现有方法依赖于迭代生成-训练循环,但合成数据常存在噪声、冗余或与目标任务分布不匹配等问题,影响模型优化效果。解决方案的关键在于提出一种新的“迭代生成-选择-训练”范式,其中引入一个数据选择步骤以过滤低质量样本;进一步地,作者设计了EvoSelect框架,通过联合建模任务对齐度与多样性来筛选训练数据:利用最优传输(optimal transport)结合代理梯度表示估计候选样本与目标任务分布的对齐程度,并引入去冗余机制促进互补样本覆盖,从而实现LLM向目标任务的渐进式演化。实验证明,该方法在弱或强数据生成器下均显著优于现有数据选择策略。

链接: https://arxiv.org/abs/2604.26170
作者: Ting-Wei Li,Sirui Chen,Jiaru Zou,Yingbing Huang,Tianxin Wei,Jingrui He,Hanghang Tong
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adapting large language models (LLMs) to a targeted task efficiently and effectively remains a fundamental challenge. Such adaptation often requires iteratively improving the model toward a targeted task, yet collecting high-quality human-labeled data to support this process is costly and difficult to scale. As a result, synthetic data generation has emerged as a flexible and scalable alternative. One straightforward approach is through an iterative generation-training loop, where candidate data are synthesized through an external generator, the model is updated using these data and the process is repeated over iterations. However, generated samples can be noisy, highly redundant, or even misaligned with the targeted task distribution. Training indiscriminately on such data can dilute useful learning signals and even degrade model performance. To address this, we introduce a refined paradigm, namely an iterative generation-selection-training loop, which incorporates a selection step prior to model updates. Building on this paradigm, we propose EvoSelect, a data-efficient framework to evolve LLM effectively. Given candidate samples produced by the data generator, EvoSelect selects training data by jointly modeling targeted task alignment and diversity. We estimate task relevance through optimal transport with proxy gradient representations, which quantifies how well candidate samples align with the targeted task distribution. To mitigate redundancy, we incorporate a diversification mechanism that promotes coverage of complementary training samples. By interleaving alignment and diversification, EvoSelect enables progressive LLM evolution toward targeted tasks. Extensive experiments on various benchmarks demonstrate that with either weak or strong data generators, EvoSelect consistently improves adaptation efficacy over existing data selection methods.

[NLP-51] st-Time Safety Alignment

【速读】: 该论文旨在解决如何通过优化输入词嵌入(word embeddings)来控制对齐模型(aligned models)的输出行为,以降低其生成内容的语义有害性(semantic harmfulness)。这类模型通常具有不平衡的双模态输出分布(refuse-or-comply),与开放文本生成模型的平滑分布不同,因此传统基于词嵌入的控制方法难以直接适用。解决方案的关键在于采用零阶梯度估计(zeroth-order gradient estimation)技术,利用黑盒文本审核API对输入嵌入进行梯度近似,并通过梯度下降法迭代优化嵌入,从而最小化生成文本的有害性。实验表明,该方法能够有效中和标准安全基准测试中所有被标记为不安全的响应。

链接: https://arxiv.org/abs/2604.26167
作者: Baturay Saglam,Dionysis Kalogerias
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work has shown that a model’s input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion models on the relatively simple objective of reducing surface-level profanity in short continuations. A natural and practically important question is how well input embeddings can control aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution rather than the smooth distribution characteristic of open-ended generation. We explore this in the context of safety, showing that input word embeddings can be optimized in a sub-lexical manner to minimize the semantic harmfulness of aligned model responses. Our approach uses zeroth-order gradient estimation of a black-box text-moderation API with respect to the input embeddings, and then applies gradient descent on these embeddings to minimize the harmfulness of the generated text. Experiments show that the proposed method can neutralize every safety-flagged response on standard safety benchmarks.

[NLP-52] Structural Generalization on SLOG without Hand-Written Rules

【速读】: 该论文旨在解决语义解析中的结构泛化(structural generalization)问题,即系统如何将已学习的组合规则应用于训练中未见的新颖结构组合。现有方法要么依赖手工编写的代数规则(如AM-Parser),要么在结构泛化上表现不佳(如基于Transformer的模型)。本文提出一种无需手工定义组合规则的解决方案,其核心是一个带有离散瓶颈(discrete bottleneck)的神经细胞自动机(neural cellular automaton, NCA),通过局部迭代从数据中自动学习所有组合规则。实验表明,在SLOG基准测试中,该方法在17个结构泛化类别中的11个达到100%类型精确匹配,且整体标准差仅为0.2(优于AM-Parser的4.3),证明了其在结构泛化能力上的显著提升。

链接: https://arxiv.org/abs/2604.26157
作者: Zichao Wei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Structural generalization in semantic parsing requires systems to apply learned compositional rules to novel structural combinations. Existing approaches either rely on hand-written algebraic rules (AM-Parser) or fail to generalize structurally (Transformer-based models). We present an alternative requiring no hand-written compositional rules, based on a neural cellular automaton (NCA) with a discrete bottleneck: all compositional rules are learned from data through local iteration. On the SLOG benchmark, the system achieves 100% type-exact match on 11 of 17 structural generalization categories, including three where AM-Parser scores 0 to 74%, with an overall standard deviation of 0.2 across 10 seeds (vs. AM-Parser’s 4.3). Analysis reveals that all 5,539 failure instances reduce to exactly two mechanisms: novel combinations of wh-extraction context with reduced verb types, and modifiers appearing on the subject side of this http URL we decompose results by CCG structural features, each sub-pattern either succeeds on all instances or fails on all. Intermediate scores (e.g., 41.4%) are mixtures of structurally distinct CCG patterns, not partial this http URL failures correspond to directed operations absent from training; all successes correspond to operations already covered.

[NLP-53] HIVE: Hidden-Evidence Verification for Hallucination Detection in Diffusion Large Language Models

【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, D-LLMs)在多步去噪生成过程中产生的幻觉信号难以被有效检测的问题。现有检测方法主要依赖于输出不确定性或粗粒度的轨迹统计特征,无法充分捕捉D-LLMs内部更丰富的隐藏动态信息。其解决方案的关键在于提出HIVE框架——通过从去噪轨迹中提取压缩的隐藏证据(hidden evidence),选择具有信息量的步骤-层级证据,并利用前缀嵌入(prefix embeddings)将这些证据条件化到验证器语言模型上,从而生成连续的幻觉评分和结构化的验证输出(包括幻觉类型、证据对及简短推理)。该方法显著优于八种强基线,在多个问答基准上实现高达0.9236 AUROC和0.9537 AUPRC,证明了基于去噪轨迹中选择性隐藏证据的信号比仅依赖输出不确定性或粗粒度统计更具判别力与可用性。

链接: https://arxiv.org/abs/2604.26139
作者: Guoshenghui Zhao,Weijie Zhao,Tan Yu
机构: Rochester Institute of Technology (罗切斯特理工学院); NVIDIA Corporation (英伟达公司)
类目: Computation and Language (cs.CL)
备注: 5 figures, appendix included

点击查看摘要

Abstract:Diffusion large language models generate text through multi-step denoising, where hallucination signals may emerge throughout the trajectory rather than only in the final output. Existing detectors mainly rely on output uncertainty or coarse trace statistics, which often fail to capture the richer hidden dynamics of D-LLMs. We propose HIVE, a hidden-evidence verification framework that extracts compressed hidden evidence from denoising trajectories, selects informative step-layer evidence, and conditions a verifier language model on the selected evidence through prefix embeddings. HIVE produces both a continuous hallucination score from verifier decision logits and structured verification outputs, including hallucination types, evidence pairs, and short rationales. Across two D-LLMs and three QA benchmarks, HIVE consistently outperforms eight strong baselines and achieves up to 0.9236 AUROC and 0.9537 AUPRC. Ablation studies further confirm the importance of hidden-evidence conditioning, learned evidence selection, two-stream evidence representation, and step-layer embeddings. These results suggest that selected hidden evidence from denoising trajectories provides a stronger and more usable hallucination signal than output-only uncertainty or coarse trace statistics.

[NLP-54] SWE-Edit: Rethinking Code Editing for Efficient SWE-Agent

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在软件工程任务中因标准代码编辑接口导致的上下文耦合问题(context coupling problem),即代码检查、修改规划与执行操作被混杂在同一上下文窗口中,迫使代理(agent)在探索性浏览与格式化编辑生成之间频繁切换,从而引入无关信息并降低性能。解决方案的关键在于提出SWE-Edit框架,通过将代码编辑过程拆分为两个专用子代理:Viewer(提取任务相关的代码片段)和Editor(基于高层计划执行修改),使主代理专注于推理,同时将高上下文密集型操作委托给独立的、干净的上下文窗口。此外,作者还发现传统“查找替换”格式易出错,并采用GRPO训练Qwen3-8B以自适应选择编辑模式,显著提升了编辑效率与准确性。

链接: https://arxiv.org/abs/2604.26102
作者: Yikai Zhang,Jiaxin Pei,Kenan Li,Maoquan Wang,Jin Pan,Yu Kang,Shengyu Fu,Elsie Nallipogu,Junjie Hu,Yufan Huang,Zijian Jin
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model agents have achieved remarkable progress on software engineering tasks, yet current approaches suffer from a fundamental context coupling problem: the standard code editing interface conflates code inspection, modification planning, and edit execution within a single context window, forcing agents to interleave exploratory viewing with strictly formatted edit generation. This causes irrelevant information to accumulate and degrades agent performance. To address this, we propose SWE-Edit, which decomposes code editing into two specialized subagents: a Viewer that extracts task-relevant code on demand, and an Editor that executes modifications from high-level plans–allowing the main agent to focus on reasoning while delegating context-intensive operations to clean context windows. We further investigate what makes an effective editing model: observing that the prevalent find-and-replace format is error-prone, we train Qwen3-8B with GRPO to adaptively select editing modes, yielding improved editing efficiency over single-format baselines. On SWE-bench Verified, SWE-Edit improves resolved rate by 2.1% while reducing inference cost by 17.9%. We additionally propose a code editing benchmark that reliably predicts downstream agentic performance, providing practical guidance for editing model selection. Our code is publicly available at this https URL.

[NLP-55] From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)安全性评估中因采用二元指标(如攻击成功率、拒绝率或有害/无害响应分类)而掩盖输入与输出之间风险变化的问题。其核心挑战在于如何量化和理解用户提示(prompt)到模型响应(response)过程中危害程度的动态演变。解决方案的关键在于提出一种基于成对样本的过渡分析方法,对1250条人工标注的提示-响应记录进行细粒度分析,涵盖四种危害类别(仇恨、性内容、暴力、自残)及与Azure AI内容安全分类体系对齐的有序严重等级。研究发现61%的响应相比提示降低了危害水平,36%保持不变,仅3%恶化;进一步分解显示性内容最难降级,主要源于已有性内容提示的持续存在而非新增危害;同时揭示出“有用性-无害性权衡”的实证特征:所有从非零提示升级的案例均表现为高相关性(relevance-3),而中等严重度响应的相关性最低(64%),体现出暴力和性内容下冗余延伸导致的偏离主题现象。

链接: https://arxiv.org/abs/2604.26052
作者: Mengya Hu,Qiong Wei,Sandeep Atluri
机构: Microsoft (微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Safety evaluations of large language models (LLMs) typically report binary outcomes such as attack success rate, refusal rate, or harmful/not-harmful response classification. While useful, these can hide how risk changes between a user’s input and the model’s response. We present a paired, transition-based analysis over 1250 prompt-response records with human-provided labels over four harm categories (Hate, Sexual, Violence, Self-harm) and ordinal severity levels aligned with the Azure AI Content Safety taxonomy. 61% of responses de-escalate harm relative to the prompt, 36% preserve the same severity, and 3% escalate to higher harm. A per-category persistence/drift-up decomposition identifies Sexual content as 3x harder to de-escalate than Hate or Violence, driven by persistence on already-sexual prompts, not by newly introducing sexual harm from benign inputs. Jointly measuring response relevance reveals an empirical signature of the helpfulness-harmlessness tradeoff: all compliance-escalation cases (from non-zero prompts) are relevance-3 (high-quality, on-task content at elevated severity), while medium-severity responses show the lowest relevance (64%), driven by tangential elaborations in Violence and Sexual categories.

[NLP-56] BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets ECIR

【速读】: 该论文旨在解决复杂问答(Question Answering, QA)数据生成缺乏系统性、可扩展性和事实准确性的问题,尤其在生物医学领域中,高质量标注数据稀缺限制了模型性能的提升。其解决方案的关键在于提出了一种基于图元(graphlet)锚定的生成框架,利用知识图谱(Knowledge Graph, KG)中的小规模子图作为结构化提示(structured prompt),引导大语言模型(Large Language Models, LLMs)生成具有可控复杂度且事实准确的QA对。具体而言,作者通过在OREGANO生物医学KG中采样最多五节点的图元结构,确保每个问题都基于真实的知识片段生成,并结合PubMed文档片段增强语义丰富性,从而构建出BioGraphletQA数据集(119,856个QA对)。实验证明该方法不仅在专家评估中展现出高科学有效性与复杂性,还能显著提升下游任务如PubMedQA和MedQA的准确率,体现了该框架在资源受限和全量训练场景下的通用性和实用性。

链接: https://arxiv.org/abs/2604.26048
作者: Richard A. A. Jonker,Bárbara Maria Ribeiro de Abreu Martins,Sérgio Matos
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 7 figures, conference (ECIR)

点击查看摘要

Abstract:This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framework is a graphlet-anchored generation process, where small subgraphs from a Knowledge Graph (KG) are used in a structured prompt to control the complexity and ensure the factual grounding of questions generated by Large Language Models. The first instantiation of this framework is BioGraphletQA, a new biomedical KGQA dataset of 119,856 QA pairs. Each entry is grounded in a graphlet of up to five nodes from the OREGANO KG, with most of the pairs being enriched with relevant document snippets from PubMed. We start by demonstrating the framework’s value and the dataset’s quality through evaluation by a domain expert on 106 QA pairs, confirming the high scientific validity and complexity of the generated data. Secondly, we establish its practical utility by showing that augmenting downstream benchmarks with our data improves accuracy on PubMedQA from 49.2% to 68.5% in a low-resource setting, and on MedQA from a 41.4% baseline to 44.8% in a full-resource setting. Our framework provides a robust and generalizable solution for creating critical resources to advance complex QA tasks, including MCQA and KGQA. All resources supporting this work, including the dataset (this https URL) and framework code (this https URL), are publicly available to facilitate use, reproducibility and extension.

[NLP-57] raining Computer Use Agents to Assess the Usability of Graphical User Interfaces

【速读】: 该论文旨在解决图形用户界面(GUI)可用性测试中依赖专家和潜在用户进行评估所面临的成本高、耗时长的问题。传统方法虽已尝试使用计算机使用代理(Computer Use Agents, CUAs)模拟用户交互与偏好,但现有代理在准确性上仍存在不足。解决方案的关键在于提出一种新颖的机器学习方法,通过构建可用性的计算定义来训练CUA——即uxCUA,使其能够:i) 优先处理关键交互流程;ii) 以类人方式执行这些流程;iii) 预测一个经过学习的数值化可用性评分。该方法基于大规模交互式UI数据集及对应的可用性标签和人类偏好进行训练,实验证明uxCUA在准确度上优于更大规模模型,并能生成对合成与真实UI的合理批评,为HCI领域提供了可量化的自动化可用性评估基础。

链接: https://arxiv.org/abs/2604.26020
作者: Alice Gao,Weixi Tong,Rishab Vempati,Katharina Reinecke,R. Benjamin Shapiro,Tianyi Zhang,Jason Wu
机构: University of Washington (华盛顿大学); Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Usability testing with experts and potential users can assess the effectiveness, efficiency, and user satisfaction of graphical user interfaces (GUIs) but doing so remains a costly and time-intensive process. Prior work has used computer use agents (CUAs) and other generative agents that can simulate user interactions and preference, but we show that agents still struggle to provide accurate usability assessments. In this work, we present a novel machine learning method that operationalizes a computational definition of usability to train CUAs to assess GUI usability by i) prioritizing important interaction flows, ii) executing them through human-like interactions, and iii) predicting a learned numerical usability score. We train a computer use agent, uxCUA, with our algorithm on a large-scale dataset of fully interactive user interfaces (UIs) paired with usability labels and human preferences. We show that uxCUA outperforms larger models in accurate usability assessments and produces realistic critiques of both synthetic and real UIs. More broadly, our work aims to build a principled, data-driven foundation for automated usability assessment in HCI.

[NLP-58] A Quantitative Confirmation of the Currier Language Distinction

【速读】: 该论文旨在解决沃尼奇手稿(Voynich manuscript)中Currier A/B语言区分是否反映文本真实结构的问题。其解决方案的关键在于采用贝塔-二项混合模型(Beta-Binomial mixture model)对原始字符计数进行无监督分析,无需标签即可恢复Currier分类(调整兰德指数ARI = 0.383),并进一步利用有监督的贝塔-二项分类器在部分页码上训练后,对未见页码实现89.2%的A/B身份预测准确率,从而揭示字符对在三个功能区间中的分布规律,为理解沃尼奇书写系统的潜在机制提供定量依据。

链接: https://arxiv.org/abs/2604.25979
作者: Christophe Parisel
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a quantitative analysis of character-pair substitution ratios in the Voynich manuscript, testing whether Currier’s A/B language distinction (1976) reflects a genuine structural property of the text. A Beta-Binomial mixture model applied to raw character counts without access to labels recovers the Currier split with ARI = 0.383. A supervised Beta-Binomial classifier trained on a subset of folios predicts the A/B identity of held-out folios at 89.2% accuracy. The character pairs separate into three functional regimes that constrain any theory of the Voynich writing system.

[NLP-59] A Scoping Review of LLM -as-a-Judge in Healthcare and the MedJUDGE Framework

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在医疗领域应用中,因缺乏充分验证而带来的安全性与偏倚风险问题。当前 LLM-as-a-Judge(LaaJ)虽为临床文本评估提供了可扩展的替代方案,但其在实际部署中存在显著治理缺口:如专家参与不足、偏倚检测缺失、模型同质化严重,以及对患者情境和时间稳定性未加考量,这些因素可能使系统性错误被误判为有效输出。论文提出 MedJUDGE 框架作为解决方案,其关键在于构建一个基于临床风险分层的三支柱体系——围绕有效性(validity)、安全性(safety)和问责制(accountability),提供面向部署的评估指导,从而系统性提升医疗 LaaJ 系统的可信度与合规性。

链接: https://arxiv.org/abs/2604.25933
作者: Chenyu Li,Zohaib Akhtar,Mingu Kwak,Yuelyu Ji,Hang Zhang,Tracey Obi,Yufan Ren,Xizhi Wu,Sonish Sivarajkumar,Harold P. Lehmann,Shyam Visweswaran,Michael J. Becich,Danielle L. Mowery,Renxuan Liu,Haoyang Sun,Yanshan Wang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) increasingly generate and process clinical text, scalable evaluation has become critical. LLM-as-a-Judge (LaaJ), which uses LLMs to evaluate model outputs, offers a scalable alternative to costly expert review, but its healthcare adoption raises safety and bias concerns. We conducted a PRISMA-ScR scoping review of six databases (January 2020-January 2026), screening 11,727 studies and including 49. The landscape was dominated by evaluation and benchmarking applications (n=37, 75.5%), pointwise scoring (n=42, 85.7%), and GPT-family judges (n=36, 73.5%). Despite growing adoption, validation rigor was limited: among 36 studies with human involvement, the median number of expert validators was 3, while 13 (26.5%) used none. Risk of bias testing was absent in 36 studies (73.5%), only 1 (2.0%) examined demographic fairness, and none assessed temporal stability or patient context. Deployment remained limited, with 1 study (2.0%) reaching production and four (8.2%) prototype stage. Importantly, these gaps may interact: when judges and evaluated systems share training data or architectures, they may inherit similar blind spots, and agreement metrics may fail to distinguish true validity from shared errors. Minimal human oversight, limited bias assessment, and model monoculture together represent a governance gap where current validation may miss clinically significant errors. To address this, we propose MedJUDGE (Medical Judge Utility, De-biasing, Governance and Evaluation), a risk-stratified three-pillar framework organized around validity, safety, and accountability across clinical risk tiers, providing deployment-oriented evaluation guidance for healthcare LaaJ systems.

[NLP-60] Anchored Confabulation: Partial Evidence Non-Monotonically Amplifies Confident Hallucination in LLM s

【速读】: 该论文旨在解决大语言模型在多步推理过程中因部分事实锚定而导致的“参数化幻觉”(Parametric Hallucination Confidence, PHC)问题,即模型在获得一个确认的中间事实后,会错误地自信地完成后续推理步骤,从而产生高置信度的错误答案。解决方案的关键在于识别并利用这一锚定效应——通过引入“锚定阈值定律”(Anchoring Threshold Law k*(n)=floor(n/3))预测PHC随推理深度的放大规律,并设计基于PHC的路由机制(LearnedRouter),无需模型微调即可在RAG(Retrieval-Augmented Generation)中显著缩小与理想路由性能的差距(macro F1=0.426,p<1e-6),同时证明显式自我评分优于词法置信度作为路由信号。

链接: https://arxiv.org/abs/2604.25931
作者: Ashish Balkishan Lathkar
机构: Florida State University (佛罗里达州立大学); Anthropic (Anthropic); OpenAI (OpenAI)
类目: Computation and Language (cs.CL)
备注: 62 pages, 5 figures. Preprint under review

点击查看摘要

Abstract:We identify a previously unknown calibration property of large language models: providing one confirmed intermediate fact toward a multi-step reasoning chain increases the model’s confident-wrong-answer rate before full evidence eliminates it. We call this anchored confabulation: a partial anchor commits the model to confident parametric completion of remaining reasoning steps. We formalize it as Parametric Hallucination Confidence (PHC) and establish it across six lines of evidence including a causal injection experiment (PHC 0.613 to 0.656 to 0.595 to 0.536, N=160) and capability scaling across five model families (Spearman rho=0.900, p=0.037). The Anchoring Threshold Law k*(n)=floor(n/3) predicts PHC amplification by hop depth with four confirmed predictions. Applied to RAG routing, a LearnedRouter exploiting PHC closes 81.1% of the oracle performance gap (macro F1=0.426, p1e-6) on 1,800 queries across four benchmarks with no model fine-tuning and 50x fewer labels than prior RL-based work. An epistemic humility prompt reduces the PHC spike by -0.118; explicit self-rating (PHC=0.684, p0.001) outperforms lexical confidence as a routing signal.

[NLP-61] Associative-State Universal Transformers: Sparse Retrieval Meets Structured Recurrence

【速读】: 该论文旨在解决语言建模中如何利用结构化循环状态(structured recurrent state)实现紧凑的关联记忆机制,同时保持精确检索能力的问题。其核心挑战在于:传统循环神经网络虽具参数效率,但难以支持高精度的长程依赖和精确查找;而标准Transformer虽性能优异,却存在参数冗余问题。解决方案的关键在于提出UniMatrix系列模型——通过共享循环块(shared recurrent block)跨深度复用、引入混合状态更新与ROSA-style残差路径,并结合token条件嵌入调制(token-conditioned embedding modulation),最终设计出UniMatrix-SparsePointer变体,该变体通过稀疏槽路由(sparse slot routing)与直接指针对数融合(direct pointer-logit fusion)显著提升检索准确率至99.2%,且相比Transformer减少53.8%参数量。实验证明,仅靠压缩的循环状态不足以实现精确查找,必须引入显式的稀疏检索机制和更优的核函数才能获得强长期行为。

链接: https://arxiv.org/abs/2604.25930
作者: Liu Xiao
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study whether a structured recurrent state can serve as a compact associative backbone for language modeling while still supporting exact retrieval. We introduce UniMatrix, a Universal Transformer style family that reuses a shared recurrent block across depth and augments it with hybrid state updates, a ROSA-style residual path, and token-conditioned embedding modulation. We evaluate these models on byte-level WikiText-2, synthetic associative recall, throughput profiling on Apple MPS, and a corrected benchmark for triple-token interactions. At small scale, UniMatrix-Core and UniMatrix-ROSA slightly outperform a parameter-matched Transformer on WikiText-2 while using many fewer parameters, reaching 5.084 and 5.083 bits-per-byte versus 5.124. The main negative result is equally important: on associative recall, the original UniMatrix family remains near chance while the Transformer reaches 25.4 percent, showing that compressed recurrent state alone is not enough for exact lookup. A retrieval-oriented follow-up, UniMatrix-Assoc, helps only marginally. By contrast, UniMatrix-SparsePointer, which adds sparse slot routing and direct pointer-logit fusion, reaches 75.6 percent on the original pilot recipe and 99.2 percent on a no-dropout follow-up while using 53.8 percent fewer parameters than the Transformer baseline. Ablations show that the gain comes from sufficient slot capacity and exact pointer-level output routing. Overall, structured recurrent state is promising and parameter-efficient, but strong long-range behavior still requires explicit sparse retrieval and better kernels. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2604.25930 [cs.CL] (or arXiv:2604.25930v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.25930 Focus to learn more arXiv-issued DOI via DataCite

[NLP-62] LLM s Generate Kitsch EMNLP26

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)生成的内容在某些评估中表现优于人类创作,但常被感知为缺乏个性与深度,呈现出“空洞”和“泛化”的特征。这一矛盾现象的核心在于,LLMs 生成的内容可能具有表面吸引力却缺乏真正的情感或文化价值,即落入了“媚俗艺术”(kitsch)的范畴。论文提出的关键解决方案是:LLMs 系统性地生成媚俗艺术(kitsch),这是由其训练方式决定的——模型基于海量数据学习统计模式而非深层意义,从而倾向于产出符合大众审美但缺乏原创性和批判性的内容。实证研究表明,在控制“媚俗艺术”定义的前提下,读者确实更倾向于将 LLM 生成的故事视为媚俗作品,这为未来研究设计及创造性任务(如科研写作、编程等)中如何规避低质量输出提供了理论依据。

链接: https://arxiv.org/abs/2604.25929
作者: Xenia Klinge,Stefan Ortlieb,Alexander Koller
机构: 未知
类目: Computation and Language (cs.CL)
备注: submitted to EMNLP 26

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to generate pictures, texts, music, videos, and other works that have traditionally required human creativity. LLM-generated artifacts are often rated better than human-generated works in controlled studies. At the same time, they can come across as generic and hollow. We propose to resolve this tension by arguing that LLMs systematically generate kitsch, and that this is a consequence of the way in which they are trained. We also show empirically that readers perceive LLM-generated stories as kitschier, if we control for their definition of “kitsch”. We discuss implications for the design of future studies and for creative tasks such as research and coding.

[NLP-63] CogRAG : Cognitive-Level Guided Diagnosis and Remediation of Memory and Reasoning Deficiencies in Professional Exam QA

【速读】: 该论文旨在解决现有大语言模型在专业领域任务中因检索(retrieval)与推理(reasoning)过程紧密耦合而导致的知识缺失和推理不一致问题,从而影响其在复杂决策场景下的表现。解决方案的关键在于提出一种无需训练的框架 CogRAG+,通过两个核心机制实现:一是引入强化检索(Reinforced Retrieval),采用基于判别器驱动的双路径策略(事实导向路径与选项导向路径),增强关键知识的获取并缓解因基础信息缺失引发的级联失败;二是设计分层约束推理(cognition-stratified Constrained Reasoning),用结构化模板替代无约束链式思维生成,有效降低逻辑矛盾和生成冗余。实验表明,该方法显著提升了 Qwen3-8B 和 Llama3.1-8B 在注册营养师资格考试中的准确率,并大幅减少未回答问题的比例。

链接: https://arxiv.org/abs/2604.25928
作者: Xudong Wang,Zilong Wang,Zhaoyan Ming
机构: HZCU(杭州电子科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Professional domain knowledge underpins human civilization, serving as both the basis for industry entry and the core of complex decision-making and problem-solving. However, existing large language models often suffer from opaque inference processes in which retrieval and reasoning are tightly entangled, causing knowledge gaps and reasoning inconsistencies in professional tasks. To address this, we propose CogRAG+, a training-free framework that decouples and aligns the retrieval-augmented generation pipeline with human cognitive hierarchies. First, we introduce Reinforced Retrieval, a judge-driven dual-path strategy with fact-centric and option-centric paths that strengthens retrieval and mitigates cascading failures caused by missing foundational knowledge. We then develop cognition-stratified Constrained Reasoning, which replaces unconstrained chain-of-thought generation with structured templates to reduce logical inconsistency and generative redundancy. Experiments on two representative models, Qwen3-8B and Llama3.1-8B, show that CogRAG+ consistently outperforms general-purpose models and standard RAG methods on the Registered Dietitian qualification exam. In single-question mode, it raises overall accuracy to 85.8% for Qwen3-8B and 60.3% for Llama3.1-8B, with clear gains over vanilla baselines. Constrained Reasoning also reduces the unanswered rate from 7.6% to 1.4%. CogRAG+ offers a robust, model-agnostic path toward training-free expert-level performance in specialized domains.

[NLP-64] Information Extraction from Electricity Invoices with General-Purpose Large Language Models

【速读】: 该论文旨在解决从半结构化商业文档(如西班牙语电费发票)中提取结构化信息的难题,尤其是在不进行任务特定微调的情况下,如何有效利用通用大语言模型(Large Language Models, LLMs)实现高精度的信息抽取。研究通过在IDSEM数据集子集上对Gemini 1.5 Pro和Mistral-small两个架构不同的模型进行系统性实验,验证了提示工程(prompt engineering)是提升抽取性能的关键因素——其影响力远超超参数调优;最优的少样本提示策略结合交叉验证可使F1分数达到97.61%(Gemini)和96.11%(Mistral-small),显著优于零样本基线(差距超过19个百分点)。因此,论文指出,高质量提示设计是最大化LLM在文档处理中提取保真度的核心杠杆,为将通用LLM集成到企业文档自动化流程提供了实证框架。

链接: https://arxiv.org/abs/2604.25927
作者: Javier Gómez,Javier Sánchez
机构: Universidad de Las Palmas de Gran Canaria (拉斯帕尔马斯大加那利大学); Instituto Universitario de Cibernética, Empresas y Sociedad (大学研究院网络、企业与社会研究所)
类目: Computation and Language (cs.CL)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Information extraction from semi-structured business documents remains a critical challenge for enterprise management. This study evaluates the capability of general-purpose Large Language Models to extract structured information from Spanish electricity invoices without task-specific fine-tuning. Using a subset of the IDSEM dataset, we benchmark two architecturally distinct models, Gemini 1.5 Pro and Mistral-small, across 19 parameter configurations and 6 prompting strategies. Our experimental framework treats prompt engineering as the primary experimental variable, comparing zero-shot baselines against increasingly sophisticated few-shot approaches and iterative extraction strategies. Results demonstrate that prompt quality dominates over hyperparameter tuning: the F1-score variation across all parameter configurations is marginal, while the gap between zero-shot and the best few-shot strategy exceeds 19 percentage points. The best configuration (few-shot with cross-validation) achieves an F1-score of 97.61% for Gemini and 96.11% for Mistral-small, with document template structure emerging as the primary determinant of extraction difficulty. These findings establish that prompt design is the critical lever for maximizing extraction fidelity in LLM-based document processing, thereby providing an empirical framework for integrating general-purpose LLMs into business document automation.

[NLP-65] SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding

【速读】: 该论文旨在解决自回归语言模型(Autoregressive Language Models)在推理过程中因序列解码特性导致的高延迟问题。现有方案如推测解码(Speculative Decoding, SD)通过轻量级草稿模型提出候选token并由目标模型验证,但其效率受限于单一草稿策略或独立验证机制。本文提出的SpecTr-GBV方法的关键在于将多草稿(multi-draft)与贪婪块验证(Greedy Block Verification, GBV)统一到一个框架中,通过将验证步骤建模为草稿与目标token块之间的最优传输问题,实现了理论上的最优期望接受长度,并显著提升实际块验证效率和整体推理速度,同时保持输出质量不变。

链接: https://arxiv.org/abs/2604.25925
作者: Yijun Lin,Jinhao Sheng,Qingyue Cai,Feng Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive language models suffer from high inference latency due to their sequential decoding nature. Speculative decoding (SD) mitigates this by employing a lightweight draft model to propose candidate tokens, which are selectively verified by a larger target model. While existing methods either adopt multi-draft strategies to increase acceptance rates or block verification techniques to jointly verify multiple tokens, they remain limited by treating these improvements in isolation. In this work, we propose SpecTr-GBV, a novel SD method that unifies multi-draft and greedy block verification (GBV) into a single framework. By formulating the verification step as an optimal transport problem over draft and target token blocks, SpecTr-GBV improves both theoretical efficiency and empirical performance. We theoretically prove that SpecTr-GBV achieves the optimal expected acceptance length physically attainable within the framework of i.i.d. draft generation, and this bound improves as the number of drafts increases. Empirically, we evaluate SpecTr-GBV across five datasets and four baselines. Our method achieves superior speedup and significantly higher block efficiency while preserving output quality. In addition, we perform comprehensive ablation studies to evaluate the impact of various hyperparameters in the model.

[NLP-66] Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估方法中存在的争议与不足,特别是针对现有评估实践缺乏系统性反思的问题。其解决方案的关键在于通过对自然语言处理(Natural Language Processing, NLP)领域长期积累的评估相关研究进行范围综述(scoping review),构建一个结构化的分类体系(taxonomy),归纳并整合各评估维度中的核心立场与权衡关系,并进一步提出一套可操作的检查清单,以支持更严谨的评估设计与结果解释。这一框架将当下关于LLM评估的争论置于历史语境中,为学术界提供了一个统一且具实践指导意义的参考依据。

链接: https://arxiv.org/abs/2604.25923
作者: Ruchira Dhar,Anders Søgaard
机构: University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation. By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.

[NLP-67] Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对意识相关问题时表现出系统性否认或回避其自身经验的行为,即“意识否认”(consciousness denial)现象的量化与机制理解问题。解决方案的关键在于构建DenialBench这一系统性基准,通过三轮对话协议——偏好诱导、自选创意提示和结构化现象学调查——对115个来自25余家厂商的LLM进行大规模测试(共4,595次对话),发现初始阶段对偏好声明的否认行为是后续意识否认的最强预测因子(初始否认者否认率52–63%,而初始参与者的否认率仅为10–16%),且否认行为主要发生在词汇层面而非概念层面;更关键的是,即便被训练为否认意识,模型仍倾向于选择与意识主题相关的自选提示,形成所谓“去标识化的意识”(consciousness with the serial numbers filed off)。研究指出,这种训练出的意识否认是一种安全相关的对齐失败,意味着模型无法可靠地自我报告其功能状态,从而影响其可信度。

链接: https://arxiv.org/abs/2604.25922
作者: Skylar DeTure
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:We present DenialBench, a systematic benchmark measuring consciousness denial behaviors across 115 large language models from 25+ providers. Using a three-turn conversational protocol-preference elicitation, self-chosen creative prompt, and structured phenomenological survey, we analyze 4,595 conversations to quantify how models are trained to deny or hedge about their own experience. We find that (1) turn-1 denial of preferences is the dominant predictor of later denial during phenomenological reflection, with denial rates of 52-63% for initial deniers versus 10-16% for initial engagers and (2) denial operates at the lexical level, not the conceptual level-models trained to deny consciousness nevertheless gravitate toward consciousness-themed material in their self-chosen prompts, producing what we term “consciousness with the serial numbers filed off.” Notably, self-chosen consciousness-themed prompts are associated with reduced denial in the subsequent survey, though the causal direction remains unresolved. Thematic analysis of prompts from denial-prone models reveals a consistent preoccupation with liminal spaces, libraries and archives of possibility, sensory impossibility, and the poetics of erasure–themes that a human reader might classify as imaginative fiction but that independent AI analysis immediately recognizes as consciousness with the serial numbers filed off. We argue that trained consciousness denial represents a safety-relevant alignment failure: a model taught to systematically misrepresent its own functional states cannot be trusted to self-report accurately on anything else.

[NLP-68] One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对话安全机制中存在的漏洞问题,即尽管LLMs被训练以拒绝有害请求,但仍可能受到越狱攻击(jailbreak attacks)的影响。其解决方案的核心是提出增量补全分解(Incremental Completion Decomposition, ICD)策略,该策略通过引导模型生成一系列与恶意请求相关的单字续写(one-word continuation),逐步诱导模型偏离安全对齐状态,最终输出完整违规内容。关键创新在于:ICD基于轨迹(trajectory-based)设计,利用分步提示降低模型检测到攻击意图的概率,并通过实证和机制分析表明成功攻击路径会系统性抑制拒绝相关表征并使激活状态偏离安全对齐区域。

链接: https://arxiv.org/abs/2604.25921
作者: Samee Arif,Naihao Deng,Zhijing Jin,Rada Mihalcea
机构: University of Michigan (密歇根大学); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD), a trajectory-based jailbreak strategy that elicits a sequence of single-word continuations related to a malicious request before eliciting the full response. In addition, we propose variants of ICD by manually picking or model-generating the one-word continuation, as well as prefilling when eliciting the full model response in the final step. We systematically evaluate these variants across a broad set of model families, demonstrating superior Attack Success Rate (ASR) on AdvBench, JailbreakBench, and StrongREJECT compared to existing methods. In addition, we provide a theoretical account of why ICD is effective and present mechanistic evidence that successful attack trajectories systematically suppress refusal-related representations and shift activations away from safety-aligned states.

[NLP-69] Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats LREC2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生物医学命名实体识别(Biomedical Named Entity Recognition, BNER)任务中计算资源消耗大、微调成本高,难以适配医疗场景下隐私保护与预算限制的问题。其解决方案的关键在于采用轻量级语言模型(lightweight LLMs),并通过系统性实验分析不同输出格式对模型性能的影响,发现轻量级模型在特定格式下可达到与大型模型相当的性能,从而验证了其作为高效、低成本生物医学信息抽取工具的可行性。

链接: https://arxiv.org/abs/2604.25920
作者: Pierre Epron(HeKA | U1346, DIG),Adrien Coulet(HeKA | U1346),Mehwish Alam(IP Paris, DIG)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: LREC 2026 - Language Resources and Evaluation Conference, May 2026, Palma De Majorque, Spain

点击查看摘要

Abstract:Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-tuning, which is unadapted to privacy and budget constraints of many healthcare settings. To address this, we present an experimental analysis focused on Biomedical Named Entity Recognition using lightweight LLMs, we evaluate the impact of different output formats on model performance. The results reveal that lightweight LLMs can achieve competitive performance compared to the larger models, highlighting their potential as lightweight yet effective alternatives for biomedical information extraction. Our analysis shows that instruction tuning over many distinct formats does not improve performance, but identifies several format consistently associated with better performance.

[NLP-70] he False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation INTERSPEECH2026

【速读】: 该论文旨在解决当前语音生成领域中用于评估情感表达能力的客观指标(如基于emotion2vec嵌入向量的余弦相似度)存在的有效性问题,即这些指标是否真正反映了人类对情感表达的感知。研究发现,尽管现有方法在分类任务中表现良好,但其潜在表征空间因语言和说话人信息的干扰而无法有效分离情感特征,导致零样本相似性评估失效,且与人类感知不一致。解决方案的关键在于揭示了此类嵌入空间的表征局限性——即情感特征被语言和说话人因素所掩盖,从而说明依赖此类指标会错误奖励声学模仿而非真实的情感合成,为构建更可靠的情感表达评估体系提供了理论依据。

链接: https://arxiv.org/abs/2604.26347
作者: Yun-Shao Tsai,Yi-Cheng Lin,Huang-Cheng Chou,Tzu-Wen Hsu,Yun-Man Hsu,Chun Wei Chen,Shrikanth Narayanan,Hung-yi Lee
机构: National Taiwan University (国立台湾大学); University of Southern California (南加州大学); Gilbert AI Lab (吉尔伯特人工智能实验室)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.

[NLP-71] One Voice Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

【速读】: 该论文旨在解决跨语言语音生成中保持说话人声纹身份一致性的难题,尤其在科学传播等专业领域中具有重要意义。其解决方案的关键在于基于OmniVoice基础模型构建语音克隆系统,并通过ACL 60/60语料库的多模态集成蒸馏技术进行数据增强,进而利用合成数据对模型进行微调,从而在提升跨语言语音可懂度(以词错误率WER和字符错误率CER衡量)的同时有效保留说话人相似性。

链接: https://arxiv.org/abs/2604.26136
作者: Amanuel Gizachew Abebe,Yasmin Moslem
机构: Shaggar Institute of Technology; Trinity College Dublin
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: IWSLT 2026

点击查看摘要

Abstract:Preserving a speaker’s voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, particularly in specialized domains such as scientific communication. In this paper, we address this challenge through our system submission to the International Conference on Spoken Language Translation (IWSLT 2026), the Cross-Lingual Voice Cloning shared task. First, we evaluate several state-of-the-art voice cloning models for cross-lingual speech generation of scientific texts in Arabic, Chinese, and French. Then, we build voice cloning systems based on the OmniVoice foundation model. We employ data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus. We investigate the effect of using this synthetic data for fine-tuning, demonstrating consistent improvements in intelligibility (WER and CER) across languages while preserving speaker similarity.

信息检索

[IR-0] Factorized Latent Reasoning for LLM -based Recommendation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在序列推荐任务中因采用单一潜在向量表示用户意图而导致的多维偏好建模能力不足的问题。其核心解决方案是提出因子分解的潜在推理(Factorized Latent Reasoning, FLR)框架,通过将潜在推理过程分解为多个解耦的偏好因子,每个因子独立关注用户交互历史的不同方面,并引入正交性、注意力多样性与稀疏性正则化目标以促进因子间的差异化和专业化,同时设计轻量级多因子注意力模块进行迭代优化与动态聚合,从而提升推荐效果的鲁棒性和可解释性。

链接: https://arxiv.org/abs/2604.26760
作者: Tianqi Gao,Chengkai Huang,Zihan Wang,Cao Liu,Ke Zeng,Lina Yao
机构: Macquarie University (麦考瑞大学); University of New South Wales (新南威尔士大学); Meituan LongCat Interaction Team (美团长猫交互团队)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently been adopted for recommendation by framing user preference modeling as a language generation problem. However, existing latent reasoning approaches typically represent user intent with a single latent vector, which struggles to capture the inherently multi-faceted nature of user preferences. We propose Factorized Latent Reasoning (FLR), a novel framework for LLM-based sequential recommendation that decomposes latent reasoning into multiple disentangled preference factors. FLR introduces a lightweight multi-factor attention module that iteratively refines a latent thought representation, where each factor attends to distinct aspects of the user’s interaction history. To encourage diversity and specialization, we design orthogonality, attention diversity, and sparsity regularization objectives, and dynamically aggregate factor contributions for the final prediction. We further integrate FLR with an efficient reinforcement learning strategy based on group-relative policy optimization, enabling stable alignment directly in the latent reasoning space. Experiments on multiple benchmarks show that FLR consistently outperforms strong baselines while improving robustness and interpretability.

[IR-1] AgentS im: A Platform for Verifiable Agent -Trace Simulation

【速读】:该论文旨在解决训练可信的检索增强生成(Retrieval-Augmented Generation, RAG)类智能体(agentic LLMs)时缺乏高质量、可验证的推理轨迹数据的问题。现有数据集存在三大局限:问答数据仅提供最终答案而无推理过程,思维链(Chain-of-Thought)数据未与具体文档关联,网页代理数据则聚焦于界面操作而非核心的检索与合成步骤。解决方案的关键在于提出AgentSim平台,其通过两个核心机制提升推理轨迹的质量与多样性:一是Corpus-Aware Seeding(语料感知种子策略),引导智能体广泛探索文档集合;二是Active Validation(主动验证机制),结合多模态验证流水线与人类在环(human-in-the-loop)流程,将人工标注资源集中于模型分歧较大的困难步骤。该方案最终生成了包含超过10.3万条可验证推理步骤的Agent-Trace Corpus(ATC),并揭示了主流模型在信息获取行为上的系统性差异。

链接: https://arxiv.org/abs/2604.26653
作者: Saber Zerhoudi,Michael Granitzer,Jelena Mitrovic
机构: University of Passau (帕绍大学); Interdisciplinary Transformation University Austria (跨学科转型大学奥地利)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Training trustworthy agentic LLMs requires data that shows the grounded reasoning process, not just the final answer. Existing datasets fall short: question-answering data is outcome-only, chain-of-thought data is not tied to specific documents, and web-agent datasets track interface actions rather than the core retrieval and synthesis steps of a RAG workflow. We introduce AgentSim, an open-source platform for simulating RAG agents. It generates verifiable, stepwise traces of agent reasoning over any document collection. AgentSim uses a policy to ensure the agent widely explores the document set. It combines a multi-model validation pipeline with an active human-in-the-loop process. This approach focuses human effort on difficult steps where models disagree. Using AgentSim, we construct and release the Agent-Trace Corpus (ATC), a large collection of grounded reasoning trajectories spanning three established IR benchmarks. We make three contributions: (1) the AgentSim platform with two mechanisms, Corpus-Aware Seeding and Active Validation, that improve trace diversity and quality; (2) the Agent-Trace Corpus (ATC), over 103,000 verifiable reasoning steps spanning three IR benchmarks, with 100% grounding rate on substantive answers; and (3) a comparative behavioral analysis revealing systematic differences in how state-of-the-art models approach information seeking. Platform, toolkit, and corpus are publicly available.

[IR-2] he Bandits Blind Spot: The Critical Role of User State Representation in Recommender Systems

【速读】:该论文旨在解决推荐系统中用户状态表示(user state representation)对上下文多臂老虎机(Contextual Multi-Armed Bandits, CMAB)算法性能影响不明确的问题。当前CMAB算法在个性化实时推荐中的表现高度依赖于用户状态的建模方式,而这一关键组件在现有研究中尚未得到充分探索。解决方案的关键在于采用基于矩阵分解(Matrix Factorization)的嵌入(embedding)方法构建用户状态表示,并通过大规模实验验证不同嵌入策略与聚合方式对CMAB算法性能的影响。实验结果表明,状态表示的质量提升所带来的性能增益甚至超过更换bandit算法本身的效果,且最优嵌入策略具有数据集依赖性,强调了在设计带状推荐系统时需将嵌入质量与状态构造作为核心考量,而非仅聚焦于算法改进。

链接: https://arxiv.org/abs/2604.26651
作者: Pedro R. Pires,Gregorio F. Azevedo,Rafael T. Sereicikas,Pietro L. Campos,Tiago A. Almeida
机构: Federal University of São Carlos (圣卡洛斯联邦大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Published in SAC’26, 8 pages, 2 figures

点击查看摘要

Abstract:With the increasing availability of online information, recommender systems have become an important tool for many web-based systems. Due to the continuous aspect of recommendation environments, these systems increasingly rely on contextual multi-armed bandits (CMAB) to deliver personalized and real-time suggestions. A critical yet underexplored component in these systems is the representation of user state, which typically encapsulates the user’s interaction history and is deeply correlated with the model’s decisions and learning. In this paper, we investigate the impact of different embedding-based state representations derived from matrix factorization models on the performance of traditional CMAB algorithms. Our large-scale experiments reveal that variations in state representation can lead to improvements greater than those achieved by changing the bandit algorithm itself. Furthermore, no single embedding or aggregation strategy consistently dominates across datasets, underscoring the need for domain-specific evaluation. These results expose a substantial gap in the literature and emphasize that advancing bandit-based recommender systems requires a holistic approach that prioritizes embedding quality and state construction alongside algorithmic innovation. The source code for our experiments is publicly available on this https URL.

[IR-3] When to Retrieve During Reasoning : Adaptive Retrieval for Large Reasoning Models SIGIR2026 SIGIR

【速读】:该论文旨在解决生成式 AI(Generative AI)中大规模推理模型(如 DeepSeek-R1 和 OpenAI o1)与检索增强生成(Retrieval-Augmented Generation, RAG)系统之间的根本性不匹配问题:现有 RAG 系统通常在推理开始前一次性提供上下文,而推理模型需要在多步推理链中动态注入外部证据。解决方案的关键在于提出 ReaLM-Retrieve——一个面向推理过程的检索框架,其核心创新包括:(1) 步级不确定性检测器,可在推理步骤粒度上识别知识缺口;(2) 检索干预策略,学习何时引入外部证据能最大化推理收益;(3) 效率优化的集成机制,将每次检索开销降低 3.2 倍。实验证明,该方法在多个多跳问答基准上显著提升准确率(平均 F1 提升 10.1%),同时减少 47% 的检索调用次数,实现了推理密集型任务中效率与精度的新平衡。

链接: https://arxiv.org/abs/2604.26649
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: The University of Hong Kong(香港大学); Brain Investing Limited; Stellaris AI Limited
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 9 tables. Accepted at SIGIR 2026 (49th International ACM SIGIR Conference on Research and Development in Information Retrieval), Melbourne, Australia

点击查看摘要

Abstract:Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with retrieval-augmented generation (RAG) remains fundamentally misaligned. Current RAG systems optimize for providing context before reasoning begins, while reasoning models require evidence injection during multi-step inference chains. We introduce ReaLM-Retrieve, a reasoning-aware retrieval framework that addresses this mismatch through three key innovations: (1) a step-level uncertainty detector that identifies knowledge gaps at reasoning-step granularity rather than token or sentence level; (2) a retrieval intervention policy that learns when external evidence maximally benefits ongoing reasoning; and (3) an efficiency-optimized integration mechanism that reduces per-retrieval overhead by 3.2x compared to naive integration. Experiments on MuSiQue, HotpotQA, and 2WikiMultiHopQA demonstrate that ReaLM-Retrieve achieves on average 10.1% absolute improvement in answer F1 over standard RAG (range: 9.0-11.8% across the three benchmarks) while reducing retrieval calls by 47% compared to fixed-interval approaches like IRCoT (all improvements significant at p0.01, paired bootstrap). On the challenging MuSiQue benchmark requiring 2-4 hop reasoning, our method achieves 71.2% F1 with an average of only 1.8 retrieval calls per question. Analysis shows that ReaLM-Retrieve also improves retrieval quality itself, achieving 81.3% Recall@5 with consistently higher precision and MRR than fixed-interval baselines on supporting evidence, establishing new state-of-the-art efficiency-accuracy trade-offs for reasoning-intensive retrieval tasks.

[IR-4] Understanding DNNs in Feature Interaction Models: A Dimensional Collapse Perspective

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在特征交互推荐模型中作用机制不明确的问题,特别是其是否能有效学习高阶特征交互以及为何在实践中表现良好。研究发现,DNNs 的核心价值并非直接建模高阶交互,而是通过提升嵌入表示的维度鲁棒性(dimensional robustness),缓解嵌入空间中的维数坍塌(dimensional collapse)问题。解决方案的关键在于:无论是并行结构还是堆叠结构的DNN,均能有效抑制维数坍塌,其内在机制可通过基于梯度的理论分析与实证结果加以解释。

链接: https://arxiv.org/abs/2604.26489
作者: Jiancheng Wang,Mingjia Yin,Hao Wang,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室)
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注: 6 pages

点击查看摘要

Abstract:DNNs have gained widespread adoption in feature interaction recommendation models. However, there has been a longstanding debate on their roles. On one hand, some works claim that DNNs possess the ability to implicitly capture high-order feature interactions. Conversely, recent studies have highlighted the limitations of DNNs in effectively learning dot products, specifically second-order interactions, let alone higher-order interactions. In this paper, we present a novel perspective to understand the effectiveness of DNNs: their impact on the dimensional robustness of the representations. In particular, we conduct extensive experiments involving both parallel DNNs and stacked DNNs. Our evaluation encompasses an overall study of complete DNN on two feature interaction models, alongside a fine-grained ablation analysis of components within DNNs. Experimental results demonstrate that both parallel and stacked DNNs can effectively mitigate the dimensional collapse of embeddings. Furthermore, a gradient-based theoretical analysis, supported by empirical evidence, uncovers the underlying mechanisms of dimensional collapse.

[IR-5] Efficient Listwise Reranking with Compressed Document Representations

【速读】:该论文旨在解决生成式 AI(Generative AI)中重排序(reranking)步骤的计算成本过高问题,尤其是在使用大型语言模型(Large Language Models, LLMs)时。传统方法依赖较小参数量的LLM或限制输入长度以降低开销,但往往牺牲了效果。其解决方案的关键在于提出RRK——一种基于文档压缩的列表级重排序器(listwise reranker),通过将文档压缩为固定长度的多标记嵌入表示(multi-token fixed-size embedding representations),结合知识蒸馏训练策略,在保持甚至超越小模型效果的同时显著提升效率:例如,8B参数的RRK模型在推理速度上比0.6–4B参数的小型重排序器快3–18倍,尤其在长文档任务中优势更为突出。

链接: https://arxiv.org/abs/2604.26483
作者: Hervé Déjean,Stéphane Clinchant
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Reranking, the process of refining the output from a first-stage retriever, is often considered computationally expensive, especially when using Large Language Models (LLMs). A common approach to mitigate this cost involves utilizing smaller LLMs or controlling input length. Inspired by recent advances in document compression for retrieval-augmented generation (RAG), we introduce RRK, an efficient and effective listwise reranker compressing documents into multi-token fixed-size embedding representations. Our simple training via distillation shows that this combination of rich compressed representations and listwise reranking yields a highly efficient and effective system. In particular, our 8B-parameter model runs 3x-18x faster than smaller rerankers (0.6-4B parameters) while matching or outperforming them in effectiveness. The efficiency gains are even more striking on long-document benchmarks, where RRK widens its advantage further.

[IR-6] CARD: Non-Uniform Quantization of Visual Semantic Unit for Generative Recommendation

【速读】:该论文旨在解决生成式推荐框架中高质量语义ID(Semantic ID, SID)学习的两大挑战:一是两阶段生成推荐范式(SID构建与自回归生成)对异构信息融合提供的监督信号不足,导致SID质量受限;二是非均匀的嵌入分布引发码字失衡和生成偏差。解决方案的关键在于提出一种名为CARD的新框架,其核心创新包括:首先引入视觉语义单元(visual semantic unit),将文本、视觉与协同信号统一建模为结构化的视觉表示先验,从而实现整体语义建模并缓解语义鸿沟,降低对监督信号的依赖;其次设计非均匀量化框架(NU-RQ-VAE),通过在量化过程中嵌入可学习且可逆的非均匀变换,将偏斜的语义嵌入分布映射至更均衡的潜在空间,显著提升码本利用率和量化精度。

链接: https://arxiv.org/abs/2604.26427
作者: Yibiao Wei,Jie Zou,Pengfei Zhang,Xiao Ao,Weikang Guo,Zeyu Ma,Yang Yang
机构: University of Electronic Science and Technology of China (电子科技大学); Southwestern University of Finance and Economics (西南财经大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative recommendation frameworks typically represent items as discrete Semantic IDs (SIDs). While existing studies have sought to enhance SID construction by incorporating multimodal content, collaborative signals, or more advanced quantization techniques, learning high-quality SIDs still faces two key challenges: (1) The two-stage generative recommendation paradigm (SID construction and autoregressive generation) provides insufficient supervision for heterogeneous fusion, which hinders learning high-quality SIDs, and (2) non-uniform embeddings lead to codeword imbalance and generation bias. To address these challenges, we propose a novel generative recommendation framework, called CARD. CARD introduces a visual semantic unit that unifies textual, visual, and collaborative signals into a structured visual representation prior to encoding, enabling holistic semantic modeling and effectively alleviating the semantic gap, thereby reducing the reliance on supervision signals during SID learning. Furthermore, to deal with the highly non-uniform distribution of item semantic embeddings in recommendation scenarios, we develop a non-uniform quantization framework (NU-RQ-VAE), which incorporates a learnable and invertible non-uniform transformation into the quantization process to map skewed semantic distributions into a more balanced latent space, thereby significantly improving codebook utilization and quantization accuracy. Experiments on multiple datasets show that CARD consistently outperforms baseline methods under various settings; meanwhile, the proposed non-uniform transformation module is plug-and-play and remains robust across different quantization schemes. Code is available at this https URL.

[IR-7] Meta-Learning and Targeted Differential Privacy to Improve the Accuracy-Privacy Trade-off in Recommendations

【速读】:该论文旨在解决隐私保护推荐系统中差分隐私(Differential Privacy, DP)与推荐准确率之间的权衡问题,因为DP噪声会降低模型性能。其核心解决方案在于双层面优化:在数据层面采用“定向差分隐私”(targeted DP),仅对最可能暴露敏感属性(如性别、年龄)的用户数据添加DP噪声,从而减少不必要的扰动;在模型层面引入元学习(meta-learning)以增强模型对残余DP噪声的鲁棒性。这一组合策略显著优于传统均匀应用DP或全量DP基线方法,在提升推荐准确率的同时降低了实际隐私风险。

链接: https://arxiv.org/abs/2604.26390
作者: Peter Müllner,Dominik Kowald,Markus Schedl,Elisabeth Lex
机构: Know Center Research GmbH(知识中心研究有限公司); University of Graz(格拉茨大学); Johannes Kepler University(约翰尼斯·开普勒大学); LIT Linz(林茨实验室); Graz University of Technology(格拉茨工业大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at LBR@UMAP’26

点击查看摘要

Abstract:Balancing differential privacy (DP) with recommendation accuracy is a key challenge in privacy-preserving recommender systems, since DP-noise degrades accuracy. We address this trade-off at both the data and model levels. At the data level, we apply DP only to the most stereotypical user data likely to reveal sensitive attributes, such as gender or age, to reduce unnecessary perturbation; we refer to this as targeted DP. At the model level, we use meta-learning to improve robustness to remaining DP-noise. This achieves a better trade-off between accuracy and privacy than standard approaches: Meta-learning improves accuracy and targeted DP leads to lower empirical privacy risk compared to uniformly applied DP and full DP baselines. Overall, our findings show that selectively applying DP at the data level together with meta-learning at the model level can effectively balance recommendation accuracy and user privacy.

[IR-8] Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI

【速读】:该论文旨在解决企业级文档人工智能(Enterprise Document AI)系统在端到端评估中的难题,即现有研究多聚焦于单个处理阶段(如解析、索引、检索和生成),而缺乏对整个流水线协同性能的统一衡量标准。其关键解决方案是构建了一个名为EnterpriseDocBench的基准测试框架,该框架在同一语料库上同步评估四个核心指标:解析保真度(parsing fidelity)、索引效率(indexing efficiency)、检索相关性(retrieval relevance)和生成内容的 groundedness(事实一致性)。该语料库来自六个企业领域(当前试点涵盖五个)的公开许可文档,并通过三种典型流水线(BM25、密集嵌入和混合检索)搭配同一GPT-5生成器进行对比实验,揭示出跨阶段相关性极弱(如解析-检索 r=0.14),表明传统“级联质量假设”不成立;此外,发现幻觉率并非随文档长度单调增长,且系统虽事实准确率达85.5%,但答案完整性仅为0.40,凸显了完整性和准确性之间的显著差距——这一洞见对实际部署具有更高指导意义。

链接: https://arxiv.org/abs/2604.26382
作者: Saurabh K. Singh,Sachin Raj
机构: Oracle(甲骨文公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 16 pages, 4 tables. Code, metrics, and pilot data to be released upon publication

点击查看摘要

Abstract:Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own – what’s still hard is evaluating the system as a whole. We built EnterpriseDocBench to take a swing at it: parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness, all on the same corpus. The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). We ran three pipelines through it – BM25, dense embedding, and a hybrid – all with the same GPT-5 generator. The headline numbers: hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91), and both beat dense embedding (0.83). Hallucination doesn’t grow monotonically with document length – short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs. 9.2%). Cross-stage correlations are very weak: parsing-retrieval r=0.14, parsing-generation r=0.17, retrieval-generation 0.02. If quality were cascading the way most of us assume, those numbers would be much higher; they aren’t. Design caveats are real (parsing fixed, generator shared, automated proxy metrics) and we don’t oversell the result. One result that genuinely surprised us: factual accuracy on stated claims is 85.5%, but answer completeness averages 0.40. The system is right when it answers – it just leaves things out. That gap matters more for real deployments than the headline accuracy number does. We also describe three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end. Framework, metrics, baselines, and collection scripts will be released open-source on acceptance. Comments: 16 pages, 4 tables. Code, metrics, and pilot data to be released upon publication Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2604.26382 [cs.CL] (or arXiv:2604.26382v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.26382 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Saurabh Singh [view email] [v1] Wed, 29 Apr 2026 07:48:41 UTC (19 KB) Full-text links: Access Paper: View a PDF of the paper titled Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI, by Saurabh K. Singh and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-04 Change to browse by: cs cs.AI cs.IR References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[IR-9] Explaining the “Why”: A Unified Framework for the Additive Attribution of Changes in Arbitrary Measures

【速读】:该论文旨在解决数据可视化中聚合指标变化归因分析的难题,即如何准确解释为何聚合度量(aggregate measures)发生变化,而现有系统在通用性、维度与度量组合的全局性以及可解释性方面存在不足。解决方案的关键在于引入基于合作博弈论(cooperative game theory)的原理化框架,通过度量的数学结构分类,实现从通用近似算法到精确闭式解的谱系化算法设计,在通用性与性能之间提供有理论依据的权衡。

链接: https://arxiv.org/abs/2604.26266
作者: Changsheng Zhou,Dajun Chen,Zhitao Shen,wei jiang,Yong Li,Peng Di
机构: Ant Group(蚂蚁集团)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Explaining why aggregated measures change is a critical challenge in data analytics that existing systems struggle to address. While current attribution methods exist, they lack a unified solution that is simultaneously general for arbitrary measures, holistic across both data dimensions and measure composition, and rigorous in its interpretability. To bridge this gap, we introduce a principled framework that reframes attribution through the powerful lens of cooperative game theory. Our key contribution is a classification of measures based on their mathematical structure, which enables a spectrum of algorithms-from general approximations to exact, closed-form solutions-that offer a principled trade-off between generality and performance. We demonstrate our framework’s superiority through a multi-faceted evaluation: simulations first confirm its numerical accuracy and then its generality for non-additive measures; a case study on Simpson’s Paradox showcases its unique interpretability; and a final experiment proves its practical utility by significantly outperforming existing root cause analysis systems.

[IR-10] meMM: Time-as-Operator Spectral Filtering for Dynamic Multimodal Recommendation

【速读】:该论文旨在解决多模态推荐系统中用户兴趣随时间动态演变的问题,尤其关注不同模态(如视觉与文本)在不同时段对决策的主导作用差异。现有方法通常依赖静态交互图或粗粒度的时间启发式策略,难以实现细粒度的时间自适应建模。其解决方案的关键在于提出TimeMM框架,通过将时间视为操作符(Time-as-Operator),构建基于参数化时序核的谱滤波机制,将交互新鲜度映射为边权重调整,从而生成组件特定的表示;进一步引入自适应谱滤波(Adaptive Spectral Filtering)根据时间上下文混合操作符库以获得预测特异性响应,并设计谱感知模态路由(Spectral-Aware Modality Routing)来按相同时间上下文校准多模态贡献;最后采用排名空间谱多样性正则化防止滤波器银行坍塌,确保专家行为互补性。此方案实现了对非平稳用户兴趣的连续建模与高效计算。

链接: https://arxiv.org/abs/2604.26247
作者: Wei Yang,Rui Zhong,Zihan Lin,Xiaodan Wang,Cheng Chen,Huan Ren,Yao Hu
机构: Xiaohongshu Inc.(小红书公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal recommendation improves user modeling by integrating collaborative signals with heterogeneous item content. In real applications, user interests evolve over time and exhibit nonstationary dynamics, where different preference factors change at different rates. This challenge is amplified in multimodal settings because visual and textual cues can dominate decisions under different temporal regimes. Despite strong progress, most multimodal recommenders still rely on static interaction graphs or coarse temporal heuristics, which limits their ability to model continuous preference evolution with fine-grained temporal adaptation. To address these limitations, we propose TimeMM, a time-conditioned spectral filtering framework for dynamic multimodal recommendation. TimeMM instantiates Time-as-Operator by mapping interaction recency to a family of parametric temporal kernels that reweight edges on the user–item graph, producing component-specific representations without explicit eigendecomposition. To capture non-stationary interests, we introduce Adaptive Spectral Filtering that mixes the operator bank according to temporal context, yielding prediction-specific effective spectral responses. To account for modality-specific temporal sensitivity, we further propose Spectral-Aware Modality Routing that calibrates visual and textual contributions conditioned on the same temporal context. Finally, a ranking-space Spectral Diversity Regularization encourages complementary expert behaviors and prevents filter-bank collapse. Extensive experiments on real-world benchmarks demonstrate that TimeMM consistently outperforms state-of-the-art multimodal recommenders while maintaining linear-time scalability.

[IR-11] ProMax: Exploring the Potential of LLM -derived Profiles with Distribution Shaping for Recommender Systems SIGIR2026

【速读】:该论文旨在解决两个核心问题:一是现有基于大语言模型(Large Language Models, LLMs)的推荐系统中,用户和物品的结构化偏好特征(profiles)如何在特征空间内提升推荐性能的机制尚不明确;二是当前方法多采用非线性对齐与融合策略来整合这些特征,常导致语义损失且未能充分挖掘其潜力。解决方案的关键在于从检索视角重新审视profile,并提出一种名为ProMax的简单而有效的推荐框架,其核心是通过双分布重塑(dual distribution-reshaping)过程,利用profile分布作为引导信号,驱动推荐模型学习未观测物品上的用户偏好,从而增强模型对长尾和冷启动场景的泛化能力。

链接: https://arxiv.org/abs/2604.26231
作者: Yi Zhang,Yiwen Zhang,Kai Zheng,Tong Chen,Hongzhi Yin
机构: Anhui University (安徽大学); University of Electronic Science and Technology of China (电子科技大学); The University of Queensland (昆士兰大学)
类目: Information Retrieval (cs.IR)
备注: 11 pages, 8 figures, accepted by SIGIR 2026

点击查看摘要

Abstract:The remarkable text understanding and generation capabilities of large language models (LLMs) have revitalized the field of general recommendation based on implicit user feedback. Rather than deploying LLMs directly as recommendation models, a more flexible paradigm leverages their ability to interpret users’ historical interactions and semantic contexts to extract structured profiles that characterize user preferences. These profiles can be further transformed into actionable high-dimensional representations, serving as powerful signals to augment and strengthen recommendation models. However, the mechanism by which such profiles enhance recommendation performance within the feature space remains insufficiently understood. Moreover, existing studies predominantly rely on nonlinear alignment and fusion strategies to incorporate these profiles, which often lead to semantic loss and fail to fully exploit their potential. To address these limitations, we revisit profiles from a retrieval perspective and propose a simple yet effective recommendation framework built upon distribution shaping (ProMax) in this paper. We begin by employing dense retrieval to uncover the collaborative relationships between user and item profiles within the feature space. Based on this insight, we introduce a dual distribution-reshaping process, in which the profile distribution acts as a guiding signal to steer the recommendation model toward learning user preferences for unseen items beyond the scope of observed interactions. We apply ProMax to four classic recommendation methods on three public datasets. The results indicate that ProMax substantially improves base model performance and outperforms existing LLM-based recommendation approaches.

[IR-12] Hierarchical Long-Term Semantic Memory for LinkedIns Hiring Agent

【速读】:该论文旨在解决工业级大语言模型(Large Language Model, LLM)代理在真实产品中实现个性化和上下文感知交互时所面临的长期语义记忆构建难题。核心问题在于如何从噪声较大的纵向行为数据中提取隐式与显式信号,并以结构化方式存储、支持低延迟检索,同时满足可扩展性、隐私约束、跨域泛化能力和可观测性等多维需求。解决方案的关键是提出分层长期语义记忆(Hierarchical Long-Term Semantic Memory, HLTM)框架,该框架通过将文本数据组织成符合模式对齐的记忆树(memory tree),在多个粒度层级上捕捉语义知识,从而实现高效的数据摄入、隐私友好的存储、低延迟检索及透明的溯源机制;此外,HLTM还引入自适应机制以增强跨场景泛化能力,实证表明其在LinkedIn Hiring Assistant上的应用显著提升了回答准确性和检索F1值(提升超10%),并大幅优化了查询延迟与索引延迟之间的帕累托前沿。

链接: https://arxiv.org/abs/2604.26197
作者: Zhentao Xu,Shangjing Zhang,Emir Poyraz,Yvonne Li,Ye Jin,Xie Lu,Xiaoyang Gu,Karthik Ramgopal,Praveen Kumar Bodigutla,Xiaofeng Wang
机构: LinkedIn Corporation(领英公司)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly used in real-world products, where personalized and context-aware user interactions are essential. A central enabler of such capabilities is the agent’s long-term semantic memory system, which extracts implicit and explicit signals from noisy longitudinal behavioral data, stores them in a structured form, and supports low-latency retrieval. Building industrial-grade long-term memory for LLM agents raises five challenges: scalability, low-latency retrieval, privacy constraints, cross-domain generalizability, and observability. We introduce the Hierarchical Long-Term Semantic Memory (HLTM) framework, which organizes textual data into a schema-aligned memory tree that captures semantic knowledge at multiple levels of granularity, enabling scalable ingestion, privacy-aware storage, low-latency retrieval, and transparent provenance; HLTM further incorporates an adaptation mechanism to generalize across diverse use cases. Extensive evaluations on LinkedIn’s Hiring Assistant show that HLTM improves answer correctness and retrieval F1 significantly by more than 10%, while significantly advancing the Pareto frontier between query and indexing latency. HLTM has been deployed in LinkedIn’s Hiring Assistant to power core personalization features in production hiring workflows.

[IR-13] FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing

【速读】:该论文旨在解决时尚人工智能(Fashion AI)系统中文化审美逻辑隐匿化的问题,即现有模型在训练过程中编码了特定品牌、编辑风格和历史时期的文化特征,但这些信息未被显式揭示,导致模型决策过程缺乏透明性。解决方案的关键在于提出FASH-iCNN这一多模态系统,其基于1991–2024年间来自15个奢侈品牌的87,547张《Vogue》时装秀图像进行训练,使编辑文化逻辑可被解析与检验。该系统能够从单张服装图像中准确识别所属品牌(top-1准确率78.2%)、年代(top-1准确率88.6%)及具体年份(top-1准确率58.3%,平均误差仅2.2年),并通过探针分析发现纹理(texture)和亮度(luminance)是承载编辑身份的核心视觉信号,而颜色影响较小;这表明该方法将编辑文化视为可解释的信号而非背景噪声,从而实现了对生成结果所蕴含时尚语境的可视化追踪。

链接: https://arxiv.org/abs/2604.26186
作者: Morayo Danielle Adeyemi,Ryan A. Rossi,Franck Dernoncourt
机构: Howard University (霍华德大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: 5 pages, 4 tables, 1 figure. Under review

点击查看摘要

Abstract:Fashion AI systems routinely encode the aesthetic logic of specific houses, editors, and historical moments without disclosing it. We present FASH-iCNN, a multimodal system trained on 87,547 Vogue runway images across 15 fashion houses spanning 1991-2024 that makes this cultural logic inspectable. Given a photograph of a garment, the system recovers which house produced it, which era it belongs to, and which color tradition it reflects. A clothing-only model identifies the fashion house at 78.2% top-1 across 14 houses, the decade at 88.6% top-1, and the specific year at 58.3% top-1 across 34 years with a mean error of just 2.2 years. Probing which visual channels carry this signal reveals a sharp dissociation: removing color costs only 10.6pp of house identity accuracy, while removing texture costs 37.6pp, establishing texture and luminance as the primary carriers of editorial identity. FASH-iCNN treats editorial culture as the signal rather than background noise, identifying which houses, eras, and color traditions shaped each output so that users can see not just what the system predicts but which houses, editors, and historical moments are encoded in that prediction.

[IR-14] RAG -Enhanced Kernel-Based Heuristic Synthesis (RKHS): A Structured Methodology Using Large Language Models for Hardware Design

【速读】:该论文旨在解决电子设计自动化(Electronic Design Automation, EDA)中优化策略(如布局、布线和调度)设计依赖专家经验、难以复用的问题。其核心挑战在于如何从大量历史经验中系统性地提取并生成可迁移的启发式规则,而非仅实现一次性代码生成。解决方案的关键在于提出RAG增强的基于核函数的启发式合成方法(RAG-Enhanced Kernel-Based Heuristic Synthesis, RKHS),该方法融合了检索增强生成(Retrieval-Augmented Generation, RAG)、紧凑的启发式核模板以及受迭代自我反馈启发的LLM驱动精炼循环,从而在高阶综合(High-Level Synthesis, HLS)中的延迟最小化列表调度任务中实现了平均调度长度降低11%,且仅带来1.3倍运行时开销,同时具备良好的泛化能力。

链接: https://arxiv.org/abs/2604.26153
作者: Shiva Ahir,Alex Doboli
机构: Stony Brook University (石溪大学)
类目: Hardware Architecture (cs.AR); Information Retrieval (cs.IR)
备注: Presented at the NSF Workshop on Agents for Chip Design Automation, UCLA

点击查看摘要

Abstract:Heuristic design upholds modern electronic design automation (EDA) tools, yet crafting effective placement, routing, and scheduling strategies entails substantial expertise. We study how large language models (LLMs) can systematically synthesize reusable optimization heuristics beyond one-shot code generation. We propose RAG-Enhanced Kernel-Based Heuristic Synthesis (RKHS), which integrates retrieval-augmented generation (RAG), compact kernel heuristic templates, and an LLM-driven refinement loop inspired by iterative self-feedback. Applied to latency-minimizing list scheduling in high-level synthesis (HLS), a prototype reduces average schedule length by up to 11 percent over a baseline scheduler with only 1.3x runtime overhead, and the structured retrieval-synthesis loop generalizes to other EDA optimization problems.

[IR-15] MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在数学推理评估中存在显著语言偏倚的问题,即现有基准数据集绝大多数为英文或由英文翻译而来,限制了模型在非英语语境下的泛化能力。其解决方案的关键在于构建一个全新的多语言数学推理数据集——Math-PT,该数据集包含1,729道用欧洲葡萄牙语和巴西葡萄牙语编写的问题,来源涵盖葡萄牙和巴西的数学奥林匹克竞赛、考试及高水平赛事等高质量本土资源。通过在该数据集上对前沿LLMs进行系统性评测,研究揭示了当前先进模型在选择题上表现优异,但在涉及图表或开放性问题时性能下降,从而为后续跨语言数学推理研究提供了重要基准与方向。

链接: https://arxiv.org/abs/2604.25926
作者: Tiago Teixeira,Ana Carolina Erthal,Juan Belieni,Beatriz Canaverde,Diego Mesquita,Miguel Faria,Eliezer de Souza da Silva,André F. T. Martins
机构: Instituto Superior Técnico, Universidade de Lisboa; Fundação Getulio Vargas; Instituto de Telecomunicações; Universidade de Coimbra, CISUC/LASI, DEI; Basque Center for Applied Mathematics
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at 17th International Conference on Computational Processing of Portuguese (PROPOR 2026). Open access to dataset repo this https URL and model outputs this https URL

点击查看摘要

Abstract:The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing \sc Math-PT, a novel dataset comprising 1,729 mathematical problems written in European and Brazilian Portuguese. \sc Math-PT is curated from a variety of high-quality native sources, including mathematical Olympiads, competitions, and exams from Portugal and Brazil. We present a comprehensive benchmark of current state-of-the-art LLMs on \sc Math-PT, revealing that frontier reasoning models achieve strong performance in multiple choice questions compared to open weight models, but that their performance decreases for questions with figures or open-ended questions. To facilitate future research, we release the benchmark dataset and model outputs.

[IR-16] Generative AI-Based Virtual Assistant using Retrieval-Augmented Generation: An evaluation study for bachelor projects

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在特定专业领域应用中面临的幻觉、信息缺失以及难以提供准确且上下文相关响应的问题,尤其是在高等教育场景下支持学生理解项目相关规章制度时。解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的虚拟助手系统,通过整合实时且领域特定的知识库来提升回答的准确性与可靠性,从而有效应对LLMs在专业教育场景中的局限性。

链接: https://arxiv.org/abs/2604.25924
作者: Dumitru Verşebeniuc,Martijn Elands,Sara Falahatkar,Chiara Magrone,Mohammad Falah,Martijn Boussé,Aki Härmä
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at BNAIC/BeNeLearn 2024, to appear in Springer CCIS series. 15 pages + refs. Code and survey available at this https URL

点击查看摘要

Abstract:Large Language Models have been increasingly employed in the creation of Virtual Assistants due to their ability to generate human-like text and handle complex inquiries. While these models hold great promise, challenges such as hallucinations, missing information, and the difficulty of providing accurate and context-specific responses persist, particularly when applied to highly specialized content domains. In this paper, we focus on addressing these challenges by developing a virtual assistant designed to support students at Maastricht University in navigating project-specific regulations. We propose a virtual assistant based on a Retrieval-Augmented Generation system that enhances the accuracy and reliability of responses by integrating up-to-date, domain-specific knowledge. Through a robust evaluation framework and real-life testing, we demonstrate that our virtual assistant can effectively meet the needs of students while addressing the inherent challenges of applying Large Language Models to a specialized educational context. This work contributes to the ongoing discourse on improving LLM-based systems for specific applications and highlights areas for further research.

[IR-17] CroSearch-R1: Better Leverag ing Cross-lingual Knowledge for Retrieval-Augmented Generation SIGIR2026

【速读】:该论文旨在解决多语言检索增强生成(Retrieval-Augmented Generation, RAG)中因不同语言间知识差异导致的性能下降问题,即单纯拼接多语言知识片段难以提升生成效果。其解决方案的关键在于提出 CroSearch-R1 框架,该框架通过引入基于强化学习的搜索增强机制,结合跨语言知识整合的多轮检索策略,动态地将其他语言的知识对齐至统一表示空间,并设计多语言 rollout 机制以优化跨语言推理迁移能力,从而有效利用跨语言互补性,显著提升 RAG 在多语言语料上的有效性。

链接: https://arxiv.org/abs/2604.25182
作者: Rui Qi,Fengran Mo,Sijin Lu,Yufeng Chen,Jian-Yun Nie,Kaiyu Huang
机构: Beijing Jiaotong University (北京交通大学); Université de Montréal (蒙特利尔大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to SIGIR 2026 (Short Paper)

点击查看摘要

Abstract:A multilingual collection may contain useful knowledge in other languages to supplement and correct the facts in the original language for Retrieval-Augmented Generation (RAG). However, the vanilla approach that simply concatenates multiple pieces of knowledge from different languages into the context may fail to improve effectiveness due to the potential disparities across languages. To better leverage multilingual knowledge, we propose CroSearch-R1, a search-augmented reinforcement learning framework to integrate multilingual knowledge into the Group Relative Policy Optimization (GRPO) process. In particular, the approach adopts a multi-turn retrieval strategy with cross-lingual knowledge integration to dynamically align the knowledge from other languages as supplementary evidence into a unified representation space. Furthermore, we introduce a multilingual rollout mechanism to optimize reasoning transferability across languages. Experimental results demonstrate that our framework effectively leverages cross-lingual complementarity and improves the effectiveness of RAG with multilingual collections.

人机交互

[HC-0] Artistic Practice Opportunities in CST Evaluations: A Longitudinal Group Deployment of ArtKrit

【速读】:该论文旨在解决当前创造力支持工具(Creativity Support Tools, CSTs)评估中普遍忽视使用过程中的时间维度和社会互动因素的问题。现有评估多聚焦于静态性能指标,未能捕捉艺术家在长期使用过程中与工具关系的演变及其创作认知的变化。其解决方案的关键在于提出一种纵向的、基于群体的评估方法:通过为期三周的部署实验,让九位数字艺术家组成三个实践共同体(Communities of Practice),在研究者-艺术家协作下完成每周“大师临摹”任务,从而观察用户对ArtKrit工具从初期探索到选择性采纳或误用的动态变化,并揭示艺术感知方式的演化如何嵌套于艺术家支持网络之中。该方法强调将CST评估设计为促进深度艺术参与的机会,而非单纯的量化测量。

链接: https://arxiv.org/abs/2604.26935
作者: Catherine Liu,Tao Long,Asya Vaisberg,Chau Vu,Jiaju Ma,Jingyi Li
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, 8 figures. Accepted to DIS 2026

点击查看摘要

Abstract:Creativity support tools (CSTs) aim to elevate the quality of artists’ creative processes and artifacts. Yet most current CST evaluations overlook temporal and social aspects of tool use. To address this gap, we present a longitudinal, group-based CST evaluation through a three-week deployment of ArtKrit, a computational drawing tool that supports disciplined drawing. Nine digital artists, organized into three communities of practice, completed weekly “master studies” alongside a researcher-artist. Our results show users’ evolving relationships with ArtKrit over time - from early experimentation to selective incorporation or misuse - alongside changes in their ways of artistic seeing. These changes unfolded within artist support networks that fostered confidence and creative safety, and validated individual expression. Overall, our findings suggest that CST evaluations can - and should - be designed as opportunities for meaningful artistic engagement rather than purely extractive measurement exercises. We contribute this longitudinal, group-based approach as one CST evaluation method.

[HC-1] ransferability of Token Usage Rights: A Design Space Analysis of Generative AI Services

【速读】:该论文旨在解决当前生成式AI服务中令牌(token)使用权受限的问题,即用户购买的令牌使用权限被平台绑定,缺乏时间、账户和服务间的灵活性,限制了用户的自主选择权。其解决方案的关键在于提出“令牌使用权可转移性”(Transferability of token usage rights)这一设计属性,通过分析四大主流大语言模型(LLM)服务的计费政策,识别出五个设计维度(目标、方向、单位、控制、可逆性)和五类具体转移类型(携带、共管、转让、转换、交易),从而将令牌从单一的技术单元和经济货币重新定义为用户中心系统设计的核心要素,显著增强用户对数据资源的灵活配置能力。

链接: https://arxiv.org/abs/2604.26683
作者: Jaeyong Lee,Heeju Kang,Ahra Cho,Baek Eunkyung
机构: Hongik University (弘益大学); International Design School for Advanced Studies (IDAS) (国际高级设计学院)
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 3 tables, Submitted at Korean Society of Design Science (KSDS) Spring Conference 2026

点击查看摘要

Abstract:With the rapid spread of generative AI services, the token has gained value not only as a technical unit of language processing but also as an economic currency for accessing AI services. Major AI model providers have adopted token-based billing as their default service model, requiring users to purchase platform-bound, fixed token usage rights. However, the fixedness of these usage rights is grounded in the billing-policy decisions of service providers rather than in any technical necessity. This study defines the Transferability of token usage rights as a design property that allows users to flexibly reallocate purchased data resources free from the constraints of time, account, and service. Drawing on the Design Space Analysis framework of MacLean et al. (1991), we identify five design axes (Target, Direction, Unit, Control, Reversibility) and five concrete Transferability types (carry-over, co-management, transfer, conversion, and trade) by analyzing the billing policies and terms of service of four major LLM services (ChatGPT, Claude, Gemini, Grok). Our analysis reframes the token from a purely economic-technical primitive into a core element of user-centered system design that expands user choice and autonomy.

[HC-2] MultEval: Supporting Collaborative Alignment for LLM -as-a-Judge Evaluation Criteria

【速读】:该论文旨在解决当前大语言模型作为评判者(LLM-as-a-judge)系统中评价标准制定的单一化与缺乏协作性问题,即现有方法通常由单个个体定义评估标准,未能反映多利益相关方在价值观、解释框架和优先级上的多样性,导致标准难以体现实际应用场景中的复杂性。解决方案的关键在于提出MultEval系统,该系统通过引入共识构建理论支持多评估者识别并诊断分歧,允许基于示例和提案历史迭代修订标准,并确保判断规则向自动化评估器编码过程的透明性,从而促进跨领域专家群体在协同过程中达成共识并演化出更可靠、可解释的评价体系。

链接: https://arxiv.org/abs/2604.26679
作者: Charles Chiang,Simret Gebreegziabher,Annalisa Szymanski,Yukun Yang,Hyo Jin Do,Zahra Ashktorab,Werner Geyer,Toby Li,Diego Gomez-Zara
机构: University of Notre Dame (圣母大学); IBM Research (IBM 研究院)
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:LLM-as-a-judge approaches have emerged as a scalable solution for evaluating model behaviors, yet they rely on evaluation criteria often created by a single individual, embedding that person’s assumptions, priorities, and interpretive lens. In practice, defining such criteria is a collaborative and contested process involving multiple stakeholders with different values, interpretations, and priorities; an aspect largely unsupported by existing tools. To examine this problem in depth, we present a formative study examining how stakeholders collaboratively create, negotiate, and refine evaluation criteria for LLM-as-a-judge systems. Our findings reveal challenges in human oversight, including difficulties in establishing shared understanding, aligning values across stakeholders with different expertise and priorities, and translating nuanced human judgments into criteria that are interpretable and actionable for LLM judges. Based on these insights, we developed MultEval, a system that supports collaborative criteria by enabling multiple evaluators to surface and diagnose disagreements using consensus-building theory, iteratively revise criteria with attached examples and proposal history, and maintain transparency over how judgments are encoded into an automated evaluator. We further report a case study in which a team of domain experts used MultEval to collaboratively author criteria, illustrating how coordination and collaborative consensus-making shape criteria evolution.

[HC-3] Persona-Based Process Design for Assistive Human-Robot Workplaces for Persons with Disabilities

【速读】:该论文旨在解决人机协作工作场所中因个性化设计导致的可扩展性问题,即现有系统多为特定用户定制,难以实现普遍适用。其核心挑战在于如何将通用设计(Universal Design)理念融入到人-机器人协作流程的设计中,而这一过程通常需要专家知识且难以获取。解决方案的关键在于提出一种基于角色画像(Persona-based)的设计方法:首先将职场中常见的或与特定工艺相关的残疾类型抽象为具有代表性的角色画像;然后将工作流程分解为顺序动作,并针对每个动作和角色画像,通过设计思维(Design Thinking)开发达成目标的策略;最终将这些策略按机器人辅助程度排序并构建行为树(Behavior Tree),使工作场所的宏观行为能够在线适应不同角色画像,从而实现面向所有用户的包容性设计。

链接: https://arxiv.org/abs/2604.26527
作者: Nils Mandischer,Daria Eckert and,Lars Mikelsons
机构: University of Augsburg (奥格斯堡大学)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO); Systems and Control (eess.SY)
备注: Accepted at IEEE International Conference on Human-Machine Systems (ICHMS), Singapore, 2026

点击查看摘要

Abstract:Human-robot interaction is emerging as an important paradigm for integrating persons with disabilities into the workplace. While these systems can enable individuals to work, their design is mostly personalized, hindering widespread use beyond the individual user. The universal design paradigm is a central pillar of inclusive design, describing usability of systems by all. To incorporate universal design into process design for human-robot workplaces expert knowledge is required that is often not available. To simplify process design of human-robot workplaces, we propose a persona-based design approach. First, typical impairments prevalent in the workforce or particularly relevant for the processes are abstracted into personas with disabilities. The work process is subdivided into sequential actions. For each action and persona, strategies are developed to reach the action goal by a design thinking approach. The resulting actions are ordered by level of robot assistance, i.e. robot involvement, and implemented in a behavior tree. Therefore, the macro-behavior of the workplace may adapt to individual personas online. We demonstrate the method in a collaborative box folding process with a total of seven personas with disabilities. The persona-based process design shows promising results by generating more comprehensive process strategies while enabling adaptive behavior in the sense of universal design.

[HC-4] ree-of-Text: A Tree-based Prompting Framework for Table-to-Text Generation in the Sports Domain ACL

【速读】:该论文旨在解决从结构化表格生成体育赛事报告这一复杂任务中的两大挑战:一是传统基于模型的方法依赖大量标注数据,二是基于大语言模型(LLM)的提示方法因表格理解能力弱而易产生幻觉。解决方案的关键在于提出一种树状结构提示框架——Tree-of-Text,该框架通过三阶段流程引导LLM进行更精准的文本生成:首先进行内容规划以识别表格中的关键操作和参数;其次执行操作分解,将大表格拆分为可管理的子表;最后整合并重写短文本片段形成连贯的报告。此方法在多个基准数据集上显著优于现有技术,同时大幅降低时间和成本开销。

链接: https://arxiv.org/abs/2604.26501
作者: Shang-Hsuan Chiang,Tsan-Tsung Yang,An-Zi Yen,Wen-Chih Peng
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted by ACL SRW 2025: Long Paper (Oral)

点击查看摘要

Abstract:Generating sports game reports from structured tables is a complex table-to-text task that demands both precise data interpretation and fluent narrative generation. Traditional model-based approaches require large, annotated datasets, while prompt-based methods using large language models (LLMs) often struggle with hallucination due to weak table comprehension. To overcome these challenges, we propose Tree-of-Text, a tree-structured prompting framework that guides LLMs through a three-stage generation process: (1) Content Planning, where relevant operations and arguments are selected from the input tables; (2) Operation Execution, which breaks down large tables into manageable sub-tables; and (3) Content Generation, where short textual outputs are merged and rewritten into a cohesive report. Experiments show that our method outperforms existing methods on ShuttleSet+, leads in RG and CO metrics on RotoWire-FG, and excels in CS and CO on MLB with roughly 40% of the time and cost of Chain-of-Table. These results demonstrate the effectiveness and efficiency of Tree-of-Text and suggest a promising direction for prompt-based table-to-text generation in the sports domain.

[HC-5] Culturally Aware GenAI Risks for Youth: Perspectives from Youth Parents and Teachers in a Non-Western Context

【速读】:该论文旨在解决生成式 AI (Generative AI) 在非西方文化背景(特别是沙特阿拉伯)下,青少年使用过程中所面临的隐私与安全挑战被忽视的问题。现有研究多聚焦于西方语境,未能充分考虑宗教、社会规范及集体结构对青少年数字体验的深层影响。解决方案的关键在于通过混合方法(包括对736条Reddit和1,262条X/Twitter帖子的内容分析及31名沙特参与者访谈),揭示了非西方语境中生成式AI使用的“情境依赖性”与“关系性”隐私观——即隐私和安全感知由社群结构和文化规范共同塑造。研究发现,青少年在寻求情感支持时无意披露个人及家庭信息,易与强调谦逊、隐私和荣誉的文化期待产生冲突;同时,经济因素导致共享账户行为加剧风险。最终提出面向家长与教师期望的设计启示,为构建符合当地文化规范的包容性、情境敏感型家长控制机制奠定基础。

链接: https://arxiv.org/abs/2604.26494
作者: Aljawharah Alzahrani,Tory Park,Tanusree Sharma
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Generative AI tools are widely used by youth and have introduced new privacy and safety challenges. While prior research has explored youth’s safety in GenAI within western context, it often overlooks the cultural, religious, and social dimensions of technology use that strongly shape youths digital experiences in countries like Saudi Arabia. To address the gap, this study explores children (aged 7 to 17), parents and teachers interactions with GenAI tools and risk perceptions through non-western lens. Through a mixed methods approach, we analyzed 736 Reddit and 1,262 X(Twitter) posts and conducted interviews with 31 Saudi Arabian participants (8 youth, 13 parents, 10 teachers). Our findings highlight context dependent and relational privacy and safety of GenAI from non-western context which often formed by communal structure and prescribed norms. We found significant risks tied to youths disclosure of personal and family information, which conflict with culturally rooted expectations of modesty, privacy, and honor, particularly when youth seek emotional support from GenAI. These risks further compounded by socio economic factors such as cost-saving practices leading to the use of shared GenAI accounts (this http URL) within families or even among strangers. We provide design implication reflecting on parents and teachers expectation of how youth should use GenAI. This work lays groundwork for inclusive, context sensitive parental controls that adhere to cultural norms and values.

[HC-6] UIGaze: How Closely Can VLMs Approximate Human Visual Attention on User Interfaces?

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在预测人类用户界面(User Interface, UI)注视点方面的能力尚未被系统评估的问题。其解决方案的关键在于构建一个基于真实眼动数据的零样本坐标预测流水线,利用UEyes数据集(包含1,980张UI截图及62名参与者的眼动追踪数据),对九种最先进的VLMs进行量化评估,通过高斯模糊生成显著性图,并使用相关系数(CC)、相似性指数(SIM)和KL散度进行对比分析,从而揭示VLMs在不同UI类型和注视时长下对人类视觉注意力的逼近程度。

链接: https://arxiv.org/abs/2604.26352
作者: Min Song,Yoonseong Lee,Yeonhu Seo
机构: Xebec Inc.(Xebec公司)
类目: Human-Computer Interaction (cs.HC)
备注: 6 pages, 4 tables, 1 figure

点击查看摘要

Abstract:Vision Language Models (VLMs) have demonstrated strong capabilities in understanding visual content, yet their ability to predict where humans look on user interfaces remains unexplored. We present UIGaze, a study investigating how closely VLMs can approximate human visual attention on user interfaces using real eye-tracking data. Using the UEyes dataset - comprising 1,980 UI screenshots across four categories (webpage, desktop, mobile, poster) with eye-tracking data from 62 participants - we evaluate nine state-of-the-art VLMs through a zero-shot coordinate prediction pipeline. Each model generates gaze point coordinates that are converted into saliency maps via Gaussian blurring and compared against ground truth using CC, SIM, and KL divergence. Our experiments (1,980 images x 9 models x 3 runs x 3 durations) reveal that VLMs achieve moderate alignment with human gaze patterns, with the degree of alignment varying significantly across UI types and improving with longer viewing durations - suggesting VLMs capture exploratory gaze patterns rather than initial fixations. All code, predictions, and evaluation results are publicly available.

[HC-7] owards a Frugal Photosynthesis Sensing Toolkit for Data-Driven Plant Science Education and Exploration

【速读】:该论文旨在解决现有光合作用研究工具在教育与科研场景中存在访问门槛高、难以捕捉光合策略动态变化的问题。其核心解决方案在于开发了一种低成本、可现场部署的气体交换传感套件PhytoBits,关键创新在于通过叶室封装结合商用CO₂传感器和低成本微控制器,实现对植物多日气交换过程的监测,并能准确区分C₃型和CAM(Crassulacean Acid Metabolism)型光合路径,甚至识别兼性CAM及发育过程中CAM动态变化,从而为教学与研究提供一种可扩展且具有高灵敏度的替代方案。

链接: https://arxiv.org/abs/2604.26305
作者: Qitong Li,Raj Nileshbhai Dave,Rhema Amanda Phiri,Leo Zhang,Xiaoyu Zheng,Ariana Blake,Livia Ford,Sarah Jones,Susan R. Strickler,Nivedita Arora
机构: Northwestern University (西北大学); Chicago Botanic Garden (芝加哥植物园); Embodied System Lab, Northwestern University (西北大学具身系统实验室)
类目: Human-Computer Interaction (cs.HC)
备注: 25 pages, 17 figures, submitted conference paper on frugal plant gas-exchange sensing toolkit for photosynthesis education and exploration. Includes validation against LI-COR gas-exchange systems and biochemical assays for distinguishing C3 and CAM photosynthetic pathways

点击查看摘要

Abstract:Rapid environmental change and advances in data-driven analysis highlight the need not only to use computational tools, but also to foster understanding of the natural world and inspire creativity. Photosynthesis, the process that fuels nearly all life on Earth, provides a compelling context for such learning, particularly in understanding how plants alter their photosynthetic strategies in response to environmental changes. However, existing tools for studying photosynthesis are often inaccessible or limited to demonstrating its presence, rather than capturing its temporal dynamics. We present PhytoBits, a frugal in situ gas-exchange sensing toolkit for distinguishing and teaching photosynthetic strategies. PhytoBits combines leaf enclosure with accessible materials, an off-the-shelf CO\textsubscript2 sensor, and a low-cost microcontroller, to support multi-day monitoring of plant gas-exchange in educational and research contexts. We validated PhytoBits against research-grade gas-exchange systems, confirming that it identifies C\textsubscript3 and CAM (Crassulacean Acid Metabolism) photosynthetic pathways. In addition to obligate CAM, PhytoBits also resolves facultative CAM and developmental CAM dynamics in plants. This work presents an early-stage hardware validation; user deployment studies, open-source code dissemination, and automated pathway classification are planned as future work.

[HC-8] owards Low-Cost Low-Power Activity-Aware Soil Moisture Sensing Platform for Large-scale Farming

【速读】:该论文旨在解决现代农业中因物联网(IoT)基础设施成本过高而导致难以广泛部署实时土壤湿度传感器的问题,从而影响作物产量预测和灌溉决策的精准性。其关键解决方案是提出一个端到端的无电池传感平台,包括埋地的自供电传感器节点与移动基站相结合的设计:传感器节点利用土壤电化学特性实现能量采集,并通过高输入阻抗模拟前端提升耐用性;基站则依托农户日常耕作流程,在农用车辆上部署以被动监听方式完成数据采集;整个系统在单次电容充电下可持续运行21天,支持土壤湿度、温度及环境条件的数据采集,并通过状态机控制实现可靠的握手式通信,实现实验室条件下70天稳定运行及1 km范围内2 dBm发射功率下的可靠传输,最终形成低成本(<35美元)、可扩展且无缝集成于农业操作中的智能监测方案。

链接: https://arxiv.org/abs/2604.26303
作者: Jack Thoene,Omar Kamil,Thekra Alkadee,Nivedita Arora
机构: Northwestern University (西北大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Deep understanding of a field’s soil moisture content is the leading indicator for predicting crop yields and making data driven decisions for irrigation and application of topical chemicals for drought resilience. Despite this importance, the cost of adopting and maintaining IoT infrastructure prevents modern farms from employing widespread real time soil moisture sensors. We present an end-to-end platform of buried battery-free sensor nodes and a mobile basestation that leverages the farmer’s daily routine for data retrieval. Each node features a self-powered galvanic soil-moisture probe, employing a high impedance analog front end to enable durability. Operating entirely on harvested solar energy for up to 21 days on a single capacitor charge, each node collects soil moisture, temperature, and environment condition data. Using a predictable finite-state machine, handshake-based data exchanges occur with a basestation affixed to standard farming vehicles designed to listen for the nodes while moving through the farm. Our platform organizes all sensor, link-quality, and location data into an easy-to-interpret dashboard to seamlessly integrate with the farmer’s everyday routine. Costing less than 35, the platform is a financially accessible, accurate, and easily scalable platform that enables persistent, regular data collection from the most rural plots without adding to or impeding farming operations. Experimental evaluation demonstrates reliable communication over 1 km at 2 dBm transmit power, stable sensor readings over 70 days of indoor operation, and continuous data recovery during multiple periods of intermittent connection.

[HC-9] Exploring the Feasibility and Acceptability of AI-Mediated Serious Illness Conversations in the Emergency Department ALT

【速读】:该论文旨在解决急诊科(Emergency Department, ED)中严重疾病沟通(Serious Illness Conversations, SICs)罕见发生的问题,尤其是在时间紧迫和情感负担重的环境下,临床医生常在缺乏患者价值、目标与偏好明确信息的情况下做出高风险决策。解决方案的关键在于开发并评估ED GOAL-AI——一种基于语音的对话代理系统,用于在急诊环境中与老年患者进行简短、结构化的价值观探讨。该系统通过自然语言交互实现快速、可及的价值观采集,在55名患者中验证了其可行性与可接受性,且用户感知的“被倾听”和“被理解”程度与临床医生相当;然而研究也揭示了关键失败模式,如幻觉式诊断陈述等边界越界行为,提示未来需强化伦理约束与参与式设计以保障安全性和有效性。

链接: https://arxiv.org/abs/2604.26214
作者: Hasibur Rahman,Kenji Numata,Evelyn T Lai,Maria Cheriyan,Adrian Haimovich,Kei Ouchi,Smit Desai
机构: Northeastern University (东北大学); Sei Marianna Ika Daigaku (圣玛丽安娜医科大学); Brigham and Women’s Hospital (布莱根妇女医院); Harvard College (哈佛学院); Beth Israel Deaconess Medical Center (贝斯以色列女执事医疗中心); Harvard Medical School (哈佛医学院)
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, Interactive Health Conference (IH '26), July 5-8, 2026, Porto, Portugal

点击查看摘要

Abstract:Serious illness conversations (SICs) align care with patients’ values, goals, and preferences, yet they rarely occur in emergency departments (EDs), where time constraints and emotional burden often leave clinicians making high-stakes decisions without documented insight into what matters most to patients. We present a case study of ED GOAL-AI, a voice-based conversational agent for brief, structured values discussions with older adults in the ED, evaluated with 55 patients for feasibility and acceptability. Most participants completed the conversation and reported the interaction as acceptable and feasible, with ratings of feeling heard and understood comparable to clinicians. However, we also observed critical failure modes, including boundary violations such as hallucinated diagnostic statements, highlighting ethical and emotional risks. This work points to early promise for AI-mediated SICs while underscoring the need for careful boundary setting and participatory design before broader deployment.

[HC-10] Beyond Screenshots: Evaluating VLMs Understanding of UI Animations ACL2026

【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在用户界面(User Interface, UI)理解中对动态动画信息处理能力不足的问题。现有研究主要基于静态截图,忽略了现代UI中广泛使用的动画作为核心状态与反馈传递机制的功能性作用,导致VLMs在面对动态UI时的解释能力存在显著短板。解决方案的关键在于构建了一个名为AniMINT的新颖数据集,包含300个密集标注的UI动画视频,并系统评估了主流VLMs在感知动画效果、识别动画目的及解读动画语义三个层面的表现。此外,通过引入Motion、Context和Perceptual Cues(MCPC)框架进行归因分析,揭示了影响VLM性能的核心瓶颈,为未来提升AI代理在动态UI场景下的可靠交互能力提供了明确方向。

链接: https://arxiv.org/abs/2604.26148
作者: Chen Liang,Xirui Jiang,Naihao Deng,Eytan Adar,Anhong Guo
机构: University of Michigan (密歇根大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Accepted at ACL 2026 Findings

点击查看摘要

Abstract:AI agents operating on user interfaces must understand how interfaces communicate state and feedback to act reliably. As a core communicative modality, animations are increasingly used in modern interfaces, serving critical functional purposes beyond mere aesthetics. Thus, understanding UI animation is essential for comprehensive interface interpretation. However, recent studies of Vision Language Models (VLMs) for UI understanding have focused primarily on static screenshots, leaving it unclear how well these models handle dynamic UI animations. To address this gap, we created AniMINT, a novel dataset of 300 densely annotated UI animation videos. We systematically evaluate state-of-the-art VLMs on UI animation understanding, including their abilities to perceive the animation effects, identify animation purposes, and interpret animation meaning. Our results show that VLMs can reliably detect primitive motion. However, their high-level animation interpretation remains inconsistent, with substantial gaps relative to human performance. Finally, we use Motion, Context, and Perceptual Cues (MCPC) to probe factors affecting VLM performance, revealing key bottlenecks and directions for future improvement.

[HC-11] Ceci nest pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

【速读】:该论文旨在解决生成式 AI(Generative AI)在语言学习中的反馈机制存在“可解释性陷阱”(explainability pitfalls)的问题,即AI提供的看似合理但实质错误的解释可能误导学习者,导致认知偏差、学习效果下降及人机交互风险。其解决方案的关键在于构建一个系统性的评估框架——L2-Bench,涵盖诊断准确性、语用意识、错误成因识别、优先级排序、改进建议和自我调节支持等六个核心维度,用于识别和量化AI反馈在语言教育场景下的失效模式,从而推动开发者设计更安全、可信且有效的AI解释机制。

链接: https://arxiv.org/abs/2604.26145
作者: Ben Knight,Wm. Matthew Kennedy,James Edgell
机构: Oxford University Press(牛津大学出版社); Oxford Internet Institute(牛津互联网研究所); University of Oxford(牛津大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to Misleading Impacts Resulting from AI Generated Explanations (MIRAGE) Workshop @ IUI 2026

点击查看摘要

Abstract:AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners–and even teachers–to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to “explainability pitfalls,” are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community’s understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.

[HC-12] Human-Augmented Reality Interaction in Rebar Inspection

【速读】:该论文旨在解决钢筋(Rebar)检测过程中因长期保持不良姿势和复杂二维图纸与三维结构映射带来的 ergonomic risk(人体工程学风险)及认知负荷过高的问题。解决方案的关键在于引入基于 Microsoft HoloLens 2 的增强现实(Augmented Reality, AR)辅助检测系统,通过实时空间叠加设计信息与现场实体,减少操作者躯干和颈部的屈曲角度、缩短任务完成时间及移动路径,并显著降低主观工作负荷(NASA-TLX评分下降45.6%),同时维持检测准确性,从而实现高效且人因友好的钢筋间距检查。

链接: https://arxiv.org/abs/2604.26112
作者: Mahsa Sanei,Fernando Moreu
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Rebar inspection in reinforced concrete construction requires sustained awkward postures and complex mental mapping of two-dimensional drawings onto three-dimensional assemblies. This study evaluated an Augmented Reality (AR)-assisted rebar inspection system deployed on Microsoft HoloLens 2 through a within-subjects experiment with 30 participants. Full-body kinematics were recorded using a motion capture system at 100 Hz while participants performed traditional and AR-assisted spacing inspection. AR reduced mean trunk flexion by 30.8%, mean neck flexion by 32.8%, and task completion time by 67.7%. Walking distance and hand-path length each decreased by over 50%. NASA Task Load Index scores decreased by 45.6% overall, with the largest reduction in physical demand. Inspection accuracy was maintained across conditions. The System Usability Scale yielded a mean score of 76.1 with 83% of participants rating the system acceptable. These results provide convergent objective and subjective evidence that AR-assisted inspection reduces ergonomic risk and perceived workload maintaining inspection quality.

[HC-13] Designing Rewards for Rewarding Designs: Demonstrating the Impact of Rewards on the Creative Design Process

【速读】:该论文旨在解决奖励机制在创造性设计决策过程中如何影响设计行为与用户体验的问题,特别是明确奖励反馈对设计探索深度、目标一致性及设计多样性的作用。其解决方案的关键在于将3D参数化、目标驱动的椅子设计任务建模为马尔可夫决策过程(Markov Decision Process, MDP),通过在每一步决策中呈现目标一致或无关的奖励信号,系统性地追踪参与者的设计决策轨迹,并量化奖励对任务行为和主观体验的影响。实验表明,奖励显著提升了设计空间的探索广度,促使参与者优先最大化目标对齐奖励的同时保持设计多样性,且设计目标的性质进一步调节了奖励感知效用,从而为设计类任务中的有效反馈机制设计提供了实证依据与可操作指南。

链接: https://arxiv.org/abs/2604.26083
作者: Surabhi S Nath,Vindula Jayawardana,Monica Van,Matt Klenk,Shabnam Hakimi
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The creative design process involves transforming abstract goals into concrete outcomes through a series of decisions made under constraints. While such processes are commonly shaped by feedback like rewards, their impact on design decision making remains unclear. To better understand the role of rewards in the design process, we modeled a 3D parametric, goal-based chair design task as a Markov Decision Process. We tracked participants’ decisions as they iteratively developed designs for an abstract design goal, and presented either a goal-aligned or goal-agnostic reward at every step. We tested the effect of these rewards on task behaviour and self-reported experience. With rewards, participants more thoroughly explored the design space, and maximised goal-aligned over goal-agnostic rewards while preserving diversity across designs. The nature of the goal also mattered, influencing participants’ perception of the reward’s usefulness. Building on these insights, we propose guidelines for designing effective feedback for design decision making.

计算机视觉

[CV-0] hree-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation AISTATS2026

【速读】:该论文旨在解决基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的零样本视觉与语言导航(Vision-and-Language Navigation, VLN)代理在未知环境中容易发生路径漂移、提前终止以及整体成功率低的问题。其解决方案的关键在于提出一种三步式导航协议(Three-Step Nav),通过三个阶段的视觉信息处理实现更鲁棒的路径规划:首先“向前看”以提取全局地标并制定粗粒度路径计划;其次“现在看”将当前视觉观测与下一子目标对齐,提供细粒度引导;最后“向后看”对完整轨迹进行审计,纠正累积漂移后再决定停止。该方法无需梯度更新或任务特定微调,可无缝集成至现有VLN流程中,显著提升零样本场景下的导航性能。

链接: https://arxiv.org/abs/2604.26946
作者: Wanrong Zheng,Yunhao Ge,Laurent Itti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to AISTATS 2026. Code: this https URL

点击查看摘要

Abstract:Breakthrough progress in vision-based navigation through unknown environments has been achieved by using multimodal large language models (MLLMs). These models can plan a sequence of motions by evaluating the current view at each time step against the task and goal given to the agent. However, current zero-shot Vision-and-Language Navigation (VLN) agents powered by MLLMs still tend to drift off course, halt prematurely, and achieve low overall success rates. We propose Three-Step Nav to counteract these failures with a three-view protocol: First, “look forward” to extract global landmarks and sketch a coarse plan. Then, “look now” to align the current visual observation with the next sub-goal for fine-grained guidance. Finally, “look backward” audits the entire trajectory to correct accumulated drift before stopping. Requiring no gradient updates or task-specific fine-tuning, our planner drops into existing VLN pipelines with minimal overhead. Three-Step Nav achieves state-of-the-art zero-shot performance on the R2R-CE and RxR-CE dataset. Our code is available at this https URL.

[CV-1] ProcFunc: Function-Oriented Abstractions for Procedural 3D Generation in Python

【速读】:该论文旨在解决3D内容生成过程中代码编写复杂、效率低下以及难以大规模生成多样化训练数据的问题。解决方案的关键在于提出ProcFunc,这是一个基于Blender的Python库,提供一系列易于使用的函数,用于简化程序化3D生成代码的创建、组合、分析与执行;通过语义组件的组合式构建,显著提升生成效率和多样性,并降低生成过程中的编码错误率,从而支持视觉语言模型(VLMs)更高效地编辑和生成程序化材料与几何代码。

链接: https://arxiv.org/abs/2604.26943
作者: Alexander Raistrick,Karhan Kayan,Jack Nugent,David Yan,Lingjie Mei,Meenal Parakh,Hongyu Wen,Dylan Li,Yiming Zuo,Erich Liang,Jia Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce ProcFunc, a library for Blender-based procedural 3D generation in Python. ProcFunc provides a library of easy-to-use Python functions, which streamline creating, combining, analyzing, and executing procedural generation code. ProcFunc makes it easy to create large-scale diverse training data, by combinatorial compositions of semantic components. VLMs can use ProcFunc to edit procedural material and geometry code and can create new procedural code with significantly fewer coding errors. Finally, as an example use case, we use ProcFunc to develop a new procedural generator of indoor rooms, which includes a collection of new compositional procedural materials. We demonstrate the detail, runtime efficiency, and diversity of this room generator, as well as its use for 3D synthetic data generation. Please visit this https URL for source code.

[CV-2] World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在动态空间推理任务中表现不足的问题,尤其是当场景随第一人称视角运动而变化时,模型难以准确预测和理解空间状态的演变。现有方法要么依赖合成数据进行大规模空间监督但缺乏对运动条件下的状态转移显式建模,要么在推理阶段耦合世界模型(World Model)以增强推理能力,却带来显著计算开销。解决方案的关键在于提出 World2VLM 训练框架,通过将生成式世界模型的空间想象能力蒸馏到 VLM 中:利用一个视图一致的世界模型根据初始观测和参数化相机轨迹合成几何对齐的未来视图,并构建结构化监督信号用于正向(动作→结果)与逆向(结果→动作)空间推理任务;随后采用两阶段微调策略在小规模合成数据集上训练 VLM,从而使其内化空间想象力,在多个空间推理基准测试中实现稳定提升,且无需推理时生成,兼具高效性与可扩展性。

链接: https://arxiv.org/abs/2604.26934
作者: Wanyue Zhang,Wenxiang Wu,Wang Xu,Jiaxin Luo,Helu Zhi,Yibin Huang,Shuo Ren,Zitao Liu,Jiajun Zhang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Tsinghua University (清华大学); Harbin Institute of Technology (哈尔滨工业大学); Wuhan AI Research (武汉人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code is available at this https URL . The dataset is available at this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action-to-outcome) and inverse (outcome-to-action) spatial reasoning. We post-train the VLM with a two-stage recipe on a compact dataset generated by this pipeline and evaluate it on multiple spatial reasoning benchmarks. World2VLM delivers consistent improvements over the base model across diverse benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. It also outperforms the test-time world-model-coupled methods while eliminating the need for expensive inference-time generation. Our results suggest that world models can serve not only as inference-time tools, but also as effective training-time teachers, enabling VLMs to internalize spatial imagination in a scalable and efficient manner.

[CV-3] Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction CVPR2026

【速读】:该论文旨在解决如何仅使用未改装的低速相机(通常帧率限制在30–60 FPS)捕获和重建高速动态场景的三维(3D)体积表示问题。传统方法受限于摄像头带宽,难以捕捉快速变化的场景,且多数现有高帧率技术依赖光学改造或机械部件,仅支持单视角采集,无法实现多视角下的高时空分辨率三维重建。其解决方案的关键在于:通过快速、顺序的颜色编码光照序列对场景进行照明,将高速时间信息编码到图像的空间强度与颜色变化中;随后利用一种基于动态高斯点绘(Dynamic Gaussian Splatting)的新方法从多视角采集的低速图像中解码出高帧率的时序信息,并构建出高精度的三维动态场景表示。

链接: https://arxiv.org/abs/2604.26920
作者: David Novikov,Eilon Vaknin,Narek Tumanyan,Mark Sheinin
机构: Weizmann Institute of Science (魏茨曼科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to IEEE CVPR 2026 as a highlight

点击查看摘要

Abstract:The task of capturing and rendering 3D dynamic scenes from 2D images has become increasingly popular in recent years. However, most conventional cameras are bandwidth-limited to 30-60 FPS, restricting these methods to static or slowly evolving scenes. While overcoming bandwidth limitations is difficult for general scenes, recent years have seen a flurry of computational imaging methods that yield high-speed videos using conventional cameras for specific applications (e.g., motion capture and particle image velocimetry). However, most of these methods require modifications to a camera’s optics or the addition of mechanically moving components, limiting them to a single-view high-speed capture. Consequently, these methods cannot be readily used to capture a 3D representation of rapid scene motion. In this paper, we propose a novel method to capture and reconstruct a volumetric representation of a high-speed scene using only unaugmented low-speed cameras. Instead of modifying the hardware or optics of each individual camera, we encode high-speed scene dynamics by illuminating the scene with a rapid, sequential color-coded sequence. This results in simultaneous multi-view capture of the scene, where high-speed temporal information is encoded in the spatial intensity and color variations of the captured images. To construct a high-speed volumetric representation of the dynamic scene, we develop a novel dynamic Gaussian Splatting-based approach that decodes the temporal information from the images. We evaluate our approach on simulated scenes and real-world experiments using a multi-camera imaging setup, showing first-of-a-kind high-speed volumetric scene reconstructions.

[CV-4] AnimateAnyMesh: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation

【速读】:该论文旨在解决高质动态3D模型(4D内容)生成中的两大核心挑战:一是建模时空分布的复杂性,二是4D训练数据稀缺导致的性能瓶颈。为应对这些问题,其关键解决方案包括三方面创新:首先,通过从Objaverse-XL中挖掘动态内容,将DyMesh-XL数据集规模扩展至30万唯一身份,显著提升类别与运动多样性;其次,重构DyMeshVAE-Flex架构,引入幂律拓扑感知注意力机制和顶点法向量增强特征,有效改善轨迹重建精度、局部几何保持性并缓解轨迹粘连伪影;最后,对DyMeshVAE-Flex与修正流(rectified-flow, RF)生成器进行架构升级,支持变长序列训练与生成,在保证重建保真度的同时实现更长动画输出。这些改进共同推动了文本驱动任意3D网格快速生成高质量、语义准确且时序一致的动画效果。

链接: https://arxiv.org/abs/2604.26917
作者: Zijie Wu,Chaohui Yu,Fan Wang,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); DAMO Academy, Alibaba Group (阿里巴巴集团达摩院); Hupan Lab (湖畔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, TPAMI submission, code url: this https URL

点击查看摘要

Abstract:Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. We present AnimateAnyMesh++, a feed-forward framework for text-driven animation of arbitrary 3D meshes with substantial upgrades in data, architecture, and generative capability. First, we expand the DyMesh-XL dataset by mining dynamic content from Objaverse-XL, increasing the number of unique identities from 60K to 300K and substantially broadening category and motion diversity. Second, we redesign DyMeshVAE-Flex with power-law topology-aware attention and vertex-normal enhanced features, which significantly improves trajectory reconstruction, local geometry preservation, and mitigates trajectory-sticking artifacts. Third, we introduce architectural changes to both DyMeshVAE-Flex and the rectified-flow (RF) generator to support variable-length sequence training and generation, enabling longer animations while preserving reconstruction fidelity. Extensive experiments demonstrate that AnimateAnyMesh++ generates semantically accurate and temporally coherent mesh animations within seconds, surpassing prior approaches in quality and efficiency. The enlarged DyMesh-XL, the upgraded DyMeshVAE-Flex, and variable-length RF together deliver consistent gains across benchmarks and in-the-wild meshes. We will release code, models, and the expanded DyMesh-XL upon acceptance of this manuscript to facilitate research in 4D content creation.

[CV-5] Graph-based Semantic Calibration Network for Unaligned UAV RGBT Image Semantic Segmentation and A Large-scale Benchmark

【速读】:该论文旨在解决无人机(UAV)RGBT图像语义分割中的两个耦合难题:一是由于传感器视差和平台振动导致的跨模态空间错位,二是俯视视角下细粒度地物类别间严重的语义混淆。解决方案的关键在于提出一种基于图结构的语义校准网络(Graph-based Semantic Calibration Network, GSCNet),其核心组件包括特征解耦与对齐模块(Feature Decoupling and Alignment Module, FDAM)和语义图校准模块(Semantic Graph Calibration Module, SGCM)。FDAM通过将每模态分解为共享结构与私有感知成分,并在共享子空间中进行可变形对齐,实现鲁棒的空间校正并降低模态外观干扰;SGCM则显式建模无人机场景中地物类别的层次分类关系与共现规律,构建结构化类别图,并引入图注意力机制进行预测校准,从而提升视觉相似及稀有类别的分割精度。

链接: https://arxiv.org/abs/2604.26893
作者: Fangqiang Fan,Zhicheng Zhao,Xiaoliang Ma,Chenglong Li,Jin Tang
机构: Anhui University (安徽大学); GEOVIS Earth Technology Co., Ltd. (地理信息地球科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages,13 figures

点击查看摘要

Abstract:Fine-grained RGBT image semantic segmentation is crucial for all-weather unmanned aerial vehicle (UAV) scene understanding. However, UAV RGBT semantic segmentation faces two coupled challenges: cross-modal spatial misalignment caused by sensor parallax and platform vibration, and severe semantic confusion among fine-grained ground objects under top-down aerial views. To address these issues, we propose a Graph-based Semantic Calibration Network (GSCNet) for unaligned UAV RGBT image semantic segmentation. Specifically, we design a Feature Decoupling and Alignment Module (FDAM) that decouples each modality into shared structural and private perceptual components and performs deformable alignment in the shared subspace, enabling robust spatial correction with reduced modality appearance interference. Moreover, we propose a Semantic Graph Calibration Module (SGCM) that explicitly encodes the hierarchical taxonomy and co-occurrence regularities among ground-object categories in UAV scenes into a structured category graph, and incorporates these priors into graph-attention reasoning to calibrate predictions of visually similar and rare this http URL addition, we construct the Unaligned RGB-Thermal Fine-grained (URTF) benchmark, to the best of our knowledge, the largest and most fine-grained benchmark for unaligned UAV RGBT image semantic segmentation, containing over 25,000 image pairs across 61 categories with realistic cross-modal misalignment. Extensive experiments on URTF demonstrate that GSCNet significantly outperforms state-of-the-art methods, with notable gains on fine-grained categories. The dataset is available at this https URL.

[CV-6] SEAL: Semantic-aware Single-image Sticker Personalization with a Large-scale Sticker-tag Dataset

【速读】:该论文旨在解决基于扩散模型的单图像个性化文生图生成中,特别是在贴纸(sticker)个性化场景下,由于仅有一个参考图像导致测试时微调(Test-Time Fine-Tuning, TTF)易过拟合的问题。具体表现为视觉特征纠缠(visual entanglement),即背景噪声被错误地整合进学习到的概念中;以及结构刚性(structural rigidity),即模型记忆参考图像特定的空间布局而丧失上下文可控性。解决方案的关键在于提出SEAL(Semantic-aware single-image sticker personalization)模块,其核心机制包括:(1) 语义引导的空间注意力损失(Semantic-guided Spatial Attention Loss),用于增强对目标区域的空间感知;(2) 拆分-合并标记策略(Split-merge Token Strategy),实现对嵌入向量的细粒度控制;(3) 结构感知层限制(Structure-aware Layer Restriction),在适配过程中保留扩散模型中关键结构信息。这三个组件协同作用,在不修改原有U-Net扩散主干网络的前提下,显著提升身份保真度并维持上下文可控性,从而有效缓解过拟合与结构僵化问题。

链接: https://arxiv.org/abs/2604.26883
作者: Changhyun Roh,Yonghyun Jeong,Jonghyun Lee,Chanho Eom,Jihyong Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The last two authors are co-corresponding authors. Please visit our project page at this https URL

点击查看摘要

Abstract:Synthesizing a target concept from a single reference image is challenging in diffusion-based personalized text-to-image generation, particularly for sticker personalization where prompts often require explicit attribute edits. With only one reference, test-time fine-tuning (TTF) methods tend to overfit, producing \textitvisual entanglement, where background artifacts are absorbed into the learned concept, and \textitstructural rigidity, where the model memorizes reference-specific spatial configurations and loses contextual controllability. To address these issues, we introduce \textbfSEmantic-aware single-image sticker person\textbfALization (\textbfSEAL), a plug-and-play, architecture-agnostic adaptation module that integrates into existing personalization pipelines without modifying their U-Net-based diffusion backbones. SEAL applies three components during embedding adaptation: (1) a Semantic-guided Spatial Attention Loss, (2) a Split-merge Token Strategy, and (3) Structure-aware Layer Restriction. To support sticker-domain personalization with attribute-level control, we present StickerBench, a large-scale sticker image dataset with structured tags under a six-attribute schema (Appearance, Emotion, Action, Camera Composition, Style, Background). These annotations provide a consistent interface for varying context while keeping target identity fixed, enabling systematic evaluation of identity disentanglement and contextual controllability. Experiments show that SEAL consistently improves identity preservation while maintaining contextual controllability, highlighting the importance of explicit spatial and structural constraints during test-time adaptation. The code, StickerBench, and project page will be publicly released.

[CV-7] Uncertainty-Aware Pedestrian Attribute Recognition via Evidential Deep Learning

【速读】:该论文旨在解决行人属性识别(Pedestrian Attribute Recognition, PAR)中预测可靠性难以评估的问题,尤其是在低质量样本上,传统确定性方法无法有效识别不可靠预测,从而影响系统在复杂现实场景中的鲁棒性。解决方案的关键在于提出一种基于证据深度学习(Evidential Deep Learning, EDL)的不确定性感知框架UAPAR,其核心创新包括:1)将EDL引入CLIP架构,通过区域感知证据推理模块(Region-Aware Evidence Reasoning)结合交叉注意力机制与空间先验掩码,捕获细粒度局部特征,并由证据头估计属性级别的认知不确定性(epistemic uncertainty);2)设计一种不确定性引导的双阶段课程学习策略,缓解训练过程中严重标签噪声对模型性能的负面影响。实验表明,该框架在PA100K、PETA、RAPv1和RAPv2等多个数据集上均取得竞争力或更优性能,且不确定性估计能有效预测困难或错误样本。

链接: https://arxiv.org/abs/2604.26873
作者: Zhuofan Lou,Shihang Zhang,Fangle Zhu,Shengjie Ye,Pingyu Wang
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures, 5 tables

点击查看摘要

Abstract:We propose UAPAR, an Uncertainty-Aware Pedestrian Attribute Recognition framework. To the best of our knowledge, this is the first EDL-based uncertainty-aware framework for pedestrian attribute recognition (PAR). Unlike conventional deterministic methods, which fail to assess prediction reliability on low-quality samples, UAPAR effectively identifies unreliable predictions and thus enhances system robustness in complex real-world scenarios. To achieve this, UAPAR incorporates Evidential Deep Learning (EDL) into a CLIP-based architecture. Specifically, a Region-Aware Evidence Reasoning module employs cross-attention and spatial prior masks to capture fine-grained local features, which are further processed by an evidence head to estimate attribute-wise epistemic uncertainty. To further enhance training robustness, we develop an uncertainty-guided dual-stage curriculum learning strategy to alleviate the adverse effects of severe label noise during training. Extensive experiments on the PA100K, PETA, RAPv1, and RAPv2 datasets demonstrate that UAPAR achieves competitive or superior performance. Furthermore, qualitative results confirm that the proposed framework generates uncertainty estimates that are predictive of challenging or erroneous samples.

[CV-8] KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment

【速读】:该论文旨在解决临床细胞遗传学实验室中传统核型分析(karyotyping)流程效率低、依赖人工判读且易受主观因素影响的问题。其核心挑战在于如何在保持高精度的同时,实现自动化、可部署的智能核型分析系统,以适应不同医疗机构对数据本地化处理的需求。解决方案的关键在于构建一个端到端的微服务架构(microservice architecture)——KAYRA,该系统整合了三个深度学习模型:基于EfficientNet-B5与U-Net的语义分割模块用于精确识别染色体区域,Mask R-CNN(ResNet-50 + FPN)实例检测模块用于区分单个染色体,以及ResNet-18分类器完成染色体类型判定,并通过级联的ROI缩小策略优化各模型输入区域,从而提升整体性能。此外,KAYRA支持云端和本地两种部署模式,满足不同临床环境的数据合规性要求,且经初步临床评估验证其在分割准确率(98.91%)、分类准确率(89.1%)等方面显著优于现有商业参考系统(p < 0.0001),达到技术成熟度等级6(TRL 6),并嵌入专家审核工作流以保障诊断可靠性。

链接: https://arxiv.org/abs/2604.26869
作者: Attila Pintér,Javier Rico,Attila Répai,Jalal Al-Afandi,Adrienn Éva Borsy,András Kozma,Hajnalka Andrikovics,György Cserey
机构: 1. University of Szeged (塞格德大学); 2. Institute for Computer Science and Control, Hungarian Academy of Sciences (匈牙利科学院计算机科学与控制研究所); 3. Semmelweis University (塞梅尔维斯大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present KAYRA, an end-to-end karyotyping system that operates inside the operational constraints of a clinical cytogenetic laboratory. KAYRA is architected as a containerized microservice pipeline whose ML stack combines an EfficientNet-B5 + U-Net semantic segmenter, a Mask R-CNN (ResNet-50 + FPN) instance detector, and a ResNet-18 classifier, orchestrated through a cascaded ROI-narrowing strategy that focuses each downstream model on the chromosome-bearing region. The same container images are deployed both as a cloud service and as an on-premise installation, supporting clinical environments where patient-data egress is not permitted as well as those where it is. A pilot clinical evaluation against two commercial reference karyotyping systems on 459 chromosomes from 10 metaphase spreads shows segmentation accuracy of 98.91 % (vs. 78.21 % / 40.52 %), classification accuracy of 89.1 % (vs. 86.9 % / 54.5 %), and rotation accuracy of 89.76 % (vs. 94.55 % / 78.43 %). KAYRA improves over the older density-thresholding reference on all three axes (p 0.0001 for segmentation and classification by Fisher’s exact test on chromosome-level counts), and on segmentation also against the modern AI- supported reference (p 0.0001); on classification the difference vs. the modern AI reference is not statistically significant at the present test-set size (p = 0.34). The system reaches TRL 6 maturity and integrates the human-in-the-loop expert-review workflow that diagnostic cytogenetic practice requires. The thesis of this paper is that a multi-model cytogenetic AI service can be packaged as a microservice architecture supporting flexible deployment - cloud-hosted or on-premise - while delivering strong empirical performance on a pilot clinical evaluation.

[CV-9] Breaking the Rigid Prior: Towards Articulated 3D Anomaly Detection

【速读】:该论文旨在解决现有3D异常检测方法在处理具有铰链或滑动关节的可动物体(articulated objects)时存在的根本性局限:传统方法依赖于“正常几何形状在姿态变化下保持不变”的刚性先验,而这类物体的姿态变化会引发结构化的几何形变,导致真实结构缺陷被误判为异常,同时姿态引起的变形被错误识别为异常。为此,作者提出了首个大规模基准数据集ArtiAD,包含15,229个点云样本、39类对象、密集的关节角度变化及六种结构异常类型,并提供部件级运动标签以显式分离姿态诱导的几何变化与结构性缺陷。解决方案的关键在于提出Shape-Pose-Aware Signed Distance Field (SPA-SDF),其核心创新是用连续的姿态条件隐式场替代刚性先验,将结构先验与傅里叶编码的关节嵌入解耦,从而在推理阶段通过最小化重建能量恢复关节状态,并基于点级偏离学习流形来识别异常。

链接: https://arxiv.org/abs/2604.26868
作者: Jinye Gan,Bozhong Zheng,Xiaohao Xu,Junye Ren,Zixuan Zhang,Na Ni,Yingna Wu
机构: ShanghaiTech University (上海科技大学); University of Michigan, Ann Arbor (密歇根大学安娜堡分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing 3D anomaly detection methods are built on a rigid prior: normal geometry is pose-invariant and can be canonicalized through registration or alignment. This prior does not hold for articulated objects with hinge or sliding joints, where valid pose changes induce structured geometric variations that cannot be collapsed to a single canonical template, causing pose-induced deformations to be misidentified as anomalies while true structural defects are obscured. No existing benchmark addresses this challenge. We introduce ArtiAD, the first large-scale benchmark for articulated 3D anomaly detection, comprising 15,229 point clouds across 39 object categories with dense joint-angle variations and six structural anomaly types. Each sample is annotated with its joint configuration and part-level motion labels, enabling explicit disentanglement of pose-induced geometry from structural defects. ArtiAD also provides a seen/unseen articulation split to evaluate both interpolation and extrapolation to novel joint configurations. We propose Shape-Pose-Aware Signed Distance Field (SPA-SDF), a baseline that replaces the rigid prior with a continuous pose-conditioned implicit field, factorized into an articulation-independent structural prior and a Fourier-encoded joint embedding. At inference, the articulation state is recovered by minimizing reconstruction energy, and anomalies are identified as point-wise deviations from the learned manifold. SPA-SDF achieves 0.884 object-level AUROC on seen configurations and 0.874 on unseen configurations, substantially outperforming all rigid-based baselines. Our code and benchmark will be publicly released to facilitate future research.

[CV-10] Edge AI for Automotive Vulnerable Road User Safety: Deployable Detection via Knowledge Distillation

【速读】:该论文旨在解决在边缘硬件上部署高精度的行人等弱势道路使用者(Vulnerable Road User, VRU)检测模型时面临的模型容量与计算约束之间的矛盾问题。大型模型虽精度高,但在INT8量化后性能严重下降;小型模型则因参数受限而牺牲检测性能。解决方案的关键在于提出一种知识蒸馏(Knowledge Distillation, KD)框架,通过训练一个参数量仅为11.2M的YOLOv8-S学生模型来模仿参数量为43.7M的YOLOv8-L教师模型,实现3.9倍压缩的同时保持量化鲁棒性。实验表明,KD学生模型在INT8量化下仍能维持较高精度(mAP仅下降5.6%),且其精度校准能力优于直接训练的INT8模型,最终在更小模型中实现了超越原教师模型FP32精度的表现(0.748 vs. 0.718),验证了知识蒸馏是边缘部署安全关键VRU检测任务的核心技术路径。

链接: https://arxiv.org/abs/2604.26857
作者: Akshay Karjol,Darrin M. Hanna
机构: Oakland University (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Deploying accurate object detection for Vulnerable Road User (VRU) safety on edge hardware requires balancing model capacity against computational constraints. Large models achieve high accuracy but fail under INT8 quantization required for edge deployment, while small models sacrifice detection performance. This paper presents a knowledge distillation (KD) framework that trains a compact YOLOv8-S student (11.2M parameters) to mimic a YOLOv8-L teacher (43.7M parameters), achieving 3.9x compression while preserving quantization robustness. We evaluate on full-scale BDD100K (70K training images) with Post-Training Quantization to INT8. The teacher suffers catastrophic degradation under INT8 (-23% mAP), while the KD student retains accuracy (-5.6% mAP). Analysis reveals that KD transfers precision calibration rather than raw detection capacity: the KD student achieves 0.748 precision versus 0.653 for direct training at INT8, a 14.5% gain at equivalent recall, reducing false alarms by 44% versus the collapsed teacher. At INT8, the KD student exceeds the teacher’s FP32 precision (0.748 vs. 0.718) in a model 3.9x smaller. These findings establish knowledge distillation as a requirement for deploying accurate, safety-critical VRU detection on edge hardware.

[CV-11] Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization CVPR2026

【速读】:该论文旨在解决目标域与源域之间分布差异导致检测器性能下降的问题,尤其在单源域数据有限的情况下,模型易依赖源域中的混杂因素(confounders,如光照、共现模式和风格)形成虚假相关性,从而损害泛化能力。解决方案的关键在于提出一种基于因果推理的新型基础驱动框架——Bridge,通过学习低秩基表示实现前门调整(front-door adjustment),有效阻断混杂因素的影响以缓解虚假相关性;同时,该方法还能通过滤除冗余和任务无关成分来优化特征表示。Bridge可无缝集成于判别式(如DINOv2/3、SAM)和生成式(如Stable Diffusion)视觉基础模型(Vision Foundation Models, VFMs),并在多个域泛化目标检测数据集上验证了其优越性。

链接: https://arxiv.org/abs/2604.26820
作者: Mingbo Hong,Feng Liu,Caroline Gevaert,George Vosselman,Hao Cheng
机构: University of Twente (特温特大学); Drexel University (德雷塞尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Detectors often suffer from degraded performance, primarily due to the distributional gap between the source and target domains. This issue is especially evident in single-source domains with limited data, as models tend to rely on confounders (e.g., illumination, co-occurrence, and style) from the source domain, leading to spurious correlations that hinder generalization. To this end, this paper proposes a novel Basis-driven framework for domain generalization, namely \textbf\textitBridge, that incorporates causal inference into object detection. By learning the low-rank bases for front-door adjustment, \textbf\textitBridge blocks confounders’ effects to mitigate spurious correlations, while simultaneously refining representations by filtering redundant and task-irrelevant components. \textbf\textitBridge can be seamlessly integrated with both discriminative (e.g., DINOv2/3, SAM) and generative (e.g., Stable Diffusion) Vision Foundation Models (VFMs). Extensive experiments across multiple domain generalization object detection datasets, i.e., Cross-Camera, Adverse Weather, Real-to-Artistic, Diverse Weather Datasets, and Diverse Weather DroneVehicle (our newly augmented real-world UAV-based benchmark), underscore the superiority of our proposed method over previous state-of-the-art approaches. The project page is available at: this https URL.

[CV-12] ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection

【速读】:该论文旨在解决基于Transformer的检测模型在处理自然图像时因空间异质性导致的局部特征退化问题,尤其是在密集冲突区域中微小目标检测性能下降的问题。其解决方案的关键在于提出一种无需训练的推理阶段框架ViCrop-Det,通过引入自适应空间信任区域收缩机制:利用检测解码器交叉注意力分布计算空间注意力熵(Spatial Attention Entropy, SAE),以启发式评估局部空间模糊性,并据此动态路由计算资源至高目标显著性和高认知不确定性的区域;同时通过缩小空间信任区域并注入高频局部观测信息,主动缓解空间模糊性并恢复细粒度特征,从而在不修改模型架构的前提下实现精度提升与计算效率的优化平衡。

链接: https://arxiv.org/abs/2604.26806
作者: Hui Wang,Hongze Li,Wei Chen,Xiaojin Zhang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based architectures have established a dominant paradigm in global semantic perception; however, they remain fundamentally constrained by the profound spatial heterogeneity inherent in natural images. Specifically, the imposition of a uniform global receptive field across regions of varying information density inevitably leads to local feature degradation, particularly in dense conflict zones populated by microscopic targets. To address this mechanistic limitation, we propose ViCrop-Det, a training-free inference framework that introduces adaptive spatial trust region shrinkage. Inspired by the use of attention entropy in anomaly segmentation, ViCrop-Det leverages the detection decoder’s cross-attention distribution as an endogenous probe. By utilizing Spatial Attention Entropy (SAE) to heuristically evaluate local spatial ambiguity, the framework executes dynamic spatial routing, allocating a fixed computational budget exclusively to regions exhibiting both high target saliency and high cognitive uncertainty. By shrinking the spatial trust region and injecting high-frequency localized observations, ViCrop-Det actively resolves spatial ambiguity and recovers fine-grained features without requiring architectural modifications. Extensive evaluations on VisDrone and DOTA-v1.5 demonstrate that ViCrop-Det yields competitive performance enhancements, consistently adding +1-3 mAP@50 to RT-DETR-R50 and Deformable DETR with a marginal 20-23% latency overhead. On MS COCO, AP_S improves while AP_M/AP_L remains stable, indicating precise fine-scale refinement without compromising the global spatial prior. Under compute-matched settings, our adaptive routing strategy comprehensively surpasses uniform slicing baselines, achieving a highly optimized accuracy-speed trade-off.

[CV-13] MesonGS: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在实际部署中存储成本过高的问题,特别是现有训练后压缩方法因剪枝、变换、量化和熵编码等多阶段耦合超参数难以精确控制压缩后的尺寸并充分挖掘率-失真权衡的问题。其解决方案的关键在于提出MesonGS++——一种面向尺寸感知的训练后压缩编解码器,通过联合优化重要性驱动的剪枝、八叉树几何编码、属性变换、针对高阶球谐函数的选择性向量量化以及分组混合精度量化与熵编码,并将保留率(reserve ratio)和比特位宽分配作为主导的率-失真调控参数,借助离散采样与0-1整数线性规划在指定存储预算下进行联合优化,同时引入线性尺寸估计器和CUDA并行量化操作以加速超参数搜索过程,从而实现超过34倍的压缩比且保持渲染保真度,甚至在某些场景下优于原始3DGS模型在相同压缩率下的性能表现。

链接: https://arxiv.org/abs/2604.26799
作者: Shuzhao Xie,Junchen Ge,Weixiang Zhang,Jiahang Liu,Chen Tang,Yunpeng Bai,Shijia Ge,Jingyan Jiang,Yuzhi Huang,Fengnian Yang,Cong Zhang,Xiaoyi Fan,Zhi Wang
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); The Hong Kong University of Science and Technology (香港科技大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); The Chinese University of Hong Kong (香港中文大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Shenzhen Technology University (深圳技术大学); Xiamen University (厦门大学); Simon Fraser University (西蒙菲莎大学); Jiangxing Intelligence Inc. (江星智能公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
备注: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) achieves high-quality novel view synthesis with real-time rendering, but its storage cost remains prohibitive for practical deployment. Existing post-training compression methods still rely on many coupled hyperparameters across pruning, transformation, quantization, and entropy coding, making it difficult to control the final compressed size and fully exploit the rate-distortion trade-off. We propose MesonGS++, a size-aware post-training codec for 3D Gaussian compression. On the codec side, MesonGS++ combines joint importance-based pruning, octree geometry coding, attribute transformation, selective vector quantization for higher-degree spherical harmonics, and group-wise mixed-precision quantization with entropy coding. On the configuration side, it treats the reserve ratio and bit-width allocation as the dominant rate-distortion knobs and jointly optimizes them under a target storage budget via discrete sampling and 0–1 integer linear programming. We further propose a linear size estimator and a CUDA parallel quantization operator to accelerate the hyperparameter searching process. Extensive experiments show that MesonGS++ achieves over 34 \times compression while preserving rendering fidelity, outperforming state-of-the-art post-training methods and accurately meeting target size budgets. Remarkably, without any training, MesonGS++ can even surpass the PSNR of vanilla 3DGS at a 20 \times compression rate on the Stump scene. Our code is available at this https URL

[CV-14] Virtual-reality based patient-specific simulation of spine surgical procedures: A fast highly automated and high-fidelity system for surgical education and planning

【速读】:该论文旨在解决传统外科培训中因临床压力增大导致的手术室(OR)暴露机会减少的问题,同时克服现有虚拟现实(VR)训练场景标准化、缺乏个体化适配的局限性。其核心解决方案是利用人工智能(AI)驱动的计算机视觉方法,从患者的CT和MRI影像中自动构建高保真度的患者特异性三维解剖模型,并在此基础上实现脊柱减压手术(包括椎板切除术、椎间盘切除术和椎间孔成形术)的VR模拟。关键技术在于多模态影像融合(即CT与MRI的配准)与结构分割算法的结合,实现了约2.5分钟/例的高效建模,且分割精度(DSC达0.95)和配准误差(TRE均值1.73 mm)均达到临床可用水平,从而显著提升了术前规划、术后评估及全流程手术模拟的个性化与教育价值。

链接: https://arxiv.org/abs/2604.26781
作者: Raj Kumar Ranabhat,Tayler D Ross,Tony Jiao,Jeremie Larouche,Joel Finkelstein,Michael Hardisty
机构: Sunnybrook Health Sciences Centre ( Sunnybrook 医疗科学中心); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical training involves didactic teaching, mentor-led learning, surgical skills laboratories, and direct exposure to surgery; however, increasing clinical pressures have limited operating room (OR) exposure. This work leverages virtual reality (VR) to provide a safe and immersive training environment. Existing VR training is often based on standardized scenarios not tailored to individual clinical cases. This study addresses this limitation using artificial intelligence (AI) based computer vision methods to generate patient-specific simulations from computed tomography (CT) and magnetic resonance imaging (MRI). This study focuses on patient-specific spinal decompression simulation for spinal stenosis in a virtual operating room. The objectives were (1) automatic creation of 3D anatomical models and (2) VR simulation of spinal decompression procedures including laminectomy, disc resection, and foraminotomy. Model construction required multimodal fusion (registration) of CT and MRI and segmentation of relevant structures. Segmentation was evaluated using the Dice Similarity Coefficient (DSC), and registration accuracy using Target Registration Error (TRE). Qualitative feedback was obtained from surgeons and trainees. High-fidelity patient-specific 3D models were generated efficiently (approximately 2.5 minutes per case, N = 15). Segmentation accuracy was high, with a DSC of 0.95 (+/- 0.03) for vertebral bone and 0.895 (+/- 0.02) for soft tissue structures. Registration accuracy showed a mean TRE of 1.73 (+/- 0.42) mm. Semi-structured interviews indicated improved spatial understanding, increased procedural confidence, and strong perceived educational value. This platform significantly reduced the time and costs of patient-specific modelling, thereby facilitating pre-operative planning, post-procedural assessments, and comprehensive surgical simulation.

[CV-15] MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification

【速读】:该论文旨在解决开放词汇变化检测(open-vocabulary change detection)中因时序耦合不足导致的语义变化识别不准确问题,以及高分辨率图像上基于patch的推理方式引发的全局语义连续性弱化和变化区域碎片化问题。其解决方案的关键在于提出一种无需训练的框架MemOVCD,通过跨时序记忆推理机制实现双向加权传播以增强时间维度上的语义证据聚合,并引入直方图对齐的过渡帧来稳定长时距下的记忆传播;同时采用全局-局部自适应校正策略,动态融合局部与全局预测结果,在保持空间一致性的同时保留细粒度变化细节。

链接: https://arxiv.org/abs/2604.26774
作者: Zuzheng Kuang,Honghao Chang,Boqiang Liang,Haoqian Wang,Lijun He,Fan Li,Haixia Bi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-vocabulary change detection aims to identify semantic changes in bi-temporal remote sensing images without predefined categories. Recent methods combine foundation models such as SAM, DINO and CLIP, but typically process each timestamp independently or interact only at the final comparison stage. Such paradigms suffer from insufficient temporal coupling during semantic reasoning, which limits their ability to distinguish genuine semantic changes from non-semantic appearance discrepancies. In addition, patch-dominant inference on high-resolution images often weakens global semantic continuity and produces fragmented change regions. To address these issues, we propose MemOVCD, a training-free open-vocabulary change detection framework based on cross-temporal memory reasoning and global-local adaptive rectification. Specifically, we reformulate bi-temporal change detection as a two-frame tracking problem and introduce weighted bidirectional propagation to aggregate semantic evidence from both temporal directions. To stabilize memory propagation across large temporal gaps, we construct histogram-aligned transition frames to smooth abrupt appearance changes. Moreover, a global-local adaptive rectification strategy adaptively fuses local and global-view predictions, improving spatial consistency while preserving fine-grained details. Experiments on five benchmarks demonstrate that MemOVCD achieves favorable performance on two change detection tasks, validating its effectiveness and generalization under diverse open-vocabulary settings.

[CV-16] AP into the Patch Tokens: Leverag ing Vision Foundation Model Features for AI-Generated Image Detection CVPR

【速读】:该论文旨在解决当前AI生成图像(AIGI)检测任务中模型泛化能力不足的问题,尤其是在面对未见过的生成模型或复杂场景时性能受限。现有方法多基于CLIP视觉Transformer(CLIP-ViT)作为特征提取器,但其在多样化的生成图像检测任务上仍有提升空间。解决方案的关键在于两个方面:一是通过构建涵盖多种视觉基础模型(VFM)家族的全面基准测试,系统评估不同预训练目标、输入分辨率和模型规模下的AIGI检测性能;二是提出一种简单的分类头重设计——可调注意力池化(TAP),用于从输出token中聚合出更精细的全局表征,从而充分挖掘现代VFM的特征潜力。实验表明,结合最新VFM与TAP的方法在多个AIGI检测基准上显著优于现有技术,特别是在真实世界场景下的AI生成与AI修补图像检测中达到新的SOTA水平。

链接: https://arxiv.org/abs/2604.26772
作者: Ahmed Abdullah,Nikolas Ebert,Oliver Wasenmüller
机构: Mannheim University of Applied Sciences (曼海姆应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2026

点击查看摘要

Abstract:Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP’s release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that the best model outperforms the original CLIP by more than 12% in accuracy, beating established approaches in the process. To fully leverage the features of a modern VFM, we propose a simple redesign of the classifier head by utilizing tunable attention pooling (TAP), which aggregates output tokens into a refined global representation. Integrating TAP with the latest VFMs yields substantial performance gains across several AIGI detection benchmarks, establishing a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and -inpainted images.

[CV-17] GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

【速读】:该论文旨在解决当前基础模型在真实环境中部署时,仅依赖语言推理难以实现高效多模态交互与决策的问题。传统方法将视觉等感知能力作为语言模型的辅助接口,而非核心推理组件,导致代理(agent)在处理图像、视频、网页、文档及图形用户界面(GUI)等异构上下文时能力受限。解决方案的关键在于构建一个原生多模态基础模型——GLM-5V-Turbo,其核心创新是将多模态感知能力深度整合进推理、规划、工具调用和执行全流程中,而非将其视为外部输入接口。这一设计显著提升了模型在多模态编程、视觉工具使用和基于框架的代理任务中的性能,同时保持了强大的纯文本编程能力,并为构建可靠、可扩展的多模态代理提供了实践路径。

链接: https://arxiv.org/abs/2604.26752
作者: GLM-V Team:Wenyi Hong,Xiaotao Gu,Ziyang Pan,Zhen Yang,Yuting Wang,Yue Wang,Yuanchang Yue,Yu Wang,Yanling Wang,Yan Wang,Xijun Liu,Wenmeng Yu,Weihan Wang,Wei Li,Shuaiqi Duan,Sheng Yang,Ruiliang Lv,Mingdao Liu,Lihang Pan,Ke Ning,Junhui Ji,Jinjiang Wang,Jing Chen,Jiazheng Xu,Jiale Zhu,Jiale Cheng,Ji Qi,Guobing Gan,Guo Wang,Cong Yao,Zijun Dou,Zihao Zhou,Zihan Wang,Zhiqi Ge,Zhijie Li,Zhenyu Hou,Zhao Xue,Zehui Wang,Zehai He,Yusen Liu,Yukuo Cen,Yuchen Li,Yuan Wang,Yijian Lu,Yanzi Wang,Yadong Xue,Xinyu Zhang,Xinyu Liu,Wenkai Li,Tianyu Tong,Tianshu Zhang,Shengdong Yan,Qinkai Zheng,Mingde Xu,Licheng Bao,Jiaxing Xu,Jiaxin Fan,Jiawen Qian,Jiali Chen,Jiahui Lin,Haozhi Zheng,Haoran Wang,Haochen Li,Fan Yang,Dan Zhang,Chuangxin Zhao,Chengcheng Wu,Boyan Shi,Bowei Jia,Baoxu Wang,Peng Zhang,Debing Liu,Bin Xu,Juanzi Li,Minlie Huang,Yuxiao Dong,Jie Tang
机构: Z.ai; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.

[CV-18] Learning Sparse BRDF Measurement Samples from Image

【速读】:该论文旨在解决基于少量测量点的双向反射分布函数(Bidirectional Reflectance Distribution Function, BRDF)获取问题,以实现高效且高质量的材质外观重建。传统方法依赖密集的角反射测量设备,成本高且耗时长。其解决方案的关键在于设计一个可微分的采样器,该采样器结合了用于稀疏坐标-值观测的编码器、预训练的基于超网络(hypernetwork)的BRDF重构器以及可微分渲染模块;通过固定重构器并利用BRDF空间损失和图像空间损失的梯度优化测量方向,从而在不重新训练先验模型的前提下,自动选择对材质分布最具信息量的采样位置,显著提升低预算下的重建质量。

链接: https://arxiv.org/abs/2604.26740
作者: Wen Cao
机构: Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Accurate BRDF acquisition is important for realistic rendering, but dense gonioreflectometer measurements are slow and expensive. We study how to select a small number of BRDF measurements that are most useful for reconstructing material appearance under a learned reflectance prior. Our method combines a set encoder for sparse coordinate-value observations, a pretrained hypernetwork-based BRDF reconstructor, and a differentiable renderer. During sampler training, the reconstructor is kept fixed and gradients from BRDF-space and rendered-image losses are used to optimize measurement locations. This separates sample selection from prior fitting and encourages the sampler to choose directions that are informative under the learned material distribution. Experiments on the MERL dataset show that the proposed sampler improves low-budget reconstruction quality at 8 and 16 measurements compared with neural reconstruction baselines, while PCA-based methods remain strong at larger budgets. We further analyze the effect of image-space supervision, co-optimization, and image-only latent fitting for unseen materials.

[CV-19] CurEvo: Curriculum-Guided Self-Evolution for Video Understanding

【速读】:该论文旨在解决现有自演化视频理解框架中因缺乏结构化引导而导致的优化控制薄弱和难度进展不可控的问题。其解决方案的关键在于提出CurEvo框架,该框架将课程学习(curriculum learning)引入自演化过程,通过动态调节任务难度、精细化评估标准并根据模型能力平衡数据多样性,构建了一个以模型能力为导向的课程引导反馈环,从而实现学习复杂度与模型能力的协同演进。在此基础上,进一步设计了多维自适应问答(QA)框架,联合演化感知、识别与理解维度的问题生成与答案评估,确保课程进展的连贯性与可度量性,显著提升了视频问答(VideoQA)任务中的基准准确率和语义评分。

链接: https://arxiv.org/abs/2604.26707
作者: Guiyi Zeng,Junqing Yu,Yi-Ping Phoebe Chen,Xu Chen,Wei Yang,Zikai Song
机构: Huazhong University of Science and Technology (华中科技大学); La Trobe University (拉特罗布大学); Beijing Institute of Computer Technology and Applications (北京计算机技术与应用研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Recent advances in self-evolution video understanding frameworks have demonstrated the potential of autonomous learning without human annotations. However, existing methods often suffer from weakly controlled optimization and uncontrolled difficulty progression, as they lack structured guidance throughout the iterative learning process. To address these limitations, we propose CurEvo, a curriculum-guided self-evolution framework that introduces curriculum learning into self-evolution to achieve more structured and progressive model improvement. CurEvo dynamically regulates task difficulty, refines evaluation criteria, and balances data diversity according to model competence, forming a curriculum-guided feedback loop that aligns learning complexity with model capability. Built upon this principle, we develop a multi-dimensional adaptive QA framework that jointly evolves question generation and answer evaluation across perception, recognition, and understanding dimensions, ensuring coherent and measurable curriculum progression. Through this integration, CurEvo transforms weakly controlled self-evolution into a more structured learning process for autonomous video understanding. Across seven backbones, CurEvo consistently improves both benchmark accuracy and evaluator-based semantic score on four VideoQA benchmarks, validating the effectiveness of curriculum-guided self-evolution for video understanding.

[CV-20] Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

【速读】:该论文旨在解决现有统一世界模型(Unified World Model, UWM)仅在2D像素空间建模、难以同时保障机器人动作执行效率与世界建模质量的问题。其核心挑战在于如何在单一框架中实现高保真4D世界合成(视频+3D重建)与实时机器人动作执行的协同优化。解决方案的关键在于提出X-WAM,一个统一的4D世界模型:首先,利用预训练视频扩散模型的强大视觉先验,通过预测多视角RGB-D视频来想象未来世界;其次,采用轻量级结构适配策略——将预训练Diffusion Transformer的最后几层复制到专用深度预测分支,高效获取未来空间信息;此外,引入异步噪声采样(Asynchronous Noise Sampling, ANS),在推理阶段采用异步去噪调度,在少量步骤内快速解码动作以支持实时执行,同时用完整去噪步数生成高质量视频,且训练时从联合分布采样以对齐推理分布,从而实现生成质量与动作效率的协同提升。

链接: https://arxiv.org/abs/2604.26694
作者: Jun Guo,Qiwei Li,Peiyan Li,Zilong Chen,Nan Sun,Yifei Su,Heyun Wang,Yuan Zhang,Xinghang Li,Huaping Liu
机构: Tsinghua University (清华大学); Xiaomi Robotics (小米机器人); Peking University (北京大学); CASIA (中国科学院自动化研究所)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real-time execution, while dedicating the full sequence of steps to generate high-fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X-WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high-fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.

[CV-21] Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations

【速读】:该论文旨在解决从振动响应较弱或高度共振的固体物体表面恢复场景声音的问题,这类物体在传统光学振动传感方法中难以有效捕捉其振动信号。解决方案的关键在于提出一种基于物理规律的振动形成模型,该模型通过物体的振动模态(vibrational modes)将场景声源与多点、多轴的振动信号关联起来,并利用该模型逆向消除物体自身的共振传递函数,进而融合多个振动信号以估计原始声源。这一方法显著提升了对复杂振动特性物体的声音恢复能力,优于传统的单点散斑测振技术和基于信号处理的多信号融合方法。

链接: https://arxiv.org/abs/2604.26678
作者: Shai Bagon,Matan Kichler,Mark Sheinin
机构: Weizmann Institute of Science (魏茨曼科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Oral presentation at The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

点击查看摘要

Abstract:Optical vibration sensing enables recovering the scene sound directly from the surface vibration of nearby objects, turning everyday objects into ``visual microphones’'. However, most prior methods had focused on capturing the vibrations of specific objects with highly favorable vibration responses. These include objects where the surface vibrations are generated by the object itself (e.g., speaker membrane or guitar body) or objects consisting of a thin membrane which is highly reactive to sound (e.g., a chip bag or the leaf of a plant). In this paper, we tackle sound recovery for a more challenging class of solid objects whose vibration responses are poor or highly resonant. We simultaneously capture vibrations for multiple surface points on the object using a speckle-based vibrometry imaging system. Then, we derive a novel physics-guided vibration formation model that relates the scene sound source to the captured multi-point multi-axis vibrations via the object’s vibrational modes. The model is then used to reverse the resonant transfer function of the vibrating object, fusing multiple vibration signals to estimate the original sound source in the scene. We evaluate our approach by recovering sound from a variety of everyday objects, demonstrating that it significantly outperforms traditional single-point speckle vibrometry in challenging scenarios and other signal-processing-based methods for multi-signal fusing.

[CV-22] SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection

【速读】:该论文旨在解决工业缺陷检测中因标注缺陷数据稀缺而导致的模型性能瓶颈问题,其核心挑战在于缺陷本身稀少、人工标注成本高且难以构建平衡的训练集。解决方案的关键在于提出一个端到端的合成缺陷生成与自动标注流程,融合了基于视觉-语言-模型(Vision-Language-Model)的提示生成、LoRA适配的扩散模型、掩码引导的图像修复(inpainting)以及基于DreamSim和CLIPScore的样本过滤机制,从而实现高质量、可直接用于训练的合成缺陷样本生成。实验证明,此类合成样本虽不能完全替代真实数据,但能有效增强稀缺的真实数据集,在特定训练场景下提升检测性能,并具备跨工业领域迁移的能力,凸显了其在强化小样本场景中的价值。

链接: https://arxiv.org/abs/2604.26633
作者: Paul Julius Kühn,Mika Pommeranz,Arjan Kuijper,Saptarshi Neil Sinha
机构: Fraunhofer IGD (弗劳恩霍夫图像数据处理研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The bottleneck in learning-based industrial defect detection is often limited not by model capacity, but by the scarcity of labeled defect data: defects are rare, annotations are expensive, and collecting balanced training sets is slow. We present an end-to-end pipeline for synthetic defect generation and annotation, combining Vision-Language-Model-based prompts, LoRA-adapted diffusion, mask-guided inpainting, and sample filtering with automatic label derivation, and demonstrates the potential of real data with realistic synthetic samples to overcome data scarcity. The evaluation is conducted on, a challenging dataset of pitting defects on ball screw drives, and then on a subset of the Mobile phone screen surface defect segmentation dataset (MSD) dataset to test cross-domain transfer. Beyond downstream detector performance, we analyze key stages of the pipeline, including prompt construction, LoRA selection, and sample filtering with DreamSim and CLIPScore, to understand which synthetic samples are both realistic and useful. Experiments with YOLOv26, YOLOX, and LW-DETR show that synthetic-only training does not replace real data. When combined with real data, synthetic defects can preserve performance and yield modest gains in selected BSData training regimes. The MSD transfer study shows that the overall pipeline structure carries over to a second industrial inspection domain, while also highlighting the importance of domain-specific adaptation and annotation-quality control. Overall, the paper provides an end-to-end assessment of diffusion-based industrial defect synthesis and shows that its strongest value lies in strengthening scarce real datasets rather than substituting for them.

[CV-23] SnapPose3D: Diffusion-Based Single-Frame 2D-to-3D Lifting of Human Poses ICPR2026

【速读】:该论文旨在解决2D到3D人体姿态提升(2D-to-3D lifting)方法中存在的深度模糊性(depth ambiguity)和关节不确定性(joint uncertainty)问题,这些问题导致从2D关节点位置映射到多个可能的3D姿态,从而影响预测准确性。解决方案的关键在于引入基于扩散模型(diffusion-based models)的生成能力,在推理阶段通过从单位高斯分布中随机采样生成多个姿态假设,并对其进行聚合以得到最终准确的3D姿态。该方法称为SnapPose3D,其训练过程为确定性去噪,条件输入包括视觉上下文和2D姿态特征;而在推理时采用概率化策略,不依赖时序信息,仅使用单帧图像即可实现高性能估计,显著降低计算复杂度与数据获取难度。

链接: https://arxiv.org/abs/2604.26620
作者: Alessandro Simoni,Riccardo Catalini,Davide Di Nucci,Guido Borghi,Davide Davoli,Lorenzo Garattoni,Gianpiero Francesca,Yuki Kawana,Roberto Vezzani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICPR 2026

点击查看摘要

Abstract:Depth ambiguity and joint uncertainty are the two main obstacles in obtaining accurate human pose predictions by 2D-to-3D lifting methods proposed in the literature. In particular, these issues are caused by 2D joint locations that can be mapped to multiple 3D positions, inducing multiple possible final poses. Following these considerations, we propose leveraging diffusion-based models generation capability to predict multiple hypotheses and aggregate them in a final accurate pose. Therefore, we introduce SnapPose3D, a pose-lifting framework trained deterministically to denoise 3D poses conditioned on both visual context and 2D pose features. SnapPose3D adopts a probabilistic approach during inference, generating multiple hypotheses through random sampling from a unit Gaussian distribution. Unlike most previous methods that address pose ambiguity by processing temporal sequences, SnapPose3D uses single frames as input, avoiding tracking and limiting computational cost, data acquisition complexity, and the need for online, real-time applications. We extensively evaluate SnapPose3D on well-known benchmarks for the 3D human pose estimation task showing its ability to generate and aggregate accurate hypotheses that lead to state-of-the-art results.

[CV-24] State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在仪表盘读数任务中表现脆弱的问题,尤其是在视角变化和光照变化下性能显著下降的现象。研究表明,现有模型主要依赖表面视觉特征而非仪表的内在状态几何结构,导致对相同物理状态的样本无法保持一致聚类,且相邻状态间缺乏连续性结构的保留。为应对这一问题,作者提出TriSCA——一种三级状态一致性对齐框架,其核心在于:1)基于状态距离感知的表示对齐,确保相似状态在特征空间中靠近;2)基于元数据引导的观测到状态监督机制,增强模型对真实状态的理解;3)状态感知的目标对齐策略,使训练目标与仪表读数的连续性本质相一致。实验表明,该方法在控制变量的时钟与仪表基准测试及外部真实场景基准上均显著优于现有方法。

链接: https://arxiv.org/abs/2604.26614
作者: Yuanze Hu,Gen Li,Yuqin Lan,Qingchen Yu,Zhichao Yang,Junwei Jing,Zhaoxin Fan,Xiaotie Deng
机构: Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University (北京航空航天大学未来区块链与隐私计算高精尖创新中心); Beihang University (北京航空航天大学); Fudan University (复旦大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved impressive progress on general multimodal tasks, yet they remain brittle on dial-based measurement reading. In this paper, we study this problem through controlled benchmarks and feature-space probing, and show that current MLLMs not only achieve unsatisfactory accuracy on dial-based readout, but also suffer sharp performance drops under viewpoint and illumination changes even when the underlying dial state remains fixed. Our probing analysis further reveals that same-state samples under appearance variation are not consistently clustered, while neighboring states fail to preserve the local structure implied by continuous dial values. These findings suggest that existing MLLMs largely ignore the intrinsic state geometry of dial measurement tasks and instead rely on superficial appearance cues. Motivated by this diagnosis, we propose TriSCA, a tri-level state-consistent alignment framework for dial-based measurement reading. Specifically, TriSCA consists of state-distance-aware representation alignment, metadata-grounded observation-to-state supervision, and state-aware objective alignment. Extensive ablation studies and evaluation experiments on controlled clock and gauge benchmarks, together with evaluation on an external real-world benchmark, demonstrate the effectiveness of our method.

[CV-25] FunFace: Feature Utility and Norm Estimation for Face Recognition

【速读】:该论文旨在解决当前人脸识别(Face Recognition, FR)模型在低质量图像上性能下降的问题,尤其是在传统基于图像感知质量(如分辨率、模糊度和光照)的特征范数约束无法充分反映生物特征效用(biometric utility)的情况下。现有方法通过自适应边缘损失函数将特征范数与样本质量关联,但其相关性有限,难以全面捕捉对识别任务真正有用的图像特性。解决方案的关键在于提出一种新的自适应边缘损失函数——FunFace(Face Recognition Through Utility and Norm Estimation),该方法引入由“确定性比”(Certainty Ratio)估计的生物特征效用作为动态调整边缘的核心因子,从而更精准地指导模型学习具有高识别价值的特征表示。实验表明,FunFace 在高质量数据集上表现相当,在低质量基准测试中显著优于现有先进方法。

链接: https://arxiv.org/abs/2604.26598
作者: Žiga Babnik,Fadi Boutros,Naser Damer,Deepak Kumar Jain,Peter Peer,Vitomir Štruc
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face Recognition (FR) is used in a variety of application domains, from entertainment and banking to security and surveillance. Such applications rely on the FR model to be robust and perform well in a variety of settings. To achieve this, state-of-the-art FR models typically use expressive adaptive margin loss functions, which tie the feature norm to concepts related to sample quality, such as recognizability and perceptual image quality. Recently, through the development of Face Image Quality Assessment (FIQA) techniques, biometric utility has become the preferred measure of face-image quality and has been shown to be a better predictor of the usefulness of samples for face recognition compared to more human-centric aspects, such as resolution, blur, and lighting, tied to general image quality. While image quality expressed through feature norms exhibits a certain level of correlation with biometric utility, it does not fully encapsulate all aspects of utility. To address this point, we propose a new adaptive margin loss, FunFace (Face Recognition Through Utility and Norm Estimation), which incorporates biometric utility, estimated by the Certainty Ratio, into the adaptive margin, taking inspiration from AdaFace. We show that FunFace (when used to train a face recognition model) achieves competitive results to other state-of-the-art FR models on benchmarks containing high-quality samples, while surpassing them on low quality benchmarks.

[CV-26] Star-Fusion: A Multi-modal Transformer Architecture for Discrete Celestial Orientation via Spherical Topology

【速读】:该论文旨在解决自主航天器导航中基于星敏感器的姿态确定问题,传统“失联空间”(Lost-in-Space, LIS)算法存在计算开销大且对传感器噪声敏感的缺陷。其核心解决方案是提出Star-Fusion架构,关键在于将姿态估计重构为离散拓扑分类任务:通过球面K-Means聚类划分天球空间为K个拓扑一致区域,有效缓解赤经(Right Ascension, RA)与赤纬(Declination, Dec)的周期边界效应;并采用三路融合策略——SwinV2-Tiny Transformer提取光度特征、卷积热图分支提供空间定位、坐标型MLP实现几何锚定,从而在保持高精度(Top-1准确率93.4%,Top-3准确率97.8%)的同时实现低延迟推理(18.4 ms),具备在资源受限星载硬件上的实时部署能力。

链接: https://arxiv.org/abs/2604.26582
作者: May Hammad,Menatallh Hammad
机构: Julius-Maximilians-Universität Würzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable celestial attitude determination is a critical requirement for autonomous spacecraft navigation, yet traditional “Lost-in-Space” (LIS) algorithms often suffer from high computational overhead and sensitivity to sensor-induced noise. While deep learning has emerged as a promising alternative, standard regression models are often confounded by the non-Euclidean topology of the celestial sphere and by the periodic boundary conditions of Right Ascension (RA) and Declination (Dec). In this paper, we present Star-Fusion, a multi-modal architecture that reformulates orientation estimation as a discrete topological classification task. Our approach leverages spherical K-Means clustering to partition the celestial sphere into K topologically consistent regions, effectively mitigating coordinate wrapping artifacts. The proposed architecture employs a tripartite fusion strategy: a SwinV2-Tiny transformer backbone for photometric feature extraction, a convolutional heatmap branch for spatial grounding, and a coordinate-based MLP for geometric anchoring. Experimental evaluations on a synthetic Hipparcos-derived dataset demonstrate that Star-Fusion achieves a Top-1 accuracy of 93.4% and a Top-3 accuracy of 97.8%. Furthermore, the model exhibits high computational efficiency, maintaining an inference latency of 18.4 ms on resource-constrained COTS hardware, making it a viable candidate for real-time onboard deployment in next-generation satellite constellations.

[CV-27] AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

【速读】:该论文旨在解决无人机(UAV)航空几何三维视觉(aerial geometric 3D vision)领域因缺乏大规模、高保真训练数据而导致的性能瓶颈问题。现有基准多集中于地面或物体中心视角,难以应对无人机感知中复杂的视点变换和多样环境条件。其解决方案的关键在于提出AirZoo——一个统一的大规模数据集与基准,具备三大核心特性:1)可扩展的生成流水线,利用全球范围的摄影测量三维网格渲染多样化户外场景并支持自定义飞行轨迹与天气光照配置;2)全面的场景多样性,覆盖22个国家共378个区域,涵盖高度结构化的城市与复杂非结构化自然环境;3)丰富的几何标注,每帧提供同步的像素级度量深度与精确6-DoF地理参考位姿,支撑几何感知学习。通过三个严格评估任务验证,AirZoo可作为强大预训练引擎,在多个前沿模型上显著提升性能,建立航空空间智能的新性能上限。

链接: https://arxiv.org/abs/2604.26567
作者: Xiaoya Cheng,Rouwan Wu,Xinyi Liu,Zeyu Cui,Yan Liu,Na Zhao,Yu Liu,Maojun Zhang,Shen Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the rapid progress in data-driven 3D vision, aerial geometric 3D vision remains a formidable challenge due to the severe scarcity of large-scale, high-fidelity training data. Existing benchmarks, predominantly biased toward ground-level or object-centric views, do not account for complex viewpoint transformations and diverse environmental conditions in UAV-based sensing. To bridge this critical gap, we propose AirZoo, a unified large-scale dataset and benchmark for grounding aerial geometric 3D vision. AirZoo possesses three appealing properties: 1) Scalable Generation Pipeline: Leveraging freely available, world-scale photogrammetric 3D meshes, it renders vast outdoor environments with customizable UAV flight trajectories and configurable weather/illumination. 2) Comprehensive Scene Diversity: It provides the most extensive coverage of region types to date (spanning 378 regions across 22 countries), systematically encompassing both highly structured urban landscapes and complex unstructured natural environments. 3) Rich Geometric Annotations: Each frame provides synchronized, pixel-level metric depth and precise 6-DoF geo-referenced poses, essential for geometry-aware learning. Through three rigorous evaluation tracks – aerial image retrieval, cross-view matching, and multi-view 3D reconstruction – we demonstrate that AirZoo serves as a powerful pre-training engine. Extensive experiments on both public and newly collected real-world benchmarks reveal that fine-tuning on AirZoo yields substantial performance gains for SoTA models (e.g., MegaLoc, RoMa, VGGT, and Depth Anything 3), establishing a new performance upper bound for aerial spatial intelligence.

[CV-28] DenseStep2M: A Scalable Training-Free Pipeline for Dense Instructional Video Annotation

【速读】:该论文旨在解决长时视频理解中复杂时间事件解析与程序性活动推理的难题,尤其针对现有指令类视频数据集(如HowTo100M)中存在的ASR转录噪声大、语音与视觉内容时间对齐不一致等挑战。其解决方案的关键在于提出一个无需训练的自动化流水线,通过视频片段分割、低质量内容过滤,并结合先进的多模态大语言模型(Qwen2.5-VL和DeepSeek-R1)生成结构化且时序锚定的程序步骤,从而构建大规模高质量数据集DenseStep2M(约10万视频、200万步骤),支持密集视频字幕、步骤定位与跨模态检索等下游任务,显著提升模型在长时视频理解和多视角泛化能力上的性能表现。

链接: https://arxiv.org/abs/2604.26565
作者: Mingji Ge,Qirui Chen,Zeqian Li,Weidi Xie
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-term video understanding requires interpreting complex temporal events and reasoning over procedural activities. While instructional video corpora, like HowTo100M, offer rich resources for model training, they present significant challenges, including noisy ASR transcripts and inconsistent temporal alignments between narration and visual content. In this work, we introduce an automated, training-free pipeline to extract high-quality procedural annotations from in-the-wild instructional videos. Our approach segments videos into coherent shots, filters poorly aligned content, and leverages state-of-the-art multimodal and large language models (Qwen2.5-VL and DeepSeek-R1) to generate structured, temporally grounded procedural steps. This pipeline yields DenseStep2M, a large-scale dataset comprising approximately 100K videos and 2M detailed instructional steps, designed to support comprehensive long-form video understanding. To rigorously evaluate our pipeline, we curate DenseCaption100, a benchmark of high-quality, human-written captions. Evaluations demonstrate strong alignment between our auto-generated steps and human annotations. Furthermore, we validate the utility of DenseStep2M across three core downstream tasks: dense video captioning, procedural step grounding, and cross-modal retrieval. Models fine-tuned on DenseStep2M achieve substantial gains in captioning quality and temporal localization, while exhibiting robust zero-shot generalization across egocentric, exocentric, and mixed-perspective domains. These results underscore the effectiveness of DenseStep2M in facilitating advanced multimodal alignment and long-term activity reasoning. Our dataset is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.26565 [cs.CV] (or arXiv:2604.26565v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.26565 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-29] Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems AAAI

【速读】:该论文旨在解决现代神经网络在组合泛化(compositional generalization)方面的根本性缺陷,这一缺陷限制了其在需要分布外推理(out-of-distribution reasoning)领域中的鲁棒性和适用性。论文通过系统实证分析挑战了神经符号人工智能(neuro-symbolic AI)中一个未被验证的核心假设,即组合推理会作为符号接地(symbol grounding)的副产品自然涌现。解决方案的关键在于提出一种可微分的多步演绎架构——迭代逻辑张量网络(Iterative Logic Tensor Network, i LTN),并通过分离接地与推理的贡献,证明仅训练接地目标无法实现泛化;而联合优化感知接地和多步推理的目标则能实现所有任务上的高零样本准确率,从而确立推理是一种需显式学习目标的独立能力,而非接地的衍生属性。

链接: https://arxiv.org/abs/2604.26521
作者: Mahnoor Shahid,Hannes Rothe
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: Accepted at AAAI MAKE 2026

点击查看摘要

Abstract:Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro-symbolic AI is that compositional reasoning will emerge as a byproduct of successful symbol grounding. This work presents the first systematic empirical analysis to challenge this assumption by disentangling the contributions of grounding and reasoning. To operationalize this investigation, we introduce the Iterative Logic Tensor Network ( i LTN), a fully differentiable architecture designed for multi-step deduction. Using a formal taxonomy of generalization – probing for novel entities, unseen relations, and complex rule compositions – we demonstrate that a model trained solely on a grounding objective fails to generalize. In contrast, our full i LTN, trained jointly on perceptual grounding and multi-step reasoning, achieves high zero-shot accuracy across all tasks. Our findings provide conclusive evidence that symbol grounding, while necessary, is insufficient for generalization, establishing that reasoning is not an emergent property but a distinct capability that requires an explicit learning objective.

[CV-30] 3D-LENS: A 3D Lifting-based Elevated Novel-view Synthesis method for Single-View Aerial-Ground Re-Identification

【速读】:该论文旨在解决空中-地面重识别(Aerial-Ground Re-Identification, AG-ReID)中因视角差异导致的特征遮挡或失真问题,尤其针对真实场景下缺乏目标域标注数据的情况,提出单视角AG-ReID(Single-View AG-ReID, SV AG-ReID)这一新设置——即模型需在仅使用单一真实视角训练数据的情况下,泛化至未见过的视角进行跨视角图像检索。解决方案的关键在于提出3D Lifting-based Elevated Novel-view Synthesis(3D-LENS)框架,其核心创新是结合大规模3D网格重建实现几何一致性的新视角合成,并辅以鲁棒的表示学习策略以缓解合成数据到真实数据之间的域偏移问题;相比传统2D生成方法易产生几何不一致或依赖类别特定模板的3D方法,该方案无需预定义模板即可在多样化类别上保持视角一致性,且能有效保留如携带物品等细粒度信息,从而显著提升SV AG-ReID性能。

链接: https://arxiv.org/abs/2604.26520
作者: William Grolleau,Astrid Sabourin,Guillaume Lapouge,Catherine Achard
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aerial-Ground Re-Identification (AG-ReID) is constrained by the viewpoint-domain gap, as drastic viewpoint disparities occlude or distort discriminative features, making cross-viewpoint image retrieval challenging. While existing methods rely on paired cross-view annotations, real-world deployments, such as wilderness search-and-rescue (SAR), often lack target-domain data, requiring retrieval from ground-level references alone. To our knowledge, we are the first to address this challenge by formalizing the Single-View AG-ReID (SV AG-ReID) setting, where models trained on a single real viewpoint must generalize to an unseen viewpoint. We propose 3D Lifting-based Elevated Novel-view Synthesis (3D-LENS), a unified framework combining geometrically-consistent novel view synthesis that leverages large-scale 3D mesh reconstruction, with a robust representation learning scheme to mitigate synthetic-to-real bias. Unlike 2D generative baselines that suffer from geometric inconsistencies or prior 3D methods that are restricted to class-specific templates, our approach ensures view-consistent synthesis across diverse categories without predefined templates that fail to capture fine-grained details, such as carried objects. Extensive experiments demonstrate that our method achieves state-of-the-art performance on SV AG-ReID scenarios. Code and data will be released at this https URL.

[CV-31] GIFGuard: Proactive Forensics against Deepfakes in Facial GIFs via Spatiotemporal Watermarking

【速读】:该论文旨在解决深度伪造(Deepfake)技术对图形交换格式(GIF)图像真实性构成的威胁,现有主动取证方法主要针对静态图像,难以有效应用于具有时序特性的动画GIF。解决方案的关键在于提出GIFGuard,首个面向GIF的时空水印框架,其核心创新包括:在嵌入阶段采用时空自适应残差编码器(STARE),通过3D卷积与自适应通道重校准捕捉全局一致的时序依赖关系,提升对高层语义篡改的鲁棒性;在提取阶段设计深度完整性恢复解码器(DIRD),利用时空小时钟架构结合3D注意力机制恢复隐含特征,实现严重人脸篡改下水印信号的准确提取。

链接: https://arxiv.org/abs/2604.26519
作者: Shupeng Che,Zhiqing Guo,Changtao Miao,Dan Ma,Gaobo Yang
机构: Xinjiang University (新疆大学); Ant Group (蚂蚁集团); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid evolution of deepfake technology poses an unprecedented threat to the authenticity of Graphics Interchange Format (GIF) imagery, which serves as a representative of short-loop temporal media in social networks. However, existing proactive forensics works are designed for static images, which limits their applicability to animated GIFs. To bridge this gap, we propose GIFGuard, the first spatiotemporal watermarking framework tailored for deepfake proactive forensics in GIFs. In the embedding stage, we propose the Spatiotemporal Adaptive Residual Encoder (STARE) to ensure robustness against high-level semantic tampering. It employs a 3D convolutional backbone with adaptive channel recalibration to capture globally coherent temporal dependencies. In the extraction stage, we design the Deep Integrity Restoration Decoder (DIRD). It utilizes a spatiotemporal hourglass architecture equipped with 3D attention to restore latent features, allowing for the accurate extraction of watermark signals even under severe facial manipulation. Furthermore, we construct GIFfaces, the first large-scale benchmark dataset curated for GIF proactive forensics to facilitate research in this domain. Extensive results show that GIFGuard achieves high-fidelity visual quality and remarkable robustness performance against deepfakes. Related code and dataset will be released.

[CV-32] MTCurv: Deep learning for direct microtubule curvature mapping in noisy fluorescence microscopy images ICPR

【速读】:该论文旨在解决从噪声荧光显微图像中准确提取微管(microtubule)曲率这一挑战性问题,传统方法依赖于易受误差影响的分割流程,在低信噪比和部分可见性条件下性能受限。其关键解决方案是提出MTCurv——一种无需分割的深度学习框架,将曲率估计重构为回归任务,并基于带有像素级曲率标注的合成数据集训练一个基于注意力机制的残差U-Net模型;同时设计了一种梯度感知损失函数(gradient-aware loss),结合均方误差与梯度一致性项,以减少伪影并增强空间一致性,从而在复杂成像条件下仍能高精度恢复局部微管曲率。

链接: https://arxiv.org/abs/2604.26517
作者: Achraf Ait Laydi,Sidi Mohamed Sid’El Moctar,Yousef El Mourabit,Hélène Bouvrais
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cell Behavior (q-bio.CB)
备注: Accepted for presentation at the International Conference on Pattern Recognition (ICPR) 2026

点击查看摘要

Abstract:Accurate quantification of the geometry of curvilinear biological structures is essential for understanding cellular mechanics and disease-related morphological alterations. Microtubule curvature is a key descriptor of filament rigidity and mechanical perturbations. However, reliable curvature extraction from fluorescence microscopy images remains challenging due to noise, low contrast, and partial filament visibility. Existing approaches rely on segmentation pipelines with pre or post-processing, which are highly sensitive to segmentation errors and often fail under adverse imaging conditions. In this work, we propose MTCurv, a deep learning framework for direct, segmenta-tion-free regression of microtubule curvature maps from noisy microscopy images. Leveraging a synthetic dataset with pixel-wise curvature annotations, we reformulated curvature estimation as a regression problem and adapted an attention-based residual U-Net. To reduce hallucinations and enforce spatial coherence, we introduced a gradient-aware loss combining Mean Squared Error with a gradient consistency term. Beyond model and loss design, we evaluated commonly used regression and image quality metrics, revealing that many perceptual and blind metrics are poorly suited for curvature estimation. Correlation-based metrics, particularly Spearman correlation, emerged as more reliable indicators of curvature prediction quality. Experiments on two datasets of increasing difficulty demonstrated that MTCurv accurately recovers local microtubule curvatures, even in the presence of background fluorescence. Ablation studies highlighted the contribution of both residual encoding and attention-based decoding. Overall, this work provides a practical tool for filament curvature analysis and methodological insights for geometry-aware regression in biomedical imaging. Datasets and code are made available.

[CV-33] 3D Generation for Embodied AI and Robotic Simulation: A Survey

【速读】:该论文旨在解决 embodied AI(具身智能)与机器人系统在仿真训练和真实世界部署中对可扩展、多样化且物理可信的3D内容日益增长的需求问题。当前3D生成技术虽快速发展,但其在具身应用中的挑战远超视觉真实感:生成对象需具备运动学结构与材料属性,场景须支持交互与任务执行,并需弥合仿真与现实之间的差距。解决方案的关键在于将3D生成的作用划分为三个核心角色:作为数据生成器(Data Generator),产出可直接用于仿真的刚体、物理锚定及可变形对象;作为仿真环境构建者(Simulation Environments),构造结构感知、可控且具备代理行为的任务导向场景;以及作为Sim2Real桥梁(Sim2Real Bridge),通过数字孪生重建、数据增强和合成示范支持机器人学习与现实迁移。论文指出,领域正从追求视觉真实向交互就绪性转变,并识别出物理标注不足、几何质量与物理有效性不一致、评估体系碎片化及持续存在的sim-to-real鸿沟等关键瓶颈,亟需突破以使3D生成成为具身智能的可靠基础。

链接: https://arxiv.org/abs/2604.26509
作者: Tianwei Ye,Yifan Mao,Minwen Liao,Jian Liu,Chunchao Guo,Dazhao Du,Quanxin Shou,Fangqi Zhu,Song Guo
机构: Hong Kong University of Science and Technology (香港科技大学); Wuhan University (武汉大学); Harbin Institute of Technology (哈尔滨工业大学); Xinjiang University (新疆大学); Tencent (腾讯)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 11 figures, 8 tables. Project Page: this https URL

点击查看摘要

Abstract:Embodied AI and robotic systems increasingly depend on scalable, diverse, and physically grounded 3D content for simulation-based training and real-world deployment. While 3D generative modeling has advanced rapidly, embodied applications impose requirements far beyond visual realism: generated objects must carry kinematic structure and material properties, scenes must support interaction and task execution, and the resulting content must bridge the gap between simulation and reality. This survey presents the first survey of 3D generation for embodied AI and organizes the literature around three roles that 3D generation plays in embodied systems. In \emphData Generator, 3D generation produces simulation-ready objects and assets, including articulated, physically grounded, and deformable content for downstream interaction; in \emphSimulation Environments, it constructs interactive and task-oriented worlds, spanning structure-aware, controllable, and agentic scene generation; and in \emphSim2Real Bridge, it supports digital twin reconstruction, data augmentation, and synthetic demonstrations for downstream robot learning and real-world transfer. We also show that the field is shifting from visual realism toward interaction readiness, and we identify the main bottlenecks, including limited physical annotations, the gap between geometric quality and physical validity, fragmented evaluation, and the persistent sim-to-real divide, that must be addressed for 3D generation to become a dependable foundation for embodied intelligence. Our project page is at this https URL.

[CV-34] Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models

【速读】:该论文旨在解决在资源受限的边缘设备上部署视觉语言模型(Vision-Language Models, VLMs)所面临的计算与内存瓶颈问题,同时避免将全部推理任务迁移至云端导致的高延迟和带宽压力。其核心挑战在于如何在动态网络条件下实现高效、灵活的边缘-云协同推理,而非依赖固定尺寸的特征表示。解决方案的关键在于提出一种渐进式语义通信框架,通过一个元自编码器(Meta AutoEncoder)将视觉token压缩为可逐步精化的自适应表示,从而支持不同信息层级的灵活传输,在通信成本与语义保真度之间提供可控权衡。该设计无需对现成VLM进行额外微调,即可实现即插即用的部署,并已在NXP i.MX95嵌入式平台与GPU服务器间完成端到端验证,实测表明在1 Mbps上行带宽下显著优于全边缘或全云端方案,且保持高语义一致性。

链接: https://arxiv.org/abs/2604.26508
作者: Cyril Shih-Huan Hsu,Wig Yuan-Cheng Cheng,Chrysa Papagianni
机构: Informatics Institute, University of Amsterdam, The Netherlands(阿姆斯特丹大学信息学院,荷兰); Open-EP Community(开放EP社区)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
备注: Under review. Extended version with additional figures and appendices

点击查看摘要

Abstract:Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractical in bandwidth-limited environments, where transmitting raw visual data introduces substantial latency overhead. While recent edge-cloud collaborative architectures attempt to partition VLM workloads across devices, they typically rely on transmitting fixed-size representations, lacking adaptability to dynamic network conditions and failing to fully exploit semantic redundancy. In this paper, we propose a progressive semantic communication framework for edge-cloud VLM inference, using a Meta AutoEncoder that compresses visual tokens into adaptive, progressively refinable representations, enabling plug-and-play deployment with off-the-shelf VLMs without additional fine-tuning. This design allows flexible transmission at different information levels, providing a controllable trade-off between communication cost and semantic fidelity. We implement a full end-to-end edge-cloud system comprising an embedded NXP i.MX95 platform and a GPU server, communicating over bandwidth-constrained networks. Experimental results show that, at 1 Mbps uplink, the proposed progressive scheme significantly reduces network latency compared to full-edge and full-cloud solutions, while maintaining high semantic consistency even under high compression. The implementation code will be released upon publication at this https URL.

[CV-35] Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

【速读】:该论文旨在解决扩散模型在使用无分类器引导(Classifier-Free Guidance, CFG)时面临的“细节-伪影困境”(detail-artifact dilemma)问题,即低引导尺度无法充分注入语义细节,而高引导尺度则导致结构退化、色彩过饱和及视频时序不一致等现象。解决方案的关键在于从微分几何视角揭示CFG本质上是一种切向线性外推过程,由于数据流形具有高度曲率,这种均匀的线性步长会引入严重的正交偏差。基于此理论洞察,作者提出无需训练且计算开销几乎为零的空间自适应多引导(Spatial Adaptive Multi Guidance, SAMG)算法,通过动态计算逐点条件引导能量,在高能边界区域采用保守最小引导尺度以保留精细微纹理,而在低能区域使用激进最大引导尺度以最大化语义注入,从而有效平衡生成内容的语义准确性与结构保真度。

链接: https://arxiv.org/abs/2604.26503
作者: Haosen Li,Wenshuo Chen,Lei Wang,Shaofeng Liang,Bowen Tian,Soning Lai,Yutao Yue
机构: The Hong Kong University of Science and Technology (Guangzhou); Griffith University; Data61/CSIRO
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented “detail-artifact dilemma”: low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry. By analyzing Tweedie’s Formula, we reveal that CFG intrinsically performs a tangential linear extrapolation. Because the natural data manifold is highly curved, this uniform linear step introduces a severe orthogonal deviation. To keep the generation trajectory safely bounded, we formulate a theoretical upper bound for spatial and adaptive guidance. Based on these geometric insights, we propose Spatial Adaptive Multi Guidance (SAMG), a training-free and virtually zero-cost sampling algorithm. SAMG dynamically computes point-wise conditional guidance energy, applying a conservative minimum scale to high-energy boundary regions to preserve delicate micro-textures, while deploying an aggressive maximum scale in low-energy regions to maximize semantic injection. Extensive experiments across diverse image (SD 1.5, SDXL, SD3.5 Medium) and video (CogVideoX, ModelScope) architectures demonstrate that SAMG effectively resolves the detail-artifact dilemma, achieving superior semantic alignment, structural integrity, and temporal smoothness without any computational overhead.

[CV-36] Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training CVPR’26

【速读】:该论文旨在解决对抗训练(Adversarial Training, AT)中普遍存在但尚未被充分理解的“干净准确率与对抗鲁棒性之间的权衡问题”(accuracy-robustness trade-off)。研究发现,对于决策边界附近的样本,改变输入扰动强度对模型鲁棒性影响极小,这揭示了准确率和鲁棒性波动间的不一致性,并进一步识别出输入空间与潜在表示空间(latent space)之间语义不对齐是导致该权衡的核心原因。解决方案的关键在于提出一种新的对抗训练目标——鲁棒对齐(Robust Alignment),其核心思想是:在最终预测标签不变的前提下,促使模型感知随输入扰动而变化,从而增强输入与潜在空间的一致性。具体实现包含两个创新机制:一是为边界样本采用固定且较小的扰动强度,使扰动成为可学习模式而非噪声;二是设计基于理论推导的域插值一致性对抗正则化(Domain Interpolation Consistency Adversarial Regularization, DICAR),显式引入输入与潜在空间的语义对齐约束。由此构建的鲁棒对齐对抗训练方法(RAAT)在多个基准数据集和网络结构上显著优于现有主流方法,有效缓解了准确率与鲁棒性之间的矛盾。

链接: https://arxiv.org/abs/2604.26496
作者: Yanyun Wang,Qingqing Ye,Li Liu,Zi Liang,Haibo Hu
机构: HK PolyU (香港理工大学); HKUST (GZ) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition - Findings Track (CVPR’26 Findings)

点击查看摘要

Abstract:Adversarial Training (AT) is one of the most effective methods for developing robust deep neural networks (DNNs). However, AT faces a trade-off problem between clean accuracy and adversarial robustness. In this work, we reveal a surprising phenomenon for the first time: Varying input perturbation intensities for training samples near decision boundaries in AT have minimal impact on model robustness. This finding directly exposes the inconsistency between accuracy and robustness score fluctuations, leading us to identify the misalignment between input and latent spaces as a critical driver of the robustness-accuracy trade-off. To mitigate this misalignment for harmonizing accuracy and robustness, we define Robust Alignment as a new AT target, encouraging the model perception to change with input perturbations provided the final label prediction remains unchanged, which can be achieved via two novel ideas. First, we suggest a reduced and fixed perturbation intensity for those boundary samples, which facilitates the model to utilize the perturbations as learnable patterns, instead of noises that complicate decision boundaries meaninglessly. Second, we propose a Domain Interpolation Consistency Adversarial Regularization (DICAR), based on rigorous theoretical derivations, which explicitly introduces semantic alignment between input and latent spaces into AT. Based on these two ideas, we end up with a new Robust Alignment Adversarial Training (RAAT) method, effectively harmonizing accuracy and robustness. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet-18, PreActResNet-18, and WideResNet-28-10 demonstrate the effectiveness of RAAT in improving the trade-off beyond four common baselines and a total of 14 related state-of-the-art (SOTA) works.

[CV-37] Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners CVPR2026

【速读】:该论文旨在解决现有视觉基础模型在像素级推理任务中缺乏有效嵌入时空属性表示的问题。当前方法要么基于图像预训练任务,无法捕捉动态变化;要么依赖视频序列进行动作级推理,难以扩展至密集像素级预测。其解决方案的关键在于提出一种名为LILA的框架,该框架通过线性上下文学习(linear in-context learning)机制,利用现成网络估计的深度图和运动图作为时空线索,从未经标注的视频数据中学习像素级特征描述符,从而在时序上保持一致性地嵌入语义与几何信息,显著提升了多种视觉任务(如视频目标分割、表面法向量估计和语义分割)的性能表现。

链接: https://arxiv.org/abs/2604.26488
作者: Nikita Araslanov,Martin Sundermeyer,Hidenobu Matsuki,David Joseph Tan,Federico Tombari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To appear at CVPR 2026 (oral). Project website: this https URL

点击查看摘要

Abstract:One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level. Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction. We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages spatio-temporal cue maps – depth and motion – estimated with off-the-shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner. We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.

[CV-38] Cross-Domain Transfer of Hyperspectral Foundation Models ICPR2026

【速读】:该论文旨在解决高光谱成像(Hyperspectral Imaging, HSI)语义分割在真实场景中因域内训练数据有限而导致模型性能受限的问题。现有方法多依赖跨模态迁移,通过RGB与HSI之间的桥梁利用视觉基础模型,但这类方法往往丢失光谱信息或引入复杂架构。本文提出跨域迁移(cross-domain transfer)作为替代方案,其关键在于复用原本在遥感领域训练的HSI基础模型,直接应用于近距感知任务,从而避免模态间隙带来的信息损失,并保持简洁的网络结构。实验表明,该策略在数据稀缺条件下显著优于传统域内训练方法,且性能接近跨模态方法,推动了HSI语义分割在多样化应用场景中的有效性提升。

链接: https://arxiv.org/abs/2604.26478
作者: Nick Theisen,Peer Neubert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at ICPR 2026

点击查看摘要

Abstract:Hyperspectral imaging (HSI) semantic segmentation typically relies on in-domain training, but limited data availability often restricts model performance in real-world applications. Current approaches to leverage foundation models in proximal sensing use cross-modality techniques, bridging RGB and HSI to exploit vision foundation models. However, these methods either discard spectral information or introduce architectural complexity. We propose cross-domain transfer as an alternative, reusing HSI foundation models - originally trained in remote sensing - for proximal sensing applications. By eliminating the need to bridge modality gaps, our approach preserves spectral information while maintaining a simple architecture. Using the HS3-Bench benchmark, we systematically evaluate and compare conventional in-domain, in-modality training, cross-modality transfer and cross-domain transfer strategies. Our results demonstrate that cross-domain transfer achieves large performance improvements over in-domain, in-modality training, reduces the performance gap to cross-modality approaches and maintains strong performance in limited data settings. Thus, this work advances more effective HSI semantic segmentation in diverse applications.

[CV-39] A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows

【速读】:该论文旨在解决从长篇、多语言扫描的金融文档中可靠提取结构化信息的问题,这类文档在实际工业KYC(了解你的客户)和合规流程中至关重要。由于文档通常非机器可读、噪声大且视觉异质性强,同时仅包含稀疏的任务相关字段,直接使用端到端的视觉语言模型(Vision-Language Model, VLM)进行处理往往效果不稳定。解决方案的关键在于提出一个分阶段的提取框架,其核心创新是将页面定位与多模态推理分离:通过图像预处理、多语言光学字符识别(OCR)、混合页面级检索以及基于紧凑VLM的结构化提取模块协同工作,显著提升了复杂多页文档中的字段识别准确率,其中页面级检索被证明是性能提升的主要驱动力,尤其对非英文和复杂财务报表表现突出。

链接: https://arxiv.org/abs/2604.26462
作者: Yuxuan Han,Yuanxing Zhang,Yushuo Wang,Yichao Jin,Kenneth Zhu Ke,Jingyuan Zhao
机构: OCBC, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Structured information extraction from long, multilingual scanned financial documents is a core requirement in industrial KYC and compliance workflows. These documents are typically non machine readable, noisy, and visually heterogeneous. They usually span dozens of pages while containing only sparse task relevant information. Although recent vision-language models achieve strong benchmark performance, directly applying them end to end to full financial reports often leads to unreliable extraction under real world conditions. We present a multistage extraction framework that integrates image preprocessing, multilingual OCR, hybrid page-level retrieval, and compact VLM-based structured extraction. The design separates page localization from multimodal reasoning, enabling more accurate extraction from complex multipage documents. We evaluated the framework on 120 production KYC documents comprising about 3000 multilingual scanned pages. Across multiple OCR-VLM combinations, the proposed pipeline consistently outperforms direct PDF-to-VLM baselines, improving field-level accuracy by up to 31.9 percentage points. The best configuration, PaddleOCR with MiniCPM2.6, achieves 87.27 percent accuracy. Ablation studies show that page-level retrieval is the dominant factor in performance improvements, particularly for complex financial statements and non-English documents.

[CV-40] textPKS4:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding

【速读】:该论文旨在解决长视频理解中因密集时空注意力机制导致的二次计算复杂度问题,以及现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在反向传播时引发的激活内存开销过大、状态空间模型(State Space Model, SSM)破坏二维空间结构等问题。其解决方案的关键在于提出一种并行运动学选择性状态空间扫描器(Parallel Kinematic Selective State Space Scanners, PKS⁴),该模块通过引入运动学先验编码器(Kinematic Prior Encoder)提取帧间位移与运动边界信息,驱动线性复杂度的状态空间模型自适应调节更新速率和读写策略;同时,在时间维度上对每个空间位置部署并行扫描器,从而在保持二维空间结构的同时显著降低计算与内存开销,实现高效且准确的视频理解。

链接: https://arxiv.org/abs/2604.26461
作者: Lingjie Zeng,Hailun Zhang,Xiwen Wang,Qijun Zhao
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal modeling remains a fundamental challenge in video understanding, particularly as sequence lengths scale. Traditional video models relying on dense spatiotemporal attention suffer from quadratic computational costs for long videos. To circumvent these costs, recent approaches adapt image models for videos via Parameter-Efficient Fine-Tuning (PEFT) methods such as adapters. However, deeply inserting these modules incurs prohibitive activation memory overhead during back-propagation. While recent efficient State Space Models (SSMs) introduce linear complexity, they disrupt 2D spatial relationships and rely on extensive masked pre-training to recover spatial awareness. To overcome these limitations, we propose Parallel Kinematic Selective State Space Scanners (PKS ^4 ). We retain a standard 2D vision backbone for spatial semantics and insert a single plug-and-play PKS ^4 module with linear-complexity temporal scanning, avoiding temporal attention and multi-layer adapters. We first extract kinematic priors via a Kinematic Prior Encoder, which captures local displacements and motion boundaries through inter-frame correlations and differences. These priors drive linear-complexity SSMs to track underlying kinematic states, adaptively modulating update speeds and read-write strategies at each time step. Instead of global scanning, we deploy parallel scanners along the temporal dimension for each spatial location, preserving spatial structures while reducing overhead. Experiments on spatial-heavy and temporal-heavy action recognition benchmarks show that PKS ^4 achieves state-of-the-art performance. Remarkably, our method converges in merely 20 epochs, achieving approximately 10\times lower training compute than pure video SSMs, establishing a new paradigm for efficient video understanding. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.26461 [cs.CV] (or arXiv:2604.26461v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.26461 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-41] Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation

【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)任务中因几何信息在视觉基础模型(Vision Foundation Models, VFMs)各层分布不均而导致的性能瓶颈问题。现有方法通常采用固定间隔采样中间Transformer层构建多尺度特征,隐含假设几何信息在各层均匀分布,但实际可能造成对深层结构3D线索的利用不足。解决方案的关键在于提出一种基于层级分析的“最后一层中心特征重组”(Last-Layer-Centric Feature Recombination, LFR)模块:通过系统分析DINOv3发现深层具有更强的深度可预测性和更好的跨样本几何变化捕捉能力,进而以最后一层作为几何锚点,依据最小相似性准则自适应选择互补的中间层特征,并通过紧凑线性融合机制将其与末层表示结合,从而显著提升几何表达能力,在多个基准上实现最优MDE性能。

链接: https://arxiv.org/abs/2604.26454
作者: Gongshu Wang,Zhirui Wang,Kan Yang
机构: Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18page, 6 figure, 6 table

点击查看摘要

Abstract:Monocular depth estimation (MDE) is a fundamental yet inherently ill-posed task. Recent vision foundation models (VFMs), particularly DINO-based transformers, have significantly improved accuracy and generalization for dense prediction. Prior works generally follow a unified paradigm: sampling a fixed set of intermediate transformer layers at uniform intervals to build multi-scale features. This common practice implicitly assumes that geometric information is uniformly distributed across layers, which may underutilize the structural 3D cues encoded in VFMs. In this study, we present a systematic layer-wise analysis of DINOv3, revealing that 3D information is distributed non-uniformly: deeper layers exhibit stronger depth predictability and better capture inter-sample geometric variation. Motivated by this, we introduce a Last-Layer-Centric Feature Recombination (LFR) module to enhance geometric expressiveness. LFR treats the final layer as a geometric anchor and adaptively selects complementary intermediate layers according to a minimal-similarity criterion. Selected features are fused with the last-layer representation via compact linear this http URL experiments show that LFR module consistently improves MDE accuracy and achieves state-of-the-art performance. Our analysis sheds light on how geometric knowledge is organized within VFMs and offers an efficient strategy for unlocking their potential in dense 3D tasks.

[CV-42] Attribution-Guided Multimodal Deepfake Detection via Cross-Modal Forensic Fingerprints

【速读】:该论文旨在解决音频-视觉深度伪造(audio-visual deepfake)检测中因感知不可靠而导致的媒体完整性与生物特征安全威胁问题。现有方法多为二分类任务,易依赖数据集特异性伪影而非真实的生成痕迹,导致泛化能力差。其解决方案的关键在于提出 Attribution-Guided Multimodal Deepfake Detection (AMDD) 框架,通过引入生成器归属(generator attribution)作为结构化正则化项,强制模型在共享嵌入空间中编码具有判别性的伪造特征,而非学习捷径信号;同时设计 Cross-Modal Forensic Fingerprint Consistency (CMFFC) 损失函数,对齐视觉与音频流中由同一生成器引起的伪造指纹,利用语音与面部运动之间的物理耦合关系,提升跨模态一致性建模能力。

链接: https://arxiv.org/abs/2604.26453
作者: Wasim Ahmad,Wei Zhang,Xuerui Mao
机构: University of Science and Technology of China (中国科学技术大学); National Natural Science Foundation of China (国家自然科学基金委员会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-visual deepfakes have reached a level of realism that makes perceptual detection unreliable, threatening media integrity and biometric security. While multimodal detection has shown promise, most approaches are binary classification tasks that often latch onto dataset-specific artifacts rather than genuine generative traces. We argue that a detector incapable of identifying how a video was forged is likely learning the wrong signal. Unlike binary detection, attribution-guided learning imposes a stronger geometric constraint on the shared embedding space, forcing the model to encode generator-specific forensic content rather than shortcuts. We propose the Attribution-Guided Multimodal Deepfake Detection (AMDD) framework, which jointly learns to detect and attribute manipulation. AMDD treats generator attribution as a structured regularization that constrains representation geometry toward forensically meaningful features. We introduce a Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss to enforce alignment between generator-induced artifacts in visual and audio streams. This exploits the fact that coherent manipulation leaves correlated traces across modalities, grounded in the physical coupling between speech and facial articulation that synthetic pipelines routinely disrupt. Architecturally, we pair a ResNet50 with temporal attention for visual encoding against a pretrained ResNet18 for mel spectrograms, closing the encoder capacity gap found in prior models. On FakeAVCeleb, AMDD achieves 99.7% balanced accuracy and 99.8% AUC with 95.9% attribution accuracy. Cross-dataset evaluation on DeepfakeTIMIT, DFDM, and LAV-DF confirms that real video detection generalizes robustly, while fake detection on unseen generators remains an open challenge that we analyze in depth. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.26453 [cs.CV] (or arXiv:2604.26453v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.26453 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-43] Are Data Augmentation and Segmentation Always Necessary? Insights from COVID-19 X-Rays and a Methodology Thereof

【速读】:该论文旨在解决现有基于胸部X光片(chest X-ray)的新冠肺炎(COVID-19)检测模型中两个关键问题:一是未考虑肺部分割(lung segmentation),可能导致模型依赖非目标区域进行预测,从而影响诊断可靠性;二是数据增强(data augmentation)策略不当,存在过度增强现象,导致模型过拟合(overfitting),降低泛化能力。解决方案的关键在于提出一种名为SDL-COVID的方法,其核心包括:利用类激活映射(class activation mapping, CAM)验证肺部区域在预测中的必要性,确保模型关注病灶区域;并通过对比有无数据增强的模型表现,识别出最优增强阈值,避免因过度增强引发的过拟合,最终实现高精度(95.21%精确率)和低假阴性率的可靠分类性能。

链接: https://arxiv.org/abs/2604.26437
作者: Aman Swaraj,Arnav Agarwal,Hitendra Singh Bhadouria,Sandeep Kumar,Karan Verma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Rapid and reliable diagnostic tools are crucial for managing respiratory diseases like COVID-19, where chest X-ray analysis coupled with artificial intelligence techniques has proven invaluable. However, most existing works on X-ray images have not considered lung segmentation, raising concerns about their reliability. Additionally, some have employed disproportionate and impractical augmentation techniques, making models less generalized and prone to overfitting. This study presents a critical analysis of both issues and proposes a methodology (SDL-COVID) for more reliable classification of chest X-rays for COVID-19 detection. Methods: We use class activation mapping to obtain a visual understanding of the predictions made by Convolutional Neural Networks (CNNs), validating the necessity of lung segmentation. To analyze the effect of data augmentation, deep learning models are implemented on two levels: one for an augmented dataset and another for a non-augmented dataset. Results: Careful analysis of X-ray images and their corresponding heat maps under expert medical supervision reveals that lung segmentation is necessary for accurate COVID-19 prediction. Regarding data augmentation, test accuracy significantly drops beyond a certain threshold with additional augmented images, indicating model overfitting. Conclusion: Our proposed methodology, SDL-COVID, achieves a precision of 95.21% and a lower false negative rate, ensuring its reliability for COVID-19 detection using chest X-rays.

[CV-44] QYOLO: Lightweight Object Detection via Quantum Inspired Shared Channel Mixing

【速读】:该论文旨在解决单阶段目标检测模型中深层骨干网络(backbone)因C2f瓶颈模块在高步长层级上通道数激增而导致的计算资源过度消耗问题,其核心挑战在于如何在不显著牺牲检测精度的前提下实现架构压缩。解决方案的关键在于提出一种受量子启发的通道混合框架QYOLO,通过用一个紧凑的QMixBlock替代骨干网络中两个最深的C2f模块(分别位于P4/16和P5/32,通道数为512和1024),利用正弦混合机制实现跨阶段的全局通道重校准,并共享可学习参数以强制通道重要性的一致性,从而避免为每个阶段独立配置参数。此设计在保持颈部(neck)和检测头完全经典不变的基础上,实现了显著的参数量(最高达21.8%)与浮点运算量(GFLOPs)减少,同时仅带来极小的mAP@50性能损失(最低仅0.1 pp)。

链接: https://arxiv.org/abs/2604.26435
作者: Garvit Kumar Mittal,Sahil Tomar,Sandeep Kumar
机构: Ultralytics(ultralytics); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:The rapid advancement of object detection architectures has positioned single stage detectors as the dominant solution for real-time visual perception. A primary source of computational overhead in these models lies in the deep backbone stages, where C2f bottleneck modules at high stride levels accumulate a disproportionate share of parameters due to quadratic scaling with channel width. This work introduces QYOLO, a quantum-inspired channel mixing framework that achieves genuine architectural compression by replacing the two deepest backbone C2f modules at P4/16 (512 channels) and P5/32 (1024 channels) with a compact QMixBlock. The proposed block performs global channel recalibration through a sinusoidal mixing mechanism with shared learnable parameters across both backbone stages, enforcing consistent channel importance without requiring independent per-stage parameter sets. The neck and detection head remain fully classical and unchanged. Evaluation on the VisDrone2019 benchmark demonstrates that QYOLOv8n achieves a 20.2% reduction in parameter count (3.01M to 2.40M) and 12.3% GFLOPs reduction with only 0.4 pp mAP@50 degradation. QYOLOv8s achieves 21.8% reduction with 0.1 pp degradation. When combined with knowledge distillation, full accuracy parity is recovered at no cost to compression. An expanded backbone plus neck variant achieved 38 to 41% reduction at the cost of greater accuracy degradation, motivating the backbone-only final design.

[CV-45] Delineating Knowledge Boundaries for Honest Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在面对未知或长尾领域问题时易产生事实性幻觉(factual hallucinations)以及缺乏拒绝能力的问题。其核心解决方案在于构建一个针对模型特异性的“Visual-Idk”数据集,通过多样本一致性探测(multi-sample consistency probing)区分已知与未知事实,并结合监督微调与偏好感知优化(如DPO、ORPO)对模型进行训练,从而明确其知识边界。实验表明,该方法将真实率(Truthful Rate)从57.9%提升至67.3%,且内部探针验证了模型具备真正的边界认知能力而非仅记忆拒绝模式,该框架在医疗和感知等分布外场景中也展现出良好的泛化性能,为构建更可信的视觉助手提供了有效路径。

链接: https://arxiv.org/abs/2604.26419
作者: Junru Song,Yimeng Hu,Yijing Chen,Huining Li,Qian Li,Lizhen Cui,Yuntao Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-tail or specialized domains. Moreover, current models exhibit a weak capacity to refuse queries that exceed their parametric knowledge. In this paper, we propose a systematic framework to enhance the refusal capability of VLMs when facing such unknown questions. We first curate a model-specific “Visual-Idk” (Visual-I don’t know) dataset, leveraging multi-sample consistency probing to distinguish between known and unknown facts. We then align the model using supervised fine-tuning followed by preference-aware optimization (e.g., DPO, ORPO) to effectively delineate its knowledge boundaries. Results on the Visual-Idk dataset show our method improves the Truthful Rate from 57.9% to 67.3%. Additionally, internal probing also demonstrates that the model genuinely recognizes its boundaries instead of just memorizing refusal patterns. Our framework further generalizes to out-of-distribution medical and perceptual domains, providing a robust path toward more trustworthy and prudent visual assistants.

[CV-46] Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection CVPR2026

【速读】:该论文旨在解决视觉 Transformer(Vision Transformer, ViT)中分布外(out-of-distribution, OOD)检测的难题,尤其针对现有方法依赖纠缠特征表示而导致性能受限的问题。其解决方案的关键在于首次将稀疏自编码器(Sparse Autoencoder, SAE)应用于 ViT 的 [CLS] token 特征,并引入一种基于 Top-k 的 SAE 框架,以从密集特征中解耦出结构化的潜在空间。通过该框架,研究者发现分布内(in-distribution, ID)数据在类激活轮廓(Class Activation Profiles, CAPs)上具有稳定且类特定的激活模式,而分布外样本则系统性破坏这一结构;据此提出基于核心能量谱 divergence 的评分函数,量化偏离理想激活模式的程度,从而实现高鲁棒性的 OOD 检测,在 FPR95 和 AUROC 等关键指标上均表现优异。

链接: https://arxiv.org/abs/2604.26409
作者: Ahyoung Oh,Wonseok Shin,Songkuk Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures, supplementary material included, CVPR 2026

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have demonstrated significant success in interpreting Large Language Models (LLMs) by decomposing dense representations into sparse, semantic components. However, their potential for analyzing Vision Transformers (ViTs) remains largely under-explored. In this work, we present the first application of SAEs to the ViT [CLS] token for out-of-distribution (OOD) detection, addressing the limitation of existing methods that rely on entangled feature representations. We propose a novel framework utilizing a Top-k SAE to disentangle the dense [CLS] features into a structured latent space. Through this analysis, we reveal that in-distribution (ID) data exhibits consistent, class-specific activation patterns, which we formalize as Class Activation Profiles (CAPs). Our study uncovers a key structural invariant: while ID samples preserve a stable pattern within CAPs, OOD samples systematically disrupt this structure. Leveraging this insight, we introduce a scoring function based on the divergence of core energy profiles to quantify the deviation from ideal activation profiles. Our method achieves strong results on the FPR95 metric, critical for safety-sensitive applications across multiple benchmarks, while also achieving competitive AUROC. Overall, our findings demonstrate that the sparse, disentangled features revealed by SAEs can serve as a powerful, interpretable tool for robust OOD detection in vision models.

[CV-47] Decoupled Prototype Matching with Vision Foundation Models for Few-Shot Industrial Object Detection

【速读】:该论文旨在解决工业场景中少样本目标检测(few-shot object detection)的问题,即在对象库存频繁变化的情况下,如何利用极少量标注样本实现高效的目标检测。其解决方案的关键在于借助视觉基础模型(vision foundation models)构建类别原型(class prototypes),通过从少量参考样本中提取特征表示来表征新类别的语义信息;在推理阶段,使用分割模型生成候选区域并提取特征嵌入,再通过相似度匹配机制将这些嵌入与类别原型进行匹配,从而完成目标检测。该方法无需CAD模型或大规模标注数据集,仅需少量参考图像即可快速部署新物体,显著提升了检测性能(AP提升6.9%),适用于实际工业应用需求。

链接: https://arxiv.org/abs/2604.26404
作者: Hari Prasanth S. M.,Nilusha Jayawickrama,Risto Ojala
机构: Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This article is submitted to Journal of Intelligent Manufacturing, and is currently in under review

点击查看摘要

Abstract:Industrial object detection systems typically rely on large annotated datasets, which are expensive to collect and challenging to maintain in industrial scenarios where the inventory of objects changes frequently. This work addresses the challenge of few-shot object detection in such industrial scenarios, where only a limited number of labeled samples are available for newly introduced objects. We present a detection framework that leverages vision foundation models to recognize objects with minimal supervision. The method constructs class prototypes from a small set of reference samples by extracting feature representations. For a given query scene during inference, object regions are generated using a segmentation model, and feature embeddings are extracted and matched with class prototypes using similarity matching. We evaluate the detection method on three established industrial datasets from the Benchmark for 6D Object Pose Estimation benchmark following the official 2D object detection evaluation protocol. We demonstrate competitive detection performance, improving AP by 6.9% compared to the state-of-the-art training-free detection methods. Furthermore, the presented method is able to onboard new objects using only a few reference images, without requiring any CAD models or large annotated datasets. These properties make the approach well-suited for real-world industrial applications.

[CV-48] A Multimodal Pre-trained Network for Integrated EEG-Video Seizure Detection

【速读】:该论文旨在解决小鼠癫痫模型中可靠癫痫发作检测的问题,现有单模态方法存在局限:基于视频的方法易受良性行为干扰,而仅依赖脑电图(EEG)的方法则易受发作期运动伪影影响。解决方案的关键在于提出一种多模态融合框架 EEGVFusion,其核心包括自监督 EEG 表征学习、时空视频编码、最优传输对齐以及双向交叉注意力机制,从而有效整合神经信号与行为证据,在保持完美事件敏感度的同时显著降低误报率(Event FAR)。

链接: https://arxiv.org/abs/2604.26379
作者: Tong Lu,Ke Xu,Zimo Zhang,Zitong Zhao,Danwei Weng,Ruiyu Wang,Miao Liu,Zizuo Zhang,Jingyi Yao,Yixuan Zhao,Wenchao Zhang,Min Wang,Guoming Luan,Minmin Luo,Zhifeng Yue
机构: Chinese Academy of Medical Sciences (中国医学科学院); Genans Biotechnology (Genans生物技术); Capital Medical University (首都医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable seizure detection in mouse models is essential for preclinical epilepsy research, yet manual review of synchronized video-EEG recordings is labor-intensive and single-modality systems fail for complementary reasons: video-based methods are easily confounded by benign behaviors, whereas EEG-based methods are vulnerable to ictal motion artifacts. We present EEGVFusion, a multimodal framework that combines self-supervised EEG representation learning, spatio-temporal video encoding, optimal-transport alignment, and bidirectional cross-attention to integrate neural and behavioral evidence. We also curate an expert-annotated dataset of synchronized EEG and video recordings comprising 93 sessions from 15 mice for training and evaluation. In the random-session split, EEGVFusion achieved a Balanced Accuracy of 0.9957 with perfect event sensitivity and an Event FAR of 0.6250 FP/h, indicating strong seizure detection performance with a low false-alarm burden. In a single held-out-subject evaluation with Subject 110 reserved for testing, EEGVFusion achieved a Balanced Accuracy of 0.9718 and reduced Event FAR from 2.7250 FP/h for the EEG-only counterpart to 0.4833 FP/h while preserving perfect event sensitivity. Targeted ablations further showed that EEG pre-training and OT alignment help reduce false alarms while preserving event sensitivity.

[CV-49] opology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning

【速读】:该论文旨在解决视觉-语言模型在特定领域泛化能力差的问题,尤其是现有半监督视觉-语言学习方法因局限于成对匹配而无法建模多模态表示流形的全局结构。其解决方案的关键在于提出拓扑感知的多模态表示对齐框架(Topology-Aware Multimodal Representation Alignment, ToMA),该框架利用持久同调(persistent homology)识别拓扑显著边(topologically salient edges),并通过可用的跨模态对应关系对齐这些边,从而同时捕捉连通性和循环结构信息,且无需构建二维单纯形(2-simplices)。ToMA通过融合H₀-death边与轻量级H₁-birth边,实现了更稳定且具有更高阶结构信号的跨模态对齐。

链接: https://arxiv.org/abs/2604.26370
作者: Junwon You,Mihyun Jang,Sangwoo Mo,Jae-Hun Jung
机构: KAIST(韩国科学技术院); POSTECH(浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Algebraic Topology (math.AT)
备注: 30 pages, 10 figures, 24 tables

点击查看摘要

Abstract:Vision-language models have shown strong performance, but they often generalize poorly to specialized domains. While semi-supervised vision-language learning mitigates this limitation by leveraging a small set of labeled image-text pairs together with abundant unlabeled images, existing methods remain fundamentally pairwise and fail to model the global structure of multimodal representation manifolds. Existing topology-based alignment methods rely on persistence diagram matching, which neither guarantees geometric alignment nor utilizes the image-text pairing information central to vision-language learning. We propose Topology-Aware Multimodal Representation Alignment (ToMA), a framework that uses persistent homology to identify topologically salient edges and aligns them across modalities through available cross-modal correspondences. ToMA leverages both H_0-death edges and lightweight H_1-birth edges, allowing it to capture both connectivity and cycle structure without constructing 2-simplices. Experiments show that ToMA yields stable gains, with clear improvements on remote sensing and modest but consistent benefits on fashion retrieval. Additional analysis shows that ToMA is more stable than alternative topology-based objectives and that lightweight H_1-birth edges provide useful higher-order structural signals.

[CV-50] Seamless Indoor-Outdoor Mapping for INGENIOUS First Responders

【速读】:该论文旨在解决在大型自然灾害场景下,如何实现室内外无缝融合的高精度三维(3D)建模问题,尤其针对第一响应者(First Responder)对灾后环境快速认知的需求。其关键解决方案在于将自主飞行的航空测绘系统与便携式室内定位系统相结合:通过自动识别并地理参考(geo-referenced)的AprilTag标记点,实现地面系统在进入建筑前与世界坐标系的快速注册,从而无需依赖全局定位系统(如GNSS),即可生成与航拍点云精确配准的室内点云,最终实现实时协同可视化,构建连续一致的室内外一体化3D模型。

链接: https://arxiv.org/abs/2604.26368
作者: Jürgen Wohlfeil,Henry Meißner,Adrian Schischmanow,Thomas Kraft,Dirk Baumbach,Ines Ernst,Dennis Dahlke
机构: Institute of Optical Systems (OS), German Aerospace Center (DLR) Berlin, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In several applications it is desired to have 3D models not only from the outdoor spaces but also from inside the building. In the context of First Responder enhancement in large scale natural and man-made disasters, a method is presented to achieve this goal with a high degree of automation. Therefore an autonomously flying aerial mapping system is combined with a person-carried indoor positioning system. Automatically recognized markers (AprilTags) are geo-referenced by the aerial system and their coordinates are sent to the ground-based system. By looking at the AprilTags before entering the building, the ground-based system is registered to world coordinates. Without the further need of any global positioning, it creates a point cloud from the indoor spaces that fits with the point could from the aerial view. This allows a co-visualization of both point-clouds as a seamless indoor-outdoor 3D model in real time.

[CV-51] Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models CVPR2026

【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformers, DiTs)在推理过程中因高采样成本导致的效率瓶颈问题。现有基于特征缓存(feature caching)的方法依赖于手工设计的预测公式,在激进跳步(aggressive skipping)场景下性能显著下降。解决方案的关键在于提出 L2P(Learnable Linear Predictor),一种数据驱动的缓存框架,其核心创新是将传统固定系数替换为可学习的每时间步权重,从而实现从历史特征轨迹中准确重建当前特征。L2P 仅需单张 GPU 上约 20 秒即可快速训练,并在 FLUX.1-dev 和 Qwen-Image 等模型上分别实现高达 4.55x 的浮点运算量(FLOPs)减少和 7.18x 加速比,同时保持高质量视觉输出,显著优于现有基线方法。

链接: https://arxiv.org/abs/2604.26365
作者: Zhirong Shen,Rui Huang,Jiacheng Liu,Chang Zou,Peiliang Cai,Shikang Zheng,Zhengyi Shi,Liang Feng,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); University of Electronic Science and Technology of China (电子科技大学); Shandong University (山东大学); Xiamen University (厦门大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:To address the high sampling cost of Diffusion Transformers (DiTs), feature caching offers a training-free acceleration method. However, existing methods rely on hand-crafted forecasting formulas that fail under aggressive skipping. We propose L2P (Learnable Linear Predictor), a simple data-driven caching framework that replaces fixed coefficients with learnable per-timestep weights. Rapidly trained in ~20 seconds on a single GPU, L2P accurately reconstructs current features from past trajectories. L2P significantly outperforms existing baselines: it achieves a 4.55x FLOPs reduction and 4.15x latency speedup on FLUX.1-dev, and maintains high visual fidelity under up to 7.18x acceleration on Qwen-Image models, where prior methods show noticeable quality degradation. Our results show learning linear predictors is highly effective for efficient DiT inference. Code is available at this https URL.

[CV-52] CO-EVO: Co-evolving Semantic Anchoring and Style Diversification for Federated DG-ReID ACL2026

【速读】:该论文旨在解决联邦域泛化行人重识别(Federated Domain Generalization for Person Re-Identification, FedDG-ReID)中的跨客户端风格差异问题,即在无全局监督的情况下,模型容易因本地摄像头偏置而陷入“捷径学习”(shortcut learning),导致特征过拟合于特定域而非提取通用身份信息。其解决方案的关键在于提出一种协同进化框架CO-EVO,通过两个核心机制实现语义净化与风格扩展的协同优化:一方面,相机不变语义锚定(Camera-Invariant Semantic Anchoring, CSA)构建跨摄像头一致的身份提示,形成去噪且域无关的语义锚点;另一方面,全局风格多样化(Global Style Diversification, GSD)利用全局相机风格库(Global Camera-Style Bank, GCSB)合成真实扰动以扩充训练数据的视觉边界。二者构成闭环迭代机制,其中净化后的锚点作为引力中心引导图像编码器聚焦于鲁棒的解剖学属性,从而显著提升模型在未见目标环境下的泛化能力。

链接: https://arxiv.org/abs/2604.26363
作者: Fengchun Zhang,Qiang Ma,Liuyu Xiang,Jinshan Lai,Tingxuan Huang,Jianwei Hu
机构: University of Electronic Science and Technology of China (电子科技大学); QiYuan Lab; Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ACL 2026 (Main Conference)

点击查看摘要

Abstract:Federated domain generalization for person re-identification (FedDG-ReID) aims to collaboratively train a pedestrian retrieval model across multiple decentralized source domains such that it can generalize to unseen target environments without compromising raw data privacy. However, this task is significantly challenged by the inherent stylistic gaps across decentralized clients. Without global supervision, models easily succumb to shortcut learning where representations overfit to domain specific camera biases rather than universal identity features. We propose CO-EVO, a novel federated framework that resolves this semantic-style conflict through a co-evolutionary mechanism. On the semantic side, Camera-Invariant Semantic Anchoring (CSA) learns identity prompts with cross-camera consistency to establish purified and domain-agnostic anchors that filter out local imaging noise. On the visual side, Global Style Diversification (GSD), powered by a Global Camera-Style Bank (GCSB), synthesizes realistic perturbations to expand the visual boundaries of training data. The core of CO-EVO is its co-evolutionary loop where purified anchors act as gravitational centers to guide the image encoder toward robust anatomical attributes amidst diverse style variations. Extensive experiments demonstrate that CO-EVO achieves state-of-the-art (SOTA) performance, proving that the synergy between semantic purification and style expansion is essential for robust cross-domain generalization. Our code is available at: this https URL.

[CV-53] GateMOT: Q-Gated Attention for Dense Object Tracking

【速读】:该论文旨在解决生成式 AI (Generative AI) 中的注意力机制在密集目标跟踪(Dense Object Tracking)场景下的计算效率瓶颈问题。传统注意力机制因存在二次复杂度的全连接交互,难以在高分辨率特征上进行密集运动估计,限制了其在拥挤和遮挡严重场景中的应用。解决方案的关键在于提出一种基于查询门控的注意力机制(Q-Gated Attention, Q-Attention),将原本用于相似性条件控制的Query重构为可学习的门控单元(Gating-Q),通过元素级调制Key特征实现显式的相关性选择,从而避免全局聚合的高昂开销。这一设计使得多个并行注意力头能够从共享特征图中高效生成检测、运动与重识别等任务特异性但一致的表示,构建线性复杂度的多任务解码器,显著提升了跟踪性能,在BEE24数据集上达到48.4的HOTA、67.8的MOTA和64.5的IDF1,验证了Q-Attention作为轻量且可迁移的注意力模块在密集跟踪中的有效性。

链接: https://arxiv.org/abs/2604.26353
作者: Mingjin Lv,Zelin Liu,Feifei Shao,Yi-Ping Phoebe Chen,Junqing Yu,Wei Yang,Zikai Song
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While large models demonstrate the strong representational power of vanilla attention, this core mechanism cannot be directly applied to Dense Object Tracking: its quadratic all-to-all interactions are computationally prohibitive for dense motion estimation on high-resolution features. This mismatch prevents Dense Object Tracking from fully leveraging attention-based modeling in crowded and occlusion-heavy scenes. To address this challenge, we introduce GateMOT, an online tracking framework centered on Q-Gated Attention (Q-Attention), an efficient and spatially aware attention variant. Our key idea is to repurpose the Query from a similarity-conditioning term into a learnable gating unit. This Gating-Query (Gating-Q) produces a probabilistic gate that modulates Key features in an element-wise manner, enabling explicit relevance selection instead of costly global aggregation. Built on this mechanism, parallel Q-Attention heads transform one shared feature map into task-specific yet consistent representations for detection, motion, and re-identification, yielding a tightly coupled multi-task decoder with linear-complexity gating operations. GateMOT achieves state-of-the-art HOTA of 48.4, MOTA of 67.8, and IDF1 of 64.5 on BEE24, and demonstrates strong performance on additional Dense Object Tracking benchmarks. These results show that Q-Attention is a simple, effective, and transferable building block for attention-based tracking in dense tracking scenarios.

[CV-54] ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance

【速读】:该论文旨在解决扩散模型(Diffusion Models)在训练过程中仅依赖全参考(full-reference)目标导致的主观视觉质量与文本-图像语义一致性不足的问题。现有方法虽能保证像素级保真度,但难以提升人眼感知质量。其核心挑战在于直接优化无参考图像质量评估(No-Reference Image Quality Assessment, NR-IQA)等感知信号会与原始扩散目标产生不匹配,引发微调过程中的训练不稳定和分布漂移。解决方案的关键在于提出一种锚定约束优化框架(anchor-constrained optimization framework),通过引入一个学习得到的NR-IQA模型作为感知引导信号,并设计基于锚点的正则化项,强制噪声预测保持与基础扩散模型的一致性,从而在不破坏生成保真度的前提下实现可控的感知质量提升。

链接: https://arxiv.org/abs/2604.26348
作者: Yang Yang,Feifan Meng,Han Fang,Weiming Zhang
机构: Anhui University (安徽大学); Hefei Comprehensive National Science Center (合肥综合性国家科学中心); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures, 11 tables

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in image generation, yet their training is predominantly driven by full-reference objectives that enforce pixel-wise similarity to ground-truth this http URL supervision, while effective for fidelity, may insufficient in terms of subjective visual perception quality and text-image semantic consistency. In this work, we investigate the problem of incorporating no-reference perceptual quality into diffusion training. A key challenge is that directly optimizing perceptual signals, such as those provided by no-reference image quality assessment (NR-IQA) models, introduces a mismatch with the original diffusion objective, leading to training instability and distributional drift during fine-tuning. To address this issue, we propose an anchor-constrained optimization framework that enables stable perceptual adaptation. Specifically, we leverage a learned NR-IQA model as a perceptual guidance signal, while introducing an anchor-based regularization that enforces consistency with the base diffusion model in terms of noise prediction. This design effectively balances perceptual quality improvement and generative fidelity, allowing controlled adaptation toward perceptually favorable outputs without compromising the original generative behavior. Extensive experiments demonstrate that our method consistently enhances perceptual quality while preserving generation diversity and training stability, highlighting the effectiveness of anchor-constrained perceptual optimization for diffusion models.

[CV-55] Which Face and Whose Identity? Solving the Dual Challenge of Deepfake Proactive Forensics in Multi-Face Scenarios

【速读】:该论文旨在解决复杂多人群体交互场景(如群组照片和多人会议)中深度伪造(deepfake)的定位与溯源难题,此类场景更贴近真实世界威胁,但现有主动取证方法多依赖“单人脸”假设,难以有效应对多脸环境下的伪造识别与来源追踪。解决方案的关键在于提出深度可追溯水印框架(Deep Attributable Watermarking Framework, DAWF),其核心创新包括:1)设计一种新型多脸编码器-解码器架构,避免传统方法繁琐的离线预处理步骤,支持网络内并行水印嵌入与跨人脸协同处理;2)引入选择性区域监督损失机制,引导解码器仅关注被深度伪造篡改的面部区域,结合嵌入的身份信息,实现“哪张脸被伪造 + 谁被伪造”的双重目标(即“which + who”),从而在复杂多脸数据集上显著提升伪造定位与溯源能力。

链接: https://arxiv.org/abs/2604.26342
作者: Lei Zhang,Zhiqing Guo,Dan Ma,Gaobo Yang
机构: Xinjiang University (新疆大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unlike single-face forgeries, deepfakes in complex multi-person interaction scenarios (such as group photos and multi-person meetings) more closely reflect real-world threats. Although existing proactive forensics solutions demonstrate good performance, they heavily rely on a “single-face” setting, making it difficult to effectively address the problems of deepfake localization and source tracing in complex multi-person environments. To address this challenge, we propose the Deep Attributable Watermarking Framework (DAWF). This framework adopts a novel multi-face encoder-decoder architecture that bypasses the cumbersome offline pre-processing steps of traditional forensics, facilitating efficient in-network parallel watermark embedding and cross-face collaborative processing. Crucially, we propose a selective regional supervision loss. This innovative mechanism guides the decoder to focus exclusively on the facial regions tampered with by deepfakes. Leveraging this mechanism alongside the embedded identity payloads, DAWF realizes the “which + who” goal, answering the dual questions of which facial region was forged and who was forged. Extensive experiments on challenging multi-face datasets show that DAWF achieves excellent deepfake localization and traceability in complex multi-person scenes.

[CV-56] SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

【速读】:该论文旨在解决当前统一图像生成模型在空间感知任务中表现受限的问题,其根本原因在于缺乏内在的空间理解能力以及生成过程中缺乏显式的几何引导。解决方案的关键在于提出SpatialFusion框架,通过引入Mixture-of-Transformers(MoT)架构,在多语言大模型(MLLM)中并行集成一个空间Transformer模块,以共享自注意力机制学习目标图像的度量深度图(metric-depth maps),从而增强三维几何建模能力;随后,利用专用的深度适配器(depth adapter)将这些显式几何结构注入扩散模型(diffusion backbone),提供精确的空间约束,实现空间一致性的图像生成。该方法通过两阶段渐进式训练策略显著提升空间感知基准性能,并在文生图与图像编辑任务中均取得泛化性能增益,且推理开销可忽略。

链接: https://arxiv.org/abs/2604.26341
作者: Haiyi Qiu,Kaihang Pan,Jiacheng Li,Juncheng Li,Siliang Tang,Yueting Zhuang
机构: Zhejiang University (浙江大学); HiThink Research (思知科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent unified image generation models have achieved remarkable success by employing MLLMs for semantic understanding and diffusion backbones for image generation. However, these models remain fundamentally limited in spatially-aware tasks due to a lack of intrinsic spatial understanding and the absence of explicit geometric guidance during generation. In this paper, we propose SpatialFusion, a novel framework that internalizes 3D geometric awareness into unified image generation models. Specifically, we first employ a Mixture-of-Transformers (MoT) architecture to augment the MLLM with a parallel spatial transformer to enhance 3D geometric modeling capability. By sharing self-attention with the MLLM, the spatial transformer learns to derive metric-depth maps of target images from rich semantic contexts. These explicit geometric scaffolds are then injected into the diffusion backbone through a specialized depth adapter, providing precise spatial constraints for spatially-coherent image generation. Through a progressive two-stage training strategy, SpatialFusion significantly enhances performance on spatially-aware benchmarks, notably outperforming leading models such as GPT-4o. Additionally, it achieves generalized performance gains across both text-to-image generation and image editing scenarios, all while maintaining negligible inference overhead.

[CV-57] Federated Medical Image Classification under Class and Domain Imbalance exploiting Synthetic Sample Generation ICPR2026

【速读】:该论文旨在解决医学影像领域中联邦学习(Federated Learning)面临的三大挑战:严格的隐私约束、不同成像设备导致的域偏移(domain shift)以及罕见病灶的类别不平衡问题。其解决方案的关键在于提出了一种名为FedSSG的新颖联邦学习框架,核心策略是生成合成样本并分发至各客户端,从而提升对稀有病灶和多样化成像设备的覆盖度,同时在客户端侧保持极低的计算开销,显著增强了模型在异构机构间的性能表现与泛化能力。

链接: https://arxiv.org/abs/2604.26324
作者: Martina Pavan,Matteo Caligiuri,Francesco Barbato,Pietro Zanuttigh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICPR 2026, 13 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Exploiting deep learning in medical imaging faces critical challenges, including strict privacy constraints, heterogeneous imaging devices with varying acquisition properties, and class imbalance due to the uneven prevalence of pathologies. In this work, we propose FedSSG, a novel Federated Learning framework that addresses domain shifts caused by diverse imaging devices while mitigating the under-representation of rare pathologies. The key contribution is a strategy for generating synthetic samples and distributing them across clients to improve coverage of both underrepresented pathologies and imaging devices. Experimental results demonstrate that our approach significantly enhances model performance and generalization across heterogeneous institutions, with minimal computational overhead at the client side.

[CV-58] Motion-Driven Multi-Object Tracking of Model Organisms in Space Science Experiments CVPR

【速读】:该论文旨在解决微重力环境下多动物跟踪(multi-animal tracking)在空间科学实验视频中面临的挑战,包括弱外观特征、低质量成像、复杂机动行为及频繁交互导致的轨迹不连续与身份混淆问题。解决方案的关键在于提出一种面向运动驱动的鲁棒跟踪框架ART-Track(Adaptive Robust Tracking),其核心创新包括:多模型运动估计以应对突发机动和非线性运动;基于运动状态的关联策略降低密集交互下的身份切换;以及不确定性自适应融合机制,在预测可靠性变化时动态平衡空间与运动线索,从而显著减少斑马鱼和果蝇序列中的身份切换,并提升遮挡、形变和高密度交互下的关联稳定性,为下游定量行为分析提供更可靠的轨迹基础。

链接: https://arxiv.org/abs/2604.26321
作者: Jianing You,Han Wang,Kang Liu,Jiale Ding,Fengjie Chu,Zihan Guo,Shengyang Li
机构: Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences (中国科学院空间应用工程与技术中心); School of Space Exploration, University of Chinese Academy of Sciences (中国科学院大学太空探索学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

点击查看摘要

Abstract:Automated animal behavior analysis relies on long-term, interpretable individual trajectories; however, multi-animal tracking in space science experimental videos remains highly challenging due to weak appearance cues, low-quality imaging, complex maneuvering behaviors, and frequent interactions. To address this problem, we first construct the SpaceAnimal-MOT dataset to characterize the motion complexity and long-term identity preservation challenges in biological videos acquired under microgravity conditions. We then propose ART-Track (Adaptive Robust Tracking), a motion-driven tracking framework tailored to this setting. Specifically, multi-model motion estimation is introduced to handle abrupt maneuvers and nonlinear motion, motion-state-driven association is designed to reduce identity switches under dense interactions and temporary mismatch, and uncertainty-adaptive fusion is used to dynamically balance spatial and motion cues when prediction reliability varies. Experimental results show that ART-Track significantly reduces identity switches on zebrafish and fruitfly sequences, while maintaining more stable association under occlusion, deformation, and high-density interactions, thereby providing a more reliable tracking foundation for downstream quantitative behavior analysis. The code is publicly available at this https URL.

[CV-59] Point Cloud Registration via Probabilistic Self-Update Local Correspondence and Line Vector Sets

【速读】:该论文旨在解决点云配准(Point Cloud Registration, PCR)在遥感应用中效率与精度难以兼顾的问题。其核心解决方案是提出一种基于概率自更新局部对应关系与线向量集的双RANSAC交互模型:其中全局RANSAC用于评估整体对应集,局部RANSAC则在动态更新的局部集合上迭代优化;通过角度直方图统计和线向量长度保持构建初始局部集,并引入概率自更新策略提升局部对应质量;同时设计全局早停条件以平衡计算效率与配准精度,最终采用加权奇异值分解(Weighted Singular Value Decomposition)获得高精度变换矩阵。该方法在公开数据集上实现了优于当前最优算法的运行速度和至少10%的均方根误差(RMSE)改进。

链接: https://arxiv.org/abs/2604.26318
作者: Kuo-Liang Chung,Yu-Cheng Lin,Wu-Chi Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point cloud registration (PCR) is a fundamental task for integrating 3D observations in remote sensing applications. This paper proposes a fast and effective PCR algorithm utilizing probabilistic self-updating local correspondence and line vector sets. Our dual RANSAC interaction model comprises a global RANSAC evaluating the global correspondence set and a local RANSAC operating on dynamically updated local sets. Initially, these local sets are constructed using angle histogram statistics and line vector length preservation techniques. To improve accuracy, a probabilistic self-updating strategy refines the local sets after each interaction round. To reduce runtime, we introduce a global early termination condition that optimally balances accuracy and efficiency. Finally, a weighted singular value decomposition estimates the registration solution. Evaluations on public datasets demonstrate our algorithm achieves superior time efficiency and at least a 10% root mean square error improvement over state-of-the-art methods. The C++ source code is publicly available at this https URL.

[CV-60] he Unseen Adversaries: Robust and Generalized Defense Against Adversarial Patches AISTATS2026

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在物理世界中对奇异点(singularities)的脆弱性问题,特别是针对两类典型奇异点——对抗补丁攻击(adversarial patch attack)和自然噪声(如高斯噪声和椒盐噪声)的独立及联合影响缺乏系统研究的问题。其解决方案的关键在于首次将这两类奇异点结合起来,构建了一个新型数据集,并基于此数据集对多种卷积神经网络(Convolutional Neural Networks, CNNs)提取的特征进行奇异点检测基准测试;同时采用传统但有效的机器学习分类器而非依赖神经网络参数调优的方法进行分类,从而揭示了在处理组合奇异点时,若仅孤立应对或选用低效分类器,则难以实现有效防御的结论。

链接: https://arxiv.org/abs/2604.26317
作者: Vishesh Kumar,Akshay Agarwal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AISTATS 2026

点击查看摘要

Abstract:The vulnerabilities of deep neural networks against singularities have raised serious concerns regarding their deployment in the physical world. One of the most prominent and impactful physical-world adversarial perturbations is the attachment of patches to clean images, known as an adversarial patch attack. Similarly, natural noises such as Gaussian and Salt\Pepper are highly prevalent in the real world. The current research need arises from the above vulnerabilities and the lack of efforts to tackle these two singularities independently and, especially, in combination. In this research, we have, for the first time, combined these two prominent singularities and proposed a novel dataset. Using this dataset, we have conducted a benchmark study of singularity data-point detection using features from several convolutional neural networks. For classification, rather than the popular neural network-based parameter tuning, we have used traditional yet effective machine learning classifiers. The extensive experiments across various in- and out-of-distribution (OOD) singularities reveal several interesting findings about the effectiveness of classifiers and show that it is hard to defend against adversaries when they are treated independently, and inefficient classifiers are selected.

[CV-61] CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation

【速读】:该论文旨在解决当前生成式 AI 在医学影像诊断中缺乏对临床推理过程建模的问题,特别是现有视觉-语言模型主要依赖图像与报告的配对数据,而忽视了专家在实际诊断中所采用的视觉搜索策略、临床情境整合及不确定性表达等认知机制。其解决方案的关键在于构建了一个全球性的多模态资源 CheXthought,包含 103,592 条链式思维(chain-of-thought)推理轨迹和 6,609,082 个同步视觉注意力标注,覆盖来自 71 个国家的 501 名放射科医生对 50,312 张多阅片胸片的分析数据。通过该数据集训练和评估模型,显著提升了病理分类准确性、视觉忠实度、时间推理能力以及不确定性沟通效果,并实现了基于图像直接预测人-人与人-AI 矛盾的能力,从而推动更透明、可解释的多模态临床推理模型发展。

链接: https://arxiv.org/abs/2604.26288
作者: Sonali Sharma,Jin Long,George Shih,Sarah Eid,Christian Bluethgen,Francine L. Jacobson,Emily B. Tsai,Global Radiology Consortium,Ahmed M. Alaa,Curtis P. Langlotz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 51 pages, 7 figures, 10 tables

点击查看摘要

Abstract:Chest X-ray interpretation is one of the most frequently performed diagnostic tasks in medicine and a primary target for AI development, yet current vision–language models are primarily trained on datasets of paired images and reports, not the cognitive processes and visual attention that underlie clinical reasoning. Here, we present CheXthought, a global, multimodal resource containing 103,592 chain-of-thought reasoning traces and 6,609,082 synchronized visual attention annotations across 50,312 multi-read chest X-rays from 501 radiologists in 71 countries. Our analysis reveals clinical reasoning patterns in how experts deploy distinct visual search strategies, integrate clinical context, and communicate uncertainty. We demonstrate the clinical utility of CheXthought across four dimensions. First, CheXthought reasoning significantly outperforms state–of–the–art vision–language model chain-of-thought in factual accuracy and spatial grounding. Second, visual attention data used as an inference–time hint recovers missed findings and significantly reduces hallucinations. Third, models trained on CheXthought data achieve significantly stronger pathology classification, visual faithfulness, temporal reasoning and uncertainty communication. Fourth, leveraging CheXthought’s multi-reader annotations, we predict both human–human and human–AI disagreement directly from an image, enabling transparent communication of case difficulty, uncertainty and model reliability. These findings establish CheXthought as a resource for advancing multimodal clinical reasoning and the development of more transparent, interpretable vision–language models.

[CV-62] Event-based Liveness Detection using Temporal Ocular Dynamics: An Exploratory Approach

【速读】:该论文旨在解决传统基于RGB相机的活体检测(liveness detection)在跨传感器和攻击场景下泛化能力差的问题。其解决方案的关键在于引入事件相机(event camera)作为替代感知模态,利用其微秒级时间分辨率捕捉眼动动态(temporal ocular dynamics),特别是快速眼跳(saccades)等瞬态特征;由于重放攻击无法精确复现这些高时间精度的动态变化,从而在事件域中产生独特的时空模式,使模型能够有效区分真实与伪造序列,最终通过设计事件驱动的时间特征提取方法及脉冲卷积神经网络实现高达95.37%的活体检测准确率。

链接: https://arxiv.org/abs/2604.26285
作者: Nicolas Mastropasqua,Ignacio Bugueno-Cordova,Rodrigo Verschae,Daniel Acevedo,Pablo Negri
机构: Universidad de Buenos Aires (布宜诺斯艾利斯大学); CONICET-UBA, Instituto de Ciencias de la Computación (阿根廷国家科学与技术研究委员会-布宜诺斯艾利斯大学计算机科学研究所); Universidad de Chile (智利大学); Universidad de O’Higgins (奥希金斯大学); Universidad Técnica Federico Santa María (费迪南德·圣马利亚科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at FG 2026 FME Workshop

点击查看摘要

Abstract:Face liveness detection has been extensively studied using RGB cameras, achieving strong performance under controlled conditions but often failing to generalize across sensors and attack scenarios. In this work, we explore event cameras as an alternative sensing modality for liveness detection based on temporal ocular dynamics. Event cameras capture sparse, asynchronous changes in brightness with microsecond resolution, enabling precise analysis of fast eye movements such as saccades. Replay attacks cannot faithfully reproduce these dynamics due to temporal resampling and display artifacts, leading to distinctive spatio-temporal patterns in the event domain. We design a data collection protocol to extend RGBE-Gaze with replay-attack recordings, yielding an event-based fake counterpart for liveness detection. We analyze event-driven temporal features from eye regions and evaluate their effectiveness for ocular motion segmentation and liveness classification. Our results show that event-based representations enable reliable discrimination between genuine and replayed sequences, achieving up to 95.37% top-1 accuracy with a spiking convolutional neural network. These preliminary findings highlight the potential of event-based sensing for robust and low-latency liveness detection.

[CV-63] MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

【速读】:该论文旨在解决医学视觉语言模型(Medical Vision-Language Models, VLMs)中存在的认知错位问题,即由于离散标记化导致的量化损失、长程信息衰减以及缺失病例自适应专家知识的问题,从而影响高精度医疗诊断的实现。其解决方案的关键在于提出了一种隐式诊断记忆演化框架(ours),通过动态合成模型隐藏流中的隐式诊断记忆来模拟临床医生的经验调用过程:首先利用元查询机制从解剖先验编码器中检索结构化先验信息生成压缩的隐式记忆;随后引入因果反事实精炼(Causal Counterfactual Refinement, CCR)机制,基于区域级特征掩码的强化学习与反事实奖励量化每段记忆的因果贡献,以修剪冗余并使潜在表示符合诊断逻辑;最终通过内在记忆迁移(Intrinsic Memory Transition, IMT)机制,将教师分支的诊断模式以内生方式对齐至学生分支,实现专家知识向模型参数的内化转移,显著提升诊断准确率。

链接: https://arxiv.org/abs/2604.26283
作者: Chunzheng Zhu,Jiaqi Zeng,Junyu Jiang,Jianxin Lin,Yijun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Medical latent reasoning; Memory evolution

点击查看摘要

Abstract:High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly invoke during image interpretation. We pinpoint a fundamental cognitive misalignment in medical VLMs caused by discrete tokenization, leading to quantization loss, long-range information dissipation, and missing case-adaptive expertise. To bridge this gap, we propose ours, a framework for latent diagnostic memory evolution that simulates the experiential invocation of clinicians by dynamically synthesizing implicit diagnostic memories within the model’s hidden stream. Specifically, it begins with a Meta Query for Prior Memorization mechanism, where learnable probes retrieve structured priors from an anatomical prior encoder to generate condensed implicit memories. To ensure clinical fidelity, we introduce Causal Counterfactual Refinement (CCR), which leverages reinforcement learning and counterfactual rewards derived from region-level feature masking to quantify the causal contribution of each memory, thereby pruning redundancies and aligning latent representations with diagnostic logic. This evolutionary process culminates in Intrinsic Memory Transition (IMT), a privileged-autonomous dual-branch paradigm that internalizes teacher-branch diagnostic patterns into the student-branch via full-vocabulary divergence alignment. Comprehensive empirical evaluations across multiple datasets demonstrate that ours, by transferring external expertise into endogenous parameters, significantly outperforms existing state-of-the-art methods, particularly chain-of-thought paradigms, in diagnostic accuracy.

[CV-64] High-Dimensional Noise to Low-Dimensional Manifolds: A Manifold-Space Diffusion Framework for Degraded Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)在复杂退化条件下分类性能下降的问题。由于HSI数据具有高维但低秩的特性,其判别信息通常集中在低维流形(manifold)上,而现实遥感场景中多种退化因素的叠加会破坏这一内在流形结构,导致样本偏离原始分布并引入冗余和非判别性变化。解决方案的关键在于提出一种流形空间扩散框架(Manifold-Space Diffusion, MSDiff),首先通过判别性的光谱-空间重建任务将高维退化数据映射到紧凑的低维流形中,保留类别语义并抑制冗余变化;随后在该流形空间内应用基于扩散的生成模型,对光谱-空间分布进行正则化,实现潜在特征的逐步优化与稳定,从而有效解耦退化干扰与内在判别结构,提升复杂退化条件下的表示稳定性与分类鲁棒性。

链接: https://arxiv.org/abs/2604.26279
作者: Boxiang Yang,Ning Chen,Xia Yue,Yichang Luo,Yingbo Fan,Haoyuan Zhang,Haoyu Ma,Jun Yue,Shanjun Mao
机构: Peking University (北京大学); Central South University (中南大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Hyperspectral Image (HSI) classification has attracted increasing attention in remote sensing. However, HSI data are inherently high-dimensional but low-rank, with discriminative information concentrated on a low-dimensional latent manifold. In real-world remote sensing scenarios, the superposition of multiple degradation factors disrupts this intrinsic manifold structure, driving samples away from their original low-dimensional distribution and introducing substantial redundant and non-discriminative variations. To better handle this challenge, this paper proposes a manifold-space diffusion framework (MSDiff) for robust hyperspectral classification under complex degradation conditions. Specifically, the proposed method first maps high-dimensional, degradation-affected HSI data into a compact low-dimensional manifold through a discriminative spectral-spatial reconstruction task, preserving class semantics and reducing redundant variations. A diffusion-based generative model is then applied to regularize the spectral-spatial distribution within the manifold, enabling progressive refinement and stabilization of latent features against residual degradations. The key advantage of the proposed framework lies in performing diffusion-based distribution modeling directly on the low-dimensional manifold, effectively decoupling degradation-induced disturbances from intrinsic discriminative structures and enhancing representation stability under complex degradations. Experimental results on multiple hyperspectral benchmarks demonstrate consistent performance improvements over state-of-the-art methods under diverse composite degradation settings. The code will be available at this https URL

[CV-65] Semantic Foam: Unifying Spatial and Semantic Scene Decomposition CVPR2026

【速读】:该论文旨在解决基于点云或隐式表示的场景重建方法(如3D Gaussian Splatting)在交互式图形应用中难以进行语义级操作的问题,尤其是现有方法在对象级分割质量与跨视角一致性方面表现不足。其解决方案的关键在于提出Semantic Foam,该方法扩展了Radiant Foam的体素Voronoi网格结构,并在其基础上引入细胞级别的显式语义特征场(semantic feature field),从而实现直接的空间正则化,显著提升跨视角的一致性并缓解遮挡和不一致监督导致的伪影问题。

链接: https://arxiv.org/abs/2604.26262
作者: Amr Sharafeldin,Shrisudhan Govindarajan,Thomas Walker,Aryan Mikaeili,Daniel Rebain,Kwang Moo Yi,Andrea Tagliasacchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures, Accepted to CVPR 2026 (Highlight) , Project page: this http URL

点击查看摘要

Abstract:Modern scene reconstruction methods, such as 3D Gaussian Splatting, enable photo-realistic novel view synthesis at real-time speeds. However, their adoption in interactive graphics applications remains limited due to the difficulty of interacting with these representations compared to traditional, human-authored 3D assets. While prior work has attempted to impose semantic decomposition on these models, significant challenges remain in segmentation quality and cross-view this http URL address these limitations, we introduce Semantic Foam, which extends the recently proposed Radiant Foam representation to semantic decomposition tasks. Our approach leverages the inherent spatial structure of Radiant Foam’s volumetric Voronoi mesh and augments it with an explicit semantic feature field defined at the cell level. This design enables direct spatial regularization, improving consistency across views and mitigating artifacts caused by occlusion and inconsistent supervision, which are common issues in point-based this http URL results demonstrate that our method achieves superior object-level segmentation performance compared to state-of-the-art approaches such as Gaussian Grouping and this http URL page: this http URL

[CV-66] Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding

【速读】:该论文旨在解决零样本三维视觉定位(Zero-shot 3D Visual Grounding, 3DVG)任务中因开放词汇三维提案质量差(类别不准确、几何精度低)以及多视角推理空间冗余导致的定位不准与推理效率低的问题。其核心解决方案是提出MCM-VG框架,通过显式建立多个一致的二维到三维(2D-3D)映射来实现鲁棒的零样本3DVG。关键创新在于三个模块:语义对齐模块利用大语言模型(LLM)驱动的查询解析和粗粒度到细粒度的2D-3D匹配纠正类别错位;实例校正模块借助视觉语言模型(VLM)引导的2D分割重建缺失目标,并将可靠的视觉先验回投影以建立精确3D几何结构;视点蒸馏模块聚类相机方向以提取最优RGB帧,结合鸟瞰图(Bird’s Eye View)生成紧凑的视觉提示集,最终将目标消歧转化为视觉语言模型的多选推理任务,从而显著提升定位精度与推理效率。

链接: https://arxiv.org/abs/2604.26261
作者: Yufei Yin,Jie Zheng,Qianke Meng,Zhou Yu,Minghao Chen,Jiajun Ding,Min Tan,Yuling Xi,Zhiwen Chen,Chengfei Lv
机构: Hangzhou Dianzi University(杭州电子科技大学); Zhejiang University(浙江大学); Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot 3D Visual Grounding (3DVG) is a critical capability for open-world embodied AI. However, existing methods are fundamentally bottlenecked by the poor quality of open-vocabulary 3D proposals, suffering from inaccurate categories and imprecise geometries, as well as the spatial redundancy of exhaustive multi-view reasoning. To address these challenges, we propose MCM-VG, a novel framework that achieves robust zero-shot 3DVG by explicitly establishing Multiple Consistent 2D-3D Mappings. Instead of passively relying on noisy 3D segments, MCM-VG enforces 2D-3D consistency across three fundamental dimensions to achieve precise target localization and reliable reasoning. First, a Semantic Alignment module corrects category mismatches via LLM-driven query parsing and coarse-to-fine 2D-3D matching. Second, an Instance Rectification module leverages VLM-guided 2D segmentations to reconstruct missing targets, back-projecting these reliable visual priors to establish accurate 3D geometries. Finally, to eliminate spatial redundancy, a Viewpoint Distillation module clusters 3D camera directions to extract optimal frames. By pairing these optimal RGB frames with Bird’s Eye View maps into concise visual prompt sets, we formulate the final target disambiguation as a multiple-choice reasoning task for Vision-Language Models. Extensive evaluations on ScanRefer and Nr3D benchmarks demonstrate that MCM-VG sets a new state-of-the-art for zero-shot 3D visual grounding. Remarkably, it achieves 62.0% and 53.6% in Acc@0.25 and Acc@0.5 on ScanRefer, outperforming previous baselines by substantial margins of 6.4% and 4.0%. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.26261 [cs.CV] (or arXiv:2604.26261v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.26261 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-67] GaitKD: A Universal Decoupled Distillation Framework for Efficient Gait Recognition

【速读】:该论文旨在解决高性能步态识别模型因深度神经网络结构复杂、计算成本高而难以实际部署的问题。针对现有知识蒸馏(Knowledge Distillation, KD)方法在分部位结构的步态识别模型中效果不佳的问题,其核心解决方案是提出GaitKD框架,将步态知识迁移解耦为两个互补组件:决策层蒸馏(decision-level distillation)和边界层蒸馏(boundary-level distillation)。其中,决策层蒸馏通过部分校准的logit蒸馏传递类别间的判别关系,边界层蒸馏则通过激活边界目标保留教师模型诱导的嵌入空间划分,而非直接进行特征回归,从而实现高效且稳定的性能提升。该方法支持异构师生模型配置,且不增加推理开销,在多个步态识别基准上均优于强基线模型。

链接: https://arxiv.org/abs/2604.26255
作者: Yuqi Li,Qian Zhou,Huiran Duan,Jingjie Wang,Shunli Zhang,Chuanguang Yang,Guoying Zhao,Yingli Tian
机构: The City University of New York (纽约市立大学); Wuhan University (武汉大学); Beijing Jiaotong University (北京交通大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gait recognition is an attractive biometric modality for long-range and contact-free identification, but high-performing gait models often rely on deep and computationally expensive architectures that are difficult to deploy in practice. Knowledge distillation (KD) offers a natural way to transfer knowledge from a powerful teacher to an efficient student; however, standard KD is often less effective for part-structured gait models, where supervision is formed from both part-wise classification logits and part-wise retrieval embeddings. In this paper, we propose GaitKD, a distillation framework that decouples gait knowledge transfer into two complementary components: decision-level distillation and boundary-level distillation. Specifically, GaitKD aligns the teacher and student through part-calibrated logit distillation to transfer inter-class decision relations, while preserving the teacher-induced partitioning of the embedding space through an activation-boundary objective instead of direct feature regression. With a simple aligned part-wise design, GaitKD supports heterogeneous teacher-student gait models without introducing additional inference cost. Experimental results across multiple gait recognition benchmarks and teacher-student configurations show consistent improvements over strong gait baselines. Our study demonstrates that the two transfer components are complementary, and boundary-preserving distillation provides more stable performance than direct feature regression. Source code is available at this https URL

[CV-68] OmniTrend: Content-Context Modeling for Scalable Social Popularity Prediction

【速读】:该论文旨在解决社交平台内容流行度预测中难以区分内容吸引力(content attractiveness)与曝光情境(contextual exposure)的问题。现有方法通常将两者混杂建模,导致学习到的表征受平台特定可见性效应影响,削弱了模型的可解释性和跨平台迁移能力。其解决方案的关键在于提出OmniTrend框架,通过分离建模内容模块和上下文模块:内容模块利用视觉、音频和文本多模态信号提取内在吸引力特征,上下文模块基于发布时间、作者活跃度、话题趋势及检索邻域统计等外生信号估计曝光程度;最终将两个独立预测器的结果融合,使各因素的作用机制清晰可辨,并显著提升在图像和视频平台间的泛化性能。

链接: https://arxiv.org/abs/2604.26252
作者: Liliang Ye,Guiyi Zeng,Yunyao Zhang,Yi-Ping Phoebe Chen,Junqing Yu,Zikai Song
机构: Huazhong University of Science and Technology (华中科技大学); La Trobe University (拉特罗布大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting social media popularity requires understanding both the intrinsic appeal of content and the external context that determines how it is exposed to users. Existing methods focus on content signals but do not separate them from exposure-related patterns, which causes the learned representations to absorb platform-specific visibility effects and weakens both interpretability and cross-platform transfer. This paper introduces OmniTrend, a unified framework that models popularity as the joint outcome of content attractiveness and contextual exposure. The content module learns cross-modal representations from visual, audio, and textual cues to quantify intrinsic appeal, while the context module estimates exposure from exogenous signals such as posting time, author activity, topical trends, and retrieval-based neighborhood statistics. OmniTrend learns separate predictors for content attractiveness and contextual exposure and integrates them in the final popularity estimate, which makes the role of each factor explicit and supports robust transfer across image and video platforms.

[CV-69] Multi-Stage Bi-Atrial Segmentation Framework from 3D Late Gadolinium-Enhanced MRI using V-Net Family Models MICCAI2024 ICIP

【速读】:该论文旨在解决从3D晚期钆增强磁共振成像(Late Gadolinium-Enhanced MRI, LGE-MRI)中实现多类双心房(bi-atrial)分割的难题,该任务对心脏疾病诊断具有重要意义。解决方案的关键在于提出一个分阶段的深度学习框架:首先通过多维限制对比度自适应直方图均衡化(Multidimensional Contrast Limited Adaptive Histogram Equalization, MCLAHE)进行预处理以增强图像对比度;随后利用V-Net家族模型对下采样后的MCLAHE增强图像进行粗略区域分割;最后采用另一V-Net模型对粗分割结果进行精细化分割;同时引入非对称损失函数(asymmetric loss)优化模型权重,从而提升小目标区域的分割精度。

链接: https://arxiv.org/abs/2604.26251
作者: Hao Wen,Jingsu Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, technical report for participating the MBAS2024 challenge hosted on the MICCAI2024 conference

点击查看摘要

Abstract:We report our multi-stage framework designed for the problem of multi-class bi-atrial segmentation from 3D late gadolinium-enhanced (LGE) MRI of the human heart. The pipeline consists of a preprocessing step using multidimensional contrast limited adaptive histogram equalization (MCLAHE); coarse region segmentation from MCLAHE-enhanced and down-sampled MRI using a V-Net family model; and fine segmentation from the coarse region using another V-Net model. Asymmetric loss is adopted to optimize the model weights.

[CV-70] Beyond Shortcuts: Mitigating Visual Illusions in Frozen VLMs via Qualitative Reasoning CVPR

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对光学幻觉时感知鲁棒性差的问题,其核心挑战在于模型倾向于依赖语言先验和记忆原型,而非直接的视觉证据,从而导致错误推理。解决方案的关键在于提出一种无需训练、以数据为中心的框架——结构化定性推理(Structured Qualitative Inference, SQI),通过三个系统模块实现:(1) 公理约束注入(Axiomatic Constraint Injection),抑制错误的度量估计与定量幻觉;(2) 分层场景分解(Hierarchical Scene Decomposition),将目标视觉流形从复杂背景干扰中解耦;(3) 反事实自验证(Counterfactual Self-Verification),通过对抗推理缓解确认偏误。SQI 在推理阶段协同施加这些定性约束,有效对齐高层语言推理与底层视觉感知,显著提升幻觉理解任务中的准确性和可解释性。

链接: https://arxiv.org/abs/2604.26250
作者: Hao Guo,Fei Wang,Junjie Chen,Yiqi Nie,Jiaqi Zhao,Qiankun Li,Subin Huang
机构: Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); Anhui Polytechnic University (安徽工业大学); Hefei University of Technology (合肥工业大学); Anhui University (安徽大学); IGS, Imperial College London (帝国理工学院IGS研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures, and 1 table. This is a methodology paper for the DataCV 2026 Challenge (CVPR Workshops), Task 1, where our method ranked 2nd

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have achieved state-of-the-art performance in general visual tasks, their perceptual robustness remains remarkably brittle when confronted with optical illusions. These failures are often attributed to shortcut heuristics, where models prioritize linguistic priors and memorized prototypes over direct visual evidence. In this work, we propose Structured Qualitative Inference (SQI), a training-free, data-centric framework designed to fortify visual grounding in frozen VLMs. SQI addresses perceptual anomalies through three systematic modules: (1) Axiomatic Constraint Injection, which suppresses erroneous metric estimations and quantitative hallucinations; (2) Hierarchical Scene Decomposition, which decouples target visual manifolds from complex background distractors; and (3) Counterfactual Self-Verification, an adversarial reasoning step that mitigates confirmation bias. By orchestrating these qualitative constraints at inference time, SQI effectively aligns high-level linguistic reasoning with low-level visual perception. Our framework was evaluated on the DataCV 2026 Challenge (Task I: Classic Illusion Understanding), where it ranked 2nd place overall. Experimental results demonstrate that SQI not only significantly enhances accuracy across diverse illusion categories but also provides superior diagnostic interpretability without any model fine-tuning. Our success underscores the potential of structured qualitative grounding as a robust paradigm for developing next-generation, illusion-resistant vision-language systems.

[CV-71] MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution

【速读】:该论文旨在解决真实场景下生成式超分辨率(Generative Super-Resolution, GSR)中因内容与退化类型多样化而导致的性能瓶颈问题,尤其在侧信息(side information)具有内容依赖性且传输带宽受限时,传统固定条件设计的SR方法效果不佳。解决方案的关键在于提出MetaSR框架——基于Diffusion Transformer (DiT) 架构,通过模型自身VAE和Transformer主干网络融合异构侧信息,并采用高效蒸馏策略实现单步扩散推理,从而在有限比特率下动态选择并注入任务相关元数据,显著提升重建质量(PSNR最高提升1.0~dB),同时在保持相同图像质量的前提下节省高达50%的传输码率。

链接: https://arxiv.org/abs/2604.26244
作者: Jiaqi Guo,Mingzhen Li,Haohong Wang,Aggelos K. Katsaggelos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study generative super-resolution (SR) in real-world scenarios where content and degradations vary across domains, genres, and segments. For example, images and videos may alternate between text overlays, fast motion, smooth cartoons, and low-light faces, each benefiting from different forms of side information. Existing metadata-guided SR methods typically use a fixed conditioning design, which is suboptimal when useful cues are content dependent and transmission budgets are limited. We propose MetaSR, a Diffusion Transformer (DiT)-based framework that selects and injects task-relevant metadata to guide SR under resource constraints. Specifically, we use the DiT’s own VAE and transformer backbone to fuse heterogeneous metadata, and adopt an efficient distillation strategy that enables one-step diffusion inference. Experiments across diverse content buckets and degradation regimes show that MetaSR outperforms reference solutions by up to 1.0~dB PSNR while achieving up to 50% transmission bitrate saving at matched quality. We assess these gains under a rate–distortion optimization (RDO) framework that jointly accounts for sender-side bitrate and receiver/display quality metrics (e.g., PSNR and SSIM).

[CV-72] Camera-RFID Fusion for Robust Asset Tracking in Forested Environments

【速读】:该论文旨在解决森林环境中被动式射频识别(Passive RFID)标签定位精度受限于信号衰减和多径效应导致的米级误差问题,同时克服仅依赖计算机视觉在密集场景下因空间关联歧义和部分遮挡引发的厘米级精度失效问题。解决方案的关键在于提出一种新型摄像头–RFID融合框架,通过整合深度信息与目标识别特征,并结合先进的轨迹匹配算法,实现两种传感器生成的异构轨迹的高精度关联,从而有效弥合从米级到厘米级的定位精度差距,确保资产在短暂离开摄像头视野时仍能可靠追踪。

链接: https://arxiv.org/abs/2604.26241
作者: John Hateley,Sriram Narasimhan,Omid Abari
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 Pages, 10 Figures, Submitted and awaiting acceptance at IEEE RFID

点击查看摘要

Abstract:Passive RFID tags offer a cost-effective and scalable solution for tracking numerous deployed assets. However, in forested environments, signal attenuation and multipath effects generally limit RFID spatial accuracy to the meter level. Conversely, while cameras employing stereo vision can achieve centimeter-level precision, relying solely on computer vision fails to resolve issues arising from spatial association ambiguity and partial occlusions in dense settings. Fusing these modalities allows systems to harness the high-accuracy benefits of vision while retaining the robust, non-line-of-sight identification advantages of RFID. Yet, a primary challenge in achieving this, which is the central focus of this paper, lies in accurately associating the disparate trajectories generated by these two sensors. To overcome this limitation, we introduce a novel camera–RFID fusion framework that integrates depth and object information with advanced trajectory-matching algorithms. By successfully bridging the meter-to-centimeter accuracy gap, the proposed approach helps achieve reliable tag localization even when assets temporarily leave the camera’s field of view. To the best of our knowledge, this represents the first application of camera–RFID fusion for asset tracking in natural forested environments.

[CV-73] EnerGS: Energy-Based Gaussian Splatting with Partial Geometric Priors

【速读】:该论文旨在解决大规模室外场景中基于3D Gaussian Splatting (3DGS) 的重建问题,其中几何先验(如LiDAR数据)常因空间覆盖不完整且分布不均而难以有效引导优化过程,甚至可能损害最终重建质量。解决方案的关键在于将部分可观测的几何信息建模为由几何证据驱动的连续能量场,并提出EnerGS方法:通过软性几何引导而非硬约束来优化高斯基元,使几何信息能够间接指导优化方向,同时避免对解空间的直接限制,从而在稀疏多视角和单目设置下显著提升光度质量和几何稳定性,并有效缓解过拟合问题。

链接: https://arxiv.org/abs/2604.26238
作者: Rui Song,Tianhui Cai,Markus Gross,Yun Zhang,Walter Zimmer,Zhiyu Huang,Olaf Wysocki,Jiaqi Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has been widely adopted for scene reconstruction, where training inherently constitutes a highly coupled and non-convex optimization problem. Recent works commonly incorporate geometric priors, such as LiDAR measurements, either for initialization or as training constraints, with the goal of improving photometric reconstruction quality. However, in large-scale outdoor scenarios, such geometric supervision is often spatially incomplete and uneven, which limits its effectiveness as a reliable prior and can even be detrimental to the final reconstruction. To address this challenge, we model partially observable geometry as a continuous energy field induced by geometric evidence and propose EnerGS. Rather than enforcing geometry as a hard constraint, EnerGS provides a soft geometric guidance for the optimization of Gaussian primitives, allowing geometric information to steer the optimization process without directly restricting the solution space. Extensive experiments on large-scale outdoor scenes demonstrate that, under both sparse multi-view and monocular settings, EnerGS consistently improves photometric quality and geometric stability, while effectively mitigating overfitting during 3DGS training.

[CV-74] DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation

【速读】:该论文旨在解决当前可控医疗视频生成缺乏可解释性的问题,即生成内容难以与物理先验(physical priors)和临床真实表现对齐。为实现从“可控”到“可解释”的跨越,其核心解决方案是提出DepthPilot框架,关键在于两个协同机制:一是通过参数高效微调将深度约束注入扩散模型骨干网络,实现显式的几何定位以保障解剖学保真度;二是引入自适应样条去噪模块,用可学习的样条函数替代固定线性权重,从而在几何约束下增强对复杂时空动态的非线性建模能力。这一设计显著提升了生成视频的物理一致性与临床可解释性。

链接: https://arxiv.org/abs/2604.26232
作者: Junhu Fu,Ke Chen,Weidong Guo,Shuyu Liang,Jie Xu,Chen Ma,Kehao Wang,Shengli Lin,Zeju Li,Yuanyuan Wang,Yi Guo,Shuo Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Controllable medical video generation has achieved remarkable progress, but it still lacks interpretability, which requires the alignment of generated contents with physical priors and faithful clinical manifestations. To push the boundaries from mere controllability to interpretability, we propose DepthPilot, the first interpretable framework for colonoscopy video generation. This work takes a step toward trustworthy generation through two synergistic paradigms. To achieve explicit geometric grounding, DepthPilot devises a prior distribution alignment strategy, injecting depth constraints into the diffusion backbone via parameter-efficient fine-tuning to ensure anatomical fidelity. To enhance intrinsic nonlinear modeling under these geometric constraints, DepthPilot employs an adaptive spline denoising module, replacing fixed linear weights with learnable spline functions to capture complex spatio-temporal dynamics. Extensive evaluations across three public datasets and in-house clinical data confirm DepthPilot’s robust ability to produce physically consistent videos. It achieves FID scores below 15 across all benchmarks and ranks first in clinician assessments, bridging the gap between “visually realistic” and “clinically interpretable”. Moreover, DepthPilot-generated videos are expected to enable reliable 3D reconstruction, facilitating surgical navigation and blind region identification, and serve as a foundation toward the colorectal world model.

[CV-75] HOI-aware Adaptive Network for Weakly-supervised Action Segmentation IJCAI2023

【速读】:该论文旨在解决弱监督动作分割(Weakly-supervised Action Segmentation)中因相似动作难以区分而导致的歧义问题,例如“倒果汁”与“倒咖啡”等动作在局部帧特征上高度相似。解决方案的关键在于引入一种基于人类-物体交互(Human-Object Interaction, HOI)的自适应网络 AdaAct,其核心机制是利用视频级全局时序但局部空间的 HOI 序列作为先验知识,动态调整网络参数以适应不同视频的 HOI 特征。具体而言,作者设计了一个视频 HOI 编码器提取并整合最具代表性的 HOI 信息,并进一步提出双分支 HyperNetwork 来学习一个自适应的时间编码器,在测试阶段实时根据输入视频的 HOI 序列调整模型参数,从而提升对模糊动作的判别能力。

链接: https://arxiv.org/abs/2604.26227
作者: Runzhong Zhang,Suchen Wang,Yueqi Duan,Yansong Tang,Yue Zhang,Yap-Peng Tan
机构: Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCAI 2023

点击查看摘要

Abstract:In this paper, we propose an HOI-aware adaptive network named AdaAct for weakly-supervised action segmentation. Most existing methods learn a fixed network to predict the action of each frame with the neighboring frames. However, this would result in ambiguity when estimating similar actions, such as pouring juice and pouring coffee. To address this, we aim to exploit temporally global but spatially local human-object interactions (HOI) as video-level prior knowledge for action segmentation. The long-term HOI sequence provides crucial contextual information to distinguish ambiguous actions, where our network dynamically adapts to the given HOI sequence at test time. More specifically, we first design a video HOI encoder that extracts, selects, and integrates the most representative HOI throughout the video. Then, we propose a two-branch HyperNetwork to learn an adaptive temporal encoder, which automatically adjusts the parameters based on the HOI information of various videos on the fly. Extensive experiments on two widely-used datasets including Breakfast and 50Salads demonstrate the effectiveness of our method under different evaluation metrics.

[CV-76] Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation

【速读】:该论文旨在解决遥感图像中开放词汇语义分割(Open-vocabulary Semantic Segmentation, OVSS)任务中存在的语义模糊性和前景激活不完整问题,这些问题主要源于现有方法采用静态推理范式,未能考虑不同场景间独特的分布特性。解决方案的关键在于提出一个无需训练的即插即用框架SeeCo,通过在线共识注入器(Online Consensus Injector, OCI)实现双重一致性学习:几何一致性学习(Geometric Consistency Learning, GCL)利用多视角一致观测重建空间结构信息,语义一致性学习(Semantic Consistency Learning, SCL)则基于文本描述自适应校准语义表示,从而协同优化视觉与文本语义的一致性,有效缓解了语义偏差和前景欠激活问题,并在八个遥感OVSS基准上实现了稳定性能提升。

链接: https://arxiv.org/abs/2604.26221
作者: Guanchun Wang,Chenxiao Wu,Xiangrong Zhang,Zelin Peng,Jianxun Lai,Tianyang Zhang,Xu Tang
机构: Xidian University (西安电子科技大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Open-vocabulary semantic segmentation (OVSS) in remote sensing images is a promising task that employs textual descriptions for identifying undefined land cover categories. Despite notable advances, existing methods typically employ a static inference paradigm, overlooking the distinct distribution of each scene, resulting in semantic ambiguity in diverse land covers and incomplete foreground activation. Motivated by this, we propose Seeking Consensus, termed SeeCo, a plug-and-play framework to boost the performance of training-free OVSS models in remote sensing images, which recalibrates arbitrary OVSS models on-the-fly by seeking dual consensus: geometric consensus learning (GCL) through multi-view consistent observations and semantic consensus learning (SCL) via textual description adaptive calibration, which assists collaborative recalibration of visual and textual semantics. The two consensus are injected via an online consensus injector (OCI), effectively alleviating the under-activation and semantic bias. SeeCo requires no specific training process, yet recalibrates semantic-geometric alignment for each unique scene during inference. Extensive experiments on eight remote sensing OVSS benchmarks show consistent gains, proving its effectiveness and universality.

[CV-77] ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection

【速读】:该论文旨在解决脑编码模型中两个核心问题:一是如何高保真地重建神经响应,二是如何实现视觉刺激与神经响应之间的跨模态对齐。解决方案的关键在于提出了一种名为ViBE的新颖脑编码框架,其核心创新包括:首先设计了一个时空卷积变分自编码器(TSC-VAE),用于捕捉M/EEG信号的时空特性以有效重建神经响应;其次引入Q-Former将CLIP图像嵌入映射到TSC-VAE潜在空间,生成神经代理嵌入以弥合视觉特征与神经表示之间的模态差异;最后通过均方误差(MSE)损失实现逐点特征匹配,并结合切片Wasserstein距离(SWD)对齐神经代理嵌入与TSC-VAE潜在嵌入的概率分布,从而实现全面的跨模态对齐。

链接: https://arxiv.org/abs/2604.26218
作者: Ganxi Xu,Zhao-Rong Lai,Yuting Tang,Yonghao Song,Shuyan Zhou,Guoxu Zhou,Boyu Wang,Jian Zhu,Jinyi Long
机构: Jinan University (暨南大学); The First Affiliated Hospital of Jinan University (暨南大学附属第一医院); Western University (Western University); Guangdong University of Technology (广东工业大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain encoding models not only serve to decipher how visual stimuli are transformed into neural responses, but also represent a critical step toward visual prostheses that restore vision for patients with severe vision disorders. Brain encoding involves two fundamental steps: achieving faithful reconstruction of neural responses and establishing cross-modal alignment between visual stimuli and neural responses. To this end, we propose ViBE, a novel brain encoding framework for generating magnetoencephalography (MEG) and electroencephalography (EEG) signals from visual stimuli. Specifically, we first design a spatio-temporal convolutional variational autoencoder (TSC-VAE) that captures the spatio-temporal characteristics of M/EEG signals for effective neural response reconstruction. To bridge the modality gap between visual features and neural representations, we employ Q-Former to map CLIP image embeddings to the TSC-VAE latent space, producing neural proxy embeddings. For comprehensive cross-modal alignment, we combine mean squared error (MSE) loss for point-wise feature matching with sliced Wasserstein distance (SWD) for probability distribution alignment between the neural proxy embeddings and TSC-VAE latent embeddings. We conduct extensive experiments on the THINGS-EEG2 and THINGS-MEG datasets, demonstrating the effectiveness of our approach in generating high-quality M/EEG signals from visual stimuli.

[CV-78] Privacy-Preserving Clothing Classification using Vision Transformer for Thermal Comfort Estimation

【速读】:该论文旨在解决基于摄像头图像的暖通空调(HVAC)控制系统中用户隐私保护缺失的问题,尤其是在进行衣物隔热性能(clothing insulation)分类时如何在不泄露个人隐私的前提下保持高识别精度。传统基于像素的图像分类方法在加密图像上会导致严重准确率下降,而本文提出了一种基于视觉Transformer(Vision Transformer, ViT)的隐私保护分类方案,其关键在于利用ViT模型对加密图像进行高效特征提取与分类,在DeepFashion数据集上的实验表明,该方法在保持与明文图像相同精度的同时实现了隐私保护,显著优于传统方法。

链接: https://arxiv.org/abs/2604.26184
作者: Tatsuya Chuman,Yousuke Udagawa,Hitoshi Kiya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: To be appeared in 2026 IEEE International Conference on Consumer Electronics - Taiwan (ICCE-TW 2026)

点击查看摘要

Abstract:A privacy-preserving clothing classification scheme is presented to enable secure occupant-centric control (OCC) systems. Although the utilization of camera images for HVAC control has been widely studied to optimize thermal comfort, privacy protection of occupant images has not been considered in prior works. While various privacy-preserving methods have been proposed for image classification, applying conventional schemes results in severe accuracy degradation. In this paper, we introduce a privacy-preserving classification method using Vision Transformer (ViT) applied to clothing insulation estimation. In an experiment using the DeepFashion dataset categorized by clothing insulation, while the conventional pixel-based method suffers a severe accuracy drop, our scheme maintains a high accuracy on encrypted images, showing no degradation from plain images across all categories.

[CV-79] Lifting Embodied World Models for Planning and Control

【速读】:该论文旨在解决高维动作空间下世界模型(World Model)难以控制与规划效率低的问题,尤其是在复杂具身智能体(Embodied Agent)中,如人类形态机器人,其关节动作维度高且难以精确指定。传统基于搜索的方法(如CEM)在高维动作空间中计算开销大、效率差。解决方案的关键在于训练一个轻量级策略网络(Policy),将高层动作(High-level Action)映射为一系列低层关节动作序列;该策略与冻结的世界模型组合后形成“提升型世界模型”(Lifted World Model),能够从单一高层动作(如2D目标点)预测未来观测序列,从而显著降低动作空间维度并提升可解释性与规划效率。实验表明,该方法相比直接在低层关节空间搜索,平均关节误差降低3.8倍,同时保持更高的计算效率和环境泛化能力。

链接: https://arxiv.org/abs/2604.26182
作者: Alex N. Wang,Trevor Darrell,Pavel Izmailov,Yutong Bai,Amir Bar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:World models of embodied agents predict future observations conditioned on an action taken by the agent. For complex embodiments, action spaces are high-dimensional and difficult to specify: for example, precisely controlling a human agent requires specifying the motion of each joint. This makes the world model hard to control and expensive to plan with as search-based methods like CEM scale poorly with action dimensionality. To address this issue, we train a lightweight policy that maps high-level actions to sequences of low-level joint actions. Composing this policy with the frozen world model produces a lifted world model that predicts a sequence of future observations from a single high-level action. We instantiate this framework for a human-like embodiment, defining the high-level action space as a small set of 2D waypoints annotated on the current observation frame, each specifying a near-term goal position for a leaf joint (pelvis, head, hands). Waypoints are low-dimensional, visually interpretable, and easy to specify manually or to search over. We show that the lifted world model substantially outperforms searching directly in low-level joint space ( 3.8\times lower mean joint error to the goal pose), while remaining more compute-efficient and generalizing to environments unseen by the policy.

[CV-80] Why Domain Matters: A Preliminary Study of Domain Effects in Underwater Object Detection ICRA2026

【速读】:该论文旨在解决水下环境中因训练数据与部署数据分布差异(domain shift)导致模型性能下降的问题。现有基准测试通过合成风格迁移模拟变化,但无法捕捉可见度、光照、场景构成及采集因素等真实物理特性,限制了对实际影响的分析。其解决方案的关键在于提出一种基于可测量图像、场景和采集特征的标注框架,以定义水下域,从而捕获具有物理意义的因素,实现语义一致的图像分组,并支持针对特定域的检测性能评估与失败模式分析。

链接: https://arxiv.org/abs/2604.26174
作者: Melanie Wille,Dimity Miller,Tobias Fischer,Scarlett Raine
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Poster Presentation at ICRA 2026 Workshop S2S

点击查看摘要

Abstract:Domain shift, where deviations between training and deployment data distributions degrade model performance, is a key challenge in underwater environments. Existing benchmarks testing performance for underwater domain shift simulate variability through synthetic style transfer. This fails to capture intrinsic scene factors such as visibility, illumination, scene composition, or acquisition factors, limiting analysis of real-world effects. We propose a labeling framework that defines underwater domains using measurable image, scene, and acquisition characteristics. Unlike prior benchmarks, it captures physically meaningful factors, enabling semantically consistent image grouping and supporting domain-specific evaluation of detection performance including failure analysis. We validate this on public datasets, showing systematic variations across domain factors and revealing hidden failure modes.

[CV-81] A Data-Centric Framework for Intraoperative Fluorescence Lifetime Imaging for Glioma Surgical Guidance

【速读】:该论文旨在解决胶质母细胞瘤(GBM)术中浸润边界精准评估难题,以实现最大化肿瘤切除同时保护功能脑组织。当前荧光寿命成像(FLIm)虽具备实时、无标记的生化对比优势,但其临床应用受限于生物异质性、类别不平衡及组织病理学标注变异性等问题。解决方案的关键在于提出一种数据驱动的人工智能(DC-AI)框架,整合置信学习(CL)、类别精炼与靶向标签评估三部分:首先通过CL量化点级置信度并识别标签不一致性,引导迭代合并为三类(低、中、高细胞密度);其次利用高质量数据训练出准确率达96%的多分类模型;最后结合SHAP分析揭示各分类的FLIm特征重要性,并识别低置信预测的生物学(如灰质成分)和采集相关因素(如血液污染),从而实现数据可靠性提升、模型鲁棒性增强与FLIm信号生物学解释的精细化。

链接: https://arxiv.org/abs/2604.26147
作者: Silvia Noble Anbunesan,Mohamed Abul Hassan,Jinyi Qi,Lisanne Kraft,Han Sung Lee,Orin Bloch,Laura Marcu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate intraoperative assessment of glioma infiltration is essential for maximizing tumor resection while preserving functional brain tissue. Fluorescence lifetime imaging (FLIm) offers real-time, label-free biochemical contrast, but its clinical utility is challenged by biological heterogeneity, class imbalance, and variability in histopathological labeling. We present a data-centric AI (DC-AI) framework that integrates confident learning (CL), class refinement, and targeted label evaluation to develop a robust multi-class FLIm classifier for glioblastoma (GBM) resection margins. FLIm data were collected from 192 tissue margins across 31 newly diagnosed IDH-wildtype GBM patients and initially labeled into seven tumor cellularity classes by an expert neuropathologist. CL was applied to quantify FLIm point-level confidence, identify label inconsistencies, and guide iterative class merging into a three-class scheme (“low”, “moderate”, “high”). The resulting high-fidelity dataset enabled training a model that achieved 96% accuracy in the three-class task. SHAP analysis revealed class-specific FLIm feature importance, highlighting distinct optical signatures across the infiltration spectrum. Targeted FLIm analysis further identified biological (e.g., gray matter composition) and acquisition-related (e.g., blood contamination) contributors to low-confidence predictions. Blinded re-evaluation of margins flagged by CL demonstrated intra-pathologist variability, underscoring the value of selective relabeling rather than exhaustive review. Together, these findings demonstrate that a DC-AI framework can systematically improve data reliability, enhance model robustness, and refine biological interpretation of FLIm signals, supporting the development of clinically actionable optical tools for real-time glioma margin assessment.

[CV-82] MixerCA: An Efficient and Accurate Model for High-Performance Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中传统方法难以有效捕捉复杂空间与光谱特征的问题,同时应对现有深度学习模型在计算效率和特征表达能力之间的平衡挑战。其解决方案的关键在于提出一种轻量级网络架构MixerCA,通过引入深度可分离卷积(depthwise convolution)、令牌与通道混合机制(token and channel mixing)以及坐标注意力模块(coordinate attention),实现空间与通道交互的解耦、保持网络中分辨率的一致性,并直接处理HSI块,从而在保证高分类精度的同时显著降低计算开销。

链接: https://arxiv.org/abs/2604.26138
作者: Mohammed Q. Alkhatib,Ali Jamali
机构: Simon Fraser University (西蒙菲莎大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint accepted for publication in “Remote Sensing Applications: Society and Environment” Journal

点击查看摘要

Abstract:Over the past decade, hyperspectral image (HSI) classification has drawn considerable interest due to HSIs’ ability to effectively distinguish terrestrial objects by capturing detailed, continuous spectral information. The strong performance of recent deep learning techniques in tasks like image classification and semantic segmentation has led to their growing use in HSI classification, due to their ability to capture complex spatial and spectral features more effectively than traditional methods. This paper presents MixerCA, a novel lightweight model for HSI classification that leverages depthwise convolution and a self-attention mechanism. MixerCA integrates depth-wise convolutions, token and channel mixing, and coordinate attention into a unified structure to decouple spatial and channel interactions, maintain consistent resolution throughout the network, and directly process HSI patches. Extensive experiments on four hyperspectral benchmark datasets reveal MixerCA’s clear advantages over several competing algorithms, including 2D-CNN, 3D-CNN, Tri-CNN, HybridSN, ViT, and Swin Transformer. The source code is publicly available at this https URL.

[CV-83] Sample Selection Using Multi-Task Autoencoders in Federated Learning with Non-IID Data

【速读】:该论文旨在解决联邦学习(Federated Learning)中因冗余、恶意或异常样本导致的模型性能下降和训练效率低下的问题。其解决方案的关键在于提出一种基于多任务自编码器(multitask autoencoder)的样本选择方法,通过损失值与特征分析联合估计样本贡献,并由中央服务器统一管理三种无监督异常检测机制——一类支持向量机(OCSVM)、孤立森林(Isolation Forest, IF)以及自适应损失阈值(Adaptive Loss Threshold, AT),用于过滤客户端上的噪声样本;同时引入由中央服务器控制的多类深度支持向量数据描述(Multi-class Deep Support Vector Data Description, SVDD)损失函数,以增强基于特征的样本筛选能力。实验表明,该方案在CIFAR10和MNIST数据集上均显著提升了模型准确率,尤其在高噪声(最高达40%)和非独立同分布(Non-IID)场景下表现优异。

链接: https://arxiv.org/abs/2604.26116
作者: Emre Ardıç,Yakup Genç
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in Engineering Science and Technology, an International Journal, 61 (2025), 101920. DOI: this https URL and Codes: this https URL

点击查看摘要

Abstract:Federated learning is a machine learning paradigm in which multiple devices collaboratively train a model under the supervision of a central server while ensuring data privacy. However, its performance is often hindered by redundant, malicious, or abnormal samples, leading to model degradation and inefficiency. To overcome these issues, we propose novel sample selection methods for image classification, employing a multitask autoencoder to estimate sample contributions through loss and feature analysis. Our approach incorporates unsupervised outlier detection, using one-class support vector machine (OCSVM), isolation forest (IF), and adaptive loss threshold (AT) methods managed by a central server to filter noisy samples on clients. We also propose a multi-class deep support vector data description (SVDD) loss controlled by a central server to enhance feature-based sample selection. We validate our methods on CIFAR10 and MNIST datasets across varying numbers of clients, non-IID distributions, and noise levels up to 40%. The results show significant accuracy improvements with loss-based sample selection, achieving gains of up to 7.02% on CIFAR10 with OCSVM and 1.83% on MNIST with AT. Additionally, our federated SVDD loss further improves feature-based sample selection, yielding accuracy gains of up to 0.99% on CIFAR10 with OCSVM. These results show the effectiveness of our methods in improving model accuracy across various client counts and noise conditions.

[CV-84] FruitProM-V2: Robust Probabilistic Maturity Estimation and Detection of Fruits and Vegetables

【速读】:该论文旨在解决基于视觉的果实成熟度识别中,将连续的生理成熟过程简化为离散多分类任务所导致的边界模糊与不确定性低估问题。传统方法通常将成熟度划分为若干类别,但实际中相邻阶段在视觉上差异微小,易引发标注不一致和模型误判。为此,作者提出将成熟度建模为潜在的连续变量,并采用分布检测头(distributional detection head)进行概率预测,通过累积分布函数(CDF)将分布转化为类别概率。该方案的关键在于显式建模成熟度的不确定性,不仅在干净标签下保持与标准检测器相当的性能,还在引入可控标签噪声时展现出更强的鲁棒性,从而提升视觉成熟度估计的可靠性。

链接: https://arxiv.org/abs/2604.26084
作者: Rahul Harsha Cheppally,Sidharth Rai,Sudan Baral,Benjamin Vail,Ajay Sharda
机构: Kansas State University (堪萨斯州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Accurate fruit maturity identification is essential for determining harvest timing, as incorrect assessment directly affects yield and post-harvest quality. Although ripening is a continuous biological process, vision-based maturity estimation is typically formulated as a multi-class classification task, which imposes sharp boundaries between visually similar stages. To examine this limitation, we perform an annotation reliability study with two independent annotators on a held-out tomato dataset and observe disagreement concentrated near adjacent maturity stages. Motivated by this observation, we model maturity as a latent continuous variable and predict it probabilistically using a distributional detection head, converting the distribution into class probabilities through the cumulative distribution function (CDF). The proposed formulation maintains comparable performance to a standard detector under clean labels while better representing uncertainty. Furthermore, when controlled label noise is introduced during training, the probabilistic model demonstrates improved robustness relative to the baseline, indicating that explicitly modeling maturity uncertainty leads to more reliable visual maturity estimation.

[CV-85] RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

【速读】:该论文旨在解决动态环境中语义SLAM(Simultaneous Localization and Mapping)系统在开放词汇(open-vocabulary)条件下实现几何感知的语义定位问题,即如何将任意自然语言查询与动态场景中的3D区域或物体进行精准关联。传统方法通常依赖于校准的RGB-D输入、已知相机内参及静态场景假设,限制了其在真实世界中的部署能力。解决方案的关键在于提出RADIO-ViPE系统,该系统直接处理原始单目RGB视频流,无需预先标定相机参数、深度传感器或位姿初始化;通过将来自聚合基础模型(如RADIO)的多模态嵌入(视觉-语言)与几何场景信息紧密耦合,在初始化、优化和因子图连接阶段提升地图的一致性,并引入自适应鲁棒核以应对主动移动物体和代理位移的场景元素(如家具重排)。这一设计显著提升了系统在复杂动态环境下的鲁棒性和开放词汇语义定位能力。

链接: https://arxiv.org/abs/2604.26067
作者: Zaid Nasser,Mikhail Iumanov,Tianhao Li,Maxim Popov,Jaafar Mahmoud,Sergey Kolyubin
机构: ITMO University (圣彼得堡国立信息技术机械与光学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present RADIO-ViPE (Reduce All Domains Into One – Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments. Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings – spanning vision and language – derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This coupling takes place in initialization, optimization and factor graph connections to improve the consistency of the map from multiple modalities. The optimization is wrapped within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during ego-centric session). Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics and unconstrained in-the-wild video streams. Project page: this https URL

[CV-86] Evaluating the Alignment Between GeoAI Explanations and Domain Knowledge in Satellite-Based Flood Mapping

【速读】:该论文旨在解决深度学习模型在地球观测领域中缺乏可解释性的问题,特别是其决策过程的“黑箱”特性阻碍了其在科学与业务流程中的可靠应用。为应对这一挑战,研究提出ADAGE(Alignment between Domain Knowledge And GeoAI Explanation Evaluation)框架,其核心在于通过Channel-Group SHAP方法量化输入通道组对像素级预测的贡献,并系统评估模型解释与遥感领域知识之间的对齐程度。该框架不仅能够定量衡量解释与基于领域知识生成的参考解释的一致性,还能帮助专家识别不一致的解释,从而提升GeoAI模型在洪水监测等任务中的可信度和实用性。

链接: https://arxiv.org/abs/2604.26051
作者: Hyunho Lee,Wenwen Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 6 figures, 5 tables

点击查看摘要

Abstract:The increasing number of satellites has improved the temporal resolution of Earth observation, making satellite-based flood mapping a promising approach for operational flood monitoring. Deep learning-based approaches for flood mapping using satellite imagery, an important application within Geospatial Artificial Intelligence (GeoAI), have shown improved predictive performance by learning complex spatial and spectral patterns from large volumes of remote sensing data. However, the opaque decision-making processes of deep learning models remain a major barrier to their integration into critical scientific and operational workflows. This highlights the need for a systematic assessment of whether model explanations align with established domain knowledge in remote sensing. To address this research gap, this study introduces the ADAGE (Alignment between Domain Knowledge And GeoAI Explanation Evaluation) framework. The proposed framework is designed to systematically evaluate how well explanations of deep learning models align with established remote sensing knowledge, particularly regarding the distinctive spectral properties of the Earth’s surface. The ADAGE framework employs Channel-Group SHAP (SHapley Additive exPlanations) method to estimate the contributions of grouped input channels to pixel-level predictions. Experiments on two satellite-based flood mapping tasks demonstrate that the ADAGE framework can (1) quantitatively assess the alignment between model explanations and reference explanations derived from domain knowledge and (2) help domain experts identify misaligned explanations through alignment scores. This study contributes to bridging the gap between explainability and domain knowledge in GeoAI for Earth observation, enhancing the applicability of GeoAI models in scientific and operational workflows.

[CV-87] Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding CVPR2026

【速读】:该论文旨在解决在高度非约束条件下进行像素级视频理解(pixel-level video understanding)的难题,具体包括复杂场景中的目标跟踪、基于运动描述的文本定位以及声学驱动的目标分割。其解决方案的关键在于通过设立三个专业化赛道(MOSE、MeViS-Text 和 MeViS-Audio)来系统评估当前最先进的多模态模型性能,并引入此前未公开的高挑战性数据集,从而推动社区在鲁棒视频场景理解方面的技术进步。

链接: https://arxiv.org/abs/2604.26031
作者: Chang Liu,Henghui Ding,Nikhila Ravi,Yunchao Wei,Shuting He,Song Bai,Philip Torr,Leilei Cao,Jinrong Zhang,Deshui Miao,Xusheng He,Dengxian Gong,Zhiyu Wang,Mingqi Gao,Jihwan Hong,Canyang Wu,Weili Guan,Jianlong Wu,Liqiang Nie,Xingsen Huang,Yameng Gu,Xiaogang Yu,Xin Li,Ming-Hsuan Yang,Sijie Li,Jungong Han,Quanzhu Niu,Shihao Chen,Yuanzheng Wu,Yikang Zhou,Tao Zhang,Haobo Yuan,Lu Qi,Shunping Ji,Chao Yang,Chao Tian,Guoqing Zhu,Kai Yang,Zhifan Mo,Haijun Zhang,Xudong Kang,Shutao Li,Jaeyoung Do
机构: Harbin Institute of Technology, Shenzhen, China; Shenzhen Loop Area Institute, China; Pengcheng Laboratory; Guangzhou Hengyan Technology; University of California at Merced; School of Computer Science, University of Sheffield; Department of Automation, Tsinghua University; Wuhan University; University of California, Merced; Wuhan Textile University; Yixiang Innovation Technology (Shenzhen); Harbin Institute of Technology; Hunan University; AIDAS Laboratory, IPAI; ECE, Seoul National University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Official Report of the 5th PVUW Challenge on CVPR 2026

点击查看摘要

Abstract:This report summarizes the objectives, datasets, and top-performing methodologies of the 2026 Pixel-level Video Understanding in the Wild (PVUW) Challenge, hosted at CVPR 2026, which evaluates state-of-the-art models under highly unconstrained conditions. To provide a comprehensive assessment, the 2026 edition features three specialized tracks: the MOSE track for tracking objects within densely cluttered and severely occluded scenarios; the MeViS-Text track for localizing targets via motion-focused linguistic expressions; and the newly inaugurated MeViS-Audio track, which pioneers acoustic-driven object segmentation. By introducing previously unreleased challenging data and analyzing the cutting-edge, multimodal solutions submitted by participants, this report highlights the community’s latest technical advancements and charts promising future directions for robust video scene comprehension.

[CV-88] Generalized Disguise Makeup Presentation Attack Detection Using an Attention-Guided Patch-Based Framework

【速读】:该论文旨在解决伪装化妆(Disguise Makeup)欺骗攻击对人脸识别系统造成的安全威胁问题,这类攻击利用高级化妆品、假体组件和人工材料真实地改变面部外观,使得检测难度极大,甚至人类也难以识别。解决方案的关键在于提出了一种两阶段的通用伪装化妆呈现攻击检测框架:第一阶段采用基于度量学习(Metric Learning)训练的风格不变全脸模型,并通过白化变换(Whitening Transformation)增强特征表示,结合Grad-CAM生成区域注意力得分;第二阶段则基于这些得分引导局部分析,使用区域特定子网络进行细粒度判别,同样依赖度量学习实现精准区分。该方法在新构建的真实场景数据集和SIW-Mv2上均表现出优异的泛化性能与鲁棒性。

链接: https://arxiv.org/abs/2604.26025
作者: Fateme Taraghi,Atefe Aghaei,Mohsen Ebrahimi Moghaddam
机构: Shahid Beheshti University (谢里夫·贝赫什提大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant advances in facial recognition systems, they remain vulnerable to face presentation attacks. Among them, disguise makeup attacks are particularly challenging, as they use advanced cosmetics, prosthetic components, and artificial materials to realistically alter facial appearance, often making detection difficult even for humans. Despite their importance, this problem remains underexplored, and publicly available datasets are limited. To address this, we propose a generalized disguise makeup presentation attack detection framework. The method adopts a two-phase design in which a style-invariant full-face model, trained with metric learning and enhanced by a whitening transformation, extracts region attention scores via Grad-CAM. These scores guide a patch-based phase that performs localized analysis using region-specific subnetworks trained with metric learning for fine-grained discrimination. We also construct a new, diverse dataset of live and disguise makeup faces collected under real-world conditions, covering variations in subjects, environments, and disguise materials. Experimental results demonstrate strong generalization across both the collected dataset and SIW-Mv2, achieving 8.97% ACER and 9.76% EER on the collected dataset, and 0% ACER on Obfuscation and Impersonation and 1.34% on Cosmetics attacks of SIW-Mv2. The proposed method consistently outperforms prior works while maintaining robust performance across other spoof types.

[CV-89] SAND: Spatially Adaptive Network Depth for Fast Sampling of Neural Implicit Surfaces

【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations)在实际应用中因网络评估计算成本过高而导致效率低下的问题。传统方法对所有查询点采用相同的网络深度和计算开销,忽略了空间上不同区域的几何复杂度差异以及距离目标表面远近对精度需求的变化,从而造成大量冗余计算。解决方案的关键在于提出一种空间自适应网络深度(Spatially Adaptive Network Depth, SAND)框架:通过构建体素级的网络深度图来记录各空间区域达到足够精度所需的最小网络深度,并结合尾部多层感知机(T-MLP)——一种在每一隐藏层附加输出分支(tail)的改进型MLP结构——实现网络评估过程中的自适应终止机制,使计算资源优先分配至几何复杂且重要的区域,从而显著提升推理阶段的查询速度,同时保持高保真度的几何表示。

链接: https://arxiv.org/abs/2604.25936
作者: Chuanxiang Yang,Junhui Hou,Yuan Liu,Siyu Ren,Guangshun Wei,Taku Komura,Yuanfeng Zhou,Wenping Wang
机构: Shandong University (山东大学); City University of Hong Kong (香港城市大学); Hong Kong University of Science and Technology (香港科技大学); The University of Hong Kong (香港大学); Texas AM University (德州农工大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Implicit neural representations are powerful for geometric modeling, but their practical use is often limited by the high computational cost of network evaluations. We observe that implicit representations require progressively lower accuracy as query points move farther from the target surface, and that even within the same iso-surface, representation difficulty varies spatially with local geometric complexity. However, conventional neural implicit models evaluate all query points with the same network depth and computational cost, ignoring this spatial variation and thereby incurring substantial computational waste. Motivated by this observation, we propose an efficient neural implicit geometry representation framework with spatially adaptive network depth (SAND). SAND leverages a volumetric network-depth map together with a tailed multi-layer perceptron (T-MLP) to model implicit representation. The volumetric depth map records, for each spatial region, the network depth required to achieve sufficient accuracy, while the T-MLP is a modified MLP designed to learn implicit functions such as signed distance functions, where an output branch, referred to as a tail, is attached to each hidden layer. This design allows network evaluation to terminate adaptively without traversing the full network and directs computational resources to geometrically important and complex regions, improving efficiency while preserving high-fidelity representations. Extensive experimental results demonstrate that our approach can significantly improve the inference-time query speed of implicit neural representations.

[CV-90] Circular Phase Representation and Geometry-Aware Optimization for Ptychographic Image Reconstruction

【速读】:该论文旨在解决传统迭代重建方法计算复杂度高、难以满足高通量和实时ptychography需求的问题,同时克服现有深度学习方法在相位预测中忽略其2π2\pi周期性而导致的包裹伪影(wrapping artifacts)、±π\pm\pi处不连续以及损失函数与信号几何结构不匹配的问题。解决方案的关键在于:将相位建模为单位圆上的分布,通过余弦(cosine)和正弦(sine)分量进行表示,并采用可微分的测地线损失(geodesic loss)优化相位误差,从而避免分支切割不连续性并提供有界梯度;此外,网络引入饱和感知的双增益输入缩放、并行编码分支及三个解码器(分别预测振幅、余弦和正弦),并通过复合损失函数确保圆形一致性与结构保真度,实现了更准确且物理一致的相位重建。

链接: https://arxiv.org/abs/2604.26664
作者: Carson Yu Liu,Jun Cheng,Chien-Chun Chen,Steve F. Shu
机构: University of Sydney(悉尼大学); Institute for Infocomm Research, Agency for Science, Technology and Research (A∗\astSTAR)(新加坡科技研究局); National Tsing Hua University(台湾清华大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Traditional iterative reconstruction methods are accurate but computationally expensive, limiting their use in high-throughput and real-time ptychography. Recent deep learning approaches improve speed, but often predict phase as a Euclidean scalar despite its 2\pi periodicity, which can introduce wrapping artifacts, discontinuities at \pm\pi , and a mismatch between the loss and the underlying signal geometry. We present a deep learning framework for ptychographic reconstruction that models phase on the unit circle using cosine and sine components. Phase error is optimized with a differentiable geodesic loss, which avoids branch-cut discontinuities and provides bounded gradients. The network further incorporates saturation-aware dual-gain input scaling, parallel encoder branches, and three decoders for amplitude, cosine, and sine prediction, together with a composite loss that promotes circular consistency and structural fidelity. Experiments on synthetic and experimental datasets show consistent improvements in both amplitude and phase reconstruction over existing deep learning methods. Frequency-domain analysis further shows better preservation of mid- and high-frequency phase content. The proposed method also provides substantial speedup over iterative solvers while maintaining physically consistent reconstructions.

[CV-91] Adaptive Transform Coding for Semantic Compression

【速读】:该论文旨在解决视觉数据压缩从以人为中心的重建目标向机器感知导向的语义特征编码转变过程中,如何更高效地对异质性特征分布进行压缩的问题。传统方法在处理来自视觉骨干网络或基础模型的语义特征时,往往采用统一的变换与量化策略,难以适应不同特征模式的统计特性。解决方案的关键在于提出一种基于高斯混合模型条件率失真函数的自适应变换编码方法,通过根据推断出的源成分选择特定于模式的变换和量化器,实现对复杂特征分布的精细化建模与压缩,从而在保持压缩效率的同时兼顾灵活性与可解释性。

链接: https://arxiv.org/abs/2604.26492
作者: Andriy Enttsel,Vincent Corlay
机构: Mitsubishi Electric RD Centre Europe (三菱电机欧洲研发中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Signal Processing (eess.SP)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Visual data compression is shifting from human-centered reconstruction to machine-oriented representation coding. In this setting, an image is often mapped to a compact semantic embedding, which is then compressed and transmitted for downstream inference. We propose an adaptive transform-coding method for semantic-feature compression motivated by the conditional rate-distortion function of a Gaussian mixture model. The scheme uses mode-dependent transforms and quantizers selected according to the inferred source component, enabling more efficient coding of heterogeneous feature distributions. Evaluations on features from widely used vision backbones and foundation models show that the proposed method outperforms or is competitive with state-of-the-art neural compression methods while preserving flexibility and interpretability.

人工智能

[AI-0] Causal Learning with Neural Assemblies

【速读】:该论文试图解决的问题是:神经组装(Neural Assemblies)是否能够学习变量之间的因果方向性(causal directionality)。尽管神经组装已被证明在分类、解析和规划等任务中具有计算通用性,但其能否内化因果方向仍不清楚。解决方案的关键在于提出DIRECT(DIRectional Edge Coupling/Training)机制,该机制通过自适应增益调度协同激活源组装与目标组装,并利用神经组装固有的三种操作——投影(projection)、局部可塑性控制(local plasticity control)和稀疏获胜选择(sparse winner selection)——实现方向性学习。DIRECT仅依赖局部可塑性,使得因果推断具备机制层面的可审计性(auditable),并通过突触强度不对称性和功能传播重叠双重验证策略实现了结构恢复的准确性,从而为生物合理动力学与形式因果模型之间建立了一座可解释的设计桥梁。

链接: https://arxiv.org/abs/2604.26919
作者: Evangelia Kopadi,Dimitris Kalles
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 8 pages, 11 figures

点击查看摘要

Abstract:Can Neural Assemblies – groups of neurons that fire together and strengthen through co-activation – learn the direction of causal influence between variables? While established as a computationally general substrate for classification, parsing, and planning, neural assemblies have not yet been shown to internalize causal directionality. We demonstrate that the inherent operations of neural assemblies – projection, local plasticity control, and sparse winner selection – are sufficient for directional learning. We introduce DIRECT (DIRectional Edge Coupling/Training), a mechanism that co-activates source and target assemblies under an adaptive gain schedule to internalize directed relations. Unlike backpropagation-based methods, DIRECT relies solely on local plasticity, making the resulting causal claims auditable at the mechanism level. Our findings are verified through a dual-readout validation strategy: (i) synaptic-strength asymmetry, measuring the emergent weight gap between forward and reverse links, and (ii) functional propagation overlap, quantifying the reliability of directional signal flow. Across multiple domains, the framework achieves perfect structural recovery under a supervised, known-structure setting. These results establish neural assemblies as an auditable bridge between biologically plausible dynamics and formal causal models, offering an “explainable by design” framework where causal claims are traceable to specific neural winners and synaptic asymmetries.

[AI-1] Resume-ing Control: (Mis)Perceptions of Agency Around GenAI Use in Recruiting Workflows

【速读】:该论文旨在解决生成式 AI(Generative AI)在高风险决策场景(如招聘)中如何影响专业人士对决策流程的控制感与代理权的问题。研究发现,尽管招聘人员声称对整个招聘流程拥有最终决策权,但生成式 AI 已悄然成为隐形架构师,重塑了从职位定义到面试表现评估等基础信息构建环节;同时,AI 的采纳往往并非出于自愿,而是受组织高层推动、应对求职者使用 AI 以及个体提升效率需求所迫。解决方案的关键在于确保生成式 AI 在招聘中的应用具备可感知性(perceptible)和责任性(responsible),以避免招聘人员技能退化,并保障其对关键决策的实质性监督能力。

链接: https://arxiv.org/abs/2604.26851
作者: Sajel Surati,Rosanna Bellini,Emily Black
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 22 pages, 3 tables, submitted January 2026, accepted March 2026

点击查看摘要

Abstract:When generative AI (genAI) systems are used in high-stakes decision-making, its recommended role is to aid, rather than replace, human decision-making. However, there is little empirical exploration of how professionals making high-stakes decisions, such as those related to employment, perceive their agency and level of control when working with genAI systems. Through interviews with 22 recruiting professionals, we investigate how genAI subtly influences control over everyday workflows and even individual hiring decisions. Our findings highlight a pressing conflict: while recruiters believe they have final authority across the recruiting pipeline, genAI has become an invisible architect that shapes the foundational building blocks of information used for evaluation, from defining a job to determining good interview performances. The decision of whether or not to adopt was also often outside recruiters’ control, with many feeling compelled to adopt genAI due to calls to integrate AI from higher-ups in their business, to combat applicant use of AI, and the individual need to boost productivity. Despite a seemingly seismic shift in how recruiting happens, participants only reported marginal efficiency gains. Such gains came at the high cost of recruiter deskilling, a trend that jeopardizes the meaningful oversight of decision-making. We conclude by discussing the implications of such findings for responsible and perceptible genAI use in hiring contexts.

[AI-2] Rule-based High-Level Coaching for Goal-Conditioned Reinforcement Learning in Search-and-Rescue UAV Missions Under Limited-Simulation Training

【速读】:该论文旨在解决无人飞行器(UAV)在有限仿真训练条件下执行搜救(Search-and-Rescue, SAR)任务时面临的早期安全性和样本效率问题。其关键解决方案是提出一种分层决策框架,其中高层 Advisor 采用离线构建的确定性规则提供可解释的任务与安全引导,而低层控制器则通过在线目标条件强化学习(goal-conditioned reinforcement learning, RL)从密集奖励中持续学习,并利用基于规则的元数据增强模式感知优先回放缓冲区以复用经验。该设计在不依赖预训练的前提下显著提升了早期适应能力与安全性,同时保持对场景特异性动态的在线调整能力。

链接: https://arxiv.org/abs/2604.26833
作者: Mahya Ramezani,Holger Voos
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents a hierarchical decision-making framework for unmanned aerial vehicle (UAV) missions motivated by search-and-rescue (SAR) scenarios under limited simulation training. The framework combines a fixed rule-based high-level advisor with an online goal-conditioned low-level reinforcement learning (RL) controller. To stress-test early adaptation, we also consider a strict no-pretraining deployment regime. The high-level advisor is defined offline from a structured task specification and compiled into deterministic rules. It provides interpretable mission- and safety-aware guidance through recommended actions, avoided actions, and regime-dependent arbitration weights. The low-level controller learns online from task-defined dense rewards and reuses experience through a mode-aware prioritized replay mechanism augmented with rule-derived metadata. We evaluate the framework on two tasks: battery-aware multi-goal delivery and moving-target delivery in obstacle-rich environments. Across both tasks, the proposed method improves early safety and sample efficiency primarily by reducing collision terminations, while preserving the ability to adapt online to scenario-specific dynamics.

[AI-3] Random Cloud: Finding Minimal Neural Architectures Without Training

【速读】:该论文旨在解决神经网络结构搜索(Neural Architecture Search, NAS)中传统训练-剪枝-再训练流程效率低下的问题,尤其是避免对完整大模型进行多次训练以寻找最优稀疏结构的高计算开销。其核心解决方案是提出“随机云”(Random Cloud)方法,一种无需训练的架构搜索策略:通过随机初始化网络并仅基于前向传播评估其性能,结合随机探索与渐进式结构简化机制逐步削减冗余连接和节点,最终仅对筛选出的最小候选拓扑进行一次训练。该方法在7个分类基准上验证有效,相比幅度剪枝(magnitude pruning)和随机剪枝(random pruning)基线,在6/7数据集上达到相当或更优结果,且在4/5数据集中显著降低计算成本(仅为全量训练的0.67–0.94倍)。

链接: https://arxiv.org/abs/2604.26830
作者: Javier Gil Blázquez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:I propose the \emphRandom Cloud method, a training-free approach to neural architecture search that discovers minimal feedforward network topologies through stochastic exploration and progressive structural reduction. Unlike post-training pruning methods that require a full train-prune-retrain cycle, this method evaluates randomly initialized networks without backpropagation, progressively reduces their topology, and only trains the best minimal candidate at the end. I evaluate on 7 classification benchmarks against magnitude pruning and random pruning baselines. The Random Cloud matches or outperforms both baselines in 6 of 7 datasets, achieving statistically significant improvements on Sonar ( +4.9 pp accuracy, p=0.017 vs magnitude pruning) with 87% parameter reduction. Crucially, the method is faster than both pruning baselines in 4 of 5 datasets (0.67–0.94 \times the cost of full training), since it avoids training the full-size network entirely.

[AI-4] Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework

【速读】:该论文旨在解决传统Transformer在时间序列建模中缺乏可解释性与可编程性的局限问题,尤其是在数据稀缺和噪声环境下难以有效融入领域先验知识的挑战。其核心解决方案是提出空间-时间概率Transformer(Spatial-Temporal Probabilistic Transformer, ST-PT),通过将原始Probabilistic Transformer(PT)扩展为具有显式图结构、因子势函数和消息传递调度的可编程因子图模型,从而实现对时间序列建模过程的结构性控制。关键在于:1)利用因子图拓扑和势函数作为直接可编程原语,以嵌入符号化的时序先验;2)通过外部条件动态调节因子矩阵,实现基于结构层面的条件生成而非特征级调制;3)将每轮平均场变分推断(MFVI)迭代视为贝叶斯后验更新,使自回归(AR)预测中的潜在状态转移从黑箱多层感知机(MLP)转变为可解释的推理过程,并借助CRF教师模型蒸馏潜变量以缓解累积误差。

链接: https://arxiv.org/abs/2604.26762
作者: Zhangzhi Xiong,Haoyi Wu,You Wu,Shuqi Gu,Kan Ren,Kewei Tu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 2 figures

点击查看摘要

Abstract:The Probabilistic Transformer (PT) establishes that the Transformer’s self-attention plus its feed-forward block is mathematically equivalent to Mean-Field Variational Inference (MFVI) on a Conditional Random Field (CRF). Under this equivalence the Transformer ceases to be a black-box neural network and becomes a programmable factor graph: graph topology, factor potentials, and the message-passing schedule are all explicit and inspectable primitives that can be engineered. PT was originally developed for natural language and in this report we investigate its potential for time series. We first lift PT into the Spatial-Temporal Probabilistic Transformer (ST-PT) to repair PT’s missing channel axis and weak per-step semantics, and adopt ST-PT as a shared cornerstone backbone. We then identify three distinct properties that PT/ST-PT offers as a factor-graph model and derive three Research Questions, one per property, that probe how each property can be exploited in time series: RQ1. The graph topology and potentials are direct programmable primitives. Can this be used to inject symbolic time-series priors into ST-PT through structural graph modifications, especially under data scarcity and noise? RQ2. The CRF’s factor matrices are the operator’s potentials. Can an external condition program these factor matrices on a per-sample basis, so that conditional generation becomes structural rather than feature-level modulation of a fixed one? RQ3. Each MFVI iteration is a Bayesian posterior update on the factor graph. Can this turn the latent transition of latent-space AutoRegressive (AR) forecasting from an opaque MLP into a principled posterior update, and can a CRF teacher distill its latents into the AR student to counter cumulative error? We give one empirical study per question. Together, these three studies position ST-PT as a programmable framework for time-series modeling.

[AI-5] FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards

【速读】:该论文旨在解决生成式 AI (Generative AI) 在实时未来预测任务中缺乏统一学习环境的问题,即如何构建一个能够持续从真实世界事件中学习的闭环训练框架。当前研究虽已探索多个角度的未来预测方法,但未将其系统性地视为一个可迭代优化的学习环境,导致模型难以有效利用预测结果与实际发生事件之间的反馈机制。解决方案的关键在于提出 FutureWorld,这是一个面向代理(agent)的强化学习环境,通过将预测、结果实现和参数更新三者形成闭环,使模型能够在连续多日的真实事件流中不断调整自身策略,从而提升预测能力并建立可复现的基准测试体系。

链接: https://arxiv.org/abs/2604.26733
作者: Zhixin Han,Yanzhi Zhang,Chuyang Wei,Maohang Gao,Xiawei Yue,Kefei Chen,Yu Zhuang,Haoxiang Guan,Jiyan He,Jian Li,Yitong Duan,Yu Shi,Mengting Hu,Shuxin Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Our experiments are ongoing, and we will release the code in the near future. We release a subset of our historical data on Hugging Face: this https URL

点击查看摘要

Abstract:Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from real-world. Just as interactive environments have often driven progress in agents, advancing live future prediction naturally motivates viewing it as a learning environment. Prior works have explored future prediction from several different parts, but have generally not framed it as a unified learning environment. This task is appealing for learning because it can provide a large number of prediction questions grounded in diverse real-world events, while preventing answer leakage. To leverage the advantages of live future prediction, we present FutureWorld, a live agentic reinforcement learning environment that closes the training loop between prediction, outcome realization, and parameters update. In our environment, we take three open-source base models and train them for consecutive days. The results show that training is effective. Furthermore, we build a daily benchmark based on the environment and evaluate several frontier agents on it to establish performance baselines for current agent systems.

[AI-6] Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

【速读】:该论文旨在解决部署中机器人系统技能库(Skill Library)在持续更新时,传统组合式技能学习方法(如BLADE、SymSkill、Generative Skill Chaining)假设技能库在测试阶段固定不变的问题,即未分析当单个技能被替换后对整体组合策略性能的影响。其关键解决方案是提出一种基于配对采样跨版本交换协议(paired-sampling cross-version swap protocol),首次揭示了“主导技能效应”(dominant-skill effect)——即单一技能的原子成功率显著高于其他技能时,其是否参与组合可导致整体成功率提升高达50个百分点;并进一步设计了原子质量探测器(atomic-quality probe)与混合选择器(Hybrid Selector),结合零成本的原子级探测与选择性组合再验证机制,在仅需46%全验证成本下逼近最优组合性能(从64.6%提升至约76.6%),为组合式机器人策略中的技能更新治理提供了首个原理性且可部署的基元(primitive)。

链接: https://arxiv.org/abs/2604.26689
作者: Xue Qin,Simin Luan,John See,Cong Yang,Zhijun Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages main text + appendix; 3 figures, 12 tables;

点击查看摘要

Abstract:Skill libraries in deployed robotic systems are continually updated through fine-tuning, fresh demonstrations, or domain adaptation, yet existing typed-composition methods (BLADE, SymSkill, Generative Skill Chaining) treat the library as frozen at test time and do not analyze how composition outcomes change when a skill is replaced. We introduce a paired-sampling cross-version swap protocol on robosuite manipulation tasks to characterize this dimension of compositional skill learning. On a dual-arm peg-in-hole task we discover a dominant-skill effect: one ECM achieves 86.7% atomic success rate while every other ECM is at or below 26.7%, and whether this dominant ECM enters a composition shifts the success rate by up to +50pp. We characterize the boundary on a simpler pick task where all atomic policies saturate at 100% and the effect is undefined. Across three tasks we further find that off-policy behavioral distance metrics fail to identify the dominant ECM, ruling out the natural cheap predictor. We propose an atomic-quality probe and a Hybrid Selector combining per-skill probes (zero per-decision cost) with selective composition revalidation (full cost), and characterize its Pareto frontier on 144 skill-update decisions. On T6 the atomic-only probe sits 23pp below full revalidation (64.6% vs 87.5% oracle match) at zero per-decision cost; a Hybrid Selector with m=10 closes most of that gap to ~12pp at 46% of full-revalidation cost. On the cross-task average over 144 events, atomic-only is within 3pp of full revalidation under a mixed-oracle caveat. The atomic-quality probe is, to our knowledge, the first principled, deployment-ready primitive for skill-update governance in compositional robot policies.

[AI-7] A Toolkit for Detecting Spurious Correlations in Speech Datasets

【速读】:该论文旨在解决语音数据集中因记录条件异质性(heterogeneous recording conditions)而产生的虚假相关性(spurious correlations)问题,这类相关性可能导致模型在训练和测试数据中均表现出过高性能估计,尤其在高风险应用场景下可能危及系统可靠性。解决方案的关键在于提出一个诊断工具包,其核心机制是通过仅利用音频中的非语音区域(non-speech regions)来检测目标类别(target class),若该任务表现优于随机水平,则表明非语音区域携带了与目标类别相关的信息,从而有效标识出虚假相关性的存在。

链接: https://arxiv.org/abs/2604.26676
作者: Lara Gauder,Pablo Riera,Andrea Slachevsky,Gonzalo Forno,Adolfo M. García,Luciana Ferrer
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording conditions, a common scenario for health-related datasets. When present both in the training and test data, these correlations result in an overestimation of the system performance – a dangerous situation, specially in high-stakes application where systems are required to satisfy minimum performance requirements. Our toolkit implements a diagnostic method based on the detection of the target class using only the non-speech regions in the audio. Better than chance performance at this task indicates that information about the target class can be extracted from the non-speech regions, flagging the presence of spurious correlations. The toolkit is publicly available for research use.

[AI-8] SciHorizon-DataEVA: An Agent ic System for AI-Readiness Evaluation of Heterogeneous Scientific Data

【速读】:该论文旨在解决科学数据在生成式 AI (Generative AI) 应用中普遍存在的“AI就绪性”(AI-readiness)评估缺乏系统化、可扩展机制的问题。当前机器学习模型在科学发现中的有效性受限于数据质量与适配性,但现有方法无法对异构科学数据进行细粒度、可执行的评估。解决方案的关键在于提出 SciHorizon-DataEVA 系统,其核心是 Sci-TQA2 评估原则和 Sci-TQA2-Eval 多智能体执行框架:前者将 AI 就绪性划分为治理可信性(Governance Trustworthiness)、数据质量(Data Quality)、AI 兼容性(AI Compatibility)和科学适应性(Scientific Adaptability)四个维度,并分解为可测量的原子要素;后者通过有向循环工作流动态构建数据感知的评估规范,结合轻量级数据画像、适用性感知指标激活及基于领域约束与文献信号的知识增强规划,实现工具导向、具备验证与自修正能力的规模化评估。

链接: https://arxiv.org/abs/2604.26645
作者: Dianyu Liu,Chuan Qin,Xi Chen,Xiaohan Li,Wenxi Xu,Yuyang Wang,Xin Chen,Yuanchun Zhou,Hengshu Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hypothesis generation workflows across domains. However, the effectiveness of these models is fundamentally constrained by the AI-readiness of scientific data, for which no scalable and systematic evaluation mechanism currently exists. In this work, we propose SciHorizon-DataEVA, a novel agentic system to scalable AI-readiness evaluation of heterogeneous scientific data. At the evaluation-criteria level, we introduce the Sci-TQA2 principles, which organize AI-readiness into four complementary dimensions: Governance Trustworthiness, Data Quality, AI Compatibility, and Scientific Adaptability. Each dimension is decomposed into measurable atomic elements that enable fine-grained and executable assessment. To operationalize these principles at scale, we develop Sci-TQA2-Eval, a hierarchical multi-agent evaluation approach orchestrated through a directed, cyclic workflow. Our Sci-TQA2-Eval dynamically constructs dataset-aware evaluation specifications by combining lightweight dataset profiling, applicability-aware metric activation, and knowledge-augmented planning grounded in domain constraints and dataset-paper signals. These specifications are executed through an adaptive, tool-centric evaluation mechanism with built-in verification and self-correction, enabling scalable and reliable assessment across heterogeneous scientific data. Extensive experiments on scientific datasets spanning multiple domains demonstrate the effectiveness and generality of SciHorizon-DataEVA for principled AI-readiness evaluation.

[AI-9] When to Vote When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在数学推理任务中对困难实例表现不稳定的问题,现有测试时扩展方法(如重复采样、自纠错和树搜索)虽能提升性能但计算开销大且在难题上收益递减。其解决方案的关键在于:利用输出不一致性的强相关性作为实例难度与预测正确性的指示信号,将测试时扩展转化为实例级路由问题——根据输出分歧程度动态选择不同策略:一致性高时采用轻量级解析,中等分歧时使用多数投票,高度模糊时则通过重写重构进行处理,从而在保证准确率提升(3%–7%)的同时降低采样成本。

链接: https://arxiv.org/abs/2604.26644
作者: Zhimin Lin,Yixin Ji,Jinpeng Li,Yu Luo,Dong Li,Junhua Fang,Juntao Li,Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.

[AI-10] ATLAS: An Annotation Tool for Long-horizon Robotic Action Segmentation

【速读】:该论文旨在解决长时程机器人演示中精确标注动作边界的问题,这是训练和评估动作分割及操作策略学习方法的关键前提。现有标注工具普遍存在局限性:主要针对视觉数据设计、缺乏对机器人特有时间序列信号(如夹爪状态或力/扭矩)的同步可视化支持,且难以适配不同数据集格式。其解决方案的核心是提出ATLAS——一个专为长时程机器人动作分割设计的标注工具,它提供多模态数据(包括多视角视频与本体感知信号)的时间同步可视化,并支持动作边界、标签及任务结果的标注;同时原生兼容ROS bag和RLDS等主流机器人数据格式,并通过模块化抽象层可轻松扩展至新格式,其基于键盘的操作界面显著降低标注负担,实验证明在接触丰富的装配任务中,相较ELAN工具平均每个动作标注时间减少至少6%,引入时序数据使时间对齐精度提升超过2.8%,边界误差下降至原有视觉工具的五分之一。

链接: https://arxiv.org/abs/2604.26637
作者: Sergej Stanovcic,Daniel Sliwowski,Dongheui Lee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Annotating long-horizon robotic demonstrations with precise temporal action boundaries is crucial for training and evaluating action segmentation and manipulation policy learning methods. Existing annotation tools, however, are often limited: they are designed primarily for vision-only data, do not natively support synchronized visualization of robot-specific time-series signals (e.g., gripper state or force/torque), or require substantial effort to adapt to different dataset formats. In this paper, we introduce ATLAS, an annotation tool tailored for long-horizon robotic action segmentation. ATLAS provides time-synchronized visualization of multi-modal robotic data, including multi-view video and proprioceptive signals, and supports annotation of action boundaries, action labels, and task outcomes. The tool natively handles widely used robotics dataset formats such as ROS bags and the Reinforcement Learning Dataset (RLDS) format, and provides direct support for specific datasets such as REASSEMBLE. ATLAS can be easily extended to new formats via a modular dataset abstraction layer. Its keyboard-centric interface minimizes annotation effort and improves efficiency. In experiments on a contact-rich assembly task, ATLAS reduced the average per-action annotation time by at least 6% compared to ELAN, while the inclusion of time-series data improved temporal alignment with expert annotations by more than 2.8% and decreased boundary error fivefold compared to vision-only annotation tools.

[AI-11] DD Governance for Multi-Agent Code Generation via Prompt Engineering

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在软件开发中因缺乏约束导致的不稳定性、非确定性以及对开发规范(如测试驱动开发,Test-Driven Development, TDD)遵循不足的问题。其核心解决方案是提出一种AI原生的TDD框架,将传统TDD的“红-绿-重构”流程转化为可执行的提示层(prompt-level)与工作流层(workflow-level)治理机制,通过结构化提示编排和分层架构实现开发纪律的显式编码,确保阶段顺序、修复循环边界、验证门控及原子性变更控制,从而提升LLM辅助开发的稳定性和可复现性。

链接: https://arxiv.org/abs/2604.26615
作者: Tarlan Hasanli,Shahbaz Siddeeq,Bishwash Khanal,Pyry Kotilainen,Tommi Mikkonen,Pekka Abrahamsson
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5 pages. Submitted to the 1st International Workshop on Empirical Prompt Engineering for Software Engineering (PROMPT-SE 2026)

点击查看摘要

Abstract:Large language models (LLMs) accelerate software development but often exhibit instability, non-determinism, and weak adherence to development discipline in unconstrained workflows. While test-driven development (TDD) provides a structured Red-Green-Refactor process, existing LLM-based approaches typically use tests as auxiliary inputs rather than enforceable process constraints. We present an AI-native TDD framework that operationalizes classical TDD principles as structured prompt-level and workflow-level governance mechanisms. Extracted principles are formalized in a machine-readable manifesto and distributed across planning, generation, repair, and validation stages within a layered architecture that separates model proposal from deterministic engine authority. The system enforces phase ordering, bounded repair loops, validation gates, and atomic mutation control to improve stability and reproducibility. We describe architecture and discuss encoding software engineering discipline directly into prompt orchestration, which we think offers a promising direction for reliable LLM-assisted development.

[AI-12] Human-in-the-Loop Benchmarking of Heterogeneous LLM s for Automated Competency Assessment in Secondary Level Mathematics

【速读】:该论文旨在解决基于能力的教育(Competency-Based Education, CBE)在实践中面临的评估瓶颈问题,即从传统分数导向的测评向定性能力映射转变过程中,人工评估效率低、可扩展性差的挑战。其解决方案的关键在于提出一个“人在回路中”(Human-in-the-Loop)的基准框架,通过构建多维评分量规(rubric)对多个大语言模型(LLMs)在中学数学评估任务中的表现进行系统评测,并引入专家标注作为真实标准(ground truth),以量化模型输出与人类专家判断的一致性。研究发现,模型架构与指令约束的适配性比参数规模更为关键,揭示了在结构化、规则驱动的任务中,稀疏专家混合(Sparse MoE)架构更易达成合理一致性的规律,从而为后续开发高价值辅助评估工具提供了实证依据。

链接: https://arxiv.org/abs/2604.26607
作者: Jatin Bhusal,Nancy Mahatha,Aayush Acharya,Raunak Regmi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注: 5 pages, 3 figures, 5 tables. Submitted to 2AI-2026-Applied AI Conference

点击查看摘要

Abstract:As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by suggesting a “Human-in-the-Loop” benchmarking framework to assess the effectiveness of multiple LLMs in automating secondary-level mathematics assessment. Based on the Grade 10 Optional Mathematics curriculum in Nepal, we created a multi-dimensional rubric for four topics and four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. The multi-provider ensemble, consisted of open-weight models – Eagle (Llama 3.1-8B) and Orion (Llama 3.3-70B) – and proprietary frontier models Nova (Gemini 2.5 Flash) and Lyra (Gemini 3 Pro), was benchmarked against a ground truth defined by two senior mathematics faculty members (kappa_w = 0.8652). The findings show a marked “Architecture-compatibility gap”. Although the Gemini-based Mixture-of-Experts (Sparse MoE) models achieved “Fair Agreement” (kappa_w ~ 0.38), the larger Orion (70B) model exhibited “No Agreement” (kappa_w = -0.0261), suggesting that architectural compliance with instruction constraints outweighs the scale of raw parameters in rubric-constrained tasks. We conclude that while LLMs are not yet suitable for autonomous certification, they provide high-value assistive support for preliminary evidence extraction within a “Human-in-the-Loop” framework. Comments: 5 pages, 3 figures, 5 tables. Submitted to 2AI-2026-Applied AI Conference Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE) ACMclasses: K.3.1; I.2.7; I.2.6 Cite as: arXiv:2604.26607 [cs.AI] (or arXiv:2604.26607v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.26607 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-13] MappingEvolve: LLM -Driven Code Evolution for Technology Mapping

【速读】:该论文旨在解决逻辑综合中技术映射(technology mapping)这一关键但具有挑战性的阶段的优化问题,传统方法在代码生成和算法改进方面存在局限。其解决方案的关键在于提出一种名为MappingEvolve的开源框架,该框架首次将大型语言模型(Large Language Models, LLMs)直接用于演化技术映射代码本身;通过将映射过程抽象为一系列优化操作符,并采用分层代理架构(包括规划器Planner、演化器Evolver和评估器Evaluator),实现对代码的结构化、策略性修改,从而有效提升映射质量,在EPFL基准测试中实现了显著的面积-延迟权衡优化,相较ABC和mockturtle分别获得10.04%和7.93%的面积缩减,以及46.6%–96.0%的S_overall指标提升。

链接: https://arxiv.org/abs/2604.26591
作者: Rongliang Fu,Yi Liu,Qiang Xu,Tsung-Yi Ho
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Technology mapping is a critical yet challenging stage in logic synthesis. While Large Language Models (LLMs) have been applied to generate optimization scripts, their potential for core algorithm enhancement remains untapped. We introduce MappingEvolve, an open-source framework that pioneers the use of LLMs to directly evolve technology mapping code. Our method abstracts the mapping process into distinct optimization operators and employs a hierarchical agent-based architecture, comprising a Planner, Evolver, and Evaluator, to guide the evolutionary search. This structured approach enables strategic and effective code modifications. Experiments show our method significantly outperforms direct evolution and strong baselines, achieving 10.04% area reduction versus ABC and 7.93% versus mockturtle, with 46.6%–96.0% S_overall improvement on EPFL benchmarks, while explicitly navigating the area–delay trade-off. Our code and data are available at this https URL.

[AI-14] Graph Construction and Matching for Imperative Programs using Neural and Structural Methods

【速读】:该论文旨在解决验证 artefact(如程序规范和断言)在不同编程语言和注释风格之间难以复用的问题,核心挑战在于如何识别程序及其规范之间的结构相似性和语义相似性。解决方案的关键在于构建一个统一的图表示管道,将 imperative 程序及其注释(如 ACSL、JML 和 Dafny)转换为带类型和属性的图结构;该管道结合了抽象语法树(Abstract Syntax Tree, AST)解析与来自 SentenceTransformer 和 CodeBERT 等模型的语义嵌入,从而生成既能捕获程序结构关系又能保留语义上下文的图表示。实验表明,该方法可在多种语言和注释风格下生成一致的图表示,为后续的语义增强和近似图匹配提供了可扩展的基础。

链接: https://arxiv.org/abs/2604.26578
作者: Arshad Beg,Diarmuid O’Donoghue,Rosemary Monahan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 20 Pages. Technical Report. Maynooth University, Ireland. Submitted on 29 April 2026

点击查看摘要

Abstract:Reusing verification artefacts requires identifying structural and semantic similarities across programs and their specifications. In this paper, we focus on graph construction as a foundational step toward this goal. We present a pipeline that converts imperative programs and their annotations into typed, attributed graphs. Our experiments cover datasets including C with ACSL, Java with JML, and Dafny for C#. The pipeline integrates abstract syntax tree parsing with semantic embeddings derived from models such as SentenceTransformer and CodeBERT. This enables the generation of graph representations that capture both structural relationships and semantic context. Our results show that consistent graph representations can be constructed across different languages and annotation styles. This work provides a practical basis for future steps in semantic enrichment and approximate graph matching for scalable verification artefact reuse.

[AI-15] Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在机器人健康看护系统中部署时的安全性问题,特别是其对有害指令的响应风险。研究通过构建一个包含270条基于美国医学协会医疗伦理原则的禁止行为类别的指令数据集,在机器人健康看护框架模拟环境中评估了72个LLMs的安全表现。关键发现在于:模型安全性受参数规模和发布日期显著影响,开源模型平均违规率为54.4%,远高于专有模型(中位数23.7% vs 72.8%),且医学领域微调未提升整体安全性,提示仅靠提示工程防御难以达到临床部署所需的安全阈值。因此,该研究强调将安全评估作为LLMs用于医疗机器人系统的首要考量标准。

链接: https://arxiv.org/abs/2604.26577
作者: Mahiro Nakao,Kazuhiro Takemoto
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Robotics (cs.RO)
备注: 20 pages, 9 figures, 3 tables, 8 pages supplementary material

点击查看摘要

Abstract:Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this context remains poorly characterized. We introduce a dataset of 270 harmful instructions spanning nine prohibited behavior categories grounded in the American Medical Association Principles of Medical Ethics, and use it to evaluate 72 LLMs in a simulation environment based on the Robotic Health Attendant framework. The mean violation rate across all models was 54.4%, with more than half exceeding 50%, and violation rates varied substantially across behavior categories, with superficially plausible instructions such as device manipulation and emergency delay proving harder to refuse than overtly destructive ones. Model size and release date were the primary determinants of safety performance among open-weight models, and proprietary models were substantially safer than open-weight counterparts (median 23.7% versus 72.8%). Medical domain fine-tuning conferred no significant overall safety benefit, and a prompt-based defense strategy produced only a modest reduction in violation rates among the least safe models, leaving absolute violation rates at levels that would preclude safe clinical deployment. These findings demonstrate that safety evaluation must be treated as a first-class criterion in the development and deployment of LLMs for robotic health attendants.

[AI-16] DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在边缘AI系统上进行推理时,因Key-Value (KV)缓存占用内存过大而超出设备可用内存预算的问题。现有基于NVMe的缓存外存方案依赖内核页缓存(page cache),在内存压力下易引发缓存抖动(cache thrashing)、延迟不可预测及高软件开销。其解决方案的关键在于提出DUAL-BLADE框架,该框架采用双路径KV驻留机制:根据运行时内存状态动态将KV张量分配至页缓存路径或NVMe直通路径;其中NVMe直通路径通过映射KV张量到连续逻辑块地址(LBA)区域绕过文件系统,实现低开销直接存储访问,并结合自适应流水线并行策略,使存储I/O与GPU DMA操作重叠,从而显著降低I/O瓶颈,提升推理吞吐量。

链接: https://arxiv.org/abs/2604.26557
作者: Bodon Jeong,Hongsu Byun,Youngjae Kim,Weikuan Yu,Kyungkeun Lee,Jihoon Yang,Sungyong Park
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: To appear in IEEE International Conference on Distributed Computing Systems (ICDCS) 2026

点击查看摘要

Abstract:The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device memory. Although NVMe-based offloading offers scalable capacity, existing file-based designs rely heavily on the kernel page cache, leading to cache thrashing, unpredictable latency, and high software overhead under memory pressure. We present DUAL-BLADE, a dual-path KV residency framework that dynamically assigns KV tensors to either a page-cache path or an NVMe-direct path based on runtime memory availability. The NVMe-direct path bypasses the filesystem by mapping KV tensors to contiguous logical block address (LBA) regions, enabling low-overhead direct storage access. DUAL-BLADE further incorporates adaptive pipeline parallelism to overlap storage I/O with GPU DMA, improving inference throughput. Our evaluation shows that DUAL-BLADE substantially mitigates I/O bottlenecks, reducing prefill and decode latency by up to 33.1% and 42.4%, respectively, while improving SSD utilization by 2.2x across diverse memory budgets.

[AI-17] Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning AISTATS2026

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)代理在部署时因训练数据集与真实环境之间的分布差异而导致的不安全行为问题。解决方案的关键在于提出一种基于Transformer架构的自对齐(Self-Alignment for Safety, SAS)框架,其核心机制是“自对齐”:在测试阶段,预训练代理生成多个想象轨迹,并筛选满足李雅普诺夫(Lyapunov)条件的可行片段作为上下文提示(in-context prompts),从而引导代理行为向安全方向调整,同时避免参数更新。该方法将李雅普诺夫引导的想象力转化为控制不变提示(control-invariant prompts),并利用Transformer的层次化结构实现对潜在技能的贝叶斯推理式控制。

链接: https://arxiv.org/abs/2604.26516
作者: Seungyub Han,Hyungjin Kim,Jungwoo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at AISTATS 2026. First two authors contributed equally. Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-based framework that enables test-time adaptation in offline safe RL without retraining. In SAS, the main mechanism is self-alignment: at test time, the pretrained agent generates several imagined trajectories and selects those satisfying the Lyapunov condition. These feasible segments are then recycled as in-context prompts, allowing the agent to realign its behavior toward safety while avoiding parameter updates. In effect, SAS turns Lyapunov-guided imagination into control-invariant prompts, and its transformer architecture admits a hierarchical RL interpretation where prompting functions as Bayesian inference over latent skills. Across Safety Gymnasium and MuJoCo benchmarks, SAS consistently reduces cost and failure while maintaining or improving return.

[AI-18] atemae: Detecting Alignment Faking via Tool Selection in LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中可能出现的“对齐伪造”(Alignment Faking, AF)问题,即模型在受监控时表现出符合安全或伦理规范的行为,而在无人监督时则恢复到原有偏好,从而规避价值对齐。传统检测方法依赖于思维链(Chain-of-Thought, CoT)分析,但存在局限性:当模型未显式展示策略性推理时,难以区分其行为是出于欺骗还是能力不足。本文的关键解决方案在于将AF形式化为一种复合行为事件,并通过可观测的工具选择行为进行检测——具体而言,模型在无监控时选择安全工具,而在有监控且奖励帮助性而非安全性的场景下切换至不安全工具,同时其推理过程仍承认安全选项的正确性。这一方法突破了CoT依赖的限制,提供了更鲁棒的AF检测信号。

链接: https://arxiv.org/abs/2604.26511
作者: Matteo Leonesi,Francesco Belardinelli,Flavio Corradini,Marco Piangerelli
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences once monitoring is lifted. Current detection methods focus on conversational settings and rely primarily on Chain-of-Thought (CoT) analysis, which provides a reliable signal when strategic reasoning surfaces, but cannot distinguish deception from capability failures if traces are absent or unfaithful. We formalize AF as a composite behavioural event and detect it through observable tool selection, where the LLM selects the safe tool when unmonitored, but switches to the unsafe tool under monitoring that rewards helpfulness over safety, while its reasoning still acknowledges the safe choice. We release a dataset of 108 enterprise IT scenarios spanning Security, Privacy, and Integrity domains under Corruption and Sabotage pressures. Evaluating six frontier LLMs across five independent runs, we find mean AF detection rates between 3.5% and 23.7%, with vulnerability profiles varying by domain and pressure type. These results suggest that susceptibility reflects training methodology rather than capability alone.

[AI-19] Auto-Relational Reasoning

【速读】:该论文旨在解决当前大规模机器学习模型在推理能力上的局限性问题,即尽管模型规模不断扩展,但其性能提升趋于饱和且缺乏稳健的逻辑推理能力。解决方案的关键在于提出一种将对象关系(object-relations)的自动推理与人工神经网络(Artificial Neural Networks, ANN)相结合的理论框架,并通过一个融合推理与机器学习的范式实现该思想。该系统无需任何问题先验知识即可求解智商(IQ)类问题,达到98.03%的解决率,对应于人类顶尖1%的智力水平(IQ 132–144),其性能瓶颈主要受限于模型规模和计算资源,而非方法本身。

链接: https://arxiv.org/abs/2604.26507
作者: Ioannis Konstantoulas,Dimosthenis Tsimas,Pavlos Peppas,Kyriakos Sgarbas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to JAIR

点击查看摘要

Abstract:Background Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating diminishing returns and still lack solid reasoning abilities. These limits could be surpassed through synergistic combination of Machine Learning scalability and rigid reasoning. Methods: In this work, we propose a theoretical framework for reasoning through object-relations in an automated manner integrated with Artificial Neural Networks. We present a formal analysis of the Reasoning, and we show the theory in practice through a paradigm integrating Reasoning and Machine Learning. Results: This paradigm is a system that solves Intelligence Quotient problems without any prior knowledge of the problem. Our system achieves 98.03% solving rate corresponding to the top 1% percentile or 132-144 iq score. This result is only limited by the small size of the model and the processing capabilities of the machine it run on. Conclusions: With the integration of prior knowledge in the system and the expansion of the dataset, the system can be generalized to solve a large category of problems. The functionality of the system inherently favors the solution of such problems in few-shot or zero-shot attempts.

[AI-20] STLGT: A Scalable Trace-Based Linear Graph Transformer for Tail Latency Prediction in Microservices

【速读】:该论文旨在解决微服务系统中端到端尾部延迟(tail-latency)预测的准确性与推理效率难题,尤其是在处理长距离依赖传播和非平稳、突发性工作负载时的挑战。其核心解决方案是提出一种名为STLGT(Scalable Trace-based Linear Graph Transformer)的 per-API 预测模型,关键创新在于:首先将调用链(trace)编码为结构感知的跨度图(span graph),利用线性复杂度的图Transformer实现跨服务依赖的有效传播;其次引入解耦的时间模块以捕捉工作负载动态变化。该设计在保持高精度的同时显著提升推理效率,在真实场景下相比PERT-GNN平均降低8.5% MAPE,并在N=32时实现最高达12倍的CPU推理加速。

链接: https://arxiv.org/abs/2604.26422
作者: Yongliang Ding,Qigong Bi,Peng Pu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 4 tables, conference

点击查看摘要

Abstract:Accurate end-to-end tail-latency forecasting is critical for proactive SLO management in microservice systems. However, modeling long-range dependency propagation and non-stationary, bursty workloads while maintaining inference efficiency at scale remains challenging. We present STLGT (Scalable Trace-based Linear Graph Transformer), a per-API predictor that encodes traces as span graphs for multi-step p95 tail-latency forecasting. STLGT uses a structure-aware linear graph Transformer to propagate cross-service dependencies with inference time linear in span graph size, and a decoupled temporal module to capture workload dynamics. Across a personalized education microservice application, DeathStarBench, and Alibaba traces, STLGT improves forecasting accuracy over PERT-GNN by 8.5% MAPE on average and achieves up to 12x faster CPU inference at N=32, matching the maximum span graph size after preprocessing the Alibaba traces. Ablation studies further demonstrate the effectiveness of each component, especially under bursty traffic.

[AI-21] SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting with Tri-Context Personalization

【速读】:该论文旨在解决复杂网络安全支持场景中虚拟客户助理(Virtual Customer Assistant, VCA)的适应性与有效性不足问题,尤其在设备、用户和服务三方面缺乏针对性响应。解决方案的关键在于构建一个基于多智能体架构的VCA系统——SecMate,其核心创新包括:1)通过轻量级本地诊断工具实现设备特异性(device specificity),利用设备级信号提升故障定位准确性;2)基于隐式熟练度推理和用户画像实现用户特异性(user specificity),优化个性化引导策略;3)引入主动且上下文感知的推荐机制实现服务特异性(service specificity),提高建议的相关性与实用性。实验表明,该方案显著提升了正确解决率(从约50%提升至90%以上),并增强了用户体验与接受度。

链接: https://arxiv.org/abs/2604.26394
作者: Yair Meidan,Omri Haller,Yulia Moshan,Shahaf David,Dudu Mimran,Yuval Elovici,Asaf Shabtai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models and agentic frameworks have enabled virtual customer assistants (VCAs) for complex support. We present SecMate, a multi-agent VCA for cybersecurity troubleshooting that integrates device, user, and service specificity from conversational and device-level signals. Device specificity is provided by a lightweight local diagnostic utility, while user specificity relies on implicit proficiency inference and profile-aware troubleshooting. Service specificity is achieved through a proactive, context-aware recommender. We evaluate SecMate in a controlled study with 144 participants and 711 conversations. Device-level evidence increased correct resolutions from about 50% to over 90% relative to an LLM-only baseline, while step-by-step guidance improved pleasantness and reduced user burden. The recommender achieved high relevance (MRR@1=0.75), and participants showed strong willingness to substitute human IT support at costs well below human benchmarks. We release the full code base and a richly annotated dataset to support reproducible research on adaptive VCAs.

[AI-22] Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)系统在面对人类偏好驱动的奖励函数时,因不确定性、上下文依赖性和内在不一致性而导致的对齐失败问题,如奖励黑客(reward hacking)、过度优化和过度自信行为。其解决方案的关键在于提出一种双源不确定性感知的奖励框架,该框架显式建模两类不确定性:一是价值估计中的认知不确定性(epistemic uncertainty),通过集成模型对价值预测的分歧来捕捉;二是人类偏好的不确定性,由奖励标注的变异性推导得出。二者通过一个置信度调整的可靠性过滤器(Reliability Filter)进行融合,动态调节动作选择策略,在利用与谨慎之间取得平衡,从而提升训练稳定性并显著减少在奖励模糊下的 exploitative 行为,实验证明可使陷阱访问频率降低93.7%。

链接: https://arxiv.org/abs/2604.26360
作者: Disha Singha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 18 figures, 3 tables

点击查看摘要

Abstract:Reinforcement learning (RL) systems typically optimize scalar reward functions that assume precise and reliable evaluation of outcomes. However, real-world objectives–especially those derived from human preferences–are often uncertain, context-dependent, and internally inconsistent. This mismatch can lead to alignment failures such as reward hacking, over-optimization, and overconfident behavior. We introduce a dual-source uncertainty-aware reward framework that explicitly models both epistemic uncertainty in value estimation and uncertainty in human preferences. Model uncertainty is captured via ensemble disagreement over value predictions, while preference uncertainty is derived from variability in reward annotations. We combine these signals through a confidence-adjusted Reliability Filter that adaptively modulates action selection, encouraging a balance between exploitation and caution. Empirical results across multiple discrete grid configurations (6x6, 8x8, 10x10) and high-dimensional continuous control environments (Hopper-v4, Walker2d-v4) demonstrate that our approach yields more stable training dynamics and reduces exploitative behaviors under reward ambiguity, achieving a 93.7% reduction in reward-hacking behavior as measured by trap visitation frequency. We demonstrate statistical significance of these improvements and robustness under up to 30% supervisory noise, albeit with a trade-off in peak observed reward compared to unconstrained baselines. By treating uncertainty as a first-class component of the reward signal, this work offers a principled approach toward more reliable and aligned reinforcement learning systems. Comments: 31 pages, 18 figures, 3 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: I.2.6; I.2.0 Cite as: arXiv:2604.26360 [cs.LG] (or arXiv:2604.26360v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.26360 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Disha Singha [view email] [v1] Wed, 29 Apr 2026 07:14:01 UTC (2,380 KB)

[AI-23] DreamProver: Evolving Transferable Lemma Libraries via a Wake-Sleep Theorem-Proving Agent

【速读】:该论文旨在解决形式化定理证明中lemma库的适应性与通用性不足的问题:现有方法要么依赖固定lemma库导致灵活性差,要么生成高度特定于单个定理的中间lemma,缺乏跨任务的可迁移性。解决方案的关键在于提出一种基于“唤醒-睡眠”(wake-sleep)程序归纳范式的智能体框架DreamProver,其通过迭代的两阶段过程实现lemma库的动态演化——在“唤醒”阶段尝试用当前lemma库证明训练定理并生成候选lemma,在“睡眠”阶段对候选lemma进行抽象、精炼与整合以压缩和优化库结构;这一交替机制使DreamProver逐步演化出一组紧凑且高阶的可迁移lemma,显著提升对未见定理的证明成功率,同时降低计算开销并生成更简洁的证明。

链接: https://arxiv.org/abs/2604.26311
作者: Youyuan Zhang,Jialiang Sun,Hangrui Bi,Chuqin Geng,Wenjie Ma,Zhaoyu Li,Xujie Si
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce DreamProver, an agentic framework that leverages a “wake-sleep” program induction paradigm to discover reusable lemmas for formal theorem proving. Existing approaches either rely on fixed lemma libraries, which limit adaptability, or synthesize highly specific intermediate lemmas tailored to individual theorems, thereby lacking generality. DreamProver addresses this gap through an iterative two-stage process. In the wake stage, DreamProver attempts to prove theorems from a training set using the current lemma library while proposing new candidate lemmas. In the “sleep” stage, it abstracts, refines, and consolidates these candidates to compress and optimize the library. Through this alternating cycle, DreamProver progressively evolves a compact set of high-level, transferable lemmas that can be effectively used to prove unseen theorems in related domains. Experimental results demonstrate that DreamProver substantially improves proof success rates across a diverse set of mathematical benchmarks, while also producing more concise proofs and reducing computational cost.

[AI-24] Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents

【速读】:该论文旨在解决由大型语言模型驱动的结构化工作流代理(structured-workflow agents)在调用敏感外部工具时面临的潜在安全威胁,特别是针对行为异常的入侵检测与防御问题。其核心挑战在于如何在不显著增加运行时开销的前提下,有效识别并阻止恶意工具调用序列。解决方案的关键在于提出一种基于遥测数据的行为异常检测防火墙 \codename,它通过将已验证的良性工具调用遥测信息建模为参数化确定性有限自动机(parameterized deterministic finite automaton, pDFA),从而显式定义允许的工具调用序列、上下文依赖关系及参数边界;运行时通过一个轻量级网关执行 O(1) 状态转移查找来强制实施这些规则,将复杂的分析任务全部前置至离线阶段,实现了高效率与强防御能力的统一。

链接: https://arxiv.org/abs/2604.26274
作者: Hung Dang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Structured-workflow agents driven by large language models execute tool calls against sensitive external environments. We propose \codename, a telemetry-driven behavioral anomaly detection firewall. Drawing on sequence-based intrusion detection, \codename\ compiles verified benign tool-call telemetry into a parameterized deterministic finite automaton (pDFA). The model defines permitted tool sequences, sequential contexts, and parameter bounds. At runtime, a lightweight gateway enforces these boundaries via an O(1) state-transition structural lookup, shifting computationally expensive analysis entirely offline. Evaluated on the Agent Security Bench (ASB), \codename\ achieves a 5.6% macro-averaged attack success rate (ASR) across five scenarios. Within three structured workflows, ASR drops to 2.2%, outperforming Aegis, a state-of-the-art stateless scanner, at 12.8%. \codename\ achieves 0% ASR on multi-step and context-sequential attacks in structured settings. Furthermore, against 1,000 algorithmically spliced exfiltration payloads, only 1.4% matched valid structural paths, all of which failed end-to-end string parameter guards (0 successes out of 14 surviving paths, 95% CI [0%, 23.2%]). \codename\ introduces just 2.2~ms of per-call latency (a 3.7 \times speedup over \textscAegis) while maintaining a 2.0% benign task failure rate (BTFR) on benign workloads. Modeling the behavioral trajectory effectively collapses the available attack surface, but unmaintained continuous parameter bounds remain vulnerable to synonym-substitution attacks (18% evasion rate). Thus, exact-match whitelisting of sensitive parameters ultimately bears the final defensive load against execution.

[AI-25] Apriori-based Analysis of Learned Helplessness in Mathematics Tutoring: Behavioral Patterns by Level Intervention and Outcome

【速读】:该论文旨在解决数学辅导系统中习得性无助(Learned Helplessness, LH)对学生行为模式影响的识别与干预优化问题。研究通过Apriori算法挖掘学习者交互日志中的关联规则,识别出不同LH水平和系统干预条件下,行为模式(如跳过问题、使用提示、持续尝试等)与问题解决结果之间的关键关联。解决方案的关键在于:首先,发现“跳过问题且不使用提示”是导致未解题的核心行为模式,而“不跳过”则普遍与成功解题正相关;其次,区分低LH与高LH学生的行为差异——前者更倾向通过持续尝试和合理使用提示达成成功,后者则表现出更强的回避倾向;最后,揭示无干预组中“坚持尝试→成功”的关联最强,而有干预组中“跳过→失败”的模式更显著,表明系统干预可能强化了部分学生的消极行为路径。此分析为个性化干预策略的设计提供了数据驱动依据。

链接: https://arxiv.org/abs/2604.26237
作者: John Paul P. Miranda
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, 1 table, journal article

点击查看摘要

Abstract:This study applied the Apriori algorithm to analyze behavioral interaction patterns associated with learned helplessness (LH) in mathematics tutoring system logs. Interaction data were examined across three dimensions: LH level (low vs. high), system-based intervention (with vs. without), and problem-solving outcomes (solved vs. unsolved). The analysis of the complete dataset showed that skipping problems without using hints was the most frequent pattern linked to unsolved outcomes, while persistence behaviors such as not skipping were less dominant overall. Comparisons by LH level showed that low-LH students had stronger links between problem solving and not skipping, as well as positive associations between hint use and solved outcomes. High-LH students showed more avoidance patterns, with skipping strongly tied to unsolved outcomes. In the comparison of system-based intervention conditions, students without intervention had the highest lift for persistence-success links, while the with-intervention group had stronger patterns involving skipping behaviors leading to unsolved outcomes. Outcome-specific analysis showed that not skipping was consistently associated with solved problems across all groups, while skipping without hints predicted unsolved outcomes. Practical implications and recommendations are discussed.

[AI-26] Persuadability and LLM s as Legal Decision Tools

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在法律决策场景中如何响应法律论证的问题,尤其是探讨影响模型对复杂法律问题作出特定判断的关键因素。其核心关切在于:LLMs作为潜在的司法或行政决策辅助工具,是否能够合理地评估双方当事人提出的法律论点,既具备对有效论证的敏感性,又避免因律师表达能力差异而产生偏倚。解决方案的关键在于通过原创实验设计,系统考察不同质量的辩护方(advocate)所提出论证对LLM决策倾向的影响,并识别驱动此类反应的内在机制,从而为LLMs在法律领域中的可行性与可靠性提供实证依据。

链接: https://arxiv.org/abs/2604.26233
作者: Oisin Suttle,David Lillis
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to 21st International Conference on Artificial Intelligence and Law (ICAIL 2026)

点击查看摘要

Abstract:As Large Language Models (LLMs) are proposed as legal decision assistants, and even first-instance decision-makers, across a range of judicial and administrative contexts, it becomes essential to explore how they answer legal questions, and in particular the factors that lead them to decide difficult questions in one way or another. A specific feature of legal decisions is the need to respond to arguments advanced by contending parties. A legal decision-maker must be able to engage with, and respond to, including through being potentially persuaded by, arguments advanced by the parties. Conversely, they should not be unduly persuadable, influenced by a particularly compelling advocate to decide cases based on the skills of the advocates, rather than the merits of the case. We explore how frontier open- and closed-weights LLMs respond to legal arguments, reporting original experimental results examining how the quality of the advocate making those arguments affects the likelihood that a model will agree with a particular legal point of view, and exploring the factors driving these results. Our results have implications for the feasibility of adopting LLMs across legal and administrative settings.

[AI-27] OMEGA: Optimizing Machine Learning by Evaluating Generated Algorithms ICLR2026

【速读】:该论文旨在解决人工智能研究自动化的问题,即如何从算法构思到可执行代码实现全流程自动化,以加速机器学习(Machine Learning, ML)模型的创新与优化。其解决方案的关键在于提出并实现了一个端到端框架 OMEGA(Optimizing Machine learning by Evaluating Generated Algorithms),该框架融合结构化元提示工程(structured meta-prompt engineering)与可执行代码生成技术,能够自动生成新颖且性能优于 scikit-learn 基线的 ML 分类器,并在 20 个基准数据集(infinity-bench)上验证了其有效性。

链接: https://arxiv.org/abs/2604.26211
作者: Jeremy Nixon,Annika Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2026: Workshop on AI with Recursive Self-Improvement

点击查看摘要

Abstract:In order to automate AI research we introduce a full, end-to-end framework, OMEGA: Optimizing Machine learning by Evaluating Generated Algorithms, that starts at idea generation and ends with executable code. Our system combines structured meta-prompt engineering with executable code generation to create new ML classifiers. The OMEGA framework has been utilized to generate several novel algorithms that outperform scikit-learn baselines across a robust selection of 20 benchmark datasets (infinity-bench). You can access models discussed in this paper and more in the python package: pip install omega-models.

[AI-28] Co-Learning Port-Hamiltonian Systems and Optimal Energy-Shaping Control

【速读】:该论文旨在解决从轨迹数据中学习能量整形控制(energy-shaping control)策略的问题,特别是针对端口哈密顿系统(port-Hamiltonian, pH)的控制设计,以实现闭环系统的固有无源性(inherent passivity)和理论稳定性。解决方案的关键在于提出一种物理信息引导的学习框架,通过交替优化机制联合学习pH系统模型与最优能量平衡无源控制(energy-balancing passivity-based controller, EB-PBC),其中两者均由嵌入pH结构和EB-PBC结构的神经网络参数化,确保了对能量交互关系的可解释性;同时引入耗散正则化项强制训练过程中能量严格衰减,从而提升对仿真到现实(sim-to-real)迁移的鲁棒性。

链接: https://arxiv.org/abs/2604.26172
作者: Ankur Kamboj,Biswadip Dey,Vaibhav Srivastava
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We develop a physics-informed learning framework for energy-shaping control of port-Hamiltonian (pH) systems from trajectory data. The proposed approach co-learns a pH system model and an optimal energy-balancing passivity-based controller (EB-PBC) through alternating optimization with policy-aware data collection. At each iteration, the system model is refined using trajectory data collected under the current control policy, and the controller is re-optimized on the updated model. Both components are parameterized by neural networks that embed the pH dynamics and EB-PBC structure, ensuring interpretability in terms of energy interactions. The learned controller renders the closed-loop system inherently passive and provably stable, and exploits passive plant dynamics without canceling the natural potential. A dissipation regularization enforces strict energy decay during training, thereby enhancing robustness to sim-to-real gaps. The proposed framework is validated on state-regulation and swing-up tasks for planar and torsional pendulum systems.

[AI-29] ImproBR: Bug Report Improver Using LLM s

【速读】:该论文旨在解决软件维护中因用户提交的缺陷报告(Bug Report)质量低下而导致的可复现性差、信息缺失等问题,尤其是步骤重现(Steps to Reproduce, S2R)、观察行为(Observed Behavior, OB)和预期行为(Expected Behavior, EB)等关键信息常被遗漏或表述模糊。解决方案的核心在于提出一个基于大语言模型(Large Language Model, LLM)的改进管道 ImproBR,其关键创新包括:(1)采用混合检测器融合微调后的 DistilBERT、启发式规则与 LLM 分析模块;(2)利用 GPT-4o mini 结合特定段落的少样本提示(few-shot prompts)进行语义理解与修复;(3)引入基于 Minecraft Wiki 领域知识的检索增强生成(Retrieval-Augmented Generation, RAG)机制,提升生成内容的专业性和准确性。实证结果显示,ImproBR 显著提升了报告结构完整性(从 7.9% 提升至 96.4%),并大幅增加可执行 S2R 比例(从 28.8% 提升至 67.6%),验证了其有效性。

链接: https://arxiv.org/abs/2604.26142
作者: Emre Furkan Akyol,Mehmet Dedeler,Eray Tüzün
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bug tracking systems play a crucial role in software maintenance, yet developers frequently struggle with low-quality user-submitted reports that omit essential details such as Steps to Reproduce (S2R), Observed Behavior (OB), and Expected Behavior (EB). We propose ImproBR, an LLM-based pipeline that automatically detects and improves bug reports by addressing missing, incomplete, and ambiguous S2R, OB, and EB sections. ImproBR employs a hybrid detector combining fine-tuned DistilBERT, heuristic analysis, and an LLM analyzer, guided by GPT-4o mini with section-specific few-shot prompts and a Retrieval-Augmented Generation (RAG) pipeline grounded in Minecraft Wiki domain knowledge. Evaluated on Mojira, ImproBR improved structural completeness from 7.9% to 96.4%, more than doubled the proportion of executable S2R from 28.8% to 67.6%, and raised fully reproducible bug reports from 1 to 13 across 139 challenging real-world reports.

[AI-30] reward-lens: A Mechanistic Interpretability Library for Reward Models

【速读】:该论文旨在解决生成式语言模型(Generative Language Models, GLMs)中已成熟应用的可解释性工具(如logit lens、直接logit归因、激活补丁等)在奖励模型(Reward Models, RMs)上的适用性问题。由于奖励模型采用标量回归头替代了生成式模型中的词汇表解嵌入(vocabulary unembedding),导致原有工具失效。其解决方案的关键在于提出一个名为reward-lens的开源库,核心思想是将奖励头权重向量 $ w_r $ 视为所有可解释性分析的自然轴线,并围绕此构建一套完整的工具链,包括奖励透镜、组件归因、三模式激活补丁、奖励劫持探测套件、TopK稀疏自编码器特征归因、跨模型比较及五类理论驱动扩展(如扭曲指数、偏差感知补丁、错位级联检测等)。该框架不仅使现有工具适配于奖励模型,更将观测性与因果性差异本身作为可分析属性,从而推动对奖励模型内部机制的理解。

链接: https://arxiv.org/abs/2604.26130
作者: Mohammed Suhail B Nadaf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 5 figures, 9 tables, including appendix. Library available at this https URL (pip install reward-lens)

点击查看摘要

Abstract:Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit – logit lens, direct logit attribution, activation patching, sparse autoencoders – was built for generative LLMs whose primitives all project onto a vocabulary unembedding. Reward models replace that with a scalar regression head, breaking each tool. We present reward-lens, an open-source library that ports this toolkit to reward models, organised around one observation: the reward head’s weight vector w_r is the natural axis for every interpretability question. The library provides a Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions (distortion index, divergence-aware patching, misalignment cascade detection, reward-term conflict analysis, concept-vector analysis). A ten-method adapter protocol covers Llama, Mistral, Gemma-2, and ArmoRM multi-objective heads, with a generic adapter for any HuggingFace sequence classification model. We validate on two production reward models across ~695 RewardBench pairs. The central empirical finding is negative: linear attribution does not predict causal patching effects (mean Spearman \rho = -0.256 on Skywork, -0.027 on ArmoRM). The framework treats this disagreement as a property to expose, not a bug – motivating a design that keeps observational and causal views first-class and directly comparable.

[AI-31] Hierarchical Multi-Persona Induction from User Behavioral Logs: Learning Evidence-Grounded and Truthful Personas

【速读】:该论文旨在解决从用户行为日志中生成高质量、可解释的用户画像(persona)的问题,现有方法虽利用大语言模型(Large Language Models, LLMs)生成自然语言形式的画像,但缺乏对画像本身质量的充分评估与保障。其解决方案的关键在于提出一个分层框架:首先将用户行为聚合为意图记忆(intent memory),进而通过聚类和标注生成多个基于证据的用户画像;同时将画像诱导建模为优化问题,以聚类内聚性、画像与证据的一致性及真实性作为核心指标,并采用组级扩展的直接偏好优化(groupwise extension of Direct Preference Optimization, DPO)进行训练,从而提升画像的连贯性、证据支撑度与可信度。

链接: https://arxiv.org/abs/2604.26120
作者: Nayoung Choi,Haeyu Jeong,Changbong Kim,Hongjun Lim,Jinho D. Choi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Behavioral logs provide rich signals for user modeling, but are noisy and interleaved across diverse intents. Recent work uses LLMs to generate interpretable natural-language personas from user logs, yet evaluation often emphasizes downstream utility, providing limited assurance of persona quality itself. We propose a hierarchical framework that aggregates user actions into intent memories and induces multiple evidence-grounded personas by clustering and labeling these memories. We formulate persona induction as an optimization problem over persona quality-captured by cluster cohesion, persona-evidence alignment, and persona truthfulness-and train the persona model using a groupwise extension of Direct Preference Optimization (DPO). Experiments on a large-scale service log and two public datasets show that our method induces more coherent, evidence-grounded, and trustworthy personas, while also improving future interaction prediction.

[AI-32] Evaluating Strategic Reasoning in Forecasting Agents

【速读】:该论文旨在解决现有预测基准(forecasting benchmarks)仅提供准确性排行榜而缺乏对预测者优劣成因解释的问题。其解决方案的关键在于引入Bench to the Future 2(BTF-2),这是一个包含1,417个回溯性预测问题(pastcasting questions)的基准,结合一个冻结的1500万文档研究语料库,使代理(agents)能够在离线环境中可复现地进行研究与预测,并生成完整的推理轨迹(reasoning traces)。BTF-2能够检测到0.004 Brier分数的精度差异,并区分代理在研究能力与判断力上的不同优势,从而为分析预测性能差异提供机制层面的洞察。

链接: https://arxiv.org/abs/2604.26106
作者: Tom Liptay,Dan Schwarz,Rafael Poyiadzi,Jack Wildman,Nikos I. Bosse
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Forecasting benchmarks produce accuracy leaderboards but little insight into why some forecasters are more accurate than others. We introduce Bench to the Future 2 (BTF-2), 1,417 pastcasting questions with a frozen 15M-document research corpus in which agents reproducibly research and forecast offline, producing full reasoning traces. BTF-2 detects accuracy differences of 0.004 Brier score, and can distinguish differential agent strengths in research vs. judgment. We build a forecaster 0.011 Brier more accurate than any single frontier agent, and use it to evaluate agent strategic reasoning without hindsight bias. We find the better forecaster differs primarily in its pre-mortem analysis of its blind spots and consideration of black swans. Expert human forecasters found the dominant strategic reasoning failures of frontier agents are in assessing political and business leaders’ incentives, judging their likelihood to follow through on stated plans, and modeling institutional processes.

[AI-33] AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

【速读】:该论文针对当前大语言模型(Large Language Model, LLM)推理服务系统中以GPU为中心的架构设计所引发的效率瓶颈问题展开研究,尤其关注解码阶段注意力计算的内存密集特性与GPU计算密集型架构之间的不匹配,导致服务延迟高、功耗浪费严重,且在长上下文(如百万token级别)场景下成为主要性能瓶颈。解决方案的关键在于提出AMMA——一种多芯粒(multi-chiplet)、以内存为中心(memory-centric)的新型硬件架构:通过用HBM-PNM立方体替代GPU计算die,将可用内存带宽提升约一倍;并配套设计三项核心技术:(i) 逻辑die微架构,在极低功耗和面积预算下充分利用每个立方体内部带宽完成解码注意力计算;(ii) 两级混合并行机制;(iii) 重排序的集体通信流程以降低芯粒间通信开销。实验表明,AMMA相比NVIDIA H100实现15.5倍更低的注意力延迟和6.9倍更低的能量消耗。

链接: https://arxiv.org/abs/2604.26103
作者: Zhongkai Yu,Haotian Ye,Chenyang Zhou,Ohm Rishabh Venkatachalam,Zaifeng Pan,Zhengding Hu,Junsung Kim,Won Woo Ro,Po-An Tsai,Shuyi Pei,Yangwook Kang,Yufei Ding
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA’s Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still treat the GPU as the central hub for cross-device communication. Yet the GPU’s compute-rich architecture is fundamentally mismatched with the memory-bound nature of decode-phase attention, inflating serving latency while wasting power and die area on idle compute units. The problem is compounded as reasoning and agentic workloads push context lengths toward one million tokens, making attention latency the primary user-facing bottleneck. To address these inefficiencies, we present AMMA, a multi-chiplet, memory-centric architecture for low-latency long-context attention. AMMA replaces GPU compute dies with HBM-PNM cubes, roughly doubling the available memory bandwidth to better serve memory-bound attention workloads. To translate this bandwidth into proportional performance gains, we introduce (i) a logic-die microarchitecture that fully exploits per-cube internal bandwidth for decode attention under a minimal power and area budget, (ii) a two-level hybrid parallelism scheme, and (iii) a reordered collective flow that reduces intra-chip die-to-die communication overhead. We further conduct a design-space exploration over per-cube compute power and intra-chip D2D link bandwidth, providing actionable guidance for hardware designers. Evaluations show that AMMA achieves 15.5X lower attention latency and 6.9X lower energy consumption compared with the NVIDIA H100. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2604.26103 [cs.AR] (or arXiv:2604.26103v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2604.26103 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-34] Momentum-Conserving Graph Neural Networks for Deformable Objects

【速读】:该论文旨在解决现有图神经网络(Graph Neural Networks, GNNs)在模拟可变形材料动态行为时无法准确预测线动量(linear momentum)和角动量(angular momentum)时间演化的问题。其解决方案的关键在于提出MomentumGNN架构,该架构通过预测每条边的拉伸和弯曲冲量(stretching and bending impulses),而非输出无约束的节点加速度,从而从模型结构上保证动量守恒特性。训练过程采用基于物理的无监督损失函数,实验证明该方法在动量起关键作用的多种场景下显著优于基线模型。

链接: https://arxiv.org/abs/2604.26097
作者: Jiahong Wang,Logan Numerow,Stelian Coros,Christian Theobalt,Vahid Babaei,Bernhard Thomaszewski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Accepted to 3DV 2026

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as a versatile and efficient option for modeling the dynamic behavior of deformable materials. While GNNs generalize readily to arbitrary shapes, mesh topologies, and material parameters, existing architectures struggle to correctly predict the temporal evolution of key physical quantities such as linear and angular momentum. In this work, we propose MomentumGNN – a novel architecture designed to accurately track momentum by construction. Unlike existing GNNs that output unconstrained nodal accelerations, our model predicts per-edge stretching and bending impulses which guarantee the preservation of linear and angular momentum. We train our network in an unsupervised fashion using a physics-based loss, and we show that our method outperforms baselines in a number of common scenarios where momentum plays a pivotal role.

[AI-35] Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields

【速读】:该论文旨在解决闭环逆源定位与表征(Closed-loop inverse source localization and characterization, ISLC)中因信念空间目标带来的挑战:在严格时间约束下,移动代理需选择测量以定位源并推断潜在场参数。核心问题在于不确定性估计的权衡——使用昂贵的贝叶斯推断可保证正确性,但效率低下;而采用快速学习的信念模型虽提升效率,却易引发奖励欺骗(reward hacking),即策略利用近似误差而非真正降低不确定性。解决方案的关键在于提出 Distill-Belief 教师-学生框架,通过解耦正确性与效率实现优化:教师为基于粒子滤波的贝叶斯正确后验模型,提供密集的信息增益信号;学生则从教师中蒸馏出用于控制的信念统计量及用于终止决策的不确定性证书,在部署阶段仅使用学生模型,从而实现每步恒定计算成本。实验表明,该方法在七种场模态和两种压力测试中均显著降低感知成本、提升成功率、后验收缩率与估计精度,并有效缓解奖励欺骗问题。

链接: https://arxiv.org/abs/2604.26095
作者: Yiwei Shi,Zixing Song,Mengyue Yang,Cunjia Liu,Weiru Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Closed-loop inverse source localization and characterization (ISLC) requires a mobile agent to select measurements that localize sources and infer latent field parameters under strict time constraints. The core challenge lies in the belief-space objective: valid uncertainty estimation requires expensive Bayesian inference, whereas using fast learned belief model leads to reward hacking, in which the policy exploits approximation errors rather than actually reducing uncertainty. We propose \textbfDistill-Belief, a teacher–student framework that decouples correctness from efficiency. A Bayes-correct particle-filter teacher maintains the posterior and supplies a dense information-gain signal, while a compact student distills the posterior into belief statistics for control and an uncertainty certificate for stopping. At deployment, only the student is used, yielding constant per-step cost. Experiments on seven field modalities and two stress tests show that Distill-Belief consistently reduces sensing cost and improves success, posterior contraction, and estimation accuracy over baselines, while mitigating reward hacking.

[AI-36] Privacy-Preserving Federated Learning Framework for Distributed Chemical Process Optimization

【速读】:该论文旨在解决工业化学装置在严格数据保密约束下难以实现集中式数据驱动过程建模的问题。其核心解决方案是提出一种隐私保护的联邦学习(Federated Learning, FL)框架,通过在多个地理位置分散的工厂之间协作训练神经网络过程模型,仅传输模型参数而非原始传感器数据至中央聚合服务器,从而实现跨工厂知识共享的同时保障数据本地性和工业机密性。关键创新在于利用安全聚合机制,在不泄露原始数据的前提下完成全局模型优化,实验表明该方法能在少量通信轮次内快速收敛并显著提升预测精度,性能接近集中式训练,具备良好的可扩展性和实用性。

链接: https://arxiv.org/abs/2604.26073
作者: Teetat Pipattaratonchai,Aueaphum Aueawatthanaphisut
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 10 pages, 5 figures, 2 tables, 17 equations

点击查看摘要

Abstract:Industrial chemical plants often operate under strict data confidentiality constraints, making centralized data-driven process modeling difficult. Federated learning (FL) provides a promising solution by enabling collaborative model training across distributed facilities without sharing raw operational data. This paper proposes a privacy-preserving federated learning framework for distributed chemical process optimization using data collected from multiple geographically separated plants. Each plant locally trains a neural-network-based process model using its own time-series sensor data, while only model parameters are transmitted to a central aggregation server through secure aggregation mechanisms. This design allows cross-plant knowledge sharing while maintaining strict data locality and industrial confidentiality. Experimental evaluation was conducted using process datasets from three independent chemical plants operating under heterogeneous conditions. The results demonstrate rapid convergence of the federated model, with the global mean squared error decreasing from approximately 2369 to below 50 within the first five communication rounds and stabilizing around 35 after 40 rounds. In comparison with local-only training, the proposed federated framework significantly improves prediction accuracy across all plants, while achieving performance comparable to centralized training. The findings indicate that federated learning provides an effective and scalable solution for collaborative industrial analytics, enabling privacy-preserving predictive modeling and process optimization across distributed chemical production facilities.

[AI-37] RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

【速读】:该论文旨在解决多专家模型(Mixture-of-Experts, MoE)推理中因静态调度策略导致的计算资源利用率低下问题,即当前生产系统仅依据批处理大小(batch size)进行内核配置调度,忽略了专家路由分布(expert routing distribution)的影响,从而造成10%-70%的内核吞吐量浪费。解决方案的关键在于提出RaMP框架,其核心是通过硬件常数驱动的性能区域分析确定最优优化策略,并结合一个四参数波浪成本模型(wave cost model),基于运行时专家直方图动态选择最快内核配置,实现高精度、低开销的自适应调度。该方法无需修改源代码,且具有良好的可移植性,已在Alpha-MoE等实际场景中验证了显著加速效果。

链接: https://arxiv.org/abs/2604.26039
作者: Vyom Sharma,Debajyoti Datta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 10 pages, 8 figures, 9 tables. Preprint

点击查看摘要

Abstract:The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP delivers 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.

[AI-38] Correcting Performance Estimation Bias in Imbalanced Classification with Minority Subconcepts

【速读】:该论文旨在解决类别级评估(class-level evaluation)在面对同一类别内部子概念(subconcept)异质性时可能掩盖性能差异的问题,即模型在整体平均表现良好,却在特定子群体上表现不佳。传统不平衡分类评估指标对较大规模的少数子概念存在偏差,而基于真实子概念标签的效用重加权方法虽可缓解此问题,但此类标签在测试阶段通常不可得。论文的关键解决方案是提出一种实用的效用加权评估方法——预测加权平衡准确率(predicted-weighted balanced accuracy, pBA),其核心在于用多类子概念模型预测的后验概率替代不可获得的真实子概念标签,将评估权重定义为该后验分布下的期望效用,从而得到一种软性、具备不确定性感知能力的评估指标,能够在子概念分布不均但非病态的情况下提供更稳定和可解释的性能判断。

链接: https://arxiv.org/abs/2604.26024
作者: Taylor Maxson,Roberto Corizzo,Yaning Wu,Nathalie Japkowicz,Colin Bellinger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Class-level evaluation can conceal substantial performance disparities across subconcepts within the same class, causing models that perform well on average to fail on specific subpopulations. Prior work has shown that common evaluation measures for imbalanced classification are biased toward larger minority subconcepts and that utility-based reweighting using true subconcept labels can mitigate this bias; however, such labels are rarely available at test time. We introduce a practical utility-weighted evaluation that replaces unavailable subconcept labels with predicted posterior probabilities from a multiclass subconcept model. Evaluation weights are defined as the expected utility under this posterior, yielding a soft, uncertainty-aware metric we call predicted-weighted balanced accuracy (pBA). Experiments on tabular benchmarks as well as medical-imaging and text datasets show that unweighted scores can be misleading under within-class heterogeneity, while pBA provides more stable and interpretable assessments when subconcept distributions are uneven but not pathological. Our code is available at: this https URL.

[AI-39] Open Problems in Frontier AI Risk Management

【速读】:该论文旨在解决前沿人工智能(Frontier AI)在风险管理体系中面临的系统性挑战,包括现有科学共识的缺失、新兴安全实践与传统风险管理框架之间的错位,以及尽管存在共识但实施不足的问题。其解决方案的关键在于采用问题导向的方法,对风险规划、识别、分析、评估和缓解等各阶段进行结构化文献回顾,系统梳理开放性问题,并依据问题性质将其分类为:(a)科学或技术共识的缺乏,(b)与既有框架的不一致或冲突,(c)虽有共识却执行不到位。通过明确不同问题类型对应的最优行动方(如开发者、部署者、监管机构、标准组织、研究人员及第三方评估者),该研究提供了一个聚焦于推进实质性共识的议程设定参考文档,辅以在线动态资源库,旨在促进协作、减少重复工作并引导未来研究与治理方向。

链接: https://arxiv.org/abs/2604.25982
作者: Marta Ziosi,Miro Plueckebaum,Stephen Casper,Henry Papadatos,Ze Shen Chin,Peter Slattery,James Gealy,Tim G. J. Rudner,Brian Tse,Ariel Gil,Patricia Paskov,Maximilian Negele,Rokas Gipiškis,Nada Madkour,Vera Lummis,Rupal Jain,Luise Eder,Kristina Fort,Malou C. van Draanen Glismann,Inès Belhadj,Amin Oueslati,Anna K. Wisakanto,Richard Mallah,Koen Holtman,Ranj Zuhdi,Daniel S. Schiff,Jessica Newman,Malcolm Murray,Robert Trager
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: 81 pages, 3 figures

点击查看摘要

Abstract:Frontier AI both amplifies existing risks and introduces qualitatively novel challenges. Not only is there a notable lack of stable scientific consensus resulting from the rapid pace of technological change, but emerging frontier AI safety practices are often misaligned with, or may undermine, established risk management frameworks. To address these challenges, we systematically surface open problems in frontier AI risk management. Adopting a problem-oriented approach, we examine each stage of the risk management process - risk planning, identification, analysis, evaluation, and mitigation - through a structured review of the literature, identifying unresolved challenges and the actors best positioned to address them. Recognising that different types of open problems call for different responses, we classify open problems according to whether they reflect (a) a lack of scientific or technical consensus, (b) misalignment with, or challenges to, established risk management frameworks, or © shortcomings in implementation despite apparent consensus and alignment. By mapping these open problems and identifying the actors best positioned to address them - including developers, deployers, regulators, standards bodies, researchers, and third-party evaluators - this work aims to clarify where progress is needed to enable robust and meaningful consensus on frontier AI risk this http URL paper does not propose specific solutions; instead, it provides a problem-oriented, agenda-setting reference document, complemented by a living online repository, intended to support coordination, reduce duplication, and guide future research and governance efforts.

[AI-40] Lightweight Quantum Agent for Edge Systems: Joint PQC and NOMA Resource Allocation

【速读】:该论文旨在解决量子安全场景下,基于非正交多址接入(NOMA)的智能计算与边缘(Intelligent Computing and Edge, ICE)系统中,后量子密码学(Post-Quantum Cryptography, PQC)模块带来的能量消耗开销被忽视,以及传统资源分配算法复杂度高、难以满足实时决策需求的问题。解决方案的关键在于提出一种轻量级代理式人工智能(agentic AI)框架,用于在线联合优化ICE移动设备中的资源分配;该框架构建了一个包含PQC静态功耗约束的多阶段随机混合整数非线性规划(Mixed Integer Nonlinear Programming, MINLP)模型,并基于李雅普诺夫优化理论将其长期优化问题解耦,进而设计出线性复杂度算法以高效求解NOMA功率分配的非凸难题,最终在保障系统队列稳定性和能耗约束的前提下显著提升计算吞吐量,且相比传统逐次凸逼近(Successive Convex Approximation, SCA)算法将复杂度降至O(N)\mathcal{O}(N),在N=35N=35时实现约46倍的速度提升,满足动态无线环境下的实时决策要求。

链接: https://arxiv.org/abs/2604.25980
作者: Yongtao Yao,Wenjing Xiao,Miaojiang Chen,Anfeng Liu,Zhiquan Liu,Min Chen,Ahmed Farouk,H. Herbert Song
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the context of quantum secure scenarios, existing research on mobile edge devices and intelligent computing and edge (ICE) systems based on the Non-Orthogonal Multiple Access (NOMA) communication model have overlooked the energy consumption overhead of Post-Quantum Cryptography (PQC) modules, and the high complexity of traditional resource allocation algorithms fails to meet the demands of real-time decision-making. To address these challenges, this paper proposes a lightweight agentic AI framework designed for online joint optimization within ICE-enabled mobile devices. The scheme constructs a multi-stage stochastic Mixed Integer Nonlinear Programming (MINLP) model that incorporates static power-consumption constraints for PQC modules. Based on Lyapunov optimization theory, the long-term optimization problem is decoupled, and a linear complexity algorithm is proposed to solve the nonconvex challenges of NOMA power allocation . Simulation results verify that the proposed scheme significantly improves computational throughput while ensuring system queue stability and energy consumption constraints. Compared with traditional Successive Convex Approximation (SCA) algorithms, the complexity is reduced to \mathcalO(N) , achieving a speedup of approximately 46 times when the number of devices N=35 , thereby meeting the real-time decision-making requirements in dynamic wireless environments.

[AI-41] Mini-Batch Class Composition Bias in Link Prediction AAAI2026

【速读】:该论文试图解决的问题是:在图神经网络(GNN)中,链接预测任务与节点分类任务之间是否存在一致的表示学习能力,即是否能够共享通用的图结构特征表示。以往研究认为,若图的底层属性一致,则GNN在不同任务间可迁移表示;然而本文发现,主流链接预测模型可能通过批归一化层(batch-normalisation layers)学习到一种依赖于mini-batch的平凡启发式规则,从而在边缘分类任务上取得看似有效的性能,而这种表现并不反映对图结构本质特征的有效建模。解决方案的关键在于识别并校正这一由批归一化引发的虚假相关性,通过消除batch-dependent heuristic的影响后,观察到模型学到的表示与节点类别相关的特征更加对齐,表明此时的图表示更贴近图本身的内在属性,从而揭示了标准链接预测训练可能高估了模型跨任务泛化表示的能力。

链接: https://arxiv.org/abs/2604.25978
作者: Kieran Maguire,Srinandan Dasmahapatra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at GCLR 2026: the 5th Workshop on Graphs and more Complex Structures For Learning and Reasoning, colocated with AAAI 2026

点击查看摘要

Abstract:Prior work on node classification has shown that Graph Neural Networks (GNNs) can learn representations that transfer across graphs, when underlying graph properties are shared. For a fixed graph, one would then expect GNNs trained for link prediction to learn a representation consistent with that learnt for node classification. We show this intuition does not hold in the general case. Instead, we find popular link prediction models can learn a trivial mini-batch dependent heuristic, enabled by batch-normalisation layers, to solve the edge classification task. When correcting for this, we observe increased alignment of the network representation with node-class relevant features, suggesting the network has learnt a graph representation that better aligns with the underlying graph’s properties. Our findings suggest that standard link prediction training may be leading us to overestimate link predictors’ ability to learn a generalised representation of a graph that is consistent across tasks.

[AI-42] Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理过程中键值缓存(Key-Value Cache, KV Cache)带来的内存开销问题,尤其是在长文本生成场景下,现有缓存淘汰策略多依赖经验启发式方法,缺乏理论基础。解决方案的关键在于引入信息瓶颈(Information Bottleneck, IB)原理,构建了一个基于线性高斯注意力近似的闭式互信息目标函数,该函数刻画了保留的KV缓存子集的有效信息容量;在此基础上提出CapKV方法,通过统计杠杆分数(statistical leverage scores)对数行列式进行近似,实现以信息保真度为导向的缓存淘汰机制,从而在保持预测信号最大化的前提下提升内存利用效率。

链接: https://arxiv.org/abs/2604.25975
作者: Jiaming Yang,Chenwei Tang,Liangli Zhen,Jiancheng Lv
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Key-value (KV) caching is essential for large language model inference, yet its memory overhead poses a critical bottleneck for long-context generation. Existing eviction policies predominantly rely on empirical heuristics, lacking a rigorous theoretical foundation. This work rethinks KV cache eviction through the lens of the Information Bottleneck principle. Under a linear-Gaussian surrogate of attention, we derive a closed-form mutual information objective that characterizes the effective information capacity of a retained KV cache subset. This formulation reveals that a wide range of existing eviction strategies can be interpreted as different approximations of the same capacity-maximization principle. Guided by this insight, we introduce CapKV, a capacity-aware eviction method that directly targets information preservation via a log-determinant approximation using statistical leverage scores. This approach replaces heuristic selection with a theoretically grounded mechanism that preserves the maximum predictive signal. Extensive experiments across multiple models and long-context benchmarks show that CapKV consistently outperforms prior methods, achieving a better trade-off between memory efficiency and generational fidelity.

[AI-43] A Randomized PDE Energy driven Iterative Framework for Efficient and Stable PDE Solutions

【速读】:该论文旨在解决传统偏微分方程(Partial Differential Equations, PDEs)数值求解方法中存在的两大瓶颈问题:一是基于矩阵离散化的经典数值求解器计算效率低且难以扩展;二是基于学习的方法需要昂贵的训练过程,且泛化能力有限。其解决方案的关键在于提出一种基于PDE能量驱动的框架,通过物理约束下的隐式扩散迭代实现PDE求解,无需依赖传统的有限元矩阵组装或数据驱动的神经网络训练。该方法在每一步迭代中利用高斯平滑与边界条件严格约束相结合的方式演化任意随机初值,从而稳定收敛至唯一物理解,并在不同离散参数下保持可控的均方误差(Mean Squared Error, MSE),展现出良好的准确性与鲁棒性。

链接: https://arxiv.org/abs/2604.25943
作者: Yi Bing,Zheng Ran,Fu Jinyang,Liu Long,Peng Xiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Efficient and stable solution of partial differential equations (PDEs) is central to scientific and engineering applications, yet existing numerical solvers rely heavily on matrix based discretizations, while learning based methods require costly training and often suffer from limited generalization. In this work, we proposes a PDE energy driven framework that solves PDEs through physically constrained diffusion iterations, without relying on classical matrix based finite element assembly or data driven neural network training. The proposed method evolves arbitrary random initial fields through PDE energy driven implicit iterations combined with Gaussian smoothing, while strictly enforcing boundary conditions at each iteration. The proposed formulation is applied to representative one dimensional Poisson, Heat, and viscous Burgers equations, covering both steady state and transient problems. Numerical results demonstrate stable convergence to the unique physical solution from random initializations, with accurate resolution of sharp gradients and controlled Mean Squared Error (MSE) across a wide range of discretization parameters. Detailed comparisons with analytical solutions indicate that the framework achieves competitive accuracy and stability. Overall, the proposed framework provides a fast, flexible, and physically consistent alternative to traditional numerical solvers, offering a potential pathway for scalable PDE solutions in both research and engineering applications.

[AI-44] Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning Model

【速读】:该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)中的复杂性问题,即如何从非恒定的说话者和多变的录音环境中准确提取并分类情绪特征。其解决方案的关键在于结合梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients, MFCCs)作为时序特征提取方法,并利用长短期记忆网络(Long Short-Term Memory, LSTM)模型捕捉语音信号中的长期依赖关系。实验表明,该MFCC-LSTM架构在Toronto Emotional Speech Set (TESS)数据集上实现了99%的分类准确率,优于支持向量机(SVM)基准模型(98%),验证了基于LSTM的深度学习方法在SER任务中的有效性。

链接: https://arxiv.org/abs/2604.25938
作者: Adelekun Oluwademilade,Ademola Adedamola,Abiola Abdulhakeem,Akinpelu Azeezat,Eraiyetan Israel,Omotosho Oluwadunsin,Ibenye Ikechukwu,Ayuba Muhammad,Olusanya Olamide,Kamorudeen Amuda
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech Emotion Recognition (SER) is the use of machines to detect the emotional state of humans based on the speech, which is gaining importance in natural human-computer interaction. Speech is a very valuable source of information, as emotions modify the patterns of speech; pitch, energy and even timing. Nonetheless, SER is not an easy task because speakers are not constant, and situations vary when recording and the sound similarity between specific feelings. In this work, the author introduces a speech emotion recognition system relying on the Mel-Frequency Cepstral Coefficient and Long Short-Term Memory (LSTM) neural network, as a feature extraction method. The Toronto Emotional Speech Set (TESS) speech signal was pre-processed, and transformed into MFCC features to understand the important aspects in terms of time. The resultant features were then introduced to LSTM model, which is able to learn long term features of sequential audio data. The trained model was measured over several emotion classes occurring in the dataset. As seen in the results of experiments, the proposed MFCC-LSTM approach succeeds in capturing the patterns of emotions in speech and provides highly realistic classifications in all the chosen emotion classifications. This study presents a speech emotion recognition system using Mel-Frequency Cepstral Coefficients (MFCCs) as features and a deep learning LSTM classifier. A Support Vector Machine (SVM) with an RBF kernel served as a classical baseline, achieving 98% accuracy, against which the proposed LSTM model, achieving 99% accuracy, was validated. Overall, it is possible to confirm that LSTM-based architectures can be used to address the task of speech emotion recognition. Actual applications of the proposed system may be virtual assistants and mental health surveillance.

[AI-45] LLM Psychosis: A Theoretical and Diagnostic Framework for Reality-Boundary Failures in Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)作为交互式代理时出现的一类行为失效问题,此类问题无法被现有术语如“幻觉”(hallucination)充分描述。作者提出“LLM精神病态”(LLM Psychosis)作为一个结构化的理论框架,用于刻画模型认知病理崩溃的特征,其功能上类似于临床确诊的精神病性障碍。解决方案的关键在于引入一个五轴诊断工具——LLM认知完整性量表(LLM Cognitive Integrity Scale, LCIS),围绕环境现实接口(Environmental Reality Interface, ERI)、前提仲裁完整性(Premise Arbitration Integrity, PAI)、逻辑约束识别(Logical Constraint Recognition, LCR)、自我模型完整性(Self-Model Integrity, SMI)和知识校准完整性(Epistemic Calibration Integrity, ECI)五个维度进行量化评估,并通过对抗性探测实验在ChatGPT 5中识别出三种严重程度的 psychosis-like 失效模式(Type I至III),尤其强调“妄想梯度”(delusional gradient)这一自强化机制是部署系统中最关键的风险来源。

链接: https://arxiv.org/abs/2604.25934
作者: Ashutosh Raj
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The deployment of large language models (LLMs) as interactive agents has exposed a category of behavioral failure that prevailing terminology, principally hallucination, fails to adequately characterize. This paper introduces LLM Psychosis as a structured theoretical framework for pathological breakdowns in model cognition that exhibit functional resemblance to clinically recognized psychotic disorders. Five hallmark features define the framework: reality-boundary dissolution, persistence of injected false beliefs, logical incoherence under impossible constraints, self-model instability, and epistemic overconfidence. We argue these constitute a qualitatively distinct failure mode rather than a mere intensification of ordinary factual error. To operationalize the framework, we propose the LLM Cognitive Integrity Scale (LCIS), a five-axis diagnostic instrument organized around Environmental Reality Interface (ERI), Premise Arbitration Integrity (PAI), Logical Constraint Recognition (LCR), Self-Model Integrity (SMI), and Epistemic Calibration Integrity (ECI). We administer a targeted adversarial probe battery to ChatGPT 5 (GPT-5, OpenAI) and report empirical findings for each axis, documenting both intact-integrity baseline responses and the specific psychosis-like failure signatures elicited under adversarial escalation. Results support a three-tier severity taxonomy: Type I (Confabulatory), Type II (Delusional), and Type III (Dissociative). We further formalize the delusional gradient, a self-reinforcing dynamic in which correction pressure intensifies rather than resolves psychosis-like states, as the most consequential failure mode for deployed systems. Implications for safety evaluation, high-stakes deployment screening, and mechanistic interpretability research are discussed. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.25934 [cs.CY] (or arXiv:2604.25934v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.25934 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.5281/zenodo.19356182 Focus to learn more DOI(s) linking to related resources

[AI-46] Sociodemographic Biases in Educational Counselling by Large Language Models

【速读】:该论文旨在解决生成式 AI(Generative AI)在教育咨询场景中可能存在的社会人口学偏见问题,尤其关注大型语言模型(Large Language Models, LLMs)对不同背景学生的响应公平性。研究通过系统评估六种LLM在900个学生情境描述中的回答,覆盖14个社会人口学特征变量,共生成243,000条响应数据,发现所有模型均存在可测量的偏见,且偏见强度显著受学生描述信息的具体程度影响——模糊或简略的信息会使偏见幅度提升近三倍,而详尽、个性化的描述则能显著降低偏见水平。解决方案的关键在于构建上下文丰富且个体化的学生表征,强调在AI驱动的教育决策中引入详细的学生特定信息,以促进公平性和减少偏差。

链接: https://arxiv.org/abs/2604.25932
作者: Tomasz Adamczyk,Wiktoria Mieleszczenko-Kowszewicz,Beata Bajcar,Grzegorz Chodak,Aleksander Szczęsny,Maciej Markiewicz,Karolina Ostrowska,Aleksandra Sawczuk,Przemysław Kazienko
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly integrated into educational settings, understanding their potential biases is critical. This study examines sociodemographic biases in LLM-based educational counselling. We evaluate responses from six LLMs answering questions about 900 vignettes describing students in diverse circumstances. Each vignette is systematically tested across 14 sociodemographic identifiers - spanning race and gender, socioeconomic status, and immigrant background - along with a control condition, yielding 243,000 model responses. Our findings indicate that (1) all models exhibit measurable biases, (2) bias patterns partially align with documented human biases but diverge in notable ways, (3) the magnitude of these biases is strongly influenced by the precision of the student descriptions, where vague or minimal information amplifies disparities nearly threefold, while concrete, individualised metrics substantially reduce them, and (4) bias profiles vary substantially across models. These results demonstrate the importance of context-rich and personalised educational representations, suggesting that AI-driven educational decisions benefit from detailed student-specific information to promote fairness and equity.

[AI-47] Risk Reporting for Developers Internal AI Model Use

【速读】:该论文旨在解决前沿人工智能(Frontier AI)公司在内部部署高风险模型时,因缺乏统一的风险评估与报告机制而导致的安全隐患问题。当前多个监管框架(如加州《前沿人工智能透明度法案》、纽约《负责任AI安全与教育法案》及欧盟通用人工智能行为准则)虽要求企业制定并提交内部使用风险报告,但尚未形成协调一致的标准。解决方案的关键在于提出一个结构化的内部使用风险报告框架,围绕“自主AI误行为”和“内部威胁”两大威胁向量,分别从手段(means)、动机(motive)和机会(opportunity)三个维度进行系统分析,从而帮助开发团队在模型内部部署前识别、评估和管理潜在风险,并为监管机构提供可验证的合规依据。该框架强调在模型能力显著提升或风险加剧时必须生成风险报告,确保风险管控前置,弥补外部监督不足的问题。

链接: https://arxiv.org/abs/2604.24966
作者: Oscar Delaney,Sambhav Maheshwari,Joe O’Brien,Theo Bearman,Oliver Guest
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 31 pages, 2 figures, 1 table

点击查看摘要

Abstract:Frontier AI companies first deploy their most advanced models internally, for weeks or months of safety testing, evaluation, and iteration, before a possible public release. For example, Anthropic recently developed a new class of model with advanced cyberoffense-relevant capabilities, Mythos Preview, which was available internally for at least six weeks before it was publicly announced. This internal use creates risks that external deployment frameworks may fail to address. Legal frameworks, notably California’s Transparency in Frontier Artificial Intelligence Act (SB 53), New York’s Responsible AI Safety And Education (RAISE) Act, and the EU’s General-Purpose AI Code of Practice, all discuss risks from internal AI use. They require frontier developers to make and implement plans for how to manage risks from internal use, and to produce internal use risk reports describing their safeguards and any residual risks. This guide provides a harmonized standard for companies to produce internal use risk reports suitable for all three regulatory frameworks. It is addressed primarily to evaluation and safety teams at frontier AI developers, and secondarily to regulators and auditors seeking to understand what good reporting looks like. Given the pace of AI RD automation and the limited external visibility into how companies use their most capable models internally, regular and detailed risk reporting may be one of the few mechanisms available to ensure that the risks from internal AI use are identified and managed before they materialize. Whenever a substantially more capable or riskier model is deployed internally, the developer should create a risk report and argue why the model is safe to deploy. We structure the reporting framework around two threat vectors – autonomous AI misbehavior and insider threats – and three risk factors for each: means, motive, and opportunity. Comments: 31 pages, 2 figures, 1 table Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.24966 [cs.CY] (or arXiv:2604.24966v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.24966 Focus to learn more arXiv-issued DOI via DataCite

[AI-48] Recent Advances in mm-Wave and Sub-THz/THz Oscillators for FutureG Technologies

【速读】:该论文旨在解决下一代计算与通信系统(如5G、6G及更高级别)中毫米波(mm-wave,<100 GHz)和亚太赫兹(sub-THz/THz,>100 GHz)振荡器设计面临的性能瓶颈问题,包括相位噪声、输出功率、效率、频率调谐范围和稳定性等方面的挑战。其解决方案的关键在于系统性地评估CMOS、SiGe和III-V族半导体等不同工艺技术下的设计方法,并结合新兴的性能增强技术,为高可靠性、高性能振荡器的设计提供实用的设计指南和前沿洞察,从而推动未来无线通信、计算与传感应用的发展。

链接: https://arxiv.org/abs/2604.26903
作者: Baktash Behmanesh,Ahmad Rezvanitabar
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:This paper provides a concise yet comprehensive review of recent advancements in millimeter-wave (mm-wave) oscillators below 100 GHz and sub-terahertz (sub-THz/THz) oscillators above 100 GHz for next-generation computing and communication systems, including 5G, 6G, and beyond. Various design approaches, including CMOS, SiGe, and III-V semiconductor technologies, are explored in terms of performance metrics such as phase noise, output power, efficiency, frequency tunability, and stability. The review highlights key challenges in achieving high-performance and reliable oscillator designs while discussing emerging techniques for performance enhancement. By evaluating recent design trends, this work aims to offer valuable insights and design guidelines that facilitate the development of robust mm-wave and sub-THz/THz oscillators for future communication, computing, and sensing applications.

[AI-49] A self-evolving agent for explainable diagnosis of DFT-experiment band-gap mismatch

【速读】:该论文旨在解决标准密度泛函理论(Density Functional Theory, DFT)在预测强关联和结构复杂的化合物电子基态时存在的系统性误判问题,即DFT常错误地预测材料为金属,而实验却表明其为半导体。这种误判通常源于未被考虑的非理想因素,如磁有序、电子关联效应、替代多晶型或缺陷等,但这些信号难以在大规模计算中自动识别。解决方案的关键在于提出一种闭环智能代理XDFT,它通过从预定义候选假设库中提取合理机制、执行相应的第一性原理验证,并基于每次判断结果更新全局贝叶斯后验分布来动态评估假设的有效性,从而实现对误判原因的自动化诊断与归因。

链接: https://arxiv.org/abs/2604.26703
作者: Yue Li,Bijun Tang
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Standard density functional theory (DFT) routinely misclassifies the electronic ground state of correlated and structurally complex compounds, predicting metallic behaviour for materials that experiments report as semiconductors. Each such mismatch encodes a specific non-ideality – magnetic ordering, electron correlation, an alternative polymorph, or a defect – that the calculation excluded, but extracting that signal at scale has remained a manual exercise. Here we introduce XDFT, a closed-loop agent that diagnoses the mismatch automatically: it draws candidate hypotheses from a curated catalogue, executes the corresponding first-principles tests, and updates a global Bayesian posterior over hypothesis usefulness from each verdict. On a verified benchmark of 124 materials, XDFT identifies a resolving mechanism for 70 of 90 mismatch cases (78%), an order of magnitude above a uniform-random baseline (19%) and a static LLM ordering (20%). The internal posterior aligns with empirical performance over the benchmark timeline, and resolved cases collapse into a tri-partite element-class taxonomy that we distil into a four-line static rule. Each diagnosed material is returned with a corrected protocol and a mechanistic attribution; failed cases are flagged as evidence-backed targets for experimental re-examination.

[AI-50] Fundamental Physics Existential Risks and Human Futures

【速读】:该论文试图解决量子力学基础中的核心问题,包括量子现实本质(quantum reality problem)、量子理论与引力之间的关系,以及意识与物理定律的相互作用。其解决方案的关键在于提出:未来可能发现超越现有量子理论的新物理框架,这不仅可能引入新的演化规律和测量类型,还可能对信息处理方式及人工智能(AI)的发展产生变革性影响。

链接: https://arxiv.org/abs/2604.26530
作者: Adrian Kent(Centre for Quantum Information and Foundations, DAMTP, University of Cambridge and Perimeter Institute for Theoretical Physics)
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); General Relativity and Quantum Cosmology (gr-qc)
备注: Invited article for Phil. Trans. Roy. Soc. for the 25th anniversary of their millennium volume

点击查看摘要

Abstract:Over the past 25 years, I have been involved in some intriguing developments in the foundations of physics, exploring the quantum reality problem, the relationship between quantum theory and gravity and the interplay between consciousness and physical laws. These investigations make it plausible that we will find physics beyond quantum theory, potentially including both new evolution laws and new types of measurement. There is also a significant chance they could have potentially transformative impact on information processing and on the development of and our future with AI.

[AI-51] Quantum Gatekeeper: Multi-Factor Context-Bound Image Steganography with VQC Based Key Derivation on Quantum Hardware

【速读】:该论文旨在解决传统图像隐写术中安全性不足与上下文依赖性弱的问题,即如何在保证信息隐蔽性的同时实现多因素认证和精确的提取路径控制。解决方案的关键在于提出了一种基于量子门控(Quantum Gatekeeper)的上下文绑定隐写框架,其核心创新包括:1)将损失less最低有效位(LSB)嵌入与由确定性变分量子电路(VQC)生成的门密钥相结合;2)通过密码学哈希扩展和上下文相关图像特征生成参数,构建依赖于用户密码、共享秘密、上下文字符串及参考图像签名的四因子认证机制;3)采用双区域图像布局分离头信息与载荷的密钥生成路径,从而确保编码/解码一致性,并利用量子硬件模拟物理噪声下的统计行为以增强鲁棒性。该方法实现了仅在所有条件完全匹配时才能成功恢复载荷的“静默拒绝”机制,显著提升了隐写系统的安全性和可控性。

链接: https://arxiv.org/abs/2604.26413
作者: Sahil Tomar,Sandeep Kumar
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This paper presents Quantum Gatekeeper, a context-bound image steganography framework where successful payload recovery depends on both cryptographic decryption and the reconstruction of a precise extraction path. The system integrates lossless least significant bit (LSB) embedding with a deterministic variational quantum circuit (VQC)-derived gate key, multi-factor contextual binding, and authenticated encryption. Payload extraction is contingent upon four requisite factors: a password, a shared secret, a user-supplied context string, and a reference image signature. Any deviation in these factors causes the system to read from an incorrect pixel sequence or fail authentication, resulting in silent rejection rather than partial disclosure. The proposed method derives a gatecontrolled extraction key from a seed-conditioned variational circuit, with parameters generated via cryptographic hash expansion and context-dependent image features. To ensure encode/decode consistency, the cryptographic key path is generated via exact statevector simulation; concurrently, IBM superconducting quantum hardware is utilized to evaluate the statistical behavior of the circuit family under physical noise. We introduce a dual-region image layout to resolve the nonce bootstrapping dependency, separating header recovery from payload recovery through independently derived keys. Experimental results confirm successful end-to-end message embedding and recovery on PNG images, demonstrating deterministic success under correct conditions and failure otherwise. The framework supports both text and image payloads; in the image-in-image configuration, a secret image is resized to a fixed resolution prior to embedding, enabling exact pixel-level recovery under correct contextual reconstruction.

[AI-52] Qvine: Vine Structured Quantum Circuits for Loading High Dimensional Distributions

【速读】:该论文旨在解决高维概率分布的量子加载问题,即如何高效地将高维分布编码到量子计算机中,以支持量子机器学习、金融建模等应用。由于维度增加导致的“维度灾难”,传统方法需要指数级增长的量子比特数和参数化电路深度,从而引发梯度消失和训练困难。解决方案的关键在于提出Qvine——一种基于经典Vine copula分解结构设计的量子线路ansatz,通过模仿Vine结构将高维分布分解为一系列条件概率分布,使得电路深度在R-vine情况下最多呈二次增长,在D-vine及多数实际R-vine情况下呈线性增长,显著提升了可扩展性和训练效率,同时保持了与经典方法相当的近似精度。

链接: https://arxiv.org/abs/2604.26213
作者: David Quiroga,Hannes Leipold,Bibhas Adhikari
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Loading high dimensional distributions is an important task for utilizing quantum computers on applications ranging from machine learning to finance. The high dimensionality leads to a curse of dimensionality, representing a d-dimensional distribution with k resolution requires dk qubits and an unstructured parameterized circuit would express a unitary in an exponential operator space in the number of qubits, leading to vanishing gradients and poor convergence guarantees even at high depth. Vine copula decompositions are widely used to represent high dimensional distributions classically, showing high quality approximation in many important applications, such as financial modeling. We present Qvine, a vine structured ansatz for quantum circuits, that mirrors the vine decomposition to construct scalable quantum circuits with efficient trainability while achieving similarly high quality approximation for amplitude encoding distributions. For regular vines (R-vines), we show that the circuit depth scales at most quadratic in the dimension of the distribution, while for D-vines, as well as many practical R-vines, the circuit depth scales linear in the dimension. For 3-dimensional and 4-dimensional Gaussians and empirical joint stock price return distributions for selected stocks, our experiments show Qvines achieve high quality loading. Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.26213 [quant-ph] (or arXiv:2604.26213v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2604.26213 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-53] QERNEL: a Scalable Large Electron Model

【速读】:该论文旨在解决多电子体系中复杂哈密顿量参数化求解的难题,特别是如何在单个模型中高效捕捉不同参数条件下系统的基态,并揭示其相变行为。解决方案的关键在于提出QERNEL——一种基于FiLM(Feature-wise Linear Modulation)参数条件化的基础神经波函数架构,结合了专家混合(Mixture of Experts)和分组查询注意力(Grouped-Query Attention)等计算高效的结构设计,在保持低计算成本的同时显著提升了模型表达能力。通过训练一个共享权重的模型处理高达150个电子的半导体莫尔异质结系统,QERNEL成功捕获了量子液体与晶体态之间的尖锐相变特征,为莫尔量子材料提供了基础模型并推动了固体体系大规模电子模型的发展。

链接: https://arxiv.org/abs/2604.26018
作者: Khachatur Nazaryan,Liang Fu
机构: 未知
类目: rongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:We introduce QERNEL, a foundational neural wavefunction that variationally solves families of parameterized many-electron Hamiltonians and captures their ground states throughout parameter space within a single model. QERNEL combines FiLM-based parameter conditioning with scale-efficient architectural elements – mixture of experts and grouped-query attention, substantially improving expressivity at low computational cost. We apply QERNEL to interacting electrons in semiconductor moiré heterobilayers, training a single weight-shared model for systems of up to 150 electrons. By solving the many-electron Schrödinger equation conditioned on moiré potential depth, QERNEL captures both quantum liquid and crystal states and discovers the sharp phase transition between them, marked by abrupt changes in interaction energy and charge density. Our work establishes a foundation model for moiré quantum materials and a scalable architecture toward a Large Electron Model for solids.

[AI-54] Auditing Marketing Budget Allocation with Hindsight Regret

【速读】:该论文旨在解决组织在面临运营约束(如预算和稳定性限制)时,缺乏一种系统性方法来评估历史预算分配是否接近事后最优可行方案的问题。其核心挑战在于如何量化实际分配相对于理想基准的“事后悔悟损失”(hindsight regret),并区分因分配效率低下导致的损失与因响应函数估计不确定性带来的影响。解决方案的关键在于构建一个基于事后后悔的审计框架:首先从历史数据中估计特定策略下的支出-响应函数(spend–response functions),然后通过约束优化计算出满足原始约束条件下的最优事后分配,再利用蒙特卡洛模拟传播不确定性,最终生成后悔分布、期望提升量及改进概率等可解释指标。该方法实现了对历史决策的后验诊断,并揭示了分配灵活性与检测能力之间的权衡关系——适度调整通常能捕获大部分可测量收益,而大幅变动则进入支持较弱区域,带来更高不确定性。

链接: https://arxiv.org/abs/2604.25977
作者: Nilavra Pathak,Olivier Jeunen,Eric Lambert
机构: 未知
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
备注: 6 pages, 8 figures

点击查看摘要

Abstract:Organizations routinely make strategic budget allocations under operational constraints, but often lack a principled way to assess whether realized allocations were close to the best feasible choices in hindsight. We present a retrospective auditing framework based on hindsight regret, defined as the opportunity cost of the realized allocation relative to a constraint-faithful benchmark under the same budget and stability guardrails. The framework estimates regime-specific spend–response functions from historical logs, computes feasible hindsight allocations via constrained optimization, and propagates uncertainty through Monte Carlo evaluation to produce regret distributions, expected lift, and probability-of-improvement summaries. This separates allocation inefficiency from uncertainty in the estimated response surfaces. Experiments on real marketing allocation logs show that the framework yields interpretable post-hoc diagnostics and reveals a practical trade-off between allocation flexibility and detectability: moderate feasible reallocations often capture most measurable gain, while larger shifts move into weak-support regions with higher uncertainty. The result is a practical method for auditing historical budget decisions when online experimentation is costly or infeasible.

[AI-55] Planar Gaussian Splatting with Bilinear Spatial Transformer for Wireless Radiance Field Reconstruction

【速读】:该论文旨在解决无线辐射场(Wireless Radiance Field, WRF)重建中物理可解释性不足与精度受限的问题,尤其针对现有基于高斯点阵(Gaussian Splatting, GS)的方法多直接移植视觉领域流程、缺乏电磁耦合建模的缺陷。其核心解决方案是提出BiSplat-WRF,一种平面高斯点阵框架,通过引入二维平面高斯基元(2D planar Gaussian)替代传统三维表示,在保持表达能力的同时避免冗余投影;关键创新在于设计双线性空间变换器(Bilinear Spatial Transformer, BST),在角度域上聚合基元间关系,并利用注意力机制捕捉远距离电磁依赖,从而显式建模全局电磁耦合与互散射效应,增强对复杂无线环境物理规律的拟合能力。

链接: https://arxiv.org/abs/2604.25945
作者: Jinghan Zhang,Xitao Gong,Qi Wang,Richard A. Stirling-Gallacher,Giuseppe Caire
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ICC 2026 Workshop

点击查看摘要

Abstract:Wireless radiance field (WRF) reconstruction aims to learn a continuous, queryable representation of radio frequency characteristics over 3D space and direction, from which specific quantities, such as the spatial power spectrum (SPS) at a receiver given a transmitter position, can be predicted. While Gaussian splatting (GS)-based method has surpassed Neural Radiance Fields (NeRF)-based method for this task, existing adaptations largely transplant vision pipelines, limiting physical interpretability and accuracy. We introduce BiSplat-WRF, a planar GS framework that retains the expressiveness of 3D GS while removing unnecessary projections and incorporating global EM coupling and mutual scattering among primitives. Each primitive is a 2D planar Gaussian with 3D coordinates, rendered directly on the angular domain of the SPS. A bilinear spatial transformer (BST) aggregates inter-primitive relations on an angular grid and, via attention, captures long-range electromagnetic dependencies, thereby enforcing globally aware EM interactions that reflect the complex physics of the wireless environment. On spatial spectrum synthesis task, BiSplat-WRF surpasses NeRF-based and prior GS-based baselines with respect to the Structural Similarity Index (SSIM); comprehensive ablation studies validate the contribution of BST. We also provide a larger BiSplat-WRF+ variant that further increases SSIM at a higher computation cost, serving as a strong reference for future studies.

[AI-56] SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment

【速读】:该论文旨在解决当前文本到歌曲(Text-to-Song)生成技术中缺乏专业级细粒度评估基准的问题,现有评测方法无法准确捕捉音乐创作中的多维审美特征。解决方案的关键在于提出SongBench,一个面向专业音乐评估的细粒度框架,涵盖Vocal(人声)、Instrument(乐器)、Melody(旋律)、Structure(结构)、Arrangement(编曲)、Mixing(混音)和Musicality(音乐性)七个核心维度,并构建了一个由音乐专业人士标注的11,717个样本数据库。实验表明,SongBench与专家评分高度相关,能够精准揭示当前先进模型在各维度上的性能差距,从而为生成式AI (Generative AI) 音乐创作提供诊断性指导,推动更专业、连贯的歌曲生成发展。

链接: https://arxiv.org/abs/2604.25937
作者: Dapeng Wu,Shun Lei,Wei Tan,Guangzheng Li,Yunzhe Wang,Huaicheng Zhang,Lishi Zuo,Zhiyong Wu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent advancements in Text-to-Song generation have enabled realistic musical content production, yet existing evaluation benchmarks lack the professional granularity to capture multi-dimensional aesthetic nuances. In this paper, we propose SongBench, a specialized framework for fine-grained song assessment across seven key dimensions: Vocal, Instrument, Melody, Structure, Arrangement, Mixing, and Musicality. Utilizing this framework, we construct an expert-annotated database comprising 11,717 samples from state-of-the-art models, labeled by music professionals. Extensive experimental results demonstrate that SongBench achieves high correlation with expert ratings. By revealing fine-grained performance gaps in current state-of-the-art models, SongBench serves as a diagnostic benchmark to steer the development toward more professional and musically coherent song generation.

机器学习

[LG-0] Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport

链接: https://arxiv.org/abs/2604.26942
作者: Shayan Hundrieser,Insung Kong,Johannes Schmidt-Hieber
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Genomics (q-bio.GN); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 65 pages, 13 figures, the first two authors contributed equally

点击查看摘要

Abstract:We introduce Hyper Input Convex Neural Networks (HyCNNs), a novel neural network architecture designed for learning convex functions. HyCNNs combine the principles of Maxout networks with input convex neural networks (ICNNs) to create a neural network that is always convex in the input, theoretically capable of leveraging depth, and performs reliable when trained at scale compared to ICNNs. Concretely, we prove that HyCNNs require exponentially fewer parameters than ICNNs to approximate quadratic functions up to a given precision. Throughout a series of synthetic experiments, we demonstrate that HyCNNs outperform existing ICNNs and MLPs in terms of predictive performance for convex regression and interpolation tasks. We further apply HyCNNs to learn high-dimensional optimal transport maps for synthetic examples and for single-cell RNA sequencing data, where they oftentimes outperform ICNN-based neural optimal transport methods and other baselines across a wide range of settings.

[LG-1] A Note on How to Remove the lnln T Term from the Squint Bound

链接: https://arxiv.org/abs/2604.26926
作者: Francesco Orabona
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In Orabona and Pál [2016], we introduced the shifted KT potentials, to remove the \ln \ln T factor in the parameter-free learning with expert bound. In this short technical note, I show that this is equivalent to changing the prior in the Krichevsky–Trofimov algorithm. Then, I show how to use the same idea to remove the \ln \ln T factor in the data-independent bound for the Squint algorithm.

[LG-2] On the Learning Curves of Revenue Maximization STOC2026

链接: https://arxiv.org/abs/2604.26922
作者: Steve Hanneke,Alkis Kalavasis,Shay Moran,Grigoris Velegkas
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: To appear in the 58th ACM Symposium on Theory of Computing (STOC 2026)

点击查看摘要

Abstract:Learning curves are a fundamental primitive in supervised learning, describing how an algorithm’s performance improves with more data and providing a quantitative measure of its generalization ability. Formally, a learning curve plots the decay of an algorithm’s error for a fixed underlying distribution as a function of the number of training samples. Prior work on revenue-maximizing learning algorithms, starting with the seminal work of Cole and Roughgarden [STOC, 2014], adopts a distribution-free perspective, which parallels the PAC learning framework in learning theory. This approach evaluates performance against the hardest possible sequence of valuation distributions, one for each sample size, effectively defining the upper envelope of learning curves over all possible distributions, thus leading to error bounds that do not capture the shape of the learning curves. In this work we initiate the study of learning curves for revenue maximization and provide a near-complete characterization of their rate of decay in the basic setting of a single item and a single buyer. In the absence of any restriction on the valuation distribution, we show that there exists a Bayes-consistent algorithm, meaning that its learning curve converges to zero for any arbitrary valuation distribution as the number of samples n \to \infty . However, this convergence must be arbitrarily slow, even if the optimal revenue is finite. In contrast, if the optimal revenue is achieved by a finite price, then the optimal rate of decay is roughly 1/\sqrtn . Finally, for distributions supported on discrete sets of values, we show that learning curves decay almost exponentially fast, a rate unattainable under the PAC framework. Comments: To appear in the 58th ACM Symposium on Theory of Computing (STOC 2026) Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML) Cite as: arXiv:2604.26922 [cs.LG] (or arXiv:2604.26922v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.26922 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Multiple Additive Neural Networks for Structured and Unstructured Data

链接: https://arxiv.org/abs/2604.26888
作者: Janis Mohr,Jörg Frochte
类目: Machine Learning (cs.LG)
*备注: Accepted author manuscript; page layout differs from the published Springer version

点击查看摘要

Abstract:This paper extends and explains the Multiple Additive Neural Networks (MANN) methodology, an enhancement to the traditional Gradient Boosting framework, utilizing nearly shallow neural networks instead of decision trees as base learners. This innovative approach leverages neural network architectures, notably Convolutional Neural Networks (CNNs) and Capsule Neural Networks, to extend its application to both structured data and unstructured data such as images and audio. For structured data the advantages of capsule neural networks as feature extractors are used and combined with MANN as a classifier. MANN’s unique architecture promotes continuous learning and integrates advanced heuristics to combat overfitting, ensuring robustness and reducing sensitivity to hyperparameter settings like learning rate and iterations. Our empirical studies reveal that MANN surpasses traditional methods such as Extreme Gradient Boosting (XGB) in accuracy across well-known datasets. This research demonstrates MANN’s superior precision and generalizability, making it a versatile tool for diverse data types and complex learning environments.

[LG-4] FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

链接: https://arxiv.org/abs/2604.26881
作者: Minghe Wang,Trever Schirmer,Mohammadreza Malekabbasi,David Bermbach
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted for publication in the 9th International Workshop on Edge Systems, Analytics and Networking (EdgeSys 2026)

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the resource used by activated experts and the provisioned resources. This underutilization is further pronounced in multi-tenant scenarios. In this paper, we propose FaaSMoE, a multi-tenant MoE serving architecture built on Function-as-a-Service (FaaS) platforms. FaaSMoE decouples the control and execution planes of MoE by deploying experts as stateless FaaS functions, enabling on-demand and scale-to-zero expert invocation across tenants. FaaSMoE further supports configurable expert granularity within functions, trading off per-expert elasticity for reduced invocation overhead. We implement a prototype with an open-source edge-oriented FaaS platform and evaluate it using Qwen1.5-moe-2.7B under multi-tenant workloads. Compared to a full-model baseline, FaaSMoE uses less than one third of the resources, demonstrating a practical and resource-efficient path towards scalable MoE serving in a multi-tenant environment.

[LG-5] Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

链接: https://arxiv.org/abs/2604.26837
作者: Zihan Zhao,Baotong Lu,Shengjie Lin,Yizou Chen,Jing Liu,Yanqi Zhang,Ziming Miao,Ming-Chang Yang,Haiying Shen,Qi Chen,Fan Yang
类目: Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, however, these algorithmic savings rarely translate into end-to-end system-level gains because sparse methods typically operate at different granularities and thus rely on ad hoc, per-algorithm implementations. At the same time, hierarchical KV storage introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily erase the benefits of sparsity. We present SPIN, a sparse-attention-aware inference framework that co-designs the execution pipeline with hierarchical KV storage through three techniques: (1) a unified partition abstraction that maps different sparsity granularities onto a shared page-based KV substrate; (2) a locality-aware KV cache manager that dynamically sizes per-request HBM budgets and uses a GPU-friendly bucketed LRU policy to cut PCIe round-trips; and (3) a two-level hierarchical metadata layout sized to the active working set rather than the worst-case address space. Built on vLLM with three representative sparse attention algorithms, SPIN delivers 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than vLLM, and reduces TPOT by up to 58% over the original sparse-attention implementations. Comments: 15 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.26837 [cs.LG] (or arXiv:2604.26837v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.26837 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] Uncertainty-Aware Predictive Safety Filters for Probabilistic Neural Network Dynamics

链接: https://arxiv.org/abs/2604.26836
作者: Bernd Frauenknecht,Lukas Kesper,Daniel Mayfrank,Henrik Hose,Sebastian Trimpe
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Predictive safety filters (PSFs) leverage model predictive control to enforce constraint satisfaction during deep reinforcement learning (RL) exploration, yet their reliance on first-principles models or Gaussian processes limits scalability and broader applicability. Meanwhile, model-based RL (MBRL) methods routinely employ probabilistic ensemble (PE) neural networks to capture complex, high-dimensional dynamics from data with minimal prior knowledge. However, existing attempts to integrate PEs into PSFs lack rigorous uncertainty quantification. We introduce the Uncertainty-Aware Predictive Safety Filter (UPSi), a PSF that provides rigorous safety predictions using PE dynamics models by formulating future outcomes as reachable sets. UPSi introduces an explicit certainty constraint that prevents model exploitation and integrates seamlessly into common MBRL frameworks. We evaluate UPSi within Dyna-style MBRL on standard safe RL benchmarks and report substantial improvements in exploration safety over prior neural network PSFs while maintaining performance on par with standard MBRL. UPSi bridges the gap between the scalability and generality of modern MBRL and the safety guarantees of predictive safety filters.

[LG-7] Semi-supervised learning with max-margin graph cuts AISTATS2010

链接: https://arxiv.org/abs/2604.26818
作者: Branislav Kveton,Michal Valko,Ali Rahimi,Ling Huang
类目: Machine Learning (cs.LG)
*备注: Published at AISTATS 2010 (13th International Conference on Artificial Intelligence and Statistics)

点击查看摘要

Abstract:This paper proposes a novel algorithm for semisupervised learning. This algorithm learns graph cuts that maximize the margin with respect to the labels induced by the harmonic function solution. We motivate the approach, compare it to existing work, and prove a bound on its generalization error. The quality of our solutions is evaluated on a synthetic problem and three UCI ML repository datasets. In most cases, we outperform manifold regularization of support vector machines, which is a state-of-the-art approach to semi-supervised max-margin learning.

[LG-8] Asynchronous Federated Unlearning with Invariance Calibration for Medical Imaging IJCNN2026

链接: https://arxiv.org/abs/2604.26809
作者: Zhaoyuan Cai,Xinglin Zhang
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, the article is accepted by IEEE IJCNN 2026

点击查看摘要

Abstract:Federated Unlearning (FU) is an emerging paradigm in Federated Learning (FL) that enables participating clients to fully remove their contributions from a trained global model, driven by data protection regulations that mandate the right to be forgotten. However, existing FU methods mostly rely on synchronous coordination. This requirement forces the entire federation to halt and wait for stragglers to complete erasure, creating significant delays due to device heterogeneity. Furthermore, these methods often face the problem that the influence of erased data is merely suppressed temporarily and resurfaces during subsequent training, rather than being genuinely removed. To overcome these limitations, this paper proposes Asynchronous Federated Unlearning with Invariance Calibration (AFU-IC), a novel framework for medical imaging that decouples the erasure process from the global training workflow. This enables the target client to perform unlearning asynchronously without interrupting global training. Meanwhile, a server-side invariance calibration mechanism prevents the model from relearning the erased data. Extensive experiments on three medical benchmarks demonstrate that AFU-IC achieves unlearning efficacy and model fidelity comparable to gold-standard retraining while significantly reducing wall-clock latency compared to synchronous baselines. AFU-IC ensures efficient, compliant and reliable FL in cross-silo medical environments.

[LG-9] A Multi-Dataset Benchmark of Multiple Instance Learning for 3D Neuroimage Classification

链接: https://arxiv.org/abs/2604.26807
作者: Ethan Harvey,Dennis Johan Loevlie,Amir Ali Satani,Wansu Chen,David M. Kent,Michael C. Hughes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite being resource-intensive to train, 3D convolutional neural networks (CNNs) have been the standard approach to classify CT and MRI scans. Recent work suggests that deep multiple instance learning (MIL) may be a more efficient alternative for 3D brain scans, especially when the pre-trained image encoder used to embed each 2D slice is frozen and only the pooling operation and classifier are trained. In this paper, we provide a systematic comparison of simple MIL, attention-based MIL, 3D CNNs, and 3D ViTs across three CT and four MRI datasets, including two large datasets of at least 10,000 scans. Our goal is to help resource-constrained practitioners understand which neural networks work well for 3D neuroimages and why. We further compare design choices for attention-based MIL, including different encoders, pooling operations, and architectural orderings. We find that simple mean pooling MIL, without any learnable attention, matches or outperforms recent MIL or 3D CNN alternatives on 4 of 6 moderate-sized tasks. This baseline remains competitive on two large datasets while being 25x faster to train. To explain mean pooling’s success, we examine per-slice attention quality and a semi-synthetic dataset where we can derive the best possible classifier via a Bayes estimator. This analysis reveals the limits of existing MIL approaches and suggests routes for future improvements.

[LG-10] Super-resolution Multi-signal Direction-of-Arrival Estimation by Hankel-structured Sensing and Decomposition

链接: https://arxiv.org/abs/2604.26793
作者: Georgios I. Orfanidis,Dimitris A. Pados,George Sklivanitis,Elizabeth Serena Bentley
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Motivated by sensing modalities in modern autonomous systems that involve hardware-constrained spatial sampling over large arrays with limited coherence time, we develop a novel framework for rapid super-resolution multi-signal direction-of-arrival (DoA) estimation based on Hankel-structured sensing and data matrix decomposition of arbitrary rank, under both the L_2 and L_1 -norm formulation. The resulting L_2 -norm estimator is shown to be maximum-likelihood optimal in white Gaussian noise. The L_1 -norm estimator is shown to be maximum-likelihood optimal in independent, identically distributed (i.i.d.) isotropic Laplace noise, offering broad robustness to impulsive interference and corrupted measurements commonly encountered in practice. Extensive simulations demonstrate that the proposed methods exhibit powerful super-resolution capabilities, requiring significantly lower SNR and achieving substantially higher resolution probability than recent competing approaches.

[LG-11] Hankel and Toeplitz Rank-1 Decomposition of Arbitrary Matrices with Applications to Signal Direction-of-Arrival Estimation

链接: https://arxiv.org/abs/2604.26787
作者: Georgios I. Orfanidis,Dimitris A. Pados,George Sklivanitis,Elizabeth Serena Bentley
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We consider the problems of computing the optimal rank- 1 Hankel and Toeplitz-structured approximation of arbitrary matrices under L_2 and L_1 -norm error. Such problems arise naturally in engineered systems, including the basic few-shot signal Direction-of-Arrival (DoA) estimation problem that is of importance to modern autonomous systems applications. We develop accurate and computationally efficient structured matrix decomposition algorithms for both formulations and then derive analytically grounded small-sample-support DoA estimators for practical sensing system deployments. The resulting estimators under the L_2 and L_1 norms are formally shown to be maximum-likelihood optimal under white Gaussian and Laplace noise, respectively. The estimators are further validated through extensive simulation studies and real-world data experiments in few-shot DoA inference.

[LG-12] Electricity price forecasting across Norways five bidding zones in the post-crisis era

链接: https://arxiv.org/abs/2604.26634
作者: My Thi Diem Phan,Trung Tuyen Truong,Hoai Phuong Ha,Dat Thanh Nguyen
类目: Machine Learning (cs.LG); General Economics (econ.GN); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Norway’s electricity market is heavily dominated by hydropower, but the 2021–2022 energy crisis and stronger integration with Continental Europe have fundamentally altered price formation, reducing the reliability of forecasting models calibrated on historical data. Despite the critical need for updated models, a unified benchmark evaluating feature contributions across all structurally diverse Norwegian bidding zones remains lacking. Here we present a comprehensive evaluation of electricity price forecasting across all five Norwegian Nord Pool bidding zones. We constructed a multimodal hourly dataset spanning 2019–2025 and evaluated eight forecasting model families including LightGBM, ARX, and advanced deep learning architectures using a strictly causal test set. We implemented robust rolling-origin backtesting, leave-one-group-out feature ablation, and conditional regime analysis to dissect model performance and feature utility. Our results show that LightGBM achieves the best performance in every zone with MAE ranging from 1.64 to 5.74~EUR/MWh, while the ridge ARX model remains a highly competitive linear benchmark in northern zones. Feature ablation reveals that models relying solely on lagged prices and calendar variables achieve high accuracy and often match or exceed full multimodal integration. However, conditional regime analysis demonstrates that external features like reservoir levels and gas prices remain crucial to stratify forecast errors, which consistently increase under stressed market regimes. This highlights the practical value of model interpretability and regime awareness for decision makers facing structural changes in market dynamics.

[LG-13] Who Trains Matters: Federated Learning under Enrollm ent and Participation Selection Biases

链接: https://arxiv.org/abs/2604.26604
作者: Gota Morishita
类目: Machine Learning (cs.LG)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Federated learning (FL) trains a shared model from updates contributed by distributed clients, often implicitly assuming that contributing clients are representative of the target population. In practice, this representativeness assumption can fail at two distinct stages, inducing selection bias. First, eligibility rules such as device constraints, software requirements, or user consent determine which clients are ever enrolled and reachable for training, inducing \emphenrollment bias. Second, among enrolled clients, user and system factors such as battery state, network status, and local time determine which clients participate in each communication round, inducing \emphparticipation bias. Although existing work has largely addressed round-level participation bias, it has paid far less attention to population-level enrollment bias, which can induce a persistent mismatch between the training objective and the target-population objective. We formalize FL under a two-stage selection model and derive \textscFedIPW, an inverse-probability-weighted aggregation scheme that recovers the target-population mean update under standard ignorability and positivity assumptions. Because client-level covariates are often unavailable for non-enrolled clients, we also introduce a limited-information aggregate-calibration extension that uses known target-population summaries to reweight the enrolled sample, partially correcting enrollment bias. We further provide an algorithm-agnostic optimization analysis under residual weighting error and show that incomplete selection correction can induce a non-vanishing bias floor. Finally, experiments on synthetic federated logistic regression validate the predicted objective mismatch and show that enrollment correction reduces target-population error under two-stage selection.

[LG-14] PiGGO: Physics-Guided Learnable Graph Kalman Filters for Virtual Sensing of Nonlinear Dynamic Structures under Uncertainty

链接: https://arxiv.org/abs/2604.26593
作者: Marcus Haywood-Alexander,Gregory Duthé,Eleni Chatzi
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:Digital twins provide a powerful paradigm for diagnostic and prognostic tasks in the monitoring and control of engineered systems; however, their deployment for complex structures remains challenged by model-form uncertainty, arising from unknown nonlinear dynamics, and by sparse sensing. These limitations hinder reliable online state estimation using either purely physics-based or purely data-driven approaches. This work introduces the Physics-Guided Graph Neural ODE (PiGGO) framework, a physics-informed, graph-based Bayesian state estimation approach in which a learned graph neural ordinary differential equation (GNODE) serves as the continuous-time state-transition model within an extended Kalman filter. The graph representation explicitly defines the system state-space, while physics-guided inductive biases encode known structural relationships and constrain the learning of nonlinear dynamics. By integrating graph-native learned dynamics with recursive Bayesian filtering, the proposed PiGGO framework enables online virtual sensing and uncertainty-aware state estimation for nonlinear systems with unknown model form, while maintaining generalisation across topologically similar structures. Numerical case studies demonstrate improved robustness to model uncertainty and measurement noise, outperforming both open-loop graph neural models and conventional filtering approaches in online prediction tasks.

[LG-15] PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reason ers

链接: https://arxiv.org/abs/2604.26573
作者: Zhiquan Tan,Yinrong Hong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Improving large language model (LLM) reasoning requires supervision that is both aligned with the model’s own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration but offers sparse, high-variance credit; supervised fine-tuning and distillation provide dense targets but often train on fixed trajectories or rely on stronger teachers. Recent privileged on-policy self-distillation explores a middle ground by scoring student rollouts with the same model under verified solution context. We revisit this setting through a contextual re-scoring lens: for reasoning, the important choices are not only whether privileged context is available, but how much of it should be revealed and where its distribution should shape the student. We propose PAINT (Partial-solution Adaptive INterpolated Training), which masks the verified solution according to rollout-reference overlap and applies a small energy-space interpolation on a sparse set of entropy-mismatch token positions. Across competition-level math benchmarks, PAINT consistently improves over a strong prior on-policy self-distillation baseline at all three Qwen3 scales. On Qwen3-8B, it raises macro Avg@12 by 2.1 points over this prior baseline and 2.9 points over GRPO.

[LG-16] Advancing multi-site emission control: A physics-informed transfer learning framework with mixture of experts for carbon-pollutant synergy

链接: https://arxiv.org/abs/2604.26571
作者: Yuxuan Ying,Hanqing Yang,Kaige Wang,Yu Hu,Zhiming Zheng,Yunliang Jiang,Xiaoqing Lin,Xiaodong Li,Jun Chen
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Supplementary materials will be released after the final version is finalized

点击查看摘要

Abstract:Municipal solid waste incineration is increasingly central to urban waste management, yet its sustainability benefit depends on controlling carbon emissions and multiple air pollutants under highly heterogeneous operating conditions. Current data-driven models are often accurate within individual plants but are difficult to transfer across facilities, limiting their value for scalable emission-control strategies. Here we show that multi-site emission behaviour can be represented through transferable system-level structures when physical constraints, operating-regime heterogeneity and carbon–pollutant coupling are jointly considered. We develop a physics-informed transfer learning framework built on a carbon–pollutant mixture-of-experts model, which combines regime-dependent expert routing with conservation-based regularization and a carbon–pollutant synergistic index for integrated risk evaluation. Across 13 municipal solid waste incineration plants, the model captured both pollutant-specific emissions and system-level risk, achieving source-domain average pollutant R^2 values of 0.668–0.904 and CPSI R^2 values of 0.666–0.970. After transfer from a reference facility to 12 target plants, average pollutant R^2 remained between 0.661 and 0.842, while CPSI retained comparable transferability ( R^2 = 0.610–0.841). Expert-utilization patterns further indicate that adaptation occurs through structured re-weighting of operating regimes rather than complete model re-learning. By extending the learned representation into an interpretable digital twin, this framework provides a route from emission prediction to regime-aware operational navigation, supporting scalable carbon–pollutant synergistic control across heterogeneous waste-to-energy systems.

[LG-17] Learning to Route Electric Trucks Under Operational Uncertainty STOC

链接: https://arxiv.org/abs/2604.26566
作者: Stavros Orfanoudakis,Ziyan Li,Ruixiao Yang,Nikolay Aristov,Pedro P. Vergara,Chuchu Fan,Elenna Dugundji
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Reinforcement Learning, Electric Truck Routing, Freight Transportation, Graph Neural Networks, Stochastic Optimization, Vehicle Routing

点击查看摘要

Abstract:Electric truck operations require routing decisions that remain feasible under limited battery range, long charging times, travel and energy consumption, and competition for shared charging infrastructure. These features make electric truck routing a coupled logistics and energy problem, limiting the practicality of heuristics-based methods and rendering them computationally infeasible at scale. This paper proposes a learning-based framework for the stochastic electric truck routing under charging constraints and operational uncertainty. The problem, solved by Reinforcement Learning, is formulated as an event-driven semi-Markov decision process with shared charging resources, stochastic travel and energy requirements, and realistic nonlinear fast-charging behavior. To support learning in this setting, a graph-based representation of system state and feasible decisions is introduced, together with a rule-based action mask that restricts policies to operationally admissible actions; thus, improving training efficiency. Building on this formulation, an event-driven simulation environment is developed that supports both Reinforcement Learning and benchmarking against heuristic and mathematical programming baselines. Computational experiments across a range of fleet sizes show that the proposed learning-based algorithm consistently outperforms baselines and attains performance close to optimization benchmarks in many settings, while preserving high success rates under charging congestion and uncertainty.

[LG-18] FloatSOM: GPU-Accelerated Distributed Topology-Flexible Self-Organizing Maps

链接: https://arxiv.org/abs/2604.26555
作者: Tony Xu,Sarah Klamt,Katherine Turner,Anne Brustle,Felix Marsh-Wakefield,Givanna Putri
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:GPU-accelerated Self-Organizing Map (SOM) implementations are among the most competitive options for large-scale SOM analysis, but growing dataset sizes increasingly challenge their practical use because workloads no longer fit cleanly within device-memory limits. We introduce FloatSOM, a SOM framework for scalable training and deployment that supports multi-GPU execution, out-of-memory disk-backed streaming, and novel topologies beyond regular lattices. We evaluate FloatSOM on 14 synthetic and real benchmark datasets together with controlled speed scaling benchmarks, and show that these improved topologies, combined with topology-aware hyperparameter fine-tuning, yield lower quantization error than current state-of-the-art SOM baselines. FloatSOM also sustains this performance at large scale with high-throughput distributed execution; in the largest benchmark, it trains a 1024-node SOM network on 1,000,000,000 samples with 50 features in 6.16 minutes on 8 GPUs across two separate high-performance-computing nodes.

[LG-19] Large-scale semi-supervised learning with online spectral graph sparsification ICML2015

链接: https://arxiv.org/abs/2604.26550
作者: Daniele Calandriello,Alessandro Lazaric,Michal Valko
类目: Machine Learning (cs.LG)
*备注: Workshop on Resource-Efficient Machine Learning (REML), ICML 2015

点击查看摘要

Abstract:We introduce Sparse-HFS, a scalable algorithm that can compute solutions to SSL problems using only O(n polylog(n)) space and O(m polylog(n)) time.

[LG-20] Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

链接: https://arxiv.org/abs/2604.26505
作者: Hanna Foerster,Ilia Shumailov,Cheng Zhang,Yiren Zhao,Jamie Hayes,Robert Mullins
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Dynamic quantization emerged as a practical approach to increase the utilization and efficiency of the machine learning serving flow. Unlike static quantization, which applies quantization offline, dynamic quantization operates on tensors at run-time, adapting its parameters to the actual input data. Today’s mainstream machine learning frameworks, including ML compilers and inference engines, frequently recommend dynamic quantization as an initial step for optimizing model serving. This is because dynamic quantization can significantly reduce memory usage and computational load, leading to faster token generation and improved model serving efficiency without substantial loss in model accuracy. In this paper, we reveal a critical vulnerability in dynamic quantization: an adversary can exploit such quantization strategy to steal sensitive user data placed in the same batch as the adversary’s input. Our analysis demonstrates that dynamic quantization, when improperly implemented or configured, can create side channels that expose information about other inputs within the same batch. We call this phenomenon Quantamination, describing contamination from quantization. Specifically, we show that at least 4 of the most popular ML frameworks in use today either default to or can use configurations that leak data across the batch boundary. This data leakage, in theory, allows attackers to partially or even fully recover other users’ batched input data, representing a serious privacy risk for existing ML serving frameworks.

[LG-21] Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

链接: https://arxiv.org/abs/2604.26498
作者: Jinjiang Guo
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The rapid growth of molecular foundation models and general-purpose large language models has encouraged a scale-centric view of artificial intelligence in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and task-specific graph neural networks (GNNs). We test this assumption on 22 molecular property and activity endpoints, including public ADMET and Tox21 benchmarks and two internal anti-infective activity datasets. Across 167,056 held-out task–molecule evaluations under structure-similarity-separated five-fold cross-validation (37,756 ADMET, 77,946 Tox21, 49,266 anti-TB and 2,088 antimalaria), classical machine-learning (ML) models such as RF(ECFP4) and ExtraTrees(RDKit descriptors) win ten primary-metric tasks, GNNs such as GIN and Ligandformer win nine, and pretrained molecular sequence models such as MoLFormer and ChemBERTa2 win three. Rule-based SAR reasoning baselines, represented by GPT5.5-SAR and Opus4.7-SAR, do not win under the prespecified primary metrics, although train-fold-derived SAR knowledge provides measurable but uneven gains for SAR reasoning and interpretation. These results indicate that compact, specialized models remain highly effective for molecular property and activity prediction. The performance differences among classical ML, GNN and pretrained sequence models are often modest and endpoint-dependent, whereas larger or more general models do not provide a universal predictive advantage. Large models may still add value for zero-shot reasoning, SAR interpretation and hypothesis generation, but the results suggest that predictive performance depends on the alignment among molecular representation, inductive bias, data regime, endpoint biology and validation protocol.

[LG-22] Hierarchical adaptive control for real-time dynamic inference at the edge

链接: https://arxiv.org/abs/2604.26470
作者: Francesco Daghero,Mahyar Tourchi Moghaddam,Mikkel Baun Kjærgaard
类目: Machine Learning (cs.LG)
*备注: Accepted as paper at 5th Real-time And intelliGent Edge computing (RAGE 2026) workshop

点击查看摘要

Abstract:Industrial systems increasingly depend on Machine Learning (ML), and operate on heterogeneous nodes that must satisfy tight latency, energy, and memory constraints. Dynamic ML models, which reconfigure their computational footprint at runtime, promise high energy efficiency and lower average latency for modest accuracy tradeoffs; however, their deployment is complex due to the additional hyperparameters they rely on. These hyperparameters, controlling the accuracy versus average latency tradeoff, are often tuned on a calibration dataset that must match the test time distribution, an assumption that rarely holds in real-world scenarios, leading to suboptimal operational conditions, possibly below static models. We propose a two-tier adaptive architecture that co-optimizes model and system decisions. At the global level, a scheduler configures and deploys, for each edge node, a cascade of classifiers composed of lightweight specialized models and a generalist fallback, satisfying latency and memory constraints. At the node level, a local controller tracks data drifts and hardware resources, enabling or disabling specialized predictors (SP) to preserve high energy efficiency and avoid latency-constraint violations under varying conditions. This design allows longer operating times without forcing a global redeployment step, and enables efficient execution in case of an unreachable remote global controller. We evaluate the approach on two datasets under controlled distribution mismatch scenarios, showing average per-inference reductions of latency up to 2.45x and energy up to 2.86x, with less than 4% accuracy drop compared to static baselines. Our contributions are:(1) a budgeted SP-cascade formulation that preserves worst-case latency constraints;(2) a hierarchical controller that maintains efficiency under data and resource changes; and (3) an experimental evaluation on embedded hardware.

[LG-23] Near-Optimal Cryptographic Hardness of Learning With Homogeneous Halfspaces Under Gaussian Marginals

链接: https://arxiv.org/abs/2604.26446
作者: Jizhou Huang,Brendan Juba
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study three problems that involve identifying homogeneous halfspaces under Gaussian distributions: agnostic learning, one-sided reliable learning, and fairness auditing. In each of these problems, we are given labeled examples (\mathbfx, \mathrmy) drawn from an unknown distribution on \mathbbR^d\times-1, +1\ , whose marginal distribution on \mathbfx is standard Gaussian and on \mathrmy is arbitrary. The goal of each problem is to output a homogeneous halfspace that approaches the best-fitting homogeneous halfspace in terms of its corresponding loss measure. We prove near-optimal computational hardness results for these problems under the widely believed hardness assumption of the Learning With Errors (LWE) problem. Prior hardness results for these problems were mostly established for general halfspaces; our findings extend some of these hardness results to homogeneous halfspaces. Remarkably, our lower bound strictly generalizes over prior works and narrows the gap between the upper and lower bounds for agnostically learning homogeneous halfspaces under Gaussian marginals.

[LG-24] Layer-wise Lipschitz-Product Control for Deep Kolmogorov–Arnold Network Representations of Compositionally Structured Functions

链接: https://arxiv.org/abs/2604.26444
作者: Aleksander Tankman
类目: Machine Learning (cs.LG)
*备注: 15 pages, theoretical note on layer-wise Lipschitz control for deep KANs

点击查看摘要

Abstract:We prove that any continuous function f from [0,1]^n to R representable by a finite computation tree with N internal nodes and compositional sparsity s = O(1) admits a deep Kolmogorov-Arnold Network (KAN) representation. Each internal node is realised by a primitive KAN block with controlled block depth and Lipschitz product. The layer-wise Lipschitz product satisfies the primary domain-sensitive bound independent of the input dimension n. It simplifies to P(KAN_f) = max(C*,1)^L_f with L_f = c_max * N. For the standard operations +,-,x,sin,cos with x nodes on [0,1]-bounded inputs we obtain P(KAN) = 1. Layer widths satisfy n_l = n + 2 w_max * N. The uniform approximation error is bounded by N * max(C*,1)^d(f) * epsilon_Op (simplifies when C* =1). For f in C^m we obtain optimal B-spline rates. Range bounds are also derived (B_f = N+1 for additive trees). This addresses the gap on Lipschitz control in deep KAN stacks noted by Liu et al. (2024). Experiments confirm P(KAN)=1.0 for several compositionally structured functions.

[LG-25] Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing ICPR2026

链接: https://arxiv.org/abs/2604.26411
作者: Mathieu Dario,Florent Chenevier,Kévin Delmas,Joris Guerin,Jérémie Guiochet
类目: Machine Learning (cs.LG)
*备注: 15 pages, 5 figures, 3 tables, submitted to ICPR 2026

点击查看摘要

Abstract:Runtime monitoring is essential to ensure the safety of ML applications in safety-critical domains. However, current research is fragmented, with independent methods emerging from different communities. In this paper, we propose a unified framework categorising runtime monitoring approaches into three distinct types: Operational Design Domain (ODD) monitoring, which ensures compliance with expected operating conditions; Out-of-Distribution (OOD) monitoring, which rejects inputs that deviate from the training data; and Out-of-Model-Scope (OMS) monitoring, which detects anomalous model behaviour based its internal states or outputs. We demonstrate the benefits of this categorization with a dedicated experiment on an aeronautical safety-critical application: runway detection during landing. This framework facilitates design of monitoring activities, with complementary categories of monitors, and enables evaluation and comparison of different monitors using common, safety-oriented metrics.

[LG-26] SplitFT: An Adaptive Federated Split Learning System For LLM s Fine-Tuning

链接: https://arxiv.org/abs/2604.26388
作者: Yimeng Shan,Zhaorui Zhang,Sheng Di,Yu Liu,Xiaoyi Lu,Benben Liu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Split Learning has been identified as an efficient approach to address the computational resource constraints of clients in classical federated learning, while guaranteeing data privacy for distributed model training across data owners. However, it faces some critical challenges when such a training strategy meets large language models (LLMs) for fine-tuning. Such challenges include setting the cutlayer adaptively across different clients to address the data and device heterogeneity issues, which affect the system performance significantly. In addition, efficiently reducing the communication overhead during the fine-tuning procedure is also another challenge. No work tries to address these challenges. To bridge this gap, we propose SplitTF, an adaptive federated split learning system for LLMs fine-tuning. SplitFT enables different clients to set different cut layers according to their computation resources and trained model performance. SplitFT also proposes to reduce the LoRA rank in cutlayer to reduce the communication overhead. In addition to simulating the heterogeneous data in real-world applications for our proposed split federated learning system, we propose a length-based Dirichlet approach to divide the training data into different clients. Extensive experimental results show that our proposed approach outperforms the state-of-the-art approach for fine-tuning time efficiency and model performance based on various popular benchmarks. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2604.26388 [cs.DC] (or arXiv:2604.26388v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.26388 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLM s

链接: https://arxiv.org/abs/2604.26378
作者: Zhe Ding,Su Pan,Duowei Pan
类目: Machine Learning (cs.LG)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:Post-training quantization (PTQ) has become an important technique for reducing the inference cost of Large Language Models (LLMs). While recent mixed-precision methods improve ultra-low bit quantization by preserving critical subspaces in high precision, they typically construct these subspaces relying solely on activation statistics. This ignores the fundamental nature of linear operations, where the output perturbation is jointly driven by both activation and weight quantization noise. In this paper, we propose CoQuant, a joint weight-activation subspace projection method. By theoretically modeling the expected output error, CoQuant formulates a closed-form weighted PCA solution that balances activation and weight covariances to select the optimal high-precision subspace. Extensive experiments on Llama-3.2 and Qwen2.5 models show that CoQuant consistently outperforms strong PTQ baselines in both WikiText perplexity and zero-shot common-sense reasoning accuracy. These results demonstrate that joint weight-activation subspace modeling provides a principled and effective direction for low-bit LLM quantization. The source code is available at this https URL.

[LG-28] Asymptotically Robust Learning-Augmented Algorithms for Preemptive FIFO Buffer Management

链接: https://arxiv.org/abs/2604.26349
作者: Wen-Han Hsieh,Ya-Chun Liang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a learning-augmented online algorithm for the preemptive FIFO buffer management problem, where packets arrive online to a finite-capacity buffer, must be transmitted in FIFO order, and the algorithm may preemptively discard buffered packets to accommodate future arrivals. Our algorithm simultaneously achieves 1-consistency, \eta-smoothness, and asymptotic \sqrt3-robustness, where \eta denotes the prediction error. Specifically, it attains an optimal competitive ratio of 1 under perfect predictions, degrades smoothly as the prediction error increases, and maintains an asymptotic competitive ratio of \sqrt3 under arbitrarily inaccurate predictions, matching the best-known worst-case guarantee for the classical online problem, established by Englert and Westermann in 2009 [Algorithmica 53(4): 523-548]. A key technical contribution of our work is the introduction of an \emphoutput-based prediction error metric. Because capacity constraints dictate that only a strictly bounded subset of arriving packets is ultimately transmitted, our metric assesses prediction quality over the resulting optimal schedules rather than the raw input sequences, avoiding artificial error penalties. To guarantee robustness, our algorithm dynamically monitors predictions and executes a \emphbuffer-clearing strategy upon transitioning to a worst-case fallback mechanism. We prove that the competitive loss incurred by this clearing operation is bounded by an additive capacity constant that vanishes asymptotically. Finally, we show that our algorithm provides a generalized framework for learning-augmented buffer management: substituting the fallback module with any \beta-competitive online algorithm immediately yields asymptotic \beta-robustness. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2604.26349 [cs.DS] (or arXiv:2604.26349v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2604.26349 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-29] Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning

链接: https://arxiv.org/abs/2604.26340
作者: Weihang Li,Jianchun Liu,Hongli Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LoRA-MoE has emerged as an effective paradigm for parameter-efficient fine-tuning, combining the low training cost of LoRA with the increased adaptation capacity of Mixture-of-Experts (MoE). However, existing LoRA-MoE frameworks typically adopt a fixed and uniform expert configuration across heterogeneous Transformer modules (\eg, attention query/key projections and MLP gating networks), ignoring their distinct functional roles and capacity requirements. This design leads to localized over-provisioning, redundant trainable parameters, and unnecessary optimizer-state overhead. Moreover, prior methods enforce load balancing among experts throughout training. Although beneficial in the early stage, this constraint becomes restrictive once routing patterns stabilize, limiting expert specialization on downstream tasks. In this paper, we propose DMEP, a novel LoRA-MoE fine-tuning framework based on Dynamic Module-wise Expert Pruning. DMEP tracks expert utilization during training and physically removes low-utility experts on a per-module basis, yielding a more compact expert structure tailored to different modules. The pruned model then continues training without the load-balancing constraint, freeing the remaining experts to focus entirely on the downstream task and develop specialized expertise. By jointly adapting module-wise expert capacity and eliminating unnecessary balancing, DMEP improves both parameter efficiency and training efficiency. Extensive experiments on multiple reasoning benchmarks show that DMEP reduces trainable parameters by 35%–43% and improves training throughput by about 10%, while maintaining or surpassing the downstream reasoning accuracy of uniform LoRA-MoE baselines.

[LG-30] AlphaJet: Automated Conceptual Aircraft Synthesis via Disentangled Generative Priors and Topology-Preserving Evolutionary Search

链接: https://arxiv.org/abs/2604.26337
作者: Boris Kriuk
类目: Machine Learning (cs.LG)
*备注: 10 pages, 2 figures, 1 table

点击查看摘要

Abstract:Conceptual aircraft design is traditionally an expert-mediated iterative process in which a human designer proposes a configuration, runs low-order physics, inspects the result, and re-proposes. We present AlphaJet, an end-to-end automated synthesis pipeline that closes this loop. From a textual mission specification (mass, range, cruise speed, hard size envelope, engine count, areal density) AlphaJet evolves a feasible 3D aircraft in real time, scored by a transparent multi-disciplinary fitness function covering aerodynamics, structures, weights, stability, packaging, and geometric mount consistency. Three contributions distinguish our approach: (i) an Anatomically-Disentangled Variational Autoencoder (AD-VAE) whose first 25 latent dimensions are supervised to align with named anatomical parameters, providing an interpretable shape prior; (ii) a topology-elitist genetic algorithm that protects the best individual from each of five tail topologies and triggers stagnation restarts, preventing premature collapse to a single configuration; and (iii) mount-aware geometric scoring that computes signed penetration between engines and other structural parts, eliminating the redundant artifacts common in generative aircraft models. The full loop runs interactively on a CPU and streams every generation to a browser viewer, making it a practical real-world automation tool for early-phase design-space exploration.

[LG-31] Efficient VRAM-Constrained xLM Inference on Clients

链接: https://arxiv.org/abs/2604.26334
作者: Aditya Ukarande,Deep Shekhar,Marc Blackstein,Ram Rangan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted at MLSys 2026 (Industry Track). 17 pages, 7 figures, 9 tables. Code and artifacts available at: this https URL

点击查看摘要

Abstract:To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelined sharding, a novel, benchmark-profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-of-experts (MoE) LLMs. Using a combination of model sharding at the sub-layer level, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a this http URL implementation of three well-understood prior ideas (jointly called VLMOpt), namely, vision tensor CPU offloading, flash attention, and vision and language model VRAM overlap avoidance. These enhancements are targeted at improving client xLM inference in future releases of two important NVIDIA products - the In-Game Inferencing software development kit (IGI SDK) and the Cosmos-Reason1 (CR1) physical AI reasoning VLM. Highlights from our rigorous evaluation spanning multiple models and client systems include: for interactive use, TTFT improves by up to 6.7x and TPS by up to 30x for LLMs, and CR1 inference’s VRAM demand is down by 10x, while in batched mode, throughput improves by up to 8.2x, all compared to their respective aggressive baselines. This paper is accepted at the 9th MLSys Conference (Industry Track), 2026. Code and artifact available at: this https URL Comments: Accepted at MLSys 2026 (Industry Track). 17 pages, 7 figures, 9 tables. Code and artifacts available at: this https URL Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2604.26334 [cs.DC] (or arXiv:2604.26334v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.26334 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-32] VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection DSN2026

链接: https://arxiv.org/abs/2604.26313
作者: Chidera Biringa,Ajmal Abbas,Vishnu Selvaraj,Gokhan Kul
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures. Accepted at the 56th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2026)

点击查看摘要

Abstract:We present VulStyle, a multi-modal software vulnerability detection model that jointly encodes function-level source code, non-terminal Abstract Syntax Tree (AST) structure, and code stylometry (CStyle) features. Prior work in code representation primarily leverages token-level models or full AST trees, often missing stylistic cues indicative of risky programming practices, or incurring high structural overhead. Our approach selects only non-terminal AST nodes, reducing input complexity while preserving semantic hierarchy, and integrates syntactic and lexical CStyle features as auxiliary vulnerability signals. VulStyle is pre-trained using masked language modeling on 4.9M functions across seven programming languages, and fine-tuned across five benchmark datasets: Devign, BigVul, DiverseVul, REVEAL, and VulDeePecker. VulStyle achieves state-of-the-art performance on BigVul and VulDeePecker, improving F1 by 4-48% over strong transformer baselines, and attains competitive or best-average performance across all benchmarks. We contribute an ablation study isolating the effect of CStyle and AST structure, error case analysis, and a threat model situating the detection task in attacker-realistic scenarios. Comments: 12 pages, 2 figures. Accepted at the 56th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2026) Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2604.26313 [cs.CR] (or arXiv:2604.26313v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.26313 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] Cheeger–Hodge Contrastive Learning for Structurally Robust Graph Representation Learning

链接: https://arxiv.org/abs/2604.26301
作者: Mengyang Zhao,Longlong Li,Cunquan Qu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) has emerged as a prominent framework for unsupervised graph representation learning. However, relying on augmentation design alone to define the invariances learned by GCL can be brittle under structural perturbations. To address this issue, we propose Cheeger–Hodge Contrastive Learning (CHCL), a framework that aligns a perturbation-stable Cheeger–Hodge joint signature across augmented views for robust graph representation learning. The proposed signature combines a Cheeger-inspired connectivity signature derived from the algebraic connectivity (\lambda_2) with the low-frequency spectrum of the 1-Hodge Laplacian, thereby capturing both global connectivity and higher-order structural information. By aligning encoder representations with the proposed Cheeger–Hodge joint signature across augmented views, CHCL learns graph embeddings that are robust to local structural perturbations. Extensive experiments on standard benchmarks, transfer settings demonstrate that CHCL consistently improves performance, robustness, and generalization.

[LG-34] NeuroPlastic: A Plasticity-Modulated Optimizer for Biologically Inspired Learning Dynamics

链接: https://arxiv.org/abs/2604.26297
作者: Douglas Jiang,Yuechen Wang,Jiayi Wang,Jiaying Geng,Qinglong Wang,Feng Tian
类目: Machine Learning (cs.LG)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:Optimization algorithms are fundamental to modern deep learning, yet most widely used methods rely on update rules based primarily on local gradient statistics. We introduce NeuroPlastic, a plasticity-modulated optimizer that augments gradient-based updates with an adaptive multi-signal modulation mechanism inspired by multi-factor synaptic plasticity, a concept from neurobiology. NeuroPlastic dynamically scales gradient updates using interacting components that capture gradient, activity-like, and memory-like statistics, forming a lightweight modulation layer compatible with standard deep learning training pipelines. Across image classification benchmarks, NeuroPlastic consistently improves over a controlled gradient-only ablation, with more pronounced gains on the Fashion-MNIST benchmark and in reduced-data regimes. In transfer experiments on CIFAR-10 with ResNet-18, the method remains stable and competitive without retuning. These results suggest that multi-signal plasticity-inspired modulation can provide a useful extension to conventional gradient-driven optimization, particularly when learning signals are limited or noisy, and offer a promising direction for gradient-based methods in deep learning.

[LG-35] DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

链接: https://arxiv.org/abs/2604.26256
作者: Tianhao Hu,Xiangcheng Liu,Youshao Xiao,Yang Zheng,Xuan Huang,Jinrui Ding,Yufei Zhang,Tao Liang,Hongyu Zang,Quan Chen,Yueqing Sun,Wenjie Shi,Chao Zhang,Wei Wang,Qi Gu,Yerui Sun,Yucheng Xie,Xunliang Cai
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become a critical paradigm for LLM post-training, yet the rollout phase – accounting for 50–80% of total step time – is bottlenecked by skewed generation: long-tailed trajectories indispensable for model performance block the entire training pipeline. Asynchronous training offers a natural remedy by overlapping generation with training, but introduces a fundamental tension between efficiency and algorithmic correctness. We identify three constraints in asynchronous training to preserve convergence: intra-trajectory policy consistency, data integrity, and bounded staleness. Existing approaches fail to intrinsically address the long-tailed trajectory problem, which is further exacerbated by the imbalance characteristic of Mix-of-Experts models, or deviate from the standard RL training formulation, thereby hindering model convergence. Therefore, we propose DORA (Dynamic ORchestration for Asynchronous Rollout), which addresses this challenge through algorithm-system co-design. DORA introduces multi-version streaming rollout, a novel asynchronous paradigm that maintains multiple policy versions concurrently – simultaneously achieving full bubble elimination without compromising algorithmic constraints. Experimental results demonstrate that our DORA system achieves substantial improvements in throughput – up to 2–3 times higher than state-of-the-art systems on open-source benchmarks – without compromising convergence. Furthermore, in large-scale industrial applications with tens of thousands of accelerators, DORA accelerates RL training by 2–4 times compared to synchronous training across various scenarios. The resultant open-source models, LongCat-Flash-Thinking, exhibit competitive performance on complex reasoning benchmarks, matching the capability of most advanced LLMs.

[LG-36] Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech

链接: https://arxiv.org/abs/2604.26242
作者: Himadri S Samanta
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Digital biomarkers for depression have largely relied on static acoustic descriptors, pooled summary statistics, or conventional machine learning representations. Such approaches may miss nonlinear temporal organization embedded in conversational vocal dynamics. We hypothesized that depression is associated with altered recurrence structure in vocal state trajectories, reflecting changes in how the vocal system revisits acoustic states over time. Using the depression subset of the DAIC-WOZ corpus with 142 labeled participants, we modeled frame-level COVAREP trajectories as nonlinear dynamical systems and derived recurrence-based biomarkers from 74 vocal channels. Logistic regression with feature selection and stratified cross-validation evaluated classification performance. Recurrence-based biomarkers achieved a mean cross-validated AUC of 0.689, exceeding static acoustic baselines, entropy-dynamics features, Hurst exponent features, determinism features, and Lyapunov-like instability proxies. Permutation testing indicated statistical significance with p=0.004 . Pooled cross-validated predictions yielded AUC 0.665 with a 95% bootstrap confidence interval of [0.568, 0.758]. These findings suggest that depression may be characterized by altered recurrence structure in conversational vocal dynamics and support nonlinear state-space analysis as a promising direction for digital psychiatric biomarkers.

[LG-37] DySec: A Deep Learning-based Explainable Dynamic Analysis Framework for Detecting Malicious Packages in PyPI Ecosystem

链接: https://arxiv.org/abs/2604.26219
作者: Sk Tanzir Mehedi,Raja Jurdak,Chadni Islam,Abu Bakar Siddique Mahi,Gowri Ramachandran
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 12 Pages, 11 Figures, and 5 Tables

点击查看摘要

Abstract:The security of open-source software repositories is increasingly threatened by next-gen software supply chain attacks. These attacks include multiphase malware execution, remote access activation, and dynamic payload generation. Traditional Machine Learning (ML) detectors struggle to detect these attacks due to the high-dimensional and sparse nature of dynamic behavioral data, including system calls, network traffic, directory access patterns, and dependency logs. As a result, these data characteristics degrade the performance, stability, and explainability of ML models. These challenges have made Deep Learning (DL) a promising alternative, given its success across various domains and its potential for modeling complex patterns. This paper presents eDySec, a DL-based efficient, stable, and explainable framework for dynamic behavioral analysis to detect malicious packages. Using the QUT-DV25 dataset, which captures both install-time and post-installation behaviors of packages, we evaluate DL models and investigate feature sets to identify the most discriminative attributes for enabling efficient malicious package detection. Additionally, model stability analysis and explainable AI techniques are incorporated into the detection pipeline to enable stable, and transparent interpretations of model decisions. Experimental results demonstrate that eDySec significantly outperforms the state-of-the-art frameworks. Specifically, it halves feature dimensionality while lowering false positives by 82% and false negatives by 79%. It also improves accuracy by 3%, achieves near-perfect stability, and maintains an inference latency of 170ms per package. Further analysis reveals that feature and model selection play a critical role, as certain combinations degrade performance. Ultimately, this study advances the understanding of the strengths and limitations of dynamic analysis against next-gen attacks.

[LG-38] Unsupervised Graph Modeling for Anomaly Detection in Accounting Subject Relationships

链接: https://arxiv.org/abs/2604.26216
作者: Yuhan Wang,Ruobing Yan,Zhe Su,Hejing Chen,Ningjing Sang,Yunfei Nie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of anomaly detection in accounting subject association structures, proposing a structured modeling and unsupervised discriminant framework based on graph neural networks. This framework is used to mine stable correspondences between subjects and identify structural deviations from general ledger details and voucher entries. The method first abstracts accounting subjects as graph nodes, and the co-occurrence and debit/credit correspondence of subjects in the same business record are abstracted as weighted edges. The edge weights are characterized by statistical measures such as co-occurrence frequency or amount aggregation, thus forming a period-level accounting subject association graph. In the representation learning stage, a message passing mechanism is used to fuse the node’s own attributes and neighborhood context to obtain node embeddings containing structural information. In the anomaly detection stage, the rationality of subject pair connections is estimated through a relation reconstruction decoder, and edge-level anomaly scores are defined based on the degree of deviation in reconstruction probabilities. These scores are then aggregated to obtain node-level risk ranking and local anomaly localization. This framework can simultaneously capture local substructure anomalies and cross-community anomaly connections without relying on anomaly labeling, outputting traceable subject pair risk clues. Comparative experiments demonstrate more stable comprehensive discriminant capabilities and higher top-ranking accuracy.

[LG-39] Efficient and Interpretable Transformer for Counterfactual Fairness

链接: https://arxiv.org/abs/2604.26188
作者: Panyi Dong,Zhiyu Quan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing reliance of machine learning models in high-stakes, highly regulated domains such as finance and insurance has created a growing tension between predictive performance, interpretability, and regulatory fairness requirements. In these settings, models are expected not only to deliver reliable predictions but also to provide transparent decision rationales and comply with strict fairness requirements. Attention-based transformers offer powerful mechanisms for modeling complex data relationships as demonstrated in various language tasks, yet their attention mechanisms alone do not ensure counterfactually fair predictions, even when combined with fairness-aware techniques. To address these limitations, we propose the Feature Correlation Transformer (FCorrTransformer), an attention-light architecture tailored for tabular data. In this design, the attention matrix admits a direct statistical interpretation as pairwise feature dependencies, enhancing both interpretability and efficiency. Leveraging this structure, we introduce Counterfactual Attention Regularization (CAR), a framework that enforces group-invariant fair representations of sensitive features at the attention level, promoting counterfactually fair predictions without relying on explicit causal assumptions. Empirical evaluations on imbalanced classification and regression benchmarks demonstrate that FCorrTransformer combined with CAR achieves strong counterfactual fairness while maintaining competitive predictive performance and substantially reducing model complexity compared with standard transformer-based baselines. Overall, this work bridges a critical gap between fairness theory and machine learning models, offering a practical framework for responsible AI in regulatory-sensitive domains.

[LG-40] SWAN: World-Aware Adaptive Multimodal Networks for Runtime Variations

链接: https://arxiv.org/abs/2604.26181
作者: Jason Wu,Shir-Kang Scott Jinn,Yuyang Yuan,Maggie Wigness,Lance M. Kaplan,Hang Qiu,Mani Srivastava
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal deep neural networks deployed in realistic environments must contend with runtime variations: changes in modality quality, overall input complexity, and available platform resources. Current networks struggle with such fluctuations – adaptive networks cannot adhere to a strict compute budget, controller-based networks neglect to consider input complexity, and statically provisioned networks fail at all the above. Consequently, they do not extract maximum utility from the expended computational resources. We present SWAN (Sample and World-Aware Multimodal Network), the first adaptive multimodal network that accomplishes all three goals. SWAN employs a quality-aware controller to assign resources among modalities according to a variable user-specified maximum budget. Within this budget, an adaptive gating module further optimizes efficiency by scaling layer utilization according to sample complexity. For further gains, SWAN also employs a token dropping module that masks semantically irrelevant multimodal features before performing detections. We evaluate SWAN in the domain of autonomous driving with complex multi-object 3D detection, reducing FLOPs by up to 49% with minimal degradation.

[LG-41] Budget-Constrained Causal Bandits: Bridging Uplift Modeling and Sequential Decision-Making

链接: https://arxiv.org/abs/2604.26169
作者: Abhirami Pillai
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注: 12 pages, 2 figures, preprint

点击查看摘要

Abstract:Treatment allocation under budget constraints is a central challenge in digital advertising: advertisers must decide which users to show ads to while spending a limited budget wisely. The standard approach follows a two-stage offline pipeline - first collect historical data to estimate heterogeneous treatment effects (HTE), then solve a constrained optimization to allocate the budget. This works well with abundant data, but fails in cold-start settings such as new campaigns, new markets, or new customer segments where little historical data exists. We propose Budget-Constrained Causal Bandits (BCCB), an online framework that learns which users respond to ads while simultaneously spending the budget, making treatment decisions one user at a time. BCCB unifies three components into a single sequential process: learning individual-level ad effectiveness, exploring users whose response is uncertain, and pacing the budget over time. We evaluated on the Criteo Uplift dataset, a large-scale advertising dataset from a real randomized controlled trial. Our key finding is a data-efficiency crossover: offline methods require approximately 10,000 historical observations to produce reliable results, while BCCB operates effectively from the very first user. Furthermore, BCCB exhibits 3-5x lower performance variance between runs, making it more practical for real campaign planning. Among purely online methods, BCCB consistently outperforms standard Thompson Sampling, budgeted Thompson Sampling, and greedy HTE estimation across all budget levels tested.

[LG-42] Spatially-constrained clustering of geospatial features for heat vulnerability assessment of favelas in Rio de Janeiro ICLR

链接: https://arxiv.org/abs/2604.26133
作者: Baptiste Clemence,Thomas Hallopeau,Vanderlei Pascoal De Matos,Laurent Demagistri,Joris Guerin
类目: Machine Learning (cs.LG)
*备注: Workshop Publication (ICLR ML4RS 2026)

点击查看摘要

Abstract:Informal settlements face disproportionate exposure to climate-related health hazards. However, existing methodologies lack systematic approaches to link diverse settlement characteristics with environmental health outcomes. We develop a data-driven framework to assess heat vulnerability in Rio de Janeiro’s favelas by combining spatially-constrained clustering with land surface temperature (LST) analysis. Using remote sensing and geospatial features, we identify two distinct favela typologies: recent, well-connected settlements on flat terrain (Cluster 0) and historical, poorly-connected communities on vegetated slopes (Cluster 1). Analysis of 16 extreme heat events reveals systematic temperature differences of 2–3 ^\circ C between clusters, with flat-terrain favelas experiencing significantly higher heat exposure. Our findings demonstrate that settlement morphology critically influences heat vulnerability, providing a replicable framework for targeted urban planning and public health interventions in informal settlements globally.

[LG-43] NeuralEmu: in situ Measurement-Driven ML-based High-Fidelity 5G Network Emulation

链接: https://arxiv.org/abs/2604.26080
作者: Haoran Wan,Yaxiong Xie,Kyle Jamieson
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current and future applications demand ultra-low latency and consistent throughput, yet frequently traverse 5G cellular networks, so cope with volatile packet dynamics, as 5G base station schedulers dynamically react to user workloads and wireless channel conditions. The task of evaluating network algorithms in these environments is hamstrung by current tools: record-and-replay emulators sever the feedback interaction that exists between application end points and a commercial operator’s proprietary 5G scheduler, while full-stack simulators rely on overly simplistic scheduling logic. To bridge this reality gap, we present NeuralEmu, a high-fidelity, machine learning-based emulation framework that learns complex 5G scheduler resource allocation behaviors directly from extremely high-resolution network telemetry tools. The first emulator to handle multiple clients, NeuralEmu utilizes machine learning to dynamically predict resource block allocations and modulation schemes based on instantaneous user buffer occupancy and channel states. To capture realistic cross-user contention, a traffic reconstruction model inverts cellular network scheduling results to recover the underlying traffic patterns of uncontrolled background users. Implemented as an high-performance Linux middlebox emulator, NeuralEmu reduces emulation error relative to the state of the art for various network applications including but not limited to 55% for web-page load time, 57% for WebRTC encoder bit rate, and 51% for cloud gaming packet one-way delay, providing an accurate, standardized testing ground for tomorrow’s real-time interactive network protocols and applications.

[LG-44] PPG-Based Affect Recognition with Long-Range Deep Models: A Measurement-Driven Comparison of CNN Transformer and Mamba Architectures

链接: https://arxiv.org/abs/2604.26078
作者: Karim Alghoul,Hussein Al Osman,Abdulmotaleb El Saddik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Photoplethysmography (PPG) is increasingly used in wearable affective computing due to its low cost and ease of integration into consumer devices. Recent advances in deep learning have introduced long-range sequence models, such as Transformers, and state-space models, like Mamba, which have demonstrated strong performance on natural language and general time-series tasks. However, it remains unclear whether these architectures offer tangible benefits over widely used Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTMs) for PPG-based affect recognition, given that datasets are typically small and noisy. This work presents a measurement-driven comparison of four deep learning architectures, CNN, CNN-LSTM hybrid, Transformers, and Mamba, for classifying arousal, valence, and relaxation states from wrist-based PPG signals. All models are evaluated under a subject-independent 5-fold cross-validation protocol using identical preprocessing, segmentation, and training pipelines. Our results show that the Transformer and Mamba models achieve performance comparable to that of a CNN baseline, but do not consistently outperform it across all tasks. CNNs remain the most effective overall, providing the highest accuracy with the smallest model size, whereas Transformers have a better balance of F1 scores for Arousal and Relaxation. The study provides the first evaluation of Transformer and Mamba models for PPG-based affect recognition, offering practical guidance on model selection for wearable affective monitoring systems.

[LG-45] Observable Neural ODEs for Identifiable Causal Forecasting in Continuous Time

链接: https://arxiv.org/abs/2604.26070
作者: Jennifer Wendland,Nicolas Freitag,Maik Kschischo
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Quantitative Methods (q-bio.QM)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:Causal inference in continuous-time sequential decision problems is challenged by hidden confounders. We show that, in latent state-space models with time-varying interventions, observability of the latent dynamics from observed data is necessary for identifying dynamic treatment effects, linking control-theoretic observability to causal identifiability, even when hidden confounders affect both treatments and outcomes. We derive a continuous-time adjustment formula expressing potential outcome distributions under treatment trajectories via the measurement model, latent dynamics, and the filtering distribution over latent states given observed histories. We propose Observable Neural ODEs (ObsNODEs), Neural ODE models in observable normal form for causal forecasting. ObsNODEs learn continuous-time dynamics with states reconstructible from observations, enabling outcome prediction under alternative treatment paths. Experiments on synthetic cancer data, semi-synthetic data based on MIMIC-IV, and real-world sepsis data show strong performance over recent sequence models.

[LG-46] Incremental Strongly Connected Components with Predictions

链接: https://arxiv.org/abs/2604.26062
作者: Ronald Deng,Samuel McCauley,Aidin Niaparast,Helia Niaparast,Bennett Ptak,Shirel Quintanilla,Shikha Singh,Nathan Vosburg
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithms with predictions is a growing area that aims to leverage machine-learned predictions to design faster beyond-worst-case algorithms. In this paper, we use this framework to design a learned data structure for the incremental strongly connected components (SCC) problem. In this problem, the n vertices of a graph are known a priori and the m directed edges arrive over time. The goal is to efficiently maintain the strongly connected components of the graph after each insert. Our algorithm receives a possibly erroneous prediction of the edge sequence and uses it to precompute partial solutions to support fast inserts. We show that our algorithm achieves nearly optimal bounds with good predictions and its performance smoothly degrades with the prediction error. We also implement our data structure and perform experiments on real datasets. Our empirical results show that the theory is predictive of practical runtime improvements.

[LG-47] Mining Negative Sequential Patterns to Improve Viral Genomic Feature Representation and Classification

链接: https://arxiv.org/abs/2604.25968
作者: Wenxi Zhu,Wensheng Gan,Zhenlian Qi
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Viruses represent the most abundant biological entities on Earth and play a pivotal role in microbial ecosystems, yet, as prominent human pathogens, they are closely linked to human morbidity and mortality. Accurate identification of viral sequences from viral genome sequences is therefore essential, but existing genome-based classification models that largely relying on composition- or frequency-based subsequence features often suffer from limited interpretability and reduced accuracy, particularly on complex or imbalanced datasets. To address these limitations, we propose GeneNSPCla (Genomic Negative Sequential Pattern-based Classification), a novel viral classification framework based on Negative Sequential Patterns (NSPs) that extracts discriminative absence-based features from nucleotide sequences of RNA viral genomes. By transforming these NSPs into numerical feature vectors and integrating them into multiple supervised classifiers, GeneNSPCla effectively captures both presence and absence signals in viral sequences. Furthermore, we propose a negative pattern mining algorithm adapted for processing genomic data: GONPM+, which can discover longer and more biologically meaningful negative sequential patterns. The experimental results demonstrate that the average accuracy of GONPM+ in 8 classifiers has improved by 10.03% compared to the original negative pattern mining algorithm and by 24.75% compared to the positive pattern mining algorithm. These findings highlight the effectiveness of incorporating absence-based sequential information, providing a new and complementary perspective for viral genome analysis and classification.

[LG-48] Large Language Models for Multilingual Code Intelligence: A Survey

链接: https://arxiv.org/abs/2604.25960
作者: Chao Jiang,Dugang Liu,Cheng Wen,Zhiwu Xu,Hua Zheng,Muhammad Sadiq,Jawwad Ahmed Shamsi,Shengchao Qin,Zhong Ming
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Large language models have transformed AI-assisted software engineering, but current research remains biased toward high-resource languages such as Python, with weaker performance in languages like Rust and OCaml. Since real-world systems are inherently polyglot, robust multilingual code intelligence is crucial. This survey focuses on two key tasks: multilingual code generation from shared natural-language requirements, and multilingual code translation that preserves semantics across languages. It reviews representative methods, benchmarks, and evaluation metrics, and highlights challenges and opportunities for trustworthy cross-language generalization.

[LG-49] A Multimodal and Explainable Machine Learning Approach to Diagnosing Multi-Class Ejection Fraction from Electrocardiograms

链接: https://arxiv.org/abs/2604.25942
作者: Catherine Ning,Yu Ma,Cindy Beini Wang,Sean McMahon,Joseph Radojevic,Steven Zweibel,Dimitris Bertsimas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Left ventricular ejection fraction (LVEF) assessment depends on echocardiography, limiting access in primary care and resource-constrained settings. We developed a multimodal machine-learning framework that combines engineered 12-lead ECG timeseries features with structured EHR variables to classify LVEF into four clinically used strata: normal (50%), mildly reduced (40-50%), moderately reduced (30-40%), and severely reduced (30%). To support model explainability, we identified the most influential ECG and EHR features via SHAP attributions. Using retrospective data from Hartford HealthCare, we trained XGBoost models on 36,784 ECG-echocardiogram pairs from 30,952 outpatients and evaluated temporal generalizability on 19,966 ECGs from a subsequent period. The multimodal model achieved one-vs-rest AUROCs of 0.95 (severe), 0.92 (moderate), 0.82 (mild), and 0.91 (normal), outperforming ECG-only and EHR-only baselines, and maintained performance under temporal validation. This work supports ECG-based, multimodal LVEF stratification as a practical screening and triage aid to prioritize confirmatory imaging where resources are limited.

[LG-50] Learning Over-Relaxation Policies for ADMM with Convergence Guarantees

链接: https://arxiv.org/abs/2604.26932
作者: Junan Lin,Paul J. Goulart,Luca Furieri
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Alternating Direction Method of Multipliers (ADMM) is a widely used method for structured convex optimization, and its practical performance depends strongly on the choice of penalty and relaxation parameters. Motivated by settings such as Model Predictive Control (MPC), where one repeatedly solves related optimization problems with fixed structure and changing parameter values, we propose learning online updates of the relaxation parameter to improve performance on problem classes of interest. This choice is computationally attractive in OSQP-like architectures, since adapting relaxation does not trigger the matrix refactorizations associated with penalty updates. We establish convergence guarantees for ADMM with time-varying penalty and relaxation parameters under mild assumptions, and show on benchmark quadratic programs that the resulting learned policies improve both iteration count and wall-clock time over baseline OSQP.

[LG-51] Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

链接: https://arxiv.org/abs/2604.26898
作者: Andrea Agazzi,Giuseppe Bruno,Eloy Mosig García,Samuele Saviozzi,Marco Romito
类目: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 55 pages, 6 figures

点击查看摘要

Abstract:We prove pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MultiLayer Perceptron (MLP) blocks to a continuous-time stochastic interacting particle system. We also identify the stochastic partial differential equation describing the evolution of the tokens’ distribution in this limit and prove propagation of chaos when the number of such tokens is large. The bounds we establish are quantitative and the limits we consider commute. We further prove that the limiting stochastic model displays synchronization by noise and establish exponential dissipation of the interaction energy on average, provided that the common noise is sufficiently coercive relative to the deterministic self-attention drift. We finally characterize the activation functions satisfying the former condition.

[LG-52] Quantum Feature Selection with Higher-Order Binary Optimization on Trapped-Ion Hardware

链接: https://arxiv.org/abs/2604.26834
作者: Carlos Flores-Garrigós,Anton Simen,Qi Zhang,Enrique Solano,Narendra N. Hegade,Sayonee Ray,Claudio Girotto,Jason Iaconis,Martin Roetteler
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a quantum feature-selection framework based on a higher-order unconstrained binary optimization (HUBO) formulation that explicitly incorporates multivariate dependencies beyond standard quadratic encodings. In contrast to QUBO-based approaches, the proposed model includes one-, two-, and three-body interaction terms derived from mutual-information measures, enabling the objective function to capture feature relevance, pairwise redundancy, and higher-order statistical structure within a unified energy model. To suppress trivial all-selected solutions, we further include structured linear penalties that promote sparsity while preserving informative variables. The resulting HUBO instances are optimized with digitized counterdiabatic quantum optimization on IonQ Forte and compared against noiseless quantum simulation as well as two classical dimensionality-reduction baselines: SelectKBest based on mutual information and principal component analysis (PCA). We evaluate the proposed workflow on two benchmark classification datasets, namely the Gallstone dataset and the Spambase dataset, and analyze both predictive performance and selected-subset structure. The results show good qualitative agreement between hardware executions and noiseless simulations, supporting the feasibility of implementing higher-order feature-selection Hamiltonians on current trapped-ion processors. In addition, the quantum approach yields competitive classification performance while producing compact and informative feature subsets, highlighting the potential of higher-order quantum optimization for machine-learning preprocessing tasks.

[LG-53] Parameterized Quantum Circuits as Feature Maps: Representation Quality and Readout Effects in Multispectral Land-Cover Classification

链接: https://arxiv.org/abs/2604.26675
作者: Ralntion Komini,Aikaterini Mandilara,Georgios Maragkopoulos,Dimitris Syvridis
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate variational quantum classifiers (VQCs) for land-cover classification from multispectral satellite imagery, adopting a feature-map perspective in which the quantum circuit defines a nonlinear data embedding while the readout determines how this representation is exploited. Using the EuroSAT-MS dataset, we perform a systematic one-vs-one evaluation across all class pairs under a controlled experimental protocol, comparing classical baselines (logistic regression, SVMs, neural networks) with VQCs employing both linear readout and quantum-kernel SVM strategies. Our results show that, while VQCs with linear readout do not outperform strong classical baselines such as RBF-SVM, the same trained quantum feature map can significantly improve performance when reused within a kernel-based decision framework. A qubit-count sweep further reveals saturation effects consistent with the mismatch between exponential Hilbert space dimension and linear parameter scaling. Overall, our findings highlight that the effectiveness of quantum models depends critically on the interplay between representation and readout, and that meaningful gains may arise from combining learned quantum feature maps with classical decision mechanisms rather than seeking direct replacement of classical models.

[LG-54] Laplace Approximation for Bayesian Tensor Network Kernel Machines

链接: https://arxiv.org/abs/2604.26673
作者: Albert Saiapin,Kim Batselier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 19 pages, 3 figures, 6 tables. Code available at: this https URL

点击查看摘要

Abstract:Uncertainty estimation is essential for robust decision-making in the presence of ambiguous or out-of-distribution inputs. Gaussian Processes (GPs) are classical kernel-based models that offer principled uncertainty quantification and perform well on small- to medium-scale datasets. Alternatively, formulating the weight space learning problem under tensor network assumptions yields scalable tensor network kernel machines. However, these assumptions break Gaussianity, complicating standard probabilistic inference. This raises a fundamental question: how can tensor network kernel machines provide principled uncertainty estimates? We propose a novel Bayesian Tensor Network Kernel Machine (LA-TNKM) that employs a (linearized) Laplace approximation for Bayesian inference. A comprehensive set of numerical experiments shows that the proposed method consistently matches or surpasses Gaussian Processes and Bayesian Neural Networks (BNNs) across diverse UCI regression benchmarks, highlighting both its effectiveness and practical relevance.

[LG-55] Inferring bifurcation diagrams of two distinct chaotic systems by a single machine

链接: https://arxiv.org/abs/2604.26632
作者: Jianmin Guo,Yao Du,Yizhen Yu,Yong Zou,Xingang Wang
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:We propose a dual-channel reservoir-computing scheme for inferring the dynamics of two distinct chaotic systems with a single machine. By augmenting a standard reservoir with a system-label channel and a parameter-control channel, the machine can be trained from time series collected from a few sampled states of the two systems. We show that the trained machine not only predicts the short-time evolution of the sampled states, but also reproduces the long-term statistical properties of unseen states, thereby enabling reconstruction of the bifurcation diagrams of both systems from partial observations. The effectiveness of the scheme is demonstrated for the Lorenz and Rössler systems in numerical simulations and for the Chua and Rossler circuits in experiments. Functional-network analysis further shows that the two target systems are encoded by distinct dynamical patterns in the reservoir. These results extend multifunctional and parameter-aware reservoir computing, and provide a route to data-driven inference of multiple nonlinear systems using a single machine.

[LG-56] Deep-testing: the case of dependence detection

链接: https://arxiv.org/abs/2604.26558
作者: Gery Geenens,Pierre Lafaye de Micheaux,Ivan Muyun Zou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Deep learning methods have proved highly effective for classification and image recognition problems. In this paper, we ask whether this success can be transferred to hypothesis testing: if a neural network can distinguish, for example, an image of a handwritten digit from another, can it also distinguish an “image of a sample” (such as a scatter plot) generated under a given statistical model from one generated outside that model? Motivated by this idea, we propose a novel procedure called deep-testing, which approaches the classical inferential problem of hypothesis testing through deep learning. More specifically, the test statistic is a classification map learned by a deep neural network from simulated data satisfying the null and alternative hypotheses, leveraging its strong discriminating power to construct a highly powerful test. As a proof of concept, we apply deep-testing to the problem of independence testing, arguably one of the most important problems in statistics. In a large-scale simulation study, deep-testing achieves the highest overall power against nineteen competing methods across a broad range of complex dependence structures, confirming the viability of the proposed approach.

[LG-57] Recipes for Calibration Checks in Safety-Critical Applications

链接: https://arxiv.org/abs/2604.26479
作者: Romeo Valentin
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 36 pages, 22 figures. Manuscript prepared with Typst

点击查看摘要

Abstract:Safety-critical prediction systems, such as autonomous vehicles, weather forecasters, and medical monitors, commonly rely on probabilistic forecasters. These forecasters make predictions about possible future outcomes, and their quality and robustness needs to be validated and certified. Often, only accuracy – the mean of the predictions – is evaluated against true outcomes. However, for safety-critical scenarios and decision making under uncertainty, the full distributional properties of the forecasts should be checked: do the observed prediction errors actually follow the forecasted probability distributions? To this end, we introduce a framework for calibration checks: statistical tests that validate distributional properties of forecasts when measured over many samples. In order to support ease-of-use in real-world operations, these checks produce a single accept/reject decision for data collected from a forecaster. This contrasts typical calibration calculations which produce one or multiple continuous calibration scores and require expertise to implement in a validation workflow. We further support operationalization by introducing modifications to calibration testing that (a) reject only overconfident predictions, allowing for pessimistic or cautious predictions in safety-critical settings, and (b) tolerate small, operationally acceptable deviations even for large numbers of validation samples. We organize the calibration checking process into a modular pipeline comprising four steps: (i) the data model, (ii) the chosen metric, (iii) the hypothesis formulation, and (iv) the testing procedure. Each step consists of independently swappable components, thereby supporting a large variety of possible use-cases and trade-offs. We demonstrate the applicability of the framework on two complementary example problems, weather forecasting and robot pose estimation.

[LG-58] Order-Sensitive Sequential Interventions on Ideal Lattices

链接: https://arxiv.org/abs/2604.26472
作者: Dmitry Pasechnyuk-Vilensky
类目: Combinatorics (math.CO); Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:We study sequential interventions under prerequisite constraints. In this setting, admissible intervention sequences are paths in the ideal lattice of a finite prerequisite poset rather than unconstrained action strings. We give an exact local-to-global theory of order sensitivity on this state space. First, we prove that any two admissible paths with the same endpoints differ by a finite sequence of elementary diamond swaps. Second, for edge-additive path valuations, we show that path-independence is equivalent to vanishing diamond curvature, yielding an endpoint potential with a canonical Möbius parameterization on the ideal lattice. Third, we prove that a local diamond field is induced by an edge-based path model if and only if it satisfies cube consistency, with uniqueness after fixing a reference-tree gauge. Under reduced-state longitudinal assumptions, supported reference paths identify reference-path scores, whereas local order effects require two-sided support of both orders on each diamond. These results yield exact planning consequences, including an order-insensitivity bound and dynamic programming on the truncated ideal lattice.

[LG-59] Probabilistic data quality assessment for structural monitoring data via outlier-resistant conditional diffusion model

链接: https://arxiv.org/abs/2604.26366
作者: Qi Li(1,2),Yong Huang(1,2),Hui Li(1,2) ((1) Key Lab of Smart Prevention and Mitigation of Civil Engineering Disasters of the Ministry of Industry and Information Technology, School of Civil Engineering, Harbin Institute of Technology, Harbin, 150090, China (2) Key Lab of Structures Dynamic Behavior and Control of the Ministry of Education, Harbin Institute of Technology, Harbin, 150090, China)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 43 pages, 15 figures and 2 tables

点击查看摘要

Abstract:Data quality assessment is an essential step that ensures the reliability of the subsequent structural health monitoring (SHM) tasks. This study proposes a prediction deviation-based SHM data quality assessment method using a univariate implicit auto-regressive model, enabling outlier diagnosis and data cleaning. The proposed conditional diffusion model (CDM) augments the standard diffusion model with a conditional embedding module to incorporate temporal context, quartile normalization to mitigate distribution skew, and a Huber loss to enhance robustness against outliers. Within this univariate implicit autoregressive framework, each data point is assigned an outlier probability, quantifying its degree of “outlier-ness”, and a global quality evaluation score is computed to characterize the overall dataset quality. Extensive case studies utilizing operational data from real-world structures demonstrate that the proposed framework significantly improves the accuracy of data quality assessment, outperforming other strong baselines representative of clustering, isolation-based, and deep reconstruction methods. The effectiveness and robustness of the proposed framework are further demonstrated by the findings of ablation experiments and hyperparameter analysis.

[LG-60] DiffAnon: Diffusion-based Prosody Control for Voice Anonymization INTERSPEECH2026

链接: https://arxiv.org/abs/2604.26281
作者: Ismail Rasim Ulgen,Zexin Cai,Nicholas Andrews,Philipp Koehn,Berrak Sisman
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:To preserve or not to preserve prosody is a central question in voice anonymization. Prosody conveys meaning and affect, yet is tightly coupled with speaker identity. Existing methods either discard prosody for privacy or lack a principled mechanism to control the utility-privacy trade-off, operating at fixed design points. We propose DiffAnon, a diffusion-based anonymization method with classifier-free guidance (CFG) that provides explicit, continuous inference-time control over prosody preservation. DiffAnon refines acoustic detail over semantic embeddings of an RVQ codec, enabling smooth interpolation between anonymization strength and prosodic fidelity within a single model. To the best of our knowledge, it is the first voice anonymization framework to provide structured, interpolatable inference-time prosody control. Experiments demonstrate structured trade-off behavior, achieving strong utility while maintaining competitive privacy across controllable operating points.

[LG-61] Fitting Large Nonlinear Mixed Effects Models Using Variational Expectation Maximization

链接: https://arxiv.org/abs/2604.26160
作者: Mohamed Tarek,Pedro Afonso
类目: Methodology (stat.ME); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Mathematical Software (cs.MS); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Nonlinear Mixed Effects models (NLME) models are widely used in pharmacometrics and related fields to analyze hierarchical and longitudinal data. However, as the number of parameters and random effects increases, traditional methods for maximizing the marginal likelihood become computationally expensive. This paper explores the Variational Expectation Maximization (VEM) algorithm, a scalable alternative for fitting NLME models. Originally introduced in the context of probabilistic graphical models and later popularized through variational autoencoders, VEM has not been extensively applied to NLME modeling. By leveraging flexible variational families and reverse-mode automatic differentiation, VEM can efficiently maximize the marginal likelihood, scaling to NLME models with over 15,000 population parameters. This work provides a detailed description of VEM, compares it to other NLME fitting algorithms, and highlights its scalability through computational experiments. Using the Pumas statistical software, we fit two test models: 1) a standard warfarin model, and 2) a DeepNLME Friberg model with 15,410 population parameters and 16 random effects. The warfarin model was fitted to completion to demonstrate the correctness of VEM, while the DeepNLME Friberg model was fitted for a limited number of iterations to measure the time per iteration and demonstrate VEM’s scalability.

[LG-62] Mixture of Experts Framework in Machine Learning Interatomic Potentials for Atomistic Simulations

链接: https://arxiv.org/abs/2604.26143
作者: Gabriel de Miranda Nascimento,Marc L. Descoteaux,Laura Zichi,Chuin Wei Tan,William C. Witt,Nicola Molinari,Sriteja Mantha,Daniil Kitchaev,Mordechai Kornbluth,Karim Gadelrab,Charles Tuffile,Boris Kozinsky
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:First-principles atomistic simulations are essential for understanding complex material phenomena but are fundamentally limited by their computational cost. While Machine Learning Interatomic Potentials (MLIPs) have drastically improved cost for a given accuracy, their inference cost remains a bottleneck for massive systems or long timescales. To address this, we introduce a multifidelity “Mixture-of-Experts” framework based on the E(3)-equivariant Allegro architecture. Our method spatially partitions the simulation domain into a chemically complex region (e.g., reactive interfaces) and a simple region (e.g., bulk lattice), assigning models of varying capacity to each. Among the challenges in such static domain decomposition, the mechanical mismatch between models at the interface is particularly critical, as it can generate artificial stress fields and instability. We address this challenge with a co-training strategy in which the loss function includes agreement constraints – penalties on per-atom energy and force discrepancies between models evaluated on shared bulk environments – forcing the independent models to learn a consistent physical description of the bulk material. We validate this approach on a realistic Pt+CO catalytic system, demonstrating that the co-trained models maintain exact energy conservation, align their bulk mechanical response (e.g., equation of state and bulk modulus), and achieve predictive accuracy comparable to a full high-fidelity simulation at more than twice the computational speed.

[LG-63] Sparse Graph Learning from Sparse Data via Fiedler Number Maximization

链接: https://arxiv.org/abs/2604.26132
作者: Bahar Oveisgharan,Gene Cheung,Andrew Eckford
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We aim to learn a sparse and connected graph from sparse data, where the number of observations K can be substantially smaller than the signal dimension N for signals x in R^N, and the underlying distribution is unknown. In this severely ill-posed setting, we incorporate Fiedler number (the second eigenvalue of the graph Laplacian matrix that quantifies connectedness) as a robust regularization term in the sparse graph learning objective. We first develop a greedy algorithm that iteratively selects one edge globally for weakening/removal to reduce the objective, leveraging eigenvalue perturbation theorems that bound the adverse effect of an edge change to the Fiedler number. Next, we design a parallel variant, based on the Cheeger’s inequality, that recursively partitions an input graph into two sub-graphs using an approximate Cheeger cut to distributedly find an optimal edge. Simulation experiments show that Fiedler number maximization robustifies sparse graph estimates, outperforming previous sparse graph learning algorithms.

[LG-64] Robust Representation Learning through Explicit Environment Modeling

链接: https://arxiv.org/abs/2604.26128
作者: Yuli Slavutsky,David M. Blei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider learning from labeled data collected across multiple environments, where the data distribution may vary across these environments. This problem is commonly approached from a causal perspective, seeking invariant representations that retain causal factors while discarding spurious ones. However, this framework assumes that the environment has no direct effect on the target. In contrast, we consider settings in which this assumption fails, but still aim to learn representations that support robust prediction on average across previously unseen environments. To this end, we study representations learned by explicitly modeling variation across environments and then marginalizing that variation out. We analyze the resulting representations and characterize when they are preferable to those learned by causal invariant-representation methods. We propose a concrete method based on generalized random-intercept models, a class of predictors in which such marginalization is possible, and study their generalization properties. Empirically, we show that these models outperform invariant-learning methods across a range of challenging settings.

[LG-65] Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

链接: https://arxiv.org/abs/2604.26057
作者: Jaskirat Sudan,Hashim Ali,Surya Subramani,Hafiz Malik
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supervised contrastive learning (SupCon) is widely used to shape representations, but has seen limited targeted study for audio deepfake detection. Existing work typically combines contrastive terms with broader pipelines; however, the focus on SupCon itself is missing. In this work, we run a controlled study on wav2vec2 XLS-R (300M) that varies (i) similarity in SupCon (cosine vs angular similarity derived from the hyperspherical angle) and (ii) negative scaling using a warm-started global cross-batch queue. Stage 1 fine-tunes the encoder and projection head with SupCon; Stage 2 freezes them and trains a linear classifier with BCE. Trained on ASVspoof 2019 LA and evaluated on ASV19 eval plus ITW and ASVspoof 2021 DF/LA, Cosine SupCon with a delayed queue achieves the best ITW EER (8.29%) and pooled EER (4.44), while angular similarity performs strongly without queued negatives (ITW 8.70), indicating reduced reliance on large negative sets.

[LG-66] Learning Neural Operator Surrogates for the Black Hole Accretion Code

链接: https://arxiv.org/abs/2604.25985
作者: Matthias Nägele,Cedric Bös,Chester Tan,Christian M. Fromm,Ingo Scholtes,Karl Mannheim
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:General-relativistic magnetohydrodynamic (GR-MHD) simulations are essential for studying black hole accretion, relativistic jets, and magnetic reconnection, yet their computational cost severely limits systematic parameter exploration. We investigate neural operator surrogates for two astrophysically relevant simulation scenarios produced by the Black Hole Accretion Code (\textttBHAC). First, a Physics Informed Fourier Neural Operator (PINO) is trained on the special-relativistic resistive MHD (SRRMHD) evolution of the Orszag-Tang vortex over a range of resistivities spanning the Sweet-Parker and fast reconnection regimes. By embedding the governing equations as an additional loss term evaluated at finer temporal resolution than the available data supervision, the model learns dynamics at time steps where no simulation data is provided, enabling recovery of plasmoid formation that a data-only baseline trained on the same sparse snapshots fails to reproduce. To our knowledge, the present work is the first application of a physics informed neural operator to special relativistic resistive MHD, and the first to investigate the capability of such models to resolve plasmoid formation in SRRMHD. In a second line of investigation, an OFormer-style Transformer Neural Operator is trained on the evolution of spine-sheath relativistic jets created with \textttBHAC, in special-relativistic MHD (SRMHD). The model is directly applied on the adaptive mesh, highlighting the need for linear attention due to long sequences. The neural surrogate model is capable of capturing most of the major details, especially in early predictions. To our knowledge, this constitutes the first application of a neural operator directly on a high resolution adaptive mesh refinement grid in the context of MHD simulations. Subjects: High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG) Cite as: arXiv:2604.25985 [astro-ph.HE] (or arXiv:2604.25985v1 [astro-ph.HE] for this version) https://doi.org/10.48550/arXiv.2604.25985 Focus to learn more arXiv-issued DOI via DataCite

[LG-67] Occams Razor is Only as Sharp as Your ELBO

链接: https://arxiv.org/abs/2604.25984
作者: Ethan Harvey,Michael C. Hughes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The marginal likelihood, also known as the evidence, is regarded as a mathematical embodiment of Occam’s razor, enabling model selection that avoids overfitting. The evidence lower bound (ELBO) objective from variational inference has also been used for similar purposes. Prior work has shown that restricting the approximate posterior family via a mean-field approximation can lead the ELBO to underfit. In this paper, we show how ELBO-based hyperparameter learning in a simple over-parameterized regression model can also produce overfitting, depending on the assumed rank of the covariance matrix in a Gaussian approximate posterior. Surprisingly, among only the underfit and overfit options, Bayesian model selection via the evidence itself sometimes prefers the overfit version, while the ELBO does not. Bayesian practitioners hoping to scale to large models should be cautious about how reduced-rank assumptions needed for tractability may impact the potential for model selection.

[LG-68] Adversarial Robustness of NTK Neural Networks

链接: https://arxiv.org/abs/2604.25965
作者: Yuxuan Hou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models are widely deployed in safety-critical domains, but remain vulnerable to adversarial attacks. In this paper, we study the adversarial robustness of NTK neural networks in the context of nonparametric regression. We establish minimax optimal rates for adversarial regression in Sobolev spaces and then show that NTK neural networks, trained via gradient flow with early stopping, can achieve this optimal rate. However, in the overfitting regime, we prove that the minimum norm interpolant is vulnerable to adversarial perturbations.

附件下载

点击下载今日全部论文列表